Essam Hassan
A pragmatic software engineer, cyber security enthusiast and a Linux geek. I curse at my machine on a daily basis at Google. My views are my own.
3 min read

wtf series - wtf is protobuf?

wtf series - wtf is protobuf?

This is part of a series of posts explaining cryptic tech terms in an introductory way.

Disclaimer: this series is not intended to be a main learning source. However, there might be follow up posts with hands-on experiments or deeper technical content for some of these topics.

Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages. [src]

History

XML
One of the oldest data serialization standards driven from the SGML, the Standard Generalized Markup Language. Standardized 1996~1998, XML was the primary structured and semi-structured data serialization standard and the basis for SOAP protocol. It's human-readable, structured and very verbose. [snippet src]

<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu>
<food>
    <name>Belgian Waffles</name>
    <price>$5.95</price>
    <description>
   Two of our famous Belgian Waffles with plenty of real maple syrup
   </description>
    <calories>650</calories>
</food>
<food>
    <name>Strawberry Belgian Waffles</name>
    <price>$7.95</price>
    <description>
    Light Belgian waffles covered with strawberries and whipped cream
    </description>
    <calories>900</calories>
</food>
<food>
    <name>Berry-Berry Belgian Waffles</name>
    <price>$8.95</price>
    <description>
    Belgian waffles covered with assorted fresh berries and whipped cream
    </description>
    <calories>900</calories>
</food>
<food>
    <name>French Toast</name>
    <price>$4.50</price>
    <description>
    Thick slices made from our homemade sourdough bread
    </description>
    <calories>600</calories>
</food>
<food>
    <name>Homestyle Breakfast</name>
    <price>$6.95</price>
    <description>
    Two eggs, bacon or sausage, toast, and our ever-popular hash browns
    </description>
    <calories>950</calories>
</food>
</breakfast_menu>

JSON
Short for JavaScript Object Notation, popularized around early 2000s was a step forward for data representation and serialization as it was less verbose than XML, easier to support on browsers, faster to process and it enforces consistent structure. This made it the main data serialization standard for the modern web for long years.

// simple representation of a breakfast menu with only one item
[
  {
    "name": "Homestyle Breakfast",
    "price": "$6.95",
    "description": "Two eggs, bacon or sausage, toast, and our ever-opular hash browns",
    "calories": 950
  }
]

Protocol buffers (protobuf)

Google developed Protocol Buffers for use in their internal services. It is a binary encoding format that allows you to specify a schema for your data using a specification language, like so:

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;
}

The Protocol Buffers specification is implemented in various languages: Java, C, Go, etc. are all supported, and most modern languages have an implementation. Here is a Java example using the previous schema:

Person john = Person.newBuilder()
    .setId(1234)
    .setName("John Doe")
    .setEmail("jdoe@example.com")
    .build();
output = new FileOutputStream(args[0]);
john.writeTo(output);
Person john;
fstream input(argv[1],
    ios::in | ios::binary);
john.ParseFromIstream(&input);
id = john.id();
name = john.name();
email = john.email();

Using protocol buffers has many advantages over plain text serializations like JSON and XML:

  • Very dense data which result in very small output and therefore less network overhead
  • Declared schema makes parsing from most languages very straightforward with less boilerplate parsing code
  • Very fast processing
  • Binary encoded and hard to decode without knowledge of the schema
  • Backward compatibility as a side-effect

References

Google developers - Protocol Buffers