English | 中文版
[TOC]
Google Protocol Buffers (Protobuf) is an excellent library that defines a compact (compared to XML and JSON) and extensible binary message format, especially suitable for network data transmission.
Protobuf has two versions:
proto2: Older, not recommended for new projectsproto3: Recommended
classDiagram
MessageLite <|-- Message
Message <-- Message : call prototype->New()
Message <|-- Person
Message <|-- AddressBook
Message <-- MessageFactory : returns default instance
MessageFactory <-- Descriptor : passes to GetPrototype()
Descriptor <-- DescriptorPool : returns
class MessageLite{
+~MessageLite()
+New(): MessageLite*
+GetTypeName(): string
+ByteSize(): int
+ParseFrom*(): bool
+SerializeTo*(...): bool
}
class Message{
+New(): Message*
+GetTypeName(): string
+ByteSize(): int
+GetDescriptor(): const Descriptor*
}
class Person{
+New(): Person*
+ByteSize(): int
+descriptor(): const Descriptor*
+default_instance(): const Person&
}
class AddressBook{
+New(): AddressBook*
+ByteSize(): int
+descriptor(): const Descriptor*
+default_instance(): const AddressBook&
}
MessageFactory <|-- GeneratedMessageFactory
class MessageFactory{
+GetPrototype(const Descriptor*): const Message*
+generated_factory(): MessageFactory*
}
class GeneratedMessageFactory{
~type_map_: hash_map<const Descriptor*, const Message>
+GetPrototype(const Descriptor*): const Message*
+singleton(): GeneratedMessageFactory*
}
class Descriptor{
+name(): string
+full_name(): string
}
class DescriptorPool{
+FindMessageTypeByName(string): const Descriptor*
+generated_pool(): const DescriptorPool*
}
| .proto Type | C++ | Java | Python | Go |
|---|---|---|---|---|
| double | double | double | float | *float64 |
| float | float | float | float | *float32 |
| int32 | int32 | int | int | *int32 |
| int64 | int64 | long | int/long | *int64 |
| uint32 | uint32 | int | int/long | *uint32 |
| uint64 | uint64 | long | int/long | *uint64 |
| sint32 | int32 | int | int | *int32 |
| sint64 | int64 | long | int/long | *int64 |
| fixed32 | uint32 | int | int/long | *uint64 |
| fixed64 | uint64 | long | int/long | *uint64 |
| sfixed32 | int32 | int | int | *int32 |
| sfixed64 | int64 | long | bool | *bool |
| bool | bool | boolean | bool | *bool |
| string | string | String | unicode | *string |
| bytes | string | ByteString | bytes | []byte |
Required
Optional
Repeated
Enumeration
Reserve Tag Number for other modules, used for field extension;
Example:
TODO
protoc -I [OPTION] PROTO_FILES
-Iproto file pathOPTIONoutput format--cpp_out--java_out--python_out
# Generate C++ code
protoc -I=xxx.proto --cpp_out=cpp_dir # xxx.proto: proto file path, cpp_dir: generated C++ pathTODO
TODO
In Protobuf 2, message fields can be marked as required or optional, and also support the default modifier to specify default values. By default, if an optional field is not set, or is explicitly set to the default value, this field will be omitted when serialized to binary format.
In Protobuf 3, required and optional modifiers are removed, all fields are optional, and for primitive type fields, there is no hasXxx() method at all.
So how do you know if a value is actually null, or just the default (0/0.0/"")? There are several methods:
- Use special values to distinguish, avoid using null;
- Explicitly define a boolean field;
message Account { string name = 1; double profit_rate = 2; bool has_profit_rate = 3; };
- Use oneof trick (skynet-pbc does not support);
message Account { string name = 1; oneof profit_rate { double profit_rate = 2; } };
- Use wrapper types (recommended)
import "google/protobuf/wrappers.proto" message Account { string name = 1; google.protobuf.DoubleValue profit_rate = 2; }
Each 8-bit byte uses the lower 7 bits to represent data, and the most significant bit (MSB) is used as a flag.
- Step 1: Group bits in 7s in reverse order
- Step 2: Determine if another byte is needed; if so, set MSB to 1
Number 1 in binary: 0000 0000 0000 0000 0000 0000 0000 0001
Varint encoding:
- Step 1:
000 0001 000 0000 000 0000 000 0000 0000 - Step 2:
0000 0001 0000 0000 0000 0000 0000 0000
Number 1 after varint encoding: 0000 0001, hex: 0x01
Number 251 in binary: 0000 0000 0000 0000 0000 0000 1111 1011
Varint encoding:
- Step 1:
111 1011 000 0001 0000 0000 0000 0000 00 - Step 2:
1111 1011 0000 0001 0000 0000 0000 0000 00
Number 251 after varint encoding: 1111 1011 0000 0001, hex: 0xFB01
Number -1 in binary: 1111 1111 1111 1111 1111 1111 1111 1111
- Step 1:
111 1111 111 1111 111 1111 111 1111 1111 - Step 2:
1111 1111 1111 1111 1111 1111 1111 1111 0111 1000
Number -1 after varint encoding: 1111 1111 1111 1111 1111 1111 1111 1111 0111 1000, hex: 0xFFFFFFFF78
For negative types, varint encoding is inefficient
ZigZag is mainly to solve the inefficiency of varint encoding for negative types. The main idea is:
Encodingmaps negative numbers to corresponding positive numbers, then uses Varint encoding;Decodingmaps back to negative numbers after decoding the positive number;
ZigZag encoding mapping:
| Signed Original | Encoded As |
|---|---|
| 0 | 0 |
| -1 | 1 |
| 1 | 2 |
| -2 | 3 |
| 2147483647 | 4294967294 |
| -2147483648 | 4294967295 |
ZigZag mapping is implemented by shifting:
- sint32
Zigzag(n) = (n << 1) ^ (n >> 31) - sint64
Zigzag(n) = (n << 1) ^ (n >> 63)
Number -1 in binary: 1111 1111 1111 1111 1111 1111 1111 1111
Zigzag encoding:
- n << 1:
1111 1111 1111 1111 1111 1111 1111 1110 - n >> 31:
1111 1111 1111 1111 1111 1111 1111 1111 (n << 1) ^ (n >> 31):0000 0000 0000 0000 0000 0000 0000 0001- Varint encoding:
0000 0001 0000 0000 0000 0000 0000 0000
Number -1 after Zigzag+varint encoding: 0000 0001, hex: 0x01
The message structure is as follows:
Field structure: |Tag|Length|Value| or |Tag|Value|
-
Tag(varint encoding)Structure:
wire_type is 3 bits, can represent 8 types
Type Meaning Used For 0 Varint int32, int64, uint32, uint64, sint32, sint64, bool, enum 1 64-bit fixed64, sfixed64, double 2 Length-delimited string, bytes, embedded messages, packed repeated fields 3 Start group groups(deprecated) 4 End group groups(deprecated) 5 32-bit fixed32, sfixed32, float The field type is affected by the wire_type value:
- 0
|Tag|Value|, value uses varint encoding, no extra bits are needed to indicate the length of the value, because the MSB of varint indicates whether the next byte is valid, which serves as a length indicator. - 1, 5
|Tag|Value|, length is fixed at 32 or 64 bits, no extra bits are needed to indicate the value length. - 2
|Tag|Length|Value|, indicates a variable-length value, needs Length to indicate the length.
- 0
-
LengthOptional, determined by the last three bits (wire_type) ofTag. -
Value(varint encoding)
| Size (bytes) | Type |
|---|---|
| 1 | + bool |
| 4 | + float + fixed32 + sfixed32 |
| 8 | + double + fixed64 + sfixed64 |
| variable | + int64 + uint64 + int32 + uint32 + enum + sint32 + sint64 |
| length+data | + string + message + bytes |
- int type encoding length for positive numbers is proportional to the value size; for negative numbers, encoding length is fixed at the maximum (int32 and int64 are both 10 bytes)
- uint type encoding length is the same as int type for positive numbers, and supports a larger value range
- sint type encoding length is proportional to the absolute value size
The Protobuf protocol supports bootstrapping itself, making implementation quite simple! The relationships among multiple protocols are as follows:
TODO
- Download protobuf source code and compile/install
git clone git@github.com:protocolbuffers/protobuf.git
cd protobuf && ./autogen.sh
./configure --prefix=/usr/local/protobuf
make && make check
make install && ldconfig- Define proto file
xxx.protomessage Example1 { optional string stringVal = 1; optional bytes bytesVal = 2; message EmbeddedMessage { int32 int32Val = 1; string stringVal = 2; } optional EmbeddedMessage embeddedExample1 = 3; repeated int32 repeatedInt32Val = 4; repeated string repeatedStringVal = 5; }
- Generate C++ file
protoc -I=xxx.proto --cpp_out=./
- Call the generated C++ file
#include <iostream> #include <fstream> #include <string> #include "xxx.pb.h" int main() { Example example1; example1.set_stringval("hello, world"); example1.set_bytesval("are you ok?"); Example1_EmbeddedMessage *embeddedExample2 = new Example1_EmbeddedMessage(); embeddedExample2->set_int32val(1); embeddedExample2->set_stringval("embeddedInfo"); example1.set_allocated_embeddedexample1(embeddedExample2); example1.add_repeatedint32val(2); example1.add_repeatedint32val(3); example1.add_repeatedstringval("repeated1"); example1.add_repeatedstringval("repeated2"); std::string filename = "single_length_delimited_all_example1_val_result"; std::fstream output(filename, std::ios::out | std::ios::trunc | std::ios::binary); if (!example.SerializeToOstream(&output)) { std::cerr << "Failed to write example1." << std::endl; exit(-1); } return 0; }
(To be confirmed)
For Protocol Buffer, fields with tag values 1 to 15 can be optimized during encoding, i.e., the tag value and type info only take up one byte. Tags in the range 16 to 2047 take up two bytes. Protocol Buffer supports up to
[1] Chen Shuo. Linux Multithreaded Server Programming - Using the muduo C++ Network Library.
- Protocol Buffers
- github protobuf
- Protocol Buffers#extensions
- Wikipedia - Protocol Buffers
- Google Protocol Buffers Practical Techniques: Parsing .proto files and arbitrary data files
- In-depth ProtoBuf - Introduction
- Protobuf Encoding
- Distinguishing missing and default values in Protobuf
- pbc implementation analysis


