protobuffer中string和bytes类型_langzi989的博客

博客搬家，原地址： https://langzi989.github.io/2017/06/07/protoBuffer中string与byte类型区别/

从上一节protobuffer的介绍中我们知道字符串类型在protobuffer中有string和bytes两种类型，那这两种类型有什么区别呢,什么时候用string,什么时候用bytes。在C++中两种类型分别对应的是什么类型.下面将揭开迷雾

string与bytes区别

按照经验我们知道bytes一般适用于存储二进制数据的，但在C++中,string既可以存储ASCII文本字符串，也能存储任意多个\0的二进制序列，那两者的区别在哪里呢？

string类型（protobuffer中的string，与C++区别开）不能存储非法的UTF-8字符，如果遇到该字符，序列化的时候将会出错。

[libprotobuf ERROR google/protobuf/wire_format.cc:1091] String field ‘str’ contains invalid UTF-8 data when serializing a protocol buffer. Use the ‘bytes’ type if you intend to send raw bytes.

###出现上述错误的原因
这里从ProtoBuf的源码进行分析。protoBuf在序列化的过程中，都会调用SerializeFieldWithCachedSizes这个函数。我们看一下序列化string和bytes在序列化过程中的区别。

对于 string 类型：

case FieldDescriptor::TYPE_STRING: {
  string scratch;
  const string& value = field->is_repeated() ?
    message_reflection->GetRepeatedStringReference(
      message, field, j, &scratch) :
    message_reflection->GetStringReference(message, field, &scratch);
  VerifyUTF8StringNamedField(value.data(), value.length(), SERIALIZE,
                             field->name().c_str());
  WireFormatLite::WriteString(field->number(), value, output);
  break;
对于bytes类型： 
case FieldDescriptor::TYPE_BYTES: {
        string scratch;
        const string& value = field->is_repeated() ?
          message_reflection->GetRepeatedStringReference(
            message, field, j, &scratch) :
          message_reflection->GetStringReference(message, field, &scratch);
        WireFormatLite::WriteBytes(field->number(), value, output);
        break;
从上面可以看到，序列化string和bytes的区别主要在于:string类型序列化调用了VerifyUTF8StringNamedField函数检验string中是否有非法的UTF-8字符。其中VerifyUTF8StringNamedField实现如下： 
void WireFormat::VerifyUTF8StringFallback(const char* data,
                                          int size,
                                          Operation op,
                                          const char* field_name) {
  if (!IsStructurallyValidUTF8(data, size)) {
    const char* operation_str = NULL;
    switch (op) {
      case PARSE:
        operation_str = "parsing";
        break;
      case SERIALIZE:
        operation_str = "serializing";
        break;
      // no default case: have the compiler warn if a case is not covered.
    string quoted_field_name = "";
    if (field_name != NULL) {
      quoted_field_name = StringPrintf(" '%s'", field_name);
    // no space below to avoid double space when the field name is missing.
    GOOGLE_LOG(ERROR) << "String field" << quoted_field_name << " contains invalid "
               << "UTF-8 data when " << operation_str << " a protocol "
               << "buffer. Use the 'bytes' type if you intend to send raw "
               << "bytes. ";
string和bytes类型在C++和Java中的区别
 
protobuf类型在C++和java中的类型对应如下： 
在C++中，string和bytes的实现都是std::string类型。
在Java中string和bytes类型的实现分别是String和ByteString。 
为什么bytes类型可以描述string类型，还需要string呢？ 
根据论坛上说的，string类型在Java中有较多的API可供使用，而bytes较少，所以能定义为string的尽量定义为string，如果字段值确定或者可能含有非法的utf-8编码，则使用bytes类型。
                                    protobuf提供了多种基础数据格式，包括string/bytes。从字面意义上，我们了解bytes适用于任意的二进制字节序列。然而对C++程序员来讲，std::string既能存储ASCII文本字符串，也能存储任意多个\0的二进制序列。那么区别在哪里呢？
同时在实际使用中，我们偶尔会看到类似这样的运行错误：
	[libprotobuf ERROR google/protobuf/wire_format.cc:1091] String field 'str' contains invali.
                                    我确定至少在C++中，我们在解析过程中验证字符串是UTF-8，而不是在序列化过程中。因此，如果C++做了类似的事情，那么就有可能用非UTF-8字符串来序列化proto，这在稍后再次解析proto之前不会被检测到。如果您需要能够在该字段中存储非UTF-8数据，那么您可能需要考虑将其从字符串字段更改为字节字段，这通常是一种安全的更改。字符串字段仅用于存储UTF-8数据，因此如果您需要发送非UTF-8数据，那么最好使用。类型都对应与**C++**的std::string类型。关于第二点，两个函数都定义在**
在protobuf中如果定义了bytes类型的消息传输对象
	syntax = "proto3";
	option java_package="com.test.protobuf";
	option java_outer_classname = "NettyMessage";
	message MessageBase {
		 bytes data =1;
在生成protobu...
protobuf提供了多种基础数据格式，包括string/bytes。从字面意义上，我们了解bytes适用于任意的二进制字节序列。然而对C++程序员来讲，std::string既能存储ASCII文本字符串，也能存储任意
                                    在以不同语言编写并在不同平台上运行的应用程序之间交换数据时，Protobuf 编码可提高效率。-- Marty Kalin协议缓冲区Protocol Buffers(Protobufs)像 XML 和 JSON 一样，可以让用不同语言编写并在不同平台上运行的应用程序交换数据。例如，用 Go 编写的发送程序可以在 Protobuf 中对以 Go 表示的销售订单数据进行编码，然后用 Java...
                                    上回我们介绍了protobuf的序列化原理及概念，这回我们就来看看它有哪些数据类型，要如何使用吧。一、基本数据类型结构类型表如下：二、Varint类型1. 每个字节第一位表示有无后续字节，有为1，无为0, (双字节，低字节在前，高字节在后.)2. 剩余7位倒序合并例：300的二进制为100101100第一位：1(有后续)+0101100第二位：0(无后续)+0000010最终结果：...
                                    1  go grpc-go 相关技术专栏 总入口
2  Protobuf介绍与实战 图文专栏 文章目录
当数据类型为string,bytes,embedded messages,packed repeated fields时，
采用的是Length-delimited编码方式，即TLV结构；(TLV结构介绍，可参考前文)
整体采用的是TLV编码结构
但是，变量值V的编码方式是不同一的。
当类型为string, bytes时，变量值采用的是UTF-8编码(我对UTF-8编码规则并不了解，这一点，仅个
int32: int类型，使用可变长编码，编码负数不够高效,如果有负数那么使用sint32
sint32: int类型，使用可变长编码,
有符号的整形,比通常的int32高效;
uint32: 无符号整数使用可变长编码方式;
int64 long long ,