protobuf string/bytes

最新推荐文章于 2025-08-09 19:51:43 发布

转载最新推荐文章于 2025-08-09 19:51:43 发布 · 2.5k 阅读

34 篇文章

订阅专栏

本文探讨了Protobuf中string与bytes类型的实现差异，特别是在序列化过程中对UTF-8格式的验证不同。对于追求效率的应用场景，推荐使用bytes类型。

protobuf提供了多种基础数据格式，包括string/bytes。从字面意义上，我们了解bytes适用于任意的二进制字节序列。然而对C++程序员来讲，std::string既能存储ASCII文本字符串，也能存储任意多个\0的二进制序列。那么区别在哪里呢？

同时在实际使用中，我们偶尔会看到类似这样的运行错误：

[libprotobuf ERROR google/protobuf/wire_format.cc:1091] String field 'str' contains invalid UTF-8 data when serializing a protocol buffer. Use the 'bytes' type if you intend to send raw bytes.
[libprotobuf ERROR google/protobuf/wire_format.cc:1091] String field 'str' contains invalid UTF-8 data when parsing a protocol buffer. Use the 'bytes' type if you intend to send raw bytes.

这篇文章从源码角度分析下string/bytes类型的区别。

在之前的文章里介绍过protobuf序列化的过程，我们看下string/bytes序列化的过程。在之前的文章里介绍过protobuf序列化的过程，我们看下string/bytes序列化的过程。

所有的序列化操作都会在SerializeFieldWithCachedSizes这个函数里进行。根据不同的类型调用对应的序列化函数，例如对于string类型

而对于bytes类型：

可以看到在序列化时主要有两点区别：

关于第二点，两个函数都定义在wire_format_lite.cc，实现是相同的。

那么我们继续看下第一点，VerifyUTF8StringNamedField调用了VerifyUTF8StringFallback（话说一直不理解fallback在这里什么意思，protobuf源码里经常看到这个后缀）。看下这个函数的实现：

void WireFormat::VerifyUTF8StringFallback(const char* data,
int size,
Operation op,
const char* field_name) {
if (!IsStructurallyValidUTF8(data, size)) {
const char* operation_str = NULL;
switch (op) {
case PARSE:
operation_str = "parsing";
break;
case SERIALIZE:
operation_str = "serializing";
break;
// no default case: have the compiler warn if a case is not covered.
}
string quoted_field_name = "";
if (field_name != NULL) {
quoted_field_name = StringPrintf(" '%s'", field_name);
}
// no space below to avoid double space when the field name is missing.
GOOGLE_LOG(ERROR) << "String field" << quoted_field_name << " contains invalid "
<< "UTF-8 data when " << operation_str << " a protocol "
<< "buffer. Use the 'bytes' type if you intend to send raw "
<< "bytes. ";
}
}