Java的String不能处理中文utf-8编码

最新推荐文章于 2025-09-16 01:37:59 发布

原创

最新推荐文章于 2025-09-16 01:37:59 发布 · 1.1w 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#java #string #utf-8 #camus #sequencefile

Sequence File 文件格式支持文件分割，所以适合map-reduce作业。最近有一个项目，把kafka的protobuf数据写到hdfs，方便下游的离线作业做数据分析。

在kafka中，protobuf序列化成了byte数组（message就是byte数组）。这个时候在linkedin的camus（linkedin开源的一个把kafka数据写到hdfs 的工具）中配置作业把kafka message以sequence file的格式写到hdfs。Sequence file的key是org.apache.hadoop.io.LongWritable，value是org.apache.hadoop.io.Text。

很顺利，我们把数据写到了hdfs，然后我自己写一个pig udf去解析protobuf的数据。Pig本身有一个udf去读取sequencefile：org.apache.pig.piggybank.storage.SequenceFileLoader。因为存放的是Text，所以pig读取的时候就转成了chararray。然后我就遇到了如下的错误：

bad record, bad formatcom.google.protobuf.InvalidProtocolBufferException: While parsing a protocol message, the input ended unexpectedly in the middle of a field. This could mean either than the input has been truncated or that an embedded message misreported its own length.

后来我们花费了很长时间，甚至查看了protobuf的二进制，做了人工翻译，