HUDI-update报错Null-value for required field:***

最新推荐文章于 2024-07-04 20:11:27 发布

原创最新推荐文章于 2024-07-04 20:11:27 发布 · 1.8k 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#spark #aws

一、抛出问题

环境：

aws EMR s3
hudi-0.10.1
spark-3.1.2
hive-3.1.2
hadoop-3.2.1

错误日志

Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge old record into new file for key cat_id:201225781 from old file s3://...parquet to new file s3://...parquet with writerSchema {
  "type" : "record",
  "name" : "dim_catpath_hudi_test_ex_record",
  "namespace" : "hoodie.dim_catpath_hudi_test_ex",
  "fields" : [ {
    "name" : "_hoodie_commit_time",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "_hoodie_commit_seqno",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "_hoodie_record_key",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "_hoodie_partition_path",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "_hoodie_file_name",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "platform",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "cat_id",
    "type" : [ "null", "int" ],
    "default" : null
  }, {
    "name" : "len",
    "type" : "int"
  }, {
    "name" : "map_to",
    "type" : "int"
  }, {
    "name" : "name_path",
    "type" : {
      "type" : "array",
      "items" : [ "string", "null" ]
    }
  }, {
    "name" : "id_path",
    "type" : {
      "type" : "array",
      "items" : [ "string", "null" ]
    }
  }, {
    "name" : "ts",
    "type" : [ "null", "long" ],
    "default" : null
  } ]
}
        at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:356)
        at org.apache.hudi.table.action.commit.AbstractMergeHelper$UpdateHandler.consumeOneRecord(AbstractMergeHelper.java:122)
        at org.apache.hudi.table.action.commit.AbstractMergeHelper$UpdateHandler.consumeOneRecord(AbstractMergeHelper.java:112)
        at org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
        at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        ... 3 more
Caused by: java.lang.RuntimeException: Null-value for required field: map_to
        at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:194)
        at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
        at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
        at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299)
        at org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:95)
        at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:351)
        ... 8 more

分析问题

HUDI表的创建方式是使用python/scala的"write...save(...)"的方式处理的
sql端的table是在spark-sql环境执行的"create table ... using hudi partitioned by (" +partitions +") options ( primaryKey = '"+ pk +"', type = 'cow', preCombineField = 'ts' ) location '...'"
对于options的方面，我们尝试并组合过N种方案，无果。
对于表类型方面，我们尝试了external表和managed表加载元数据，亦会有相同问题
最后繁琐的问题排查过程就不一一赘述了，直接上目前的结论和解决方案

结论&对策

结论

在更新处理的时候，我们一贯使用update table set col = ？ where 的方法。

此时在where条件以外的数据，col若存在null就会报出以上的错误，除非我将所有为null的行都涵盖在更新范围内也就是where条件内。

来个图说明一下上述结论

表中有且仅有两条map_to is null的数据，主键分别为127***,201***

若仅对一条进行更新，则会报错（第一次upd），若将所有的一起更新则可以正常执行（第二次upd），正常者即为将所有null的行都涵盖在where条件内的更新

临时对策

直接对该列所有为null的值进行更新，更新后则不会存在null的值、亦不会报Null-value异常

update hudi.table_name set col = '' where col is null;

结尾语

该场景并不普遍存在，只在某特定环境下（引发问题的具体环境因素及原因还未找到，所以此问题并未根治，此处仅为临时解决方案，望抛砖引玉，有大神前来指点），比如我的大环境就是aws-EMR-s3。hudi包是我所在团队的优秀同事协力，基于aws环境进行调整后完成编译的hudi-0.10.1。