Datax-HdfsWriter如何实现支持Map,List,Set等符合类型数据写入

最新推荐文章于 2024-09-23 07:30:00 发布

Shadow_Light

最新推荐文章于 2024-09-23 07:30:00 发布

阅读量816

点赞数

CC 4.0 BY-SA版权

分类专栏： Datax

本文链接：https://blog.youkuaiyun.com/Shadow_Light/article/details/108775512

Datax 专栏收录该内容

6 篇文章

订阅专栏

本文介绍了如何将DataX升级到3.0版本，并针对Hadoop 2.7.3和Hive 2.3.2环境，实现对HDFS写入Map类型数据的支持。通过改造HdfsWriter，新增对Map数据类型的处理，包括在SupportHiveDataType中添加MAP类型，修改HdfsHelper以处理Map数据。测试结果显示，Map数据能成功写入Hive的Map字段，对于Text格式的数据，需要额外处理并指定字段分隔符。最后提供了测试案例及代码地址。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、问题背景

之前分析hdfswriter时，看到过api支持map,list,set类型的数据写入，因此，想将其合入到datax里，使得源数据端有类似json,array等这种数据可以直接导入到hive中。

二. 环境准备

Datax版本：3.0

Hadoop版本：2.7.3

Hive版本：2.3.2

三. 源码

hdfs写入时，最重要的是获取字段对应类型的ObjectInspector，因此直接从ObjectInspectorFactory类入手看：

ObjectInspectorFactory：

private static ObjectInspector getReflectionObjectInspectorNoCache(Type t, ObjectInspectorFactory.ObjectInspectorOptions options) {
...
if (t instanceof ParameterizedType) {
	ParameterizedType pt = (ParameterizedType)t;
	if (List.class.isAssignableFrom((Class)pt.getRawType()) || Set.class.isAssignableFrom((Class)pt.getRawType())) {
	    return getStandardListObjectInspector(getReflectionObjectInspector(pt.getActualTypeArguments()[0], options));
	}

	if (Map.class.isAssignableFrom((Class)pt.getRawType())) {
	    return getStandardMapObjectInspector(getReflectionObjectInspector(pt.getActualTypeArguments()[0], options), getReflectionObjectInspector(pt.getActualTypeArguments()[1], options));
	}

	t = pt.getRawType();
}
...
}

可以看出，如果类型是Map,List,Set类型的，实际获取的inspector都有对应的方法去获取。以下就拿其中的Map类型做例子来进行讲解：

ObjectInspectorFactory：

public static StandardMapObjectInspector getStandardMapObjectInspector(ObjectInspector mapKeyObjectInspector, ObjectInspector mapValueObjectInspector) {
    ArrayList<ObjectInspector> signature = new ArrayList(2);
    signature.add(mapKeyObjectInspector);
    signature.add(mapValueObjectInspector);
    StandardMapObjectInspector result = (StandardMapObjectInspector)cachedStandardMapObjectInspector.get(signature);
    if (result == null) {
        result = new StandardMapObjectInspector(mapKeyObjectInspector, mapValueObjectInspector);
        StandardMapObjectInspector prev = (StandardMapObjectInspector)cachedStandardMapObjectInspector.putIfAbsent(signature, result);
        if (prev != null) {
            result = prev;
        }
    }

    return result;
}

从这个方法中，可以得知，要想获取map类型的inspector，需要知道对应的Key和Value的对应类型inspector，而hive中又不存在object这个类型，所以一般我们都是直接用string类型来接所有的map值，而且一般源数据读取的都是json格式的字符串类型，以下就以Key和Value都是String来对hdfswriter进行改造：

1、首先还是在SupportHiveDataType中新增一个MAP的类型

2、在HdfsHelper中修改2处地方，一个是源数据源的数据进行map类型转换，一个是获取Map对应的ObjectInspector

HdfsHelper:

public List<ObjectInspector>  getColumnTypeInspectors(List<Configuration> columns){
...
case MAP:
	ObjectInspector stringInspector = ObjectInspectorFactory.getReflectionObjectInspector(String.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
	objectInspector = ObjectInspectorFactory.getStandardMapObjectInspector(stringInspector, stringInspector);
break;
...
}

public static MutablePair<List<Object>, Boolean> transportOneRecord(
            Record record,List<Configuration> columnsConfiguration,
            TaskPluginCollector taskPluginCollector){
...
case MAP:
    // 获取Map
	recordList.add(JSON.parseObject(column.asString(), Map.class));
break;
...
}

四、测试结果

Mysql端数据：

写入ORC后查询结果：

但是对于Text类型的数据，还是需要进行额外的改动，具体改造思路可参考java往hive 的map类型字段写数据

public static MutablePair<List<Object>, Boolean> transportOneRecord(
            Record record,List<Configuration> columnsConfiguration,
            TaskPluginCollector taskPluginCollector,
            // 新增一个参数，区分text还是orc
            String fileType){
...
case MAP:
    Map map = JSON.parseObject(column.asString(), Map.class);
    if(fileType.equalsIgnoreCase("TEXT")){
        // text类型特殊处理
        recordList.add(map.toString().replaceAll("\\{|}|\\s",""));
    }else {
        recordList.add(map);
    }
break;
...
}

建表时，也需要指定一下集合的分隔符：

create table test_map(
    id      int,
    student map<string,string>
)
row format delimited
fields terminated by '|'
collection items terminated by ','
map keys terminated by '='
;

Datax指定字段分割符：

"writer": {
	"name": "hdfswriter",
	"parameter": {
	"defaultFS": "hdfs://node:9000",
	"fileType": "text",
	"path": "/usr/hive/warehouse/test.db/test_map/",
	"fileName": "test",
	"column": [
	  {
	    "name": "id",
	    "type": "int"
	  },
	  {
	    "name": "student",
	    "type": "map"
	  }
	],
	"writeMode": "append",
	"fieldDelimiter": "|"
}

最终执行结果：

五、代码地址

github地址