Java API连接HDFS并创建Parquet文件

最新推荐文章于 2024-11-06 16:02:16 发布

转载最新推荐文章于 2024-11-06 16:02:16 发布 · 1.5k 阅读

2 ·

CC 4.0 BY-SA版权

原文链接：https://my.oschina.net/nivalsoul/blog/776713

文章标签：

#java #大数据 #json

本文介绍如何使用Java API连接Hadoop并设置Parquet文件的schema，提供了两种创建并写入Parquet文件的方法，同时涵盖了将Parquet文件导入Hive表的过程。

2019独角兽企业重金招聘Python工程师标准>>>

1、设置连接，参考之前的文章：Java API操作HA方式下的Hadoop

    static String ClusterName = "nsstargate";
	private static final String HADOOP_URL = "hdfs://"+ClusterName;
	public static Configuration conf;

    static {
        conf = new Configuration();
        conf.set("fs.defaultFS", HADOOP_URL);
        conf.set("dfs.nameservices", ClusterName);
        conf.set("dfs.ha.namenodes."+ClusterName, "nn1,nn2");
        conf.set("dfs.namenode.rpc-address."+ClusterName+".nn1", "172.16.50.24:8020");
        conf.set("dfs.namenode.rpc-address."+ClusterName+".nn2", "172.16.50.21:8020");
        //conf.setBoolean(name, value);
        conf.set("dfs.client.failover.proxy.provider."+ClusterName, 
        		"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");
        conf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
    }

注：如果只是Configuration conf = new Configuration(); 不设置hdfs连接信息的话，则会将文件写到本地磁盘上（需要配置hadoop环境信息）。

2、设置parquet文件的schema

public static final MessageType FILE_SCHEMA = Types.buildMessage()
		      .required(PrimitiveTypeName.INT32).named("id")
		      .required(PrimitiveTypeName.BINARY).as(OriginalType.UTF8).named("name")
		      .named("test");

3、创建parquet文件并写入数据

已测试成功的有两种方法：

	public static void createFile1() throws Exception {
		String file = "/user/test/test_parquet_file7.parquet";
		Path path = new Path(file);
		FileSystem fs = path.getFileSystem(conf);
	    if (fs.exists(path)) {
	      fs.delete(path, true);
	    }
		GroupWriteSupport.setSchema(FILE_SCHEMA, conf);
	    SimpleGroupFactory f = new SimpleGroupFactory(FILE_SCHEMA);
		ParquetWriter<Group> writer = new ParquetWriter<Group>(path, new GroupWriteSupport(),
				CODEC, 1024, 1024, 512, true, false, WriterVersion.PARQUET_2_0, conf);
		for (int i = 0; i < 100; i++) {
          writer.write(
              f.newGroup()
              .append("id", i)
              .append("name", UUID.randomUUID().toString()));
        }
        writer.close();
	}

	public static void createFile2() throws Exception {
		String file = "/user/test/test_parquet_file8.parquet";
		Path path = new Path(file);
		FileSystem fs = path.getFileSystem(conf);
	    if (fs.exists(path)) {
	      fs.delete(path, true);
	    }
	    SimpleGroupFactory f = new SimpleGroupFactory(FILE_SCHEMA);
		ParquetWriter<Group> writer = ExampleParquetWriter.builder(path)
				.withConf(conf)
		        .withType(FILE_SCHEMA)
		        .build();
		for (int i = 0; i < 100; i++) {
			Group group = f.newGroup();
			group.add("id", i);
			group.add("name", UUID.randomUUID().toString());
			writer.write(group);
        }
        writer.close();
	}

其中用到的两个类GroupWriteSupport和ExampleParquetWriter在这里parquet-mr。

4、Hive建表及load parquet文件

create table test_parquet_table(id int,name string) stored AS PARQUET;

load data inpath '/user/test/test_parquet_file8.parquet' overwrite into table test_parquet_table;

在创建文件以及将parquet文件导入到Hive表中时，需要注意的是：

当字段为boolean类型时，则schema为boolean，写入数据为boolean，创建Hive表为boolean，建表时字段不必和schema中字段的顺序一致，但是必须保证同名。

转载于:https://my.oschina.net/nivalsoul/blog/776713