1、设置连接,参考之前的文章:Java API操作HA方式下的Hadoop
static String ClusterName = "nsstargate";
private static final String HADOOP_URL = "hdfs://"+ClusterName;
public static Configuration conf;
static {
conf = new Configuration();
conf.set("fs.defaultFS", HADOOP_URL);
conf.set("dfs.nameservices", ClusterName);
conf.set("dfs.ha.namenodes."+ClusterName, "nn1,nn2");
conf.set("dfs.namenode.rpc-address."+ClusterName+".nn1", "172.16.50.24:8020");
conf.set("dfs.namenode.rpc-address."+ClusterName+".nn2", "172.16.50.21:8020");
//conf.setBoolean(name, value);
conf.set("dfs.client.failover.proxy.provider."+ClusterName,
"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");
conf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
}
注:如果只是Configuration conf = new Configuration(); 不设置hdfs连接信息的话,则会将文件写到本地磁盘上(需要配置hadoop环境信息)。
2、设置parquet文件的schema
public static final MessageType FILE_SCHEMA = Types.buildMessage()
.required(PrimitiveTypeName.INT32).named("id")
.required(PrimitiveTypeName.BINARY).as(OriginalType.UTF8).named("name")
.named("test");
3、创建parquet文件并写入数据
已测试成功的有两种方法:
public static void createFile1() throws Exception {
String file = "/user/test/test_parquet_file7.parquet";
Path path = new Path(file);
FileSystem fs = path.getFileSystem(conf);
if (fs.exists(path)) {
fs.delete(path, true);
}
GroupWriteSupport.setSchema(FILE_SCHEMA, conf);
SimpleGroupFactory f = new SimpleGroupFactory(FILE_SCHEMA);
ParquetWriter<Group> writer = new ParquetWriter<Group>(path, new GroupWriteSupport(),
CODEC, 1024, 1024, 512, true, false, WriterVersion.PARQUET_2_0, conf);
for (int i = 0; i < 100; i++) {
writer.write(
f.newGroup()
.append("id", i)
.append("name", UUID.randomUUID().toString()));
}
writer.close();
}
public static void createFile2() throws Exception {
String file = "/user/test/test_parquet_file8.parquet";
Path path = new Path(file);
FileSystem fs = path.getFileSystem(conf);
if (fs.exists(path)) {
fs.delete(path, true);
}
SimpleGroupFactory f = new SimpleGroupFactory(FILE_SCHEMA);
ParquetWriter<Group> writer = ExampleParquetWriter.builder(path)
.withConf(conf)
.withType(FILE_SCHEMA)
.build();
for (int i = 0; i < 100; i++) {
Group group = f.newGroup();
group.add("id", i);
group.add("name", UUID.randomUUID().toString());
writer.write(group);
}
writer.close();
}
其中用到的两个类GroupWriteSupport和ExampleParquetWriter在这里parquet-mr。
4、Hive建表及load parquet文件
create table test_parquet_table(id int,name string) stored AS PARQUET;
load data inpath '/user/test/test_parquet_file8.parquet' overwrite into table test_parquet_table;
在创建文件以及将parquet文件导入到Hive表中时,需要注意的是:
当字段为boolean类型时,则schema为boolean,写入数据为boolean,创建Hive表为boolean,建表时字段不必和schema中字段的顺序一致,但是必须保证同名。
本文介绍如何使用Java API连接Hadoop并设置Parquet文件的schema,提供了两种创建并写入Parquet文件的方法,同时涵盖了将Parquet文件导入Hive表的过程。

1498

被折叠的 条评论
为什么被折叠?



