说明
hadoop2.7.2的Read流程见:HDFS Client Read流程分析
直接将2.7.2的版本换成3.2.0,可能会报 No FileSystem for scheme "hdfs" ,原因是需要加入hdfs相关的3个依赖:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
</dependency>
依然使用原来的FsShellTest,换用3.2.0的集群及测试路径。
通过调用栈分析,依然会走到Display中的getInputStream方法:
at org.apache.hadoop.fs.shell.Display$Cat.getInputStream(Display.java:108)
at org.apache.hadoop.fs.shell.Display$Text.getInputStream(Display.java:125)
at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:96)
Display是用于展示内容或checksum专用类,其封装了3个内部子类,如下:
class Display extends FsCommand {
public static void registerCommands(CommandFactory factory) {
factory.addClass(Cat.class, "-cat");
factory.addClass(Text.class, "-text");
factory.addClass(Checksum.class, "-checksum");
}
/**
* Displays file content to stdout
*/
public static class Cat extends Display {
...
protected InputStream getInputStream(PathData item) throws IOException {
return item.fs.open(item.path);
}
}
/**
* Same behavior as "-cat", but handles zip and TextRecordInputStream
* and Avro encodings.
*/
public static class Text extends Cat {
@Override
protected InputStream getInputStream(PathData item) throws IOException {
FSDataInputStream i = (FSDataInputStream)super.getInputStream(item);
// Handle 0 and 1-byte files
short leadBytes;
try {
leadBytes = i.readShort();
} catch (EOFException e) {
i.seek(0);
return i;
}
// Check type of stream first
switch(leadBytes) {
case 0x1f8b: { // RFC 1952
// Must be gzip
i.seek(0);
return new GZIPInputStream(i);
}
case 0x5345: { // 'S' 'E'
// Might be a SequenceFile
if (i.readByte() == 'Q') {
i.close();
return new TextRecordInputStream(item.stat);
}
}
default: {
// Check the type of compression instead, depending on Codec class's
// own detection methods, based on the provided path.
CompressionCodecFactory cf = new CompressionCodecFactory(getConf());
CompressionCodec codec = cf.getCodec(item.path);
if (codec != null) {
i.seek(0);
return codec.createInputStream(i);
}
break;
}
case 0x4f62: { // 'O' 'b'
if (i.readByte() == 'j') {
i.close();
return new AvroFileInputStream(item.stat);
}
break;
}
}
// File is non-compressed, or not a file container we know.
i.seek(0);
return i;
}
}
其中,Text类将Cat类中的InputStream转换成FSDataInputStream,做进一步处理。
和2.7.2类似,item.fs.open()标志着客户端读文件的开始。
客户端读的差异
本文详细解析了Hadoop3.2.0版本中客户端读取文件的具体流程,包括所需依赖的引入及FsShellTest的使用。通过调用栈分析,深入探讨了Display类及其子类Cat、Text、Checksum的功能与实现,特别是Text类如何处理压缩文件的读取。
1万+

被折叠的 条评论
为什么被折叠?



