HDFS实战：从Shell命令到Java API的文件合并全解析-优快云博客

本文链接：https://blog.youkuaiyun.com/2302_80233472/article/details/146379489

引言

HDFS（Hadoop Distributed File System）作为大数据生态的核心存储组件，是每位大数据学习者的必修课。本文将通过一个完整的实验案例，结合操作截图与代码演示，手把手教你掌握HDFS文件操作与合并的核心技能。

一、HDFS Shell操作全流程

1. 集群启动与目录创建

# 启动HDFS和YARN
start-dfs.sh && start-yarn.sh

# 创建HDFS目录
hdfs dfs -mkdir -p /user/hadoop

说明：执行hdfs dfs -ls /user查看目录结构

2. 本地文件创建与上传

# 在Master节点创建测试文件
vim /home/hadoop/a.txt

说明：输入"2025并保存

# 上传至HDFS
hdfs dfs -put /home/hadoop/a.txt /user/

说明：通过hdfs dfs -ls /user确认文件已存在

3. 跨节点文件访问验证

# 在Slave节点查看文件
hdfs dfs -cat /user/a.txt

注意：需提前配置SSH免密登录

二、Java API合并文件实战

1. 项目配置

提示：新建java项目，选择提前准备好的jdk

提示：需添加hadoop-common、hadoop-hdfs等JAR包

2. 核心代码解析

// 关键点：正则过滤1.txt和2.txt
FileStatus[] sourceStatus = fsSource.listStatus(
    inputPath, 
    new MyPathFilter(".*/(1\\.txt|2\\.txt)$")
);

3. 运行结果验证

hdfs dfs -cat /user/3.txt

说明：3.txt应包含1.txt和2.txt的全部内容

三、避坑指南

1. 权限问题

权限不足: 使用hdfs dfs -chmod调整权限

解决方案：

hdfs dfs -chmod 777 /user/1.txt

2. 正则匹配失效

注意: HDFS路径包含完整URI

修正技巧：
使用完整HDFS路径正则：hdfs://master:9000/user/.*txt

四、MergeFile.java完整代码

package abc;

import java.io.IOException;
import java.io.PrintStream;
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;


/**
 * 保留文件名满足特定条件的文件
 */
class MyPathFilter implements PathFilter {
    String reg;
    MyPathFilter(String reg) {
        this.reg = reg;
    }
    @Override
    public boolean accept(Path path) {
        return path.toString().matches(reg);
    }
}

public class MergeFile {
    Path inputPath = null;
    Path outputPath = null;

    public MergeFile(String input, String output) {
        this.inputPath = new Path(input);
        this.outputPath = new Path(output);
    }

    public void doMerge() throws IOException {
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://master:9000");
        conf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
        FileSystem fsSource = FileSystem.get(URI.create(inputPath.toString()), conf);
        FileSystem fsDst = FileSystem.get(URI.create(outputPath.toString()), conf);

        // 使用正则匹配1.txt和2.txt
        FileStatus[] sourceStatus = fsSource.listStatus(inputPath,
                new MyPathFilter(".*/(1\\.txt|2\\.txt)$"));

        // 允许覆盖输出文件
        FSDataOutputStream fsdos = fsDst.create(outputPath, true);
        PrintStream ps = new PrintStream(System.out);

        for (FileStatus sta : sourceStatus) {
            System.out.print("路径：" + sta.getPath() + "    文件大小：" + sta.getLen()
                    + "   权限：" + sta.getPermission() + "   内容：");
            FSDataInputStream fsdis = fsSource.open(sta.getPath());
            byte[] data = new byte[1024];
            int read = -1;

            while ((read = fsdis.read(data)) > 0) {
                ps.write(data, 0, read); // 输出到控制台
                fsdos.write(data, 0, read); // 写入目标文件
            }
            fsdis.close();
        }
        ps.close();
        fsdos.close();
    }

    public static void main(String[] args) throws IOException {
        MergeFile merge = new MergeFile(
                "hdfs://master:9000/user/",
                "hdfs://master:9000/user/3.txt");
        merge.doMerge();
    }
}

💪掌握HDFS操作是大数据工程师的基本功，接下来可尝试通过MapReduce实现TB级文件合并！⚡(ง •̀_•́)ง