实现一次将多个文件上传到hdfs

最新推荐文章于 2025-05-13 08:44:33 发布

克终

最新推荐文章于 2025-05-13 08:44:33 发布

阅读量8.9k

点赞数 2

分类专栏：分布式存储

分布式存储专栏收录该内容

37 篇文章

订阅专栏

本文介绍如何使用Hadoop的FileStatus和PathFilter接口配合通配符进行文件过滤，并通过Java API将特定格式的文件从本地文件系统上传至HDFS。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

需求场景分析
　　在某个单一操作中处理一系列文件是很常见的。例如一个日志处理的MapReduce作业可能要分析一个月的日志量。如果一个文件一个文件或者一个目录一个目录的声明那就太麻烦了，我们可以使用通配符(wild card)来匹配多个文件（这个操作也叫做globbing）。
为了实现上面的需求，需要先掌握如下的知识点

　　2.1　Hadoop提供了两种方法来处理文件组：

public FileStatus[] globStatus(Path pathPattern) throws IOException;
public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws IOException;
- PathFilter
  　　使用文件模式有时候并不能有效的描述你想要的一系列文件，例如如果你想排除某个特定文件就很难。所以FileSystem的listStatus()和globStatus()方法就提供了一个可选参数：PathFilter——它允许你一些更细化的控制匹配：
  
  package org.apache.hadoop.fs;
  public interface PathFilter
  {
  boolean accept(Path path);
  }
  
  　　２.2 　 Hadoop中的文件名匹配符(与Linux 中文件名通配符相同）
  
  * 代表0个或者多个特殊字符
  　　例子 yum.* 代表的可以使yum.也可以是yum.a、yum.ab、yum.abc 当然小数点后面可以有多个字母
  ? 代表的是任意一个字符
  　　例子 yum.? 可以是yum.a yum.b yum.c“““`但是要注意小数点后面必须有任意一个字符
  []代表的是中括号中的任意一个
  　　例子[abcdef] 可以是a b c d e f 中的任意一个字母当然也可以是数字
  [-]代表的是一个范围
  　　例子 [a-z] 表示的是字母a到z之间的所有字母
  [^]^是反向选择符号从字面意思可以知道也就是非的意思
  　　例子[^abc]表示只要不a b c 这三个字符中的任意一个就选择

3.实现思路分析
　　我们利用通配符和PathFilter 对象，将本地多种格式的文件上传至 HDFS，并过滤掉 txt文本格式以外的文件
　　文件数据如下图所示：
　　这里写图片描述 .

基于上述的需求分析，我们通过以下两步完成：

首先使用globStatus(Path pathPattern, PathFilter filter)，完成文件格式过滤，获取所有 txt 格式的文件。
然后使用 Java API 接口 copyFromLocalFile，将所有 txt 格式的文件上传至 HDFS。

４．程序代码

4.1 首先定义一个类 RegexAcceptPathFilter实现 PathFilter，过滤掉 txt 文本格式以外的文件。

package com.ywendeng.hdfs;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.PathFilter;

public class RegxAcceptPathFilter implements PathFilter{
    private  final String regex;
    public RegxAcceptPathFilter(String regex) {
        this.regex=regex;
    }
    @Override
    public boolean accept(Path path) {

        boolean flag=path.toString().matches(regex);
        return flag;
    }
}
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

4.2 在main() 中调用listFile() 上传至HDFS

package com.ywendeng.hdfs;

import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileUtil;
import org.apache.hadoop.fs.Path;

/*
 * 对多文本过滤上传到HDFS
 */

public class MultipleFileUpload {
    //声明两个从不同文件系统类型的静态变量
    private static FileSystem fs = null;
    private static FileSystem local = null;

      public static void main(String[] args) throws Exception {
         //指定在元数据目录的地址在linux环境下 
         String  srcPath="/home/hadoop/hadoopFile/*";
         String  dstPath="hdfs://ywendeng:9000/hadoopFile/";
         //调用上传到HDFS
         listFile(srcPath, dstPath);
      }

      public static void listFile(String srcPath,String dstPath) throws Exception{
          //读取配置文件
          Configuration conf=new Configuration();
          //指定HDFS地址
          URI uri=new URI("hdfs://ywendeng:9000");
          fs=FileSystem.get(uri,conf);
          // 获取本地文件系统
          local=FileSystem.getLocal(conf);
          //获取文件目录
          FileStatus[] listFile=local.globStatus(new Path(srcPath), new RegxAcceptPathFilter("^.*txt$"));
          //获取文件路径
          Path[]  listPath=FileUtil.stat2Paths(listFile);
          //输出文件路径
          Path outPath=new Path(dstPath);
          //循环遍历所有文件路径
          for(Path p:listPath)
              fs.copyFromLocalFile(p, outPath);

      }
}

 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

4.3上传文件结果如下所示：

(备注：如果是windows 环境下开发的程序，需要在Linux 环境下运行，在需要将程序打成jar 的形式，让后拷贝到linux 目录下使用hadoop jar 指令运行。例如：hadoop jar /home/hadoop/mul.jar com.ywendeng.hdfs.MultipleFileUpload(Main函数的所在类的全路径名)
)