Data Sources及例子

本文介绍了Apache Flink如何从文件、集合和通用输入格式读取数据。具体包括使用readTextFile、readCsvFile、fromCollection等方法读取文本、CSV、序列文件和递归文件夹,并提供了读取压缩文件的示例。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Data Sources

基于文件

File-based:

  • readTextFile(path) / TextInputFormat - Reads files line wise and returns them as Strings.

  • readTextFileWithValue(path) / TextValueInputFormat - Reads files line wise and returns them as StringValues. StringValues are mutable strings.

  • readCsvFile(path) / CsvInputFormat - Parses files of comma (or another char) delimited fields. Returns a DataSet of tuples or POJOs. Supports the basic java types and their Value counterparts as field types.

  • readFileOfPrimitives(path, Class) / PrimitiveInputFormat - Parses files of new-line (or another char sequence) delimited primitive data types such as String or Integer.

  • readFileOfPrimitives(path, delimiter, Class) / PrimitiveInputFormat - Parses files of new-line (or another char sequence) delimited primitive data types such as String or Integer using the given delimiter.

  • readSequenceFile(Key, Value, path) / SequenceFileInputFormat - Creates a JobConf and reads file from the specified path with type SequenceFileInputFormat, Key class and Value class and returns them as Tuple2<Key, Value>.

基于集合

Collection-based:

  • fromCollection(Collection) - Creates a data set from the Java Java.util.Collection. All elements in the collection must be of the same type.

  • fromCollection(Iterator, Class) - Creates a data set from an iterator. The class specifies the data type of the elements returned by the iterator.

  • fromElements(T ...) - Creates a data set from the given sequence of objects. All objects must be of the same type.

  • fromParallelCollection(SplittableIterator, Class) - Creates a data set from an iterator, in parallel. The class specifies the data type of the elements returned by the iterator.

  • generateSequence(from, to) - Generates the sequence of numbers in the given interval, in parallel.

Generic:

  • readFile(inputFormat, path) / FileInputFormat - Accepts a file input format.

  • createInput(inputFormat) / InputFormat - Accepts a generic input format

public static void main(String[] args) throws Exception {

        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

//         fromCollection(env);

        textFile(env);

    }

读文件和文件夹:

String filePath = "D:\\yly\\BaiduNetdiskDownload\\最全最新flink教程\\000.代码+环境\\00.flink-train-master\\flink-train\\data\\04\\hello.txt";

env.readTextFile(filePath).print();

System.out.println("~~~~~~~华丽的分割线~~~~~~~~");

filePath = "D:\\yly\\BaiduNetdiskDownload\\最全最新flink教程\\000.代码+环境\\00.flink-train-master\\flink-train\\data\\04\\inputs";

env.readTextFile(filePath).print();

读取csv文件例子:

String filePath = "D:\\yly\\BaiduNetdiskDownload\\最全最新flink教程\\000.代码+环境\\00.flink-train-master\\flink-train\\data\\04\\people.csv";

DataSet<Person> csvInput = env.readCsvFile(filePath).ignoreFirstLine().includeFields("101")/*.types(String.class, Integer.class);*/

        .pojoType(Person.class, "name","job");

csvInput.map(new MapFunction<Person, Person>() {

    @Override

    public Person map(Person value) throws Exception {

        System.out.println("execute map:{"+value+"}");

        TimeUnit.SECONDS.sleep(5);

        return value;

    }

}).print();

读取递归文件夹:

//读取递归文件夹文件

// create a configuration object

Configuration parameters = new Configuration();

// set the recursive enumeration parameter

parameters.setBoolean("recursive.file.enumeration", true);

// pass the configuration to the data source

DataSet<String> logs = env.readTextFile("D:\\yly\\BaiduNetdiskDownload\\最全最新flink教程\\000.代码+环境\\00.flink-train-master\\flink-train\\data\\04")

        .withParameters(parameters);

logs.print();

读取压缩文件:

filePath = "D:\\yly\\BaiduNetdiskDownload\\最全最新flink教程\\000.代码+环境\\00.flink-train-master\\flink-train\\data\\04\\yloader.gz";

env.readTextFile(filePath).print();

下表列出了当前支持的压缩方法。

压缩方法

文件扩展名

可并行

DEFLATE

.deflate

没有

GZip压缩

.gz, .gzip

没有

bzip2的

.bz2

没有

XZ

.xz

没有

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值