Data Sources及例子

最新推荐文章于 2024-02-04 17:30:42 发布

darling.0

最新推荐文章于 2024-02-04 17:30:42 发布

阅读量279

点赞数

分类专栏： Flink 文章标签： Flink

版权

15 篇文章

订阅专栏

本文介绍了Apache Flink如何从文件、集合和通用输入格式读取数据。具体包括使用readTextFile、readCsvFile、fromCollection等方法读取文本、CSV、序列文件和递归文件夹，并提供了读取压缩文件的示例。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Data Sources

基于文件

File-based:

readTextFile(path) / TextInputFormat - Reads files line wise and returns them as Strings.
readTextFileWithValue(path) / TextValueInputFormat - Reads files line wise and returns them as StringValues. StringValues are mutable strings.
readCsvFile(path) / CsvInputFormat - Parses files of comma (or another char) delimited fields. Returns a DataSet of tuples or POJOs. Supports the basic java types and their Value counterparts as field types.
readFileOfPrimitives(path, Class) / PrimitiveInputFormat - Parses files of new-line (or another char sequence) delimited primitive data types such as String or Integer.
readFileOfPrimitives(path, delimiter, Class) / PrimitiveInputFormat - Parses files of new-line (or another char sequence) delimited primitive data types such as String or Integer using the given delimiter.
readSequenceFile(Key, Value, path) / SequenceFileInputFormat - Creates a JobConf and reads file from the specified path with type SequenceFileInputFormat, Key class and Value class and returns them as Tuple2<Key, Value>.

基于集合

Collection-based:

fromCollection(Collection) - Creates a data set from the Java Java.util.Collection. All elements in the collection must be of the same type.
fromCollection(Iterator, Class) - Creates a data set from an iterator. The class specifies the data type of the elements returned by the iterator.
fromElements(T ...) - Creates a data set from the given sequence of objects. All objects must be of the same type.
fromParallelCollection(SplittableIterator, Class) - Creates a data set from an iterator, in parallel. The class specifies the data type of the elements returned by the iterator.
generateSequence(from, to) - Generates the sequence of numbers in the given interval, in parallel.

Generic: