Data Sources
基于文件
File-based:
-
readTextFile(path) / TextInputFormat - Reads files line wise and returns them as Strings.
-
readTextFileWithValue(path) / TextValueInputFormat - Reads files line wise and returns them as StringValues. StringValues are mutable strings.
-
readCsvFile(path) / CsvInputFormat - Parses files of comma (or another char) delimited fields. Returns a DataSet of tuples or POJOs. Supports the basic java types and their Value counterparts as field types.
-
readFileOfPrimitives(path, Class) / PrimitiveInputFormat - Parses files of new-line (or another char sequence) delimited primitive data types such as String or Integer.
-
readFileOfPrimitives(path, delimiter, Class) / PrimitiveInputFormat - Parses files of new-line (or another char sequence) delimited primitive data types such as String or Integer using the given delimiter.
-
readSequenceFile(Key, Value, path) / SequenceFileInputFormat - Creates a JobConf and reads file from the specified path with type SequenceFileInputFormat, Key class and Value class and returns them as Tuple2<Key, Value>.
基于集合
Collection-based:
-
fromCollection(Collection) - Creates a data set from the Java Java.util.Collection. All elements in the collection must be of the same type.
-
fromCollection(Iterator, Class) - Creates a data set from an iterator. The class specifies the data type of the elements returned by the iterator.
-
fromElements(T ...) - Creates a data set from the given sequence of objects. All objects must be of the same type.
-
fromParallelCollection(SplittableIterator, Class) - Creates a data set from an iterator, in parallel. The class specifies the data type of the elements returned by the iterator.
-
generateSequence(from, to) - Generates the sequence of numbers in the given interval, in parallel.
Generic:
-
readFile(inputFormat, path) / FileInputFormat - Accepts a file input format.
-
createInput(inputFormat) / InputFormat - Accepts a generic input format
public static void main(String[] args) throws Exception {
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// fromCollection(env);
textFile(env);
}
读文件和文件夹:
String filePath = "D:\\yly\\BaiduNetdiskDownload\\最全最新flink教程\\000.代码+环境\\00.flink-train-master\\flink-train\\data\\04\\hello.txt";
env.readTextFile(filePath).print();
System.out.println("~~~~~~~华丽的分割线~~~~~~~~");
filePath = "D:\\yly\\BaiduNetdiskDownload\\最全最新flink教程\\000.代码+环境\\00.flink-train-master\\flink-train\\data\\04\\inputs";
env.readTextFile(filePath).print();
读取csv文件例子:
String filePath = "D:\\yly\\BaiduNetdiskDownload\\最全最新flink教程\\000.代码+环境\\00.flink-train-master\\flink-train\\data\\04\\people.csv";
DataSet<Person> csvInput = env.readCsvFile(filePath).ignoreFirstLine().includeFields("101")/*.types(String.class, Integer.class);*/
.pojoType(Person.class, "name","job");
csvInput.map(new MapFunction<Person, Person>() {
@Override
public Person map(Person value) throws Exception {
System.out.println("execute map:{"+value+"}");
TimeUnit.SECONDS.sleep(5);
return value;
}
}).print();
读取递归文件夹:
//读取递归文件夹文件
// create a configuration object
Configuration parameters = new Configuration();
// set the recursive enumeration parameter
parameters.setBoolean("recursive.file.enumeration", true);
// pass the configuration to the data source
DataSet<String> logs = env.readTextFile("D:\\yly\\BaiduNetdiskDownload\\最全最新flink教程\\000.代码+环境\\00.flink-train-master\\flink-train\\data\\04")
.withParameters(parameters);
logs.print();
读取压缩文件:
filePath = "D:\\yly\\BaiduNetdiskDownload\\最全最新flink教程\\000.代码+环境\\00.flink-train-master\\flink-train\\data\\04\\yloader.gz";
env.readTextFile(filePath).print();
下表列出了当前支持的压缩方法。
压缩方法 | 文件扩展名 | 可并行 |
DEFLATE | .deflate | 没有 |
GZip压缩 | .gz, .gzip | 没有 |
bzip2的 | .bz2 | 没有 |
XZ | .xz | 没有 |