Here is a small code snippet on how to read the csv data from HDFS using rhdfs (RHadoop)
rhdfs uses rJava and the buffersize is limited by the heapsize. By default the size of the buffer is set to 5Mb in rhdfs. The source code for rhdfs can be found here.
HADOOP_CMD environment should point to the hadoop.
rhdfs uses rJava and the buffersize is limited by the heapsize. By default the size of the buffer is set to 5Mb in rhdfs. The source code for rhdfs can be found here.
HADOOP_CMD environment should point to the hadoop.
Sys.setenv(HADOOP_CMD="/bin/hadoop") library(rhdfs) hdfs.init() f = hdfs.file("fulldata.csv","r",buffersize=104857600) m = hdfs.read(f) c = rawToChar(m) data = read.table(textConnection(c), sep = ",") ## Alternatively You can use hdfs.line.reader() reader = hdfs.line.reader("fulldata.csv") x = reader$read() typeof(x) ## [1] "character"
本文提供了一个简单的代码片段,演示如何利用RHadoop从HDFS读取CSV数据并将其转换为DataFrame。重点在于理解如何配置HADOOP_CMD环境变量,初始化RHadoop环境,以及使用rhdfs函数读取文件。
7006

被折叠的 条评论
为什么被折叠?



