RHadoop : Reading CSV using rhdfs

最新推荐文章于 2025-06-22 10:03:59 发布

原创最新推荐文章于 2025-06-22 10:03:59 发布 · 2.7k 阅读

0 ·

CC 4.0 BY-SA版权

云计算/大数据专栏收录该内容

93 篇文章

订阅专栏

本文提供了一个简单的代码片段，演示如何利用RHadoop从HDFS读取CSV数据并将其转换为DataFrame。重点在于理解如何配置HADOOP_CMD环境变量，初始化RHadoop环境，以及使用rhdfs函数读取文件。

[RHadoop:#104] How to read hdfs file into data frame

http://grokbase.com/t/gg/rhadoop/125qyh30m3/104-how-to-read-hdfs-file-into-data-frame

RHadoop : Reading CSV using rhdfs

Here is a small code snippet on how to read the csv data from HDFS using rhdfs (RHadoop)

rhdfs uses rJava and the buffersize is limited by the heapsize. By default the size of the buffer is set to 5Mb in rhdfs. The source code for rhdfs can be found here.

HADOOP_CMD environment should point to the hadoop.

Sys.setenv(HADOOP_CMD="/bin/hadoop")

library(rhdfs)
hdfs.init()

f = hdfs.file("fulldata.csv","r",buffersize=104857600)
m = hdfs.read(f)
c = rawToChar(m)

data = read.table(textConnection(c), sep = ",")

## Alternatively You can use hdfs.line.reader()

reader = hdfs.line.reader("fulldata.csv")
 
x = reader$read()
typeof(x)
## [1] "character"