Prediction(2)R running through Spark/Hadoop Cluster

SparkR与Hadoop集群实战

最新推荐文章于 2024-05-29 21:36:50 发布

原创最新推荐文章于 2024-05-29 21:36:50 发布 · 243 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#大数据 #json #shell

Summary 专栏收录该内容

381 篇文章

订阅专栏

本文介绍如何在Hadoop集群上使用SparkR进行数据处理。通过安装配置、准备Hadoop数据及运行WordCount示例，展示了一种高效的数据处理方案。文章还提供了解决常见错误的方法，并附带相关资源链接。

部署运行你感兴趣的模型镜像

Prediction(2)R running through Spark/Hadoop Cluster

1. How we Load the Config in R
install.packages("yaml", repos="http://cran.rstudio.com/")

library("yaml")
config = yaml.load_file("config.yaml")

config$spark$home

These codes in Rstudio can be run. And also we can run them directly from shell
> Rscript scripts/WordCount.R

2. Prepare Hadoop Data
Create the Directory
>hadoop fs -mkdir user/carl/sparkR

Upload the file
>cd /home/carl/install/spark-1.4.1-bin-hadoop2.6/examples/src/main/resources

> hadoop fs -put ./people.json /user/carl/sparkR/

3. This RScript Run Great on Hadoop Cluster
#install.packages("yaml", repos="http://cran.rstudio.com/")

library("yaml")
config = yaml.load_file("config.yaml")

spark_home <- config$spark$home
spark_r_location <- paste0(spark_home,"/R/lib")
spark_server <- config$spark$server

library("SparkR", lib.loc = spark_r_location)

sc <- sparkR.init(master = spark_server, appName = "SparkR_Wordcount",
sparkHome = spark_home)
sqlContext <- sparkRSQL.init(sc)

path <- file.path("sparkR/people.json")

peopleDF <- jsonFile(sqlContext, path)

printSchema(peopleDF)
head(peopleDF)

Running great both on RStudio and RScript.

Tips
1. Error Message:
trying to use CRAN without setting a mirror

Solution:
install.packages("yaml", repos="http://cran.rstudio.com/")

Add the repos there will fix the problem.

References:
http://www.mayin.org/ajayshah/KB/R/

http://stackoverflow.com/questions/5272846/how-to-get-parameters-from-config-file-in-r-script

wordcount example
https://github.com/amplab-extras/SparkR-pkg/blob/master/examples/wordcount.R

您可能感兴趣的与本文相关的镜像

Langchain-Chatchat

AI应用

Langchain

Langchain-Chatchat 是一个基于 ChatGLM 等大语言模型和 Langchain 应用框架实现的开源项目，旨在构建一个可以离线部署的本地知识库问答系统。它通过检索增强生成 (RAG) 的方法，让用户能够以自然语言与本地文件、数据库或搜索引擎进行交互，并支持多种大模型和向量数据库的集成，以及提供 WebUI 和 API 服务