Prediction(2)R running through Spark/Hadoop Cluster

SparkR与Hadoop集群实战
本文介绍如何在Hadoop集群上使用SparkR进行数据处理。通过安装配置、准备Hadoop数据及运行WordCount示例,展示了一种高效的数据处理方案。文章还提供了解决常见错误的方法,并附带相关资源链接。
部署运行你感兴趣的模型镜像
Prediction(2)R running through Spark/Hadoop Cluster

1. How we Load the Config in R
install.packages("yaml", repos="http://cran.rstudio.com/")

library("yaml")
config = yaml.load_file("config.yaml")

config$spark$home

These codes in Rstudio can be run. And also we can run them directly from shell
> Rscript scripts/WordCount.R

2. Prepare Hadoop Data
Create the Directory
>hadoop fs -mkdir user/carl/sparkR

Upload the file
>cd /home/carl/install/spark-1.4.1-bin-hadoop2.6/examples/src/main/resources

> hadoop fs -put ./people.json /user/carl/sparkR/

3. This RScript Run Great on Hadoop Cluster
#install.packages("yaml", repos="http://cran.rstudio.com/")

library("yaml")
config = yaml.load_file("config.yaml")

spark_home <- config$spark$home
spark_r_location <- paste0(spark_home,"/R/lib")
spark_server <- config$spark$server

library("SparkR", lib.loc = spark_r_location)

sc <- sparkR.init(master = spark_server, appName = "SparkR_Wordcount",
sparkHome = spark_home)
sqlContext <- sparkRSQL.init(sc)

path <- file.path("sparkR/people.json")

peopleDF <- jsonFile(sqlContext, path)

printSchema(peopleDF)
head(peopleDF)

Running great both on RStudio and RScript.

Tips
1. Error Message:
trying to use CRAN without setting a mirror

Solution:
install.packages("yaml", repos="http://cran.rstudio.com/")

Add the repos there will fix the problem.

References:
http://www.mayin.org/ajayshah/KB/R/

http://stackoverflow.com/questions/5272846/how-to-get-parameters-from-config-file-in-r-script

wordcount example
https://github.com/amplab-extras/SparkR-pkg/blob/master/examples/wordcount.R

您可能感兴趣的与本文相关的镜像

Langchain-Chatchat

Langchain-Chatchat

AI应用
Langchain

Langchain-Chatchat 是一个基于 ChatGLM 等大语言模型和 Langchain 应用框架实现的开源项目,旨在构建一个可以离线部署的本地知识库问答系统。它通过检索增强生成 (RAG) 的方法,让用户能够以自然语言与本地文件、数据库或搜索引擎进行交互,并支持多种大模型和向量数据库的集成,以及提供 WebUI 和 API 服务

SPARK: (1)我执行的命令 spark-submit \ --master yarn \ --deploy-mode cluster \ --packages com.microsoft.azure:synapseml_2.12:1.0.11 \ --py-files /home/hadoop/codecommit/airflow-prd-dags/device_profile.zip \ /home/hadoop/codecommit/airflow-prd-dags/device_profile/prediction/inactive_prediction/test.py (2)命令结果: 25/07/01 12:27:36 INFO Client: Application report for application_1751358220407_0018 (state: RUNNING) 25/07/01 12:27:36 INFO Client: client token: N/A diagnostics: N/A ApplicationMaster host: ip-10-10-151-16.cn-north-1.compute.internal ApplicationMaster RPC port: 45805 queue: root.default start time: 1751372847494 final status: UNDEFINED tracking URL: http://ip-10-10-151-57.cn-north-1.compute.internal:20888/proxy/application_1751358220407_0018/ user: hadoop 25/07/01 12:28:06 INFO Client: Application report for application_1751358220407_0018 (state: RUNNING) 25/07/01 12:28:36 INFO Client: Application report for application_1751358220407_0018 (state: RUNNING) 25/07/01 12:29:06 INFO Client: Application report for application_1751358220407_0018 (state: RUNNING) 25/07/01 12:29:36 INFO Client: Application report for application_1751358220407_0018 (state: RUNNING) (3)检查application日志发现报错: 25/07/01 12:27:43 INFO Executor: Created or updated repl class loader org.apache.spark.util.MutableURLClassLoader@5a6551d1 for default. 25/07/01 12:27:43 INFO MetricsSystemImpl: Scheduled Metric snapshot period at 300 second(s). 25/07/01 12:27:43 INFO MetricsSystemImpl: s3a-file-system metrics system started SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. 25/07/01 12:27:43 INFO ClientConfigurationFactory: Set initial getObject socket timeout to 2000 ms. 25/07/01 12:27:44 INFO YarnCoarseGrainedExecutorBackend: eagerFSInit: Eagerly initialized FileSystem at s3://does/not/exist in 1189 ms 25/07/01 12:27:44 INFO YarnCoarseGrainedExecutorBackend: eagerFSInit: Eagerly initialized FileSystem at s3a://does/not/exist in 1189 ms 25/07/01 12:28:43 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM 25/07/01 12:28:43 INFO MemoryStore: MemoryStore cleared 25/07/01 12:28:43 INFO BlockManager: BlockManager stopped 25/07/01 12:28:43 INFO ShutdownHookManager: Shutdown hook called 25/07/01 12:28:43 INFO MetricsSystemImpl: Stopping s3a-file-system metrics system... 25/07/01 12:28:43 INFO MetricsSystemImpl: s3a-file-system metrics system stopped. 25/07/01 12:28:43 INFO MetricsSystemImpl: s3a-file-system metrics system shutdown complete. End of LogType:stderr. 为什么报错: ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
最新发布
07-03
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值