SparkR(1)Naive Bayesian

最新推荐文章于 2024-05-29 21:36:50 发布

原创最新推荐文章于 2024-05-29 21:36:50 发布 · 189 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#大数据 #shell #json

Summary 同时被 2 个专栏收录

381 篇文章

订阅专栏

Distributed

160 篇文章

订阅专栏

本文详细介绍了如何使用SparkR环境搭建并运行Spark任务，着重讲解了如何利用SparkR实现Naive Bayes算法进行文本分类。通过实际代码示例，展示了从环境配置到数据加载、模型训练直至预测的过程，旨在帮助读者理解如何在分布式环境下高效处理大规模数据，并运用概率理论解决实际问题。

SparkR(1)Naive Bayesian

1. Naive Bayesian
P(A|B) = P(B|A) P(A) / P(B)

Features - F1, F2, … Fn
Category - C1, C2, … Cm

P(C|F1F2…Fn) = P(F1F2 … Fn|C)P(C) / P(F1F2…Fn)

P(F1F2…Fn|C)P(C) = P(F1|C)P(F2|C) … P(FN|C)P(C)

2. Prepare the Environment
spark-1.4.1
I just download the latest version and place that in my class path
http://mirror.nexcess.net/apache//spark/spark-1.4.1/spark-1.4.1-bin-hadoop2.6.tgz

R-3.2.2
http://sillycat.iteye.com/blog/2240148

>r --version
R version 3.2.2 (2015-08-14) -- "Fire Safety"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin13.4.0 (64-bit)

Rstudio version 0.99.473
http://sillycat.iteye.com/blog/2240148

3. Start the Spark with R shell
> bin/sparkR --master local[2]

And we can directly put what we want into the shell from this example
https://github.com/apache/spark/blob/master/examples/src/main/r/dataframe.R

4. Execute R script in SparkR
https://github.com/math-and-data/SparkR/blob/master/Demo_of_SparkR.Rmd

https://github.com/apache/spark/blob/master/examples/src/main/r/dataframe.R

https://github.com/apache/spark/blob/master/examples/src/main/resources/people.json

> bin/spark-submit examples/src/main/r/dataframe.R

5. Run the R Codes in Rstudio

Install the JDK 1.6 on my MAC
https://support.apple.com/kb/DL1572?locale=en_US

The file I download is from here.
http://supportdownload.apple.com/download.info.apple.com/Apple_Support_Area/Apple_Software_Updates/Mac_OS_X/downloads/031-29055.20150831-0f779fb2-4bf4-11e5-a8d8-/javaforosx.dmg

Move the binary spark file to /opt/spark
> mv spark-1.4.1-bin-hadoop2.6.tgz /opt/spark/

And this sample R codes can be run on the Rstudio
## download all the related packages
mypkgs <- c("dplyr", "ggplot2", "magrittr")
install.packages(mypkgs)

Sys.setenv(JAVA_HOME="/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home") # my path in Linux Ubuntu
library("rJava")

mySparkRpackagepath <- "/opt/spark/spark-1.4.1-bin-hadoop2.6.tgz"
install.packages(mySparkRpackagepath)

library("SparkR", lib.loc="/opt/spark/R/lib")
library("SparkR")
Sys.setenv(SPARK_HOME="/opt/spark")

sc <- sparkR.init(master = "local", appName = "SparkR_demo_RTA",
sparkHome = "/opt/spark")

sqlContext <- sparkRSQL.init(sc)

hiveContext <- sparkRHive.init(sc)

path <- file.path(Sys.getenv("SPARK_HOME"),
"examples/src/main/resources/people.json")

peopleDF <- jsonFile(sqlContext, path)

printSchema(peopleDF)
head(peopleDF)

6. Further Example
https://github.com/kiendang/sparkr-naivebayes-example

http://www.slideshare.net/KienDang5/introduction-to-sparkr

Data Types of R language
Vector
> c(1,2,3,4)
[1] 1 2 3 4
> 1:4
[1] 1 2 3 4
> c("a","b","c")
[1] "a" "b" "c"
> c(T,F,T)
[1] TRUE FALSE TRUE

Matrix
> matrix(c(1,2,3,4),ncol=2)
[,1] [,2]
[1,] 1 3
[2,] 2 4
>
> matrix(c(1,2,3,4),ncol=2,byrow=T)
[,1] [,2]
[1,] 1 2
[2,][1]
[1][2]
[1][1]
[1][2]
[1][3]
[1] 3

Data frame
> name <-c("A","B","C")
> age <- c(30,17,42)
> male <- c(T,F,F)
> data.frame(name, age, male)
name age male
1 A 30 TRUE
2 B 17 FALSE
3 C 42 FALSE

http://sillycat.iteye.com/blog/2240148
http://sillycat.iteye.com/blog/2240395

http://sillycat.iteye.com/blog/2240407

http://sillycat.iteye.com/blog/2240494

runif(n, min=0,max=1) average
x <- 1:100

y <- 1:100 + runif(100,0,20)

> m <- lm(y~x)
> plot(y~x)
> abline(m$coefficients)

R is single-threaded, can only process data sets that fit in a single machine.

SparkR allows users to interactively run jobs from the R shell on a cluster.

Famous Word Count Example
start the shell
> bin/sparkR --master local[2][1]
[1][2]
[1][3]
[1] 78

Supervised machine learning, Naive Bayes, Classifies texts based on the word frequency.

References:
http://www.iteblog.com/archives/1385
http://spark.apache.org/docs/latest/sparkr.html

https://github.com/math-and-data/SparkR/blob/master/Demo_of_SparkR.Rmd

https://github.com/BIDS/sparkR-demo

http://ampcamp.berkeley.edu/5/exercises/sparkr.html

https://github.com/kiendang/sparkr-naivebayes-example

naive bayesian
http://www.cnblogs.com/leoo2sk/archive/2010/09/17/1829190.html
http://www.ruanyifeng.com/blog/2013/12/naive_bayes_classifier.html

algorithm
http://www.ruanyifeng.com/blog/algorithm/