1、首先创建一个项目,如下图所示:
2、然后选择maven点击next,如下图所示:
3、输入GroupId和Artifactld,如下图所示:
4、输入项目名称和选择路径,如下图所示:
5、此时会提示,选择Enable Auto-import,如下图所示:
6、选择Project Structure,如下图所示:
7、然后点击Libraries,点击+号选择Scala SDK,如下图所示:
8、选择scala版本,我的系统装的是2.10.5,注意:选择jar,spark,hadoop,scala的版本号要一致,否则会出错的,如下图所示:
9、选择之后就会自动导入scala jar包,如下图所示:
10、由于我系统上装了mvn,所以配置maven,maven配置方法请看:https://blog.youkuaiyun.com/sunxiaoju/article/details/86500912,配置方法首先选择Settings...如下图所示:
12、然后找到Maven选项,如下图所示:
13、然后选择Override打上勾就可以选择自定义的maven仓库,如下图所示:
14、选择一个settings.xml,如下图所示:
15、选择之后Local repository会自动选择路径的,如下图所示:
16、选择好之后会再次提示是否自动导入,如下图所示:
17、然后在https://mvnrepository.com/中搜索对应java包,然后粘贴进来,如下图所示:
xml代码如下:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>sparktest</groupId>
<artifactId>spark</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.5</version>
</dependency>
</dependencies>
</project>
18、下载好后,红色会自动变成绿色的,如下图所示:
19、此时maven就会自动导入jar包,如下图所示:
20、然后在java中新建一个scala文件,文件名为:WordCount,然后类型选择Object,如下图所示:
21、新建一个文本文件,然后内容如下图所示:
内容为:
hello
hello
world
world
hello
linux
spark
window
linux
spark
spark
linux
hello
sunxj
window
22、然后在集群上新建一个目录,命令为:hadoop fs -mkdir hdfs://master:9000/user_data/,注意:至于如何在window上使用hadoop命令请看:https://blog.youkuaiyun.com/sunxiaoju/article/details/101224927 如下图所示:
23、然后将worldcount.txt上传至user_data目录,如下图所示:
24、然后输入如下代码
import org.apache.spark.{SparkConf, SparkContext}
object WordCount {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark 学习")
.setMaster("spark://master:7077")
val sc = new SparkContext(conf)
//val line = sc.textFile(args(0))
val file=sc.textFile("hdfs://master:9000/user_data/worldcount.txt")
val rdd = file.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
rdd.collect()
rdd.foreach(println)
rdd.collectAsMap().foreach(println)
}
}
如下图所示:
35、然后选择文件运行,如下图所示:
36、此时会提示如下错误:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/09/23 21:41:26 INFO SparkContext: Running Spark version 1.6.3
19/09/23 21:41:26 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:378)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:393)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:386)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:116)
at org.apache.hadoop.security.Groups.<init>(Groups.java:93)
at org.apache.hadoop.security.Groups.<init>(Groups.java:73)
at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:293)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:283)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:260)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:789)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:774)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:647)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2214)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2214)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2214)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:322)
at WordCount$.main(WordCount.scala:12)
at WordCount.main(WordCount.scala)
如下图所示:
37、这是由于系统中没有配置HADOOP_HOME环境变量,需要加入HADOOP_HOME环境变量即可,如下图所示:
38、然后修改path路径,如下图所示:
注意:修改过环境变量之后需要重启idea才行
39、再次运行会出现如下错误:
19/09/23 22:02:37 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, slave1, partition 1,NODE_LOCAL, 2134 bytes)
19/09/23 22:02:37 INFO BlockManagerMasterEndpoint: Registering block manager slave1:33248 with 511.1 MB RAM, BlockManagerId(0, slave1, 33248)
19/09/23 22:02:37 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on slave1:33248 (size: 2.3 KB, free: 511.1 MB)
19/09/23 22:02:38 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, slave1): java.lang.ClassNotFoundException: WordCount$$anonfun$2
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1868)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1751)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1170)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2178)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
如下图所示:
这是由于运行的代码没有jar包提交到集群上,需要通过idea编译成jar包,然后通过代码添加到代码中,编译jar包的方法请看:
https://blog.youkuaiyun.com/sunxiaoju/article/details/86183405文中的第39步开始,注意:如果是直接运行,则只需要编译层jar包,直接放到集群上,然后通过spark-submit命令提交即可
40、编译好之后添加一行代码:
sc.addJar("E:\\sunxj\\idea\\spark\\out\\artifacts\\spark_jar\\spark.jar")
如下图所示:
41、再次执行即可成功,然后去集群上查看运行结果,如下图所示:
42、如果想在运行时查看运行结果,只需要在末尾添加一行:
rdd.collectAsMap().foreach(println)
如下图所示:
43、再次运行即可运行时查看输出结果了,如下图所示: