window+idea搭建spark调试环境

1、首先创建一个项目,如下图所示:

2、然后选择maven点击next,如下图所示:

3、输入GroupId和Artifactld,如下图所示:

4、输入项目名称和选择路径,如下图所示:

5、此时会提示,选择Enable Auto-import,如下图所示:

6、选择Project Structure,如下图所示:

7、然后点击Libraries,点击+号选择Scala SDK,如下图所示:

8、选择scala版本,我的系统装的是2.10.5,注意:选择jar,spark,hadoop,scala的版本号要一致,否则会出错的,如下图所示:

9、选择之后就会自动导入scala jar包,如下图所示:

10、由于我系统上装了mvn,所以配置maven,maven配置方法请看:https://blog.youkuaiyun.com/sunxiaoju/article/details/86500912,配置方法首先选择Settings...如下图所示:

12、然后找到Maven选项,如下图所示:

13、然后选择Override打上勾就可以选择自定义的maven仓库,如下图所示:

14、选择一个settings.xml,如下图所示:

15、选择之后Local repository会自动选择路径的,如下图所示:

16、选择好之后会再次提示是否自动导入,如下图所示:

17、然后在https://mvnrepository.com/中搜索对应java包,然后粘贴进来,如下图所示:

xml代码如下:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>sparktest</groupId>
    <artifactId>spark</artifactId>
    <version>1.0-SNAPSHOT</version>
    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.10</artifactId>
            <version>1.6.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.6.5</version>
        </dependency>
    </dependencies>

</project>

18、下载好后,红色会自动变成绿色的,如下图所示:

19、此时maven就会自动导入jar包,如下图所示:

20、然后在java中新建一个scala文件,文件名为:WordCount,然后类型选择Object,如下图所示:

21、新建一个文本文件,然后内容如下图所示:

内容为:

hello
hello
world
world
hello
linux
spark
window
linux
spark
spark
linux
hello
sunxj
window

22、然后在集群上新建一个目录,命令为:hadoop fs -mkdir hdfs://master:9000/user_data/,注意:至于如何在window上使用hadoop命令请看:https://blog.youkuaiyun.com/sunxiaoju/article/details/101224927 如下图所示:

23、然后将worldcount.txt上传至user_data目录,如下图所示:

24、然后输入如下代码

import org.apache.spark.{SparkConf, SparkContext}

object WordCount {
  def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("Spark 学习")
      .setMaster("spark://master:7077")



    val sc = new SparkContext(conf)
    //val line = sc.textFile(args(0))
    val file=sc.textFile("hdfs://master:9000/user_data/worldcount.txt")
    val rdd = file.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
    rdd.collect()
    rdd.foreach(println)
    rdd.collectAsMap().foreach(println)
  }
}

如下图所示:

35、然后选择文件运行,如下图所示:

36、此时会提示如下错误:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/09/23 21:41:26 INFO SparkContext: Running Spark version 1.6.3
19/09/23 21:41:26 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
	at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:378)
	at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:393)
	at org.apache.hadoop.util.Shell.<clinit>(Shell.java:386)
	at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
	at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:116)
	at org.apache.hadoop.security.Groups.<init>(Groups.java:93)
	at org.apache.hadoop.security.Groups.<init>(Groups.java:73)
	at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:293)
	at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:283)
	at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:260)
	at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:789)
	at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:774)
	at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:647)
	at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2214)
	at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2214)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2214)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:322)
	at WordCount$.main(WordCount.scala:12)
	at WordCount.main(WordCount.scala)

如下图所示:

37、这是由于系统中没有配置HADOOP_HOME环境变量,需要加入HADOOP_HOME环境变量即可,如下图所示:

38、然后修改path路径,如下图所示:

注意:修改过环境变量之后需要重启idea才行

39、再次运行会出现如下错误:

19/09/23 22:02:37 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, slave1, partition 1,NODE_LOCAL, 2134 bytes)
19/09/23 22:02:37 INFO BlockManagerMasterEndpoint: Registering block manager slave1:33248 with 511.1 MB RAM, BlockManagerId(0, slave1, 33248)
19/09/23 22:02:37 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on slave1:33248 (size: 2.3 KB, free: 511.1 MB)
19/09/23 22:02:38 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, slave1): java.lang.ClassNotFoundException: WordCount$$anonfun$2
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1868)
	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1751)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
	at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1170)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2178)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
	at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:64)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

如下图所示:

这是由于运行的代码没有jar包提交到集群上,需要通过idea编译成jar包,然后通过代码添加到代码中,编译jar包的方法请看:

https://blog.youkuaiyun.com/sunxiaoju/article/details/86183405文中的第39步开始,注意:如果是直接运行,则只需要编译层jar包,直接放到集群上,然后通过spark-submit命令提交即可

40、编译好之后添加一行代码:

sc.addJar("E:\\sunxj\\idea\\spark\\out\\artifacts\\spark_jar\\spark.jar")

如下图所示:

41、再次执行即可成功,然后去集群上查看运行结果,如下图所示:

42、如果想在运行时查看运行结果,只需要在末尾添加一行:

rdd.collectAsMap().foreach(println)

如下图所示:

43、再次运行即可运行时查看输出结果了,如下图所示:

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值