写在开头
本博客的目标是可以在不开启spark集群 | Linux虚拟机的情况下,对Spark RDD程序的完美运行,旨在解放初学者应无Linux集群环境、无内存容量支撑的情况下运行spark程序,写这篇的灵感来源于自己在学习Spark时,看官方文档中介绍spark standalone模式部署,动辄20G内存、一个Master附带几个worker节点,然后默默看了眼自己的笔记本,8G内存、还要分几台虚拟机出来搭建集群环境测试spark程序,对于初学者的入门有点过头了,所以博主在摸索中前行,在之前学习Hadoop的基础之上, 探索出了这条在windows本地,通过local模式执行spark程序,我们主要入门在于spark RDD、spark SQL、spark Streaming等等,把这些掌握才是重点。好了,下面就开始介绍如何搭建这个开发环境。
搭建本地环境
1. Hadoop2.7.5 binary源码包下载 http://hadoop.apache.org/releases.html
2. 在本地解压,由于Hadoop2.7.x的源码/lib/native包中的文件本身就是64位的,所以不用替换。
3. 在windows-64bit-hadoop本地库bin只需要使用其中的bin目录,将原来的bin目录替换成新下载的bin目录。
3. 在系统环境变量中设置 HADOOP_HOME
为Hadoop源码包的主目录
Eclipse编程
这里在创建项目时,请参考我之前写的一篇关于Spark项目创建的博客 Eclipse+Maven+Scala Project+Spark | 编译并打包wordcount程序,这其中详细介绍了创建一个在Linux或Windows中创建一个Spark项目的过程。
这里我们只需提供pom.xml配置文件以及spark程序,其他和上面博客中介绍的相同。
pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>MyScala</groupId>
<artifactId>MyScala</artifactId>
<version>0.0.1-SNAPSHOT</version>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.3.1</version>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.1</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.scalatest/scalatest -->
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_2.11</artifactId>
<version>3.2.0-SNAP5</version>
<scope>test</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/junit/junit -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.5</version>
</dependency>
</dependencies>
</project>
wordcount.scala
package spark
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object wordcount {
def main(args: Array[String]): Unit = {
val localFile = "./src/main/resources/myFile.txt"
val conf = new SparkConf().setAppName("wordcount").setMaster("local")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
val logData = sc.textFile(localFile).cache()
val numAs = logData.filter(line => line.contains("s")).count()
val numBs = logData.filter(line => line.contains("r")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
控制台输出
Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties
18/01/22 22:55:14 INFO SparkContext: Running Spark version 2.2.1
18/01/22 22:55:15 INFO SparkContext: Submitted application: wordcount
18/01/22 22:55:15 INFO SecurityManager: Changing view acls to: elon
18/01/22 22:55:15 INFO SecurityManager: Changing modify acls to: elon
18/01/22 22:55:15 INFO SecurityManager: Changing view acls groups to:
18/01/22 22:55:15 INFO SecurityManager: Changing modify acls groups to:
18/01/22 22:55:15 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(elon); groups with view permissions: Set(); users with modify permissions: Set(elon); groups with modify permissions: Set()
18/01/22 22:55:16 INFO Utils: Successfully started service ‘sparkDriver’ on port 14723.
18/01/22 22:55:16 INFO SparkEnv: Registering MapOutputTracker
18/01/22 22:55:16 INFO SparkEnv: Registering BlockManagerMaster
18/01/22 22:55:16 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
18/01/22 22:55:16 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
18/01/22 22:55:16 INFO DiskBlockManager: Created local directory at C:\Users\yilon\AppData\Local\Temp\blockmgr-65564417-bf25-4673-8dbe-c928a3cdbe56
18/01/22 22:55:16 INFO MemoryStore: MemoryStore started with capacity 852.6 MB
18/01/22 22:55:16 INFO SparkEnv: Registering OutputCommitCoordinator
18/01/22 22:55:16 INFO Utils: Successfully started service ‘SparkUI’ on port 4040.
18/01/22 22:55:16 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.121.1:4040
18/01/22 22:55:17 INFO Executor: Starting executor ID driver on host localhost
18/01/22 22:55:17 INFO Utils: Successfully started service ‘org.apache.spark.network.netty.NettyBlockTransferService’ on port 14732.
18/01/22 22:55:17 INFO NettyBlockTransferService: Server created on 192.168.121.1:14732
18/01/22 22:55:17 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/01/22 22:55:17 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.121.1, 14732, None)
18/01/22 22:55:17 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.121.1:14732 with 852.6 MB RAM, BlockManagerId(driver, 192.168.121.1, 14732, None)
18/01/22 22:55:17 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.121.1, 14732, None)
18/01/22 22:55:17 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.121.1, 14732, None)
Lines with a: 2, Lines with b: 2
至此,本篇博文介绍了在windows中对spark程序进行local调试学习的全过程,若读者在实际操作的过程中遇到任何问题欢迎给我评论区留言或通过左侧Email联系
转载请注明出处:http://blog.youkuaiyun.com/coder__cs/article/details/79128713
本文出自【elon33的博客】