安装spark并编写scala
一、安装spark
1、下载:https://www.apache.org/dyn/closer.lua/spark/spark-2.4.2/spark-2.4.2-bin-hadoop2.7.tgz
2、解压压缩包
3、先启动hadoop 环境start-all.sh
4、启动spark环境
进入到SPARK_HOME/sbin下运行start-all.sh
/opt/module/spark/sbin/start-all.sh
修改配置文件后进入Spark的sbin目录启动spark:
./start-all.sh
启动 Spark Shell :
./bin/spark-shell
二、安装Scala:
1、下载:https://downloads.lightbend.com/scala/2.12.8/scala-2.12.8.rpm
2、tar -zxvf scala-2.12.8.tgz -C /opt/module
3、mv scala-2.12.8 scala
4、测试:scala -version
5、启动:scala
WordCount:
1、在 Spark Shell 使用本地文件进行统计
2、在 CentOS中 编写 WordCount 程序,在 Spark Shell 中执行程序
3、编写 Java 版的 WordCount 程序并执行:
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;
三、搭建spark伪分布
配置spark-env.sh:
export JAVA_HOME=/usr/java/jdk1.8.0_211-amd64
export SCALA_HOME=/usr/share/scala
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.7
export HADOOP_CONF_DIR=/usr/local/hadoop/hadoop-2.7.7/etc/hadoop
export SPARK_MASTER_HOST=bigdata
export SPARK_MASTER_PORT=7077
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native
配置etc/profile
export JAVA_HOME=/usr/java/jdk1.8.0_211-amd64
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.7
export HBASE_HOME=/usr/local/hbase/hbase-1.4.9
export HIVE_HOME=/usr/local/hive/apache-hive-2.3.4-bin
export SPARK_HOME=/usr/local/spark/spark-2.4.2-bin-hadoop2.7
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native
source profile使其生效
进入Spark 的 sbin 目录执行 start-all.sh 启动 spark:./start-all.sh
输入spark-shell进入spark界面
四、安装sbt
参考网址:http://dblab.xmu.edu.cn/blog/1307-2/
五、统计本地文件
val textFile = sc.textFile("file:///usr/local/spark/mycode/wordcount/word.txt")
wordCount.collect()
六、scala程序实现wordcount统计
spark-submit --class "WordCount" /usr/local/spark/mycode/wordcount/target/scala-2.11/simple-project_2.11-4.1.jar
相关scala程序:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object WordCount {
def main(args: Array[String]) {
val inputFile = "file:///usr/local/spark/mycode/wordcount/word.txt"
val conf = new SparkConf().setAppName("WordCount").setMaster("local[2]")
val sc = new SparkContext(conf)
val textFile = sc.textFile(inputFile)
val wordCount = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCount.foreach(println)
}
}
七、java程序实现wordcount统计
程序详情:
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;
import java.util.Arrays;
public class JavaWordCount {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("Spark WordCount written by java!");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> textFile = sc.textFile("hdfs:///user/hadoop/word.txt");
JavaPairRDD<String, Integer> counts = textFile
.flatMap(s -> Arrays.asList(s.split(" ")).iterator())
.mapToPair(word -> new Tuple2<>(word, 1))
.reduceByKey((a, b) -> a + b);
counts.saveAsTextFile(hdfs:///user/hadoop/writeback");
sc.close();
}
}
相关依赖
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>Spark</groupId>
<artifactId>SPARK</artifactId>
<version>0.0.1-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.1</version>
</dependency>
</dependencies>
<build>
<pluginManagement>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<appendAssemblyId>false</appendAssemblyId>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>JavaWordCount</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>assembly</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>8</source>
<target>8</target>
</configuration>
</plugin>
</plugins>
</pluginManagement>
</build>
</project>
打包上传到centos后
spark-submit --class spark.JavaWordCount --master spark://bigdata:7077 /usr/local/sp