Spark 程序 WordCount实现 Scala、Python

最新推荐文章于 2024-03-28 21:40:19 发布

21989939

最新推荐文章于 2024-03-28 21:40:19 发布

阅读量402

点赞数

CC 4.0 BY-SA版权

分类专栏：【大数据】Spark 文章标签： spark

本文链接：https://blog.youkuaiyun.com/qq_21989939/article/details/79432084

【大数据】Spark 专栏收录该内容

5 篇文章

订阅专栏

这篇博客介绍了如何在Spark中实现单词统计程序，分别提供了Scala和Python两种实现方式。Scala实现中，讲解了从IDEA配置Scala环境，创建Maven项目，编写并执行WordCount代码，以及在集群上提交任务的过程。Python实现部分则简要提及。

单词统计程序

Scala实现

---

idea 安装scala插件
创建maven项目，引入scala sdk

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>spark-learn</groupId>
    <artifactId>cn.spark.learn</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>1.7</maven.compiler.source>
        <maven.compiler.target>1.7</maven.compiler.target>
        <encoding>UTF-8</encoding>
        <scala.version>2.10.6</scala.version>
        <scala.compat.version>2.10</scala.compat.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.10</artifactId>
            <version>1.5.2</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.10</artifactId>
            <version>1.5.2</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.6.2</version>
        </dependency>
    </dependencies>

    <build>
        <sourceDirectory>src/main/scala</sourceDirectory>
        <testSourceDirectory>src/test/scala</testSourceDirectory>
        <plugins>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.0</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                        <configuration>
                            <args>
                                <arg>-make:transitive</arg>
                                <arg>-dependencyfile</arg>
                                <arg>${project.build.directory}/.scala_dependencies</arg>
                            </args>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.18.1</version>
                <configuration>
                    <useFile>false</useFile>
                    <disableXmlReport>true</disableXmlReport>
                    <includes>
                        <include>**/*Test.*</include>
                        <include>**/*Suite.*</include>
                    </includes>
                </configuration>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.3</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                            <transformers>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass>cn.itcast.spark.WordCount</mainClass>
                                </transformer>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

3. 代码实现

object WordCount {

  def main(args: Array[String]): Unit = {
    // 创建conf，设置应用程序的名字和运行的方式，local[2]表示本地模式运行两个线程，产生两个文件结果
    val conf = new SparkConf().setAppName("wordcount").setMaster("local[2]")
    // 创建sparkcontext
    val sc = new SparkContext(conf)
    // 开始计算代码
    // textfile从hdfs中读取代码
    val file: RDD[String] = sc.textFile("hdfs://mini1:9000/words.txt")
    // 压平，分割每一行数据为每个单词
    val words: RDD[String] = file.flatMap(_.split(" "))
    val tuple: RDD[(String, Int)] = words.map((_, 1))
    val result: RDD[(String, Int)] = tuple.reduceByKey(_ + _)
    val resultBy: RDD[(String, Int)] = result.sortBy(_._2, false)

    // 打印结果
    resultBy.foreach(println)
  }

}

以上程序的输出结果从控制台打印上来看可能没有排序，原因是local[2]启动了两个线程执行，产生了两个结果文件，local[1]相当于全局排序。

4. 提交到集群执行

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD

object WordCount {

  def main(args: Array[String]): Unit = {
    // 创建conf，设置应用程序的名字和运行的方式，local[2]表示本地模式运行两个线程，产生两个文件结果
    //    val conf = new SparkConf().setAppName("wordcount").setMaster("local[2]")
    // 提交到集群执行
    val conf = new SparkConf().setAppName("wordcount")
    // 创建sparkcontext
    val sc = new SparkContext(conf)
    // 开始计算代码
    // textfile从hdfs中读取代码
    val file: RDD[String] = sc.textFile(args(0))
    // 压平，分割每一行数据为每个单词
    val words: RDD[String] = file.flatMap(_.split(" "))
    val tuple: RDD[(String, Int)] = words.map((_, 1))
    val result: RDD[(String, Int)] = tuple.reduceByKey(_ + _)
    val resultBy: RDD[(String, Int)] = result.sortBy(_._2, false)

    // 打印结果
    //    resultBy.foreach(println)
    resultBy.saveAsTextFile(args(1))
  }

}

使用idea打包，上传至集群中的任意台机器。

提交任务

spark-submit --master spark://mini1:7077 --class cn.itcast.spark.WordCount  original-spark-learn-1.0-SNAPSHOT.jar hdfs://mini1:9000/words.txt hdfs://mini1:9000/ceshi-scala/

Python实现

---

#!/usr/bin/python

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("pywordCount").setMaster("spark://mini1:7077")
sc = SparkContext(conf = conf)

sc.textFile("hdfs://mini1:9000/words.txt").flatMap(lambda a:a.split(" ")).map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).saveAsTextFile("hdfs://mini1:9000/wordcount/ceshi/")

spark-submit wordcount.py