二种方法实现Spark计算WordCount

本文介绍如何使用Apache Spark的WordCount应用处理文本文件,包括在Spark Shell中执行WordCount任务,以及通过Scala for IntelliJ IDEA配置依赖并提交作业到Spark集群的方法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1.spark-shell

val lines = sc.textFile("hdfs://spark1:9000/spark.txt")
val words = lines.flatMap(line => line.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.foreach(wordcount => println(wordcount._1 + " appeared " + wordcount._2 + " times"))

2.Scala for idea

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>2.2.0</version>
    </dependency>

 

package cn.spark.study.core
 
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
 
object WordCount {
   
  def main(args: Array[String]) { 
    val conf = new SparkConf()
        .setAppName("WordCount")
.setMaster("spark://hadoop:7077");
//.setMaster("local[2]");//本地运行(windows)
    val sc = new SparkContext(conf)
    
    val lines = sc.textFile(args(0), 1);
    val words = lines.flatMap { line => line.split(" ")}
    val pairs = words.map {word => (word, 1)}
    val wordCount = pairs.reduceByKey(_ + _)
    wordCount.foreach(wordCount => println(wordCount._1 + " appeared " + wordCount._2 + " times"))
  }
}

 

最后,需要使用spark submit提交到spark集群中进行运行,执行脚本如下:

/usr/local/spark/bin/spark-submit \
--class cn.spark.study.core.WordCount \
/usr/local/spark-study/scala/wordcount.jar \
/root/test.txt
~                                                        

注意:需要停止spark-shell,否则可能出现内存不足错误(Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources)

 

 

 

 

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值