Scala写的wordcount

本文介绍了一个使用 Apache Spark 实现的 WordCount 示例程序。该程序读取指定输入文件中的文本,统计每个单词出现的频率,并将结果保存到指定输出路径。此外,程序还筛选并输出了包含 'ERROR' 的日志行。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

参照文档加以修改和深化,mark一下

package mywork

import java.io.File
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object WordCount {
  def main(args: Array[String]) {
    if (args.length < 2) {
      System.out.println("Usage: <input> <output>")
      System.exit(1)
    }
    val infile = args(0) // Should be some file on your system
    println("  =======================")
    println(" || WordCount in Spark !||")
    println("  =======================")
    println("Input: " + args(0) + ",size:" + getFileSize(infile) + "bytes")
    println("Onput: " + args(1))
      
    
    val conf = new SparkConf().setAppName("word count")
    val sc = new SparkContext(conf)
    val indata = sc.textFile(infile, 2).cache()
    //flatMap把每行按空格分割单词
    //map把每个单词映射成(word,1)的格式
    //reduceByKey则按key,在此即单词,做整合,即把相同单词的次数1相加
    val words = indata.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey((a,b) => a+b)
    //获取包含ERROR的所有行
    val errlineRDD = indata.filter(line => line.contains("ERROR"))
    words.saveAsTextFile(args(1))
    
    println()
    println("All words are counted!")
    val res = words.count().toInt
 
    //打印出来部分统计结果
    if (res > 20) {
      println("The first 20 words are ... ")
      words.take(20).foreach(println)
      println(" ... ")
    };
    else
      words.take(res).foreach(println)
    println() 
    println(errlineRDD.count + " lines with ERROR and the first line with ERROR is:")
    println(errlineRDD.first + "\n")
  }
  
  //获取文件大小
  	def getFileSize(fname: String): Long = {
  	  new File(fname) match {
  	    case null => 0
  	    case cat if cat.isFile() => cat.length()
  	  }
  	}

}

运行结果如下:

[root@sparkmaster bin]# ./spark-submit --class mywork.WordCount /opt/myjars/spark-wordcount-in-scala.jar /log /tmp/wordcount03
  =======================
 || WordCount in Spark !||
  =======================
Input: /log,size:102400015bytes
Onput: /tmp/wordcount03
16/02/28 04:47:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


All words are counted!
The first 20 words are ...
(15:50:27.516,78)
(15:50:26.205,78)
(15:50:22.326,76)
(15:50:26.586,77)
(15:50:25.056,76)
(15:50:19.248,73)
(15:50:21.324,78)
(15:50:20.304,76)
(15:50:25.119,77)
(15:50:22.506,77)
(15:50:24.870,78)
(15:50:26.007,77)
(15:50:19.116,74)
(15:50:20.475,78)
(15:50:20.559,80)
(15:50:21.483,78)
(15:50:24.249,67)
(15:50:19.032,73)
(15:50:22.401,78)
(15:50:25.122,77)
 ...


706207 lines with ERROR and the first line with ERROR is:
2015-11-15 15:50:18.864 GMT ERROR [28379:17590581916128] mrss.requesthandler - RequestHandler::svc failed to accept a new connection, error = 24


[root@sparkmaster bin]#




评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

飞鸿踏雪Ben归来

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值