spark计算单词的相对频率

该博客探讨如何使用Spark来计算每个单词邻域中其他单词的相对频率。邻域定义为当前单词前后的两个单词,计算方法是确定每个邻域单词在邻域中的比重。例如,w01的邻域是w02和w03,它们的比重均为1/2。文章提供了代码示例和结果展示。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

需求:

如果定义一个单词的邻域为这个单词的前两个单词和后两个单词,求的是每个邻域单词占每个单词邻域的比重

如:

w01,w02,w03,w04,w05

邻域表:

单词邻域
w01w02,w03
w02w01,w03,w04
w03w01,w02,w03,w05
w04w02,w03,w05
w05w03,w04

那么对于 w01而言,w02和 w03所占的比重都是1/2

那么对于 w02而言,w01,w03,w04所占的比重都是1/3

以此类推…数据一般不会出现这种所有单词只出现一次的情况

数据:

w01,w02,w03,w04,w05,w06,w07,w08,w09,w10,w01,w02,w03,w04,w05,w06,w07,w08,w09,w10

代码:

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}

object RelativeFrequency {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("RelativeFrequency")
      .master("local")
      .config("spark.sql.shuffle.partitions", "5")
      .getOrCreate()
    val sc = spark.sparkContext

    val brodcastWindow = sc.broadcast(2)
    val rawData = sc.textFile("RelativeFrequency.csv")

    // (word, (neighbour, 1))
    val pairs = rawData.flatMap(line => {
      val tokens = line.split(",")
      for {
        i <- 0 until tokens.length
        start = if (i - brodcastWindow.value < 0) 0 else i - brodcastWindow.value
        end = if (i + brodcastWindow.value >= tokens.length) tokens.length - 1 else i + brodcastWindow.value
        j <- start to end if i != j
      } yield (tokens(i), (tokens(j), 1))
    })

    // 第一种方式

    // (word, sum(word))
    val totalByKey = pairs.map(t => (t._1, t._2._2)).reduceByKey(_ + _)

    // (word, (neighbour, sum(neighbour)))
    val uniquePairs = pairs.groupByKey()
                           .flatMapValues(_.groupBy(_._1).mapValues(_.unzip._2.sum))

    // (word, ((neighbour, sum(neighbour)), sum(word)))
    val joined = uniquePairs join totalByKey

    // ((key, neighbour), sum(neighbour)/sum(word))
    val relativeFrequency = joined.map(t => (t._1, t._2._1._1, (t._2._1._2.toDouble / t._2._2.toDouble).formatted("%.2f")))

    relativeFrequency.foreach(println)

    // 第二种方式,DataFrame
    val rfSchema = StructType(StructField("word", StringType, false) ::
                              StructField("neighbour", StringType, false) ::
                              StructField("frequency", IntegerType, false) :: Nil)

    spark.createDataFrame(pairs.map(t => Row(t._1, t._2._1, t._2._2)), rfSchema)
         .createOrReplaceTempView("rfTable")

    spark.sql(
      """
        |SELECT a.word,
        |       a.neighbour,
        |       (a.feq_total/b.total) rf
        |FROM
        |  (SELECT word,
        |          neighbour,
        |          SUM(frequency) feq_total
        |   FROM rfTable
        |   GROUP BY word,
        |            neighbour) a
        |INNER JOIN
        |  (SELECT word,
        |          SUM(frequency) AS total
        |   FROM rfTable
        |   GROUP BY word) b ON a.word = b.word
        |ORDER BY a.word, a.neighbour
      """.stripMargin).show()

    spark.stop()
  }

}

结果:

+----+---------+-------------------+
|word|neighbour|                 rf|
+----+---------+-------------------+
| w01|      w02| 0.3333333333333333|
| w01|      w03| 0.3333333333333333|
| w01|      w09|0.16666666666666666|
| w01|      w10|0.16666666666666666|
| w02|      w01| 0.2857142857142857|
| w02|      w03| 0.2857142857142857|
| w02|      w04| 0.2857142857142857|
| w02|      w10|0.14285714285714285|
| w03|      w01|               0.25|
| w03|      w02|               0.25|
| w03|      w04|               0.25|
| w03|      w05|               0.25|
| w04|      w02|               0.25|
| w04|      w03|               0.25|
| w04|      w05|               0.25|
| w04|      w06|               0.25|
| w05|      w03|               0.25|
| w05|      w04|               0.25|
| w05|      w06|               0.25|
| w05|      w07|               0.25|
+----+---------+-------------------+
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值