Spark MLlib 之 Basic Statistics

本文详细介绍了SparkMLlib中用于处理RDD[Vector]类型数据的统计方法和相关性计算,包括如何计算列汇总统计信息如最大值、最小值、均值、方差和非零元素数量,以及如何计算两个序列之间的Pearson’s和Spearman’s相关性。

Spark MLlib提供了一些基本的统计学的算法,下面主要说明一下:

1、Summary statistics

对于RDD[Vector]类型,Spark MLlib提供了colStats的统计方法,该方法返回一个MultivariateStatisticalSummary的实例。他封装了列的最大值,最小值,均值、方差、总数。如下所示:

    val conf = new SparkConf().setAppName("Simple Application").setMaster("yarn-cluster")
    val sc = new SparkContext(conf)
    val observations = sc.textFile("/user/liujiyu/spark/mldata1.txt")
      .map(_.split(' ') //     转换为RDD[Array[String]]类型
        .map(_.toDouble)) //            转换为RDD[Array[Double]]类型
      .map(line => Vectors.dense(line)) //转换为RDD[Vector]类型

    // Compute column summary statistics.
    val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
    println(summary.mean) // a dense vector containing the mean value for each column
    println(summary.variance) // column-wise variance
    println(summary.numNonzeros) // number of nonzeros in each column

2、Correlations(相关性)

计算两个序列的相关性,提供了计算Pearson’s and Spearman’s correlation.如下所示:

    val conf = new SparkConf().setAppName("Simple Application").setMaster("yarn-cluster")
    val sc = new SparkContext(conf)

    val observations = sc.textFile("/user/liujiyu/spark/mldata1.txt")

    val data1 = Array(1.0, 2.0, 3.0, 4.0, 5.0)
    val data2 = Array(1.0, 2.0, 3.0, 4.0, 5.0)
    val distData1: RDD[Double] = sc.parallelize(data1)
    val distData2: RDD[Double] = sc.parallelize(data2) // must have the same number of partitions and cardinality as seriesX

    // compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    // method is not specified, Pearson's method will be used by default. 
    val correlation: Double = Statistics.corr(distData1, distData2, "pearson")

    val data: RDD[Vector] = observations // note that each Vector is a row and not a column

    // calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    // If a method is not specified, Pearson's method will be used by default. 
    val correlMatrix: Matrix = Statistics.corr(data, "pearson")

 

转载于:https://www.cnblogs.com/ljy2013/p/5105549.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值