SparkML之DistributedMatrix(一)_distributed matrix math-优快云博客

本文链接：https://blog.youkuaiyun.com/legotime/article/details/51085385

本文深入探讨了Spark中矩阵的存储方式与操作方法，包括RowMatrix、IndexedRowMatrix、CoordinateMatrix与BlockMatrix的特性与应用，通过实例展示了如何进行矩阵的计算与分析。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

===========================目录==============================================

分布矩阵（Distributed matrix）
------行矩阵（RowMatrix）
------标记行矩阵（indexed RowMatrix）
------坐标矩阵（CoordinateMatrix）
------分块矩阵（BlockMatrix）

-------------------------------------------------------------------------------------------------------------------------------

对于Spark对于矩阵的存储可以查看( http://stanford.edu/~rezab/nips2014workshop/slides/reza.pdf)

1.矩阵的说明

RowMatrix:

A RowMatrix is a row-oriented distributed matrix without meaningful row indices, backed by an RDD of its rows, where each row is a local vector. Since each row is represented by a local vector, the number of columns is limited by the integer range but it should be much smaller in practice.

IndexedRowMatrix:

An IndexedRowMatrix is similar to a RowMatrix but with meaningful row indices. It is backed by an RDD of indexed rows, so that each row is represented by its index (long-typed) and a local vector.

CoordinateMatrix

:A CoordinateMatrix is a distributed matrix backed by an RDD of its entries. Each entry is a tuple of (i: Long, j: Long, value: Double), where i is the row index, j is the column index, and value is the entry value. A CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse.

2.实际应用

package Basic
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg.{Vectors,Matrices}
import org.apache.spark.mllib.linalg.distributed.{IndexedRowMatrix, RowMatrix, CoordinateMatrix,IndexedRow}

object DistributedMatrix {
  def main(args: Array[String]) {
    /* 官网Distributed matrix的学习
    * */

    val sparkConf = new SparkConf().setAppName("Distributed matrix Learning").setMaster("local")
    val sc = new SparkContext(sparkConf)
    // 创建RDD[Vector]
    val rdd1= sc.parallelize(
      Array(
        Array(1.0,7.0,0,0),
        Array(0,2.0,8.0,0),
        Array(5.0,0,3.0,9.0),
        Array(0,6.0,0,4.0)
      )
    ).map(f => Vectors.dense(f))
    //创建RowMatrix
    //http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix
    val rowMatrix = new RowMatrix(rdd1)
    //rowMatirx调用columnSimliarities 的方式有两种
    //def columnSimilarities(threshold: Double): CoordinateMatrix
    //def columnSimilarities(): CoordinateMatrix
    //threshold = 1  返回是（）
    //threshold = 0.5
    //MatrixEntry(2,3,0.9738191526000504)
    //MatrixEntry(0,1,0.2607165735702116)
    //MatrixEntry(0,2,0.5586783719361678)
    //()
    //总共有6个，就是当没有输入参数的时候的个数
    println(rowMatrix.columnSimilarities(0.5).entries.collect().foreach(x=>println(x)))
    println("=================================exclude threshold==============================================")
    println(rowMatrix.columnSimilarities().entries.collect().foreach(x=>println(x)))
    //计算列之间的相似度，返回的是CoordinateMatrix，采用
    //case class MatrixEntry(i: Long, j: Long, value: Double)存储值
    println(rowMatrix.numCols())//4
    println(rowMatrix.numRows())//4
    //调用computeColumnSummaryStatistics()方法
    println("========computeColumnSummaryStatistics()方法：触发action===========================")
    println(rowMatrix.computeColumnSummaryStatistics().count)//Array的size：4
    println(rowMatrix.computeColumnSummaryStatistics().max)//每列的最大值，组成的vector：[5.0,7.0,8.0,9.0]
    println(rowMatrix.computeColumnSummaryStatistics().min)//每列的最小值，组成的vector：[0.0,0.0,0.0,0.0]
    println(rowMatrix.computeColumnSummaryStatistics().mean)//没列平均值：[1.5,3.75,2.75,3.25]
    println(rowMatrix.computeColumnSummaryStatistics().normL1)//矩阵列的1-范数,也就是绝对值相加，||x||1 = sum（abs(xi)）：[6.0,15.0,11.0,13.0]
    println(rowMatrix.computeColumnSummaryStatistics().normL2)//矩阵列的2-范数,也就是数的平方相加在取根号，||x||2 = sqrt(sum(xi.^2))：[5.0990195135927845,9.433981132056603,8.54400374531753,9.848857801796104]
    println(rowMatrix.computeColumnSummaryStatistics().numNonzeros)//每列0的个数：[2.0,3.0,2.0,2.0]
    println(rowMatrix.computeColumnSummaryStatistics().variance)//每列的方差[5.666666666666667,10.916666666666666,14.25,18.25]
    println("----------------------------------协方差-----------------------------------------------")
    //协方差：是衡量两个向量之间的总体误差(方差是协方差的特殊情况)
    //Cov(X,Y) = E((X-u)*((Y-v))),其中u是X的平均值，v是Y的平均值
    //纯数学：就是比较两个向量的相关是  正 负 还是 相互独立
    println(rowMatrix.computeCovariance())
    println("----------------------------------格拉姆矩阵--------------------------------------------")
    //定义：https://en.wikipedia.org/wiki/Gramian_matrix
    //所有主子式大于等0 => 半正定矩阵
    //格拉姆矩阵是半正定矩阵。
    println(rowMatrix.computeGramianMatrix())
    //26.0  7.0   15.0  45.0
    // 7.0   89.0  16.0  24.0
    //15.0  16.0  73.0  27.0
    //45.0  24.0  27.0  97.0
    //其中26是 第1列组成的向量乘以第1列组成的向量，7是第1列组成的向量乘以第2列组成的向量.。。。。。。。。。。
    println("-------------------------矩阵的主成分----------------------------------------------------")
    //主成分分析本质就是矩阵转换。把原有的数据点变换到一个新的坐标。是一种降维的统计方法。
    //后续会有详细介绍,输入的2是前2的主成分因子
    println(rowMatrix.computePrincipalComponents(2))
    //-0.39298150433827816  -0.12886620383031905
    //0.5385989845229371    -0.3723128718395973
    //-0.1573813891189192   0.8341561627916149
    //-0.7284969248239043   -0.3859535244685352
    println("-----------------------------SVD奇异值分界-----------------------------------")
    //推荐博文：http://blog.chinaunix.net/uid-20761674-id-4040274.html
    //A = USV'，
    //U = A*A'
    //V = A'*A
    //S是一个特征向量
    //源码返回的是：SingularValueDecomposition[RowMatrix, Matrix]
    println(rowMatrix.computeSVD(4,true))
    println("S的数值："+rowMatrix.computeSVD(4,true).s)
    //rowMatrix.computeSVD(4,true).U是一个rowMatrix
    println("U的数值："+rowMatrix.computeSVD(4,true).U.columnSimilarities().entries.collect().foreach(x => println(x)))
    println("V的数值："+rowMatrix.computeSVD(4,true).V)
    println("=======================================矩阵相乘=======================")
    //def multiply(B: Matrix): RowMatrix={.................}
    //返回的是RowMatrix类型
    val Arr1 = Matrices.dense(4,1,Array(1.0,2.0,3.0,4.0))
    println(rowMatrix.multiply(Arr1).columnSimilarities.entries.collect.foreach(println))
    //===============IndexIndexedRowMatrix的用法=====================================================
    //An IndexedRowMatrix is similar to a RowMatrix but with meaningful row indices.
    // It is backed by an RDD of indexed rows, so that each row is represented by its index (long-typed) and a local vector.
    //indexedRowMatrix可以从RDD从创建而来，之后可以转变为RowMatrix
    //错误:val indexedRowMatrix = new IndexedRowMatrix(rdd1)
    //理由：rdd1是RDD[Vector]
    //Indexed的两种用法：
    //new IndexedRowMatrix(rows: RDD[IndexedRow])
    //new IndexedRowMatrix(rows: RDD[IndexedRow], nRows: Long, nCols: Int)
    //查阅源码发现：case class IndexedRow(index: Long, vector: Vector)
    //index:我们可以自己选择数值,注意是Long类型
    //函数实现
    //def keyOfIndex(key:Int):Long=key
    //def keyOfIndex(key:Double)=key.toLong
    //implicit函数实现
    implicit def DoubleToLong(key:Double)= key.toLong

    val rdd2 = sc.parallelize(
      Array(
        Array(1.0,7.0,0,0),
        Array(0,2.0,8.0,0),
        Array(5.0,0,3.0,9.0),
        Array(0,6.0,0,4.0)
      )
    ).map(f => IndexedRow(f.takeRight(1)(0),Vectors.dense(f.drop(1))))
    println("rdd2的数值："+rdd2.collect.foreach(println))
    //IndexedRow(1,[7.0,0.0,0.0])
    //IndexedRow(0,[2.0,8.0,0.0])
    //IndexedRow(5,[0.0,3.0,9.0])
    //IndexedRow(0,[6.0,0.0,4.0])
    //f.take(1)(0),Vectors.dense(f.drop(1)))
    //f.take(1)得到是一个数组之后在选数组的第一个数字所以是f.take(1)(0)
    //如果最右边的第一个是指标，前面n-1个的是参数
    //f.takeRight(1)(0),Vectors.dense(f.dropRight(1)))
    //有了這些指标在训练网络的时候，非常方便，比如BP
    val IndexedRowMatrix = new IndexedRowMatrix(rdd2)
    //indexedRowMatrix转换为RowMatrix
    println(IndexedRowMatrix.toRowMatrix())
    //当然还有一些基本的运算，可以参考rowMatrix部分
    println(IndexedRowMatrix.numCols())

    //===============CoordinateMatrix的用法===========================================================
    //new CoordinateMatrix(entries: RDD[MatrixEntry])
    //new CoordinateMatrix(entries: RDD[MatrixEntry], nRows: Long, nCols: Long)
    //从返回的类型来看，coordinateMatrix有下面两种方式创建
    //val coordinateMatrix:CoordinateMatrix= rowMatrix.columnSimilarities(0.5)
    val coordinateMatrix:CoordinateMatrix= rowMatrix.columnSimilarities()
    //return：An n x n sparse upper-triangular matrix of cosine similarities between columns of this matrix.
    //CoordinateMatrix的用法
    //http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix
    //返回矩阵行数、列数
    //println(coordinateMatrix.entries.collect().foreach(x => println(x)))
    coordinateMatrix.entries.collect.foreach(println)
    //MatrixEntry(2,3,0.32086065591476265)
    //MatrixEntry(0,1,0.14551788123706114)
    //MatrixEntry(1,2,0.19850138864225716)
    //MatrixEntry(0,3,0.896065945800218)
    //MatrixEntry(1,3,0.25830354780341375)
    //MatrixEntry(0,2,0.3443048616036665)
    //也可以做其他矩阵的操作
    println(coordinateMatrix.numCols())
    println(coordinateMatrix.numRows())
    //也可以转换为IndexedRowMatrix
    println(coordinateMatrix.toIndexedRowMatrix())
    sc.stop()
  }
}