SparkML之DistributedMatrix(一)

本文深入探讨了Spark中矩阵的存储方式与操作方法,包括RowMatrix、IndexedRowMatrix、CoordinateMatrix与BlockMatrix的特性与应用,通过实例展示了如何进行矩阵的计算与分析。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

===========================目录==============================================

分布矩阵(Distributed matrix)
------行矩阵(RowMatrix)
------标记行矩阵(indexed RowMatrix)
------坐标矩阵(CoordinateMatrix)
------分块矩阵(BlockMatrix)

-------------------------------------------------------------------------------------------------------------------------------

对于Spark对于矩阵的存储可以查看(  http://stanford.edu/~rezab/nips2014workshop/slides/reza.pdf)

1.矩阵的说明

RowMatrix:

RowMatrix is a row-oriented distributed matrix without meaningful row indices, backed by an RDD of its rows, where each row is a local vector. Since each row is represented by a local vector, the number of columns is limited by the integer range but it should be much smaller in practice.

IndexedRowMatrix:

An IndexedRowMatrix is similar to a RowMatrix but with meaningful row indices. It is backed by an RDD of indexed rows, so that each row is represented by its index (long-typed) and a local vector.

CoordinateMatrix

:A CoordinateMatrix is a distributed matrix backed by an RDD of its entries. Each entry is a tuple of (i: Long, j: Long, value: Double), where i is the row index, j is the column index, and value is the entry value. A CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse.
2.实际应用

package Basic
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg.{Vectors,Matrices}
import org.apache.spark.mllib.linalg.distributed.{IndexedRowMatrix, RowMatrix, CoordinateMatrix,IndexedRow}

object DistributedMatrix {
  def main(args: Array[String]) {
    /* 官网Distributed matrix的学习
    * */

    val sparkConf = new SparkConf().setAppName("Distributed matrix Learning").setMaster("local")
    val sc = new SparkContext(sparkConf)
    // 创建RDD[Vector]
    val rdd1= sc.parallelize(
      Array(
        Array(1.0,7.0,0,0),
        Array(0,2.0,8.0,0),
        Array(5.0,0,3.0,9.0),
        Array(0,6.0,0,4.0)
      )
    ).map(f => Vectors.dense(f))
    //创建RowMatrix
    //http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix
    val rowMatrix = new RowMatrix(rdd1)
    //rowMatirx调用columnSimliarities 的方式有两种
    //def columnSimilarities(threshold: Double): CoordinateMatrix
    //def columnSimilarities(): CoordinateMatrix
    //threshold = 1  返回是()
    //threshold = 0.5
    //MatrixEntry(2,3,0.9738191526000504)
    //MatrixEntry(0,1,0.2607165735702116)
    //MatrixEntry(0,2,0.5586783719361678)
    //()
    //总共有6个,就是当没有输入参数的时候的个数
    println(rowMatrix.columnSimilarities(0.5).entries.collect().foreach(x=>println(x)))
    println("=================================exclude threshold==============================================")
    println(rowMatrix.columnSimilarities().entries.collect().foreach(x=>println(x)))
    //计算列之间的相似度,返回的是CoordinateMatrix,采用
    //case class MatrixEntry(i: Long, j: Long, value: Double)存储值
    println(rowMatrix.numCols())//4
    println(rowMatrix.numRows())//4
    //调用computeColumnSummaryStatistics()方法
    println("========computeColumnSummaryStatistics()方法:触发action===========================")
    println(rowMatrix.computeColumnSummaryStatistics().count)//Array的size:4
    println(rowMatrix.computeColumnSummaryStatistics().max)//每列的最大值,组成的vector:[5.0,7.0,8.0,9.0]
    println(rowMatrix.computeColumnSummaryStatistics().min)//每列的最小值,组成的vector:[0.0,0.0,0.0,0.0]
    println(rowMatrix.computeColumnSummaryStatistics().mean)//没列平均值:[1.5,3.75,2.75,3.25]
    println(rowMatrix.computeColumnSummaryStatistics().normL1)//矩阵列的1-范数,也就是绝对值相加,||x||1 = sum(abs(xi)):[6.0,15.0,11.0,13.0]
    println(rowMatrix.computeColumnSummaryStatistics().normL2)//矩阵列的2-范数,也就是数的平方相加在取根号,||x||2 = sqrt(sum(xi.^2)):[5.0990195135927845,9.433981132056603,8.54400374531753,9.848857801796104]
    println(rowMatrix.computeColumnSummaryStatistics().numNonzeros)//每列0的个数:[2.0,3.0,2.0,2.0]
    println(rowMatrix.computeColumnSummaryStatistics().variance)//每列的方差[5.666666666666667,10.916666666666666,14.25,18.25]
    println("----------------------------------协方差-----------------------------------------------")
    //协方差:是衡量两个向量之间的总体误差(方差是协方差的特殊情况)
    //Cov(X,Y) = E((X-u)*((Y-v))),其中u是X的平均值,v是Y的平均值
    //纯数学:就是比较两个向量的相关是  正 负 还是 相互独立
    println(rowMatrix.computeCovariance())
    println("----------------------------------格拉姆矩阵--------------------------------------------")
    //定义:https://en.wikipedia.org/wiki/Gramian_matrix
    //所有主子式大于等0 => 半正定矩阵
    //格拉姆矩阵是半正定矩阵。
    println(rowMatrix.computeGramianMatrix())
    //26.0  7.0   15.0  45.0
    // 7.0   89.0  16.0  24.0
    //15.0  16.0  73.0  27.0
    //45.0  24.0  27.0  97.0
    //其中26是 第1列组成的向量乘以第1列组成的向量,7是第1列组成的向量乘以第2列组成的向量.。。。。。。。。。。
    println("-------------------------矩阵的主成分----------------------------------------------------")
    //主成分分析本质就是矩阵转换。把原有的数据点变换到一个新的坐标。是一种降维的统计方法。
    //后续会有详细介绍,输入的2是前2的主成分因子
    println(rowMatrix.computePrincipalComponents(2))
    //-0.39298150433827816  -0.12886620383031905
    //0.5385989845229371    -0.3723128718395973
    //-0.1573813891189192   0.8341561627916149
    //-0.7284969248239043   -0.3859535244685352
    println("-----------------------------SVD奇异值分界-----------------------------------")
    //推荐博文:http://blog.chinaunix.net/uid-20761674-id-4040274.html
    //A = USV',
    //U = A*A'
    //V = A'*A
    //S是一个特征向量
    //源码返回的是:SingularValueDecomposition[RowMatrix, Matrix]
    println(rowMatrix.computeSVD(4,true))
    println("S的数值:"+rowMatrix.computeSVD(4,true).s)
    //rowMatrix.computeSVD(4,true).U是一个rowMatrix
    println("U的数值:"+rowMatrix.computeSVD(4,true).U.columnSimilarities().entries.collect().foreach(x => println(x)))
    println("V的数值:"+rowMatrix.computeSVD(4,true).V)
    println("=======================================矩阵相乘=======================")
    //def multiply(B: Matrix): RowMatrix={.................}
    //返回的是RowMatrix类型
    val Arr1 = Matrices.dense(4,1,Array(1.0,2.0,3.0,4.0))
    println(rowMatrix.multiply(Arr1).columnSimilarities.entries.collect.foreach(println))
    //===============IndexIndexedRowMatrix的用法=====================================================
    //An IndexedRowMatrix is similar to a RowMatrix but with meaningful row indices.
    // It is backed by an RDD of indexed rows, so that each row is represented by its index (long-typed) and a local vector.
    //indexedRowMatrix可以从RDD从创建而来,之后可以转变为RowMatrix
    //错误:val indexedRowMatrix = new IndexedRowMatrix(rdd1)
    //理由:rdd1是RDD[Vector]
    //Indexed的两种用法:
    //new IndexedRowMatrix(rows: RDD[IndexedRow])
    //new IndexedRowMatrix(rows: RDD[IndexedRow], nRows: Long, nCols: Int)
    //查阅源码发现:case class IndexedRow(index: Long, vector: Vector)
    //index:我们可以自己选择数值,注意是Long类型
    //函数实现
    //def keyOfIndex(key:Int):Long=key
    //def keyOfIndex(key:Double)=key.toLong
    //implicit函数实现
    implicit def DoubleToLong(key:Double)= key.toLong

    val rdd2 = sc.parallelize(
      Array(
        Array(1.0,7.0,0,0),
        Array(0,2.0,8.0,0),
        Array(5.0,0,3.0,9.0),
        Array(0,6.0,0,4.0)
      )
    ).map(f => IndexedRow(f.takeRight(1)(0),Vectors.dense(f.drop(1))))
    println("rdd2的数值:"+rdd2.collect.foreach(println))
    //IndexedRow(1,[7.0,0.0,0.0])
    //IndexedRow(0,[2.0,8.0,0.0])
    //IndexedRow(5,[0.0,3.0,9.0])
    //IndexedRow(0,[6.0,0.0,4.0])
    //f.take(1)(0),Vectors.dense(f.drop(1)))
    //f.take(1)得到是一个数组之后在选数组的第一个数字所以是f.take(1)(0)
    //如果最右边的第一个是指标,前面n-1个的是参数
    //f.takeRight(1)(0),Vectors.dense(f.dropRight(1)))
    //有了這些指标在训练网络的时候,非常方便,比如BP
    val IndexedRowMatrix = new IndexedRowMatrix(rdd2)
    //indexedRowMatrix转换为RowMatrix
    println(IndexedRowMatrix.toRowMatrix())
    //当然还有一些基本的运算,可以参考rowMatrix部分
    println(IndexedRowMatrix.numCols())

    //===============CoordinateMatrix的用法===========================================================
    //new CoordinateMatrix(entries: RDD[MatrixEntry])
    //new CoordinateMatrix(entries: RDD[MatrixEntry], nRows: Long, nCols: Long)
    //从返回的类型来看,coordinateMatrix有下面两种方式创建
    //val coordinateMatrix:CoordinateMatrix= rowMatrix.columnSimilarities(0.5)
    val coordinateMatrix:CoordinateMatrix= rowMatrix.columnSimilarities()
    //return:An n x n sparse upper-triangular matrix of cosine similarities between columns of this matrix.
    //CoordinateMatrix的用法
    //http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix
    //返回矩阵行数、列数
    //println(coordinateMatrix.entries.collect().foreach(x => println(x)))
    coordinateMatrix.entries.collect.foreach(println)
    //MatrixEntry(2,3,0.32086065591476265)
    //MatrixEntry(0,1,0.14551788123706114)
    //MatrixEntry(1,2,0.19850138864225716)
    //MatrixEntry(0,3,0.896065945800218)
    //MatrixEntry(1,3,0.25830354780341375)
    //MatrixEntry(0,2,0.3443048616036665)
    //也可以做其他矩阵的操作
    println(coordinateMatrix.numCols())
    println(coordinateMatrix.numRows())
    //也可以转换为IndexedRowMatrix
    println(coordinateMatrix.toIndexedRowMatrix())
    sc.stop()
  }
}

参考文献:

http://stanford.edu/~rezab/nips2014workshop/slides/reza.pdf

http://blog.youkuaiyun.com/lovehuangjiaju/article/details/48297921

http://spark.apache.org/docs/latest/

http://www.cnblogs.com/heaad/p/

感谢


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值