===========================目录==============================================
分布矩阵(Distributed matrix)
------行矩阵(RowMatrix)
------标记行矩阵(indexed RowMatrix)
------坐标矩阵(CoordinateMatrix)
------分块矩阵(BlockMatrix)
-------------------------------------------------------------------------------------------------------------------------------
对于Spark对于矩阵的存储可以查看( http://stanford.edu/~rezab/nips2014workshop/slides/reza.pdf)
1.矩阵的说明
RowMatrix:
A RowMatrix is a row-oriented distributed matrix without meaningful row indices, backed by an RDD of its rows, where each row is a local vector. Since each row is represented by a local vector, the number of columns is limited by the integer range but it should be much smaller in practice.
IndexedRowMatrix:
CoordinateMatrix
:A CoordinateMatrix is a distributed matrix backed by an RDD of its entries. Each entry is a tuple of (i: Long, j: Long, value: Double), where i is the row index, j is the column index, and value is the entry value. A CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse.
package Basic import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.mllib.linalg.{Vectors,Matrices} import org.apache.spark.mllib.linalg.distributed.{IndexedRowMatrix, RowMatrix, CoordinateMatrix,IndexedRow} object DistributedMatrix { def main(args: Array[String]) { /* 官网Distributed matrix的学习 * */ val sparkConf = new SparkConf().setAppName("Distributed matrix Learning").setMaster("local") val sc = new SparkContext(sparkConf) // 创建RDD[Vector] val rdd1= sc.parallelize( Array( Array(1.0,7.0,0,0), Array(0,2.0,8.0,0), Array(5.0,0,3.0,9.0), Array(0,6.0,0,4.0) ) ).map(f => Vectors.dense(f)) //创建RowMatrix //http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix val rowMatrix = new RowMatrix(rdd1) //rowMatirx调用columnSimliarities 的方式有两种 //def columnSimilarities(threshold: Double): CoordinateMatrix //def columnSimilarities(): CoordinateMatrix //threshold = 1 返回是() //threshold = 0.5 //MatrixEntry(2,3,0.9738191526000504) //MatrixEntry(0,1,0.2607165735702116) //MatrixEntry(0,2,0.5586783719361678) //() //总共有6个,就是当没有输入参数的时候的个数 println(rowMatrix.columnSimilarities(0.5).entries.collect().foreach(x=>println(x))) println("=================================exclude threshold==============================================") println(rowMatrix.columnSimilarities().entries.collect().foreach(x=>println(x))) //计算列之间的相似度,返回的是CoordinateMatrix,采用 //case class MatrixEntry(i: Long, j: Long, value: Double)存储值 println(rowMatrix.numCols())//4 println(rowMatrix.numRows())//4 //调用computeColumnSummaryStatistics()方法 println("========computeColumnSummaryStatistics()方法:触发action===========================") println(rowMatrix.computeColumnSummaryStatistics().count)//Array的size:4 println(rowMatrix.computeColumnSummaryStatistics().max)//每列的最大值,组成的vector:[5.0,7.0,8.0,9.0] println(rowMatrix.computeColumnSummaryStatistics().min)//每列的最小值,组成的vector:[0.0,0.0,0.0,0.0] println(rowMatrix.computeColumnSummaryStatistics().mean)//没列平均值:[1.5,3.75,2.75,3.25] println(rowMatrix.computeColumnSummaryStatistics().normL1)//矩阵列的1-范数,也就是绝对值相加,||x||1 = sum(abs(xi)):[6.0,15.0,11.0,13.0] println(rowMatrix.computeColumnSummaryStatistics().normL2)//矩阵列的2-范数,也就是数的平方相加在取根号,||x||2 = sqrt(sum(xi.^2)):[5.0990195135927845,9.433981132056603,8.54400374531753,9.848857801796104] println(rowMatrix.computeColumnSummaryStatistics().numNonzeros)//每列0的个数:[2.0,3.0,2.0,2.0] println(rowMatrix.computeColumnSummaryStatistics().variance)//每列的方差[5.666666666666667,10.916666666666666,14.25,18.25] println("----------------------------------协方差-----------------------------------------------") //协方差:是衡量两个向量之间的总体误差(方差是协方差的特殊情况) //Cov(X,Y) = E((X-u)*((Y-v))),其中u是X的平均值,v是Y的平均值 //纯数学:就是比较两个向量的相关是 正 负 还是 相互独立 println(rowMatrix.computeCovariance()) println("----------------------------------格拉姆矩阵--------------------------------------------") //定义:https://en.wikipedia.org/wiki/Gramian_matrix //所有主子式大于等0 => 半正定矩阵 //格拉姆矩阵是半正定矩阵。 println(rowMatrix.computeGramianMatrix()) //26.0 7.0 15.0 45.0 // 7.0 89.0 16.0 24.0 //15.0 16.0 73.0 27.0 //45.0 24.0 27.0 97.0 //其中26是 第1列组成的向量乘以第1列组成的向量,7是第1列组成的向量乘以第2列组成的向量.。。。。。。。。。。 println("-------------------------矩阵的主成分----------------------------------------------------") //主成分分析本质就是矩阵转换。把原有的数据点变换到一个新的坐标。是一种降维的统计方法。 //后续会有详细介绍,输入的2是前2的主成分因子 println(rowMatrix.computePrincipalComponents(2)) //-0.39298150433827816 -0.12886620383031905 //0.5385989845229371 -0.3723128718395973 //-0.1573813891189192 0.8341561627916149 //-0.7284969248239043 -0.3859535244685352 println("-----------------------------SVD奇异值分界-----------------------------------") //推荐博文:http://blog.chinaunix.net/uid-20761674-id-4040274.html //A = USV', //U = A*A' //V = A'*A //S是一个特征向量 //源码返回的是:SingularValueDecomposition[RowMatrix, Matrix] println(rowMatrix.computeSVD(4,true)) println("S的数值:"+rowMatrix.computeSVD(4,true).s) //rowMatrix.computeSVD(4,true).U是一个rowMatrix println("U的数值:"+rowMatrix.computeSVD(4,true).U.columnSimilarities().entries.collect().foreach(x => println(x))) println("V的数值:"+rowMatrix.computeSVD(4,true).V) println("=======================================矩阵相乘=======================") //def multiply(B: Matrix): RowMatrix={.................} //返回的是RowMatrix类型 val Arr1 = Matrices.dense(4,1,Array(1.0,2.0,3.0,4.0)) println(rowMatrix.multiply(Arr1).columnSimilarities.entries.collect.foreach(println)) //===============IndexIndexedRowMatrix的用法===================================================== //An IndexedRowMatrix is similar to a RowMatrix but with meaningful row indices. // It is backed by an RDD of indexed rows, so that each row is represented by its index (long-typed) and a local vector. //indexedRowMatrix可以从RDD从创建而来,之后可以转变为RowMatrix //错误:val indexedRowMatrix = new IndexedRowMatrix(rdd1) //理由:rdd1是RDD[Vector] //Indexed的两种用法: //new IndexedRowMatrix(rows: RDD[IndexedRow]) //new IndexedRowMatrix(rows: RDD[IndexedRow], nRows: Long, nCols: Int) //查阅源码发现:case class IndexedRow(index: Long, vector: Vector) //index:我们可以自己选择数值,注意是Long类型 //函数实现 //def keyOfIndex(key:Int):Long=key //def keyOfIndex(key:Double)=key.toLong //implicit函数实现 implicit def DoubleToLong(key:Double)= key.toLong val rdd2 = sc.parallelize( Array( Array(1.0,7.0,0,0), Array(0,2.0,8.0,0), Array(5.0,0,3.0,9.0), Array(0,6.0,0,4.0) ) ).map(f => IndexedRow(f.takeRight(1)(0),Vectors.dense(f.drop(1)))) println("rdd2的数值:"+rdd2.collect.foreach(println)) //IndexedRow(1,[7.0,0.0,0.0]) //IndexedRow(0,[2.0,8.0,0.0]) //IndexedRow(5,[0.0,3.0,9.0]) //IndexedRow(0,[6.0,0.0,4.0]) //f.take(1)(0),Vectors.dense(f.drop(1))) //f.take(1)得到是一个数组之后在选数组的第一个数字所以是f.take(1)(0) //如果最右边的第一个是指标,前面n-1个的是参数 //f.takeRight(1)(0),Vectors.dense(f.dropRight(1))) //有了這些指标在训练网络的时候,非常方便,比如BP val IndexedRowMatrix = new IndexedRowMatrix(rdd2) //indexedRowMatrix转换为RowMatrix println(IndexedRowMatrix.toRowMatrix()) //当然还有一些基本的运算,可以参考rowMatrix部分 println(IndexedRowMatrix.numCols()) //===============CoordinateMatrix的用法=========================================================== //new CoordinateMatrix(entries: RDD[MatrixEntry]) //new CoordinateMatrix(entries: RDD[MatrixEntry], nRows: Long, nCols: Long) //从返回的类型来看,coordinateMatrix有下面两种方式创建 //val coordinateMatrix:CoordinateMatrix= rowMatrix.columnSimilarities(0.5) val coordinateMatrix:CoordinateMatrix= rowMatrix.columnSimilarities() //return:An n x n sparse upper-triangular matrix of cosine similarities between columns of this matrix. //CoordinateMatrix的用法 //http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix //返回矩阵行数、列数 //println(coordinateMatrix.entries.collect().foreach(x => println(x))) coordinateMatrix.entries.collect.foreach(println) //MatrixEntry(2,3,0.32086065591476265) //MatrixEntry(0,1,0.14551788123706114) //MatrixEntry(1,2,0.19850138864225716) //MatrixEntry(0,3,0.896065945800218) //MatrixEntry(1,3,0.25830354780341375) //MatrixEntry(0,2,0.3443048616036665) //也可以做其他矩阵的操作 println(coordinateMatrix.numCols()) println(coordinateMatrix.numRows()) //也可以转换为IndexedRowMatrix println(coordinateMatrix.toIndexedRowMatrix()) sc.stop() } }
参考文献:
http://stanford.edu/~rezab/nips2014workshop/slides/reza.pdf
http://blog.youkuaiyun.com/lovehuangjiaju/article/details/48297921
http://spark.apache.org/docs/latest/
http://www.cnblogs.com/heaad/p/
感谢