MLib -Dimensionality(I)

本文介绍了Apache Spark MLlib中的两种主要降维方法:奇异值分解(SVD)和主成分分析(PCA)。通过实例演示了如何使用Spark对高瘦矩阵进行SVD分解和PCA分析,提取关键特征并降低数据维度。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

MLlib-Dimensionality Reduction

* Singular value decomposition (SVD)

* Principal component analysis(PCA)

      降维可以理解为在一定情况下一个减少变量的过程 它经常被用于从原始和含有噪声的特征中提取潜在的有用特征,
或者用在在保留结构的情况下的数据压缩。spark-1.0.0这个版本,提供了针对tall and skinny矩阵的降维。

SVD

      具体不在多做介绍,这篇文章写的非常好。需要指出的是,对于大的矩阵,通常我们不需要做全部的因子分解,只需要头几个比较大的奇异值以及对应的奇异向量。这样就比较有利于存储,而且更重要的是,可以降维,看到矩阵的低秩结构。

     在spark-1.0.0这个版本中,提供了针对tall-and-skinny矩阵,即行数多,列数少(小于1000列)的SVD分解。 这个矩阵采用用RowMatrix这种分布式存储,对它进行SVD奇异值分解。

import org.apache.spark.mllib.linalg.Matrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.SingularValueDecomposition

val mat: RowMatrix = ...

// Compute the top 20 singular values and corresponding singular vectors.
val svd: SingularValueDecomposition[RowMatrix, Matrix] = mat.computeSVD(20, computeU = true)
val U: RowMatrix = svd.U // The U factor is a RowMatrix.
val s: Vector = svd.s // The singular values are stored in a local dense vector.
val V: Matrix = svd.V // The V factor is a local dense matrix.
借助 博客(II)中生成的RowMatrix, 实验SVD分解。具体每一步过程如下:

scala> val svd: SingularValueDecomposition[RowMatrix, Matrix] = mat.computeSVD(2, computeU = true)
14/07/11 03:26:49 INFO SparkContext: Starting job: aggregate at RowMatrix.scala:211
14/07/11 03:26:49 INFO DAGScheduler: Got job 4 (aggregate at RowMatrix.scala:211) with 2 output partitions (allowLocal=false)
14/07/11 03:26:49 INFO DAGScheduler: Final stage: Stage 4(aggregate at RowMatrix.scala:211)
14/07/11 03:26:49 INFO DAGScheduler: Parents of final stage: List()
14/07/11 03:26:49 INFO DAGScheduler: Missing parents: List()
14/07/11 03:26:49 INFO DAGScheduler: Submitting Stage 4 (MappedRDD[18] at map at <console>:29), which has no missing parents
14/07/11 03:26:49 INFO DAGScheduler: Submitting 2 missing tasks from Stage 4 (MappedRDD[18] at map at <console>:29)
14/07/11 03:26:49 INFO TaskSchedulerImpl: Adding task set 4.0 with 2 tasks
14/07/11 03:26:49 INFO TaskSetManager: Starting task 4.0:0 as TID 6 on executor localhost: localhost (PROCESS_LOCAL)
14/07/11 03:26:49 INFO TaskSetManager: Serialized task 4.0:0 as 3057 bytes in 0 ms
14/07/11 03:26:49 INFO TaskSetManager: Starting task 4.0:1 as TID 7 on executor localhost: localhost (PROCESS_LOCAL)
14/07/11 03:26:49 INFO TaskSetManager: Serialized task 4.0:1 as 3057 bytes in 0 ms
14/07/11 03:26:49 INFO Executor: Running task ID 6
14/07/11 03:26:49 INFO Executor: Running task ID 7
14/07/11 03:26:49 INFO BlockManager: Found block broadcast_2 locally
14/07/11 03:26:49 INFO BlockManager: Found block broadcast_2 locally
14/07/11 03:26:49 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:9+9
14/07/11 03:26:49 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:0+9
14/07/11 03:26:49 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
14/07/11 03:26:49 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
14/07/11 03:26:49 INFO Executor: Serialized size of result for 7 is 1037
14/07/11 03:26:49 INFO Executor: Serialized size of result for 6 is 1037
14/07/11 03:26:49 INFO Executor: Sending result for 7 directly to driver
14/07/11 03:26:49 INFO Executor: Sending result for 6 directly to driver
14/07/11 03:26:49 INFO Executor: Finished task ID 7
14/07/11 03:26:49 INFO Executor: Finished task ID 6
14/07/11 03:26:49 INFO DAGScheduler: Completed ResultTask(4, 1)
14/07/11 03:26:49 INFO TaskSetManager: Finished TID 7 in 39 ms on localhost (progress: 1/2)
14/07/11 03:26:49 INFO TaskSetManager: Finished TID 6 in 42 ms on localhost (progress: 2/2)
14/07/11 03:26:49 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
14/07/11 03:26:49 INFO DAGScheduler: Completed ResultTask(4, 0)
14/07/11 03:26:49 INFO DAGScheduler: Stage 4 (aggregate at RowMatrix.scala:211) finished in 0.043 s
14/07/11 03:26:49 INFO SparkContext: Job finished: aggregate at RowMatrix.scala:211, took 0.049573705 s
14/07/11 03:26:49 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
14/07/11 03:26:49 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK
14/07/11 03:26:49 INFO MemoryStore: ensureFreeSpace(200) called with curMem=332040, maxMem=309225062
14/07/11 03:26:49 INFO MemoryStore: Block broadcast_3 stored as values to memory (estimated size 200.0 B, free 294.6 MB)
svd: org.apache.spark.mllib.linalg.SingularValueDecomposition[org.apache.spark.mllib.linalg.distributed.RowMatrix,org.apache.spark.mllib.linalg.Matrix] =
SingularValueDecomposition(org.apache.spark.mllib.linalg.distributed.RowMatrix@2be207c2,[12.014542582220963,0.7307604855863146],-0.45395279953839474  0.6207943097808744   
-0.5636972053725028   0.35545611740775257  
-0.6900524012323158   -0.6987598826071044  )

scala> val U: RowMatrix = svd.U
U: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@2be207c2

scala> val s: Vector = svd.s
s: org.apache.spark.mllib.linalg.Vector = [12.014542582220963,0.7307604855863146]

scala> val V: Matrix = svd.V
V: org.apache.spark.mllib.linalg.Matrix =
-0.45395279953839474  0.6207943097808744   
-0.5636972053725028   0.35545611740775257  
-0.6900524012323158   -0.6987598826071044  

PCA
    

           主成份分析在降维中用的非常广泛,可以理解成用于找到一个旋转使得第一个坐标尽量有最大的变量,并且依次的坐标都尽量有较大的变量。旋转矩阵的列就是主成份。主成份分析在这个spark版本中也是针对tall-and-skinny矩阵。下面是主成份分解法实验:

import org.apache.spark.mllib.linalg.Matrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix

val mat: RowMatrix = ...

// Compute the top 10 principal components.
val pc: Matrix = mat.computePrincipalComponents(10) // Principal components are stored in a local dense matrix.

// Project the rows to the linear space spanned by the top 10 principal components.
val projected: RowMatrix = mat.multiply(pc)
 具体每一步

scala> val pc: Matrix = mat.computePrincipalComponents(1)
14/07/11 03:32:17 INFO SparkContext: Starting job: aggregate at RowMatrix.scala:312
14/07/11 03:32:17 INFO DAGScheduler: Got job 5 (aggregate at RowMatrix.scala:312) with 2 output partitions (allowLocal=false)
14/07/11 03:32:17 INFO DAGScheduler: Final stage: Stage 5(aggregate at RowMatrix.scala:312)
14/07/11 03:32:17 INFO DAGScheduler: Parents of final stage: List()
14/07/11 03:32:17 INFO DAGScheduler: Missing parents: List()
14/07/11 03:32:17 INFO DAGScheduler: Submitting Stage 5 (MappedRDD[18] at map at <console>:29), which has no missing parents
14/07/11 03:32:17 INFO DAGScheduler: Submitting 2 missing tasks from Stage 5 (MappedRDD[18] at map at <console>:29)
14/07/11 03:32:17 INFO TaskSchedulerImpl: Adding task set 5.0 with 2 tasks
14/07/11 03:32:17 INFO TaskSetManager: Starting task 5.0:0 as TID 8 on executor localhost: localhost (PROCESS_LOCAL)
14/07/11 03:32:17 INFO TaskSetManager: Serialized task 5.0:0 as 3105 bytes in 1 ms
14/07/11 03:32:17 INFO TaskSetManager: Starting task 5.0:1 as TID 9 on executor localhost: localhost (PROCESS_LOCAL)
14/07/11 03:32:17 INFO TaskSetManager: Serialized task 5.0:1 as 3105 bytes in 1 ms
14/07/11 03:32:17 INFO Executor: Running task ID 8
14/07/11 03:32:17 INFO Executor: Running task ID 9
14/07/11 03:32:17 INFO BlockManager: Found block broadcast_2 locally
14/07/11 03:32:17 INFO BlockManager: Found block broadcast_2 locally
14/07/11 03:32:17 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:9+9
14/07/11 03:32:17 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:0+9
14/07/11 03:32:17 INFO Executor: Serialized size of result for 9 is 1140
14/07/11 03:32:17 INFO Executor: Serialized size of result for 8 is 1140
14/07/11 03:32:17 INFO Executor: Sending result for 9 directly to driver
14/07/11 03:32:17 INFO Executor: Sending result for 8 directly to driver
14/07/11 03:32:17 INFO Executor: Finished task ID 9
14/07/11 03:32:17 INFO Executor: Finished task ID 8
14/07/11 03:32:17 INFO DAGScheduler: Completed ResultTask(5, 1)
14/07/11 03:32:17 INFO TaskSetManager: Finished TID 9 in 31 ms on localhost (progress: 1/2)
14/07/11 03:32:17 INFO DAGScheduler: Completed ResultTask(5, 0)
14/07/11 03:32:17 INFO TaskSetManager: Finished TID 8 in 33 ms on localhost (progress: 2/2)
14/07/11 03:32:17 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool
14/07/11 03:32:17 INFO DAGScheduler: Stage 5 (aggregate at RowMatrix.scala:312) finished in 0.034 s
14/07/11 03:32:17 INFO SparkContext: Job finished: aggregate at RowMatrix.scala:312, took 0.045160171 s
14/07/11 03:32:17 INFO SparkContext: Starting job: aggregate at RowMatrix.scala:211
14/07/11 03:32:17 INFO DAGScheduler: Got job 6 (aggregate at RowMatrix.scala:211) with 2 output partitions (allowLocal=false)
14/07/11 03:32:17 INFO DAGScheduler: Final stage: Stage 6(aggregate at RowMatrix.scala:211)
14/07/11 03:32:17 INFO DAGScheduler: Parents of final stage: List()
14/07/11 03:32:17 INFO DAGScheduler: Missing parents: List()
14/07/11 03:32:17 INFO DAGScheduler: Submitting Stage 6 (MappedRDD[18] at map at <console>:29), which has no missing parents
14/07/11 03:32:17 INFO DAGScheduler: Submitting 2 missing tasks from Stage 6 (MappedRDD[18] at map at <console>:29)
14/07/11 03:32:17 INFO TaskSchedulerImpl: Adding task set 6.0 with 2 tasks
14/07/11 03:32:17 INFO TaskSetManager: Starting task 6.0:0 as TID 10 on executor localhost: localhost (PROCESS_LOCAL)
14/07/11 03:32:17 INFO TaskSetManager: Serialized task 6.0:0 as 3057 bytes in 0 ms
14/07/11 03:32:17 INFO TaskSetManager: Starting task 6.0:1 as TID 11 on executor localhost: localhost (PROCESS_LOCAL)
14/07/11 03:32:17 INFO TaskSetManager: Serialized task 6.0:1 as 3057 bytes in 0 ms
14/07/11 03:32:17 INFO Executor: Running task ID 10
14/07/11 03:32:17 INFO Executor: Running task ID 11
14/07/11 03:32:17 INFO BlockManager: Found block broadcast_2 locally
14/07/11 03:32:17 INFO BlockManager: Found block broadcast_2 locally
14/07/11 03:32:17 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:0+9
14/07/11 03:32:17 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:9+9
14/07/11 03:32:17 INFO Executor: Serialized size of result for 10 is 1037
14/07/11 03:32:17 INFO Executor: Serialized size of result for 11 is 1037
14/07/11 03:32:17 INFO Executor: Sending result for 10 directly to driver
14/07/11 03:32:17 INFO Executor: Sending result for 11 directly to driver
14/07/11 03:32:17 INFO Executor: Finished task ID 10
14/07/11 03:32:17 INFO Executor: Finished task ID 11
14/07/11 03:32:17 INFO DAGScheduler: Completed ResultTask(6, 0)
14/07/11 03:32:17 INFO TaskSetManager: Finished TID 10 in 15 ms on localhost (progress: 1/2)
14/07/11 03:32:17 INFO DAGScheduler: Completed ResultTask(6, 1)
14/07/11 03:32:17 INFO TaskSetManager: Finished TID 11 in 14 ms on localhost (progress: 2/2)
14/07/11 03:32:17 INFO DAGScheduler: Stage 6 (aggregate at RowMatrix.scala:211) finished in 0.016 s
14/07/11 03:32:17 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool
14/07/11 03:32:17 INFO SparkContext: Job finished: aggregate at RowMatrix.scala:211, took 0.024784268 s
pc: org.apache.spark.mllib.linalg.Matrix =
-0.5023120302337961  
-0.6114256390937501  
-0.6114256390937501
 


scala> val projected: RowMatrix = mat.multiply(pc)

14/07/11 03:34:21 INFO MemoryStore: ensureFreeSpace(176) called with curMem=332240, maxMem=309225062

14/07/11 03:34:21 INFO MemoryStore: Block broadcast_4 stored as values to memory (estimated size 176.0 B, free 294.6 MB)
projected: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@4d8fde01
 

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值