Spark的ML(Machine Learning)库提供了主流数据统计/挖掘算法的实现,威廉将在本文中做一个总览,具体的解析将会在之后的文章中来写
分类与回归算法
| 算法 | Spark算法类 | Spark模型类 |
|---|
| SVM支持向量机 | SVMWithSGD | SVMModel |
| Logistic回归 | LogisticRegressionWithLBFGS;LogisticRegressionWithSGD | LogisticRegressionModel |
| 线性回归 | LinearRegressionWithSGD | LinearRegressionModel |
| 实时线性回归 | StreamingLinearRegressionWithSGD | LinearRegressionModel |
| 岭回归 | RidgeRegressionWithSGD | RidgeRegressionModel |
| Lasso回归 | LassoWithSGD | LassoModel |
| 朴素贝叶斯 | NaiveBayes | NaiveBayesModel |
| 决策树 | DecisionTree | DecisionTreeModel |
| 随机森林 | RandomForest | RandomForestModel |
| Gradient-Boosted Trees | GradientBoostedTrees | GradientBoostedTreesModel |
| Isotonic regression | IsotonicRegression | IsotonicRegressionModel |
协同过滤算法
| 算法 | Spark算法类 | Spark模型类 |
|---|
| alternating least squares (ALS) | ALS | MatrixFactorizationModel |
聚类算法
| 算法 | Spark算法类 | Spark模型类 |
|---|
| k-means | KMeans | KMeansModel |
| Gaussian mixture | GaussianMixture | GaussianMixtureModel |
| power iteration clustering (PIC) | PowerIterationClustering | PowerIterationClusteringModel |
| latent Dirichlet allocation (LDA) | LDA | DistributedLDAModel |
| streaming k-means | StreamingKMeans | KMeansModel |
降维算法
| 算法 | Spark算法类 |
|---|
| singular value decomposition (SVD) | RowMatrix.computeSVD |
| principal component analysis (PCA) | RowMatrix.computePrincipalComponents |
特征提取与转换
| 算法 | Spark算法类 | Spark模型类 |
|---|
| TF-IDF | HashingTF;IDF | |
| Word2Vec | Word2Vec | Word2VecModel |
| Standard Scaler | StandardScaler | StandardScalerModel |
| Normalizer | Normalizer | |
频繁项集的挖掘
| 算法 | Spark算法类 |
|---|
| FP-growth | FPGrowth |
| association rules | AssociationRules |
| PrefixSpan | PrefixSpan |