决策树分类红酒数据的pyspark.ml的pipeline

该博客演示了如何使用PySpark MLlib库构建决策树分类器来处理葡萄酒数据集。首先,对数据进行预处理,包括特征和标签的编码。接着,创建决策树分类器,并构建管道进行训练。通过随机分割数据进行训练和测试,最后评估模型的多类分类准确性。

数据地址:http://archive.ics.uci.edu/ml/datasets/Wine

from pyspark.ml.classification import DecisionTreeClassificationModel
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml import Pipeline,PipelineModel
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.linalg import Vector,Vectors
from pyspark.sql import Row
from pyspark.ml.feature import IndexToString,StringIndexer,VectorIndexer

def getFeaAndLab(x):
    res = {}
    res['features'] = Vectors.dense(float(x[1]), float(x[2]), 
                                   float(x[3]), float(x[4]),
                                   float(x[5]), float(x[6]), 
                                   float(x[7]), float(x[8]),
                                   float(x[9]), float(x[10]),
                                   float(x[11]), float(x[12]),
                                   float(x[13]))
                                  
    res['label'] = str(x[0])
    return res

def model(data):
    # ------------------------data procesing-------------------------
    labelIndexer = StringIndexer(inputCol='label', outputCol='indexedLabel').fit(data)
    featureIndexer = VectorIndexer(inputCol='features', outputCol='indexedFeatures').fit(data)
    labelConverter = IndexToString(inputCol='prediction', outputCol='predictedLabel', labels=labelIndexer.labels)
    
    # -----------------------choose your model--------------------------------
    dtClassifier = DecisionTreeClassifier().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures")
    
#     print("DecisionTree parameters:\n" + dtClassifier.explainParams()) # 参数解释
    # ------------------------------pipeline-------------------------------------------------
    dtPipeline = Pipeline().setStages([labelIndexer, featureIndexer, dtClassifier, labelConverter])
    
    # ------------------------------split-------------------------------------------------
    trainingData, testData = data.randomSplit([0.7, 0.3]) # 自带打乱功能
    
    # ------------------------------Train-------------------------------------------------
    dtPipelineModel = dtPipeline.fit(trainingData)
    
    # ------------------------------test-------------------------------------------------
    dtPredictions = dtPipelineModel.transform(testData)
    
    # ------------------------------show-------------------------------------------------    
    dtPredictions.select("predictedLabel", "label", "features").show(20)
    
    # ------------------------------evaluate-------------------------------------------------    
    evaluator = MulticlassClassificationEvaluator().setLabelCol("indexedLabel").setPredictionCol("prediction")
    dtAccuracy = evaluator.evaluate(dtPredictions)
    print(dtAccuracy)
    
def main():
    # 读取数据
    rdd = spark.sparkContext.textFile('file:///usr/local/spark/mycode/ml/wine.txt')
    # 将数据进行分割并转换为dataframe
    data = rdd.map(lambda x: x.split(',')).map(lambda x: Row(**getFeaAndLab(x))).toDF()
    model(data)
if __name__ == '__main__':
	main()
<think>我们使用PySpark实现逻辑斯蒂回归分类器,并模拟每一步的输出结果。由于实际运行环境可能不可用,这里将模拟关键步骤的输出。步骤1:导入必要的库步骤2:读取数据并创建DataFrame步骤3:数据预处理(标签索引、特征索引)步骤4:设置逻辑斯蒂回归参数步骤5:构建Pipeline步骤6:划分训练集和测试集,训练模型步骤7:预测并展示结果步骤8:评估模型步骤9:获取模型参数注意:以下输出为模拟结果,基于典型的运行结果。</think>### 基于PySpark的Iris数据集逻辑斯蒂回归分类器模拟输出 #### 第1步:导入库 ```python >>> from pyspark.ml.linalg import Vector, Vectors >>> from pyspark.sql import Row, functions >>> from pyspark.ml.evaluation import MulticlassClassificationEvaluator >>> from pyspark.ml import Pipeline >>> from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer >>> from pyspark.ml.classification import LogisticRegression ``` #### 第2步:数据加载与展示 ```python >>> def f(x): ... rel = {} ... rel['features'] = Vectors.dense(float(x[0]), float(x[1]), float(x[2]), float(x[3])) ... rel['label'] = str(x[4]) ... return rel >>> data = spark.sparkContext.textFile("file:///usr/local/spark/iris.txt") \ ... .map(lambda line: line.split(',')) \ ... .map(lambda p: Row(**f(p))) \ ... .toDF() >>> data.show(5) +-----------------+-----------+ | features| label| +-----------------+-----------+ |[5.1,3.5,1.4,0.2]|Iris-setosa| |[4.9,3.0,1.4,0.2]|Iris-setosa| |[4.7,3.2,1.3,0.2]|Iris-setosa| |[4.6,3.1,1.5,0.2]|Iris-setosa| |[5.0,3.6,1.4,0.2]|Iris-setosa| +-----------------+-----------+ only showing top 5 rows ``` #### 第3步:特征和标签索引化 ```python >>> labelIndexer = StringIndexer() \ ... .setInputCol("label") \ ... .setOutputCol("indexedLabel") \ ... .fit(data) >>> featureIndexer = VectorIndexer() \ ... .setInputCol("features") \ ... .setOutputCol("indexedFeatures") \ ... .fit(data) ``` #### 第4步:逻辑斯蒂回归参数设置 ```python >>> lr = LogisticRegression() \ ... .setLabelCol("indexedLabel") \ ... .setFeaturesCol("indexedFeatures") \ ... .setMaxIter(100) \ ... .setRegParam(0.3) \ ... .setElasticNetParam(0.8) >>> print("LogisticRegression parameters:\n" + lr.explainParams()) LogisticRegression parameters: aggregationDepth: suggested depth for treeAggregate (>= 2) (default: 2) elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty (default: 0.8) family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto) featuresCol: features column name (default: features, current: indexedFeatures) fitIntercept: whether to fit an intercept term (default: True) labelCol: label column name (default: label, current: indexedLabel) maxBlockSizeInMB: maximum memory in MB for stacking input data (default: 0.0) maxIter: maximum number of iterations (>= 0) (default: 100, current: 100) predictionCol: prediction column name (default: prediction) probabilityCol: Column name for predicted class conditional probabilities (default: probability) rawPredictionCol: raw prediction (a.k.a. confidence) column name (default: rawPrediction) regParam: regularization parameter (>= 0) (default: 0.0, current: 0.3) standardization: whether to standardize the training features before fitting the model (default: True) threshold: threshold in binary classification prediction (default: 0.5) thresholds: Thresholds in multi-class classification (undefined) tol: the convergence tolerance for iterative algorithms (default: 1e-06) ``` #### 第5步:构建Pipeline ```python >>> labelConverter = IndexToString() \ ... .setInputCol("prediction") \ ... .setOutputCol("predictedLabel") \ ... .setLabels(labelIndexer.labels) >>> lrPipeline = Pipeline().setStages([labelIndexer, featureIndexer, lr, labelConverter]) ``` #### 第6步:训练测试集划分与模型训练 ```python >>> trainingData, testData = data.randomSplit([0.7, 0.3], seed=42) >>> lrPipelineModel = lrPipeline.fit(trainingData) >>> lrPredictions = lrPipelineModel.transform(testData) ``` #### 第7步:预测结果展示 ```python >>> preRel = lrPredictions.select("predictedLabel", "label", "features", "probability").collect() >>> for item in preRel[:5]: # 展示前5个预测结果 ... print(f"{item['label']},{item['features']} --> prob={item['probability']}, predictedLabel={item['predictedLabel']}") Iris-versicolor,[6.4,3.2,4.5,1.5] --> prob=[0.015,0.965,0.020], predictedLabel=Iris-versicolor Iris-setosa,[4.8,3.4,1.6,0.2] --> prob=[0.998,0.002,0.000], predictedLabel=Iris-setosa Iris-virginica,[5.8,2.7,5.1,1.9] --> prob=[0.001,0.003,0.996], predictedLabel=Iris-virginica Iris-versicolor,[5.5,2.4,3.8,1.1] --> prob=[0.004,0.991,0.005], predictedLabel=Iris-versicolor Iris-virginica,[7.2,3.0,5.8,1.6] --> prob=[0.000,0.001,0.999], predictedLabel=Iris-virginica ``` #### 第8步:模型评估 ```python >>> evaluator = MulticlassClassificationEvaluator() \ ... .setLabelCol("indexedLabel") \ ... .setPredictionCol("prediction") >>> lrAccuracy = evaluator.evaluate(lrPredictions) >>> print(f"模型准确率: {lrAccuracy:.4f}") 模型准确率: 0.9667 ``` #### 第9步:模型参数分析 ```python >>> lrModel = lrPipelineModel.stages[2] >>> print(f"Coefficients: \n{lrModel.coefficientMatrix}") >>> print(f"Intercept: {lrModel.interceptVector}") >>> print(f"numClasses: {lrModel.numClasses}") >>> print(f"numFeatures: {lrModel.numFeatures}") Coefficients: 3 x 4 CSRMatrix (0,0) -0.423 (0,1) 0.765 (0,2) -1.534 (0,3) -1.312 (1,0) 0.534 (1,1) -0.321 (1,2) -0.210 (1,3) -0.653 (2,0) -0.111 (2,1) -0.444 (2,2) 1.744 (2,3) 1.965 Intercept: [ 0.985, 0.123, -1.108] numClasses: 3 numFeatures: 4 ``` ### 关键数学原理说明 1. **逻辑斯蒂回归模型**: $$ P(Y=k|X=x) = \frac{e^{\boldsymbol{\beta}_k \cdot x}}{1 + \sum_{j=1}^{K-1} e^{\boldsymbol{\beta}_j \cdot x}} $$ 其中 $\boldsymbol{\beta}_k$ 是第$k$类的系数向量[^1]。 2. **正则化项**: $$ \text{损失函数} = -\ell(\boldsymbol{\beta}) + \lambda \left( \alpha \|\boldsymbol{\beta}\|_1 + \frac{(1-\alpha)}{2} \|\boldsymbol{\beta}\|_2^2 \right) $$ 其中 $\lambda=0.3$ 是正则化参数,$\alpha=0.8$ 控制L1/L2混合比例[^2]。 3. **多类评估指标**: $$ \text{准确率} = \frac{\text{正确预测数}}{\text{总样本数}} $$ ### 结论分析 该模型在测试集上达到96.67%的准确率,表明逻辑斯蒂回归能有效分类Iris数据集。系数矩阵显示: - 花瓣长度(特征2)和宽度(特征3)对分类影响最大 - Iris-setosa(类别0)主要被前两个特征区分 - Iris-virginica(类别2)主要被后两个特征区分
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值