0.0 构建之法|系列索引

本内容提供《构建之法》一书的详细阅读笔记和高效读书方法,旨在帮助读者深入理解软件工程原理及实践技巧。
第9章 Spark MLlib 教材配套机房上机实验指南 实验7 Spark MLlib编程初级实践 一、实验目的 (1)通过实验掌握基本的MLLib编程方法; (2)掌握用MLLib解决一些常见的数据分析问题,包括数据导入、成分分析和分类和预测等。 二、实验平台 操作系统:Ubuntu16.04 JDK版本:1.8或以上版本 Spark版本:3.4.0 Python版本:3.8.18 数据集:下载Adult数据集(http://archive.ics.uci.edu/ml/datasets/Adult),该数据集也可以直接到本教程官网的“下载专区”的“数据集”中下载。数据从美国1994年人口普查数据库抽取而来,可用来预测居民收入是否超过50K$/year。该数据集类变量为年收入是否超过50k$,属性变量包含年龄、工种、学历、职业、人种等重要信息,值得一提的是,14个属性变量中有7个类别型变量。 三、实验内容和要求 1.数据导入 从文件中导入数据,并转化为DataFrame。 【参考答案】 答案: # 导入需要的包 from pyspark.ml.feature import PCA from pyspark.sql import Row from pyspark.ml.linalg import Vector,Vectors from pyspark.ml.evaluation import MulticlassClassificationEvaluator from pyspark.ml import Pipeline,PipelineModel from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer,HashingTF, Tokenizer from pyspark.ml.classification import LogisticRegression from pyspark.ml.classification import LogisticRegressionModel from pyspark.ml.classification import BinaryLogisticRegressionSummary, LogisticRegression from pyspark.sql import functions from pyspark.ml.tuning import CrossValidator, ParamGridBuilder from pyspark.sql import SparkSession # 获取训练集测试集(需要对测试集进行一下处理,adult.data.txt的标签是>50K和<=50K,而adult.test.txt的标签是>50K.和<=50K.,这里是把adult.test.txt标签的“.”去掉了。另外,确保adult.data.txt和adult.test.txt最后没有多一个空格。) # 读取训练集文件 >>> with open("/usr/local/spark/adult.data", "r") as train_file:     train_data = train_file.readlines() # 读取测试集文件 >>> with open("/usr/local/spark/adult.test", "r") as test_file:     test_data = test_file.readlines() # 去除测试集标签中的点号 "." >>> test_data = [line.replace(" >50K.", " >50K").replace(" <=50K.", " <=50K") for line in test_data] # 去除测试集第一行(|1x3 Cross validator) >>> if test_data[0].strip() == "|1x3 Cross validator":     test_data = test_data[1:] # 去除训练集文件末尾的空行 >>> while train_data and train_data[-1].strip() == "":     train_data.pop() # 去除测试集文件末尾的空行 >>> while test_data and test_data[-1].strip() == "":     test_data.pop() # 保存处理后的训练集为adult.data.txt >>> with open("/usr/local/spark/adult.data.txt", "w") as train_processed_file:     train_processed_file.writelines(train_data) # 保存处理后的测试集为adult.test.txt >>> with open("/usr/local/spark/adult.test.txt", "w") as test_processed_file:     test_processed_file.writelines(test_data) # 训练集测试集文件处理完成 >>> def f(x): rel = {} rel['features']=Vectors.dense(float(x[0]),float(x[2]),float(x[4]),float(x[10]),float(x[11]),float(x[12])) rel['label'] = str(x[14]) return rel >>> spark = SparkSession.builder.appName("SparkMLlib").getOrCreate() >>> df = spark.sparkContext.textFile("file:///usr/local/spark/adult.data.txt").map(lambda line: line.split(',')).map(lambda p: Row(**f(p))).toDF() >>> test = spark.sparkContext.textFile("file:///usr/local/spark/adult.test.txt").map(lambda line: line.split(',')).map(lambda p: Row(**f(p))).toDF() 2.进行主成分分析(PCA) 对6个连续型的数值型变量进行主成分分析。PCA(主成分分析)是通过正交变换把一组相关变量的观测值转化成一组线性无关的变量值,即主成分的一种方法。PCA通过使用主成分把特征向量投影到低维空间,实现对特征向量的降维。请通过setK()方法将主成分数量设置为3,把连续型的特征向量转化成一个3维的主成分。 【参考答案】 构建PCA模型,并通过训练集进行主成分分解,然后分别应用到训练集和测试集 >>> pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures").fit(df) >>> result = pca.transform(df) >>> testdata = pca.transform(test) >>> result.show(truncate=False) +------------------------------------+------+-----------------------------------------------------------+ |features |label |pcaFeatures | +------------------------------------+------+-----------------------------------------------------------+ |[39.0,77516.0,13.0,2174.0,0.0,40.0] | <=50K|[77516.06543281932,-2171.6489938846544,-6.94636047659662] | |[50.0,83311.0,13.0,0.0,0.0,13.0] | <=50K|[83310.99935595779,2.526033892795439,-3.3887024086797206] | |[38.0,215646.0,9.0,0.0,0.0,40.0] | <=50K|[215645.99925048652,6.551842584558863,-8.58495396907331] | |[53.0,234721.0,7.0,0.0,0.0,40.0] | <=50K|[234720.99907961808,7.130299808626894,-9.360179790809578] | |[28.0,338409.0,13.0,0.0,0.0,40.0] | <=50K|[338408.99918830546,10.289249842829474,-13.36825187163079] | |[37.0,284582.0,14.0,0.0,0.0,40.0] | <=50K|[284581.99916695454,8.649756033721609,-11.281731333793076] | |[49.0,160187.0,5.0,0.0,0.0,16.0] | <=50K|[160186.9992693704,4.8657537211958015,-6.394299355794672] | |[52.0,209642.0,9.0,0.0,0.0,45.0] | >50K |[209641.99910851714,6.366453450454777,-8.387055585722319] | |[31.0,45781.0,14.0,14084.0,0.0,50.0]| >50K |[45781.42721110637,-14082.596953729322,-26.303509105368967]| |[42.0,159449.0,13.0,5178.0,0.0,40.0]| >50K |[159449.15652342225,-5173.151337268407,-15.351831002502344]| |[37.0,280464.0,10.0,0.0,0.0,80.0] | >50K |[280463.99908861093,8.519356755970295,-11.18800053344727] | |[30.0,141297.0,13.0,0.0,0.0,40.0] | >50K |[141296.99942061218,4.290098166706542,-5.6631132626324545] | |[23.0,122272.0,13.0,0.0,0.0,30.0] | <=50K|[122271.99953623723,3.7134109235615136,-4.887549331279782] | |[32.0,205019.0,12.0,0.0,0.0,50.0] | <=50K|[205018.99929839544,6.227844686218623,-8.176186180265162] | |[40.0,121772.0,11.0,0.0,0.0,40.0] | >50K |[121771.99934864059,3.6945287780608354,-4.9185835672785005]| |[34.0,245487.0,4.0,0.0,0.0,45.0] | <=50K|[245486.99924622502,7.460149417474324,-9.7500032428796] | |[25.0,176756.0,9.0,0.0,0.0,35.0] | <=50K|[176755.99943997271,5.370793765357621,-7.029037217536837] | |[32.0,186824.0,9.0,0.0,0.0,40.0] | <=50K|[186823.9993467819,5.6755410564333655,-7.445605003141201] | |[38.0,28887.0,7.0,0.0,0.0,50.0] | <=50K|[28886.999469511487,0.8668334219453465,-1.2969921640114914]| |[43.0,292175.0,14.0,0.0,0.0,45.0] | >50K |[292174.99908683443,8.879323215730548,-11.599483225617751] | +------------------------------------+------+-----------------------------------------------------------+ only showing top 20 rows >>> testdata.show(truncate=False) +------------------------------------+------+------------------------------------------------------------+ |features |label |pcaFeatures | +------------------------------------+------+------------------------------------------------------------+ |[25.0,226802.0,7.0,0.0,0.0,40.0] | <=50K|[226801.9993670891,6.893313042338155,-8.993983821758418] | |[38.0,89814.0,9.0,0.0,0.0,50.0] | <=50K|[89813.9993894769,2.720987324481492,-3.6809508659703227] | |[28.0,336951.0,12.0,0.0,0.0,40.0] | >50K |[336950.9991912231,10.244920104044986,-13.310695651855433] | |[44.0,160323.0,10.0,7688.0,0.0,40.0]| >50K |[160323.2327290343,-7683.121090489598,-19.729118648463572] | |[18.0,103497.0,10.0,0.0,0.0,30.0] | <=50K|[103496.99961293538,3.1428623091567167,-4.141563083946152] | |[34.0,198693.0,6.0,0.0,0.0,30.0] | <=50K|[198692.99933690467,6.037911774664424,-7.894879761309243] | |[29.0,227026.0,9.0,0.0,0.0,40.0] | <=50K|[227025.99932507658,6.899470708683594,-9.011878890809934] | |[63.0,104626.0,15.0,3103.0,0.0,32.0]| >50K |[104626.09338764263,-3099.8250060691976,-9.648800672049632] | |[24.0,369667.0,10.0,0.0,0.0,40.0] | <=50K|[369666.9991911036,11.241251385630434,-14.58110445420285] | |[55.0,104996.0,4.0,0.0,0.0,10.0] | <=50K|[104995.99929475832,3.186050789410868,-4.236895975019619] | |[65.0,184454.0,9.0,6418.0,0.0,40.0] | >50K |[184454.19392400666,-6412.391589847377,-18.518448307258247] | |[36.0,212465.0,13.0,0.0,0.0,40.0] | <=50K|[212464.99927015402,6.45514884447021,-8.458640605560896] | |[26.0,82091.0,9.0,0.0,0.0,39.0] | <=50K|[82090.99954236703,2.4891114096287383,-3.3355931885530445] | |[58.0,299831.0,9.0,0.0,0.0,35.0] | <=50K|[299830.9989556856,9.111696151579187,-11.90914144134721] | |[48.0,279724.0,9.0,3103.0,0.0,48.0] | >50K |[279724.09328344715,-3094.4957992963828,-16.491321474156507]| |[43.0,346189.0,14.0,0.0,0.0,50.0] | >50K |[346188.9990067699,10.522518314336622,-13.72068664318214] | |[20.0,444554.0,10.0,0.0,0.0,25.0] | <=50K|[444553.9991678727,13.522886896071775,-17.47586621453686] | |[43.0,128354.0,9.0,0.0,0.0,30.0] | <=50K|[128353.99933456784,3.895809826841343,-5.16363050899861] | |[37.0,60548.0,9.0,0.0,0.0,20.0] | <=50K|[60547.99950268138,1.8343884998321716,-2.482228457083682] | |[40.0,85019.0,16.0,0.0,0.0,45.0] | >50K |[85018.99937940769,2.5751267063738417,-3.492497873708586] | +------------------------------------+------+------------------------------------------------------------+ only showing top 20 rows 3.训练分类模型并预测居民收入 在主成分分析的基础上,采用逻辑斯蒂回归,或者决策树模型预测居民收入是否超过50K;对Test数据集进行验证。 【参考答案】 训练逻辑斯蒂回归模型,并进行测试,得到预测准确率 >>> labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(result) >>> for label in labelIndexer.labels:print(label) <=50K >50K >>> featureIndexer = VectorIndexer(inputCol="pcaFeatures", outputCol="indexedFeatures") .fit(result) >>> print(featureIndexer.numFeatures) 3 >>> labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel" ,labels=labelIndexer.labels) >>> lr = LogisticRegression().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures") .setMaxIter(100) >>> lrPipeline = Pipeline().setStages([labelIndexer, featureIndexer, lr, labelConverter]) >>> lrPipelineModel = lrPipeline.fit(result) >>> lrModel = lrPipelineModel.stages[2] >>> print ("Coefficients: \n " + str(lrModel.coefficientMatrix)+"\nIntercept: "+str(lrModel.interceptVector)+ "\n numClasses: "+str(lrModel.numClasses)+"\n numFeatures: "+str(lrModel.numFeatures)) Coefficients: DenseMatrix([[-1.98284969e-07, -3.50909196e-04, -8.45149645e-04]]) Intercept: [-1.4525986478021733] numClasses: 2 numFeatures: 3 >>> lrPredictions = lrPipelineModel.transform(testdata) >>> evaluator = MulticlassClassificationEvaluator().setLabelCol("indexedLabel") .setPredictionCol("prediction") >>> lrAccuracy = evaluator.evaluate(lrPredictions) lrAccuracy = 0.7764235163053484 >>> print("Test Error = %g " % (1.0 - lrAccuracy)) Test Error = 0.223576 4.超参数调优 利用CrossValidator确定最优的参数,包括最优主成分PCA的维数、分类器自身的参数等。 【参考答案】 >>> pca = PCA().setInputCol("features").setOutputCol("pcaFeatures") >>> labelIndexer = StringIndexer().setInputCol("label").setOutputCol("indexedLabel").fit(df) >>> featureIndexer = VectorIndexer().setInputCol("pcaFeatures") .setOutputCol("indexedFeatures") >>> labelConverter = IndexToString().setInputCol("prediction").setOutputCol("predictedLabel") .setLabels(labelIndexer.labels) >>> lr = LogisticRegression().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures") .setMaxIter(100) >>> lrPipeline = Pipeline().setStages([pca, labelIndexer, featureIndexer, lr, labelConverter]) >>> paramGrid = ParamGridBuilder().addGrid(pca.k, [1,2,3,4,5,6]).addGrid(lr.elasticNetParam, [0.2,0.8]).addGrid(lr.regParam, [0.01, 0.1, 0.5]).build() >>> cv = CrossValidator().setEstimator(lrPipeline) .setEvaluator(MulticlassClassificationEvaluator().setLabelCol("indexedLabel").setPredictionCol("prediction")).setEstimatorParamMaps(paramGrid).setNumFolds(3) >>> cvModel = cv.fit(df) >>> lrPredictions=cvModel.transform(test) >>> evaluator = MulticlassClassificationEvaluator().setLabelCol("indexedLabel") .setPredictionCol("prediction") >>> lrAccuracy = evaluator.evaluate(lrPredictions) >>> print("准确率为"+str(lrAccuracy)) 准确率为0.7831750221537989 >>> bestModel= cvModel.bestModel >>> lrModel = bestModel.stages[3] >>> print ("Coefficients: \n " + str(lrModel.coefficientMatrix)+"\nIntercept: "+str(lrModel.interceptVector)+ "\n numClasses: "+str(lrModel.numClasses)+"\n numFeatures: "+str(lrModel.numFeatures)) Coefficients: DenseMatrix([[-1.46929818e-07, -1.68830548e-04, -8.8415347 4.93295331e-02, 3.11466080e-02, -2.82108785 Intercept: [-7.4699641638803005] numClasses: 2 numFeatures: 6 >>> pcaModel = bestModel.stages[0] >>> print("Primary Component: " + str(pcaModel.pc)) Primary Component: DenseMatrix([[-9.90507714e-06, -1.43514070e-04, ... (6 total) ], [ 9.99999999e-01, 3.04337871e-05, ... ], [-1.05283840e-06, -4.27228452e-05, ... ], [ 3.03678811e-05, -9.99998483e-01, ... ], [-3.91389877e-05, 1.72989546e-03, ... ], [-2.19555372e-06, -1.31095844e-04, ... ]]) 可以看出,PCA最优的维数是6。
06-23
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值