用Spark-Scala训练LightGBM模型

  今日表情 ???? 

Spark-scala 可以使用LightGBM模型,既可以进行分布式训练,也可以进行分布式预测,支持各种参数设置。

支持模型保存,并且保存后的模型和Python等语言是可以相互调用的。

需要注意的是,Spark-scala训练LightGBM模型时, 输入模型的训练数据集需要处理成一个DataFrame,用spark.ml.feature.VectorAssembler将多列特征转换成一个 features向量列,label作为另外一列。

一,环境配置

spark-scala要使用lightgbm模型,pom文件中要配置如下依赖。

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_${scala.version}</artifactId>
<version>${spark.version}</version>
<!--spark-ml要去掉pmml-model依赖-->
<exclusions>
    <exclusion>
        <groupId>org.jpmml</groupId>
        <artifactId>pmml-model</artifactId>
    </exclusion>
</exclusions>
</dependency>

<dependency>
    <groupId>org.jpmml</groupId>
    <artifactId>jpmml-sparkml</artifactId>
    <version>1.3.4</version>
</dependency>
<dependency>
    <groupId>org.jpmml</groupId>
    <artifactId>jpmml-lightgbm</artifactId>
    <version>1.3.4</version>
</dependency>

二,范例代码

下面我们以二分类问题为例,按照如下几个大家熟悉的步骤进行范例代码演示。

  • 1,准备数据

  • 2,定义模型

  • 3,训练模型

  • 4,评估模型

  • 5,使用模型

  • 6,保存模型

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType, IntegerType}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.attribute.Attribute
import org.apache.spark.ml.feature.{IndexToString, StringIndexer}
import com.microsoft.ml.spark.{lightgbm=>lgb}
import com.google.gson.{JsonObject, JsonParser}
import scala.collection.JavaConverters._

object LgbDemo extends Serializable {
    
    def printlog(info:String): Unit ={
        val dt = new java.text.SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(new java.util.Date)
        println("=========="*8+dt)
        println(info+"\n")
    }
    
    def main(args:Array[String]):Unit= {


    /*================================================================================*/
    //  一,加载数据
    /*================================================================================*/
    printlog("step1: preparing data ...")

    //加载数据
    val spark = SparkSession.builder().getOrCreate()
    val dfdata_raw = spark.read.option("header","true")
        .option("delimiter", "\t")
        .option("inferschema", "true")
        .option("nullValue","")
        .csv("data/breast_cancer.csv")

    dfdata_raw.sample(false,0.1,1).printSchema 

    //将特征组合成features向量
    val feature_cols = dfdata_raw.columns.filter(!Array("label").contains(_)) 
    val cate_cols = Array("mean_radius","mean_texture") 


    val vectorAssembler = new VectorAssembler().
      setInputCols(feature_cols).
      setOutputCol("features")

    val dfdata = vectorAssembler.transform(dfdata_raw).select("features", "label")
    val Array(dftrain,dfval)  = dfdata.randomSplit(Array(0.7, .3), 666)

    //各个特征的名字存储在了schema 的 metadata中了, 所以可以用特征名指定类别特征 
    println(dfdata.schema("features").metadata)
    dfdata.show(10) 

    /*================================================================================*/
    //  二,定义模型
    /*================================================================================*/
    printlog("step2: defining model ...")

    val lgbclassifier = new lgb.LightGBMClassifier()
      .setNumIterations(100)
      .setLearningRate(0.1)
      .setNumLeaves(31)
      .setMinSumHessianInLeaf(0.001)
      .setMaxDepth(-1)
      .setBoostFromAverage(false)
      .setFeatureFraction(1.0)
      .setMaxBin(255)
      .setLambdaL1(0.0)
      .setLambdaL2(0.0)
      .setBaggingFraction(1.0)
      .setBaggingFreq(0)
      .setBaggingSeed(1)
      .setBoostingType("gbdt") //rf、dart、goss
      .setCategoricalSlotNames(cate_cols)
      .setObjective("binary") //binary, multiclass
      .setFeaturesCol("features") 
      .setLabelCol("label")

    println(lgbclassifier.explainParams) 


    /*================================================================================*/
    //  三,训练模型
    /*================================================================================*/
    printlog("step3: training model ...")

    val lgbmodel = lgbclassifier.fit(dftrain)

    val feature_importances = lgbmodel.getFeatureImportances("gain")
    val arr = feature_cols.zip(feature_importances).sortBy[Double](t=> -t._2)
    val dfimportance = spark.createDataFrame(arr).toDF("feature_name","feature_importance(gain)")

    dfimportance.show(100)


    /*================================================================================*/
    //  四,评估模型
    /*================================================================================*/
    printlog("step4: evaluating model ...")

    val evaluator = new BinaryClassificationEvaluator()
      .setLabelCol("label")
      .setRawPredictionCol("rawPrediction")
      .setMetricName("areaUnderROC")

    val dftrain_result = lgbmodel.transform(dftrain)
    val dfval_result = lgbmodel.transform(dfval)

    val train_auc  = evaluator.evaluate(dftrain_result)
    val val_auc = evaluator.evaluate(dfval_result)
    println(s"train_auc = ${train_auc}")
    println(s"val_auc = ${val_auc}")


    /*================================================================================*/
    //  五,使用模型
    /*================================================================================*/
    printlog("step5: using model ...")

    //批量预测
    val dfpredict = lgbmodel.transform(dfval)
    dfpredict.sample(false,0.1,1).show(20)

    //对单个样本进行预测
    val features = dfval.head().getAs[Vector]("features")
    val single_result = lgbmodel.predict(features)

    println(single_result)


    /*================================================================================*/
    //  六,保存模型
    /*================================================================================*/
    printlog("step6: saving model ...")

    //保存到集群,多文件
    lgbmodel.write.overwrite().save("lgbmodel.model")
    //加载集群模型
    println("load model ...")
    val lgbmodel_loaded = lgb.LightGBMClassificationModel.load("lgbmodel.model")
    val dfresult = lgbmodel_loaded.transform(dfval)
    dfresult.show() 

    //保存到本地,单文件,和Python接口兼容
    //lgbmodel.saveNativeModel("lgb_model",true)
    //加载本地模型
    //val lgbmodel_loaded = LightGBMClassificationModel.loadNativeModelFromFile("lgb_model")
    
    }
    
}


三,输出参考

运行如上代码之后,可以得到如下输出。

注意 println(lgbclassifier.explainParams)可以获取LightGBM模型各个参数的含义以及默认值。

================================================================================2021-07-17 22:16:29
step1: preparing data ...

root
 |-- mean_radius: integer (nullable = true)
 |-- mean_texture: integer (nullable = true)
 |-- mean_perimeter: double (nullable = true)
 |-- mean_area: double (nullable = true)
 |-- mean_smoothness: double (nullable = true)
 |-- mean_compactness: double (nullable = true)
 |-- mean_concavity: double (nullable = true)
 |-- mean_concave_points: double (nullable = true)
 |-- mean_symmetry: double (nullable = true)
 |-- mean_fractal_dimension: double (nullable = true)
 |-- radius_error: double (nullable = true)
 |-- texture_error: double (nullable = true)
 |-- perimeter_error: double (nullable = true)
 |-- area_error: double (nullable = true)
 |-- smoothness_error: double (nullable = true)
 |-- compactness_error: double (nullable = true)
 |-- concavity_error: double (nullable = true)
 |-- concave_points_error: double (nullable = true)
 |-- symmetry_error: double (nullable = true)
 |-- fractal_dimension_error: double (nullable = true)
 |-- worst_radius: double (nullable = true)
 |-- worst_texture: double (nullable = true)
 |-- worst_perimeter: double (nullable = true)
 |-- worst_area: double (nullable = true)
 |-- worst_smoothness: double (nullable = true)
 |-- worst_compactness: double (nullable = true)
 |-- worst_concavity: double (nullable = true)
 |-- worst_concave_points: double (nullable = true)
 |-- worst_symmetry: double (nullable = true)
 |-- worst_fractal_dimension: double (nullable = true)
 |-- label: integer (nullable = true)

{"ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"mean_radius"},{"idx":1,"name":"mean_texture"},{"idx":2,"name":"mean_perimeter"},{"idx":3,"name":"mean_area"},{"idx":4,"name":"mean_smoothness"},{"idx":5,"name":"mean_compactness"},{"idx":6,"name":"mean_concavity"},{"idx":7,"name":"mean_concave_points"},{"idx":8,"name":"mean_symmetry"},{"idx":9,"name":"mean_fractal_dimension"},{"idx":10,"name":"radius_error"},{"idx":11,"name":"texture_error"},{"idx":12,"name":"perimeter_error"},{"idx":13,"name":"area_error"},{"idx":14,"name":"smoothness_error"},{"idx":15,"name":"compactness_error"},{"idx":16,"name":"concavity_error"},{"idx":17,"name":"concave_points_error"},{"idx":18,"name":"symmetry_error"},{"idx":19,"name":"fractal_dimension_error"},{"idx":20,"name":"worst_radius"},{"idx":21,"name":"worst_texture"},{"idx":22,"name":"worst_perimeter"},{"idx":23,"name":"worst_area"},{"idx":24,"name":"worst_smoothness"},{"idx":25,"name":"worst_compactness"},{"idx":26,"name":"worst_concavity"},{"idx":27,"name":"worst_concave_points"},{"idx":28,"name":"worst_symmetry"},{"idx":29,"name":"worst_fractal_dimension"}]},"num_attrs":30}}
+--------------------+-----+
|            features|label|
+--------------------+-----+
|[17.0,10.0,122.8,...|    0|
|[20.0,17.0,132.9,...|    0|
|[19.0,21.0,130.0,...|    0|
|[11.0,20.0,77.58,...|    0|
|[20.0,14.0,135.1,...|    0|
|[12.0,15.0,82.57,...|    0|
|[18.0,19.0,119.6,...|    0|
|[13.0,20.0,90.2,5...|    0|
|[13.0,21.0,87.5,5...|    0|
|[12.0,24.0,83.97,...|    0|
+--------------------+-----+
only showing top 10 rows

================================================================================2021-07-17 22:16:29
step2: defining model ...

baggingFraction: Bagging fraction (default: 1.0, current: 1.0)
baggingFreq: Bagging frequency (default: 0, current: 0)
baggingSeed: Bagging seed (default: 3, current: 1)
boostFromAverage: Adjusts initial score to the mean of labels for faster convergence (default: true, current: false)
boostingType: Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).  (default: gbdt, current: gbdt)
categoricalSlotIndexes: List of categorical column indexes, the slot index in the features column (undefined)
categoricalSlotNames: List of categorical column slot names, the slot name in the features column (current: [Ljava.lang.String;@351fb3fc)
defaultListenPort: The default listen port on executors, used for testing (default: 12400)
earlyStoppingRound: Early stopping round (default: 0)
featureFraction: Feature fraction (default: 1.0, current: 1.0)
featuresCol: features column name (default: features, current: features)
initScoreCol: The name of the initial score column, used for continued training (undefined)
isProvideTrainingMetric: Whether output metric result over training dataset. (default: false)
isUnbalance: Set to true if training data is unbalanced in binary classification scenario (default: false)
labelCol: label column name (default: label, current: label)
lambdaL1: L1 regularization (default: 0.0, current: 0.0)
lambdaL2: L2 regularization (default: 0.0, current: 0.0)
learningRate: Learning rate or shrinkage rate (default: 0.1, current: 0.1)
maxBin: Max bin (default: 255, current: 255)
maxDepth: Max depth (default: -1, current: -1)
minSumHessianInLeaf: Minimal sum hessian in one leaf (default: 0.001, current: 0.001)
modelString: LightGBM model to retrain (default: )
numBatches: If greater than 0, splits data into separate batches during training (default: 0)
numIterations: Number of iterations, LightGBM constructs num_class * num_iterations trees (default: 100, current: 100)
numLeaves: Number of leaves (default: 31, current: 31)
objective: The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.  (default: binary, current: binary)
parallelism: Tree learner parallelism, can be set to data_parallel or voting_parallel (default: data_parallel)
predictionCol: prediction column name (default: prediction)
probabilityCol: Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities (default: probability)
rawPredictionCol: raw prediction (a.k.a. confidence) column name (default: rawPrediction)
thresholds: Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold (undefined)
timeout: Timeout in seconds (default: 1200.0)
useBarrierExecutionMode: Use new barrier execution mode in Beta testing, off by default. (default: false)
validationIndicatorCol: Indicates whether the row is for training or validation (undefined)
verbosity: Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug (default: 1)
weightCol: The name of the weight column (undefined)
================================================================================2021-07-17 22:16:29
step3: training model ...

+--------------------+------------------------+
|        feature_name|feature_importance(gain)|
+--------------------+------------------------+
|          worst_area|       974.9349449056517|
|     worst_perimeter|       885.3691593843923|
|worst_concave_points|      255.67364284247745|
| mean_concave_points|      250.21955942230738|
|       worst_texture|      151.07745621304454|
|          area_error|       65.75557372416814|
|    worst_smoothness|       62.29973236144293|
|     mean_smoothness|      19.902610011957194|
|        worst_radius|        16.8275272153341|
|           mean_area|       12.41261211467938|
|      mean_perimeter|      12.127510878875537|
|     worst_concavity|      11.414242858900646|
|   compactness_error|      10.996194651604892|
|        mean_texture|       9.274276675339683|
|     concavity_error|       8.009578698471008|
|      symmetry_error|        7.93458393366217|
|        radius_error|       7.357747321194173|
|      worst_symmetry|       5.951699663755868|
|fractal_dimension...|       4.811246624133022|
|concave_points_error|        4.73140145466917|
|   worst_compactness|       4.469820723182832|
|       texture_error|       4.356178728700959|
|    mean_compactness|       3.123736411467967|
|       mean_symmetry|      1.9968633063354835|
|      mean_concavity|      1.9701941942285224|
|    smoothness_error|       1.673042485476758|
|worst_fractal_dim...|      1.3582115541525612|
|mean_fractal_dime...|      0.6050912755332459|
|     perimeter_error|      0.3889888676278275|
|         mean_radius|    5.684356116234315...|
+--------------------+------------------------+

================================================================================2021-07-17 22:16:30
step4: evaluating model ...

train_auc = 1.0
val_auc = 0.9890340267698758
================================================================================2021-07-17 22:16:31
step5: using model ...

+--------------------+-----+--------------------+--------------------+----------+
|            features|label|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|[9.0,12.0,60.34,2...|    1|[-10.570726382467...|[-9.5707263824679...|       1.0|
|[10.0,16.0,65.85,...|    1|[-10.120435089856...|[-9.1204350898567...|       1.0|
|[10.0,21.0,68.51,...|    1|[-8.8020346337692...|[-7.8020346337692...|       1.0|
|[11.0,14.0,73.53,...|    1|[-10.315758226759...|[-9.3157582267596...|       1.0|
|[11.0,15.0,73.38,...|    1|[-10.086077130817...|[-9.0860771308174...|       1.0|
|[11.0,16.0,74.72,...|    1|[-6.9649803118554...|[-5.9649803118554...|       1.0|
|[11.0,17.0,71.25,...|    1|[-10.694667171248...|[-9.6946671712481...|       1.0|
|[11.0,17.0,75.27,...|    1|[-9.0156792680894...|[-8.0156792680894...|       1.0|
|[11.0,18.0,75.17,...|    1|[-5.7513546284621...|[-4.7513546284621...|       1.0|
|[11.0,18.0,76.38,...|    1|[-4.3134421808792...|[-3.3134421808792...|       1.0|
|[12.0,15.0,82.57,...|    0|[2.49310942805160...|[3.49310942805160...|       0.0|
|[12.0,17.0,78.27,...|    1|[-10.516042459712...|[-9.5160424597122...|       1.0|
|[12.0,18.0,83.19,...|    1|[-9.4899850168431...|[-8.4899850168431...|       1.0|
|[12.0,22.0,78.75,...|    1|[-8.9917629958319...|[-7.9917629958319...|       1.0|
|[14.0,15.0,92.68,...|    1|[-7.2724968676775...|[-6.2724968676775...|       1.0|
|[14.0,15.0,95.77,...|    1|[-5.0143190624015...|[-4.0143190624015...|       1.0|
|[14.0,16.0,96.22,...|    1|[-5.3849620427583...|[-4.3849620427583...|       1.0|
|[14.0,19.0,97.83,...|    1|[-3.3292007261919...|[-2.3292007261919...|       1.0|
|[16.0,14.0,104.3,...|    1|[4.66077729134426...|[5.66077729134426...|       0.0|
|[19.0,24.0,122.0,...|    0|[10.1503565558166...|[11.1503565558166...|       0.0|
+--------------------+-----+--------------------+--------------------+----------+

1.0
================================================================================2021-07-17 22:16:31
step6: saving model ...

load model ...


收工。????

本文Spark-scala 使用 LightGBM 模型训练 二分类模型 代码和数据集,以及训练 多分类模型 和 回归模型 的范例代码和数据集,可以在公众号算法美食屋后台回复关键词 spark+lightgbm 获取。

万水千山总是情,点个在看行不行?????

猜你喜欢❤️

30分钟学会LightGBM

"C:\Program Files\Java\jdk1.8.0_281\bin\java.exe" "-javaagent:D:\新建文件夹 (2)\IDEA\idea\IntelliJ IDEA 2019.3.3\lib\idea_rt.jar=59342" -Dfile.encoding=UTF-8 -classpath "C:\Program Files\Java\jdk1.8.0_281\jre\lib\charsets.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\deploy.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\access-bridge-64.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\cldrdata.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\dnsns.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\jaccess.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\jfxrt.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\localedata.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\nashorn.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\sunec.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\sunjce_provider.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\sunmscapi.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\sunpkcs11.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\zipfs.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\javaws.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\jce.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\jfr.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\jfxswt.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\jsse.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\management-agent.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\plugin.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\resources.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\rt.jar;D:\carspark\out\production\carspark;C:\Users\wyatt\.ivy2\cache\org.scala-lang\scala-library\jars\scala-library-2.12.10.jar;C:\Users\wyatt\.ivy2\cache\org.scala-lang\scala-reflect\jars\scala-reflect-2.12.10.jar;C:\Users\wyatt\.ivy2\cache\org.scala-lang\scala-library\srcs\scala-library-2.12.10-sources.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\accessors-smart-1.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\activation-1.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\aircompressor-0.10.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\algebra_2.12-2.0.0-M2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\antlr-runtime-3.5.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\antlr4-runtime-4.8-1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\aopalliance-1.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\aopalliance-repackaged-2.6.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\arpack_combined_all-0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\arrow-format-2.0.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\arrow-memory-core-2.0.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\arrow-memory-netty-2.0.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\audience-annotations-0.5.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\automaton-1.11-8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\avro-1.8.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\avro-ipc-1.8.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\avro-mapred-1.8.2-hadoop2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\bonecp-0.8.0.RELEASE.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\breeze-macros_2.12-1.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\breeze_2.12-1.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\cats-kernel_2.12-2.0.0-M4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\chill-java-0.9.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\chill_2.12-0.9.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-beanutils-1.9.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-cli-1.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-codec-1.10.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-collections-3.2.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-compiler-3.0.16.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-compress-1.20.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-configuration2-2.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-crypto-1.1.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-daemon-1.0.13.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-dbcp-1.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-httpclient-3.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-io-2.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-lang-2.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-lang3-3.10.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-logging-1.1.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-math3-3.4.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-net-3.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-pool-1.5.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-text-1.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\compress-lzf-1.0.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\core-1.1.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\curator-client-2.13.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\curator-framework-2.13.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\curator-recipes-2.13.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\datanucleus-api-jdo-4.2.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\datanucleus-core-4.1.17.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\datanucleus-rdbms-4.1.19.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\derby-10.12.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\dnsjava-2.1.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\dropwizard-metrics-hadoop-metrics2-reporter-0.1.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\ehcache-3.3.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\flatbuffers-java-1.9.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\generex-1.0.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\geronimo-jcache_1.0_spec-1.0-alpha-1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\gson-2.2.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\guava-14.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\guice-4.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\guice-servlet-4.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-annotations-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-auth-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-common-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-hdfs-client-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-mapreduce-client-common-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-mapreduce-client-core-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-mapreduce-client-jobclient-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-yarn-api-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-yarn-client-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-yarn-common-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-yarn-registry-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-yarn-server-common-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-yarn-server-web-proxy-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\HikariCP-2.5.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-beeline-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-cli-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-common-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-exec-2.3.7-core.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-jdbc-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-llap-common-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-metastore-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-serde-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-service-rpc-3.1.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-shims-0.23-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-shims-common-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-shims-scheduler-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-storage-api-2.7.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-vector-code-gen-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hk2-api-2.6.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hk2-locator-2.6.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hk2-utils-2.6.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\htrace-core4-4.1.0-incubating.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\httpclient-4.5.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\httpcore-4.4.12.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\istack-commons-runtime-3.0.8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\ivy-2.4.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-annotations-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-core-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-core-asl-1.9.13.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-databind-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-dataformat-yaml-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-datatype-jsr310-2.11.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-jaxrs-base-2.9.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-jaxrs-json-provider-2.9.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-mapper-asl-1.9.13.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-module-jaxb-annotations-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-module-paranamer-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-module-scala_2.12-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.activation-api-1.2.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.annotation-api-1.3.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.inject-2.6.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.servlet-api-4.0.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.validation-api-2.0.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.ws.rs-api-2.1.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.xml.bind-api-2.3.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\janino-3.0.16.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\javassist-3.25.0-GA.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\javax.inject-1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\javax.jdo-3.2.0-m3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\javolution-5.5.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jaxb-api-2.2.11.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jaxb-runtime-2.3.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jcip-annotations-1.0-1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jcl-over-slf4j-1.7.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jdo-api-3.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-client-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-common-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-container-servlet-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-container-servlet-core-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-hk2-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-media-jaxb-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-server-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\JLargeArrays-1.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jline-2.14.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\joda-time-2.10.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jodd-core-3.5.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jpam-1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\json-1.8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\json-smart-2.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\json4s-ast_2.12-3.7.0-M5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\json4s-core_2.12-3.7.0-M5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\json4s-jackson_2.12-3.7.0-M5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\json4s-scalap_2.12-3.7.0-M5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jsp-api-2.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jsr305-3.0.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jta-1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\JTransforms-3.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jul-to-slf4j-1.7.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-admin-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-client-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-common-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-core-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-crypto-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-identity-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-server-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-simplekdc-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-util-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerby-asn1-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerby-config-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerby-pkix-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerby-util-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerby-xdr-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kryo-shaded-4.0.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-client-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-admissionregistration-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-apiextensions-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-apps-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-autoscaling-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-batch-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-certificates-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-common-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-coordination-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-core-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-discovery-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-events-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-extensions-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-metrics-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-networking-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-policy-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-rbac-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-scheduling-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-settings-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-storageclass-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\leveldbjni-all-1.8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\libfb303-0.9.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\libthrift-0.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\log4j-1.2.17.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\logging-interceptor-3.12.12.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\lz4-java-1.7.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\machinist_2.12-0.6.8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\macro-compat_2.12-1.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\mesos-1.4.0-shaded-protobuf.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\metrics-core-4.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\metrics-graphite-4.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\metrics-jmx-4.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\metrics-json-4.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\metrics-jvm-4.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\minlog-1.3.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\netty-all-4.1.51.Final.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\nimbus-jose-jwt-4.41.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\objenesis-2.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\okhttp-2.7.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\okhttp-3.12.12.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\okio-1.14.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\opencsv-2.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\orc-core-1.5.12.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\orc-mapreduce-1.5.12.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\orc-shims-1.5.12.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\oro-2.0.8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\osgi-resource-locator-1.0.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\paranamer-2.8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\parquet-column-1.10.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\parquet-common-1.10.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\parquet-encoding-1.10.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\parquet-format-2.4.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\parquet-hadoop-1.10.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\parquet-jackson-1.10.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\protobuf-java-2.5.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\py4j-0.10.9.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\pyrolite-4.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\re2j-1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\RoaringBitmap-0.9.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\scala-collection-compat_2.12-2.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\scala-compiler-2.12.10.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\scala-library-2.12.10.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\scala-parser-combinators_2.12-1.1.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\scala-reflect-2.12.10.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\scala-xml_2.12-1.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\shapeless_2.12-2.3.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\shims-0.9.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\slf4j-api-1.7.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\slf4j-log4j12-1.7.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\snakeyaml-1.24.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\snappy-java-1.1.8.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-catalyst_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-core_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-graphx_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-hive-thriftserver_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-hive_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-kubernetes_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-kvstore_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-launcher_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-mesos_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-mllib-local_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-mllib_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-network-common_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-network-shuffle_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-repl_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-sketch_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-sql_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-streaming_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-tags_2.12-3.1.1-tests.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-tags_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-unsafe_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-yarn_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spire-macros_2.12-0.17.0-M1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spire-platform_2.12-0.17.0-M1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spire-util_2.12-0.17.0-M1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spire_2.12-0.17.0-M1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\ST4-4.0.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\stax-api-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\stax2-api-3.1.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\stream-2.9.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\super-csv-2.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\threeten-extra-1.5.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\token-provider-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\transaction-api-1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\univocity-parsers-2.9.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\velocity-1.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\woodstox-core-5.0.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\xbean-asm7-shaded-4.15.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\xz-1.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\zjsonpatch-0.3.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\zookeeper-3.4.14.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\zstd-jni-1.4.8-1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\arrow-vector-2.0.0.jar" car.LoadModelRideHailing Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 25/06/08 17:05:07 INFO SparkContext: Running Spark version 3.1.1 25/06/08 17:05:07 INFO ResourceUtils: ============================================================== 25/06/08 17:05:07 INFO ResourceUtils: No custom resources configured for spark.driver. 25/06/08 17:05:07 INFO ResourceUtils: ============================================================== 25/06/08 17:05:07 INFO SparkContext: Submitted application: LoadModelRideHailing 25/06/08 17:05:07 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0) 25/06/08 17:05:07 INFO ResourceProfile: Limiting resource is cpu 25/06/08 17:05:07 INFO ResourceProfileManager: Added ResourceProfile id: 0 25/06/08 17:05:07 INFO SecurityManager: Changing view acls to: wyatt 25/06/08 17:05:07 INFO SecurityManager: Changing modify acls to: wyatt 25/06/08 17:05:07 INFO SecurityManager: Changing view acls groups to: 25/06/08 17:05:07 INFO SecurityManager: Changing modify acls groups to: 25/06/08 17:05:07 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(wyatt); groups with view permissions: Set(); users with modify permissions: Set(wyatt); groups with modify permissions: Set() 25/06/08 17:05:07 INFO Utils: Successfully started service 'sparkDriver' on port 59361. 25/06/08 17:05:07 INFO SparkEnv: Registering MapOutputTracker 25/06/08 17:05:07 INFO SparkEnv: Registering BlockManagerMaster 25/06/08 17:05:08 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 25/06/08 17:05:08 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 25/06/08 17:05:08 INFO SparkEnv: Registering BlockManagerMasterHeartbeat 25/06/08 17:05:08 INFO DiskBlockManager: Created local directory at C:\Users\wyatt\AppData\Local\Temp\blockmgr-8fe065e2-024c-4e2f-8662-45d2fe3de444 25/06/08 17:05:08 INFO MemoryStore: MemoryStore started with capacity 1899.0 MiB 25/06/08 17:05:08 INFO SparkEnv: Registering OutputCommitCoordinator 25/06/08 17:05:08 INFO Utils: Successfully started service 'SparkUI' on port 4040. 25/06/08 17:05:08 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://windows10.microdone.cn:4040 25/06/08 17:05:08 INFO Executor: Starting executor ID driver on host windows10.microdone.cn 25/06/08 17:05:08 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 59392. 25/06/08 17:05:08 INFO NettyBlockTransferService: Server created on windows10.microdone.cn:59392 25/06/08 17:05:08 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 25/06/08 17:05:08 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, windows10.microdone.cn, 59392, None) 25/06/08 17:05:08 INFO BlockManagerMasterEndpoint: Registering block manager windows10.microdone.cn:59392 with 1899.0 MiB RAM, BlockManagerId(driver, windows10.microdone.cn, 59392, None) 25/06/08 17:05:08 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, windows10.microdone.cn, 59392, None) 25/06/08 17:05:08 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, windows10.microdone.cn, 59392, None) Exception in thread "main" java.lang.IllegalArgumentException: 测试数据中不包含 features 列,请检查数据! at car.LoadModelRideHailing$.main(LoadModelRideHailing.scala:23) at car.LoadModelRideHailing.main(LoadModelRideHailing.scala) 进程已结束,退出代码为 1 package car import org.apache.spark.ml.classification.{LogisticRegressionModel, RandomForestClassificationModel} import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.sql.{SparkSession, functions => F} object LoadModelRideHailing { def main(args: Array[String]): Unit = { val spark = SparkSession.builder() .master("local[3]") .appName("LoadModelRideHailing") .getOrCreate() spark.sparkContext.setLogLevel("Error") // 使用经过特征工程处理后的测试数据 val TestData = spark.read.option("header", "true").csv("C:\\Users\\wyatt\\Documents\\ride_hailing_test_data.csv") // 将 label 列转换为数值类型 val testDataWithNumericLabel = TestData.withColumn("label", F.col("label").cast("double")) // 检查 features 列是否存在 if (!testDataWithNumericLabel.columns.contains("features")) { throw new IllegalArgumentException("测试数据中不包含 features 列,请检查数据!") } // 修正后的模型路径(确保文件夹存在且包含元数据) val LogisticModel = LogisticRegressionModel.load("C:\\Users\\wyatt\\Documents\\ride_hailing_logistic_model") // 示例路径 val LogisticPre = LogisticModel.transform(testDataWithNumericLabel) val LogisticAcc = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction") .setMetricName("accuracy") .evaluate(LogisticPre) println("逻辑回归模型后期数据准确率:" + LogisticAcc) // 随机森林模型路径同步修正 val RandomForest = RandomForestClassificationModel.load("C:\\Users\\wyatt\\Documents\\ride_hailing_random_forest_model") // 示例路径 val RandomForestPre = RandomForest.transform(testDataWithNumericLabel) val RandomForestAcc = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction") .setMetricName("accuracy") .evaluate(RandomForestPre) println("随机森林模型后期数据准确率:" + RandomForestAcc) spark.stop() } }
06-09
<think>我们正在处理一个SparkML管道中的问题:测试数据中缺少features列。在SparkML中,通常我们会构建一个管道,该管道包括多个转换器和一个最终的估计器(模型)。当我们在训练数据上调用`fit`方法时,管道会生成一个PipelineModel,然后我们可以使用这个PipelineModel对测试数据进行转换(`transform`),此时PipelineModel会依次应用每个阶段(包括转换和最终的模型预测)。测试数据缺少features列,这通常意味着在测试数据中没有完成与训练数据相同的特征预处理步骤。在SparkML管道中,特征列(通常命名为"features")通常是由管道中的一个或多个转换步骤生成的。例如,我们可能会使用VectorAssembler将多个特征列组合成一个特征向量,并输出到"features"列。因此,处理这个错误的正确方法是:1.确保测试数据与训练数据经过相同的预处理步骤。在SparkML中,这是通过使用同一个PipelineModel来实现的。所以,我们不需要手动为测试数据创建特征列,而是应该使用训练得到的PipelineModel进行转换。2.检查代码逻辑,确保在测试阶段使用了正确的PipelineModel进行转换。根据用户提供的引用内容,我们可以看到训练使用了交叉验证(CrossValidator)来训练模型,交叉验证会返回一个最佳模型(也是一个PipelineModel,因为整个管道是作为估计器传入的)。然后,我们应该使用这个训练得到的模型对测试数据进行转换。引用[1]中训练模型的部分:```scalavalmodel=cv.fit(trainingData)//这里训练得到的model就是PipelineModel(因为cv的estimator是一个pipeline)```然后,在引用[2]中,我们可以看到使用模型进行预测的代码:```scalavalprediction=model.transform(test)```这里的`model`就是上面训练得到的PipelineModel,它会对测试数据应用整个管道(包括特征预处理和模型预测)。如果测试数据在进入模型转换之前没有经过预处理(比如缺少VectorAssembler等步骤产生的列),那么使用这个PipelineModel转换时会自动应用这些步骤。所以,问题可能出在:测试数据在调用`model.transform(test)`之前,没有包含管道中前期转换步骤所需要的列。管道模型在转换测试数据时,需要测试数据具有与训练数据相同的输入列,以便能够应用相同的转换规则。如果测试数据缺少这些输入列,则会出现错误。解决方案:1.检查训练数据包含哪些列,确保测试数据也有相同的列(除了标签列可能不是必需的,但特征列必须存在且结构相同)。2.确保测试数据与训练数据的列名一致。例如,训练数据中可能有一个列名为"f1"的特征,而测试数据中同样的特征应该也叫"f1"。3.检查管道中的各个阶段,特别是特征组合阶段(如VectorAssembler)的输入列是否在测试数据中都能找到。如果测试数据确实缺少某些列,则需要修正测试数据使其包含所有需要的列。如果测试数据中有这些列但列名不一致,则需要在转换前进行重命名。如果以上都满足,那么使用PipelineModel进行转换时,会自动生成"features"列,从而避免这个错误。如果用户仍然遇到问题,建议检查管道中各个阶段的输入输出列,确保在测试数据上的转换可以正确进行。另外,注意引用[2]中使用了`MulticlassClassificationEvaluator`,它在评估时需要预测列和标签列。在测试数据中,标签列(即目标变量)应该是存在的(但如果没有,可能预测也能进行,但评估时需要标签列)。总结处理步骤:-确保测试数据包含所有原始特征列(与训练时相同)。-使用训练得到的PipelineModel(即上面代码中的`model`)对测试数据进行转换,此时会自动生成"features"列和最终的预测结果。-如果测试数据中不包含某些特征,需要根据实际情况进行填补或重构测试数据。根据以上分析,给出代码层面的建议:假设训练数据包含列:col1,col2,col3,label在管道中,我们使用VectorAssembler将col1,col2,col3组合成features列,那么测试数据也必须包含col1,col2,col3列。示例代码片段(基于用户提供的引用):训练阶段:```scala//假设我们有一个转换器VectorAssembler,它将一些列组合成特征向量valassembler=newVectorAssembler().setInputCols(Array("col1","col2","col3")).setOutputCol("features")//其他阶段:如索引标签列(如果有)//然后建立管道valpipeline=newPipeline().setStages(Array(assembler,lightgbm))//lightgbm是最终的估计器//交叉验证等...valcv=newCrossValidator().setEstimator(pipeline)//...其他设置valmodel=cv.fit(trainingData)//trainingData必须包含col1,col2,col3,和标签列```测试阶段:```scala//测试数据testData必须包含与训练相同的特征列:col1,col2,col3//它可能包含标签列,也可能没有(但如果有评估,则需要标签列)valpredictions=model.transform(testData)```如果测试数据缺少这些列,我们需要在转换之前先进行预处理(添加或重命名列)使其具有这些列。因此,用户需要检查测试数据是否包含管道中所有转换器所需的输入列。如果缺失,需要补充;如果列名不一致,需要重命名。</think> 针对 Spark ML 测试数据缺少 `features` 列的错误,以下是处理步骤和解决方案: ### 错误原因分析 在 Spark ML 管道中,`features` 列通常由特征预处理阶段(如 `VectorAssembler`)生成[^1]。当测试数据直接调用 `model.transform(testData)` 时,如果测试数据未经过完整的管道预处理,将缺失该列。主要成因包括: 1. **未使用完整管道模型**:测试数据未通过训练时相同的 `PipelineModel` 进行转换 2. **特征工程缺失**:测试数据缺少 `VectorAssembler` 等特征预处理步骤 3. **列名不一致**:训练/测试数据的特征列名称不匹配 ### 解决方案(Scala代码) ```scala // 1. 确保使用完整的PipelineModel进行预测(非单一模型) val model: PipelineModel = cv.fit(trainingData) // 完整管道模型 // 2. 验证测试数据包含原始特征列(非features) require(testData.columns.contains("col1") && testData.columns.contains("col2"), "缺失特征列") // 3. 通过管道模型转换测试数据 val predictions = model.transform(testData) // 自动生成features列 // 4. 显式检查features列存在(调试用) if (predictions.columns.contains("features")) { println("成功生成特征向量") } ``` ### 关键修复点 1. **统一使用 PipelineModel** 必须用训练得到的完整管道模型(`PipelineModel`类型)转换测试数据,而非直接调用 `LightGBM` 或 `XGBoost` 等单一模型。 2. **前置特征验证** 在 `transform()` 前检查测试数据是否包含所有**原始输入特征列**(即 `VectorAssembler` 的输入列): ```scala val requiredCols = Array("age", "income", "product_view") // 替换为实际列名 requiredCols.foreach(col => assert(testData.columns.contains(col), s"测试数据缺失列: $col") ) ``` 3. **管道构建规范** 确保训练管道包含特征工程阶段: ```scala val assembler = new VectorAssembler() .setInputCols(Array("col1", "col2")) .setOutputCol("features") // ← 必须命名为features val pipeline = new Pipeline() .setStages(Array(assembler, lightgbm)) // 特征工程在前 ``` ### 错误排查流程 ```mermaid graph TD A[测试数据报错] --> B{检查features列是否存在} B -- 存在 --> C[检查列数据类型] B -- 不存在 --> D{验证原始特征列} D -- 列缺失 --> E[修复数据源] D -- 列存在 --> F[检查PipelineModel类型] F -- 非PipelineModel --> G[用cv.fit替换model.fit] F -- 是PipelineModel --> H[检查VectorAssembler设置] ``` ### 注意事项 1. `PipelineModel.transform()` 会**自动执行特征工程**,无需手动创建 `features` 列 2. 训练/测试数据的**特征顺序必须完全一致** 3. 使用 `.schema` 对比数据结构: ```scala trainingData.schema.foreach(println) testData.schema.foreach(println) ``` > **引用说明**:Spark ML 要求测试数据与训练数据具有相同的特征结构,管道模型通过元数据自动处理特征生成[^1][^2]。
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值