Spark-Pipeline

学习笔记。若涉及侵权,请告知删除。

Intro

构建在DataFrame之上。Mllib提供标准的机器学习算法API,能够方便的将不同的算法组合成一个独立的管道 Pipeline or workflow.

  • DataFrame: from Spark SQL as an ML dataset, which can hold a variety of data types, e.g. different columns storing text, feature vectors, true labels and predictions.
  • Estimator: is an algorithm which can be fit on a DataFrame to produce a Transformer. A learning algorithm is an Estimator which trains on a DataFrame and produces a model.
  • Transformer: is an algorithm which can transform one DataFrame into another DataFrame, e.g. an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with prediction.
  • Pipeline: a pipeline chains multiple Transformers and Estimators together to specify an ML workflow.
  • Parameter: All Transformer and Estimators share a common API for specifying parameters.

Pipeline

A Pipeline is specified as a sequences of stages, and each stage is either a Transformer or an Estimator.
The input DataFrame is transformed as it passes through each stage.

For Transformer stages, the transform() method is called on the DataFrame.

For Estimator stages, the fit() method is called to produce a Transformer (which becomes a part of the PipelineModel, or fitted Pipeline)

Fit

fit

the figure is for the training time usage of a Pipeline.

The top row, three stages, the first two are Transformers and the third one is an Estimator.

The bottom row represents data flowing through the pipeline, where

"C:\Program Files\Java\jdk1.8.0_281\bin\java.exe" "-javaagent:D:\新建文件夹 (2)\IDEA\idea\IntelliJ IDEA 2019.3.3\lib\idea_rt.jar=59342" -Dfile.encoding=UTF-8 -classpath "C:\Program Files\Java\jdk1.8.0_281\jre\lib\charsets.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\deploy.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\access-bridge-64.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\cldrdata.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\dnsns.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\jaccess.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\jfxrt.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\localedata.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\nashorn.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\sunec.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\sunjce_provider.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\sunmscapi.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\sunpkcs11.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\zipfs.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\javaws.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\jce.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\jfr.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\jfxswt.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\jsse.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\management-agent.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\plugin.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\resources.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\rt.jar;D:\carspark\out\production\carspark;C:\Users\wyatt\.ivy2\cache\org.scala-lang\scala-library\jars\scala-library-2.12.10.jar;C:\Users\wyatt\.ivy2\cache\org.scala-lang\scala-reflect\jars\scala-reflect-2.12.10.jar;C:\Users\wyatt\.ivy2\cache\org.scala-lang\scala-library\srcs\scala-library-2.12.10-sources.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\accessors-smart-1.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\activation-1.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\aircompressor-0.10.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\algebra_2.12-2.0.0-M2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\antlr-runtime-3.5.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\antlr4-runtime-4.8-1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\aopalliance-1.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\aopalliance-repackaged-2.6.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\arpack_combined_all-0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\arrow-format-2.0.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\arrow-memory-core-2.0.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\arrow-memory-netty-2.0.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\audience-annotations-0.5.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\automaton-1.11-8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\avro-1.8.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\avro-ipc-1.8.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\avro-mapred-1.8.2-hadoop2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\bonecp-0.8.0.RELEASE.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\breeze-macros_2.12-1.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\breeze_2.12-1.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\cats-kernel_2.12-2.0.0-M4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\chill-java-0.9.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\chill_2.12-0.9.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-beanutils-1.9.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-cli-1.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-codec-1.10.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-collections-3.2.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-compiler-3.0.16.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-compress-1.20.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-configuration2-2.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-crypto-1.1.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-daemon-1.0.13.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-dbcp-1.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-httpclient-3.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-io-2.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-lang-2.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-lang3-3.10.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-logging-1.1.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-math3-3.4.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-net-3.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-pool-1.5.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-text-1.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\compress-lzf-1.0.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\core-1.1.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\curator-client-2.13.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\curator-framework-2.13.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\curator-recipes-2.13.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\datanucleus-api-jdo-4.2.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\datanucleus-core-4.1.17.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\datanucleus-rdbms-4.1.19.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\derby-10.12.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\dnsjava-2.1.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\dropwizard-metrics-hadoop-metrics2-reporter-0.1.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\ehcache-3.3.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\flatbuffers-java-1.9.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\generex-1.0.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\geronimo-jcache_1.0_spec-1.0-alpha-1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\gson-2.2.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\guava-14.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\guice-4.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\guice-servlet-4.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-annotations-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-auth-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-common-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-hdfs-client-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-mapreduce-client-common-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-mapreduce-client-core-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-mapreduce-client-jobclient-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-yarn-api-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-yarn-client-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-yarn-common-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-yarn-registry-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-yarn-server-common-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-yarn-server-web-proxy-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\HikariCP-2.5.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-beeline-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-cli-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-common-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-exec-2.3.7-core.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-jdbc-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-llap-common-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-metastore-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-serde-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-service-rpc-3.1.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-shims-0.23-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-shims-common-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-shims-scheduler-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-storage-api-2.7.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-vector-code-gen-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hk2-api-2.6.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hk2-locator-2.6.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hk2-utils-2.6.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\htrace-core4-4.1.0-incubating.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\httpclient-4.5.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\httpcore-4.4.12.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\istack-commons-runtime-3.0.8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\ivy-2.4.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-annotations-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-core-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-core-asl-1.9.13.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-databind-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-dataformat-yaml-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-datatype-jsr310-2.11.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-jaxrs-base-2.9.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-jaxrs-json-provider-2.9.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-mapper-asl-1.9.13.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-module-jaxb-annotations-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-module-paranamer-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-module-scala_2.12-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.activation-api-1.2.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.annotation-api-1.3.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.inject-2.6.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.servlet-api-4.0.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.validation-api-2.0.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.ws.rs-api-2.1.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.xml.bind-api-2.3.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\janino-3.0.16.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\javassist-3.25.0-GA.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\javax.inject-1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\javax.jdo-3.2.0-m3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\javolution-5.5.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jaxb-api-2.2.11.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jaxb-runtime-2.3.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jcip-annotations-1.0-1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jcl-over-slf4j-1.7.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jdo-api-3.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-client-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-common-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-container-servlet-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-container-servlet-core-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-hk2-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-media-jaxb-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-server-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\JLargeArrays-1.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jline-2.14.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\joda-time-2.10.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jodd-core-3.5.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jpam-1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\json-1.8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\json-smart-2.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\json4s-ast_2.12-3.7.0-M5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\json4s-core_2.12-3.7.0-M5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\json4s-jackson_2.12-3.7.0-M5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\json4s-scalap_2.12-3.7.0-M5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jsp-api-2.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jsr305-3.0.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jta-1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\JTransforms-3.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jul-to-slf4j-1.7.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-admin-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-client-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-common-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-core-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-crypto-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-identity-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-server-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-simplekdc-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-util-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerby-asn1-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerby-config-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerby-pkix-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerby-util-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerby-xdr-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kryo-shaded-4.0.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-client-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-admissionregistration-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-apiextensions-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-apps-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-autoscaling-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-batch-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-certificates-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-common-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-coordination-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-core-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-discovery-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-events-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-extensions-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-metrics-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-networking-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-policy-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-rbac-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-scheduling-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-settings-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-storageclass-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\leveldbjni-all-1.8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\libfb303-0.9.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\libthrift-0.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\log4j-1.2.17.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\logging-interceptor-3.12.12.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\lz4-java-1.7.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\machinist_2.12-0.6.8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\macro-compat_2.12-1.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\mesos-1.4.0-shaded-protobuf.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\metrics-core-4.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\metrics-graphite-4.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\metrics-jmx-4.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\metrics-json-4.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\metrics-jvm-4.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\minlog-1.3.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\netty-all-4.1.51.Final.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\nimbus-jose-jwt-4.41.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\objenesis-2.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\okhttp-2.7.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\okhttp-3.12.12.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\okio-1.14.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\opencsv-2.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\orc-core-1.5.12.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\orc-mapreduce-1.5.12.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\orc-shims-1.5.12.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\oro-2.0.8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\osgi-resource-locator-1.0.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\paranamer-2.8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\parquet-column-1.10.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\parquet-common-1.10.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\parquet-encoding-1.10.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\parquet-format-2.4.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\parquet-hadoop-1.10.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\parquet-jackson-1.10.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\protobuf-java-2.5.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\py4j-0.10.9.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\pyrolite-4.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\re2j-1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\RoaringBitmap-0.9.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\scala-collection-compat_2.12-2.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\scala-compiler-2.12.10.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\scala-library-2.12.10.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\scala-parser-combinators_2.12-1.1.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\scala-reflect-2.12.10.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\scala-xml_2.12-1.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\shapeless_2.12-2.3.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\shims-0.9.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\slf4j-api-1.7.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\slf4j-log4j12-1.7.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\snakeyaml-1.24.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\snappy-java-1.1.8.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-catalyst_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-core_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-graphx_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-hive-thriftserver_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-hive_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-kubernetes_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-kvstore_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-launcher_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-mesos_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-mllib-local_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-mllib_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-network-common_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-network-shuffle_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-repl_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-sketch_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-sql_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-streaming_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-tags_2.12-3.1.1-tests.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-tags_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-unsafe_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-yarn_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spire-macros_2.12-0.17.0-M1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spire-platform_2.12-0.17.0-M1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spire-util_2.12-0.17.0-M1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spire_2.12-0.17.0-M1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\ST4-4.0.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\stax-api-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\stax2-api-3.1.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\stream-2.9.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\super-csv-2.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\threeten-extra-1.5.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\token-provider-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\transaction-api-1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\univocity-parsers-2.9.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\velocity-1.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\woodstox-core-5.0.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\xbean-asm7-shaded-4.15.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\xz-1.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\zjsonpatch-0.3.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\zookeeper-3.4.14.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\zstd-jni-1.4.8-1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\arrow-vector-2.0.0.jar" car.LoadModelRideHailing Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 25/06/08 17:05:07 INFO SparkContext: Running Spark version 3.1.1 25/06/08 17:05:07 INFO ResourceUtils: ============================================================== 25/06/08 17:05:07 INFO ResourceUtils: No custom resources configured for spark.driver. 25/06/08 17:05:07 INFO ResourceUtils: ============================================================== 25/06/08 17:05:07 INFO SparkContext: Submitted application: LoadModelRideHailing 25/06/08 17:05:07 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0) 25/06/08 17:05:07 INFO ResourceProfile: Limiting resource is cpu 25/06/08 17:05:07 INFO ResourceProfileManager: Added ResourceProfile id: 0 25/06/08 17:05:07 INFO SecurityManager: Changing view acls to: wyatt 25/06/08 17:05:07 INFO SecurityManager: Changing modify acls to: wyatt 25/06/08 17:05:07 INFO SecurityManager: Changing view acls groups to: 25/06/08 17:05:07 INFO SecurityManager: Changing modify acls groups to: 25/06/08 17:05:07 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(wyatt); groups with view permissions: Set(); users with modify permissions: Set(wyatt); groups with modify permissions: Set() 25/06/08 17:05:07 INFO Utils: Successfully started service 'sparkDriver' on port 59361. 25/06/08 17:05:07 INFO SparkEnv: Registering MapOutputTracker 25/06/08 17:05:07 INFO SparkEnv: Registering BlockManagerMaster 25/06/08 17:05:08 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 25/06/08 17:05:08 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 25/06/08 17:05:08 INFO SparkEnv: Registering BlockManagerMasterHeartbeat 25/06/08 17:05:08 INFO DiskBlockManager: Created local directory at C:\Users\wyatt\AppData\Local\Temp\blockmgr-8fe065e2-024c-4e2f-8662-45d2fe3de444 25/06/08 17:05:08 INFO MemoryStore: MemoryStore started with capacity 1899.0 MiB 25/06/08 17:05:08 INFO SparkEnv: Registering OutputCommitCoordinator 25/06/08 17:05:08 INFO Utils: Successfully started service 'SparkUI' on port 4040. 25/06/08 17:05:08 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://windows10.microdone.cn:4040 25/06/08 17:05:08 INFO Executor: Starting executor ID driver on host windows10.microdone.cn 25/06/08 17:05:08 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 59392. 25/06/08 17:05:08 INFO NettyBlockTransferService: Server created on windows10.microdone.cn:59392 25/06/08 17:05:08 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 25/06/08 17:05:08 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, windows10.microdone.cn, 59392, None) 25/06/08 17:05:08 INFO BlockManagerMasterEndpoint: Registering block manager windows10.microdone.cn:59392 with 1899.0 MiB RAM, BlockManagerId(driver, windows10.microdone.cn, 59392, None) 25/06/08 17:05:08 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, windows10.microdone.cn, 59392, None) 25/06/08 17:05:08 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, windows10.microdone.cn, 59392, None) Exception in thread "main" java.lang.IllegalArgumentException: 测试数据中不包含 features 列,请检查数据! at car.LoadModelRideHailing$.main(LoadModelRideHailing.scala:23) at car.LoadModelRideHailing.main(LoadModelRideHailing.scala) 进程已结束,退出代码为 1 package car import org.apache.spark.ml.classification.{LogisticRegressionModel, RandomForestClassificationModel} import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.sql.{SparkSession, functions => F} object LoadModelRideHailing { def main(args: Array[String]): Unit = { val spark = SparkSession.builder() .master("local[3]") .appName("LoadModelRideHailing") .getOrCreate() spark.sparkContext.setLogLevel("Error") // 使用经过特征工程处理后的测试数据 val TestData = spark.read.option("header", "true").csv("C:\\Users\\wyatt\\Documents\\ride_hailing_test_data.csv") // 将 label 列转换为数值类型 val testDataWithNumericLabel = TestData.withColumn("label", F.col("label").cast("double")) // 检查 features 列是否存在 if (!testDataWithNumericLabel.columns.contains("features")) { throw new IllegalArgumentException("测试数据中不包含 features 列,请检查数据!") } // 修正后的模型路径(确保文件夹存在且包含元数据) val LogisticModel = LogisticRegressionModel.load("C:\\Users\\wyatt\\Documents\\ride_hailing_logistic_model") // 示例路径 val LogisticPre = LogisticModel.transform(testDataWithNumericLabel) val LogisticAcc = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction") .setMetricName("accuracy") .evaluate(LogisticPre) println("逻辑回归模型后期数据准确率:" + LogisticAcc) // 随机森林模型路径同步修正 val RandomForest = RandomForestClassificationModel.load("C:\\Users\\wyatt\\Documents\\ride_hailing_random_forest_model") // 示例路径 val RandomForestPre = RandomForest.transform(testDataWithNumericLabel) val RandomForestAcc = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction") .setMetricName("accuracy") .evaluate(RandomForestPre) println("随机森林模型后期数据准确率:" + RandomForestAcc) spark.stop() } }
06-09
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值