[spark-src-core] 2.3 shuffle in spark

本文深入解析了 Apache Spark 中 Shuffle 的工作流程,包括数据排序、分区写入及索引文件生成等关键步骤。同时探讨了 SortShuffleWriter 和 ExternalSorter 的核心实现细节,并提出了几个待解答的问题。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1.flow

1.1 shuffle abstract



 

 

  1.2 shuffle flow

 

   1.3 sort flow  in shuffle

 

   1.4 data structure in mem



 

2.core code paths

 

//SortShuffleWriter
override def write(records: Iterator[Product2[K, V]]): Unit = { //-how to collect this result by partition?by index file
    //-1 sort result data
    //-both below cases will spill if over threshold
    val ts = System.currentTimeMillis()
    //-comp to reduce side combine HashShuffleReader#read(),here is real map side.
    if (dep.mapSideCombine) { //-as map side's Combiner;note:even if no aggregrator is provided,DAGScheduler will add it-
      require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!") //by default,but keyOrdering,eg.
      sorter = new ExternalSorter[K, V, C](   //reduceByKey()
        dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
      logInfo("-aggr:" + dep.aggregator + ",key ord:" + dep.keyOrdering
        +",ser:" + dep.serializer +",part:"+dep.partitioner + ",dep " + dep)

      sorter.insertAll(records) //-if no order is given ,using the key's hashcode to sort per partition
    } else { //-no combine is given,eg. groupBy..
      // In this case we pass neither an aggregator nor an ordering to the sorter, because we don't
      // care whether the keys get sorted in each partition; that will be done on the *reduce side*
      // if the operation being run is *sortByKey.*
      sorter = new ExternalSorter[K, V, V](None, Some(dep.partitioner), None, dep.serializer)
      sorter.insertAll(records)
    }
    logInfo("*total cost of sorting(ms) " + (System.currentTimeMillis()-ts))
    ///-2 write to data file then index file
    // Don't bother including the time to open the merged output file in the shuffle write time,
    // because it just opens a single file, so is typically too fast to measure accurately
    // (see SPARK-3570).
    val outputFile = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)//-same with shulffle output file in insertAll()
    val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)

    //-means that always write result dato disk event if data is much less.
    val partitionLengths = sorter.writePartitionedFile(blockId, context, outputFile)
    shuffleBlockResolver.writeIndexFile(dep.shuffleId, mapId, partitionLengths)

    logInfo("-output file:" + outputFile + ",blockid:" + blockId
      + ",part len " + partitionLengths.length + ",total " + partitionLengths.sum
      +",shuffle server id " + blockManager.shuffleServerId + ",shuffleId " + dep.shuffleId)

    //-3 encapsulate the result(serialization is placed in Executor#launchTask())
    //-used by MapOutputTracker#getServerStatuses()
    mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)//-not contains real data but only shuffle server
  }

  /** Close this writer, passing along whether the map completed */
  override def stop(success: Boolean): Option[MapStatus] = {
    try {
      if (stopping) {
        return None
      }
      stopping = true
      if (success) {
        return Option(mapStatus)
      } else {
        // The map task failed, so delete our output data.
        shuffleBlockResolver.removeDataByMap(dep.shuffleId, mapId)
        return None
      }
    } finally {
      // Clean up our sorter, which may have its own intermediate files
      if (sorter != null) {
        val startTime = System.nanoTime()
        sorter.stop()
        context.taskMetrics.shuffleWriteMetrics.foreach(
          _.incShuffleWriteTime(System.nanoTime - startTime))
        sorter = null
      }
    }
  }
 
//ExternalSorter
/**-this file's data structure is same as spilled file.
   * Write all the data added into this ExternalSorter into *a file* in the disk store. This is
   * called by the SortShuffleWriter and can go through an efficient path of just concatenating
   * binary files if we decided to avoid merge-sorting.
   *
   * @param blockId block ID to write to. The index file will be blockId.name + ".index".-note
   * @param context a TaskContext for a running Spark task, for us to update shuffle metrics.
   * @return array of lengths, in bytes, of each partition of the file (used by map output tracker)
   */
  def writePartitionedFile(
      blockId: BlockId,
      context: TaskContext,
      outputFile: File): Array[Long] = {

    // Track location of each range in the output file
    val lengths = new Array[Long](numPartitions)
    //-1.case
    if (bypassMergeSort && partitionWriters != null) {
      logInfo("-bypass:" + bypassMergeSort+",pwriters:" + partitionWriters.length)
      // We decided to write separate files for each partition, so just concatenate them. To keep
      // this simple we spill out the current in-memory collection so that everything is in files.-so no order is guaranteed
      spillToPartitionFiles(if (aggregator.isDefined) map else buffer)
      partitionWriters.foreach(_.commitAndClose())
      val out = new FileOutputStream(outputFile, true) //-note here,append is 'true'
      val writeStartTime = System.nanoTime
      util.Utils.tryWithSafeFinally {
        for (i <- 0 until numPartitions) {
          val in = new FileInputStream(partitionWriters(i).fileSegment().file)
          util.Utils.tryWithSafeFinally {
            lengths(i) = org.apache.spark.util.Utils.copyStream(in, out, false, transferToEnabled)
          } {
            in.close()
          }
        }
      } {
        out.close()
        context.taskMetrics.shuffleWriteMetrics.foreach(
          _.incShuffleWriteTime(System.nanoTime - writeStartTime))
      }
    } else if (spills.isEmpty && partitionWriters == null) {//-come here is if no shuffle spill data to disk;same as spill()
      logInfo("-no spills occured")
      //2 Case where we only have in-memory data
      val collection = if (aggregator.isDefined) map else buffer
      val it = collection.destructiveSortedWritablePartitionedIterator(comparator) //-same as spillToMergableFile()
      while (it.hasNext) { //-note:this is a double loops
        val writer = blockManager.getDiskWriter(blockId, outputFile, serInstance, fileBufferSize,
          context.taskMetrics.shuffleWriteMetrics.get) //-using file's append mode,so only one file will be created
        val partitionId = it.nextPartition() //-base part id
        while (it.hasNext && it.nextPartition() == partitionId) { //-since the data has been ordered by partId,so
          it.writeNext(writer)      //-this is a staged writing
        }
        writer.commitAndClose() //-a part data is write all out
        val segment = writer.fileSegment()
        lengths(partitionId) = segment.length //-count up it's size
      }
    } else { //-3.case spilled files and remain data in-mem(here will spill them all out and concate them to final file)
      logInfo("-merge spilled file and in-mem data?,part it:" + this.partitionedIterator)
      // Not bypassing merge-sort; get an iterator by partition and just write everything directly.
      for ((id, elements) <- this.partitionedIterator) { //-use multiway merge sorter
        if (elements.hasNext) {
          val writer = blockManager.getDiskWriter(blockId, outputFile, serInstance, fileBufferSize,
            context.taskMetrics.shuffleWriteMetrics.get)
          for (elem <- elements) {
            writer.write(elem._1, elem._2)
          }
          writer.commitAndClose()
          val segment = writer.fileSegment()
          lengths(id) = segment.length
        }
      }
    }

    context.taskMetrics.incMemoryBytesSpilled(memoryBytesSpilled)
    context.taskMetrics.incDiskBytesSpilled(diskBytesSpilled)
    context.taskMetrics.shuffleWriteMetrics.filter(_ => bypassMergeSort).foreach { m =>
      if (curWriteMetrics != null) {
        m.incShuffleBytesWritten(curWriteMetrics.shuffleBytesWritten)
        m.incShuffleWriteTime(curWriteMetrics.shuffleWriteTime)
        m.incShuffleRecordsWritten(curWriteMetrics.shuffleRecordsWritten)
      }
    }

    lengths
  }
 

 

3.FAQ(TBD TODO)

   here are some questions im not clear,so any clues from you are highly appreciated:

A.when and where does ResultTask know to fetch result data from ShuffleMapTask?for example,here are a job with : 

  3 maps(m1,m2,m3) x 2 reduces(r1,r2)

t1:m1,m2,m3 are all running ;

t2:all maps are continued running except that m1 is finished,then  r1 is notified to setup and fetchs result of m1

t3:m2 is finished.question is here:when does r1 know to fetch the result of m2? in my guess,there shuould be a share place to for reduces to know where are maps and when they are finished,but i have not found that sources with happy.

 

B.in BlockManager,see #Question# below

private def doGetRemote(blockId: BlockId, asBlockResult: Boolean): Option[Any] = {
    require(blockId != null, "BlockId is null")
    //-1 get bock locations replication-note:this block is plaaced in TaskRunner#run() in case of 'Indirect result'
    val locations = Random.shuffle(master.getLocations(blockId)) //-deliver to BlockManagerMasterEndpoint
    //- #Question# why not to identify which blockmanager is most recent to this driver if in cluster even local mode?vip
    for (loc <- locations) { //-multi hosts for the same blockid,so once is enough if data is valid
      logDebug(s"Getting remote block $blockId from $loc")
      //-2 fetch real data
      val data = blockTransferService.fetchBlockSync(
        loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer()

      if (data != null) { //-get one from one of the replication node,return immediately
        if (asBlockResult) {
          return Some(new BlockResult(
            dataDeserialize(blockId, data),
            DataReadMethod.Network,
            data.limit()))
        } else {
          return Some(data)
        }
      }
      logDebug(s"The value of block $blockId is null")
    }
    logDebug(s"Block $blockId not found")
    None
  }

 

 

 

 

 

 

"C:\Program Files\Java\jdk1.8.0_281\bin\java.exe" "-javaagent:D:\新建文件夹 (2)\IDEA\idea\IntelliJ IDEA 2019.3.3\lib\idea_rt.jar=59342" -Dfile.encoding=UTF-8 -classpath "C:\Program Files\Java\jdk1.8.0_281\jre\lib\charsets.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\deploy.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\access-bridge-64.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\cldrdata.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\dnsns.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\jaccess.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\jfxrt.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\localedata.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\nashorn.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\sunec.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\sunjce_provider.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\sunmscapi.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\sunpkcs11.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\zipfs.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\javaws.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\jce.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\jfr.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\jfxswt.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\jsse.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\management-agent.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\plugin.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\resources.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\rt.jar;D:\carspark\out\production\carspark;C:\Users\wyatt\.ivy2\cache\org.scala-lang\scala-library\jars\scala-library-2.12.10.jar;C:\Users\wyatt\.ivy2\cache\org.scala-lang\scala-reflect\jars\scala-reflect-2.12.10.jar;C:\Users\wyatt\.ivy2\cache\org.scala-lang\scala-library\srcs\scala-library-2.12.10-sources.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\accessors-smart-1.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\activation-1.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\aircompressor-0.10.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\algebra_2.12-2.0.0-M2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\antlr-runtime-3.5.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\antlr4-runtime-4.8-1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\aopalliance-1.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\aopalliance-repackaged-2.6.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\arpack_combined_all-0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\arrow-format-2.0.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\arrow-memory-core-2.0.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\arrow-memory-netty-2.0.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\audience-annotations-0.5.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\automaton-1.11-8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\avro-1.8.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\avro-ipc-1.8.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\avro-mapred-1.8.2-hadoop2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\bonecp-0.8.0.RELEASE.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\breeze-macros_2.12-1.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\breeze_2.12-1.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\cats-kernel_2.12-2.0.0-M4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\chill-java-0.9.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\chill_2.12-0.9.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-beanutils-1.9.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-cli-1.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-codec-1.10.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-collections-3.2.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-compiler-3.0.16.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-compress-1.20.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-configuration2-2.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-crypto-1.1.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-daemon-1.0.13.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-dbcp-1.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-httpclient-3.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-io-2.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-lang-2.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-lang3-3.10.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-logging-1.1.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-math3-3.4.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-net-3.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-pool-1.5.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-text-1.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\compress-lzf-1.0.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\core-1.1.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\curator-client-2.13.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\curator-framework-2.13.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\curator-recipes-2.13.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\datanucleus-api-jdo-4.2.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\datanucleus-core-4.1.17.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\datanucleus-rdbms-4.1.19.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\derby-10.12.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\dnsjava-2.1.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\dropwizard-metrics-hadoop-metrics2-reporter-0.1.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\ehcache-3.3.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\flatbuffers-java-1.9.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\generex-1.0.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\geronimo-jcache_1.0_spec-1.0-alpha-1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\gson-2.2.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\guava-14.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\guice-4.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\guice-servlet-4.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-annotations-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-auth-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-common-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-hdfs-client-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-mapreduce-client-common-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-mapreduce-client-core-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-mapreduce-client-jobclient-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-yarn-api-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-yarn-client-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-yarn-common-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-yarn-registry-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-yarn-server-common-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-yarn-server-web-proxy-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\HikariCP-2.5.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-beeline-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-cli-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-common-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-exec-2.3.7-core.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-jdbc-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-llap-common-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-metastore-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-serde-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-service-rpc-3.1.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-shims-0.23-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-shims-common-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-shims-scheduler-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-storage-api-2.7.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-vector-code-gen-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hk2-api-2.6.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hk2-locator-2.6.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hk2-utils-2.6.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\htrace-core4-4.1.0-incubating.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\httpclient-4.5.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\httpcore-4.4.12.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\istack-commons-runtime-3.0.8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\ivy-2.4.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-annotations-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-core-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-core-asl-1.9.13.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-databind-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-dataformat-yaml-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-datatype-jsr310-2.11.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-jaxrs-base-2.9.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-jaxrs-json-provider-2.9.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-mapper-asl-1.9.13.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-module-jaxb-annotations-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-module-paranamer-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-module-scala_2.12-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.activation-api-1.2.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.annotation-api-1.3.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.inject-2.6.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.servlet-api-4.0.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.validation-api-2.0.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.ws.rs-api-2.1.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.xml.bind-api-2.3.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\janino-3.0.16.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\javassist-3.25.0-GA.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\javax.inject-1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\javax.jdo-3.2.0-m3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\javolution-5.5.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jaxb-api-2.2.11.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jaxb-runtime-2.3.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jcip-annotations-1.0-1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jcl-over-slf4j-1.7.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jdo-api-3.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-client-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-common-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-container-servlet-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-container-servlet-core-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-hk2-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-media-jaxb-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-server-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\JLargeArrays-1.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jline-2.14.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\joda-time-2.10.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jodd-core-3.5.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jpam-1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\json-1.8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\json-smart-2.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\json4s-ast_2.12-3.7.0-M5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\json4s-core_2.12-3.7.0-M5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\json4s-jackson_2.12-3.7.0-M5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\json4s-scalap_2.12-3.7.0-M5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jsp-api-2.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jsr305-3.0.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jta-1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\JTransforms-3.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jul-to-slf4j-1.7.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-admin-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-client-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-common-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-core-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-crypto-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-identity-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-server-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-simplekdc-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-util-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerby-asn1-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerby-config-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerby-pkix-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerby-util-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerby-xdr-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kryo-shaded-4.0.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-client-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-admissionregistration-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-apiextensions-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-apps-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-autoscaling-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-batch-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-certificates-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-common-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-coordination-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-core-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-discovery-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-events-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-extensions-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-metrics-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-networking-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-policy-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-rbac-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-scheduling-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-settings-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-storageclass-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\leveldbjni-all-1.8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\libfb303-0.9.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\libthrift-0.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\log4j-1.2.17.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\logging-interceptor-3.12.12.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\lz4-java-1.7.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\machinist_2.12-0.6.8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\macro-compat_2.12-1.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\mesos-1.4.0-shaded-protobuf.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\metrics-core-4.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\metrics-graphite-4.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\metrics-jmx-4.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\metrics-json-4.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\metrics-jvm-4.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\minlog-1.3.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\netty-all-4.1.51.Final.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\nimbus-jose-jwt-4.41.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\objenesis-2.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\okhttp-2.7.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\okhttp-3.12.12.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\okio-1.14.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\opencsv-2.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\orc-core-1.5.12.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\orc-mapreduce-1.5.12.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\orc-shims-1.5.12.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\oro-2.0.8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\osgi-resource-locator-1.0.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\paranamer-2.8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\parquet-column-1.10.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\parquet-common-1.10.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\parquet-encoding-1.10.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\parquet-format-2.4.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\parquet-hadoop-1.10.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\parquet-jackson-1.10.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\protobuf-java-2.5.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\py4j-0.10.9.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\pyrolite-4.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\re2j-1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\RoaringBitmap-0.9.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\scala-collection-compat_2.12-2.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\scala-compiler-2.12.10.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\scala-library-2.12.10.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\scala-parser-combinators_2.12-1.1.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\scala-reflect-2.12.10.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\scala-xml_2.12-1.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\shapeless_2.12-2.3.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\shims-0.9.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\slf4j-api-1.7.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\slf4j-log4j12-1.7.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\snakeyaml-1.24.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\snappy-java-1.1.8.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-catalyst_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-core_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-graphx_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-hive-thriftserver_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-hive_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-kubernetes_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-kvstore_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-launcher_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-mesos_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-mllib-local_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-mllib_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-network-common_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-network-shuffle_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-repl_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-sketch_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-sql_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-streaming_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-tags_2.12-3.1.1-tests.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-tags_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-unsafe_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-yarn_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spire-macros_2.12-0.17.0-M1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spire-platform_2.12-0.17.0-M1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spire-util_2.12-0.17.0-M1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spire_2.12-0.17.0-M1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\ST4-4.0.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\stax-api-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\stax2-api-3.1.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\stream-2.9.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\super-csv-2.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\threeten-extra-1.5.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\token-provider-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\transaction-api-1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\univocity-parsers-2.9.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\velocity-1.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\woodstox-core-5.0.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\xbean-asm7-shaded-4.15.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\xz-1.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\zjsonpatch-0.3.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\zookeeper-3.4.14.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\zstd-jni-1.4.8-1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\arrow-vector-2.0.0.jar" car.LoadModelRideHailing Using Spark&#39;s default log4j profile: org/apache/spark/log4j-defaults.properties 25/06/08 17:05:07 INFO SparkContext: Running Spark version 3.1.1 25/06/08 17:05:07 INFO ResourceUtils: ============================================================== 25/06/08 17:05:07 INFO ResourceUtils: No custom resources configured for spark.driver. 25/06/08 17:05:07 INFO ResourceUtils: ============================================================== 25/06/08 17:05:07 INFO SparkContext: Submitted application: LoadModelRideHailing 25/06/08 17:05:07 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0) 25/06/08 17:05:07 INFO ResourceProfile: Limiting resource is cpu 25/06/08 17:05:07 INFO ResourceProfileManager: Added ResourceProfile id: 0 25/06/08 17:05:07 INFO SecurityManager: Changing view acls to: wyatt 25/06/08 17:05:07 INFO SecurityManager: Changing modify acls to: wyatt 25/06/08 17:05:07 INFO SecurityManager: Changing view acls groups to: 25/06/08 17:05:07 INFO SecurityManager: Changing modify acls groups to: 25/06/08 17:05:07 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(wyatt); groups with view permissions: Set(); users with modify permissions: Set(wyatt); groups with modify permissions: Set() 25/06/08 17:05:07 INFO Utils: Successfully started service &#39;sparkDriver&#39; on port 59361. 25/06/08 17:05:07 INFO SparkEnv: Registering MapOutputTracker 25/06/08 17:05:07 INFO SparkEnv: Registering BlockManagerMaster 25/06/08 17:05:08 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 25/06/08 17:05:08 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 25/06/08 17:05:08 INFO SparkEnv: Registering BlockManagerMasterHeartbeat 25/06/08 17:05:08 INFO DiskBlockManager: Created local directory at C:\Users\wyatt\AppData\Local\Temp\blockmgr-8fe065e2-024c-4e2f-8662-45d2fe3de444 25/06/08 17:05:08 INFO MemoryStore: MemoryStore started with capacity 1899.0 MiB 25/06/08 17:05:08 INFO SparkEnv: Registering OutputCommitCoordinator 25/06/08 17:05:08 INFO Utils: Successfully started service &#39;SparkUI&#39; on port 4040. 25/06/08 17:05:08 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://windows10.microdone.cn:4040 25/06/08 17:05:08 INFO Executor: Starting executor ID driver on host windows10.microdone.cn 25/06/08 17:05:08 INFO Utils: Successfully started service &#39;org.apache.spark.network.netty.NettyBlockTransferService&#39; on port 59392. 25/06/08 17:05:08 INFO NettyBlockTransferService: Server created on windows10.microdone.cn:59392 25/06/08 17:05:08 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 25/06/08 17:05:08 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, windows10.microdone.cn, 59392, None) 25/06/08 17:05:08 INFO BlockManagerMasterEndpoint: Registering block manager windows10.microdone.cn:59392 with 1899.0 MiB RAM, BlockManagerId(driver, windows10.microdone.cn, 59392, None) 25/06/08 17:05:08 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, windows10.microdone.cn, 59392, None) 25/06/08 17:05:08 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, windows10.microdone.cn, 59392, None) Exception in thread "main" java.lang.IllegalArgumentException: 测试数据中不包含 features 列,请检查数据! at car.LoadModelRideHailing$.main(LoadModelRideHailing.scala:23) at car.LoadModelRideHailing.main(LoadModelRideHailing.scala) 进程已结束,退出代码为 1 package car import org.apache.spark.ml.classification.{LogisticRegressionModel, RandomForestClassificationModel} import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.sql.{SparkSession, functions => F} object LoadModelRideHailing { def main(args: Array[String]): Unit = { val spark = SparkSession.builder() .master("local[3]") .appName("LoadModelRideHailing") .getOrCreate() spark.sparkContext.setLogLevel("Error") // 使用经过特征工程处理后的测试数据 val TestData = spark.read.option("header", "true").csv("C:\\Users\\wyatt\\Documents\\ride_hailing_test_data.csv") // 将 label 列转换为数值类型 val testDataWithNumericLabel = TestData.withColumn("label", F.col("label").cast("double")) // 检查 features 列是否存在 if (!testDataWithNumericLabel.columns.contains("features")) { throw new IllegalArgumentException("测试数据中不包含 features 列,请检查数据!") } // 修正后的模型路径(确保文件夹存在且包含元数据) val LogisticModel = LogisticRegressionModel.load("C:\\Users\\wyatt\\Documents\\ride_hailing_logistic_model") // 示例路径 val LogisticPre = LogisticModel.transform(testDataWithNumericLabel) val LogisticAcc = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction") .setMetricName("accuracy") .evaluate(LogisticPre) println("逻辑回归模型后期数据准确率:" + LogisticAcc) // 随机森林模型路径同步修正 val RandomForest = RandomForestClassificationModel.load("C:\\Users\\wyatt\\Documents\\ride_hailing_random_forest_model") // 示例路径 val RandomForestPre = RandomForest.transform(testDataWithNumericLabel) val RandomForestAcc = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction") .setMetricName("accuracy") .evaluate(RandomForestPre) println("随机森林模型后期数据准确率:" + RandomForestAcc) spark.stop() } }
06-09
<think>我们正在处理一个SparkML管道中的问题:测试数据中缺少features列。在SparkML中,通常我们会构建一个管道,该管道包括多个转换器和一个最终的估计器(模型)。当我们在训练数据上调用`fit`方法时,管道会生成一个PipelineModel,然后我们可以使用这个PipelineModel对测试数据进行转换(`transform`),此时PipelineModel会依次应用每个阶段(包括转换和最终的模型预测)。测试数据缺少features列,这通常意味着在测试数据中没有完成与训练数据相同的特征预处理步骤。在SparkML管道中,特征列(通常命名为"features")通常是由管道中的一个或多个转换步骤生成的。例如,我们可能会使用VectorAssembler将多个特征列组合成一个特征向量,并输出到"features"列。因此,处理这个错误的正确方法是:1.确保测试数据与训练数据经过相同的预处理步骤。在SparkML中,这是通过使用同一个PipelineModel来实现的。所以,我们不需要手动为测试数据创建特征列,而是应该使用训练得到的PipelineModel进行转换。2.检查代码逻辑,确保在测试阶段使用了正确的PipelineModel进行转换。根据用户提供的引用内容,我们可以看到训练时使用了交叉验证(CrossValidator)来训练模型,交叉验证会返回一个最佳模型(也是一个PipelineModel,因为整个管道是作为估计器传入的)。然后,我们应该使用这个训练得到的模型对测试数据进行转换。引用[1]中训练模型的部分:```scalavalmodel=cv.fit(trainingData)//这里训练得到的model就是PipelineModel(因为cv的estimator是一个pipeline)```然后,在引用[2]中,我们可以看到使用模型进行预测的代码:```scalavalprediction=model.transform(test)```这里的`model`就是上面训练得到的PipelineModel,它会对测试数据应用整个管道(包括特征预处理和模型预测)。如果测试数据在进入模型转换之前没有经过预处理(比如缺少VectorAssembler等步骤产生的列),那么使用这个PipelineModel转换时会自动应用这些步骤。所以,问题可能出在:测试数据在调用`model.transform(test)`之前,没有包含管道中前期转换步骤所需要的列。管道模型在转换测试数据时,需要测试数据具有与训练数据相同的输入列,以便能够应用相同的转换规则。如果测试数据缺少这些输入列,则会出现错误。解决方案:1.检查训练数据包含哪些列,确保测试数据也有相同的列(除了标签列可能不是必需的,但特征列必须存在且结构相同)。2.确保测试数据与训练数据的列名一致。例如,训练数据中可能有一个列名为"f1"的特征,而测试数据中同样的特征应该也叫"f1"。3.检查管道中的各个阶段,特别是特征组合阶段(如VectorAssembler)的输入列是否在测试数据中都能找到。如果测试数据确实缺少某些列,则需要修正测试数据使其包含所有需要的列。如果测试数据中有这些列但列名不一致,则需要在转换前进行重命名。如果以上都满足,那么使用PipelineModel进行转换时,会自动生成"features"列,从而避免这个错误。如果用户仍然遇到问题,建议检查管道中各个阶段的输入输出列,确保在测试数据上的转换可以正确进行。另外,注意引用[2]中使用了`MulticlassClassificationEvaluator`,它在评估时需要预测列和标签列。在测试数据中,标签列(即目标变量)应该是存在的(但如果没有,可能预测也能进行,但评估时需要标签列)。总结处理步骤:-确保测试数据包含所有原始特征列(与训练时相同)。-使用训练得到的PipelineModel(即上面代码中的`model`)对测试数据进行转换,此时会自动生成"features"列和最终的预测结果。-如果测试数据中不包含某些特征,需要根据实际情况进行填补或重构测试数据。根据以上分析,给出代码层面的建议:假设训练数据包含列:col1,col2,col3,label在管道中,我们使用VectorAssembler将col1,col2,col3组合成features列,那么测试数据也必须包含col1,col2,col3列。示例代码片段(基于用户提供的引用):训练阶段:```scala//假设我们有一个转换器VectorAssembler,它将一些列组合成特征向量valassembler=newVectorAssembler().setInputCols(Array("col1","col2","col3")).setOutputCol("features")//其他阶段:如索引标签列(如果有)//然后建立管道valpipeline=newPipeline().setStages(Array(assembler,lightgbm))//lightgbm是最终的估计器//交叉验证等...valcv=newCrossValidator().setEstimator(pipeline)//...其他设置valmodel=cv.fit(trainingData)//trainingData必须包含col1,col2,col3,和标签列```测试阶段:```scala//测试数据testData必须包含与训练相同的特征列:col1,col2,col3//它可能包含标签列,也可能没有(但如果有评估,则需要标签列)valpredictions=model.transform(testData)```如果测试数据缺少这些列,我们需要在转换之前先进行预处理(添加或重命名列)使其具有这些列。因此,用户需要检查测试数据是否包含管道中所有转换器所需的输入列。如果缺失,需要补充;如果列名不一致,需要重命名。</think> 针对 Spark ML 测试数据缺少 `features` 列的错误,以下是处理步骤和解决方案: ### 错误原因分析 在 Spark ML 管道中,`features` 列通常由特征预处理阶段(如 `VectorAssembler`)生成[^1]。当测试数据直接调用 `model.transform(testData)` 时,如果测试数据未经过完整的管道预处理,将缺失该列。主要成因包括: 1. **未使用完整管道模型**:测试数据未通过训练时相同的 `PipelineModel` 进行转换 2. **特征工程缺失**:测试数据缺少 `VectorAssembler` 等特征预处理步骤 3. **列名不一致**:训练/测试数据的特征列名称不匹配 ### 解决方案(Scala代码) ```scala // 1. 确保使用完整的PipelineModel进行预测(非单一模型) val model: PipelineModel = cv.fit(trainingData) // 完整管道模型 // 2. 验证测试数据包含原始特征列(非features) require(testData.columns.contains("col1") && testData.columns.contains("col2"), "缺失特征列") // 3. 通过管道模型转换测试数据 val predictions = model.transform(testData) // 自动生成features列 // 4. 显式检查features列存在(调试用) if (predictions.columns.contains("features")) { println("成功生成特征向量") } ``` ### 关键修复点 1. **统一使用 PipelineModel** 必须用训练得到的完整管道模型(`PipelineModel`类型)转换测试数据,而非直接调用 `LightGBM` 或 `XGBoost` 等单一模型。 2. **前置特征验证** 在 `transform()` 前检查测试数据是否包含所有**原始输入特征列**(即 `VectorAssembler` 的输入列): ```scala val requiredCols = Array("age", "income", "product_view") // 替换为实际列名 requiredCols.foreach(col => assert(testData.columns.contains(col), s"测试数据缺失列: $col") ) ``` 3. **管道构建规范** 确保训练管道包含特征工程阶段: ```scala val assembler = new VectorAssembler() .setInputCols(Array("col1", "col2")) .setOutputCol("features") // ← 必须命名为features val pipeline = new Pipeline() .setStages(Array(assembler, lightgbm)) // 特征工程在前 ``` ### 错误排查流程 ```mermaid graph TD A[测试数据报错] --> B{检查features列是否存在} B -- 存在 --> C[检查列数据类型] B -- 不存在 --> D{验证原始特征列} D -- 列缺失 --> E[修复数据源] D -- 列存在 --> F[检查PipelineModel类型] F -- 非PipelineModel --> G[用cv.fit替换model.fit] F -- 是PipelineModel --> H[检查VectorAssembler设置] ``` ### 注意事项 1. `PipelineModel.transform()` 会**自动执行特征工程**,无需手动创建 `features` 列 2. 训练/测试数据的**特征顺序必须完全一致** 3. 使用 `.schema` 对比数据结构: ```scala trainingData.schema.foreach(println) testData.schema.foreach(println) ``` > **引用说明**:Spark ML 要求测试数据与训练数据具有相同的特征结构,管道模型通过元数据自动处理特征生成[^1][^2]。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值