Spark-scheduler

本文深入解析了Apache Spark的调度机制,介绍了Task、ResultTask、ShuffleMapTask等核心概念,并详细阐述了DAGScheduler的工作原理及其如何管理和调度任务集(TaskSet)。此外,还讨论了TaskSetManager的角色和职责,以及资源分配策略。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Spark-scheduler

@(spark)[scheduler]

Task

/**                                                                                                                                                                     
 * A unit of execution. We have two kinds of Task's in Spark:                                                                                                           
 * - [[org.apache.spark.scheduler.ShuffleMapTask]]                                                                                                                      
 * - [[org.apache.spark.scheduler.ResultTask]]                                                                                                                          
 *                                                                                                                                                                      
 * A Spark job consists of one or more stages. The very last stage in a job consists of multiple                                                                        
 * ResultTasks, while earlier stages consist of ShuffleMapTasks. A ResultTask executes the task                                                                         
 * and sends the task output back to the driver application. A ShuffleMapTask executes the task                                                                         
 * and divides the task output to multiple buckets (based on the task's partitioner).                                                                                   
 *                                                                                                                                                                      
 * @param stageId id of the stage this task belongs to                                                                                                                  
 * @param partitionId index of the number in the RDD                                                                                                                    
 */                                                                                                                                                                     
private[spark] abstract class Task[T](val stageId: Int, var partitionId: Int) extends Serializable {     

ResultTask

/**                                                                                                                                                                     
 * A task that sends back the output to the driver application.                                                                                                         
 *                                                                                                                                                                      
 * See [[Task]] for more information.                                                                                                                                   
 *                                                                                                                                                                      
 * @param stageId id of the stage this task belongs to                                                                                                                  
 * @param taskBinary broadcasted version of the serialized RDD and the function to apply on each                                                                        
 *                   partition of the given RDD. Once deserialized, the type should be                                                                                  
 *                   (RDD[T], (TaskContext, Iterator[T]) => U).                                                                                                         
 * @param partition partition of the RDD this task is associated with                                                                                                   
 * @param locs preferred task execution locations for locality scheduling                                                                                               
 * @param outputId index of the task in this job (a job can launch tasks on only a subset of the                                                                        
 *                 input RDD's partitions).                                                                                                                             
 */                                                                                                                                                                     
private[spark] class ResultTask[T, U](                                                                                                                                  
    stageId: Int,                                                                                                                                                       
    taskBinary: Broadcast[Array[Byte]],                                                                                                                                 
    partition: Partition,                                                                                                                                               
    @transient locs: Seq[TaskLocation],                                                                                                                                 
    val outputId: Int)                                                                                                                                                  
  extends Task[U](stageId, partition.index) with Serializable {    

重点看一下它的runTask:

  override def runTask(context: TaskContext): U = {                                                                                                                     
    // Deserialize the RDD and the func using the broadcast variables.                                                                                                  
    val ser = SparkEnv.get.closureSerializer.newInstance()                                                                                                              
    val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => U)](                                                                                       
      ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)                                                                                    

    metrics = Some(context.taskMetrics)                                                                                                                                 
    func(context, rdd.iterator(partition, context))                                                                                                                     
  }   
  1. 反序列化rdd
  2. 调用其iterator(iterator是RDD的final function,会根据情况调用computOrcheckoutpoint)

ShuffleMapTask

/**                                                                                                                                                                     
* A ShuffleMapTask divides the elements of an RDD into multiple buckets (based on a partitioner                                                                         
* specified in the ShuffleDependency).                                                                                                                                  
*                                                                                                                                                                       
* See [[org.apache.spark.scheduler.Task]] for more information.                                                                                                         
*                                                                                                                                                                       
 * @param stageId id of the stage this task belongs to                                                                                                                  
 * @param taskBinary broadcast version of of the RDD and the ShuffleDependency. Once deserialized,                                                                      
 *                   the type should be (RDD[_], ShuffleDependency[_, _, _]).                                                                                           
 * @param partition partition of the RDD this task is associated with                                                                                                   
 * @param locs preferred task execution locations for locality scheduling                                                                                               
 */                                                                                                                                                                     
private[spark] class ShuffleMapTask(                                                                                                                                    
    stageId: Int,                                                                                                                                                       
    taskBinary: Broadcast[Array[Byte]],                                                                                                                                 
    partition: Partition,                                                                                                                                               
    @transient private var locs: Seq[TaskLocation])                                                                                                                     
  extends Task[MapStatus](stageId, partition.index) with Logging {      

其runTask的返回是MapStatus:

MapStatus

/**
* Result returned by a ShuffleMapTask to a scheduler. Includes the block manager address that the
* task ran on as well as the sizes of outputs for each reducer, for passing on to the reduce tasks.
*/
private[spark] sealed trait MapStatus {
/* Location where this task was run. /
def location: BlockManagerId

/**
* Estimated size for the reduce block, in bytes.
*
* If a block is non-empty, then this method MUST return a non-zero size. This invariant is
* necessary for correctness, since block fetchers are allowed to skip zero-size blocks.
*/
def getSizeForBlock(reduceId: Int): Long
}

RunTask

runTask的也比较简单,就是生成一个ShuffleWriter,写结果。

TaskResult

// Task result. Also contains updates to accumulator variables.
private[spark] sealed trait TaskResult[T]
分为DirectTaskResult和IndirectTaskResult。

TaskInfo

/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * Information about a running task attempt inside a TaskSet.                                                                                                           
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
class TaskInfo(                                                                                                                                                         
    val taskId: Long,                                                                                                                                                   
    val index: Int,                                                                                                                                                     
    val attempt: Int,                                                                                                                                                   
    val launchTime: Long,                                                                                                                                               
    val executorId: String,                                                                                                                                             
    val host: String,                                                                                                                                                   
    val taskLocality: TaskLocality.TaskLocality,                                                                                                                        
    val speculative: Boolean) {   

TaskDescription


/**                                                                                                                                                                     
 * Description of a task that gets passed onto executors to be executed, usually created by                                                                             
 * [[TaskSetManager.resourceOffer]].                                                                                                                                    
 */                                                                                                                                                                     
private[spark] class TaskDescription(        

AccumulableInfo

/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * Information about an [[org.apache.spark.Accumulable]] modified during a task or stage.                                                                               
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
class AccumulableInfo (                                                                                                                                                 
    val id: Long,                                                                                                                                                       
    val name: String,                                                                                                                                                   
    val update: Option[String], // represents a partial update within a task                                                                                            
    val value: String) {                                                                                                                                                

SplitInfo

// information about a specific split instance : handles both split instances.                                                                                          
// So that we do not need to worry about the differences.                                                                                                               
@DeveloperApi                                                                                                                                                           
class SplitInfo(                                                                                                                                                        
    val inputFormatClazz: Class[_],                                                                                                                                     
    val hostLocation: String,                                                                                                                                           
    val path: String,                                                                                                                                                   
    val length: Long,                                                                                                                                                   
    val underlyingSplit: Any) {       

SparkListener

  1. 定义了一系列的事件
  2. 定义了接口trait SparkListener
  3. 定义了class StatsReportListener: Simple SparkListener that logs a few summary statistics when each stage complet.

JobResult

A result of a job in the DAGScheduler.
只有两种 JobSucceeded和JobFailed

JobWaiter

/**                                                                                                                                                                     
 * An object that waits for a DAGScheduler job to complete. As tasks finish, it passes their                                                                            
 * results to the given handler function.                                                                                                                               
 */                                                                                                                                                                     
private[spark] class JobWaiter[T](                                                                                                                                      
    dagScheduler: DAGScheduler,                                                                                                                                         
    val jobId: Int,                                                                                                                                                     
    totalTasks: Int,                                                                                                                                                    
    resultHandler: (Int, T) => Unit)                                                                                                                                    
  extends JobListener {   

JobListener

/**                                                                                                                                                                     
 * Interface used to listen for job completion or failure events after submitting a job to the                                                                          
 * DAGScheduler. The listener is notified each time a task succeeds, as well as if the whole                                                                            
 * job fails (and no further taskSucceeded events will happen).                                                                                                         
 */                                                                                                                                                                     
private[spark] trait JobListener {                                                                                                                                      
  def taskSucceeded(index: Int, result: Any)                                                                                                                            
  def jobFailed(exception: Exception)                                                                                                                                   
}   

JobLogger

/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * A logger class to record runtime information for jobs in Spark. This class outputs one log file                                                                      
 * for each Spark job, containing tasks start/stop and shuffle information. JobLogger is a subclass                                                                     
 * of SparkListener, use addSparkListener to add JobLogger to a SparkContext after the SparkContext                                                                     
 * is created. Note that each JobLogger only works for one SparkContext                                                                                                 
 *                                                                                                                                                                      
 * NOTE: The functionality of this class is heavily stripped down to accommodate for a general                                                                          
 * refactor of the SparkListener interface. In its place, the EventLoggingListener is introduced                                                                        
 * to log application information as SparkListenerEvents. To enable this functionality, set                                                                             
 * spark.eventLog.enabled to true.                                                                                                                                      
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
@deprecated("Log application information by setting spark.eventLog.enabled.", "1.0.0")                                                                                  
class JobLogger(val user: String, val logDirName: String) extends SparkListener with Logging {      

ApplicationEventListener

/**                                                                                                                                                                     
 * A simple listener for application events.                                                                                                                            
 *                                                                                                                                                                      
 * This listener expects to hear events from a single application only. If events                                                                                       
 * from multiple applications are seen, the behavior is unspecified.                                                                                                    
 */                                                                                                                                                                     
private[spark] class ApplicationEventListener extends SparkListener {      

DAGSchedulerEvent

/**                                                                                                                                                                     
 * Types of events that can be handled by the DAGScheduler. The DAGScheduler uses an event queue                                                                        
 * architecture where any thread can post an event (e.g. a task finishing or a new job being                                                                            
 * submitted) but there is a single "logic" thread that reads these events and takes decisions.                                                                         
 * This greatly simplifies synchronization.                                                                                                                             
 */                                                                                                                                                                     
private[scheduler] sealed trait DAGSchedulerEvent   

包含很多的event,重要的包括JobSubmitted,StageCancelled等等。

SparkListenerBus

/**                                                                                                                                                                     
 * A [[SparkListenerEvent]] bus that relays [[SparkListenerEvent]]s to its listeners                                                                                    
 */                                                                                                                                                                     
private[spark] trait SparkListenerBus extends ListenerBus[SparkListener, SparkListenerEvent] {  

EventLoggingListener

/**                                                                                                                                                                     
 * A SparkListener that logs events to persistent storage.                                                                                                              
 *                                                                                                                                                                      
 * Event logging is specified by the following configurable parameters:                                                                                                 
 *   spark.eventLog.enabled - Whether event logging is enabled.                                                                                                         
 *   spark.eventLog.compress - Whether to compress logged events                                                                                                        
 *   spark.eventLog.overwrite - Whether to overwrite any existing files.                                                                                                
 *   spark.eventLog.dir - Path to the directory in which events are logged.                                                                                             
 *   spark.eventLog.buffer.kb - Buffer size to use when writing to output streams                                                                                       
 */                                                                                                                                                                     
private[spark] class EventLoggingListener(                                                                                                                              
    appId: String,                                                                                                                                                      
    logBaseDir: URI,                                                                                                                                                    
    sparkConf: SparkConf,                                                                                                                                               
    hadoopConf: Configuration)                                                                                                                                          
  extends SparkListener with Logging {   

ReplayListenerBus

/**                                                                                                                                                                     
 * A SparkListenerBus that can be used to replay events from serialized event data.                                                                                     
 */                                                                                                                                                                     
private[spark] class ReplayListenerBus extends SparkListenerBus with Logging {  

LiveListenerBus

/**                                                                                                                                                                     
 * Asynchronously passes SparkListenerEvents to registered SparkListeners.                                                                                              
 *                                                                                                                                                                      
 * Until start() is called, all posted events are only buffered. Only after this listener bus                                                                           
 * has started will events be actually propagated to all attached listeners. This listener bus                                                                          
 * is stopped when it receives a SparkListenerShutdown event, which is posted using stop().                                                                             
 */                                                                                                                                                                     
private[spark] class LiveListenerBus                                                                                                                                    
  extends AsynchronousListenerBus[SparkListener, SparkListenerEvent]("SparkListenerBus")                                                                                
  with SparkListenerBus {       

ExecutorLossReason

/**                                                                                                                                                                     
 * Represents an explanation for a executor or whole slave failing or exiting.                                                                                          
 */                                                                                                                                                                     
private[spark]                                                                                                                                                          
class ExecutorLossReason(val message: String) {                                                                                                                         
  override def toString: String = message                                                                                                                               
}       

SchedulerBackend

/**                                                                                                                                                                     
 * A backend interface for scheduling systems that allows plugging in different ones under                                                                              
 * TaskSchedulerImpl. We assume a Mesos-like model where the application gets resource offers as                                                                        
 * machines become available and can launch tasks on them.                                                                                                              
 */                                                                                                                                                                     
private[spark] trait SchedulerBackend {   

SchedulerBackend的子类有四类分别为MesosSchedulerBackend,CoarseMesosSchedulerBackend,SimrSchedulerBackend,SparkDeploySchedulerBackend。MesosSchedulerBackend和CoarseMesosSchedulerBackend用于mesos的部署方式,SimrSchedulerBackend用于hadoop部署方式,SparkDeploySchedulerBackend用于纯spark的部署方式。YarnSchedulerBackend用于基于yarn的方式,问题是它是个abstract class,实现在
spark/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClusterSchedulerBackend.scala里。

LocalBackend

/**                                                                                                                                                                     
 * LocalBackend is used when running a local version of Spark where the executor, backend, and                                                                          
 * master all run in the same JVM. It sits behind a TaskSchedulerImpl and handles launching tasks                                                                       
 * on a single Executor (created by the LocalBackend) running locally.                                                                                                  
 */                                                                                                                                                                     
private[spark] class LocalBackend(scheduler: TaskSchedulerImpl, val totalCores: Int)    

YarnSchedulerBackend

/**                                                                                                                                                                     
 * Abstract Yarn scheduler backend that contains common logic                                                                                                           
 * between the client and cluster Yarn scheduler backends.                                                                                                              
 */                                                                                                                                                                     
private[spark] abstract class YarnSchedulerBackend(                                                                                                                     
    scheduler: TaskSchedulerImpl,                                                                                                                                       
    sc: SparkContext)                                                                                                                                                   
  extends CoarseGrainedSchedulerBackend(scheduler, sc.env.actorSystem) { 

CoarseGrainedSchedulerBackend

/**                                                                                                                                                                     
 * A scheduler backend that waits for coarse grained executors to connect to it through Akka.                                                                           
 * This backend holds onto each executor for the duration of the Spark job rather than relinquishing                                                                    
 * executors whenever a task is done and asking the scheduler to launch a new executor for                                                                              
 * each new task. Executors may be launched in a variety of ways, such as Mesos tasks for the                                                                           
 * coarse-grained Mesos mode or standalone processes for Spark's standalone deploy mode                                                                                 
 * (spark.deploy.*).                                                                                                                                                    
 */                                                                                                                                                                     
private[spark]                                                                                                                                                          
class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val actorSystem: ActorSystem)                                                                         
  extends ExecutorAllocationClient with SchedulerBackend with Logging    

SparkDeploySchedulerBackend

ActiveJob

/**                                                                                                                                                                     
 * Tracks information about an active job in the DAGScheduler.                                                                                                          
 */                                                                                                                                                                     
private[spark] class ActiveJob(                                                                                                                                         
    val jobId: Int,                                                                                                                                                     
    val finalStage: Stage,                                                                                                                                              
    val func: (TaskContext, Iterator[_]) => _,                                                                                                                          
    val partitions: Array[Int],                                                                                                                                         
    val callSite: CallSite,                                                                                                                                             
    val listener: JobListener,                                                                                                                                          
    val properties: Properties) {   

Stage

/**                                                                                                                                                                     
 * A stage is a set of independent tasks all computing the same function that need to run as part                                                                       
 * of a Spark job, where all the tasks have the same shuffle dependencies. Each DAG of tasks run                                                                        
 * by the scheduler is split up into stages at the boundaries where shuffle occurs, and then the                                                                        
 * DAGScheduler runs these stages in topological order.                                                                                                                 
 *                                                                                                                                                                      
 * Each Stage can either be a shuffle map stage, in which case its tasks' results are input for                                                                         
 * another stage, or a result stage, in which case its tasks directly compute the action that                                                                           
 * initiated a job (e.g. count(), save(), etc). For shuffle map stages, we also track the nodes                                                                         
 * that each output partition is on.                                                                                                                                    
 *                                                                                                                                                                      
 * Each Stage also has a jobId, identifying the job that first submitted the stage.  When FIFO                                                                          
 * scheduling is used, this allows Stages from earlier jobs to be computed first or recovered                                                                           
 * faster on failure.                                                                                                                                                   
 *                                                                                                                                                                      
 * The callSite provides a location in user code which relates to the stage. For a shuffle map                                                                          
 * stage, the callSite gives the user code that created the RDD being shuffled. For a result                                                                            
 * stage, the callSite gives the user code that executes the associated action (e.g. count()).                                                                          
 *                                                                                                                                                                      
 * A single stage can consist of multiple attempts. In that case, the latestInfo field will                                                                             
 * be updated for each attempt.                                                                                                                                         
 *                                                                                                                                                                      
 */                                                                                                                                                                     
private[spark] class Stage(                                                                                                                                             
    val id: Int,                                                                                                                                                        
    val rdd: RDD[_],                                                                                                                                                    
    val numTasks: Int,                                                                                                                                                  
    val shuffleDep: Option[ShuffleDependency[_, _, _]],  // Output shuffle if stage is a map stage                                                                      
    val parents: List[Stage],                                                                                                                                           
    val jobId: Int,
    val callSite: CallSite)                                                                                                                                             
  extends Logging {       

StageInfo

/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * Stores information about a stage to pass from the scheduler to SparkListeners.                                                                                       
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
class StageInfo(                                                                                                                                                        
    val stageId: Int,                                                                                                                                                   
    val attemptId: Int,                                                                                                                                                 
    val name: String,                                                                                                                                                   
    val numTasks: Int,                                                                                                                                                  
    val rddInfos: Seq[RDDInfo],                                                                                                                                         
    val details: String) {      

TaskResultGetter

/**                                                                                                                                                                     
 * Runs a thread pool that deserializes and remotely fetches (if necessary) task results.                                                                               
 */                                                                                                                                                                     
private[spark] class TaskResultGetter(sparkEnv: SparkEnv, scheduler: TaskSchedulerImpl)      

TaskLocation

/**                                                                                                                                                                     
 * A location where a task should run. This can either be a host or a (host, executorID) pair.                                                                          
 * In the latter case, we will prefer to launch the task on that executorID, but our next level                                                                         
 * of preference will be executors on the same host if this is not possible.                                                                                            
 */                                                                                                                                                                     
private[spark] sealed trait TaskLocation {                                                                                                                              
  def host: String                                                                                                                                                      
}          

SchedulingMode

/**                                                                                                                                                                     
 *  "FAIR" and "FIFO" determines which policy is used                                                                                                                   
 *    to order tasks amongst a Schedulable's sub-queues                                                                                                                 
 *  "NONE" is used when the a Schedulable has no sub-queues.                                                                                                            
 */                                                                                                                                                                     
object SchedulingMode extends Enumeration {                                                                                                                             

  type SchedulingMode = Value                                                                                                                                           
  val FAIR, FIFO, NONE = Value                                                                                                                                          
}  

TaskSet

/**                                                                                                                                                                     
 * A set of tasks submitted together to the low-level TaskScheduler, usually representing                                                                               
 * missing partitions of a particular stage.                                                                                                                            
 */                                                                                                                                                                     
private[spark] class TaskSet(                                                                                                                                           
    val tasks: Array[Task[_]],                                                                                                                                          
    val stageId: Int,                                                                                                                                                   
    val attempt: Int,                                                                                                                                                   
    val priority: Int,                                                                                                                                                  
    val properties: Properties) {          

AccumulableInfo

Information about an [[org.apache.spark.Accumulable]] modified during a task or stage.

InputFormatInfo

Parses and holds information about inputFormat (and files) specified as a parameter.
有意思的是,在object中有段注释:

  /**                                                                                                                                                                   
    Computes the preferred locations based on input(s) and returned a location to block map.                                                                            
    Typical use of this method for allocation would follow some algo like this:                                                                                         

    a) For each host, count number of splits hosted on that host.                                                                                                       
    b) Decrement the currently allocated containers on that host.                                                                                                       
    c) Compute rack info for each host and update rack -> count map based on (b).                                                                                       
    d) Allocate nodes based on (c)                                                                                                                                      
    e) On the allocation result, ensure that we dont allocate "too many" jobs on a single node                                                                          
       (even if data locality on that is very high) : this is to prevent fragility of job if a                                                                          
       single (or small set of) hosts go down.                                                                                                                          

    go to (a) until required nodes are allocated.                                                                                                                       

    If a node 'dies', follow same procedure.                                                                                                                            

    PS: I know the wording here is weird, hopefully it makes some sense !                                                                                               
  */      

Schedulable

Pool

An Schedulable entity that represent collection of Pools or TaskSetManagers

TaskSetManager

/**                                                                                                                                                                     
 * Schedules the tasks within a single TaskSet in the TaskSchedulerImpl. This class keeps track of                                                                      
 * each task, retries tasks if they fail (up to a limited number of times), and                                                                                         
 * handles locality-aware scheduling for this TaskSet via delay scheduling. The main interfaces                                                                         
 * to it are resourceOffer, which asks the TaskSet whether it wants to run a task on one node,                                                                          
 * and statusUpdate, which tells it that one of its tasks changed state (e.g. finished).                                                                                
 *                                                                                                                                                                      
 * THREADING: This class is designed to only be called from code with a lock on the                                                                                     
 * TaskScheduler (e.g. its event handlers). It should not be called from other threads.                                                                                 
 *                                                                                                                                                                      
 * @param sched           the TaskSchedulerImpl associated with the TaskSetManager                                                                                      
 * @param taskSet         the TaskSet to manage scheduling for                                                                                                          
 * @param maxTaskFailures if any particular task fails more than this number of times, the entire                                                                       
 *                        task set will be aborted                                                                                                                      
 */                                                                                                                                                                     
private[spark] class TaskSetManager(                                                                                                                                    
    sched: TaskSchedulerImpl,                                                                                                                                           
    val taskSet: TaskSet,                                                                                                                                               
    val maxTaskFailures: Int,                                                                                                                                           
    clock: Clock = new SystemClock())                                                                                                                                   
  extends Schedulable with Logging {   

这个文件蛮长,需要详细解释一下:

TaskLocality

object TaskLocality extends Enumeration {                                                                                                                               
  // Process local is expected to be used ONLY within TaskSetManager for now.                                                                                           
  val PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY = Value                                                                                                       

  type TaskLocality = Value                                                                                                                                                                                                                                                                                                                   

在TaskSetManager的逻辑中,它优先选择离自己近的节点先跑,优先级就是上面的

resourceOffer

  /**                                                                                                                                                                   
   * Respond to an offer of a single executor from the scheduler by finding a task                                                                                      
   *                                                                                                                                                                    
   * NOTE: this function is either called with a maxLocality which                                                                                                      
   * would be adjusted by delay scheduling algorithm or it will be with a special                                                                                       
   * NO_PREF locality which will be not modified                                                                                                                        
   *                                                                                                                                                                    
   * @param execId the executor Id of the offered resource                                                                                                              
   * @param host  the host Id of the offered resource                                                                                                                   
   * @param maxLocality the maximum locality we want to schedule the tasks at                                                                                           
   */                                                                                                                                                                   
  @throws[TaskNotSerializableException]                                                                                                                                 
  def resourceOffer(                                                                                                                                                    
      execId: String,                                                                                                                                                   
      host: String,                                                                                                                                                     
      maxLocality: TaskLocality.TaskLocality)                                                                                                                           
    : Option[TaskDescription] =     

有个executer可以提供一个位置跑了,那么就找出一个来

handleSuccessfulTask

Marks the task as successful and notifies the DAGScheduler that a task has ended.

handleFailedTask

Marks the task as failed, re-adds it to the list of pending tasks, and notifies the

基本逻辑

  1. 根据TaskSet所描述的Task列表,根据距离远近分别归类。
  2. SchedulerDAG 给出一个可以执行executor
  3. 选择最合适的task执行之
  4. 根据task执行的结果:
    • 成功, 标记成功
    • 失败,未达次数则继续try

SchedulingAlgorithm

/**                                                                                                                                                                     
 * An interface for sort algorithm                                                                                                                                      
 * FIFO: FIFO algorithm between TaskSetManagers                                                                                                                         
 * FS: FS algorithm between Pools, and FIFO or FS within Pools                                                                                                          
 */                                                                                                                                                                     
private[spark] trait SchedulingAlgorithm {                                                                                                                              
  def comparator(s1: Schedulable, s2: Schedulable): Boolean                                                                                                             
}      

其中FS指的是FairSchedulingAlgorithm

OutputCommitCoordinator

/**                                                                                                                                                                     
 * Authority that decides whether tasks can commit output to HDFS. Uses a "first committer wins"                                                                        
 * policy.                                                                                                                                                              
 *                                                                                                                                                                      
 * OutputCommitCoordinator is instantiated in both the drivers and executors. On executors, it is                                                                       
 * configured with a reference to the driver's OutputCommitCoordinatorActor, so requests to commit                                                                      
 * output will be forwarded to the driver's OutputCommitCoordinator.                                                                                                    
 *                                                                                                                                                                      
 * This class was introduced in SPARK-4879; see that JIRA issue (and the associated pull requests)                                                                      
 * for an extensive design discussion.                                                                                                                                  
 */                                                                                                                                                                     
private[spark] class OutputCommitCoordinator(conf: SparkConf) extends Logging {   

在driver上,会有一个OutputCommitCoordinatorActor,这个Actor就是OutputCommitCoordinator的持有者,它会接受第一个task的请求;deny剩下所有的task的请求。
请求的粒度是: AskPermissionToCommitOutput(stage, partition, taskAttempt)

DAGScheduler

/**                                                                                                                                                                     
 * The high-level scheduling layer that implements stage-oriented scheduling. It computes a DAG of                                                                      
 * stages for each job, keeps track of which RDDs and stage outputs are materialized, and finds a                                                                       
 * minimal schedule to run the job. It then submits stages as TaskSets to an underlying                                                                                 
 * TaskScheduler implementation that runs them on the cluster.                                                                                                          
 *                                                                                                                                                                      
 * In addition to coming up with a DAG of stages, this class also determines the preferred                                                                              
 * locations to run each task on, based on the current cache status, and passes these to the                                                                            
 * low-level TaskScheduler. Furthermore, it handles failures due to shuffle output files being                                                                          
 * lost, in which case old stages may need to be resubmitted. Failures *within* a stage that are                                                                        
 * not caused by shuffle file loss are handled by the TaskScheduler, which will retry each task                                                                         
 * a small number of times before cancelling the whole stage.                                                                                                           
 *                                                                                                                                                                      
 * Here's a checklist to use when making or reviewing changes to this class:                                                                                            
 *                                                                                                                                                                      
 *  - When adding a new data structure, update `DAGSchedulerSuite.assertDataStructuresEmpty` to                                                                         
 *    include the new structure. This will help to catch memory leaks.                                                                                                  
 */                                                                                                                                                                     
private[spark]                                                                                                                                                          
class DAGScheduler(                                                                                                                                                     
    private[scheduler] val sc: SparkContext,                                                                                                                            
    private[scheduler] val taskScheduler: TaskScheduler,                                                                                                                
    listenerBus: LiveListenerBus,                                                                                                                                       
    mapOutputTracker: MapOutputTrackerMaster,                                                                                                                           
    blockManagerMaster: BlockManagerMaster,                                                                                                                             
    env: SparkEnv,                                                                                                                                                      
    clock: Clock = new SystemClock())                                                                                                                                   
  extends Logging {    

DAGSchedulerEventProcessLoop

 /**                                                                                                                                                                   
   * The main event loop of the DAG scheduler.                                                                                                                          
   */                                                                                                                                                                   
  override def onReceive(event: DAGSchedulerEvent): Unit = event match {                                                                                                
    case JobSubmitted(jobId, rdd, func, partitions, allowLocal, callSite, listener, properties) =>                                                                      
      dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, allowLocal, callSite,                                                                               
        listener, properties)                                                                                                                                           

    case StageCancelled(stageId) =>                                                                                                                                     
      dagScheduler.handleStageCancellation(stageId)                                                                                                                     

    case JobCancelled(jobId) =>                                                                                                                                         
      dagScheduler.handleJobCancellation(jobId)                                                                                                                         

    case JobGroupCancelled(groupId) =>                                                                                                                                  
      dagScheduler.handleJobGroupCancelled(groupId)                                                                                                                     

    case AllJobsCancelled =>                                                                                                                                            
      dagScheduler.doCancelAllJobs()                                                                                                                                    

    case ExecutorAdded(execId, host) =>                                                                                                                                 
      dagScheduler.handleExecutorAdded(execId, host)                                                                                                                    

    case ExecutorLost(execId) =>                                                                                                                                        
      dagScheduler.handleExecutorLost(execId, fetchFailed = false)     

    case BeginEvent(task, taskInfo) =>                                                                                                                                  
      dagScheduler.handleBeginEvent(task, taskInfo)                                                                                                                     

    case GettingResultEvent(taskInfo) =>                                                                                                                                
      dagScheduler.handleGetTaskResult(taskInfo)                                                                                                                        

    case completion @ CompletionEvent(task, reason, _, _, taskInfo, taskMetrics) =>                                                                                     
      dagScheduler.handleTaskCompletion(completion)                                                                                                                     

    case TaskSetFailed(taskSet, reason) =>                                                                                                                              
      dagScheduler.handleTaskSetFailed(taskSet, reason)                                                                                                                 

    case ResubmitFailedStages =>                                                                                                                                        
      dagScheduler.resubmitFailedStages()                                                                                                                               
  }                                                                                                                                                                     

生成Stage的基本逻辑

请先参考如下文章:
Stage划分及提交源码分析 或者 stage

个人理解:
1. 最后的一个RDD一定是一个Stage,so 把它当作最终的Stage
2. 从finalRDD开始遍历,如果遇到了ShuffleDependence,那么它也应该是一个Stage
3. 2的过程不断重复,直到所有的Stage都生成。

"C:\Program Files\Java\jdk1.8.0_281\bin\java.exe" "-javaagent:D:\新建文件夹 (2)\IDEA\idea\IntelliJ IDEA 2019.3.3\lib\idea_rt.jar=59342" -Dfile.encoding=UTF-8 -classpath "C:\Program Files\Java\jdk1.8.0_281\jre\lib\charsets.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\deploy.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\access-bridge-64.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\cldrdata.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\dnsns.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\jaccess.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\jfxrt.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\localedata.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\nashorn.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\sunec.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\sunjce_provider.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\sunmscapi.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\sunpkcs11.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\ext\zipfs.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\javaws.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\jce.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\jfr.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\jfxswt.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\jsse.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\management-agent.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\plugin.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\resources.jar;C:\Program Files\Java\jdk1.8.0_281\jre\lib\rt.jar;D:\carspark\out\production\carspark;C:\Users\wyatt\.ivy2\cache\org.scala-lang\scala-library\jars\scala-library-2.12.10.jar;C:\Users\wyatt\.ivy2\cache\org.scala-lang\scala-reflect\jars\scala-reflect-2.12.10.jar;C:\Users\wyatt\.ivy2\cache\org.scala-lang\scala-library\srcs\scala-library-2.12.10-sources.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\accessors-smart-1.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\activation-1.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\aircompressor-0.10.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\algebra_2.12-2.0.0-M2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\antlr-runtime-3.5.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\antlr4-runtime-4.8-1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\aopalliance-1.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\aopalliance-repackaged-2.6.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\arpack_combined_all-0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\arrow-format-2.0.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\arrow-memory-core-2.0.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\arrow-memory-netty-2.0.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\audience-annotations-0.5.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\automaton-1.11-8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\avro-1.8.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\avro-ipc-1.8.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\avro-mapred-1.8.2-hadoop2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\bonecp-0.8.0.RELEASE.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\breeze-macros_2.12-1.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\breeze_2.12-1.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\cats-kernel_2.12-2.0.0-M4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\chill-java-0.9.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\chill_2.12-0.9.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-beanutils-1.9.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-cli-1.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-codec-1.10.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-collections-3.2.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-compiler-3.0.16.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-compress-1.20.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-configuration2-2.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-crypto-1.1.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-daemon-1.0.13.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-dbcp-1.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-httpclient-3.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-io-2.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-lang-2.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-lang3-3.10.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-logging-1.1.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-math3-3.4.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-net-3.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-pool-1.5.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\commons-text-1.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\compress-lzf-1.0.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\core-1.1.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\curator-client-2.13.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\curator-framework-2.13.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\curator-recipes-2.13.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\datanucleus-api-jdo-4.2.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\datanucleus-core-4.1.17.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\datanucleus-rdbms-4.1.19.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\derby-10.12.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\dnsjava-2.1.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\dropwizard-metrics-hadoop-metrics2-reporter-0.1.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\ehcache-3.3.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\flatbuffers-java-1.9.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\generex-1.0.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\geronimo-jcache_1.0_spec-1.0-alpha-1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\gson-2.2.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\guava-14.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\guice-4.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\guice-servlet-4.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-annotations-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-auth-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-common-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-hdfs-client-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-mapreduce-client-common-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-mapreduce-client-core-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-mapreduce-client-jobclient-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-yarn-api-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-yarn-client-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-yarn-common-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-yarn-registry-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-yarn-server-common-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hadoop-yarn-server-web-proxy-3.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\HikariCP-2.5.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-beeline-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-cli-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-common-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-exec-2.3.7-core.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-jdbc-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-llap-common-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-metastore-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-serde-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-service-rpc-3.1.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-shims-0.23-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-shims-common-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-shims-scheduler-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-storage-api-2.7.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hive-vector-code-gen-2.3.7.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hk2-api-2.6.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hk2-locator-2.6.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\hk2-utils-2.6.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\htrace-core4-4.1.0-incubating.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\httpclient-4.5.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\httpcore-4.4.12.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\istack-commons-runtime-3.0.8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\ivy-2.4.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-annotations-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-core-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-core-asl-1.9.13.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-databind-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-dataformat-yaml-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-datatype-jsr310-2.11.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-jaxrs-base-2.9.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-jaxrs-json-provider-2.9.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-mapper-asl-1.9.13.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-module-jaxb-annotations-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-module-paranamer-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jackson-module-scala_2.12-2.10.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.activation-api-1.2.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.annotation-api-1.3.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.inject-2.6.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.servlet-api-4.0.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.validation-api-2.0.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.ws.rs-api-2.1.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jakarta.xml.bind-api-2.3.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\janino-3.0.16.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\javassist-3.25.0-GA.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\javax.inject-1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\javax.jdo-3.2.0-m3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\javolution-5.5.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jaxb-api-2.2.11.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jaxb-runtime-2.3.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jcip-annotations-1.0-1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jcl-over-slf4j-1.7.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jdo-api-3.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-client-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-common-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-container-servlet-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-container-servlet-core-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-hk2-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-media-jaxb-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jersey-server-2.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\JLargeArrays-1.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jline-2.14.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\joda-time-2.10.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jodd-core-3.5.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jpam-1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\json-1.8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\json-smart-2.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\json4s-ast_2.12-3.7.0-M5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\json4s-core_2.12-3.7.0-M5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\json4s-jackson_2.12-3.7.0-M5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\json4s-scalap_2.12-3.7.0-M5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jsp-api-2.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jsr305-3.0.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jta-1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\JTransforms-3.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\jul-to-slf4j-1.7.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-admin-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-client-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-common-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-core-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-crypto-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-identity-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-server-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-simplekdc-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerb-util-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerby-asn1-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerby-config-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerby-pkix-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerby-util-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kerby-xdr-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kryo-shaded-4.0.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-client-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-admissionregistration-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-apiextensions-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-apps-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-autoscaling-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-batch-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-certificates-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-common-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-coordination-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-core-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-discovery-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-events-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-extensions-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-metrics-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-networking-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-policy-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-rbac-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-scheduling-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-settings-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\kubernetes-model-storageclass-4.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\leveldbjni-all-1.8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\libfb303-0.9.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\libthrift-0.12.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\log4j-1.2.17.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\logging-interceptor-3.12.12.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\lz4-java-1.7.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\machinist_2.12-0.6.8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\macro-compat_2.12-1.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\mesos-1.4.0-shaded-protobuf.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\metrics-core-4.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\metrics-graphite-4.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\metrics-jmx-4.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\metrics-json-4.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\metrics-jvm-4.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\minlog-1.3.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\netty-all-4.1.51.Final.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\nimbus-jose-jwt-4.41.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\objenesis-2.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\okhttp-2.7.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\okhttp-3.12.12.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\okio-1.14.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\opencsv-2.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\orc-core-1.5.12.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\orc-mapreduce-1.5.12.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\orc-shims-1.5.12.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\oro-2.0.8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\osgi-resource-locator-1.0.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\paranamer-2.8.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\parquet-column-1.10.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\parquet-common-1.10.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\parquet-encoding-1.10.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\parquet-format-2.4.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\parquet-hadoop-1.10.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\parquet-jackson-1.10.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\protobuf-java-2.5.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\py4j-0.10.9.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\pyrolite-4.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\re2j-1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\RoaringBitmap-0.9.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\scala-collection-compat_2.12-2.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\scala-compiler-2.12.10.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\scala-library-2.12.10.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\scala-parser-combinators_2.12-1.1.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\scala-reflect-2.12.10.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\scala-xml_2.12-1.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\shapeless_2.12-2.3.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\shims-0.9.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\slf4j-api-1.7.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\slf4j-log4j12-1.7.30.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\snakeyaml-1.24.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\snappy-java-1.1.8.2.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-catalyst_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-core_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-graphx_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-hive-thriftserver_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-hive_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-kubernetes_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-kvstore_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-launcher_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-mesos_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-mllib-local_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-mllib_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-network-common_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-network-shuffle_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-repl_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-sketch_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-sql_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-streaming_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-tags_2.12-3.1.1-tests.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-tags_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-unsafe_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spark-yarn_2.12-3.1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spire-macros_2.12-0.17.0-M1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spire-platform_2.12-0.17.0-M1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spire-util_2.12-0.17.0-M1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\spire_2.12-0.17.0-M1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\ST4-4.0.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\stax-api-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\stax2-api-3.1.4.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\stream-2.9.6.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\super-csv-2.2.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\threeten-extra-1.5.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\token-provider-1.0.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\transaction-api-1.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\univocity-parsers-2.9.1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\velocity-1.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\woodstox-core-5.0.3.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\xbean-asm7-shaded-4.15.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\xz-1.5.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\zjsonpatch-0.3.0.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\zookeeper-3.4.14.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\zstd-jni-1.4.8-1.jar;D:\spark\spark-3.1.1-bin-hadoop3.2\jars\arrow-vector-2.0.0.jar" car.LoadModelRideHailing Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 25/06/08 17:05:07 INFO SparkContext: Running Spark version 3.1.1 25/06/08 17:05:07 INFO ResourceUtils: ============================================================== 25/06/08 17:05:07 INFO ResourceUtils: No custom resources configured for spark.driver. 25/06/08 17:05:07 INFO ResourceUtils: ============================================================== 25/06/08 17:05:07 INFO SparkContext: Submitted application: LoadModelRideHailing 25/06/08 17:05:07 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0) 25/06/08 17:05:07 INFO ResourceProfile: Limiting resource is cpu 25/06/08 17:05:07 INFO ResourceProfileManager: Added ResourceProfile id: 0 25/06/08 17:05:07 INFO SecurityManager: Changing view acls to: wyatt 25/06/08 17:05:07 INFO SecurityManager: Changing modify acls to: wyatt 25/06/08 17:05:07 INFO SecurityManager: Changing view acls groups to: 25/06/08 17:05:07 INFO SecurityManager: Changing modify acls groups to: 25/06/08 17:05:07 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(wyatt); groups with view permissions: Set(); users with modify permissions: Set(wyatt); groups with modify permissions: Set() 25/06/08 17:05:07 INFO Utils: Successfully started service 'sparkDriver' on port 59361. 25/06/08 17:05:07 INFO SparkEnv: Registering MapOutputTracker 25/06/08 17:05:07 INFO SparkEnv: Registering BlockManagerMaster 25/06/08 17:05:08 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 25/06/08 17:05:08 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 25/06/08 17:05:08 INFO SparkEnv: Registering BlockManagerMasterHeartbeat 25/06/08 17:05:08 INFO DiskBlockManager: Created local directory at C:\Users\wyatt\AppData\Local\Temp\blockmgr-8fe065e2-024c-4e2f-8662-45d2fe3de444 25/06/08 17:05:08 INFO MemoryStore: MemoryStore started with capacity 1899.0 MiB 25/06/08 17:05:08 INFO SparkEnv: Registering OutputCommitCoordinator 25/06/08 17:05:08 INFO Utils: Successfully started service 'SparkUI' on port 4040. 25/06/08 17:05:08 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://windows10.microdone.cn:4040 25/06/08 17:05:08 INFO Executor: Starting executor ID driver on host windows10.microdone.cn 25/06/08 17:05:08 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 59392. 25/06/08 17:05:08 INFO NettyBlockTransferService: Server created on windows10.microdone.cn:59392 25/06/08 17:05:08 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 25/06/08 17:05:08 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, windows10.microdone.cn, 59392, None) 25/06/08 17:05:08 INFO BlockManagerMasterEndpoint: Registering block manager windows10.microdone.cn:59392 with 1899.0 MiB RAM, BlockManagerId(driver, windows10.microdone.cn, 59392, None) 25/06/08 17:05:08 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, windows10.microdone.cn, 59392, None) 25/06/08 17:05:08 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, windows10.microdone.cn, 59392, None) Exception in thread "main" java.lang.IllegalArgumentException: 测试数据中不包含 features 列,请检查数据! at car.LoadModelRideHailing$.main(LoadModelRideHailing.scala:23) at car.LoadModelRideHailing.main(LoadModelRideHailing.scala) 进程已结束,退出代码为 1 package car import org.apache.spark.ml.classification.{LogisticRegressionModel, RandomForestClassificationModel} import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.sql.{SparkSession, functions => F} object LoadModelRideHailing { def main(args: Array[String]): Unit = { val spark = SparkSession.builder() .master("local[3]") .appName("LoadModelRideHailing") .getOrCreate() spark.sparkContext.setLogLevel("Error") // 使用经过特征工程处理后的测试数据 val TestData = spark.read.option("header", "true").csv("C:\\Users\\wyatt\\Documents\\ride_hailing_test_data.csv") // 将 label 列转换为数值类型 val testDataWithNumericLabel = TestData.withColumn("label", F.col("label").cast("double")) // 检查 features 列是否存在 if (!testDataWithNumericLabel.columns.contains("features")) { throw new IllegalArgumentException("测试数据中不包含 features 列,请检查数据!") } // 修正后的模型路径(确保文件夹存在且包含元数据) val LogisticModel = LogisticRegressionModel.load("C:\\Users\\wyatt\\Documents\\ride_hailing_logistic_model") // 示例路径 val LogisticPre = LogisticModel.transform(testDataWithNumericLabel) val LogisticAcc = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction") .setMetricName("accuracy") .evaluate(LogisticPre) println("逻辑回归模型后期数据准确率:" + LogisticAcc) // 随机森林模型路径同步修正 val RandomForest = RandomForestClassificationModel.load("C:\\Users\\wyatt\\Documents\\ride_hailing_random_forest_model") // 示例路径 val RandomForestPre = RandomForest.transform(testDataWithNumericLabel) val RandomForestAcc = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction") .setMetricName("accuracy") .evaluate(RandomForestPre) println("随机森林模型后期数据准确率:" + RandomForestAcc) spark.stop() } }
最新发布
06-09
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值