(随手记)Spark基础:杂七杂八_transformation executor-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_44641024/article/details/102643308

RDD操作

transformation:转换它不会立即执行，spark所有操作都是懒执行的，所有的转换只有到action的时候才会运行

action 动作

Transformation算子

http://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations

Action算子

http://spark.apache.org/docs/latest/rdd-programming-guide.html#actions

spark core第二节：1小时开始

Spark一些核心术语

http://spark.apache.org/docs/latest/cluster-overview.html#glossary

Application

用户构建在spark之上的应用程序，包含一个driver和多个executors

Application jar

应用程序jar，这个jar里面包含自己的依赖，一般我们的作业是不含hadoop和spark的包，一般时候我们把这些放在运行的环境中。

Driver program

这是一个进程，这个进程会创建我们的SparkContext，所以一个应用程序会有driver的

Cluster manager

集群的管理。这是一个外部的服务，当到集群里获取资源的时候。我们只需要在提交作业的时候，设置standalone、mesos、yarn模式。

Deploy mode

部署模式。来区分我们的driver应用程序运行在哪里。在“集群”模式下，框架启动集群内部的驱动程序。在“客户端”模式下，提交者启动集群外部的驱动程序。

例如我们采用YARN模型：RM NM(container)

cluster模式Driver是跑着container上，并且在哪台机器不清楚。

clien模式：Driver就运行在你提交机器的本地。

Worker node

工作节点。即跑集群代码的。例如 yarn上，工作节点就是集群上的nm。

Executor

在工作节点上为应用程序启动的进程，它运行任务并将数据保存在内存或磁盘存储器中。每个应用程序都有自己的执行者。

即是一个进程，运行作业，并且把数据缓存到内存或磁盘中。对于yarn来说就是跑着container上。

Task

一个最基本的工作单元，会被发送到executor上运行。

RDD是由多个partition构成的。每一个partition对应一个task。

由多个任务组成的并行计算。遇到action就会产生job。

Stage

每一个作业将会被拆成比较小的任务集，任务集就是stage。

一个application：有一到多个job

一个job：一个job有一到多个stage

一个stage：一个stage有一到多个task构成，这个task和partition一一对应。stage的拆分原则：遇到shuffle算子就会切分

例如：spark-shell 提交一个最简单的word count程序。

Spark-shell 就是一个Application，Cluster manager 为local模式，Task就是和分区对应的，job通过collect触发job，一个job由1到N个stage构成。一个stage由一到N个Task，这个Task和paritition一一对应。Stage拆分原则：遇到shuffle算子就会切开。

spark core第二节 1小时26分

There are several useful things to note about this architecture:

Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN).
The driver program must listen for and accept incoming connections from its executors throughout its lifetime (e.g., see spark.driver.port in the network config section). As such, the driver program must be network addressable from the worker nodes.
Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.

1.每个应用程序都有自己的执行程序进程，它们在整个应用程序期间保持运行，并在多个线程中运行任务。

这样做的好处是可以在调度端(每个驱动程序调度自己的任务)和执行端(来自不同应用程序的任务在不同的jvm中运行)相互隔离应用程序。

然而，这也意味着如果不将数据写入外部存储系统，就无法在不同的Spark应用程序(SparkContext的实例)之间共享数据。（目前Alluxio能够解决这个问题）

2.Spark与底层集群管理器无关。只要它能够获取执行进程，并且这些进程彼此通信，那么即使在支持其他应用程序(例如Mesos/YARN)的集群管理器上运行它也相对容易。

3.驱动程序必须在其整个生命周期中侦听并接受来自其执行器的传入连接(例如，参见spark.driver。端口在网络配置部分)。因此，驱动程序必须是可从工作节点寻址的网络。

4.因为驱动程序在集群上调度任务，所以它应该在接近工作节点的地方运行，最好是在相同的局域网上。如果希望向集群远程发送请求，最好向驱动程序打开RPC并让它在附近提交操作，而不是在远离工作节点的地方运行驱动程序。

spark程序运行过程中，不同应用程序的数据是不能进行共享的，有一个框架可以解决：alluxo