Seatunnel源码解析(4) - 启动Spark/Flink程序

本文详细解析了Seatunnel如何通过源码启动Spark和Flink应用程序,涵盖了从创建Execution接口实例到组织Source、Transform、Sink流程的过程。在启动过程中,Seatunnel检查并准备每个插件,然后按照配置文件的顺序执行数据读取、转换和写入。文章还探讨了在特定情况下,如首个Source无数据时,后续transform如何处理。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Seatunnel源码解析(4) - 启动Spark/Flink程序

需求

公司在使用Seatunnel的过程中,规划将Seatunnel集成在平台中,提供可视化操作。
因此目前有如下几个相关的需求:

  1. 可以通过Web接口,传递参数,启动一个Seatunnel应用
  2. 可以自定义日志,收集相关指标,目前想到的包括:应用的入流量、出流量;启动时间、结束时间等
  3. 在任务结束后,可以用applicationId自动从yarn上收集日志(一是手动收集太麻烦,二是时间稍长日志就没了)

材料

  1. Seatunnel:2.0.5

目前官方2版本还没有正式发布,只能自己下载源码编译。
从Github下载官方源码,clone到本地Idea
github:https://github.com/apache/incubator-seatunnel
官方地址:http://seatunnel.incubator.apache.org/
Idea下方Terminal命令行里,maven打包,执行:mvn clean install -Dmaven.test.skip=true
打包过程大约十几分钟,执行结束后,seatunnel-dist模块的target目录下,可以找到打好包的*.tar.gz压缩安装包

  1. Spark:2.4.8
  2. Hadoop:2.7

任意门

Seatunnel源码解析(1)-启动应用
Seatunnel源码解析(2)-加载配置文件
Seatunnel源码解析(3)-加载插件
Seatunnel源码解析(4) -启动Spark/Flink程序
Seatunnel源码解析(5)-修改启动LOGO

导读

本章将从源码角度,解读Seatunnel如何将创建好的Source、Transform、Sink插件,组织成一个完整的Spark/Flink应用,并提交执行

执行Spark/Flink应用

  • entryPoint
public class Seatunnel {
   
    ...
    private static void entryPoint(String configFile, Engine engine) throws Exception {
   
        // 根据.conf配置文件的路径,加载解析配置,并封装成ConfigBuilder 
        ConfigBuilder configBuilder = new ConfigBuilder(configFile, engine);
        // 通过ConfigBuilder,加载配置文件中指定的Source、Transform、Sink插件
        List<BaseSource> sources = configBuilder.createPlugins(PluginType.SOURCE);
        List<BaseTransform> transforms = configBuilder.createPlugins(PluginType.TRANSFORM);
        List<BaseSink> sinks = configBuilder.createPlugins(PluginType.SINK);
        //  通过ConfigBuilder,创建对应执行引擎(Spark/Flink)和执行模式(Batch/Flink)的执行器Execution
        Execution execution = configBuilder.createExecution();
        // 调用插件自定义的检查配置的逻辑
        baseCheckConfig(sources, transforms, sinks);
        // 调用插件自定义的插件执行的前置初始化逻辑
        prepare(configBuilder.getEnv(), sources, transforms, sinks);
        // 打印应用启动LOGO
        showAsciiLogo();
        // Execution提交Spark/Flink应用
        execution.start(sources, transforms, sinks);
    }
}
  • Execuion接口

Execution接口,提供一个start()方法,用来启动一个具体的Seatunnel的job(Spark/Flink)
start()方法中定义了具体的source读取数据,transform转换数据,sink写数据的流程,即一个完整的的Spark/Flink代码流程
Execution接口,继承Plugin接口,有5个实现类
在这里插入图片描述

/**
 * the SeaTunnel job's execution context
 */
public interface Execution<SR extends BaseSource, TF extends BaseTransform, SK extends BaseSink> extends Plugin<Void> {
   

    /**
     * start to execute the SeaTunnel job
     *
     * @param sources    source plugin list
     * @param transforms transform plugin list
     * @param sinks      sink plugin list
     */
    void start(List<SR> sources, List<TF> transforms, List<SK>
networks:  net:    external: trueservices:  jobmanager1:    restart: always    image: apache/flink:1.16.3    container_name: jobmanager1    hostname: jobmanager1    ports:     - '8081:8081'    volumes:      - /etc/localtime:/etc/localtime      - /home/sumengnan/apache/flink/timezone:/etc/timezone      - /home/sumengnan/apache/flink/conf/flink-conf-jobmanager1.yaml:/opt/flink/conf/flink-conf.yaml      - /home/sumengnan/apache/flink/conf/log4j.properties:/opt/flink/conf/log4j.properties      - /home/sumengnan/apache/flink/conf/logback.xml:/opt/flink/conf/logback.xml      - /home/sumengnan/apache/flink/conf/log4j-console.properties:/opt/flink/conf/log4j-console.properties      - /home/sumengnan/apache/flink/conf/logback-console.xml:/opt/flink/conf/logback-console.xml      - /home/sumengnan/apache/flink/data:/opt/flink/data      - /home/sumengnan/apache/flink/log:/opt/flink/log      - /home/sumengnan/apache/flink/tmp:/opt/flink/tmp    networks:      - net  jobmanager2:    restart: always    image: apache/flink:1.16.3    container_name: jobmanager2    hostname: jobmanager2    ports:     - '8082:8081'    volumes:      - /etc/localtime:/etc/localtime      - /home/sumengnan/apache/flink/timezone:/etc/timezone      - /home/sumengnan/apache/flink/conf/flink-conf-jobmanager2.yaml:/opt/flink/conf/flink-conf.yaml      - /home/sumengnan/apache/flink/conf/log4j.properties:/opt/flink/conf/log4j.properties      - /home/sumengnan/apache/flink/conf/logback.xml:/opt/flink/conf/logback.xml      - /home/sumengnan/apache/flink/conf/log4j-console.properties:/opt/flink/conf/log4j-console.properties      - /home/sumengnan/apache/flink/conf/logback-console.xml:/opt/flink/conf/logback-console.xml      - /home/sumengnan/apache/flink/data:/opt/flink/data      - /home/sumengnan/apache/flink/log:/opt/flink/log      - /home/sumengnan/apache/flink/tmp:/opt/flink/tmp    networks:      - net    depenes_on:      - jobmanager1  taskmanager1:    restart: always    image: apache/flink:1.16.3    container_name: taskmanager1    hostname: taskmanager1    command: taskmanager    volumes:      - /etc/localtime:/etc/localtime      - /home/sumengnan/apache/flink/timezone:/etc/timezone      - /home/sumengnan/apache/flink/conf/flink-conf-taskmanager1.yaml:/opt/flink/conf/flink-conf.yaml      - /home/sumengnan/apache/flink/conf/log4j.properties:/opt/flink/conf/log4j.properties      - /home/sumengnan/apache/flink/conf/logback.xml:/opt/flink/conf/logback.xml      - /home/sumengnan/apache/flink/conf/log4j-console.properties:/opt/flink/conf/log4j-console.properties      - /home/sumengnan/apache/flink/conf/logback-console.xml:/opt/flink/conf/logback-console.xml      - /home/sumengnan/apache/flink/data:/opt/flink/data      - /home/sumengnan/apache/flink/log:/opt/flink/log      - /home/sumengnan/apache/flink/tmp:/opt/flink/tmp    networks:      - net    depenes_on:      - jobmanager1      - jobmanager2  taskmanager2:    restart: always    image: apache/flink:1.16.3    container_name: taskmanager2    hostname: taskmanager2    command: taskmanager    volumes:      - /etc/localtime:/etc/localtime      - /home/sumengnan/apache/flink/timezone:/etc/timezone      - /home/sumengnan/apache/flink/conf/flink-conf-taskmanager2.yaml:/opt/flink/conf/flink-conf.yaml      - /home/sumengnan/apache/flink/conf/log4j.properties:/opt/flink/conf/log4j.properties      - /home/sumengnan/apache/flink/conf/logback.xml:/opt/flink/conf/logback.xml      - /home/sumengnan/apache/flink/conf/log4j-console.properties:/opt/flink/conf/log4j-console.properties      - /home/sumengnan/apache/flink/conf/logback-console.xml:/opt/flink/conf/logback-console.xml      - /home/sumengnan/apache/flink/data:/opt/flink/data      - /home/sumengnan/apache/flink/log:/opt/flink/log      - /home/sumengnan/apache/flink/tmp:/opt/flink/tmp    networks:      - net    depenes_on:      - jobmanager1      - jobmanager2  taskmanager3:    restart: always    image: apache/flink:1.16.3    container_name: taskmanager3    hostname: taskmanager3    command: taskmanager    volumes:      - /etc/localtime:/etc/localtime      - /home/sumengnan/apache/flink/timezone:/etc/timezone      - /home/sumengnan/apache/flink/conf/flink-conf-taskmanager3.yaml:/opt/flink/conf/flink-conf.yaml      - /home/sumengnan/apache/flink/conf/log4j.properties:/opt/flink/conf/log4j.properties      - /home/sumengnan/apache/flink/conf/logback.xml:/opt/flink/conf/logback.xml      - /home/sumengnan/apache/flink/conf/log4j-console.properties:/opt/flink/conf/log4j-console.properties      - /home/sumengnan/apache/flink/conf/logback-console.xml:/opt/flink/conf/logback-console.xml      - /home/sumengnan/apache/flink/data:/opt/flink/data      - /home/sumengnan/apache/flink/log:/opt/flink/log      - /home/sumengnan/apache/flink/tmp:/opt/flink/tmp    networks:      - net    depenes_on:      - jobmanager1      - jobmanager2
04-04
### Apache Flink 1.16.3 Docker Compose 文件配置 为了正确配置包含 `jobmanager` 和 `taskmanager` 的 Apache Flink 1.16.3 Docker Compose 文件,以下是完整的解决方案: #### 1. 创建基础目录结构 创建一个工作目录来存储所有的必要文件。例如: ```bash mkdir flink-docker && cd flink-docker ``` 在此目录下放置以下文件: - `docker-compose.yml` - `flink-conf.yaml` - 自定义脚本(如 `start.sh`) --- #### 2. 编写 `docker-compose.yml` 下面是一个适用于 Apache Flink 1.16.3 的标准 `docker-compose.yml` 文件示例。 ```yaml version: '3' services: jobmanager: image: flink:1.16.3-scala_2.12-java8 container_name: flink_jobmanager ports: - "8081:8081" environment: - JOB_MANAGER_RPC_ADDRESS=jobmanager command: ["standalone-job", "--job-classname", "org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint"] volumes: - ./config/flink-conf.yaml:/opt/flink/conf/flink-conf.yaml - ./lib/:/opt/flink/lib/ - ./scripts/start.sh:/start.sh taskmanager: image: flink:1.16.3-scala_2.12-java8 container_name: flink_taskmanager depends_on: - jobmanager environment: - JOB_MANAGER_RPC_ADDRESS=jobmanager command: ["taskmanager"] deploy: replicas: 2 volumes: - ./config/flink-conf.yaml:/opt/flink/conf/flink-conf.yaml - ./lib/:/opt/flink/lib/ volumes: data: ``` 上述配置说明如下: - **JobManager**: 使用官方镜像启动 JobManager 并暴露 Web UI 端口 (默认为 8081)[^1]。 - **TaskManager**: 同样基于官方镜像启动 TaskManager,并通过 `replicas` 参数指定副本数量。 - **共享卷**: 将本地的 `flink-conf.yaml` 和自定义库 (`./lib`) 映射到容器内部路径 `/opt/flink/conf` 和 `/opt/flink/lib` 中。 --- #### 3. 配置 `flink-conf.yaml` Flink 的核心配置位于 `flink-conf.yaml` 文件中。以下是一些必要的参数设置: ```yaml jobmanager.rpc.address: jobmanager jobmanager.memory.process.size: 1600m taskmanager.numberOfTaskSlots: 4 taskmanager.memory.process.size: 1g parallelism.default: 4 restart-strategy: fixed-delay restart-strategy.fixed-delay.attempts: 3 restart-strategy.fixed-delay.delay: 10s high-availability: zookeeper high-availability.zookeeper.quorum: zookeeper:2181 high-availability.storageDir: hdfs:///recovery/ state.backend: filesystem state.checkpoints.dir: hdfs:///checkpoints/ state.savepoints.dir: hdfs:///savepoints/ ``` 此配置实现了以下功能: - 设置 JobManager 地址为 `jobmanager`[^2]。 - 定义高可用模式并启用 ZooKeeper 支持[^3]。 - 指定状态后端以及检查点和保存点的存储位置。 注意:如果未使用 HDFS 或 ZooKeeper,则需调整相应部分。 --- #### 4. 复制额外资源至容器 将所需的 JAR 文件和其他依赖项复制到容器中的适当位置。例如: ```bash docker cp flink-docker-1.0.jar jobmanager:/flink-docker-1.0.jar docker cp lib/ jobmanager:/flink-docker-lib/ docker cp scripts/start.sh jobmanager:/start.sh ``` 这一步骤确保了所有外部依赖能够被正常加载。 --- #### 5. 测试 Kafka 集成(可选) 如果有涉及 Kafka 的场景,可以通过以下命令进入 Kafka 容器进行调试: ```bash docker exec -it kafka /bin/bash cd /opt/bitnami/kafka/bin ``` 需要注意的是,在执行前应确认 Kafka 已经作为服务运行于同一网络环境中[^4]。 --- ### 总结 以上方法提供了一种标准化的方式构建支持 HA 的分布式 Flink 集群环境。它不仅涵盖了基本组件部署还考虑到了扩展性和容错能力的需求。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值