12_Flink Streaming cluster

在此之前的都是如何编写api,制作topology的过程。提交到集群后,就是运行topology了。运行时的模块大多都是scala写的,可能是因为使用akka通信。之前绘制DAG的过程大多是java写的。

flink的cluster有两种,LocalFlinkMiniCluster和FlinkMiniCluster,本地运行时是LocalFlinkMiniCluster,集群运行时是FlinkMiniCluster。LocalFlinkMiniCluster是在jvm中使用多线程模拟分布式计算。研究这个意义较小。所以研究一下FlinkMiniCluster。

FlinkMiniCluster。

在了解集群之前。先了解flink的架构。flink runtime 集群由1个JobManager(非HA)和多个TaskManager。组成。

JobManager,负责接收Client请求,统一管理TaskManager。类似storm的nimbus和worker的关系。

TaskManager,管理Task任务的执行。

以下简称JM和TM

JM和TM的通信由akka实现。两者都实现了FlinkActor。通过handleMessage方法,传递不同的Message进行通信。


/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.flink.runtime

import _root_.akka.actor.Actor
import grizzled.slf4j.Logger

/** Base trait for Flink's actors.
  *
  * The message handling logic is defined in the handleMessage method. This allows to mixin
  * stackable traits which change the message receiving behaviour.
  */
trait FlinkActor extends Actor {
  val log: Logger

  override def receive: Receive = handleMessage

  /** Handle incoming messages
    *
    * @return
    */
  def handleMessage: Receive

  /** Factory method for messages. This method can be used by mixins to decorate messages
    *
    * @param message The message to decorate
    * @return The decorated message
    */
  def decorateMessage(message: Any): Any = {
    message
  }
}

TaskManager

/**
   * Central handling of actor messages. This method delegates to the more specialized
   * methods for handling certain classes of messages.
   */
  override def handleMessage: Receive = {
    // task messages are most common and critical, we handle them first
    case message: TaskMessage => handleTaskMessage(message)

    // messages for coordinating checkpoints
    case message: AbstractCheckpointMessage => handleCheckpointingMessage(message)

    case JobManagerLeaderAddress(address, newLeaderSessionID) =>
      handleJobManagerLeaderAddress(address, newLeaderSessionID)

    // registration messages for connecting and disconnecting from / to the JobManager
    case message: RegistrationMessage => handleRegistrationMessage(message)

    // task sampling messages
    case message: StackTraceSampleMessages => handleStackTraceSampleMessage(message)

    // ----- miscellaneous messages ----

    // periodic heart beats that transport metrics
    case SendHeartbeat => sendHeartbeatToJobManager()

    // sends the stack trace of this TaskManager to the sender
    case SendStackTrace => sendStackTrace(sender())

    // registers the message sender to be notified once this TaskManager has completed
    // its registration at the JobManager
    case NotifyWhenRegisteredAtJobManager =>
      if (isConnected) {
        sender ! decorateMessage(RegisteredAtJobManager)
      } else {
        waitForRegistration += sender
      }

    // this message indicates that some actor watched by this TaskManager has died
    case Terminated(actor: ActorRef) =>
      if (isConnected && actor == currentJobManager.orNull) {
          handleJobManagerDisconnect(sender(), "JobManager is no longer reachable")
          triggerTaskManagerRegistration()
      } else {
        log.warn(s"Received unrecognized disconnect message " +
            s"from ${if (actor == null) null else actor.path}.")
      }

    case Disconnect(msg) =>
      handleJobManagerDisconnect(sender(), s"JobManager requested disconnect: $msg")
      triggerTaskManagerRegistration()

    case msg: StopCluster =>
      log.info(s"Stopping TaskManager with final application status ${msg.finalStatus()} " +
        s"and diagnostics: ${msg.message()}")
      shutdown()

    case FatalError(message, cause) =>
      killTaskManagerFatal(message, cause)

    case RequestTaskManagerLog(requestType : LogTypeRequest) =>
      blobService match {
        case Some(_) =>
          handleRequestTaskManagerLog(sender(), requestType, currentJobManager.get)
        case None =>
          sender() ! new IOException("BlobService not available. Cannot upload TaskManager logs.")
      }
  }

  /**
   * Handle unmatched messages with an exception.
   */
  override def unhandled(message: Any): Unit = {
    val errorMessage = "Received unknown message " + message
    val error = new RuntimeException(errorMessage)
    log.error(errorMessage)

    // terminate all we are currently running (with a dedicated message)
    // before the actor is stopped
    cancelAndClearEverything(error)

    // let the actor crash
    throw error
  }

JobManager

/**
   * Central work method of the JobManager actor. Receives messages and reacts to them.
   *
   * @return
   */
  override def handleMessage: Receive = {

    case GrantLeadership(newLeaderSessionID) =>
      log.info(s"JobManager $getAddress was granted leadership with leader session ID " +
        s"$newLeaderSessionID.")

      leaderSessionID = newLeaderSessionID

      // confirming the leader session ID might be blocking, thus do it in a future
      future {
        leaderElectionService.confirmLeaderSessionID(newLeaderSessionID.orNull)

        // TODO (critical next step) This needs to be more flexible and robust (e.g. wait for task
        // managers etc.)
        if (recoveryMode != RecoveryMode.STANDALONE) {
          log.info(s"Delaying recovery of all jobs by $jobRecoveryTimeout.")

          context.system.scheduler.scheduleOnce(
            jobRecoveryTimeout,
      
#!/bin/bash ############################################################################## # 脚本功能:Flink 1.15.2 Standalone 模式单作业监控与自动恢复 # 适配输出:针对flink list的格式"时间 : 作业ID : 作业名 (状态)"专门优化 # 适用场景:单作业环境,WebUI提交,代码内置Checkpoint和重启机制 ############################################################################## # ========================== 【用户需修改的核心配置】 ========================== export FLINK_HOME="/opt/flink-1.15.2" # Flink安装目录(Standalone模式) export MAIN_CLASS="com.example.YourMainClass" # 作业主类(全限定名) export JOB_JAR="/path/to/your/job.jar" # 作业JAR包绝对路径 export PARALLELISM=4 # 并行度(与WebUI提交时一致) export CHECK_INTERVAL=30 # 状态检查间隔(秒,建议30-60) export LOG_FILE="/var/log/flink_job_monitor.log" # 脚本日志路径 export ACTUAL_JOB_NAME="Flink Streaming Job" # 替换为你的实际作业名(flink list中显示的名称) # ============================================================================== # ========================== 【依赖检查】 ========================== check_dependency() { # 检查基础工具 if ! command -v grep &> /dev/null || ! command -v awk &> /dev/null; then echo "错误:缺少必要工具(grep/awk),请先安装" exit 1 fi # 检查Flink核心脚本 local required_scripts=("${FLINK_HOME}/bin/flink" "${FLINK_HOME}/bin/start-cluster.sh" "${FLINK_HOME}/bin/stop-cluster.sh") for script in "${required_scripts[@]}"; do if [ ! -f "${script}" ]; then echo "错误:Flink脚本不存在(${script}),请检查FLINK_HOME" exit 1 fi done # 检查作业JAR包 if [ ! -f "${JOB_JAR}" ]; then echo "错误:作业JAR包不存在(${JOB_JAR})" exit 1 fi } # ========================== 【日志工具函数】 ========================== log() { local level=$1 local message=$2 local timestamp=$(date "+%Y-%m-%d %H:%M:%S") echo -e "[${timestamp}] [${level}] ${message}" | tee -a "${LOG_FILE}" } # ========================== 【获取作业ID(适配实际输出格式)】 ========================== get_job_id() { local job_id local flink_list_output # 执行flink list并捕获输出(过滤错误信息) flink_list_output=$("${FLINK_HOME}/bin/flink" list 2>/dev/null) # 提取逻辑: # 1. 过滤无关行(等待提示、无作业、调度中状态) # 2. 按": "分割列(格式为"时间 : ID : 作业名 (状态)") # 3. 匹配实际作业名,提取第二列作为ID job_id=$(echo "${flink_list_output}" | \ grep -v -E "Waiting for response|No running jobs|scheduled" | \ grep "${ACTUAL_JOB_NAME}" | \ awk -F': ' '{print $2}' | \ head -n 1) if [ -z "${job_id}" ]; then log "WARN" "未检测到作业[${ACTUAL_JOB_NAME}](可能未提交或已停止)" echo "" else log "INFO" "检测到作业[${ACTUAL_JOB_NAME}],ID:${job_id}" echo "${job_id}" fi } # ========================== 【获取作业状态(适配实际输出格式)】 ========================== get_job_status() { local job_id=$1 local job_status local flink_list_output flink_list_output=$("${FLINK_HOME}/bin/flink" list 2>/dev/null) # 提取状态:找到含目标ID的行,按括号分割取状态部分 job_status=$(echo "${flink_list_output}" | \ grep "${job_id}" | \ awk -F'[()]' '{print $2}' | \ head -n 1) if [ -z "${job_status}" ]; then log "ERROR" "获取作业[${job_id}]状态失败(可能集群异常)" echo "UNKNOWN" else log "INFO" "作业[${job_id}]当前状态:${job_status}" echo "${job_status}" fi } # ========================== 【重启Standalone集群】 ========================== restart_flink_cluster() { log "INFO" "开始重启Flink Standalone集群..." # 1. 停止集群 log "INFO" "执行停止命令:${FLINK_HOME}/bin/stop-cluster.sh" ${FLINK_HOME}/bin/stop-cluster.sh >> "${LOG_FILE}" 2>&1 sleep 5 # 等待进程退出 # 2. 清理残留进程(防止端口占用) local residual_pids=$(jps | grep -E "StandaloneSessionClusterEntrypoint|TaskManagerRunner" | awk '{print $1}') if [ -n "${residual_pids}" ]; then log "WARN" "发现残留进程:${residual_pids},强制终止" kill -9 ${residual_pids} >> "${LOG_FILE}" 2>&1 sleep 2 fi # 3. 启动集群 log "INFO" "执行启动命令:${FLINK_HOME}/bin/start-cluster.sh" ${FLINK_HOME}/bin/start-cluster.sh >> "${LOG_FILE}" 2>&1 sleep 12 # 等待集群初始化(根据机器性能调整) # 4. 验证集群状态(检查JM和TM进程) local jm_pid=$(jps | grep StandaloneSessionClusterEntrypoint | awk '{print $1}') local tm_pid=$(jps | grep TaskManagerRunner | awk '{print $1}') if [ -n "${jm_pid}" ] && [ -n "${tm_pid}" ]; then log "INFO" "集群重启成功(JM PID:${jm_pid},TM PID:${tm_pid})" else log "ERROR" "集群重启失败(JM/TM进程未启动)" exit 1 fi } # ========================== 【重新提交作业】 ========================== resubmit_job() { log "INFO" "开始重新提交作业[${ACTUAL_JOB_NAME}]..." # 用指定参数提交作业(与WebUI一致) ${FLINK_HOME}/bin/flink run \ -c "${MAIN_CLASS}" \ -p "${PARALLELISM}" \ "${JOB_JAR}" >> "${LOG_FILE}" 2>&1 # 验证提交结果 local new_job_id=$(get_job_id) if [ -n "${new_job_id}" ]; then log "INFO" "作业提交成功,新ID:${new_job_id}" else log "ERROR" "作业提交失败(检查JAR包/主类或集群资源)" exit 1 fi } # ========================== 【主监控逻辑】 ========================== main() { check_dependency log "INFO" "Flink作业监控脚本启动(配置:FLINK_HOME=${FLINK_HOME},作业名=${ACTUAL_JOB_NAME})" while true; do log "INFO" "===================== 开始状态检查 =====================" local job_id=$(get_job_id) # 无作业时等待重试 if [ -z "${job_id}" ]; then log "WARN" "无作业可监控,${CHECK_INTERVAL}秒后重试..." sleep "${CHECK_INTERVAL}" continue fi # 检查作业状态并处理 local job_status=$(get_job_status "${job_id}") case "${job_status}" in "RUNNING"|"RESTARTING") log "INFO" "作业正常运行,无需处理" ;; "FAILED"|"CANCELED") log "ERROR" "作业状态为[${job_status}],启动恢复流程..." restart_flink_cluster resubmit_job ;; "FINISHED") log "WARN" "作业已正常完成,不启动恢复" ;; "UNKNOWN") log "ERROR" "集群状态异常,${CHECK_INTERVAL}秒后重试..." ;; esac log "INFO" "===================== 检查结束,等待${CHECK_INTERVAL}s =====================" sleep "${CHECK_INTERVAL}" done } # 启动脚本 main 分析代码
09-17
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值