又是天坑！DolphinScheduler 跟踪Flink任务状态不准怎么办？

最新推荐文章于 2025-02-07 11:35:24 发布

fox_初始化

最新推荐文章于 2025-02-07 11:35:24 发布

阅读量693

点赞数 5

文章标签： flink 大数据 yarn

本文链接：https://blog.youkuaiyun.com/fox_233/article/details/143021642

版权

背景：公司使用DolphinScheduler3.1.4做离线任务调度。Flink任务是在Yarn上托管的。在执行Flink任务的时候，发现DolphinScheduler上展示任务状态是成功的，但日志里面出现fail、ERROR等明显报错的信息。

排查：因为任务在Yarn上，所以可以先根据日志里面任务的applicationId去Yarn任务管理界面查看任务的执行状态。

查看Yarn任务地址：http://XXXXXXX:8088/cluster

可以发现实际上Yarn任务的状态是失败的，但调度里面是成功的。按照这个方向，我去搜了一下是什么情况。总结就是：在DolphinScheduler调度Flink批作业时，作业提交后状态立即变为成功，但实际上作业还在后台运行。所以才会出现任务执行有时候只有几秒钟就显示成功。实际上可能在初始化都失败了，或者执行后因为数据问题等执行失败，但调度已经结束且返回为成功。

总之，问题在于DolphinScheduler并没有根据Flink的Yarn任务执行结果设置调度任务的最终执行状态。

方案：知道原因后，现在DolphinScheduler官网看看有没有配置之类的可以让结果指向Yarn任务结果，发现并没有。

于是只能选择修改源码。先去github拉取DolphinScheduler3.1.4的源码，打不开gitee也行。

下载后，因为我们是Flink任务节点，直接看

dolphinscheduler-task-plugin/dolphinscheduler-task-flink/src/main/java/org/apache/dolphinscheduler/plugin/task/flink/FlinkTask.java

可以发现FlinkTask类继承了AbstractYarnTask，于是直接看AbstractYarnTask类。

可以看到，这里真的直接提交之后就拿到shell脚本执行的返回码做为调度状态返回了。

‘

下面的submitApplication、trackApplicationStatus方法还是todo... 悲伤。

AbstractYarnTask任务继承了AbstractRemoteTask类，可以看到这里是有规划要进行任务提交于追踪分离的。

我赶紧看一下最新版的，是不是修复了这个问题。很不幸，依然todo

改造方案就是将原本handle里的逻辑放在submitApplication中，然后完善trackApplicationStatus方法。

直接上代码

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.dolphinscheduler.plugin.task.api;

import static org.apache.dolphinscheduler.common.constants.Constants.*;
import static org.apache.dolphinscheduler.plugin.task.api.TaskConstants.*;
import static org.apache.dolphinscheduler.plugin.task.api.TaskConstants.EXIT_CODE_FAILURE;
import static org.apache.dolphinscheduler.plugin.task.api.TaskConstants.EXIT_CODE_SUCCESS;
import static org.apache.dolphinscheduler.plugin.task.api.TaskConstants.HADOOP_SECURITY_AUTHENTICATION_STARTUP_STATE;
import static org.apache.dolphinscheduler.plugin.task.api.TaskConstants.SLEEP_TIME_MILLIS;

import org.apache.dolphinscheduler.common.constants.Constants;
import org.apache.dolphinscheduler.common.exception.BaseException;
import org.apache.dolphinscheduler.common.utils.HttpUtils;
import org.apache.dolphinscheduler.common.utils.JSONUtils;
import org.apache.dolphinscheduler.common.utils.KerberosHttpClient;
import org.apache.dolphinscheduler.common.utils.PropertyUtils;
import org.apache.dolphinscheduler.plugin.task.api.model.ApplicationInfo;
import org.apache.dolphinscheduler.plugin.task.api.model.ResourceInfo;
import org.apache.dolphinscheduler.plugin.task.api.model.TaskResponse;
import org.apache.dolphinscheduler.plugin.task.api.utils.LogUtils;

import org.apache.commons.lang3.StringUtils;

import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.TimeUnit;
import java.util.regex.Pattern;
import java.util.stream.Stream;

import lombok.Data;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.fasterxml.jackson.databind.no

最低0.47元/天解锁文章