10_Flink Streaming jobSubmit

本文详细介绍了Flink流处理作业的提交过程。当通过shell提交job后,Flink将jobGraph和jar包发送给JobManager,JobManager类似nimbus角色,负责将任务分解到TaskManager上运行。JobManager将jobGraph转换为executionGraph,并通过优化后的任务计划启动执行。

./bin/flink run ./examples/batch/WordCount.jar

通过shell提交job后。flink将程序产生的jobGraph和jar包传给 jobmanager(简称JM)。再由jobmanager(类似nimbus)将任务分解到taskmanager(一个jvm带一个taskmanager)运行。

JM将jobGraph转化成executionGraph。,Client的getOptimizedPlan获取优化后的任务,最后通过JobClient.submitJobAndWait。启动执行。


#!/usr/bin/env bash
################################################################################
#  Licensed to the Apache Software Foundation (ASF) under one
#  or more contributor license agreements.  See the NOTICE file
#  distributed with this work for additional information
#  regarding copyright ownership.  The ASF licenses this file
#  to you under the Apache License, Version 2.0 (the
#  "License"); you may not use this file except in compliance
#  with the License.  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
# limitations under the License.
################################################################################

target="$0"
# For the case, the executable has been directly symlinked, figure out
# the correct bin path by following its symlink up to an upper bound.
# Note: we can't use the readlink utility here if we want to be POSIX
# compatible.
iteration=0
while [ -L "$target" ]; do
    if [ "$iteration" -gt 100 ]; then
        echo "Cannot resolve path: You have a cyclic symlink in $target."
        break
    fi
    ls=`ls -ld -- "$target"`
    target=`expr "$ls" : '.* -> \(.*\)$'`
    iteration=$((iteration + 1))
done

# Convert relative path to absolute path
bin=`dirname "$target"`

# get flink config
. "$bin"/config.sh

if [ "$FLINK_IDENT_STRING" = "" ]; then
        FLINK_IDENT_STRING="$USER"
fi

CC_CLASSPATH=`constructFlinkClassPath`

log=$FLINK_LOG_DIR/flink-$FLINK_IDENT_STRING-client-$HOSTNAME.log
log_setting=(-Dlog.file="$log" -Dlog4j.configuration=file:"$FLINK_CONF_DIR"/log4j-cli.properties -Dlogback.configurationFile=file:"$FLINK_CONF_DIR"/logback.xml)

export FLINK_ROOT_DIR
export FLINK_CONF_DIR

# Add HADOOP_CLASSPATH to allow the usage of Hadoop file systems
$JAVA_RUN $JVM_ARGS "${log_setting[@]}" -classpath "`manglePathList "$CC_CLASSPATH:$INTERNAL_HADOOP_CLASSPATHS"`" org.apache.flink.client.CliFrontend "$@"



/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.flink.client;

import akka.actor.ActorSystem;

import org.apache.commons.cli.CommandLine;

import org.apache.flink.api.common.InvalidProgramException;
import org.apache.flink.api.common.JobExecutionResult;
import org.apache.flink.api.common.JobID;
import org.apache.flink.api.common.JobSubmissionResult;
import org.apache.flink.api.common.accumulators.AccumulatorHelper;
import org.apache.flink.client.cli.CancelOptions;
import org.apache.flink.client.cli.CliArgsException;
import org.apache.flink.client.cli.CliFrontendParser;
import org.apache.flink.client.cli.CommandLineOptions;
import org.apache.flink.client.cli.InfoOptions;
import org.apache.flink.client.cli.ListOptions;
import org.apache.flink.client.cli.ProgramOptions;
import org.apache.flink.client.cli.RunOptions;
import org.apache.flink.client.cli.SavepointOptions;
import org.apache.flink.client.cli.StopOptions;
import org.apache.flink.client.program.Client;
import org.apache.flink.client.program.PackagedProgram;
import org.apache.flink.client.program.ProgramInvocationException;
import org.apache.flink.configuration.ConfigConstants;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.configuration.GlobalConfiguration;
import org.apache.flink.core.fs.FileSystem;
import org.apache.flink.optimizer.DataStatistics;
import org.apache.flink.optimizer.Optimizer;
import org.apache.flink.optimizer.costs.DefaultCostEstimator;
import org.apache.flink.optimizer.plan.FlinkPlan;
import org.apache.flink.optimizer.plan.OptimizedPlan;
import org.apache.flink.optimizer.plan.StreamingPlan;
import org.apache.flink.optimizer.plandump.PlanJSONDumpGenerator;
import org.apache.flink.runtime.akka.AkkaUtils;
import org.apache.flink.runtime.client.JobStatusMessage;
import org.apache.flink.runtime.instance.ActorGateway;
import org.apache.flink.runtime.jobgraph.JobStatus;
import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalService;
import org.apache.flink.runtime.messages.JobManagerMessages;
import org.apache.flink.runtime.messages.JobManagerMessages.CancelJob;
import org.apache.flink.runtime.messages.JobManagerMessages.CancellationFailure;
import org.apache.flink.runtime.messages.JobManagerMessages.RunningJobsStatus;
import org.apache.flink.runtime.messages.JobManagerMessages.StopJob;
import org.apache.flink.runtime.messages.JobManagerMessages.StoppingFailure;
import org.apache.flink.runtime.messages.JobManagerMessages.TriggerSavepoint;
import org.apache.flink.runtime.messages.JobManagerMessages.TriggerSavepointSuccess;
import org.apache.flink.runtime.security.SecurityUtils;
import org.apache.flink.runtime.util.EnvironmentInformation;
import org.apache.flink.runtime.util.LeaderRetrievalUtils;
import org.apache.flink.runtime.yarn.AbstractFlinkYarnClient;
import org.apache.flink.runtime.yarn.AbstractFlinkYarnCluster;
import org.apache.flink.runtime.yarn.FlinkYarnClusterStatus;
import org.apache.flink.util.StringUtils;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import scala.Some;
import scala.concurrent.Await;
import scala.concurrent.Future;
import scala.concurrent.duration.FiniteDuration;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.net.InetSocketAddress;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.URL;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.Comparator;
import java.util.Date;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import java.util.concurrent.TimeUnit;

import static org.apache.flink.runtime.messages.JobManagerMessages.DisposeSavepoint;
import static org.apache.flink.runtime.messages.JobManagerMessages.DisposeSavepointFailure;
import static org.apache.flink.runtime.messages.JobManagerMessages.TriggerSavepointFailure;

/**
 * Implementation of a simple command line frontend for executing programs.
 */
public class CliFrontend {

	// actions
	public static final String ACTION_RUN = "run";
	public static final String ACTION_INFO = "info";
	private static final String ACTION_LIST = "list";
	private static final String ACTION_CANCEL = "cancel";
	private static final String ACTION_STOP = "stop";
	private static final String ACTION_SAVEPOINT = "savepoint";

	// config dir parameters
	private static final String ENV_CONFIG_DIRECTORY = "FLINK_CONF_DIR";
	private static final String CONFIG_DIRECTORY_FALLBACK_1 = "../conf";
	private static final String CONFIG_DIRECTORY_FALLBACK_2 = "conf";

	// YARN-session related constants
	public static final String YARN_PROPERTIES_FILE = ".yarn-properties-";
	public static final String YARN_PROPERTIES_JOBMANAGER_KEY = "jobManager";
	public static final String YARN_PROPERTIES_PARALLELISM = "parallelism";
	public static final String YARN_PROPERTIES_DYNAMIC_PROPERTIES_STRING = "dynamicPropertiesString";

	public static final String YARN_DYNAMIC_PROPERTIES_SEPARATOR = "@@"; // this has to be a regex for String.split()

	/**
	 * A special host name used to run a job by deploying Flink into a YARN cluster,
	 * if this string is specified as the JobManager address
	 */
	public static final String YARN_DEPLOY_JOBMANAGER = "yarn-cluster";

	// --------------------------------------------------------------------------------------------
	// --------------------------------------------------------------------------------------------

	private static final Logger LOG = LoggerFactory.getLogger(CliFrontend.class);


	private final Configuration config;

	private final FiniteDuration clientTimeout;

	private final FiniteDuration lookupTimeout;

	private ActorSystem actorSystem;

	private AbstractFlinkYarnCluster yarnCluster;

	/**
	 *
	 * @throws Exception Thrown if the configuration directory was not found, the configuration could not
	 *                   be loaded, or the YARN properties could not be parsed.
	 */
	public CliFrontend() throws Exception {
		this(getConfigurationDirectoryFromEnv());
	}

	public CliFrontend(String configDir) throws Exception {

		// configure the config directory
		File configDirectory = new File(configDir);
		LOG.info("Using configuration directory " + configDirectory.getAbsolutePath());

		// load the configuration
		LOG.info("Trying to load configuration file");
		GlobalConfiguration.loadConfiguration(configDirectory.getAbsolutePath());
		this.config = GlobalConfiguration.getConfiguration();

		// load the YARN properties
		String defaultPropertiesFileLocation = System.getProperty("java.io.tmpdir");
		String currentUser = System.getProperty("user.name");
		String propertiesFileLocation = config.getString(ConfigConstants.YARN_PROPERTIES_FILE_LOCATION, defaultPropertiesFileLocation);

		File propertiesFile = new File(propertiesFileLocation, CliFrontend.YARN_PROPERTIES_FILE + currentUser);
		if (propertiesFile.exists()) {

			logAndSysout("Found YARN properties file " + propertiesFile.getAbsolutePath());

			Properties yarnProperties = new Properties();
			try {
				try (InputStream is = new FileInputStream(propertiesFile)) {
					yarnProperties.load(is);
				}
			}
			catch (IOException e) {
				throw new Exception("Cannot read the YARN properties file", e);
			}

			// configure the default parallelism from YARN
			String propParallelism = yarnProperties.getProperty(YARN_PROPERTIES_PARALLELISM);
			if (propParallelism != null) { // maybe the property is not set
				try {
					int parallelism = Integer.parseInt(propParallelism);
					this.config.setInteger(ConfigConstants.DEFAULT_PARALLELISM_KEY, parallelism);

					logAndSysout("YARN properties set default parallelism to " + parallelism);
				}
				catch (NumberFormatException e) {
					throw new Exception("Error while parsing the YARN properties: " +
							"Property " + YARN_PROPERTIES_PARALLELISM + " is not an integer.");
				}
			}

			// get the JobManager address from the YARN properties
			String address = yarnProperties.getProperty(YARN_PROPERTIES_JOBMANAGER_KEY);
			InetSocketAddress jobManagerAddress;
			if (address != null) {
				try {
					jobManagerAddress = parseHostPortAddress(address);
					// store address in config from where it is retrieved by the retrieval service
					writeJobManagerAddressToConfig(jobManagerAddress);
				}
				catch (Exception e) {
					throw new Exception("YARN properties contain an invalid entry for JobManager address.", e);
				}

				logAndSysout("Using JobManager address from YARN properties " + jobManagerAddress);
			}

			// handle the YARN client's dynamic properties
			String dynamicPropertiesEncoded = yarnProperties.getProperty(YARN_PROPERTIES_DYNAMIC_PROPERTIES_STRING);
			Map<String, String> dynamicProperties = getDynamicProperties(dynamicPropertiesEncoded);
			for (Map.Entry<String, String> dynamicProperty : dynamicProperties.entrySet()) {
				this.config.setString(dynamicProperty.getKey(), dynamicProperty.getValue());
			}
		}

		try {
			FileSystem.setDefaultScheme(config);
		} catch (IOException e) {
			throw new Exception("Error while setting the default " +
				"filesystem scheme from configuration.", e);
		}

		this.clientTimeout = AkkaUtils.getClientTimeout(config);
		this.lookupTimeout = AkkaUtils.getLookupTimeout(config);
	}


	// --------------------------------------------------------------------------------------------
	//  Getter & Setter
	// --------------------------------------------------------------------------------------------

	/**
	 * Getter which returns a copy of the associated configuration
	 *
	 * @return Copy of the associated configuration
	 */
	public Configuration getConfiguration() {
		Configuration copiedConfiguration = new Configuration();

		copiedConfiguration.addAll(config);

		return copiedConfiguration;
	}


	// --------------------------------------------------------------------------------------------
	//  Execute Actions
	// --------------------------------------------------------------------------------------------

	/**
	 * Executions the run action.
	 * 
	 * @param args Command line arguments for the run action.
	 */
	protected int run(String[] args) {
		LOG.info("Running 'run' command.");

		RunOptions options;
		try {
			options = CliFrontendParser.parseRunCommand(args);
		}
		catch (CliArgsException e) {
			return handleArgException(e);
		}
		catch (Throwable t) {
			return handleError(t);
		}

		// evaluate help flag
		if (options.isPrintHelp()) {
			CliFrontendParser.printHelpForRun();
			return 0;
		}

		if (options.getJarFilePath() == null) {
			return handleArgException(new CliArgsException("The program JAR file was not specified."));
		}

		PackagedProgram program;
		try {
			LOG.info("Building program from JAR file");
			program = buildProgram(options);
		}
		catch (FileNotFoundException e) {
			return handleArgException(e);
		}
		catch (Throwable t) {
			return handleError(t);
		}

		int exitCode = 1;
		try {
			int userParallelism = options.getParallelism();
			LOG.debug("User parallelism is set to {}", userParallelism);

			Client client = getClient(options, program.getMainClassName(), userParallelism, options.getDetachedMode());
			client.setPrintStatusDuringExecution(options.getStdoutLogging());
			LOG.debug("Client slots is set to {}", client.getMaxSlots());

			LOG.debug("Savepoint path is set to {}", options.getSavepointPath());

			try {
				if (client.getMaxSlots() != -1 && userParallelism == -1) {
					logAndSysout("Using the parallelism provided by the remote cluster ("+client.getMaxSlots()+"). " +
							"To use another parallelism, set it at the ./bin/flink client.");
					userParallelism = client.getMaxSlots();
				}

				// detached mode
				if (options.getDetachedMode() || (yarnCluster != null && yarnCluster.isDetached())) {
					exitCode = executeProgramDetached(program, client, userParallelism);
				}
				else {
					exitCode = executeProgramBlocking(program, client, userParallelism);
				}

				// show YARN cluster status if its not a detached YARN cluster.
				if (yarnCluster != null && !yarnCluster.isDetached()) {
					List<String> msgs = yarnCluster.getNewMessages();
					if (msgs != null && msgs.size() > 1) {

						logAndSysout("The following messages were created by the YARN cluster while running the Job:");
						for (String msg : msgs) {
							logAndSysout(msg);
						}
					}
					if (yarnCluster.hasFailed()) {
						logAndSysout("YARN cluster is in failed state!");
						logAndSysout("YARN Diagnostics: " + yarnCluster.getDiagnostics());
					}
				}

				return exitCode;
			}
			finally {
				client.shutdown();
			}
		}
		catch (Throwable t) {
			return handleError(t);
		}
		finally {
			if (yarnCluster != null && !yarnCluster.isDetached()) {
				logAndSysout("Shutting down YARN cluster");
				yarnCluster.shutdown(exitCode != 0);
			}
			if (program != null) {
				program.deleteExtractedLibraries();
			}
		}
	}

	/**
	 * Executes the info action.
	 * 
	 * @param args Command line arguments for the info action.
	 */
	protected int info(String[] args) {
		LOG.info("Running 'info' command.");

		// Parse command line options
		InfoOptions options;
		try {
			options = CliFrontendParser.parseInfoCommand(args);
		}
		catch (CliArgsException e) {
			return handleArgException(e);
		}
		catch (Throwable t) {
			return handleError(t);
		}

		// evaluate help flag
		if (options.isPrintHelp()) {
			CliFrontendParser.printHelpForInfo();
			return 0;
		}

		if (options.getJarFilePath() == null) {
			return handleArgException(new CliArgsException("The program JAR file was not specified."));
		}

		// -------- build the packaged program -------------

		PackagedProgram program;
		try {
			LOG.info("Building program from JAR file");
			program = buildProgram(options);
		}
		catch (Throwable t) {
			return handleError(t);
		}

		try {
			int parallelism = options.getParallelism();

			LOG.info("Creating program plan dump");

			Optimizer compiler = new Optimizer(new DataStatistics(), new DefaultCostEstimator(), config);
			FlinkPlan flinkPlan = Client.getOptimizedPlan(compiler, program, parallelism);
			
			String jsonPlan = null;
			if (flinkPlan instanceof OptimizedPlan) {
				jsonPlan = new PlanJSONDumpGenerator().getOptimizerPlanAsJSON((OptimizedPlan) flinkPlan);
			} else if (flinkPlan instanceof StreamingPlan) {
				jsonPlan = ((StreamingPlan) flinkPlan).getStreamingPlanAsJSON();
			}

			if (jsonPlan != null) {
				System.out.println("----------------------- Execution Plan -----------------------");
				System.out.println(jsonPlan);
				System.out.println("--------------------------------------------------------------");
			}
			else {
				System.out.println("JSON plan could not be generated.");
			}

			String description = program.getDescription();
			if (description != null) {
				System.out.println();
				System.out.println(description);
			}
			else {
				System.out.println();
				System.out.println("No description provided.");
			}
			return 0;
		}
		catch (Throwable t) {
			return handleError(t);
		}
		finally {
			program.deleteExtractedLibraries();
		}
	}

	/**
	 * Executes the list action.
	 * 
	 * @param args Command line arguments for the list action.
	 */
	protected int list(String[] args) {
		LOG.info("Running 'list' command.");

		ListOptions options;
		try {
			options = CliFrontendParser.parseListCommand(args);
		}
		catch (CliArgsException e) {
			return handleArgException(e);
		}
		catch (Throwable t) {
			return handleError(t);
		}

		// evaluate help flag
		if (options.isPrintHelp()) {
			CliFrontendParser.printHelpForList();
			return 0;
		}

		boolean running = options.getRunning();
		boolean scheduled = options.getScheduled();

		// print running and scheduled jobs if not option supplied
		if (!running && !scheduled) {
			running = true;
			scheduled = true;
		}

		try {
			ActorGateway jobManagerGateway = getJobManagerGateway(options);

			LOG.info("Connecting to JobManager to retrieve list of jobs");
			Future<Object> response = jobManagerGateway.ask(
				JobManagerMessages.getRequestRunningJobsStatus(),
				clientTimeout);

			Object result;
			try {
				result = Await.result(response, clientTimeout);
			}
			catch (Exception e) {
				throw new Exception("Could not retrieve running jobs from the JobManager.", e);
			}

			if (result instanceof RunningJobsStatus) {
				LOG.info("Successfully retrieved list of jobs");

				List<JobStatusMessage> jobs = ((RunningJobsStatus) result).getStatusMessages();

				ArrayList<JobStatusMessage> runningJobs = null;
				ArrayList<JobStatusMessage> scheduledJobs = null;
				if (running) {
					runningJobs = new ArrayList<JobStatusMessage>();
				}
				if (scheduled) {
					scheduledJobs = new ArrayList<JobStatusMessage>();
				}

				for (JobStatusMessage rj : jobs) {
					if (running && (rj.getJobState().equals(JobStatus.RUNNING)
							|| rj.getJobState().equals(JobStatus.RESTARTING))) {
						runningJobs.add(rj);
					}
					if (scheduled && rj.getJobState().equals(JobStatus.CREATED)) {
						scheduledJobs.add(rj);
					}
				}

				SimpleDateFormat df = new SimpleDateFormat("dd.MM.yyyy HH:mm:ss");
				Comparator<JobStatusMessage> njec = new Comparator<JobStatusMessage>(){
					@Override
					public int compare(JobStatusMessage o1, JobStatusMessage o2) {
						return (int)(o1.getStartTime()-o2.getStartTime());
					}
				};

				if (running) {
					if(runningJobs.size() == 0) {
						System.out.println("No running jobs.");
					}
					else {
						Collections.sort(runningJobs, njec);

						System.out.println("------------------ Running/Restarting Jobs -------------------");
						for (JobStatusMessage rj : runningJobs) {
							System.out.println(df.format(new Date(rj.getStartTime()))
									+ " : " + rj.getJobId() + " : " + rj.getJobName() + " (" + rj.getJobState() + ")");
						}
						System.out.println("--------------------------------------------------------------");
					}
				}
				if (scheduled) {
					if (scheduledJobs.size() == 0) {
						System.out.println("No scheduled jobs.");
					}
					else {
						Collections.sort(scheduledJobs, njec);

						System.out.println("----------------------- Scheduled Jobs -----------------------");
						for(JobStatusMessage rj : scheduledJobs) {
							System.out.println(df.format(new Date(rj.getStartTime()))
									+ " : " + rj.getJobId() + " : " + rj.getJobName());
						}
						System.out.println("--------------------------------------------------------------");
					}
				}
				return 0;
			}
			else {
				throw new Exception("ReqeustRunningJobs requires a response of type " +
						"RunningJobs. Instead the response is of type " + result.getClass() + ".");
			}
		}
		catch (Throwable t) {
			return handleError(t);
		}
	}

	/**
	 * Executes the STOP action.
	 * 
	 * @param args Command line arguments for the stop action.
	 */
	protected int stop(String[] args) {
		LOG.info("Running 'stop' command.");

		StopOptions options;
		try {
			options = CliFrontendParser.parseStopCommand(args);
		}
		catch (CliArgsException e) {
			return handleArgException(e);
		}
		catch (Throwable t) {
			return handleError(t);
		}

		// evaluate help flag
		if (options.isPrintHelp()) {
			CliFrontendParser.printHelpForStop();
			return 0;
		}

		String[] stopArgs = options.getArgs();
		JobID jobId;

		if (stopArgs.length > 0) {
			String jobIdString = stopArgs[0];
			try {
				jobId = new JobID(StringUtils.hexStringToByte(jobIdString));
			}
			catch (Exception e) {
				return handleError(e);
			}
		}
		else {
			return handleArgException(new CliArgsException("Missing JobID"));
		}

		try {
			ActorGateway jobManager = getJobManagerGateway(options);
			Future<Object> response = jobManager.ask(new StopJob(jobId), clientTimeout);

			final Object rc = Await.result(response, clientTimeout);

			if (rc instanceof StoppingFailure) {
				throw new Exception("Stopping the job with ID " + jobId + " failed.",
						((StoppingFailure) rc).cause());
			}

			return 0;
		}
		catch (Throwable t) {
			return handleError(t);
		}
	}

	/**
	 * Executes the CANCEL action.
	 * 
	 * @param args Command line arguments for the cancel action.
	 */
	protected int cancel(String[] args) {
		LOG.info("Running 'cancel' command.");

		CancelOptions options;
		try {
			options = CliFrontendParser.parseCancelCommand(args);
		}
		catch (CliArgsException e) {
			return handleArgException(e);
		}
		catch (Throwable t) {
			return handleError(t);
		}

		// evaluate help flag
		if (options.isPrintHelp()) {
			CliFrontendParser.printHelpForCancel();
			return 0;
		}

		String[] cleanedArgs = options.getArgs();
		JobID jobId;

		if (cleanedArgs.length > 0) {
			String jobIdString = cleanedArgs[0];
			try {
				jobId = new JobID(StringUtils.hexStringToByte(jobIdString));
			}
			catch (Exception e) {
				LOG.error("Error: The value for the Job ID is not a valid ID.");
				System.out.println("Error: The value for the Job ID is not a valid ID.");
				return 1;
			}
		}
		else {
			LOG.error("Missing JobID in the command line arguments.");
			System.out.println("Error: Specify a Job ID to cancel a job.");
			return 1;
		}

		try {
			ActorGateway jobManager = getJobManagerGateway(options);
			Future<Object> response = jobManager.ask(new CancelJob(jobId), clientTimeout);

			final Object rc = Await.result(response, clientTimeout);

			if (rc instanceof CancellationFailure) {
				throw new Exception("Canceling the job with ID " + jobId + " failed.",
						((CancellationFailure) rc).cause());
			}

			return 0;
		}
		catch (Throwable t) {
			return handleError(t);
		}
	}

	/**
	 * Executes the SAVEPOINT action.
	 *
	 * @param args Command line arguments for the cancel action.
	 */
	protected int savepoint(String[] args) {
		LOG.info("Running 'savepoint' command.");

		SavepointOptions options;
		try {
			options = CliFrontendParser.parseSavepointCommand(args);
		}
		catch (CliArgsException e) {
			return handleArgException(e);
		}
		catch (Throwable t) {
			return handleError(t);
		}

		// evaluate help flag
		if (options.isPrintHelp()) {
			CliFrontendParser.printHelpForCancel();
			return 0;
		}

		if (options.isDispose()) {
			// Discard
			return disposeSavepoint(options, options.getDisposeSavepointPath());
		}
		else {
			// Trigger
			String[] cleanedArgs = options.getArgs();
			JobID jobId;

			if (cleanedArgs.length > 0) {
				String jobIdString = cleanedArgs[0];
				try {
					jobId = new JobID(StringUtils.hexStringToByte(jobIdString));
				}
				catch (Exception e) {
					return handleError(new IllegalArgumentException(
							"Error: The value for the Job ID is not a valid ID."));
				}
			}
			else {
				return handleError(new IllegalArgumentException(
						"Error: The value for the Job ID is not a valid ID. " +
								"Specify a Job ID to trigger a savepoint."));
			}

			return triggerSavepoint(options, jobId);
		}
	}

	/**
	 * Sends a {@link org.apache.flink.runtime.messages.JobManagerMessages.TriggerSavepoint}
	 * message to the job manager.
	 */
	private int triggerSavepoint(SavepointOptions options, JobID jobId) {
		try {
			ActorGateway jobManager = getJobManagerGateway(options);

			logAndSysout("Triggering savepoint for job " + jobId + ".");
			Future<Object> response = jobManager.ask(new TriggerSavepoint(jobId),
					new FiniteDuration(1, TimeUnit.HOURS));

			Object result;
			try {
				logAndSysout("Waiting for response...");
				result = Await.result(response, FiniteDuration.Inf());
			}
			catch (Exception e) {
				throw new Exception("Triggering a savepoint for the job " + jobId + " failed.", e);
			}

			if (result instanceof TriggerSavepointSuccess) {
				TriggerSavepointSuccess success = (TriggerSavepointSuccess) result;
				logAndSysout("Savepoint completed. Path: " + success.savepointPath());
				logAndSysout("You can resume your program from this savepoint with the run command.");

				return 0;
			}
			else if (result instanceof TriggerSavepointFailure) {
				TriggerSavepointFailure failure = (TriggerSavepointFailure) result;
				throw failure.cause();
			}
			else {
				throw new IllegalStateException("Unknown JobManager response of type " +
						result.getClass());
			}
		}
		catch (Throwable t) {
			return handleError(t);
		}
	}

	/**
	 * Sends a {@link org.apache.flink.runtime.messages.JobManagerMessages.DisposeSavepoint}
	 * message to the job manager.
	 */
	private int disposeSavepoint(SavepointOptions options, String savepointPath) {
		try {
			ActorGateway jobManager = getJobManagerGateway(options);
			logAndSysout("Disposing savepoint '" + savepointPath + "'.");
			Future<Object> response = jobManager.ask(new DisposeSavepoint(savepointPath), clientTimeout);

			Object result;
			try {
				logAndSysout("Waiting for response...");
				result = Await.result(response, clientTimeout);
			}
			catch (Exception e) {
				throw new Exception("Disposing the savepoint with path" + savepointPath + " failed.", e);
			}

			if (result.getClass() == JobManagerMessages.getDisposeSavepointSuccess().getClass()) {
				logAndSysout("Savepoint '" + savepointPath + "' disposed.");
				return 0;
			}
			else if (result instanceof DisposeSavepointFailure) {
				DisposeSavepointFailure failure = (DisposeSavepointFailure) result;
				throw failure.cause();
			}
			else {
				throw new IllegalStateException("Unknown JobManager response of type " +
						result.getClass());
			}
		}
		catch (Throwable t) {
			return handleError(t);
		}
	}

	// --------------------------------------------------------------------------------------------
	//  Interaction with programs and JobManager
	// --------------------------------------------------------------------------------------------

	protected int executeProgramDetached(PackagedProgram program, Client client, int parallelism) {
		LOG.info("Starting execution of program");

		JobSubmissionResult result;
		try {
			result = client.runDetached(program, parallelism);
		} catch (ProgramInvocationException e) {
			return handleError(e);
		} finally {
			program.deleteExtractedLibraries();
		}

		if (yarnCluster != null) {
			yarnCluster.stopAfterJob(result.getJobID());
			yarnCluster.disconnect();
		}
		
		System.out.println("Job has been submitted with JobID " + result.getJobID());

		return 0;
	}

	protected int executeProgramBlocking(PackagedProgram program, Client client, int parallelism) {
		LOG.info("Starting execution of program");

		JobSubmissionResult result;
		try {
			result = client.runBlocking(program, parallelism);
		}
		catch (ProgramInvocationException e) {
			return handleError(e);
		}
		finally {
			program.deleteExtractedLibraries();
		}

		LOG.info("Program execution finished");

		if (result instanceof JobExecutionResult) {
			JobExecutionResult execResult = (JobExecutionResult) result;
			System.out.println("Job with JobID " + execResult.getJobID() + " has finished.");
			System.out.println("Job Runtime: " + execResult.getNetRuntime() + " ms");
			Map<String, Object> accumulatorsResult = execResult.getAllAccumulatorResults();
			if (accumulatorsResult.size() > 0) {
					System.out.println("Accumulator Results: ");
					System.out.println(AccumulatorHelper.getResultsFormated(accumulatorsResult));
			}
		}

		return 0;
	}

	/**
	 * Creates a Packaged program from the given command line options.
	 *
	 * @return A PackagedProgram (upon success)
	 * @throws java.io.FileNotFoundException
	 * @throws org.apache.flink.client.program.ProgramInvocationException
	 */
	protected PackagedProgram buildProgram(ProgramOptions options)
			throws FileNotFoundException, ProgramInvocationException
	{
		String[] programArgs = options.getProgramArgs();
		String jarFilePath = options.getJarFilePath();
		List<URL> classpaths = options.getClasspaths();

		if (jarFilePath == null) {
			throw new IllegalArgumentException("The program JAR file was not specified.");
		}

		File jarFile = new File(jarFilePath);

		// Check if JAR file exists
		if (!jarFile.exists()) {
			throw new FileNotFoundException("JAR file does not exist: " + jarFile);
		}
		else if (!jarFile.isFile()) {
			throw new FileNotFoundException("JAR file is not a file: " + jarFile);
		}

		// Get assembler class
		String entryPointClass = options.getEntryPointClassName();

		PackagedProgram program = entryPointClass == null ?
				new PackagedProgram(jarFile, classpaths, programArgs) :
				new PackagedProgram(jarFile, classpaths, entryPointClass, programArgs);

		program.setSavepointPath(options.getSavepointPath());

		return program;
	}

	/**
	 * Writes the given job manager address to the associated configuration object
	 *
	 * @param address Address to write to the configuration
	 */
	protected void writeJobManagerAddressToConfig(InetSocketAddress address) {
		config.setString(ConfigConstants.JOB_MANAGER_IPC_ADDRESS_KEY, address.getHostName());
		config.setInteger(ConfigConstants.JOB_MANAGER_IPC_PORT_KEY, address.getPort());
	}

	/**
	 * Updates the associated configuration with the given command line options
	 *
	 * @param options Command line options
	 */
	protected void updateConfig(CommandLineOptions options) {
		if(options.getJobManagerAddress() != null){
			InetSocketAddress jobManagerAddress = parseHostPortAddress(options.getJobManagerAddress());
			writeJobManagerAddressToConfig(jobManagerAddress);
		}
	}

	/**
	 * Retrieves the {@link ActorGateway} for the JobManager. The JobManager address is retrieved
	 * from the provided {@link CommandLineOptions}.
	 *
	 * @param options CommandLineOptions specifying the JobManager URL
	 * @return Gateway to the JobManager
	 * @throws Exception
	 */
	protected ActorGateway getJobManagerGateway(CommandLineOptions options) throws Exception {
		// overwrite config values with given command line options
		updateConfig(options);

		// start an actor system if needed
		if (this.actorSystem == null) {
			LOG.info("Starting actor system to communicate with JobManager");
			try {
				scala.Tuple2<String, Object> systemEndpoint = new scala.Tuple2<String, Object>("", 0);
				this.actorSystem = AkkaUtils.createActorSystem(
						config,
						new Some<scala.Tuple2<String, Object>>(systemEndpoint));
			}
			catch (Exception e) {
				throw new IOException("Could not start actor system to communicate with JobManager", e);
			}

			LOG.info("Actor system successfully started");
		}

		LOG.info("Trying to lookup the JobManager gateway");
		// Retrieve the ActorGateway from the LeaderRetrievalService
		LeaderRetrievalService lrs = LeaderRetrievalUtils.createLeaderRetrievalService(config);

		return LeaderRetrievalUtils.retrieveLeaderGateway(lrs, actorSystem, lookupTimeout);
	}

	/**
	 * Retrieves a {@link Client} object from the given command line options and other parameters.
	 *
	 * @param options Command line options which contain JobManager address
	 * @param programName Program name
	 * @param userParallelism Given user parallelism
	 * @throws Exception
	 */
	protected Client getClient(
			CommandLineOptions options,
			String programName,
			int userParallelism,
			boolean detachedMode)
		throws Exception {
		InetSocketAddress jobManagerAddress;
		int maxSlots = -1;

		if (YARN_DEPLOY_JOBMANAGER.equals(options.getJobManagerAddress())) {
			logAndSysout("YARN cluster mode detected. Switching Log4j output to console");

			// Default yarn application name to use, if nothing is specified on the command line
			String applicationName = "Flink Application: " + programName;

			// user wants to run Flink in YARN cluster.
			CommandLine commandLine = options.getCommandLine();
			AbstractFlinkYarnClient flinkYarnClient = CliFrontendParser
														.getFlinkYarnSessionCli()
														.withDefaultApplicationName(applicationName)
														.createFlinkYarnClient(commandLine);

			if (flinkYarnClient == null) {
				throw new RuntimeException("Unable to create Flink YARN Client. Check previous log messages");
			}

			// in case the main detached mode wasn't set, we don't wanna overwrite the one loaded
			// from yarn options.
			if (detachedMode) {
				flinkYarnClient.setDetachedMode(true);
			}

			// the number of slots available from YARN:
			int yarnTmSlots = flinkYarnClient.getTaskManagerSlots();
			if (yarnTmSlots == -1) {
				yarnTmSlots = 1;
			}
			maxSlots = yarnTmSlots * flinkYarnClient.getTaskManagerCount();
			if (userParallelism != -1) {
				int slotsPerTM = userParallelism / flinkYarnClient.getTaskManagerCount();
				logAndSysout("The YARN cluster has " + maxSlots + " slots available, " +
						"but the user requested a parallelism of " + userParallelism + " on YARN. " +
						"Each of the " + flinkYarnClient.getTaskManagerCount() + " TaskManagers " +
						"will get "+slotsPerTM+" slots.");
				flinkYarnClient.setTaskManagerSlots(slotsPerTM);
			}

			try {
				yarnCluster = flinkYarnClient.deploy();
				yarnCluster.connectToCluster();
			}
			catch (Exception e) {
				throw new RuntimeException("Error deploying the YARN cluster", e);
			}

			jobManagerAddress = yarnCluster.getJobManagerAddress();
			writeJobManagerAddressToConfig(jobManagerAddress);
			
			// overwrite the yarn client config (because the client parses the dynamic properties)
			this.config.addAll(flinkYarnClient.getFlinkConfiguration());

			logAndSysout("YARN cluster started");
			logAndSysout("JobManager web interface address " + yarnCluster.getWebInterfaceURL());
			logAndSysout("Waiting until all TaskManagers have connected");

			while(true) {
				FlinkYarnClusterStatus status = yarnCluster.getClusterStatus();
				if (status != null) {
					if (status.getNumberOfTaskManagers() < flinkYarnClient.getTaskManagerCount()) {
						logAndSysout("TaskManager status (" + status.getNumberOfTaskManagers() + "/" + flinkYarnClient.getTaskManagerCount() + ")");
					} else {
						logAndSysout("All TaskManagers are connected");
						break;
					}
				} else {
					logAndSysout("No status updates from the YARN cluster received so far. Waiting ...");
				}

				try {
					Thread.sleep(500);
				}
				catch (InterruptedException e) {
					LOG.error("Interrupted while waiting for TaskManagers");
					System.err.println("Thread is interrupted");
					Thread.currentThread().interrupt();
				}
			}
		}
		else {
			if(options.getJobManagerAddress() != null) {
				jobManagerAddress = parseHostPortAddress(options.getJobManagerAddress());
				writeJobManagerAddressToConfig(jobManagerAddress);
			}
		}

		return new Client(config, maxSlots);
	}

	// --------------------------------------------------------------------------------------------
	//  Logging and Exception Handling
	// --------------------------------------------------------------------------------------------

	/**
	 * Displays an exception message for incorrect command line arguments.
	 *
	 * @param e The exception to display.
	 * @return The return code for the process.
	 */
	private int handleArgException(Exception e) {
		LOG.error("Invalid command line arguments." + (e.getMessage() == null ? "" : e.getMessage()));

		System.out.println(e.getMessage());
		System.out.println();
		System.out.println("Use the help option (-h or --help) to get help on the command.");
		return 1;
	}

	/**
	 * Displays an exception message.
	 * 
	 * @param t The exception to display.
	 * @return The return code for the process.
	 */
	private int handleError(Throwable t) {
		LOG.error("Error while running the command.", t);

		System.err.println();
		System.err.println("------------------------------------------------------------");
		System.err.println(" The program finished with the following exception:");
		System.err.println();

		if (t.getCause() instanceof InvalidProgramException) {
			System.err.println(t.getCause().getMessage());
			StackTraceElement[] trace = t.getCause().getStackTrace();
			for (StackTraceElement ele: trace) {
				System.err.println("\t" + ele.toString());
				if (ele.getMethodName().equals("main")) {
					break;
				}
			}
		} else {
			t.printStackTrace();
		}
		return 1;
	}

	private void logAndSysout(String message) {
		LOG.info(message);
		System.out.println(message);
	}

	// --------------------------------------------------------------------------------------------
	//  Entry point for executable
	// --------------------------------------------------------------------------------------------

	/**
	 * Parses the command line arguments and starts the requested action.
	 * 
	 * @param args command line arguments of the client.
	 * @return The return code of the program
	 */
	public int parseParameters(String[] args) {

		// check for action
		if (args.length < 1) {
			CliFrontendParser.printHelp();
			System.out.println("Please specify an action.");
			return 1;
		}

		// get action
		String action = args[0];

		// remove action from parameters
		final String[] params = Arrays.copyOfRange(args, 1, args.length);

		// do action
		switch (action) {
			case ACTION_RUN:
				// run() needs to run in a secured environment for the optimizer.
				if (SecurityUtils.isSecurityEnabled()) {
					String message = "Secure Hadoop environment setup detected. Running in secure context.";
					LOG.info(message);

					try {
						return SecurityUtils.runSecured(new SecurityUtils.FlinkSecuredRunner<Integer>() {
							@Override
							public Integer run() throws Exception {
								return CliFrontend.this.run(params);
							}
						});
					}
					catch (Exception e) {
						return handleError(e);
					}
				} else {
					return run(params);
				}
			case ACTION_LIST:
				return list(params);
			case ACTION_INFO:
				return info(params);
			case ACTION_CANCEL:
				return cancel(params);
			case ACTION_STOP:
				return stop(params);
			case ACTION_SAVEPOINT:
				return savepoint(params);
			case "-h":
			case "--help":
				CliFrontendParser.printHelp();
				return 0;
			case "-v":
			case "--version":
				String version = EnvironmentInformation.getVersion();
				String commitID = EnvironmentInformation.getRevisionInformation().commitId;
				System.out.print("Version: " + version);
				System.out.println(!commitID.equals(EnvironmentInformation.UNKNOWN) ? ", Commit ID: " + commitID : "");
				return 0;
			default:
				System.out.printf("\"%s\" is not a valid action.\n", action);
				System.out.println();
				System.out.println("Valid actions are \"run\", \"list\", \"info\", \"stop\", or \"cancel\".");
				System.out.println();
				System.out.println("Specify the version option (-v or --version) to print Flink version.");
				System.out.println();
				System.out.println("Specify the help option (-h or --help) to get help on the command.");
				return 1;
		}
	}

	public void shutdown() {
		ActorSystem sys = this.actorSystem;
		if (sys != null) {
			this.actorSystem = null;
			sys.shutdown();
		}
	}

	/**
	 * Submits the job based on the arguments
	 */
	public static void main(String[] args) {
		EnvironmentInformation.logEnvironmentInfo(LOG, "Command Line Client", args);

		try {
			CliFrontend cli = new CliFrontend();
			int retCode = cli.parseParameters(args);
			System.exit(retCode);
		}
		catch (Throwable t) {
			LOG.error("Fatal error while running command line interface.", t);
			t.printStackTrace();
			System.exit(31);
		}
	}

	// --------------------------------------------------------------------------------------------
	//  Miscellaneous Utilities
	// --------------------------------------------------------------------------------------------

	/**
	 * Parses a given host port address of the format URL:PORT and returns an {@link InetSocketAddress}
	 *
	 * @param hostAndPort host port string to be parsed
	 * @return InetSocketAddress object containing the parsed host port information
	 */
	private static InetSocketAddress parseHostPortAddress(String hostAndPort) {
		// code taken from http://stackoverflow.com/questions/2345063/java-common-way-to-validate-and-convert-hostport-to-inetsocketaddress
		URI uri;
		try {
			uri = new URI("my://" + hostAndPort);
		} catch (URISyntaxException e) {
			throw new RuntimeException("Malformed address " + hostAndPort, e);
		}
		String host = uri.getHost();
		int port = uri.getPort();
		if (host == null || port == -1) {
			throw new RuntimeException("Address is missing hostname or port " + hostAndPort);
		}
		return new InetSocketAddress(host, port);
	}

	public static String getConfigurationDirectoryFromEnv() {
		String location = System.getenv(ENV_CONFIG_DIRECTORY);

		if (location != null) {
			if (new File(location).exists()) {
				return location;
			}
			else {
				throw new RuntimeException("The config directory '" + location + "', specified in the '" +
						ENV_CONFIG_DIRECTORY + "' environment variable, does not exist.");
			}
		}
		else if (new File(CONFIG_DIRECTORY_FALLBACK_1).exists()) {
			location = CONFIG_DIRECTORY_FALLBACK_1;
		}
		else if (new File(CONFIG_DIRECTORY_FALLBACK_2).exists()) {
			location = CONFIG_DIRECTORY_FALLBACK_2;
		}
		else {
			throw new RuntimeException("The configuration directory was not specified. " +
					"Please specify the directory containing the configuration file through the '" +
					ENV_CONFIG_DIRECTORY + "' environment variable.");
		}
		return location;
	}

	public static Map<String, String> getDynamicProperties(String dynamicPropertiesEncoded) {
		if (dynamicPropertiesEncoded != null && dynamicPropertiesEncoded.length() > 0) {
			Map<String, String> properties = new HashMap<>();
			
			String[] propertyLines = dynamicPropertiesEncoded.split(CliFrontend.YARN_DYNAMIC_PROPERTIES_SEPARATOR);
			for (String propLine : propertyLines) {
				if (propLine == null) {
					continue;
				}
				
				String[] kv = propLine.split("=");
				if (kv.length >= 2 && kv[0] != null && kv[1] != null && kv[0].length() > 0) {
					properties.put(kv[0], kv[1]);
				}
			}
			return properties;
		}
		else {
			return Collections.emptyMap();
		}
	}
}


222

/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.flink.client.program;

import java.io.IOException;
import java.net.URISyntaxException;
import java.net.URL;
import java.util.Collections;
import java.util.List;
import java.util.Map;

import com.google.common.base.Preconditions;
import org.apache.flink.api.common.JobID;
import org.apache.flink.api.common.JobSubmissionResult;
import org.apache.flink.api.common.accumulators.AccumulatorHelper;
import org.apache.flink.api.common.JobExecutionResult;
import org.apache.flink.api.common.Plan;
import org.apache.flink.optimizer.CompilerException;
import org.apache.flink.optimizer.DataStatistics;
import org.apache.flink.optimizer.Optimizer;
import org.apache.flink.optimizer.costs.DefaultCostEstimator;
import org.apache.flink.optimizer.plan.FlinkPlan;
import org.apache.flink.optimizer.plan.OptimizedPlan;
import org.apache.flink.optimizer.plan.StreamingPlan;
import org.apache.flink.optimizer.plandump.PlanJSONDumpGenerator;
import org.apache.flink.optimizer.plantranslate.JobGraphGenerator;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.core.fs.Path;
import org.apache.flink.runtime.akka.AkkaUtils;
import org.apache.flink.runtime.client.JobClient;
import org.apache.flink.runtime.client.JobExecutionException;
import org.apache.flink.runtime.instance.ActorGateway;
import org.apache.flink.runtime.jobgraph.JobGraph;
import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalService;
import org.apache.flink.runtime.messages.accumulators.AccumulatorResultsErroneous;
import org.apache.flink.runtime.messages.accumulators.AccumulatorResultsFound;
import org.apache.flink.runtime.messages.accumulators.RequestAccumulatorResults;
import org.apache.flink.runtime.messages.JobManagerMessages;
import org.apache.flink.runtime.util.LeaderRetrievalUtils;
import org.apache.flink.util.SerializedValue;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import scala.concurrent.Await;
import scala.concurrent.Future;
import scala.concurrent.duration.FiniteDuration;
import akka.actor.ActorSystem;

/**
 * Encapsulates the functionality necessary to submit a program to a remote cluster.
 */
public class Client {

	private static final Logger LOG = LoggerFactory.getLogger(Client.class);

	/** The optimizer used in the optimization of batch programs */
	final Optimizer compiler;

	/** The actor system used to communicate with the JobManager */
	private final ActorSystem actorSystem;

	/** Configuration of the client */
	private final Configuration config;

	/** Timeout for futures */
	private final FiniteDuration timeout;

	/** Lookup timeout for the job manager retrieval service */
	private final FiniteDuration lookupTimeout;

	/**
	 * If != -1, this field specifies the total number of available slots on the cluster
	 * connected to the client.
	 */
	private final int maxSlots;

	/** Flag indicating whether to sysout print execution updates */
	private boolean printStatusDuringExecution = true;

	/**
	 * For interactive invocations, the Job ID is only available after the ContextEnvironment has
	 * been run inside the user JAR. We pass the Client to every instance of the ContextEnvironment
	 * which lets us access the last JobID here.
	 */
	private JobID lastJobID;

	// ------------------------------------------------------------------------
	//                            Construction
	// ------------------------------------------------------------------------

	/**
	 * Creates a instance that submits the programs to the JobManager defined in the
	 * configuration. This method will try to resolve the JobManager hostname and throw an exception
	 * if that is not possible.
	 *
	 * @param config The config used to obtain the job-manager's address, and used to configure the optimizer.
	 *
	 * @throws java.io.IOException Thrown, if the client's actor system could not be started.
	 * @throws java.net.UnknownHostException Thrown, if the JobManager's hostname could not be resolved.
	 */
	public Client(Configuration config) throws IOException {
		this(config, -1);
	}

	/**
	 * Creates a new instance of the class that submits the jobs to a job-manager.
	 * at the given address using the default port.
	 *
	 * @param config The configuration for the client-side processes, like the optimizer.
	 * @param maxSlots maxSlots The number of maxSlots on the cluster if != -1.
	 *
	 * @throws java.io.IOException Thrown, if the client's actor system could not be started.
	 * @throws java.net.UnknownHostException Thrown, if the JobManager's hostname could not be resolved.
	 */
	public Client(Configuration config, int maxSlots) throws IOException {
		this.config = Preconditions.checkNotNull(config);
		this.compiler = new Optimizer(new DataStatistics(), new DefaultCostEstimator(), config);
		this.maxSlots = maxSlots;

		LOG.info("Starting client actor system");

		try {
			this.actorSystem = JobClient.startJobClientActorSystem(config);
		} catch (Exception e) {
			throw new IOException("Could start client actor system.", e);
		}

		timeout = AkkaUtils.getClientTimeout(config);
		lookupTimeout = AkkaUtils.getLookupTimeout(config);
	}

	// ------------------------------------------------------------------------
	//  Startup & Shutdown
	// ------------------------------------------------------------------------

	/**
	 * Shuts down the client. This stops the internal actor system and actors.
	 */
	public void shutdown() {
		if (!this.actorSystem.isTerminated()) {
			this.actorSystem.shutdown();
			this.actorSystem.awaitTermination();
		}
	}

	// ------------------------------------------------------------------------
	//  Configuration
	// ------------------------------------------------------------------------

	/**
	 * Configures whether the client should print progress updates during the execution to {@code System.out}.
	 * All updates are logged via the SLF4J loggers regardless of this setting.
	 *
	 * @param print True to print updates to standard out during execution, false to not print them.
	 */
	public void setPrintStatusDuringExecution(boolean print) {
		this.printStatusDuringExecution = print;
	}

	/**
	 * @return whether the client will print progress updates during the execution to {@code System.out}
	 */
	public boolean getPrintStatusDuringExecution() {
		return this.printStatusDuringExecution;
	}

	/**
	 * @return -1 if unknown. The maximum number of available processing slots at the Flink cluster
	 * connected to this client.
	 */
	public int getMaxSlots() {
		return this.maxSlots;
	}

	// ------------------------------------------------------------------------
	//  Access to the Program's Plan
	// ------------------------------------------------------------------------

	public static String getOptimizedPlanAsJson(Optimizer compiler, PackagedProgram prog, int parallelism)
			throws CompilerException, ProgramInvocationException
	{
		PlanJSONDumpGenerator jsonGen = new PlanJSONDumpGenerator();
		return jsonGen.getOptimizerPlanAsJSON((OptimizedPlan) getOptimizedPlan(compiler, prog, parallelism));
	}

	public static FlinkPlan getOptimizedPlan(Optimizer compiler, PackagedProgram prog, int parallelism)
			throws CompilerException, ProgramInvocationException
	{
		Thread.currentThread().setContextClassLoader(prog.getUserCodeClassLoader());
		if (prog.isUsingProgramEntryPoint()) {
			return getOptimizedPlan(compiler, prog.getPlanWithJars(), parallelism);
		} else if (prog.isUsingInteractiveMode()) {
			// temporary hack to support the optimizer plan preview
			OptimizerPlanEnvironment env = new OptimizerPlanEnvironment(compiler);
			if (parallelism > 0) {
				env.setParallelism(parallelism);
			}

			return env.getOptimizedPlan(prog);
		} else {
			throw new RuntimeException("Couldn't determine program mode.");
		}
	}

	public static OptimizedPlan getOptimizedPlan(Optimizer compiler, Plan p, int parallelism) throws CompilerException {
		if (parallelism > 0 && p.getDefaultParallelism() <= 0) {
			LOG.debug("Changing plan default parallelism from {} to {}", p.getDefaultParallelism(), parallelism);
			p.setDefaultParallelism(parallelism);
		}
		LOG.debug("Set parallelism {}, plan default parallelism {}", parallelism, p.getDefaultParallelism());

		return compiler.compile(p);
	}

	// ------------------------------------------------------------------------
	//  Program submission / execution
	// ------------------------------------------------------------------------

	public JobSubmissionResult runBlocking(PackagedProgram prog, int parallelism) throws ProgramInvocationException {
		Thread.currentThread().setContextClassLoader(prog.getUserCodeClassLoader());
		if (prog.isUsingProgramEntryPoint()) {
			return runBlocking(prog.getPlanWithJars(), parallelism, prog.getSavepointPath());
		}
		else if (prog.isUsingInteractiveMode()) {
			LOG.info("Starting program in interactive mode");
			ContextEnvironment.setAsContext(new ContextEnvironmentFactory(this, prog.getAllLibraries(),
					prog.getClasspaths(), prog.getUserCodeClassLoader(), parallelism, true,
					prog.getSavepointPath()));

			// invoke here
			try {
				prog.invokeInteractiveModeForExecution();
			}
			finally {
				ContextEnvironment.unsetContext();
			}

			return new JobSubmissionResult(lastJobID);
		}
		else {
			throw new RuntimeException();
		}
	}

	public JobSubmissionResult runDetached(PackagedProgram prog, int parallelism)
			throws ProgramInvocationException
	{
		Thread.currentThread().setContextClassLoader(prog.getUserCodeClassLoader());
		if (prog.isUsingProgramEntryPoint()) {
			return runDetached(prog.getPlanWithJars(), parallelism, prog.getSavepointPath());
		}
		else if (prog.isUsingInteractiveMode()) {
			LOG.info("Starting program in interactive mode");
			ContextEnvironmentFactory factory = new ContextEnvironmentFactory(this, prog.getAllLibraries(),
					prog.getClasspaths(), prog.getUserCodeClassLoader(), parallelism, false,
					prog.getSavepointPath());
			ContextEnvironment.setAsContext(factory);

			// invoke here
			try {
				prog.invokeInteractiveModeForExecution();
				return ((DetachedEnvironment) factory.getLastEnvCreated()).finalizeExecute();
			}
			finally {
				ContextEnvironment.unsetContext();
			}
		}
		else {
			throw new RuntimeException("PackagedProgram does not have a valid invocation mode.");
		}
	}

	public JobExecutionResult runBlocking(JobWithJars program, int parallelism) throws ProgramInvocationException {
		return runBlocking(program, parallelism, null);
	}

	/**
	 * Runs a program on the Flink cluster to which this client is connected. The call blocks until the
	 * execution is complete, and returns afterwards.
	 *
	 * @param program The program to be executed.
	 * @param parallelism The default parallelism to use when running the program. The default parallelism is used
	 *                    when the program does not set a parallelism by itself.
	 *
	 * @throws CompilerException Thrown, if the compiler encounters an illegal situation.
	 * @throws ProgramInvocationException Thrown, if the program could not be instantiated from its jar file,
	 *                                    or if the submission failed. That might be either due to an I/O problem,
	 *                                    i.e. the job-manager is unreachable, or due to the fact that the
	 *                                    parallel execution failed.
	 */
	public JobExecutionResult runBlocking(JobWithJars program, int parallelism, String savepointPath)
			throws CompilerException, ProgramInvocationException {
		ClassLoader classLoader = program.getUserCodeClassLoader();
		if (classLoader == null) {
			throw new IllegalArgumentException("The given JobWithJars does not provide a usercode class loader.");
		}

		OptimizedPlan optPlan = getOptimizedPlan(compiler, program, parallelism);
		return runBlocking(optPlan, program.getJarFiles(), program.getClasspaths(), classLoader, savepointPath);
	}

	public JobSubmissionResult runDetached(JobWithJars program, int parallelism) throws ProgramInvocationException {
		return runDetached(program, parallelism, null);
	}

	/**
	 * Submits a program to the Flink cluster to which this client is connected. The call returns after the
	 * program was submitted and does not wait for the program to complete.
	 *
	 * @param program The program to be executed.
	 * @param parallelism The default parallelism to use when running the program. The default parallelism is used
	 *                    when the program does not set a parallelism by itself.
	 *
	 * @throws CompilerException Thrown, if the compiler encounters an illegal situation.
	 * @throws ProgramInvocationException Thrown, if the program could not be instantiated from its jar file,
	 *                                    or if the submission failed. That might be either due to an I/O problem,
	 *                                    i.e. the job-manager is unreachable.
	 */
	public JobSubmissionResult runDetached(JobWithJars program, int parallelism, String savepointPath)
			throws CompilerException, ProgramInvocationException {
		ClassLoader classLoader = program.getUserCodeClassLoader();
		if (classLoader == null) {
			throw new IllegalArgumentException("The given JobWithJars does not provide a usercode class loader.");
		}

		OptimizedPlan optimizedPlan = getOptimizedPlan(compiler, program, parallelism);
		return runDetached(optimizedPlan, program.getJarFiles(), program.getClasspaths(), classLoader, savepointPath);
	}

	public JobExecutionResult runBlocking(
			FlinkPlan compiledPlan, List<URL> libraries, List<URL> classpaths, ClassLoader classLoader) throws ProgramInvocationException {
		return runBlocking(compiledPlan, libraries, classpaths, classLoader, null);
	}

	public JobExecutionResult runBlocking(FlinkPlan compiledPlan, List<URL> libraries, List<URL> classpaths,
			ClassLoader classLoader, String savepointPath) throws ProgramInvocationException
	{
		JobGraph job = getJobGraph(compiledPlan, libraries, classpaths, savepointPath);
		return runBlocking(job, classLoader);
	}

	public JobSubmissionResult runDetached(FlinkPlan compiledPlan, List<URL> libraries, List<URL> classpaths, ClassLoader classLoader) throws ProgramInvocationException {
		return runDetached(compiledPlan, libraries, classpaths, classLoader, null);
	}

	public JobSubmissionResult runDetached(FlinkPlan compiledPlan, List<URL> libraries, List<URL> classpaths,
			ClassLoader classLoader, String savepointPath) throws ProgramInvocationException
	{
		JobGraph job = getJobGraph(compiledPlan, libraries, classpaths, savepointPath);
		return runDetached(job, classLoader);
	}

	public JobExecutionResult runBlocking(JobGraph jobGraph, ClassLoader classLoader) throws ProgramInvocationException {
		LeaderRetrievalService leaderRetrievalService;
		try {
			leaderRetrievalService = LeaderRetrievalUtils.createLeaderRetrievalService(config);
		} catch (Exception e) {
			throw new ProgramInvocationException("Could not create the leader retrieval service.", e);
		}

		try {
			this.lastJobID = jobGraph.getJobID();
			return JobClient.submitJobAndWait(actorSystem, leaderRetrievalService, jobGraph, timeout, printStatusDuringExecution, classLoader);
		} catch (JobExecutionException e) {
			throw new ProgramInvocationException("The program execution failed: " + e.getMessage(), e);
		}
	}

	public JobSubmissionResult runDetached(JobGraph jobGraph, ClassLoader classLoader) throws ProgramInvocationException {
		ActorGateway jobManagerGateway;

		try {
			jobManagerGateway = getJobManagerGateway();
		} catch (Exception e) {
			throw new ProgramInvocationException("Failed to retrieve the JobManager gateway.", e);
		}

		LOG.info("Checking and uploading JAR files");
		try {
			JobClient.uploadJarFiles(jobGraph, jobManagerGateway, timeout);
		}
		catch (IOException e) {
			throw new ProgramInvocationException("Could not upload the program's JAR files to the JobManager.", e);
		}
		try {
			this.lastJobID = jobGraph.getJobID();
			JobClient.submitJobDetached(jobManagerGateway, jobGraph, timeout, classLoader);
			return new JobSubmissionResult(jobGraph.getJobID());
		} catch (JobExecutionException e) {
				throw new ProgramInvocationException("The program execution failed: " + e.getMessage(), e);
		}
	}

	/**
	 * Cancels a job identified by the job id.
	 * @param jobId the job id
	 * @throws Exception In case an error occurred.
	 */
	public void cancel(JobID jobId) throws Exception {
		final ActorGateway jobManagerGateway = getJobManagerGateway();

		final Future<Object> response;
		try {
			response = jobManagerGateway.ask(new JobManagerMessages.CancelJob(jobId), timeout);
		} catch (final Exception e) {
			throw new ProgramInvocationException("Failed to query the job manager gateway.", e);
		}

		final Object result = Await.result(response, timeout);

		if (result instanceof JobManagerMessages.CancellationSuccess) {
			LOG.info("Job cancellation with ID " + jobId + " succeeded.");
		} else if (result instanceof JobManagerMessages.CancellationFailure) {
			final Throwable t = ((JobManagerMessages.CancellationFailure) result).cause();
			LOG.info("Job cancellation with ID " + jobId + " failed.", t);
			throw new Exception("Failed to cancel the job because of \n" + t.getMessage());
		} else {
			throw new Exception("Unknown message received while cancelling: " + result.getClass().getName());
		}
	}

	/**
	 * Stops a program on Flink cluster whose job-manager is configured in this client's configuration.
	 * Stopping works only for streaming programs. Be aware, that the program might continue to run for
	 * a while after sending the stop command, because after sources stopped to emit data all operators
	 * need to finish processing.
	 * 
	 * @param jobId
	 *            the job ID of the streaming program to stop
	 * @throws Exception
	 *             If the job ID is invalid (ie, is unknown or refers to a batch job) or if sending the stop signal
	 *             failed. That might be due to an I/O problem, ie, the job-manager is unreachable.
	 */
	public void stop(final JobID jobId) throws Exception {
		final ActorGateway jobManagerGateway = getJobManagerGateway();

		final Future<Object> response;
		try {
			response = jobManagerGateway.ask(new JobManagerMessages.StopJob(jobId), timeout);
		} catch (final Exception e) {
			throw new ProgramInvocationException("Failed to query the job manager gateway.", e);
		}

		final Object result = Await.result(response, timeout);

		if (result instanceof JobManagerMessages.StoppingSuccess) {
			LOG.info("
#!/bin/bash ############################################################################## # 脚本功能:Flink 1.15.2 Standalone 模式单作业监控与自动恢复 # 适配输出:针对flink list的格式"时间 : 作业ID : 作业名 (状态)"专门优化 # 适用场景:单作业环境,WebUI提交,代码内置Checkpoint和重启机制 ############################################################################## # ========================== 【用户需修改的核心配置】 ========================== export FLINK_HOME="/opt/flink-1.15.2" # Flink安装目录(Standalone模式) export MAIN_CLASS="com.example.YourMainClass" # 作业主类(全限定名) export JOB_JAR="/path/to/your/job.jar" # 作业JAR包绝对路径 export PARALLELISM=4 # 并行度(与WebUI提交时一致) export CHECK_INTERVAL=30 # 状态检查间隔(秒,建议30-60) export LOG_FILE="/var/log/flink_job_monitor.log" # 脚本日志路径 export ACTUAL_JOB_NAME="Flink Streaming Job" # 替换为你的实际作业名(flink list中显示的名称) # ============================================================================== # ========================== 【依赖检查】 ========================== check_dependency() { # 检查基础工具 if ! command -v grep &> /dev/null || ! command -v awk &> /dev/null; then echo "错误:缺少必要工具(grep/awk),请先安装" exit 1 fi # 检查Flink核心脚本 local required_scripts=("${FLINK_HOME}/bin/flink" "${FLINK_HOME}/bin/start-cluster.sh" "${FLINK_HOME}/bin/stop-cluster.sh") for script in "${required_scripts[@]}"; do if [ ! -f "${script}" ]; then echo "错误:Flink脚本不存在(${script}),请检查FLINK_HOME" exit 1 fi done # 检查作业JAR包 if [ ! -f "${JOB_JAR}" ]; then echo "错误:作业JAR包不存在(${JOB_JAR})" exit 1 fi } # ========================== 【日志工具函数】 ========================== log() { local level=$1 local message=$2 local timestamp=$(date "+%Y-%m-%d %H:%M:%S") echo -e "[${timestamp}] [${level}] ${message}" | tee -a "${LOG_FILE}" } # ========================== 【获取作业ID(适配实际输出格式)】 ========================== get_job_id() { local job_id local flink_list_output # 执行flink list并捕获输出(过滤错误信息) flink_list_output=$("${FLINK_HOME}/bin/flink" list 2>/dev/null) # 提取逻辑: # 1. 过滤无关行(等待提示、无作业、调度中状态) # 2. 按": "分割列(格式为"时间 : ID : 作业名 (状态)") # 3. 匹配实际作业名,提取第二列作为ID job_id=$(echo "${flink_list_output}" | \ grep -v -E "Waiting for response|No running jobs|scheduled" | \ grep "${ACTUAL_JOB_NAME}" | \ awk -F': ' '{print $2}' | \ head -n 1) if [ -z "${job_id}" ]; then log "WARN" "未检测到作业[${ACTUAL_JOB_NAME}](可能未提交或已停止)" echo "" else log "INFO" "检测到作业[${ACTUAL_JOB_NAME}],ID:${job_id}" echo "${job_id}" fi } # ========================== 【获取作业状态(适配实际输出格式)】 ========================== get_job_status() { local job_id=$1 local job_status local flink_list_output flink_list_output=$("${FLINK_HOME}/bin/flink" list 2>/dev/null) # 提取状态:找到含目标ID的行,按括号分割取状态部分 job_status=$(echo "${flink_list_output}" | \ grep "${job_id}" | \ awk -F'[()]' '{print $2}' | \ head -n 1) if [ -z "${job_status}" ]; then log "ERROR" "获取作业[${job_id}]状态失败(可能集群异常)" echo "UNKNOWN" else log "INFO" "作业[${job_id}]当前状态:${job_status}" echo "${job_status}" fi } # ========================== 【重启Standalone集群】 ========================== restart_flink_cluster() { log "INFO" "开始重启Flink Standalone集群..." # 1. 停止集群 log "INFO" "执行停止命令:${FLINK_HOME}/bin/stop-cluster.sh" ${FLINK_HOME}/bin/stop-cluster.sh >> "${LOG_FILE}" 2>&1 sleep 5 # 等待进程退出 # 2. 清理残留进程(防止端口占用) local residual_pids=$(jps | grep -E "StandaloneSessionClusterEntrypoint|TaskManagerRunner" | awk '{print $1}') if [ -n "${residual_pids}" ]; then log "WARN" "发现残留进程:${residual_pids},强制终止" kill -9 ${residual_pids} >> "${LOG_FILE}" 2>&1 sleep 2 fi # 3. 启动集群 log "INFO" "执行启动命令:${FLINK_HOME}/bin/start-cluster.sh" ${FLINK_HOME}/bin/start-cluster.sh >> "${LOG_FILE}" 2>&1 sleep 12 # 等待集群初始化(根据机器性能调整) # 4. 验证集群状态(检查JM和TM进程) local jm_pid=$(jps | grep StandaloneSessionClusterEntrypoint | awk '{print $1}') local tm_pid=$(jps | grep TaskManagerRunner | awk '{print $1}') if [ -n "${jm_pid}" ] && [ -n "${tm_pid}" ]; then log "INFO" "集群重启成功(JM PID:${jm_pid},TM PID:${tm_pid})" else log "ERROR" "集群重启失败(JM/TM进程未启动)" exit 1 fi } # ========================== 【重新提交作业】 ========================== resubmit_job() { log "INFO" "开始重新提交作业[${ACTUAL_JOB_NAME}]..." # 用指定参数提交作业(与WebUI一致) ${FLINK_HOME}/bin/flink run \ -c "${MAIN_CLASS}" \ -p "${PARALLELISM}" \ "${JOB_JAR}" >> "${LOG_FILE}" 2>&1 # 验证提交结果 local new_job_id=$(get_job_id) if [ -n "${new_job_id}" ]; then log "INFO" "作业提交成功,新ID:${new_job_id}" else log "ERROR" "作业提交失败(检查JAR包/主类或集群资源)" exit 1 fi } # ========================== 【主监控逻辑】 ========================== main() { check_dependency log "INFO" "Flink作业监控脚本启动(配置:FLINK_HOME=${FLINK_HOME},作业名=${ACTUAL_JOB_NAME})" while true; do log "INFO" "===================== 开始状态检查 =====================" local job_id=$(get_job_id) # 无作业时等待重试 if [ -z "${job_id}" ]; then log "WARN" "无作业可监控,${CHECK_INTERVAL}秒后重试..." sleep "${CHECK_INTERVAL}" continue fi # 检查作业状态并处理 local job_status=$(get_job_status "${job_id}") case "${job_status}" in "RUNNING"|"RESTARTING") log "INFO" "作业正常运行,无需处理" ;; "FAILED"|"CANCELED") log "ERROR" "作业状态为[${job_status}],启动恢复流程..." restart_flink_cluster resubmit_job ;; "FINISHED") log "WARN" "作业已正常完成,不启动恢复" ;; "UNKNOWN") log "ERROR" "集群状态异常,${CHECK_INTERVAL}秒后重试..." ;; esac log "INFO" "===================== 检查结束,等待${CHECK_INTERVAL}s =====================" sleep "${CHECK_INTERVAL}" done } # 启动脚本 main 分析代码
09-17
<think> 我们讨论的是Flink 1.15.2 Standalone模式下的单作业监控与自动恢复脚本代码分析。 目标:监控一个Flink作业(通过作业ID),如果作业失败,则自动重启它。 步骤: 1. 我们需要能够获取作业的状态(运行中、失败等)。 2. 如果检测到作业失败,则重新提交作业(需要知道作业的jar包路径、入口类、参数等)。 在Standalone模式下,我们可以使用Flink REST API或命令行工具来查询作业状态。 参考引用: 引用[2]提到:在Standalone模式下,可以使用`flink list`命令查看作业状态。 引用[5]提到:Standalone模式不需要额外的Hadoop依赖。 我们选择使用Flink REST API来监控作业状态,因为API返回的信息更结构化,易于解析。 Flink Web UI默认使用8081端口,REST API的基地址通常是:http://<jobmanager_host>:8081 具体步骤: 第一步:定期检查作业状态(通过作业ID) - 使用curl调用REST API:GET /jobs/<jobID> - 返回的JSON中,状态字段为"state",可能的取值包括:"RUNNING", "FAILED", "CANCELED", "FINISHED"等。 第二步:如果状态为"FAILED",则重新提交作业。 - 重新提交作业需要使用`flink run`命令,指定jar包、入口类、参数等。 脚本设计要点: 1. 需要知道Flink作业的Job ID(如果作业第一次提交,需要记录下Job ID,因为后续监控和重启需要同一个ID?注意:作业重启后,Flink会生成新的Job ID) 因此,我们可能需要改变策略:我们监控的是作业是否存在(通过作业名称)?或者我们每次重新提交后记录新的Job ID? 2. 有两种思路: a) 监控特定的作业名称(因为作业名称可以保持不变)。通过REST API获取所有作业,然后根据名称找到作业ID,再检查状态。 b) 在第一次提交作业时记录Job ID,然后监控该ID。但是当作业失败重启后,会生成新的Job ID,所以需要更新存储的Job ID。 这里采用方案a),因为作业名称通常由用户指定且不变。 具体步骤: 1. 提交作业时,使用`-Djob.name`指定作业名称(或者通过程序内部设置,但这里我们假设提交时通过`-D`设置,这样在REST API中可以获取到)。 2. 通过REST API获取所有作业,然后根据名称查找作业ID和状态。 REST API获取所有作业:GET /jobs/overview 返回示例: { "jobs": [ { "jid": "d3f4e5g6h7i8j9k0l1m2n3o4p5q6r", "name": "My Flink Job", "state": "RUNNING", ... } ] } 3. 如果找不到该名称的作业,或者作业状态为"FAILED",则重新提交。 自动恢复脚本的伪代码: while true: 通过REST API获取作业列表 查找作业名称对应的作业状态 if 未找到作业 OR 作业状态为"FAILED": 重新提交作业(使用相同的作业名称) 记录新的作业ID(可选,因为后续通过名称查找) 睡眠一段时间(如30秒) 提交作业的命令(假设我们使用命令行提交): ./bin/flink run -d -Djob.name=MyJobName -c com.example.MainClass /path/to/jar arg1 arg2 注意:提交作业时,通过`-Djob.name`设置作业名称(Flink 1.15支持)。 脚本实现(使用bash和jq解析JSON): 需要安装curl和jq。 脚本代码示例: ```bash #!/bin/bash # 配置 FLINK_REST_URL="http://localhost:8081" # Flink JobManager的REST API地址 JOB_NAME="My Flink Job" # 要监控的作业名称 JAR_PATH="/path/to/your/job.jar" MAIN_CLASS="com.example.MainClass" JOB_ARGS="arg1 arg2" # 作业参数 CHECK_INTERVAL=30 # 检查间隔(秒) # 提交作业的函数 function submit_job() { # 提交作业,并设置作业名称 output=$(./bin/flink run -d -Djob.name="${JOB_NAME}" -c "${MAIN_CLASS}" "${JAR_PATH}" ${JOB_ARGS}) # 从输出中提取作业ID(可选,这里我们后续通过名称查找,所以不需要) # 但可以打印日志 echo "Submitted job: ${output}" } # 检查作业状态的函数 function check_job_status() { # 获取作业概览 response=$(curl -s -X GET "${FLINK_REST_URL}/jobs/overview") # 使用jq解析,查找作业名称为$JOB_NAME的作业的状态 # 注意:jq中需要转义双引号 job_status=$(echo "$response" | jq -r --arg jobName "$JOB_NAME" '.jobs[] | select(.name == $jobName) | .state') # 如果找到,返回状态;否则返回空 echo "$job_status" } # 主循环 while true; do status=$(check_job_status) if [ -z "$status" ]; then echo "Job is not running. Resubmitting..." submit_job elif [ "$status" == "FAILED" ]; then echo "Job has failed. Resubmitting..." submit_job elif [ "$status" == "RUNNING" ]; then echo "Job is running normally." else echo "Job status: $status" fi sleep "$CHECK_INTERVAL" done ``` 注意事项: 1. 作业名称必须唯一,否则可能监控多个作业(但我们的脚本只会取第一个匹配的,所以最好确保唯一)。 2. 如果作业正常完成(FINISHED),脚本也会重新提交。所以这个脚本只适用于需要持续运行的作业(如流作业)。 如果作业是批处理,则不应该重启。可以根据需要修改条件(例如,只对FAILED状态重启,而如果是FINISHED则不做处理)。 3. 提交作业的命令需要根据实际情况调整(如Flink的路径,可能需要使用绝对路径)。 4. 如果Flink集群重启,作业会丢失,脚本会重新提交。 改进点: - 可以添加重试次数限制。 - 监控更详细的状态(如RESTARTING状态,但FAILED是最终失败状态)。 引用说明: 使用REST API监控作业状态是Flink的标准做法[^flink-rest-docs]。 使用命令行提交作业在Standalone模式下是直接的方式(引用[2]和[4]提到了命令行操作)。 [^flink-rest-docs]: Flink REST API文档: https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/rest_api/ 以上脚本代码实现了基本的监控和自动恢复功能,适用于Standalone模式下的流处理作业。 §§相关问题§§ 1. Flink REST API 还能获取哪些作业信息? 2. 如何确保脚本本身的高可用性? 3. Flink 作业失败后如何保留状态以实现有状态恢复? </think> ### Flink 1.15.2 Standalone模式单作业监控与自动恢复脚本分析 基于引用内容和Flink 1.15.2特性,以下是脚本设计的核心逻辑和技术要点: --- #### 1. **脚本核心架构** ```mermaid graph LR A[监控进程] --> B{状态检查} B -->|RUNNING| C[等待下次检查] B -->|FAILED| D[重启作业] D --> E[提交作业] E --> F[记录新作业ID] ``` --- #### 2. **关键功能模块** ##### (1) **作业状态监控** ```bash # 通过REST API获取作业状态(需开启Web UI端口) curl -s "http://jobmanager:8081/jobs/${JOB_ID}" | jq -r '.state' # 状态返回值: # RUNNING, FAILED, CANCELED, FINISHED, RESTARTING ``` **解析逻辑**: ```bash if [ "$STATE" = "FAILED" ]; then echo "$(date) - Job ${JOB_ID} failed" >> recovery.log restart_job fi ``` > 引用[3]中需配置`jobmanager.rpc.address`确保API可达[^3] ##### (2) **自动恢复提交** ```bash # 作业提交模板(包含高可用配置) ./bin/flink run -d \ -D high-availability=zookeeper \ -D high-availability.zookeeper.quorum=zk1:2181,zk2:2181 \ -c com.MainClass /path/to/job.jar ``` **关键参数**: - `-d`:分离模式(后台运行) - `high-availability.storageDir`:状态存储路径(引用[3])[^3] - `-D`:覆盖`flink-conf.yaml`配置(引用[1])[^1] ##### (3) **作业ID记录** ```bash # 首次提交时捕获作业ID JOB_ID=$(./bin/flink run ... | grep -oP 'JobID \K[a-f0-9]+') # 持久化存储到文件 echo $JOB_ID > /var/flink/job.id ``` --- #### 3. **完整脚本示例** ```bash #!/bin/bash # 配置参数 JOB_NAME="StreamingJob" JAR_PATH="/opt/flink/jobs/etl.jar" MAIN_CLASS="com.ETLMain" CHECK_INTERVAL=60 # 检查间隔(秒) # 加载上次作业ID [ -f "/tmp/${JOB_NAME}.id" ] && JOB_ID=$(cat /tmp/${JOB_NAME}.id) while true; do STATUS=$(curl -s "http://jobmanager:8081/jobs/$JOB_ID" | jq -r '.state') case $STATUS in "FAILED") echo "[$(date)] Job $JOB_ID failed! Restarting..." NEW_ID=$(./bin/flink run -d -c $MAIN_CLASS $JAR_PATH | \ grep -oP 'JobID \K[0-9a-z]+') echo $NEW_ID > /tmp/${JOB_NAME}.id JOB_ID=$NEW_ID ;; "RUNNING") echo "[$(date)] Job $JOB_ID healthy" ;; *) echo "[$(date)] Unknown status: $STATUS" esac sleep $CHECK_INTERVAL done ``` --- #### 4. **关键技术点** 1. **状态持久化** - 需配置ZooKeeper和HDFS(引用[3])[^3]: ```yaml high-availability: zookeeper high-availability.storageDir: hdfs:///flink/recovery ``` 2. **日志监控替代方案** ```bash # 监控TaskManager日志(引用[2]) tail -F log/flink-*-taskexecutor-*.log | grep "Exception" ``` 3. **作业唯一性保证** - 通过`-Djob.name`指定作业名称(引用[4])[^4],避免重复提交 --- #### 5. **生产环境增强** - **重试策略**:添加指数退避重试机制 - **通知集成**:失败时触发邮件/钉钉报警 - **资源隔离**:使用`yarn.provided.lib.dirs`避免重复上传jar(引用[5])[^5] - **版本控制**:在HDFS存储jar版本,确保恢复时使用正确版本 > 此方案通过解耦监控与执行模块,实现Standalone模式下的作业自治恢复[^1][^3]
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值