Flink 代码方式提交程序到远程集群运行

本文介绍了一种在Apache Flink中使用Scala提交本地IDE作业到远程集群的方法,并解决了提交过程中出现的序列化错误问题。


在学习Flink时候,看到如下方法,可以获取到远程集群上的一个ExecutionEnvironment实例,便尝试使用一下,将本地IDE作业提交到集群运行,代码如下:

  def createRemoteEnvironment(host: String, port: Int, jarFiles: String*): ExecutionEnvironment 


代码:

package com.daxin.batch

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.configuration.{ConfigConstants, Configuration}
//important: this import is needed to access the 'createTypeInformation' macro function
import org.apache.flink.api.scala._
/**
  * Created by Daxin on 2017/4/17.
  * https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/types_serialization.html#type-information-in-the-scala-api
  */
object RemoteJob {
  def main(args: Array[String]) {

    val env = ExecutionEnvironment.createRemoteEnvironment("node", 6123)

    val words = env.readTextFile("hdfs://node:9000/word/spark-env.sh")

    val data = words.flatMap(x => x.split(" ")).map(x => (x, 1)).groupBy(0).sum(1)

    println(data.count) //简单触发作业打印一下个数
  }
}

运行报错,百度,谷歌,bing搜索了老半天也没有解决。为了以后方便搜索到此错误,将粘出全部异常信息:

Submitting job with JobID: 2e9a9550e8352e8f6cfd579b3522a732. Waiting for job completion.
Connected to JobManager at Actor[akka.tcp://flink@node:6123/user/jobmanager#950641914]
04/19/2017 19:37:21	Job execution switched to status RUNNING.
04/19/2017 19:37:21	CHAIN DataSource (at com.daxin.batch.RemoteJob$.main(RemoteJob.scala:25) (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at com.daxin.batch.RemoteJob$.main(RemoteJob.scala:27)) -> Map (Map at com.daxin.batch.RemoteJob$.main(RemoteJob.scala:27)) -> Combine(SUM(1))(1/1) switched to SCHEDULED 
04/19/2017 19:37:21	CHAIN DataSource (at com.daxin.batch.RemoteJob$.main(RemoteJob.scala:25) (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at com.daxin.batch.RemoteJob$.main(RemoteJob.scala:27)) -> Map (Map at com.daxin.batch.RemoteJob$.main(RemoteJob.scala:27)) -> Combine(SUM(1))(1/1) switched to DEPLOYING 
04/19/2017 19:37:21	CHAIN DataSource (at com.daxin.batch.RemoteJob$.main(RemoteJob.scala:25) (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at com.daxin.batch.RemoteJob$.main(RemoteJob.scala:27)) -> Map (Map at com.daxin.batch.RemoteJob$.main(RemoteJob.scala:27)) -> Combine(SUM(1))(1/1) switched to RUNNING 
04/19/2017 19:37:21	CHAIN DataSource (at com.daxin.batch.RemoteJob$.main(RemoteJob.scala:25) (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at com.daxin.batch.RemoteJob$.main(RemoteJob.scala:27)) -> Map (Map at com.daxin.batch.RemoteJob$.main(RemoteJob.scala:27)) -> Combine(SUM(1))(1/1) switched to FAILED 
java.lang.RuntimeException: The initialization of the DataSource's outputs caused an error: The type serializer factory could not load its parameters from the configuration due to missing classes.
	at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:92)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:655)
	at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.RuntimeException: The type serializer factory could not load its parameters from the configuration due to missing classes.
	at org.apache.flink.runtime.operators.util.TaskConfig.getTypeSerializerFactory(TaskConfig.java:1145)
	at org.apache.flink.runtime.operators.util.TaskConfig.getOutputSerializer(TaskConfig.java:551)
	at org.apache.flink.runtime.operators.BatchTask.getOutputCollector(BatchTask.java:1216)
	at org.apache.flink.runtime.operators.BatchTask.initOutputs(BatchTask.java:1295)
	at org.apache.flink.runtime.operators.DataSourceTask.initOutputs(DataSourceTask.java:286)
	at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:90)
	... 2 more
Caused by: java.lang.ClassNotFoundException: com.daxin.batch.RemoteJob$$anon$2$$anon$1
	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:270)
	at org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.resolveClass(InstantiationUtil.java:66)
	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
	at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:292)
	at org.apache.flink.util.InstantiationUtil.readObjectFromConfig(InstantiationUtil.java:250)
	at org.apache.flink.api.java.typeutils.runtime.RuntimeSerializerFactory.readParametersFromConfig(RuntimeSerializerFactory.java:76)
	at org.apache.flink.runtime.operators.util.TaskConfig.getTypeSerializerFactory(TaskConfig.java:1143)
	... 7 more

04/19/2017 19:37:21	Job execution switched to status FAILING.
java.lang.RuntimeException: The initialization of the DataSource's outputs caused an error: The type serializer factory could not load its parameters from the configuration due to missing classes.
	at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:92)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:655)
	at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.RuntimeException: The type serializer factory could not load its parameters from the configuration due to missing classes.
	at org.apache.flink.runtime.operators.util.TaskConfig.getTypeSerializerFactory(TaskConfig.java:1145)
	at org.apache.flink.runtime.operators.util.TaskConfig.getOutputSerializer(TaskConfig.java:551)
	at org.apache.flink.runtime.operators.BatchTask.getOutputCollector(BatchTask.java:1216)
	at org.apache.flink.runtime.operators.BatchTask.initOutputs(BatchTask.java:1295)
	at org.apache.flink.runtime.operators.DataSourceTask.initOutputs(DataSourceTask.java:286)
	at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:90)
	... 2 more
Caused by: java.lang.ClassNotFoundException: com.daxin.batch.RemoteJob$$anon$2$$anon$1
	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:270)
	at org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.resolveClass(InstantiationUtil.java:66)
	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
	at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:292)
	at org.apache.flink.util.InstantiationUtil.readObjectFromConfig(InstantiationUtil.java:250)
	at org.apache.flink.api.java.typeutils.runtime.RuntimeSerializerFactory.readParametersFromConfig(RuntimeSerializerFactory.java:76)
	at org.apache.flink.runtime.operators.util.TaskConfig.getTypeSerializerFactory(TaskConfig.java:1143)
	... 7 more
04/19/2017 19:37:21	Reduce (SUM(1))(1/1) switched to CANCELED 
04/19/2017 19:37:21	DataSink (org.apache.flink.api.java.Utils$CountHelper@516be40f)(1/1) switched to CANCELED 
04/19/2017 19:37:21	Job execution switched to status FAILED.
Exception in thread "main" org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Job execution failed.
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427)
	at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:101)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:400)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:362)
	at org.apache.flink.client.RemoteExecutor.executePlanWithJars(RemoteExecutor.java:211)
	at org.apache.flink.client.RemoteExecutor.executePlan(RemoteExecutor.java:188)
	at org.apache.flink.api.java.RemoteEnvironment.execute(RemoteEnvironment.java:172)
	at org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:926)
	at org.apache.flink.api.scala.ExecutionEnvironment.execute(ExecutionEnvironment.scala:672)
	at org.apache.flink.api.scala.DataSet.count(DataSet.scala:529)
	at com.daxin.batch.RemoteJob$.main(RemoteJob.scala:29)
	at com.daxin.batch.RemoteJob.main(RemoteJob.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
	at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$6.apply$mcV$sp(JobManager.scala:900)
	at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$6.apply(JobManager.scala:843)
	at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$6.apply(JobManager.scala:843)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.RuntimeException: The initialization of the DataSource's outputs caused an error: The type serializer factory could not load its parameters from the configuration due to missing classes.
	at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:92)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:655)
	at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.RuntimeException: The type serializer factory could not load its parameters from the configuration due to missing classes.
	at org.apache.flink.runtime.operators.util.TaskConfig.getTypeSerializerFactory(TaskConfig.java:1145)
	at org.apache.flink.runtime.operators.util.TaskConfig.getOutputSerializer(TaskConfig.java:551)
	at org.apache.flink.runtime.operators.BatchTask.getOutputCollector(BatchTask.java:1216)
	at org.apache.flink.runtime.operators.BatchTask.initOutputs(BatchTask.java:1295)
	at org.apache.flink.runtime.operators.DataSourceTask.initOutputs(DataSourceTask.java:286)
	at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:90)
	... 2 more
Caused by: java.lang.ClassNotFoundException: com.daxin.batch.RemoteJob$$anon$2$$anon$1
	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:270)
	at org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.resolveClass(InstantiationUtil.java:66)
	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
	at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:292)
	at org.apache.flink.util.InstantiationUtil.readObjectFromConfig(InstantiationUtil.java:250)
	at org.apache.flink.api.java.typeutils.runtime.RuntimeSerializerFactory.readParametersFromConfig(RuntimeSerializerFactory.java:76)
	at org.apache.flink.runtime.operators.util.TaskConfig.getTypeSerializerFactory(TaskConfig.java:1143)
	... 7 more

注意到有一行异常信息:

java.lang.RuntimeException: The initialization of the DataSource's outputs caused an error: The type serializer factory could not load its parameters from the configuration due to missing classes.

总以为是序列化的问题,反复查看文档也没有找到解决访问!最后又回头查了一下Api文档,发现createRemoteEnvironment方法的第三个参数是一个可变参数,并不是有默认值,这个被Scala函数可以提供默认值给思维定势了,后来加上第三个参数为作业程序的Jar之后便可以正确提交到远程集群运行了!


正确代码如下:


package com.daxin.batch

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.configuration.{ConfigConstants, Configuration}
//important: this import is needed to access the 'createTypeInformation' macro function
import org.apache.flink.api.scala._
/**
  * Created by Daxin on 2017/4/17.
  * https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/types_serialization.html#type-information-in-the-scala-api
  */
object RemoteJob {
  def main(args: Array[String]) {

    val env = ExecutionEnvironment.createRemoteEnvironment("node", 6123,"C://logs//flink-lib//flinkwordcount.jar")

    val words = env.readTextFile("hdfs://node:9000/word/spark-env.sh")

    val data = words.flatMap(x => x.split(" ")).map(x => (x, 1)).groupBy(0).sum(1)

    println(data.count) //简单触发作业打印一下个数
  }
}


最后注意:如果是为了方便本地代码打包在集群中运行的话,最好保持代码和jar一致性,言外之意就是修改之后最好也从新打jar包





### 如何在Flink中实现定时调度的工作流 #### 使用 `ProcessFunction` 实现定时任务 为了在 Apache Flink 中实现定时调度功能,可以借助于 `ProcessFunction` 的能力。此函数提供了对每个元素进行复杂处理的能力,并支持状态管理和定时器设置[^2]。 下面是一个简单的例子,展示如何创建一个带有定时器的任务: ```java public class MyProcessFunction extends KeyedProcessFunction<Tuple, String, String> { @Override public void processElement(String value, Context ctx, Collector<String> out) throws Exception { // 设置一个延迟触发的计时器,在当前时间戳后的10秒执行 long timerTimestamp = ctx.timestamp() + TimeUnit.SECONDS.toMillis(10); ctx.timerService().registerEventTimeTimer(timerTimestamp); // 输出接收到的消息 System.out.println("Received message: " + value); } @Override public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception { super.onTimer(timestamp, ctx, out); // 当定时器到期时触发的动作 System.out.println("Timer triggered at time: " + new Date(timestamp)); } } ``` 这段代码展示了如何在一个键控环境中注册基于事件时间和处理时间的定时器。每当有新消息到达时,会为其设定一个新的定时器;当这些定时器过期时,则会在相应的回调方法中执行特定操作。 #### 配置与部署 要使上述代码正常运行,还需要确保以下几点配置正确无误: - **依赖项**: 确认项目已引入必要的 Maven 或 Gradle 依赖。 - **环境搭建**: 构建适合开发测试使用的本地集群远程生产环境。 - **作业提交**: 将编写好的应用程序打包成 JAR 文件并通过命令行或其他方式提交Flink Cluster 执行。 对于托管状态的应用场景来说,建议尽可能使用 Flink 自带的状态后端来简化应用开发过程中的状态管理难题[^3]。
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

javartisan

对您有帮助,欢迎老板赐一杯奶茶

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值