Flink实战:自定义KafkaDeserializationSchema(Java/Scala)

微信公众号:大数据开发运维架构

关注可了解更多大数据相关的资讯。问题或建议,请公众号留言;

如果您觉得“大数据开发运维架构”对你有帮助,欢迎转发朋友圈

从微信公众号拷贝过来,格式有些错乱,建议直接去公众号阅读


    kafka中的数据通常是键值对的,所以我们这里自定义反序列化类从kafka中消费键值对的消息,为方便大家学习,这里我实现了Java/Scala两个版本,由于比较简单这里直接上代码:

一、Scala代码:

1.自定义反序列化类:

package comhadoop.ljs.flink010.kafka
import org.apache.flink.api.common.typeinfo.{TypeHint, TypeInformation}
import org.apache.flink.streaming.connectors.kafka.KafkaDeserializationSchema
import org.apache.kafka.clients.consumer.ConsumerRecord
/**
  * @author: Created By lujisen
  * @company ChinaUnicom Software JiNan
  * @date: 2020-04-25 18:31
  * @version: v1.0
  * @description: comhadoop.ljs.flink010.kafka
  */
class MyKafkaDeserializationSchema  extends KafkaDeserializationSchema[ConsumerRecord[String, String]]{
  /*是否流结束,比如读到一个key为end的字符串结束,这里不再判断,直接返回false 不结束*/
  override def isEndOfStream(t: ConsumerRecord[String, String]): Boolean ={
    false
  }
  override def deserialize(record: ConsumerRecord[Array[Byte], Array[Byte]]): ConsumerRecord[String, String] = {
    new ConsumerRecord(record.topic(),record.partition(),record.offset(),new String(record.key(),"UTF-8"),new String(record.value(),"UTF-8"))
  }
  /*用于获取反序列化对象的类型*/
  override def getProducedType: TypeInformation[ConsumerRecord[String, String]] = {
    TypeInformation.of(new TypeHint[ConsumerRecord[String, String]] {})
  }
}

2.主函数类:

package comhadoop.ljs.flink010.kafka
import java.util.Properties
import org.apache.flink.api.common.functions.MapFunction
import org.apache.flink.streaming.api.datastream.DataStream
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
/**
  * @author: Created By lujisen
  * @company ChinaUnicom Software JiNan
  * @date: 2020-04-25 16:32
  * @version: v1.0
  * @description: comhadoop.ljs.flink010.kafka
  */
object KafkaDeserializerSchemaTest {
  def main(args: Array[String]): Unit = {

    /*环境初始化*/
    val senv:StreamExecutionEnvironment =StreamExecutionEnvironment.getExecutionEnvironment()
    /*启用checkpoint,这里我没有对消息体的key value进行判断,即使为空启动了checkpoint,遇到错误也会无限次重启*/
    senv.enableCheckpointing(2000)
    /*topic2不存在话会自动在kafka创建,一个分区 分区名称0*/
    val  myConsumer=new FlinkKafkaConsumer[ConsumerRecord[String, String]]("topic3",new MyKafkaDeserializationSchema(),getKafkaConfig())
    /*指定消费位点*/
    val specificStartOffsets = new java.util.HashMap[KafkaTopicPartition, java.lang.Long]()
    /*这里从topic3 的0分区的第一条开始消费*/
    specificStartOffsets.put(new KafkaTopicPartition("topic3", 0), 0L)
    myConsumer.setStartFromSpecificOffsets(specificStartOffsets)
    /*指定source数据源*/
    val source:DataStream[ConsumerRecord[String, String]]=senv.addSource(myConsumer)

    val keyValue=source.map(new MapFunction[ConsumerRecord[String, String],String] {
      override def map(message: ConsumerRecord[String, String]): String = {
        "key" + message.key + "  value:" + message.value
      }
    })
    /*打印接收的数据*/
    keyValue.print()
    /*启动执行*/
    senv.execute()
  }

  def getKafkaConfig():Properties={
    val props:Properties=new Properties()
    props.setProperty("bootstrap.servers","worker1.hadoop.ljs:6667,worker2.hadoop.ljs:6667")
    props.setProperty("group.id","topic_1")
    props.setProperty("key.deserializer",classOf[StringDeserializer].getName)
    props.setProperty("value.deserializer",classOf[StringDeserializer].getName)
    props.setProperty("auto.offset.reset","latest")
    props
  }
}

二、Java代码: 
1.自定义反序列化类:

package com.hadoop.ljs.flink110.kafka;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.streaming.connectors.kafka.KafkaDeserializationSchema;
import org.apache.kafka.clients.consumer.ConsumerRecord;
/**
 * @author: Created By lujisen
 * @company ChinaUnicom Software JiNan
 * @date: 2020-04-25 18:45
 * @version: v1.0
 * @description: com.hadoop.ljs.flink110.kafka
 */
public class MyKafkaDeserializationSchema implements KafkaDeserializationSchema<ConsumerRecord<String, String>> {

    private static  String encoding = "UTF8";
    @Override
    public boolean isEndOfStream(ConsumerRecord<String, String> nextElement) {
        return false;
    }
    @Override
    public ConsumerRecord<String, String> deserialize(ConsumerRecord<byte[], byte[]> record) throws Exception {
       /* System.out.println("Record--partition::"+record.partition());
        System.out.println("Record--offset::"+record.offset());
        System.out.println("Record--timestamp::"+record.timestamp());
        System.out.println("Record--timestampType::"+record.timestampType());
        System.out.println("Record--checksum::"+record.checksum());
        System.out.println("Record--key::"+record.key());
        System.out.println("Record--value::"+record.value());*/
        return new ConsumerRecord(record.topic(),
                record.partition(),
                record.offset(),
                record.timestamp(),
                record.timestampType(),
                record.checksum(),
                record.serializedKeySize(),
                record.serializedValueSize(),
                /*这里我没有进行空值判断,生产一定记得处理*/
                new  String(record.key(), encoding),
                new  String(record.value(), encoding));
    }
    @Override
    public TypeInformation<ConsumerRecord<String, String>> getProducedType() {
        return TypeInformation.of(new TypeHint<ConsumerRecord<String, String>>(){});
    }
}

 

2.主函数类:

package com.hadoop.ljs.flink110.kafka;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import java.util.HashMap;
import java.util.Map;
import java.util.Properties;
/**
 * @author: Created By lujisen
 * @company ChinaUnicom Software JiNan
 * @date: 2020-04-25 18:41
 * @version: v1.0
 * @description: com.hadoop.ljs.flink110.kafka
 */
public class KafkaDeserializerSchemaTest {
    public static void main(String[] args) throws Exception {
        /*环境初始化*/
        StreamExecutionEnvironment senv = StreamExecutionEnvironment.getExecutionEnvironment();
        /*启用checkpoint,这里我没有对消息体的key value进行判断,即使为空启动了checkpoint,遇到错误也会无限次重启*/
        senv.enableCheckpointing(2000);
        /*topic2不存在话会自动在kafka创建,一个分区 分区名称0*/
        FlinkKafkaConsumer<ConsumerRecord<String, String>> myConsumer=new FlinkKafkaConsumer<ConsumerRecord<String, String>>("topic3",new MyKafkaDeserializationSchema(),getKafkaConfig());

        /*指定消费位点*/
        Map<KafkaTopicPartition, Long> specificStartOffsets = new HashMap<>();
        /*这里从topic3 的0分区的第一条开始消费*/
        specificStartOffsets.put(new KafkaTopicPartition("topic3", 0), 0L);
        myConsumer.setStartFromSpecificOffsets(specificStartOffsets);

        DataStream<ConsumerRecord<String, String>> source = senv.addSource(myConsumer);
        DataStream<String> keyValue = source.map(new MapFunction<ConsumerRecord<String, String>, String>() {
            @Override
            public String map(ConsumerRecord<String, String> message) throws Exception {
                return "key"+message.key()+"  value:"+message.value();
            }
        });
        /*打印结果*/
        keyValue.print();
        /*启动执行*/
        senv.execute();
    }
    public static Properties getKafkaConfig(){
        Properties  props=new Properties();
        props.setProperty("bootstrap.servers","worker1.hadoop.ljs:6667,worker2.hadoop.ljs:6667");
        props.setProperty("group.id","topic_group2");
        props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.setProperty("auto.offset.reset","latest");
        return props;
    }
}

三、函数测试

1.KafkaProducer发送测试数据类<key,value>:

package com.hadoop.ljs.kafka220;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import java.util.Date;
import java.util.Properties;
public class KafkaPartitionProducer extends Thread{
  private static long count =10;
  private static String topic="topic3";
  private static String brokerList="worker1.hadoop.ljs:6667,worker2.hadoop.ljs:6667";
  public static void main(String[] args) {
        KafkaPartitionProducer jproducer = new KafkaPartitionProducer();
        jproducer.start();
    }
    @Override
    public void run() {
        producer();
    }
    private void producer() {
        Properties props = config();
        KafkaProducer<String, String> producer = new KafkaProducer<>(props);
        ProducerRecord<String, String> record=null;
        System.out.println("kafka生产数据条数:"+count);
        for (int i = 1; i <= count; i++) {
            String json = "{\"id\":" + i + ",\"ip\":\"192.168.0." + i + "\",\"date\":" + new Date().toString() + "}";
            String key ="key"+i;
            record = new ProducerRecord<String, String>(topic, key, json);
            producer.send(record, (metadata, e) -> {
                // 使用回调函数
                if (null != e) {
                    e.printStackTrace();
                }
                if (null != metadata) {
                    System.out.println(String.format("offset: %s, partition:%s, topic:%s  timestamp:%s", metadata.offset(), metadata.partition(), metadata.topic(), metadata.timestamp()));
                }
            });
        }
        producer.close();
    }
    private Properties config() {
        Properties props = new Properties();
        props.put("bootstrap.servers",brokerList);
        props.put("acks", "1");
        props.put("retries", 3);
        props.put("batch.size", 16384);
        props.put("linger.ms", 1);
        props.put("buffer.memory", 33554432);
        props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        /*自定义分区,两种形式*/
        /*props.put("partitioner.class", PartitionUtil.class.getName());*/
        return props;
    }
}

2.测试结果

如果觉得我的文章能帮到您,请关注微信公众号“大数据开发运维架构”,并转发朋友圈,谢谢支持!

networks:  net:    external: trueservices:  jobmanager1:    restart: always    image: apache/flink:1.16.3    container_name: jobmanager1    hostname: jobmanager1    ports:     - '8081:8081'    volumes:      - /etc/localtime:/etc/localtime      - /home/sumengnan/apache/flink/timezone:/etc/timezone      - /home/sumengnan/apache/flink/conf/flink-conf-jobmanager1.yaml:/opt/flink/conf/flink-conf.yaml      - /home/sumengnan/apache/flink/conf/log4j.properties:/opt/flink/conf/log4j.properties      - /home/sumengnan/apache/flink/conf/logback.xml:/opt/flink/conf/logback.xml      - /home/sumengnan/apache/flink/conf/log4j-console.properties:/opt/flink/conf/log4j-console.properties      - /home/sumengnan/apache/flink/conf/logback-console.xml:/opt/flink/conf/logback-console.xml      - /home/sumengnan/apache/flink/data:/opt/flink/data      - /home/sumengnan/apache/flink/log:/opt/flink/log      - /home/sumengnan/apache/flink/tmp:/opt/flink/tmp    networks:      - net  jobmanager2:    restart: always    image: apache/flink:1.16.3    container_name: jobmanager2    hostname: jobmanager2    ports:     - '8082:8081'    volumes:      - /etc/localtime:/etc/localtime      - /home/sumengnan/apache/flink/timezone:/etc/timezone      - /home/sumengnan/apache/flink/conf/flink-conf-jobmanager2.yaml:/opt/flink/conf/flink-conf.yaml      - /home/sumengnan/apache/flink/conf/log4j.properties:/opt/flink/conf/log4j.properties      - /home/sumengnan/apache/flink/conf/logback.xml:/opt/flink/conf/logback.xml      - /home/sumengnan/apache/flink/conf/log4j-console.properties:/opt/flink/conf/log4j-console.properties      - /home/sumengnan/apache/flink/conf/logback-console.xml:/opt/flink/conf/logback-console.xml      - /home/sumengnan/apache/flink/data:/opt/flink/data      - /home/sumengnan/apache/flink/log:/opt/flink/log      - /home/sumengnan/apache/flink/tmp:/opt/flink/tmp    networks:      - net    depenes_on:      - jobmanager1  taskmanager1:    restart: always    image: apache/flink:1.16.3    container_name: taskmanager1    hostname: taskmanager1    command: taskmanager    volumes:      - /etc/localtime:/etc/localtime      - /home/sumengnan/apache/flink/timezone:/etc/timezone      - /home/sumengnan/apache/flink/conf/flink-conf-taskmanager1.yaml:/opt/flink/conf/flink-conf.yaml      - /home/sumengnan/apache/flink/conf/log4j.properties:/opt/flink/conf/log4j.properties      - /home/sumengnan/apache/flink/conf/logback.xml:/opt/flink/conf/logback.xml      - /home/sumengnan/apache/flink/conf/log4j-console.properties:/opt/flink/conf/log4j-console.properties      - /home/sumengnan/apache/flink/conf/logback-console.xml:/opt/flink/conf/logback-console.xml      - /home/sumengnan/apache/flink/data:/opt/flink/data      - /home/sumengnan/apache/flink/log:/opt/flink/log      - /home/sumengnan/apache/flink/tmp:/opt/flink/tmp    networks:      - net    depenes_on:      - jobmanager1      - jobmanager2  taskmanager2:    restart: always    image: apache/flink:1.16.3    container_name: taskmanager2    hostname: taskmanager2    command: taskmanager    volumes:      - /etc/localtime:/etc/localtime      - /home/sumengnan/apache/flink/timezone:/etc/timezone      - /home/sumengnan/apache/flink/conf/flink-conf-taskmanager2.yaml:/opt/flink/conf/flink-conf.yaml      - /home/sumengnan/apache/flink/conf/log4j.properties:/opt/flink/conf/log4j.properties      - /home/sumengnan/apache/flink/conf/logback.xml:/opt/flink/conf/logback.xml      - /home/sumengnan/apache/flink/conf/log4j-console.properties:/opt/flink/conf/log4j-console.properties      - /home/sumengnan/apache/flink/conf/logback-console.xml:/opt/flink/conf/logback-console.xml      - /home/sumengnan/apache/flink/data:/opt/flink/data      - /home/sumengnan/apache/flink/log:/opt/flink/log      - /home/sumengnan/apache/flink/tmp:/opt/flink/tmp    networks:      - net    depenes_on:      - jobmanager1      - jobmanager2  taskmanager3:    restart: always    image: apache/flink:1.16.3    container_name: taskmanager3    hostname: taskmanager3    command: taskmanager    volumes:      - /etc/localtime:/etc/localtime      - /home/sumengnan/apache/flink/timezone:/etc/timezone      - /home/sumengnan/apache/flink/conf/flink-conf-taskmanager3.yaml:/opt/flink/conf/flink-conf.yaml      - /home/sumengnan/apache/flink/conf/log4j.properties:/opt/flink/conf/log4j.properties      - /home/sumengnan/apache/flink/conf/logback.xml:/opt/flink/conf/logback.xml      - /home/sumengnan/apache/flink/conf/log4j-console.properties:/opt/flink/conf/log4j-console.properties      - /home/sumengnan/apache/flink/conf/logback-console.xml:/opt/flink/conf/logback-console.xml      - /home/sumengnan/apache/flink/data:/opt/flink/data      - /home/sumengnan/apache/flink/log:/opt/flink/log      - /home/sumengnan/apache/flink/tmp:/opt/flink/tmp    networks:      - net    depenes_on:      - jobmanager1      - jobmanager2
最新发布
04-04
### Apache Flink 1.16.3 Docker Compose 文件配置 为了正确配置包含 `jobmanager` 和 `taskmanager` 的 Apache Flink 1.16.3 Docker Compose 文件,以下是完整的解决方案: #### 1. 创建基础目录结构 创建一个工作目录来存储所有的必要文件。例如: ```bash mkdir flink-docker && cd flink-docker ``` 在此目录下放置以下文件: - `docker-compose.yml` - `flink-conf.yaml` - 自定义脚本(如 `start.sh`) --- #### 2. 编写 `docker-compose.yml` 下面是一个适用于 Apache Flink 1.16.3 的标准 `docker-compose.yml` 文件示例。 ```yaml version: '3' services: jobmanager: image: flink:1.16.3-scala_2.12-java8 container_name: flink_jobmanager ports: - "8081:8081" environment: - JOB_MANAGER_RPC_ADDRESS=jobmanager command: ["standalone-job", "--job-classname", "org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint"] volumes: - ./config/flink-conf.yaml:/opt/flink/conf/flink-conf.yaml - ./lib/:/opt/flink/lib/ - ./scripts/start.sh:/start.sh taskmanager: image: flink:1.16.3-scala_2.12-java8 container_name: flink_taskmanager depends_on: - jobmanager environment: - JOB_MANAGER_RPC_ADDRESS=jobmanager command: ["taskmanager"] deploy: replicas: 2 volumes: - ./config/flink-conf.yaml:/opt/flink/conf/flink-conf.yaml - ./lib/:/opt/flink/lib/ volumes: data: ``` 上述配置说明如下: - **JobManager**: 使用官方镜像启动 JobManager 并暴露 Web UI 端口 (默认为 8081)[^1]。 - **TaskManager**: 同样基于官方镜像启动 TaskManager,并通过 `replicas` 参数指定副本数量。 - **共享卷**: 将本地的 `flink-conf.yaml` 和自定义(`./lib`) 映射到容器内部路径 `/opt/flink/conf` 和 `/opt/flink/lib` 中。 --- #### 3. 配置 `flink-conf.yaml` Flink 的核心配置位于 `flink-conf.yaml` 文件中。以下是一些必要的参数设置: ```yaml jobmanager.rpc.address: jobmanager jobmanager.memory.process.size: 1600m taskmanager.numberOfTaskSlots: 4 taskmanager.memory.process.size: 1g parallelism.default: 4 restart-strategy: fixed-delay restart-strategy.fixed-delay.attempts: 3 restart-strategy.fixed-delay.delay: 10s high-availability: zookeeper high-availability.zookeeper.quorum: zookeeper:2181 high-availability.storageDir: hdfs:///recovery/ state.backend: filesystem state.checkpoints.dir: hdfs:///checkpoints/ state.savepoints.dir: hdfs:///savepoints/ ``` 此配置实现了以下功能: - 设置 JobManager 地址为 `jobmanager`[^2]。 - 定义高可用模式并启用 ZooKeeper 支持[^3]。 - 指定状态后端以及检查点和保存点的存储位置。 注意:如果未使用 HDFS 或 ZooKeeper,则需调整相应部分。 --- #### 4. 复制额外资源至容器 将所需的 JAR 文件和其他依赖项复制到容器中的适当位置。例如: ```bash docker cp flink-docker-1.0.jar jobmanager:/flink-docker-1.0.jar docker cp lib/ jobmanager:/flink-docker-lib/ docker cp scripts/start.sh jobmanager:/start.sh ``` 这一步骤确保了所有外部依赖能够被正常加载。 --- #### 5. 测试 Kafka 集成(可选) 如果有涉及 Kafka 的场景,可以通过以下命令进入 Kafka 容器进行调试: ```bash docker exec -it kafka /bin/bash cd /opt/bitnami/kafka/bin ``` 需要注意的是,在执行前应确认 Kafka 已经作为服务运行于同一网络环境中[^4]。 --- ### 总结 以上方法提供了一种标准化的方式构建支持 HA 的分布式 Flink 集群环境。它不仅涵盖了基本组件部署还考虑到了扩展性和容错能力的需求。
评论 8
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

陆师傅说

您的鼓励将是我们创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值