Flink写入kafka时,只写入kafka的部分Partitioner,无法写所有的Partitioner问题

本文深入探讨了Apache Flink在处理大规模实时数据流时,如何避免因使用默认分区策略导致的数据倾斜问题。通过分析FlinkKafkaProducer09的源码,揭示了Flink写入Kafka时未能均匀分布数据到所有Partition的原因,并提供了一种自定义解决方案,确保数据能够平滑地写入到Kafka的所有Partition中。
1. 写在前面

在利用flink实时计算的时候,往往会从kafka读取数据写入数据到kafka,但会发现当kafka多个Partitioner时,特别在P量级数据为了kafka的性能kafka的节点有十几个时,一个topic的Partitioner可能有几十个甚至更多,发现flink写入kafka的时候没有全部写Partitioner,而是写了部分的Partitioner,虽然这个问题不容易被发现,但这个问题会影响flink写入kafka的性能和造成单个Partitioner数据过多的问题,更严重的问题会导致单个Partitioner所在磁盘写满,为什么会出现这种问题,我们来分析flink写入kafka的源码,主要是FlinkKafkaProducer09这个类

2. 分析FlinkKafkaProducer09的源码
public class FlinkKafkaProducer09<IN> extends FlinkKafkaProducerBase<IN> {
    private static final long serialVersionUID = 1L;

    public FlinkKafkaProducer09(String brokerList, String topicId, SerializationSchema<IN> serializationSchema) {
        this(topicId, (KeyedSerializationSchema)(new KeyedSerializationSchemaWrapper(serializationSchema)), getPropertiesFromBrokerList(brokerList), (FlinkKafkaPartitioner)(new FlinkFixedPartitioner()));
    }

    public FlinkKafkaProducer09(String topicId, SerializationSchema<IN> serializationSchema, Properties producerConfig) {
        this(topicId, (KeyedSerializationSchema)(new KeyedSerializationSchemaWrapper(serializationSchema)), producerConfig, (FlinkKafkaPartitioner)(new FlinkFixedPartitioner()));
    }

    public FlinkKafkaProducer09(String topicId, SerializationSchema<IN> serializationSchema, Properties producerConfig, @Nullable FlinkKafkaPartitioner<IN> customPartitioner) {
        this(topicId, (KeyedSerializationSchema)(new KeyedSerializationSchemaWrapper(serializationSchema)), producerConfig, (FlinkKafkaPartitioner)customPartitioner);
    }

    public FlinkKafkaProducer09(String brokerList, String topicId, KeyedSerializationSchema<IN> serializationSchema) {
        this(topicId, (KeyedSerializationSchema)serializationSchema, getPropertiesFromBrokerList(brokerList), (FlinkKafkaPartitioner)(new FlinkFixedPartitioner()));
    }

    public FlinkKafkaProducer09(String topicId, KeyedSerializationSchema<IN> serializationSchema, Properties producerConfig) {
        this(topicId, (KeyedSerializationSchema)serializationSchema, producerConfig, (FlinkKafkaPartitioner)(new FlinkFixedPartitioner()));
    }

    public FlinkKafkaProducer09(String topicId, KeyedSerializationSchema<IN> serializationSchema, Properties producerConfig, @Nullable FlinkKafkaPartitioner<IN> customPartitioner) {
        super(topicId, serializationSchema, producerConfig, customPartitioner);
    }

    /** @deprecated */
    @Deprecated
    public FlinkKafkaProducer09(String topicId, SerializationSchema<IN> serializationSchema, Properties producerConfig, KafkaPartitioner<IN> customPartitioner) {
        this(topicId, (KeyedSerializationSchema)(new KeyedSerializationSchemaWrapper(serializationSchema)), producerConfig, (KafkaPartitioner)customPartitioner);
    }

    /** @deprecated */
    @Deprecated
    public FlinkKafkaProducer09(String topicId, KeyedSerializationSchema<IN> serializationSchema, Properties producerConfig, KafkaPartitioner<IN> customPartitioner) {
        super(topicId, serializationSchema, producerConfig, new FlinkKafkaDelegatePartitioner(customPartitioner));
    }

    protected void flush() {
        if (this.producer != null) {
            this.producer.flush();
        }

    }
}

只关注下面这个两个构造器

    public FlinkKafkaProducer09(String brokerList, String topicId, SerializationSchema<IN> serializationSchema) {
        this(topicId, (KeyedSerializationSchema)(new KeyedSerializationSchemaWrapper(serializationSchema)), getPropertiesFromBrokerList(brokerList), (FlinkKafkaPartitioner)(new FlinkFixedPartitioner()));
        }

    public FlinkKafkaProducer09(String topicId, KeyedSerializationSchema<IN> serializationSchema, Properties producerConfig, @Nullable FlinkKafkaPartitioner<IN> customPartitioner) {
        super(topicId, serializationSchema, producerConfig, customPartitioner);
        }

主要看第一个构造器,可以推测往Partition是这个类new FlinkFixedPartitioner()),再来关注该类

3. 分析FlinkFixedPartitioner类的源码
public class FlinkFixedPartitioner<T> extends FlinkKafkaPartitioner<T> {
    private static final long serialVersionUID = -3785320239953858777L;
    private int parallelInstanceId;

    public FlinkFixedPartitioner() {
    }

    public void open(int parallelInstanceId, int parallelInstances) {
        Preconditions.checkArgument(parallelInstanceId >= 0, "Id of this subtask cannot be negative.");
        Preconditions.checkArgument(parallelInstances > 0, "Number of subtasks must be larger than 0.");
        this.parallelInstanceId = parallelInstanceId;
    }

    public int partition(T record, byte[] key, byte[] value, String targetTopic, int[] partitions) {
        Preconditions.checkArgument(partitions != null && partitions.length > 0, "Partitions of the target topic is empty.");
        return partitions[this.parallelInstanceId % partitions.length];
    }

    public boolean equals(Object o) {
        return this == o || o instanceof FlinkFixedPartitioner;
    }

    public int hashCode() {
        return FlinkFixedPartitioner.class.hashCode();
    }
}

根据代码可以推测return partitions[this.parallelInstanceId % partitions.length]代码会导致有的partition无法写到,现在来自己重写一个类似FlinkKafkaProducer09的类MyFlinkKafkaProducer09

4. 分析原生自带的FlinkKafkaProducer09的逻辑

1>.需要继承FlinkKafkaProducerBase

//
// Source code recreated from a .class file by IntelliJ IDEA
// (powered by Fernflower decompiler)
//

package org.apache.flink.streaming.connectors.kafka;

import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Objects;
import java.util.Properties;
import java.util.Map.Entry;
import org.apache.flink.annotation.Internal;
import org.apache.flink.annotation.VisibleForTesting;
import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.api.java.ClosureCleaner;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.metrics.MetricGroup;
import org.apache.flink.runtime.state.FunctionInitializationContext;
import org.apache.flink.runtime.state.FunctionSnapshotContext;
import org.apache.flink.streaming.api.checkpoint.CheckpointedFunction;
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;
import org.apache.flink.streaming.api.functions.sink.SinkFunction.Context;
import org.apache.flink.streaming.api.operators.StreamingRuntimeContext;
import org.apache.flink.streaming.connectors.kafka.internals.metrics.KafkaMetricWrapper;
import org.apache.flink.streaming.connectors.kafka.partitioner.FlinkKafkaDelegatePartitioner;
import org.apache.flink.streaming.connectors.kafka.partitioner.FlinkKafkaPartitioner;
import org.apache.flink.streaming.util.serialization.KeyedSerializationSchema;
import org.apache.flink.util.NetUtils;
import org.apache.flink.util.SerializableObject;
import org.apache.kafka.clients.producer.Callback;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.RecordMetadata;
import org.apache.kafka.common.Metric;
import org.apache.kafka.common.MetricName;
import org.apache.kafka.common.PartitionInfo;
import org.apache.kafka.common.serialization.ByteArraySerializer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

@Internal
public abstract class FlinkKafkaProducerBase<IN> extends RichSinkFunction<IN> implements CheckpointedFunction {
    private static final Logger LOG = LoggerFactory.getLogger(FlinkKafkaProducerBase.class);
    private static final long serialVersionUID = 1L;
    public static final String KEY_DISABLE_METRICS = "flink.disable-metrics";
    protected final Properties producerConfig;
    protected final String defaultTopicId;
    protected final KeyedSerializationSchema<IN> schema;
    protected final FlinkKafkaPartitioner<IN> flinkKafkaPartitioner;
    protected final Map<String, int[]> topicPartitionsMap;
    protected boolean logFailuresOnly;
    protected boolean flushOnCheckpoint = true;
    protected transient KafkaProducer<byte[], byte[]> producer;
    protected transient Callback callback;
    protected transient volatile Exception asyncException;
    protected final SerializableObject pendingRecordsLock = new SerializableObject();
    protected long pendingRecords;

    public FlinkKafkaProducerBase(String defaultTopicId, KeyedSerializationSchema<IN> serializationSchema, Properties producerConfig, FlinkKafkaPartitioner<IN> customPartitioner) {
        Objects.requireNonNull(defaultTopicId, "TopicID not set");
        Objects.requireNonNull(serializationSchema, "serializationSchema not set");
        Objects.requireNonNull(producerConfig, "producerConfig not set");
        ClosureCleaner.clean(customPartitioner, true);
        ClosureCleaner.ensureSerializable(serializationSchema);
        this.defaultTopicId = defaultTopicId;
        this.schema = serializationSchema;
        this.producerConfig = producerConfig;
        this.flinkKafkaPartitioner = customPartitioner;
        if (!producerConfig.containsKey("key.serializer")) {
            this.producerConfig.put("key.serializer", ByteArraySerializer.class.getName());
        } else {
            LOG.warn("Overwriting the '{}' is not recommended", "key.serializer");
        }

        if (!producerConfig.containsKey("value.serializer")) {
            this.producerConfig.put("value.serializer", ByteArraySerializer.class.getName());
        } else {
            LOG.warn("Overwriting the '{}' is not recommended", "value.serializer");
        }

        if (!this.producerConfig.containsKey("bootstrap.servers")) {
            throw new IllegalArgumentException("bootstrap.servers must be supplied in the producer config properties.");
        } else {
            this.topicPartitionsMap = new HashMap();
        }
    }

    public void setLogFailuresOnly(boolean logFailuresOnly) {
        this.logFailuresOnly = logFailuresOnly;
    }

    public void setFlushOnCheckpoint(boolean flush) {
        this.flushOnCheckpoint = flush;
    }

    @VisibleForTesting
    protected <K, V> KafkaProducer<K, V> getKafkaProducer(Properties props) {
        return new KafkaProducer(props);
    }

    public void open(Configuration configuration) {
        this.producer = this.getKafkaProducer(this.producerConfig);
        RuntimeContext ctx = this.getRuntimeContext();
        if (null != this.flinkKafkaPartitioner) {
            if (this.flinkKafkaPartitioner instanceof FlinkKafkaDelegatePartitioner) {
                ((FlinkKafkaDelegatePartitioner)this.flinkKafkaPartitioner).setPartitions(getPartitionsByTopic(this.defaultTopicId, this.producer));
            }

            this.flinkKafkaPartitioner.open(ctx.getIndexOfThisSubtask(), ctx.getNumberOfParallelSubtasks());
        }

        LOG.info("Starting FlinkKafkaProducer ({}/{}) to produce into default topic {}", new Object[]{ctx.getIndexOfThisSubtask() + 1, ctx.getNumberOfParallelSubtasks(), this.defaultTopicId});
        if (!Boolean.parseBoolean(this.producerConfig.getProperty("flink.disable-metrics", "false"))) {
            Map<MetricName, ? extends Metric> metrics = this.producer.metrics();
            if (metrics == null) {
                LOG.info("Producer implementation does not support metrics");
            } else {
                MetricGroup kafkaMetricGroup = this.getRuntimeContext().getMetricGroup().addGroup("KafkaProducer");
                Iterator var5 = metrics.entrySet().iterator();

                while(var5.hasNext()) {
                    Entry<MetricName, ? extends Metric> metric = (Entry)var5.next();
                    kafkaMetricGroup.gauge(((MetricName)metric.getKey()).name(), new KafkaMetricWrapper((Metric)metric.getValue()));
                }
            }
        }

        if (this.flushOnCheckpoint && !((StreamingRuntimeContext)this.getRuntimeContext()).isCheckpointingEnabled()) {
            LOG.warn("Flushing on checkpoint is enabled, but checkpointing is not enabled. Disabling flushing.");
            this.flushOnCheckpoint = false;
        }

        if (this.logFailuresOnly) {
            this.callback = new Callback() {
                public void onCompletion(RecordMetadata metadata, Exception e) {
                    if (e != null) {
                        FlinkKafkaProducerBase.LOG.error("Error while sending record to Kafka: " + e.getMessage(), e);
                    }

                    FlinkKafkaProducerBase.this.acknowledgeMessage();
                }
            };
        } else {
            this.callback = new Callback() {
                public void onCompletion(RecordMetadata metadata, Exception exception) {
                    if (exception != null && FlinkKafkaProducerBase.this.asyncException == null) {
                        FlinkKafkaProducerBase.this.asyncException = exception;
                    }

                    FlinkKafkaProducerBase.this.acknowledgeMessage();
                }
            };
        }

    }

    public void invoke(IN next, Context context) throws Exception {
        this.checkErroneous();
        byte[] serializedKey = this.schema.serializeKey(next);
        byte[] serializedValue = this.schema.serializeValue(next);
        String targetTopic = this.schema.getTargetTopic(next);
        if (targetTopic == null) {
            targetTopic = this.defaultTopicId;
        }

        int[] partitions = (int[])this.topicPartitionsMap.get(targetTopic);
        if (null == partitions) {
            partitions = getPartitionsByTopic(targetTopic, this.producer);
            this.topicPartitionsMap.put(targetTopic, partitions);
        }

        ProducerRecord record;
        if (this.flinkKafkaPartitioner == null) {
            record = new ProducerRecord(targetTopic, serializedKey, serializedValue);
        } else {
            record = new ProducerRecord(targetTopic, this.flinkKafkaPartitioner.partition(next, serializedKey, serializedValue, targetTopic, partitions), serializedKey, serializedValue);
        }

        if (this.flushOnCheckpoint) {
            synchronized(this.pendingRecordsLock) {
                ++this.pendingRecords;
            }
        }

        this.producer.send(record, this.callback);
    }

    public void close() throws Exception {
        if (this.producer != null) {
            this.producer.close();
        }

        this.checkErroneous();
    }

    private void acknowledgeMessage() {
        if (this.flushOnCheckpoint) {
            synchronized(this.pendingRecordsLock) {
                --this.pendingRecords;
                if (this.pendingRecords == 0L) {
                    this.pendingRecordsLock.notifyAll();
                }
            }
        }

    }

    protected abstract void flush();

    public void initializeState(FunctionInitializationContext context) throws Exception {
    }

    public void snapshotState(FunctionSnapshotContext ctx) throws Exception {
        this.checkErroneous();
        if (this.flushOnCheckpoint) {
            this.flush();
            synchronized(this.pendingRecordsLock) {
                if (this.pendingRecords != 0L) {
                    throw new IllegalStateException("Pending record count must be zero at this point: " + this.pendingRecords);
                }

                this.checkErroneous();
            }
        }

    }

    protected void checkErroneous() throws Exception {
        Exception e = this.asyncException;
        if (e != null) {
            this.asyncException = null;
            throw new Exception("Failed to send data to Kafka: " + e.getMessage(), e);
        }
    }

    public static Properties getPropertiesFromBrokerList(String brokerList) {
        String[] elements = brokerList.split(",");
        String[] var2 = elements;
        int var3 = elements.length;

        for(int var4 = 0; var4 < var3; ++var4) {
            String broker = var2[var4];
            NetUtils.getCorrectHostnamePort(broker);
        }

        Properties props = new Properties();
        props.setProperty("bootstrap.servers", brokerList);
        return props;
    }

    protected static int[] getPartitionsByTopic(String topic, KafkaProducer<byte[], byte[]> producer) {
        List<PartitionInfo> partitionsList = new ArrayList(producer.partitionsFor(topic));
        Collections.sort(partitionsList, new Comparator<PartitionInfo>() {
            public int compare(PartitionInfo o1, PartitionInfo o2) {
                return Integer.compare(o1.partition(), o2.partition());
            }
        });
        int[] partitions = new int[partitionsList.size()];

        for(int i = 0; i < partitions.length; ++i) {
            partitions[i] = ((PartitionInfo)partitionsList.get(i)).partition();
        }

        return partitions;
    }

    @VisibleForTesting
    protected long numPendingRecords() {
        synchronized(this.pendingRecordsLock) {
            return this.pendingRecords;
        }
    }
}

2>.需要关注FlinkKafkaProducerBase类下面的构造器:

public FlinkKafkaProducerBase(String defaultTopicId, KeyedSerializationSchema<IN> serializationSchema, Properties producerConfig, FlinkKafkaPartitioner<IN> customPartitioner) {
        Objects.requireNonNull(defaultTopicId, "TopicID not set");
        Objects.requireNonNull(serializationSchema, "serializationSchema not set");
        Objects.requireNonNull(producerConfig, "producerConfig not set");
        ClosureCleaner.clean(customPartitioner, true);
        ClosureCleaner.ensureSerializable(serializationSchema);
        this.defaultTopicId = defaultTopicId;
        this.schema = serializationSchema;
        this.producerConfig = producerConfig;
        this.flinkKafkaPartitioner = customPartitioner;
        if (!producerConfig.containsKey("key.serializer")) {
            this.producerConfig.put("key.serializer", ByteArraySerializer.class.getName());
        } else {
            LOG.warn("Overwriting the '{}' is not recommended", "key.serializer");
        }

        if (!producerConfig.containsKey("value.serializer")) {
            this.producerConfig.put("value.serializer", ByteArraySerializer.class.getName());
        } else {
            LOG.warn("Overwriting the '{}' is not recommended", "value.serializer");
        }

        if (!this.producerConfig.containsKey("bootstrap.servers")) {
            throw new IllegalArgumentException("bootstrap.servers must be supplied in the producer config properties.");
        } else {
            this.topicPartitionsMap = new HashMap();
        }
    }

3>.同时关注类下面invoke()方法的以下代码

    if (this.flinkKafkaPartitioner == null) {
            record = new ProducerRecord(targetTopic, serializedKey, serializedValue);
        } 

4>.再来分析ProducerRecord这个类

//
// Source code recreated from a .class file by IntelliJ IDEA
// (powered by Fernflower decompiler)
//

package org.apache.kafka.clients.producer;

public final class ProducerRecord<K, V> {
    private final String topic;
    private final Integer partition;
    private final K key;
    private final V value;

    public ProducerRecord(String topic, Integer partition, K key, V value) {
        if (topic == null) {
            throw new IllegalArgumentException("Topic cannot be null");
        } else {
            this.topic = topic;
            this.partition = partition;
            this.key = key;
            this.value = value;
        }
    }

    public ProducerRecord(String topic, K key, V value) {
        this(topic, (Integer)null, key, value);
    }

    public ProducerRecord(String topic, V value) {
        this(topic, (Object)null, value);
    }

    public String topic() {
        return this.topic;
    }

    public K key() {
        return this.key;
    }

    public V value() {
        return this.value;
    }

    public Integer partition() {
        return this.partition;
    }

    public String toString() {
        String key = this.key == null ? "null" : this.key.toString();
        String value = this.value == null ? "null" : this.value.toString();
        return "ProducerRecord(topic=" + this.topic + ", partition=" + this.partition + ", key=" + key + ", value=" + value;
    }

    public boolean equals(Object o) {
        if (this == o) {
            return true;
        } else if (!(o instanceof ProducerRecord)) {
            return false;
        } else {
            ProducerRecord that;
            label56: {
                that = (ProducerRecord)o;
                if (this.key != null) {
                    if (this.key.equals(that.key)) {
                        break label56;
                    }
                } else if (that.key == null) {
                    break label56;
                }

                return false;
            }

            label49: {
                if (this.partition != null) {
                    if (this.partition.equals(that.partition)) {
                        break label49;
                    }
                } else if (that.partition == null) {
                    break label49;
                }

                return false;
            }

            if (this.topic != null) {
                if (!this.topic.equals(that.topic)) {
                    return false;
                }
            } else if (that.topic != null) {
                return false;
            }

            if (this.value != null) {
                if (!this.value.equals(that.value)) {
                    return false;
                }
            } else if (that.value != null) {
                return false;
            }

            return true;
        }
    }

    public int hashCode() {
        int result = this.topic != null ? this.topic.hashCode() : 0;
        result = 31 * result + (this.partition != null ? this.partition.hashCode() : 0);
        result = 31 * result + (this.key != null ? this.key.hashCode() : 0);
        result = 31 * result + (this.value != null ? this.value.hashCode() : 0);
        return result;
    }
}

5>. 关注该类下面的构造器

    public ProducerRecord(String topic, Integer partition, K key, V value) {
        if (topic == null) {
            throw new IllegalArgumentException("Topic cannot be null");
        } else {
            this.topic = topic;
            this.partition = partition;
            this.key = key;
            this.value = value;
        }
    }

6>. 从代码中可以看到获取默认的partition即所有partition,那么我们自定义的类就会很简单的

5. 编写自己的flink写入kafka的类MyFlinkKafkaProducer09
package com.run;

import org.apache.flink.api.common.serialization.SerializationSchema;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase;
import org.apache.flink.streaming.connectors.kafka.partitioner.FlinkKafkaPartitioner;
import org.apache.flink.streaming.util.serialization.KeyedSerializationSchema;
import org.apache.flink.streaming.util.serialization.KeyedSerializationSchemaWrapper;
import org.codehaus.commons.nullanalysis.Nullable;

import java.util.Properties;

public class MyFlinkKafkaProducer09<IN> extends FlinkKafkaProducerBase<IN> {

    public MyFlinkKafkaProducer09(String brokerList, String topicId, SerializationSchema<IN> serializationSchema) {
        this(topicId, (KeyedSerializationSchema)(new KeyedSerializationSchemaWrapper(serializationSchema)), getPropertiesFromBrokerList(brokerList),null);
    }

    public MyFlinkKafkaProducer09(String topicId, KeyedSerializationSchema<IN> serializationSchema, Properties producerConfig, @Nullable FlinkKafkaPartitioner<IN> customPartitioner) {
        super(topicId, serializationSchema, producerConfig, customPartitioner);
    }

    protected void flush() {
        if (this.producer != null) {
            this.producer.flush();
        }
    }
}

在这里不给FlinkKafkaPartitioner(new FlinkFixedPartitioner()),直接给一个null,FlinkKafkaProducerBase会直接写所有的Partitioner
在自己的实时计算程序应用

distributeDataStream.addSink(new FlinkKafkaProducer09<String>("localhost:9092", "my-topic", new SimpleStringSchema()));

转载于:https://www.cnblogs.com/jiashengmei/p/10739037.html

<think>好的,用户问的是Flink输出到Kafka,什么场景下需要指定写入Kafka的key。我需要先回忆一下FlinkKafka的相关知识。 首先,Kafka的消息结构是key-value对的,对吧?所以当FlinkKafka数据的候,可以选择是否指定key。那什么候需要指定呢? 可能的情况之一是关于分区策略。如果用户不指定key,默认情况下Kafka会轮询分区,这样数据会均匀分布。但如果需要根据某个字段来分区,比如按用户ID进行分区,保证同一用户的数据在同一个分区里,这候就需要指定key了。这样能保证有序性,或者方便下游消费者按分区处理。 另外,可能涉及到数据处理的逻辑。比如,在Flink中做KeyedStream的操作后,可能需要将处理后的数据按照同样的key写入Kafka,以保持一致性。或者,如果下游应用需要根据key来聚合数据,这候指定key也是有必要的。 还有,Kafka的日志压缩功能是基于key的。如果需要使用这个功能,比如保留同一个key的最新消息,那么必须指定key,否则压缩无法进行。 另外,可能有些场景下需要使用事务性写入,这候key的指定可能和事务的管理有关,但不确定这部分是否直接影响是否需要指定key。可能需要再确认一下。 总结一下,主要场景应该是分区策略控制、数据有序性、下游处理需求以及Kafka的日志压缩功能。需要把这些情况分点说明,用例子帮助理解,比如按用户ID分区,或者日志压缩的例子。</think>在 Apache Flink 将数据写入 Kafka ,指定消息 Key 的场景主要与 Kafka 的分区机制和数据组织特性相关。以下是需要显式指定 Kafka Key 的典型场景及技术解析: --- ### 1. **控制数据分区分布** **核心机制**: Kafka 通过 Key 的哈希值决定消息写入的分区(当未指定自定义分区器)。若不指定 Key,默认采用轮询(Round-Robin)分区策略。 **典型场景**: - **相同 Key 路由到同一分区** 当需要保证相同业务实体的数据(如用户ID、订单号)始终进入同一分区,需将该字段作为 Key。这对以下场景至关重要: ```java // 示例:将 userId 作为 Key 保证分区一致性 DataStream<String> stream = ...; stream.addSink(new FlinkKafkaProducer<>( "topic", (element) -> new ProducerRecord<>("topic", element.userId, element), // Key=userId producerConfig, FlinkKafkaProducer.Semantic.EXACTLY_ONCE )); ``` - **有序性保障**:Kafka单个分区内保证消息顺序,同一实体的数据集中处理可避免乱序 - **状态处理优化**:下游消费者可针对分区实现本地状态缓存(如计数、窗口统计) --- ### 2. **支持 Kafka 日志压缩(Log Compaction)** **核心机制**: Kafka 的 Log Compaction 功能依赖 Key 来保留同一 Key 的最新消息,删除旧数据。未指定 Key 的消息无法被压缩。 **使用场景**: - 存储键值型数据快照(如数据库变更日志) - 需要保留每个 Key 最后状态(如设备状态更新) ```java // Key 的存在使得 compaction 可识别最新状态 producerRecord = new ProducerRecord<>("compacted_topic", deviceId, latestStatus); ``` --- ### 3. **下游消费逻辑依赖 Key** **典型需求**: - **Key-Based 处理**:下游消费者使用 Key 进行分组聚合(如 Spark Streaming 的 `groupByKey`) - **数据关联**:将 Key 作为数据关联依据(如维表关联、双流 JOIN 的预处理) --- ### 4. **兼容现有 Kafka 拓扑结构** 当目标 Topic 已存在预先定义的分区策略(如自定义 Partitioner),需通过 Key 保持与现有分区逻辑的一致性,避免数据分布失衡。 --- ### 何无需指定 Key? - 数据无需按业务维度分组 - 仅需简单均匀分布负载 - Topic 的分区数量经常变动,无需强分区绑定 --- ### 技术决策建议 | 场景特征 | Key 指定建议 | |-----------------------------------|-----------------------| | 需要分区级有序性 | ✅ **必须指定** | | 使用 Log Compaction | ✅ **必须指定** | | 下游依赖 Key 处理 | ✅ **推荐指定** | | 纯均匀负载分发 | ⭕ 可不指定 | 通过合理设置 Key,可充分利用 Kafka 的分区特性实现数据组织优化,建议在数据建模阶段明确 Key 的选择策略。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值