kafka java api 入数据报错:Error: MESSAGE_TOO_LARGE

  • 1.kafka版本:kafka_2.11-2.1.0-kafka-4.0.0.jar
  • 2.server.properties:所有调优参数都是默认
  • 3.topic :null,所有参数默认
  • 4.入库1G txt文件,只加三个参数:

acks:all

batch.size:1048576

linger.ms:10

batch.size和linger.ms是对kafka producer性能影响比较大的两个参数。batch.size是producer批量发送的基本单位,默认是16384Bytes,即16kB;lingger.ms是sender线程在检查batch是否ready时候,判断有没有过期的参数,默认大小是0ms。

那么producer是按照batch.size大小批量发送消息呢,还是按照linger.ms的时间间隔批量发送消息呢?这里先说结论:其实满足batch.size和ling.ms之一,producer便开始发送消息。

  • 5.入库报错:
[2019-12-18 16:50:50,325] WARN [Producer clientId=producer-1] Got error produce response in correlation id 645 on topic-partition null-0, splitting and retrying (2147483647 attempts left). Error: MESSAGE_TOO_LARGE (org.apache.kafka.clients.producer.internals.Sender)
[2019-12-18 16:50:50,359] WARN [Producer clientId=producer-1] Got error produce response in correlation id 646 on topic-partition null-0, splitting and retrying (2147483647 attempts left). Error: MESSAGE_TOO_LARGE (org.apache.kafka.clients.producer.internals.Sender)

Currently, producers do the batch splitting based on the batch size. However, the split will never succeed when batch size is greatly larger than the topic-level max message size.

For instance, if the batch size is set to 8MB but we maintain the default value for broker-side `message.max.bytes` (1000012, about1MB), producer will endlessly try to split a large batch but never succeeded, as shown below:

[2019-05-10 16:25:09,233] WARN [Producer clientId=producer-1] Got error produce response in correlation id 61 on topic-partition test-0, splitting and retrying (2147483647 attempts left). Error: MESSAGE_TOO_LARGE (org.apache.kafka.clients.producer.internals.Sender:617)
[2019-05-10 16:25:10,021] WARN [Producer clientId=producer-1] Got error produce response in correlation id 62 on topic-partition test-0, splitting and retrying (2147483647 attempts left). Error: MESSAGE_TOO_LARGE (org.apache.kafka.clients.producer.internals.Sender:617)
[2019-05-10 16:25:10,758] WARN [Producer clientId=producer-1] Got error produce response in correlation id 63 on topic-partition test-0, splitting and retrying (2147483647 attempts left). Error: MESSAGE_TOO_LARGE (org.apache.kafka.clients.producer.internals.Sender:617)
[2019-05-10 16:25:12,071] WARN [Producer clientId=producer-1] Got error produce response in correlation id 64 on topic-partition test-0, splitting and retrying (2147483647 attempts left). Error: MESSAGE_TOO_LARGE (org.apache.kafka.clients.producer.internals.Sender:617)

A better solution is to have producer do splitting based on the minimum of these two configs. However, it is tricky for the client to get the topic-level or broker-level config values. Seems  there could be three ways to do this:

  1. When broker throws `RecordTooLargeException`, do not swallow its real message since it contains the max message size already. If the message is not swallowed, the client easily gets it from the response.
  2. Add code to issue  `DescribeConfigsRequest` to retrieve the value.
  3. If splitting failed, decreases the batch size gradually until the split is successful. For  example, 
// In RecordAccumulator.java
private int steps = 1;
......
public int splitAndReenqueue(ProducerBatch bigBatch) {
......
    Deque<ProducerBatch> dq = bigBatch.split(this.batchSize / steps);
    if (dq.size() == 1) // split failed
        steps++;
......
}
  • 7.调整入库参数:topic :message.max.bytes:default:1000012

batch.size为1000001,入库成功

batch.size为1000012,入库成功

batch.size为1000112,入库失败

8.疑问点:仅仅设置produce端压缩方式为gzip,然后设置代码produce参数batch.size为1048576 >message.max.bytes:default,其他保持不变,这种情况下从未出现MESSAGE_TOO_LARGE报错,不知道为啥,希望大神解答,感谢。

9.然后设置topic为lz4,仅仅设置produce端压缩方式为gzip,produce参数batch.size为1048576,入库又报错了。

<think>好的,我现在需要解决用户在使用Scrapy集成Kafka作为数据源时遇到的TypeError:process_start_requests() takes 1 positional argument but 3 were given。首先,我得分析这个错误的原因。 根据错误信息,process_start_requests方法接收了3个位置参数,但定义时只接受1个。通常,Scrapy的start_requests方法生成请求时会调用process_start_requests,而相关的方法可能需要特定的参数签名。 用户可能是在自定义的Spider中重写了start_requests方法,或者在使用某个中间件或扩展时,错误地传递了参数。比如,当使用scrapy-kafka之类的库时,可能需要正确继承特定的Spider类,比如KafkaSpider,而不是普通的scrapy.Spider。 接下来,我需要检查用户可能的代码结构。假设用户可能错误地继承了普通的Spider类,而没有使用支持Kafka的专用类。例如,正确的做法应该是从scrapy_kafkaKafkaSpider,并确保process_start_requests方法的参数正确。 另外,检查方法的定义是否正确。Scrapy的process_start_requests方法通常接收一个可迭代的请求和spider实例作为参数。如果用户自定义的方法没有正确接收这些参数,就会导致参数数量不匹配的错误。 可能的解决方案包括: 1. 确保Spider类正确继承自KafkaSpider或其他支持Kafka的基类。 2. 检查process_start_requests方法的定义,确认其参数签名是否正确,可能应定义为接收start_requests和spider参数。 3. 在中间件或扩展中,正确实现process_start_requests方法,确保参数数量匹配。 例如,正确的process_start_requests方法应该像这样: def process_start_requests(self, start_requests, spider): for req in start_requests: yield req 如果用户的方法缺少参数,就会触发错误。需要调整方法定义,添加必要的参数。 此外,用户可能需要检查是否安装了正确版本的scrapy-kafka库,或者是否存在其他兼容性问题。比如,某些库可能依赖特定版本的Scrapy或Kafka客户端,版本不匹配也可能导致方法签名不一致。 总结来说,错误的核心在于方法参数不匹配,解决方案是确保正确继承Spider类,并正确定义process_start_requests方法,接收所需的参数数量。</think>在Scrapy中集成Kafka作为数据源时出现`TypeError: process_start_requests() takes 1 positional argument but 3 were given`错误,通常是由于方法签名不匹配导致的。以下是逐步解决方案: --- ### **1. 错误原因分析** - **核心问题**:Scrapy的`process_start_requests`方法需要接收3个参数(`start_requests`、`spider`、`**kwargs`),但你的实现只定义了1个参数。 - **常见场景**:当使用类似`scrapy-kafka`的库时,如果自定义的Spider类未正确继承或方法签名错误,会触发此异常。 --- ### **2. 解决方案** #### **步骤1:确认Spider继承自正确的基类** - 使用Kafka作为数据源时,Spider需继承支持Kafka的类(如`KafkaSpider`),而非默认的`scrapy.Spider`。 ```python # 错误示例:继承自普通Spider class MySpider(scrapy.Spider): name = "my_spider" # 正确示例:继承自KafkaSpider from scrapy_kafka import KafkaSpider class MyKafkaSpider(KafkaSpider): name = "kafka_spider" ``` #### **步骤2:检查`process_start_requests`方法签名** - 如果自定义了`process_start_requests`方法,需确保参数数量正确: ```python # 错误定义(仅1个参数) def process_start_requests(self, requests): pass # 正确定义(至少2个参数) def process_start_requests(self, start_requests, spider): for request in start_requests: yield request ``` #### **步骤3:检查中间件或扩展配置** - 如果通过中间件或扩展集成Kafka,需确认`process_start_requests`的实现: ```python class KafkaMiddleware: def process_start_requests(self, start_requests, spider): # 处理Kafka消息生成请求 for msg in kafka_consumer: yield scrapy.Request(url=msg.value) # 传递原始请求(可选) yield from start_requests ``` #### **步骤4:验证依赖库兼容性** - 确认使用的库(如`scrapy-kafka`、`kafka-python`)与Scrapy版本兼容。 - 安装推荐版本: ```bash pip install scrapy-kafka kafka-python==2.0.2 ``` --- ### **3. 完整代码示例** ```python from scrapy_kafka import KafkaSpider from kafka import KafkaConsumer class MyKafkaSpider(KafkaSpider): name = "kafka_spider" topic_name = "your_kafka_topic" def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.kafka_consumer = KafkaConsumer( self.topic_name, bootstrap_servers='localhost:9092', group_id='scrapy_consumer' ) def process_start_requests(self, start_requests, spider): # 从Kafka消费消息并生成请求 for message in self.kafka_consumer: yield scrapy.Request( url=message.value.decode('utf-8'), callback=self.parse ) def parse(self, response): # 解析页面逻辑 pass ``` --- ### **4. 其他注意事项** 1. **Kafka消费者配置**:确保`bootstrap_servers`、`topic_name`等配置正确。 2. **Scrapy配置**:在`settings.py`中启用相关中间件: ```python DOWNLOADER_MIDDLEWARES = { 'scrapy_kafka.KafkaMiddleware': 543, } ``` 3. **调试建议**:使用简单Kafka消息测试,排除数据格式问题。 通过以上步骤,可解决`process_start_requests()`参数不匹配错误,并成功集成Kafka作为Scrapy数据源。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值