How to reuse database session object created in foreachPartition in Spark streaming/RDD ?

本文探讨了在Spark Streaming中使用foreachRDD的常见错误及其优化策略,包括避免在Driver创建连接、合理利用foreachPartition减少资源开销以及复用连接提高效率。同时,介绍了正确的连接管理和资源分配方式,以提升数据处理的整体吞吐量。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

 

Design Patterns for using foreachRDD

dstream.foreachRDD is a powerful primitive that allows data to be sent out to external systems. However, it is important to understand how to use this primitive correctly and efficiently. Some of the common mistakes to avoid are as follows.

Often writing data to external system requires creating a connection object (e.g. TCP connection to a remote server) and using it to send data to a remote system. For this purpose, a developer may inadvertently try creating a connection object at the Spark driver, and then try to use it in a Spark worker to save records in the RDDs. For example (in Scala),

dstream.foreachRDD(rdd -> {
  Connection connection = createNewConnection(); // executed at the driver
  rdd.foreach(record -> {
    connection.send(record); // executed at the worker
  });
});

This is incorrect as this requires the connection object to be serialized and sent from the driver to the worker. Such connection objects are rarely transferable across machines. This error may manifest as serialization errors (connection object not serializable), initialization errors (connection object needs to be initialized at the workers), etc. The correct solution is to create the connection object at the worker.

However, this can lead to another common mistake - creating a new connection for every record. For example,

dstream.foreachRDD(rdd -> {
  rdd.foreach(record -> {
    Connection connection = createNewConnection();
    connection.send(record);
    connection.close();
  });
});

Typically, creating a connection object has time and resource overheads. Therefore, creating and destroying a connection object for each record can incur unnecessarily high overheads and can significantly reduce the overall throughput of the system. A better solution is to userdd.foreachPartition - create a single connection object and send all the records in a RDD partition using that connection.

dstream.foreachRDD(rdd -> {
  rdd.foreachPartition(partitionOfRecords -> {
    Connection connection = createNewConnection();
    while (partitionOfRecords.hasNext()) {
      connection.send(partitionOfRecords.next());
    }
    connection.close();
  });
});

This amortizes the connection creation overheads over many records.

Finally, this can be further optimized by reusing connection objects across multiple RDDs/batches. One can maintain a static pool of connection objects than can be reused as RDDs of multiple batches are pushed to the external system, thus further reducing the overheads.

dstream.foreachRDD(rdd -> {
  rdd.foreachPartition(partitionOfRecords -> {
    // ConnectionPool is a static, lazily initialized pool of connections
    Connection connection = ConnectionPool.getConnection();
    while (partitionOfRecords.hasNext()) {
      connection.send(partitionOfRecords.next());
    }
    ConnectionPool.returnConnection(connection); // return to the pool for future reuse
  });
});

Note that the connections in the pool should be lazily created on demand and timed out if not used for a while. This achieves the most efficient sending of data to external systems.

Other points to remember:

  • DStreams are executed lazily by the output operations, just like RDDs are lazily executed by RDD actions. Specifically, RDD actions inside the DStream output operations force the processing of the received data. Hence, if your application does not have any output operation, or has output operations like dstream.foreachRDD() without any RDD action inside them, then nothing will get executed. The system will simply receive the data and discard it.

  • By default, output operations are executed one-at-a-time. And they are executed in the order they are defined in the application.

 

Spark Partition:

它是基于运行节点读取的数据块,多个partition构成RDD。当需要shuffle后,Partition就变成了某类确定key的一组数据集合,为未来的key操作准备,减少节点间数据传输。

所以Spark Partition是一个动态概念,RDD里按某种划分的一块数据,这样forPartition()来建立Connection就会容易在同一节点excutor上建立多个连接池浪费资源。因此在一个executor使用一个静态的连接池可以在该节点上的多个partition间共享资源。

 

Partition 与 executor 的关系

通常情况下,partition最好达到executor的数目,但是理想状态是一个executor包好2~3个partition;

 

Kafka Partition 与 executor 关系

kafka Partition 与executor是一一对应关系。 

 

参考:

connection pool in spark

https://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd

partitioning in spark

https://medium.com/parrot-prediction/partitioning-in-apache-spark-8134ad840b0

spark rdd partitons

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值