要使用Java实现Spark Streaming连接Kafka,需要按照以下步骤进行操作:
- 引入依赖
在pom.xml文件中添加以下依赖:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>3.1.1</version>
</dependency>
注意:版本号根据你的Spark版本进行修改。
- 创建SparkConf对象
在Java代码中创建一个SparkConf对象,指定应用程序的名称和Master URL。
SparkConf conf = new SparkConf().setAppName("KafkaSparkStreaming").setMaster("local[*]");
- 创建JavaStreamingContext对象
使用上述SparkConf对象创建JavaStreamingContext对象,并指定批处理间隔。
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(5));
- 定义Kafka连接参数
指定Kafka连接参数,例如Kafka服务的地址、端口、消费者组ID等。
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", "localhost:9092");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "test-group");
- 创建DStream流对象
使用JavaStreamingContext对象创建DStream对象,指定Kafka主题名称和上述Kafka连接参数。
Collection<String> topics = Arrays.asList("test-topic");
JavaInputDStream<ConsumerRecord<String, String>> stream = KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
);
- 解析数据流并处理
使用DStream对象进行数据流的解析和处理,例如进行数据过滤、转换、聚合等等。
stream.filter(record -> record.value().contains("error"))
.map(record -> record.value())
.flatMap(line -> Arrays.asList(line.split(" ")).iterator())
.mapToPair(word -> new Tuple2<>(word, 1))
.reduceByKey((a, b) -> a + b)
.print();
- 开始流式计算
最后调用JavaStreamingContext对象的start()
方法启动流式计算。
jssc.start();
jssc.awaitTermination();
完整的Java代码示例:
import java.util.Arrays;
import java.util.Collection;
import java.util.HashMap;
import java.util.Map;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.*;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.*;
import org.apache.spark.streaming.kafka010.*;
import scala.Tuple2;
public class KafkaSparkStreaming {
public static void main(String[] args) throws InterruptedException {
SparkConf conf = new SparkConf().setAppName("KafkaSparkStreaming").setMaster("local[*]");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(5));
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", "localhost:9092");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "test-group");
Collection<String> topics = Arrays.asList("test-topic");
JavaInputDStream<ConsumerRecord<String, String>> stream = KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
);
stream.filter(record -> record.value().contains("error"))
.map(record -> record.value())
.flatMap(line -> Arrays.asList(line.split(" ")).iterator())
.mapToPair(word -> new Tuple2<>(word, 1))
.reduceByKey((a, b) -> a + b)
.print();
jssc.start();
jssc.awaitTermination();
}
}