一.项目目的
该部分为Kakfa与Storm整合部分,kafka通过Storm实时进行数据的分析
二.项目流程
待写
三.项目操作
3.1 Storm准备
3.1.1. 构建Maven项目
pom.xml
文件 引入相关依赖和插件
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>StormWordCount</groupId>
<artifactId>StormWordCount</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.storm/storm-core -->
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-core</artifactId>
<version>1.2.2</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-kafka-client</artifactId>
<version>1.2.2</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<testExcludes>
<testExclude>/src/test/**</testExclude>
</testExcludes>
<encoding>utf-8</encoding>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id> <!-- this is used for inheritance merges -->
<phase>package</phase> <!-- 指定在打包节点执行jar包合并操作 -->
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
3.1.2 编写代码
a. 项目代码结构
b. KafkaSpoutBuilder
package com.zpark;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.storm.kafka.spout.KafkaSpout;
import org.apache.storm.kafka.spout.KafkaSpoutConfig;
import java.util.List;
/**
* @description: KafkaSpout构建器
*/
public class KafkaSpoutBuilder {
private List<String> brokers;
private String topic;
public KafkaSpoutBuilder brokers(List<String> v) {
brokers = v;
return this;
}
public KafkaSpoutBuilder topic(String v) {
topic = v;
return this;
}
public KafkaSpout build() {
/** 配置kafka
* 1. 需要设置consumer group(注意一个partition中的消息只能被同一group中的一个consumer消费)
* 2. 起始消费策略:根据业务需要配置
*/
String allBrokers = String.join(",", brokers);
KafkaSpoutConfig<String, String> conf = KafkaSpoutConfig
.builder(allBrokers, topic)
.setProp(ConsumerConfig.GROUP_ID_CONFIG, "word-count-storm")
//消费最新的数据
.setFirstPollOffsetStrategy(KafkaSpoutConfig.FirstPollOffsetStrategy.LATEST)
.build();
return new KafkaSpout(conf);
}
}
c. KafkaSplitSentenceBolt
package com.zpark;
/**
* @description: 语句分割bolt
*/
import org.apache.storm.task.OutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseRichBolt;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Tuple;
import org.apache.storm.tuple.Values;
import java.util.Map;
/**
* 如,接收的Tuple是:Tuple("sentence" -> "I love teacher shao")
* 则,输出的Tuple为:
* Tuple("word" -> "I")
* Tuple("word" -> "love")
* Tuple("word" -> "teacher")
* Tuple("word" -> "shao")
*/
public class KafkaSplitSentenceBolt extends BaseRichBolt {
private OutputCollector collector;
@Override
public void prepare(Map map, TopologyContext topologyContext, OutputCollector outputCollector) {
this.collector = outputCollector;
}
@Override
public void execute(Tuple tuple) { // 实时接收SentenceSpout中输出的Tuple流
String sentence = tuple.getStringByField("value"); // 根据key获取Tuple中的语句,"value"是Kafka中固定了的
String[] words = sentence.split(" "); // 将语句按照空格进行切割
for (String word: words) {
this.collector.emit(new Values(word)); // 将切割之后的每一个单词作为Tuple的value输出到下一个bolt中
}
this.collector.ack(tuple); // 表示成功处理kafka-spout输出的消息,需要应答,要不然,kafka-spout会不断的重复发送消息
}
@Override
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declare(new Fields("word")); // 输出Tuple的key
}
}
d. KafkaWordCountBolt
package com.zpark;
import org.apache.storm.task.OutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseRichBolt;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Tuple;
import org.apache.storm.tuple.Values;
import java.util.HashMap;
import java.util.Map;
/**
* @description: 单词计数bolt
*/
public class KafkaWordCountBolt extends BaseRichBolt {
private OutputCollector collector;
private HashMap<String, Long> counts = null; // 用于统计每隔单词的计数
@Override
public void prepare(Map map, TopologyContext topologyContext, OutputCollector outputCollector) {
this.collector = outputCollector;
this.counts = new HashMap<String, Long>();
}
@Override
public void execute(Tuple tuple) { // 实时接收SplitSentenceBolt中输出的Tuple流
String word = tuple.getStringByField("word"); // 根据key获取Tuple中的单词
// 统计每一个单词总共出现的次数
Long count = counts.getOrDefault(word, 0L);
count++;
this.counts.put(word, count);
// 将每一个单词以及这个单词出现的次数作为Tuple中的value输出到下一个bolt中
this.collector.emit(new Values(word, count.toString()));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
// 输出Tuple的key,有两个key,是因为每次输出的value也有两个
outputFieldsDeclarer.declare(new Fields("key", "message"));
}
}
e. WordCountKafkaTopology
package com.zpark;
import org.apache.storm.Config;
import org.apache.storm.StormSubmitter;
import org.apache.storm.generated.AlreadyAliveException;
import org.apache.storm.generated.AuthorizationException;
import org.apache.storm.generated.InvalidTopologyException;
import org.apache.storm.kafka.bolt.KafkaBolt;
import org.apache.storm.kafka.bolt.mapper.FieldNameBasedTupleToKafkaMapper;
import org.apache.storm.kafka.bolt.selector.DefaultTopicSelector;
import org.apache.storm.kafka.spout.KafkaSpout;
import org.apache.storm.topology.TopologyBuilder;
import org.apache.storm.tuple.Fields;
import java.util.Arrays;
import java.util.Properties;
/**
* @description: Kafka之WordCountTopology
*/
public class WordCountKafkaTopology {
private static final String SENTENCE_SPOUT_ID = "sentence-spout";
private static final String SPLIT_BOLT_ID = "split-bolt";
private static final String COUNT_BOLT_ID = "count-bolt";
private static final String KAFKA_BOLT_ID = "kafka-bolt";
private static final String TOPOLOGY_NAME = "word-count-topology";
public static void main(String[] args) throws InvalidTopologyException, AuthorizationException, AlreadyAliveException {
int workers = Integer.parseInt(args[0]);
// 从Kafka中消费数据
KafkaSpout kafkaSpout = new KafkaSpoutBuilder()
.brokers(Arrays.asList("hdp-1:9092"))
.topic("word-count-input")
.build();
KafkaSplitSentenceBolt splitSentenceBolt = new KafkaSplitSentenceBolt();
KafkaWordCountBolt wordCountBolt = new KafkaWordCountBolt();
Properties props = new Properties();
props.put("bootstrap.servers", "hdp-1:9092");
// 此配置是表明当一次produce请求被认为完成时的确认值。
// 特别是,多少个其他brokers必须已经提交了数据到他们的log并且向他们的leader确认了这些信息。典型的值包括:
// 0: 表示producer从来不等待来自broker的确认信息(和0.7一样的行为)。
// 这个选择提供了最小的时延但同时风险最大(因为当server宕机时,数据将会丢失)。
// 1:表示获得leader replica已经接收了数据的确认信息。这个选择时延较小同时确保了server确认接收成功。
// -1:producer会获得所有同步replicas都收到数据的确认
props.put("acks", "1");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
KafkaBolt kafkaBolt = new KafkaBolt()
.withProducerProperties(props)
.withTopicSelector(new DefaultTopicSelector("word-count-output"))
.withTupleToKafkaMapper(new FieldNameBasedTupleToKafkaMapper());
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout(SENTENCE_SPOUT_ID, kafkaSpout);
builder.setBolt(SPLIT_BOLT_ID, splitSentenceBolt).shuffleGrouping(SENTENCE_SPOUT_ID);
builder.setBolt(COUNT_BOLT_ID, wordCountBolt).fieldsGrouping(SPLIT_BOLT_ID, new Fields("word"));
builder.setBolt(KAFKA_BOLT_ID, kafkaBolt).shuffleGrouping(COUNT_BOLT_ID);
// 3、提交Topology
Config config = new Config(); // 用来配置Topology运行时行为,对Topology所有组件全局生效的配置参数集合
config.setNumWorkers(workers);
StormSubmitter.submitTopology(TOPOLOGY_NAME, config, builder.createTopology()); // 提交Topology
}
}
3.2 Kafka准备
3.2.1 启动Kafka
我的写了脚本 直接在~目录下 ./start-kafka-cluster.sh
3.2.2 创建Topic
在hdp-1中 cd apps/kafka_2.12-2.2.0/bin 创建topics:word-count-input
./kafka-topics.sh --create --zookeeper hdp-1:2181,hdp-2:2181,hdp-3:2181 --replication-factor 1 --partitions 1 --topic word-count-input
在hdp-1中 cd apps/kafka_2.12-2.2.0/bin 同理创建topics:word-count-output
3.2.3 启动生产者与消费者
a. 启动一个producer,向word-count-input
发送消息
首先 在hdp-1中:cd apps/kafka_2.12-2.2.0/bin
启动生产者: ./kafka-console-producer.sh --broker-list hdp-1:9092,hdp-2:9092,hdp-3:9092 --topic word-count-input
b. 启动一个consumer,消费word-count-output
的消息
首先 在hdp-2中:cd apps/kafka_2.12-2.2.0/bin
启动消费者:./kafka-console-consumer.sh --bootstrap-server hdp-1:9092,hdp-2:9092,hdp-3:9092 --topic word-count-output --property print.key=true
3.3 校验结果
3.3.1 打包Storm代码 双击package
3.3.2 上传jar包
3.3.3执行ZK与Storm
a. zk启动用的是脚本 ./zkmanager.sh start
b.Storm启动
在hdp-1上启动Nimbus和Web UI
cd apps/storm-1.2.2/bin
./storm nimbus 2>&1 &
然后回车,切换终端2,执行:
./storm ui 2>&1 &
然后回车
在hdp-2和hdp-3上启动Supervisor
cd apps/storm-1.2.2/bin
./storm supervisor 2>&1 &
3.3.4 执行Storm作业
a. 执行Storm作业
在cd apps/storm-1.2.2/bin 中输入(如果环境配置好,任何地方都可以直接执行)
com.zpark.WordCountKafkaTopology 是main方法的全路径 1是参数 (在这里是执行线程的数量)
storm jar StormWordCount-1.0-SNAPSHOT-jar-with-dependencies.jar com.zpark.WordCountKafkaTopology 1
执行后成功效果
b. 查看Web UI界面(hdp-1:8080)
3.3.5校验过程
a.发送消息到Kafka
b.查看消费者信息
c. 查看Storm的Web UI界面