CPU: 8核
内存: 32 GB
操作系统: Aliyun Linux 17.1 64位
jar包 kafka_2.11-0.10.0.0.tgz spark-2.1.1-bin-hadoop2.7.tgz hadoop-2.7.1_64bit.tar.gz jdk-8u65-linux-x64.tar.gz
环境搭建顺序
kafka->hadoop->spark->hbase
1.搭建kafka系统
tar -xzf kafka_2.11-0.10.0.0.tgz
cd kafka_2.11-0.10.0.0
cd config
vim server.properties
修改参数#advertised.listeners=PLAINTEXT://your.host.name:9092 打开注释写上服务器IP
启动zookeeper bin/zookeeper-server-start.sh config/zookeeper.properties
启动kafka bin/kafka-server-start.sh config/server.properties
shell测试创建主题
./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic helloKafka
物理位置在server.properties的log.dirs配置 默认路径log.dirs=/tmp/kafka-logs
cd /tmp/kafka-logs目录下可以看见helloKafka-0文件夹,--replication-factor 1副本数量单机环境略过 --partitions 1 表示分区数(--partitions 2 对应创建文件 helloKafka-0 helloKafka-1)
查看创建的topic ./kafka-topics.sh --list --zookeeper localhost:2181
shell测试Producer
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic helloKafka
hello kafka
shell启动消费者
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic helloKafka --from-beginning
hello kafka
2.搭建hadoop
1.配置主机名
$vim /etc/sysconfig/network
$source /etc/sysconfig/network
例如:
NETWORKING=yes
HOSTNAME=spark1
2.配置本机密码
ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
3.配置Hosts
vi /etc/hosts
填入以下内容
127.0.0.1 spark1
其他主机和ip对应信息。。。
4.解压jar包修改配置文件
cd etc/hadoop
(1)修改 hadoop-env.sh
通过vim打开
vim [hadoop]/etc/hadoop/hadoop-env.sh
主要是修改java_home的路径
在hadoop-env.sh的第27行,把export JAVA_HOME=${JAVA_HOME}修改成具体的路径export JAVA_HOME=/home/software/jdk1.8
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/home/hadoop-2.7.1/etc/hadoop/"}
重新加载使修改生效
source hadoop-env.sh
(2)修改 core-site.xml
通过vim打开
vim [hadoop]/etc/hadoop/core-site.xml
增加namenode配置、文件存储位置配置
<configuration>
<property>
<!--用来指定hdfs的老大,namenode的地址-->
<name>fs.defaultFS</name>
<value>hdfs://spark1:9000</value>
</property>
<property>
(3)修改 hdfs-site.xml
通过vim打开
vim [hadoop]/etc/hadoop/hdfs-site.xml
配置包括自身在内的备份副本数量。
<configuration>
<property>
<!--指定hdfs保存数据副本的数量,包括自己,默认为3-->
<!--伪分布式模式,此值必须为1-->
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
(4)修改 mapred-site.xml
说明:在/etc/hadoop的目录下,只有一个mapred-site.xml.template文件,复制一个
cp mapred-site.xml.template mapred-site.xml
通过vim打开
vim [hadoop]/etc/hadoop/mapred-site.xml
配置mapreduce运行在yarn上
<configuration>
<property>
<!--指定mapreduce运行在yarn上-->
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
(4).启动hadoop
在sbin目录下
$start-all.sh
3.编写spark streamong job任务
val conf = new SparkConf().setMaster("local[2]").setAppName("text")
val ssc = new StreamingContext(conf, Seconds(30))
//kafka链接配置
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "127.0.0.1:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
//"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
//kafka消费链接
val topics = Array("test_1","test_2")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
//kafka消息value处理
var str = stream.flatMap(record=>record.value().toString().split(" "))
var lines = str.map { x => (x,1l) }.reduceByKey(_+_)
//存储到hbase(此处hbase链接未优化)
lines.foreachRDD(rdd=>{
rdd.foreach(r=>{
val hbaseConf = HBaseConfiguration.create()
hbaseConf.set("hbase.zookeeper.quorum", "127.0.0.1:2182")
println("===========================================")
var StatTable = new HTable(hbaseConf, "tabe")
var key = "row"
var arr = r._1.split("\\.")
if (arr.length >2) {
println("*****************************************")
key += r._1;
var put = new Put(Bytes.toBytes(key));
put.add(Bytes.toBytes("fam1"),Bytes.toBytes("col2"),Bytes.toBytes(r._2+""));
StatTable.put(put)
}
arr.length
println(r._1)
println(r._2)
})
})
//kafka消费者offset提交(apache声明此API是测试版)
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
val offsets = new Array[OffsetRange](offsetRanges.length)
for(i<- 0 until offsetRanges.length) {
var kafkaOffset = OffsetRange.create(offsetRanges(i).topic, offsetRanges(i).partition, offsetRanges(i).untilOffset,offsetRanges(i).untilOffset );
offsets(i)=kafkaOffset;
}
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsets)
}
def getNowDate():String={
var now:Date = new Date()
var dateFormat:SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd")
var hehe = dateFormat.format( now )
hehe
}
//消息存储到本地或到HDFS
stream.saveAsTextFiles("/home/test",getNowDate())
ssc.start()
ssc.awaitTermination()
4.单机spark streaming直接运行任务即可
./spark-submit --driver-class-path=.. --class kafka.test /home/kafkatest.jar
5.搭建hbase
下载hbase-0.98.17-bin.tar.gz
修改jdk路径conf/hbase-env.sh
export JAVA_HOME=/home/software/jdk1.7
编写配置文件
conf/hbase-site.xml
<property>
<name>hbase.rootdir</name>
<value>hdfs://spark1:9000/hbase</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
内存: 32 GB
操作系统: Aliyun Linux 17.1 64位
jar包 kafka_2.11-0.10.0.0.tgz spark-2.1.1-bin-hadoop2.7.tgz hadoop-2.7.1_64bit.tar.gz jdk-8u65-linux-x64.tar.gz
环境搭建顺序
kafka->hadoop->spark->hbase
1.搭建kafka系统
tar -xzf kafka_2.11-0.10.0.0.tgz
cd kafka_2.11-0.10.0.0
cd config
vim server.properties
修改参数#advertised.listeners=PLAINTEXT://your.host.name:9092 打开注释写上服务器IP
启动zookeeper bin/zookeeper-server-start.sh config/zookeeper.properties
启动kafka bin/kafka-server-start.sh config/server.properties
shell测试创建主题
./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic helloKafka
物理位置在server.properties的log.dirs配置 默认路径log.dirs=/tmp/kafka-logs
cd /tmp/kafka-logs目录下可以看见helloKafka-0文件夹,--replication-factor 1副本数量单机环境略过 --partitions 1 表示分区数(--partitions 2 对应创建文件 helloKafka-0 helloKafka-1)
查看创建的topic ./kafka-topics.sh --list --zookeeper localhost:2181
shell测试Producer
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic helloKafka
hello kafka
shell启动消费者
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic helloKafka --from-beginning
hello kafka
2.搭建hadoop
1.配置主机名
$vim /etc/sysconfig/network
$source /etc/sysconfig/network
例如:
NETWORKING=yes
HOSTNAME=spark1
2.配置本机密码
ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
3.配置Hosts
vi /etc/hosts
填入以下内容
127.0.0.1 spark1
其他主机和ip对应信息。。。
4.解压jar包修改配置文件
cd etc/hadoop
(1)修改 hadoop-env.sh
通过vim打开
vim [hadoop]/etc/hadoop/hadoop-env.sh
主要是修改java_home的路径
在hadoop-env.sh的第27行,把export JAVA_HOME=${JAVA_HOME}修改成具体的路径export JAVA_HOME=/home/software/jdk1.8
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/home/hadoop-2.7.1/etc/hadoop/"}
重新加载使修改生效
source hadoop-env.sh
(2)修改 core-site.xml
通过vim打开
vim [hadoop]/etc/hadoop/core-site.xml
增加namenode配置、文件存储位置配置
<configuration>
<property>
<!--用来指定hdfs的老大,namenode的地址-->
<name>fs.defaultFS</name>
<value>hdfs://spark1:9000</value>
</property>
<property>
(3)修改 hdfs-site.xml
通过vim打开
vim [hadoop]/etc/hadoop/hdfs-site.xml
配置包括自身在内的备份副本数量。
<configuration>
<property>
<!--指定hdfs保存数据副本的数量,包括自己,默认为3-->
<!--伪分布式模式,此值必须为1-->
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
(4)修改 mapred-site.xml
说明:在/etc/hadoop的目录下,只有一个mapred-site.xml.template文件,复制一个
cp mapred-site.xml.template mapred-site.xml
通过vim打开
vim [hadoop]/etc/hadoop/mapred-site.xml
配置mapreduce运行在yarn上
<configuration>
<property>
<!--指定mapreduce运行在yarn上-->
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
(4).启动hadoop
在sbin目录下
$start-all.sh
3.编写spark streamong job任务
val conf = new SparkConf().setMaster("local[2]").setAppName("text")
val ssc = new StreamingContext(conf, Seconds(30))
//kafka链接配置
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "127.0.0.1:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
//"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
//kafka消费链接
val topics = Array("test_1","test_2")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
//kafka消息value处理
var str = stream.flatMap(record=>record.value().toString().split(" "))
var lines = str.map { x => (x,1l) }.reduceByKey(_+_)
//存储到hbase(此处hbase链接未优化)
lines.foreachRDD(rdd=>{
rdd.foreach(r=>{
val hbaseConf = HBaseConfiguration.create()
hbaseConf.set("hbase.zookeeper.quorum", "127.0.0.1:2182")
println("===========================================")
var StatTable = new HTable(hbaseConf, "tabe")
var key = "row"
var arr = r._1.split("\\.")
if (arr.length >2) {
println("*****************************************")
key += r._1;
var put = new Put(Bytes.toBytes(key));
put.add(Bytes.toBytes("fam1"),Bytes.toBytes("col2"),Bytes.toBytes(r._2+""));
StatTable.put(put)
}
arr.length
println(r._1)
println(r._2)
})
})
//kafka消费者offset提交(apache声明此API是测试版)
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
val offsets = new Array[OffsetRange](offsetRanges.length)
for(i<- 0 until offsetRanges.length) {
var kafkaOffset = OffsetRange.create(offsetRanges(i).topic, offsetRanges(i).partition, offsetRanges(i).untilOffset,offsetRanges(i).untilOffset );
offsets(i)=kafkaOffset;
}
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsets)
}
def getNowDate():String={
var now:Date = new Date()
var dateFormat:SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd")
var hehe = dateFormat.format( now )
hehe
}
//消息存储到本地或到HDFS
stream.saveAsTextFiles("/home/test",getNowDate())
ssc.start()
ssc.awaitTermination()
4.单机spark streaming直接运行任务即可
./spark-submit --driver-class-path=.. --class kafka.test /home/kafkatest.jar
5.搭建hbase
下载hbase-0.98.17-bin.tar.gz
修改jdk路径conf/hbase-env.sh
export JAVA_HOME=/home/software/jdk1.7
编写配置文件
conf/hbase-site.xml
<property>
<name>hbase.rootdir</name>
<value>hdfs://spark1:9000/hbase</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>