0073 spark streaming从端口接受数据进行实时处理的方法

最新推荐文章于 2024-01-01 16:34:18 发布

原创最新推荐文章于 2024-01-01 16:34:18 发布 · 3.4k 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#spark #scala #spark streaming #ncat

Spark Data Processing 同时被 2 个专栏收录

3 篇文章

订阅专栏

Scala

1 篇文章

订阅专栏

本文介绍了如何在Spark Streaming环境中配置环境变量，并通过搭建本地应用接收来自端口的数据进行实时处理。主要涉及Scala编程和ncat工具的使用。

一，环境

Windows_x64 系统

Java1.8

Scala2.10.6

spark1.6.0

hadoop2.7.5

IDEA IntelliJ 2017.2

nmap工具（用到其中的ncat命令，对应Linux中的nc命令）

二，本地应用搭建

2.1 环境变量

设置方法：系统参数--》添加变量--》形式为：XXX_HOME，然后把对应安装包的根目录复制作为变量值；在PATH变量中添加: %XXX_HOME%\bin;

1，Hadoop需要设置环境变量；

2，Scala最好自己下载安装相应版本，设置环境变量；

3，spark直接解压即可；

参考：环境搭建参考

2.2 搭建测试

利用SBT工具非常方便的可以完成搭建，利用sbt创建Scala项目。项目结构生成为：

其中testMain.scala：

/**
  * notes: To test scala and spark and hadoop
  * date: 2017.12.20
  * author: gendlee
  */
import org.apache.spark.{SparkConf,SparkContext}
import org.apache.log4j.{Level,Logger}
import com.test.SparkStreaming
object test {

  Logger.getLogger("org").setLevel(Level.ERROR)

  def main(args: Array[String]): Unit = {

    SparkStreaming.printWebsites()



    //initiate spark
    
    val sc = new SparkContext(conf)

    //read file from local disc
    val rdd = sc.textFile("F:\\Code\\scala2.10.6_spark1.6_hadoop2.8\\Test.log")


  }

}

其中SparkStreaming.scala为：

/**
  *notes: To test spark streaming
  * date: 2017.12.21
  * author: gendlee
  */
package com.test
import org.apache.spark.{SparkConf,SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object SparkStreaming {
  def printWebsites(): Unit= {

    val conf = new SparkConf().setMaster("local[2]").setAppName("PrintWebsites")
    val ssc = new StreamingContext(conf, Seconds(1))

    val output = "F:\\Code\\scala2.10.6_spark1.6_hadoop2.8\\out\\gettedWebsites"

    val lines = ssc.socketTextStream("localhost", 7777)

    val websiteLines = lines.filter(_.contains("http"))
    websiteLines.print()
    //websiteLines.repartition(1).saveAsTextFiles(output)

    ssc.start()
    ssc.awaitTermination()
  }

}

我要从输入中提取出含有网址的字段（含有http）：

踩坑：

val conf = new SparkConf().setMaster("local[2]").setAppName("PrintWebsites")

这里setMaster参数必须为local[2]，应为这里要开启两个进程，一个发一个收，若用默认的local将接受不到数据。

编译后可以运行一下，发现打印这样的信息：

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/12/22 16:39:14 INFO Slf4jLogger: Slf4jLogger started
17/12/22 16:39:14 INFO Remoting: Starting remoting
17/12/22 16:39:14 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@169.254.78.142:64905]
17/12/22 16:39:15 ERROR ReceiverTracker: Deregistered receiver for stream 0: Restarting receiver with delay 2000ms: Socket data stream had no more data
-------------------------------------------
Time: 1513931956000 ms
-------------------------------------------
Time: 1513931957000 ms
-------------------------------------------

    出现错误。不着急，那是因为7777 端口没有接受到数据，下面先暂停程序，我们需要往7777端口发数据。

     利用socketTextStream（）函数，我们可以从指定的主机上某个特定端口接收数据。下面看一下如何在7777端口发数据。

     打开windows的power shell或CMD，输入：

ncat -lk -p 7777

然后再运行IDEA中的程序，这时在打开的CMD窗空中输入，当输入的字段含有http，就会在IDEA的运行展示窗口打印出来。

IDEA端过滤打印：

可见这里有个问题，其实像https这种我是不要的，即http作为单词的一部分这种是不要的，所以后续再想办法看看如何过滤。

至此完成题目的要求。

三，参考：

http://blog.youkuaiyun.com/gendlee1991/article/details/78066548

https://www.cnblogs.com/FG123/p/5324743.html