大数据-Spark实例

Spark日志分析实战

Spark实例

tomcat日志
110.52.250.126 - - [30/May/2018:17:38:20 +0800] "GET /source/plugin/wsh_wx/img/wsh_zk.css HTTP/1.1" 200 1482
27.19.74.143 - - [30/May/2018:17:38:20 +0800] "GET /static/image/common/faq.gif HTTP/1.1" 200 1127
27.19.74.143 - - [30/May/2018:17:38:20 +0800] "GET /static/image/common/hot_1.gif HTTP/1.1" 200 680
27.19.74.143 - - [30/May/2018:17:38:20 +0800] "GET /static/image/common/hot_1.gif HTTP/1.1" 200 682
27.19.74.143 - - [30/May/2018:17:38:20 +0800] "GET /static/image/filetype/common.gif HTTP/1.1" 200 90
110.52.250.126 - - [30/May/2018:17:38:20 +0800] "GET /source/plugin/wsh_wx/img/wx_jqr.gif HTTP/1.1" 200 1770
27.19.74.143 - - [30/May/2018:17:38:20 +0800] "GET /static/image/common/recommend_1.gif HTTP/1.1" 200 1030
110.52.250.126 - - [30/May/2018:17:38:20 +0800] "GET /static/image/common/wsh_zk.css HTTP/1.1" 200 4542
27.19.74.143 - - [30/May/2018:17:38:20 +0800] "GET /data/attachment/common/c8/common_2_verify_icon.png HTTP/1.1" 200 582
27.19.74.143 - - [30/May/2018:17:38:20 +0800] "GET /static/image/common/pn.png HTTP/1.1" 200 592
27.19.74.143 - - [30/May/2018:17:38:20 +0800] "GET /static/image/editor/editor.gif HTTP/1.1" 200 13648
8.35.201.165 - - [30/May/2018:17:38:21 +0800] "GET /uc_server/data/avatar/000/05/94/42_avatar_middle.jpg HTTP/1.1" 200 6153
8.35.201.164 - - [30/May/2018:17:38:21 +0800] "GET /uc_server/data/avatar/000/03/13/42_avatar_middle.jpg HTTP/1.1" 200 5087
8.35.201.163 - - [30/May/2018:17:38:21 +0800] "GET /uc_server/data/avatar/000/04/87/42_avatar_middle.jpg HTTP/1.1" 200 5117
8.35.201.165 - - [30/May/2018:17:38:21 +0800] "GET /uc_server/data/avatar/000/01/01/42_avatar_middle.jpg HTTP/1.1" 200 5844
8.35.201.160 - - [30/May/2018:17:38:21 +0800] "GET /uc_server/data/avatar/000/04/12/42_avatar_middle.jpg HTTP/1.1" 200 3174
8.35.201.163 - - [30/May/2018:17:38:21 +0800] "GET /static/image/common/arw_r.gif HTTP/1.1" 200 65
8.35.201.166 - - [30/May/2018:17:38:21 +0800] "GET /static/image/common/search.png HTTP/1.1" 200 210
8.35.201.144 - - [30/May/2018:17:38:21 +0800] "GET /static/image/common/pmto.gif HTTP/1.1" 200 152
8.35.201.161 - - [30/May/2018:17:38:21 +0800] "GET /static/image/common/search.png HTTP/1.1" 200 3047
8.35.201.164 - - [30/May/2018:17:38:21 +0800] "GET /uc_server/data/avatar/000/05/83/35_avatar_middle.jpg HTTP/1.1" 200 7171
8.35.201.160 - - [30/May/2018:17:38:21 +0800] "GET /uc_server/data/avatar/000/01/54/35_avatar_middle.jpg HTTP/1.1" 200 5396
实例一:求图片的访问量
scala代码
package Spark

import org.apache.spark.{SparkConf, SparkContext}
import scala.util.matching.Regex

/*
* 解析tomcat日志
*
* @author Jabin
* @version 0.0.1
* @data 2019/07/16
* */
object LogCount {
  def main(args: Array[String]): Unit = {
    //创建Spark配置
    val conf = new SparkConf().setAppName("Log.Count").setMaster("local")
    //加载Spark配置
    val sc = new SparkContext(conf)

    val rdd = sc.textFile("C:\\Users\\Administrator\\Desktop\\日志\\tomcat.log")
    .map(
      line => {
      /*
      * 27.19.74.143 - - [30/May/2018:17:38:20 +0800] "GET /static/image/common/faq.gif HTTP/1.1" 200 1127
      * 通过正则表达式匹配
      * */
      val pattern = "(/(\\w)+)+\\.[a-z]{3}".r
      //得到/static/image/common/faq.gif
      val photoDir = pattern.findAllIn(line).mkString(",")
      val regex = new Regex("\\w+\\.[a-z]{3}")
      //得到faq.gif
      val photoName = regex.findAllIn(photoDir).mkString(",")

      (photoName, 1)
    }
    )

    val rdd1 = rdd.reduceByKey(_+_)

    rdd1.foreach(println)

    val rdd2 = rdd1.sortBy(_._2,false)

    rdd2.foreach(println)

    rdd2.take(2).foreach(println)

    //关闭
    sc.stop()
  }
}

结果

//reduceByKey结果
(editor.gif,1)
(common.gif,1)
(35_avatar_middle.jpg,2)
(pn.png,1)
(wx_jqr.gif,1)
(pmto.gif,1)
(wsh_zk.css,2)
(42_avatar_middle.jpg,5)
(arw_r.gif,1)
(common_2_verify_icon.png,1)
(hot_1.gif,2)
(search.png,2)
(recommend_1.gif,1)
(faq.gif,1)
//sortBy结果
(42_avatar_middle.jpg,5)
(35_avatar_middle.jpg,2)
(wsh_zk.css,2)
(hot_1.gif,2)
(search.png,2)
(editor.gif,1)
(common.gif,1)
(pn.png,1)
(wx_jqr.gif,1)
(pmto.gif,1)
(arw_r.gif,1)
(common_2_verify_icon.png,1)
(recommend_1.gif,1)
(faq.gif,1)
//最终结果
(42_avatar_middle.jpg,5)
(35_avatar_middle.jpg,2)
实例二:创建自定义分区

scala代码

package Spark

import org.apache.spark.{Partitioner, SparkConf, SparkContext}
import scala.collection.mutable
import scala.util.matching.Regex

/*
* 解析tomcat日志,自定义分区
*
* @author Jabin
* @version 0.0.1
* @data 2019/07/16
* */
object PartitionCount {
  def main(args: Array[String]): Unit = {
    //创建Spark配置
    val conf = new SparkConf().setAppName("Partition.Count").setMaster("local")
    //加载Spark配置
    val sc = new SparkContext(conf)

    val rdd = sc.textFile("C:\\Users\\Administrator\\Desktop\\作业\\tomcat.log")
      .map(
        line => {
          /*
          * 27.19.74.143 - - [30/May/2018:17:38:20 +0800] "GET /static/image/common/faq.gif HTTP/1.1" 200 1127
          * 通过正则表达式匹配
          * */
          val pattern = "(/(\\w)+)+\\.[a-z]{3}".r
          //得到/static/image/common/faq.gif
          val photoDir = pattern.findAllIn(line).mkString(",")
          val regex = new Regex("\\w+\\.[a-z]{3}")
          //得到faq.gif
          val photoName = regex.findAllIn(photoDir).mkString(",")

          (photoName, line)
        }
      )

    //获取不重复的photoName
    val rdd1 = rdd.map(_._1).distinct().collect

    //创建分区规则
    val partition = new PartitionCount(rdd1)
    val rdd2 = rdd.partitionBy(partition)

    rdd2.saveAsTextFile("C:\\Users\\Administrator\\Desktop\\日志\\partition")

	//关闭
    sc.stop()
  }
}

class PartitionCount(array: Array[String]) extends Partitioner{
  //创建map存储photoName
  val map = new mutable.HashMap[String, Int]()
  //初始化分区
  var id = 0

  for (arr <- array){
    map.put(arr,id)
    id += 1
  }

  //返回分区的数目
  override def numPartitions: Int = map.size

  //根据photoName,返回对应的分区
  override def getPartition(key: Any): Int = map.getOrElse(key.toString,0)
}

结果

在这里插入图片描述

实例三:访问数据库(查询)
SQL代码
CREATE database IF NOT EXISTS DATA;
USE DATA;
CREATE TABLE EMPLOYEE(ID INT NOT NULL AUTO_INCREMENT,NAME VARCHAR(20),SALARY INT,PRIMARY KEY(ID));
INSERT INTO EMPLOYEE(NAME,SALARY) VALUES('Destiny',1000);
INSERT INTO EMPLOYEE(NAME,SALARY) VALUES('Freedom',4500);
INSERT INTO EMPLOYEE(NAME,SALARY) VALUES('Fate',3000);
SELECT * FROM EMPLOYEE;

在这里插入图片描述
scala代码

package Spark

import java.sql.DriverManager

import org.apache.spark.rdd.JdbcRDD
import org.apache.spark.{SparkConf, SparkContext}
/*
* 解析tomcat日志,自定义分区
*
* @author Jabin
* @version 0.0.1
* @data 2019/07/17
* */
object JDBC {
  //创建连接
  private val connection = () => {
    Class.forName("com.mysql.cj.jdbc.Driver").newInstance()
    DriverManager.getConnection("jdbc:mysql://localhost:3306/data?serverTimezone=GMT%2B8","root","root")
  }

  def main(args: Array[String]): Unit = {
    //创建Spark配置
    val conf = new SparkConf().setAppName("JDBC.Count").setMaster("local")
    //加载Spark配置
    val sc = new SparkContext(conf)

    val rdd = new JdbcRDD(sc,connection,"SELECT * FROM EMPLOYEE WHERE SALARY >= ? AND SALARY < ?",3000,6000,2,r => {
      val name = r.getString(2)
      val salary = r.getInt(3)

      (name,salary)
    })

    val result = rdd.collect()

    println(result.toBuffer)
//    result.foreach(println)

    //关闭
    sc.stop()
  }
}

结果

ArrayBuffer((FATE,3000), (FREEDOM,4500))
实例三:访问数据库(插入)
SQL代码
CREATE database IF NOT EXISTS DATA;
USE DATA;
CREATE TABLE LOG(PhotoName VARCHAR(50),Num INT)

在这里插入图片描述
scala代码

package Spark

import java.sql.{Connection, DriverManager, PreparedStatement}
import org.apache.spark.{SparkConf, SparkContext}
import scala.util.matching.Regex

/*
* 将数据导入到MySQL
*
* @author Jabin
* @version 0.0.1
* @data 2019/07/18
* */
object MyMySQL {
  var connection : Connection = _
  var pst : PreparedStatement = _
  def main(args: Array[String]): Unit = {
    //创建Spark配置
    val conf = new SparkConf().setAppName("MySQL.Count").setMaster("local")
    //加载Spark配置
    val sc = new SparkContext(conf)

    val rdd = sc.textFile("C:\\Users\\Administrator\\Desktop\\作业\\tomcat.log")
      .map(
        line => {
          /*
          * 27.19.74.143 - - [30/May/2018:17:38:20 +0800] "GET /static/image/common/faq.gif HTTP/1.1" 200 1127
          * 通过正则表达式匹配
          * */
          val pattern = "(/(\\w)+)+\\.[a-z]{3}".r
          //得到/static/image/common/faq.gif
          val photoDir = pattern.findAllIn(line).mkString(",")
          val regex = new Regex("\\w+\\.[a-z]{3}")
          //得到faq.gif
          val photoName = regex.findAllIn(photoDir).mkString(",")

          (photoName, 1)
        }
      )

    val rdd1 = rdd.reduceByKey(_+_)

    rdd1.foreachPartition(insertData)

    sc.stop()
  }

  def insertData(iter: Iterator[(String, Int)]) = {
    try{
      connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/data?serverTimezone=GMT%2B8","root","root")
      pst = connection.prepareStatement("INSERT INTO LOG VALUES(?,?)")
      iter.foreach(f =>{
        pst.setString(1,f._1)
        pst.setInt(2,f._2)

        pst.executeUpdate()
      })
    }catch{
      case t: Throwable => t.printStackTrace()
    }finally {
      if (connection != null) connection.close()
      if (pst != null) pst.close()
    }
  }
}

结果

在这里插入图片描述

### 关于头歌教学平台中的Spark算子Java版本教程 在大数据分析与处理领域,Apache Spark 是一种高效的分布式计算框架。对于使用 Java 编写的 Spark 应用程序而言,了解 Spark 的核心概念以及其算子的实现方法至关重要。以下是针对头歌实践教学平台中涉及的大数据分析与处理课程(特别是第4至第7章节)的相关内容解析。 #### Spark 算子的核心概念 Spark 提供了一组丰富的操作符(即算子),这些算子可以分为两大类:转换 (Transformation) 和行动 (Action)[^1]。 - **转换**是指创建一个新的 RDD 或 DataFrame/Dataset 的操作,例如 `map`、`filter`、`reduceByKey` 等。 - **行动**则是触发实际计算并返回结果的操作,例如 `collect`、`count`、`saveAsTextFile` 等。 在 Java 版本中,开发者可以通过编写 Lambda 表达式或匿名内部类来定义这些算子的行为[^5]。 --- #### 示例代码展示 以下是一些常见的 Spark 算子及其对应的 Java 实现: ##### 1. 使用 `map` 转换函数 ```java import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class MapExample { public static void main(String[] args) { JavaSparkContext sc = new JavaSparkContext("local", "Map Example"); // 创建初始数据集 JavaRDD<Integer> data = sc.parallelize(Arrays.asList(1, 2, 3, 4)); // 使用 map 函数将每个元素乘以 2 JavaRDD<Integer> mappedData = data.map(x -> x * 2); System.out.println(mappedData.collect()); sc.close(); } } ``` 此代码展示了如何通过 `map` 将输入集合中的每个整数加倍[^4]。 ##### 2. 使用 `filter` 过滤数据 ```java import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class FilterExample { public static void main(String[] args) { JavaSparkContext sc = new JavaSparkContext("local", "Filter Example"); // 创建初始数据集 JavaRDD<String> words = sc.parallelize(Arrays.asList("apple", "banana", "cherry")); // 使用 filter 只保留长度大于等于 5 的单词 JavaRDD<String> filteredWords = words.filter(word -> word.length() >= 5); System.out.println(filteredWords.collect()); sc.close(); } } ``` 这段代码演示了如何过滤掉不符合条件的数据项。 ##### 3. 使用 `reduceByKey` 合并键值对 ```java import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import scala.Tuple2; public class ReduceByKeyExample { public static void main(String[] args) { JavaSparkContext sc = new JavaSparkContext("local", "Reduce By Key Example"); // 创建键值对数据集 List<Tuple2<String, Integer>> tuples = Arrays.asList( new Tuple2<>("a", 1), new Tuple2<>("b", 2), new Tuple2<>("a", 3), new Tuple2<>("b", 4) ); JavaPairRDD<String, Integer> pairRdd = sc.parallelizePairs(tuples); // 对相同 key 的 value 执行求和操作 JavaPairRDD<String, Integer> reducedRdd = pairRdd.reduceByKey((x, y) -> x + y); System.out.println(reducedRdd.collect()); sc.close(); } } ``` 上述示例说明了如何利用 `reduceByKey` 来聚合具有相同键的值。 --- #### 学习资源推荐 为了更好地理解 Spark 算子的工作原理,在学习过程中可参考如下资料: - 官方文档提供了详细的 API 描述和实例- 头歌教学平台上的实验项目通常会结合具体业务场景讲解算子的实际用途[^3]。 - 如果希望进一步提升技能水平,《Hadoop+Spark大数据分析实战》一书由迟殿委撰写,涵盖了大量实用案例[^2]。 --- ###
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值