电商指标项目-实时频道热点分析业务开发

本文详细介绍了一种实时数据分析方案,聚焦于频道的热点统计、PV/UV分析、用户新鲜度、地域分布、运营商偏好及浏览器使用情况。通过Flink流处理框架,实现对频道数据的实时监控与统计,涵盖小时、天、月等多个时间维度,为精细化运营提供数据支持。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1. 业务介绍


频道热点,就是要统计频道被访问(点击)的数量。

分析得到以下的数据:

频道ID访问数量
频道ID1128
频道ID2401
频道ID3501

需要将历史的点击数据进行累加

2. 业务开发

步骤

  1. 创建实时热点样例类,专门用来计算实时热点的数据
  2. 将预处理后的数据,转换为要分析出来的数据(频道、访问次数)样例类
  3. 按照频道进行分组(分流)
  4. 划分时间窗口(3秒一个窗口)
  5. 进行合并计数统计
  6. 打印测试
  7. 将计算后的数据下沉到Hbase

实现

  1. 创建一个ChannelRealHotTask单例对象
  2. 添加一个ChannelRealHot样例类,它封装要统计的两个业务字段:频道ID(channelID)、访问数量(visited)
  3. ChannelRealHotTask中编写一个process方法,接收预处理后的DataStream
  4. 使用map算子,将ClickLog对象转换为ChannelRealHot
  5. 按照频道ID进行分流
  6. 划分时间窗口(3秒一个窗口)
  7. 执行reduce合并计算
  8. 将合并后的数据下沉到hbase
    • 判断hbase中是否已经存在结果记录
    • 若存在,则获取后进行累加
    • 若不存在,则直接写入
package com.xu.realprocess.task

import com.xu.realprocess.bean.ClickLogWide.ClickLogWide
import com.xu.realprocess.util.HBaseUtil
import org.apache.commons.lang.StringUtils
import org.apache.flink.streaming.api.scala.{DataStream, KeyedStream, WindowedStream}
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.datastream.DataStreamSink
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow



case class ChannelRealHot(var channelid: String, var visited: Long)

/**
  * 频道热点分析
  *
  * 1. 字段转换
  * 2. 分组
  * 3. 时间窗口
  * 4. 聚合
  * 5. 落地HBase
  *
  */
object ChannelRealHotTask {

  def process(clickLogWideDataStream: DataStream[ClickLogWide]) = {

    // 1. 字段转换 channelid, visited
    val realHotDataStream: DataStream[ChannelRealHot] = clickLogWideDataStream.map {
      clickLogWide: ClickLogWide =>
        ChannelRealHot(clickLogWide.channelID, clickLogWide.count)
    }
    // 2. 分组
    val keyedStream: KeyedStream[ChannelRealHot, String] = realHotDataStream.keyBy(_.channelid)

    // 3. 时间窗口
    val windowedStream: WindowedStream[ChannelRealHot, String, TimeWindow] = keyedStream.timeWindow(Time.seconds(3))

    // 4. 聚合
    val reduceDataStream: DataStream[ChannelRealHot] = windowedStream.reduce {
      (t1: ChannelRealHot, t2: ChannelRealHot) =>
        ChannelRealHot(t1.channelid, t1.visited + t2.visited)
    }

    // 5. 落地HBase
    reduceDataStream.addSink(new SinkFunction[ChannelRealHot] {

      override def invoke(value: ChannelRealHot): Unit = {

        // hbase相关字段
        val tableName = "channel"
        val clfName = "info"
        val channelIdColumn = "channelId"
        val visitedColumn = "visited"
        val rowkey = value.channelid

        // 查询HBase,获取相关记录
        val visitedValue: String = HBaseUtil.getData(tableName, rowkey, clfName, visitedColumn)
        // 创建总数的临时变量
        var totalCount: Long = 0

        if (StringUtils.isBlank(visitedValue)) {
          totalCount = value.visited
        } else {
          totalCount = visitedValue.toLong + value.visited
        }

        // 保存数据
        HBaseUtil.putMapData(tableName, rowkey, clfName, Map(
          channelIdColumn -> value.channelid,
          visitedColumn -> totalCount.toString
        ))
      }
    })
  }

}

三 实时频道PV/UV分析

针对频道的PV、UV进行不同时间维度的分析。有以下三个维度:

  • 小时

3.1 业务介绍


PV(访问量)

即Page View,页面刷新一次算一次。

UV(独立访客)

即Unique Visitor,指定时间内相同的客户端只被计算一次

统计分析后得到的数据如下所示:

频道ID时间PVUV
频道120170101161230350
频道220170101171251330
频道320170101185512610

3.2 小时维度PV/UV业务开发


步骤

  1. 创建频道PV、UV样例类
  2. 将预处理后的数据,转换为要分析出来的数据(频道、PV、UV)样例类
  3. 按照频道时间进行分组(分流)
  4. 划分时间窗口(3秒一个窗口)
  5. 进行合并计数统计
  6. 打印测试
  7. 将计算后的数据下沉到Hbase

实现

  1. 创建一个ChannelPvUvTask单例对象
  2. 添加一个ChannelPvUv样例类,它封装要统计的四个业务字段:频道ID(channelID)、年月日时、PV、UV
  3. ChannelPvUvTask中编写一个processHourDim方法,接收预处理后的DataStream
  4. 使用map算子,将ClickLog对象转换为ChannelPvUv
  5. 按照频道ID年月日时进行分流
  6. 划分时间窗口(3秒一个窗口)
  7. 执行reduce合并计算
  8. 打印测试
  9. 将合并后的数据下沉到hbase
    • 判断hbase中是否已经存在结果记录
    • 若存在,则获取后进行累加
    • 若不存在,则直接写入

3.3 天维度PV/UV业务开发

按天的维度来统计PV、UV与按小时维度类似,就是分组字段不一样。可以直接复制按小时维度的PV/UV,然后修改即可。

3.4 小时/天/月维度PV/UV业务开发


但是,其实上述代码,都是一样的。我们可以将小时三个时间维度的数据放在一起来进行分组

思路

  1. 每一条ClickLog生成三个维度的ChannelPvUv,分别用于三个维度的统计
  • ChannelPvUv --> 小时维度
  • ChannelPvUv --> 天维度
  • ChannelPvUv --> 月维度

在这里插入图片描述

实现

  1. 使用flatmap算子,将ClickLog转换为三个ChannelPvUv
  2. 重新运行测试

核心代码:

```scala
  def process(clicklogWideDataStream:DataStream[ClickLogWide]) = {
    ...
    val channelPvUvDataStream: DataStream[ChannelPvUv] = clicklogWideDataStream.flatMap {
      clicklog =>
        List(
          ChannelPvUv(clicklog.channelID, clicklog.yearMonthDayHour, clicklog.count, clicklog.isHourNew),
          ChannelPvUv(clicklog.channelID, clicklog.yearMonthDay, clicklog.count, clicklog.isDayNew),
          ChannelPvUv(clicklog.channelID, clicklog.yearMonth, clicklog.count, clicklog.isMonthNew)
        )
    }
    ...
  }

四 实时频道用户新鲜度分析

4.1 业务介绍


用户新鲜度即分析网站每小时、每天、每月活跃的新老用户占比

可以通过新鲜度:

  • 从宏观层面上了解每天的新老用户比例以及来源结构

  • 当天新增用户与当天推广行为是否相关

统计分析要得到的数据如下:

频道ID时间新用户老用户
频道1201703512144
频道1201703184114123
频道120170318103424412

4.2 业务开发


步骤

  1. 创建频道新鲜度样例类,包含以下字段(频道、时间、新用户、老用户)
  2. 将预处理后的数据,转换为新鲜度样例类
  3. 按照频道时间进行分组(分流)
  4. 划分时间窗口(3秒一个窗口)
  5. 进行合并计数统计
  6. 打印测试
  7. 将计算后的数据下沉到Hbase

实现

  1. 创建一个ChannelFreshnessTask单例对象

  2. 添加一个ChannelFreshness样例类,它封装要统计的四个业务字段:频道ID(channelID)、日期(date)、新用户(newCount)、老用户(oldCount)

  3. ChannelFreshnessTask中编写一个process方法,接收预处理后的DataStream

  4. 使用flatMap算子,将ClickLog对象转换为三个不同时间维度ChannelFreshness

  5. 按照频道ID日期进行分流

  6. 划分时间窗口(3秒一个窗口)

  7. 执行reduce合并计算

  8. 打印测试

  9. 将合并后的数据下沉到hbase

    • 准备hbase的表名、列族名、rowkey名、列名
    • 判断hbase中是否已经存在结果记录
    • 若存在,则获取后进行累加
    • 若不存在,则直接写入

注意:

这个地方,老用户需要注意处理,因为如果不进行判断,就会计算重复的一些用户访问数据

  1. 新用户就是根据clicklog拓宽后的isNew来判断

  2. 老用户需要判断

  • 如果isNew是0,且isHourNew为1/isDayNew为1、isMonthNew为1,则进行老用户为1
  • 否则为0

核心代码:

// 1. 添加一个`ChannelFreshness`样例类,它封装要统计的四个业务字段:频道ID(channelID)、日期(date)、新用户(newCount)、老用户(oldCount)
case class ChannelFreshness(var channelID:String,
                            var date:String,
                            var newCount:Long,
                            var oldCount:Long)



object ChannelFreshnessTask {
  // 2. 在`ChannelFreshnessTask`中编写一个`process`方法,接收预处理后的`DataStream`
  def process(clicklogWideDataStream:DataStream[ClickLogWide]) = {

    // 3. 使用flatMap算子,将`ClickLog`对象转换为`ChannelFreshness`
    val channelFreshnessDataStream: DataStream[ChannelFreshness] = clicklogWideDataStream.flatMap {
      clicklog =>
        val isOld = (isNew: Int, isDateNew:Int) => if (isNew == 0 && isDateNew == 1) 1 else 0

        List(
          ChannelFreshness(clicklog.channelID, clicklog.yearMonthDayHour, clicklog.isNew, isOld(clicklog.isNew, clicklog.isHourNew)),
          ChannelFreshness(clicklog.channelID, clicklog.yearMonthDay, clicklog.isNew, isOld(clicklog.isDayNew, clicklog.isDayNew)),
          ChannelFreshness(clicklog.channelID, clicklog.yearMonth, clicklog.isNew, isOld(clicklog.isMonthNew, clicklog.isMonthNew))
        )
    }

    // 4. 按照`频道ID`、`日期`进行分流

    val groupedDateStream: KeyedStream[ChannelFreshness, String] = channelFreshnessDataStream.keyBy {
      freshness =>
        freshness.channelID + freshness.date
    }

    // 5. 划分时间窗口(3秒一个窗口)
    val windowStream: WindowedStream[ChannelFreshness, String, TimeWindow] = groupedDateStream.timeWindow(Time.seconds(3))

    // 6. 执行reduce合并计算
    val reduceDataStream: DataStream[ChannelFreshness] = windowStream.reduce {
      (freshness1, freshness2) =>
        ChannelFreshness(freshness2.channelID, freshness2.date, freshness1.newCount + freshness2.newCount, freshness1.oldCount + freshness2.oldCount)
    }

    // 打印测试
    reduceDataStream.print()

    // 7. 将合并后的数据下沉到hbase
    reduceDataStream.addSink(new SinkFunction[ChannelFreshness] {
      override def invoke(value: ChannelFreshness): Unit = {
        val tableName = "channel_freshness"
        val cfName = "info"
        // 频道ID(channelID)、日期(date)、新用户(newCount)、老用户(oldCount)
        val channelIdColName = "channelID"
        val dateColName = "date"
        val newCountColName = "newCount"
        val oldCountColName = "oldCount"
        val rowkey = value.channelID + ":" + value.date

        // - 判断hbase中是否已经存在结果记录
        val newCountOldCountMap = HBaseUtil.getData(tableName, rowkey, cfName, List(newCountColName, oldCountColName))

        var totalNewCount = 0L
        var totalOldCount = 0L

        // - 若存在,则获取后进行累加
        if(newCountOldCountMap != null && StringUtils.isNotBlank(newCountOldCountMap.getOrElse(newCountColName, ""))) {
          totalNewCount = value.newCount + newCountOldCountMap(newCountColName).toLong
        }
        else {
          totalNewCount = value.newCount
        }
        // - 若不存在,则直接写入

        HBaseUtil.putMapData(tableName, rowkey, cfName, Map(
          channelIdColName -> value.channelID,
          dateColName -> value.date,
          newCountColName -> totalNewCount.toString,
          oldCountColName -> totalOldCount.toString
        ))
      }
    })
  }
}

4.3 模板方法提取公共类

模板方法模式是在父类中定义算法的骨架,把具体实延迟到子类中去,可以在不改变一个算法的结构时可重定义该算法的某些步骤。

BaseTask.scala

package com.itheima.realprocess.task

import com.itheima.realprocess.bean.ClickLogWide
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.windows.TimeWindow

// 抽取一个公共的trait, 所有的任务都来实现它
trait BaseTask[T] {

  /**
    * 对原始日志数据流 进行map转换 分组 时间窗口 聚合 落地HBase
    * @param clickLogWideDataStream
    * @return
    */
  def process(clickLogWideDataStream: DataStream[ClickLogWide]):Any={
    val mapDataStream:DataStream[T] = map(clickLogWideDataStream)
    val keyedStream:KeyedStream[T, String] = keyBy(mapDataStream)
    val windowedStream: WindowedStream[T, String, TimeWindow] = timeWindow(keyedStream)
    val reduceDataStream: DataStream[T] = reduce(windowedStream)
    sink2HBase(reduceDataStream)
  }

  // Map转换数据流
  def map(source:DataStream[ClickLogWide]):DataStream[T]

  // 分组
  def keyBy(mapDataStream: DataStream[T]):KeyedStream[T,String]

  // 时间窗口
  def timeWindow(keyedStream: KeyedStream[T, String]):WindowedStream[T, String, TimeWindow]

  // 聚合  
  def reduce(windowedStream: WindowedStream[T, String, TimeWindow]): DataStream[T]

  // 落地HBase
  def sink2HBase(reduceDataStream: DataStream[T])
}

改造后的代码:

// 添加一个`ChannelFreshness`样例类,它封装要统计的四个业务字段:频道ID(channelID)、日期(date)、新用户(newCount)、老用户(oldCount)
case class ChannelFreshness(var channelID: String,
                            var date: String,
                            var newCount: Long,
                            var oldCount: Long)


object ChannelFreshnessTask extends BaseTask[ChannelFreshness] {

  // 1. 使用flatMap算子,将`ClickLog`对象转换为`ChannelFreshness`
  override def map(source: DataStream[ClickLogWide]): DataStream[ChannelFreshness] = {
    source.flatMap {
      clicklog =>
        val isOld = (isNew: Int, isDateNew: Int) => if (isNew == 0 && isDateNew == 1) 1 else 0

        List(
          ChannelFreshness(clicklog.channelID, clicklog.yearMonthDayHour, clicklog.isNew, isOld(clicklog.isNew, clicklog.isHourNew)),
          ChannelFreshness(clicklog.channelID, clicklog.yearMonthDay, clicklog.isNew, isOld(clicklog.isDayNew, clicklog.isDayNew)),
          ChannelFreshness(clicklog.channelID, clicklog.yearMonth, clicklog.isNew, isOld(clicklog.isMonthNew, clicklog.isMonthNew))
        )
    }
  }

  override def keyBy(mapDataStream: DataStream[ChannelFreshness]): KeyedStream[ChannelFreshness, String] = {
    mapDataStream.keyBy {
      freshness =>
        freshness.channelID + freshness.date
    }
  }

  override def timeWindow(keyedStream: KeyedStream[ChannelFreshness, String]): WindowedStream[ChannelFreshness, String, TimeWindow] = {
    keyedStream.timeWindow(Time.seconds(3))
  }

  override def reduce(windowedStream: WindowedStream[ChannelFreshness, String, TimeWindow]): DataStream[ChannelFreshness] = {
    windowedStream.reduce {
      (freshness1, freshness2) =>
        ChannelFreshness(freshness2.channelID, freshness2.date, freshness1.newCount + freshness2.newCount, freshness1.oldCount + freshness2.oldCount)
    }
  }

  override def sink2HBase(reduceDataStream: DataStream[ChannelFreshness]) = {
    reduceDataStream.addSink {
      value => {
        val tableName = "channel_freshness"
        val cfName = "info"
        // 频道ID(channelID)、日期(date)、新用户(newCount)、老用户(oldCount)
        val channelIdColName = "channelID"
        val dateColName = "date"
        val newCountColName = "newCount"
        val oldCountColName = "oldCount"
        val rowkey = value.channelID + ":" + value.date

        // - 判断hbase中是否已经存在结果记录
        val newCountInHBase = HBaseUtil.getData(tableName, rowkey, cfName, newCountColName)
        val oldCountInHBase = HBaseUtil.getData(tableName, rowkey, cfName, oldCountColName)

        var totalNewCount = 0L
        var totalOldCount = 0L

        // 判断hbase中是否有历史的指标数据
        if (StringUtils.isNotBlank(newCountInHBase)) {
          totalNewCount = newCountInHBase.toLong + value.newCount
        }
        else {
          totalNewCount = value.newCount
        }

        if (StringUtils.isNotBlank(oldCountInHBase)) {
          totalOldCount = oldCountInHBase.toLong + value.oldCount
        }
        else {
          totalOldCount = value.oldCount
        }

        // 将合并累计的数据写入到hbase中
        HBaseUtil.putMapData(tableName, rowkey, cfName, Map(
          channelIdColName -> value.channelID,
          dateColName -> value.date,
          newCountColName -> totalNewCount,
          oldCountColName -> totalOldCount
        ))
      }
    }
  }
}

五 实时频道地域分析业务开发

5.1 业务介绍

通过地域分析,可以帮助查看地域相关的PV/UV、用户新鲜度。

需要分析出来指标

  • PV

  • UV

  • 新用户

  • 老用户

需要分析的维度

  • 地域(国家省市)——这里为了节省时间,只分析市级的地域维度,其他维度大家可以自己来实现
  • 时间维度(时、天、月)

统计分析后的结果如下:

频道ID地域(国/省/市)时间PVUV新用户老用户
频道1中国北京市朝阳区2018091000300123171
频道1中国北京市朝阳区2018091051212323100
频道1中国北京市朝阳区2018091010100411130

5.2 业务开发


步骤

  1. 创建频道地域分析样例类(频道、地域(国省市)、时间、PV、UV、新用户、老用户)
  2. 将预处理后的数据,使用flatMap转换为样例类
  3. 按照频道时间地域进行分组(分流)
  4. 划分时间窗口(3秒一个窗口)
  5. 进行合并计数统计
  6. 打印测试
  7. 将计算后的数据下沉到Hbase

实现

  1. 创建一个ChannelAreaTask单例对象

  2. 添加一个ChannelArea样例类,它封装要统计的四个业务字段:频道ID(channelID)、地域(area)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)

  3. ChannelAreaTask中编写一个process方法,接收预处理后的DataStream

  4. 使用flatMap算子,将ClickLog对象转换为三个不同时间维度ChannelArea

  5. 按照频道ID时间地域进行分流

  6. 划分时间窗口(3秒一个窗口)

  7. 执行reduce合并计算

  8. 打印测试

  9. 将合并后的数据下沉到hbase

    • 准备hbase的表名、列族名、rowkey名、列名

    • 判断hbase中是否已经存在结果记录

    • 若存在,则获取后进行累加

    • 若不存在,则直接写入

核心代码:
ChannelFreshnessTask.scala

package com.itheima.realprocess.task

import com.itheima.realprocess.bean.ClickLogWide
import com.itheima.realprocess.util.HBaseUtil
import org.apache.commons.lang.StringUtils
import org.apache.flink.streaming.api.scala.{DataStream, KeyedStream, WindowedStream}
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow


// 添加一个`ChannelFreshness`样例类,它封装要统计的四个业务字段:频道ID(channelID)、日期(date)、新用户(newCount)、老用户(oldCount)
case class ChannelFreshness(var channelID: String,
                            var date: String,
                            var newCount: Long,
                            var oldCount: Long)


object ChannelFreshnessTask extends BaseTask[ChannelFreshness] {

  // 1. 使用flatMap算子,将`ClickLog`对象转换为`ChannelFreshness`
  override def map(source: DataStream[ClickLogWide]): DataStream[ChannelFreshness] = {
    source.flatMap {
      clicklog =>
        val isOld = (isNew: Int, isDateNew: Int) => if (isNew == 0 && isDateNew == 1) 1 else 0

        List(
          ChannelFreshness(clicklog.channelID, clicklog.yearMonthDayHour, clicklog.isNew, isOld(clicklog.isNew, clicklog.isHourNew)),
          ChannelFreshness(clicklog.channelID, clicklog.yearMonthDay, clicklog.isNew, isOld(clicklog.isDayNew, clicklog.isDayNew)),
          ChannelFreshness(clicklog.channelID, clicklog.yearMonth, clicklog.isNew, isOld(clicklog.isMonthNew, clicklog.isMonthNew))
        )
    }
  }

  override def keyBy(mapDataStream: DataStream[ChannelFreshness]): KeyedStream[ChannelFreshness, String] = {
    mapDataStream.keyBy {
      freshness =>
        freshness.channelID + freshness.date
    }
  }

  override def timeWindow(keyedStream: KeyedStream[ChannelFreshness, String]): WindowedStream[ChannelFreshness, String, TimeWindow] = {
    keyedStream.timeWindow(Time.seconds(3))
  }

  override def reduce(windowedStream: WindowedStream[ChannelFreshness, String, TimeWindow]): DataStream[ChannelFreshness] = {
    windowedStream.reduce {
      (freshness1, freshness2) =>
        ChannelFreshness(freshness2.channelID, freshness2.date, freshness1.newCount + freshness2.newCount, freshness1.oldCount + freshness2.oldCount)
    }
  }

  override def sink2HBase(reduceDataStream: DataStream[ChannelFreshness]) = {
    reduceDataStream.addSink {
      value => {
        val tableName = "channel_freshness"
        val cfName = "info"
        // 频道ID(channelID)、日期(date)、新用户(newCount)、老用户(oldCount)
        val channelIdColName = "channelID"
        val dateColName = "date"
        val newCountColName = "newCount"
        val oldCountColName = "oldCount"
        val rowkey = value.channelID + ":" + value.date

        // - 判断hbase中是否已经存在结果记录
        val newCountInHBase = HBaseUtil.getData(tableName, rowkey, cfName, newCountColName)
        val oldCountInHBase = HBaseUtil.getData(tableName, rowkey, cfName, oldCountColName)

        var totalNewCount = 0L
        var totalOldCount = 0L

        // 判断hbase中是否有历史的指标数据
        if (StringUtils.isNotBlank(newCountInHBase)) {
          totalNewCount = newCountInHBase.toLong + value.newCount
        }
        else {
          totalNewCount = value.newCount
        }

        if (StringUtils.isNotBlank(oldCountInHBase)) {
          totalOldCount = oldCountInHBase.toLong + value.oldCount
        }
        else {
          totalOldCount = value.oldCount
        }

        // 将合并累计的数据写入到hbase中
        HBaseUtil.putMapData(tableName, rowkey, cfName, Map(
          channelIdColName -> value.channelID,
          dateColName -> value.date,
          newCountColName -> totalNewCount,
          oldCountColName -> totalOldCount
        ))
      }
    }
  }
}

六 实时运营商分析业务开发

6.1 业务介绍


分析出来中国移动、中国联通、中国电信等运营商的指标。来分析,流量的主要来源是哪个运营商的,这样就可以进行较准确的网络推广。

需要分析出来指标

  • PV

  • UV

  • 新用户

  • 老用户

需要分析的维度

  • 运营商
  • 时间维度(时、天、月)

统计分析后的结果如下:

频道ID运营商时间PVUV新用户老用户
频道120180910003000300
频道1中国联通20180910123101
频道1中国电信201809101055220

6.2 业务开发

步骤

  1. 将预处理后的数据,转换为要分析出来数据(频道、运营商、时间、PV、UV、新用户、老用户)样例类
  2. 按照频道时间运营商进行分组(分流)
  3. 划分时间窗口(3秒一个窗口)
  4. 进行合并计数统计
  5. 打印测试
  6. 将计算后的数据下沉到Hbase

实现

  1. 创建一个ChannelNetworkTask单例对象

  2. 添加一个ChannelNetwork样例类,它封装要统计的四个业务字段:频道ID(channelID)、运营商(network)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)

  3. ChannelNetworkTask中编写一个process方法,接收预处理后的DataStream

  4. 使用flatMap算子,将ClickLog对象转换为三个不同时间维度ChannelNetwork

  5. 按照频道ID时间运营商进行分流

  6. 划分时间窗口(3秒一个窗口)

  7. 执行reduce合并计算

  8. 打印测试

  9. 将合并后的数据下沉到hbase

    • 准备hbase的表名、列族名、rowkey名、列名

    • 判断hbase中是否已经存在结果记录

    • 若存在,则获取后进行累加

    • 若不存在,则直接写入

核心代码:

package com.itheima.realprocess.task

import com.itheima.realprocess.bean.ClickLogWide
import com.itheima.realprocess.util.HBaseUtil
import org.apache.commons.lang.StringUtils
import org.apache.flink.streaming.api.scala.{DataStream, KeyedStream, WindowedStream}
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.windowing.time.Time

// 2. 添加一个`ChannelNetwork`样例类,它封装要统计的四个业务字段:频道ID(channelID)、运营商(network)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)
case class ChannelNetwork(var channelID: String,
                          var network: String,
                          var date: String,
                          var pv: Long,
                          var uv: Long,
                          var newCount: Long,
                          var oldCount: Long)

object ChannelNetworkTask extends BaseTask[ChannelNetwork] {

  override def map(source: DataStream[ClickLogWide]): DataStream[ChannelNetwork] = {

    source.flatMap {
      clicklog =>
        val isOld = (isNew: Int, isDateNew: Int) => if (isNew == 0 && isDateNew == 1) 1 else 0

        List(
          ChannelNetwork(clicklog.channelID,
            clicklog.network,
            clicklog.yearMonthDayHour,
            clicklog.count,
            clicklog.isHourNew,
            clicklog.isNew,
            isOld(clicklog.isNew, clicklog.isHourNew)), // 小时维度
          ChannelNetwork(clicklog.channelID,
            clicklog.network,
            clicklog.yearMonthDay,
            clicklog.count,
            clicklog.isDayNew,
            clicklog.isNew,
            isOld(clicklog.isNew, clicklog.isDayNew)), // 天维度
          ChannelNetwork(clicklog.channelID,
            clicklog.network,
            clicklog.yearMonth,
            clicklog.count,
            clicklog.isMonthNew,
            clicklog.isNew,
            isOld(clicklog.isNew, clicklog.isMonthNew)) // 月维度
        )
    }
  }

  override def keyBy(mapDataStream: DataStream[ChannelNetwork]): KeyedStream[ChannelNetwork, String] = {
    mapDataStream.keyBy {
      network =>
        network.channelID + network.date + network.network
    }
  }

  override def timeWindow(keyedStream: KeyedStream[ChannelNetwork, String]): WindowedStream[ChannelNetwork, String, TimeWindow] = {
    keyedStream.timeWindow(Time.seconds(3))
  }

  override def reduce(windowedStream: WindowedStream[ChannelNetwork, String, TimeWindow]): DataStream[ChannelNetwork] = {
    windowedStream.reduce {
      (network1, network2) =>
        ChannelNetwork(network2.channelID,
          network2.network,
          network2.date,
          network1.pv + network2.pv,
          network1.uv + network2.uv,
          network1.newCount + network2.newCount,
          network1.oldCount + network2.oldCount)
    }
  }

  override def sink2HBase(reduceDataStream: DataStream[ChannelNetwork]): Unit = {
    reduceDataStream.addSink(new SinkFunction[ChannelNetwork] {
      override def invoke(value: ChannelNetwork): Unit = {
        // - 准备hbase的表名、列族名、rowkey名、列名
        val tableName = "channel_network"
        val cfName = "info"
        // 频道ID(channelID)、运营商(network)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)
        val rowkey = s"${value.channelID}:${value.date}:${value.network}"
        val channelIdColName = "channelID"
        val networkColName = "network"
        val dateColName = "date"
        val pvColName = "pv"
        val uvColName = "uv"
        val newCountColName = "newCount"
        val oldCountColName = "oldCount"

        // - 判断hbase中是否已经存在结果记录
        val resultMap: Map[String, String] = HBaseUtil.getMapData(tableName, rowkey, cfName, List(
          pvColName,
          uvColName,
          newCountColName,
          oldCountColName
        ))

        var totalPv = 0L
        var totalUv = 0L
        var totalNewCount = 0L
        var totalOldCount = 0L

        if(resultMap != null && resultMap.size > 0 && StringUtils.isNotBlank(resultMap(pvColName))) {
          totalPv = resultMap(pvColName).toLong + value.pv
        }
        else {
          totalPv = value.pv
        }

        if(resultMap != null && resultMap.size > 0 && StringUtils.isNotBlank(resultMap(uvColName))) {
          totalUv = resultMap(uvColName).toLong + value.uv
        }
        else {
          totalUv = value.uv
        }

        if(resultMap != null && resultMap.size > 0 && StringUtils.isNotBlank(resultMap(newCountColName))) {
          totalNewCount = resultMap(newCountColName).toLong + value.newCount
        }
        else {
          totalNewCount = value.newCount
        }

        if(resultMap != null && resultMap.size > 0 && StringUtils.isNotBlank(resultMap(oldCountColName))) {
          totalOldCount = resultMap(oldCountColName).toLong + value.oldCount
        }
        else {
          totalOldCount = value.oldCount
        }

        // 频道ID(channelID)、运营商(network)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)
        HBaseUtil.putMapData(tableName, rowkey, cfName, Map(
          channelIdColName -> value.channelID,
          networkColName -> value.network,
          dateColName -> value.date,
          pvColName -> totalPv.toString,
          uvColName -> totalUv.toString,
          newCountColName -> totalNewCount.toString,
          oldCountColName -> totalOldCount.toString
        ))
      }
    })
  }
}

七 实时频道浏览器分析业务开发

7.1 业务介绍

需要分别统计不同浏览器(或者客户端)的占比

需要分析出来指标

  • PV

  • UV

  • 新用户

  • 老用户

需要分析的维度

  • 浏览器
  • 时间维度(时、天、月)

统计分析后的结果如下:

频道ID浏览器时间PVUV新用户老用户
频道1360浏览器20180910003000300
频道1IE20180910123101
频道1Chrome201809101055220

7.2 业务开发


步骤

  1. 创建频道浏览器分析样例类(频道、浏览器、时间、PV、UV、新用户、老用户)
  2. 将预处理后的数据,使用flatMap转换为要分析出来数据样例类
  3. 按照频道时间浏览器进行分组(分流)
  4. 划分时间窗口(3秒一个窗口)
  5. 进行合并计数统计
  6. 打印测试
  7. 将计算后的数据下沉到Hbase

实现

  1. 创建一个ChannelBrowserTask单例对象

  2. 添加一个ChannelBrowser样例类,它封装要统计的四个业务字段:频道ID(channelID)、浏览器(browser)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)

  3. ChannelBrowserTask中编写一个process方法,接收预处理后的DataStream

  4. 使用flatMap算子,将ClickLog对象转换为三个不同时间维度ChannelBrowser

  5. 按照频道ID时间浏览器进行分流

  6. 划分时间窗口(3秒一个窗口)

  7. 执行reduce合并计算

  8. 打印测试

  9. 将合并后的数据下沉到hbase

    • 准备hbase的表名、列族名、rowkey名、列名

    • 判断hbase中是否已经存在结果记录

    • 若存在,则获取后进行累加

    • 若不存在,则直接写入

核心代码:

package com.itheima.realprocess.task

import com.itheima.realprocess.bean.ClickLogWide
import com.itheima.realprocess.util.HBaseUtil
import org.apache.commons.lang.StringUtils
import org.apache.flink.streaming.api.scala.{DataStream, KeyedStream, WindowedStream}
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.windowing.time.Time

// 2. 添加一个`ChannelBrowser`样例类,它封装要统计的四个业务字段:频道ID(channelID)、浏览器(browser)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)
case class ChannelBrowser(var channelID: String,
                          var browser: String,
                          var date: String,
                          var pv: Long,
                          var uv: Long,
                          var newCount: Long,
                          var oldCount: Long)


object ChannelBrowserTask extends BaseTask[ChannelBrowser] {

  override def map(source: DataStream[ClickLogWide]): DataStream[ChannelBrowser] = {

    source.flatMap {
      clicklog =>
        val isOld = (isNew: Int, isDateNew: Int) => if (isNew == 0 && isDateNew == 1) 1 else 0

        List(
          ChannelBrowser(clicklog.channelID,
            clicklog.browserType,
            clicklog.yearMonthDayHour,
            clicklog.count,
            clicklog.isHourNew,
            clicklog.isNew,
            isOld(clicklog.isNew, clicklog.isHourNew)), // 小时维度
          ChannelBrowser(clicklog.channelID,
            clicklog.browserType,
            clicklog.yearMonthDayHour,
            clicklog.count,
            clicklog.isDayNew,
            clicklog.isNew,
            isOld(clicklog.isNew, clicklog.isDayNew)), // 天维度
          ChannelBrowser(clicklog.channelID,
            clicklog.browserType,
            clicklog.yearMonth,
            clicklog.count,
            clicklog.isMonthNew,
            clicklog.isNew,
            isOld(clicklog.isNew, clicklog.isMonthNew)) // 月维度
        )
    }
  }

  override def keyBy(mapDataStream: DataStream[ChannelBrowser]): KeyedStream[ChannelBrowser, String] = {
    mapDataStream.keyBy {
      broswer =>
        broswer.channelID + broswer.date + broswer.browser
    }
  }

  override def timeWindow(keyedStream: KeyedStream[ChannelBrowser, String]): WindowedStream[ChannelBrowser, String, TimeWindow] = {
    keyedStream.timeWindow(Time.seconds(3))
  }

  override def reduce(windowedStream: WindowedStream[ChannelBrowser, String, TimeWindow]): DataStream[ChannelBrowser] = {
    windowedStream.reduce {
      (broswer1, broswer2) =>
        ChannelBrowser(broswer2.channelID,
          broswer2.browser,
          broswer2.date,
          broswer1.pv + broswer2.pv,
          broswer1.uv + broswer2.uv,
          broswer1.newCount + broswer2.newCount,
          broswer1.oldCount + broswer2.oldCount)
    }
  }

  override def sink2HBase(reduceDataStream: DataStream[ChannelBrowser]): Unit = {

    reduceDataStream.addSink(new SinkFunction[ChannelBrowser] {
      override def invoke(value: ChannelBrowser): Unit = {
        // - 准备hbase的表名、列族名、rowkey名、列名
        val tableName = "channel_broswer"
        val cfName = "info"
        // 频道ID(channelID)、浏览器(browser)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)
        val rowkey = s"${value.channelID}:${value.date}:${value.browser}"
        val channelIDColName = "channelID"
        val broswerColName = "browser"
        val dateColName = "date"
        val pvColName = "pv"
        val uvColName = "uv"
        val newCountColName = "newCount"
        val oldCountColName = "oldCount"

        var totalPv = 0L
        var totalUv = 0L
        var totalNewCount = 0L
        var totalOldCount = 0L

        val resultMap: Map[String, String] = HBaseUtil.getMapData(tableName, rowkey, cfName, List(
          pvColName,
          uvColName,
          newCountColName,
          oldCountColName
        ))

        // 计算PV,如果Hbase中存在pv数据,就直接进行累加

        if (resultMap != null && resultMap.size > 0 && StringUtils.isNotBlank(resultMap(pvColName))) {
          totalPv = resultMap(pvColName).toLong + value.pv
        }
        else {
          totalPv = value.pv
        }

        if (resultMap != null && resultMap.size > 0 && StringUtils.isNotBlank(resultMap(uvColName))) {
          totalUv = resultMap(uvColName).toLong + value.uv
        }
        else {
          totalUv = value.uv
        }


        // - 判断hbase中是否已经存在结果记录
        // - 若存在,则获取后进行累加
        // - 若不存在,则直接写入
        if (resultMap != null && resultMap.size > 0 && StringUtils.isNotBlank(resultMap(newCountColName))) {
          totalNewCount = resultMap(newCountColName).toLong + value.newCount
        }
        else {
          totalNewCount = value.newCount
        }

        if (resultMap != null && resultMap.size > 0 && StringUtils.isNotBlank(resultMap(oldCountColName))) {
          totalOldCount = resultMap(oldCountColName).toLong + value.oldCount
        }
        else {
          totalOldCount = value.oldCount
        }

        // 频道ID(channelID)、浏览器(browser)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)
        HBaseUtil.putMapData(tableName, rowkey, cfName, Map(
          channelIDColName -> value.channelID,
          broswerColName -> value.browser,
          dateColName -> value.date,
          pvColName -> totalPv.toString,
          uvColName -> totalUv.toString,
          newCountColName -> totalNewCount.toString,
          oldCountColName -> totalOldCount.toString
        ))
      }
    })
  }

}
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值