nginx+flume网络流量日志实时数据分析实战_日志数据分析

最新推荐文章于 2024-08-11 23:27:23 发布

2401_83946044

最新推荐文章于 2024-08-11 23:27:23 发布

阅读量614

点赞数 6

分类专栏：程序员文章标签： nginx flume 数据分析

本文链接：https://blog.youkuaiyun.com/2401_83946044/article/details/138289357

版权

运行flume采集

 /export/server/flume-1.8.0/bin/flume-ng agent -c conf -f /export/server/flume-1.8.0/conf/web_log.conf -n a1  -Dflume.root.logger=INFO,console

数据预处理-清洗

Flume采集获取的日志数据是不能直接用于分析的，需要做下预处理：
预处理需求:
1:剔除字段长度不够的日志数据
2:剔除对分析没有意义的字段
3:转换日期格式
4:对原日志数据进行切割，分隔符指定为 ‘\001’
5:对数据的有效行进行标记

数据清洗：
hadoop jar /export/data/mapreduce/web_log.jar cn.itcast.bigdata.weblog.pre.WeblogPreProcess

网络流量日志数据分析-点击流模型数据

点击流概念

点击流（Click Stream）是指用户在网站上持续访问的轨迹。注重用户浏览网站的整个流程。用户对网站的每次访问包含了一系列的点击动作行为，这些点击行为数据就构成了点击流数据（Click Stream Data），它代表了用户浏览网站的整个流程。

在点击流模型中，存在着两种模型数据：PageViews、Visits。

点击流模型pageviews

Pageviews模型数据专注于用户每次会话（session）的识别，以及每次session内访问了几步和每一步的停留时间。
在网站分析中，通常把前后两条访问记录时间差在30分钟以内算成一次会话。如果超过30分钟，则把下次访问算成新的会话开始。
大致步骤如下：
在所有访问日志中找出该用户的所有访问记录
把该用户所有访问记录按照时间正序排序
计算前后两条记录时间差是否为30分钟
如果小于30分钟，则是同一会话session的延续
如果大于30分钟，则是下一会话session的开始
用前后两条记录时间差算出上一步停留时间
最后一步和只有一步的业务默认指定页面停留时间60s。

--得到pageviews模型
hadoop jar /export/data/mapreduce/web_log.jar  cn.itcast.bigdata.weblog.clickstream.ClickStreamPageView

点击流模型Visits

Visits模型专注于每次会话session内起始、结束的访问情况信息。比如用户在某一个会话session内，进入会话的起始页面和起始时间，会话结束是从哪个页面离开的，离开时间，本次session总共访问了几个页面等信息。
大致步骤如下：
在pageviews模型上进行梳理
在每一次回收session内所有访问记录按照时间正序排序
第一天的时间页面就是起始时间页面
业务指定最后一条记录的时间页面作为离开时间和离开页面。

得到visits模型
hadoop jar /export/data/mapreduce/web_log.jar cn.itcast.bigdata.weblog.clickstream.ClickStreamVisit

网络日志数据分析-数据加载

对于日志数据的分析，Hive也分为三层：ods层、dw层、app层

创建数据库

create database if not exists  web_log_ods;
create database if not exists  web_log_dw;
create database if not exists  web_log_app;

创建ODS层数据表

原始日志数据表

drop table if exists web_log_ods.ods_weblog_origin;
  create table web_log_ods.ods_weblog_origin(
  valid string , --有效标记
  remote_addr string, --访客ip
  remote_user string, --访客用户信息
  time_local string, --请求时间
  request string,  --请求url
  status string, --响应状态码
  body_bytes_sent string, --响应字节数
  http_referer string, --来源url
  http_user_agent string --访客终端信息
  ) 
  partitioned by (dt string)
  row format delimited fields terminated by '\001';

点击流模型pageviews表

drop table if exists web_log_ods.ods_click_pageviews;
create table  web_log_ods.ods_click_pageviews(
session string, --会话id
remote_addr string, --访客ip
remote_user string, --访客用户信息
time_local str

最低0.47元/天解锁文章