hadoop学习-海量日志分析(提取KPI指标)

本文介绍了如何利用Hadoop对Web日志进行分析,包括提取PV、IP、访问时间等KPI指标。通过MapReduce程序,分别实现了页面访问量、独立IP数、每小时访问量和用户浏览器统计的计算。详细步骤涵盖日志解析、MapReduce设计与实现,并提供了Eclipse启动程序的方法和源代码链接。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1、Web日志分析

从Web日志中,我们可以获取网站各类页面的PV值(PageView,页面访问量),访问IP;或者是用户停留时间最长的页面等等,更复杂的,可以分析用户行为特征。

在Web日志中,每条日志都代表用户的一次访问行为,以下面的一条日志为例子:

60.208.6.156 - - [18/Sep/2013:06:49:48 +0000] "GET /wp-content/uploads/2013/07/rcassandra.png HTTP/1.0" 200 185524 "http://cos.name/category/software/packages/" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
可以 拆分为8个变量:

remote_addr:60.208.6.156 //用户IP地址

remote_user:- //用户名称

time_local:[18/Sep/2013:06:49:48 +0000] //记录访问时间

request:"GET /wp-content/uploads/2013/07/rcassandra.png HTTP/1.0" //记录访问的url与http协议

status:200 //记录请求状态,成功是200

body_bytes_sent:185524 //记录发给客户端内容的大小

http_referer:"http://cos.name/category/software/packages/" //记录从哪个页面访问过来的

http_user_agent:"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36" //记录客户浏览器的信息

2、KPI指标设计

一般的KPI指标可以设置为:

PV:页面访问量统计

IP:页面独立IP访问数量统计

Time:每小时用户访问数量统计

Source:用户来源域名的统计

Brower:用户访问设备的统计

3、hadoop算法实现

PV:页面访问量统计

Map过程:key: request,value: 1

Reduce过程:key:request,value(求和)


IP:页面独立IP访问数量统计

Map过程:key: request,value: remote_addr

Reduce过程:key:request,value(去重再求和)


Time:每小时用户访问数量统计

Map过程:key: time_local,value: 1

Reduce过程:key:time_local,value(求和)


Source:用户来源域名的统计

Map过程:key: http_referer,value: 1

Reduce过程:key:http_referer,value(求和)


Brower:用户访问设备的统计

Map过程:key: http_user_agent,value: 1

Reduce过程:http_user_agent,value(求和)


下面以PV(页面访问量统计)为例,设计MapReduce程序

4、MapReduce程序实现

1).对日志解析

2).Map过程

3).Reduce过程

KPI.java

import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Locale;

public class KPI {
	private String remote_addr; //ip addr
	private String remote_user;	//user name
	private String time_local;
	private String request;
	private String status;
	private String body_bytes_sent;
	private String http_referer;
	private String http_user_agent;
	private boolean valid = true;
	
	public String getRemote_addr() {
        return remote_addr;
    }

    public void setRemote_addr(String remote_addr) {
        this.remote_addr = remote_addr;
    }

    public String getRemote_user() {
        return remote_user;
    }

    public void setRemote_user(String remote_user) {
        this.remote_user = remote_user;
    }

    public String getTime_local() {
        return time_local;
    }

    public Date getTime_local_Date() throws ParseException {
        SimpleDateFormat df = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss", Locale.US);
        return df.parse(this.time_local);
    }
    
    public String getTime_local_Date_hour() throws ParseException{
        SimpleDateFormat df = new SimpleDateFormat("yyyyMMddHH");
        return df.format(this.getTime_local_Date());
    }

    public void setTime_local(String time_local) {
        this.time_local = time_local;
    }

    public String getRequest() {
        return request;
    }

    public void setRequest(String request) {
        this.request = request;
    }

    public String getStatus() {
        return status;
    }

    public void setStatus(String status) {
        this.status = status;
    }

    public String getBody_bytes_sent() {
        return body_bytes_sent;
    }

    public void setBody_bytes_sent(String body_bytes_sent) {
        this.body_bytes_sent = body_bytes_sent;
    }

    public String getHttp_referer() {
        return http_referer;
    }
    
    public void setHttp_referer(String http_referer) {
        this.http_referer = http_referer;
    }

    public String getHttp_user_agent() {
        return http_user_agent;
    }

    public void setHttp_user_agent(String http_user_agent) {
        this.http_user_agent = http_user_agent;
    }

    public boolean isValid() {
        return valid;
    }

    public void setValid(boolean valid) {
        this.valid = valid;
    }
	public void parser(String line){
		String[] arr = line.split(" ");
		
		if(arr.length > 11){
			this.setRemote_addr(arr[0]);
			this.setRemote_user(arr[1]);
			this.setTime_local(arr[3]);
			this.setRequest(arr[6]);
			this.setStatus(arr[8]);
			this.setBody_bytes_sent(arr[9]);
			this.setHttp_referer(arr[10]);
			this.setHttp_user_agent(arr[11]);
			this.setValid(true);
			
			if(Integer.parseInt(this.getStatus()) >= 400){
				this.setValid(false);
			}
		}
		else
		{
			this.setValid(false);
		}

	}
}
KPI类可以解析每一条日志记录,并存储相应信息。


KPIPV.java

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.io.IntWritable;


public class KPIPV {
    
    public static class MapClass 
    	extends Mapper<Object, Text, Text, IntWritable> {
        
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text(); 
        public void map(Object key, Text value,
                        Context context ) throws IOException,
                        InterruptedException {
            KPI kpi = new KPI();
            kpi.parser(value.toString());
            if (kpi.isValid()){
				word.set(kpi.getRequest());
				context.write(word,one);
			}
        }
    }
    
    public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
        
        public void reduce(Text key, Iterable<IntWritable> values,
                           Context context) throws IOException,InterruptedException {
                           
            int count = 0;
            for (IntWritable val : values) {
                count += val.get();
            }
            context.write(key, new IntWritable(count));
        }
    }
    
    public static void main(String[] args) throws Exception { 
        Configuration conf = new Configuration();
        
        String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
        if(otherArgs.length != 2){
        	System.err.println("Usage: KPI <in> <out>");
        	System.exit(2);
        }
        
        Job job = new Job(conf, "KPIPV");
        job.setJarByClass(KPIPV.class);
        job.setMapperClass(MapClass.class);
        //job.setCombinerClass(Reduce.class);
        job.setReducerClass(Reduce.class);
        
        //job.setInputFormat(KeyValueTextInputFormat.class);
        //job.setOutputFormat(TextOutputFormat.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.setInputPaths(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}
本例采用的数据源为个人网站的Web日志(由 bsspirit提供),大家也可以通过 百度统计获得某个网站的日志。

5、eclipse启动程序

设置输入 输出目录

hdfs://localhost:9000/user/root/access.log.10 hdfs://localhost:9000/user/root/output9

6、结果输出

/r-rserve-nodejs/?cf_action=sync_comments&amp;post_id=1769	5
/r-rserve-nodejs/feed/	1
/r-rstudio-server/	2
/r-rstudio-server/?cf_action=sync_comments&amp;post_id=1506	2
/rhadoop-demo-email/	3
/rhadoop-demo-email/?cf_action=sync_comments&amp;post_id=308	1
/rhadoop-hadoop	2
/rhadoop-hadoop/	10
/rhadoop-hadoop/?cf_action=sync_comments&amp;post_id=87	2
/rhadoop-hadoop/feed/	1
/rhadoop-hbase-rhase/	4
/rhadoop-hbase-rhase/?cf_action=sync_comments&amp;post_id=97	2
/rhadoop-hbase-rhase/feed/	1
/rhadoop-java-basic/	3

源代码及数据:https://github.com/y521263/Hadoop_in_Action

参考资料:

http://blog.fens.me/hadoop-mapreduce-log-kpi/


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值