1、Web日志分析
从Web日志中,我们可以获取网站各类页面的PV值(PageView,页面访问量),访问IP;或者是用户停留时间最长的页面等等,更复杂的,可以分析用户行为特征。
在Web日志中,每条日志都代表用户的一次访问行为,以下面的一条日志为例子:
60.208.6.156 - - [18/Sep/2013:06:49:48 +0000] "GET /wp-content/uploads/2013/07/rcassandra.png HTTP/1.0" 200 185524 "http://cos.name/category/software/packages/" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"可以 拆分为8个变量:
remote_addr:60.208.6.156 //用户IP地址
remote_user:- //用户名称
time_local:[18/Sep/2013:06:49:48 +0000] //记录访问时间
request:"GET /wp-content/uploads/2013/07/rcassandra.png HTTP/1.0" //记录访问的url与http协议
status:200 //记录请求状态,成功是200
body_bytes_sent:185524 //记录发给客户端内容的大小
http_referer:"http://cos.name/category/software/packages/" //记录从哪个页面访问过来的
http_user_agent:"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36" //记录客户浏览器的信息
2、KPI指标设计
一般的KPI指标可以设置为:
PV:页面访问量统计
IP:页面独立IP访问数量统计
Time:每小时用户访问数量统计
Source:用户来源域名的统计
Brower:用户访问设备的统计
3、hadoop算法实现
PV:页面访问量统计
Map过程:key: request,value: 1
Reduce过程:key:request,value(求和)
IP:页面独立IP访问数量统计
Map过程:key: request,value: remote_addr
Reduce过程:key:request,value(去重再求和)
Time:每小时用户访问数量统计
Map过程:key: time_local,value: 1
Reduce过程:key:time_local,value(求和)
Map过程:key: http_referer,value: 1
Reduce过程:key:http_referer,value(求和)
Brower:用户访问设备的统计
Map过程:key: http_user_agent,value: 1
Reduce过程:http_user_agent,value(求和)
下面以PV(页面访问量统计)为例,设计MapReduce程序
4、MapReduce程序实现
1).对日志解析
2).Map过程
3).Reduce过程
KPI.java
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Locale;
public class KPI {
private String remote_addr; //ip addr
private String remote_user; //user name
private String time_local;
private String request;
private String status;
private String body_bytes_sent;
private String http_referer;
private String http_user_agent;
private boolean valid = true;
public String getRemote_addr() {
return remote_addr;
}
public void setRemote_addr(String remote_addr) {
this.remote_addr = remote_addr;
}
public String getRemote_user() {
return remote_user;
}
public void setRemote_user(String remote_user) {
this.remote_user = remote_user;
}
public String getTime_local() {
return time_local;
}
public Date getTime_local_Date() throws ParseException {
SimpleDateFormat df = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss", Locale.US);
return df.parse(this.time_local);
}
public String getTime_local_Date_hour() throws ParseException{
SimpleDateFormat df = new SimpleDateFormat("yyyyMMddHH");
return df.format(this.getTime_local_Date());
}
public void setTime_local(String time_local) {
this.time_local = time_local;
}
public String getRequest() {
return request;
}
public void setRequest(String request) {
this.request = request;
}
public String getStatus() {
return status;
}
public void setStatus(String status) {
this.status = status;
}
public String getBody_bytes_sent() {
return body_bytes_sent;
}
public void setBody_bytes_sent(String body_bytes_sent) {
this.body_bytes_sent = body_bytes_sent;
}
public String getHttp_referer() {
return http_referer;
}
public void setHttp_referer(String http_referer) {
this.http_referer = http_referer;
}
public String getHttp_user_agent() {
return http_user_agent;
}
public void setHttp_user_agent(String http_user_agent) {
this.http_user_agent = http_user_agent;
}
public boolean isValid() {
return valid;
}
public void setValid(boolean valid) {
this.valid = valid;
}
public void parser(String line){
String[] arr = line.split(" ");
if(arr.length > 11){
this.setRemote_addr(arr[0]);
this.setRemote_user(arr[1]);
this.setTime_local(arr[3]);
this.setRequest(arr[6]);
this.setStatus(arr[8]);
this.setBody_bytes_sent(arr[9]);
this.setHttp_referer(arr[10]);
this.setHttp_user_agent(arr[11]);
this.setValid(true);
if(Integer.parseInt(this.getStatus()) >= 400){
this.setValid(false);
}
}
else
{
this.setValid(false);
}
}
}
KPI类可以解析每一条日志记录,并存储相应信息。
KPIPV.java
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.io.IntWritable;
public class KPIPV {
public static class MapClass
extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value,
Context context ) throws IOException,
InterruptedException {
KPI kpi = new KPI();
kpi.parser(value.toString());
if (kpi.isValid()){
word.set(kpi.getRequest());
context.write(word,one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException,InterruptedException {
int count = 0;
for (IntWritable val : values) {
count += val.get();
}
context.write(key, new IntWritable(count));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
if(otherArgs.length != 2){
System.err.println("Usage: KPI <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "KPIPV");
job.setJarByClass(KPIPV.class);
job.setMapperClass(MapClass.class);
//job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);
//job.setInputFormat(KeyValueTextInputFormat.class);
//job.setOutputFormat(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
本例采用的数据源为个人网站的Web日志(由
bsspirit提供),大家也可以通过
百度统计获得某个网站的日志。
5、eclipse启动程序
设置输入 输出目录
hdfs://localhost:9000/user/root/access.log.10 hdfs://localhost:9000/user/root/output9
6、结果输出
/r-rserve-nodejs/?cf_action=sync_comments&post_id=1769 5 /r-rserve-nodejs/feed/ 1 /r-rstudio-server/ 2 /r-rstudio-server/?cf_action=sync_comments&post_id=1506 2 /rhadoop-demo-email/ 3 /rhadoop-demo-email/?cf_action=sync_comments&post_id=308 1 /rhadoop-hadoop 2 /rhadoop-hadoop/ 10 /rhadoop-hadoop/?cf_action=sync_comments&post_id=87 2 /rhadoop-hadoop/feed/ 1 /rhadoop-hbase-rhase/ 4 /rhadoop-hbase-rhase/?cf_action=sync_comments&post_id=97 2 /rhadoop-hbase-rhase/feed/ 1 /rhadoop-java-basic/ 3
源代码及数据:https://github.com/y521263/Hadoop_in_Action
参考资料:
http://blog.fens.me/hadoop-mapreduce-log-kpi/