hive的正则表达式

最新推荐文章于 2025-03-02 13:54:40 发布

weixin_39953756

最新推荐文章于 2025-03-02 13:54:40 发布

阅读量2.1k

点赞数

文章标签： Hive

本文链接：https://blog.youkuaiyun.com/weixin_39953756/article/details/80985370

版权

某公司服务器日志的一行数据如下：一共11个字段

"27.38.5.159"

"-"

"31/Aug/2015:00:04:37 +0800"

"GET /course/view.php?id=27 HTTP/1.1"

"303"

"440"

"http://www.ibeifeng.com/user.php?act=mycourse"

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36"

"-"

"learn.ibeifeng.com"

每个字段之间的是通过空格符隔开，但是我们发现有的字段内部也有空格，如果将这样的数据导入到hive表中，建hive表的时候不能仅仅通过row format delimited fields terminated by " " 来匹配字符，如果这样会造成字段丢失！！！

-》字段分析
-》ip:ip有可能是公网ip，也就是代理ip（一个公司内部不同的用户上同一网站，可能在日志上留下的IP记录是一样的）
-》time_local:
  客户端时间：
  服务器时间：推荐

-》流程：
  -》数据采集（HDFS、Hive）
  -》需求分析：
  -》数据清洗：
   -》自定义UDF
   -》自定义Java类，手写MR程序，用于过滤判断
   -》正则表达式匹配
  -》数据分析（计算、处理）
  -》结果导出（目的地有可能是很多的：hdfs、mysql等）
  -》前端结合（传统js，或者echarts）
  -》数据可视化展示

-》建库
create database bf_test_source;
-》建源表
create table IF NOT EXISTS bf_source (
remote_addr string,
remote_user string,
time_local string,
request string,
status string,
body_bytes_sent string,
request_body string,
http_referer string,
http_user_agent string,
http_x_forwarded_for string,
host string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
stored as textfile ;

load data local inpath '/opt/datas/moodle.ibeifeng.access.log' into table bf_source ;

查询结果
"27.38.5.159"
"-"
"31/Aug/2015:00:04:37   +0800"
"GET    /course/view.php?id=27 HTTP/1.1"
"303"
"440"
-
"http://www.ibeifeng.com/user.php?act=mycourse"

字段丢失了！！！

【查询字段中包含分隔符该如何处理】
-》会导致字段加载不全

正则网址：http://wpjam.qiniudn.com/tool/regexpal/

\转义字符 ()域（字段） []字符集合
^ 非空格的多位字符
| 或者
^}非大括号的多位字符
[0-9]单个数字

【使用正则表达式重新建表】

create table IF NOT EXISTS bf_log (
remote_addr string,
remote_user string,
time_local string,
request string,
status string,
body_bytes_sent string,
request_body string,
http_referer string,
http_user_agent string,
http_x_forwarded_for string,
host string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(\"[^ ]*\") (\"[-|^ ]*\") (\"[^}]*\") (\"[^}]*\") (\"[0-9]*\") (\"[0-9]*\") ([-|^ ]*) (\"[^ ]*\") (\"[^}]*\") (\"[-|^ ]*\") (\"[^ ]*\")"
)
STORED AS TEXTFILE;

load data local inpath '/opt/datas/moodle.ibeifeng.access.log' into table bf_log;

二、ETL字段过滤、格式化处理
"27.38.5.159"
"-"
"31/Aug/2015:00:04:37 +0800"
"GET /course/view.php?id=27 HTTP/1.1"
"303"
"440"
-
"http://www.ibeifeng.com/user.php?act=mycourse"
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36"
"-"
"learn.ibeifeng.com"

-》数据字段过滤，考虑哪些字段是需要的，将不需要的字段去除，分析有价值的字段
-》双引号对于字段数据分析没有太多的作用，就可以考虑去除双引号 replace replaceALL
-》时间字段的格式转换，比如："31/Aug/2015:00:04:37 +0800"
         -》理想格式：
          -》2015-08-31 00:04:37
          -》e
-》去除某些字段不必要的部分，比如："GET /course/view.php?id=27 HTTP/1.1"
      startwith endwith substring
        -》可以查看页面的访问量，统计数据
        -》做网站基本的流量分析统计（用户行为数据：点击、搜索）

-》获取当前页面的前一个页面，链入地址
    通过这个链入地址可以加大对产品的宣传
    "http://www.ibeifeng.com/user.php?act=mycourse" （可以将数据存储到mysql中）
-》客户端信息：浏览器版本，用户操作系统

-》自定义日期转换UDF：将日志记录中的时间字段改成我们需要的形式。

继承UDF类，并且重写evaluate方法

import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Locale;

import org.apache.commons.lang.StringUtils;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class DatePractice extends UDF{
/*
* 旧的日期格式是：    "31/Aug/2015:00:04:37 +0800"
* 转换后的日期格式是：   2015-08-31 00:04:37
* Java中的时间格式化类SimpleDateFormat
*/
SimpleDateFormat oldDateFormat = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss",Locale.ENGLISH);
SimpleDateFormat newDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
String format = null;
public Text evaluate(Text time){
  if(time == null){
   return null;
  }
  if(StringUtils.isBlank(time.toString())){
   return null;
  }
  String parse2 = time.toString().replaceAll("\"","");
  try {
   Date parse = oldDateFormat.parse(parse2);
   format = newDateFormat.format(parse);
  } catch (ParseException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }

return new Text(format);
}
public static void main(String[] args){
System.out.println(new DatePractice().evaluate(new Text("\"31/DEC/2025:00:12:37 +0800\"")))
}
}

-》将代码打成jar和hive进行关联
add jar /opt/datas/testDateUDF.jar;
-》创建临时函数：
create temporary function pdate as 'cmz.hivetest.testdate';
-》检验UDF函数是否成功
select pdate(time_local) newdate from bf_log limit 10;
结果：
newdate
2015-08-31 00:04:37
2015-08-31 00:04:37
2015-08-31 00:04:53
2015-08-31 00:04:53