模块A

本文详细介绍使用Sqoop从MySQL导入数据至Hadoop并转换为Parquet格式的过程,通过Hive进行复杂的数据分析,包括计算产品收入排名及分析Web日志数据。涵盖数据迁移、数据仓库构建、SQL查询优化等大数据技术关键环节。

任务3:Sqoop

sqoop import-all-tables \
 -m 1 \
 --connect jdbc:mysql://localhost:3306/retail_db \
 --username=training \
 --password=training \
 --compression-codec=snappy \
 --as-parquetfile \
 --warehouse-dir=/user/hive/warehouse \
 --hive-import
$ hadoop fs -ls /user/hive/warehouse/
$ hadoop fs -ls /user/hive/warehouse/categories/

任务4:hive

(计算每个产品总收入前十的产品)

select c.category_name, count(order_item_quantity) as count
from order_items oi
inner join products p on oi.order_item_product_id = p.product_id
inner join categories c on c.category_id = p.product_category_id
group by c.category_name
order by count desc
limit 10;

-- top 10 revenue generating products
select p.product_id, p.product_name, r.revenue
from products p inner join
(select oi.order_item_product_id, sum(cast(oi.order_item_subtotal as float))
as revenue
from order_items oi inner join orders o
on oi.order_item_order_id = o.order_id
where o.order_status <> 'CANCELED'
and o.order_status <> 'SUSPECTED_FRAUD'
group by order_item_product_id) r
on p.product_id = r.order_item_product_id
order by r.revenue desc
limit 10;

任务5:加载Web日志数据

1.将一个月的数据加载进去	
	$ sudo -u hdfs hadoop fs -mkdir /user/hive/warehouse/original_access_logs
	$ sudo -u hdfs hadoop fs -copyFromLocal /opt/examples/log_files/access.log.2
	/user/hive/warehouse/original_access_logs
2.验证数据
	$ hadoop fs -ls /user/hive/warehouse/original_access_logs

任务6:处理Web日志事件

 (查看访问者浏览网站最多的是哪一个)
 
CREATE EXTERNAL TABLE intermediate_access_logs (
 ip STRING,
 date STRING,
 method STRING,
 url STRING,
 http_version STRING,
 code1 STRING,
 code2 STRING,
 dash STRING,
 user_agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
 'input.regex' = '([^ ]*) - - \\[([^\\]]*)\\] "([^\ ]*) ([^\ ]*) ([^\ ]*)"
(\\d*) (\\d*) "([^"]*)" "([^"]*)"',
 'output.format.string' = "%1$$s %2$$s %3$$s %4$$s %5$$s %6$$s %7$$s %8$$s
%9$$s")
LOCATION '/user/hive/warehouse/original_access_logs';


CREATE EXTERNAL TABLE tokenized_access_logs (
 ip STRING,
 date STRING,
 method STRING,
 url STRING,
 http_version STRING,
 code1 STRING,
 code2 STRING,
 dash STRING,
 user_agent STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/hive/warehouse/tokenized_access_logs';
ADD JAR /usr/lib/hive/lib/hive-contrib.jar;
INSERT OVERWRITE TABLE tokenized_access_logs SELECT * FROM
intermediate_access_logs;
select count(*),url from tokenized_access_logs
where url like '%\/product\/%'
group by url order by count(*) desc;
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

hao难懂

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值