[hadoop@hadoop004 hadoop]$ vim core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop004:9000</value>
</property>
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec
</value>
</property>
</configuration>
[hadoop@hadoop004 hadoop]$ vim mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!-- 配置 Map段输出的压缩,snappy-->
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
<!--开启MapReduce输出文件压缩-->
<property>
<name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.output.fileoutputformat.compress.codec</name>
<value>org.apache.hadoop.io.compress.BZip2Codec</value>
</property>
</configuration>
[hadoop@hadoop004 sbin]$ hadoop jar /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount /data/wc/ /data/wc/output/
19/04/18 20:13:53 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/04/18 20:13:55 INFO input.FileInputFormat: Total input paths to process : 1
19/04/18 20:13:55 INFO mapreduce.JobSubmitter: number of splits:1
19/04/18 20:13:55 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1555589556022_0001
19/04/18 20:13:55 INFO impl.YarnClientImpl: Submitted application application_1555589556022_0001
19/04/18 20:13:55 INFO mapreduce.Job: The url to track the job: http://hadoop004:8088/proxy/application_1555589556022_0001/
19/04/18 20:13:55 INFO mapreduce.Job: Running job: job_1555589556022_0001
19/04/18 20:14:03 INFO mapreduce.Job: Job job_1555589556022_0001 running in uber mode : false
19/04/18 20:14:03 INFO mapreduce.Job: map 0% reduce 0%
19/04/18 20:14:08 INFO mapreduce.Job: map 100% reduce 0%
19/04/18 20:14:15 INFO mapreduce.Job: map 100% reduce 100%
19/04/18 20:14:15 INFO mapreduce.Job: Job job_1555589556022_0001 completed successfully
19/04/18 20:14:15 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=71
FILE: Number of bytes written=223593
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=145
HDFS: Number of bytes written=66
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=2983
Total time spent by all reduces in occupied slots (ms)=3594
Total time spent by all map tasks (ms)=2983
Total time spent by all reduce tasks (ms)=3594
Total vcore-seconds taken by all map tasks=2983
Total vcore-seconds taken by all reduce tasks=3594
Total megabyte-seconds taken by all map tasks=3054592
Total megabyte-seconds taken by all reduce tasks=3680256
Map-Reduce Framework
Map input records=3
Map output records=9
Map output bytes=80
Map output materialized bytes=67
Input split bytes=101
Combine input records=9
Combine output records=5
Reduce input groups=5
Reduce shuffle bytes=67
Reduce input records=5
Reduce output records=5
Spilled Records=10
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=54
CPU time spent (ms)=1460
Physical memory (bytes) snapshot=461029376
Virtual memory (bytes) snapshot=3206344704
Total committed heap usage (bytes)=319291392
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=44
File Output Format Counters
Bytes Written=66
hadoop@hadoop004 sbin]$ hdfs dfs -ls /data/wc/output/par*
-rw-r--r-- 1 hadoop supergroup 66 2019-04-18 20:14 /data/wc/output/part-r-00000.bz2
[hadoop@hadoop004 hadoop]$ hdfs dfs -text /data/wc/output/par*
19/04/18 20:20:15 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
19/04/18 20:20:15 INFO compress.CodecPool: Got brand-new decompressor [.bz2]
boy 2
hello 4
hi 1
son 1
word 1
再来试试snappy输出
[hadoop@hadoop004 hadoop-2.6.0-cdh5.7.0]$ hdfs dfs -rm -r /data/wc/output
Deleted /data/wc/output
[hadoop@hadoop004 hadoop-2.6.0-cdh5.7.0]$ hdfs dfs -ls /data/wc/output/par*
-rw-r--r-- 1 hadoop supergroup 42 2019-04-18 20:25 /data/wc/output/part-r-00000.snappy
[hadoop@hadoop004 hadoop-2.6.0-cdh5.7.0]$ hdfs dfs -text /data/wc/output/par*
19/04/18 20:27:29 INFO compress.CodecPool: Got brand-new decompressor [.snappy]
boy 2
hello 4
hi 1
son 1
word 1
上面是以WordCount为例子,下面再以一个小文件为例子通过hive来实现压缩
[hadoop@hadoop004 data]$ hdfs dfs -put page_views.dat /data/click/
[hadoop@hadoop004 data]$ hdfs dfs -ls /data/click/
Found 1 items
-rw-r--r-- 1 hadoop supergroup 19014993 2019-04-19 12:26 /data/click/page_views.dat
[hadoop@hadoop004 data]$ hdfs dfs -du -s -h /data/click/page_views.dat
18.1 M 18.1 M /data/click/page_views.dat
hive> SET hive.exec.compress.output;
hive.exec.compress.output=false
hive>
> create table page_views(
> track_times string,
> url string,
> session_id string,
> referer string,
> ip string,
> end_user_id string,
> city_id string
> ) row format delimited fields terminated by '\t';
OK
Time taken: 0.741 seconds
hive> load data local inpath '/home/hadoop/data/page_views.dat' overwrite into table page_views;
Loading data to table default.page_views
Table default.page_views stats: [numFiles=1, numRows=0, totalSize=19014993, rawDataSize=0]
OK
Time taken: 0.595 seconds
[hadoop@hadoop004 data]$ hdfs dfs -du -s -h /user/hive/warehouse/page_views/page_views.dat
18.1 M 18.1 M /user/hive/warehouse/page_views/page_views.dat
因为我在mapred-site.xml里面设置默认的解压缩格式为snappy
hive>
> set hive.exec.compress.output=true;
hive> set hive.exec.compress.output
> ;
hive.exec.compress.output=true
hive> set mapreduce.output.fileoutputformat.compress.codec
> ;
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
hive> create table page_views_snappy as select * from page_views;
Query ID = hadoop_20190419141818_561ede29-e964-4655-9e48-1e4f5d6eeb5c
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1555643336639_0002, Tracking URL = http://hadoop004:8088/proxy/application_1555643336639_0002/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1555643336639_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-04-19 14:30:50,994 Stage-1 map = 0%, reduce = 0%
2019-04-19 14:30:57,450 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.7 sec
MapReduce Total cumulative CPU time: 2 seconds 700 msec
Ended Job = job_1555643336639_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop004:9000/user/hive/warehouse/.hive-staging_hive_2019-04-19_14-30-44_413_8511625915872870114-1/-ext-10001
Moving data to: hdfs://hadoop004:9000/user/hive/warehouse/page_views_snappy
Table default.page_views_snappy stats: [numFiles=1, numRows=100000, totalSize=8814444, rawDataSize=18914993]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 2.7 sec HDFS Read: 19018292 HDFS Write: 8814535 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 700 msec
OK
Time taken: 15.354 seconds
[hadoop@hadoop004 data]$ hdfs dfs -ls /user/hive/warehouse/page_views_snappy
Found 1 items
-rwxr-xr-x 1 hadoop supergroup 8814444 2019-04-19 14:30 /user/hive/warehouse/page_views_snappy/000000_0.snappy
[hadoop@hadoop004 data]$ hdfs dfs -du -s -h /user/hive/warehouse/page_views_snappy
8.4 M 8.4 M /user/hive/warehouse/page_views_snappy
下面采用bzip2瞅瞅
hive>
> set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec;
hive> set mapreduce.output.fileoutputformat.compress.codec;
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec
hive> create table page_views_bzip2 as select * from page_views;
Query ID = hadoop_20190419143939_53d67293-92de-4697-a791-f9a1afe7be01
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1555643336639_0003, Tracking URL = http://hadoop004:8088/proxy/application_1555643336639_0003/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1555643336639_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-04-19 14:44:16,992 Stage-1 map = 0%, reduce = 0%
2019-04-19 14:44:25,471 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.19 sec
MapReduce Total cumulative CPU time: 5 seconds 190 msec
Ended Job = job_1555643336639_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop004:9000/user/hive/warehouse/.hive-staging_hive_2019-04-19_14-44-09_564_3458580979229190548-1/-ext-10001
Moving data to: hdfs://hadoop004:9000/user/hive/warehouse/page_views_bzip2
Table default.page_views_bzip2 stats: [numFiles=1, numRows=100000, totalSize=3815195, rawDataSize=18914993]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 5.19 sec HDFS Read: 19018291 HDFS Write: 3815285 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 190 msec
OK
Time taken: 17.289 seconds
[hadoop@hadoop004 hadoop]$ hdfs dfs -ls /user/hive/warehouse/page_views_bzip2
Found 1 items
-rwxr-xr-x 1 hadoop supergroup 3815195 2019-04-19 14:44 /user/hive/warehouse/page_views_bzip2/000000_0.bz2
[hadoop@hadoop004 hadoop]$ hdfs dfs -du -s -h /user/hive/warehouse/page_views_bzip2/*
3.6 M 3.6 M /user/hive/warehouse/page_views_bzip2/000000_0.bz2
再来看看gzip的
hive>
> set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec;
hive> set mapreduce.output.fileoutputformat.compress.codec;
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec
hive>
> create table page_views_gzip as select * from page_views;
Query ID = hadoop_20190419143939_53d67293-92de-4697-a791-f9a1afe7be01
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1555643336639_0004, Tracking URL = http://hadoop004:8088/proxy/application_1555643336639_0004/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1555643336639_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-04-19 14:48:10,606 Stage-1 map = 0%, reduce = 0%
2019-04-19 14:48:18,019 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.29 sec
MapReduce Total cumulative CPU time: 3 seconds 290 msec
Ended Job = job_1555643336639_0004
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop004:9000/user/hive/warehouse/.hive-staging_hive_2019-04-19_14-48-03_436_7531556390065383047-1/-ext-10001
Moving data to: hdfs://hadoop004:9000/user/hive/warehouse/page_views_gzip
Table default.page_views_gzip stats: [numFiles=1, numRows=100000, totalSize=5550655, rawDataSize=18914993]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.29 sec HDFS Read: 19018290 HDFS Write: 5550744 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 290 msec
OK
Time taken: 15.866 seconds
[hadoop@hadoop004 hadoop]$ hdfs dfs -ls /user/hive/warehouse/page_views_gzip
Found 1 items
-rwxr-xr-x 1 hadoop supergroup 5550655 2019-04-19 14:48 /user/hive/warehouse/page_views_gzip/000000_0.gz
[hadoop@hadoop004 hadoop]$ hdfs dfs -du -s -h /user/hive/warehouse/page_views_gzip/*
5.3 M 5.3 M /user/hive/warehouse/page_views_gzip/000000_0.gz
下面再来个猛的,下面还做了一部分清洗操作
[hadoop@hadoop004 data]$ hdfs dfs -put /home/hadoop/data/login.log /g6/hadoop/accesslog/20180717/
[hadoop@hadoop004 data]$ hdfs dfs -ls /g6/hadoop/accesslog/20180717/
Found 1 items
-rw-r--r-- 1 hadoop supergroup 476118692 2019-04-19 12:09 /g6/hadoop/accesslog/20180717/login.log
[hadoop@hadoop004 data]$ hdfs dfs -du -s -h /g6/hadoop/accesslog/20180717/login.log
454.1 M 454.1 M /g6/hadoop/accesslog/20180717/login.log
[hadoop@hadoop004 data]$ hadoop jar /home/hadoop/lib/g6-hadoop-1.0.jar com.ruozedata.hadoop.mapreduce.driver.LogETLDriver /g6/hadoop/accesslog/20180717 /g6/hadoop/access/output
19/04/19 12:12:57 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/04/19 12:12:57 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
19/04/19 12:12:58 INFO input.FileInputFormat: Total input paths to process : 1
19/04/19 12:12:58 INFO mapreduce.JobSubmitter: number of splits:4
19/04/19 12:12:58 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1555643336639_0001
19/04/19 12:12:58 INFO impl.YarnClientImpl: Submitted application application_1555643336639_0001
19/04/19 12:12:58 INFO mapreduce.Job: The url to track the job: http://hadoop004:8088/proxy/application_1555643336639_0001/
19/04/19 12:12:58 INFO mapreduce.Job: Running job: job_1555643336639_0001
19/04/19 12:13:06 INFO mapreduce.Job: Job job_1555643336639_0001 running in uber mode : false
19/04/19 12:13:06 INFO mapreduce.Job: map 0% reduce 0%
19/04/19 12:13:19 INFO mapreduce.Job: map 1% reduce 0%
19/04/19 12:13:22 INFO mapreduce.Job: map 3% reduce 0%
19/04/19 12:13:23 INFO mapreduce.Job: map 5% reduce 0%
19/04/19 12:13:24 INFO mapreduce.Job: map 6% reduce 0%
19/04/19 12:13:25 INFO mapreduce.Job: map 11% reduce 0%
19/04/19 12:13:26 INFO mapreduce.Job: map 13% reduce 0%
19/04/19 12:13:27 INFO mapreduce.Job: map 19% reduce 0%
19/04/19 12:13:28 INFO mapreduce.Job: map 23% reduce 0%
19/04/19 12:13:30 INFO mapreduce.Job: map 34% reduce 0%
19/04/19 12:13:32 INFO mapreduce.Job: map 41% reduce 0%
19/04/19 12:13:34 INFO mapreduce.Job: map 52% reduce 0%
19/04/19 12:13:34 INFO mapreduce.Job: Task Id : attempt_1555643336639_0001_m_000001_0, Status : FAILED
Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137
Killed by external signal
19/04/19 12:13:35 INFO mapreduce.Job: map 62% reduce 0%
19/04/19 12:13:36 INFO mapreduce.Job: map 75% reduce 0%
19/04/19 12:13:50 INFO mapreduce.Job: map 90% reduce 25%
19/04/19 12:13:51 INFO mapreduce.Job: map 100% reduce 25%
19/04/19 12:13:53 INFO mapreduce.Job: map 100% reduce 78%
19/04/19 12:13:54 INFO mapreduce.Job: map 100% reduce 100%
19/04/19 12:13:54 INFO mapreduce.Job: Job job_1555643336639_0001 completed successfully
19/04/19 12:13:54 INFO mapreduce.Job: Counters: 52
File System Counters
FILE: Number of bytes read=39419386
FILE: Number of bytes written=79392688
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=476131480
HDFS: Number of bytes written=36841185
HDFS: Number of read operations=15
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Failed map tasks=1
Killed map tasks=1
Launched map tasks=6
Launched reduce tasks=1
Other local map tasks=2
Data-local map tasks=4
Total time spent by all maps in occupied slots (ms)=120819
Total time spent by all reduces in occupied slots (ms)=14541
Total time spent by all map tasks (ms)=120819
Total time spent by all reduce tasks (ms)=14541
Total vcore-seconds taken by all map tasks=120819
Total vcore-seconds taken by all reduce tasks=14541
Total megabyte-seconds taken by all map tasks=123718656
Total megabyte-seconds taken by all reduce tasks=14889984
Map-Reduce Framework
Map input records=1100000
Map output records=1100000
Map output bytes=100291376
Map output materialized bytes=39416687
Input split bytes=500
Combine input records=0
Combine output records=0
Reduce input groups=1
Reduce shuffle bytes=39416687
Reduce input records=1100000
Reduce output records=1100000
Spilled Records=2200000
Shuffled Maps =4
Failed Shuffles=0
Merged Map outputs=4
GC time elapsed (ms)=3969
CPU time spent (ms)=35800
Physical memory (bytes) snapshot=2299715584
Virtual memory (bytes) snapshot=7979675648
Total committed heap usage (bytes)=2030043136
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=476130980
File Output Format Counters
Bytes Written=36841185
[hadoop@hadoop004 data]$ hdfs dfs -ls /g6/hadoop/access/output
Found 2 items
-rw-r--r-- 1 hadoop supergroup 0 2019-04-19 12:13 /g6/hadoop/access/output/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 36841185 2019-04-19 12:13 /g6/hadoop/access/output/part-r-00000.snappy
[hadoop@hadoop004 data]$ hdfs dfs -du -s -h /g6/hadoop/access/output
35.1 M 35.1 M /g6/hadoop/access/output