Block是最基本的存储单元
HDFS Client上传数据到HDFS时,会先在本地缓存数据,当数据达到一个Block大小时,请求NameNode分配一个Block。NameNode会把Block所在的DataNode的地址告诉HDFS Client。HDFS Client会直接和DataNode通信,把数据写到DataNode节点一个Block文件中。
设置数据块大小:hdfs-site.xml
dfs.blocksize
134217728
dfs.datanode.data.dir
Block元数据信息 (.meta)
/usr/local/mydata/dfs/data/current/BP-1476006134-192.168.1.10-1427374210743/current/finalized/subdir0/subdir0
如:
清空HDFS,这里面也清空了。
hdfs存储一定依赖操作系统的文件管理
HDFS是分布式文件系统上的一层文件管理系统。
Linux文件系统上的数据存储到HDFS上面不压缩,可以看Block元数据信息。
DataNode中副本管理
hdfs-site.xml
dfs.replication
数据存储:Replication Pipelining
假设dfs.replication=3
当HDFS Client上传数据时,向NameNode申请Block,NameNode给Client三个DataNode的地址,Client会把数据上传到第一个DataNode的Block中。然后第一个DataNode把数据传给第二个DataNode,第二个DataNode再把数据传给第三个DataNode。
HDFS文件归档操作:
合并HDFS小文件:
/input
/input/a.txt
/input/b.txt
[root@i-love-you hadoop]# bin/hdfs dfs -ls /dir
Found 2 items
-rw-r--r-- 1 root supergroup 13 2015-03-30 20:49 /dir/a.txt
-rw-r--r-- 1 root supergroup 18 2015-03-30 20:49 /dir/b.txt
创建文件:
[root@i-love-you hadoop]# bin/hadoop archive -archiveName c.har /dir /dest
查看内容:
[root@i-love-you hadoop]# bin/hdfs dfs -ls /dest
Found 1 items
HDFS Client上传数据到HDFS时,会先在本地缓存数据,当数据达到一个Block大小时,请求NameNode分配一个Block。NameNode会把Block所在的DataNode的地址告诉HDFS Client。HDFS Client会直接和DataNode通信,把数据写到DataNode节点一个Block文件中。
设置数据块大小:hdfs-site.xml
dfs.blocksize
134217728
dfs.datanode.data.dir
Block元数据信息 (.meta)
/usr/local/mydata/dfs/data/current/BP-1476006134-192.168.1.10-1427374210743/current/finalized/subdir0/subdir0
如:
-rw-r--r--. 1 root root 33574 3月 29 18:07 blk_1073741851
-rw-r--r--. 1 root root 271 3月 29 18:07 blk_1073741851_1027.meta
-rw-r--r--. 1 root root 103997 3月 29 18:07 blk_1073741852
-rw-r--r--. 1 root root 823 3月 29 18:07 blk_1073741852_1028.meta
清空HDFS,这里面也清空了。
hdfs存储一定依赖操作系统的文件管理
HDFS是分布式文件系统上的一层文件管理系统。
Linux文件系统上的数据存储到HDFS上面不压缩,可以看Block元数据信息。
DataNode中副本管理
hdfs-site.xml
dfs.replication
数据存储:Replication Pipelining
假设dfs.replication=3
当HDFS Client上传数据时,向NameNode申请Block,NameNode给Client三个DataNode的地址,Client会把数据上传到第一个DataNode的Block中。然后第一个DataNode把数据传给第二个DataNode,第二个DataNode再把数据传给第三个DataNode。
HDFS文件归档操作:
合并HDFS小文件:
/input
/input/a.txt
/input/b.txt
[root@i-love-you hadoop]# bin/hdfs dfs -ls /dir
Found 2 items
-rw-r--r-- 1 root supergroup 13 2015-03-30 20:49 /dir/a.txt
-rw-r--r-- 1 root supergroup 18 2015-03-30 20:49 /dir/b.txt
创建文件:
[root@i-love-you hadoop]# bin/hadoop archive -archiveName c.har /dir /dest
[root@i-love-you hadoop]# bin/hadoop archive -archiveName c.har -p /dir /dest
15/03/30 21:02:41 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/03/30 21:02:45 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/03/30 21:02:45 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/03/30 21:02:47 INFO mapreduce.JobSubmitter: number of splits:1
15/03/30 21:02:49 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1427717039268_0001
15/03/30 21:02:52 INFO impl.YarnClientImpl: Submitted application application_1427717039268_0001
15/03/30 21:02:54 INFO mapreduce.Job: The url to track the job: http://i-love-you:8088/proxy/application_1427717039268_0001/
15/03/30 21:02:54 INFO mapreduce.Job: Running job: job_1427717039268_0001
15/03/30 21:03:36 INFO mapreduce.Job: Job job_1427717039268_0001 running in uber mode : false
15/03/30 21:03:36 INFO mapreduce.Job: map 0% reduce 0%
15/03/30 21:04:25 INFO mapreduce.Job: map 100% reduce 0%
15/03/30 21:04:53 INFO mapreduce.Job: map 100% reduce 100%
15/03/30 21:04:55 INFO mapreduce.Job: Job job_1427717039268_0001 completed successfully
15/03/30 21:04:57 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=206
FILE: Number of bytes written=214211
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=418
HDFS: Number of bytes written=236
HDFS: Number of read operations=17
HDFS: Number of large read operations=0
HDFS: Number of write operations=7
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=48204
Total time spent by all reduces in occupied slots (ms)=20795
Total time spent by all map tasks (ms)=48204
Total time spent by all reduce tasks (ms)=20795
Total vcore-seconds taken by all map tasks=48204
Total vcore-seconds taken by all reduce tasks=20795
Total megabyte-seconds taken by all map tasks=49360896
Total megabyte-seconds taken by all reduce tasks=21294080
Map-Reduce Framework
Map input records=3
Map output records=3
Map output bytes=194
Map output materialized bytes=206
Input split bytes=116
Combine input records=0
Combine output records=0
Reduce input groups=3
Reduce shuffle bytes=206
Reduce input records=3
Reduce output records=0
Spilled Records=6
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=467
CPU time spent (ms)=3810
Physical memory (bytes) snapshot=293769216
Virtual memory (bytes) snapshot=1690853376
Total committed heap usage (bytes)=136450048
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=271
File Output Format Counters
Bytes Written=0
查看内容:
[root@i-love-you hadoop]# bin/hadoop fs -ls -R har:///dest/c.har
-rw-r--r-- 1 root supergroup 13 2015-03-30 20:49 har:///dest/c.har/a.txt
-rw-r--r-- 1 root supergroup 18 2015-03-30 20:49 har:///dest/c.har/b.txt
[root@i-love-you hadoop]# bin/hdfs dfs -ls /dest
Found 1 items
drwxr-xr-x - root supergroup 0 2015-03-30 21:04 /dest/c.har