Hadoop
基本概念
- Hadoop = HDFS + MapReduce. 大数据框架
- HDFS : 分布式文件系统
- MapReduce:大数据计算模型
- NameNode:主节点;DataNode: 数据节点;SecondaryNameNode:备份
- yarn:资源调度
- ResourceManager: 负责集群中所有资源的统一管理和分配
- NodeManager:管路Hadoop集群中单个计算节点
环境搭建
- https://www.jianshu.com/p/2c3b04ac498d
- https://blog.youkuaiyun.com/dai451954706/article/details/50464036
测试
import sys
def read_input(file):
for line in file:
yield line.split()
def main():
data = read_input(sys.stdin)
for words in data:
for word in words:
print("%s%s%d" % (word, "\t", 1))
if __name__ == '__main__':
main()
import sys
from operator import itemgetter
from itertools import groupby
def read_map_out(file, sep='\t'):
for line in file:
yield line.rstrip().split(sep)
def main():
data = read_map_out(sys.stdin)
for cur_word, group in groupby(data, itemgetter(0)):
tot_count = sum(int(count) for cur_word, count in group)
print("%s%s%d"%(cur_word, "\t", tot_count))
if __name__ == '__main__':
main()
- 本地测试
echo "a b d e a v b" | python3 map.py | sort | python3 reduce.py
- hdfs测试
/usr/local/Cellar/hadoop/3.1.2/bin/hadoop jar /usr/local/Cellar/hadoop/3.1.2/libexec/share/hadoop/tools/lib/hadoop-streaming-3.1.2.jar -files "map.py,reduce.py" -input /kms.sh -output /output1 -mapper "/usr/local/bin/python3 map.py" -reducer "/usr/local/bin/python3 reduce.py"
Spark
基于内存计算的大数据并行计算框架,MapReduce的替代方案