Hadoop基础1 - 基本组件及功能
Hadoop学习历史 此为第一篇 希望能坚持下去呀 _
Hadoop基础
Hadoop 生态圈基础
基础组件
- HDFS:分布式文件系统;
- MapReduce:分布式运算程序开发框架;
- Sqoop:关系型数据库与HDFS数据相互迁移工具;
- Flume:日志数据采集框架;
- Zookeeper:分布式协调服务基础组件;
- Hive:SQL数据仓库工具,支持SQL语言的框架(将SQL语句翻译成底层的MapReduce指令);
- Pig:高级的API,支持SQL语言的框架(将SQL语句翻译成底层的MapReduce指令);
- Mahout:机器学习框架;
- YARN:资源调度系统;
- Hbase:基于Hadoop分布式海量数据库。
Cluster分布
Hadoop基本组件及功能
- HDFS(hadoop distributed file system)
A file system written in JAVA based on google’s GFS
Sits on top of a native file system
Provides redundant storage for massive am
Files are split into blocks - MapReduce
Distributing a task across multiple nodes
Consists of Map and Reduce phases
MapReduce code is typically written in Java, many organizations have only a few developers who can write good MapReduce code.
So providing the ability to query the data without needing to know MapReduce intimately is crucial. So HIVE and PIG generated. - HIVE
Hive was originally developed at Facebook and provides a very SQL-like language. Under the covers, generates MapReduce jobs that run on the Hadoop cluster
Based on MapReduce, and therefore has built-in latency (typical queries are a few minutes), so IMPALA was addressed to executes natively on each node - PIG
Pig was originally created at Yahoo! to answer a similar need to Hive. Under the covers, PigLatin scripts are turned into MapReduce jobs and executed on the cluster
JDBC(Java Data Base Connectivity,java数据库连接)是一种用于执行SQL语句的Java API,它是Java十三个规范之一。可以为多种关系数据库提供统一访问,它由一组用Java语言编写的类和接口组成。JDBC提供了一种基准,据此可以构建更高级的工具和接口,使数据库开发人员能够编写数据库应用程序
开放数据库互连(Open Database Connectivity,ODBC)是微软公司开放服务结构(WOSA,Windows Open Services Architecture)中有关数据库的一个组成部分,它建立了一组规范,并提供了一组对数据库访问的标准API(应用程序编程接口)。这些API利用SQL来完成其大部分任务。
JDBC和ODBC都是用来连接数据库的启动程序,JDBC和ODBC由于具有数据库独立性甚至平台无关性,因而对Internet上异构数据库的访问提供了很好的支持。 - IMPALA
- SQOOP
SQL-to-Hadoop
Parallel import/export between Hadoop and various RDBMSes
- FLUME
Flume is a distributed, reliable, available service for efficiently moving large amounts of data as it is produced.
- OOZIE
- HBASE
The hadoop data base
小结:
HADOOP 建立在大数据领域的所有行业中
从engineer/IT的角度说 需要了解搭建hadoop生态系统, 数据迁移备份, 管理资源调配的问题。Sqoop, Zookeeper等都是必会工具
从数据分析人员的角度说,为了降低业务分析人员对编程能力的要求,很多不同组件如HIVE, Impala, Pig, Mahout都能用类SQL或者不同的query语言进行日常分析工作
从server monitor的角度说,yarn, Hue, Flume等都是管理资源使用, 日常工作日志管理的主要工具
一些生态圈的示意图
博主偏分析,目前工具方面接触HIVE IMPALA较多, 希望之后能更多的接触底层组件
如果有写的不清楚或者不对的地方 欢迎大家来批评指正啦
Thank you!