《Hadoop: The Definitive Guide》读书笔记 -- Chapter 1 Meet Hadoop

Hadoop作为一种可靠且可扩展的大数据存储与分析平台,通过分布式文件系统HDFS和批处理框架MapReduce等核心组件,解决了大规模数据存储及并行处理难题。此外,还介绍了Hadoop生态系统中的其他组件如HBase、YARN及其应用场景。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Preface

Stripped to its core, the tools that Hadoop provides for working with big data are simple.If there is a common theme, it is about raising the level of abstraction -- to create building blocks for programmers who have lots of data to store and analyze, and who don't have time, the skill, or the inclination to become distributed system experts to build infrastructure on it.


Part I Hadoop Fundamentals

Chapter 1 Meet Hadoop

Data!

individual footprint to grow(Machine logs, RFID readers, sensor networks)

"More data usually beats better algorithm"


Data Storage 

There is a long time to read all data from one disk -- and writing is even slower. The obvious way to reduce the time is to read from multiple disks at once.

read and write from parallel disks:

1. the chance of one will fail is fairly high. replication: RAID / HDFS

2. most analysis tasks need to be able to combine the data in some way. MapReduce

Hadoop provides: a reliable, scalable platform for storage and analysis


Querying All Your Data

MapReduce is a batch query processor


Beyond Batch

MapReduce is fundamentally a batch processing system, and is not suitable for interactive analysis. You can't run a query and get results back in a few seconds or less.


"Hadoop" is sometimes used to refer to a larger ecosystem of projects, not just HDFS and MapReduce.

HBase, a key-value store that uses HDFS for its underlying storage. HBase provides both online read/write access individual rows and batch operations for reading and writing data in bulk.

YARN(Yet Another Resource Negotiator), a cluster resource management system, which allows any distributed program (not just MapReduce) to run on data in Hadoop cluster.


Different processing patterns that work with Hadoop:[ROADMAP]

1. Interactive SQL: Impala(Daemon), Hive on Tez(Container Reuse)

2. Interactive Processing: Spark(Many algorithm in machine learning are interactive in nature)

3. Stream Processing: Storm, Spark Streaming, Samza run realtime, distributed computation on unbounded streams of data and emit results.

4. Search: Solr can run on Hadoop cluster


Comparison with Other Systems

Relational Database Management System

Why can't we use databases with lots of disks to do large-scale analysis?

seeking time is improving more slowly than transfer rate. B-Tree works well for small portion of data.


Grid Computing(good for compute insensitive jobs)

Problem when nodes need to access larger data volumes(hundreds of gigabytes, the points at which Hadoop really starts to shine), since the network bandwidth is the bottleneck and compute nodes become idle.

Hadoop tries to co-locate the data with the compute nodes, so data access is fast because it is local.(Data Locality)


Volunteer Computing




评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值