Google cloud computing

最新推荐文章于 2025-09-04 12:00:53 发布

转载最新推荐文章于 2025-09-04 12:00:53 发布 · 571 阅读

文章标签：

#google #mapreduce #processing #parallel #each #server

cloud computing 专栏收录该内容

2 篇文章

订阅专栏

http://tlyxy228.blog.163.com/blog/static/1810901201051872733257/

0. idea

    * Replication, replication, replication
    * Each piece of data is available on multiple machines
    * Literally dozens of copies of the Web across their clusters
    * Requests are split up over the logical clusters and handled in parallel

1. GFS
分布式文件系统，提供海量数据的存储和访问能力

# "Chunks" (i.e. blocks) stored on machines as regular files

    * 64MB block size

chunk大小为 64MB
每一个chunk以block为单位划分，大小为 64 KB，每一个block对应一个32bit的校验和
（为什么？）

# Designed to optimize multiple-reader/multiple-appender files/workloads

    * Crawlers atomically append to a file
    * Readers read file at the same time
    * No races/corruption issues

# Each logical chunk is replicated across 3+ boxes

# Those tasks run on same as GFS boxes

    * Keep that data close, baby

简化设计：

（1）采用中心服务器模式
Master/Slave模式， Master管理所有的 metadata，文件划分为chunk存在 chunk server上。client发起的所有操作都要先通过Master才能完成。
好处：增加chunk server非常容易，只需到 Master上注册即可，不用更新信息到所有chunk server。
缺点：单点故障问题。解决办法：控制元数据规模、对master远程备份、控制信息和数据分流等。

（2）不缓存数据
客户端大多是流式读写，没有大量的重复读写，缓存对性能提高作用不大；
维护缓存与实际数据之间的一致性是一个极其复杂的问题；
读取数据量巨大，无法完全缓存。
对master中的元数据采取了缓存策略，因为元数据需要频繁的操作。

（3）在用户态下实现
而不是内核态下实现，因为简单、调试容易、与OS松耦合。

参考文献：
The Google File System

2. MapReduce
分布式计算编程模型，并行处理

# Method for processing data in massive parallel

    * map(fn1, list) -> list
    * reduce(fn2, init, list) -> scalar

# Mappers spawned for each GFS chunk of input and apply fn1 to each record in chunk

    * Intermediate data output according to "partitions" (some hash function)

# Reducers aggregate mappers' intermediate data according to fn2

    * One reducer job per partition

# Use case

    * grep-style jobs
    * log file analysis
    * reverse web-link graph
    * inverted index

这种编程模式是适用于非结构化和结构化的海量数据（TB级以上）的搜索、挖掘、分析与机器智能学习等。

参考文献：
MapReduce: Simplifiedd Data Processing on Large Clusters

3. Chubby
分布式锁服务，保证分布式环境下并发操作的同步问题

参考文献：
The Chubby lock service for loosely-coupled distributed systems

4. Bigtable
分布式结构化数据存储系统，管理和组织海量数据

参考文献：
Bigtable: A Distributed Storage System for Structured Data