GFS

Architecture in nutshell

 

Chunk:

  1. Each chunk  is identified by an immutable and globally unique 64 bit chunk handle assigned by the master at the time of chunk creation.

  2. Each chunk is replicated on multiple chunk servers, default 3.

  3. Chunk size is 64MB

Why 64MB:

Pros:

  1. It reduces clients’ need to interact with the master. Several read/write can happen on a same chunk and client only need to contact master once. Also reduce network overhead.
  2. Reduce metadata size at Master.

Cons:

  1. Hotspot due to small files only have one chunk

Metadata: (Both in Memory & Operation Log on disk)

  1. File and chunk namespace stored in tree structure
  2. Mapping from file to chunks

Storing in memory is easy and efficient for master to scan through entire state and rebalance chunks, garbage collections etc.

64MB chunk generate 64B data and namespace use prefix compression.

Metadata: (Only in Memory)

  1. Mapping from chunks to chunk servers

Poll the location of chunks when master startup and update via heartbeat.

Chunkserver comes and goes, update on disk is too frequent and costly.

 

Operation Log & Checkpoint

  1. Log contains historical changes to critical metadata (namespace & file location). Contains version serving as logical timeline that define operations order.
  2. Logs are replicated on remote machines for replication and are flushed to disk before operations send the response back to the client.
  3. Batch logs before flush when possible to reduce disk I/O
  4. Logs merge into B-tree checkpoint to reduce recovery time.
  5. Checkpoints are merged in daemon thread while new log file is serving traffic.
  6. Checkpoints are replicated on remote machines for replication as well. Only need to store the latest checkpoint.

 

System Interaction:

Lease:

  1. For each chunk's mutation, master grant a lease to one of the chunk server holding the chunk. This server is considered primary.
  2. Each Lease has a timeout. But can be extended via heartbeat and its piggyback.
  3. The master grants a chunklease to one of the replicas, which we call the primary. The primary picks a serial order for all mutations to the chunk. All replicas follow this order when applying mutations.

Read Operation:

  1. Client translates the file name and byte offset specified by the application into a chunk index within the file.
  2. Client sends the master a request containing the file name and chunk index.
  3. Master replies with a list of chunks handle and location of each chunk's replicas. (All three replicas)
  4. The client then sends a request to the closest replica near it. The request specifies the chunk handle and a byte range within that chunk.

In step 3, Master might return the information for chunks immediately following those requested to remove future overhead.

 

Write Operation:

  1. Master grants lease to one of the replicas called primary (typically for 60 sec.). Leases renewed using periodic heartbeat messages between master and chunk servers
  2. Client asks master for primary and secondary replicas for each chunk. Client caches this info locally.
  3. To avoid network bottlenecks and high-latency links. Client only pushes data to the nearest server(not necessary the primary). Data flow to replicas & primary in topological order to fully utilize bandwidth.
  4. After getting acknowledges from all servers, the client sends write to the primary. Primary assign consecutive serial numbers to all the mutations which provide serialization.
  5. Primary forward consecutive serial numbers of writes to replicas. And replicas do write in the exact order.
  6. Replicas respond to the primary.
  7. Primary respond to the client.

Append Operation:

  1. Client pushes data to replicas of last chunk of file.

  2. The client sends append request to the primary.

  3. If request fits in current last chunk, Primary appends data to own replica. Primary tells secondaries to do same at same byte offset in theirs. Primary replies with success to client.

  4. If request doesn't fit in current last chunk, Primary fills current chunk with padding. Primary instructs other replicas to do same. Primary replies to client, “retry on next chunk”

  5. Failure retry will lead to duplicate append.

 

Snapshot

  1. Master receive snapshot request
  2. Master revoke all the lease on chunks in files to snapshot\
  3. Master wait for all the lease revoked then write snapshot operation to disk.
  4. Master duplicate metadata such that newly created snapshot files point to the same chunk locations as source files
  5. Master use COW. When client need to write to a chunk: C currently being snapshot, Master tell the chunk server to replicate C to a new chunk C' where C' is also on this server. Then point snapshot files to C' and write new data to C.

Master Operation

Chunk Creation

  1. Chunks are put on the chunk server with below avg disk space usage.
  2. Master tend not to put a new chunk on a server with "recent" chunk creation
  3. For each chunk, no more than one replica on a same chunk server.
  4. For each chunk, no more than two replicas on a same rack.

Chunk Replication Scenario

  1. Chunkserver is down
  2. Chunk is corrupted (bad digest)
  3. User-specific replication goal increase

Chunk Replication Priority

  1. Chunk with #replicas farthest from user-specific goal
  2. Chunk blocking client request 
  3. Chunk with live data instead of deleted files

Chunk Rebalancing

  1. Prefer from server with below average disk space to new server
  2. Rebalance is restricted by bandwidth and gradually fill up new server
  3. Rebalance does NOT decrease replica count or distinct rack count.

Garbage Collection

  1. Master doesn't immdiately delete the file after receiving delete request.
  2. Master move file to another hidden path with a timestamp on it.
  3. Master periodically (when free) scan the namespace and delete the metadata of such hidden files older than configurable days.
  4. Master also scan orphan chunks (chunks not reachable by any server) and delete their metadata.
  5. Master piggyback the chunks can be deleted when each server heartbeat their chunk list
  6. To quickly recover space, if a delete is requested more than once. Chunks are deleted immediately.

Stale Replica Detection

  1. There's a version number persisted with each chunk. Both stored on Master and all replicas.
  2. The version number of each chunk is increased when master grant lease to replicas. If the replica is down, it will not receive the update on version number thus get detected later via piggyback.
  3. During the restart/startup of a chunk server, it reports chunk list to Master. The list includes the version such that Master can detect stale chunks.
  4. Master mark the stale chunks and delete stale chunk via standard garbage collection.
  5. Stale chunks are NEVER sent back to the customer when requested.
  6. Master also include version number with chunk when informing clients which server hold the chunk, or when chunk migration that clone between servers. This prevents stale read at no cost.

Fault tolerance & diagnosis

High Availability (Master)

  1. Master operation logs are replicated both remoted and locally.
  2. For simplicity, Master has only one process that handles all mutations and GC. This process automatically restarts when killed.
  3. For disk/hardware failures, monitor outside the GFS system will restart GFS on another machine.
  4. Client use VIP to talk to GFS, so doesn't care about specific machine IP.
  5. Shadow master provides read-only access to file system when the real master is down. Shadow master does everything as Master does and heartbeats with chunk servers periodically.

High Availability (Chunk)

  1. Chunks are replicated on multiple chunk servers on different racks.
  2. Master clones chunks as chunk servers go offline or chunks get corrupted.
  3. GFS Can use parity or erasure coding for redundancy.

Data Integrity (Write)

  1. Chunks are divided into 64 KB blocks. Each block has an MD5 checksum.
  2. The checksum is stored with metadata in memory and persisted on disk.
  3. Chunk server periodically scans inactive chunks to detect corruption earlier.

Data Integrity (Read)

  1. Checksum calculation is in the same process as I/O
  2. If there's a mismatch on checksum, an error is returned to the client. The client will stream from another replica. Master will also clone the chunk to another server from replica. Mismatch chunk will be garbage collected after clone completes.
  3. Read starts at block boundary for easier checksum comparison.
  4. The checksum is returned to the client along with the data stream.

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值