Filecoin分布式存储方案选型-GlusterFS篇(2022/供参考)
说明
本文档是2022年的选型文档,虽然题含Filecoin,但针对常见的大文件存储还是有参考意义的,特此分享。
基本需求
- 不考虑单机存储方案(RAID、ZFS、LVM、NFS)
- 不考虑闭源、商业存储方案
- 仅考虑文件存储方案,不考虑对象存储方案
- 兼顾社区活跃度、文档、二次开发难度(实现语言)
- 兼顾高可用、复杂度、扩展性、性能、运维难度
参考列表
- 分布式开源存储(https://en.wikipedia.org/wiki/Comparison_of_distributed_file_systems)
- 分布式商业存储(https://en.wikipedia.org/wiki/Comparison_of_distributed_file_systems)
- OpenStack Manila后端Driver(Manila为分布式文件存储项目)
候选方案
- CephFS + NFS-Ganesha:架构复杂,部署运维难度高
- GlusterFS:无中心分布式,架构简单,部署运维简单,文档完善,基于C(2005)
- HDFS:有中心分布式,元数据(NameNode)高可用依托QJM或NFS,基于Java
- JuiceFS:存储网关,自身无存储数据管理,对第三方存储的代理转换
- Lustre:有中心分布式,架构复杂,纠删码支持不完善
- MinIO:分布式对象存储
- MooseFS:有中心分布式,架构简单,免费方案无元数据高可用、无纠删码
- SeaweedFS:有中心分布式(自带选主),适用于海量小文件,基于Go(2015)
- FastDFS:国人开发。
初选方案:GlusterFS
架构:OpenSDS
开放自治数据平台:旨在建立一个以通用和标准化方式解决存储供应商和终端用户的关键数据管理难点。
架构:CephFS + NFS-Ganesha(运维难度高)
架构:HDFS(单点问题,Java)
架构:JuiceFS(存储网关)
架构:GlusterFS(初选方案)
Ceph/GlusterFS性能对比(仅供参考,有待自建自测)
Performance Evaluations of Distributed File Systems for Scientific Big Data in FUSE Environment(https://www.mdpi.com/2079-9292/10/12/1471)
GlusterFS:简介
- 始于2005年,典型的分布式文件存储系统(Anand Babu Periasamy,创建MinIO)
- 架构简单,部署运维难度低
- 无中心的分布式存储,无单点问题
- 可扩展到数PB,可纳管磁盘数高达750块
- 高性能(服务端多进程,客户端多线程)
- 为对象、块和文件存储提供接口
- 支持mount,对应用无感知,断连重试、自动切换
- C语言实现,易于二次开发,编译构建简单
GlusterFS:进程模型
- gluster:cli命令执行工具,解析命令行参数,然后把命令发送给glusterd模块执行。
- glusterd:管理模块,处理gluster发过来的命令,处理集群管理、存储池管理、brick管理、负载均衡、快照管理等。集群信息、存储池信息和快照信息等都是以配置文件的形式存放在服务器中,当客户端挂载存储时,glusterd会把存储池的配置文件发送给客户端。
- glusterfsd:服务端模块,存储池中的每个brick都会启动一个glusterfsd进程。此模块主要是处理客户端的读写请求,从关联的brick所在磁盘中读写数据,然后返回给客户端。
- glusterfs:客户端模块,负责通过mount挂载集群中某台服务器的存储池,以目录的形式呈现给用户。当用户从此目录读写数据时,客户端根据从glusterd模块获取的存储池的配置文件信息,通过DHT算法计算文件所在服务器的brick位置,然后通过Infiniband
RDMA或TCP/IP方式把数据发送给brick,等brick处理完后给用户返回结果。存储池的副本、条带、hash、EC等逻辑都在客户端处理(多线程)。
GlusterFS:卷类型
- 分布式(类RAID 0)
- 复制(类RAID 1)
- 分布式复制
- 分散(类RAID 5)
- 分布式分散
GlusterFS:安装、配置、使用(Ubuntu 22.04,分布式分散)
- 安装:https://docs.gluster.org/en/latest/Install-Guide/Install/
- 配置:https://docs.gluster.org/en/latest/Install-Guide/Configure/
配置节点名解析
# vim /etc/hosts
192.168.134.201 node01
192.168.134.202 node02
192.168.134.203 node03
服务安装
apt install software-properties-common
add-apt-repository ppa:gluster/glusterfs-10
apt update
apt install glusterfs-server
systemctl enable glusterd.service
systemctl enable glustereventsd.service
systemctl start glusterd.service
systemctl start glustereventsd.service
进程信息
root@jiangsjx:~# ps -ef | grep gluster
root 14193 1 0 17:07 ? 00:00:00 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
root 14247 1 0 17:08 ? 00:00:00 /usr/bin/python3 /usr/sbin/glustereventsd --pid-file /var/run/glustereventsd.pid
root 14262 14247 0 17:08 ? 00:00:00 /usr/bin/python3 /usr/sbin/glustereventsd --pid-file /var/run/glustereventsd.pid
集群组建(可信池)
root@jiangsjx:~# gluster peer status
Number of Peers: 0
root@jiangsjx:~# gluster peer probe node02
peer probe: success
root@jiangsjx:~# gluster peer probe node03
peer probe: success
root@jiangsjx:~# gluster peer status
Number of Peers: 2
Hostname: node02
Uuid: aae003e0-1f7b-48ca-91c3-9847724aa9c1
State: Peer in Cluster (Connected)
Hostname: node03
Uuid: b6623bfd-8f02-472d-8add-61595693130f
State: Peer in Cluster (Connected)
磁盘初始化
fdisk /dev/sdb
mkfs.xfs -i size=512 /dev/sdb1
echo "/dev/sdb1 /export/sdb1 xfs defaults 0 0" >> /etc/fstab
mkdir -p /export/sdb1 && mount -a && mkdir -p /export/sdb1/brick
创建卷
root@jiangsjx:~# gluster volume create gv0 disperse 3 node01:/export/sdb1/brick node02:/export/sdb1/brick node03:/export/sdb1/brick node01:/export/sdc1/brick node02:/export/sdc1/brick node03:/export/sdc1/brick force
volume create: gv0: success: please start the volume to access data
root@jiangsjx:~# gluster volume info
Volume Name: gv0
Type: Distributed-Disperse
Volume ID: aaea30ea-b937-4391-a9aa-48008692eee7
Status: Created
Snapshot Count: 0
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: node01:/export/sdb1/brick
Brick2: node02:/export/sdb1/brick
Brick3: node03:/export/sdb1/brick
Brick4: node01:/export/sdc1/brick
Brick5: node02:/export/sdc1/brick
Brick6: node03:/export/sdc1/brick
Options Reconfigured:
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
启动卷
root@jiangsjx:~# gluster volume start gv0
volume start: gv0: success
# 进程信息
root@jiangsjx:~# ps -ef | grep gluster
root 14193 1 0 17:07 ? 00:00:00 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
root 14247 1 0 17:08 ? 00:00:00 /usr/bin/python3 /usr/sbin/glustereventsd --pid-file /var/run/glustereventsd.pid
root 14262 14247 0 17:08 ? 00:00:00 /usr/bin/python3 /usr/sbin/glustereventsd --pid-file /var/run/glustereventsd.pid
root 14579 1 0 17:30 ? 00:00:00 /usr/sbin/glusterfsd -s node01 --volfile-id gv0.node01.export-sdb1-brick -p /var/run/gluster/vols/gv0/node01-export-sdb1-brick.pid -S /var/run/gluster/5f377592500512a3.socket --brick-name /export/sdb1/brick -l /var/log/glusterfs/bricks/export-sdb1-brick.log --xlator-option *-posix.glusterd-uuid=7fdb637c-8485-4292-92c2-e422d5cce20c --process-name brick --brick-port 53898 --xlator-option gv0-server.listen-port=53898
root 14596 1 0 17:30 ? 00:00:00 /usr/sbin/glusterfsd -s node01 --volfile-id gv0.node01.export-sdc1-brick -p /var/run/gluster/vols/gv0/node01-export-sdc1-brick.pid -S /var/run/gluster/805d988efa8f4a66.socket --brick-name /export/sdc1/brick -l /var/log/glusterfs/bricks/export-sdc1-brick.log --xlator-option *-posix.glusterd-uuid=7fdb637c-8485-4292-92c2-e422d5cce20c --process-name brick --brick-port 58267 --xlator-option gv0-server.listen-port=58267
root 14613 1 0 17:30 ? 00:00:00 /usr/sbin/glusterfs -s localhost --volfile-id shd/gv0 -p /var/run/gluster/shd/gv0/gv0-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/426de13ab8d53f57.socket --xlator-option *replicate*.node-uuid=7fdb637c-8485-4292-92c2-e422d5cce20c --process-name glustershd --client-pid=-6
# 监听信息
root@jiangsjx:~# netstat -tlnp | grep gluster
tcp 0 0 0.0.0.0:24007 0.0.0.0:* LISTEN 14193/glusterd
tcp 0 0 0.0.0.0:58267 0.0.0.0:* LISTEN 14596/glusterfsd
tcp 0 0 0.0.0.0:53898 0.0.0.0:* LISTEN 14579/glusterfsd
客户端挂载卷
root@jiangsjx:/mnt# mount -t glusterfs 192.168.134.202:/gv0 /mnt/gfs/
root@jiangsjx:/mnt# df -h
Filesystem Size Used Avail Use% Mounted on
192.168.134.202:/gv0 8.0G 269M 7.7G 4% /mnt/gfs
# 进程信息
root@jiangsjx:/mnt/gfs# ps -ef | grep gluster
root 10394 1 0 8月16 ? 00:00:07 /usr/sbin/glusterfs --process-name fuse --volfile-server=192.168.134.202 --volfile-id=/gv0 /mnt/gfs
# 建连信息(自动重试与切换)
root@jiangsjx:/mnt/gfs# netstat -tnp | grep 192.168.134.20
tcp 0 0 192.168.134.56:48952 192.168.134.201:56957 ESTABLISHED 10394/glusterfs
tcp 0 0 192.168.134.56:49149 192.168.134.201:24007 ESTABLISHED 10394/glusterfs
tcp 0 0 192.168.134.56:49122 192.168.134.201:57188 ESTABLISHED 10394/glusterfs
tcp 0 0 192.168.134.56:49146 192.168.134.202:60040 ESTABLISHED 10394/glusterfs
tcp 0 0 192.168.134.56:49145 192.168.134.202:55765 ESTABLISHED 10394/glusterfs
tcp 0 0 192.168.134.56:48953 192.168.134.203:51221 ESTABLISHED 10394/glusterfs
tcp 0 0 192.168.134.56:48950 192.168.134.203:55777 ESTABLISHED 10394/glusterfs
添加文件(类似MinIO)
GlusterFS:gluster cli
root@jiangsjx:~# gluster --help
peer help - display help for peer commands
volume help - display help for volume commands
global help - list global commands
# 可信池节点管理
root@jiangsjx:~# gluster peer help
peer detach { <HOSTNAME> | <IP-address> } [force] - detach peer specified by <HOSTNAME>
peer probe { <HOSTNAME> | <IP-address> } - probe peer specified by <HOSTNAME>
peer status - list status of peers
pool list - list all the nodes in the pool (including localhost)
# 卷管理、brick管理
root@jiangsjx:~# gluster volume help
volume add-brick <VOLNAME> [<replica> <COUNT> [arbiter <COUNT>]] <NEW-BRICK> ... [force] - add brick to volume <VOLNAME>
volume create <NEW-VOLNAME> [[replica <COUNT> [arbiter <COUNT>]]|[replica 2 thin-arbiter 1]] [disperse [<COUNT>]] [disperse-data <COUNT>] [redundancy <COUNT>] [transport <tcp|rdma|tcp,rdma>] <NEW-BRICK> <TA-BRICK>... [force] - create a new volume of specified type with mentioned bricks
volume delete <VOLNAME> - delete volume specified by <VOLNAME>
volume get <VOLNAME|all> <key|all> - Get the value of the all options or given option for volume <VOLNAME> or all option. gluster volume get all all is to get all global options
volume heal <VOLNAME> [enable | disable | full |statistics [heal-count [replica <HOSTNAME:BRICKNAME>]] |info [summary | split-brain] |split-brain {bigger-file <FILE> | latest-mtime <FILE> |source-brick <HOSTNAME:BRICKNAME> [<FILE>]} |granular-entry-heal {enable | disable}] - self-heal commands on volume specified by <VOLNAME>
volume info [all|<VOLNAME>] - list information of all volumes
volume list - list all volumes in cluster
volume log <VOLNAME> rotate [BRICK] - rotate the log file for corresponding volume/brick
volume profile <VOLNAME> {start|info [peek|incremental [peek]|cumulative|clear]|stop} [nfs] - volume profile operations
volume rebalance <VOLNAME> {{fix-layout start} | {start [force]|stop|status}} - rebalance operations
volume remove-brick <VOLNAME> [replica <COUNT>] <BRICK> ... <start|stop|status|commit|force> - remove brick from volume <VOLNAME>
volume replace-brick <VOLNAME> <SOURCE-BRICK> <NEW-BRICK> {commit force} - replace-brick operations
volume reset <VOLNAME> [option] [force] - reset all the reconfigured options
volume reset-brick <VOLNAME> <SOURCE-BRICK> {{start} | {<NEW-BRICK> commit}} - reset-brick operations
volume set <VOLNAME> <KEY> <VALUE> - set options for volume <VOLNAME>
volume set <VOLNAME> group <GROUP> - This option can be used for setting multiple pre-defined volume options where group_name is a file under /var/lib/glusterd/groups containing one key value pair per line
volume start <VOLNAME> [force] - start volume specified by <VOLNAME>
volume status [all | <VOLNAME> [nfs|shd|<BRICK>|quotad]] [detail|clients|mem|inode|fd|callpool|tasks|client-list] - display status of all or specified volume(s)/brick
volume stop <VOLNAME> [force] - stop volume specified by <VOLNAME>
volume sync <HOSTNAME> [all|<VOLNAME>] - sync the volume information from a peer
volume top <VOLNAME> {open|read|write|opendir|readdir|clear} [nfs|brick <brick>] [list-cnt <value>] | {read-perf|write-perf} [bs <size> count <count>] [brick <brick>] [list-cnt <value>] - volume top operations
GlusterFS:版本信息
最新版:10.2,官方推荐用最新版 (https://docs.gluster.org/en/latest/release-notes/)
GlusterFS:代码结构
GlusterFS:编译(-with-tcmalloc,20%性能提升)
- 编译:https://docs.gluster.org/en/latest/Developer-guide/Building-GlusterFS/
apt install make automake autoconf libtool flex bison \
pkg-config libssl-dev libxml2-dev python3-dev libaio-dev \
libibverbs-dev librdmacm-dev libreadline-dev liblvm2-dev \
libglib2.0-dev liburcu-dev libcmocka-dev libsqlite3-dev \
libacl1-dev liburing-dev google-perftools libgoogle-perftools-dev
./autogen.sh
./configure
make -j 4
make install
# 程序信息
root@jiangsjx:/usr/local/gfs/sbin# ll -h
total 2.7M
drwxr-xr-x 2 root root 4.0K 8月 17 16:29 ./
drwxr-xr-x 10 root root 4.0K 8月 17 16:29 ../
-rwxr-xr-x 1 root root 417 8月 17 16:29 conf.py*
-rwxr-xr-x 1 root root 6.5K 8月 17 16:29 gcron.py*
-rwxr-xr-x 1 root root 83K 8月 17 16:24 gf_attach*
lrwxrwxrwx 1 root root 75 8月 17 16:29 gfind_missing_files -> /usr/local/gfs/libexec/glusterfs/gfind_missing_files/gfind_missing_files.sh*
-rwxr-xr-x 1 root root 2.0M 8月 17 16:29 gluster*
lrwxrwxrwx 1 root root 10 8月 17 16:24 glusterd -> glusterfsd*
lrwxrwxrwx 1 root root 50 8月 17 16:29 gluster-eventsapi -> /usr/local/gfs/libexec/glusterfs/peer_eventsapi.py*
lrwxrwxrwx 1 root root 59 8月 17 16:29 glustereventsd -> /usr/local/gfs/libexec/glusterfs/gfevents/glustereventsd.py*
lrwxrwxrwx 1 root root 10 8月 17 16:24 glusterfs -> glusterfsd*
-rwxr-xr-x 1 root root 537K 8月 17 16:24 glusterfsd*
lrwxrwxrwx 1 root root 54 8月 17 16:29 gluster-georep-sshkey -> /usr/local/gfs/libexec/glusterfs/peer_georep-sshkey.py*
lrwxrwxrwx 1 root root 52 8月 17 16:29 gluster-mountbroker -> /usr/local/gfs/libexec/glusterfs/peer_mountbroker.py*
-rwxr-xr-x 1 root root 29K 8月 17 16:29 gluster-setgfid2path*
-rwxr-xr-x 1 root root 34K 8月 17 16:29 snap_scheduler.py*
线上信息
24节点:
有效算力10PB,裸存储~12PB,存储利用率80%(EC 8+2)
36x16T(576T)
CentOS7
cache+sealed
48节点:
有效算力20PB
存储服务商1:
业务网bond 2x10Gpbs,内部网bond 2x10Gbps
存储服务商2:
业务网+内部网bond 4x10Gbps
初步方案
存储总量:36 x 16T x 20 = 11520T = 11.25P
有效存储:11.25P x 0.8 = 9.216P
后续规划
- 性能测试
- 性能调优
- 深入底层
参考资料
- Filecoin 技术选型系列2 - 存储选型:https://www.r9it.com/20210511/filecoin-storage-select.html
- 盘点分布式文件存储系统:https://bbs.huaweicloud.com/blogs/243039
- 常见分布式文件存储介绍、选型比较、架构设计:https://mikechen.cc/3469.html
- 存储老炮儿带你盘Filecoin存储之选型指南篇:https://www.apis-crypto.com/news/48398.html
- 主流分布式文件系统选型,写得太好了:https://segmentfault.com/a/1190000040444551
- 主流分布式存储技术对比分析:GFS、HDFS、GlusterFS、Ceph、Swift:https://www.modb.pro/db/411555
- 五大常见存储系统PK | Ceph、GlusterFS、MooseFS、HDFS、DRBD:https://www.cnblogs.com/tyun/p/13846019.html