ES 3节点部署step by step

本文详细指导如何在ES3环境下部署集群,包括前置任务(如下载并配置ES及IK分词器),设置用户认证,生成CA证书,以及安装IK中文分词器的过程,附带参考资料链接。

前置任务

在开始部署前需要先下载一下elasticsearch&ik分词器

  1. 下载
    下载7.16 版本的es(因为7.1x是最后一个支持jdk1.8的es) 解压后放到/opt(我一般习惯放到此目录)目录下
    elasticsearch-7.16.3-linux-x86_64.tar.gz
    为了更好的支持中文,需要下载ik分词器
    elasticsearch-analysis-ik-7.16.3.zip
  2. 创建用户名为es
adduser es
passwd es
testpassword
chown -R es:es elasticsearch-7.16.3/
chmod 770 elasticsearch-7.16.3/
  1. 设置es用户的process数量
    vim /etc/security/limits.conf
es hard nofile 65536
es soft nofile 65536
  1. 设置es用户的max number of threads
    vim /etc/security/limits.d/90-nproc.conf
*          soft    nproc     4096
root       soft    nproc     unlimited
  1. 设置es用户的最大虚拟内存
    vim /etc/sysctl.conf
# Controls max virtual memory areas
vm.max_map_count=262144
  1. 刷新配置
sysctl -p

部署集群

以下是进入elasticsearch-7.16.3目录后的操作,当然elasticsearch-7.16.3是要在es账号下有权限的。
假设有192.168.1.1、192.168.1.2、192.168.1.3三台服务器,以下步骤要在三台集群上都要配置

  1. 创建data目录
mkdir data
  1. 进入config目录下创建证书目录
mkdir certificate
  1. 修改elasticsearch.yml文件(可参考)
# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: common-server
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: common-server-3
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /opt/elasticsearch-7.16.3/data
#
# Path to log files:
#
path.logs: /opt/elasticsearch-7.16.3/logs
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
bootstrap.memory_lock: true
bootstrap.system_call_filter: false
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# By default Elasticsearch is only accessible on localhost. Set a different
# address here to expose this node on the network:
#
network.host: 192.168.1.3
#
# By default Elasticsearch listens for HTTP traffic on the first free port it
# finds starting at 9200. Set a specific HTTP port here:
#
http.port: 9200

transport.tcp.port: 9300
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when this node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.seed_hosts: ["192.168.1.1:9300", "192.168.1.2:9300", "192.168.1.3:9300"]
## 每隔多长时间ping一个node
##master选举/节点间通讯超时时间(这个时间的配置需要根据实际情况设置)
discovery.zen.fd.ping_interval: 30s

## 每次ping的超时时间
discovery.zen.fd.ping_timeout: 60s

## 一个node被ping多少次失败就认为是故障了
discovery.zen.fd.ping_retries: 6
#
# Bootstrap the cluster using an initial set of master-eligible nodes:
#
cluster.initial_master_nodes: ["common-server-1"]
#
# For more information, consult the discovery and cluster formation module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
#
# ---------------------------------- Security ----------------------------------
#
#                                 *** WARNING ***
#
# Elasticsearch security features are not enabled by default.
# These features are free, but require configuration changes to enable them.
# This means that users don’t have to provide credentials and can get full access
# to the cluster. Network connections are also not encrypted.
#
# To protect your data, we strongly encourage you to enable the Elasticsearch security features. 
# Refer to the following documentation for instructions.
#
# https://www.elastic.co/guide/en/elasticsearch/reference/7.16/configuring-stack-security.html

##集群脑裂问题参数配置
## elasticsearch则可以配置返回消息的节点数量, 一般情况下会配置(n/2 + 1)个节点
discovery.zen.minimum_master_nodes: 2
## 多少个节点启动后就可以组成集群
gateway.recover_after_nodes: 2
## 期望多少个节点联通之后才开始分配shard
gateway.expected_nodes: 3
## 超时时间
gateway.recover_after_time: 1m
node.master: true
node.data: true
node.ingest: true
  1. 切换至es账号
  2. 通过./elasticsearch -d命令启动es集群检测是否有异常
    正常是可以启动的,并且通过curl -L -X GET 'http://192.168.1.1:9200/_cat/nodes?v’可以看到集群状态
ip            heap.percent ram.percent cpu load_1m load_5m load_15m node.role   master name
192.168.1.2           62          97  11    0.38    0.22     0.13 cdfhilmrstw -      common-server-2
192.168.1.1           35          98  23    0.57    0.31     0.22 cdfhilmrstw *      common-server-1
192.168.1.3           69          98  18    0.20    0.39     0.44 cdfhilmrstw -      common-server-3

生成ca证书打开用户认证

  1. 切换至es用户
  2. es的bin目录下执行./elasticsearch-certutil ca,然后两次回车
  3. es的bin目录下执行./elasticsearch-certutil cert --ca elastic-stack-ca.p12,然后三次回车
  4. 将生成的elastic-certificates.p12、elastic-stack-ca.p12用es用户copy到es的config/certificate目录下面
  5. 修改三台服务器的elasticsearch.yml文件增加如下配置
# 跨域
http.cors.enabled: true
http.cors.allow-origin: "*"
http.cors.allow-headers: Authorization,X-Requested-With,Content-Type,Content-Length

xpack.security.enabled: true
xpack.security.authc.accept_default_password: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: /opt/elasticsearch-7.16.3/config/certificate/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: /opt/elasticsearch-7.16.3/config/certificate/elastic-certificates.p12
  1. 重启es三集群,查看服务状态是否异常
  2. 在其中一台服务器上设置es标准账号的密码,在es的bin目录下执行
./elasticsearch-setup-passwords interactive

安装IK中文分词器

  1. 使用es账户将elasticsearch-analysis-ik-7.16.3.zip上传至es用户的用户目录(/home/es)下
  2. 解压至当前目录下unzip elasticsearch-analysis-ik-7.16.3.zip -d ./elasticsearch-analysis-ik-7.16.3
  3. 将解压后的文件拷贝至es的plugin目录下 cp -r elasticsearch-analysis-ik-7.16.3 /opt/elasticsearch-7.16.3/plugins
  4. 重启es(先ps -ef | grep elastic找到进程id, 然后kill -9结束, 最后./elasticsearch -d重启es)
  5. 通过./elasticsearch-plugin list 查看是否有安装的elasticsearch-analysis-ik-7.16.3分词器

参考资料

  1. https://blog.youkuaiyun.com/chenshm/article/details/124264078
  2. https://blog.youkuaiyun.com/aotumemedazhao1996/article/details/108746616
  3. https://blog.youkuaiyun.com/aotumemedazhao1996/article/details/108746616
<think>好的,用户想了解如何在多节点环境下部署DeepSeek模型或服务,特别是分布式设置。我需要先理解多节点部署的基本概念和相关技术。多节点通常涉及分布式计算,可能需要用到像Kubernetes这样的容器编排工具,或者使用深度学习框架自带的分布式训练功能,比如PyTorch的DistributedDataParallel或者TensorFlow的tf.distribute.Strategy。 用户提到的引用[1]和[2]中提到了PyTorch镜像和Kubernetes部署命令,这可能提示用户的环境是基于容器化和Kubernetes的。所以,我需要结合这些技术来回答。 首先,多节点部署需要考虑集群管理、网络通信、数据并行、模型并行等问题。可能需要分步骤说明,比如环境准备、分布式训练配置、容器化部署、服务编排等。 然后,用户可能需要具体的操作步骤,比如如何配置Kubernetes的YAML文件,如何设置PyTorch的分布式环境变量,如何处理数据的分片和同步等。同时,还要注意性能优化,比如使用高效的通信库如NCCL,或者混合精度训练。 另外,用户可能关心高可用性和扩展性,比如如何动态调整节点数量,如何监控服务状态,如何处理节点故障等。需要提到相关的Kubernetes功能如Horizontal Pod Autoscaler,或者使用监控工具如Prometheus。 还要考虑存储和数据的分布式处理,比如使用共享存储系统如NFS,或者分布式文件系统如HDFS,确保各个节点能高效访问数据。 最后,生成相关问题时要覆盖用户可能进一步询问的方向,比如性能优化方法、特定框架的配置细节、监控和日志管理,以及常见问题的排查方法。</think>### 多节点环境下部署DeepSeek模型或服务的分布式方案 #### 1. **环境准备与工具选型** - **容器化基础**: 使用Docker或Singularity封装模型、依赖项和运行环境,参考[^1]中提到的PyTorch镜像自定义方法。 - **集群管理**: 采用Kubernetes进行容器编排,通过YAML文件定义多副本部署(如引用[^2]中的`kubectl apply`命令)。 - **分布式框架**: 选择PyTorch的`DistributedDataParallel`或TensorFlow的`MultiWorkerMirroredStrategy`实现数据并行。 #### 2. **分布式训练配置** - **节点通信**: 使用NCCL后端加速GPU通信,配置环境变量: ```bash export NCCL_DEBUG=INFO export NCCL_SOCKET_IFNAME=eth0 ``` - **启动脚本示例**(PyTorch): ```python import torch.distributed as dist dist.init_process_group(backend=&#39;nccl&#39;, init_method=&#39;env://&#39;) ``` - **资源分配公式**:设节点数为$N$,单卡显存为$M$,则最大批次大小为$B_{max} = \frac{N \times M}{\text{模型显存占用}}$ #### 3. **Kubernetes部署配置** ```yaml # time-slicing-verification.yaml apiVersion: apps/v1 kind: Deployment metadata: name: deepseek-cluster spec: replicas: 4 # 节点数量 template: spec: containers: - name: deepseek-node image: custom-pytorch:latest env: - name: MASTER_ADDR value: "deepseek-coordinator" - name: MASTER_PORT value: "29500" ``` #### 4. **性能优化关键点** - **数据分片**: 使用`torch.utils.data.distributed.DistributedSampler` - **梯度压缩**: 应用FP16混合精度训练 - **容错机制**: 设置检查点保存周期 $$ \text{保存间隔} = \max(\frac{T_{total}}{100}, 60) \text{(秒)} $$ 其中$T_{total}$为预估总训练时间 #### 5. **监控与调试** - **日志聚合**: 部署EFK(Elasticsearch+Fluentd+Kibana)堆栈 - **性能分析**: 使用PyTorch Profiler跟踪GPU利用率 - **网络优化**: 验证RDMA/RoCE网络配置,确保带宽满足: $$ B_{\text{required}} \geq \frac{\text{模型参数量} \times 4}{T_{\text{step}}} $$ 其中$T_{\text{step}}$为单步训练时间
评论 1
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值