机器学习PS参数服务器——分布式计算是个什么鬼?

本文介绍了参数服务器(Parameter Server)框架的基本概念,包括其在分布式机器学习中的角色分配及通信方式,并对比分析了同步和异步随机梯度下降算法的特点与应用场景。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1. Overview

The parameter server aims for high-performance distributed machine learning applications. In this framework, multiple nodes runs over multiple machines to solve machine learning problems. There are often a single schedule node(进度控制节点), and several worker and servers nodes.

ps arch

  • Worker. A worker node performs the main computations such as reading the data and computing the gradient. It communicates with the server nodes via push and pull. For example, it pushes the computed gradient to the servers, or pulls the recent model from them.(通过从PS拉取参数,以及推送参数到PS)
  • Server. A server node maintains and updates the model weights. Each node maintains only a part of the model.
  • Scheduler. The scheduler node monitors the aliveness of other nodes. It can be also used to send control signals to other nodes and collect their progress.(进度控制器)

1.1. Distributed Optimization

Assume(假设) we are going to solve the following

minwi=1nf(xi,yi,w) minw∑i=1nf(xi,yi,w)

where (yi, xi) are example pairs and w is the weight.

We consider solve the above problem by minibatch stochastic gradient descent (SGD) with batch size b. At time t, this algorithm first randomly picks up b examples, and then updates the weight w by

w=wηti=1bf(xki,yki,w) w=w−ηt∑i=1b∇f(xki,yki,w)

We give two examples to illusrate(说明) the basic idea of how to implement a distributed optimization algorithm in ps-lite.

1.1.1. Asynchronous SGD 异步随机梯度下降

In the first example, we extend SGD into asynchronous SGD. We let the servers maintain w, where server k gets the k-th segment of w, denoted by wk<\sub>. Once received gradient from a worker, the server k will update the weight it maintained:

t = 0;
while (Received(&grad)) {
  w_k -= eta(t) * grad;
  t++;
}

where the function received returns if received gradient from any worker node, and eta returns the learning rate at time t.

While for a worker, each time it dose four things

Read(&X, &Y);  // read a minibatch X and Y
Pull(&w);      // pull the recent weight from the servers
ComputeGrad(X, Y, w, &grad);  // compute the gradient
Push(grad);    // push the gradients to the servers

where ps-lite will provide function push and pull which will communicate with servers with the right part of data.

Note that asynchronous SGD is semantically different the single machine version. Since there is no communication between workers, so it is possible that the weight is updated while one worker is calculating the gradients. In other words, each worker may used the delayed(滞后) weights. The following figure shows the communication with 2 server nodes and 3 worker nodes.

1.1.2. Synchronized SGD 同步随机梯度下降

Different to the asynchronous version, now we consider a synchronized version, which is semantically identical(相同) to the single machine algorithm. We use the scheduler to manage the data synchronization

for (t = 0, t < num_iteration; ++t) {
  for (i = 0; i < num_worker; ++i) {
     IssueComputeGrad(i, t);
  }
  for (i = 0; i < num_server; ++i) {
     IssueUpdateWeight(i, t);
  }
  WaitAllFinished();
}

where IssueComputeGrad and IssueUpdateWeight issue commands to worker and servers, while WaitAllFinished wait until all issued commands are finished.

When worker received a command, it executes the following function,

ExecComputeGrad(i, t) {
   Read(&X, &Y);  // read minibatch with b / num_workers examples
   Pull(&w);      // pull the recent weight from the servers
   ComputeGrad(X, Y, w, &grad);  // compute the gradient
   Push(grad);    // push the gradients to the servers
}

which is almost identical to asynchronous SGD but only b/num_workers examples are processed each time.

While for a server node, it has an additional aggregation step comparing to asynchronous SGD

ExecUpdateWeight(i, t) {
   for (j = 0; j < num_workers; ++j) {
      Receive(&grad);
      aggregated_grad += grad;
   }
   w_i -= eta(t) * aggregated_grad;
}

1.1.3. Which one to use?

Comparing to a single machine algorithm, the distributed algorithms have two additional costs(两个额外的代价), one is the data communication cost(一个就是通讯代价), namely(也就是) sending data over the network(通过网络发送数据的代价); the other one is synchronization cost due to the imperfect load balance and performance variance cross machines(另一个就是同步代价). These two costs may dominate the performance for large scale applications with hundreds of machines and terabytes of data.(数据量大时尤为明显)

Assume denotations:(符号意义)

f f convex function
n n number of examples
m m number of workers
b b minibatch size
τ τ maximal delay
Tcomm Tcomm data communication overhead of one minibatch
Tsync Tsync synchronization overhead

The trade-offs are summarized by(总结)

SGD slowdown of convergence additional overhead
synchronized b b nb(Tcomm+Tsync) nb(Tcomm+Tsync)
asynchronous bτ nmbTcomm nmbTcomm

What we can see are

  • the minibatch size trade-offs the convergence and communication cost (minibatch的大小权衡了收敛和通讯的开销)
  • the maximal allowed delay trade-offs the convergence and synchronization cost. In synchronized SGD, we have τ=0 and therefore it suffers a large synchronization cost. While asynchronous SGD uses an infinite τ to eliminate this cost. In practice, an infinite τ is unlikely happens. But we also place a upper bound of τ to guarantee the convergence with some synchronization costs.

1.2. Further Reads

Distributed optimization algorithm is an active research topic these years. To name some of them


原文:
http://ps-lite.readthedocs.io/en/latest/overview.html
### 分布式优化技术在机器学习中的应用与实现 #### 1. 分布式优化的重要性 随着数据量的增长以及模型复杂性的增加,传统的单机训练方式已经无法满足高效处理的需求。为了应对这一挑战,分布式优化成为了一种有效的解决方案。通过将计算任务分配给多个节点来加速训练过程,并提高整体性能。 对于大规模的数据集和复杂的模型来说,采用分布式架构不仅可以显著缩短训练时间,还能充分利用集群内的硬件资源,从而达到更高的吞吐率和更低的成本开销[^2]。 #### 2. 常见的分布式计算框架及其特点 - **Apache Spark**: 支持批处理、流处理等多种模式,在内存中执行迭代操作具有较高的效率;适合用于基于梯度下降法的大规模线性回归等问题求解。 - **Dask**: 提供类似于Pandas DataFrame API 的接口,易于上手;支持动态调度机制,可以根据实际负载情况调整任务分发策略;适用于中小型团队快速搭建实验环境。 - **TensorFlow/PyTorch**: 这两个库不仅限于深度神经网络领域,同时也内置了强大的分布式训练功能。它们可以通过参数服务器(Parameter Server) 或者 AllReduce 等通信协议完成跨设备间的同步更新权重向量等核心运算逻辑[^3]。 ```python import tensorflow as tf from tensorflow.keras import layers, models # 定义简单的卷积神经网络结构 def create_model(): model = models.Sequential() model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1))) model.add(layers.MaxPooling2D((2, 2))) model.add(layers.Flatten()) model.add(layers.Dense(64, activation='relu')) model.add(layers.Dense(10)) return model strategy = tf.distribute.MirroredStrategy() with strategy.scope(): distributed_model = create_model() optimizer = tf.keras.optimizers.Adam() loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) distributed_model.compile(optimizer=optimizer, loss=loss_fn, metrics=['accuracy']) ``` 这段代码展示了如何使用 TensorFlow 中 `tf.distribute.Strategy` 接口来进行 GPU 多卡并行化训练。这里选择了最基础的形式——镜像策略 (`MirroredStrategy`) 来实现在同一台计算机上的多GPU之间共享相同副本的模型参数,并行地进行前向传播和反向传播计算。 #### 3. 实现分布式优化的关键要素 - **数据划分(Data Partitioning)**: 将原始输入样本按照一定规则拆分成若干子集分别交给不同worker去读取预处理。这一步骤直接影响后续各轮次间通讯频率大小进而影响整个系统的稳定性和收敛速度。 - **梯度聚合(Gradient Aggregation)**: 各个 worker 计算得到局部梯度之后需要汇总形成全局梯度再用来指导下一步权值修正方向的选择。常见的做法有中心化的 PS 架构或是去中心化的 Ring-allreduce 方案。 - **异步 vs 同步(Synchronous VS Asynchronous)**: 不同类型的更新机制决定了参与协作的工作单元之间的协调程度高低不一。前者强调严格的一致性保障后者则追求极致的速度优势但可能会牺牲部分准确性。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值