Introduction to Petuum

最新推荐文章于 2018-10-05 08:14:48 发布

转载最新推荐文章于 2018-10-05 08:14:48 发布 · 805 阅读

GPU 专栏收录该内容

22 篇文章

订阅专栏

Petuum是一款专注于解决大规模机器学习挑战的分布式框架，提供数据并行和模型并行两大平台，支持从研究集群到云服务的大规模高效运行。

https://github.com/petuum/bosen/wiki

Foreword - please read

Petuum is a distributed machine learning framework. It takes care of the difficult system "plumbing work", allowing you to focus on the ML. Petuum runs efficiently at scale on research clusters and cloud compute like Amazon EC2 and Google GCE.

The Petuum project is organized into 4 open-source (BSD 3-clause license) Github repositories:

To install Bösen and Strads, please continue reading this manual. If you have a Java environment and want to use JBösen, please start here. If you wish to use Poseidon for Deep Learning, please go here.

Petuum Bösen/Strads v1.1 manual

Installing Petuum
ML Applications
1. Topic Models
  1. Latent Dirichlet Allocation (topic modeling)
  2. MedLDA (supervised topic modeling)
2. Deep Learning
  1. Poseidon: Distributed Deep Learning Framework on Petuum
  2. General-purpose Deep Neural Network (DNN)
    1. DNN for Speech Recognition
3. Matrix Factorization and Sparse Coding
4. Regression
  1. Lasso Regression
5. Metric Learning
  1. Distance Metric Learning
6. Clustering
  1. K-means Clustering
7. Classification
  1. Random Forest
  2. Logistic Regression
  3. SVM (Newly added in v1.1)
  4. Multi-class Logistic Regression
Programming API
1. Bösen Bounded-Async Key-Value Store
2. Strads Model-Parallel Scheduler (Coming soon)

Introduction to Petuum

Petuum provides essential distributed programming tools to tackle the challenges of ML at scale: Big Data (many data samples), and Big Models (very large parameter and intermediate variable spaces). To address these challenges, Petuum provides two key platforms:

Bösen, a bounded-asynchronous key-value store for Data-Parallel ML algorithms
Strads, a scheduler for Model-Parallel ML algorithms

Unlike general-purpose distributed programming platforms, Petuum is designed specifically for ML algorithms. This means that Petuum takes advantage of data correlation, staleness, and other statistical properties to maximize the performance for ML algorithms.

ML programs are built around update functions that are iterated repeatedly until convergence, as the following diagram illustrates:

Data and Model Parallelism

The update function takes the data and model parameters as input, and outputs a change to the model parameters. Data parallelism divides the data among different workers, whereas model parallelism divides the parameters among different workers. Both styles of parallelism can be found in modern ML algorithms: for example, Sparse Coding via Stochastic Gradient Descent is a data-parallel algorithm, while Lasso regression via Coordinate Descent is a model-parallel algorithm. The Petuum Bösen and Strads systems are built to enable data-parallel and model-parallel styles, respectively.

Key Petuum features

Runs on compute clusters and cloud compute, supporting up to 100s of machines
Bösen, a bounded-asynchronous distributed key-value store for data-parallel ML programming
- Bösen uses the Stale Synchronous Parallel consistency model, which allows asynchronous-like performance that outperforms MapReduce and bulk synchronous execution, yet does not sacrifice ML algorithm correctness
Strads, a dynamic scheduler for model-parallel ML programming
- Strads performs fine-grained scheduling of ML update operations, prioritizing computation on the parts of the ML program that need it most, while avoiding unsafe parallel operations that could hurt performance
Programming interfaces for C++ and Java
YARN and HDFS support, allowing execution on Hadoop clusters
ML library with 10+ ready-to-run algorithms
- Newer algorithms such as discriminative topic models, deep learning, distance metric learning and sparse coding
- Classic algorithms such as logistic regression, k-means, and random forest

Support and Bug reports

For support, or to report a bug, please send email to petuum-user@googlegroups.com. Please provide your name and affiliation; we do not support anonymous inquiries.