
K8S AllReduce任务调度 (CPU+GPU平台)
文章平均质量分 82
基于Kubernetes的容器调度和资源管理平台,包括OpenMPI/Horovod,容器实现,CPU/GPU算力,数据存储,网络带宽,AllReduce调度算法的算法复杂度分析和优化等。
林微
林微,一个年轻的博主。
展开
-
Docker镜像创建 (CPU环境) --【A-1/4】Docker ce安装和基本使用
1. Docker安装原创 2019-05-14 20:45:41 · 974 阅读 · 0 评论 -
Docker镜像创建 (CPU环境) --【A-2/4】添加Regular User到Docker用户组
1. 问题描述安装Docker之后,普通用户权限通过docker version命令查询Docker ce的版本信息。$ docker versionClient: Version: 18.09.6 API version: 1.39 Go version: go1.10.8 Git commit: 481bc77 Bui...原创 2019-05-15 11:31:22 · 517 阅读 · 0 评论 -
Docker镜像创建 (CPU环境) --【A-3/4】通过Dockerfile生成自己的Docker镜像
1. 创建Docker镜像1.1. 新建路径新建一个路径,存放源代码,Dockerfile文件。$ mkdir /home/joe/Templates/Image1.1. 新建"hello.py"文件新建一个python的hello.py文件。$ cd /home/joe/Templates/Image$ touch hello.py代码写入hello.py文件。print...原创 2019-05-16 12:14:23 · 393 阅读 · 0 评论 -
Docker镜像创建 (CPU环境) --【A-4/4】通过container生成自己的Docker镜像
1. pull基础docker image从官方docker hub下载ubuntu镜像。$ docker pull ubuntuUsing default tag: latestlatest: Pulling from library/ubuntu6abc03819f3e: Pull complete 05731e63f211: Pull complete 0bd67c50d6be...原创 2019-05-16 15:11:03 · 774 阅读 · 0 评论 -
MPI分布式编程 (CPU环境) --【B-1/4】OpenMPI安装和基本使用
1. OpenMPI安装原创 2019-05-16 10:55:57 · 9100 阅读 · 0 评论 -
MPI分布式编程 (CPU环境) --【B-2/4】两个节点OpenMPI集群的搭建和使用
1. 介绍上一篇博客介绍OpenMPI安装和基本使用,实现了单个节点的OpenMPI的基本使用。给定同一个内网下的两台机器,IP地址分别为192.168.0.103和192.168.0.106。其中192.168.0.103为master节点,192.168.0.106为worker节点。本文将要介绍一个master-worker双节点下的OpenMPI集群的搭建以及一些简单的集群测试。...原创 2019-05-25 22:34:32 · 3242 阅读 · 5 评论 -
MPI分布式编程 (CPU环境) --【B-3/4】OpenMPI多节点运行报错
1. OpenMPI多节点运行报错问题问题描述:节点一即host3,通过mpirun调用节点二即host4的mpi程序,报错如下。$ mpirun -np 1 --host host4 ./main [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 367 [[INVALID],IN...原创 2019-05-26 15:38:18 · 3792 阅读 · 0 评论 -
MPI分布式编程 (CPU环境) --【B-4/4】Client/Server实例应用
1. Openmpi server-clienthttps://www.mcs.anl.gov/research/projects/mpi/mpi-standard/mpi-report-2.0/node106.htm参考文献[1. 例子] https://www.mcs.anl.gov/research/projects/mpi/mpi-standard/mpi-report-2.0/...原创 2021-01-29 10:50:45 · 220 阅读 · 0 评论 -
Kubernetes集群搭建 (CPU环境) --【C-1/15】XShell隧道通过跳板机一键连接内网机器
1. 打开xshell 62. xshell的session配置2.1. 配置跳板机外网地址,端口地址2.2. 配置跳板机的用户名,密码2.3. 配置内网机器登录脚本,运行于跳板机,运行成功后登录内网机器首先,配置登录脚本,xshell 6该功能的关键字 ogin:接着,配置运行脚本,xshell 6该功能的关键字 assword:...原创 2019-05-14 16:36:20 · 613 阅读 · 0 评论 -
Kubernetes集群搭建 (CPU环境) --【C-2/15】单机安装Minikube集群
我们开始Minikube在Ubuntu16.04 LTS单机上面的集群安装。Minikube通过VT-d或者VirtualBox来虚拟化出多台主机,然后配置主机网络环境,构建出一个本地化的Minikube集群系统。1. 安装Minikube依赖项1.1. VirtualBox我们以VirtualBox为例虚拟化出多个主机 (另一种虚拟化主机的方式是 VT-d),Ubuntu 16.04 L...原创 2019-04-30 12:01:31 · 665 阅读 · 0 评论 -
Kubernetes集群搭建 (CPU环境) --【C-3/15】双机安装Kubernetes集群
1. 安装要求一台或者多台ubuntu或者centos操作系统2GB或者更多内存2个或者更多CPU30GB或者更多硬盘集群中所有机器之间网络互通可以访问外网,拉取Docker镜像禁止swap分区2. s...原创 2019-05-09 11:10:49 · 882 阅读 · 0 评论 -
Kubernetes集群搭建 (CPU环境) --【C-4/15】Kubernetes集群部署Dashboard应用
1. 确定Kubernetes集群正常运行通过确定kubenetes集群处于正常运行状态。$ kubectl get pods -n kube-systemNAME READY STATUS RESTARTS AGEcoredns-fb8b8dccf-mpczg 0/1 Running ...原创 2019-05-14 16:09:26 · 417 阅读 · 0 评论 -
Kubernetes集群搭建 (CPU环境) --【C-5/15】部署OpenMPI
1. docker compose openmpihttps://github.com/DaisukeMiyamoto/docker-openmpi2. kubernetes openmpihttps://github.com/DaisukeMiyamoto/docker-openmpi2.1. login to master podhttps://kubernetes.io/docs/...原创 2021-01-29 11:03:27 · 341 阅读 · 0 评论 -
Kubernetes集群搭建 (CPU环境) --【C-6/15】Kubernetes集群配置NFS
完整版:https://medium.com/platformer-blog/nfs-persistent-volumes-with-kubernetes-a-case-study-ce1ed6e2c266原创 2019-06-24 16:54:44 · 256 阅读 · 0 评论 -
Kubernetes集群搭建 (CPU环境) --【C-7/15】删除“无法删除“的pod
先删除deployments:https://github.com/ypapax/kubernetes/issues/3原理原创 2019-06-26 16:08:11 · 1419 阅读 · 0 评论 -
Kubernetes集群搭建 (CPU环境) --【C-8/15】Kubernetes Service
1. 关于run-my-nginx.yaml文件“run-my-nginx.yaml” 源代码文件如下,apiVersion: apps/v1kind: Deploymentmetadata: name: my-nginxspec: selector: matchLabels: label: my-nginx replicas: 3 template:...原创 2021-01-29 11:13:46 · 116 阅读 · 0 评论 -
Kubernetes集群搭建 (CPU环境) --【C-9/15】Kubernetes集群OpenMPI应用
1. 查看mpi-master节点的版本查看master节点的OpenMPI版本信息。$ kubectl exec -it mpi-master -- mpirun --version执行master节点的运行命令操作。$ kubectl exec -it mpi-master -- mpiexec sh -c 'echo $(hostname):hello'...原创 2021-01-29 11:19:56 · 258 阅读 · 0 评论 -
Kubernetes集群搭建 (CPU环境) --【C-10/15】Pod-to-Pod Communication
1. Pod-to-Pod Communication (via Pod IP)1.1. 服务端将服务运行在一个Kubernetes Pod上。Pod的虚拟端口为8080,方便两个或者多个pods之间的网络通信。Pod的虚拟IP地址为Kubernetes Flannel虚拟网络系统自动分配。服务端的配置文件 test-deployment.yaml 如下,kind: Deploymenta...原创 2021-01-29 11:21:48 · 139 阅读 · 0 评论 -
Kubernetes集群搭建 (CPU环境) --【C-11/15】OpenMPI/Horovod Service
1. 创建Master Pods和Worker Pods通过 mpi-deployment-nodes.yaml 配置文件创建 1个Master Pod 和 8个Worker Pods。apiVersion: v1kind: Podmetadata: name: mpi-masterspec: containers: - name: mpi-master image:...原创 2021-01-29 11:27:28 · 230 阅读 · 0 评论 -
Kubernetes集群搭建 (CPU环境) --【C-12/15】配置已有镜像部署到Kubernetes集群
1. 二次配置镜像步骤1.1. 添加的额外功能。新增OpenSSH,开启22号容器端口,vim编译器,用户密码配置,容器后台任务设置。Dockerfile如下,FROM blackjack2015/pytorch-cpu:v1.1# Install opensshRUN apt-get updateRUN apt-get install -y openssh-serverRUN ap...原创 2021-01-29 11:29:28 · 162 阅读 · 0 评论 -
Kubernetes集群搭建 (CPU环境) --【C-13/15】Kubernetes Init Continers 启动时配置 SSH
首先,启动时 mpi-master 节点生成 SSH 密钥对完美如下apiVersion: v1kind: Podmetadata: name: mpi-masterspec: containers: - name: mpi-master image: canhui/pytorch-cpu:v1.8 command: ['sh', '-c', 'ssh-key...原创 2021-01-29 11:33:46 · 211 阅读 · 0 评论 -
Kubernetes集群搭建 (CPU环境) --【C-14/15】Kubernetes Scheduler编写自己的Scheduler调度器
编写自己的schedulerhttps://medium.com/@sebgoa/kubernetes-scheduling-in-python-3588f4928b13https://kubernetes.io/blog/2017/03/advanced-scheduling-in-kubernetes/原创 2021-01-29 11:41:59 · 177 阅读 · 0 评论 -
Kubernetes集群搭建 (CPU环境) --【C-15/15】Kubernetes Open MPI 分布式任务调度框架
1. 配置NFS参考资料:https://blog.youkuaiyun.com/Canhui_WANG/article/details/915472002. MPI镜像的工作目录/root/workspaceDockerfile 参考 https://blog.youkuaiyun.com/Canhui_WANG/article/details/91493226FROM blackjack2015/pyto...原创 2021-01-29 11:36:02 · 428 阅读 · 0 评论 -
Kubernetes之AllReduce任务调度 (CPU+GPU环境) --【D-1/14】基于Ansible脚本于Ubuntu 16.04快速搭建Kubernetes v1.14.0集群
AbstractThis blog will set up a Kubernetes v1.14.0 cluster via the Ansible tool in Ubuntu 16.04 LTS. The cluster has 24 machines, excluding one Ansible server. Among these 24 machines, there are one K8S master and 23 K8S workers. In our experiments, one A原创 2021-02-04 10:31:59 · 201 阅读 · 0 评论 -
Kubernetes之AllReduce任务调度 (CPU+GPU环境) --【D-2/14】安装Nvidia-Docker2之方法一
AbstractThis blog is going to use NVIDIA GPU in Docker containers via Nvidia-Docker2 (i.e., the latest version of Nvidia-Docker). The installation of Nvidia-Docker2 has two methods: via Nvidia-Container-Toolkit, and not via Nvidia-Container-Toolkit. The原创 2021-02-04 22:26:45 · 190 阅读 · 0 评论 -
Kubernetes之AllReduce任务调度 (CPU+GPU环境) --【D-3/14】安装Nvidia-Docker2之方法二
AbstractThis blog is going to use NVIDIA GPU in Docker containers via Nvidia-Docker2 (i.e., the latest version of Nvidia-Docker). The installation of Nvidia-Docker2 has two methods: via Nvidia-Container-Toolkit, and not via Nvidia-Container-Toolkit. The f原创 2021-02-05 10:20:12 · 174 阅读 · 0 评论 -
Kubernetes之AllReduce任务调度 (CPU+GPU环境) --【D-4/14】安装Nvidia Device Plugin插件
AbstractTo run Nvidia-Docker2 in Kubernetes, we need Nvidia Device Plugin. This blog will install and test Nvidia Device Plugin.1. Prepare Docker Environment in Every GPU NodesFirst, following the official tutorial of Nvidia Device Plugin, we need to e原创 2021-02-05 10:39:50 · 199 阅读 · 0 评论 -
Kubernetes之AllReduce任务调度 (CPU+GPU环境) --【D-5/14】通过Open MPI进行Docker Containers容器间通信(1/3)
AbstractThis blog will configure an Open MPI cluster.1. Install OpenMPISection one is going to install Open MPI on each node of the cluster.First, skip section one if Open MPI v4.0.1 is already installed. Otherwise, continue.$ mpirun --version mpiru原创 2021-02-05 10:58:13 · 256 阅读 · 0 评论 -
Kubernetes之AllReduce任务调度 (CPU+GPU环境) --【D-6/14】通过Open MPI进行Docker Containers容器间通信(2/3)
AbstractThis blog will configure an Open MPI cluster.1. The Interaction Problem for Password InputsSuppose we have three machines: 192.168.0.120 (master), 192.168.0.103 (worker one), and 192.168.0.112 (worker two). A technical problem is that whenever原创 2021-02-05 11:07:32 · 161 阅读 · 0 评论 -
Kubernetes之AllReduce任务调度 (CPU+GPU环境) --【D-7/14】通过Open MPI进行Docker Containers容器间通信(3/3)
AbstractThis blog will configure an Open MPI cluster.1. The File Sharing ProblemSuppose we have three machines: 192.168.0.120 (master), 192.168.0.103 (worker one), and 192.168.0.112 (worker two). Our previous blog enabled the master to send commands to原创 2021-02-05 11:19:47 · 151 阅读 · 0 评论 -
Kubernetes之AllReduce任务调度 (CPU+GPU环境) --【D-8/14】Open MPI集群之Hello World测试
AbstractThe Open MPI cluster is ready. This blog will test the Open MPI cluster on multiple hosts.Note that an assumption is that the remote NFS file-sharing directory is mounted on the same local directory. In other words, for convenience purposes, the原创 2021-02-06 11:04:03 · 185 阅读 · 1 评论 -
Kubernetes之AllReduce任务调度 (CPU+GPU环境) --【D-9/14】Open MPI集群之MPI_Send 和MPI_Recv函数测试
AbstractThe Open MPI cluster is ready. This blog will make use of the MPI_Send() and MPI_Recv() functions in the Open MPI cluster.1. Blocking vs Non-Blocking CommunicationOpen MPI supports two types of communication: Blocking and Non-Blocking communica原创 2021-02-06 11:25:46 · 256 阅读 · 0 评论 -
Kubernetes之AllReduce任务调度 (CPU+GPU环境) --【D-10/14】Open MPI集群之MPI_Bcast函数测试
AbstractThe Open MPI cluster is ready. This blog will make use of the MPI_Bcast() function in the Open MPI cluster. Tomorrow we are going to compare the complexity and the source code implementation of them.1. Implement Our Own MPI Broadcast FunctionSe原创 2021-02-07 10:14:17 · 209 阅读 · 0 评论 -
Kubernetes之AllReduce任务调度 (CPU+GPU环境) --【D-11/14】Open MPI集群之Reduce和AllReduce函数测试
AbstractThe Open MPI cluster is ready. This blog will make use of the Reduce() and AllReduce() functions in the Open MPI cluster.1. MPI Reduce FunctionMPI Reduce [1] is a function that reduces values on target processors to a single value in a single p原创 2021-02-07 10:22:16 · 380 阅读 · 0 评论 -
Kubernetes之AllReduce任务调度 (CPU+GPU环境) --【D-12/14】Docker Containers之间互相SSH免密码登录
AbstractThis blog will discuss SSH passwordless access among containers in a Docker-K8S-based cluster.1. SSH Passwordless Access among Containers in Different MachinesProblem: In the context of Open MPI, when a K8S container is going to log in to other原创 2021-02-07 10:34:47 · 208 阅读 · 0 评论 -
Kubernetes之AllReduce任务调度 (CPU+GPU环境) --【D-13/14】NFS远程文件访问
AbstractThis blog will discuss NFS file sharing in a Docker-K8S-based cluster.1. File-Sharing of Docker containers in Different MachinesProblem: File-sharing of docker containers in a single machine is easy. An easy solution is using Docker volumes [2]原创 2021-02-07 10:40:02 · 189 阅读 · 0 评论 -
Kubernetes之AllReduce任务调度 (CPU+GPU环境) --【D-14/14】多机分布式K8S集群之Pods/Docker Containers之间的AllReduce任务调度
AbstractThis blog will implement an Open MPI AllReduce example among containers in a Docker-K8S-based cluster.1. Combining SSH Passwordless and NFSOur previous blog discussed SSH passwordless among Kubernetes containers. Also, another previous blog dis原创 2021-02-07 10:52:44 · 252 阅读 · 0 评论