介绍
尝试了kubeflow上给的tf-operator的example跑了minist分布式的例子,官方github上写得比较笼统,这里把详细的过程记录一下
URL:https://github.com/kubeflow/tf-operator/tree/master/examples/tensorflow/distribution_strategy/keras-API
流程
-
代码download到服务器上
-
编写代码,生成Dockerfile
FROM tensorflow/tensorflow:2.1.0-gpu-py3
RUN pip install tensorflow_datasets==2.1.0
# 前面是容器外的路径,后面是容器内的路径,在容器内的工作目录一定要在这个目录下
COPY multi_worker_strategy-with-keras.py /
# 命令行运行python代码
ENTRYPOINT ["python", "/multi_worker_strategy-with-keras.py", "--saved_model_dir", "/train/saved_model/", "--checkpoint_dir", "/train/checkpoint"]
- 打包镜像
docker build -f Dockerfile -t kubeflow/multi_worker_strategy:v1.0 .
- 查看镜像
docker images
- 创建PV(可选)
apiVersion: v1
kind: PersistentVolume
metadata:
name: test-pv