离线混合部署,tf-operator, tensorflow部署,Prometheus+grafana监控

本文详细介绍了如何在阿里云Kubernetes集群上进行离线混合部署TensorFlow,包括kubeflow的两种部署方法及遇到的常见错误。同时,文章还阐述了如何利用Prometheus+Grafana进行资源监控,确保系统稳定运行。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

部署环境及要求

阿里云k8s集群
k8s version :1.14.6
工作节点配置:
4核 16G Memory 50G HD

由于配置要求相对较高,因此推荐使用阿里云ECS或直接使用集群来部署,避免资源不足的情况。

具体部署

1、K8s搭建完毕(docker等)

2、kubeflow部署方法1:

2.1 下载kfctl包与源码包

下载源码包

//任一
wget https://github.com/kubeflow/manifests/archive/v1.0.2.tar.gz
wget https://github.com/kubeflow/kfctl/archive/v1.0.2.tar.gz

作者在这一步花费了很多的时间,因为GitHub中的fkctl包下载速度非常缓慢。
你可以考虑直接使用wegt在云服务器上进行操作,从github中拉取,但作者本人用这种方式拉取的包会存在缺失不完整的情况。

可能的错误1)** 导致在执行kfctl命令时报错:Segmentation fault

当然,这种错误也可能是由于硬盘没有达到规定要求所导致。

因此最终采用的方法是,先直接从github上下载(翻墙更佳,下载速度更快),在从本地上传至阿里云服务器对应目录。

基于winscp传输

利用winscp在安装kubeflow的过程中发挥了重大的作用,

3、下载yaml文件并修改

wget https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_k8s_istio.v1.0.2.yaml
//修改 kfctl_k8s_istio.v1.0.2.yaml 内容 
将 https://github.com/kubeflow/manifests/archive/v1.0.2.tar.gz 改为 file:///root/kubeflow/v1.0.2.tar.gz

2.2 apply yaml文件

tar -xvf kfctl_v1.0.2_linux.tar.gz
export PATH=$PATH:"<path-to-kfctl>"
export KF_NAME=<your choice of name for the Kubeflow deployment>
export BASE_DIR=<path to a base directory>
export KF_DIR=${BASE_DIR}/${KF_NAME}
export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_k8s_istio.v1.0.2.yaml"
mkdir -p ${KF_DIR}
cd ${KF_DIR}
kfctl apply -V -f ${CONFIG_URI}

2.3 创建pv pvc

在法二中详述

2.4 阿里云构建拉取所需要的镜像

github链接:
link.
克隆之后,在阿里云镜像仓库中构建,再从阿里云拉取,可以避免被墙的问题

gcr.io/kubeflow-images-public/ingress-setup:latest
gcr.io/kubeflow-images-public/admission-webhook:v1.0.0-gaf96e4e3
gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta
argoproj/argoui:v2.3.0
gcr.io/kubeflow-images-public/centraldashboard:v1.0.0-g3ec0de71
gcr.io/kubeflow-images-public/jupyter-web-app:v1.0.0-g2bd63238
gcr.io/kubeflow-images-public/katib/v1alpha3/katib-controller:v0.8.0
gcr.io/kubeflow-images-public/katib/v1alpha3/katib-db-manager:v0.8.0
mysql:8
gcr.io/kubeflow-images-public/katib/v1alpha3/katib-ui:v0.8.0
gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0
gcr.io/kfserving/kfserving-controller:0.2.2
metacontroller/metacontroller:v0.3.0
mysql:8.0.3
gcr.io/kubeflow-images-public/metadata:v0.1.11
gcr.io/ml-pipeline/envoy:metadata-grpc
gcr.io/tfx-oss-public/ml_metadata_store_server:v0.21.1
gcr.io/kubeflow-images-public/metadata-frontend:v0.1.8
minio/minio:RELEASE.2018-02-09T22-40-05Z
gcr.io/ml-pipeline/api-server:0.2.5
gcr.io/ml-pipeline/visualization-server:0.2.5
gcr.io/ml-pipeline/persistenceagent:0.2.5
gcr.io/ml-pipeline/scheduledworkflow:0.2.5
gcr.io/ml-pipeline/frontend:0.2.5
gcr.io/ml-pipeline/viewer-crd-controller:0.2.5
mysql:5.6
gcr.io/kubeflow-images-public/notebook-controller:v1.0.0-gcd65ce25
gcr.io/kubeflow-images-public/profile-controller:v1.0.0-ge50a8531
gcr.io/kubeflow-images-public/kfam:v1.0.0-gf3e09203
gcr.io/kubeflow-images-public/pytorch-operator:v1.0.0-g047cf0f
docker.io/seldonio/seldon-core-operator:1.0.1
gcr.io/spark-operator/spark-operator:v1beta2-1.0.0-2.4.4
gcr.io/spark-operator/spark-operator:v1beta2-1.0.0-2.4.4
gcr.io/spark-operator/spark-operator:v1beta2-1.0.0-2.4.4
gcr.io/google_containers/spartakus-amd64:v1.1.0
tensorflow/tensorflow:1.8.0
gcr.io/kubeflow-images-public/tf_operator:v1.0.0-g92389064
argoproj/workflow-controller:v2.3.0

2.5 修改各个deploy statefulset 的镜像下载策略

下载策略为Always ,需要修改为(imagePullPolicy后面值改为 IfNotPresent)
例如:
#kubectl edit deploy deploy名字 -n kubeflow

部署成功

博主本人在使用上述官方方法部署时遭遇了诸多问题,更推荐下列方法:

3、kubeflow部署方法2

3.1 kustomize

在方法一中利用kfctl安装,但本质上是使用kustomize安装

git clone https://github.com/kubeflow/manifests
cd manifests
git checkout v0.6-branch
cd <target>/base
kubectl kustomize . | tee <output file>

3.2 修改kustomize镜像

grc_image = [
"gcr.io/kubeflow-images-public/ingress-setup:latest",
"gcr.io/kubeflow-images-public/admission-webhook:v20190520-v0-139-gcee39dbc-dirty-0d8f4c",
"gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta",
"gcr.io/kubeflow-images-public/centraldashboard:v20190823-v0.6.0-rc.0-69-gcb7dab59",
"gcr.io/kubeflow-images-public/jupyter-web-app:9419d4d",
"gcr.io/kubeflow-images-public/katib/v1alpha2/katib-controller:v0.6.0-rc.0",
"gcr.io/kubeflow-images-public/katib/v1alpha2/katib-manager:v0.6.0-rc.0",
"gcr.io/kubeflow-images-public/katib/v1alpha2/katib-manager-rest:v0.6.0-rc.0",
"gcr.io/kubeflow-images-public/katib/v1alpha2/suggestion-bayesianoptimization:v0.6.0-rc.0",
"gcr.io/kubeflow-images-public/katib/v1alpha2/suggestion-grid:v0.6.0-rc.0",
"gcr.io/kubeflow-images-public/katib/v1alpha2/suggestion-hyperband:v0.6.0-rc.0",
"gcr.io/kubeflow-images-public/katib/v1alpha2/suggestion-nasrl:v0.6.0-rc.0",
"gcr.io/kubeflow-images-public/katib/v1alpha2/suggestion-random:v0.6.0-rc.0",
"gcr.io/kubeflow-images-public/katib/v1alpha2/katib-ui:v0.6.0-rc.0",
"gcr.io/kubeflow-images-public/metadata:v0.1.8",
"gcr.io/kubeflow-images-public/metadata-frontend:v0.1.8",
"gcr.io/ml-pipeline/api-server:0.1.23",
"gcr.io/ml-pipeline/persistenceagent:0.1.23",
"gcr.io/ml-pipeline/scheduledworkflow:0.1.23",
"gcr.io/ml-pipeline/frontend:0.1.23",
"gcr.io/ml-pipeline/viewer-crd-controller:0.1.23",
"gcr.io/kubeflow-images-public/notebook-controller:v2019060
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值