pachyderm 自动标注实验

最新推荐文章于 2025-12-24 05:27:57 发布

原创最新推荐文章于 2025-12-24 05:27:57 发布 · 903 阅读

1 ·

CC 4.0 BY-SA版权

本文详细介绍如何在CentOS7.7环境下，使用Kubernetes v1.14.2搭建Pachyderm，包括创建命名空间、安装pachctl、部署及数据操作流程。深入探讨使用Pachyderm进行图像处理和自动标注的技术细节。

部署运行你感兴趣的模型镜像

部署

环境

操作系统：centos 7.7
kubernetes版本：1.14.2
kubernetes集群环境： 1 master（192.168.236.66） + 2 node （192.168.236.67\192.168.236.68）

安装

step 1：创建命名空间

kubectl create namespace ht
# 查询命名空间
kubectl get namespace

step2 ：安装pachctl

pachctl 相当于是一个控制进程，是一个独立的可执行文件，操作k8s创建pachyderm相关pods、deployments等，安装完成后也可以通过pachctl进行数据操作等

#下载
curl -o /tmp/pachctl.tar.gz -L https://github.com/pachyderm/pachyderm/releases/download/v1.8.1/pachctl_1.8.1_linux_amd64.tar.gz

#解压
tar -xvf /tmp/pachctl.tar.gz -C /tmp

#复制可执行文件
cp /tmp/pachctl_1.8.1_linux_amd64/pachctl /usr/local/bi

step3：部署

由于国内网络环境问题，如果直接执行下面的命令，可能会导致pods内部镜像拉取失败，建议现在master节点上先导入镜像，导入镜像方式和docker一样。镜像我是通过233.250下载。

# 本地部署
pachctl deploy local --namespace ht

# 查询ht命名空间下的所有k8s资源
kubectl get all -n ht

其他命令

# 删除部署

pachctl undeploy --namespace ht

#执行端口转发命令
pachctl port-forward

这个命令我执行都是失败，但是看docker容器，端口转发又都没啥问题，对后面也没什么影响，先不管了。

WEB界面

访问页面 http://192.168.236.66:30080/，注册以后UI界面可以有15天的试用期。但是不影响控制台命令操作。

界面如下：

数据操作

创建仓库

pachctl create-repo images


# 查看仓库
pachctl list-repo

commit 文件

官方文档可能版本问题，命令不可用，采用如下命令提交，如果下载图片超时，也可以把后面的url换成本地文件

pachctl put-file images master liberty.png -f http://imgur.com/46Q8nDz.png
# 可以看到参考大小变化
pachctl list-repo

# 查看commit记录
pachctl list-commit images

# 查看文件列表
pachctl list-file images master

导出文件

# 导出图片
pachctl get-file images master liberty.png > /tmp/test.png

创建pipline

pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/master/examples/opencv/edges.json

也可以采用如下命令采用本地配置文件

pachctl create-pipeline -f edges.json

其中edges.json 如下：

    {
	  "pipeline": {
	    "name": "edges"
	  },
	  "description": "A pipeline that performs image edge detection by using the OpenCV library.",
	  "input": {
	    "pfs": {
	      "glob": "/*",
	      "repo": "images"
	    }
	  },
	  "transform": {
	    "cmd": [ "python3", "/edges.py" ],
	    "image": "pachyderm/opencv"
	  }
	}

这里的"image": "pachyderm/opencv"以及被打成了官方镜像，可以直接下载，也可以在git源码中build，建议先在master上pull下来

脚本edges.py如下

import cv2
	import numpy as np
	from matplotlib import pyplot as plt
	import os
	 
	# make_edges reads an image from /pfs/images and outputs the result of running
	# edge detection on that image to /pfs/out. Note that /pfs/images and
	# /pfs/out are special directories that Pachyderm injects into the container.
	def make_edges(image):
	    img = cv2.imread(image)
	    tail = os.path.split(image)[1]
	    edges = cv2.Canny(img,100,200)
	    plt.imsave(os.path.join("/pfs/out", os.path.splitext(tail)[0]+'.png'), edges, cmap = 'gray')
	


	# walk /pfs/images and call make_edges on every file found
	for dirpath, dirs, files in os.walk("/pfs/images"):
	    for file in files:
	        make_edges(os.path.join(dirpath, file))

这是一个普通的python脚本，读取图片，candy算子提取边缘，输出图片。

查询pipeline

pachctl list-pipeline

利用pachyderm进行自动标注

利用pachyderm piplien进行自动标注实验：1、编写pachyderm提交数据脚本；2、编写自动标注算法脚本（该算法利用传图图像分割算法，可以实现简单图像的自动标注）3、生成自动标注docker容器镜像；4、pipeline配置以及上线；5、生成原始图片和标注数据的提交记录，并且存储

提交数据

提交数据有基于控制台的类似git、svn的操作方式，也可以采用如下python脚本方式，最人性化的应该是WEB UI界面提交

import python_pachyderm


client = python_pachyderm.Client('192.168.236.66')


# 提交文件位置
file_path = "pfs/images/cat.jpg"
file = open(file_path, 'rb')


# 以字节数据方式提交
with client.commit('images', 'master') as c:
    client.put_file_bytes(c, 'cat_annotation8.jpg', fil

自动标注算法脚本

该算法采用的是传统数学方法，只对简单的、前景背景明确的图片有效。算法逻辑如下：

原图 -> 灰度 -> canny -> 高斯去噪 -> 二值 -> 形态学算子 -> 框

N4pVAlbXVnIbX53w.png!thumbnail 转存失败重新上传取消

annotation.py

主要脚本中的输入和输出位置为pachyderm固定的，不可修改

import cv2
import numpy as np
from matplotlib import pyplot as plt
import os


def make_edges(image):
    img = cv2.imread(image)
    tail = os.path.split(image)[1]


    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    canny = cv2.Canny(gray, 100, 200)
    blur = cv2.GaussianBlur(canny, (3, 3), 0)
    (_, thresh) = cv2.threshold(blur, 90, 255, cv2.THRESH_BINARY)


    kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (50, 50))
    closed = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel)


    closed = cv2.erode(closed, None, iterations=4)
    closed = cv2.dilate(closed, None, iterations=4)


    a, contours, hierarchy = cv2.findContours(
        closed.copy(),
        cv2.RETR_LIST,
        cv2.CHAIN_APPROX_SIMPLE
    )
    c = sorted(contours, key=cv2.contourArea, reverse=True)[0]
    rect = cv2.minAreaRect(c)
    box = np.int0(cv2.boxPoints(rect))
    xmin = min(box[1,:])
    ymin = min(box[:,1])
    xmax = max(box[1,:])
    ymax = max(box[:,1])
    label = str(xmin) + " " + str(ymin) + " " + str(xmax) + " " + str(ymax)
    draw_img = cv2.drawContours(img.copy(), [box], -1, (0, 0, 255), 3)
    plt.imsave(os.path.join("/pfs/out", os.path.splitext(tail)[0] + '.png'), draw_img, cmap='gray')
    with open(os.path.join("/pfs/out", os.path.splitext(tail)[0] + '.txt'), 'w') as file:
        file.write(label)




# walk /pfs/images and call make_edges on every file found
for dirpath, dirs, files in os.walk("/pfs/images"):
    for file in files:
        make_edges(os.path.join(dirpath,

生成自动标注docker容器镜像

我实验的时候是采用pachyderm/opencv镜像容器化后修改得来。也可以采用如下dockerfile文件生成（国内网络环境不建议使用dockerfile）

docker打包镜像过程略，镜像名称为：pachyderm/annotation

docker build -t pachyderm/annotation .

Dockerfile

注意annotation.py 要和Dockerfile文件放在统一目录下

FROM ubuntu:18.04




# Install opencv and matplotlib.
RUN export DEBIAN_FRONTEND=noninteractive; \
    export DEBCONF_NONINTERACTIVE_SEEN=true; \
    echo 'tzdata tzdata/Areas select Etc' | debconf-set-selections; \
    echo 'tzdata tzdata/Zones/Etc select UTC' | debconf-set-selections; \
    apt-get update -qqy \
    && apt-get install -qqy make git pkg-config libswscale-dev python3-dev \
    	python3-numpy python3-tk libtbb2 libtbb-dev libjpeg-dev libpng-dev \
    	libtiff-dev bpython python3-pip libfreetype6-dev wget unzip cmake \
    	sudo \
    && apt-get clean \
    && rm -rf /var/lib/apt




RUN cd \
	&& wget https://github.com/Itseez/opencv/archive/3.4.5.zip \
	&& unzip 3.4.5.zip \
	&& cd opencv-3.4.5 \
	&& mkdir build \
	&& cd build \
	&& cmake .. \
	&& make -j \
	&& make install \
	&& cd \
	&& rm 3.4.5.zip \
    && rm -rf opencv-3.4.5
RUN python3 --version && pip3 --version && sudo pip3 install matplotlib




# Add our own code.
ADD annotation.py /annotati

pipeline配置以及上线

如下配置，注意9行指定响应仓库，13行为pipline执行的命令，14行为pipline采用的镜像。pipeline其实就是一个镜像，我们再node2节点上执行docker ps可以看到：

QijLH8VmXXIFyPyY.png!thumbnail 转存失败重新上传取消

红框即为pipeline的运行容器，运行过程如果出错，可以进入容器内部调试。

annotation.json

{
  "pipeline": {
    "name": "annotation"
  },
  "description": "A pipeline that performs image edge detection by using the OpenCV library.",
  "input": {
    "pfs": {
      "glob": "/*",
      "repo": "images"
    }
  },
  "transform": {
    "cmd": [ "python3", "/annotation.py" ],
    "image": "pachyderm/annotation"
  }
}

生成原始图片和标注数据

执行第一步的提交数据脚本，我们提交上面的这只猫的图片，执行完毕后可以再pachyderm的前端页面（试用期14天）看到提交提交记录、数据和自动生成的标注文件

原图images仓库和标注文件annotation仓库