3步搞定多标签图像数据：用 datasets轻松准备目标检测数据集-优快云博客

3步搞定多标签图像数据：用🤗 datasets轻松准备目标检测数据集

【免费下载链接】datasets 🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools 项目地址: https://gitcode.com/gh_mirrors/da/datasets

你还在为多标签图像数据处理烦恼吗？标注框与图像不匹配、数据增强后坐标错位、格式转换复杂？本文将通过3个核心步骤，带你用🤗 datasets高效完成目标检测数据集准备，从此告别繁琐的人工调整。读完你将掌握：标注数据自动对齐、批量图像增强、COCO格式无缝转换的实用技能。

数据集结构解析

目标检测数据集通常包含图像文件和对应的标注信息。在🤗 datasets中，我们可以通过ImageFolder模块快速加载这类数据。该模块支持多种图像格式，包括常见的.jpg、.png以及专业领域的.tiff、.webp等，完整支持列表可查看源码中的IMAGE_EXTENSIONS定义。

典型的目标检测数据集目录结构如下：

dataset_root/
├── images/
│   ├── img001.jpg
│   ├── img002.png
│   └── ...
└── annotations.json  # COCO格式标注文件

官方文档中object_detection.mdx详细介绍了数据集的字段组成，主要包括：

image: PIL图像对象
image_id: 图像唯一标识
height/width: 图像尺寸
objects: 包含边界框(bbox)、类别(category)等标注信息的字典

数据加载与可视化

首先安装必要依赖：

pip install -U albumentations opencv-python

使用load_dataset函数加载示例数据集，这里以医疗PPE检测数据集为例：

from datasets import load_dataset

ds = load_dataset("cppe-5")
example = ds['train'][0]
print(example)

加载后的数据集示例包含图像和标注信息，我们可以用PyTorch的可视化工具绘制边界框：

import torch
from torchvision.ops import box_convert
from torchvision.utils import draw_bounding_boxes
from torchvision.transforms.functional import pil_to_tensor, to_pil_image

categories = ds['train'].features['objects'].feature['category']
boxes_xywh = torch.tensor(example['objects']['bbox'])
boxes_xyxy = box_convert(boxes_xywh, 'xywh', 'xyxy')
labels = [categories.int2str(x) for x in example['objects']['category']]

img_with_boxes = draw_bounding_boxes(
    pil_to_tensor(example['image']),
    boxes_xyxy,
    colors="red",
    labels=labels
)
to_pil_image(img_with_boxes)

数据增强与坐标同步

使用Albumentations库进行数据增强时，关键是确保边界框坐标与图像变换同步更新。定义包含边界框参数的变换管道：

import albumentations
import numpy as np

transform = albumentations.Compose([
    albumentations.Resize(480, 480),
    albumentations.HorizontalFlip(p=1.0),
    albumentations.RandomBrightnessContrast(p=1.0),
], bbox_params=albumentations.BboxParams(format='coco', label_fields=['category']))

将变换应用到单个样本：

image = np.array(example['image'])
out = transform(
    image=image,
    bboxes=example['objects']['bbox'],
    category=example['objects']['category']
)

为了处理批量数据，创建变换函数并通过set_transform方法应用到数据集：

def transforms(examples):
    images, bboxes, categories = [], [], []
    for image, objects in zip(examples['image'], examples['objects']):
        image = np.array(image.convert("RGB"))
        out = transform(
            image=image,
            bboxes=objects['bbox'],
            category=objects['category']
        )
        images.append(torch.tensor(out['image']).permute(2, 0, 1))
        bboxes.append(torch.tensor(out['bboxes']))
        categories.append(out['category'])
    return {'image': images, 'bbox': bboxes, 'category': categories}

ds['train'].set_transform(transforms)

高级应用与最佳实践

对于大规模数据集，建议使用流式加载方式以节省内存：

ds = load_dataset("cppe-5", streaming=True)

处理自定义数据集时，可以参考create_dataset.mdx文档创建自定义数据集加载脚本。若需与PyTorch Lightning等框架集成，可使用use_with_pytorch.mdx中介绍的方法。

数据处理完成后，可通过share.mdx文档中的指南将数据集分享到Hugging Face Hub，方便团队协作和模型训练。

提示：更多高级技巧可参考官方目标检测教程，包含多标签处理、评估指标计算等实用内容。

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考