YOLO数据集处理工具：与make-sense搭配使用，辅助划分训练集和验证集，简化标注数据的整理与划分

最新推荐文章于 2025-04-26 12:29:10 发布

原创最新推荐文章于 2025-04-26 12:29:10 发布

· 719 阅读

8 ·

版权

文章标签：

#YOLO #yolo11 #make-sense #数据标注 #目标跟踪 #目标检测 #计算机视觉

背景介绍

在使用YOLO进行目标检测任务时，我们经常会遇到这样的场景：使用make-sense等工具标注完数据后，需要将杂乱的数据整理成YOLO训练所需的格式，并合理划分训练集和验证集。这个过程如果手动操作，不仅繁琐还容易出错。

为了解决这个问题，我开发了一个命令行工具，它可以帮助我们：

按类别整理和预览标注的图片
方便地选择验证集样本
自动构建符合YOLO格式的数据集

工具特点

按类别预览

自动将标注后的图片按类别分类展示
支持同一张图片包含多个类别的情况
使用软链接（Linux）或复制（Windows）方式，节省存储空间

灵活的验证集选择

可以直观地查看每个类别的图片
通过简单的复制操作选择验证集
确保验证集样本的代表性

自动构建数据集

自动创建YOLO所需的目录结构
正确处理图片和标注文件的对应关系
保证数据集的完整性

安全可靠

不会删除既有文件
对已存在的目录进行提示
详细的操作日志

使用方法

安装依赖：

pip install pyyaml

准备配置文件(dataset.yaml)：

path: ./dataset
train: images/train
val: images/val
names:
  0: person
  1: car
  2: traffic_light

基本使用流程：

# 第一步：组织预览
python yolo_dataset.py organize --yaml dataset.yaml --images raw_data/images --labels raw_data/labels

# 第二步：手动选择测试集图片

# 第三步：构建数据集
python yolo_dataset.py build --yaml dataset.yaml --images raw_data/images --labels raw_data/labels

使用建议

验证集选择：

每个类别选择约20%的图片
选择具有代表性的样本
尽量覆盖不同场景和角度

保持原始数据的完整性
使用相对路径以便项目迁移
及时备份重要数据

工具优势

提高效率

自动化的文件组织过程
直观的验证集选择方式
快速构建标准数据集

降低错误

自动验证数据集格式
保证标签和图片的对应关系
避免人工操作失误

易于使用

命令行界面简单直观
详细的帮助信息
完整的错误提示

总结

这个工具极大地简化了YOLO数据集的准备过程，使研究人员可以将更多精力放在模型训练和优化上。无论是个人项目还是团队协作，它都能帮助你更高效地管理数据集。

源码获取

import os
import shutil
import yaml
import argparse
from collections import defaultdict

def parse_args():
    """解析命令行参数"""
    # 创建带有详细描述的解析器
    example_text = """
使用示例:

1. 显示帮助信息:
    %(prog)s -h                # 显示主帮助
    %(prog)s organize -h       # 显示organize命令帮助
    %(prog)s build -h          # 显示build命令帮助

2. 标准流程:
    # 第一步：组织预览
    %(prog)s organize --yaml dataset.yaml --images raw_data/images --labels raw_data/labels

    # 第二步：手动选择测试集图片（将图片复制到test_selected目录）

    # 第三步：构建数据集
    %(prog)s build --yaml dataset.yaml --images raw_data/images --labels raw_data/labels

3. 自定义路径示例:
    # 组织预览
    %(prog)s organize \\
        --yaml custom_dataset.yaml \\
        --images /data/my_images \\
        --labels /data/my_labels \\
        --preview sorted_preview \\
        --test-selected my_test_set

    # 构建数据集
    %(prog)s build \\
        --yaml custom_dataset.yaml \\
        --images /data/my_images \\
        --labels /data/my_labels \\
        --test-selected my_test_set
    """
    
    parser = argparse.ArgumentParser(
        description='YOLO数据集组织和构建工具',
        epilog=example_text,
        formatter_class=argparse.RawDescriptionHelpFormatter  # 保持原始格式
    )
    
    subparsers = parser.add_subparsers(dest='command', help='可用命令')
    
    # organize 子命令
    organize_parser = subparsers.add_parser(
        'organize', 
        help='按类别组织数据集预览',
        formatter_class=argparse.RawDescriptionHelpFormatter
    )
    organize_parser.add_argument('--yaml', 
                              default='dataset.yaml',
                              help='数据集YAML配置文件路径 (默认: %(default)s)')
    organize_parser.add_argument('--images',
                              default='raw_data/images',
                              help='原始图片目录路径 (默认: %(default)s)')
    organize_parser.add_argument('--labels',
                              default='raw_data/labels',
                              help='原始标注文件目录路径 (默认: %(default)s)')
    organize_parser.add_argument('--preview',
                              default='preview',
                              help='预览目录名 (默认: %(default)s)')
    organize_parser.add_argument('--test-selected',
                              default='test_selected',
                              help='测试集选择目录名 (默认: %(default)s)')
    
    # build 子命令
    build_parser = subparsers.add_parser(
        'build', 
        help='构建最终数据集',
        formatter_class=argparse.RawDescriptionHelpFormatter
    )
    build_parser.add_argument('--yaml',
                           default='dataset.yaml',
                           help='数据集YAML配置文件路径 (默认: %(default)s)')
    build_parser.add_argument('--images',
                           default='raw_data/images',
                           help='原始图片目录路径 (默认: %(default)s)')
    build_parser.add_argument('--labels',
                           default='raw_data/labels',
                           help='原始标注文件目录路径 (默认: %(default)s)')
    build_parser.add_argument('--test-selected',
                           default='test_selected',
                           help='测试集选择目录名 (默认: %(default)s)')
    
    args = parser.parse_args()
    
    if args.command is None:
        parser.print_help()
        exit(1)
        
    return args

def load_and_validate_yaml(yaml_path):
    """加载并验证dataset.yaml"""
    try:
        with open(yaml_path, 'r', encoding='utf-8') as f:
            config = yaml.safe_load(f)
        
        required_fields = ['path', 'train', 'val', 'names']
        for field in required_fields:
            if field not in config:
                raise ValueError(f"dataset.yaml 缺少必需字段: {field}")
        
        class_names = config['names'].values()
        if len(class_names) != len(set(class_names)):
            raise ValueError("类别名称必须唯一!")
        
        class_ids = config['names'].keys()
        if list(map(int, class_ids)) != list(range(len(class_ids))):
            raise ValueError("类别索引必须从0开始连续!")
        
        return config
    
    except FileNotFoundError:
        raise FileNotFoundError(f"找不到配置文件: {yaml_path}")
    except yaml.YAMLError as e:
        raise ValueError(f"YAML格式错误: {e}")

def check_existing_dirs(dirs):
    """检查目录是否已存在并包含文件"""
    existing_dirs = []
    
    for dir_path in dirs.values():
        if os.path.exists(dir_path):
            files = []
            for root, _, filenames in os.walk(dir_path):
                files.extend(os.path.join(root, f) for f in filenames)
            if files:
                existing_dirs.append((dir_path, len(files)))
    
    if existing_dirs:
        print("\n警告: 以下目录已存在并包含文件:")
        for dir_path, file_count in existing_dirs:
            print(f"- {dir_path}: {file_count}个文件")
        print("这些目录不会被删除，新文件将直接添加，同名文件将被覆盖")
        input("按Enter键继续，或Ctrl+C取消...")

def create_dir_structure(args, dataset_config):
    """创建目录结构"""
    dirs = {
        'images': args.images,
        'labels': args.labels,
        'dataset_root': dataset_config['path'],
        'dataset_train_images': os.path.join(dataset_config['path'], dataset_config['train']),
        'dataset_val_images': os.path.join(dataset_config['path'], dataset_config['val']),
        'dataset_train_labels': os.path.join(dataset_config['path'], 'labels/train'),
        'dataset_val_labels': os.path.join(dataset_config['path'], 'labels/val')
    }
    
    if hasattr(args, 'preview'):
        dirs.update({
            'preview': args.preview,
            'test_selected': args.test_selected
        })
    else:
        dirs['test_selected'] = args.test_selected
    
    check_existing_dirs(dirs)
    
    for dir_path in dirs.values():
        os.makedirs(dir_path, exist_ok=True)
    
    return dirs

def organize_by_class(dirs, config):
    """按类别组织图片预览"""
    class_to_images = defaultdict(list)
    
    for txt_file in os.listdir(dirs['labels']):
        if not txt_file.endswith('.txt'):
            continue
        
        image_base = txt_file.replace('.txt', '')
        txt_path = os.path.join(dirs['labels'], txt_file)
        
        with open(txt_path, 'r') as f:
            for line in f:
                class_id = int(line.split()[0])
                class_name = config['names'][str(class_id)]
                class_to_images[class_name].append(image_base)
    
    for class_name, images in class_to_images.items():
        class_dir = os.path.join(dirs['preview'], class_name)
        os.makedirs(class_dir, exist_ok=True)
        
        for img_base in images:
            for ext in ['.jpg', '.jpeg', '.png', '.bmp']:
                src_path = os.path.join(dirs['images'], img_base + ext)
                if os.path.exists(src_path):
                    dst_path = os.path.join(class_dir, img_base + ext)
                    if os.name == 'nt':  # Windows
                        shutil.copy2(src_path, dst_path)
                    else:  # Linux/Mac
                        if os.path.exists(dst_path):
                            os.remove(dst_path)
                        os.symlink(os.path.abspath(src_path), dst_path)
                    break
    
    return class_to_images

def build_dataset(dirs):
    """构建最终数据集"""
    test_images = set()
    if os.path.exists(dirs['test_selected']):
        for img_name in os.listdir(dirs['test_selected']):
            if any(img_name.endswith(ext) for ext in ['.jpg', '.jpeg', '.png', '.bmp']):
                test_images.add(os.path.splitext(img_name)[0])
    
    processed_count = {'train': 0, 'val': 0}
    
    for img_name in os.listdir(dirs['images']):
        if not any(img_name.endswith(ext) for ext in ['.jpg', '.jpeg', '.png', '.bmp']):
            continue
            
        base_name = os.path.splitext(img_name)[0]
        is_test = base_name in test_images
        
        target_dir = 'val' if is_test else 'train'
        target_img_dir = dirs[f'dataset_{target_dir}_images']
        target_label_dir = dirs[f'dataset_{target_dir}_labels']
        
        # 复制图片
        shutil.copy2(
            os.path.join(dirs['images'], img_name),
            os.path.join(target_img_dir, img_name)
        )
        
        # 复制标注文件
        label_name = f"{base_name}.txt"
        label_path = os.path.join(dirs['labels'], label_name)
        if os.path.exists(label_path):
            shutil.copy2(
                label_path,
                os.path.join(target_label_dir, label_name)
            )
        
        processed_count[target_dir] += 1
    
    return processed_count

def command_organize(args):
    """organize命令的实现"""
    config = load_and_validate_yaml(args.yaml)
    dirs = create_dir_structure(args, config)
    
    print("\n目录结构:")
    print(f"""
    {dirs['images']}/          # 原始图片目录
    {dirs['labels']}/          # 原始标注文件目录
    {dirs['preview']}/         # 按类别组织的预览目录
        ├── <类别1>/
        ├── <类别2>/
        └── ...
    {dirs['test_selected']}/   # 测试集选择目录
    {dirs['dataset_root']}/    # 最终数据集目录
        ├── images/
        │   ├── train/
        │   └── val/
        └── labels/
            ├── train/
            └── val/
    """)
    
    class_to_images = organize_by_class(dirs, config)
    
    print("\n类别统计:")
    for class_name, images in sorted(class_to_images.items()):
        print(f"{class_name}: {len(images)}张图片")
    
    print("\n后续步骤:")
    print("1. 在预览目录中检查各个类别的图片")
    print("2. 将选作测试集的图片复制到测试集选择目录")
    print(f"3. 运行 '{os.path.basename(__file__)} build' 构建最终数据集")

def command_build(args):
    """build命令的实现"""
    config = load_and_validate_yaml(args.yaml)
    dirs = create_dir_structure(args, config)
    
    print("\n正在构建数据集...")
    processed = build_dataset(dirs)
    
    print("\n数据集构建完成!")
    print(f"训练集: {processed['train']}张图片")
    print(f"验证集: {processed['val']}张图片")

def main():
    args = parse_args()
    
    try:
        if args.command == 'organize':
            command_organize(args)
        elif args.command == 'build':
            command_build(args)
    except Exception as e:
        print(f"\n错误: {str(e)}")
        exit(1)

if __name__ == "__main__":
    main()