nnUNet V2代码——nnUNetv2_plan_and_preprocess命令

w1ndfly

已于 2025-02-23 11:57:58 修改

阅读量2.1k

点赞数 10

分类专栏：阅读nnUNet V2代码文章标签：人工智能计算机视觉深度学习神经网络机器学习 nnU-Net V2 nnunet

于 2024-05-30 17:12:38 首次发布

本文链接：https://blog.youkuaiyun.com/shey_joe/article/details/139328671

版权

阅读nnUNet V2代码专栏收录该内容

10 篇文章

订阅专栏

在配置好nnUNet_raw文件夹后，开始预处理原始数据。

nnU-Net V2 的预处理命令是nnUNetv2_plan_and_preprocess命令。

执行该命令时，会调用plan_and_preprocess_entry函数，该函数位置在nnunetv2/experiment_planning/plan_and_preprocess_entrypoints.py文件中。

plan_and_preprocess_entry函数主要做了三件事，生成dataset_fingerprint.json文件、生成nnUNetPlans.json文件、根据上述两个文件对数据集预处理。

本文目录

一. 命令参数

二. 生成dataset_fingerprint.json文件

三. 生成nnUNetPlans.json文件

四. 数据预处理

一. 命令参数

预处理命令参数
参数	注释
`-h` 或 `--help`	显示帮助信息
-d D [D ...]	必需参数，指定一个或多个数据集的编号。例如，给出 2 4 5，生成dataset_fingerprint.json文件、生成nnUNetPlans.json文件、预处理。也可以只指定一个数据集。
-fpe FPE	可选参数，用来指定数据集fingerprint提取器类的名字，默认是 DatasetFingerprintExtractor。
-npfp NPFP	可选参数，指定数据集指纹提取时使用的进程数量，默认是 8。
--verify_dataset_integrity	推荐使用，设置此标志会检查数据集的完整性。这对每个数据集来说是有用的，应该尽可能多地使用！
--no_pp	可选参数，设置后只生成dataset_fingerprint.json文件和nnUNetPlans.json文件（不进行预处理）。对于调试非常有用。
--clean	可选参数，设置后会覆盖现有的dataset_fingerprint.json。如果不设置此标志并且dataset_fingerprint.json已经存在，fingerprint提取器将不会运行。如果修改了数据集fingerprint提取器或数据集本身，使用该标志是必需的！
-pl PL	可选参数，用于指定nnUNetPlans规划器类的名字，默认是 ExperimentPlanner。注意，现在不再区分 2d 和 3d 的规划器，而是一个全新的综合性解决方案。
-gpu_memory_target GPU_MEMORY_TARGET	可选参数，设置自定义 GPU 内存目标（单位 GB）。默认为 None（等于使用规划器类的默认值）。更改此值会影响补丁和批次大小，从而直接影响模型的性能！只有在你非常清楚自己在做什么时才应使用此功能，而且在使用默认的 nnU-Net 作为基线之前，永远不要使用此功能！
-preprocessor_name PREPROCESSOR_NAME	可选参数，用于设置自定义预处理器类名称。该类必须位于 nnunetv2.preprocessing 中。默认是 DefaultPreprocessor。更改此值可能会影响模型的性能！只有在你非常清楚自己在做什么时才应使用此功能，并且在使用默认的 nnU-Net 作为基线之前，永远不要使用此功能！
-overwrite_target_spacing OVERWRITE_TARGET_SPACING [OVERWRITE_TARGET_SPACING ...]	可选参数，设置 3d_fullres 和 3d_cascade_fullres 配置的目标体素大小。默认为 None（不改变）。更改此值会影响图像大小以及可能的补丁和批次大小。这肯定会显著影响模型的性能！只有在你非常清楚自己在做什么时才应使用此功能，并且在使用默认的 nnU-Net 作为基线之前，永远不要使用此功能！更改其他配置的目标体素大小目前尚未实现。新的目标体素大小必须是三个数字的列表！（这点在后续阅读中有说明）
-overwrite_plans_name OVERWRITE_PLANS_NAME	可选参数，使用自定义计划标识符。如果你使用了 `-gpu_memory_target`、`-preprocessor_name` 或 `-overwrite_target_spacing`，最好使用 `-overwrite_plans_name` 生成不同名称的nnUNetPlans.json，这样就不会覆盖默认的 nnUNetPlans.json。之后，在运行其他 nnunet 命令（如训练、推理等）时，需要使用 `-p` 指定你的自定义计划文件。
-c C [C ...]	可选参数，选择需要运行预处理的配置。默认是 2d、3d_fullres、3d_lowres。3d_cascade_fullres 不需要指定，因为它使用 3d_fullres 的数据。如果某些数据集没有某些配置，则会被跳过。
-np NP [NP ...]	可选参数，用于设置预处理所使用的进程数，可以是一个数字或多个数字。如果只输入一个数字，则所有配置都使用相同数量的进程。如果输入多个数字（数量等于 `-c` 中的配置数），则会按顺序为每个配置分配不同的进程数。更多进程通常更快（最多可使用计算机支持的线程数，例如 4 核 CPU 有超线程时可用 8 个线程）。如果你不确定，最好不要随意增加进程数！风险提示：预处理中使用的进程越多，对 RAM 的需求可能越大。图像重采样需要占用大量 RAM。请随时监测 RAM 的使用情况，并在 RAM 占用过高时减少 `-np`！默认情况下，2d 使用 8 个进程，3d_fullres 使用 4 个进程，3d_lowres 使用 8 个进程，其他情况使用 4 个进程。
--verbose	设置后，打印大量信息（用于调试）。在集群环境中，这将禁用进度条！推荐在集群环境中使用。

二. 生成dataset_fingerprint.json文件

阅读plan_and_preprocess_entry函数，找到生成dataset_fingerprint.json文件的入口：

# fingerprint extraction
print("Fingerprint extraction...")
extract_fingerprints(args.d, args.fpe, args.npfp, args.verify_dataset_integrity, args.clean, args.verbose)

进入extract_fingerprints函数：

def extract_fingerprints(dataset_ids: List[int], fingerprint_extractor_class_name: str = 'DatasetFingerprintExtractor',
                         num_processes: int = default_num_processes, check_dataset_integrity: bool = False,
                         clean: bool = True, verbose: bool = True):
    """
    clean = False will not actually run this. This is just a switch for use with nnUNetv2_plan_and_preprocess where
    we don't want to rerun fingerprint extraction every time.
    """
    fingerprint_extractor_class = recursive_find_python_class(join(nnunetv2.__path__[0], "experiment_planning"),
                                                              fingerprint_extractor_class_name,
                                                              current_module="nnunetv2.experiment_planning")
    # fingerprint_extractor_class 默认是
    # nnunetv2.experiment_planning.dataset_fingerprint.fingerprint_extractor.DatasetFingerprintExtractor
    for d in dataset_ids:
        extract_fingerprint_dataset(d, fingerprint_extractor_class, num_processes, check_dataset_integrity, clean,
                                    verbose)

fingerprint_extractor_class类默认是DatasetFingerprintExtractor，该类的路径在上面的注释当中。进入extract_fingerprint_dataset函数：

def extract_fingerprint_dataset(dataset_id: int,
                                fingerprint_extractor_class: Type[
                                    DatasetFingerprintExtractor] = DatasetFingerprintExtractor,
                                num_processes: int = default_num_processes, check_dataset_integrity: bool = False,
                                clean: bool = True, verbose: bool = True):
    """
    Returns the fingerprint as a dictionary (additionally to saving it)
    """
    dataset_name = convert_id_to_dataset_name(dataset_id)
    print(dataset_name)

    if check_dataset_integrity:
        verify_dataset_integrity(join(nnUNet_raw, dataset_name), num_processes)

    fpe = fingerprint_extractor_class(dataset_id, num_processes, verbose=verbose)
    return fpe.run(overwrite_existing=clean)

该函数首先获取数据集名称，检查数据集文件路径设置是否正确，之后实例化DatasetFingerprintExtractor类，运行其run函数，生成dataset_fingerprint.json文件，最后返回相关信息

关于DatasetFingerprintExtractor及其run函数的代码阅读见

阅读nnUNet V2代码——生成dataset_fingerprint.json-优快云博客

三. 生成nnUNetPlans.json文件

阅读plan_and_preprocess_entry函数，找到生成nnUNetPlans.json文件的入口：

# experiment planning
print('Experiment planning...')
plans_identifier = plan_experiments(args.d, args.pl, args.gpu_memory_target, args.preprocessor_name,
                                        args.overwrite_target_spacing, args.overwrite_plans_name)

进入plan_experiments函数：

def plan_experiments(dataset_ids: List[int], experiment_planner_class_name: str = 'ExperimentPlanner',
                     gpu_memory_target_in_gb: float = None, preprocess_class_name: str = 'DefaultPreprocessor',
                     overwrite_target_spacing: Optional[Tuple[float, ...]] = None,
                     overwrite_plans_name: Optional[str] = None):
    """
    overwrite_target_spacing ONLY applies to 3d_fullres and 3d_cascade fullres!
    """
    if experiment_planner_class_name == 'ExperimentPlanner':
        print("\n############################\n"
              "INFO: You are using the old nnU-Net default planner. We have updated our recommendations. "
              "Please consider using those instead! "
              "Read more here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md"
              "\n############################\n")
    experiment_planner = recursive_find_python_class(join(nnunetv2.__path__[0], "experiment_planning"),
                                                     experiment_planner_class_name,
                                                     current_module="nnunetv2.experiment_planning")
    # experiment_planner 默认是
    # nnunetv2.experiment_planning.experiment_planners.default_experiment_planner.ExperimentPlanner
    print(experiment_planner)
    plans_identifier = None
    for d in dataset_ids:
        _, plans_identifier = plan_experiment_dataset(d, experiment_planner, gpu_memory_target_in_gb,
                                                      preprocess_class_name,
                                                      overwrite_target_spacing, overwrite_plans_name)
    return plans_identifier

experiment_planner类默认是ExperimentPlanner类，该类的路径在上面的注释当中。进入plan_experiment_dataset函数：

def plan_experiment_dataset(dataset_id: int,
                            experiment_planner_class: Type[ExperimentPlanner] = ExperimentPlanner,
                            gpu_memory_target_in_gb: float = None, preprocess_class_name: str = 'DefaultPreprocessor',
                            overwrite_target_spacing: Optional[Tuple[float, ...]] = None,
                            overwrite_plans_name: Optional[str] = None) -> Tuple[dict, str]:
    """
    overwrite_target_spacing ONLY applies to 3d_fullres and 3d_cascade fullres!
    """
    kwargs = {}
    if overwrite_plans_name is not None:
        kwargs['plans_name'] = overwrite_plans_name
    if gpu_memory_target_in_gb is not None:
        kwargs['gpu_memory_target_in_gb'] = gpu_memory_target_in_gb

    planner = experiment_planner_class(dataset_id,
                                       preprocessor_name=preprocess_class_name,
                                       overwrite_target_spacing=[float(i) for i in overwrite_target_spacing] if
                                       overwrite_target_spacing is not None else overwrite_target_spacing,
                                       suppress_transpose=False,  # might expose this later,
                                       **kwargs
                                       )
    ret = planner.plan_experiment()
    return ret, planner.plans_identifier

该函数先配置两个参数，然后实例化ExperimentPlanner类，运行其plan_experiment函数，生成nnUNetPlans.json文件，最后返回相关信息。

关于ExperimentPlanner代码及其plan_experiment函数阅读见

nnUNet V2代码——生成nnUNetPlans.json（一）-优快云博客

nnUNet V2代码——生成nnUNetPlans.json（二）-优快云博客

nnUNet V2代码——生成nnUNetPlans.json（三）-优快云博客

！！！ExperimentPlanner类内部分函数例如load_plans、save_plans、generate_data_identifier函数代码较为清晰，上述三个文章并未涉及。

四. 数据预处理

入口：

if not args.no_pp:
     print('Preprocessing...')
     preprocess(args.d, plans_identifier, args.c, np, args.verbose)

进入preprocess函数：

def preprocess(dataset_ids: List[int],
               plans_identifier: str = 'nnUNetPlans',
               configurations: Union[Tuple[str], List[str]] = ('2d', '3d_fullres', '3d_lowres'),
               num_processes: Union[int, Tuple[int, ...], List[int]] = (8, 4, 8),
               verbose: bool = False):
    for d in dataset_ids:
        preprocess_dataset(d, plans_identifier, configurations, num_processes, verbose)

参数有五个，分别是数据集ID、nnUNetPlans、配置（2d、3d_fullres等）、进程数、是否打印详细信息的标志位。

函数内对每一个数据集预处理：进入preprocess_dataset函数：

def preprocess_dataset(dataset_id: int,
                       plans_identifier: str = 'nnUNetPlans',
                       configurations: Union[Tuple[str], List[str]] = ('2d', '3d_fullres', '3d_lowres'),
                       num_processes: Union[int, Tuple[int, ...], List[int]] = (8, 4, 8),
                       verbose: bool = False) -> None:
    if not isinstance(num_processes, list):
        num_processes = list(num_processes)
    if len(num_processes) == 1:
        num_processes = num_processes * len(configurations)
    if len(num_processes) != len(configurations):
        raise RuntimeError(
            f'The list provided with num_processes must either have len 1 or as many elements as there are '
            f'configurations (see --help). Number of configurations: {len(configurations)}, length '
            f'of num_processes: '
            f'{len(num_processes)}')

    dataset_name = convert_id_to_dataset_name(dataset_id)
    print(f'Preprocessing dataset {dataset_name}')
    plans_file = join(nnUNet_preprocessed, dataset_name, plans_identifier + '.json')
    plans_manager = PlansManager(plans_file)
    for n, c in zip(num_processes, configurations):
        print(f'Configuration: {c}...')
        if c not in plans_manager.available_configurations:
            print(
                f"INFO: Configuration {c} not found in plans file {plans_identifier + '.json'} of "
                f"dataset {dataset_name}. Skipping.")
            continue
        configuration_manager = plans_manager.get_configuration(c)
        preprocessor = configuration_manager.preprocessor_class(verbose=verbose)
        preprocessor.run(dataset_id, c, plans_identifier, num_processes=n)

    # copy the gt to a folder in the nnUNet_preprocessed so that we can do validation even if the raw data is no
    # longer there (useful for compute cluster where only the preprocessed data is available)
    from distutils.file_util import copy_file
    maybe_mkdir_p(join(nnUNet_preprocessed, dataset_name, 'gt_segmentations'))
    dataset_json = load_json(join(nnUNet_raw, dataset_name, 'dataset.json'))
    dataset = get_filenames_of_train_images_and_targets(join(nnUNet_raw, dataset_name), dataset_json)
    # only copy files that are newer than the ones already present
    for k in dataset:
        copy_file(dataset[k]['label'],
                  join(nnUNet_preprocessed, dataset_name, 'gt_segmentations', k + dataset_json['file_ending']),
                  update=True)

函数首先判断进程数和配置两个参数是否正确设置。

再获取数据集名称、载入nnUNetPlan.json文件并实例化PlansManager类。

之后创建for循环按照每一个配置类型（2d、3d_fullres等）对数据集做不同的预处理。内部细节先跳过。结束for循环后，搭建nnUNet_preprocessed文件夹结构，依照dataset.json文件复制训练集seg文件到nnUNet_preprocessed中，运行此代码时，image文件均已在for循环内处理完毕。preprocess_dataset函数结束。

回到for循环内，从configuration_manager开始的三行代码涉及三个类，一个是上面提到的PlansManager类，一个是与PlansManager类处于同文件内的ConfigurationManager类，一个是DefaultPreprocessor类。

PlansManager类和ConfigurationManager类偏向于服务型，给DefaultPreprocessor类提供信息，信息内容来源是上面生成的dataset_fingerprint.json和nnUNetPlans.json文件。DefaultPreprocessor类进行数据的预处理。先阅读ConfigurationManager类，该类里的很多函数和变量都是生成nnUNetPlans.json文件时见过的，阅读起来方便一些。再阅读PlansManager类。最后阅读DefaultPreprocessor类。

关于上述三个类的代码阅读见：

nnUNet V2代码——数据预处理（一）-优快云博客

nnUNet V2代码——数据预处理（二）-优快云博客

至此，nnUNetv2_plan_and_preprocess命令阅读完毕。

之后是训练部分