Linux memory manager and your big data

Disclaimer: We always assume that when we have an issue and think it's the operating system, 99% of the time, it turns out to be something else. We therefore caution against assuming that the problem is with your operating system, unless your use-case and the following example completely overlap.

It all started with one of our customers reporting performance issues with their CitusDB cluster. This customer designed their cluster such that their working set would fit into memory, but their query run-times showed every indication that their queries were hitting disk. This naturally reduced their query run times by 10-100x.

We started looking into this problem by first examining CitusDB's query distribution mechanism and then by checking the PostgreSQL instances on the machines. We found that neither was the culprit here, and came up with the following observations:

  1. The customer's working set was one day's worth of query logs. Once they were done looking at a particular day, they started querying the next day's data.
  2. Their queries involved mostly sequential I/O. They didn't use indexes a lot.
  3. A day's data occupied more than 60% of the memory on each node (but way less than total available memory). They didn't have anything else using memory on their instances.

Our assumption going into this was that since each day's data easily fit into RAM, the Linux memory manager would eventually bring that day's data into the page cache. Once the customer started querying the next day's data (and only next day's data), then the new data would come into the page cache. At least, this is what a simple cache using the LRU eviction policy would do.

It turns out LRU has two shortcomings when used as a page replacement algorithm. First, an exact LRU implementation is too costly in this context. Second, the memory manager needs to account for frequency as well, so that a large file read doesn't evict the entire cache. Therefore, Linux uses a more sophisticated algorithm than LRU; and that algorithm doesn't play along well with the workload we just described.

To put things into an example, let's assume that you have a kernel newer than 2.6.31 (released in 2009) and that you're using an m2.4xlarge EC2 instance with 68 GB of memory. Let's also say that you have two days worth of clickstream data. Each day's data takes more than 60% of available memory, but individually they easily fit into RAM.

$ ls -lh clickstream.csv.*
-rw-rw-r-- ec2-user ec2-user 42G Nov 25 19:45 clickstream.csv.1
-rw-rw-r-- ec2-user ec2-user 42G Nov 25 19:47 clickstream.csv.2

Now, let's bring in the first day's data to memory by running the "word count" command on the clickstream file several times. Note the time difference between these two runs. The first time we run the command, the Linux memory manager brings the file's pages into the page cache. On the next run, everything gets served from memory.

$ time wc -l clickstream.csv.1 
336006288 clickstream.csv.1

real	10m4.575s
...

$ time wc -l clickstream.csv.1 
336006288 clickstream.csv.1

real	0m18.858s

Then, let's switch over to the second day's clickstream file. We again run the word count command multiple times to bring the file into memory. An LRU-like policy here would evict the first day's data after several runs, and bring the second day's data into memory. Unfortunately, no matter how many times you access the second file in this case, the Linux memory manager will never bring it into memory.

$ time wc -l clickstream.csv.2
336027448 clickstream.csv.2

real	9m50.542s

$ time wc -l clickstream.csv.2
336027448 clickstream.csv.2

real	9m52.265s

In fact, if you run into this scenario, the only way to bring the second day's data into memory is by manually flushing the page cache. Obviously, this cure might be worse than the disease, but for our little experiment, it helps.

$ echo 1 | sudo tee /proc/sys/vm/drop_caches
1

$ time wc -l clickstream.csv.2
336027448 clickstream.csv.2

real	9m51.906s

$ time wc -l clickstream.csv.2
336027448 clickstream.csv.2

real	0m17.874s

Taking a step back, the problem here lies with how Linux manages its page cache. The Linux memory manager keeps cached filesystem pages in two types of lists. One list holds recently accessed pages (recency list), and the other one holds pages that have been referenced multiple times (frequency list).

In current kernel versions, the memory manager splits available memory evenly between these two lists to establish a trade-off between protecting frequently used pages and detecting recently used ones. In other words, the kernel reserves 50% of available memory to the frequency list.

In the previous example, both lists start out empty. When referenced, the first day's pages first go into the recency list. On the second reference, they get promoted to the frequency list.

Next, when the user wants to work on the second day's data, this file is larger than 50% of available memory, but the recency list is not. Therefore, sequential scans over the file result in thrashing. The first filesystem page in the second file makes it into the recency list, but gets kicked out once the recency list fills up. As a result, no two pages in the second file stay long enough in the recency list for their reference counts to get incremented.

Fortunately, this issue occurs only when you have all three observations that we outlined above (very infrequent), and it's getting fixed as we speak. If you're interested, you can read more about the original problem report and the proposed fix in the Linux kernel mailing lists.

For us, the really neat part was how easy it was to identify the problem. Since Citus extends PostgreSQL, once we saw the issue, we could quickly reproduce it on Postgres. We then posted our findings to the Linux mailing lists, and the community took over from there.

Got comments? Join the discussion on Hacker News.

在使用nnunet训练时,出现如下报错jzuser@vpc87-3:~/Work_dir/Gn/pystudy/NnuNet$ nnUNetv2_train 3 2d 0 nnUNetv2_train 3 2d 1 nnUNetv2_train 3 2d 2 nnUNetv2_train 3 2d 3 nnUNetv2_train 3 2d 4 Using device: cuda:0 /home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py:152: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. self.grad_scaler = GradScaler() if self.device.type == 'cuda' else None 2025-08-27 09:22:59.147944: failed to log: (<class 'OSError'>, OSError(28, 'No space left on device'), <traceback object at 0x70428e727680>) 2025-08-27 09:22:59.147944: failed to log: (<class 'OSError'>, OSError(28, 'No space left on device'), <traceback object at 0x70428fe062c0>) 2025-08-27 09:22:59.147944: failed to log: (<class 'OSError'>, OSError(28, 'No space left on device'), <traceback object at 0x70428f79c100>) 2025-08-27 09:22:59.147944: failed to log: (<class 'OSError'>, OSError(28, 'No space left on device'), <traceback object at 0x70428fe062c0>) 2025-08-27 09:22:59.147944: failed to log: (<class 'OSError'>, OSError(28, 'No space left on device'), <traceback object at 0x70428f79c100>) ####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. ####################################################################### /home/jzuser/.local/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:62: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate. warnings.warn( 2025-08-27 09:23:04.645125: failed to log: (<class 'OSError'>, OSError(28, 'No space left on device'), <traceback object at 0x7042c1613700>) 2025-08-27 09:23:04.645125: failed to log: (<class 'OSError'>, OSError(28, 'No space left on device'), <traceback object at 0x70428e2d0ec0>) 2025-08-27 09:23:04.645125: failed to log: (<class 'OSError'>, OSError(28, 'No space left on device'), <traceback object at 0x7042c1613700>) 2025-08-27 09:23:04.645125: failed to log: (<class 'OSError'>, OSError(28, 'No space left on device'), <traceback object at 0x70428e2d0ec0>) 2025-08-27 09:23:04.645125: failed to log: (<class 'OSError'>, OSError(28, 'No space left on device'), <traceback object at 0x7042c1613700>) This is the configuration used by this training: Configuration name: 2d {'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 12, 'patch_size': [512, 512], 'median_image_size_in_voxels': [512.0, 512.0], 'spacing': [0.767578125, 0.767578125], 'normalization_schemes': ['CTNormalization'], 'use_mask_for_norm': [False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2, 2, 2], 'num_pool_per_axis': [7, 7], 'pool_op_kernel_sizes': [[1, 1], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]], 'conv_kernel_sizes': [[3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3]], 'unet_max_num_features': 512, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': True} 2025-08-27 09:23:07.150209: failed to log: (<class 'OSError'>, OSError(28, 'No space left on device'), <traceback object at 0x7042c34041c0>) 2025-08-27 09:23:07.150209: failed to log: (<class 'OSError'>, OSError(28, 'No space left on device'), <traceback object at 0x70428e2d0ec0>) 2025-08-27 09:23:07.150209: failed to log: (<class 'OSError'>, OSError(28, 'No space left on device'), <traceback object at 0x7042c34041c0>) 2025-08-27 09:23:07.150209: failed to log: (<class 'OSError'>, OSError(28, 'No space left on device'), <traceback object at 0x70428e2d0ec0>) 2025-08-27 09:23:07.150209: failed to log: (<class 'OSError'>, OSError(28, 'No space left on device'), <traceback object at 0x7042c34041c0>) These are the global plan.json settings: {'dataset_name': 'Dataset003_Liver', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [1.0, 0.767578125, 0.767578125], 'original_median_shape_after_transp': [432, 512, 512], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 5420.0, 'mean': 99.48007202148438, 'median': 101.0, 'min': -983.0, 'percentile_00_5': -15.0, 'percentile_99_5': 197.0, 'std': 37.13840103149414}}} 2025-08-27 09:23:09.655113: failed to log: (<class 'OSError'>, OSError(28, 'No space left on device'), <traceback object at 0x7042c141d100>) 2025-08-27 09:23:09.655113: failed to log: (<class 'OSError'>, OSError(28, 'No space left on device'), <traceback object at 0x7042c2a95b80>) 2025-08-27 09:23:09.655113: failed to log: (<class 'OSError'>, OSError(28, 'No space left on device'), <traceback object at 0x7042c141d100>) 2025-08-27 09:23:09.655113: failed to log: (<class 'OSError'>, OSError(28, 'No space left on device'), <traceback object at 0x7042c2a95b80>) 2025-08-27 09:23:09.655113: failed to log: (<class 'OSError'>, OSError(28, 'No space left on device'), <traceback object at 0x7042c141d100>) 2025-08-27 09:23:09.655113: unpacking dataset... multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/home/jzuser/.local/lib/python3.10/site-packages/numpy/lib/_npyio_impl.py", line 581, in save format.write_array(fid, arr, allow_pickle=allow_pickle, File "/home/jzuser/.local/lib/python3.10/site-packages/numpy/lib/format.py", line 754, in write_array array.tofile(fp) OSError: [Errno 28] No space left on device During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/usr/lib/python3.10/multiprocessing/pool.py", line 51, in starmapstar return list(itertools.starmap(args[0], args[1])) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/utils.py", line 17, in _convert_to_npy np.save(npz_file[:-4] + "_seg.npy", a['seg']) File "/home/jzuser/.local/lib/python3.10/site-packages/numpy/lib/_npyio_impl.py", line 579, in save with file_ctx as fid: OSError: [Errno 28] No space left on device """ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/jzuser/.local/bin/nnUNetv2_train", line 8, in <module> sys.exit(run_training_entry()) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 252, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 195, in run_training nnunet_trainer.run_training() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1203, in run_training self.on_train_start() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 788, in on_train_start unpack_dataset(self.preprocessed_dataset_folder, unpack_segmentation=True, overwrite_existing=False, File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/utils.py", line 33, in unpack_dataset p.starmap(_convert_to_npy, zip(npz_files, File "/usr/lib/python3.10/multiprocessing/pool.py", line 375, in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() File "/usr/lib/python3.10/multiprocessing/pool.py", line 774, in get raise self._value OSError: [Errno 28] No space left on device Using device: cuda:0 /home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py:152: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. self.grad_scaler = GradScaler() if self.device.type == 'cuda' else None Traceback (most recent call last): File "/home/jzuser/.local/bin/nnUNetv2_train", line 8, in <module> sys.exit(run_training_entry()) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 252, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 180, in run_training nnunet_trainer = get_trainer_from_args(dataset_name_or_id, configuration, fold, trainer_class_name, File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 65, in get_trainer_from_args nnunet_trainer = nnunet_trainer(plans=plans, configuration=configuration, fold=fold, File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 159, in __init__ maybe_mkdir_p(self.output_folder) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/utilities/file_and_folder_operations.py", line 88, in maybe_mkdir_p os.makedirs(directory, exist_ok=True) File "/usr/lib/python3.10/os.py", line 225, in makedirs mkdir(name, mode) OSError: [Errno 28] No space left on device: '/home/jzuser/Work_dir/Gn/pystudy/NnuNet/nnUNet_results/Dataset003_Liver/nnUNetTrainer__nnUNetPlans__2d/fold_1' Using device: cuda:0 /home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py:152: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. self.grad_scaler = GradScaler() if self.device.type == 'cuda' else None Traceback (most recent call last): File "/home/jzuser/.local/bin/nnUNetv2_train", line 8, in <module> sys.exit(run_training_entry()) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 252, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 180, in run_training nnunet_trainer = get_trainer_from_args(dataset_name_or_id, configuration, fold, trainer_class_name, File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 65, in get_trainer_from_args nnunet_trainer = nnunet_trainer(plans=plans, configuration=configuration, fold=fold, File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 159, in __init__ maybe_mkdir_p(self.output_folder) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/utilities/file_and_folder_operations.py", line 88, in maybe_mkdir_p os.makedirs(directory, exist_ok=True) File "/usr/lib/python3.10/os.py", line 225, in makedirs mkdir(name, mode) OSError: [Errno 28] No space left on device: '/home/jzuser/Work_dir/Gn/pystudy/NnuNet/nnUNet_results/Dataset003_Liver/nnUNetTrainer__nnUNetPlans__2d/fold_2' Using device: cuda:0 /home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py:152: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. self.grad_scaler = GradScaler() if self.device.type == 'cuda' else None Traceback (most recent call last): File "/home/jzuser/.local/bin/nnUNetv2_train", line 8, in <module> sys.exit(run_training_entry()) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 252, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 180, in run_training nnunet_trainer = get_trainer_from_args(dataset_name_or_id, configuration, fold, trainer_class_name, File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 65, in get_trainer_from_args nnunet_trainer = nnunet_trainer(plans=plans, configuration=configuration, fold=fold, File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 159, in __init__ maybe_mkdir_p(self.output_folder) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/utilities/file_and_folder_operations.py", line 88, in maybe_mkdir_p os.makedirs(directory, exist_ok=True) File "/usr/lib/python3.10/os.py", line 225, in makedirs mkdir(name, mode) OSError: [Errno 28] No space left on device: '/home/jzuser/Work_dir/Gn/pystudy/NnuNet/nnUNet_results/Dataset003_Liver/nnUNetTrainer__nnUNetPlans__2d/fold_3' Using device: cuda:0 /home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py:152: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. self.grad_scaler = GradScaler() if self.device.type == 'cuda' else None Traceback (most recent call last): File "/home/jzuser/.local/bin/nnUNetv2_train", line 8, in <module> sys.exit(run_training_entry()) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 252, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 180, in run_training nnunet_trainer = get_trainer_from_args(dataset_name_or_id, configuration, fold, trainer_class_name, File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 65, in get_trainer_from_args nnunet_trainer = nnunet_trainer(plans=plans, configuration=configuration, fold=fold, File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 159, in __init__ maybe_mkdir_p(self.output_folder) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/utilities/file_and_folder_operations.py", line 88, in maybe_mkdir_p os.makedirs(directory, exist_ok=True) File "/usr/lib/python3.10/os.py", line 225, in makedirs mkdir(name, mode) OSError: [Errno 28] No space left on device: '/home/jzuser/Work_dir/Gn/pystudy/NnuNet/nnUNet_results/Dataset003_Liver/nnUNetTrainer__nnUNetPlans__2d/fold_4' jzuser@vpc87-3:~/Work_dir/Gn/pystudy/NnuNet$ nnUNetv2_train 3 2d 0 nnUNetv2_train 3 2d 1 nnUNetv2_train 3 2d 2 nnUNetv2_train 3 2d 3 nnUNetv2_train 3 2d 4 Using device: cuda:0 /home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py:152: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. self.grad_scaler = GradScaler() if self.device.type == 'cuda' else None ####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. ####################################################################### /home/jzuser/.local/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:62: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate. warnings.warn( This is the configuration used by this training: Configuration name: 2d {'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 12, 'patch_size': [512, 512], 'median_image_size_in_voxels': [512.0, 512.0], 'spacing': [0.767578125, 0.767578125], 'normalization_schemes': ['CTNormalization'], 'use_mask_for_norm': [False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2, 2, 2], 'num_pool_per_axis': [7, 7], 'pool_op_kernel_sizes': [[1, 1], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]], 'conv_kernel_sizes': [[3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3]], 'unet_max_num_features': 512, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': True} These are the global plan.json settings: {'dataset_name': 'Dataset003_Liver', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [1.0, 0.767578125, 0.767578125], 'original_median_shape_after_transp': [432, 512, 512], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 5420.0, 'mean': 99.48007202148438, 'median': 101.0, 'min': -983.0, 'percentile_00_5': -15.0, 'percentile_99_5': 197.0, 'std': 37.13840103149414}}} 2025-08-27 11:19:51.857624: unpacking dataset... 2025-08-27 11:20:40.435668: unpacking done... 2025-08-27 11:20:40.436964: do_dummy_2d_data_aug: False 2025-08-27 11:20:40.438331: Creating new 5-fold cross-validation split... 2025-08-27 11:20:40.441395: Desired fold for training: 0 2025-08-27 11:20:40.441591: This split has 104 training and 27 validation cases. 2025-08-27 11:20:40.498886: Unable to plot network architecture: 2025-08-27 11:20:40.500076: No module named 'hiddenlayer' 2025-08-27 11:20:40.552532: 2025-08-27 11:20:40.553642: Epoch 0 2025-08-27 11:20:40.554312: Current learning rate: 0.01 Exception in background worker 0: No data left in file Traceback (most recent call last): File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer item = next(data_loader) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in __next__ return self.generate_train_batch() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/data_loader_2d.py", line 18, in generate_train_batch data, seg, properties = self._data.load_case(current_key) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/nnunet_dataset.py", line 86, in load_case data = np.load(entry['data_file'][:-4] + ".npy", 'r') File "/home/jzuser/.local/lib/python3.10/site-packages/numpy/lib/_npyio_impl.py", line 460, in load raise EOFError("No data left in file") EOFError: No data left in file Exception in background worker 2: No data left in file Traceback (most recent call last): File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer item = next(data_loader) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in __next__ return self.generate_train_batch() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/data_loader_2d.py", line 18, in generate_train_batch data, seg, properties = self._data.load_case(current_key) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/nnunet_dataset.py", line 86, in load_case data = np.load(entry['data_file'][:-4] + ".npy", 'r') File "/home/jzuser/.local/lib/python3.10/site-packages/numpy/lib/_npyio_impl.py", line 460, in load raise EOFError("No data left in file") EOFError: No data left in file Exception in background worker 1: No data left in file Traceback (most recent call last): File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer item = next(data_loader) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in __next__ return self.generate_train_batch() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/data_loader_2d.py", line 18, in generate_train_batch data, seg, properties = self._data.load_case(current_key) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/nnunet_dataset.py", line 86, in load_case data = np.load(entry['data_file'][:-4] + ".npy", 'r') File "/home/jzuser/.local/lib/python3.10/site-packages/numpy/lib/_npyio_impl.py", line 460, in load raise EOFError("No data left in file") EOFError: No data left in file Exception in background worker 3: No data left in file Traceback (most recent call last): File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer item = next(data_loader) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in __next__ return self.generate_train_batch() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/data_loader_2d.py", line 18, in generate_train_batch data, seg, properties = self._data.load_case(current_key) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/nnunet_dataset.py", line 86, in load_case data = np.load(entry['data_file'][:-4] + ".npy", 'r') File "/home/jzuser/.local/lib/python3.10/site-packages/numpy/lib/_npyio_impl.py", line 460, in load raise EOFError("No data left in file") EOFError: No data left in file Exception in background worker 4: No data left in file Traceback (most recent call last): File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer item = next(data_loader) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in __next__ return self.generate_train_batch() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/data_loader_2d.py", line 18, in generate_train_batch data, seg, properties = self._data.load_case(current_key) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/nnunet_dataset.py", line 86, in load_case data = np.load(entry['data_file'][:-4] + ".npy", 'r') File "/home/jzuser/.local/lib/python3.10/site-packages/numpy/lib/_npyio_impl.py", line 460, in load raise EOFError("No data left in file") EOFError: No data left in file using pin_memory on device 0 Traceback (most recent call last): File "/home/jzuser/.local/bin/nnUNetv2_train", line 8, in <module> sys.exit(run_training_entry()) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 252, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 195, in run_training nnunet_trainer.run_training() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1211, in run_training train_outputs.append(self.train_step(next(self.dataloader_train))) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in __next__ item = self.__get_next_item() File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message Using device: cuda:0 /home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py:152: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. self.grad_scaler = GradScaler() if self.device.type == 'cuda' else None ####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. ####################################################################### /home/jzuser/.local/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:62: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate. warnings.warn( This is the configuration used by this training: Configuration name: 2d {'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 12, 'patch_size': [512, 512], 'median_image_size_in_voxels': [512.0, 512.0], 'spacing': [0.767578125, 0.767578125], 'normalization_schemes': ['CTNormalization'], 'use_mask_for_norm': [False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2, 2, 2], 'num_pool_per_axis': [7, 7], 'pool_op_kernel_sizes': [[1, 1], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]], 'conv_kernel_sizes': [[3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3]], 'unet_max_num_features': 512, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': True} These are the global plan.json settings: {'dataset_name': 'Dataset003_Liver', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [1.0, 0.767578125, 0.767578125], 'original_median_shape_after_transp': [432, 512, 512], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 5420.0, 'mean': 99.48007202148438, 'median': 101.0, 'min': -983.0, 'percentile_00_5': -15.0, 'percentile_99_5': 197.0, 'std': 37.13840103149414}}} 2025-08-27 11:20:48.364002: unpacking dataset... 2025-08-27 11:20:51.509973: unpacking done... 2025-08-27 11:20:51.510635: do_dummy_2d_data_aug: False 2025-08-27 11:20:51.511523: Using splits from existing split file: /home/jzuser/Work_dir/Gn/pystudy/NnuNet/nnUNet_preprocessed/Dataset003_Liver/splits_final.json 2025-08-27 11:20:51.511727: The split file contains 5 splits. 2025-08-27 11:20:51.511775: Desired fold for training: 1 2025-08-27 11:20:51.511804: This split has 105 training and 26 validation cases. 2025-08-27 11:20:51.521027: Unable to plot network architecture: 2025-08-27 11:20:51.521092: No module named 'hiddenlayer' 2025-08-27 11:20:51.526808: 2025-08-27 11:20:51.526880: Epoch 0 2025-08-27 11:20:51.526956: Current learning rate: 0.01 Exception in background worker 1: No data left in file Traceback (most recent call last): File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer item = next(data_loader) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in __next__ return self.generate_train_batch() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/data_loader_2d.py", line 18, in generate_train_batch data, seg, properties = self._data.load_case(current_key) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/nnunet_dataset.py", line 86, in load_case data = np.load(entry['data_file'][:-4] + ".npy", 'r') File "/home/jzuser/.local/lib/python3.10/site-packages/numpy/lib/_npyio_impl.py", line 460, in load raise EOFError("No data left in file") EOFError: No data left in file Exception in background worker 2: No data left in file Traceback (most recent call last): File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer item = next(data_loader) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in __next__ return self.generate_train_batch() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/data_loader_2d.py", line 18, in generate_train_batch data, seg, properties = self._data.load_case(current_key) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/nnunet_dataset.py", line 86, in load_case data = np.load(entry['data_file'][:-4] + ".npy", 'r') File "/home/jzuser/.local/lib/python3.10/site-packages/numpy/lib/_npyio_impl.py", line 460, in load raise EOFError("No data left in file") EOFError: No data left in file Exception in background worker 0: No data left in file Traceback (most recent call last): File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer item = next(data_loader) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in __next__ return self.generate_train_batch() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/data_loader_2d.py", line 18, in generate_train_batch data, seg, properties = self._data.load_case(current_key) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/nnunet_dataset.py", line 86, in load_case data = np.load(entry['data_file'][:-4] + ".npy", 'r') File "/home/jzuser/.local/lib/python3.10/site-packages/numpy/lib/_npyio_impl.py", line 460, in load raise EOFError("No data left in file") EOFError: No data left in file using pin_memory on device 0 Traceback (most recent call last): File "/home/jzuser/.local/bin/nnUNetv2_train", line 8, in <module> sys.exit(run_training_entry()) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 252, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 195, in run_training nnunet_trainer.run_training() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1211, in run_training train_outputs.append(self.train_step(next(self.dataloader_train))) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in __next__ item = self.__get_next_item() File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message Using device: cuda:0 /home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py:152: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. self.grad_scaler = GradScaler() if self.device.type == 'cuda' else None ####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. ####################################################################### /home/jzuser/.local/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:62: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate. warnings.warn( This is the configuration used by this training: Configuration name: 2d {'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 12, 'patch_size': [512, 512], 'median_image_size_in_voxels': [512.0, 512.0], 'spacing': [0.767578125, 0.767578125], 'normalization_schemes': ['CTNormalization'], 'use_mask_for_norm': [False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2, 2, 2], 'num_pool_per_axis': [7, 7], 'pool_op_kernel_sizes': [[1, 1], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]], 'conv_kernel_sizes': [[3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3]], 'unet_max_num_features': 512, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': True} These are the global plan.json settings: {'dataset_name': 'Dataset003_Liver', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [1.0, 0.767578125, 0.767578125], 'original_median_shape_after_transp': [432, 512, 512], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 5420.0, 'mean': 99.48007202148438, 'median': 101.0, 'min': -983.0, 'percentile_00_5': -15.0, 'percentile_99_5': 197.0, 'std': 37.13840103149414}}} 2025-08-27 11:20:58.407072: unpacking dataset... 2025-08-27 11:21:01.513495: unpacking done... 2025-08-27 11:21:01.514885: do_dummy_2d_data_aug: False 2025-08-27 11:21:01.517129: Using splits from existing split file: /home/jzuser/Work_dir/Gn/pystudy/NnuNet/nnUNet_preprocessed/Dataset003_Liver/splits_final.json 2025-08-27 11:21:01.517678: The split file contains 5 splits. 2025-08-27 11:21:01.517827: Desired fold for training: 2 2025-08-27 11:21:01.517945: This split has 105 training and 26 validation cases. 2025-08-27 11:21:01.529522: Unable to plot network architecture: 2025-08-27 11:21:01.529716: No module named 'hiddenlayer' 2025-08-27 11:21:01.540922: 2025-08-27 11:21:01.541166: Epoch 0 2025-08-27 11:21:01.541448: Current learning rate: 0.01 Exception in background worker 2: No data left in file Traceback (most recent call last): File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer item = next(data_loader) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in __next__ return self.generate_train_batch() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/data_loader_2d.py", line 18, in generate_train_batch data, seg, properties = self._data.load_case(current_key) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/nnunet_dataset.py", line 86, in load_case data = np.load(entry['data_file'][:-4] + ".npy", 'r') File "/home/jzuser/.local/lib/python3.10/site-packages/numpy/lib/_npyio_impl.py", line 460, in load raise EOFError("No data left in file") EOFError: No data left in file Exception in background worker 3: No data left in file Traceback (most recent call last): File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer item = next(data_loader) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in __next__ return self.generate_train_batch() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/data_loader_2d.py", line 18, in generate_train_batch data, seg, properties = self._data.load_case(current_key) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/nnunet_dataset.py", line 86, in load_case data = np.load(entry['data_file'][:-4] + ".npy", 'r') File "/home/jzuser/.local/lib/python3.10/site-packages/numpy/lib/_npyio_impl.py", line 460, in load raise EOFError("No data left in file") EOFError: No data left in file using pin_memory on device 0 Traceback (most recent call last): File "/home/jzuser/.local/bin/nnUNetv2_train", line 8, in <module> sys.exit(run_training_entry()) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 252, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 195, in run_training nnunet_trainer.run_training() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1211, in run_training train_outputs.append(self.train_step(next(self.dataloader_train))) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in __next__ item = self.__get_next_item() File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message Using device: cuda:0 /home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py:152: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. self.grad_scaler = GradScaler() if self.device.type == 'cuda' else None ####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. ####################################################################### /home/jzuser/.local/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:62: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate. warnings.warn( This is the configuration used by this training: Configuration name: 2d {'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 12, 'patch_size': [512, 512], 'median_image_size_in_voxels': [512.0, 512.0], 'spacing': [0.767578125, 0.767578125], 'normalization_schemes': ['CTNormalization'], 'use_mask_for_norm': [False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2, 2, 2], 'num_pool_per_axis': [7, 7], 'pool_op_kernel_sizes': [[1, 1], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]], 'conv_kernel_sizes': [[3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3]], 'unet_max_num_features': 512, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': True} These are the global plan.json settings: {'dataset_name': 'Dataset003_Liver', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [1.0, 0.767578125, 0.767578125], 'original_median_shape_after_transp': [432, 512, 512], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 5420.0, 'mean': 99.48007202148438, 'median': 101.0, 'min': -983.0, 'percentile_00_5': -15.0, 'percentile_99_5': 197.0, 'std': 37.13840103149414}}} 2025-08-27 11:21:08.460438: unpacking dataset... 2025-08-27 11:21:11.615700: unpacking done... 2025-08-27 11:21:11.616486: do_dummy_2d_data_aug: False 2025-08-27 11:21:11.618074: Using splits from existing split file: /home/jzuser/Work_dir/Gn/pystudy/NnuNet/nnUNet_preprocessed/Dataset003_Liver/splits_final.json 2025-08-27 11:21:11.618454: The split file contains 5 splits. 2025-08-27 11:21:11.618557: Desired fold for training: 3 2025-08-27 11:21:11.618626: This split has 105 training and 26 validation cases. 2025-08-27 11:21:11.628197: Unable to plot network architecture: 2025-08-27 11:21:11.628319: No module named 'hiddenlayer' 2025-08-27 11:21:11.635873: 2025-08-27 11:21:11.636014: Epoch 0 2025-08-27 11:21:11.636152: Current learning rate: 0.01 Exception in background worker 1: No data left in file Traceback (most recent call last): File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer item = next(data_loader) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in __next__ return self.generate_train_batch() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/data_loader_2d.py", line 18, in generate_train_batch data, seg, properties = self._data.load_case(current_key) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/nnunet_dataset.py", line 86, in load_case data = np.load(entry['data_file'][:-4] + ".npy", 'r') File "/home/jzuser/.local/lib/python3.10/site-packages/numpy/lib/_npyio_impl.py", line 460, in load raise EOFError("No data left in file") EOFError: No data left in file Exception in background worker 3: No data left in file Traceback (most recent call last): File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer item = next(data_loader) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in __next__ return self.generate_train_batch() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/data_loader_2d.py", line 18, in generate_train_batch data, seg, properties = self._data.load_case(current_key) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/nnunet_dataset.py", line 86, in load_case data = np.load(entry['data_file'][:-4] + ".npy", 'r') File "/home/jzuser/.local/lib/python3.10/site-packages/numpy/lib/_npyio_impl.py", line 460, in load raise EOFError("No data left in file") EOFError: No data left in file Exception in background worker 2: No data left in file Traceback (most recent call last): File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer item = next(data_loader) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in __next__ return self.generate_train_batch() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/data_loader_2d.py", line 18, in generate_train_batch data, seg, properties = self._data.load_case(current_key) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/nnunet_dataset.py", line 86, in load_case data = np.load(entry['data_file'][:-4] + ".npy", 'r') File "/home/jzuser/.local/lib/python3.10/site-packages/numpy/lib/_npyio_impl.py", line 460, in load raise EOFError("No data left in file") EOFError: No data left in file using pin_memory on device 0 Traceback (most recent call last): File "/home/jzuser/.local/bin/nnUNetv2_train", line 8, in <module> sys.exit(run_training_entry()) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 252, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 195, in run_training nnunet_trainer.run_training() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1211, in run_training train_outputs.append(self.train_step(next(self.dataloader_train))) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in __next__ item = self.__get_next_item() File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message Using device: cuda:0 /home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py:152: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. self.grad_scaler = GradScaler() if self.device.type == 'cuda' else None ####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. ####################################################################### /home/jzuser/.local/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:62: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate. warnings.warn( This is the configuration used by this training: Configuration name: 2d {'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 12, 'patch_size': [512, 512], 'median_image_size_in_voxels': [512.0, 512.0], 'spacing': [0.767578125, 0.767578125], 'normalization_schemes': ['CTNormalization'], 'use_mask_for_norm': [False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2, 2, 2], 'num_pool_per_axis': [7, 7], 'pool_op_kernel_sizes': [[1, 1], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]], 'conv_kernel_sizes': [[3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3]], 'unet_max_num_features': 512, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': True} These are the global plan.json settings: {'dataset_name': 'Dataset003_Liver', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [1.0, 0.767578125, 0.767578125], 'original_median_shape_after_transp': [432, 512, 512], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 5420.0, 'mean': 99.48007202148438, 'median': 101.0, 'min': -983.0, 'percentile_00_5': -15.0, 'percentile_99_5': 197.0, 'std': 37.13840103149414}}} 2025-08-27 11:21:18.424697: unpacking dataset... 2025-08-27 11:21:21.510880: unpacking done... 2025-08-27 11:21:21.511596: do_dummy_2d_data_aug: False 2025-08-27 11:21:21.513083: Using splits from existing split file: /home/jzuser/Work_dir/Gn/pystudy/NnuNet/nnUNet_preprocessed/Dataset003_Liver/splits_final.json 2025-08-27 11:21:21.513473: The split file contains 5 splits. 2025-08-27 11:21:21.513583: Desired fold for training: 4 2025-08-27 11:21:21.513662: This split has 105 training and 26 validation cases. 2025-08-27 11:21:21.521894: Unable to plot network architecture: 2025-08-27 11:21:21.522025: No module named 'hiddenlayer' 2025-08-27 11:21:21.530473: 2025-08-27 11:21:21.530618: Epoch 0 2025-08-27 11:21:21.530759: Current learning rate: 0.01 Exception in background worker 2: No data left in file Traceback (most recent call last): File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer item = next(data_loader) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in __next__ return self.generate_train_batch() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/data_loader_2d.py", line 18, in generate_train_batch data, seg, properties = self._data.load_case(current_key) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/nnunet_dataset.py", line 97, in load_case seg = np.load(entry['data_file'][:-4] + "_seg.npy", 'r') File "/home/jzuser/.local/lib/python3.10/site-packages/numpy/lib/_npyio_impl.py", line 460, in load raise EOFError("No data left in file") EOFError: No data left in file Exception in background worker 4: No data left in file Exception in background worker 1: mmap length is greater than file size Traceback (most recent call last): File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer item = next(data_loader) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in __next__ return self.generate_train_batch() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/data_loader_2d.py", line 18, in generate_train_batch data, seg, properties = self._data.load_case(current_key) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/nnunet_dataset.py", line 86, in load_case data = np.load(entry['data_file'][:-4] + ".npy", 'r') File "/home/jzuser/.local/lib/python3.10/site-packages/numpy/lib/_npyio_impl.py", line 477, in load return format.open_memmap(file, mode=mmap_mode, File "/home/jzuser/.local/lib/python3.10/site-packages/numpy/lib/format.py", line 965, in open_memmap marray = numpy.memmap(filename, dtype=dtype, shape=shape, order=order, File "/home/jzuser/.local/lib/python3.10/site-packages/numpy/_core/memmap.py", line 289, in __new__ mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start) ValueError: mmap length is greater than file size Traceback (most recent call last): File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer item = next(data_loader) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in __next__ return self.generate_train_batch() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/data_loader_2d.py", line 18, in generate_train_batch data, seg, properties = self._data.load_case(current_key) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/nnunet_dataset.py", line 86, in load_case data = np.load(entry['data_file'][:-4] + ".npy", 'r') File "/home/jzuser/.local/lib/python3.10/site-packages/numpy/lib/_npyio_impl.py", line 460, in load raise EOFError("No data left in file") EOFError: No data left in file Exception in background worker 3: No data left in file Traceback (most recent call last): File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer item = next(data_loader) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in __next__ return self.generate_train_batch() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/data_loader_2d.py", line 18, in generate_train_batch data, seg, properties = self._data.load_case(current_key) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/nnunet_dataset.py", line 86, in load_case data = np.load(entry['data_file'][:-4] + ".npy", 'r') File "/home/jzuser/.local/lib/python3.10/site-packages/numpy/lib/_npyio_impl.py", line 460, in load raise EOFError("No data left in file") EOFError: No data left in file Exception in background worker 0: No data left in file Exception in background worker 5: No data left in file Traceback (most recent call last): File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer item = next(data_loader) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in __next__ return self.generate_train_batch() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/data_loader_2d.py", line 18, in generate_train_batch data, seg, properties = self._data.load_case(current_key) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/nnunet_dataset.py", line 86, in load_case data = np.load(entry['data_file'][:-4] + ".npy", 'r') File "/home/jzuser/.local/lib/python3.10/site-packages/numpy/lib/_npyio_impl.py", line 460, in load raise EOFError("No data left in file") EOFError: No data left in file Traceback (most recent call last): File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer item = next(data_loader) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in __next__ return self.generate_train_batch() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/data_loader_2d.py", line 18, in generate_train_batch data, seg, properties = self._data.load_case(current_key) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/dataloading/nnunet_dataset.py", line 86, in load_case data = np.load(entry['data_file'][:-4] + ".npy", 'r') File "/home/jzuser/.local/lib/python3.10/site-packages/numpy/lib/_npyio_impl.py", line 460, in load raise EOFError("No data left in file") EOFError: No data left in file using pin_memory on device 0 Traceback (most recent call last): File "/home/jzuser/.local/bin/nnUNetv2_train", line 8, in <module> sys.exit(run_training_entry()) File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 252, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 195, in run_training nnunet_trainer.run_training() File "/home/jzuser/.local/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1211, in run_training train_outputs.append(self.train_step(next(self.dataloader_train))) File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in __next__ item = self.__get_next_item() File "/home/jzuser/.local/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message如何解决
最新发布
08-28
你遇到的报错主要包含两个关键问题: --- ## 🔴 **1. No space left on device(磁盘空间不足)** ### ❓ 问题描述: ```bash OSError: [Errno 28] No space left on device ``` 这表明你在训练 `nnUNet` 的过程中,磁盘空间不足,无法写入 `.npy` 文件或训练日志文件。 ### ✅ 解决方法: #### 方法一:清理磁盘空间 你可以使用以下命令查看当前磁盘使用情况: ```bash df -h ``` 查看哪个分区满了,尤其是 `/home` 或 `/tmp` 分区。 ##### 常见清理方式: - 清理 `Trash`(如你前面提到的): ```bash rm -rf ~/.local/share/Trash/files/* ``` - 删除旧的训练输出文件(如 `nnUNet_results` 目录下的旧模型): ```bash rm -rf /home/jzuser/Work_dir/Gn/pystudy/NnuNet/nnUNet_results/* ``` - 清理系统缓存(需要 `sudo` 权限): ```bash sudo sync && echo 3 > /proc/sys/vm/drop_caches ``` #### 方法二:更改数据预处理和训练的输出路径到空间更大的磁盘 你可以设置环境变量 `nnUNet_results` 和 `nnUNet_preprocessed` 到更大的磁盘空间: ```bash export nnUNet_results="/mnt/big_disk/nnUNet_results" export nnUNet_preprocessed="/mnt/big_disk/nnUNet_preprocessed" ``` 确保你有权限访问并写入这个目录。 --- ## 🔴 **2. EOFError: No data left in file(加载数据失败)** ### ❓ 问题描述: ```bash EOFError: No data left in file ``` 这个错误表明在加载 `.npy` 文件时遇到了文件损坏或不完整,可能是由于前面的磁盘空间不足导致写入失败。 ### ✅ 解决方法: #### 方法一:重新预处理数据 你可以尝试重新运行预处理命令来重建 `.npy` 文件: ```bash nnUNetv2_plan_and_preprocess -d 3 --verify_dataset_integrity ``` 这个命令会重新生成预处理数据,并验证数据完整性。 #### 方法二:手动检查 `.npy` 文件 进入预处理目录,检查 `.npy` 文件是否损坏: ```bash ls -l /home/jzuser/Work_dir/Gn/pystudy/NnuNet/nnUNet_preprocessed/Dataset003_Liver/ ``` 你可以尝试手动加载一个 `.npy` 文件进行测试: ```python import numpy as np data = np.load("some_case.npy") print(data.shape) ``` 如果加载失败,说明文件损坏,建议删除后重新预处理。 --- ## 🛠️ **额外建议:优化训练流程** ### 1. 降低 `batch_size`(如果显存或内存不足) 在 `nnUNetPlans` 中可以调整 `batch_size` 参数,避免内存溢出。 ### 2. 使用更小的 patch size 在 `nnUNetPlans` 中修改 `patch_size`,例如从 `[512, 512]` 改为 `[256, 256]`。 ### 3. 检查 `.npz` 数据完整性 你可以手动加载 `.npz` 文件,确认数据是否损坏: ```python import numpy as np data = np.load("some_case.npz") print(data['data'].shape) print(data['seg'].shape) ``` --- ## ✅ 总结:完整解决流程 1. **清理磁盘空间** - 删除 Trash - 删除旧的训练结果 - 清理缓存 2. **重新预处理数据** ```bash nnUNetv2_plan_and_preprocess -d 3 --verify_dataset_integrity ``` 3. **指定更大的输出路径** ```bash export nnUNet_results="/mnt/big_disk/nnUNet_results" export nnUNet_preprocessed="/mnt/big_disk/nnUNet_preprocessed" ``` 4. **重新运行训练** ```bash nnUNetv2_train 3 2d 0 ``` ---
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值