【Pytorch】自定义Dataset类按照index去除train_set或test_set中指定数据

在Pytorch中,通常使用DataLoader和自定义Dataset处理数据。若想在训练前从Dataset中过滤有问题的样本,可以在Dataset类中根据index设定特定条件,如将问题样本的值设为None。同时,需要修改`torch.utils.data._utils.collate.py`文件的default_collate函数,添加判断逻辑以处理包含None值的batch。通过这种方式,可以在不预先清理数据集的情况下,动态移除有问题的样本。

在做深度学习相关任务时,我们一般是用 “Pytorch自带的DataLoader+自己写Dataset类” 这样的方式来加载原始或已经预处理过的数据,但是这样的方式是非常的固定的,即输入给Dataset多少数据,最后通过Dataloader传入训练阶段的数据就有多少。因此这里就会有一个问题:假如说我们已知原始数据或预处理数据中有些样本是有问题的,但是又不想简单粗暴的在传递给Dataset类之前做数据清理(删除掉),而是想要在自己写的Dataset类中自动过滤(去除)掉这些样本的数据,应该如何实现?

举个例子,比如我传给Dataset类的预处理数据尺寸是(1345,256,256,3),也就是1345张图片,我知道第1239张图片是有问题的,不想用它来训练,想要在数据传给训练之前在Dataset类中按照1239这个index来自动去除掉这个样本。

具体做法的思路要感谢Pytorch自定义Dataset和DataLoader去除不存在和空的数据,但是原作者的code只适用于Dataset返回两个值的情况,并且在使用时还要自己在写一个dataset_collate脚本,比较复杂且不通用。

下面是改进后的做法,只需两步

1. 在自己的Dataset类中,添加一段代码:

假如初始的Dataset类如下(简化版),每个样本的数据需要返回若干个结果,有img, pose, heatmap, landmark 等等,多少数量都可以。

class Dataset(Dataset):
    def __init__(self, ):
		pass
	def load_data(self, path):
       	pass
    def __len__(self):
    	pass
    def __getitem__(self, index):
        xxx
        return img, pose, heatmap, landmark 

如果想要在Dataset类中按照index指定去除数据,只需要将返回的第一个结果设置为None即可,加入以下两行代码:

        if index == 1238:   # index为1238时对应第1239张图片
            img = None

注意,并不是非得设置img=None才可以,而是本例中返回的第一个值是img,所以才设置img=None完整如下:

class Dataset(Dataset):
    def __init__(self, ):
		pass
	def load_data(self, path):
       	pass
    def __len__(self):
    	pass
    def __getitem__(self, index):
        xxx
        if index == 1238:   # index为1238时对应第1239张图片
            img = None        
        return img, pose, heatmap, landmark 
2. 修改自身python环境安装的your_python_envs_path/Lib/site-packages/torch/utils/data/_utils/collate.py文件中修改default_collate函数

做了以上改进后还不行,如果使用原始的default_collate()函数,会报错,需要在原始的代码中加入两行代码:

    if isinstance(batch, list):
        batch = [ i for i in batch if i[0] is not None]

如果使用VScode的话,可以创建一个脚本,impor torch后输入:torch.utils.data.DataLoader,然后按ctrl键的同时鼠标点击DataLoader可以找到DataLoader的文件夹,然后在文件夹内进入_utils文件夹就可以看到collate.py。

原始的default_collate()函数代码如下:

def default_collate(batch):
    r"""Puts each data field into a tensor with outer dimension batch size"""
    elem = batch[0]
    elem_type = type(elem)
    if isinstance(elem, torch.Tensor):
        out = None
        if torch.utils.data.get_worker_info() is not None:
            # If we're in a background process, concatenate directly into a
            # shared memory tensor to avoid an extra copy
            numel = sum(x.numel() for x in batch)
            storage = elem.storage()._new_shared(numel)
            out = elem.new(storage)
        return torch.stack(batch, 0, out=out)
    elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
            and elem_type.__name__ != 'string_':
        if elem_type.__name__ == 'ndarray' or elem_type.__name__ == 'memmap':
            # array of string classes and object
            if np_str_obj_array_pattern.search(elem.dtype.str) is not None:
                raise TypeError(default_collate_err_msg_format.format(elem.dtype))

            return default_collate([torch.as_tensor(b) for b in batch])
        elif elem.shape == ():  # scalars
            return torch.as_tensor(batch)
    elif isinstance(elem, float):
        return torch.tensor(batch, dtype=torch.float64)
    elif isinstance(elem, int):
        return torch.tensor(batch)
    elif isinstance(elem, string_classes):
        return batch
    elif isinstance(elem, collections.abc.Mapping):
        return {key: default_collate([d[key] for d in batch]) for key in elem}
    elif isinstance(elem, tuple) and hasattr(elem, '_fields'):  # namedtuple
        return elem_type(*(default_collate(samples) for samples in zip(*batch)))
    elif isinstance(elem, collections.abc.Sequence):
        # check to make sure that the elements in batch have consistent size
        it = iter(batch)
        elem_size = len(next(it))
        if not all(len(elem) == elem_size for elem in it):
            raise RuntimeError('each element in list of batch should be of equal size')
        transposed = zip(*batch)
        return [default_collate(samples) for samples in transposed]

    raise TypeError(default_collate_err_msg_format.format(elem_type))

增加代码后的default_collate()函数代码如下:

def default_collate(batch):
    r"""Puts each data field into a tensor with outer dimension batch size"""

    if isinstance(batch, list):
        batch = [ i for i in batch if i[0] is not None]

    elem = batch[0]
    elem_type = type(elem)
    if isinstance(elem, torch.Tensor):
        out = None
        if torch.utils.data.get_worker_info() is not None:
            # If we're in a background process, concatenate directly into a
            # shared memory tensor to avoid an extra copy
            numel = sum(x.numel() for x in batch)
            storage = elem.storage()._new_shared(numel)
            out = elem.new(storage)
        return torch.stack(batch, 0, out=out)
    elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
            and elem_type.__name__ != 'string_':
        if elem_type.__name__ == 'ndarray' or elem_type.__name__ == 'memmap':
            # array of string classes and object
            if np_str_obj_array_pattern.search(elem.dtype.str) is not None:
                raise TypeError(default_collate_err_msg_format.format(elem.dtype))

            return default_collate([torch.as_tensor(b) for b in batch])
        elif elem.shape == ():  # scalars
            return torch.as_tensor(batch)
    elif isinstance(elem, float):
        return torch.tensor(batch, dtype=torch.float64)
    elif isinstance(elem, int):
        return torch.tensor(batch)
    elif isinstance(elem, string_classes):
        return batch
    elif isinstance(elem, collections.abc.Mapping):
        return {key: default_collate([d[key] for d in batch]) for key in elem}
    elif isinstance(elem, tuple) and hasattr(elem, '_fields'):  # namedtuple
        return elem_type(*(default_collate(samples) for samples in zip(*batch)))
    elif isinstance(elem, collections.abc.Sequence):
        # check to make sure that the elements in batch have consistent size
        it = iter(batch)
        elem_size = len(next(it))
        if not all(len(elem) == elem_size for elem in it):
            raise RuntimeError('each element in list of batch should be of equal size')
        transposed = zip(*batch)
        return [default_collate(samples) for samples in transposed]

    raise TypeError(default_collate_err_msg_format.format(elem_type))

执行完以上两步后,就可以按照index在自定义的Dataset类中去除训练集或测试集中的数据。

import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import DataLoader, random_split from torchvision import datasets, transforms import numpy as np import matplotlib.pyplot as plt from collections import defaultdict import os # 设置中文字体支持 plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False # ========== 数据准备 ========== def load_mnist_data(): """ a. MNIST数据集特点:28×28灰度图像,10标签均衡分布 b. 数据预处理流程:像素值归一化至[0,1],展平为784维向量 c. 数据集划分说明:直接采用原始训练集与测试集划分 """ # 数据预处理流程 transform = transforms.Compose([ transforms.ToTensor(), # 转换为Tensor并归一化像素值到[0,1] transforms.Normalize((0.1307,), (0.3081,)), # 进一步的标准化(均值,标准差) transforms.Lambda(lambda x: x.view(-1)) # 展平图像 (1, 28, 28) -> (784,) ]) # 下载并加载数据 train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform) test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform) # 按照6:2:2划分训练集、验证集和测试集 train_size = int(0.6 * len(train_dataset)) val_size = int(0.2 * len(train_dataset)) test_size = len(train_dataset) - train_size - val_size train_dataset, val_dataset, _ = random_split( train_dataset, [train_size, val_size, test_size], generator=torch.Generator().manual_seed(42) ) # 创建数据加载器 train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True) val_loader = DataLoader(val_dataset, batch_size=128, shuffle=False) test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False) return train_loader, val_loader, test_loader # ========== 模型构建 ========== class BaseMLP(nn.Module): """ 基础网络结构:输入层(784节点)→隐藏层(256节点)→输出层(10节点) 激活函数设置:隐藏层使用ReLU,输出层使用Softmax """ def __init__(self, input_size=784, hidden_size=256, num_classes=10): super(BaseMLP, self).__init__() # 各层维度变化: [batch_size, 784] -> [batch_size, 256] -> [batch_size, 10] self.fc1 = nn.Linear(input_size, hidden_size) # 输入层 -> 隐藏层 self.relu = nn.ReLU() # ReLU激活函数 self.fc2 = nn.Linear(hidden_size, num_classes) # 隐藏层 -> 输出层 # 注意:PyTorch的CrossEntropyLoss自带Softmax,所以输出层需要额外加 def forward(self, x): out = self.fc1(x) out = self.relu(out) out = self.fc2(out) return out class DoubleHiddenMLP(nn.Module): """ 双隐藏层结构:输入层(784节点)→隐藏层1(256节点)→隐藏层2(128节点)→输出层(10节点) """ def __init__(self, input_size=784, hidden1_size=256, hidden2_size=128, num_classes=10): super(DoubleHiddenMLP, self).__init__() # 各层维度变化: [batch_size, 784] -> [batch_size, 256] -> [batch_size, 128] -> [batch_size, 10] self.fc1 = nn.Linear(input_size, hidden1_size) self.relu = nn.ReLU() self.fc2 = nn.Linear(hidden1_size, hidden2_size) self.fc3 = nn.Linear(hidden2_size, num_classes) def forward(self, x): out = self.fc1(x) out = self.relu(out) out = self.fc2(out) out = self.relu(out) out = self.fc3(out) return out class DropoutMLP(nn.Module): """ 基础网络结构 + Dropout正则化 """ def __init__(self, input_size=784, hidden_size=256, num_classes=10, dropout_prob=0.5): super(DropoutMLP, self).__init__() self.fc1 = nn.Linear(input_size, hidden_size) self.relu = nn.ReLU() self.dropout = nn.Dropout(dropout_prob) # Dropout层 self.fc2 = nn.Linear(hidden_size, num_classes) def forward(self, x): out = self.fc1(x) out = self.relu(out) out = self.dropout(out) # 应用Dropout out = self.fc2(out) return out class DoubleHiddenDropoutMLP(nn.Module): """ 双隐藏层结构 + Dropout正则化 """ def __init__(self, input_size=784, hidden1_size=256, hidden2_size=128, num_classes=10, dropout_prob=0.5): super(DoubleHiddenDropoutMLP, self).__init__() self.fc1 = nn.Linear(input_size, hidden1_size) self.relu = nn.ReLU() self.dropout1 = nn.Dropout(dropout_prob) self.fc2 = nn.Linear(hidden1_size, hidden2_size) self.dropout2 = nn.Dropout(dropout_prob) self.fc3 = nn.Linear(hidden2_size, num_classes) def forward(self, x): out = self.fc1(x) out = self.relu(out) out = self.dropout1(out) out = self.fc2(out) out = self.relu(out) out = self.dropout2(out) out = self.fc3(out) return out # ========== 训练与评估函数 ========== def train_model(model, train_loader, val_loader, num_epochs=50, lr=0.001, weight_decay=0): """ 训练模型并记录性能指标 """ device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = model.to(device) # 使用Adam优化器,可设置权重衰减(L2正则化) optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay) criterion = nn.CrossEntropyLoss() # 记录训练过程 history = { 'train_loss': [], 'val_loss': [], 'train_acc': [], 'val_acc': [], 'wrong_samples': [] # 存储错误样本 } for epoch in range(num_epochs): # 训练阶段 model.train() train_loss = 0 correct = 0 total = 0 for batch_idx, (data, target) in enumerate(train_loader): data, target = data.to(device), target.to(device) optimizer.zero_grad() output = model(data) loss = criterion(output, target) loss.backward() optimizer.step() train_loss += loss.item() _, predicted = output.max(1) total += target.size(0) correct += predicted.eq(target).sum().item() train_acc = 100. * correct / total avg_train_loss = train_loss / len(train_loader) # 验证阶段 model.eval() val_loss = 0 correct = 0 total = 0 with torch.no_grad(): for data, target in val_loader: data, target = data.to(device), target.to(device) output = model(data) val_loss += criterion(output, target).item() _, predicted = output.max(1) total += target.size(0) correct += predicted.eq(target).sum().item() # 记录错误样本 if epoch == num_epochs - 1: # 只在最后一轮记录错误样本 mask = (predicted != target) wrong_data = data[mask].cpu() wrong_pred = predicted[mask].cpu() wrong_target = target[mask].cpu() for i in range(min(3, len(wrong_data))): if len(history['wrong_samples']) < 3: history['wrong_samples'].append(( wrong_data[i].view(28, 28).numpy(), wrong_pred[i].item(), wrong_target[i].item() )) val_acc = 100. * correct / total avg_val_loss = val_loss / len(val_loader) # 记录历史数据 history['train_loss'].append(avg_train_loss) history['val_loss'].append(avg_val_loss) history['train_acc'].append(train_acc) history['val_acc'].append(val_acc) if (epoch + 1) % 10 == 0: print(f'Epoch [{epoch + 1}/{num_epochs}], Train Loss: {avg_train_loss:.4f}, Train Acc: {train_acc:.3f}%, ' f'Val Loss: {avg_val_loss:.4f}, Val Acc: {val_acc:.3f}%') return model, history def evaluate_model(model, test_loader): """ 在测试集上评估模型性能 """ device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model.eval() correct = 0 total = 0 with torch.no_grad(): for data, target in test_loader: data, target = data.to(device), target.to(device) output = model(data) _, predicted = output.max(1) total += target.size(0) correct += predicted.eq(target).sum().item() test_acc = 100. * correct / total return test_acc # ========== 可视化函数 ========== def plot_training_curves(histories, model_names): """ 绘制训练和验证损失曲线 """ plt.figure(figsize=(12, 5)) # 绘制损失曲线 plt.subplot(1, 2, 1) for i, (name, history) in enumerate(zip(model_names, histories)): plt.plot(history['train_loss'], label=f'{name}训练损失', linestyle='-', linewidth=2) plt.plot(history['val_loss'], label=f'{name}验证损失', linestyle='--', linewidth=2) plt.xlabel('训练轮次', fontsize=12) plt.ylabel('损失值', fontsize=12) plt.title('训练和验证损失曲线', fontsize=14) plt.legend(fontsize=10) plt.grid(True, alpha=0.3) # 绘制准确率曲线 plt.subplot(1, 2, 2) for i, (name, history) in enumerate(zip(model_names, histories)): plt.plot(history['train_acc'], label=f'{name}训练准确率', linestyle='-', linewidth=2) plt.plot(history['val_acc'], label=f'{name}验证准确率', linestyle='--', linewidth=2) plt.xlabel('训练轮次', fontsize=12) plt.ylabel('准确率 (%)', fontsize=12) plt.title('训练和验证准确率曲线', fontsize=14) plt.legend(fontsize=10) plt.grid(True, alpha=0.3) plt.tight_layout() plt.savefig('training_curves.png', dpi=300, bbox_inches='tight') plt.show() def plot_wrong_samples(histories, model_names): """ 展示错误分样本 """ fig, axes = plt.subplots(len(model_names), 3, figsize=(12, 4 * len(model_names))) fig.suptitle('错误分样本示例', fontsize=16) for i, (name, history) in enumerate(zip(model_names, histories)): wrong_samples = history['wrong_samples'] for j in range(3): if j < len(wrong_samples): img, pred, true = wrong_samples[j] ax = axes[i, j] if len(model_names) > 1 else axes[j] ax.imshow(img, cmap='gray') ax.set_title(f'{name}: 预测 {pred}, 真实 {true}', fontsize=12) ax.axis('off') else: ax = axes[i, j] if len(model_names) > 1 else axes[j] ax.axis('off') plt.tight_layout() plt.savefig('wrong_samples.png', dpi=300, bbox_inches='tight') plt.show() # ========== 主函数 ========== def main(): # 加载数据 train_loader, val_loader, test_loader = load_mnist_data() # 定义同模型 models = { '单隐藏层': BaseMLP(), '双隐藏层': DoubleHiddenMLP(), '单隐藏层+Dropout': DropoutMLP(), '双隐藏层+Dropout': DoubleHiddenDropoutMLP() } # 训练和评估模型 results = {} histories = [] for name, model in models.items(): print(f"\n开始训练 {name} 模型...") trained_model, history = train_model( model, train_loader, val_loader, num_epochs=50, lr=0.001, weight_decay=1e-5 if 'Dropout' not in name else 0 # 对非Dropout模型使用L2正则化 ) # 在测试集上评估 test_acc = evaluate_model(trained_model, test_loader) results[name] = test_acc histories.append(history) print(f"{name} 模型测试准确率: {test_acc:.3f}%") # 可视化结果 model_names = list(models.keys()) plot_training_curves(histories, model_names) plot_wrong_samples(histories, model_names) # 输出结果表格 print("\n=== 模型性能对比 ===") print("模型结构\t\t测试准确率(%)") print("-" * 40) for name, acc in results.items(): print(f"{name}\t\t{acc:.3f}%") # 保存结果 with open('results.txt', 'w') as f: f.write("模型结构,测试准确率(%)\n") for name, acc in results.items(): f.write(f"{name},{acc:.3f}\n") if __name__ == '__main__': main()这是一个识别模型,请帮我解析这个程序并每个函数都有解释和使用方法
最新发布
10-16
<think>好的,我现在要回答用户关于在PyTorch中是否需要为训练集和测试集分别定义两个数据处理的问题。首先,我需要回忆一下PyTorch数据处理流程。通常,PyTorch使用Dataset和DataLoader来处理数据Dataset负责加载单个样本,而DataLoader负责批量处理和迭代数据。 用户的问题是关于训练集和测试集是否需要同的Dataset。根据我的知识,是否需要分别定义两个取决于训练集和测试集的数据处理方式是否相同。如果两者的预处理步骤(比如数据增强、标准化等)同,那么可能需要同的。例如,训练集通常会有数据增强(如随机裁剪、翻转),而测试集可能只需要简单的缩放和标准化。 查看用户提供的引用,引用[3]中提到需要编写读取数据的函数,并且使用了同的路径来处理训练和测试数据。引用[4]中的示例可能涉及到同的数据处理步骤。例如,训练集可能应用了数据增强,而测试集没有。这时候可能需要同的预处理流程,但可以通过在同一个Dataset中使用同的转换参数来实现,而必创建两个。 因此,可能的解决方案是:定义一个Dataset,通过参数控制同的转换。例如,在初始化时传入transform参数,分别为训练集和测试集定义同的transform。这样,可以复用同一个Dataset,只需在创建实例时传入同的转换操作。 总结来说,一定需要两个同的,而是可以通过参数化转换步骤来区分训练和测试的数据处理。这样可以减少代码重复,提高复用性。同时,用户需要根据具体的数据处理需求来决定是否需要分别处理,比如是否有同的数据增强方法同的标准化参数。</think>在PyTorch中,是否需要为`train_dataset`和`test_dataset`分别定义两个数据处理,取决于**数据预处理需求是否一致**。以下是具体分析: --- ### 1. **是否需要分别定义?** - **若预处理相同**:如果训练集和测试集的数据加载方式、预处理流程(如标准化参数)完全一致,可以**复用同一个数据**,通过数据路径划分方式区分。 - **若预处理同**:例如训练集需要数据增强(随机裁剪、翻转等),而测试集仅需基础预处理(如中心裁剪、标准化),则需要**通过同`transform`参数实现**,无需定义两个。 --- ### 2. **实现方法** 以图像分任务为例: #### 步骤1:定义统一的数据 ```python class CustomDataset(Dataset): def __init__(self, file_paths, labels, transform=None): self.file_paths = file_paths self.labels = labels self.transform = transform # 通过参数控制同预处理 def __len__(self): return len(self.file_paths) def __getitem__(self, idx): image = cv2.imread(self.file_paths[idx]) image = Image.fromarray(image) if self.transform: image = self.transform(image) label = self.labels[idx] return image, label ``` #### 步骤2:为训练集和测试集分配同`transform` ```python # 训练集的数据增强 train_transform = transforms.Compose([ transforms.RandomHorizontalFlip(), transforms.RandomRotation(10), transforms.Resize(256), transforms.ToTensor(), transforms.Normalize(mean=[0.485], std=[0.229]) ]) # 测试集的标准化(与训练集一致) test_transform = transforms.Compose([ transforms.Resize(256), transforms.ToTensor(), transforms.Normalize(mean=[0.485], std=[0.229]) ]) # 创建数据集实例 train_dataset = CustomDataset(train_paths, train_labels, transform=train_transform) test_dataset = CustomDataset(test_paths, test_labels, transform=test_transform) ``` --- ### 3. **关键注意事项** - **标准化参数一致性**:测试集的标准化参数(如均值、方差)必须与训练集相同,否则模型性能会受影响[^4]。 - **数据路径管理**:训练集和测试集的路径可通过外部文件(如`train.txt`和`test.txt`)分别加载[^2][^3]。 --- ### 4. **代码复用与扩展性** - 若两数据存在**结构性差异**(如输入维度同),才需要定义。 - 大多数情况下,通过**参数化设计**可复用代码,如引用[3]中通过`init_process`函数统一处理同路径的数据。 --- 相关问题
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值