transformers.Trainer.train()微调deepseek时出现RuntimeError: chunk expects at least a 1-dimensional tensor

在数据集预处理,训练参数都没问题的情况下进行微调出现

File "/home/chenrd/.conda/envs/nlp/lib/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/chenrd/.conda/envs/nlp/lib/python3.8/site-packages/accelerate/accelerator.py", line 995, in no_sync
    yield
  File "/home/chenrd/.conda/envs/nlp/lib/python3.8/site-packages/transformers/trainer.py", line 2481, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/home/chenrd/.conda/envs/nlp/lib/python3.8/site-packages/transformers/trainer.py", line 3579, in training_step
    loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
  File "/home/chenrd/.conda/envs/nlp/lib/python3.8/site-packages/transformers/trainer.py", line 3633, in compute_loss
    outputs = model(**inputs)
  File "/home/chenrd/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/chenrd/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/chenrd/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 176, in forward
    inputs, module_kwargs = self.scatter(inputs, kwargs, self.device_ids)
  File "/home/chenrd/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 198, in scatter
    return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
  File "/home/chenrd/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 78, in scatter_kwargs
    scattered_kwargs = scatter(kwargs, target_gpus, dim) if kwargs else []
  File "/home/chenrd/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 64, in scatter
    res = scatter_map(inputs)
  File "/home/chenrd/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in scatter_map
    return [type(obj)(i) for i in zip(*map(scatter_map, obj.items()))]
  File "/home/chenrd/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 51, in scatter_map
    return list(zip(*map(scatter_map, obj)))
  File "/home/chenrd/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 47, in scatter_map
    return Scatter.apply(target_gpus, None, dim, obj)
  File "/home/chenrd/.conda/envs/nlp/lib/python3.8/site-packages/torch/autograd/function.py", line 574, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/chenrd/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 96, in forward
    outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
  File "/home/chenrd/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 188, in scatter
    return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: chunk expects at least a 1-dimensional tensor
python-BaseException

追溯bug发现train()方法调用时会产生一个num_items_in_batch变量

batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches)

该变量为0维的tensor张量

在后续的torch/nn/parallel/comm.py中的scatter函数将对该张量做运算,而被计算的张量形状至少为1维。

解决方法1.

发现使用的是torch中的并行化计算,使用了多个gpu才会导致该bug,对此设置单个gpu

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

import torch
torch.cuda.set_device(0)
#以上需要在导入transormers相关模块之前

解决方法2.

使用单张显卡还是不得劲,对Trainer类重写对应的方法,实现多显卡并行计算

class MyTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        """
        How the loss is computed by Trainer. By default, all models return the loss in the first element.

        Subclass and override for custom behavior.
        """
        if (self.label_smoother is not None or self.compute_loss_func is not None) and "labels" in inputs:
            labels = inputs.pop("labels")
        else:
            labels = None
        if self.model_accepts_loss_kwargs:
            loss_kwargs = {}
            if num_items_in_batch is not None:
                loss_kwargs["num_items_in_batch"] = num_items_in_batch.reshape(1,-1)
            inputs = {**inputs, **loss_kwargs}
        outputs = model(**inputs)
        # Save past state if it exists
        # TODO: this needs to be fixed and made cleaner later.
        if self.args.past_index >= 0:
            self._past = outputs[self.args.past_index]

        if labels is not None:
            unwrapped_model = self.accelerator.unwrap_model(model)
            if self._is_peft_model(unwrapped_model):
                model_name = unwrapped_model.base_model.model._get_name()
            else:
                model_name = unwrapped_model._get_name()
            # User-defined compute_loss function
            if self.compute_loss_func is not None:
                loss = self.compute_loss_func(outputs, labels, num_items_in_batch=num_items_in_batch)
            elif model_name in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES.values():
                loss = self.label_smoother(outputs, labels, shift_labels=True)
            else:
                loss = self.label_smoother(outputs, labels)
        else:
            if isinstance(outputs, dict) and "loss" not in outputs:
                raise ValueError(
                    "The model did not return a loss from the inputs, only the following keys: "
                    f"{','.join(outputs.keys())}. For reference, the inputs it received are {','.join(inputs.keys())}."
                )
            # We don't use .loss here since the model may return tuples instead of ModelOutput.
            loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]

        if self.args.average_tokens_across_devices and self.model_accepts_loss_kwargs:
            loss *= self.accelerator.num_processes

        return (loss, outputs) if return_outputs else loss

主要是将num_items_in_batch维度增加一维,避免报错

loss_kwargs["num_items_in_batch"] = num_items_in_batch.reshape(1,-1)

问题解决

<think>好的,我需要帮助用户解决在使用PEFT和Trainer训练模型出现的RuntimeError: chunk expects at least a 1-dimensional tensor问题。用户提到已经参考了站内的引用,其中提到通过调整num_items_in_batch的维度来解决,但用户现在的问题可能与PEFT配置和Trainer参数设置有关。 首先,我需要回顾之前关于这个错误的回答。之前的解决方案包括检查输入数据维度、确保batch size足够、调试维度退化操作、显式扩展维度以及检查自定义层。现在用户在使用PEFT(参数高效微调)和Hugging Face的Trainer出现同样的问题,可能需要考虑PEFT特定的因素。 可能的错误原因包括: 1. **PEFT适配器输出维度问题**:PEFT可能会修改模型结构,导致某些层的输出维度意外变化。 2. **数据加载与处理问题**:在使用自定义数据集或数据整理器,可能生成不合适的张量形状。 3. **Trainer参数配置冲突**:例如,数据并行设置与PEFT适配器不兼容。 4. **损失函数处理不当**:特别是在自定义损失函数,可能返回标量而非张量。 接下来,我需要思考如何具体诊断和解决这些问题。例如,用户提到在微调deepseek模型出现问题,可能涉及到模型特定的结构。需要检查是否在PEFT配置中正确设置了目标模块,以及模型在前向传播是否保持了正确的张量维度。 此外,用户提供的引用中提到调整num_items_in_batch的维度,这可能与数据批处理相关。需要确认在数据预处理阶段是否正确地保持了张量维度,尤其是在使用自定义的data_collator。 可能的解决方案步骤包括: 1. **检查PEFT配置**:确保LoRA或其他适配器正确应用到目标模块,不会导致输出降维。 2. **验证数据整理器**:确保数据整理后的批次数据保持正确的维度。 3. **调试模型前向传播**:在模型中添加断言或打印语句,检查各层的输出维度。 4. **调整损失计算**:确保损失函数返回的是至少1维的张量,必要进行unsqueeze操作。 5. **修改Trainer参数**:例如,设置合适的fp16_full_eval以防止评估维度变化。 需要逐一排除这些可能性,并提供具体的代码示例帮助用户实施修复。同,要确保用户理解每个步骤的目的,以便他们能够根据自身情况调整。</think>#### 问题根源分析 在使用PEFT(如LoRA)与Hugging Face Trainer结合训练出现该错误,通常由以下原因导致: 1. **适配器输出降维**:PEFT模块意外返回标量或0维张量 2. **数据整理维度缺失**:自定义`data_collator`未保持张量维度 3. **梯度累积冲突**:`gradient_accumulation_steps`与多GPU切分不兼容 4. **混合精度问题**:`fp16`模式导致张量维度退化[^2] --- #### 针对性解决方案 ##### 1. 检查PEFT适配器维度(关键步骤) 在LoRA配置中验证`target_modules`的正确性: ```python from peft import LoraConfig peft_config = LoraConfig( target_modules=["query_key_value"], # 检查是否匹配模型实际模块名 r=8, lora_alpha=16, modules_to_save=["output_layer"] # 防止关键层维度丢失 ) ``` ##### 2. 修复数据整理维度 在自定义`data_collator`中强制保持维度: ```python def custom_collator(features): batch = { k: torch.stack([f[k] for f in features]).unsqueeze(1) # 添加维度 if isinstance(features[0][k], (int, float)) else torch.stack([f[k] for f in features]) for k in features[0].keys() } return batch ``` ##### 3. 修改Trainer配置参数 调整`TrainingArguments`避免维度冲突: ```python from transformers import TrainingArguments training_args = TrainingArguments( per_device_train_batch_size=4, gradient_accumulation_steps=2, # 需满足 total_batch_size >= num_gpus fp16_full_eval=False, # 禁用评估混合精度 ddp_find_unused_parameters=False ) ``` ##### 4. 添加维度断言调试 在模型前向传播中插入维度验证: ```python class CustomModel(PreTrainedModel): def forward(self, inputs): # 前向传播过程中检查关键节点 assert inputs.dim() >= 2, f"输入维度异常:{inputs.shape}" outputs = super().forward(inputs) # 确保loss保持维度 return {"loss": outputs.loss.unsqueeze(0)} if outputs.loss.dim()==0 else outputs ``` ##### 5. 损失函数维度修复 针对返回标量loss的情况进行维度扩展: ```python # 在Trainer子类中覆盖compute_loss class FixedTrainer(Trainer): def compute_loss(self, model, inputs, return_outputs=False): outputs = model(**inputs) loss = outputs.get("loss") return loss.unsqueeze(0) if loss.dim()==0 else loss # 强制转为1D ``` --- #### 验证流程 1. 单GPU调试模式: ```python training_args._n_gpu = 0 # 禁用多GPU切分 trainer.train() ``` 2. 张量维度跟踪: ```python from torch.utils.data import DataLoader loader = trainer.get_train_dataloader() first_batch = next(iter(loader)) print({k: v.shape for k, v in first_batch.items()}) # 验证输入维度 ``` ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值