当你的输入多了一维_with input shapes: [2], [1], [1], [1] and with com-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_64911856/article/details/144164207

如果你的模型报错：RuntimeError: expand(torch.cuda.FloatTensor{[2, 2, 1, 1, 512]}, size=[2, 1, 1, 512]): the number of sizes provided (4) must be greater or equal to the number of dimensions in the tensor (5)

从错误堆栈信息来看，问题似乎发生在模型的前向传播过程中。具体来说，Qwen2ForCausalLM.forward 方法在调用 self.model 时出现了问题。这通常意味着输入数据的形状或格式与模型期望的不匹配。

可能的原因和解决方法

输入张量的形状问题：
- 确保输入张量（如 input_ids, attention_mask, labels）的形状是正确的。
- 检查 preprocess_function 返回的数据是否符合模型的预期输入格式。
标签张量的形状问题：
- 确保 labels 张量的形状与 input_ids 相匹配，并且没有多余的维度。
批量处理问题：
- 确保 DataCollator 正确地将单个样本组合成批次。
模型配置问题：
- 确保使用的模型配置（如 max_length）与预处理函数中的设置一致。

调试步骤

打印输入张量的形状：
- 在 preprocess_function 中添加打印语句，输出 model_inputs 和 labels 的形状。

def preprocess_function(example):
    # 确保所有必需的字段存在且是正确的类型
    if not all(key in example and isinstance(example[key], (str, list)) for key in ['query_description', 'query_text', 'correct_answer', 'incorrect_answer', 'Misconception']):
        raise ValueError("One or more required fields are missing or have incorrect types.")

    # 将列表转换为字符串
    query_description = ' '.join(example['query_description']) if isinstance(example['query_description'], list) else example['query_description']
    query_text = ' '.join(example['query_text']) if isinstance(example['query_text'], list) else example['query_text']
    correct_answer = ' '.join(example['correct_answer']) if isinstance(example['correct_answer'], list) else example['correct_answer']
    incorrect_answer = ' '.join(example['incorrect_answer']) if isinstance(example['incorrect_answer'], list) else example['incorrect_answer']
    misconception = ' '.join(example['Misconception']) if isinstance(example['Misconception'], list) else example['Misconception']

    input_text = (query_description + "\n" + query_text + "\n" + 
                  correct_answer + "\n" + incorrect_answer)
    target_text = misconception

    # 对输入和目标进行编码
    model_inputs = tokenizer(input_text, max_length=512, truncation=True, padding="max_length", return_tensors="pt")
    labels = tokenizer(target_text, max_length=512, truncation=True, padding="max_length", return_tensors="pt").input_ids

    # 将标签中的填充部分标记为-100，以便在计算损失时忽略
    labels[labels == tokenizer.pad_token_id] = -100

    # 打印输入张量的形状
    print(f"Input shape: {
     model_inputs['input_ids'].shape}")
    print(f"Labels shape: {
     labels.shape}")

    # 将 `model_inputs` 和 `labels` 放入一个字典中
    model_inputs["labels"] = labels.squeeze()  # 移除多余的维度

    # 返回预处理后的数据
    return model_inputs

检查 DataCollator 的输出：
- 在 DataCollator 中添加打印语句，输出其返回的批次数据的形状。

from transformers import DataCollatorForSeq2Seq

class CustomDataCollator(DataCollatorForSeq2Seq):
    def __call__(self, features):
        batch = super().__call__(features)
        
        # 打印批次数据的形状
        print(f"Batch input_ids shape: {
     batch['input_ids'].shape}")
        print(f"Batch attention_mask shape: {
     batch['attention_mask'].shape}")
        print