<think>我们面对的问题是:在建筑企业数字化转型分类任务中,当前模型测试准确率只有65.84%,且存在过拟合(训练损失0.36,验证损失1.13)。我们的目标是达到80%以上的准确率并避免过拟合。
根据引用[1][2][3][4]提供的经验,结合深度学习模型微调的最佳实践,我将从以下几个方面改进代码:
1. **数据增强与预处理优化**:增加数据多样性,减少噪声。
2. **模型结构调整**:调整BERT模型的层和正则化参数。
3. **训练策略优化**:使用更有效的优化器和学习率调度。
4. **正则化技术**:引入更强大的正则化方法,如权重衰减、标签平滑和DropPath(Stochastic Depth)。
5. **集成学习与高级技术**:考虑模型集成或知识蒸馏(如果单一模型难以达到80%)。
具体修改计划:
### 1. 数据增强与预处理优化
- **增加数据多样性**:在`NewsDataset`类中,我们目前只有简单的随机交换词。我们可以增加以下增强方法:
- 同义词替换(使用中文同义词词林)
- 随机删除(以概率p删除词)
- 使用回译(中->英->中,但注意速度较慢)
- **文本清洗**:建筑领域文本可能包含大量数字、符号,我们可以增加更严格的清洗。
修改`NewsDataset`类的`__getitem__`方法,增加多种数据增强方法(可选,通过参数控制):
```python
def __getitem__(self, idx):
text = str(self.texts[idx])
label = self.labels[idx]
# 数据增强:仅在训练时使用
if self.augment:
# 以0.5的概率进行增强
if random.random() > 0.5:
# 多种增强方法,随机选择一种
method = random.choice(['swap', 'delete', 'synonym'])
if method == 'swap' and len(text.split()) > 5:
words = text.split()
idx1, idx2 = random.sample(range(len(words)), 2)
words[idx1], words[idx2] = words[idx2], words[idx1]
text = ' '.join(words)
elif method == 'delete' and len(text.split()) > 5:
words = text.split()
# 随机删除部分词(删除比例为0.1~0.2)
keep_ratio = random.uniform(0.8, 0.9)
keep_idx = random.sample(range(len(words)), int(len(words)*keep_ratio))
keep_idx.sort()
words = [words[i] for i in keep_idx]
text = ' '.join(words)
elif method == 'synonym':
# 使用同义词替换,需要同义词词典,这里假设我们有一个函数get_synonyms
words = text.split()
new_words = []
for word in words:
if random.random() < 0.3: # 每个词有30%概率被替换
syn = get_synonym(word) # 需要实现这个函数
if syn is not None:
new_words.append(syn)
else:
new_words.append(word)
else:
new_words.append(word)
text = ' '.join(new_words)
... # 后面的编码部分不变
```
注意:同义词替换需要同义词词典,我们可以使用`synonyms`库(需要安装)或者自己构建一个建筑领域的同义词词典。
另外,我们还可以尝试使用EDA(Easy Data Augmentation)库,但注意中文需要适配。
### 2. 模型结构调整
当前模型是`ImprovedBertForClassification`,我们在BERT的输出上添加了Dropout(0.3)和一个线性层。根据引用[3],我们可以调整dropout率,并考虑使用两个线性层(带激活函数)来增强分类能力,但要注意防止过拟合。
修改模型:
```python
class ImprovedBertForClassification(BertForSequenceClassification):
def __init__(self, config):
super().__init__(config)
# 增加dropout层
self.dropout1 = nn.Dropout(0.5) # 第一个dropout,提高正则化
self.linear1 = nn.Linear(config.hidden_size, 256)
self.activation = nn.GELU() # 激活函数
self.dropout2 = nn.Dropout(0.3)
self.classifier = nn.Linear(256, config.num_labels)
def forward(self, input_ids=None, attention_mask=None, labels=None, **kwargs):
outputs = self.bert(
input_ids,
attention_mask=attention_mask,
**kwargs
)
pooled_output = outputs[1]
# 经过两层全连接
pooled_output = self.dropout1(pooled_output)
pooled_output = self.linear1(pooled_output)
pooled_output = self.activation(pooled_output)
pooled_output = self.dropout2(pooled_output)
logits = self.classifier(pooled_output)
loss = None
if labels is not None:
loss_fct = nn.CrossEntropyLoss(label_smoothing=0.1) # 标签平滑
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
return type('Output', (), {'loss': loss, 'logits': logits})()
```
### 3. 训练策略优化
根据引用[3],我们可以使用`Trainer`和`TrainingArguments`来简化训练流程,并支持更多的训练技巧(如混合精度训练、梯度累积)。但是原代码已经实现了训练循环,我们可以在此基础上修改:
- **优化器**:使用AdamW,并调整权重衰减(0.01->0.02)和epsilon(1e-8->1e-6)。
- **学习率调度**:使用带warmup的线性衰减,但我们可以尝试增大warmup步数(从10%到15%)。
- **梯度裁剪**:保留梯度裁剪(max_norm=1.0)。
- **批次大小**:如果显存允许,增大批次大小(16->32)可能有助于稳定训练。
修改`train_model`函数中的优化器和调度器设置:
```python
# 优化器
optimizer = AdamW(model.parameters(), lr=2e-5, eps=1e-6, weight_decay=0.02)
total_steps = len(train_loader) * num_epochs
# 增加warmup的比例
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=int(0.15 * total_steps), # 15%的warmup步数
num_training_steps=total_steps
)
```
### 4. 正则化技术
我们已经使用了:
- Dropout(在分类层前)
- 权重衰减(在优化器中)
- 标签平滑(在损失函数中)
还可以考虑:
- **Mixup**:在文本分类中,我们可以对embedding进行mixup(参考文献:https://arxiv.org/abs/1710.09412)
- **对抗训练**:如FGM(Fast Gradient Method)或PGD(Projected Gradient Descent)
这里我们实现FGM对抗训练,因为它简单且有效。在训练循环中增加对抗训练步骤:
首先,在训练循环开始前定义一个FGM类:
```python
class FGM:
def __init__(self, model, epsilon=0.25):
self.model = model
self.epsilon = epsilon
self.backup = {}
def attack(self, emb_name='word_embeddings'):
# emb_name: 模型中embedding的参数名,对于BERT,通常是'embeddings.word_embeddings.weight'
for name, param in self.model.named_parameters():
if param.requires_grad and emb_name in name:
self.backup[name] = param.data.clone()
norm = torch.norm(param.grad)
if norm != 0 and not torch.isnan(norm):
r_at = self.epsilon * param.grad / norm
param.data.add_(r_at)
def restore(self, emb_name='word_embeddings'):
for name, param in self.model.named_parameters():
if param.requires_grad and emb_name in name:
if name in self.backup:
param.data = self.backup[name]
self.backup = {}
```
然后在训练循环中(在`train_epoch`函数里),在反向传播后加入对抗训练:
```python
# 反向传播
loss.backward()
# 对抗训练
fgm = FGM(model)
fgm.attack() # 在embedding上添加对抗扰动
# 再次前向传播,计算对抗样本的损失
outputs_adv = model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=labels
)
loss_adv = outputs_adv.loss
loss_adv.backward() # 累加对抗训练的梯度
fgm.restore() # 恢复embedding参数
# 梯度裁剪防止梯度爆炸
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
```
### 5. 集成学习
如果单一模型难以达到80%,我们可以训练多个模型集成。但考虑到时间,我们首先尝试上述改进,如果仍不达标再考虑集成。
### 其他可能改进
- **调整学习率**:尝试更小的学习率(如1e-5)或分层学习率(BERT底层学习率小,顶层学习率大)。
- **增加训练轮数**:但注意早停,我们设置早停的耐心值为5,已经足够。
- **使用更大的预训练模型**:如`bert-large-chinese`,但要注意显存。
- **尝试其他预训练模型**:如RoBERTa、MacBERT等(需更换预训练模型加载)。
由于时间关系,我们首先在代码中实现数据增强、模型结构调整、优化器调整和对抗训练。
注意:由于修改内容较多,我们需要逐步调试。以下给出修改后的代码,重点关注修改的部分。
由于篇幅限制,这里只给出关键修改的代码片段,完整的代码需要整合到原代码中。
我们期望通过这些修改,能够将准确率提升到80%以上,并抑制过拟合(验证损失和训练损失差距减小)。
如果效果仍不理想,我们可以考虑:
- 收集更多数据
- 使用更强大的预训练模型(如MacBERT,在中文任务上表现更好)
我们首先尝试以上改进,然后重新训练评估。
修改后的代码将较长,因此我们只展示修改的部分,并给出整合说明。
注意:同义词替换需要外部资源,如果找不到合适的同义词库,可以暂时不使用该方法,而使用其他增强方法。
由于建筑领域的专业性,建议使用领域内的同义词词典(如建筑术语同义词表)。如果没有,可以暂时用`random_insertion`(随机插入)替代。
我们修改`NewsDataset`类的增强部分,增加随机插入:
```python
elif method == 'insert' and len(text.split()) > 5:
words = text.split()
# 随机插入一个词(从原文中随机选一个词)
new_words = words.copy()
random_word = random.choice(words)
insert_pos = random.randint(0, len(words))
new_words.insert(insert_pos, random_word)
text = ' '.join(new_words)
```
接下来,我们将这些修改整合到原始代码中,并运行。
由于代码修改较多,请确保在修改后测试代码的完整性。
我们重新训练模型,并监控训练和验证损失,目标是验证准确率达到80%且没有过拟合(验证损失在训练过程中持续下降或稳定,最终训练损失和验证损失接近)。
如果一次训练未达到目标,我们需要进行超参数调整(如学习率、dropout率、增强方法等)。
最后,如果计算资源允许,我们可以尝试使用更大的预训练模型(如`hfl/chinese-roberta-wwm-ext`),通常性能更好。
注意:修改后的模型结构(两层线性层)可能会增加过拟合风险,所以我们在验证集上密切监控。
由于时间关系,我们选择以上策略进行改进。如果准确率仍然不足80%,则可能需要更深入的分析和调整。
现在,我们给出修改后的完整代码(仅包含修改部分,其他部分保持不变,但需要整合):
</think>
为了帮助您修改BERT中文模型微调代码,实现80%以上准确率且避免过拟合,我将提供经过优化的完整代码解决方案。该方案整合了正则化技术、数据增强策略和训练优化方法,特别针对建筑企业数字化转型分类任务进行了调整。
### 优化后的完整代码
```python
import torch
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score
from transformers import BertTokenizer, BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
from transformers import TrainingArguments, Trainer
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.nn.functional as F
import time
import datetime
import random
import os
import matplotlib
import evaluate
from collections import Counter
import warnings
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import seaborn as sns
warnings.filterwarnings('ignore')
# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei', 'Microsoft YaHei']
plt.rcParams['axes.unicode_minus'] = False
# 设置随机种子确保可重复性
def set_seed(seed_value=42):
"""设置所有随机种子以确保结果可重复"""
random.seed(seed_value)
np.random.seed(seed_value)
torch.manual_seed(seed_value)
torch.cuda.manual_seed_all(seed_value)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
set_seed(42)
# 检查GPU可用性
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"使用设备: {device}")
if torch.cuda.is_available():
print(f"GPU型号: {torch.cuda.get_device_name(0)}")
print(f"GPU内存: {torch.cuda.get_device_properties(0).total_memory / 1024 ** 3:.2f} GB")
# 1. 数据加载和预处理 - 增强版
def load_and_preprocess_data(file_path):
"""加载并预处理数据,增加数据平衡处理"""
print("加载数据...")
df = pd.read_csv(file_path, encoding='gbk')
# 检查数据
print(f"数据形状: {df.shape}")
print(f"原始标签分布:\n{df['数字化转型类型代码'].value_counts().sort_index()}")
# 合并内容字段
texts = df['content'].astype(str).tolist()
labels = df['数字化转型类型代码'].astype(int).tolist()
# 检查标签范围
unique_labels = set(labels)
print(f"唯一标签: {unique_labels}")
# 将标签映射到0,1,2(确保连续)
label_mapping = {label: idx for idx, label in enumerate(sorted(unique_labels))}
labels = [label_mapping[label] for label in labels]
print(f"标签映射: {label_mapping}")
print(f"处理后标签分布: {Counter(labels)}")
# 数据平衡处理 - 过采样少数类
label_counts = Counter(labels)
max_count = max(label_counts.values())
balanced_texts = []
balanced_labels = []
for label in label_counts:
indices = [i for i, l in enumerate(labels) if l == label]
oversample_factor = max_count // label_counts[label]
for i in range(oversample_factor):
balanced_texts.extend([texts[idx] for idx in indices])
balanced_labels.extend([labels[idx] for idx in indices])
print(f"平衡后数据量: {len(balanced_labels)}")
print(f"平衡后标签分布: {Counter(balanced_labels)}")
return balanced_texts, balanced_labels, label_mapping
# 2. 创建自定义数据集类 - 增强数据增强
class NewsDataset(Dataset):
def __init__(self, texts, labels, tokenizer, max_length=256, augment=False):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_length = max_length
self.augment = augment
def __len__(self):
return len(self.texts)
def get_synonym(self, word):
"""获取同义词 - 建筑领域专用"""
synonyms = {
"数字化": ["信息化", "智能化", "数智化"],
"转型": ["变革", "转变", "升级"],
"技术": ["科技", "工艺", "手段"],
"管理": ["治理", "管控", "经营"],
"运营": ["运作", "经营", "管理"],
"组织": ["机构", "部门", "团队"],
"建筑": ["建造", "施工", "工程"],
"企业": ["公司", "厂商", "单位"],
"BIM": ["建筑信息模型", "建筑信息管理"],
"物联网": ["IoT", "物联网络"],
"人工智能": ["AI", "智能技术"],
"大数据": ["海量数据", "数据资源"]
}
if word in synonyms:
return random.choice(synonyms[word])
return None
def augment_text(self, text):
"""增强的文本增强方法"""
words = text.split()
# 随机交换
if len(words) > 3 and random.random() > 0.5:
idx1, idx2 = random.sample(range(len(words)), 2)
words[idx1], words[idx2] = words[idx2], words[idx1]
# 随机删除 (10%概率删除单词)
if len(words) > 5 and random.random() < 0.1:
del_idx = random.randint(0, len(words)-1)
words.pop(del_idx)
# 随机插入同义词 (20%概率)
if len(words) > 3 and random.random() < 0.2:
insert_idx = random.randint(0, len(words))
word_to_insert = random.choice(words)
synonym = self.get_synonym(word_to_insert)
if synonym:
words.insert(insert_idx, synonym)
# 同义词替换 (30%概率)
new_words = []
for word in words:
if random.random() < 0.3:
synonym = self.get_synonym(word)
new_words.append(synonym if synonym else word)
else:
new_words.append(word)
return " ".join(new_words)
def __getitem__(self, idx):
text = str(self.texts[idx])
label = self.labels[idx]
# 数据增强
if self.augment:
text = self.augment_text(text)
# 对文本进行编码
encoding = self.tokenizer.encode_plus(
text,
add_special_tokens=True,
max_length=self.max_length,
padding='max_length',
truncation=True,
return_attention_mask=True,
return_tensors='pt'
)
return {
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'label': torch.tensor(label, dtype=torch.long)
}
# 3. 创建数据加载器 - 增加分层抽样
def create_data_loaders(texts, labels, tokenizer, batch_size=16, max_length=256, k_folds=5):
"""创建数据加载器,支持K折交叉验证"""
skf = StratifiedKFold(n_splits=k_folds, shuffle=True, random_state=42)
loaders = []
for train_index, val_index in skf.split(texts, labels):
train_texts = [texts[i] for i in train_index]
train_labels = [labels[i] for i in train_index]
val_texts = [texts[i] for i in val_index]
val_labels = [labels[i] for i in val_index]
# 创建数据集 - 训练集使用数据增强
train_dataset = NewsDataset(train_texts, train_labels, tokenizer, max_length, augment=True)
val_dataset = NewsDataset(val_texts, val_labels, tokenizer, max_length, augment=False)
# 创建数据加载器
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
loaders.append((train_loader, val_loader))
return loaders
# 4. 改进的BERT模型类 - 增强正则化
class EnhancedBertForClassification(BertForSequenceClassification):
def __init__(self, config):
super().__init__(config)
# 增强正则化
self.dropout = nn.Dropout(0.4) # 增加dropout率
# 添加额外的线性层
self.dense = nn.Linear(config.hidden_size, 256)
self.activation = nn.GELU() # 使用GELU激活函数
self.classifier = nn.Linear(256, config.num_labels)
# 初始化权重
self.init_weights()
def forward(self, input_ids=None, attention_mask=None, labels=None, **kwargs):
outputs = self.bert(
input_ids,
attention_mask=attention_mask,
**kwargs
)
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
pooled_output = self.dense(pooled_output)
pooled_output = self.activation(pooled_output)
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
loss = None
if labels is not None:
# 使用标签平滑的交叉熵损失
loss_fct = nn.CrossEntropyLoss(label_smoothing=0.15)
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
return type('Output', (), {'loss': loss, 'logits': logits})()
# 5. FGM对抗训练类
class FGM:
"""Fast Gradient Method对抗训练"""
def __init__(self, model, epsilon=0.5):
self.model = model
self.epsilon = epsilon
self.backup = {}
def attack(self, emb_name='word_embeddings'):
"""添加对抗扰动"""
for name, param in self.model.named_parameters():
if param.requires_grad and emb_name in name:
self.backup[name] = param.data.clone()
norm = torch.norm(param.grad)
if norm != 0:
r_at = self.epsilon * param.grad / norm
param.data.add_(r_at)
def restore(self, emb_name='word_embeddings'):
"""恢复参数"""
for name, param in self.model.named_parameters():
if param.requires_grad and emb_name in name:
if name in self.backup:
param.data = self.backup[name]
self.backup = {}
# 6. 模型训练函数 - 添加对抗训练
def train_epoch(model, data_loader, optimizer, scheduler, device, epoch):
"""训练一个epoch,加入对抗训练"""
model.train()
total_loss = 0
correct_predictions = 0
total_samples = 0
# 初始化对抗训练
fgm = FGM(model, epsilon=0.3)
for batch_idx, batch in enumerate(data_loader):
# 将数据移到GPU
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['label'].to(device)
# 梯度清零
optimizer.zero_grad()
# 前向传播
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=labels
)
loss = outputs.loss
logits = outputs.logits
# 反向传播
loss.backward()
# 对抗训练
fgm.attack()
outputs_adv = model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=labels
)
loss_adv = outputs_adv.loss
loss_adv.backward()
fgm.restore()
# 梯度裁剪
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# 优化器步进
optimizer.step()
scheduler.step()
# 统计
total_loss += loss.item()
_, preds = torch.max(logits, dim=1)
correct_predictions += torch.sum(preds == labels)
total_samples += labels.size(0)
# 每处理20个批次打印一次进度
if (batch_idx + 1) % 20 == 0:
batch_acc = torch.sum(preds == labels).float() / labels.size(0)
print(f' 批次 {batch_idx + 1}/{len(data_loader)}, 损失: {loss.item():.4f}, 准确率: {batch_acc:.4f}')
# 计算平均损失和准确率
avg_loss = total_loss / len(data_loader)
accuracy = correct_predictions.double() / total_samples
return avg_loss, accuracy
# 7. 模型评估函数 - 增加F1分数
def evaluate_model(model, data_loader, device):
"""评估模型性能"""
model.eval()
total_loss = 0
correct_predictions = 0
total_samples = 0
all_preds = []
all_labels = []
with torch.no_grad():
for batch in data_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['label'].to(device)
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=labels
)
loss = outputs.loss
logits = outputs.logits
total_loss += loss.item()
_, preds = torch.max(logits, dim=1)
correct_predictions += torch.sum(preds == labels)
total_samples += labels.size(0)
all_preds.extend(preds.cpu().numpy())
all_labels.extend(labels.cpu().numpy())
avg_loss = total_loss / len(data_loader)
accuracy = correct_predictions.double() / total_samples
f1 = f1_score(all_labels, all_preds, average='weighted')
return avg_loss, accuracy, f1, all_preds, all_labels
# 8. 早停机制类
class EarlyStopping:
def __init__(self, patience=5, min_delta=0.001, restore_best_weights=True):
self.patience = patience
self.min_delta = min_delta
self.restore_best_weights = restore_best_weights
self.counter = 0
self.best_score = None
self.best_model_state = None
self.early_stop = False
def __call__(self, val_score, model):
score = val_score
if self.best_score is None:
self.best_score = score
self.best_model_state = model.state_dict().copy()
elif score <= self.best_score + self.min_delta:
self.counter += 1
print(f'早停计数器: {self.counter}/{self.patience}')
if self.counter >= self.patience:
self.early_stop = True
if self.restore_best_weights and self.best_model_state is not None:
print("恢复最佳模型权重...")
model.load_state_dict(self.best_model_state)
else:
self.best_score = score
self.best_model_state = model.state_dict().copy()
self.counter = 0
# 9. 主训练函数 - 使用K折交叉验证
def train_model(model, data_loaders, device, num_epochs=15):
"""训练模型主函数,使用K折交叉验证"""
results = []
best_val_acc = 0
best_model = None
for fold, (train_loader, val_loader) in enumerate(data_loaders):
print(f"\n{'='*40}")
print(f"训练第 {fold+1}/{len(data_loaders)} 折")
print(f"{'='*40}")
# 优化器和学习率调度器
optimizer = AdamW(model.parameters(), lr=1.5e-5, eps=1e-8, weight_decay=0.02)
total_steps = len(train_loader) * num_epochs
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=int(0.1 * total_steps),
num_training_steps=total_steps
)
# 早停机制
early_stopping = EarlyStopping(patience=4, min_delta=0.001)
# 记录训练历史
history = {
'train_loss': [],
'train_acc': [],
'val_loss': [],
'val_acc': [],
'val_f1': []
}
for epoch in range(num_epochs):
print(f'\nEpoch {epoch + 1}/{num_epochs}')
print('-' * 40)
# 训练阶段
train_loss, train_acc = train_epoch(
model, train_loader, optimizer, scheduler, device, epoch
)
# 验证阶段
val_loss, val_acc, val_f1, _, _ = evaluate_model(model, val_loader, device)
# 记录历史
history['train_loss'].append(train_loss)
history['train_acc'].append(train_acc.item())
history['val_loss'].append(val_loss)
history['val_acc'].append(val_acc.item())
history['val_f1'].append(val_f1)
print(f'训练损失: {train_loss:.4f}, 训练准确率: {train_acc:.4f}')
print(f'验证损失: {val_loss:.4f},