BERT项目gh_mirrors/be/bert：序列标注任务实战-优快云博客

BERT项目gh_mirrors/be/bert：序列标注任务实战

【免费下载链接】bert TensorFlow code and pre-trained models for BERT 项目地址: https://gitcode.com/gh_mirrors/be/bert

序列标注任务概述

序列标注（Sequence Labeling）是自然语言处理（Natural Language Processing, NLP）中的基础任务，旨在为文本序列中的每个token分配预定义的标签。典型应用包括命名实体识别（Named Entity Recognition, NER）、词性标注（Part-of-Speech Tagging, POS）和语义角色标注（Semantic Role Labeling, SRL）等。BERT（Bidirectional Encoder Representations from Transformers）模型通过预训练和微调机制，为序列标注任务提供了强大的特征提取能力。

BERT模型架构与序列标注适配

BERT模型的核心是多层Transformer编码器，能够捕获文本的双向上下文信息。在序列标注任务中，通常利用BERT的最后一层隐藏状态（sequence_output）作为每个token的表示，再通过一个线性层映射到标签空间。

# 从modeling.py中提取BERT模型输出
class BertModel(object):
    def __init__(self, config, is_training, input_ids, input_mask, token_type_ids):
        # ... 省略初始化代码 ...
        self.sequence_output = self.all_encoder_layers[-1]  # [batch_size, seq_length, hidden_size]

序列标注任务适配流程

输入处理：将原始文本转换为BERT输入格式（input_ids, input_mask, segment_ids）
特征提取：利用BERT模型获取token级别的上下文表示
标签预测：添加线性层将隐藏状态映射到标签空间
损失计算：使用交叉熵损失函数优化模型参数

环境准备与项目结构

环境依赖

TensorFlow 1.11.0+（项目要求见requirements.txt）
Python 2.7/3.x

项目核心文件

文件路径	功能描述
modeling.py	BERT模型核心实现
tokenization.py	文本分词与词汇表管理
run_classifier.py	分类任务微调代码
run_squad.py	问答任务实现（含跨度预测逻辑）

数据预处理

数据格式定义（BIO标注体系）

采用BIO（Begin-Inside-Outside）标注体系，示例如下：

EU  B-ORG
rejects O
German B-MISC
call O
to O
boycott O
British B-MISC
lamb O
. O

数据转换代码实现

# 基于tokenization.py扩展的序列标注数据处理器
class SequenceLabelingProcessor(DataProcessor):
    def get_train_examples(self, data_dir):
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_labels(self):
        return ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "X", "[CLS]", "[SEP]"]

    def _create_examples(self, lines, set_type):
        examples = []
        for (i, line) in enumerate(lines):
            guid = f"{set_type}-{i}"
            text_a = tokenization.convert_to_unicode(line[0])
            labels = tokenization.convert_to_unicode(line[1]).split()
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=None, label=labels))
        return examples

WordPiece分词与标签对齐

BERT采用WordPiece分词可能导致原始token被拆分，需进行标签对齐：

# 标签对齐逻辑（基于tokenization.py的tokenize方法）
def align_labels_with_tokens(labels, tokenized_input, label_all_tokens=True):
    new_labels = []
    current_label = "O"
    for token in tokenized_input:
        if token.startswith("##"):
            new_labels.append(labels[-1] if label_all_tokens else "O")
        else:
            current_label = labels.pop(0) if labels else "O"
            new_labels.append(current_label)
    return new_labels

模型构建

序列标注模型扩展

基于modeling.py中的BertModel类，添加序列标注头：

def create_sequence_labeling_model(bert_config, is_training, input_ids, input_mask, segment_ids, labels):
    model = modeling.BertModel(
        config=bert_config, is_training=is_training,
        input_ids=input_ids, input_mask=input_mask, token_type_ids=segment_ids)

    output_layer = model.get_sequence_output()  # [batch_size, seq_length, hidden_size]
    hidden_size = output_layer.shape[-1].value

    # 序列标注输出层
    output_weights = tf.get_variable(
        "output_weights", [bert_config.num_labels, hidden_size],
        initializer=tf.truncated_normal_initializer(stddev=0.02))
    output_bias = tf.get_variable("output_bias", [bert_config.num_labels], initializer=tf.zeros_initializer())

    with tf.variable_scope("loss"):
        if is_training:
            output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)

        logits = tf.matmul(tf.reshape(output_layer, [-1, hidden_size]), output_weights, transpose_b=True)
        logits = tf.nn.bias_add(logits, output_bias)
        logits = tf.reshape(logits, [-1, bert_config.max_seq_length, bert_config.num_labels])

        # 计算CRF损失（需引入tf.contrib.crf）
        log_likelihood, transition_params = tf.contrib.crf.crf_log_likelihood(
            logits, labels, tf.reduce_sum(input_mask, axis=1))
        loss = tf.reduce_mean(-log_likelihood)
        return (loss, logits, transition_params)

模型训练与评估

训练脚本参数配置

python run_sequence_labeling.py \
  --task_name=ner \
  --do_train=true \
  --do_eval=true \
  --data_dir=./data \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=32 \
  --learning_rate=2e-5 \
  --num_train_epochs=3.0 \
  --output_dir=./ner_output/

评估指标计算

# 序列标注评估代码（基于run_squad.py的评估逻辑扩展）
def evaluate_seq_labeling(result, examples, features):
    metric = SeqEvalMetrics()
    for example in examples:
        feature = features[example.guid]
        pred_ids = result[feature.unique_id]
        label_ids = example.label_ids
        metric.add(pred_ids, label_ids)
    return metric.get_metric()

模型预测与部署

预测代码实现

# 基于run_squad.py的预测逻辑修改
def predict_seq_labeling(input_file, output_file):
    tokenizer = tokenization.FullTokenizer(
        vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
    processor = SequenceLabelingProcessor()
    examples = processor.get_test_examples(FLAGS.data_dir)
    features = convert_examples_to_features(examples, tokenizer)

    # 模型预测
    predictions = estimator.predict(input_fn=predict_input_fn, checkpoint_path=FLAGS.init_checkpoint)

    # 输出结果
    with open(output_file, "w") as writer:
        for example, feature, pred in zip(examples, features, predictions):
            output_line = format_output(example.text_a, pred, feature)
            writer.write(output_line + "\n")

性能优化建议

批量预测：使用predict_batch_size参数提高预测效率
模型量化：通过TensorFlow Lite转换为量化模型
长文本处理：参考run_squad.py中的滑动窗口机制

常见问题与解决方案

问题1：标签不平衡

解决方案：采用类别权重调整

class_weights = tf.constant([1.0, 3.0, 2.5, 3.0, 2.5, 2.0, 2.0])
loss = tf.reduce_mean(tf.nn.weighted_cross_entropy_with_logits(
    labels=one_hot_labels, logits=logits, pos_weight=class_weights))

问题2：长序列处理

解决方案：实现滑动窗口预测

# 参考run_squad.py中的DocSpan处理逻辑
max_seq_length = 128
stride = 64
for i in range(0, len(tokens), stride):
    chunk = tokens[i:i+max_seq_length]
    # 处理每个chunk并合并结果

总结与扩展

本教程基于BERT项目gh_mirrors/be/bert实现了序列标注任务，核心步骤包括：

数据预处理（BIO格式转换与标签对齐）
模型扩展（添加CRF层与序列标注损失函数）
训练评估（基于BERT微调框架实现端到端训练）

后续改进方向

尝试不同预训练模型（如multilingual.md中的多语言模型）
引入对抗训练（参考modeling.py中的dropout实现）
结合知识蒸馏压缩模型体积

附录：核心代码文件修改记录

新增文件：run_sequence_labeling.py（基于run_classifier.py修改）
修改文件：tokenization.py（添加BIO标签处理）
配置文件：bert_config.json（添加num_labels参数）

【免费下载链接】bert TensorFlow code and pre-trained models for BERT 项目地址: https://gitcode.com/gh_mirrors/be/bert

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考