[笔记]BERT数据预处理超长句子的逻辑

最新推荐文章于 2025-10-09 16:46:49 发布

原创最新推荐文章于 2025-10-09 16:46:49 发布 · 1.5k 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#pytorch #人工智能 #BERT

杂记专栏收录该内容

43 篇文章

订阅专栏

本文探讨了如何在文本处理中通过随机选取和融合其他文档来创建新的句子对，以达到序列长度限制。方法包括根据文档长度和目标序列长度灵活调整，以及利用随机策略选择第二句。关键操作包括截断和融合，确保信息的有效性和多样性。

    if i == len(document) - 1 or current_length >= target_seq_length:
      if current_chunk:
        # `a_end` is how many segments from `current_chunk` go into the `A`
        # (first) sentence.
        a_end = 1
        if len(current_chunk) >= 2:
          a_end = rng.randint(1, len(current_chunk) - 1)

        tokens_a = []
        for j in range(a_end):
          tokens_a.extend(current_chunk[j])

        tokens_b = []
        # Random next
        is_random_next = False
        if len(current_chunk) == 1 or rng.random() < 0.5:
          is_random_next = True
          target_b_length = target_seq_length - len(tokens_a)

          # This should rarely go for more than one iteration for large
          # corpora. However, just to be careful, we try to make sure that
          # the random document is not the same as the document
          # we're processing.
          for _ in range(10):
            random_document_index = rng.randint(0, len(all_documents) - 1)
            if random_document_index != document_index:
              break

          random_document = all_documents[random_document_index]
          random_start = rng.randint(0, len(random_document) - 1)
          for j in range(random_start, len(random_document)):
            tokens_b.extend(random_document[j])
            if len(tokens_b) >= target_b_length:
              break
          # We didn't actually use these segments so we "put them back" so
          # they don't go to waste.
          num_unused_segments = len(current_chunk) - a_end
          i -= num_unused_segments
        # Actual next
        else:
          is_random_next = False
          for j in range(a_end, len(current_chunk)):
            tokens_b.extend(current_chunk[j])
        truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng)

        assert len(tokens_a) >= 1
        assert len(tokens_b) >= 1

#如果是文档中的最后一句话或者长度已经超过了最大长度，那么把这一句从头开始随机切一段作为第一句，剩下的部分作为第二句，而随机的第二句就从其他文档中随机选择一句同样长度超过的切割相同位置开始相同长度的一段作为第二句。

截断：

def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng):
  while True:
    total_length = len(tokens_a) + len(tokens_b)
    if total_length <= max_num_tokens:
      break

    trunc_tokens = tokens_a if len(tokens_a) > len(tokens_b) else tokens_b
    assert len(trunc_tokens) >= 1

    if rng.random() < 0.5:
      del trunc_tokens[0]
    else:
      trunc_tokens.pop()