d2l语言模型--生成小批量序列

最新推荐文章于 2024-07-05 10:25:41 发布

原创

最新推荐文章于 2024-07-05 10:25:41 发布 · 1.1k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#语言模型 #人工智能 #自然语言处理 #深度学习

文章介绍了如何处理语言模型的数据集，包括一元、二元和三元语法的统计，以及两种不同的随机抽样方法——各bs之间随机和连续。在一元、二元和三元语法中，统计了相邻词汇的频率。随机抽样中，讨论了如何生成不同大小的序列样本。

对语言模型的数据集处理做以下汇总与总结

1.k元语法

1.1一元

tokens = d2l.tokenize(d2l.read_time_machine())
# 因为每个⽂本⾏不⼀定是⼀个句⼦或⼀个段落，因此我们把所有⽂本⾏拼接到⼀起
corpus = [token for line in tokens for token in line]
vocab = d2l.Vocab(corpus)

看看得到了什么:corpus为原txt所有词汇拉成一维list；token_freqs为按1元语法统计的token-freqs列表。

vocab.token_freqs[:10], corpus[:20]

'''
([('the', 2261),
  ('i', 1267),
  ('and', 1245),
  ('of', 1155),
  ('a', 816),
  ('to', 695),
  ('was', 552),
  ('in', 541),
  ('that', 443),
  ('my', 440)],
 ['the',
  'time',
  'machine',
  'by',
  'h',
  'g',
  'wells',
  'i',
  'the',
  'time',
  'traveller',
  'for',
  'so',
  'it',
  'will',
  'be',
  'convenient',
  'to',
  'speak',
  'of'])
'''

1.2 二元

二元即为将前后两词连在一起当作一个词元，其token为一个包含tuple的list，其中每个tuple为元txt各2个相邻词组成。

bigram_tokens = [pair for pair in zip(corpus[:-1], corpus[1:])]
bigram_vocab = d2l.Vocab(bigram_tokens)
# 返回
bigram_vocab.token_freqs[:10],bigram_tokens[:5]

'''
([(('of', 'the'), 309),
  (('in', 'the'), 169),
  (('i', 'had'), 130),
  (('i', 'was'), 112),
  (('and', 'the'), 109),
  (('the', 'time'), 102),
  (('it', 'was'), 99),
  (('to', 'the'), 85),
  (('as', 'i'), 78),
  (('of', 'a'), 73)],
 [('the', 'time'),
  ('time', 'machine'),
  ('machine', 'by'),
  ('by', 'h'),
  ('h', 'g')])
'''

其中，讲一下相邻2词的实现：

bigram_tokens = [pair for pair in zip(corpus[:-1], corpus[1:])]
bigram_tokens[:5]

'''
[('the', 'time'),
 ('time', 'machine'),
 ('machine', 'by'),
 ('by', 'h'),
 ('h', 'g')]
'''</

最低0.47元/天解锁文章