Bonus Materials
-
02_bonus_bytepair-encoder contains optional code to benchmark different byte pair encoder implementations
-
03_bonus_embedding-vs-matmul contains optional (bonus) code to explain that embedding layers and fully connected layers applied to one-hot encoded vectors are equivalent.
-
04_bonus_dataloader-intuition contains optional (bonus) code to explain the data loader more intuitively with simple numbers rather than text.
02_bonus_bytepair-encoder
# 版权声明和许可信息
# 代码来源于OpenAI的GPT-2项目,遵循修改后的MIT许可证
# Source: https://github.com/openai/gpt-2/blob/master/src/encoder.py
# License:
# Modified MIT License
# Software Copyright (c) 2019 OpenAI
# 我们不主张对您使用GPT-2创建的内容拥有所有权,因此您可以随意使用这些内容。
# 我们只要求您负责任地使用GPT-2,并明确表明您的内容是使用GPT-2创建的。
# We don’t claim ownership of the content you create with GPT-2, so it is yours to do with as you please.
# We only ask that you use GPT-2 responsibly and clearly indicate your content was created using GPT-2.
# 特此授予任何获得本软件及相关文档文件(“软件”)副本的人免费许可,
# 允许其不受限制地处理软件,包括但不限于使用、复制、修改、合并、发布、分发、
# 再许可和/或销售软件副本的权利,并允许向其提供软件的人这样做,但需遵守以下条件:
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and
# associated documentation files (the "Software"), to deal in the Software without restriction,
# including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so,
# subject to the following conditions:
# 上述版权声明和本许可声明应包含在软件的所有副本或大部分内容中。
# 上述版权声明和本许可声明无需包含在由软件创建的内容中。
# The above copyright notice and this permission notice shall be included
# in all copies or substantial portions of the Software.
# The above copyright notice and this permission notice need not be included
# with content created by the Software.
# 软件按“原样”提供,不提供任何形式的明示或暗示的保证,
# 包括但不限于适销性、特定用途适用性和非侵权性的保证。
# 在任何情况下,作者或版权所有者均不对任何索赔、损害或其他责任负责,
# 无论是因合同、侵权或其他原因引起的,与软件或其使用或其他交易相关的责任。
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
# INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
# BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
# TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE
# OR OTHER DEALINGS IN THE SOFTWARE.
import os
import json
import regex as re
import requests
from tqdm import tqdm
from functools import lru_cache
# 定义一个函数,返回UTF-8字节和对应的Unicode字符串列表
# 可逆的BPE代码在Unicode字符串上工作,因此需要大量的Unicode字符来避免UNK
# 此函数创建查找表以避免映射到BPE代码会出错的空白/控制字符
@lru_cache()
def bytes_to_unicode():
"""
Returns list of utf-8 byte and a corresponding list of unicode strings.
The reversible bpe codes work on unicode strings.
This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
This is a significant percentage of your normal, say, 32K bpe vocab.
To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
And avoids mapping to whitespace/control characters the bpe code barfs on.
"""
bs = list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))
cs = bs[:]
n = 0
for b in range(2**8):
if b not in bs:
bs.append(b)
cs.append(2**8 + n)
n += 1
cs = [chr(n) for n in cs]
return dict(zi

最低0.47元/天 解锁文章

被折叠的 条评论
为什么被折叠?



