方便学习之 torchtext.data 篇章翻译续集Fields

最新推荐文章于 2025-01-15 04:51:01 发布

chuanyang09

最新推荐文章于 2025-01-15 04:51:01 发布

阅读量460

点赞数

文章标签：学习

本文链接：https://blog.youkuaiyun.com/u014474004/article/details/130093518

版权

torchtext提供了RawField和Field类来处理自然语言数据，包括预处理、后处理和词汇表管理。Field类特别适用于序列数据，如文本，可以进行令牌化、填充和数值化。SubwordField和NestedField则支持更复杂的子词和嵌套数据结构处理。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

torchtext

torchtext包由数据处理实用程序和自然语言的流行数据集组成。

(1) RawField (原始字段)

# Defines a general datatype. (定义通用数据类型。)
‘’‘
Every dataset consists of one or more types of data.
For instance, a text classification dataset contains sentences and their classes, 
while a machine translation dataset contains paired examples of text in two languages. 
Each of these types of data is represented by a RawField object. 
A RawField object does not assume any property of the data type and it holds 
parameters relating to how a datatype should be processed.
每个数据集都由一种或多种类型的数据组成。例如，文本分类数据集包含句子及其类，
而机器翻译数据集包含两种语言的配对文本示例。
这些类型的数据中的每一个都由一个RawField对象表示。
RawField对象不承担数据类型的任何属性，它包含与如何处理数据类型相关的参数。
’‘’


class torchtext.data.RawField(preprocessing=None, postprocessing=None, is_target=False)

# Initialize self. See help(type(self)) for accurate signature.
# 初始化自我。有关准确的签名，请参阅help(type(self))。

'''
Variables =>:	
preprocessing – The Pipeline that will be applied to examples using this field before creating an example. Default: None.
(预处理-在创建示例之前将应用于使用此字段的示例的管道。默认：None。)
postprocessing – A Pipeline that will be applied to a list of examples using this field before assigning to a batch. Function signature: (batch(list)) -> object Default: None.
(后处理-在分配给批处理之前，将应用于使用此字段的示例列表的管道。函数签名：（batch（list））->对象默认：None。)
is_target – Whether this field is a target variable. Affects iteration over batches. Default: False
(此字段是否为目标变量。影响批次的迭代。默认：false)
'''

__init__(preprocessing=None, postprocessing=None, is_target=False)

# Preprocess an example if the preprocessing Pipeline is provided.
# 如果提供了预处理管道，则预处理示例。
preprocess(x)

# Process a list of examples to create a batch. (处理示例列表以创建批处理。)
# Postprocess the batch with user-provided Pipeline. (使用用户提供的管道进行批处理。)

'''
Parameters =>:	
batch (list(object)) – A list of object from a batch of examples.
(批处理（list(object)）-来自一批示例的对象列表。)

Returns: Processed object given the input and custom postprocessing Pipeline.
(给定输入和自定义后处理管道的处理对象。)

Return type: object
'''

process(batch, *args, **kwargs)

(2) Field (字段)

'''
Defines a datatype together with instructions for converting to Tensor.

Field class models common text processing datatypes that can be represented by tensors. It holds a Vocab object that defines the set of possible values for elements of the field and their corresponding numerical representations. The Field object also holds other parameters relating to how a datatype should be numericalized, such as a tokenization method and the kind of Tensor that should be produced.

If a Field is shared between two columns in a dataset (e.g., question and answer in a QA dataset), then they will have a shared vocabulary.

定义数据类型以及转换为Tensor的指令。

字段类模拟可以用张量表示的常见文本处理数据类型。它包含一个Vocab对象，
该对象定义了字段元素的可能值集及其相应的数值表示。
Field对象还包含与数据类型如何进行数字化相关的其他参数，
例如令牌化方法和应生成的张量类型。

如果一个字段在数据集中的两列之间共享（例如，QA数据集中的问题和答案），
那么它们将有一个共享的词汇表。

'''

class torchtext.data.Field(sequential=True, use_vocab=True, init_token=None, eos_token=None, fix_length=None, dtype=torch.int64, preprocessing=None, postprocessing=None, lower=False, tokenize=None, tokenizer_language='en', include_lengths=False, batch_first=False, pad_token='<pad>', unk_token='<unk>', pad_first=False, truncate_first=False, stop_words=None, is_target=False)

'''
Variables =>:	
sequential – Whether the datatype represents sequential data. If False, no tokenization is applied. Default: True.
(数据类型是否代表顺序数据。如果为假，则不应用令牌化。默认：True。)
use_vocab – Whether to use a Vocab object. If False, the data in this field should already be numerical. Default: True.
(是否使用 Vocab 对象。如果为假，则此字段中的数据应该已经是数字。默认：True。)

init_token – A token that will be prepended to every example using this field, or None for no initial token. Default: None.
(使用此字段的每个示例前一个令牌，或无初始令牌。默认：None。)

eos_token – A token that will be appended to every example using this field, or None for no end-of-sentence token. Default: None.
(将附加到使用此字段的每个示例的令牌，或无无句末令牌。默认：None。)

fix_length – A fixed length that all examples using this field will be padded to, or None for flexible sequence lengths. Default: None.
(所有使用此字段的示例将被填充到的固定长度，或者对于灵活的序列长度，None。默认：None。)

dtype – The torch.dtype class that represents a batch of examples of this kind of data. Default: torch.long.
(代表此类数据的一批示例的torch.dtype类。默认：torch.long。)

preprocessing – The Pipeline that will be applied to examples using this field after tokenizing but before numericalizing. Many Datasets replace this attribute with a custom preprocessor. Default: None.
(预处理-在标记化后但在数值化之前应用于使用此字段的示例的管道。
许多数据集将此属性替换为自定义预处理器。默认：None。)

postprocessing – A Pipeline that will be applied to examples using this field after numericalizing but before the numbers are turned into a Tensor. The pipeline function takes the batch as a list, and the field’s Vocab. Default: None.
(后处理-在数值化后但在数字转换为张量之前，将应用于使用此字段的示例的管道。管道函数将批处理作为列表，以及字段的Vocab。默认：None。)

lower – Whether to lowercase the text in this field. Default: False.
(是否将此字段中的文本小写。默认：False。)

tokenize – The function used to tokenize strings using this field into sequential examples. If “spacy”, the SpaCy tokenizer is used. If a non-serializable function is passed as an argument, the field will not be able to be serialized. Default: string.split.
(用于将字符串令牌化的函数，使用此字段将字符串转换为顺序示例。如果“spacy”，则使用SpaCy令牌化器。如果将不可序列化函数作为参数传递，该字段将无法序列化。默认：string.split。)

tokenizer_language – The language of the tokenizer to be constructed. Various languages currently supported only in SpaCy.
(要构建的令牌化器的语言。目前仅在SpaCy中支持各种语言。)

include_lengths – Whether to return a tuple of a padded minibatch and a list containing the lengths of each examples, or just a padded minibatch. Default: False.
(是返回填充迷你批次的元组和包含每个示例长度的列表，还是只返回填充迷你批次。默认：False。)

batch_first – Whether to produce tensors with the batch dimension first. Default: False.
(是否首先产生具有批处理尺寸的张量。默认：False。)

pad_token – The string token used as padding. Default: “<pad>”.
(用于填充的字符串令牌。默认：“<pad>”。)

unk_token – The string token used to represent OOV words. Default: “<unk>”.
(用于表示OOV单词的字符串令牌。默认：“<unk>”。)

pad_first – Do the padding of the sequence at the beginning. Default: False.
(在开始时对序列进行填充。默认：False。)

truncate_first – Do the truncating of the sequence at the beginning. Default: False
(在开始时截断序列。默认：False)

stop_words – Tokens to discard during the preprocessing step. Default: None
(在预处理步骤中丢弃的令牌。默认：None)

is_target – Whether this field is a target variable. Affects iteration over batches. Default: False
(此字段是否为目标变量。影响批次的迭代。默认：False)
'''

# Initialize self. See help(type(self)) for accurate signature.
# 初始化自我。有关准确的签名，请参阅帮助 (type(self))

__init__(sequential=True, use_vocab=True, init_token=None, eos_token=None, fix_length=None, dtype=torch.int64, preprocessing=None, postprocessing=None, lower=False, tokenize=None, tokenizer_language='en', include_lengths=False, batch_first=False, pad_token='<pad>', unk_token='<unk>', pad_first=False, truncate_first=False, stop_words=None, is_target=False)

# Construct the Vocab object for this field from one or more datasets.
# 从一个或多个数据集构造此字段的Vocab对象。

'''
Parameters =>:	
arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
(数据集对象或其他可迭代数据源，从中构造表示此字段的一组可能值的Vocab对象。
如果提供了数据集对象，则使用与该字段对应的所有列；也可以直接提供单独的柱。)

keyword arguments (Remaining) – Passed to the constructor of Vocab.
(传递给Vocab的构造函数。)

'''

build_vocab(*args, **kwargs)

'''
Turn a batch of examples that use this field into a Variable.

If the field has include_lengths=True, a tensor of lengths will be included in the return value.

将一批使用此字段的示例转换为变量。

如果字段具有include_lengths=True，则返回值中将包含长度的张量。

'''

'''

Parameters =>:	
arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
(–令牌化和填充示例列表，或令牌化和填充示例列表的元组
以及每个示例的长度列表，如果self.include_lengths为True。)
device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
(torch.device的字符串或实例，指定要在哪个设备上创建变量。
如果保持默认值，张量将在cpu上创建。默认：None。)

'''

numericalize(arr, device=None)

'''
Pad a batch of examples using this field.

Pads to self.fix_length if provided, otherwise pads to the length of the longest example in the batch. Prepends self.init_token and appends self.eos_token if those attributes are not None. Returns a tuple of the padded list and a list containing lengths of each example if self.include_lengths is True and self.sequential is True, else just returns the padded list. If self.sequential is False, no padding is applied.

使用此字段填充一批示例。

如果提供，将Pads添加到self.fix_length，
否则将pads添加到批次中最长示例的长度。
如果这些属性不是None，
则预装self.init_token并附加self.eos_token。
返回填充列表的元组和包含每个示例长度的列表 
if self.include_lengths is True 
and self.sequential is True，否则仅返回填充列表。
如果self.sequential为False，则不应用填充。

'''

pad(minibatch)


'''
Load a single example using this field, tokenizing if necessary.

If the input is a Python 2 str, it will be converted to Unicode first. If sequential=True, it will be tokenized. Then the input will be optionally lowercased and passed to the user-provided preprocessing Pipeline.


使用此字段加载单个示例，必要时标记化。

如果输入是Python 2 str，它将首先转换为Unicode。
如果sequential=True，它将被标记化。
然后，输入将选择小写，并传递给用户提供的预处理管道。

'''

preprocess(x)

'''
Process a list of examples to create a torch.Tensor.

Pad, numericalize, and postprocess a batch and create a tensor.

处理示例列表以创建torch.Tensor。

Pad、数字化(numericalize)和后处理批次(postprocess)并创建一个张量。

'''

'''
Parameters =>:	
batch (list(object)) – A list of object from a batch of examples.
(来自一批示例的对象列表。)
Returns:	Processed object given the input and custom postprocessing Pipeline.
(给定输入和自定义后处理(postprocessing)管道的处理对象。)
Return type:	torch.autograd.Variable
'''

process(batch, device=None)

# alias of torchtext.vocab.Vocab
# 别名 torchtext.vocab.Vocab

vocab_cls

(3) SubwordField (可逆字段)


class torchtext.data.ReversibleField(**kwargs)

# Initialize self. See help(type(self)) for accurate signature.
# 初始化自我。有关准确的签名，请参阅帮助 (type(self))

__init__(**kwargs)

(4) SubwordField (子词字段)



class torchtext.data.SubwordField(**kwargs)
# Initialize self. See help(type(self)) for accurate signature.
# 初始化自我。有关准确的签名，请参阅帮助 (type(self))
__init__(**kwargs)

# Segment one or more datasets with this subword field.
'''
Parameters =>:	
arguments (Positional) – Dataset objects or other indexable mutable sequences to segment. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
(数据集对象或其他可转位的可变序列。
如果提供了数据集对象，
则使用与此字段对应的所有列；也可以直接提供单个列。)
'''
segment(*args)

# alias of torchtext.vocab.SubwordVocab
# 别名 torchtext.vocab.SubwordVocab
vocab_cls

(5) NestedField (嵌套字段)

# A nested field.
# 一个嵌套的字段。
'''
A nested field holds another field (called nesting field), accepts an untokenized string or a list string tokens and groups and treats them as one field as described by the nesting field. Every token will be preprocessed, padded, etc. in the manner specified by the nesting field. Note that this means a nested field always has sequential=True. The two fields’ vocabularies will be shared. Their numericalization results will be stacked into a single tensor. And NestedField will share the same include_lengths with nesting_field, so one shouldn’t specify the include_lengths in the nesting_field. This field is primarily used to implement character embeddings. See tests/data/test_field.py for examples on how to use this field.


嵌套字段包含另一个字段（称为嵌套字段），
接受未标记的字符串或列表字符串令牌和组，
并将它们视为嵌套字段描述的一个字段。
每个令牌都将按照嵌套字段指定的方式进行预处理、填充等。
请注意，这意味着嵌套字段始终具有sequential=True。
这两个字段的词汇将被共享。它们的数值化结果将堆叠成单个张量。
NestedField将与nesting_field共享相同的include_lengths，
因此不应在nesting_field中指定include_lengths。
此字段主要用于实现字符嵌入。
有关如何使用此字段的示例，请参阅tests/data/test_field.py。
'''

class torchtext.data.NestedField(nesting_field, use_vocab=True, init_token=None, eos_token=None, fix_length=None, dtype=torch.int64, preprocessing=None, postprocessing=None, tokenize=None, tokenizer_language='en', include_lengths=False, pad_token='<pad>', pad_first=False, truncate_first=False)

'''
Parameters =>:	
nesting_field (Field) – A field contained in this nested field.
(此嵌套字段中包含的字段。)

use_vocab (bool) – Whether to use a Vocab object. If False, the data in this field should already be numerical. Default: True.
(是否使用 Vocab 对象。如果为假，则此字段中的数据应该已经是数字。默认：True。)

init_token (str) – A token that will be prepended to every example using this field, or None for no initial token. Default: None.
(使用此字段的每个示例前一个令牌，或无初始令牌。默认：None。)

eos_token (str) – A token that will be appended to every example using this field, or None for no end-of-sentence token. Default: None.
(将使用此字段附加到每个示例的令牌，或无句子末尾令牌。默认：None。)

fix_length (int) – A fixed length that all examples using this field will be padded to, or None for flexible sequence lengths. Default: None.
(所有使用此字段的示例将被填充到的固定长度，对于灵活的序列长度，则为None。默认：None。)

dtype – The torch.dtype class that represents a batch of examples of this kind of data. Default: torch.long.
(代表此类数据的一批示例的torch.dtype类。默认：torch.long。)

preprocessing (Pipeline) – The Pipeline that will be applied to examples using this field after tokenizing but before numericalizing. Many Datasets replace this attribute with a custom preprocessor. Default: None.
(预处理（管道）-在标记后但在数值化之前，将应用于使用此字段的示例的管道。
许多数据集将此属性替换为自定义预处理器。默认：None。)

postprocessing (Pipeline) – A Pipeline that will be applied to examples using this field after numericalizing but before the numbers are turned into a Tensor. The pipeline function takes the batch as a list, and the field’s Vocab. Default: None.
(后处理（管道）-在数值化后但在数字转换为张量之前，
将应用于使用此字段的示例的管道。
管道函数将批处理作为列表，
以及字段的Vocab。默认：None。)

include_lengths – Whether to return a tuple of a padded minibatch and a list containing the lengths of each examples, or just a padded minibatch. Default: False.
(是返回填充迷你批次的元组和包含每个示例长度的列表，还是只返回填充迷你批次。默认：False。)

tokenize – The function used to tokenize strings using this field into sequential examples. If “spacy”, the SpaCy tokenizer is used. If a non-serializable function is passed as an argument, the field will not be able to be serialized. Default: string.split.
(用于将字符串令牌化的函数，使用此字段将字符串转换为顺序示例。
如果“spacy”，则使用SpaCy令牌化器。
如果将不可序列化函数作为参数传递，
该字段将无法序列化。默认：string.split。)

tokenizer_language – The language of the tokenizer to be constructed. Various languages currently supported only in SpaCy.
(要构建的令牌化器的语言。目前仅在SpaCy中支持各种语言。)

pad_token (str) – The string token used as padding. If nesting_field is sequential, this will be set to its pad_token. Default: "<pad>".
(用作填充的字符串标记。如果nesting_field是连续的，那么它将被设置为pad_token。默认值：“<pad>”。)

pad_first (bool) – Do the padding of the sequence at the beginning. Default: False.
(在开始时对序列进行填充。默认：False。)

'''
# Initialize self. See help(type(self)) for accurate signature.
# 初始化自我。有关准确的签名，请参阅帮助(type(self))。
__init__(nesting_field, use_vocab=True, init_token=None, eos_token=None, fix_length=None, dtype=torch.int64, preprocessing=None, postprocessing=None, tokenize=None, tokenizer_language='en', include_lengths=False, pad_token='<pad>', pad_first=False, truncate_first=False)


# Construct the Vocab object for nesting field and combine it with this field’s vocab.
# 为嵌套字段构建Vocab对象，并将其与该字段的词汇相结合。

'''

Parameters =>:	
arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for the nesting field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
(数据集对象或其他可迭代数据源，从中构造Vocab对象，
表示嵌套字段的一组可能值。如果提供了数据集对象，
则使用与此字段对应的所有列；也可以直接提供单个列。)

keyword arguments (Remaining) – Passed to the constructor of Vocab.
(传递给Vocab的构造函数。)

'''

build_vocab(*args, **kwargs)

'''
Convert a padded minibatch into a variable tensor.

Each item in the minibatch will be numericalized independently and the resulting tensors will be stacked at the first dimension.

将带padded的迷你批次转换为可变张量。

迷你批次中的每个项目都将独立进行数值化，产生的张量将堆叠在第一维度。

'''

'''
Parameters =>:	
arr (List[List[str]]) – List of tokenized and padded examples.
(标记和填充示例列表。)
device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
(torch.device的字符串或实例，指定要在哪个设备上创建变量。
如果保持默认值，张量将在cpu上创建。默认：None。)
'''
numericalize(arrs, device=None)

'''
Pad a batch of examples using this field.

If self.nesting_field.sequential is False, each example in the batch must be a list of string tokens, and pads them as if by a Field with sequential=True. Otherwise, each example must be a list of list of tokens. Using self.nesting_field, pads the list of tokens to self.nesting_field.fix_length if provided, or otherwise to the length of the longest list of tokens in the batch. Next, using this field, pads the result by filling short examples with self.nesting_field.pad_token.


使用此字段填充一批示例。

如果self.nesting_field.sequential是False，
则批处理中的每个示例必须是字符串令牌的列表，
并将它们作为sequential=True的Field来填充。否则，
每个示例必须是令牌列表。使用self.nesting_field，
如果提供了，则将令牌列表填充到self.nesting_field.fix_length，
否则将令牌列表填充到批处理中最长的令牌列表的长度。
接下来，使用此字段，
通过用self.nesting_field.pad_token填写简短的示例来填充结果。

'''

'''
Parameters =>:	
minibatch (list) – Each element is a list of string if self.nesting_field.sequential is False, a list of list of string otherwise.
(每个元素都是字符串
if selfself.nesting_field.sequential是False的列表，
否则是字符串列表。)
Returns:	The padded minibatch. or (padded, sentence_lens, word_lengths)
(填充的迷你批次。或(padded, sentence_lens, word_lengths)
Return type: list
'''
pad(minibatch)

示例

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=4)
>>>
>>> nesting_field = Field(pad_token='<c>', init_token='<w>', eos_token='</w>')
>>> field = NestedField(nesting_field, init_token='<s>', eos_token='</s>')
>>> minibatch = [
...     [list('john'), list('loves'), list('mary')],
...     [list('mary'), list('cries')],
... ]
>>> padded = field.pad(minibatch)
>>> pp.pprint(padded)
[   [   ['<w>', '<s>', '</w>', '<c>', '<c>', '<c>', '<c>'],
        ['<w>', 'j', 'o', 'h', 'n', '</w>', '<c>'],
        ['<w>', 'l', 'o', 'v', 'e', 's', '</w>'],
        ['<w>', 'm', 'a', 'r', 'y', '</w>', '<c>'],
        ['<w>', '</s>', '</w>', '<c>', '<c>', '<c>', '<c>']],
    [   ['<w>', '<s>', '</w>', '<c>', '<c>', '<c>', '<c>'],
        ['<w>', 'm', 'a', 'r', 'y', '</w>', '<c>'],
        ['<w>', 'c', 'r', 'i', 'e', 's', '</w>'],
        ['<w>', '</s>', '</w>', '<c>', '<c>', '<c>', '<c>'],
        ['<c>', '<c>', '<c>', '<c>', '<c>', '<c>', '<c>']]]

'''
Preprocess a single example.

Firstly, tokenization and the supplied preprocessing pipeline is applied. Since this field is always sequential, the result is a list. Then, each element of the list is preprocessed using self.nesting_field.preprocess and the resulting list is returned.

预处理单个示例。

首先，应用令牌化和提供的预处理管道。
由于此字段始终是顺序的，因此结果是一个列表。
然后，使用self.nesting_field.preprocess预处理列表的每个元素，
并返回结果列表。
'''

'''
Parameters:	xs (list or str) – The input to preprocess.
(预处理的输入。)
Returns: The preprocessed list. (预处理列表。)
Return type: list
'''
preprocess（xs）

文章翻译于 torchtext.data — torchtext 0.4.0 documentation