目录
组件(
Components
)组成NLU管道(
pipeline
),并按顺序将用户输入处理为结构化输出。有用于实体提取(
entity extraction
)、意图分类(
intent classification
)、响应选择(
response selection
)、预处理(
pre-processing
)等的组件。
语言模型(Language Models)
如果要在管道中使用预先训练的词向量,则以下组件将加载预先训练的模型。
MitieNLP
- Short
MITIE初始值设定项(MITIE initializer
) - Outputs
无(Nothing
) - Requires
无(Nothing
) - Description
初始化MITIE结构。每个MITIE组件都依赖于此,因此应该将此放在任何一个使用MITIE组件的管道的开头。 - Configuration
MITIE库需要语言模型文件,必须在配置中指定该文件:
pipeline:
- name: "MitieNLP"
# language model to load
model: "data/total_word_feature_extractor.dat"
你也可以使用MITIE从语言语料库中预先训练你自己的词向量。为此:
- 获取一个干净的语言语料库(Wikipedia转储工作)作为一组文本文件。
- 在你的语料库上构建并运行MITIE Wordrep工具。这可能需要几个小时/天,具体取决于数据集和工作站(
workstation
)。运行wordrep
需要128GB的RAM,需要很多:尝试扩展交换。 - 设置新
total_word_feature_extractor.dat
作为配置文件中MITINELP
组件中的model
参数。
有关如何训练MITIE词向量的完整示例,请查看用Rasa NLU构建自己的中文NLU系统,这是一篇从中国维基百科创建MITIE模型的博客文章。
SpacyNLP
-
Short
spaCy language initializer -
Outputs
无(Nothing
) -
Requires
无(Nothing
) -
Description
初始化spaCy
结构。每个spaCy
组件都依赖于此,因此应该将此放在使用任何spaCy
组件的每个管道的开头。 -
Configuration
您需要指定要使用的语言模型。名称将传递给spacy.load(name)
。您可以在spaCy
文档中找到有关可用模型的更多信息。pipeline: - name: "SpacyNLP" # language model to load model: "en_core_web_md" # when retrieving word vectors, this will decide if the casing # of the word is relevant. E.g. `hello` and `Hello` will # retrieve the same vector, if set to `False`. For some # applications and models it makes sense to differentiate # between these two words, therefore setting this to `True`. case_sensitive: False
有关如何下载spaCy
模型的更多信息,请转到安装spaCy
。
除了SpaCy
的预训练语言模型外,还可以使用此组件附加您自己训练过的SpaCy
模型。
FALLBACK
如果您没有传递model
设置,Rasa开源将代表您尝试回退到公共模型。这是作为spacy3.0迁移的一部分而引入的一个临时特性,但是在Rasa3.0.0中,回退将被删除。
HFTransformersNLP
DEPRECATED(已弃用)
HFTransformersNLP
已弃用,在将来的版本中将会被删除。LanguageModelFeaturer
现在实现其行为。
分词器(Tokenizers)
分词器(Tokenizers
)将文本拆分为tokens
。如果要将意图拆分为多个标签,例如,用于预测多个意图或构建层次意图结构,请对任何tokenizer
使用以下标志:
intent_tokenization_flag
表示是否tokenize
意图标签。将其设置为True
,则对意图标签进行tokens
化。intent_split_symbol
设置分隔符字符串以拆分意图标签,默认为下划线(_
)
WhitespaceTokenizer
-
Short
使用空格作为分隔符的分词器 -
Outputs
用户消息,响应(如果存在),意图(如果指定)的tokens
-
Requires
无(Nothing
) -
Description
为每个空格分隔的字符序列创建一个tokens
。 -
Configuration
pipeline: - name: "WhitespaceTokenizer" # Flag to check whether to split intents "intent_tokenization_flag": False # Symbol on which intent should be split "intent_split_symbol": "_" # Regular expression to detect tokens "token_pattern": None
JiebaTokenizer
-
Short
针对中文使用的结巴分词器 -
Outputs
用户消息,响应(如果存在),意图(如果指定)的tokens
-
Requires
无(Nothing
) -
Description
使用专门用于中文的结巴(Jieba
)分词器创建tokens
。它只适用于中文。 -
Configuration
通过dictionary_path
指定文件的目录路径,可以自动加载用户的自定义词典文件。如果dictionary_path
为None
(默认值),则不会使用自定义词典。
pipeline:
- name: "JiebaTokenizer"
dictionary_path: "path/to/custom/dictionary/dir"
# Flag to check whether to split intents
"intent_tokenization_flag": False
# Symbol on which intent should be split
"intent_split_symbol": "_"
# Regular expression to detect tokens
"token_pattern": None
MitieTokenizer
-
Short
MITIE分词器 -
Outputs
用户消息,响应(如果存在),意图(如果指定)的tokens
-
Requires
MitieNLP -
Description
使用MITIE分词器创建tokens
。 -
Configuration
pipeline:
- name: "MitieTokenizer"
# Flag to check whether to split intents
"intent_tokenization_flag": False
# Symbol on which intent should be split
"intent_split_symbol": "_"
# Regular expression to detect tokens
"token_pattern": None
SpacyTokenizer
-
Short
Spacy分词器 -
Outputs
用户消息,响应(如果存在),意图(如果指定)的tokens
-
Requires
SpacyNLP -
Description
使用spaCy
分词器创建tokens
。 -
Configuration
pipeline:
- name: "SpacyTokenizer"
# Flag to check whether to split intents
"intent_tokenization_flag": False
# Symbol on which intent should be split
"intent_split_symbol": "_"
# Regular expression to detect tokens
"token_pattern": None
ConveRTTokenizer
已弃用
LanguageModelTokenizer
已弃用
特征化器(Featurizers)
文本特征化器分为两类:稀疏特征化器和密集特征化器。稀疏特征化器是一种特征化器,它返回具有大量缺失值(例如零)的特征向量。由于这些特征向量通常会占用大量内存,因此我们将它们存储为稀疏特征。稀疏特征只存储非零值及其在向量中的位置。因此,我们节省了大量内存,能够在更大的数据集上进行训练。
所有特征化器都可以返回两种不同的特征:序列特征和句子特征。序列特征是一个大小矩阵 (tokens
数量x特征维度)。矩阵包含序列中每个tokens
的特征向量。这使我们能够训练序列模型。句子特征由一个大小矩阵(1 x特征维)
表示。它包含完整话语的特征向量。句子特征可以用在任何一个词包模型中。因此,相应的分类器可以决定使用什么样的特征。注:序列特征和句子特征的特征维数
不必相同。
MitieFeaturizer
-
Short
使用MITIE Featureizer
创建用户消息和响应(如果指定)的向量表示。 -
Outputs
用户消息和响应的密集特征(dense_features
) -
Requires
MitieNLP -
Type
密集特征化器(Dense featurizer
) -
Description
使用MITIE Featureizer
为实体提取、意图分类和响应分类创建特征。 -
Configuration
句子向量,即完整话语的向量,可以用两种不同的方法来计算,要么通过平均值,要么通过最大池。您可以使用选项pooling
在配置文件中指定池方法。默认池方法设置为mean
。
pipeline:
- name: "MitieFeaturizer"
# Specify what pooling operation should be used to calculate the vector of
# the complete utterance. Available options: 'mean' and 'max'.
"pooling": "mean"
SpacyFeaturizer
-
Short
使用spaCy Featureizer
创建用户消息和响应(如果指定)的向量表示。 -
Outputs
用户消息和响应的密集特征(dense_features
) -
Requires
SpacyNLP -
Type
密集特征化器(Dense featurizer
) -
Description
使用spaCy Featureizer
为实体提取、意图分类和响应分类创建特征。 -
Configuration
句子向量,即完整话语的向量,可以用两种不同的方法来计算,要么通过平均值,要么通过最大池。可以使用pooling
选项在配置文件中指定池方法。默认池方法设置为mean
。pipeline: - name: "SpacyFeaturizer" # Specify what pooling operation should be used to calculate the vector of # the complete utterance. Available options: 'mean' and 'max'. "pooling": "mean"
ConveRTFeaturizer
-
Short
使用ConveRT模型创建用户消息和响应(如果指定)的向量表示。 -
Outputs
用户消息和响应的密集特征(dense_features
) -
Requires
tokens
-
Type
密集特征化器(Dense featurizer
) -
Description
为实体提取、意图分类和响应选择创建特征。它使用默认签名来计算输入文本的向量表示。由于
ConveRT
模型只在英语会话语料库上进行训练,所以只有在训练数据是英语的情况下才应该使用这个featureizer
。
T为了使用ConveRTFeaturizer
,用pip install rasa[convert]
命令进行安装 -
Configuration
pipeline: - name: "ConveRTFeaturizer"
LanguageModelFeaturizer
-
Short
使用预先训练的语言模型创建用户消息和响应(如果指定)的向量表示。 -
Outputs
用户消息和响应的密集特征(dense_features
) -
Requires
tokens
-
Type
密集特征化器(Dense featurizer
) -
Description
为实体提取、意图分类和响应选择创建特征。使用预先训练的语言模型来计算输入文本的向量表示。请确保您使用的语言模型是在与您的训练数据相同的语言语料库上预先训练的。
-
Configuration
在此组件之前包含标记器(Tokenizer
)组件。
您应该通过参数model_name
指定要加载的语言模型。有关可用的语言模型,请参见下表。此外,还可以通过指定参数model_weights
来指定所选语言模型的体系结构变体。支持的体系结构的完整列表可以在HuggingFace文档中找到。如果留空,它将使用原始Transformers库加载的默认模型体系结构(请参见下表)。+----------------+--------------+-------------------------+ | Language Model | Parameter | Default value for | | | "model_name" | "model_weights" | +----------------+--------------+-------------------------+ | BERT | bert | rasa/LaBSE | +----------------+--------------+-------------------------+ | GPT | gpt | openai-gpt | +----------------+--------------+-------------------------+ | GPT-2 | gpt2 | gpt2 | +----------------+--------------+-------------------------+ | XLNet | xlnet | xlnet-base-cased | +----------------+--------------+-------------------------+ | DistilBERT | distilbert | distilbert-base-uncased | +----------------+--------------+-------------------------+ | RoBERTa | roberta | roberta-base | +----------------+--------------+-------------------------+
以下配置加载语言模型BERT
:
pipeline:
- name: LanguageModelFeaturizer
# Name of the language model to use
model_name: "bert"
# Pre-Trained weights to be loaded
model_weights: "rasa/LaBSE"
# An optional path to a specific directory to download and cache the pre-trained model weights.
# The "default" cache_dir is the same as https://huggingface.co/transformers/serialization.html#cache-directory .
cache_dir: null
RegexFeaturizer
-
Short
使用正则表达式创建用户消息的向量表示。 -
Outputs
用户消息和tokens.pattern
的稀疏特征(sparse_features
) -
Requires
tokens
-
Type
稀疏特征化器(Sparse featurizer
) -
Description
为实体提取和意图分类创建特征。在训练期间,RegexFeaturizer
创建以训练数据格式定义的正则表达式列表。对于每个regex
,都会设置一个特性来标记是否在用户消息中找到了该表达式。所有特征稍后将被输入到意图分类器/实体提取器中以简化分类(假设分类器在训练阶段已经学习到,该集合特征表示特定的意图/实体)。用于实体提取的正则表达式功能当前仅受CRFEntityExtractor
和DIETClassifier
组件支持! -
Configuration
通过添加case_sensitive: False
选项,使FeatureAudier
大小写不敏感,默认为case_sensitive: True
。
要正确处理不使用空格分隔单词的语言(如中文),用户需要添加use_word_boundaries: False
选项,默认值为use_word_boundaries: True
。
pipeline:
- name: "RegexFeaturizer"
# Text will be processed with case sensitive as default
"case_sensitive": True
# use match word boundaries for lookup table
"use_word_boundaries": True
增量训练的配置(Configuring for incremental training)
为确保增量训练期间sparse_features
具有固定大小,应将组件配置为考虑将来可能添加到训练数据中的其他模式。为此,请在刚开始训练基本模型时配置number_additional_patterns
参数:
pipeline:
- name: RegexFeaturizer
number_additional_patterns: 10
如果用户没有配置,组件将使用训练数据中当前存在的模式数的两倍(包括查找表和regex模式)作为number_additional_patterns
的默认值。为了避免在增量训练期间过于频繁地用完新模式的额外插槽,该数字至少保持在10。一旦组件用完了额外的模式槽,新的模式就会被丢弃,在特征化过程中不会被考虑。在这一点上,建议从头开始重新训练一种新模式。
CountVectorsFeaturizer
-
Short
创建用户消息、意图和响应的词袋(bag-of-words
)表示。 -
Outputs
用户消息、意图和响应的稀疏特征(sparse_features
) -
Requires
tokens
-
Type
稀疏特征化器(Sparse featurizer
) -
Description
创建用于意图分类和响应选择的特征。使用sklearn
的CountVectorizer
创建用户消息、意图和响应的词袋(bag-of-words
)表示。所有仅由数字组成的token
(例如123和99,但不是a123d)将被分配给同一特征。 -
Configuration
有关配置参数的详细说明,请参阅sklearn的CountVectorizer文档。
可以使用analyzer
配置参数将此特征化器配置为使用单词或字符n-grams。默认情况下,analyzer
设置为word
,因此word token计数用作功能。如果要使用字符n-grams,请将analyzer
设置为char
或char_wb
。可以通过参数min_ngram
和max_ngram
来配置n-gram的上下限。默认情况下,这两个值都设置为1
。默认情况下,featureizer将直接获取单词的词根(lemma
),而不是单词的词根(lemma
)(如果可用)。一个单词的词根(lemma
)目前只由SpacyTokenizer
设置。可以通过将use_lemma
设置为False
来禁用此行为。
选项
char_wb
仅从单词边界内的文本创建字符n-grams;单词边缘的n-gram用空格填充。此选项可用于创建Subword Semantic Hashing。
对于字符n-gram,不要忘记增加最小值和最大值。否则词汇将只包含单个字母。
词汇表外(OOV)词的处理:
仅当analyzer为word时启用
LexicalSyntacticFeaturizer
-
Short
-
Outputs
-
Requires
-
Type
-
Description
-
Configuration
意图分类器(Intent Classifiers)
意图分类器将domain
文件中定义的意图之一分配给传入的用户消息。
MitieIntentClassifier
-
Short
MITIE意图分类器(使用文本分类器) -
Outputs
intent
-
Requires
用户消息的tokens
和MitieNLP -
Output-Example
{ "intent": {"name": "greet", "confidence": 0.98343} }
-
Description
该分类器使用MITIE来执行意图分类。底层分类器使用一个带有稀疏线性核的多分类线性支持向量机(参见MITIE训练代码中的train_text_categorizer_classifier
函数)。注意
该分类器不依赖任何特征化器,因为它自己提取特征。 -
Configuration
pipeline: - name: "MitieIntentClassifier"
SklearnIntentClassifier
-
Short
Sklearn意图分类器 -
Outputs
intent
和intent_ranking
-
Requires
用户消息的dense_features
-
Output-Example
{ "intent": {"name": "greet", "confidence": 0.78343}, "intent_ranking": [ { "confidence": 0.1485910906220309, "name": "goodbye" }, { "confidence": 0.08161531595656784, "name": "restaurant_search" } ] }
-
Description
sklearn意图分类器训练一个线性支持向量机,该支持向量机使用网格搜索进行优化。它还提供了没有“获胜”的标签的排名。SklearnIntentClassifier
前面需要有一个密集的特征化器。这个密集的特征化器用于创建分类的特征。有关算法本身的更多信息,请查看GridSearchCV文档。 -
Configuration
在支持向量机的训练过程中,通过超参数搜索来寻找最佳参数集。在配置中,您可以指定要尝试的参数。pipeline: - name: "SklearnIntentClassifier" # Specifies the list of regularization values to # cross-validate over for C-SVM. # This is used with the ``kernel`` hyperparameter in GridSearchCV. C: [1, 2, 5, 10, 20, 100] # Specifies the kernel to use with C-SVM. # This is used with the ``C`` hyperparameter in GridSearchCV. kernels: ["linear"] # Gamma parameter of the C-SVM. "gamma": [0.1] # We try to find a good number of cross folds to use during # intent training, this specifies the max number of folds. "max_cross_validation_folds": 5 # Scoring function used for evaluating the hyper parameters. # This can be a name or a function. "scoring_function": "f1_weighted"
KeywordIntentClassifier
-
Short
简单的关键字匹配意向分类器,适用于小型、短期项目。 -
Outputs
intent
-
Requires
无 -
Output-Example
{ "intent": {"name": "greet", "confidence": 1.0} }
-
Description
这个分类器通过在消息中搜索关键字来工作。默认情况下,匹配区分大小写,只搜索用户消息中关键字字符串的精确匹配。意图的关键字是NLU训练数据中该意图的示例。这意味着整个示例是关键字,而不是示例中的单个单词。注意
此分类器仅用于小型项目或入门。如果NLU训练数据很少,可以在调整模型时查看推荐的管道。 -
Configuration
pipeline: - name: "KeywordIntentClassifier" case_sensitive: True
DIETClassifier
-
Short
用于意图分类和实体提取的意图实体转换器(DIET) -
Outputs
entities
,intent
和intent_ranking
-
Requires
用户消息的dense_features
或sparse_features
,以及可选意图 -
Output-Example
{ "intent": {"name": "greet", "confidence": 0.8343}, "intent_ranking": [ { "confidence": 0.385910906220309, "name": "goodbye" }, { "confidence": 0.28161531595656784, "name": "restaurant_search" } ], "entities": [{ "end": 53, "entity": "time", "start": 48, "value": "2017-04-10T00:00:00.000+02:00", "confidence": 1.0, "extractor": "DIETClassifier" }] }
-
Description
DIET(Dual Intent and Entity Transformer)是一种用于意图分类和实体识别的多任务体系结构。该体系结构基于两个任务共享的转换器。实体标签序列通过转换器输出序列顶部的条件随机场(CRF)标记层预测,该标记层对应于token
的输入序列。对于意图标签,用于完整话语和意图标签的转换器输出被嵌入到单个语义向量空间中。利用点积损失最大化与目标标签的相似度,最小化与阴性样本的相似度。
如果你想了解更多关于这个模型的信息,请查看YouTube上的Algorithm Whiteboard系列,在那里我们详细解释了模型的体系结构。注意
如果在预测时间内,消息只包含在训练过程中看不到的单词,并且没有使用词汇表外预处理器,则将以置信度0.0
预测空意图None
。如果您仅将带有单词分析器的CountVectorsFeaturizer
用作featurizer,则可能会发生这种情况。如果您使用char_wb
分析器,您应该总是得到一个置信值大于0.0
的意图。 -
Configuration
如果只想将DIETClassifier用于意图分类,请将entity_recognition
设置为False
。如果只想进行实体识别,请将intent_classification
设置为False
。默认情况下,DIETClassifier会同时执行这两项操作,即entity_recognition
和intent_classification
都设置为True
。
可以定义多个超参数来调整模型。如果要调整模型,请首先修改以下参数:-
epochs
:此参数设置算法将看到训练数据的次数(默认值:300
)。一个epoch
等于所有训练例子中的一个向前传播和一个后向传播。有时模型需要更多的训练次数来正确学习。有时更多的训练次数不会影响性能。训练次数越少,模型的训练速度就越快。 -
hidden_layers_sizes
:此参数允许您为用户消息和意图定义前馈层的数量及其输出维度(默认值:text:[],label:[]
)。列表中的每个条目都对应一个前馈层。例如,如果设置text:[256,128]
,我们将在转换器前面添加两个前馈层。输入tokens
的向量(来自用户消息)将被传递到这些层。第一层的输出维度为256,第二层的输出维度为128。如果使用空列表(默认行为),则不会添加前馈层。确保只使用正整数值。通常使用 2 n 2^n 2n的数字。此外,通常的做法是在列表中减少值:下一个值小于或等于前一个值。 -
embedding_dimension
:此参数定义模型中使用的嵌入层的输出维度(默认值:20
)。我们在模型架构中使用了多个嵌入层。例如,在比较和计算损失之前,将完整的话语和意图的向量传递到嵌入层。 -
number_of_transformer_layers
:此参数设置要使用的转换器层数(默认值:2
)。转换器层的数量对应于要用于模型的转换器块。 -
transformer_size
:此参数设置转换器中的单位数(默认值:256
)。来自转换器的向量将具有给定的transformer_size
。 -
weight_sparsity
:此参数定义模型中所有前馈层的内核权重的分数(默认值:0.8
)。该值应介于0和1之间。如果将weight_sparsity
设置为0
,则不会将任何内核权重设置为0,该层将充当标准前馈层。您不应该将weight_sparsity
设置为1
,因为这将导致所有内核权重为0,即模型无法学习。 -
constrain_similarities
:当该参数设置为True
时,将在所有相似项上应用sigmoid交叉熵损失。这有助于将输入和负标签之间的相似性保持在较小的值。这将有助于更好地将模型推广到真实世界的测试集。 -
model_confidence
:此参数允许用户配置如何在推断期间计算置信度。它可以取两个值:softmax
:信任度在[0,1]
范围内(旧行为和当前默认值)。用softmax
激活函数对计算出的相似性进行归一化。linear_norm
:信任度在[0,1]
范围内。计算出的点积相似度用线性函数归一化。
请尝试使用
linear_norm
作为model_confidence
的值。这将使FallbackClassifier
调整回退阈值变得更容易。
-
上面的配置参数是您应该配置的,以使您的模型适合您的数据。但是,存在可以调整的其他参数。
+---------------------------------+------------------+--------------------------------------------------------------+
| Parameter | Default Value | Description |
+=================================+==================+==============================================================+
| hidden_layers_sizes | text: [] | Hidden layer sizes for layers before the embedding layers |
| | label: [] | for user messages and labels. The number of hidden layers is |
| | | equal to the length of the corresponding list. |
+---------------------------------+------------------+--------------------------------------------------------------+
| share_hidden_layers | False | Whether to share the hidden layer weights between user |
| | | messages and labels. |
+---------------------------------+------------------+--------------------------------------------------------------+
| transformer_size | 256 | Number of units in transformer. |
+---------------------------------+------------------+--------------------------------------------------------------+
| number_of_transformer_layers | 2 | Number of transformer layers. |
+---------------------------------+------------------+--------------------------------------------------------------+
| number_of_attention_heads | 4 | Number of attention heads in transformer. |
+---------------------------------+------------------+--------------------------------------------------------------+
| use_key_relative_attention | False | If 'True' use key relative embeddings in attention. |
+---------------------------------+------------------+--------------------------------------------------------------+
| use_value_relative_attention | False | If 'True' use value relative embeddings in attention. |
+---------------------------------+------------------+--------------------------------------------------------------+
| max_relative_position | None | Maximum position for relative embeddings. |
+---------------------------------+------------------+--------------------------------------------------------------+
| unidirectional_encoder | False | Use a unidirectional or bidirectional encoder. |
+---------------------------------+------------------+--------------------------------------------------------------+
| batch_size | [64, 256] | Initial and final value for batch sizes. |
| | | Batch size will be linearly increased for each epoch. |
| | | If constant `batch_size` is required, pass an int, e.g. `8`. |
+---------------------------------+------------------+--------------------------------------------------------------+
| batch_strategy | "balanced" | Strategy used when creating batches. |
| | | Can be either 'sequence' or 'balanced'. |
+---------------------------------+------------------+--------------------------------------------------------------+
| epochs | 300 | Number of epochs to train. |
+---------------------------------+------------------+--------------------------------------------------------------+
| random_seed | None | Set random seed to any 'int' to get reproducible results. |
+---------------------------------+------------------+--------------------------------------------------------------+
| learning_rate | 0.001 | Initial learning rate for the optimizer. |
+---------------------------------+------------------+--------------------------------------------------------------+
| embedding_dimension | 20 | Dimension size of embedding vectors. |
+---------------------------------+------------------+--------------------------------------------------------------+
| dense_dimension | text: 128 | Dense dimension for sparse features to use. |
| | label: 20 | |
+---------------------------------+------------------+--------------------------------------------------------------+
| concat_dimension | text: 128 | Concat dimension for sequence and sentence features. |
| | label: 20 | |
+---------------------------------+------------------+--------------------------------------------------------------+
| number_of_negative_examples | 20 | The number of incorrect labels. The algorithm will minimize |
| | | their similarity to the user input during training. |
+---------------------------------+------------------+--------------------------------------------------------------+
| similarity_type | "auto" | Type of similarity measure to use, either 'auto' or 'cosine' |
| | | or 'inner'. |
+---------------------------------+------------------+--------------------------------------------------------------+
| loss_type | "cross_entropy" | The type of the loss function, either 'cross_entropy' |
| | | or 'margin'. |
+---------------------------------+------------------+--------------------------------------------------------------+
| ranking_length | 10 | Number of top intents to normalize scores for. Applicable |
| | | only with loss type 'cross_entropy' and 'softmax' |
| | | confidences. Set to 0 to disable normalization. |
+---------------------------------+------------------+--------------------------------------------------------------+
| maximum_positive_similarity | 0.8 | Indicates how similar the algorithm should try to make |
| | | embedding vectors for correct labels. |
| | | Should be 0.0 < ... < 1.0 for 'cosine' similarity type. |
+---------------------------------+------------------+--------------------------------------------------------------+
| maximum_negative_similarity | -0.4 | Maximum negative similarity for incorrect labels. |
| | | Should be -1.0 < ... < 1.0 for 'cosine' similarity type. |
+---------------------------------+------------------+--------------------------------------------------------------+
| use_maximum_negative_similarity | True | If 'True' the algorithm only minimizes maximum similarity |
| | | over incorrect intent labels, used only if 'loss_type' is |
| | | set to 'margin'. |
+---------------------------------+------------------+--------------------------------------------------------------+
| scale_loss | False | Scale loss inverse proportionally to confidence of correct |
| | | prediction. |
+---------------------------------+------------------+--------------------------------------------------------------+
| regularization_constant | 0.002 | The scale of regularization. |
+---------------------------------+------------------+--------------------------------------------------------------+
| negative_margin_scale | 0.8 | The scale of how important it is to minimize the maximum |
| | | similarity between embeddings of different labels. |
+---------------------------------+------------------+--------------------------------------------------------------+
| weight_sparsity | 0.8 | Sparsity of the weights in dense layers. |
| | | Value should be between 0 and 1. |
+---------------------------------+------------------+--------------------------------------------------------------+
| drop_rate | 0.2 | Dropout rate for encoder. Value should be between 0 and 1. |
| | | The higher the value the higher the regularization effect. |
+---------------------------------+------------------+--------------------------------------------------------------+
| drop_rate_attention | 0.0 | Dropout rate for attention. Value should be between 0 and 1. |
| | | The higher the value the higher the regularization effect. |
+---------------------------------+------------------+--------------------------------------------------------------+
| use_sparse_input_dropout | True | If 'True' apply dropout to sparse input tensors. |
+---------------------------------+------------------+--------------------------------------------------------------+
| use_dense_input_dropout | True | If 'True' apply dropout to dense input tensors. |
+---------------------------------+------------------+--------------------------------------------------------------+
| evaluate_every_number_of_epochs | 20 | How often to calculate validation accuracy. |
| | | Set to '-1' to evaluate just once at the end of training. |
+---------------------------------+------------------+--------------------------------------------------------------+
| evaluate_on_number_of_examples | 0 | How many examples to use for hold out validation set. |
| | | Large values may hurt performance, e.g. model accuracy. |
+---------------------------------+------------------+--------------------------------------------------------------+
| intent_classification | True | If 'True' intent classification is trained and intents are |
| | | predicted. |
+---------------------------------+------------------+--------------------------------------------------------------+
| entity_recognition | True | If 'True' entity recognition is trained and entities are |
| | | extracted. |
+---------------------------------+------------------+--------------------------------------------------------------+
| use_masked_language_model | False | If 'True' random tokens of the input message will be masked |
| | | and the model has to predict those tokens. It acts like a |
| | | regularizer and should help to learn a better contextual |
| | | representation of the input. |
+---------------------------------+------------------+--------------------------------------------------------------+
| tensorboard_log_directory | None | If you want to use tensorboard to visualize training |
| | | metrics, set this option to a valid output directory. You |
| | | can view the training metrics after training in tensorboard |
| | | via 'tensorboard --logdir <path-to-given-directory>'. |
+---------------------------------+------------------+--------------------------------------------------------------+
| tensorboard_log_level | "epoch" | Define when training metrics for tensorboard should be |
| | | logged. Either after every epoch ('epoch') or for every |
| | | training step ('batch'). |
+---------------------------------+------------------+--------------------------------------------------------------+
| featurizers | [] | List of featurizer names (alias names). Only features |
| | | coming from the listed names are used. If list is empty |
| | | all available features are used. |
+---------------------------------+------------------+--------------------------------------------------------------+
| checkpoint_model | False | Save the best performing model during training. Models are |
| | | stored to the location specified by `--out`. Only the one |
| | | best model will be saved. |
| | | Requires `evaluate_on_number_of_examples > 0` and |
| | | `evaluate_every_number_of_epochs > 0` |
+---------------------------------+------------------+--------------------------------------------------------------+
| split_entities_by_comma | True | Splits a list of extracted entities by comma to treat each |
| | | one of them as a single entity. Can either be `True`/`False` |
| | | globally, or set per entity type, such as: |
| | | ```|
| | | ... |
| | | - name: DIETClassifier |
| | | split_entities_by_comma: |
| | | address: True |
| | | ... |
| | | ... |
| | | ```|
+---------------------------------+------------------+--------------------------------------------------------------+
| constrain_similarities | False | If `True`, applies sigmoid on all similarity terms and adds |
| | | it to the loss function to ensure that similarity values are |
| | | approximately bounded. Used only if `loss_type=cross_entropy`|
+---------------------------------+------------------+--------------------------------------------------------------+
| model_confidence | "softmax" | Affects how model's confidence for each intent |
| | | is computed. It can take two values: |
| | | 1. `softmax` - Similarities between input and intent |
| | | embeddings are post-processed with a softmax function, |
| | | as a result of which confidence for all intents sum up to 1. |
| | | 2. `linear_norm` - Linearly normalized dot product similarity|
| | | between input and intent embeddings. Confidence for each |
| | | intent will be in the range `[0,1]` |
| | | This parameter does not affect the confidence for entity |
| | | prediction. |
+---------------------------------+------------------+--------------------------------------------------------------+
注意
参数maximum_negative_similarity
设置为负值,以模拟maximum_negative_similarity = maximum_positive_similarity
和use_maximum_negative_similarity = False
情况下的original starspace algorithm
。详见starspace paper。
FallbackClassifier
-
Short
-
Outputs
-
Requires
-
Output-Example
-
Description
-
Configuration
实体提取器(Entity Extractors)
实体提取器从用户消息中提取实体,例如人名或位置。
注意
如果您使用多个实体提取器,我们建议每个提取器针对一组独占的实体类型。例如,使用Duckling提取日期和时间,使用DIETClassifier提取人名。否则,如果多个提取器针对相同的实体类型,则很可能会多次提取实体。
例如,如果您使用两个或多个通用提取器,如MitieEntityExtractor
、DIETClassifier
或CRFEntityExtractor
,则训练数据中的实体类型将被所有这些实体找到并提取。如果用实体类型填充的槽是text
类型,那么管道中的最后一个提取器将获胜。如果插槽的类型为list
,则所有结果都将添加到列表中,包括重复的结果。
另一种不太明显的重复/重叠提取情况可能会发生,即使提取器关注不同的实体类型。想象一下,一个食品配送机器人和一条用户信息,比如我想订购星期一特价
。假设,如果您的时间提取程序的性能不是很好,它可能会提取星期一
作为订单时间,而您的另一个提取程序可能提取星期一特价
作为用餐时间。如果您在处理这类重叠实体时遇到困难,那么添加额外的训练数据来改进提取器可能是有意义的。如果这还不够,可以添加一个自定义组件,根据自己的逻辑解决实体提取中的冲突。
MitieEntityExtractor
- Short
MITIE实体提取(使用MITIE NER训练器) - Outputs
entities
- Requires
MitieNLP andtokens
- Output-Example
{
"entities": [{
"value": "New York City",
"start": 20,
"end": 33,
"confidence": null,
"entity": "city",
"extractor": "MitieEntityExtractor"
}]
}
- Description
MitieEntityExtractor
使用MITIE实体提取在消息中查找实体。底层分类器是使用多类线性支持向量机与稀疏线性核和自定义特征。MITIE组件不提供实体置信值。
注意
此实体提取器不依赖任何特征化器,因为它自己提取特征。
- Configuration
pipeline:
- name: "MitieEntityExtractor"
SpacyEntityExtractor
- Short
Spacy(空间)实体提取 - Outputs
entities
- Requires
SpacyNLP - Output-Example
{
"entities": [{
"value": "New York City",
"start": 20,
"end": 33,
"confidence": null,
"entity": "city",
"extractor": "SpacyEntityExtractor"
}]
}
- Description
使用spaCy这个组件可以预测消息的实体。spaCy使用统计的BILOU转换模型。目前,该组件只能使用spaCy内建实体提取模型,无法再训练。此提取器不提供任何置信分数。
您可以在这个交互式演示中测试spaCy的实体提取模型。请注意,有些空间模型是高度区分大小写的。
注意
SpacyEntityExtractor提取器不提供置信水平,并且将始终返回null。
- Configuration
配置空间组件应提取哪些维度(比如实体类型)。在spaCy文档中可以找到可用尺寸的完整列表。不指定“尺寸”选项将提取所有可用的尺寸。
pipeline:
- name: "SpacyEntityExtractor"
# dimensions to extract
dimensions: ["PERSON", "LOC", "ORG", "PRODUCT"]
CRFEntityExtractor
- Short
条件随机场(CRF)实体提取 - Outputs
entities
- Requires
tokens
和dense_features
(可选) - Output-Example
{
"entities": [{
"value": "New York City",
"start": 20,
"end": 33,
"entity": "city",
"confidence": 0.874,
"extractor": "CRFEntityExtractor"
}]
}
- Description
该组件实现了一个条件随机场(CRF)来进行命名实体识别。crf可以看作是一个无向Markov链,其中时间步是单词,状态是实体类。单词的特征(大写、词性标记等)为某些实体类提供了可能性,相邻实体标记之间的转换也是如此:然后计算并返回最可能的标记集。
如果要将自定义特征(如预先训练的单词嵌入)传递给CRFEntityExtractor
,可以在CRFEntityExtractor
之前向管道添加任何密集的特征化器。CRFEntityExtractor
会自动找到额外的密集特征,并检查密集特征是否是len(tokens)
的iterable
,其中每个输入都是一个向量。如果检查失败,将显示警告。然而,CRFEntityExtractor
将继续训练,只是没有额外的定制特性。如果存在密集特征,CRFEntityExtractor
会将密集特征传递给sklearn_crfsuite
并用于训练。 - Configuration
CRFEntityExtractor
具有一个要使用的默认功能列表。但是,您可以覆盖默认配置。以下功能可用:
============== ==========================================================================================
Feature Name Description
============== ==========================================================================================
low Checks if the token is lower case.
upper Checks if the token is upper case.
title Checks if the token starts with an uppercase character and all remaining characters are
lowercased.
digit Checks if the token contains just digits.
prefix5 Take the first five characters of the token.
prefix2 Take the first two characters of the token.
suffix5 Take the last five characters of the token.
suffix3 Take the last three characters of the token.
suffix2 Take the last two characters of the token.
suffix1 Take the last character of the token.
pos Take the Part-of-Speech tag of the token (``SpacyTokenizer`` required).
pos2 Take the first two characters of the Part-of-Speech tag of the token
(``SpacyTokenizer`` required).
pattern Take the patterns defined by ``RegexFeaturizer``.
bias Add an additional "bias" feature to the list of features.
============== ==========================================================================================
当Featureizer使用滑动窗口在用户消息中的tokens
上移动时,您可以在滑动窗口中定义上一个tokens
、当前tokens
和下一个tokens
的功能。将特征定义为[before,token,after]数组。
另外,您可以设置一个标志来确定是否使用BILOU标记模式。
BILOU_flag
确定是否使用BILOU标记。默认为True。
pipeline:
- name: "CRFEntityExtractor"
# BILOU_flag determines whether to use BILOU tagging or not.
"BILOU_flag": True
# features to extract in the sliding window
"features": [
["low", "title", "upper"],
[
"bias",
"low",
"prefix5",
"prefix2",
"suffix5",
"suffix3",
"suffix2",
"upper",
"title",
"digit",
"pattern",
],
["low", "title", "upper"],
]
# The maximum number of iterations for optimization algorithms.
"max_iterations": 50
# weight of the L1 regularization
"L1_c": 0.1
# weight of the L2 regularization
"L2_c": 0.1
# Name of dense featurizers to use.
# If list is empty all available dense features are used.
"featurizers": []
# Indicated whether a list of extracted entities should be split into individual entities for a given entity type
"split_entities_by_comma":
address: False
email: True
注意
如果使用POS特性(pos
或pos2
),则需要在管道中使用SpacyTokenizer
。
注意
如果使用了pattern
特征,则需要在管道中使用RegexFeaturizer
。
DucklingEntityExtractor
- Short
Duckling可以让你用多种语言提取诸如日期、金额、距离等常见实体。 - Outputs
entities
- Requires
无 - Output-Example
{
"entities": [{
"end": 53,
"entity": "time",
"start": 48,
"value": "2017-04-10T00:00:00.000+02:00",
"confidence": 1.0,
"extractor": "DucklingEntityExtractor"
}]
}
- Description
要使用此组件,您需要运行duckling服务。最简单的方法是使用docker run -p 8000:8000 rasa/duckling
来启动docker容器。
或者,您可以直接在计算机上安装duckling并启动服务。
Duckling允许识别日期、数字、距离和其他结构化实体,并对它们进行规范化。请注意,duckling试图提取尽可能多的实体类型而不提供排名。例如,如果同时指定数字
和时间
作为duckling组件的维度,则该组件将从文本我将在10分钟内到达
中提取两个实体:10
作为数字
,10分钟
作为时间
。在这种情况下,应用程序必须确定哪种实体类型是正确的。提取器将始终返回1.0作为置信度,因为它是基于规则的系统。
支持的语言列表可以在Duckling GitHub存储库中找到。 - Configuration
配置duckling组件应提取哪些维度,比如实体类型。在duckling文档中可以找到可用尺寸的完整列表。不指定维度选项将提取所有可用的维度。
pipeline:
- name: "DucklingEntityExtractor"
# url of the running duckling server
url: "http://localhost:8000"
# dimensions to extract
dimensions: ["time", "number", "amount-of-money", "distance"]
# allows you to configure the locale, by default the language is
# used
locale: "de_DE"
# if not set the default timezone of Duckling is going to be used
# needed to calculate dates from relative expressions like "tomorrow"
timezone: "Europe/Berlin"
# Timeout for receiving response from http url of the running duckling server
# if not set the default timeout of duckling http url is set to 3 seconds.
timeout : 3
DIETClassifier
- Short
用于意图分类和实体提取的意图实体转换器(DIET) - Description
您可以在Intent Classifiers
一节下找到DIETClassifier
的详细描述。
RegexEntityExtractor
- Short
使用在训练数据中定义的查找表和/或正则表达式提取实体 - Outputs
entities
- Requires
无 - Description
这个组件使用在训练数据中定义的查找表和正则表达式来提取实体。该组件检查用户消息是否包含某个查找表的条目或与某个正则表达式匹配。如果找到匹配项,则将值提取为实体。
这个组件只使用正则表达式功能,而且其名称等于训练数据中定义的实体之一。确保每个实体至少注释一个示例。
注意
当您将此提取器与MitieEntityExtractor、CRFEntityExtractor或DIETClassifier结合使用时,可能会导致对实体进行多次提取。特别是如果许多训练语句都有实体类型的实体注释,而您也为这些实体类型定义了regexes。有关多重提取的更多信息,请参见实体提取器章节开头的信息。
如果您同时需要这个RegexEntityExtractor和前面提到的另一个统计提取器,我们建议您考虑以下两个选项之一。
选项1,当每种类型的提取器都有独占的实体类型时(建议使用该选项)。为了确保提取器不会相互干扰,只为每个regex/lookup实体类型注释一个示例语句,而不是更多。
选项2,当您想使用regexes匹配作为统计提取器的附加信号时,选项2很有用,但是您没有单独的实体类型。在这种情况下,你会需要
1) 在管道中的提取器之前添加RegexFeaturizer
2) 在训练数据中注释所有实体示例,并
3) 从管道中删除RegexEntityExtractor。
这样,您的统计提取器将接收到关于regex匹配存在的附加信号,并且能够统计地确定何时依赖这些匹配以及何时不依赖这些匹配。
- Configuration
通过添加case_sensitive: True
选项使实体提取器区分大小写,默认值为case_sensitive: False
。
要正确处理不使用空格分隔的中文等语言,用户需要添加use_word_boundaries: False
选项,默认为use_word_boundaries: True
。
pipeline:
- name: RegexEntityExtractor
# text will be processed with case insensitive as default
case_sensitive: False
# use lookup tables to extract entities
use_lookup_tables: True
# use regexes to extract entities
use_regexes: True
# use match word boundaries for lookup table
"use_word_boundaries": True
EntitySynonymMapper
- Short
将同义实体值映射到相同的值。 - Outputs
Modifies existing entities that previous entity extraction components found. - Requires
实体提取器(Entity Extractors)中的提取器(extractor) - Description
如果训练数据包含定义的同义词,此组件将确保检测到的实体值将映射到相同的值。例如,如果训练数据包含以下示例:
[
{
"text": "I moved to New York City",
"intent": "inform_relocation",
"entities": [{
"value": "nyc",
"start": 11,
"end": 24,
"entity": "city",
}]
},
{
"text": "I got a new flat in NYC.",
"intent": "inform_relocation",
"entities": [{
"value": "nyc",
"start": 20,
"end": 23,
"entity": "city",
}]
}
]
此组件将允许您将实体New York City
和NYC
映射到nyc
。实体提取将返回nyc
,即使消息包含NYC
。当此组件更改现有实体时,它会将自身附加到此实体的处理器列表。
- Configuration
pipeline:
- name: "EntitySynonymMapper"
注意
将EntitySynonymMapper用作NLU管道的一部分时,需要将其放置在配置文件中任何实体提取器的下面。
组合意图分类器和实体提取器(Combined Intent Classifiers and Entity Extractors)
DIETClassifier
- Short
用于意图分类和实体提取的意图实体双转换器(DIET) - Outputs
entities
,intent
以及intent_ranking
- Requires
用户消息和意图的dense_features
和/或sparse_features
- Output-Example
{
"intent": {"name": "greet", "confidence": 0.8343},
"intent_ranking": [
{
"confidence": 0.385910906220309,
"name": "goodbye"
},
{
"confidence": 0.28161531595656784,
"name": "restaurant_search"
}
],
"entities": [{
"end": 53,
"entity": "time",
"start": 48,
"value": "2017-04-10T00:00:00.000+02:00",
"confidence": 1.0,
"extractor": "DIETClassifier"
}]
}
-
Description
DIET(Dual Intent and Entity Transformer)是一种用于意图分类和实体识别的多任务体系结构。该体系结构基于两个任务共享的转换器。实体标签序列通过转换器输出序列顶部的条件随机场(CRF)标记层预测,该标记层对应于token
的输入序列。对于意图标签,用于完整话语和意图标签的转换器输出被嵌入到单个语义向量空间中。利用点积损失最大化与目标标签的相似度,点积损失最小化与阴性样本的相似度。
如果你想了解更多关于这个模型的信息,请查看YouTube上的Algorithm Whiteboard系列,在那里我们详细解释了模型的体系结构。注意
如果在预测时间内,消息只包含在训练过程中看不到的单词,并且没有使用词汇表外预处理器,则将以置信度0.0
预测空意图None
。如果您仅将带有单词分析器的CountVectorsFeaturizer
用作featurizer,则可能会发生这种情况。如果您使用char_wb
分析器,您应该总是得到一个置信值大于0.0
的意图。 -
Configuration
如果只想将DIETClassifier用于意图分类,请将entity_recognition
设置为False
。如果只想进行实体识别,请将intent_classification
设置为False
。默认情况下,DIETClassifier会同时执行这两项操作,即entity_recognition
和intent_classification
都设置为True
。
可以定义多个超参数来调整模型。如果要调整模型,请首先修改以下参数:-
epochs
:此参数设置算法将看到训练数据的次数(默认值:300
)。一个epoch
等于所有训练例子中的一个向前传播和一个后向传播。有时模型需要更多的训练次数来正确学习。有时更多的训练次数不会影响性能。训练次数越少,模型的训练速度就越快。 -
hidden_layers_sizes
:此参数允许您为用户消息和意图定义前馈层的数量及其输出维度(默认值:text:[],label:[]
)。列表中的每个条目都对应一个前馈层。例如,如果设置text:[256,128]
,我们将在转换器前面添加两个前馈层。输入tokens
的向量(来自用户消息)将被传递到这些层。第一层的输出维度为256,第二层的输出维度为128。如果使用空列表(默认行为),则不会添加前馈层。确保只使用正整数值。通常使用 2 n 2^n 2n的数字。此外,通常的做法是在列表中减少值:下一个值小于或等于前一个值。 -
embedding_dimension
:此参数定义模型中使用的嵌入层的输出维度(默认值:20
)。我们在模型架构中使用了多个嵌入层。例如,在比较和计算损失之前,将完整的话语和意图的向量传递到嵌入层。 -
number_of_transformer_layers
:此参数设置要使用的转换器层数(默认值:2
)。转换器层的数量对应于要用于模型的转换器块。 -
transformer_size
:此参数设置转换器中的单位数(默认值:256
)。来自转换器的向量将具有给定的transformer_size
。 -
weight_sparsity
:此参数定义模型中所有前馈层的内核权重的分数(默认值:0.8
)。该值应介于0和1之间。如果将weight_sparsity
设置为0
,则不会将任何内核权重设置为0,该层将充当标准前馈层。您不应该将weight_sparsity
设置为1
,因为这将导致所有内核权重为0,即模型无法学习。 -
constrain_similarities
:当该参数设置为True
时,将在所有相似项上应用sigmoid交叉熵损失。这有助于将输入和负标签之间的相似性保持在较小的值。这将有助于更好地将模型推广到真实世界的测试集。 -
model_confidence
:此参数允许用户配置如何在推断期间计算置信度。它可以取两个值:softmax
:信任度在[0,1]
范围内(旧行为和当前默认值)。用softmax
激活函数对计算出的相似性进行归一化。linear_norm
:信任度在[0,1]
范围内。计算出的点积相似度用线性函数归一化。
请尝试使用
linear_norm
作为model_confidence
的值。这将使FallbackClassifier
调整回退阈值变得更容易。
-
上面的配置参数是您应该配置的,以使您的模型适合您的数据。但是,存在可以调整的其他参数。
+---------------------------------+------------------+--------------------------------------------------------------+
| Parameter | Default Value | Description |
+=================================+==================+==============================================================+
| hidden_layers_sizes | text: [] | Hidden layer sizes for layers before the embedding layers |
| | label: [] | for user messages and labels. The number of hidden layers is |
| | | equal to the length of the corresponding list. |
+---------------------------------+------------------+--------------------------------------------------------------+
| share_hidden_layers | False | Whether to share the hidden layer weights between user |
| | | messages and labels. |
+---------------------------------+------------------+--------------------------------------------------------------+
| transformer_size | 256 | Number of units in transformer. |
+---------------------------------+------------------+--------------------------------------------------------------+
| number_of_transformer_layers | 2 | Number of transformer layers. |
+---------------------------------+------------------+--------------------------------------------------------------+
| number_of_attention_heads | 4 | Number of attention heads in transformer. |
+---------------------------------+------------------+--------------------------------------------------------------+
| use_key_relative_attention | False | If 'True' use key relative embeddings in attention. |
+---------------------------------+------------------+--------------------------------------------------------------+
| use_value_relative_attention | False | If 'True' use value relative embeddings in attention. |
+---------------------------------+------------------+--------------------------------------------------------------+
| max_relative_position | None | Maximum position for relative embeddings. |
+---------------------------------+------------------+--------------------------------------------------------------+
| unidirectional_encoder | False | Use a unidirectional or bidirectional encoder. |
+---------------------------------+------------------+--------------------------------------------------------------+
| batch_size | [64, 256] | Initial and final value for batch sizes. |
| | | Batch size will be linearly increased for each epoch. |
| | | If constant `batch_size` is required, pass an int, e.g. `8`. |
+---------------------------------+------------------+--------------------------------------------------------------+
| batch_strategy | "balanced" | Strategy used when creating batches. |
| | | Can be either 'sequence' or 'balanced'. |
+---------------------------------+------------------+--------------------------------------------------------------+
| epochs | 300 | Number of epochs to train. |
+---------------------------------+------------------+--------------------------------------------------------------+
| random_seed | None | Set random seed to any 'int' to get reproducible results. |
+---------------------------------+------------------+--------------------------------------------------------------+
| learning_rate | 0.001 | Initial learning rate for the optimizer. |
+---------------------------------+------------------+--------------------------------------------------------------+
| embedding_dimension | 20 | Dimension size of embedding vectors. |
+---------------------------------+------------------+--------------------------------------------------------------+
| dense_dimension | text: 128 | Dense dimension for sparse features to use. |
| | label: 20 | |
+---------------------------------+------------------+--------------------------------------------------------------+
| concat_dimension | text: 128 | Concat dimension for sequence and sentence features. |
| | label: 20 | |
+---------------------------------+------------------+--------------------------------------------------------------+
| number_of_negative_examples | 20 | The number of incorrect labels. The algorithm will minimize |
| | | their similarity to the user input during training. |
+---------------------------------+------------------+--------------------------------------------------------------+
| similarity_type | "auto" | Type of similarity measure to use, either 'auto' or 'cosine' |
| | | or 'inner'. |
+---------------------------------+------------------+--------------------------------------------------------------+
| loss_type | "cross_entropy" | The type of the loss function, either 'cross_entropy' |
| | | or 'margin'. |
+---------------------------------+------------------+--------------------------------------------------------------+
| ranking_length | 10 | Number of top intents to normalize scores for. Applicable |
| | | only with loss type 'cross_entropy' and 'softmax' |
| | | confidences. Set to 0 to disable normalization. |
+---------------------------------+------------------+--------------------------------------------------------------+
| maximum_positive_similarity | 0.8 | Indicates how similar the algorithm should try to make |
| | | embedding vectors for correct labels. |
| | | Should be 0.0 < ... < 1.0 for 'cosine' similarity type. |
+---------------------------------+------------------+--------------------------------------------------------------+
| maximum_negative_similarity | -0.4 | Maximum negative similarity for incorrect labels. |
| | | Should be -1.0 < ... < 1.0 for 'cosine' similarity type. |
+---------------------------------+------------------+--------------------------------------------------------------+
| use_maximum_negative_similarity | True | If 'True' the algorithm only minimizes maximum similarity |
| | | over incorrect intent labels, used only if 'loss_type' is |
| | | set to 'margin'. |
+---------------------------------+------------------+--------------------------------------------------------------+
| scale_loss | False | Scale loss inverse proportionally to confidence of correct |
| | | prediction. |
+---------------------------------+------------------+--------------------------------------------------------------+
| regularization_constant | 0.002 | The scale of regularization. |
+---------------------------------+------------------+--------------------------------------------------------------+
| negative_margin_scale | 0.8 | The scale of how important it is to minimize the maximum |
| | | similarity between embeddings of different labels. |
+---------------------------------+------------------+--------------------------------------------------------------+
| weight_sparsity | 0.8 | Sparsity of the weights in dense layers. |
| | | Value should be between 0 and 1. |
+---------------------------------+------------------+--------------------------------------------------------------+
| drop_rate | 0.2 | Dropout rate for encoder. Value should be between 0 and 1. |
| | | The higher the value the higher the regularization effect. |
+---------------------------------+------------------+--------------------------------------------------------------+
| drop_rate_attention | 0.0 | Dropout rate for attention. Value should be between 0 and 1. |
| | | The higher the value the higher the regularization effect. |
+---------------------------------+------------------+--------------------------------------------------------------+
| use_sparse_input_dropout | True | If 'True' apply dropout to sparse input tensors. |
+---------------------------------+------------------+--------------------------------------------------------------+
| use_dense_input_dropout | True | If 'True' apply dropout to dense input tensors. |
+---------------------------------+------------------+--------------------------------------------------------------+
| evaluate_every_number_of_epochs | 20 | How often to calculate validation accuracy. |
| | | Set to '-1' to evaluate just once at the end of training. |
+---------------------------------+------------------+--------------------------------------------------------------+
| evaluate_on_number_of_examples | 0 | How many examples to use for hold out validation set. |
| | | Large values may hurt performance, e.g. model accuracy. |
+---------------------------------+------------------+--------------------------------------------------------------+
| intent_classification | True | If 'True' intent classification is trained and intents are |
| | | predicted. |
+---------------------------------+------------------+--------------------------------------------------------------+
| entity_recognition | True | If 'True' entity recognition is trained and entities are |
| | | extracted. |
+---------------------------------+------------------+--------------------------------------------------------------+
| use_masked_language_model | False | If 'True' random tokens of the input message will be masked |
| | | and the model has to predict those tokens. It acts like a |
| | | regularizer and should help to learn a better contextual |
| | | representation of the input. |
+---------------------------------+------------------+--------------------------------------------------------------+
| tensorboard_log_directory | None | If you want to use tensorboard to visualize training |
| | | metrics, set this option to a valid output directory. You |
| | | can view the training metrics after training in tensorboard |
| | | via 'tensorboard --logdir <path-to-given-directory>'. |
+---------------------------------+------------------+--------------------------------------------------------------+
| tensorboard_log_level | "epoch" | Define when training metrics for tensorboard should be |
| | | logged. Either after every epoch ('epoch') or for every |
| | | training step ('batch'). |
+---------------------------------+------------------+--------------------------------------------------------------+
| featurizers | [] | List of featurizer names (alias names). Only features |
| | | coming from the listed names are used. If list is empty |
| | | all available features are used. |
+---------------------------------+------------------+--------------------------------------------------------------+
| checkpoint_model | False | Save the best performing model during training. Models are |
| | | stored to the location specified by `--out`. Only the one |
| | | best model will be saved. |
| | | Requires `evaluate_on_number_of_examples > 0` and |
| | | `evaluate_every_number_of_epochs > 0` |
+---------------------------------+------------------+--------------------------------------------------------------+
| split_entities_by_comma | True | Splits a list of extracted entities by comma to treat each |
| | | one of them as a single entity. Can either be `True`/`False` |
| | | globally, or set per entity type, such as: |
| | | ```|
| | | ... |
| | | - name: DIETClassifier |
| | | split_entities_by_comma: |
| | | address: True |
| | | ... |
| | | ... |
| | | ```|
+---------------------------------+------------------+--------------------------------------------------------------+
| constrain_similarities | False | If `True`, applies sigmoid on all similarity terms and adds |
| | | it to the loss function to ensure that similarity values are |
| | | approximately bounded. Used only if `loss_type=cross_entropy`|
+---------------------------------+------------------+--------------------------------------------------------------+
| model_confidence | "softmax" | Affects how model's confidence for each intent |
| | | is computed. It can take two values: |
| | | 1. `softmax` - Similarities between input and intent |
| | | embeddings are post-processed with a softmax function, |
| | | as a result of which confidence for all intents sum up to 1. |
| | | 2. `linear_norm` - Linearly normalized dot product similarity|
| | | between input and intent embeddings. Confidence for each |
| | | intent will be in the range `[0,1]` |
| | | This parameter does not affect the confidence for entity |
| | | prediction. |
+---------------------------------+------------------+--------------------------------------------------------------+
注意
参数maximum_negative_similarity
设置为负值,以模拟maximum_negative_similarity = maximum_positive_similarity
和use_maximum_negative_similarity = False
情况下的original starspace algorithm
。详见starspace paper。
选择器(Selectors)
选择器从一组候选响应中预测bot响应。
响应选择器(ResponseSelector)
-
Short
响应选择器 -
Outputs
一种字典,其中键作为响应选择器的检索目的,值包含预测响应、置信度和检索意图下的响应键 -
Requires
针对用户消息和响应的dense_features
和/或sparse_features
-
Output-Example
NLU解析的输出将有一个名为response_selector
的属性,其中包含每个响应选择器组件的输出。每个响应选择器由该响应选择器的retrieval_intent
参数标识,并存储两个属性:- response:预测响应键在相应的检索意图、预测置信度和相关响应下。
- ranking:候选响应键的置信度前10名排序。
示例结果:
{ "response_selector": { "faq": { "response": { "id": 1388783286124361986, "confidence": 0.7, "intent_response_key": "chitchat/ask_weather", "responses": [ { "text": "It's sunny in Berlin today", "image": "https://i.imgur.com/nGF1K8f.jpg" }, { "text": "I think it's about to rain." } ], "utter_action": "utter_chitchat/ask_weather" }, "ranking": [ { "id": 1388783286124361986, "confidence": 0.7, "intent_response_key": "chitchat/ask_weather" }, { "id": 1388783286124361986, "confidence": 0.3, "intent_response_key": "chitchat/ask_name" } ] } } }
如果特定响应选择器的
retrieval_intent
参数保留为其默认值,则相应的响应选择器将在返回的输出中标识为default
。{ "response_selector": { "default": { "response": { "id": 1388783286124361986, "confidence": 0.7, "intent_response_key": "chitchat/ask_weather", "responses": [ { "text": "It's sunny in Berlin today", "image": "https://i.imgur.com/nGF1K8f.jpg" }, { "text": "I think it's about to rain." } ], "utter_action": "utter_chitchat/ask_weather" }, "ranking": [ { "id": 1388783286124361986, "confidence": 0.7, "intent_response_key": "chitchat/ask_weather" }, { "id": 1388783286124361986, "confidence": 0.3, "intent_response_key": "chitchat/ask_name" } ] } } }
-
Description
响应选择器组件可用于构建响应检索模型,以便从一组候选响应中直接预测bot响应。对话管理器使用此模型的预测来发出预测的响应。它将用户输入和响应标签嵌入到同一空间中,并遵循与DIETClassifier
完全相同的神经网络结构和优化。
要使用此组件,训练数据应包含retrieval intents。要定义这些,请查看有关NLU训练示例的文档和有关为检索意图定义响应语句的文档。
-
Configuration
该算法包含了DIETClassifier所使用的几乎所有超参数。如果要调整模型,请首先修改以下参数:epochs
:此参数设置算法将看到训练数据的次数(默认值:300
)。一个epoch
等于所有训练例子中的一个向前传播和一个后向传播。有时模型需要更多的训练次数来正确学习。有时更多的训练次数不会影响性能。训练次数越少,模型的训练速度就越快。hidden_layers_sizes
:此参数允许您为用户消息和意图定义前馈层的数量及其输出维度(默认值:text:[],label:[]
)。列表中的每个条目都对应一个前馈层。例如,如果设置text:[256,128]
,我们将在转换器前面添加两个前馈层。输入tokens
的向量(来自用户消息)将被传递到这些层。第一层的输出维度为256,第二层的输出维度为128。如果使用空列表(默认行为),则不会添加前馈层。确保只使用正整数值。通常使用 2 n 2^n 2n的数字。此外,通常的做法是在列表中减少值:下一个值小于或等于前一个值。embedding_dimension
:此参数定义模型中使用的嵌入层的输出维度(默认值:20
)。我们在模型架构中使用了多个嵌入层。例如,在比较和计算损失之前,将完整的话语和意图的向量传递到嵌入层。number_of_transformer_layers
:此参数设置要使用的转换器层数(默认值:2
)。转换器层的数量对应于要用于模型的转换器块。transformer_size
:此参数设置转换器中的单位数(默认值:256
)。来自转换器的向量将具有给定的transformer_size
。weight_sparsity
:此参数定义模型中所有前馈层的内核权重的分数(默认值:0.8
)。该值应介于0和1之间。如果将weight_sparsity
设置为0
,则不会将任何内核权重设置为0,该层将充当标准前馈层。您不应该将weight_sparsity
设置为1
,因为这将导致所有内核权重为0,即模型无法学习。constrain_similarities
:当该参数设置为True
时,将在所有相似项上应用sigmoid交叉熵损失。这有助于将输入和负标签之间的相似性保持在较小的值。这将有助于更好地将模型推广到真实世界的测试集。model_confidence
:此参数允许用户配置如何在推断期间计算置信度。它可以取两个值:softmax
:信任度在[0,1]
范围内(旧行为和当前默认值)。用softmax
激活函数对计算出的相似性进行归一化。linear_norm
:信任度在[0,1]
范围内。计算出的点积相似度用线性函数归一化。
请尝试使用
linear_norm
作为model_confidence
的值。这将使FallbackClassifier
调整回退阈值变得更容易。
该组件还可以针对特定检索意图训练响应选择器进行配置。参数retrieval_intent
设置为其训练此响应选择器模型的检索意图的名称。默认值为None
,即该模型针对所有检索意图进行训练。
在默认配置中,组件使用带有响应键(例如faq/ask_name
)的检索意图作为训练标签。或者,也可以将其配置为,使用响应的文本作为训练标签,方法是将use_text_as_label
切换为True
。在这种模式下,组件将使用第一个可用的响应,该响应具有用于训练的文本属性。如果找不到,则返回使用检索意图和响应键作为标签。
示例和教程
查看responseselectorbot以获取如何在助手中使用ResponseSelector
组件的示例。此外,您会发现本教程使用ResponseSelector
处理常见问题也很有用。
自定义组件(Custom Components)
您可以创建一个自定义组件来执行NLU当前不提供的特定任务(例如,情绪分析)。下面是rasa.nlu.components.Component
类的规范以及需要实现的方法。
通过添加模块路径,可以将自定义组件添加到管道中。因此,如果有一个名为sentiment
的模块包含一个SentimentAnalyzer
类:
pipeline:
- name: "sentiment.SentimentAnalyzer"
另外,请务必阅读有关组件生命周期的部分。
首先,您可以使用这个框架,其中包含您应该实现的最重要的方法:
import typing
from typing import Any, Optional, Text, Dict, List, Type
from rasa.nlu.components import Component
from rasa.nlu.config import RasaNLUModelConfig
from rasa.shared.nlu.training_data.training_data import TrainingData
from rasa.shared.nlu.training_data.message import Message
if typing.TYPE_CHECKING:
from rasa.nlu.model import Metadata
class MyComponent(Component):
"""A new component"""
# Which components are required by this component.
# Listed components should appear before the component itself in the pipeline.
@classmethod
def required_components(cls) -> List[Type[Component]]:
"""Specify which components need to be present in the pipeline."""
return []
# Defines the default configuration parameters of a component
# these values can be overwritten in the pipeline configuration
# of the model. The component should choose sensible defaults
# and should be able to create reasonable results with the defaults.
defaults = {}
# Defines what language(s) this component can handle.
# This attribute is designed for instance method: `can_handle_language`.
# Default value is None which means it can handle all languages.
# This is an important feature for backwards compatibility of components.
supported_language_list = None
# Defines what language(s) this component can NOT handle.
# This attribute is designed for instance method: `can_handle_language`.
# Default value is None which means it can handle all languages.
# This is an important feature for backwards compatibility of components.
not_supported_language_list = None
def __init__(self, component_config: Optional[Dict[Text, Any]] = None) -> None:
super().__init__(component_config)
def train(
self,
training_data: TrainingData,
config: Optional[RasaNLUModelConfig] = None,
**kwargs: Any,
) -> None:
"""Train this component.
This is the components chance to train itself provided
with the training data. The component can rely on
any context attribute to be present, that gets created
by a call to :meth:`components.Component.pipeline_init`
of ANY component and
on any context attributes created by a call to
:meth:`components.Component.train`
of components previous to this one."""
pass
def process(self, message: Message, **kwargs: Any) -> None:
"""Process an incoming message.
This is the components chance to process an incoming
message. The component can rely on
any context attribute to be present, that gets created
by a call to :meth:`components.Component.pipeline_init`
of ANY component and
on any context attributes created by a call to
:meth:`components.Component.process`
of components previous to this one."""
pass
def persist(self, file_name: Text, model_dir: Text) -> Optional[Dict[Text, Any]]:
"""Persist this component to disk for future loading."""
pass
@classmethod
def load(
cls,
meta: Dict[Text, Any],
model_dir: Text,
model_metadata: Optional["Metadata"] = None,
cached_component: Optional["Component"] = None,
**kwargs: Any,
) -> "Component":
"""Load this component from file."""
if cached_component:
return cached_component
else:
return cls(meta)
在训练数据中为意图示例定义元数据时,组件可以在处理过程中访问意图元数据和意图示例元数据:
# in your component class
def process(self, message: Message, **kwargs: Any) -> None:
metadata = message.get("metadata")
print(metadata.get("intent"))
print(metadata.get("example"))
自定义分词器
如果创建自定义标记器,则应实现rasa.nlu.tokenizers.tokenizer.Tokenizer
的方法。train
和process
方法已经实现,您只需重写tokenize
方法。
自定义特征化器
如果创建自定义特征化器,则可以返回两种不同类型的特征:序列特征和句子特征。序列特征是一个大小矩阵(number-of-tokens x feature-dimension
),例如,矩阵包含序列中每个标记的特征向量。句子特征由一个大小矩阵(1 x feature-dimension
)表示。
参考
1.官方文档