keras.layers.StringLookup 层介绍

原创

已于 2023-01-04 18:22:51 修改 · 3.9k 阅读

6 ·

CC 4.0 BY-SA版权

文章标签：

#tensorflow

于 2022-03-28 17:41:59 首次发布

应用说明：改层可以返回indices编码，onehot编码，multihot编码；逆转编码，可以根据indices返回对应的原始词。

indices编码作为默认编码：

import tensorflow as tf

vocab = ["a", "b", "c", "d"]
data = tf.constant([["a", "c", "d"], ["d", "z", "b"]])
#Stringlookup的outputmod默认为int编码，也就是返回对应文本的索引编码
layer = tf.keras.layers.StringLookup(vocabulary=vocab)
layer(data)
#输出如下：
<tf.Tensor: shape=(2, 3), dtype=int64, numpy=
array([[1, 3, 4],
       [4, 0, 2]], dtype=int64)>



layer.get_vocabulary()
#输出：
['[UNK]', 'a', 'b', 'c', 'd']

oov(out of vocabulary)数量设置：StringLookup允许设置oov的数量，增加oov的数量可以在一定程度上增加模型的可靠性（未知单词oov的方式是通过hash碰撞方式分摊到oov指定数量的位置中去[0,num_oov_indices)

#注意，StringLookup的输出维度必须是二维

另外，当output_mode配置为非int时，输出的矩阵的秩最大为2

one-hot编码时，由于输出矩阵的秩最大为2，所以：

vocab = ["a", "b", "c", "d"]
layer = tf.keras.layers.StringLookup(vocabulary=vocab, output_mode='one_hot')
layer(tf.constant(["a", "b", "c", "d", "z"]))
#输出为：
  <tf.Tensor: shape=(5, 5), dtype=float32, numpy=