对ELMo进行下面的实践:
一、直接通过AllenNLP获取ELMo词向量
1. 环境配置
参考我之前的博客https://blog.youkuaiyun.com/Findingxu/article/details/91542654,
2.下载训练好的参数和模型
参数下载:
https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5
模型下载:
https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json
3.使用代码:
from allennlp.commands.elmo import ElmoEmbedder
import torch
options_file = "elmo_2x4096_512_2048cnn_2xhighway_options.json"
weight_file = "elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"
elmo = ElmoEmbedder(options_file, weight_file)
# use batch_to_ids to convert sentences to character ids
context_tokens = [['I', 'love', 'you', '.'], ['Sorry', ',', 'I', 'don', "'t", 'love', 'you', '.']] #references
elmo_embedding, elmo_mask = elmo.batch_to_embeddings(context_tokens)
print(elmo_embedding,elmo_embedding.size()) # torch.Size([2, 3, 8, 1024]), (batch_size, 3, num_timesteps, 1024)
print(elmo_mask,elmo_mask.size()) #mask表示的是句子有单词的部分,torch.Size([2, 8]), (batch_size, num_timesteps).
二、在AllenNLP里面调用
AllenNLP是在pytorch基础上的封装,是一个很好用深度学习NLP工具包。它的原理和实践可以看我的这篇博客:AllenNLP之入门解读代码
要想在AllenNLP里面使用的话,就在配置文件(json格式)里面对word_embeddings进行设置就好:
例子如下:和上面的还是一样的,需要options_file和weight_file,如果下载较慢,那就先下下来,把链接改为路径就好
"model": {
"type": "lstm_classifier",
"word_embeddings": {
"tokens": {
"type": "elmo_token_embedder",
"options_file": "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x1024_128_2048cnn_1xhighway/elmo_2x1024_128_2048cnn_1xhighway_options.json",
"weight_file": "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x1024_128_2048cnn_1xhighway/elmo_2x1024_128_2048cnn_1xhighway_weights.hdf5",
"do_layer_norm": false,
"dropout": 0.5
}
},
"encoder": {
"type": "lstm",
"input_size": embedding_dim, # 256
"hidden_size": hidden_dim # 128
}
},