原始训练语料格式如下:
{
"sentText": "But that spasm of irritation by a master intimidator was minor compared with what Bobby Fischer , the erratic former world chess champion , dished out in March at a news conference in Reykjavik , Iceland .",
"articleId": "/m/vinci8/data1/riedel/projects/relation/kb/nyt1/docstore/nyt-2005-2006.backup/1677367.xml.pb",
"relationMentions": [{
"em1Text": "Bobby Fischer",
"em2Text": "Iceland",
"label": "/people/person/nationality"
}, {
"em1Text": "Iceland",
"em2Text": "Reykjavik",
"label": "/location/country/capital"
}, {
"em1Text": "Iceland",
"em2Text": "Reykjavik",
"label": "/location/location/contains"
}, {
"em1Text": "Bobby Fischer",
"em2Text": "Reykjavik",
"label": "/people/deceased_person/place_of_death"
}],
"entityMentions": [{
"start": 0,
"label": "PERSON",
"text": "Bobby Fischer"
}, {
"start": 1,
"label": "LOCATION",
"text": "Reykjavik"
}, {
"start": 2,
"label": "LOCATION",
"text": "Iceland"
}],
"sentId": "1"
}
需要处理成只有语句的格式:
But that spasm of irritation by a master intimidator was minor compared with what Bobby Fischer , the erratic former world chess champion , dished out in March at a news conference in Reykjavik , Iceland .
代码如下:
import json
import io
train = "./train.json"
result = './trainResult.txt'
fw = open(result, 'w')
with io.open(train, 'r', encoding='utf-8') as f:
for line in f:
data = json.loads(line)
fw.write(data['sentText'])
fw.write('\n')
fw.close()
生成词向量的输入文件后,接下来就是生成每一个单词对应的词向量了,需要借助word2vec工具,地址是:
https://github.com/dav/word2vec
最核心的命令是create-text8-vector-data.sh文件中的
time $BIN_DIR/word2vec -train $TEXT_DATA -output $VECTOR_DATA -cbow 1 -size 300 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 0 -iter 15
-binary=0表示输出的是文本词向量,如果-binary=1表示生成的是二进制词向量