训练集预处理以及词向量的生成_怎么生成预训练词向量文件npz-优快云博客

本文链接：https://blog.youkuaiyun.com/u011939633/article/details/93871728

本文介绍如何将原始训练语料处理为仅包含语句的格式，并使用word2vec工具生成词向量。通过Python代码示例展示了数据处理流程，包括读取JSON格式的训练数据、提取句子文本并写入新文件。

原始训练语料格式如下：

{
	"sentText": "But that spasm of irritation by a master intimidator was minor compared with what Bobby Fischer , the erratic former world chess champion , dished out in March at a news conference in Reykjavik , Iceland .",
	"articleId": "/m/vinci8/data1/riedel/projects/relation/kb/nyt1/docstore/nyt-2005-2006.backup/1677367.xml.pb",
	"relationMentions": [{
		"em1Text": "Bobby Fischer",
		"em2Text": "Iceland",
		"label": "/people/person/nationality"
	}, {
		"em1Text": "Iceland",
		"em2Text": "Reykjavik",
		"label": "/location/country/capital"
	}, {
		"em1Text": "Iceland",
		"em2Text": "Reykjavik",
		"label": "/location/location/contains"
	}, {
		"em1Text": "Bobby Fischer",
		"em2Text": "Reykjavik",
		"label": "/people/deceased_person/place_of_death"
	}],
	"entityMentions": [{
		"start": 0,
		"label": "PERSON",
		"text": "Bobby Fischer"
	}, {
		"start": 1,
		"label": "LOCATION",
		"text": "Reykjavik"
	}, {
		"start": 2,
		"label": "LOCATION",
		"text": "Iceland"
	}],
	"sentId": "1"
}

需要处理成只有语句的格式：

But that spasm of irritation by a master intimidator was minor compared with what Bobby Fischer , the erratic former world chess champion , dished out in March at a news conference in Reykjavik , Iceland .

代码如下：

import json
import io
train = "./train.json"
result = './trainResult.txt'
fw = open(result, 'w')
with io.open(train, 'r', encoding='utf-8') as f:
	for line in f: 
		data = json.loads(line)
        	fw.write(data['sentText'])
		fw.write('\n')
fw.close()

生成词向量的输入文件后，接下来就是生成每一个单词对应的词向量了，需要借助word2vec工具，地址是：

https://github.com/dav/word2vec

最核心的命令是create-text8-vector-data.sh文件中的

time $BIN_DIR/word2vec -train $TEXT_DATA -output $VECTOR_DATA -cbow 1 -size 300 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 0 -iter 15

-binary=0表示输出的是文本词向量，如果-binary=1表示生成的是二进制词向量