训练集预处理以及词向量的生成

本文介绍如何将原始训练语料处理为仅包含语句的格式,并使用word2vec工具生成词向量。通过Python代码示例展示了数据处理流程,包括读取JSON格式的训练数据、提取句子文本并写入新文件。

原始训练语料格式如下:

{
	"sentText": "But that spasm of irritation by a master intimidator was minor compared with what Bobby Fischer , the erratic former world chess champion , dished out in March at a news conference in Reykjavik , Iceland .",
	"articleId": "/m/vinci8/data1/riedel/projects/relation/kb/nyt1/docstore/nyt-2005-2006.backup/1677367.xml.pb",
	"relationMentions": [{
		"em1Text": "Bobby Fischer",
		"em2Text": "Iceland",
		"label": "/people/person/nationality"
	}, {
		"em1Text": "Iceland",
		"em2Text": "Reykjavik",
		"label": "/location/country/capital"
	}, {
		"em1Text": "Iceland",
		"em2Text": "Reykjavik",
		"label": "/location/location/contains"
	}, {
		"em1Text": "Bobby Fischer",
		"em2Text": "Reykjavik",
		"label": "/people/deceased_person/place_of_death"
	}],
	"entityMentions": [{
		"start": 0,
		"label": "PERSON",
		"text": "Bobby Fischer"
	}, {
		"start": 1,
		"label": "LOCATION",
		"text": "Reykjavik"
	}, {
		"start": 2,
		"label": "LOCATION",
		"text": "Iceland"
	}],
	"sentId": "1"
}

 需要处理成只有语句的格式:

But that spasm of irritation by a master intimidator was minor compared with what Bobby Fischer , the erratic former world chess champion , dished out in March at a news conference in Reykjavik , Iceland .

代码如下:

import json
import io
train = "./train.json"
result = './trainResult.txt'
fw = open(result, 'w')
with io.open(train, 'r', encoding='utf-8') as f:
	for line in f: 
		data = json.loads(line)
        	fw.write(data['sentText'])
		fw.write('\n')
fw.close()

生成词向量的输入文件后,接下来就是生成每一个单词对应的词向量了,需要借助word2vec工具,地址是:

https://github.com/dav/word2vec

最核心的命令是create-text8-vector-data.sh文件中的

time $BIN_DIR/word2vec -train $TEXT_DATA -output $VECTOR_DATA -cbow 1 -size 300 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 0 -iter 15

-binary=0表示输出的是文本词向量,如果-binary=1表示生成的是二进制词向量

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值