Word2Vec

最新推荐文章于 2024-03-31 18:39:12 发布

weixin_43579079

最新推荐文章于 2024-03-31 18:39:12 发布

阅读量4.1k

点赞数

CC 4.0 BY-SA版权

分类专栏： NLP

本文链接：https://blog.youkuaiyun.com/weixin_43579079/article/details/103599246

本文介绍了Word2Vec在文本处理中的应用，包括英文和中文的tokenize、词性归一化、停用词移除，以及词形归一化（lemmatization）。讨论了词性标注库NLTK的使用，并探讨了词频-逆文档频（TF-IDF）在情感分析、文本相似度和文本分类中的作用。此外，还分享了文本分类的案例，如关键词搜索、电影评论分析以及中文数据处理。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文本处理流程：
preprocess:tokenize+lemma/stemming+stopwords+word_list
+make features
+ML

tokenize

英文

import NLTK
sent= ""
tokens = NLTK.word_tokenize(sent)
tokens

中文

import jieba
seg_list = jieba.cut("",cut_all = True)
seg_list = jieba.cut("",cut_all = Flase)
seg_list = jieba.cut_for_search("")

对于复杂的社交语言-re

import re
emoticons_str = r""
regex_str = [emoticons_str, r"",r"",r""]

词性归一化

stemming 词干提取，只留下词根
lemmatization词形归一：把各种类型的词的变形，都归为一个形式

from nltk.atom.portar import PorterStemmer
porter_atomer = PorterStemmer()
portter_stemmer.atom("")

from nltk.stem.lancaster import LancasterStemmer
from nltk.atom import SnowballStemmer
from nltk.atom.porter import PorterStemmer

NLTK实现Lemma

根据单词的词性进行归一化

from nltk.stem import WordNetLemmatizer

NLTK标注POS Tag

得到具体词性

stopwords

指代词会产生歧义
停止词列表www.ranks.nl/stopwords

from nltk.corpus import stopwords
# token
# filter
filtered_words = [word for word in word_list if word not in stop.words('english')]

应用：情感分析

简单方式：英文现成的词语打分表AFINN-111

sentiment_dictionary = {
   
   }
for line in open('')
	word,score = line.split('\t')
	sentiment_dictionary(word) = int(score)
total_score = sum(sentiment_dictionary.get(word,0) for word in words)

配上ML的情感分析
可以使用贝叶斯分类器

from nltk.classify import NaiveBayesClassifier
s1 = ""
s2 = ""
s3 = ""
s4 = ""
def preprocess(s):
	return (word:True for word in s.lower().split())
training_data = [[preprocess(s1),'pos'],
				 [preprocess(s2),'pos'],
				 [preprocess(s3),'pos'],
				 [preprocess(s4),'pos'],
				]
model = NaiveBayesClassifier.train(training_data)
print(model.classify(proprocesss('')))

应用：文本相似度

用元素频率表示文本特征

import nltk
from nltk import FreqDist

corpus = ""
tokens = nltk.word_tokenize(corpus)
print(tokens)

dist = FreDist(tokens) # 得到一个词典，每个词分别对应出现的次数

print(fdist[""])

standard_freq_vector = fdict.most_common(50) # 拿出最常用的50，字典形式，key和value分别是词和出现的次数，统计词频只是为了拿出最常用的词，这个词频后面是用不到的，只需要词频高的词组成的一个向量，然后新的句子放到这个向量里，对应有词的位置上的数会加1，输出的还是这个词向量，key和value分别是词和对应的出现次数（稀疏矩阵）
size = len(standard_freq_vector) # 得到一个词典，最常用的50个单词以及词频
print(standard_freq_vector)
# 按照出现频率的大小，记录下每一个单词的位置
def position_lookup(v): # 得到一个词典，key和value分别是单词以及对应的位置
	res = {
   
   }
	counter =0
	for word in v:
		res[word[0]] = counter
		counter += 1 
	return res
# 把标准的单词位置记录下来，得到的是常用单词以及对应的位置的词典
standad_position_dict = position_lookup(standard_freq_vector)
print(standard_position_dict) # 得到位置对照表
# 新的输入
sentence = "
freq_vector = [0]*size #建立一个同等长度的词向量
tokens = nltk.word_tokenize(sentence)
for word in tokens:
	try:
		fre_vector[standard_position_dict[word]] += 1 # 通过word对应到该词在向量中为位置，然后再词向量该位置上加1，表示出现了一次
	except KeyError:
		continue
print(fre_vector)

应用：文本分类

TF-IDF

TF：Term Frequency衡量一个term在文档中出现得有多频繁
TF(t) = (t在文档中出现的次数)/(文档中term总数)
IDF：Inverse Document Frequency 衡量一个term有多重要
IDF(t) = log_e(文档总数/含有t的文档总数)
TF-IDF = TF*IDF
栗子：一个文档有100个单词，其中baby出现了3次；
那么TF(baby) = 0.03
如果现在有10M的文档，baby出现在其中的1000个文档中
IDF(baby) = log(10M/1000)=4
TF-IDF = 0.03.*4=0.12

nltk实现TF-IDF

from nltk.text import TextCollectionn

corpus = TextCollection(["","",""]) # 这个类会自动短句，做统计，做计算

print(corpus.tf_idf("","")) # 直接使用该函数，参数为词和所在的话
# 要得到一个标准化的向量（相同的长度）
new_sentence = ""
for word in standard_vocab: # 遍历标准词，求出每个单独的词在新句子中的词频，最后每一个新句子都可以得到一个相同长度的向量
	print(corpus(word,new_sentence))

在上述方法将各种新句子得到相同长度的向量后，可以开始机器学习了

案例：关键词搜索

在线商城网站的搜索，对搜索结果进行相近度的打分。
train是产品名称以及不同搜索词以及搜索到结果的相关性打分
test是给定关搜索词，要求给出对应的产品名的列表，然后将submission提交

#关键词搜索
#Kaggle竞赛题：https://www.kaggle.com/c/home-depot-product-search-relevance

#鉴于课件里已经完整的show了NLTK在各个NLP处理上的用法，我这里就不再重复使用了。

#本篇的教程里会尽量用点不一样的库，让大家感受一下Python NLP领域各个库的优缺点。

#Step1：导入所需
#所有要用到的库

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor # 随机森林
from nltk.stem.snowball import SnowballStemmer # 预处理
#读入训练/测试集 

df_train = pd.read_csv('../input/train.csv', encoding="ISO-8859-1") # 注意编码方式
df_test = pd.read_csv('../input/test.csv', encoding="ISO-8859-1")
#这里还有个有用的玩意儿，叫产品介绍

df_desc = pd.read_csv('../input/product_descriptions.csv')
#看看数据们都长什么样子

df_train.head()
#id	product_uid	product_title	search_term	relevance
#0	2	100001	Simpson Strong-Tie 12-Gauge Angle	angle bracket	3.00
#1	3	100001	Simpson Strong-Tie 12-Gauge Angle	l bracket	2.50
#2	9	100002	BEHR Premium Textured DeckOver 1-gal. #SC-141 ...	deck over	3.00
#3	16	100005	Delta Vero 1-Handle Shower Only Faucet Trim Ki...	rain shower head	2.33
#4	17	100005	Delta Vero 1-Handle Shower Only Faucet Trim Ki...	shower only faucet	2.67
df_desc.head() # 对照表
#product_uid	product_description
#0	100001	Not only do angles make joints stronger, they ...
#1	100002	BEHR Premium Textured DECKOVER is an innovativ...
#2	100003	Classic architecture meets contemporary design...
#3	100004	The Grape Solar 265-Watt Polycrystalline PV So...
#4	100005	Update your bathroom with the Delta Vero Singl...
#看来不要做太多的复杂处理，我们于是直接合并测试/训练集，以便于统一做进一步的文本预处理

df_all = pd.concat((df_train, df_test), axis=0, ignore_index=True) #左右合并df
df_all.head()
#id	product_title	product_uid	relevance	search_term
#0	2	Simpson Strong-Tie 12-Gauge Angle	100001	3.00	angle bracket
#1	3	Simpson Strong-Tie 12-Gauge Angle	100001	2.50	l bracket
#2	9	BEHR Premium Textured DeckOver 1-gal. #SC-141 ...	100002	3.00	deck over
#3	16	Delta Vero 1-Handle Shower Only Faucet Trim Ki...	100005	2.33	rain shower head
#4	17	Delta Vero 1-Handle Shower Only Faucet Trim Ki...	100005	2.67	shower only faucet#
#合并之后我们得到:

df_all.shape
#(240760, 5)
#产品介绍也是一个极有用的信息，我们把它拿过来：

df_all = pd.merge(df_all, df_desc, how='left', on='product_uid') # 继续合并
df_all.head()
#id	product_title	product_uid	relevance	search_term	product_description
#0	2	Simpson Strong-Tie 12-Gauge Angle	100001	3.00	angle bracket	Not only do angles make joints stronger, they #...
#1	3	Simpson Strong-Tie 12-Gauge Angle	100001	2.50	l bracket	Not only do angles make joints stronger, they ...
#2	9	BEHR Premium Textured DeckOver 1-gal. #SC-141 ...	100002	3.00	deck over	BEHR Premium Textured DECKOVER is an innovativ...
#3	16	Delta Vero 1-Handle Shower Only Faucet Trim Ki...	100005	2.33	rain shower head	Update your bathroom with the Delta Vero Singl...
#4	17	Delta Vero 1-Handle Shower Only Faucet Trim Ki...	100005	2.67	shower only faucet	Update your bathroom with the Delta Vero Singl...
#好了，现在我们得到一个全体的数据大表格

#Step 2: 文本预处理
#我们这里遇到的文本预处理比较简单，因为最主要的就是看关键词是否会被包含。

#所以我们统一化我们的文本内容，以达到任何term在我们的数据集中只有一种表达式的效果。

#我们这里用简单的Stem做个例子：

#（有兴趣的同学可以选用各种你觉得靠谱的预处理方式：去掉停止词，纠正拼写，去掉数字，去掉各种emoji，等等）

stemmer = SnowballStemmer('english') # 英文的雪球处理

def str_stemmer(s): # 单词小写、分开做stem，再合并得到单词列表
    return " ".join([stemmer.stem(word) for word in s.lower().split()])
#为了计算『关键词』的有效性，我们可以naive地直接看『出现了多少次』（这里是简单统计str1在str2之中出现了多少次，更高级的就使用tf-idf）

def str_common_word(str1, str2):
    return sum(int(str2.find(word)>=0) for word in str1.split())
#接下来，把每一个column都跑一遍，以清洁所有的文本内容

df_all['search_term'] = df_all['search_term'].map(lambda x:str_stemmer(x)) #匿名函数，意思是将x（此列中的每个cell）中的每个词运行stem函数再返回
df_all['product_title'] = df_all['product_title'].map(lambda x:str_stemmer(x))
df_all['product_description'] = df_all['product_description'].map(lambda x:str_stemmer(x))
#Step 3: 自制文本特征
#一般属于一种脑洞大开的过程，想到什么可以加什么。

#当然，特征也不是越丰富越好，稍微靠谱点是肯定的。

#关键词的长度：
df_all['len_of_query'] = df_all['search_term'].map(lambda x:len(x.split())).astype(np.int64) 
#标题中有多少关键词重合
df_all['commons_in_title'] = df_all.apply(lambda x:str_common_word(x['search_term'],x['product_title']), axis=1)
#描述中有多少关键词重合
df_all['commons_in_desc'] = df_all.apply(lambda x:str_common_word(x['search_term'],x['product_description']), axis=1)
#等等等等。。变着法子想出些数字能代表的features，一股脑放进来~

#搞完之后，我们把不能被『机器学习模型』处理的column给drop掉(去掉文字，留下数字)

df_all = df_all.drop(['search_term','product_title','product_description'],axis=1)
#Step 4: 重塑训练/测试集
#舒淇说得好，要把之前脱下的衣服再一件件穿回来

#数据处理也是如此，搞完一圈预处理之后，我们让数据重回原本的样貌

#分开训练和测试集（刚才预处理是在一起处理的）
df_train = df_all.loc[df_train.index]
df_test = df_all.loc[df_test.index]
#记录下测试集的id
#留着上传的时候 能对的上号，将id取出

test_ids = df_test['id']
#分离出y_train，相关度就是y
y_train = df_train['relevance'].values
#把原集中的label给删去
#否则就是cheating了

X_train = df_train.drop(['id','relevance'],axis=1).values # 将y去掉，axis=1代表以列去除的
X_test = df_test.drop(['id','relevance'],axis=1).values
#Step 5: 建立模型
#我们用个最简单的模型：随机森林模型

from sklearn.ensemble import RandomForestRegressor 
from sklearn.model_selection import cross_val_score # 将训练集分成5份，将1份做训练，4份做测试，将测试的结果平均（交叉验证）
#用CV结果保证公正客观性；并调试不同的alpha值

params = [1,3,5,6,7,8,9,10] # 手写的网格搜索
test_scores = []
for param in params:
    clf = RandomForestRegressor(n_estimators=30, max_depth=param) # 最大深度为超参数
    test_score = np.sqrt(-cross_val_score(clf, X_train, y_train, cv=5, scoring='neg_mean_squared_error')) 
    test_scores.append(np.mean(test_score)) # 四个结果平均
#画个图来看看：

import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(params, test_scores)
plt.title("Param vs CV Error");

#大概6~7的时候达到了最优解

#Step 6: 上传结果
#用我们测试出的最优解建立模型，并跑跑测试集

rf = RandomForestRegressor(n_estimators=30, max_depth=6)
rf.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=6,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=30, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
y_pred = rf.predict(X_test)
#把拿到的结果，放进PD，做成CSV上传：

pd.DataFrame({
   
   "id": test_ids, "relevance": y_pred}).to_csv('submission.csv',index=False)
#总结：
#这一篇教程中，虽然都是用的最简单的方法，但是基本框架是很完整的。

#同学们可以尝试修改/调试/升级的部分是：

#文本预处理步骤: 你可以使用很多不同的方法来使得文本数据变得更加清洁

#自制的特征: 相处更多的特征值表达方法（关键词全段重合数量，重合比率，等等）

#更好的回归模型: 根据之前的课讲的Ensemble方法，把分类器提升到极致

案例：Bags of Words Meets Bags of Popcorn

https://www.kaggle.com/ymanojkumar023/kumarmanoj-bag-of-words-meets-bags-of-popcorn

#import所需库
import os
import re
import numpy as np
import pandas as pd

from bs4 import BeautifulSoup

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier # 随机森林
from sklearn.metrics import confusion_matrix # 混淆矩阵
import nltk
#nltk.download()
from nltk.corpus import stopwords

# 用pandas读入训练数据
datafile = os.path.join('..', 'data', 'labeledTrainData.tsv')
df = pd.read_csv(datafile, sep='\t', escapechar='\\')
print('Number of reviews: {}'.format(len(df)))
df.head()
#Number of reviews: 25000
#id	sentiment	review
#0	5814_8	1	With all this stuff going down at the moment w...
#1	2381_9	1	"The Classic War of the Worlds" by Timothy Hin...
#2	7759_3	0	The film starts with a manager (Nicholas Bell)...
#3	3630_4	0	It must be assumed that those who praised this...
#4	9495_8	1	Superbly trashy and wondrously unpretentious 8...
#对影评数据做预处理，大概有以下环节：
#去掉html标签
#移除标点
#切分成词/token
#去掉停用词
#重组为新的句子
def display(text, title):
    print(title)
    print("\n----------我是分割线-------------\n")
    print(text) 
raw_example = df['review'][1]
display(raw_example, '原始数据')
'''
原始数据

----------我是分割线-------------

"The Classic War of the Worlds" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells' classic book. Mr. Hines succeeds in doing so. I, and those who watched his film with me, appreciated the fact that it was not the standard, predictable Hollywood fare that comes out every year, e.g. the Spielberg version with Tom Cruise that had only the slightest resemblance to the book. Obviously, everyone looks for different things in a movie. Those who envision themselves as amateur "critics" look only to criticize everything they can. Others rate a movie on more important bases,like being entertained, which is why most people never agree with the "critics". We enjoyed the effort Mr. Hines put into being faithful to H.G. Wells' classic novel, and we found it to be very entertaining. This made it easy to overlook what the "critics" perceive to be its shortcomings.
'''
example = BeautifulSoup(raw_example, 'html.parser').get_text() # 使用bs解析文本
display(example, '去掉HTML标签的数据')
'''
去掉HTML标签的数据

----------我是分割线-------------

"The Classic War of the Worlds" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells' classic book. Mr. Hines succeeds in doing so. I, and those who watched his film with me, appreciated the fact that it was not the standard, predictable Hollywood fare that comes out every year, e.g. the Spielberg version with Tom Cruise that had only the slightest resemblance to the book. Obviously, everyone looks for different things in a movie. Those who envision themselves as amateur "critics" look only to criticize everything they can. Others rate a movie on more important bases,like being entertained, which is why most people never agree with the "critics". We enjoyed the effort Mr. Hines put into being faithful to H.G. Wells' classic novel, and we found it to be very entertaining. This made it easy to overlook what the "critics" perceive to be its shortcomings.
'''
example_letters = re.sub(r'[^a-zA-Z]', ' ', example) # 使用正则表达式除掉出了字母之外的所有字符（换成空格）
display(example_letters, '去掉标点的数据')
'''
去掉标点的数据

----------我是分割线-------------

 The Classic War of the Worlds  by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H  G  Wells  classic book  Mr  Hines succeeds in doing so  I  and those who watched his film with me  appreciated the fact that it was not the standard  predictable Hollywood fare that comes out every year  e g  the Spielberg version with Tom Cruise that had only the slightest resemblance to the book  Obviously  everyone looks for different things in a movie  Those who envision themselves as amateur  critics  look only to criticize everything they can  Others rate a movie on more important bases like being entertained  which is why most people never agree with the  critics   We enjoyed the effort Mr  Hines put into being faithful to H G  Wells  classic novel  and we found it to be very entertaining  This made it easy to overlook what the  critics  perceive to be its shortcomings 
 '''
words = example_letters.lower().split() # 做一个小写化，并用空格分成一个list
display(words, '纯词列表数据')
'''
纯词列表数据

----------我是分割线-------------

[u'the', u'classic', u'war', u'of', u'the', u'worlds', u'by', u'timothy', u'hines', u'is', u'a', u'very', u'entertaining', u'film', u'that', u'obviously', u'goes', u'to', u'great', u'effort', u'and', u'lengths', u'to', u'faithfully', u'recreate', u'h', u'g', u'wells', u'classic', u'book', u'mr', u'hines', u'succeeds', u'in', u'doing', u'so', u'i', u'and', u'those', u'who', u'watched', u'his', u'film', u'with', u'me', u'appreciated', u'the', u'fact', u'that', u'it', u'was', u'not', u'the', u'standard', u'predictable', u'hollywood', u'fare', u'that', u'comes', u'out', u'every', u'year', u'e', u'g', u'the', u'spielberg', u'version', u'with', u'tom', u'cruise', u'that', u'had', u'only', u'the', u'slightest', u'resemblance', u'to', u'the', u'book', u'obviously', u'everyone', u'looks', u'for', u'different', u'things', u'in', u'a', u'movie', u'those', u'who', u'envision', u'themselves', u'as', u'amateur', u'critics', u'look', u'only', u'to', u'criticize', u'everything', u'they', u'can', u'others', u'rate', u'a', u'movie', u'on', u'more', u'important', u'bases', u'like', u'being', u'entertained', u'which', u'is', u'why', u'most', u'people', u'never', u'agree', u'with', u'the', u'critics', u'we', u'enjoyed', u'the', u'effort', u'mr', u'hines', u'put', u'into', u'being', u'faithful', u'to', u'h', u'g', u'wells', u'classic', u'novel', u'and', u'we', u'found', u'it', u'to', u'be', u'very', u'entertaining', u'this', u'made', u'it', u'easy', u'to', u'overlook', u'what', u'the', u'critics', u'perceive', u'to', u'be', u'its', u'shortcomings']
'''
#下载停用词和其他语料会用到
#nltk.download() # 可以直接使用nltk里的停用词
#words_nostop = [w for w in words if w not in stopwords.words('english')]
stopwords = {
   
   }.fromkeys([ line.rstrip() for line in open('../stopwords.txt')])
words_nostop = [w for w in words if w not in stopwords] # q去除非停用词
display(words_nostop, '去掉停用词数据')
'''
去掉停用词数据

----------我是分割线-------------

[u'classic', u'war', u'worlds', u'timothy', u'hines', u'entertaining', u'film', u'effort', u'lengths', u'faithfully', u'recreate', u'classic', u'book', u'hines', u'succeeds', u'watched', u'film', u'appreciated', u'standard', u'predictable', u'hollywood', u'fare', u'spielberg', u'version', u'tom', u'cruise', u'slightest', u'resemblance', u'book', u'movie', u'envision', u'amateur', u'critics', u'criticize', u'rate', u'movie', u'bases', u'entertained', u'people', u'agree', u'critics', u'enjoyed', u'effort', u'hines', u'faithful', u'classic', u'entertaining', u'easy', u'overlook', u'critics', u'perceive', u'shortcomings']
'''
#eng_stopwords = set(stopwords.words('english'))
eng_stopwords = set(stopwords)
#整理到一个函数中
def clean_text(text):
    text = BeautifulSoup(text, 'html.parser').get_text()
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    words = text.lower().split()
    words = [w for w in words if w not in eng_stopwords]
    return ' '.join(words)
clean_text(raw_example)
'''
u'classic war worlds timothy hines entertaining film effort lengths faithfully recreate classic book hines succeeds watched film appreciated standard predictable hollywood fare spielberg version tom cruise slightest resemblance book movie envision amateur critics criticize rate movie bases entertained people agree critics enjoyed effort hines faithful classic entertaining easy overlook critics perceive shortcomings'
'''
# 清洗数据添加到dataframe里
df['clean_review'] = df.review.apply(clean_text) # 对每一行都做清洗处理
df.head()
'''
id	sentiment	review	clean_review
0	5814_8	1	With all this stuff going down at the moment w...	stuff moment mj ve started listening music wat...
1	2381_9	1	"The Classic War of the Worlds" by Timothy Hin...	classic war worlds timothy hines entertaining ...
2	7759_3	0	The film starts with a manager (Nicholas Bell)...	film starts manager nicholas bell investors ro...
3	3630_4	0	It must be assumed that those who praised this...	assumed praised film filmed opera didn read do...
4	9495_8	1	Superbly trashy and wondrously unpretentious 8...	superbly trashy wondrously unpretentious explo...
'''

#抽取bag of words特征(用sklearn的CountVectorizer)(参考之前的代码，这里是规定了一个含有5000个词的词向量)
vectorizer = CountVectorizer(max_features = 5000) 
train_data_features = vectorizer.fit_transform(df.clean_review).toarray()
train_data_features.shape
'''
(25000, 5000)
'''

# 随机森林训练分类器
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit(train_data_features, df.sentiment)
#在训练集上做个predict看看效果如何
confusion_matrix(df.sentiment, forest.predict(train_data_features))
'''
array([[12500,     0],
       [    0, 12500]])
'''
       
#删除不用的占内容变量
del df
del train_data_features

#读取测试数据进行预测
datafile = os.path.join('..', 'data', 'testData.tsv')
df = pd.read_csv(datafile, sep='\t', escapechar='\\')
print('Number of reviews: {}'.format(len(df)))
df['clean_review'] = df.review.apply(clean_text)
df.head()
'''
Number of reviews: 25000
id	review	clean_review
0	12311_10	Naturally in a film who's main themes are of m...	naturally film main themes mortality nostalgia...
1	8348_2	This movie is a disaster within a disaster fil...	movie disaster within disaster film full great...
2	5828_4	All in all, this is a movie for kids. We saw i...	movie kids saw tonight child loved one point k...
3	7186_2	Afraid of the Dark left me with the impression...	afraid dark left impression several different ...
4	12128_7	A very accurate depiction of small time mob li...	accurate depiction small time mob life filmed ...
'''
test_data_features = vectorizer.transform(df.clean_review).toarray()
test_data_features.shape
'''
(25000, 5000)
'''
result = forest.predict(test_data_features)
output = pd.DataFrame({
   
   'id':df.id, 'sentiment':result})
output.head()
'''
id	sentiment
0	12311_10	1
1	8348_2	0
2	5828_4	1
3	7186_2	1
4	12128_7	1
'''
output.to_csv(os.path.join('..', 'data', 'Bag_of_Words_model.csv'), index=False)
del df
del test_data_features



#word2vec训练词向量
import os
import re
import numpy as np
import pandas as pd

from bs4 import BeautifulSoup

import nltk.data
#nltk.download()