构建并存储词项-文档矩阵 Python

shandler

于 2020-09-25 20:38:19 发布

阅读量1k

点赞数

本文链接：https://blog.youkuaiyun.com/shandler/article/details/108803816

版权

# de-tokenization
detokenized_cn_doc = []        ### ***** ###
for i in range(len(news_df)):
    t = ' '.join(tokenized_cn_doc[i])
    detokenized_cn_doc.append(t)
    
news_cn_df['token_cn_doc'] = detokenized_cn_doc

detokenized_cn_doc格式是 [‘崔宥莉成为了中国女足又一名强劲的对手’, ‘本文为作者原创未经授权不得转载’]

import xlwt
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer,  TfidfTransformer

# 计算词频
count_vectorizer = CountVectorizer(min_df

最低0.47元/天解锁文章