还是sklearn,不多做解释:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
#!usr/bin/env python
# -*- coding:utf-8 -*-
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import sys
reload(sys)
sys.setdefaultencoding("utf8")
#for UnicodeEncodeError
#get all file names in the "ParentFolder"
def GetFilesInFolder(ParentFolder):
import os
filenameList = []
for filename in os.listdir(ParentFolder):
print filename
filenameList.append(filename)
return filenameList
ParentFolder="wikiData"
filenameList=GetFilesInFolder(ParentFolder)
dataList=[]
for fileName in filenameList:
f=open(ParentFolder+"/"+fileName,"r")
fileDatas=f.readlines()
f.close()
fileStr=""
for lineDatas in fileDatas:
fileStr+=l

本文介绍如何利用sklearn库进行文本处理,详细阐述了如何提取文本的TF和TF-IDF特征,以用于词语相似度计算。
最低0.47元/天 解锁文章
605

被折叠的 条评论
为什么被折叠?



