使用NLTK进行英文分词

最新推荐文章于 2025-01-07 14:40:40 发布

原创

最新推荐文章于 2025-01-07 14:40:40 发布 · 2.4k 阅读

8 ·

CC 4.0 BY-SA版权

使用NLTK进行英文分词

import nltk
import re
english='C:\\Users\\pc\\CapStone\\english.txt'
with open(english,'r',encoding='utf-8') as file:
u=file.read()
str=re.sub('[^\w ]','',u)
print(nltk.word_tokenize(str))
print(nltk.pos_tag(nltk.word_tokenize(str))) #对分完词的结果进行词性标注

#result

['Trump', 'was', 'born', 'and', 'raised', 'in', 'the', 'New', 'York', 'City', 'borough', 'of', 'Queens', 'and', 'received', 'an', 'economics', 'degree', 'from', 'the', 'Wharton', 'School', 'He', 'was', 'appointed', 'president', 'of', 'his', 'familys', 'real', 'estate', 'business', 'in', '1971', 'renamed', 'it', 'The', 'Trump', 'Organization', 'and', 'expanded', 'it', 'from', 'Queens', 'and', 'Brooklyn', 'into', 'Manhattan', 'The', 'company', 'built', 'or', 'renovated', 'skyscrapers', 'hotels', 'casinos', 'and', 'golf', 'courses', 'Trump', 'later', 'started', 'various', 'side', 'ventures', 'including', 'licensing', 'his