NLTK使用

Sentence Tokenize(分割句子)

1、直接使用sent_tokenize

from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='train')
X,y = news.data,news.target
text = X[0]
from nltk.tokenize import sent_tokenize
sent_tokenize_list = sent_tokenize(text)
print(sent_tokenize_list)
2、使用nltk.tokenize.punkt中包含了很多预先训练好的tokenize模型。

from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='train')
X,y = news.data,news.target
print(X[0])
news = X[0]

from bs4 import BeautifulSoup
import nltk,re
news_text = BeautifulSoup(news).get_text()
print(news_text)
tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
raw_sentences=tokenizer.tokenize(news_text)
print(raw_sentence)

Word Tokenize(分割单词)

1.使用word_tokenize

from nltk.tokenize import word_tokenize
text='The cat is walking in the bedroom.'
sent_tokenize_list = word_tokenize(text)
print(sent_tokenize_list)


Part-Of-Speech Tagging and POS Tagger(对词进行标注)

from nltk.tokenize import word_tokenize
text='The cat is walking in the bedroom.'
sent_tokenize_list = word_tokenize(text)
print(sent_tokenize_list) 
pos_tag = nltk.pos_tag(sent_tokenize_list)
print(pos_tag)

Stemming(提取词干)

import nltk
sent1='The cat is walking in the bedroom.'
sent2='A dog was running across the kitchen.'
tokens_1=nltk.word_tokenize(sent1)
print (tokens_1)
stemmer = nltk.stem.PorterStemmer()
stem_1 = [stemmer.stem(t) for t in tokens_1]
print(stem_1)



评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值