Sentence Tokenize(分割句子)
1、直接使用sent_tokenize
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='train')
X,y = news.data,news.target
text = X[0]
from nltk.tokenize import sent_tokenize
sent_tokenize_list = sent_tokenize(text)
print(sent_tokenize_list)
2、使用nltk.tokenize.punkt中包含了很多预先训练好的tokenize模型。
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='train')
X,y = news.data,news.target
print(X[0])
news = X[0]
from bs4 import BeautifulSoup
import nltk,re
news_text = BeautifulSoup(news).get_text()
print(news_text)
tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
raw_sentences=tokenizer.tokenize(news_text)
print(raw_sentence)
Word Tokenize(分割单词)
1.使用word_tokenize
from nltk.tokenize import word_tokenize
text='The cat is walking in the bedroom.'
sent_tokenize_list = word_tokenize(text)
print(sent_tokenize_list)
Part-Of-Speech Tagging and POS Tagger(对词进行标注)
from nltk.tokenize import word_tokenize
text='The cat is walking in the bedroom.'
sent_tokenize_list = word_tokenize(text)
print(sent_tokenize_list)
pos_tag = nltk.pos_tag(sent_tokenize_list)
print(pos_tag)
Stemming(提取词干)
import nltk
sent1='The cat is walking in the bedroom.'
sent2='A dog was running across the kitchen.'
tokens_1=nltk.word_tokenize(sent1)
print (tokens_1)
stemmer = nltk.stem.PorterStemmer()
stem_1 = [stemmer.stem(t) for t in tokens_1]
print(stem_1)