Stemming the words and word lemmatization —— Python Data Science CookBook

本文介绍了词干提取和词形还原的概念,旨在将单词的不同形式归为基本形式。通过NLTK库在Python中实现,包括Porter、Lancaster和Snowball三种词干提取方法,其中Lancaster最为激进但可能不可读。同时对比了词干提取与词形还原的区别,词形还原利用词汇和形态分析得到字典形式的基础词,示例中展示了WordNetLemmatizer的使用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

English grammar dictates how certain words are used in sentences. For example, perform, performing, and performs indicate the same action; they appear in

different sentences based on the grammar rules. 

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Introduction to Information Retrieval By Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze (中文是没有词的形态变化,但是中文分词是个难点

Stemming the words

Let’s look into how we can perform word stemming using Python NLTK. NLTK provides us with a rich set of functions that can help us do the stemming pretty easily:

>>> import nltk.stem
>>> dir(nltk.stem)
['ISRIStemmer', 'LancasterStemmer', 'PorterStemmer', 'RSLPStemmer',
'RegexpStemmer', 'SnowballStemmer', 'StemmerI', 'WordNetLemmatizer',
'__builtins__', '__doc__', '__file__', '__name__', '__package__',
'__path__', 'api', 'isri', 'lancaster', 'porter', 'regexp', 'rslp',
'snowball', 'wordnet']
>>>
we have the following stemmers:
  1. Porter – porter stemmer
  2. Lancaster – Lancaster stemmer
  3. Snowball – snowball stemmer
Porter is the most commonly used stemmer. The algorithm is not very aggressive when moving words to their root form.
Snowball is an improvement over porter. It is also faster than porter in terms of the computational time.

Lancaster is the most aggressive stemmer. With porter and snowball, the final word tokens would still be readable by humans, but with Lancaster, it is not readable. It’s the fastest of the trio.

There’s more…

All the three algorithms are pretty involved; going into the details of these algorithms is beyond the scope of this book. I will recommend you to look to the web for more details on these algorithms. For details of the porter and snowball stemmers, refer to the following link: http://snowball.tartarus.org/algorithms/porter/stemmer.html

example: 

#!/usr/bin/env python2
# -*- coding: utf-8 -*-
"""
@author: snaildove
"""
# Load Libraries
from nltk import stem
#1. small input to figure out how the three stemmers perform.
input_words = ['movies','dogs','planes','flowers','flies','fries','fry','weeks','planted' ,'running','throttle']
#Let’s jump into the different stemming algorithms, as follows:
#2.Porter Stemming
porter = stem.porter.PorterStemmer()
p_words = [porter.stem(w) for w in input_words]
print p_words
#3.Lancaster Stemming
lancaster = stem.lancaster.LancasterStemmer()
l_words = [lancaster.stem(w) for w in input_words]
print l_words
#4.Snowball stemming
snowball = stem.snowball.EnglishStemmer()
s_words = [snowball.stem(w) for w in input_words]
print s_words
ouput : 

[u'movi', u'dog', u'plane', u'flower', u'fli', u'fri', u'fri', u'week', u'plant', u'run', u'throttl']
[u'movy', 'dog', 'plan', 'flow', 'fli', 'fri', 'fry', 'week', 'plant', 'run', 'throttle']
[u'movi', u'dog', u'plane', u'flower', u'fli', u'fri', u'fri', u'week', u'plant', u'run', u'throttl']

 word lemmatization

Stemming is a heuristic process, which goes about chopping the word suffixes in order to get to the root form of the word. In the previous recipe, we saw that it may  nd up chopping even the right words, that is, chopping the derivational affixes. See the following Wikipedia link for the derivational patterns:
http://en.wikipedia.org/wiki/Morphological_derivation#Derivational_patterns
On the other hand, lemmatization uses a morphological analysis and vocabulary to get the lemma of a word. It tries to change only the inflectional endings and give the base word from a dictionary.See Wikipedia for more information on inflection at
http://en.wikipedia.org/wiki/Inflection.
use NLTK’s WordNetLemmatizer.

# Load Libraries
from nltk import stem
#1. small input to figure out how the three stemmers perform.
input_words =['movies','dogs','planes','flowers','flies','fries','fry','weeks', 'planted','running','throttle']
#2.Perform lemmatization.
wordnet_lemm = stem.WordNetLemmatizer()
wn_words = [wordnet_lemm.lemmatize(w) for w in input_words]
print wn_words

output :

[u'movie', u'dog', u'plane', u'flower', u'fly', u'fry', 'fry', u'week', 'planted', 'running', 'throttle']

The word running should ideally be run and our lemmatizer should have gotten it right. We can see that it has not made any changes to running. However, our heuristic-based stemmers have got it right! 

>>> wordnet_lemm.lemmatize('running')
'running'
>>> porter.stem('running')
u'run'
>>> lancaster.stem('running')
'run'
>>> snowball.stem('running')
u'run

Tip

By default, the lemmatizer assumes that the input is a noun; this can be rectified by passing the POS tag of the word to our lemmatizer, as follows:
>>> wordnet_lemm.lemmatize('running','v') 
u'run'
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值