different sentences based on the grammar rules.
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Introduction to Information Retrieval By Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze (中文是没有词的形态变化,但是中文分词是个难点)
Stemming the words
Let’s look into how we can perform word stemming using Python NLTK. NLTK provides us with a rich set of functions that can help us do the stemming pretty easily:
>>> import nltk.stem
>>> dir(nltk.stem)
['ISRIStemmer', 'LancasterStemmer', 'PorterStemmer', 'RSLPStemmer',
'RegexpStemmer', 'SnowballStemmer', 'StemmerI', 'WordNetLemmatizer',
'__builtins__', '__doc__', '__file__', '__name__', '__package__',
'__path__', 'api', 'isri', 'lancaster', 'porter', 'regexp', 'rslp',
'snowball', 'wordnet']
>>>
we have the following
stemmers:
- Porter – porter stemmer
- Lancaster – Lancaster stemmer
- Snowball – snowball stemmer
Snowball is an improvement over porter. It is also faster than porter in terms of the computational time.
Lancaster is the most aggressive stemmer. With porter and snowball, the final word tokens would still be readable by humans, but with Lancaster, it is not readable. It’s the fastest of the trio.
There’s more…
All the three algorithms are pretty involved; going into the details of these algorithms is beyond the scope of this book. I will recommend you to look to the web for more details on these algorithms. For details of the porter and snowball stemmers, refer to the following link: http://snowball.tartarus.org/algorithms/porter/stemmer.htmlexample:
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
"""
@author: snaildove
"""
# Load Libraries
from nltk import stem
#1. small input to figure out how the three stemmers perform.
input_words = ['movies','dogs','planes','flowers','flies','fries','fry','weeks','planted' ,'running','throttle']
#Let’s jump into the different stemming algorithms, as follows:
#2.Porter Stemming
porter = stem.porter.PorterStemmer()
p_words = [porter.stem(w) for w in input_words]
print p_words
#3.Lancaster Stemming
lancaster = stem.lancaster.LancasterStemmer()
l_words = [lancaster.stem(w) for w in input_words]
print l_words
#4.Snowball stemming
snowball = stem.snowball.EnglishStemmer()
s_words = [snowball.stem(w) for w in input_words]
print s_words
ouput :
[u'movi', u'dog', u'plane', u'flower', u'fli', u'fri', u'fri', u'week', u'plant', u'run', u'throttl']
[u'movy', 'dog', 'plan', 'flow', 'fli', 'fri', 'fry', 'week', 'plant', 'run', 'throttle']
[u'movi', u'dog', u'plane', u'flower', u'fli', u'fri', u'fri', u'week', u'plant', u'run', u'throttl']
word lemmatization
Stemming is a heuristic process, which goes about chopping the word suffixes in order to get to the root form of the word. In the previous recipe, we saw that it may nd up chopping even the right words, that is, chopping the derivational affixes. See the following Wikipedia link for the derivational patterns:
http://en.wikipedia.org/wiki/Morphological_derivation#Derivational_patterns
On the other hand, lemmatization uses a morphological analysis and vocabulary to get the lemma of a word. It tries to change only the inflectional endings and give the base word from a dictionary.See Wikipedia for more information on inflection at
http://en.wikipedia.org/wiki/Inflection.
use NLTK’s WordNetLemmatizer.
# Load Libraries
from nltk import stem
#1. small input to figure out how the three stemmers perform.
input_words =['movies','dogs','planes','flowers','flies','fries','fry','weeks', 'planted','running','throttle']
#2.Perform lemmatization.
wordnet_lemm = stem.WordNetLemmatizer()
wn_words = [wordnet_lemm.lemmatize(w) for w in input_words]
print wn_words
output :
[u'movie', u'dog', u'plane', u'flower', u'fly', u'fry', 'fry', u'week', 'planted', 'running', 'throttle']
The word running should ideally be run and our lemmatizer should have gotten it right. We can see that it has not made any changes to running. However, our heuristic-based stemmers have got it right!
>>> wordnet_lemm.lemmatize('running')
'running'
>>> porter.stem('running')
u'run'
>>> lancaster.stem('running')
'run'
>>> snowball.stem('running')
u'run
Tip
By default, the lemmatizer assumes that the input is a noun; this can be rectified by passing the POS tag of the word to our lemmatizer, as follows:>>> wordnet_lemm.lemmatize('running','v')
u'run'