A Spell Checker allows to suggest a list of words similar to a misspelled word. This implementation is based on David Spencer's code using the n-gram method and the Levenshtein distance.
Structure of a dictionary index
An index (the dictionary) with all the possible words (a lucene index) must be created. The structure of this index is (for a 3-4 gram) this:
Index Structure | Example |
word | kings |
gram3 | kin, ing, ngs |
gram4 | king, ings |
start3 | kin |
start4 | king |
end3 | ngs |
end4 | ings |
Import: Adding Words to the Dictionary
We can add the words coming from a Lucene Index (more precisely from a set of Lucene fields), and from a text file with a list of words.
- Example: we can add all the keywords of a given Lucene field of my index.
SpellChecker spell= new SpellChecker(dictionaryDirectory); spell.indexDictionary(new LuceneDictionary(my_luceneReader,my_fieldname));
Getting a List of Suggested Words
The suggestSimilar method returns a list of suggested words sorted by:
- the Levenshtein distance (the most similar word to the misspelled word is the first in the list).
- (optionally) the popularity of the word in a given Lucene Field.
Furthermore, that list can be restricted only to the words present in a given Lucene Field.
- First example: the suggestSimilar(misspelled_word, num_list) method.
-
The num_list is the maximum number of words returned. In this example the list is just sorted with the Levenshtein distance.
String[] l=spellChecker.suggestSimilar("sevanty", 2); //l[0] = "seventy"
-
- Second example: the suggestSimilar(misspelled_word, num_list, myIndexReader,myField, morePopular)
Note: if myIndexReader and myField are null this method is the same as the first method
-
The returned words are restricted only to the words presents in the field myField of the Lucene Index "myIndexReader"
- The list is also sorted with a second criterium: the popularity (the frequency) of the word in the user field
-
If morePopular is true and the mispelled word exists in the user field, return only the words more frequent than this.
-
Changes
Version 1.1 :
- sort fixed (the sort was inversed!)
- set gram dynamically (depending of the length of the word)
-
use the FuzzyQuery score: ((edit distance)/(length of word))
-
new Dictionary interface + LuceneDictionary and PlaintextDictionary implementation
- replace addWords method by indexDictionary(Dictionnary dic)
- add a new public method: boolean exist(word)
- add a build.xml
Credits
- Maisonneuve Nicolas
- Spencer David