SpellChecker

最新推荐文章于 2024-09-02 08:30:37 发布

转载最新推荐文章于 2024-09-02 08:30:37 发布 · 1k 阅读

文章标签：

#lucene #dictionary #list #distance #structure #interface

Lucene(信息检索技术) 专栏收录该内容

9 篇文章

订阅专栏

本文介绍了一种使用David Spencer的代码实现的ASpellChecker，它基于三/四gram方法和Levenshtein距离来建议与错误拼写的单词相似的单词列表。详细阐述了字典索引结构，并提供了如何将词汇添加到字典以及获取建议单词列表的方法。包括实例代码和特性说明。

A Spell Checker allows to suggest a list of words similar to a misspelled word. This implementation is based on David Spencer's code using the n-gram method and the Levenshtein distance.

Structure of a dictionary index

An index (the dictionary) with all the possible words (a lucene index) must be created. The structure of this index is (for a 3-4 gram) this:

Index Structure	Example
word	kings
gram3	kin, ing, ngs
gram4	king, ings
start3	kin
start4	king
end3	ngs
end4	ings

Import: Adding Words to the Dictionary

We can add the words coming from a Lucene Index (more precisely from a set of Lucene fields), and from a text file with a list of words.

Example: we can add all the keywords of a given Lucene field of my index.

SpellChecker spell= new SpellChecker(dictionaryDirectory);
spell.indexDictionary(new LuceneDictionary(my_luceneReader,my_fieldname));

Getting a List of Suggested Words

The suggestSimilar method returns a list of suggested words sorted by:

the Levenshtein distance (the most similar word to the misspelled word is the first in the list).
(optionally) the popularity of the word in a given Lucene Field.

Furthermore, that list can be restricted only to the words present in a given Lucene Field.

First example: the suggestSimilar(misspelled_word, num_list) method.
- The num_list is the maximum number of words returned. In this example the list is just sorted with the Levenshtein distance.
```
   String[] l=spellChecker.suggestSimilar("sevanty", 2);
   //l[0] = "seventy"
```
Second example: the suggestSimilar(misspelled_word, num_list, myIndexReader,myField, morePopular)
Note: if myIndexReader and myField are null this method is the same as the first method
1. The returned words are restricted only to the words presents in the field myField of the Lucene Index "myIndexReader"
2. The list is also sorted with a second criterium: the popularity (the frequency) of the word in the user field
3. If morePopular is true and the mispelled word exists in the user field, return only the words more frequent than this.
See the test case code for an example.

Changes

Version 1.1 :

sort fixed (the sort was inversed!)
set gram dynamically (depending of the length of the word)
use the FuzzyQuery score: ((edit distance)/(length of word))
new Dictionary interface + LuceneDictionary and PlaintextDictionary implementation
replace addWords method by indexDictionary(Dictionnary dic)
add a new public method: boolean exist(word)
add a build.xml

Credits

Maisonneuve Nicolas
Spencer David