spell check investigation

本文介绍了一种用于拼写检查的字符串相似性算法,通过字典树存储单词并高效地进行拼写验证及建议生成。文章详细阐述了字典树的结构、单词的增删查操作,并探讨了四种常见拼写错误类型及其处理方法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

 

http://www.ibm.com/developerworks/cn/java/j-jazzy/#download

 

 

字符串相似性算法

 

Introduction

We all make the odd tryping error, so spull checkers are an essential part of any application that involves the user editing large amounts of text. (OK, I promise, no more corny mis-spellings!) This article describes the spell checking engine that I developed for the editing-intensive application I am currently working on.

The spell checking engine described is simply that - it checks words, and indicates whether or not they are spelt correctly. It is designed to sit behind a front end which allows the user to control spell checking, and as such, the design and operation of the user interface is beyond the scope of this article.

Dictionary Structure

Words are stored in a tree, with each node representing one character. Each node can have two children - a Next node and an Alternate node. The diagram below shows a dictionary containing three words, "horse", "horses" and "hose". Notice how we reuse nodes where the words start with the same common substring.

 

Storing the words in this manner is actually quite efficient in terms of storage space. When compared with an ASCII text file containing one word per line, the corresponding dictionary file is often considerably smaller, particularly when there are a large number of words.

Adding and Removing Words

To add a word, we traverse the tree to find as much of the word as possible, and then the remaining letters of the word are inserted into the tree. The best way to show this is by example, so say we wish to insert the word "hosts" into the dictionary above. We would perform the following steps :

  • Start at the first node
  • Check the node. The character is equal to the first letter of "hosts", so we move onto the Next node, and insert the word "osts"
  • Check the node. The character is equal to the first letter of "osts", so we move onto the Next node, and insert the word "sts"
  • Check the node. The character is not equal to the first letter of "sts", so we move onto the Alternate node, and insert the word "sts"
  • Check the node. The character is equal to the first letter of "sts", so we move onto the Next node, and insert the word "ts"
  • Check the node. The character is not equal to the first letter of "ts". The node has no Alternate node, so we create one and set the character to 't'. We then move onto this new node
  • The node has no Next node, so we add one and set it to the next character in the word, which 's'. We then move onto this new node.
  • The node has no Next node, so we add one and set it to the next character in the word, which is the null terminator. Because we have reached the end of the word, we end here.

Removing words is considerably easier. We traverse the tree to find the word, and then set the terminating character to something other than the null terminator. Although this does not reduce the size of the dictionary, it does mean that the word will no longer be found.

Checking Words

To check a word, we traverse the tree in much the same way as we did when we inserted a word. However, if we come to the point where there is no node with a matching character, the word is not in the dictionary and we end the search.

Getting Suggestions

When getting suggestions for a word, we consider four types of error that the user could have made. We assume only one error has been made in a word - if we accounted for more errors, our suggestion list could grow to something which is too large to be of use to the user.

The four errors that we account for are:

  • Extra character : For example, "horpse" instead of "horse"
  • Missing character : For example, "hrse" instead of "horse"
  • Incorrect character : For example "hprse" instead of "horse"
  • Two characters transposed: For example, "hrose" instead of "horse"

Because traversal of the tree is so fast, we individually check for each of the combinations if characters that make up each of the errors.

So, when finding suggestions for the extra character error, we will try and find the words "orpse", "hrpse", "hopse", "horse", "horpe" and "horps". Any matches that we find are added to the suggestions list. Similarly, we try all the combinations of transposed characters to find suggestions.

The missing and incorrect character errors use pattern matching to find suggestions. For the missing character errors, we insert a wildcard character at each character position, and see what words we can find that match the search string. When looking for incorrect character errors, we replace each character in turn with the wildcard character.

So, if we were getting suggestions for the word "hrse", we would get the suggestions "horse" (missing character 'o') and "hose" (incorrect character 'r').

Case

Words are inserted into the dictionary in lower case, unless they contain upper case characters. So, place names will be inserted in "correct" case.

When we try to find words, we always start by trying to get an exact match. If we do not, we then see what the best match we can get is - either matching apart from a capitalised first letter, matching but with mixed case, or no match at all. If the match indicates a capitalised first letter, it is up to the application to determine whether or not this is valid. For example, this may be treated as valid for the first letter of a sentence, but invalid elsewhere.

When getting suggestions for a mis-spelt word, we get words regardless of case. However, there is also an option to only return words where they differ from the search word only by case. This may be used when the search indicated a valid word with incorrect case.

 

code

review

disadv:

    lack of sort, low effiency of inserting word and finding word, algorithm complexity exponential.

    not suitable for big magnitude of words.

 

 

 

Double Metaphone算法

http://www.cnblogs.com/dandandan/archive/2006/06/02/415598.html 

 

MetaPhone语音匹配算法

http://aspell.net/metaphone/

 

other tech

http://www.cnblogs.com/chinafine/articles/1270414.html

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值