C#关键字快速过滤方法

最新推荐文章于 2024-03-06 14:08:38 发布

原创最新推荐文章于 2024-03-06 14:08:38 发布 · 5.7k 阅读

0 ·

CC 4.0 BY-SA版权

C# 专栏收录该内容

17 篇文章

订阅专栏

本文介绍了一种基于散列的高效关键词过滤算法，该算法利用字典存储关键词，并通过遍历文本进行快速匹配过滤，实现在90多行代码内完成过滤功能，且处理速度极快，适用于大规模文本数据。

本篇博客讲的方案，思路很简单，还是基于撒列，把每个关键词的第一个字作为key，把关键词作为value，把所有关键词撒列在一个Dictionary<key,value>中，由于一个关键字可能对应多个关键词，所以value其实是一个关键词集合，通过遍历要过滤的内容，与关键字字典进行匹配，匹配的话就过滤掉，由于思路简单清晰，可能出现的BUG绝对很少，实现的代码才90多行， 实现关键字过滤的功能代码才 90 多行，看到没有！ 而且效率还不错，关键字和要过滤的内容都一万多字，使用的时间才10毫秒，而且这两组数据都是从记事本中读出来的。

废话真的不想再说了，看了源码之后你会觉得：我靠，原来这么简单。信不信由你，反正源码在这里。
using System;
using System.Collections.Generic;
using System.Text;

namespace WordsFilter
{
     /// <summary>
/// 关键字过滤
     /// </summary>
     public class WordSearch
    {
         private Dictionary< char , IList< string >> keyDict;
         public WordSearch( string keyList)
        {
            HandleKeyWords(keyList);
        }

         private void HandleKeyWords( string text)
        {
             if ( string .IsNullOrEmpty(text))
            {
                keyDict = new Dictionary< char , IList< string >>();
            }
             else
            {
                 string [] strList = text.Split( ' | ' );
                keyDict = new Dictionary< char , IList< string >>(strList.Length / 4 );
                 foreach ( string s in strList)
                {
                     if (s == "" )
                    {
                         continue ;
                    }
                     if (keyDict.ContainsKey(s[ 0 ]))
                    {
                        keyDict[s[ 0 ]].Add(s);
                    }
                     else
                    {
                        keyDict.Add(s[ 0 ], new List< string > { s });
                    }
                }
            }
        }

         public string Filter( string str)
        {
             if ( string .IsNullOrEmpty(str))
            {
                 return string .Empty;
            }
             int len = str.Length;
            StringBuilder sb = new StringBuilder(len);
             bool isOK = true ;
             for ( int i = 0 ; i < len; i++)
            {
                 if (keyDict.ContainsKey(str ))
                {
                    foreach (string s in keyDict[str])
                    {
                        isOK = true;
                        int j = i;
                        foreach (char c in s)
                        {
                            if (j >= len || c != str[j++])
                            {
                                isOK = false;
                                break;
                            }
                        }
                        if (isOK)
                        {
                            i += s.Length - 1;
                            sb.Append('*', s.Length);
                            break;
                        }

                    }
                    if (!isOK)
                    {
                        sb.Append(str);
                    }
                }
                else
                {
                    sb.Append(str);
                }
            }
            return sb.ToString();
        }

    }
}
测试截图：

作者：陈太汉
博客：http://www.cnblogs.com/hlxs/

用你的例子测试了一下.我循环了1000次.我的这个快很多哦.
这是测试结果:
WordSearch用时(毫秒): 5824 Milliseconds (GCs=194)
TrieFilter用时(毫秒): 1497 Milliseconds (GCs=70)
FastFilter用时(毫秒): 617 Milliseconds (GCs=70)
把你的Program改了下测试的:
    class Program
    {
        static TrieFilter tf = new TrieFilter();
        static FastFilter ff = new FastFilter();

        static void Main(string[] args)
        {
            using (StreamReader sw = new StreamReader(File.OpenRead("words.txt")))
            {
                Random random = new Random();
                string key = sw.ReadLine();
                while (key != null)
                {
                    if (key != string.Empty)
                    {
                        tf.AddKey(key);
                        ff.AddKey(key);
                    }
                    key = sw.ReadLine();
                }
            }

            string keys = IOHelper.Read("words.txt").Replace("\r\n", "|");
            WordSearch ws = new WordSearch(keys);
            string str = IOHelper.Read("content.txt");

            using (new OperationTimer("WordSearch用时(毫秒):"))
            {
                for (int i = 0; i < 1000; i++)
                {
                    string s = ws.Filter(str);
                }
                //Console.WriteLine(s);
            }
            using (new OperationTimer("TrieFilter用时(毫秒):"))
            {
                for (int i = 0; i < 1000; i++)
                {
                    string s = tf.Replace(str);
                }
            }
            using (new OperationTimer("FastFilter用时(毫秒):"))
            {
                for (int i = 0; i < 1000; i++)
                {
                    string s = ff.Replace(str);
                }
            }

            Console.Read();
        }
    }