mysql全文索引之停止词(stopword)

MySQL全文索引中,停止词会被忽略。InnoDB和MyISAM有不同的默认停止词列表,可以查询INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD获取InnoDB的默认停止词。MyISAM的停止词在源码中预设,可通过ft_stopword_file覆盖。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

本文IT技术学习网将给大家讲述什么是mysql全文索引中的停止词(stopword也有的翻译做停止字)。

stopword

在全文索引中,如果一个词被认为是太普通或者太没价值,那么它将会被搜索索引和搜索查询忽略。innodb和myisam分别有两组不同的设置,控制着对应的stopword。

全文检索时,停止词列表将会被读取和检索,在不同的字符集和排序方式下(character_set_server and collation_server 系统变量),可能会导致在搜索时的停止词的不匹配。

停止词是否大小写敏感,取决于不同的排序方式,比如:latin1_swedish_ci下停止词是大小写敏感的,latin1_general_cs 或 latin1_bin下停止词是大小写不敏感的。

innodb的索引停止词

innodb的默认停止词列表很短。查询INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD表来查看默认的innodb停止词表。

    1. mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD;

 

    1. +-------+

 

    1. | value |

 

    1. +-------+

 

    1. | a     |

 

    1. | about |

 

    1. | an    |

 

    1. | are   |

 

    1. | as    |

 

    1. | at    |

 

    1. | be    |

 

    1. | by    |

 

    1. | com   |

 

    1. | de    |

 

    1. | en    |

 

    1. | for   |

 

    1. | from  |

 

    1. | how   |

 

    1. | i     |

 

    1. | in    |

 

    1. | is    |

 

    1. | it    |

 

    1. | la    |

 

    1. | of    |

 

    1. | on    |

 

    1. | or    |

 

    1. | that  |

 

    1. | the   |

 

    1. | this  |

 

    1. | to    |

 

    1. | was   |

 

    1. | what  |

 

    1. | when  |

 

    1. | where |

 

    1. | who   |

 

    1. | will  |

 

    1. | with  |

 

    1. | und   |

 

    1. | the   |

 

    1. | www   |

 

    1. +-------+

 

  1. 36 rows in set (0.00 sec)

myisam索引的停止词

myisam索引的停止词列表与innodb不同,默认的myisam停止词列表是直接在mysql程序源码中已写入。设置ft_stopword_file系统变量来指定停止词文件,从而覆盖默认设置。

在mysql源程序的 storage/myisam/ft_static.c file文件中,你可以找到默认的myisam停止词列表:

    1. a's able about above according

 

    1. accordingly across actually after afterwards

 

    1. again against ain't all allow

 

    1. allows almost alone along already

 

    1. also although always am among

 

    1. amongst an and another any

 

    1. anybody anyhow anyone anything anyway

 

    1. anyways anywhere apart appear appreciate

 

    1. appropriate are aren't around as

 

    1. aside ask asking associated at

 

    1. available away awfully be became

 

    1. because become becomes becoming been

 

    1. before beforehand behind being believe

 

    1. below beside besides best better

 

    1. between beyond both brief but

 

    1. by c'mon c's came can

 

    1. can't cannot cant cause causes

 

    1. certain certainly changes clearly co

 

    1. com come comes concerning consequently

 

    1. consider considering contain containing contains

 

    1. corresponding could couldn't course currently

 

    1. definitely described despite did didn't

 

    1. different do does doesn't doing

 

    1. don't done down downwards during

 

    1. each edu eg eight either

 

    1. else elsewhere enough entirely especially

 

    1. et etc even ever every

 

    1. everybody everyone everything everywhere ex

 

    1. exactly example except far few

 

    1. fifth first five followed following

 

    1. follows for former formerly forth

 

    1. four from further furthermore get

 

    1. gets getting given gives go

 

    1. goes going gone got gotten

 

    1. greetings had hadn't happens hardly

 

    1. has hasn't have haven't having

 

    1. he he's hello help hence

 

    1. her here here's hereafter hereby

 

    1. herein hereupon hers herself hi

 

    1. him himself his hither hopefully

 

    1. how howbeit however i'd i'll

 

    1. i'm i've ie if ignored

 

    1. immediate in inasmuch inc indeed

 

    1. indicate indicated indicates inner insofar

 

    1. instead into inward is isn't

 

    1. it it'd it'll it's its

 

    1. itself just keep keeps kept

 

    1. know known knows last lately

 

    1. later latter latterly least less

 

    1. lest let let's like liked

 

    1. likely little look looking looks

 

    1. ltd mainly many may maybe

 

    1. me mean meanwhile merely might

 

    1. more moreover most mostly much

 

    1. must my myself name namely

 

    1. nd near nearly necessary need

 

    1. needs neither never nevertheless new

 

    1. next nine no nobody non

 

    1. none noone nor normally not

 

    1. nothing novel now nowhere obviously

 

    1. of off often oh ok

 

    1. okay old on once one

 

    1. ones only onto or other

 

    1. others otherwise ought our ours

 

    1. ourselves out outside over overall

 

    1. own particular particularly per perhaps

 

    1. placed please plus possible presumably

 

    1. probably provides que quite qv

 

    1. rather rd re really reasonably

 

    1. regarding regardless regards relatively respectively

 

    1. right said same saw say

 

    1. saying says second secondly see

 

    1. seeing seem seemed seeming seems

 

    1. seen self selves sensible sent

 

    1. serious seriously seven several shall

 

    1. she should shouldn't since six

 

    1. so some somebody somehow someone

 

    1. something sometime sometimes somewhat somewhere

 

    1. soon sorry specified specify specifying

 

    1. still sub such sup sure

 

    1. t's take taken tell tends

 

    1. th than thank thanks thanx

 

    1. that that's thats the their

 

    1. theirs them themselves then thence

 

    1. there there's thereafter thereby therefore

 

    1. therein theres thereupon these they

 

    1. they'd they'll they're they've think

 

    1. third this thorough thoroughly those

 

    1. though three through throughout thru

 

    1. thus to together too took

 

    1. toward towards tried tries truly

 

    1. try trying twice two un

 

    1. under unfortunately unless unlikely until

 

    1. unto up upon us use

 

    1. used useful uses using usually

 

    1. value various very via viz

 

    1. vs want wants was wasn't

 

    1. way we we'd we'll we're

 

    1. we've welcome well went were

 

    1. weren't what what's whatever when

 

    1. whence whenever where where's whereafter

 

    1. whereas whereby wherein whereupon wherever

 

    1. whether which while whither who

 

    1. who's whoever whole whom whose

 

    1. why will willing wish with

 

    1. within without won't wonder would

 

    1. wouldn't yes yet you you'd

 

    1. you'll you're you've your yours

 

    1. yourself yourselves zero



<think>嗯,用户问的是MySQL全文索引是如何工作的。这个问题看起来挺技术性的,可能用户是开发人员或者数据库管理员,想要优化数据库的查询性能,特别是在处理文本搜索的时候。首先,我需要回忆一下MySQL全文索引的基本机制。 首先,我应该解释全文索引是什么,和普通的B树索引有什么区别。普通索引比如B树,适合精确匹配和范围查询,而全文索引则是为了文本中的关键搜索设计的。用户可能已经遇到了LIKE查询效率低下的问题,所以想了解更高效的解决方案。 接下来,需要详细说明MySQL如何建立全文索引。比如,分的过程,停用的处理,以及索引存储的结构。InnoDB和MyISAM引擎在全文索引上的区别可能也需要提到,因为不同引擎的支持情况不同,尤其是在MySQL 5.6之后InnoDB开始支持全文索引,这对用户来说可能是个重要的信息点。 然后,要解释查询时的过程,包括自然语言模式和布尔模式的区别,以及如何计算相关性分数。用户可能想知道为什么某些搜索结果排名靠前,相关性算法可能涉及TF-IDF之类的概念,但MySQL的具体实现可能有所不同,比如基于频和匹配次数。 还需要考虑使用全文索引的限制,比如最小长度(ft_min_word_len)和停用列表,这些因素会影响索引的内容和查询结果。用户如果遇到某些无法被搜索到的情况,可能需要调整这些配置。 另外,应用场景和最佳实践也是用户关心的部分。比如,在什么情况下使用全文索引更合适,对比Elasticsearch等专业搜索引擎的优缺点,可能用户的项目规模不大,不想引入额外技术栈,所以选择MySQL内置的全文索引。 可能用户没有说出来的需求是优化搜索性能,或者处理大量文本数据时的效率问题。需要强调全文索引在提高文本搜索速度上的优势,同时提醒其局限性,比如不支持复杂的语言处理或大规模分布式环境。 最后,总结一下使用全文索引的步骤,如何创建索引,编写查询语句,以及调整配置参数。确保用户能够根据这些信息实际操作,解决他们的问题。</think>MySQL全文索引(Full-Text Index)是一种专门针对文本字段设计的索引技术,用于高效实现基于关键的模糊搜索(如 `MATCH() AGAINST()` 查询),相比传统的 `LIKE` 查询,它在处理大文本时性能更高。以下是其核心工作原理和特点: --- ### 一、全文索引的工作流程 1. **分(Tokenization)** MySQL 将文本内容按规则拆分成独立的「元」(Token)。例如: - 英文默认按空格、标点分(如 `"Hello world"` → `"hello"`, `"world"`)。 - 中文需依赖分插件(如 `ngram`,默认按字符切分,如 `"数据库"` → `"数"`, "据", "库"`)。 2. **过滤停用Stopwords)** 忽略常见无意义(如英文的 "the", "and",中文的 "的"、"是"),列表由 `ft_stopword_file` 配置。 3. **归一化(Normalization)** 将元转为小写(不区分大小写),移除后缀等,确保搜索时匹配更灵活。 4. **构建倒排索引(Inverted Index)** 记录每个元出现的文档 ID 和位置,形成类似 ` → 文档列表` 的映射,加速查找。 --- ### 二、全文索引的查询模式 1. **自然语言模式(NATURAL LANGUAGE MODE)** 默认模式,按关键匹配并自动计算相关性得分(基于频、文档长度等),返回排序结果。 ```sql SELECT * FROM articles WHERE MATCH(content) AGAINST('数据库 优化' IN NATURAL LANGUAGE MODE); ``` 2. **布尔模式(BOOLEAN MODE)** 支持操作符(`+` 必须包含,`-` 排除,`*` 通配符等),适合复杂逻辑搜索。 ```sql SELECT * FROM articles WHERE MATCH(content) AGAINST('+MySQL -Oracle*' IN BOOLEAN MODE); ``` --- ### 三、核心配置参数 - **`ft_min_word_len`** 最小长(默认 4),短于此长度的(如 "is")会被忽略。需重启并重建索引生效。 - **`ngram_token_size`** 适用于中文的 ngram 分长度(默认 2),如设为 2 时 `"数据库"` 会分为 `"数据"`, `"据库"`。 --- ### 四、适用场景与限制 - **适用场景** - 日志分析、文章搜索等文本密集型场景。 - 替代低效的 `LIKE '%keyword%'` 查询。 - **限制** - 仅支持 InnoDB(MySQL 5.6+)和 MyISAM 引擎。 - 中文分需手动配置(如使用 `ngram` 插件)。 - 不支持实时更新索引(存在延迟)。 --- ### 五、与其他方案的对比 | 方案 | 优点 | 缺点 | |------|------|------| | **MySQL 全文索引** | 内嵌于数据库,无需额外工具 | 功能较基础,中文支持弱 | | **Elasticsearch** | 分强大,支持高并发、分布式 | 需独立维护,学习成本高 | | **LIKE 查询** | 简单易用 | 性能差,无法利用索引 | --- ### 六、最佳实践 1. 对需要搜索的字段(如 `TEXT`、`VARCHAR`)创建全文索引: ```sql ALTER TABLE articles ADD FULLTEXT INDEX ft_idx_content (content) WITH PARSER ngram; ``` 2. 避免短搜索(调整 `ft_min_word_len` 或 `ngram_token_size`)。 3. 对中文搜索,建议使用更专业的分工具(如结巴分+Elasticsearch)。 通过合理使用全文索引,可以显著提升文本搜索效率,但需根据场景权衡功能与复杂度。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值