分词常见算法-----小整理

最新推荐文章于 2024-08-07 07:15:00 发布

原创最新推荐文章于 2024-08-07 07:15:00 发布 · 1.6k 阅读

0 ·

CC 4.0 BY-SA版权

搜索-Lucene-Solr-Sphinx 同时被 2 个专栏收录

14 篇文章

订阅专栏

算法积累

10 篇文章

订阅专栏

本文整理了三种常见的分词算法：基于规则的分词，包括单字分词、二元分词、最大正向匹配和最大逆向匹配；基于统计的分词算法；以及基于理解分词或知识分词算法。内容旨在帮助读者理解和复习分词基础知识。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一基于规则的分词算法

1 单字分词

 public static void  tokennizer(String a){
             for(int i=0;i<a.length();i++){
              System.out.prinltn(a.charAt[i]); 
        }
    }

2 二元分词

   
public static void split(String a) {
int len = a.length();
String key;
for (int i = 0; i < len; i++) {
if ((i + 2) <= len) {
key = a.substring(i, i + 2);
System.out.println(i + " ---- " + key);
}
}

}

3 最大正向匹配

（相当于有俩个指针，j,i, 当start=0, end=n, 则 substring(0,n)，如此，先变end,再变start，可以确保取得正向的最长的字符串）

       
public static void match(String s, int n) {
int start = 0;
int end = n;
while (start < end) {
for (; end > start; end--) {
String key = s.substring(start, end);// 切出最大字符串
if (lt.contains(key)) {// 判断当前字符串是否在词典中
// j = j + i;
System.out.println(key);
break;
}
}
++start;
}
}

4 最大逆向匹配

逆向最长匹配法是基于字符串匹配的一种分词算法，即按从右至左的顺序对句子循环扫描字符串，并与所提供的关键词表进行比较，如存在则提取出该串作为关键词。相比较正向最大匹配法，逆向匹配的分词精度略高于正向匹配。

（相当于有俩个指针，start,end, 当start=0,end=n, 则 substring(0,n)，如此，先变start,在变end，可以确保取得逆向最长的字符串）

public static void matchrev(String s, int n) {
int end = n;
int start = 0;
while (end >= start) {
for (; start < end; start++) {
String key = s.substring(start, end);
if (lt.contains(key)) {
System.out.println(key);
break;
}
}
--end;
}
}