HanLP修仙指南：从中文分词小白到语义分析大能的终极奥义

五行星辰

于 2025-04-12 20:53:04 发布

阅读量303

点赞数 9

CC 4.0 BY-SA版权

分类专栏：业务系统应用技术文章标签：中文分词自然语言处理 java

本文链接：https://blog.youkuaiyun.com/wang543203/article/details/147171168

业务系统应用技术专栏收录该内容

112 篇文章

订阅专栏

各位被中文分词折磨得抓耳挠腮的道友们！今天要解锁的是NLP界的"倚天剑"——HanLP！这货不仅能精准切开中文句子，还能识别地名、人名甚至网络热词！准备好让你的程序真正"读懂"中文了吗？ 📚

一、筑基篇：快速入门

1.1 法宝祭炼（添加依赖）

<!-- 基础版（20MB+） -->
<dependency>
    <groupId>com.hankcs</groupId>
    <artifactId>hanlp</artifactId>
    <version>portable-1.8.4</version>
</dependency>

<!-- 全功能版（1.5GB+，需下载data包） -->
<dependency>
    <groupId>com.hankcs</groupId>
    <artifactId>hanlp</artifactId>
    <version>1.8.4</version>
</dependency>

1.2 基础分词（初试锋芒）

import com.hankcs.hanlp.HanLP;

public class FirstDemo {
    public static void main(String[] args) {
        String text = "今天北京天安门广场人山人海";
        
        // 标准分词
        List<Term> termList = HanLP.segment(text);
        System.out.println(termList); 
        // [今天/t, 北京/ns, 天安门广场/ns, 人山人海/i]
        
        // 获取分词结果字符串
        System.out.println(HanLP.segment(text).stream()
            .map(term -> term.word)
            .collect(Collectors.joining("/")));
        // 今天/北京/天安门广场/人山人海
    }
}

二、金丹篇：进阶功能

2.1 词性标注（识别人名地名）

for (Term term : HanLP.segment("马云在杭州阿里巴巴总部发表演讲")) {
    System.out.printf("%s → %s\n", term.word, term.nature);
}
/*
马云 → nr （人名）
在 → p  （介词）
杭州 → ns （地名）
阿里巴巴 → nz （机构名）
总部 → n  （名词）
发表 → v  （动词）
演讲 → vn （名动词）
*/

2.2 关键词提取（文摘神器）

String content = "自然语言处理是人工智能的重要分支。"
    + "HanLP是一款优秀的中文NLP工具包。";
List<String> keywordList = HanLP.extractKeyword(content, 3);
System.out.println(keywordList); // [HanLP, 自然语言处理, 人工智能]

2.3 短语提取（自动摘要）

List<String> phraseList = HanLP.extractPhrase(content, 3);
System.out.println(phraseList); // [自然语言处理, 重要分支, 优秀的中文NLP工具包]

三、元婴篇：专业配置

3.1 自定义词典（识别新词）

// 临时添加（内存生效）
HanLP.Config.CustomDictionaryPath = 
    new String[]{"src/main/resources/custom/CustomDictionary.txt"};
HanLP.reloadCustomDictionary();

// 文件内容格式：
// 区块链 1000 nz
// 奥利给 1000 i

3.2 调整分词算法（精准/速度权衡）

// 极速词典分词（速度快）
List<Term> termList = HanLP.newSegment()
    .enableIndexMode(true)  // 索引模式
    .enableNameRecognize(false) // 关闭人名识别
    .seg("今天天气真好");

// CRF分词（精度高）
List<Term> crfTerms = HanLP.newSegment("crf")
    .enablePartOfSpeechTagging(true)
    .seg("中国科学院计算技术研究所");

四、化神篇：实战应用

4.1 敏感词过滤（护山大阵）

String text = "这里有赌博、毒品和敏感内容";
AhoCorasickDoubleArrayTrie<String> trie = new AhoCorasickDoubleArrayTrie<>();
trie.build(Map.of("赌博", "**", "毒品", "**"));

Collection<Hit<String>> hits = trie.parseText(text);
for (Hit<String> hit : hits) {
    text = text.replace(hit.value, hit.value);
}
System.out.println(text); // 这里有**、**和敏感内容

4.2 文本相似度（道法比对）

double similarity = HanLP.similarity(
    "我爱自然语言处理",
    "我喜欢NLP技术"
);
System.out.printf("相似度：%.2f\n", similarity); // 相似度：0.68

五、大乘篇：性能优化

5.1 缓存分词结果（灵气循环）

LoadingCache<String, List<Term>> segmentCache = CacheBuilder.newBuilder()
    .maximumSize(10000)
    .build(CacheLoader.from(HanLP::segment));

List<Term> terms = segmentCache.get("缓存分词结果");

5.2 并行处理（分身术）

List<String> texts = Arrays.asList("文本1", "文本2", "...");
List<List<Term>> results = texts.parallelStream()
    .map(HanLP::segment)
    .collect(Collectors.toList());

渡劫指南：常见问题

问题	解决方案
新词识别不准	添加自定义词典/调整算法参数
专业术语切分错误	使用`enableCustomDictionaryForcing`强制模式
内存溢出	使用`portable`版或增大JVM内存
速度慢	关闭不必要功能（如NER）/使用索引模式