NLP——斯坦福分词工具简单使用

本文详细介绍并演示了如何使用斯坦福的中文分词工具包stanford-segmenter.jar进行中文分词,包括下载、配置及代码实现过程,特别注意分词结果的处理方式。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

本例主要演示斯坦福的中文分词工具包stanford-segmenter.jar
下载地址
下载说明
解压后如下:
在这里插入图片描述
创建好项目
将data文件夹导入到项目根目录下
导入jar包
SegDemo.java文件则是演示文件
注意事项
SegDemo执行的时候要读取data中的内容
查阅源代码发现,最后返回的分词集合segmented是数组转换成的(Arrays.asList()),不可进行修改。所以需要用list的构造方法重新构建一个。
下面是将代码封装好后的测试案例

package com.hhh.part;
import java.io.*;
import java.util.LinkedList;
import java.util.List;
import java.util.Properties;

import org.junit.jupiter.api.Test;

import edu.stanford.nlp.ie.crf.CRFClassifier;
import edu.stanford.nlp.ling.CoreLabel;

public class PartWord {

  private static final String basedir = System.getProperty("SegDemo", "data");

  public static List<String> part(String sample) throws Exception {
    System.setOut(new PrintStream(System.out, true, "utf-8"));

    Properties props = new Properties();
    props.setProperty("sighanCorporaDict", basedir);
    props.setProperty("serDictionary", basedir + "/dict-chris6.ser.gz");

    props.setProperty("inputEncoding", "UTF-8");
    props.setProperty("sighanPostProcessing", "true");

    CRFClassifier<CoreLabel> segmenter = new CRFClassifier<>(props);
    segmenter.loadClassifierNoExceptions(basedir + "/ctb.gz", props);

    List<String> segmented = segmenter.segmentString(sample);
    
    return new LinkedList<>(segmented);//重构
  }
  @Test
  public void test1() {
	  try {
		System.out.println(part("韩国《中央日报》则报道称,有人推测,"
				+ "第二次朝美首脑会谈的时间和地点有可能定于10月下旬左右在华盛顿举行。"
				+ "这一时期正好是对特朗普总统进行具有“期中考核”性质的11月6日美国中期选举之前。"
				+ "若第二次朝美首脑会谈在美国举行,将成为朝鲜首脑的第一次访美。"
				+ "然而正如朝鲜曾强烈要求第一次朝美首脑会谈在平壤举行一样,"
				+ "此次朝鲜也有可能提出在平壤举行会谈。"));
	} catch (Exception e) {
		e.printStackTrace();
	}
  }
}

Tokenization of raw text is a standard pre-processing step for many NLP tasks. For English, tokenization usually involves punctuation splitting and separation of some affixes like possessives. Other languages require more extensive token pre-processing, which is usually called segmentation. The Stanford Word Segmenter currently supports Arabic and Chinese. The provided segmentation schemes have been found to work well for a variety of applications. The system requires Java 1.6+ to be installed. We recommend at least 1G of memory for documents that contain long sentences. For files with shorter sentences (e.g., 20 tokens), decrease the memory requirement by changing the option java -mx1g in the run scripts. Arabic Arabic is a root-and-template language with abundant bound morphemes. These morphemes include possessives, pronouns, and discourse connectives. Segmenting bound morphemes reduces lexical sparsity and simplifies syntactic analysis. The Arabic segmenter model processes raw text according to the Penn Arabic Treebank 3 (ATB) standard. It is a stand-alone implementation of the segmenter described in: Spence Green and John DeNero. 2012. A Class-Based Agreement Model for Generating Accurately Inflected Translations. In ACL. Chinese Chinese is standardly written without spaces between words (as are some other languages). This software will split Chinese text into a sequence of words, defined according to some word segmentation standard. It is a Java implementation of the CRF-based Chinese Word Segmenter described in: Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky and Christopher Manning. 2005. A Conditional Random Field Word Segmenter. In Fourth SIGHAN Workshop on Chinese Language Processing. Two models with two different segmentation standards are included: Chinese Penn Treebank standard and Peking University standard. On May 21, 2008, we released a version that makes use of lexicon features. With external lexicon features, the segmenter segmen
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值