OpenNLP 命令行
1 安装
- 下载
- 下载地址:https://opennlp.apache.org/download.html;若下载历史版本,在 https://archive.apache.org/dist/opennlp/ 下载。
- 下载完成后,解压到指定路径,例如我的路径为:E:\Software\NLP\apache-opennlp1.9.1
- 环境变量配置
- 新建变量,变量名和变量值分别为:
OPENNLP_HOME
E:\Software\NLP\apache-opennlp1.9.1
- 在CLASSPATH变量后追加:
%OPENNLP_HOME%\lib;
- 在Path后追加:
%OPENNLP_HOME%\bin;
- 使用
linux使用bin目录下的opennlp,windows使用opennlp.bat。
栗子:如果当前命令行所在目录下有文档setence.txt,则该文档中的句子分词:
linux
./opennlp SimpleTokenizer < sentences.txt
windows
opennlp.bat SimpleTokenizer <sentences.txt
1.2 工具列表
LanguageDetector #语言检测
LanguageDetectorTrainer #语言检测模型训练
LanguageDetectorConverter #将莱比锡(leipzig)数据格式转换为本机OpenNLP格式
LanguageDetectorCrossValidator #K-fold交叉验证器
LanguageDetectorEvaluator #检测模型的效率
DictionaryBuilder #穿件词典
SentenceDetector #分句
SentenceDetectorTrainer
SentenceDetectorEvaluator
SentenceDetectorCrossValidator
SentenceDetectorConverter
SimpleTokenizer #字符类分词
TokenizerME #分词
TokenizerTrainer #训练分词模型
TokenizerMEEvaluator
TokenizerCrossValidator
TokenizerConverter #将外国语言格式转换为本机OpenNLP格式
DictionaryDetokenizer
TokenNameFinder #实体识别
TokenNameFinderTrainer
TokenNameFinderEvaluator
TokenNameFinderCrossValidator
TokenNameFinderConverter
CensusDictionaryCreator #将1990年美国人口普查名称转换为字典
Doccat #文档分类
DoccatTrainer
DoccatCrossValidator
DoccatConverter
POSTagger #词性标记
POSTaggerTrainer
POSTaggerEvaluator
POSTaggerCrossValidator
POSTaggerConverter
LemmatizerME #指代消除
LemmatizerTrainerME
LemmatizerEvaluator
ChunkerME #分块
ChunkerTrainerME
ChunkerEvaluator
ChunkerCrossValidator
ChunkerConverter #
Parser #语法分析
ParserTrainer
ParserEvaluator
ParserConverter
BuildModelUpdater #训练、更新语法分析模型
CheckModelUpdater #训练、更新语法分析的检查模型
TaggerModelReplacer #替换语法分析模型
EntityLinker #将实体链接到外部数据集
NGramLanguageModel
1.3 使用详细说明
1.3.1 句子检测器
- SentenceDetector
Usage: opennlp SentenceDetector model < sentences
Arguments description:
-model
模型
-setences
要解析的文件
栗子:
opennlp.bat SentenceDetector ch_sentence_detector.bin < sentences.txt > output.txt
- SentenceDetectorTrainer
Usage: opennlp SentenceDetectorTrainer [.irishsentencebank|.ad|.pos|.conllx|.namefinder|.parse|.moses|.conllu|.letsmt]
[-factory factoryName]
[-eosChars string]
[-abbDict path]
[-params paramsFile]
-lang language
-model modelFile
-data sampleData
[-encoding charsetName]
Arguments description:
-factory factoryName
A sub-class of SentenceDetectorFactory where to get implementation and resources.
-eosChars string
EOS characters.
-abbDict path
abbreviation dictionary in XML format.
-params paramsFile
training parameters file.
-lang language
language which is being processed.
-model modelFile
output model file.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
栗子:
opennlp.bat SentenceDetectorTrainer -model ch_sentence_detector.bin -lang jpn -data ch_sentence_detector.train -encoding UTF-8
注:中文训练时,如果使用默认符号分句,则lang必须为jpn。
- SentenceDetectorEvaluator
Usage: opennlp SentenceDetectorEvaluator[.nkjp|.irishsentencebank|.ad|.pos|.conllx|.namefinder|.parse|.moses|.conllu|.letsmt]
-model model
[-misclassified true|false]
-data sampleData
[-encoding charsetName]
Arguments description:
-model model
the model file to be evaluated.
-misclassified true|false
if true will print false negatives and false positives.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
栗子:
opennlp.bat SentenceDetectorEvaluator -model ch_sentence_detector.bin -misclassified true -data sentences.txt -encoding UTF-8