sphinx不支持中文分词,国内也有人写了好多个分词组件,本文就讲安装LibMMSeg,它是Coreseek.com为 Sphinx 全文搜索引擎设计的中文分词软件包,其在GPL协议下发行的中文分词法,采用Chih-Hao Tsai的MMSEG算法。
先从http://www.coreseek.cn/news/7/99/ 上下载到LibMMSeg的安装包,如下:
cd
/
usr
/
local
/
src
/
wget http: // www.coreseek.cn / uploads / csft / 3.2 / coreseek - 3.2 . 13 .tar.gz - c
wget http: // www.coreseek.cn / uploads / csft / 3.2 / coreseek - 3.2 . 13 .tar.gz - c
然后解压缩:
tar
-
zxv
-
f coreseek
-
3.2
.
13
.tar.gz
进入到mmseg所在文件夹,然后编译:
cd coreseek
-
3.2
.
13
/
mmseg
-
3.2
.
13
/
. / configure -- prefix =/ usr / local / mmseg
. / configure -- prefix =/ usr / local / mmseg
编译过程中报了一个config.status: error: cannot find input file: src/Makefile.in这个的错误,然后运行下列指令再次编译就能通过了:
aclocal
libtoolize -- force
automake -- add - missing
autoconf
autoheader
make clean
libtoolize -- force
automake -- add - missing
autoconf
autoheader
make clean
然后再进行编译和安装:
.
/
configure
--
prefix
=/
usr
/
local
/
mmseg
make && make install
make && make install
把mmseg的命令加到环境变量中,然后运行mmseg,就能输入安装成功的信息了:
ln
-
s
/
usr
/
local
/
mmseg
/
bin
/
mmseg
/
bin
/
mmseg
mmseg
Coreseek COS(tm) MM Segment 1.0
Copyright By Coreseek.com All Right Reserved.
Usage: mmseg < option > < file >
- u < unidict > Unigram Dictionary
- r Combine with - u, used a plain text build Unigram Dictionary, default Off
- b < Synonyms > Synonyms Dictionary
- t < thesaurus > Thesaurus Dictionary
- h print this help and exit
mmseg
Coreseek COS(tm) MM Segment 1.0
Copyright By Coreseek.com All Right Reserved.
Usage: mmseg < option > < file >
- u < unidict > Unigram Dictionary
- r Combine with - u, used a plain text build Unigram Dictionary, default Off
- b < Synonyms > Synonyms Dictionary
- t < thesaurus > Thesaurus Dictionary
- h print this help and exit