Zend_Search_Lucence 中用UTF-8 编码建立索引的问题
PHP 的 Zend_Search_Lucence 是ZendFrameWork 的一个扩展包。因为是汉字,所以只能用UTF-8 来建立索引,建立了索引后发现
搜索的时候不对了。原来,搜索的时候,即时你输入的字符串是UTF-8的,你也要默认的设置搜索字符的编码。看来下面的代码就应该明白了。
用UTF-8建立索引是中文搜索的第一步,过几天再写篇文章介绍Zend_Search_Lucence 索引中文。
(下面的代码只是做测试用的)
建立索引的代码如下:
搜索的代码:
搜索的时候不对了。原来,搜索的时候,即时你输入的字符串是UTF-8的,你也要默认的设置搜索字符的编码。看来下面的代码就应该明白了。
用UTF-8建立索引是中文搜索的第一步,过几天再写篇文章介绍Zend_Search_Lucence 索引中文。
(下面的代码只是做测试用的)
建立索引的代码如下:
/*
*
* @see Zend_Feed
*/
require_once ' Zend/Feed.php ' ;
/* *
* @see Zend_Search_Lucene
*/
require_once ' Zend/Search/Lucene.php ' ;
// create the index
$index = new Zend_Search_Lucene( dirname ( __FILE__ ) . ' /index ' , true );
Zend_Search_Lucene_Analysis_Analyzer :: setDefault( new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive()); //设置utf-8格式编码,并且是大小写不敏感的。
// index each item
$rss = Zend_Feed :: import( ' http://feeds.feedburner.com/ZendDeveloperZone ' );
foreach ( $rss -> items as $item ) {
$doc = new Zend_Search_Lucene_Document();
if ( $item -> link && $item -> title && $item -> description) {
$link = htmlentities ( strip_tags ( $item -> link()));
$doc -> addField(Zend_Search_Lucene_Field :: UnIndexed( ' link ' , $link , ' utf-8 ' ));
$title = htmlentities ( strip_tags ( $item -> title()));
$doc -> addField(Zend_Search_Lucene_Field :: Text( ' title ' , $title , ' utf-8 ' ));
$contents = htmlentities ( strip_tags ( $item -> description()));
$doc -> addField(Zend_Search_Lucene_Field :: Text( ' contents ' , $contents , ' utf-8 ' ));
echo " Adding { $item ->title()}
\n
"
;
$index -> addDocument( $doc );
}
}
$index -> commit();
* @see Zend_Feed
*/
require_once ' Zend/Feed.php ' ;
/* *
* @see Zend_Search_Lucene
*/
require_once ' Zend/Search/Lucene.php ' ;
// create the index
$index = new Zend_Search_Lucene( dirname ( __FILE__ ) . ' /index ' , true );
Zend_Search_Lucene_Analysis_Analyzer :: setDefault( new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive()); //设置utf-8格式编码,并且是大小写不敏感的。
// index each item
$rss = Zend_Feed :: import( ' http://feeds.feedburner.com/ZendDeveloperZone ' );
foreach ( $rss -> items as $item ) {
$doc = new Zend_Search_Lucene_Document();
if ( $item -> link && $item -> title && $item -> description) {
$link = htmlentities ( strip_tags ( $item -> link()));
$doc -> addField(Zend_Search_Lucene_Field :: UnIndexed( ' link ' , $link , ' utf-8 ' ));
$title = htmlentities ( strip_tags ( $item -> title()));
$doc -> addField(Zend_Search_Lucene_Field :: Text( ' title ' , $title , ' utf-8 ' ));
$contents = htmlentities ( strip_tags ( $item -> description()));
$doc -> addField(Zend_Search_Lucene_Field :: Text( ' contents ' , $contents , ' utf-8 ' ));
echo " Adding { $item ->title()}

$index -> addDocument( $doc );
}
}
$index -> commit();
搜索的代码:
/*
*
* @see Zend_Search_Lucene
*/
require_once ' Zend/Search/Lucene.php ' ;
$index = new Zend_Search_Lucene( dirname ( __FILE__ ) . ' /index ' );
echo " Index contains { $index ->count()} documents.\n " ;
Zend_Search_Lucene_Analysis_Analyzer :: setDefault( new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive()); //必须设置,告诉它字符串的处理方式。
$search = ' php ' ;
Zend_Search_Lucene_Search_QueryParser :: setDefaultEncoding( ' utf-8 ' ); //必须告诉,告诉查询解析器字符串的编码。我用的zend 版本比较低,不设置,就查不出来,高版本的zend 据说是可以不用设置的。
$hits = $index -> find( strtolower ( $search ));
echo " Search for \ " $search \ " returned " . count ( $hits ) . " hits.\n\n " ;
foreach ( $hits as $hit ) {
echo str_repeat ( ' - ' , 80 ) . " \n " ;
echo ' ID: ' . $hit -> id . " \n " ;
echo ' Score: ' . sprintf ( ' %.2f ' , $hit -> score) . " \n\n " ;
foreach ( $hit -> getDocument() -> getFieldNames() as $field ) {
echo " $field : \n " ;
echo ' ' . trim ( substr ( $hit -> $field , 0 , 76 )) . " \n " ;
}
}
* @see Zend_Search_Lucene
*/
require_once ' Zend/Search/Lucene.php ' ;
$index = new Zend_Search_Lucene( dirname ( __FILE__ ) . ' /index ' );
echo " Index contains { $index ->count()} documents.\n " ;
Zend_Search_Lucene_Analysis_Analyzer :: setDefault( new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive()); //必须设置,告诉它字符串的处理方式。
$search = ' php ' ;
Zend_Search_Lucene_Search_QueryParser :: setDefaultEncoding( ' utf-8 ' ); //必须告诉,告诉查询解析器字符串的编码。我用的zend 版本比较低,不设置,就查不出来,高版本的zend 据说是可以不用设置的。
$hits = $index -> find( strtolower ( $search ));
echo " Search for \ " $search \ " returned " . count ( $hits ) . " hits.\n\n " ;
foreach ( $hits as $hit ) {
echo str_repeat ( ' - ' , 80 ) . " \n " ;
echo ' ID: ' . $hit -> id . " \n " ;
echo ' Score: ' . sprintf ( ' %.2f ' , $hit -> score) . " \n\n " ;
foreach ( $hit -> getDocument() -> getFieldNames() as $field ) {
echo " $field : \n " ;
echo ' ' . trim ( substr ( $hit -> $field , 0 , 76 )) . " \n " ;
}
}