lucene学习笔记之构建索引

最新推荐文章于 2024-11-03 00:00:00 发布

原创最新推荐文章于 2024-11-03 00:00:00 发布 · 559 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#lucene #文档 #file #traversal #exception #date

lucene 专栏收录该内容

16 篇文章

订阅专栏

本文详细阐述了构建索引的过程，包括文本提取、文档分析和向索引添加文档的操作。重点介绍了索引文件的组织结构，以及如何使用Lucene库进行索引操作。还讨论了如何通过IndexReader和IndexWriter进行文档的删除操作，以及删除操作的时机和效果。

构建索引

2.2理解索引过程

文本首先从原始数据中提取出来用于创建对应的Document实例，该实例包括多个Field实例，他们都用来保存原始数据信息，随后的分析过程将域文本处理成大量的语汇单元，最后将语汇单元加入到段结构中。

2.2.1提取文本和创建文档

有关提取文本信息的细节将在第七章结合Tika框架详谈。

2.2.2 分析文档

在索引操作时，Lucene首先分析文本，将文本数据分割成语汇单元串，对于中文主要是分词和去停用词，这样就产生了大批的语汇单元，随后这些语汇单元将被写入索引文件中。

2.2.3 向索引添加文档

Lucene的索引文件目录有唯一一个段结构：索引段

索引段：Lucene索引都包含一个或多个段，每个段都是一个独立的索引，它包含整个文档索引的一个子集。每当writer刷新缓冲区增加的文档，以及挂起目录删除操作时，索引文件都会建立一个新段。在搜索索引时，每个段都是单独访问的，但搜索结果是合并返回的。

每个段都包含多个文件，文件格式_X.<ext>，这里X代表段名称，<ext>为扩展名，用来标识该文件对应索引的某个部分，各个独立的文件共同组成了索引的不同部分（项向量，存储的域，倒排索引....）。如果使用混合文件格式（这是Lucene默认的处理方式，但可以通过IndexWriter.setUseCompoundFile方法进行修改），那么上述索引文件都会被压缩成一个单一的文件：_X.cfs。这种方式能在搜索期间减少打开的文件数量。

还有一个特殊文件，段文件，用段_<N>标识，该文件指向所有激活的段。Lucene会首先打开该文件，然后打开它所指向的其他文件，Lucene每次向索引提交更改都会将这个数加1。

久而久之，索引会聚集很多段，特别是当程序打开和关闭writer较为频繁时，IndexWriter类会周期性的选择一些段，然后将它们合并到一个新段。

2.3 基本索引操作

2.3.1 想索引添加文档

添加文档的方法有两个：

addDocument(Document)-----使用默认分析器添加文档，该分析器在创建IndexWriter对象时指定，用于语汇单元化操作。

addDocument(Document , Analyzer)-----使用指定的分析器添加文档和语汇单元操作。

整个建立索引的代码如下：

public class LuceneIndex {
	public static void main(String[] args) throws Exception {
		//A path to a directory where we store the Lucene index
		File indexDir = new File("F:\\ntcr_index");
		//A path to a directory that contains the files we want to index
		File dataDir = new File("F:\\NTCR_ChangeCodeToUTF");
		long start = new Date().getTime();
		int numIndexed = index(indexDir, dataDir);//get the number of Indexed
		long end = new Date().getTime();
		System.out.println("一共索引了 " + numIndexed + " 个文件，共消耗时间 " + (end - start) + " 毫秒。");
	}
	//open an index and start file directory traversal0
	public static int index(File indexDir, File dataDir) throws IOException {
		//	Indexer: traverses a file system and indexes .txt files
		//	Create Lucene index	in this directory Index files in this directory
		if (!dataDir.exists() || !dataDir.isDirectory()) {
			throw new IOException(dataDir + " 不存在或不是目录。");
		}
		/*
		 * >=3.2.0版本的IndexWriter的使用
		 */
		WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer(Version.LUCENE_CURRENT);
		
		Directory directory = FSDirectory.open(indexDir);
		IndexWriterConfig indexConfig = new IndexWriterConfig(
				Version.LUCENE_34, analyzer);
		indexConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
		IndexWriter writer = new IndexWriter(directory, indexConfig);
		indexDirectory(writer, dataDir);
		int numIndexed = writer.numDocs();
		System.out.println("优化中......................");
		System.out.println("请耐心等待...................");
		writer.optimize();
		writer.close();
		return numIndexed;
	}
//recursive method that calls itself when it finds a directory.递归调用
	private static void indexDirectory(IndexWriter writer, File dir)
			throws IOException {
		File[] files = dir.listFiles();//files number
		for (int i = 0; i < files.length; i++) {
			File f = files[i];
			if (f.isDirectory()) {
				indexDirectory(writer, f);
			} else if (f.getName().endsWith(".txt")) {
				indexFile(writer,f);
			}
		}
	}
	//		 method to actually index a file using Lucene
	private static void indexFile(IndexWriter writer, File f)throws IOException {
		if (f.isHidden() || !f.exists() || !f.canRead()) {
			return;
		}
		System.out.println("索引... " + f.getCanonicalPath());
				
		BufferedReader reader = new BufferedReader(new FileReader(f));
		Document doc = new Document();
	 	  doc.add(new Field("FilePath", f.getCanonicalPath(), Field.Store.YES,
	 			 Field.Index.ANALYZED,TermVector.YES));
		  doc.add(new Field("FileName", f.getName(), Field.Store.YES,
				Field.Index.ANALYZED,TermVector.YES));
		//默认为索引，不储存，分词
		doc.add(new Field("textField",reader.readLine(),Field.Store.YES,
				Field.Index.ANALYZED,TermVector.YES));
		//Add document to Lucene index
		writer.addDocument(doc);
	}
}

2.13.1 用IndexReader删除文档

1）IndexReader能够根据文档号删除文档

2）IndexReader可以通过Term对象删除文档，这与IndexWriter类似，但前者会返回被删除的文档号。

3）如果程序使用相同的reader进行搜索的话，IndexReader的删除操作会即时生效，而用IndexWriter删除必须等到程序打开一个新的Reader才能感知。

4）IndexWriter可以通过Query对象执行删除操作，但IndexWriter不行。

5）IndexReader提供了一个有时非常有用的方法undeleteAll，该方法能反向操作索引中所有挂起的删除。该方法只能对还未进行段合并的文档进行反删除操作，因为IndexWriter只是将被删除文档标记为删除状态，最终删除是在该文档所对应的段合并时进行的。