How to make indexing faster(轉)

本文提供了多种提高Lucene索引速度的方法,包括使用最新版本的Lucene、优化硬件配置、合理设置索引参数等。此外还介绍了如何通过减少垃圾回收成本、优化文档构建过程等方式进一步提高效率。

Here are some things to try to speed up the indexing speed of your Lucene application. Please see ImproveSearchingSpeed for how to speed up searching.

http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

  • Be sure you really need to speed things up. Many of the ideas here are simple to try, but others will necessarily add some complexity to your application. So be sure your indexing speed is indeed too slow and the slowness is indeed within Lucene.

  • Make sure you are using the latest version of Lucene.

  • Use a local filesystem. Remote filesystems are typically quite a bit slower for indexing. If your index needs to be on the remote fileysystem, consider building it first on the local filesystem and then copying it up to the remote filesystem.

  • Get faster hardware, especially a faster IO system. If possible, use a solid-state disk (SSD). These devices have come down substantially in price recently, and much lower cost of seeking can be a very sizable speedup in cases where the index cannot fit entirely in the OS's IO cache.

  • Open a single writer and re-use it for the duration of your indexing session.

  • Flush by RAM usage instead of document count.

    For Lucene <= 2.2: call writer.ramSizeInBytes() after every added doc then call flush() when it's using too much RAM. This is especially good if you have small docs or highly variable doc sizes. You need to first set maxBufferedDocs large enough to prevent the writer from flushing based on document count. However, don't set it too large otherwise you may hit LUCENE-845. Somewhere around 2-3X your "typical" flush count should be OK.

    For Lucene >= 2.3: IndexWriter can flush according to RAM usage itself. Call writer.setRAMBufferSizeMB() to set the buffer size. Be sure you don't also have any leftover calls to setMaxBufferedDocs since the writer will flush "either or" (whichever comes first).

  • Use as much RAM as you can afford.

    More RAM before flushing means Lucene writes larger segments to begin with which means less merging later. Testing in LUCENE-843 found that around 48 MB is the sweet spot for that content set, but, your application could have a different sweet spot.

  • Turn off compound file format.

    Call setUseCompoundFile(false). Building the compound file format takes time during indexing (7-33% in testing for LUCENE-888). However, note that doing this will greatly increase the number of file descriptors used by indexing and by searching, so you could run out of file descriptors if mergeFactor is also large.

  • Re-use Document and Field instances As of Lucene 2.3 there are new setValue(...) methods that allow you to change the value of a Field. This allows you to re-use a single Field instance across many added documents, which can save substantial GC cost. It's best to create a single Document instance, then add multiple Field instances to it, but hold onto these Field instances and re-use them by changing their values for each added document. For example you might have an idField, bodyField, nameField, storedField1, etc. After the document is added, you then directly change the Field values (idField.setValue(...), etc), and then re-add your Document instance.

    Note that you cannot re-use a single Field instance within a Document, and, you should not change a Field's value until the Document containing that Field has been added to the index. See Field for details.

  • Always add fields in the same order to your Document, when using stored fields or term vectors

    Lucene's merging has an optimization whereby stored fields and term vectors can be bulk-byte-copied, but the optimization only applies if the field name -> number mapping is the same across segments. Future Lucene versions may attempt to assign the same mapping automatically (see LUCENE-1737), but until then the only way to get the same mapping is to always add the same fields in the same order to each document you index.

  • Re-use a single Token instance in your analyzer Analyzers often create a new Token for each term in sequence that needs to be indexed from a Field. You can save substantial GC cost by re-using a single Token instance instead.

  • Use the char[] API in Token instead of the String API to represent token Text

    As of Lucene 2.3, a Token can represent its text as a slice into a char array, which saves the GC cost of new'ing and then reclaiming String instances. By re-using a single Token instance and using the char[] API you can avoid new'ing any objects for each term. See Token for details.

  • Use autoCommit=false when you open your IndexWriter

    In Lucene 2.3 there are substantial optimizations for Documents that use stored fields and term vectors, to save merging of these very large index files. You should see the best gains by using autoCommit=false for a single long-running session of IndexWriter. Note however that searchers will not see any of the changes flushed by this IndexWriter until it is closed; if that is important you should stick with autoCommit=true instead or periodically close and re-open the writer.

  • Instead of indexing many small text fields, aggregate the text into a single "contents" field and index only that (you can still store the other fields).

  • Increase mergeFactor, but not too much.

    Larger mergeFactors defers merging of segments until later, thus speeding up indexing because merging is a large part of indexing. However, this will slow down searching, and, you will run out of file descriptors if you make it too large. Values that are too large may even slow down indexing since merging more segments at once means much more seeking for the hard drives.

  • Turn off any features you are not in fact using. If you are storing fields but not using them at query time, don't store them. Likewise for term vectors. If you are indexing many fields, turning off norms for those fields may help performance.

  • Use a faster analyzer.

    Sometimes analysis of a document takes alot of time. For example, StandardAnalyzer is quite time consuming, especially in Lucene version <= 2.2. If you can get by with a simpler analyzer, then try it.

  • Speed up document construction. Often the process of retrieving a document from somewhere external (database, filesystem, crawled from a Web site, etc.) is very time consuming.

  • Don't optimize unless you really need to (for faster searching).

  • Use multiple threads with one IndexWriter. Modern hardware is highly concurrent (multi-core CPUs, multi-channel memory architectures, native command queuing in hard drives, etc.) so using more than one thread to add documents can give good gains overall. Even on older machines there is often still concurrency to be gained between IO and CPU. Test the number of threads to find the best performance point.

  • Index into separate indices then merge. If you have a very large amount of content to index then you can break your content into N "silos", index each silo on a separate machine, then use the writer.addIndexesNoOptimize to merge them all into one final index.

  • Run a Java profiler.

    If all else fails, profile your application to figure out where the time is going. I've had success with a very simple profiler called JMP. There are many others. Often you will be pleasantly surprised to find some silly, unexpected method is taking far too much time.

 

 

基于51单片机,实现对直流电机的调速、测速以及正反控制。项目包含完整的仿真文件、源程序、原理图和PCB设计文件,适合学习和实践51单片机在电机控制方面的应用。 功能特点 调速控制:通过按键调整PWM占空比,实现电机的速度调节。 测速功能:采用霍尔传感器非接触式测速,实时显示电机速。 正反控制:通过按键切换电机的正和反状态。 LCD显示:使用LCD1602液晶显示屏,显示当前的速和PWM占空比。 硬件组成 主控制器:STC89C51/52单片机(与AT89S51/52、AT89C51/52通用)。 测速传感器:霍尔传感器,用于非接触式测速。 显示模块:LCD1602液晶显示屏,显示速和占空比。 电机驱动:采用双H桥电路,控制电机的正反和调速。 软件设计 编程语言:C语言。 开发环境:Keil uVision。 仿真工具:Proteus。 使用说明 液晶屏显示: 第一行显示电机速(单位:/分)。 第二行显示PWM占空比(0~100%)。 按键功能: 1键:加速键,短按占空比加1,长按连续加。 2键:减速键,短按占空比减1,长按连续减。 3键:反切换键,按下后电机反。 4键:正切换键,按下后电机正。 5键:开始暂停键,按一下开始,再按一下暂停。 注意事项 磁铁和霍尔元件的距离应保持在2mm左右,过近可能会在电机动时碰到霍尔元件,过远则可能导致霍尔元件无法检测到磁铁。 资源文件 仿真文件:Proteus仿真文件,用于模拟电机控制系统的运行。 源程序:Keil uVision项目文件,包含完整的C语言源代码。 原理图:电路设计原理图,详细展示了各模块的连接方式。 PCB设计:PCB布局文件,可用于实际电路板的制作。
<think> 我们正在解决的问题是:当使用类似列表的键(list-like key)时,torch DataLoader 会引发 `_check_indexing_error` 的错误。我们需要找到解决方案。 根据PyTorch的文档和常见问题,`_check_indexing_error` 错误通常发生在尝试使用不支持的类型(如列表)作为索引来访问数据集时。在DataLoader中,这通常是因为数据集返回的样本包含了一个列表形式的键(key),而DataLoader在默认情况下使用整数索引,当遇到非整数索引时可能会出错。 常见的解决方案包括: 1. 自定义数据集类,确保在`__getitem__`方法中返回的是单个整数索引对应的样本,而不是使用列表索引。 2. 如果确实需要使用多个索引(例如,一次返回多个样本),则可以考虑使用`torch.utils.data.Dataset`的子类,并重写`__getitem__`方法,但要确保返回的是一个样本(或一个样本的元组)而不是使用列表索引。 3. 检查数据集的实现,确保索引是整数或切片,而不是列表。 然而,根据错误信息,问题可能出现在数据集的`__getitem__`方法中使用了列表作为索引。在PyTorch中,数据集的`__getitem__`方法应该接受一个整数索引,并返回一个样本。如果我们尝试传递一个列表(例如,`dataset[[1,2,3]]`),则会导致错误,因为默认实现不支持这种索引方式。 为了解决这个问题,我们可以: - 避免在数据集上使用列表索引。如果需要一次加载多个样本,应该使用DataLoader的批次加载功能,而不是直接使用列表索引。 - 如果确实需要从数据集中一次获取多个样本,可以考虑创建一个自定义的数据集类,并实现`__getitem__`方法来处理列表索引,但这并不是推荐的做法,因为DataLoader已经提供了批次加载。 另一种可能性是我们在自定义数据集的`__getitem__`方法中使用了列表作为键(key),例如: def __getitem__(self, index): # 假设index是一个列表 return [self.data[i] for i in index] 这会导致DataLoader无法正确处理,因为DataLoader期望每次调用`__getitem__`返回一个样本(而不是一个样本列表)。因此,我们应该修改数据集,使其只接受整数索引,并返回单个样本。 如果我们需要批次处理,应该使用DataLoader的`batch_size`参数,并通过设置`collate_fn`来自定义批次的形成方式。 因此,解决方案步骤: 1. 检查数据集的`__getitem__`方法,确保它只接受整数索引,并返回单个样本。 2. 不要直接使用列表索引访问数据集,而是使用DataLoader来加载批次数据。 3. 如果必须使用列表索引(例如,在数据集中需要根据多个索引返回多个样本),那么需要重新设计数据集,避免在DataLoader中使用,或者自定义一个Sampler来生成整数索引,然后让DataLoader按批次加载。 具体代码示例: 错误的数据集实现(会导致`_check_indexing_error`): ```python class MyDataset(torch.utils.data.Dataset): def __init__(self, data): self.data = data def __getitem__(self, index): # 这里index可能是一个列表 return self.data[index] # 如果index是列表,则返回一个列表,但DataLoader期望一个样本 ``` 正确的做法(只接受整数索引): ```python class MyDataset(torch.utils.data.Dataset): def __init__(self, data): self.data = data def __getitem__(self, index): # 确保index是整数 return self.data[index] # 返回单个样本 def __len__(self): return len(self.data) ``` 然后使用DataLoader加载: ```python dataset = MyDataset(data) dataloader = DataLoader(dataset, batch_size=32, shuffle=True) ``` 这样,DataLoader会每次生成一个批次的整数索引,然后调用`__getitem__`多次(默认情况下,每次一个索引)来形成批次。如果需要自定义批次的形成方式(例如,当每个样本是字典或元组时),可以使用`collate_fn`参数。 但是,如果我们的数据集原本设计就是每次返回多个样本(即通过传递一个列表索引来获取多个样本),那么我们需要重新考虑设计。在这种情况下,我们可以: - 修改数据集,使其只返回单个样本,然后依靠DataLoader的批次加载。 - 或者,如果我们有特殊需求(例如,需要根据某种关系同时获取多个样本),可以考虑在`__getitem__`中返回一个包含多个数据的样本(但这时索引仍然是整数)。例如,在Siamese网络中的数据对。 总结:错误通常是由于在数据集索引中使用了列表,而PyTorch的Dataset默认只支持整数索引。因此,我们需要确保传递给`__getitem__`的索引是整数,而不是列表。 如果问题仍然存在,请检查代码中是否在数据集外部使用了列表索引,例如: data_loader = DataLoader(dataset, batch_size=None, ...) for batch in data_loader: # 如果在迭代过程中,我们手动传入了一个列表,则会导致错误 确保DataLoader的`batch_size`不为None(除非你使用`batch_sampler`),这样DataLoader会生成整数索引的批次。 参考PyTorch文档:https://pytorch.org/docs/stable/data.html 因此,解决方案的核心是:确保数据集的`__getitem__`方法只接受整数索引,并返回单个样本。 </think> 针对 torch DataLoader 在遇到类似列表的键时引发 `_check_indexing_error` 的问题,以下是系统级解决方案和代码实现: ### 问题原因分析 当数据集包含列表类型的键(如字典中值为列表)时,DataLoader 的默认索引机制会触发 `_check_indexing_error`。这是因为 PyTorch 的默认 `collate_fn` 无法正确处理非标量/张量的数据结构[^1]。 ### 解决方案 #### 方法 1:自定义 `collate_fn`(推荐) ```python import torch from torch.utils.data import Dataset, DataLoader def custom_collate(batch): """ 处理列表类型键的自定义批处理函数 输入: batch - 样本列表 输出: 重新组织的批数据字典 """ collated = {} for key in batch[0].keys(): if isinstance(batch[0][key], list): # 保留列表结构 collated[key] = [item[key] for item in batch] else: # 标准张量堆叠 collated[key] = torch.stack([item[key] for item in batch]) return collated # 示例数据集 class ListKeyDataset(Dataset): def __init__(self, data): self.data = data # 格式: [{'key1':val, 'key2':[v1,v2]}, ...] def __len__(self): return len(self.data) def __getitem__(self, idx): return self.data[idx] # 使用示例 dataset = ListKeyDataset([ {'feature': torch.rand(3), 'labels': [0,1]}, {'feature': torch.rand(3), 'labels': [1,0]}, ]) dataloader = DataLoader( dataset, batch_size=2, collate_fn=custom_collate, # 关键:应用自定义处理 shuffle=True ) # 测试迭代 for batch in dataloader: print(batch['feature'].shape) # torch.Size([2, 3]) print(batch['labels']) # [[0,1], [1,0]] 保留列表结构 ``` #### 方法 2:数据重组(预处理) 重组数据集结构,避免直接使用列表值: ```python # 原始问题数据格式 problem_data = [ {'id': 1, 'scores': [0.2, 0.8]}, {'id': 2, 'scores': [0.7, 0.3]} ] # 解决方案:展平列表键 transformed_data = [] for item in problem_data: transformed_data.append({ 'id': item['id'], 'score0': item['scores'][0], # 拆分为独立键 'score1': item['scores'][1] }) # 此时可使用默认DataLoader dataset = ListKeyDataset(transformed_data) dataloader = DataLoader(dataset, batch_size=32) # 无错误 ``` ### 关键说明 1. **`custom_collate` 原理** 通过类型检查区分处理: - 列表值:保持原始结构 (`[item[key] for item in batch]`) - 张量值:自动堆叠 (`torch.stack()`) 2. **应用场景选择** - 推荐 `custom_collate`:需保留列表语义的场景(如多标签分类) - 推荐数据重组:列表可换为独立特征时 3. **嵌套结构处理** 对于多层嵌套列表,需扩展 `custom_collate` 的递归处理: ```python def recursive_collate(batch): elem = batch[0] if isinstance(elem, dict): return {key: recursive_collate([d[key] for d in batch]) for key in elem} elif isinstance(elem, list): return [recursive_collate(samples) for samples in zip(*batch)] else: return torch.utils.data.default_collate(batch) ``` ### 验证方法 ```python # 触发原始错误的示例 error_dataloader = DataLoader( dataset, # 含列表键的原始数据 batch_size=2 ) # 将引发 _check_indexing_error # 解决方案验证 try: for batch in dataloader: # 使用 custom_collate pass print("✅ 解决方案验证通过") except Exception as e: print(f"❌ 错误未解决: {type(e).__name__}: {e}") ```
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值