[zz]Tokyo Cabinet Observations -优快云博客

本文链接：https://blog.youkuaiyun.com/iteye_17143/article/details/81718608

本文记录了使用Tokyo Cabinet作为数据库存储大量数据时遇到的问题及解决办法。主要讨论了随着数据库大小增长写入速度显著下降的现象，并通过实施简单的分区方案解决了这一问题。此外还提到了文件大小限制和重复读取数据时的速度差异。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

[url]http://parand.com/say/index.php/2009/04/09/tokyo-cabinet-observations/[/url]

I’m using Tokyo Cabinet with Python tc for a decent sized amount of data (~19G in a single hash table) on OS X. A few observations and oddities:

* Writes slow down significantly as the database size grows. I’m writing 97 roughly equal sized batches to the tch table. The first batch takes ~40 seconds, and processing time seems to increase fairly linearly, with the last taking ~14 minutes. Not sure why this would be the case, but it’s discouraging. I’ll probably write a simple partitioning scheme to split the data into multiple databases and keep the size of each small, but it seems like this should be handled out of the box for me.
* [Update] I implemented a simple partitioning scheme, and sure enough it makes a big difference. Apparently keeping the file size small (where small is < 500G) is important. Surprising - why doens’t TC implement partitioning if it’s susceptible to performance issues with larger file sizes? Is this a python tc issue or a Tokyo Cabinet issue?
* [Also] Seems I can only open 53-54 tc.HDB()’s before I get an ‘mmap error’, limiting how much I can partition.
* Reading records that have already been read from the tch seems to go much faster on the second access (like an order of magnitude faster). I suspect this is the disk cache at work, but if anyone has extra info on this please enlighten me.

Another somewhat surprising aspect: using the tc library you’re essentially embedding Tokyo Cabinet in your app; I had assumed it was going to be network based access, but it’s not. You can do network access either using the memcached protocol or using pytyrant.