Compaction magic in Couchbase Server

本文详细介绍了 Couchbase Server 2.0 中的在线压缩机制,包括其工作原理、如何配置压缩阈值及调度压缩任务等。通过合理设置可以有效避免磁盘空间浪费,并确保系统的高效运行。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >


http://blog.couchbase.com/compaction-magic-couchbase-server-20

Compaction magic in Couchbase Server 2.0

With Couchbase’s append-only storage design, it’s impossible to corrupt data and index files as updates go only to the end of the file. There are no in-place file updates and the files are never in an inconsistent state. But writing to an ever-expanding file will eventually eat up all your diskspace. Therefore, Couchbase server has a process called compaction. Compaction cleans up the disk space by removing stale data and index values so that the data and index files don’t unnecessarily eat up your disk space.  If your app’s use-case is mostly-reads, this maybe OK but if you have write-heavy workloads, you may want to learn about how auto-compaction works in Couchbase Server. 

 
By design, documents in Couchbase Server are partitioned into vBuckets (or partitions). There are multiple files used for storage – a data file per partition (the “data files”), multiple index-files (active, replica and temp) per design document and a master file that has metadata related to the design documents and view definitions. For example on Mac OSX (as shown below), the sample ‘gamesim’ bucket has  64 individual data files, one per partition (0.couch.1 to 63.couch.1), and a master file that has design documents and other view metadata (master.couch.1)
 

 

Couchbase Data and Master File

The index files are in the @indexes folder and consist of the active index file starting with main_, the replica index file (if index replication is enabled) starting with replica_ and a temporary file that is used while building and updating the index starting with tmp_.

 

Index Files in Couchbase Server

 

Data and index files in Couchbase Server are organized as b-trees.  The root nodes (shown in red) contains pointers to the intermediate nodes, which contain pointers to the leaf nodes (shown in blue). In the case of data files, the root and intermediate nodes track the sizes of documents under their sub-tree.  The leaf nodes store the document id, document metadata and pointers to the document content. For index files, the root and intermediate nodes track the outputted map-function result and the reduce values under their subtree.

B-tree storage for data files (shown on the left) and index files (shown on the right)

 

All mutations including inserts, updates and deletes are written to the end of the file leaving old data in place. Add a new document? The b-tree grows at the end. Delete a document? That gets recorded at the end of the b-tree. For example, as shown in the Figure below, document A is mutated followed by a mutation to document B and then a new document D is added followed by another mutation to document A. Old data is shown by the red crossed-out nodes in the Figure below.

Logical data layout for data and index files

 

By examining the size of the documents tracked by the root node in the b-tree file structure, the ratio of the actual size to the current size of the file is calculated. If this ratio hits a certain threshold that is configurable as shown in the Figure below, an online compaction process is triggered. Compaction scans through the current data and index files and creates new data and index files, without the items marked for cleanup. During compaction, the b-trees are balanced and the reduce values are re-computed for the new tree. Additionally, data that does not belong to a particular node is also cleaned up. 
 
Finally to catch-up with the active workload that might have changed the old partition data file during compaction, Couchbase copies over the data that was appended since the start of the compaction process to the new partition data file so that it is up-to date. The new index file is also updated in this manner. The old partition data and index file are then deleted and the new data and index files are used. 
 

Normally, compaction is an all-or-nothing operation but since  compaction in Couchbase is on a per partition (vbucket) basis, the dataset can be  compacted incrementally without losing any changes it has made when aborted.

Configuring the compaction thresholds in the settings UI tab for data  and index files

 

 

Compaction in Couchbase Server is an  online operation. By default, auto-compaction kicks in when the fragmentation threshold reaches 30%, but you should test what settings works well for your workload and tune this setting accordingly.

Because  compaction is a resource intensive you can also schedule it during off-peak hours. To prevent auto compaction from taking place when your database is in heavy use, you can configure an off-peak time period during which compaction is allowed using the UI shown above.  For example, here I’ve set compaction to run between 12am and 1am server time every day. If the compaction operation does not complete in this time period, it will continue, but you can check the box to have it aborted. 
 
Compaction can also be triggered manually per bucket or per design document as shown in the Figures below.
 

Manually compacting a data bucket in Couchbase

 

 

Manually compacting a design document in Couchbase

 
Compaction performance in Couchbase Server depends on IO capacity and  proper cluster sizing. Your cluster must be properly sized so that there is enough capacity in all the various areas to support everything else the system is doing to maintain the required level of performance.  So, how do you tune compaction in Couchbase Server?
 
There is no magic bullet here ... Depending on your application’s IOPS requirement, you need to size your cluster properly and might want to test your workload across a variety of different storage hardware. If your app is write heavy, SSD’s might be the best option but for heavy read ratios, EBS might be a good solution at a low cost. 
 
By default, if both data and view indexes are configured for auto-compaction, compaction operates sequentially, first on the database and then on the views.  By enabling parallel compaction, both the databases and views can be compacted at the same time. This requires more CPU and disk I/O, but if the database and view indexes are stored on different physical disk devices ( as is our best practice anyway), the two can complete in parallel so that the index and data files does not grow extremely large.
 
Conclusion
 
At the end of the day, every database needs regular maintenance. Online compaction is a huge plus but you have to test your system and configure your compaction settings appropriately so that it does not affect your system load. 

内容概要:本文介绍了基于SMA-BP黏菌优化算法优化反向传播神经网络(BP)进行多变量回归预测的项目实例。项目旨在通过SMA优化BP神经网络的权重和阈值,解决BP神经网络易陷入局部最优、收敛速度慢及参数调优困难等问题。SMA算法模拟黏菌寻找食物的行为,具备优秀的全局搜索能力,能有效提高模型的预测准确性和训练效率。项目涵盖了数据预处理、模型设计、算法实现、性能验证等环节,适用于多变量非线性数据的建模和预测。; 适合人群:具备一定机器学习基础,特别是对神经网络和优化算法有一定了解的研发人员、数据科学家和研究人员。; 使用场景及目标:① 提升多变量回归模型的预测准确性,特别是在工业过程控制、金融风险管理等领域;② 加速神经网络训练过程,减少迭代次数和训练时间;③ 提高模型的稳定性和泛化能力,确保模型在不同数据集上均能保持良好表现;④ 推动智能优化算法与深度学习的融合创新,促进多领域复杂数据分析能力的提升。; 其他说明:项目采用Python实现,包含详细的代码示例和注释,便于理解和二次开发。模型架构由数据预处理模块、基于SMA优化的BP神经网络训练模块以及模型预测与评估模块组成,各模块接口清晰,便于扩展和维护。此外,项目还提供了多种评价指标和可视化分析方法,确保实验结果科学可信。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值