How do you build a database? (self.Database)

本文深入探讨了数据库构建的核心原理,包括数据存储、索引、SQL解析与优化、事务管理等关键步骤。从数据文件组织到复杂查询处理,全面解析数据库系统的工作流程。


原始链接:How do you build a database? (self.Database)


Its a great question, and deserves a long answer.


Most database servers are built in C, and store data using B-tree type constructs. 
In the old days there was a product called C-Isam (c library for an indexed sequential access method) which is a low level library to help C programmers write data in B-tree format. 
So you need to know about btrees and understand what these are.


Most databases store data separate to indexes. Lets assume a record (or row) is 800 bytes long and you write 5 rows of data to a file. 

If the row contains columns such as first name, last name, address etc. and you want to search for a specific record by last name, you can open the file and sequentially search through each record but this is very slow. 
Instead you open an index file which just contains the lastname and the position of the record in the data file. 

Then when you have the position you open the data file, lseek to that position and read the data. 
Because index data is very small it is much quicker to search through index files. 
Also as the index files are stored in btrees in it very quick to effectively do a quicksearch (divide and conquer) to find the record you are looking for.
So you understand for one "table" you will have a data file with the data and one (or many) index files. 
The first index file could be for lastname, the next could be to search by SS number etc. 
When the user defines their query to get some data, they decide which index file to search through. 
If you can find any info on C-ISAM (there used to be an open source version (or cheap commercial) called D-ISAM) you will understand this concept quite well.

Once you have stored data and have index files, using an ISAM type approach allows you to GET a record based on a value, or PUT a new record. 
However modern database servers all support SQL, so you need an SQL parser that translates the SQL statement into a sequence of related GETs. 
SQL may join 2 tables so an optimizer is also needed to decide which table to read first (normally based on number of rows in each table and indexes available) and how to relate it to the next table. 


SQL can INSERT data so you need to parse that into PUT statements but it can also combine multiple INSERTS into transactions so you need a transaction manager to control this, and you will need transaction logs to store wip/completed transactions.


It is possible you will need some backup/restore commands to backup your data files and index files and maybe also your transaction log files, and if you really want to go for it you could write some replication tools to read your transaction log and replicate the transactions to a backup database on a different server. 


Note if you want your client programs (for example an SQL UI like phpmyadmin) to reside on separate machine than your database server you will need to write a connection manager that sends the SQL requests over TCP/IP to your server, then authenticate it using some credentials, parse the request, run your GETS and send back the data to the client.


So these database servers can be a lot of work, especially for one person. 

But you can create simple versions of these tools one at a time. 


Start with how to store data and indexes, and how to retrieve data using an ISAM type interface.
There are books out there - look for older books on mysql and msql, look for anything on google re btrees and isam, look for open source C libraries that already do isam. 


Get a good understanding on file IO on a linux machine using C. 

Many commercial databases now dont even use the filesystem for their data files because of cacheing issues - they write directly to raw disk. 

You want to just write to files initially.


I hope this helps a little bit.

<think>嗯,用户想了解如何测量MongoDB数据库的最大规模。首先,我需要回忆一下MongoDB的相关知识。记得MongoDB是NoSQL数据库,支持水平扩展,所以理论上最大规模可能取决于分片集群的配置。但具体如何测量现有数据库的大小呢? 首先,用户提到的“measure size”可能指查看当前数据库的数据量。MongoDB有一些内置命令,比如db.stats()和db.collection.stats()。这些命令能返回存储大小、索引大小等信息。应该先考虑这些基本方法。 然后,用户可能想知道的是如何计算数据库的实际占用空间。MongoDB的数据存储涉及预分配机制,所以存储文件可能比实际数据大。比如,数据文件会预分配空间来避免频繁扩展,这可能影响测量结果。需要区分存储大小和文件系统占用空间。 另外,用户提到的“largest”可能指现有最大的MongoDB数据库规模。根据引用[1],MongoHQ(现在可能叫Compose)提到计算数据库大小的方法,可能涉及检查存储引擎、分片配置等因素。分片集群的话,每个分片的数据分布可能不同,需要汇总各个分片的数据。 还要考虑性能分析,比如引用[3]提到的分布式系统和性能分析经验,可能涉及监控工具的使用,如MongoDB Atlas提供的云监控功能,或者第三方工具如Prometheus。这些工具能帮助跟踪数据库的增长趋势和容量规划。 用户可能没有明确说出的是,他们可能担心数据库规模扩展后的管理问题,或者如何预估未来的增长。因此,除了测量当前大小,还需要提到容量规划和监控策略。 最后,确保回答符合系统指令:使用正确的中文,数学公式用$...$,代码块用正确格式,生成相关问题,并正确引用参考资料。需要检查引用[1]是否与数据库大小计算相关,确保引用标识正确。</think>### 如何测量MongoDB数据库的最大规模? #### 1. **使用内置命令测量数据库大小** MongoDB提供了以下命令直接获取数据库和集合的存储信息: - **`db.stats()`**:返回当前数据库的存储统计信息,包括总数据大小、索引大小、存储预分配空间等。 ```javascript // 在MongoDB Shell中执行 use your_database_name; db.stats(); ``` 关键字段说明: - `dataSize`:数据的总大小(单位:字节)。 - `indexSize`:索引的总大小。 - `storageSize`:预分配的数据文件占用空间(可能大于实际数据大小)[^1]。 - **`db.collection.stats()`**:查看特定集合的详细信息。 ```javascript db.your_collection.stats(); ``` #### 2. **文件系统层面的测量** MongoDB默认将数据存储在`/data/db`目录(可配置)。可通过以下方式直接查看物理文件大小: ```bash # 查看数据目录占用空间 du -sh /var/lib/mongodb/ ``` 此方法反映实际占用的磁盘空间,包括预分配文件(如`collection-0--123456789.wt`)[^1]。 #### 3. **分片集群的规模计算** 若数据库为分片集群,最大规模取决于分片数量和存储引擎: - **分片统计汇总**:对每个分片执行`db.stats()`,累加各分片的`dataSize`。 - **存储引擎影响**:WiredTiger引擎支持压缩,而MMAPv1不压缩,因此实际文件大小会因引擎不同而差异显著。 #### 4. **容量规划与监控** - **云监控工具**:如MongoDB Atlas提供自动化的数据库规模监控。 - **第三方工具**:Prometheus + Grafana可配置自定义仪表盘跟踪存储增长[^3]。 #### 示例代码:计算集合的文档平均大小 ```python from pymongo import MongoClient client = MongoClient("mongodb://localhost:27017/") db = client["your_database"] stats = db.command("collstats", "your_collection") avg_size = stats["avgObjSize"] print(f"平均文档大小: {avg_size} 字节") ``` --- ### 性能与扩展性关键因素 - **分片策略**:基于哈希或范围的分片影响数据分布均衡性。 - **索引优化**:减少`indexSize`可提升存储效率[^4]。 - **压缩配置**:WiredTiger引擎的`snappy`或`zlib`压缩算法可降低存储需求。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值