Fastest way to estimate rows in a table
A friend wrote to me recently with a question. He was working on a method to ship application metrics to statsd on a 1 minute interval. He had three examples of how to estimate the number of rows in a table and he wanted to know the difference between them.
Data length/average row length
The example given:
mysql> select DATA_LENGTH/AVG_ROW_LENGTH from INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME = 'line_items';
+----------------------------+
| DATA_LENGTH/AVG_ROW_LENGTH |
+----------------------------+
| 10497541.7528 |
+----------------------------+
1 row in set (0.03 sec)
I have actually never thought of using this method! I don’t think it’s accurate though, since data length has deleted space + additional preallocated or overhead space. For example a page file is only 15/16ths in InnoDB. So as you can see the number it returns is just over 10% higher than the actual number of rows (9441296).
Table rows from Information Schema
The example given:
mysql> SELECT TABLE_ROWS FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME = 'line_items';
+------------+
| TABLE_ROWS |
+------------+
| 9800078 |
+------------+
1 row in set (0.03 sec)
This method takes a number that InnoDB provides, which in this case is accurate to within 4% of the actual number of rows (estimating slightly over in this case). So the question is really about how efficient a count already provided is, and where does it come from.
InnoDB samples a number of random pages in the table, and then estimates the total rows for the whole table. By default this sampling is just 8 pages in MySQL 5.5 (configurable via innodb_stats_sample_pages), but is greatly improved in MySQL 5.6 - with 20 pages sampled by default. The option is now calledinnodb_stats_persistent_sample_pages - a reference to the new persistent statistics feature!
So based on it being a fixed number of pages to examine, it is also going to scale reasonably with table growth. Pro tip: It is quite possible it may look much slower as soon as the table does not fit in memory, since 8 random pages could mean > 8 random IOs.
Select count(1)
The example given:
mysql> SELECT COUNT(1) FROM line_items;
+----------+
| COUNT(1) |
+----------+
| 9441296 |
+----------+
1 row in set (2.03 sec)
This requires an index scan of the primary key. It’s important to explain why that is, since this behavior differs from MyISAM. InnoDB supports MVCC which is an important feature to allow concurrent access to rows without having to need readers set locks blocking other users from writing. In a practical sense what this feature means, is that at any one point in time there will be multiple versions of a row. The actual count(1) will depend on the time your transaction started, and its isolation level.
This solution will not scale well as the number of rows in the table grows and while storage-engine development is outside of my expertise, I suspect it is unlikely that this will be improved in any future MySQL versions. My reasoning is that I can not see an easy way of maintaining multiple different pre-computed counts without introducing any new global locking or overhead - which is a big no-no in being able to scale on multiple cores/cpus.
Finally, of the three solutions, this is the only 100% accurate method to be able to tell the exact number of rows in the table.
转自:http://www.tocker.ca/2013/05/02/fastest-way-to-count-rows-in-a-table.html
作者:
Morgan Tocker
MySQL Community Manager
本文讨论了三种估算数据库表中行数的方法:数据长度除以平均行长度、InnoDB提供的表行数和直接计数(使用COUNT(1))。通过比较它们的准确性、效率和适用场景,为数据库管理员提供了选择最佳方法的指导。
492

被折叠的 条评论
为什么被折叠?



