浅谈索引系列之聚簇因子(clustering

初次听说聚簇因子，我相信大部分人都是丈二和尚摸不着头脑，不知所云。然而当我们了解其真正含义后，也许会觉得这个高大上的名词也没有那么神秘。废话少说，上官方文档：
Index Clustering Factor
The index clustering factor measures row order in relation to an indexed value such as employee last name. The more order that exists in row storage for this value, the lower the clustering factor.

The clustering factor is useful as a rough measure of the number of I/Os required to read an entire table by means of an index

If the clustering factor is high, then Oracle Database performs a relatively high number of I/Os during a large index range scan. The index entries point to random table blocks, so the database may have to read and reread the same blocks over and over again to retrieve the data pointed to by the index.
If the clustering factor is low, then Oracle Database performs a relatively low number of I/Os during a large index range scan. The index keys in a range tend to point to the same data block, so the database does not have to read and reread the same blocks over and over.

简单来讲聚簇因子用来反映索引键值字段存放杂乱排序程度的一个度量。那么聚簇因子是怎么计算得到的呢？Oracle在计算其值的时候，会对每个索引键值查找对应表的数据，在查找过程中，会跟踪数据块间跳转的次数(当然，数据库不可能真的这么做，源代码只是简单的扫描索引，得到rowid，然后根据rowid就可以获得数据块的地址了)，每一次跳转，计数器都会增加，计算完全部的索引键值得到的结果就是聚簇因子。因此表的某个字段越有序，若在此字段创建索引，索引对应的聚簇因子就越小。
下面通过实验作一个直观的展现：

点击(此处)折叠或打开

drop table colocated;
create table colocated ( x int, y varchar2(2000));
begin
for i in 1..100000 loop
insert into colocated values(i,dbms_random.string('a',5));
end loop;
commit;
end;
/
alter table colocated add constraint colocated_pk primary key(x);
drop table disorganized;
create table disorganized as select x,y from colocated order by y;
alter table disorganized add constraint disorganized_pk primary key(x);
exec dbms_stats.gather_table_stats(USER,'COLOCATED',CASCADE=>TRUE);
exec dbms_stats.gather_table_stats(USER,'DISORGANIZED',CASCADE=>TRUE);
SELECT a.index_name,b.num_rows,b.blocks,a.clustering_factor FROM user_indexes a,user_tables b WHERE a.table_name=b.table_name AND a.table_name='COLOCATED';
INDEX_NAME NUM_ROWS BLOCKS CLUSTERING_FACTOR
------------------------------ ---------- ---------- -----------------
COLOCATED_PK 100000 244 219
SELECT a.index_name,b.num_rows,b.blocks,a.clustering_factor FROM user_indexes a,user_tables b WHERE a.table_name=b.table_name AND a.table_name='DISORGANIZED';
INDEX_NAME NUM_ROWS BLOCKS CLUSTERING_FACTOR
------------------------------ ---------- ---------- -----------------
DISORGANIZED_PK 100000 232 99572

ID 按照有序依次插入到 COLOCATED 表中，而 DISORGANIZED 表中的 ID 列则是随机分布的。因此索引 COLOCATED_P K 的聚簇因子较小，基本等于表的 BLOCKS 数，索引 DISORGANIZED_PK 的聚簇因子较大，基本等于表的 NUM_ROWS .
了解了聚簇因子的概念，具体聚簇因子有什么用途呢？下面进一步说明。

点击(此处)折叠或打开

set autotrace traceonly;
SQL> select * from COLOCATED where x between 20000 and 20050;
51 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 1550765370
--------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 52 | 572 | 3 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| COLOCATED | 52 | 572 | 3 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | COLOCATED_PK | 52 | | 2 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("X">=20000 AND "X"<=20050)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
11 consistent gets
0 physical reads
0 redo size
2000 bytes sent via SQL*Net to client
556 bytes received via SQL*Net from client
5 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
51 rows processed
SQL> select * from DISORGANIZED where x between 20000 and 20050;
51 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 2594580634
-----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 52 | 572 | 54 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| DISORGANIZED | 52 | 572 | 54 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | DISORGANIZED_PK | 52 | | 2 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("X">=20000 AND "X"<=20050)
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
56 consistent gets
0 physical reads
0 redo size
2000 bytes sent via SQL*Net to client
556 bytes received via SQL*Net from client
5 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
51 rows processed

COLOCATED和 DISORGANIZED两张表存储了同样的语句，相同的 SQL语句执行，聚簇因子小的表逻辑读为 11，聚簇因子大的逻辑读却为 56，大家都知道逻辑读和锁机制息息相关，锁会影响到了数据库的并发性，也会影响性能。
小结：
1.聚簇因子的高低会影响执行计划的选择。
2.聚簇因子与表的存储有关，无法通过重建索引来改变聚簇因子的大小。

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/29827284/viewspace-2134262/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/29827284/viewspace-2134262/