访问路径的选择--返回少量数据时一定走索引吗？

本文链接：https://blog.youkuaiyun.com/scogeek/article/details/18284647

本文探讨了数据库优化器在返回少量数据时是否总会选择索引的问题。通过实验展示了即使返回10%的数据，表T1由于数据有序，优化器选择索引访问；而表T2因数据分散，选择了全表扫描。执行计划差异源于索引的聚簇因子，揭示了数据物理分布对查询效率的影响。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

-- 问题：
当某字段上有索引的表返回少量数据（如，10%数据）时，优化器一定会走索引吗？
-- 实验:
-- 1.建立两张表
-- a.T1表
--drop table t1 purge ;
create table t1 nologging as
select trunc((rownum-1)/100) id
,rpad(rownum,100) t_pad
from dba_source
where rownum <= 10000 ;

create index t1_idx1 on t1(id) ;

select owner,segment_name
,bytes/1024/1024
from dba_segments
where segment_name = 't1'
;

exec dbms_stats.gather_table_stats( 'SYS','t1',method_opt => 'FOR ALL COLUMNS SIZE 1',cascade => TRUE) ;

-- b.T2表
--drop table t2 purge ;
create table t2 nologging as
select mod(rownum,100) id
,rpad(rownum,100) t_pad
from dba_source
where rownum <= 10000 ;

create index t2_idx1 on t2(id) ;

exec dbms_stats.gather_table_stats( 'SYS','t2',method_opt => 'FOR ALL COLUMNS SIZE 1',cascade => TRUE) ;

-- 2.查看各自的执行计划
set autotrace traceonly
-- a.T1表
select * from t1 where id = 1 ;
执行计划
----------------------------------------------------------
Plan hash value: 2623418078

---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 100 | 10400 | 3 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| T1 | 100 | 10400 | 3 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | T1_IDX1 | 100 | | 1 (0)| 00:00:01 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("ID"=1)

统计信息
----------------------------------------------------------
0 recursive calls
0 db block gets
19 consistent gets
0 physical reads
0 redo size
12650 bytes sent via SQL*Net to client
586 bytes received via SQL*Net from client
8 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
100 rows processed

-- b.T2表
select * from t2 where id = 1 ;
执行计划
----------------------------------------------------------
Plan hash value: 1513984157

--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 100 | 10400 | 43 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| T2 | 100 | 10400 | 43 (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("ID"=1)

统计信息
----------------------------------------------------------
24 recursive calls
0 db block gets
166 consistent gets
0 physical reads
0 redo size
12330 bytes sent via SQL*Net to client
586 bytes received via SQL*Net from client
8 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
100 rows processed

从上可知，T1表走了索引，但是T2表却走全表扫描。两张表返回的结果是一样的，但是为什么导致执行计划的不同呢？
这里面涉及到一个叫聚簇因子的度量值。
索引的聚簇因子向优化器表明具有相同键值的数据块在物理存储中的分布情况。用一个接近于表中数据块数量的较低数值
来表明该表中基于索引后的数据行的排序度或聚簇度是很高的。而用接近于表中的数据行数来表明索引条目的聚簇度很低。

通过查询上面两个表的聚簇因子，可以知道为什么T1走索引，而T2走全表。这是因为T1的数据分布比较有序，所以通过索引访问
只需少量的IO。而T2表的数据比较分散，而且索引扫描是单块读（IFFS除外）的，而全表扫描是多块读的，
所以优化器认为T2表走全表扫描比索引扫描更高效。
因而可以得出一个结论，一个表是否走索引不仅与返回的数据量有关，还与数据的物理分布有关。
-- 查询表的聚簇因子
col idx_name format a16
col clustering_factor format 99999
col blocks format 99999
col num_rows format 99999
select a.table_name||'.'||a.index_name idx_name
,a.clustering_factor
,b.blocks
,b.num_rows
from dba_indexes a
,dba_tables b
where 1 = 1
and a.table_name = b.table_name
and b.table_name in ('t1','t2')
order by a.table_name,a.index_name ;
IDX_NAME CLUSTERING_FACTOR BLOCKS NUM_ROWS
---------------- ----------------- ------ --------
t1.T1_IDX1 152 152 10000
t2.T2_IDX1 10000 152 10000