Inside Oracle10g Dynamic Sampling

本文探讨了 Oracle 10g 中引入的动态采样技术,旨在通过运行时收集方案统计信息来提高成本基优化器的准确性。该技术能够帮助优化器更准确地估计选择性和基数,从而生成更快的执行计划。
Inside Oracle10g Dynamic Sampling
by Donald K. Burleson

Introduction
One of the greatest problems with the Oracle Cost-based Optimizer (CBO) was not a problem with the CBO at all, but with the failure of the Oracle DBA to gather accurate schema statistics. Even with the dbms_stats package, the schema statistics were often stale, and the DBA did not always create histograms for skewed data columns and data columns that are used to estimate the size of SQL intermediate result sets.

This resulted in a “bum rap” for Oracle’s CBO, and beginner DBAs often falsely accused it of failing to generate optimal execution plans when the real cause of the sub-optimal execution plan was the DBA’s failure to collect complete schema statistics.

Hence, Oracle has automated the function of collecting and refreshing schema statistics in Oracle10g. This automates a very important DBA task and ensures that Oracle will always gather good statistics and choose the best execution plan for any query. Using the enhanced dbms_stats package, Oracle will automatically estimate the sample size, detect skewed columns that would benefit from histograms, and refresh the schema statistics when they become stale.

begindbms_stats.gather_schema_stats(ownname       => 'SCOTT',estimate_percent => dbms_stats.auto_sample_size,method_opt     => 'for all columns size skewonly',degree       => 7);end;/
However, there was always a nagging problem with the CBO. Even with good statistics, the CBO would sometimes determine a sub-optimal table-join order, causing unnecessarily large intermediate result sets. For example, consider the complex WHERE clause in the query below. Even with the best schema statistics, it can be impossible to predict a priori the optimal table-join order (the one that has the smallest intermediate baggage). Reducing the size of the intermediate row-sets can greatly improve the speed of the query.

selectstufffromcustomernatural joinordersnatural joinitemnatural joinproductwherecredit_rating * extended_credit > .07and(qty_in_stock * velocity) /.075 < 30or(sku_price / 47) * (qty_in_stock / velocity) > 47;
In this example, the four-way table join only returns 18 rows, but the query carries 9,000 rows in intermediate result sets, slowing-down the SQL execution speed (refer to figure 1).



Figure 1: Sub-optimal intermediate row sets.

If we were somehow able to predict the sizes of the intermediate results, we can re-sequence the table-join order to carry less “intermediate baggage” during the four-way table join, in this example carrying only 3,000 intermediate rows between the table joins (refer to figure 2) .



Figure 2: Optimal intermediate row sets.

Let’s take a closer look at this issue. Assume that we have a three-way table join against tables that all contain over 10,000 rows each. This database has 50,000 student rows, 10,000 course rows and 5,000 professor rows (refer to figure 3).



Figure 3: Number of rows in each table.

If the number of rows in the table determined the best table-join order, we would expect that any three-way table join would start by joining the professor and course tables, and then would join the RESULT set to the student table.

Ah, but whenever there is a WHERE clause, the total number of rows in each table does not matter (if you are using index access). Here is the query:

selectstudent_namefromprofessornatural joincoursenatural joinstudentwhereprofessor = ‘jones’andcourse = ‘anthropology 610’;
Stan Nowakowski
Bob Crane
James Bakke
Patty O’Furniture

4 Rows selected.

Despite the huge numbers of rows in each table, the final result set will only be four rows. If the CBO can guess a priori the size of the final result, he can use sampling techniques to examine the WHERE clause of the query and determine which two table we should join together first.

There are only two table-join choices in our simplified example:

1. Join (student to course) and (RESULT to professor)

2. Join professor to course and (RESULT to student)

So, then, which is better? The best solution will be the one in which RESULT is smallest. Because the query is filtered with a WHERE clause, the number of rows in each table is incidental, and what we are really concerned about is the number of rows “where professor = ‘jones’” and “where course = ‘Anthropology 610’.”

If we know, the best table-join order becomes obvious. Assume that Professor Jones is very popular and teaches 50 courses and that Anthropology 610 as a total of eight students. Knowing this, we can see that the size of the intermediate row baggage is very different:

Join professor to course and (RESULT to student).



Figure 4: A sub-optimal intermediate row size.

If the CBO were to join the student table to the course table first, the intermediate result set would only be eight rows, far less baggage to carry over to the final join:

Join (student to course) and (RESULT to professor).



Figure 5: An optimal intermediate row size.

Now that we have only eight rows returned from the first query, it easy to join the tiny eight-row result set into the professor table to get the final answer.

How Do We Estimate Join Cardinality?

As we can see, in the absence of column histograms, Oracle CBO must be able to “guess” this information, and it sometimes gets it wrong. This is one reason why the ORDERED hint is one of the most popular SQL tuning hints; using the ORDERED hint allows you to specify that the tables be joined together in the same order that they appear in the FROM clause, like this:

select /+ ORDERED */student_namefromstudentnatural joincoursenatural joinprofessorwhereprofessor = ‘jones’andcourse = ‘anthropology 610’;
Remember, if the values for the professor and course table columns are not skewed, then it is unlikely that the 10g automatic statistics would have created histograms buckets in the dba_histograms view for these columns.

As we can see, the Oracle CBO needs to be able to accurately estimate the final number of rows returned by each step of the query and then use schema metadata (from running dbms_stats) to choose the table-join order that results in the least amount of “baggage” (intermediate rows) from each of the table join operations.

But this is a daunting task. When a SQL query has a complex WHERE clause, it can if very difficult to estimate the size of the intermediate result sets, especially when the WHERE clause transforms column values with mathematical functions. This is because Oracle has made a commitment to making the CBO infallible, even when incomplete information exists. However, Oracle9i introduced the new dynamic sampling method for gathering run-time schema statistics, and it is now enabled by default in Oracle10g.

Note that dynamic sampling is not for every database. Let’s take a closer look.


Dynamic Sampling

The main objective of dynamic sampling is to create more accurate selectivity and cardinality estimates, which, in turn, helps the CBO generate faster execution plans. Dynamic sampling is normally used to estimate single-table predicate selectivity when collected statistics cannot be used or are likely to lead to significant errors in estimation. It is also used to estimate table cardinality for tables without statistics or for tables whose statistics are too out of date to trust.

The optimizer_dynamic_sampling initialization parameter controls the number of blocks read by the dynamic sampling query. The parameter can be set to a value from 0 to 10. In 10g, the default for this parameter is set to “2,” automatically enabling dynamic sampling. Beware that the optimizer_features_enable parameter will turns off dynamic sampling if it is set to a version earlier than 9.2.0.

A value of 0 means dynamic sampling will not be performed. Increasing the value of the parameter results in more aggressive dynamic sampling, in terms of both the type of tables sampled (analyzed or un-analyzed) and the amount of I/O spent on sampling.

Dynamic Sampling Is Not for Everyone

When dynamic_sampling was first introduced in Oracle9i, it was used primarily for data warehouse systems with complex queries. Because it is enabled by default in Oracle10g, you may want to turn off dynamic_sampling to remove unnecessary overhead if any of the following are true:

You have an online transaction processing (OLTP) database with small, single-table queries.
Your queries are not frequently re-executed (as determined by the executions column in v$sql and executions_delta in dba_hist_sqlstat).
Your multi-table joins have simple WHERE clause predicates with single-column values and no built-in or mathematical functions.
Dynamic sampling is ideal whenever a query is going to execute multiple times because the sample time is small compared to the overall query execution time.
By sampling data from the table at runtime, Oracle10g can quickly evaluate complex WHERE clause predicates and determine the selectivity of each predicate, using this information to determine the optimal table-join order. Let’s use the Oracle SQL sample clause to see how this works.

Sample Table Scans

A sample table scan retrieves a random sample of data of whatever size you choose. The sample can be from a simple table or a complex SELECT statement such as a statement involving multiple joins and complex views.

To peek inside dynamic sampling, we can run some simple SQL queries. The following SQL statement uses a sample block and sample rows scan on the customer table. (Note that there are 50,000 rows in this table.) The first statement shows a sample block scan and the last one shows a sample row scan:

selectcount(*)fromcustomersample block(20);
COUNT(*)----------12268
selectpol_no,sales_id,sum_assured,premiumfromcustomersample (0.02) ;
  POL_NO   SALES_ID SUM_ASSURED   PREMIUM---------- ---------- ----------- ---------- --2895       10     2525       23176       10     2525       29228       10     2525       211294       11     2535       419846       11     2535       425547       12     2545       629583       12     2545       640042       13     2555       847331       14     2565       1045283       14     2565       10  
10 rows selected.
We can use the new dynamic_sampling hint to sample rows from the table.

select /*+ dynamic_sampling(customer 10) */pol_no,sales_id,sum_assured,premiumfromcustomer; POL_NO   SALES_ID SUM_ASSURED   PREMIUM---------- ---------- ----------- ---------- --2895       10     2525       23176       10     2525       29228       10     2525       211294       11     2535       419846       11     2535       425547       12     2545       629583       12     2545       640042       13     2555       847331       14     2565       1045283       14     2565       10  

Conclusion

Dynamic sampling addresses an innate problem in SQL and this issue is common to all relational databases. Estimating the optimal join order involves guessing the sequence that results in the smallest amount of intermediate row-sets, and it is impossible to collect every possible combination of WHERE clauses with a priori statistics.

Dynamic sampling is a godsend for databases that have large n-way table joins that execute frequently. By sampling a tiny sub-set of the data, the Oracle 10g CBO gleans clues as to the fastest table-join order.

As we have noted, dynamic_sampling does not take a long time to execute, but it can be an unnecessary overhead for all Oracle10g databases. Dynamic sampling is just another example of Oracle’s commitment to making Oracle10g an intelligent, self-optimizing database.

相关链接:
http://www.dba-oracle.com/art_dbazine_oracle10g_dynamic_sampling_hint.htm

 
### 区别:INDEX RANGE SCAN DESCENDING 和 INDEX RANGE SCAN INDEX RANGE SCAN 是一种索引扫描方式,用于按升序顺序访问索引中的数据。它通常适用于需要对索引列进行升序排序的查询场景[^2]。相比之下,INDEX RANGE SCAN DESCENDING 则是按降序顺序访问索引中的数据,适用于需要对索引列进行降序排序的查询场景。 ```sql -- 示例:INDEX RANGE SCAN SELECT employee_id, first_name FROM employees WHERE department_id = 50 ORDER BY employee_id ASC; -- 示例:INDEX RANGE SCAN DESCENDING SELECT employee_id, first_name FROM employees WHERE department_id = 50 ORDER BY employee_id DESC; ``` 在上述两个查询中,如果 `employee_id` 列上存在索引,则优化器可能会选择使用 INDEX RANGE SCAN 或 INDEX RANGE SCAN DESCENDING 来直接按照指定顺序读取数据,而无需额外的排序操作。 #### 性能差异 INDEX RANGE SCAN 和 INDEX RANGE SCAN DESCENDING 的性能差异主要取决于以下几个因素: - **索引存储顺序**:Oracle 默认以升序方式存储索引条目。因此,INDEX RANGE SCAN 通常比 INDEX RANGE SCAN DESCENDING 更高效,因为后者需要从索引的末尾开始向前扫描,可能增加 I/O 操作次数。 - **查询条件的选择性**:无论使用哪种扫描方式,查询条件的选择性越高(即返回的行数越少),扫描范围越小,性能越好。如果查询条件能够有效缩小扫描范围,则两种扫描方式的性能差异可能不明显。 - **数据分布**:如果目标数据在索引中分布较为集中,则两种扫描方式的性能差异可能较小;否则,如果目标数据分布较分散,则 INDEX RANGE SCAN DESCENDING 的性能可能略逊于 INDEX RANGE SCAN,因为前者需要更多的回退操作来访问索引条目[^3]。 ### 动态统计信息与动态采样 动态统计信息和动态采样是数据库优化中的重要技术,用于帮助优化器生成更优的执行计划。 #### 动态统计信息 动态统计信息是指在查询执行过程中动态收集的统计信息,主要用于处理那些无法通过静态统计信息准确评估的复杂查询。例如,在某些情况下,表或索引的统计信息可能已经过时,导致优化器生成次优的执行计划。通过启用动态统计信息,优化器可以在运行时重新评估查询的成本,并调整执行计划以提高性能。 动态统计信息的功能可以通过未公开的跟踪标志启用,该标志可以打印关于估计和实际排序成本的信息,并指导服务器在更新统计信息时使用算术平均值而不是几何加权平均值来计算密度。这对于具有偏斜数据的索引尤其有用[^1]。 #### 动态采样 动态采样是一种在查询执行前自动收集表或索引统计信息的技术。它允许优化器在没有预定义统计信息的情况下,通过对表的一部分数据进行采样来估算数据分布和查询成本。动态采样的级别可以通过参数 `optimizer_dynamic_sampling` 进行设置,范围从 0(关闭动态采样)到 11(最高采样级别)。 动态采样的主要优点包括: - 在缺少统计信息或统计信息过时时,提供更准确的查询成本估算。 - 减少因错误统计信息导致的次优执行计划。 - 提高复杂查询的性能,尤其是在涉及分区表或多表连接的情况下。 ### 示例分析 以下是一个 INDEX RANGE SCAN 和 INDEX RANGE SCAN DESCENDING 的对比示例: ```sql -- 查询最近一个月内入职的员工,并按入职日期升序排列 SELECT employee_id, hire_date FROM employees WHERE hire_date >= ADD_MONTHS(SYSDATE, -1) ORDER BY hire_date ASC; -- 查询最近一个月内入职的员工,并按入职日期降序排列 SELECT employee_id, hire_date FROM employees WHERE hire_date >= ADD_MONTHS(SYSDATE, -1) ORDER BY hire_date DESC; ``` 在此查询中,假设 `hire_date` 列上存在索引,则优化器可能会选择使用 INDEX RANGE SCAN 或 INDEX RANGE SCAN DESCENDING 来直接按指定顺序读取数据,从而避免额外的排序操作。 ###
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值