Improvement of AUTO sampling statistics gathering feature in Oracle 11g
2013-12-30 15:11
597 查看
原文:http://optimizermagic.blogspot.com/2008/01/improvement-of-auto-sampling-statistics.html
Optimizer statistics in Oracle are managed via a pl/sql package, dbms_stats. It provides several pl/sql procedures to gather statistics for a table, schema, or a database. For example, gather_table_statistics is used to gather statistics on a table. This procedure
has an estimate_percent parameter, which specifies the sampling percentage of the statistics gathering. The users can specify any number between 0 ~ 100 for this parameter. For example, suppose you have a table BIGT, you can specify a 1% sampling percentage
as follows:
oracle通过dbms_stats包来管理优化器统计信息。该包提供了很多plsql过程来收集表,模式和数据的统计信息。例如,GATHER_TABLE_STATISTICS可以用来收集表的统计信息。该过程具有一个ESTIMATE_PERCENT参数,这个参数规定了信息统计的采样比例。我们可以将该参数设置为0~100之间的任意值。例如,对于表BIGT,我们可以通过入如下的方式来设置1%的采样比例:
It is not always easy for users to pick the appropriate sampling percentage. If the specified sampling percentage is too high, it can take longer to gather statistics. On the contray, if the data is skewed and the specified sampling percentage is too low, the
resulting statistics can be inaccurate.
For this reason, Oracle introduced the AUTO value for the estimate_percent parameter. For example, you can gather statistics on BIGT as follows:
通常人们很难确定一个合适的采样比例,如果采样比例过大,那么收集统计信息的时间就会过长,于此相反,如果数据的分布是不均衡了而采样比例又过小,那么收集的统计信息往往是不正确的。
基于这个原因,oracle引入了统计自动化评估采样比例的方法,例如,我们可以按照如下的方式来收集BIGT的统计信息:
The advantage of using AUTO sample size over a fixed number is two-folds. First, when AUTO sample size is specified, the system automatically determines the appropriate sampling percentage. Second, AUTO sample size is more flexible than a fixed sampling percentage.
A fixed sampling percentage size that was good at some point in time may not be appropriate after the data distribution in the table has changed. On the other hand when AUTO value is used Oracle will adjust the sample size when the data distribution changes.
自动采样相对于手工固定采样比例的优势可以提现在两个方面。首先,如果采用自动采样比例,系统可以自动决定合适的采样比例。第二,自动采样相对于手工固化采样比例具有更多的灵活性。手工固化采样比例在某一时间点可能是较好的,但是在数据分布发生变化后就可能不再那么合适了。于此相对,自动采样比例可以保证oracle在任何数据分布发生变化后都可以进行适当的调整。
When AUTO is used Oracle picks a sample size where the statistics quality is good enough. However, it does not work very well under cases of extreme skew in the data. In Oracle 11g, we improved the behavior when the AUTO value is used. First, AUTO sampling
now generates deterministic statistics. Second, and more importantly, AUTO sampling generates statistics that are almost as accurate as 100% sampling but takes much less time than 100% sampling. To illustrate these merits, we compare the performance of using
a fixed sampling percentage, AUTO sample size in Oracle 10g and AUTO sample size in Oracle 11g.
如果采用自动采样ORACLE会设置一个合适的比例,使得统计信息的质量是足够好的。但是如果数据的分布是是否不均衡的,自动采样的处理结果并不会太理想。在oracle11g中,oracle改进了自动采样的处理方式。首先,自动采样现在总是会产生确定性的统计信息,其次,也是更重要的一点,自动采样会生成与100%采样非常接近的统计信息,但是使用的时间相对100%采样来说却是非常小的。为了验证这些结论,下面我们将对比10g和11g下自动采样和手动固定采样的性能。
We used the standard TPC-D data generator to generate a Lineitem table. The Lineitem table is about 230G and contains 1.8 million rows with 16 columns. The schema of the lineitem table is as follows:
我们基于TPC-D标准来生成一张lineitem表,这张表大概占用230g的空间并包含180万条具有19列的数据。lineitem 表的结构如下:
Table 1 gives the elapsed time of gathering statistics on the Lineitem table by different sampling percentages.
表1 给出了在不同的采样比例下,收集统计信息所花费的时间
Table 1: Statistics gathering time on 230G TPC-D Lineitem Table Using Different Estimate Percentages
We also compare the quality of the statistics gathered using different estimate percentages. Among all the statistics of a column, number of distinct values (NDV) is the one whose accuracy used to be an issue. We define the accuracy rate of NDV of a column
as follows:
我们同样比较一下在不同的采样比例下,统计信息的质量是如何的。在某列的所有统计信息中,NDV的准确性在以往的版本中往往是存在问题的,我们定义NDV的正确性如下:
The accuracy rate ranges from 0% to 100%. The higher the accuracy rate is, the more accurate the gathered statistics are. Since 100% sampling always lead to an accuracy rate of 100%, we do not report it. We focus on the columns which has at least one statistics
accuracy rate below 99.9% when using different estimate percentages. Table 2 illustrates the accurate rates of the columns.
ndv统计信息的准确性范围从0~100%。准确性越高,统计信息的正确率越高。因此100%的采样比例总是具有100%的准确率,不予考虑。我们仅仅考虑至少存在某一列的准确率低于99.9%的情况。表2显示了各个列的准确率,如下
Table 2: Accuracy Rate of Gathering NDV LineItem Using Different Estimate Percentages
In short, the elapsed time of ''AUTO sampling in Oracle 11g'' is 10 times faster than 100% sampling but yields the statistics of similar quality (accuracy rate close to 100%).
总而言之,在11g中自动采样的采样速度要高出100%采样10倍左右,而采用的质量却非常接近100%采样。
在之前的版本中,自动采样的效果是有时并不理想。
Optimizer statistics in Oracle are managed via a pl/sql package, dbms_stats. It provides several pl/sql procedures to gather statistics for a table, schema, or a database. For example, gather_table_statistics is used to gather statistics on a table. This procedure
has an estimate_percent parameter, which specifies the sampling percentage of the statistics gathering. The users can specify any number between 0 ~ 100 for this parameter. For example, suppose you have a table BIGT, you can specify a 1% sampling percentage
as follows:
oracle通过dbms_stats包来管理优化器统计信息。该包提供了很多plsql过程来收集表,模式和数据的统计信息。例如,GATHER_TABLE_STATISTICS可以用来收集表的统计信息。该过程具有一个ESTIMATE_PERCENT参数,这个参数规定了信息统计的采样比例。我们可以将该参数设置为0~100之间的任意值。例如,对于表BIGT,我们可以通过入如下的方式来设置1%的采样比例:
exec dbms_stats.gather_table_stats(null, 'BIGT', estimate_percent => 1);
It is not always easy for users to pick the appropriate sampling percentage. If the specified sampling percentage is too high, it can take longer to gather statistics. On the contray, if the data is skewed and the specified sampling percentage is too low, the
resulting statistics can be inaccurate.
For this reason, Oracle introduced the AUTO value for the estimate_percent parameter. For example, you can gather statistics on BIGT as follows:
通常人们很难确定一个合适的采样比例,如果采样比例过大,那么收集统计信息的时间就会过长,于此相反,如果数据的分布是不均衡了而采样比例又过小,那么收集的统计信息往往是不正确的。
基于这个原因,oracle引入了统计自动化评估采样比例的方法,例如,我们可以按照如下的方式来收集BIGT的统计信息:
exec dbms_stats.gather_table_stats(null, 'BIGT', estimate_percent => dbms_stats.auto_sample_size);
The advantage of using AUTO sample size over a fixed number is two-folds. First, when AUTO sample size is specified, the system automatically determines the appropriate sampling percentage. Second, AUTO sample size is more flexible than a fixed sampling percentage.
A fixed sampling percentage size that was good at some point in time may not be appropriate after the data distribution in the table has changed. On the other hand when AUTO value is used Oracle will adjust the sample size when the data distribution changes.
自动采样相对于手工固定采样比例的优势可以提现在两个方面。首先,如果采用自动采样比例,系统可以自动决定合适的采样比例。第二,自动采样相对于手工固化采样比例具有更多的灵活性。手工固化采样比例在某一时间点可能是较好的,但是在数据分布发生变化后就可能不再那么合适了。于此相对,自动采样比例可以保证oracle在任何数据分布发生变化后都可以进行适当的调整。
When AUTO is used Oracle picks a sample size where the statistics quality is good enough. However, it does not work very well under cases of extreme skew in the data. In Oracle 11g, we improved the behavior when the AUTO value is used. First, AUTO sampling
now generates deterministic statistics. Second, and more importantly, AUTO sampling generates statistics that are almost as accurate as 100% sampling but takes much less time than 100% sampling. To illustrate these merits, we compare the performance of using
a fixed sampling percentage, AUTO sample size in Oracle 10g and AUTO sample size in Oracle 11g.
如果采用自动采样ORACLE会设置一个合适的比例,使得统计信息的质量是足够好的。但是如果数据的分布是是否不均衡的,自动采样的处理结果并不会太理想。在oracle11g中,oracle改进了自动采样的处理方式。首先,自动采样现在总是会产生确定性的统计信息,其次,也是更重要的一点,自动采样会生成与100%采样非常接近的统计信息,但是使用的时间相对100%采样来说却是非常小的。为了验证这些结论,下面我们将对比10g和11g下自动采样和手动固定采样的性能。
We used the standard TPC-D data generator to generate a Lineitem table. The Lineitem table is about 230G and contains 1.8 million rows with 16 columns. The schema of the lineitem table is as follows:
我们基于TPC-D标准来生成一张lineitem表,这张表大概占用230g的空间并包含180万条具有19列的数据。lineitem 表的结构如下:
column name | column type |
l_shipdate | date |
l_orderkey | number |
l_discount | number |
l_extendedprice | number |
l_suppkey | number |
l_quantity | number |
l_returnflag | varchar2 |
l_partkey | number |
l_linestatus | varchar2 |
l_tax | number |
l_commitdate | date |
l_receiptdate | date |
l_shipmode | varchar2 |
l_linenumber | number |
l_shipinstruct | varchar2 |
l_comment | varchar2 |
表1 给出了在不同的采样比例下,收集统计信息所花费的时间
Sampling Percentage | Elapsed Time (sec) |
1% sampling | 797 |
100% sampling (Compute) | 18772 |
Auto sampling in Oracle 10g | 2935 |
Auto sampling in Oracle 11g | 1908 |
We also compare the quality of the statistics gathered using different estimate percentages. Among all the statistics of a column, number of distinct values (NDV) is the one whose accuracy used to be an issue. We define the accuracy rate of NDV of a column
as follows:
我们同样比较一下在不同的采样比例下,统计信息的质量是如何的。在某列的所有统计信息中,NDV的准确性在以往的版本中往往是存在问题的,我们定义NDV的正确性如下:
accuracy rate = 1 - (|estimated NDV - actual NDV|)/actual NDV.
The accuracy rate ranges from 0% to 100%. The higher the accuracy rate is, the more accurate the gathered statistics are. Since 100% sampling always lead to an accuracy rate of 100%, we do not report it. We focus on the columns which has at least one statistics
accuracy rate below 99.9% when using different estimate percentages. Table 2 illustrates the accurate rates of the columns.
ndv统计信息的准确性范围从0~100%。准确性越高,统计信息的正确率越高。因此100%的采样比例总是具有100%的准确率,不予考虑。我们仅仅考虑至少存在某一列的准确率低于99.9%的情况。表2显示了各个列的准确率,如下
Column Name | Actual NDV | Auto Sampling in Oracle 11g | 1% Sampling |
orderkey | 450,000,000 | 98.0% | 50% |
comment | 181,122,127 | 98.60% | 4.60% |
partkey | 60,000,000 | 99.20% | 98.20% |
suppkey | 3,000,000 | 99.60% | 99.90% |
extendedprice | 3,791,320 | 99.60% | 94.30% |
In short, the elapsed time of ''AUTO sampling in Oracle 11g'' is 10 times faster than 100% sampling but yields the statistics of similar quality (accuracy rate close to 100%).
总而言之,在11g中自动采样的采样速度要高出100%采样10倍左右,而采用的质量却非常接近100%采样。
在之前的版本中,自动采样的效果是有时并不理想。
相关文章推荐
- Disable autostart of Oracle-xe in Ubuntu
- In Oracle 11g, how to change the order of the results of a sql without “order by”?(转)
- 184 You are in the process of creating a Virtual Private Catalog (VPC) in your Oracle Database 11g d
- Extended Optimizer Statistics in Oracle 11g Improve Performance
- Oracle 11G数据库报错 :ORA-12505, TNS:listener does not currently know of SID given in connect descriptor
- ORA-12012: error on auto execute of job "ORACLE_OCM"."MGMT_CONFIG_JOB_2_3"
- Autogrow of file '******' in database '******' was cancelled by user or timed out after ****** milli
- The filter pushed in subquery issue in oracle 11G (filter推进子查询)
- oracle 11g给表建触发器错误“索引中丢失 IN 或 OUT 参数:: 1
- InnoDB: auto-extending data file ./ibdata1 is of a different size 640 pages (rounded down to MB) than specified in the .cnf file: initial 768 pages, max 0 (relevant if non-zero) pages!
- 评论数据库Win A Free Copy of Packt’s Managing Multimedia and Unstructured Data in the Oracle Database e-book
- Placement of Voting disk and OCR Files in Oracle RAC 10g and 11gR1 [ID 293819.1]
- Philosophies Of Arguments Feature in C# 4.0
- ORA-01078: failure in processing system parameters 问题的解决方法(Oracle 11g)
- [置顶] How to create Oracle 11g R2 database manually in ASM?
- How-to find the SQL that using lots of temp tablespace in Oracle
- Using Transparent Database Encryption in Oracle Database 11g
- ORACLE 11g R2 安装时,提示[INS-06101] IP address of localhost could not be detemined
- How to switch to another database in oracle 11g(如何在oracle中从一个数据库切换到另一个数据库)
- Explain types of tables in oracle