数据库性能故障—SQL案例1-SRE空间

问题描述：

某子系统慢，具体体现为数据库服务器的WIO大，达到70%以上。数据库长时间的ACTIVE进程达到30个以上，SQL长时间运行没有结果。

处理步骤：

利用AWR信息生成当天10点—11点的历史报告。由于I/O为主要原因，因此查看awr报告中占reads前几位的sql 如下：

SQL ordered by Reads DB/Inst: SIMIS/simis2 Snaps: 11758-11759

-> Total Disk Reads: 22,233,296

-> Captured SQL account for 87.1% of Total

Reads CPU Elapsed

Physical Reads Executions per Exec %Total Time (s) Time (s) SQL Id

————– ———– ————- —— ——– ——— ————-

16,662,280 7,908 2,107.0 74.9 5128.25 ######### g8xnyqtspkytq

Module: JDBC Thin Client

BEGIN PKG_K_SIINTERFACE.PRC_K_SIINTERFACE(:1,:2,:3,:4,:5,:6); END;

13,111,117 614 21,353.6 59.0 3655.61 ######### b8a7wmhujk71u

Module: JDBC Thin Client

DELETE KC22_TABLE WHERE AKB020 = :B2 AND AKC190 = :B1 AND AAE072 = ‘AAAA’ AND (A

KC126 = 0 OR AKC126 IS NULL) AND (AAE100 = 1 OR AAE100 IS NULL)

2,852,087 279 10,222.5 12.8 932.06 30029.03 6adx668cna6uv

Module: JDBC Thin Client

SELECT * FROM KC22_TABLE WHERE AKB020 = :B2 AND AKC190 = :B1 AND AAE072 = ‘AAAA’

AND ( AKC126 = 0 OR AKC126 IS NULL ) AND AAE100 = 1 ORDER BY AKA063 ASC

将上述SQL直接在PL/SQL里运行，发现COST值很低，相应的执行计划如下：

根据之前的awr报告，对于reads占前几位的SQL进行SQL PLAN的跟踪发现，这几个SQL的SQL PLAN存在着问题，ORACLE选择了错误的执行计划（都使用了索引INDX_KC22_AKB020），但是他自身统计出来的COST值却很低。

SQL ordered by Reads DB/Inst: SIMIS/simis2 Snaps: 11781-11782

-> Total Disk Reads: 19,938,631

-> Captured SQL account for 87.3% of Total

Reads CPU Elapsed

Physical Reads Executions per Exec %Total Time (s) Time (s) SQL Id

————– ———– ————- —— ——– ——— ————-

16,264,523 14,044 1,158.1 81.6 6141.55 ######### g8xnyqtspkytq

Module: JDBC Thin Client

BEGIN PKG_K_SIINTERFACE.PRC_K_SIINTERFACE(:1,:2,:3,:4,:5,:6); END;

11,820,425 598 19,766.6 59.3 3713.90 ######### fx8k7mg06230w

SELECT AKA063,SUM(NVL(AKC227,0)) AKC227,SUM(NVL(AKC253,0)) AKC253,SUM(NVL(AKC228

,0)) AKC228,SUM(NVL(AKA631,0)) AKA631,SUM(NVL(AKA634,0)) AKA634 FROM V_KC22 WHER

E AKB020 = :B2 AND AKC190 = :B1 AND NVL(AKA135,’0′) = ‘1’ GROUP BY AKA063

3,395,156 263 12,909.3 17.0 1505.19 39494.38 7dcnhdh9w2fgu

SELECT * FROM KC22_TABLE WHERE AKB020 = :B5 AND AKC190 = :B4 AND AKC220 = :B3 AN

D AAE072 = ‘AAAA’ AND AKC515 = TRIM(:B2 ) AND AKC222 = TRIM(:B1 ) AND AKA135 = ‘

1′ AND AAE100 = ‘1’

SQL1：7dcnhdh9w2fgu

Plan Statistics DB/Inst: SIMIS/simis2 Snaps: 11781-11782

-> % Total DB Time is the Elapsed Time of the SQL statement divided

into the Total Database Time multiplied by 100

Stat Name Statement Per Execution % Snap

—————————————- ———- ————– ——-

Elapsed Time (ms) ########## 150,168.7 22.7

CPU Time (ms) 1,505,189 5,723.2 15.7

Executions 263 N/A N/A

Buffer Gets ########## 111,167.2 12.4

Disk Reads 3,395,156 12,909.3 17.0

Parse Calls 122 0.5 0.0

Rows 147 0.6 N/A

User I/O Wait Time (ms) ########## N/A N/A

Cluster Wait Time (ms) 1,239,094 N/A N/A

Application Wait Time (ms) 0 N/A N/A

Concurrency Wait Time (ms) 43,135 N/A N/A

Invalidations 0 N/A N/A

Version Count 4 N/A N/A

Sharable Mem(KB) 62 N/A N/A

————————————————————-

Execution Plan

————————————————————————————————

————————————————————————————————

| 0 | SELECT STATEMENT | | | | 4 (100)| |

| 1 | TABLE ACCESS BY INDEX ROWID| KC22_TABLE | 1 | 289 | 4 (0)| 00:00:01 |

| 2 | INDEX RANGE SCAN | INDX_KC22_AKB020 | 1 | | 3 (0)| 00:00:01 |

————————————————————————————————

Full SQL Text

SQL ID SQL Text

———— —————————————————————–

7dcnhdh9w2fg SELECT * FROM KC22_TABLE WHERE AKB020 = :B5 AND AKC190 = :B4 AND

AKC220 = :B3 AND AAE072 = ‘AAAA’ AND AKC515 = TRIM(:B2 ) AND AKC

222 = TRIM(:B1 ) AND AKA135 = ‘1’ AND AAE100 = ‘1’

SQL2：fx8k7mg06230w

Plan Statistics DB/Inst: SIMIS/simis2 Snaps: 11787-11788

-> % Total DB Time is the Elapsed Time of the SQL statement divided

into the Total Database Time multiplied by 100

Stat Name Statement Per Execution % Snap

—————————————- ———- ————– ——-

Elapsed Time (ms) ########## 116,454.7 36.2

CPU Time (ms) 1,210,696 5,631.1 10.5

Executions 215 N/A N/A

Buffer Gets ########## 127,204.7 7.5

Disk Reads 3,536,633 16,449.5 35.1

Parse Calls 147 0.7 0.0

Rows 1,264 5.9 N/A

User I/O Wait Time (ms) ########## N/A N/A

Cluster Wait Time (ms) 512,298 N/A N/A

Application Wait Time (ms) 0 N/A N/A

Concurrency Wait Time (ms) 19,039 N/A N/A

Invalidations 0 N/A N/A

Version Count 7 N/A N/A

Sharable Mem(KB) 30 N/A N/A

————————————————————-

Execution Plan

————————————————————————————————-

————————————————————————————————-

| 0 | SELECT STATEMENT | | | | 5 (100)| |

| 1 | HASH GROUP BY | | 1 | 52 | 5 (20)| 00:00:01 |

| 2 | TABLE ACCESS BY INDEX ROWID| KC22_TABLE | 1 | 52 | 4 (0)| 00:00:01 |

| 3 | INDEX RANGE SCAN | INDX_KC22_AKB020 | 1 | | 3 (0)| 00:00:01 |

————————————————————————————————-

Full SQL Text

SQL ID SQL Text

———— —————————————————————–

fx8k7mg06230 SELECT AKA063, SUM(NVL(AKC227, 0)) AKC227, SUM(NVL(AKC253, 0)) AK

KC253, SUM(NVL(AKC228, 0)) AKC228, SUM(NVL(AKA631, 0)) AKA631, SU

M(NVL(AKA634, 0)) AKA634 FROM V_KC22 WHERE AKB020 = :B2 AND AKC19

0 = :B1 AND NVL(AKA135, ‘0’) = ‘1’ GROUP BY AKA063

通过现场监控发现（OEM查看），ACTIVE的进程果然使用了错误的索引，执行了错误的执行计划，如下：

ORACLE统计的cost仅为2，但实际上在PL/SQL下，如果使用该索引，COST值将达到13183

不带HINT，在PL/SQL里执行该SQL，正确的SQL PLAN应该是走index_kc22_aae072索引

存在问题的SQL列表如下：

SELECT *
FROM cssimis.kc22_table a
WHERE akb020 = :b2
AND akc190 = :b1
AND aae072 = ‘AAAA’
AND (akc126 = 0
OR akc126 IS
NULL)
AND aae100 = 1
ORDER
BY aka063 ASC
SELECT aka063, sum(nvl(akc227, 0)) akc227, sum(nvl(akc253, 0)) akc253,
sum(nvl(akc228, 0)) akc228, sum(nvl(aka631, 0)) aka631, sum(nvl(aka634,
0)) aka634
FROM cssimis.v_kc22 a
WHERE akb020 = :b2
AND akc190 = :b1
AND nvl(aka135, ‘0’) = ‘1’
GROUP
BY aka063
DELETE FROM kc22_table

WHERE akb020 = :b2

AND akc190 = :b1

AND aae072 = ‘AAAA’

AND (akc126 = 0

OR akc126 IS NULL)

AND (aae100 = 1

OR aae100 IS NULL)

SELECT * FROM KC22_TABLE WHERE AKB020 = :B5 AND AKC190 = :B4 AND

AKC220 = :B3 AND AAE072 = ‘AAAA’ AND AKC515 = TRIM(:B2 ) AND AKC

222 = TRIM(:B1 ) AND AKA135 = ‘1’ AND AAE100 = ‘1’

查看indx_kc22_akb020索引，发现该列的distinct值仅为128个，而kc22_table的记录超过9KW，因此该索引没有存在的必要。
停止相关子系统业务，将该索引删除

10、重启业务，SQL的PLAN恢复正常。系统恢复

故障分析：

该故障为使用Oracle绑定变量产生的”bug”，提供以下几种方法供参考

可对相关的对象（表及索引）重新收集统计信息
如若表较大，方法1不可行，则可删除相关索引的统计信息
设置隐含参数_OPTIM_PEEK_USER_BINDS=FALSE
使用sql_profile固定sql plan（常量情况无效）

附录：

Query using Bind Variables is suddenly slow [ID 387394.1]

	Modified 28-NOV-2008 Type PROBLEM Status MODERATED

In this Document
  Symptoms
  Cause
  Solution

This document is being delivered to you via Oracle Support’s Rapid Visibility (RaV) process, and therefore has not been subject to an independent technical review.

Applies to:

Oracle Server – Enterprise Edition – Version: 9.0.1 to 10.2
This problem can occur on any platform.

Symptoms

You are running on a database at 9.x or above, and have observed that sometimes, for no apparent
reason, some SQL which has been running fine suddenly runs very poorly. You have made no changes
to the data, the SQL, or the statistics for the objects involved.

On further examination of the SQL it can be seen that it is using bind variables.

Cause

One reason for this behaviour may be explained by the use of a feature introduced in 9.x called bind
peeking.

With this feature the query optimizer peeks at the values of user-defined bind variables on the first
invocation of a cursor. This feature lets the optimizer determine the selectivity of any WHERE
clause condition, based on this value just as if a literal had been used instead of a bind variable.
On subsequent invocations of the cursor, no peeking takes place, and the cursor is shared, based on
the standard cursor-sharing criteria, even if subsequent invocations use different bind values.
Thus if the first set of bind values that happen to be used when the cursor is first hard-parsed are not
representative then the plan may be inappropriate for subsequent executions.

The Oracle 10.2 Database Performance Tuning Guide manual Chapter 13 “The Query Optimizer” says the
following about peeking bind variables :-

“When bind variables are used in a statement, it is assumed that cursor sharing is intended and that
different invocations are supposed to use the same execution plan. If different invocations of the
cursor would significantly benefit from different execution plans, then bind variables may have been
used inappropriately in the SQL statement.”

Bind peeking has been known to cause a different execution plan to be used on different nodes of a RAC
cluster, because each node has its own Shared Pool, and despite the same SQL, data, and statistics the
first time a cursor was hard parsed on each node a different set of bind values was presented to the
optimizer, and it thus chose a different plan on each node.

There is a hidden parameter which controls this feature’s behaviour, whose default value is TRUE.
Although a stable plan can be achieved by setting the parameter off it must be realized that this stable
plan is not necessarily the optimum plan for all bind values. Consider the following simple example where
tablex has 10,000 rows and col1 has an index.

SELECT <whatever>
FROM tablex
WHERE col1 BETWEEN :bind1 AND :bind2;

If this SQL is used, for example, with values 123 and 124 in order to pick out 2 rows from the
10,000 rows in the table then using an index would be the obvious choice.
However, if the same SQL is used with bind values 123 and 9999 then we would be getting the vast majority
of the rows and a full table scan would be more appropriate. But the optimizer cannot now know this,
and does not change the plan accordingly.

Solution

In this sort of situation it would perhaps be a good idea to modify the application and have two separate
modules/sections each with the above SQL, but with a variation (perhaps modified with a hint) that will result
in the desired plan. The appropriate module could then be invoked programmatically. An example might be
a situation where you use essentially the same SQL to query the pay for one employee or all 10,000 employees.
The query for one employee should use indexes, and the query for all employees should do a full table scan.

N.B. The cursor will be moved out of the Shared Pool and therefore require a hard parse on subsequent
invocation in a number of circumstances, such as :-

1) database shutdown/restart

2) cursor not in use by any session, and aged out by LRU algorithm

3) change to the stats associated with any referenced object (eg following a gather stats job)

4) change to the structure of any referenced object (eg alter table)

5) Granting/revoking privileges on a referenced object

It will NOT get moved out by flushing the Shared Pool if it is pinned (ie in use)

CONCLUSION
==========

It is desirable to share cursors, particularly in an OLTP environment, for all the good reasons outlined in

Note 62143.1 Understanding/Tuning the Shared Pool in Oracle7, 8, 8i

Thus coding bind variables, or perhaps using CURSOR_SHARING values of SIMILAR or FORCE, is an appropriate
path to follow, but it must be realized that having bind peeking may result in unpredictable execution
plans dependent on the first set of bind values presented to the optimizer on hard parse.
Tuning SQL with hints and coding your application to allow the use of the appropriate “version” of the
SQL, or using literal values, is the preferred method of dealing with SQL having changing
execution plans due to bind peeking, but if necessary this feature can also be disabled.

To set this feature off for the whole database :-

a) set _OPTIM_PEEK_USER_BINDS=FALSE in the spfile/init.ora

or just for the session :-

b) use alter session set “_OPTIM_PEEK_USER_BINDS”=FALSE;

For a good case study where this was a factor please see Note 369427.1 “Case Study: The Mysterious Performance Drop”

N.B. Please also be aware of the following

i) (unpublished) Bug:5082178 (fixed in 10.2.0.4 and 11.x).

Details:
In some situations bind peeking can occur when it should not eg: Bind peeking can occur for user binds even if “_optim_peek_user_binds” is set to FALSE.
This can cause binds to be marked “unsafe” leading to cursors not being shared when they should be.
This fix is notable as plan changes could occur if statements suffering this problem execute in a release with this fix as the CBO will no longer have peeked data to use when determining an execution plan.

ii) (unpublished) Bug: 4567767 Abstract: UNEXPLAINED PLAN CHANGES CAN OCCUR WITHOUT STATS REGATHER (Fixed in 10.2.0.4 and 11.x)

Details:
It is possible for queries’ execution plans to change without any modification in statistics or optimizer environment. Usually it is interpreted as the plans changed “out of the blue”. The reason for the change is that density is being reevaluated as 1/ndv instead of taking the statistic stored in the data dictionary when the table is reloaded to the row cache for whatever reason, like a shared pool flush.
It is not easy to catch in the act but can be seen on a 10053 trace file when the query is hardparsed before and after the table is reloaded to the row cache.
Before:
Column: ISOCODE Col#: 7 Table: PL_X_NP Alias: X
NDV: 1344 NULLS: 0 DENS: 1.5152e-02 <—— From Dict.
NO HISTOGRAM: #BKT: 1 #VAL: 2
After:
Column: ISOCODE Col#: 7 Table: PL_X_NP Alias: X
NDV: 1344 NULLS: 0 DENS: 7.4405e-04 <—— 1 / 1344
NO HISTOGRAM: #BKT: 1 #VAL: 2

To turn this fix off (in 11g and 10gR2):
Set “_fix_control”=’4567767:off’
Workaround
Set event 10139 :-

alter session set events ‘10139 trace name context forever’;
or
event=”10139 trace name context forever”

This bug is described in Note:338113.1 “Plans can change despite no stats being regathered”

iii) Bug: 5364143 Abstract: UNPREDICTABLE CHANGE IN QUERY OPTIMIZER PLAN (Fixed in 10.2.0.4 and 11.x)

Details:
It is possible for queries’ execution plans to change without any modification in statistics or optimizer environment. Usually its interpreted like the plans changed “out of the blue”. The reason for the change is that the cursor was taken out of the library cache for whatever reason (flush, Memory Pressure, DDLs,etc) and upon reload sometimes bind peeking is skipped for the cursor.
Note: Disabling Bind Peeking DOES NOT workaround the issue.

SUMMARY
=======

In summary, the bind peeking feature can give the optimizer better information and allow a more appropriate
execution plan if the bind values presented on hard parsing the cursor are representative. However, if there
is a possibility they are NOT representative then a plan which is sub-optimal for subsequent invocations
may result. Under these circumstances one of the above strategies should be considered.
Ultimately, in order to make the most appropriate decision, a good knowledge of both the application
and the data is required.

FOOTNOTE
========

Once a good plan is in operation for a key SQL statement it always good practice to do the following :-

a) for a 9.2 database capture the statistics for the objects involved using DBMS_STATS.EXPORT_TABLE_STATS.
(See Metalink Note 117203.1 “How to Use DBMS_STATS to Move Statistics to a Different Database” for
more information on how to do this).
These statistics could then be imported in an “emergency” to restore use of a good plan while a rogue plan
is investigated. (NB take a copy of the “bad” stats before importing, of course).

In 10g whenever optimizer statistics are modified using the DBMS_STATS package, old versions of the statistics
are saved automatically for future restoration, so the above is not necessary.
See Note 281793.1 “Oracle Database 10g stats history using dbms_stats.restore_table_stats”.

b) capture the good execution plan so it can be used as a baseline reference in the event that an
undesired change occurs. For 9.2 use Note 260942.1: “Display Execution plans from Statements in V$SQL_PLAN”.

In 10g the view DBA_HIST_SQL_PLAN has historical information with respect to execution plans. Use

select * from table(dbms_xplan.display_awr(‘&sql_id’))

…as documented in Note 362887.1 “A 10g Equivalant Process To The 9i Statspack Level 6 Execution Plan Output”

未经允许不得转载：SRE空间 » 数据库性能故障—SQL案例1

数据库性能故障—SQL案例1

Applies to:

Symptoms

Cause

Solution

相关推荐

评论抢沙发

评论前必须登录！

oracle

Applies to:

Symptoms

Cause

Solution

相关推荐

评论 抢沙发

评论前必须登录！

oracle

评论抢沙发