Make the job run faster + compulsive tuning disorder + down the rabbit hole we go + a few other comments
Posted: February 18, 2011 Filed under: Uncategorized Leave a comment »I was cc’d on an email yesterday morning that was sent out to the development group. It was a pretty bluntly worded question asking why a monthly job was taking well over an hour to run.
Being the DBA here, I felt it was my responsibility to create a 10046 trace file and summarize where all the time was going. Thankfully this was 100% reproducible in our staging environment. What I found was very interesting.
Generally when there’s something running in the database that I running slower that expected, the first thing I do is figure out how to best get an accurate trace of the problem. That involves initiating the trace immediately before the problem actions and terminating the trace immediately after those actions are complete. That keeps dooda from creeping into your trace file. There are a whole bunch of ways to that these days and the internet is plastered with info on how to do that. I then just run tkprof on the resulting file with sys=no and everything else default. That gives me the sql in order of execution and I just follow the execution order and make note of SQL with the highest elapsed time. I know Millsap would be disappointed but that’s usually enough technical info to figure out what to start trying to make happen faster. Not rocket science but it usually does the trick.
In this case, what was extremely interesting was the non-recursive and recursive summary output found at the bottom of the tkprof file. When I think recursive SQL I think of all the stuff Oracle has to do in the background that I can’t do much about. and user defined functions embedded in SQL somwhere. If you embed user defined functions in SQL to hide complexity you are usually handicapping the optimizer and vastly increasing the chance that you will get some overly resource intensive activity. Here’s what that summary looked like:
OVERALL TOTALS FOR ALL NON-RECURSIVE STATEMENTScall
count cpu elapsed disk query current rows
——- —— ——– ———- ———- ———- ———- ———-
Parse 2 0.01 0.01 0 7 0 0
Execute 2 0.88 5.98 0 12145 74414 2
Fetch 0 0.00 0.00 0 0 0 0
——- —— ——– ———- ———- ———- ———- ———-
total 4 0.89 5.99 0 12152 74414 2
OVERALL TOTALS FOR ALL RECURSIVE STATEMENTS
call count cpu elapsed disk query current rows
——- —— ——– ———- ———- ———- ———- ———-
Parse 2141 17.89 17.89 0 0 0 0
Execute 13589173 492.64 489.04 5 180 0 0
Fetch 13589083 4007.37 4035.89 778773 293428221 462 13589118
——- —— ——– ———- ———- ———- ———- ———-
total 27180397 4517.91 4542.83 778778 293428401 462 13589118
13.5 million executions of recursive SQL statements. Returning 13.5 million rows. After I picked my jaw up off the floor, I gave myself a pep talk and started hunting through the trace file. Now, if I keep this writing thing up, I’m will eventually start sounding like a broken record but our homegrown applications seem to be heavy on the views on top of views on top of views methodology for saving the developers some time. What we end up with are tons of rabbit holes to crawl down when performance problems arise. Anyways…
So, I went backup up the trace file and found the following SQL was executed nearly 13.5 million times.
SELECT PERIODID
FROM
PERIOD WHERE TRUNC(STARTDATE) <= :B1 AND TRUNC(ENDDATE) >= :B1
Odd. I looked through the PL/SQL package that implemented the job, searching for this SQL. Nope. Eventually I found the culprit. It was in a view and sure enough, it was a user defined function used to filter results for the current period. This will be short on details but I spent a day and a half working this view into submission. I got the LIO’s down from ~ 17 million down to 18 thousand. There is no doubt more that could be done (isn’t there always) but I feel like I plucked the lowest hanging fruit and after re-tracing the original job, the recursive calls fell from 13.5 million down to 4996. More importantly the job went from lasting > 2.5 hours to completing in < 5 minutes.
My biggest problem now is stepping away from the computer before I start obsessing about all that can be done with the code base here.
I few things that really helped me here:
- Get rid of the user defined functions unless they really make sense to keep around.
- If you use /*+ gather_plan_statistics */ + dbms_xplan + allstats to dig into what is going on and notice that the a-rows is way different from the e-rows from a row source that you see could explicitly have some more filters applied to it because of transitivity, don’t be scared to do that.
- If you are having trouble reading the SQL don’t be scared to start breaking it up with the ‘WITH’ clause so you can focus on the real work.
- If you are still having problems, start breaking out individual pieces of the SQL and do the /*+ gather_plan_statistics */ + dbms_xplan + allstats thing on that and work your way back.
Why didn’t I catch this with my AWR monitoring? Well, it only ran once a month and it wasn’t a single SQL taking a ridiculous amount of time to run. I do have a report that shows the top 5 mostly frequently executed SQL each day but you’d be surprised at how much internal stuff shows up on that. (hmmm…I should probably filter that out).
Sorry for the lack of details but it’s beer:30 and I have an abundance of Hopslam waiting for me at home.