Analyzing Performance Problems – Part 1 of 943
“It’s slow.” If you want to see a DBA’s eyes roll to the back of his head, utter the words “it’s slow”. What, exactly, is slow? Seems it’s our job to find out.
Get the Facts
“It takes for-EVER to create an order!”
You can’t fix what you can’t measure, and the first step is to get details about the problem. Make sure that the end-user understands that they need to work with you to help isolate and reproduce the issue, and to test possible solutions.
- Is the problem related to a specific screen, program, or business process?
- Is the problem repeatable? On-demand? Is there a pattern (i.e., “only in the afternoon” or “whenever X is also running”)
- Is there a specific target metric that needs to be satisfied? Does the process have to finish before 6:00 AM, for example?
- Has the issue gotten progressively worse over time, or did it get bad all-of-a-sudden?
- When did the problem begin?
Wait! Where are all the cool DBA Tricks?
Yes, I know that you want me to jump straight into fancy ProTop graphs showing BHT latch waits, and scream endlessly about the perils of RAID 5, but alas, a good DBA needs to start, well, at the start.
The Problem Start Date
This is not always the same as the date the user first noticed the problem. You’ll need to find out if the problem started on a particular date or after an identified change – establishing a reliable start date is important while getting the wrong start date can lead to identifying the wrong root cause. Check available log files and date/time stamps on output files. Interview end-users. List all code changes to see if you can find any correlations.
You often hear us talking about trending data and keeping a rolling year of log files: this is why.
What Changes Occurred?
And when were they implemented? A complete and reliable timeline of all changes is a necessity. Be prepared to re-examine the scope of a change as changes sometimes have unexpected side effects. Make sure to track changes to all systems involved in the business process under analysis.
Consider Any Changes to System Workload
This can be divided roughly into application, environment & external workloads.
Application Workload
Over time, more users create more data and engage in more activity. If parts of the application are inappropriately sensitive to growth, there may be problems. Database activity should scale linearly (or less) with business volume. Number of users, record creates, reads, updates, and deletes, number of transactions, etc. should all relate to business activity. Reporting and data aggregation take longer with more data – but the additional time should move in step with the number of work transactions and be a predictable trend.
Any metrics which grow exponentially faster than business growth are indicators of an underlying application scalability problem.
Environment Workload
As the application becomes busier, so will the hosting environment. Similar to the application workload, the environment workload should scale linearly with business activity. Analyzing CPU utilization, network traffic, memory utilization, and disk space should show increases in step with business growth.
Disk IO operations should grow slower than business growth. IO ops are buffered resources; it should follow an inverse square growth path.
External Workload
Changes outside the immediate environment can impact performance. This relates closely to the environment workload. Network load from other applications may reduce the available bandwidth, increasing latency. Implementation of real-time barcode scanning could be impacting network trafficking more than planned. Disk IO from other applications/systems impacts SAN performance. Demand for CPU from other applications on a shared VM may impact the scheduling and availability of shared CPU. If the application and environment workloads have not changed, but we are seeing increased latency or deeper queuing, it’s likely the external workload has increased.
Next Steps
Wait for part 2.
But seriously, this steps outlined in this blog are by far the most important. Don’t try to shortcut this critical data-gathering initial step.
And if you can’t wait for part 2 (and 3, and 4…) then add a comment below or send us an email.