Database monitoring is the process of collecting and using performance metrics so your database can fully support your applications by avoiding slowdowns and responding immediately to queries. Database professionals who know the right metrics and track them over a long time can confidently say things like this:
“Yes, I know users are complaining about application performance. But our metrics are all in the green, so the problem must lie somewhere else.”
Wouldn’t you like to be able to say that when your boss is trying to get to the bottom of an application performance problem in your organization?
Think about all the moving parts between your users and your database, including the application itself, the operating system, virtualization, storage, hardware and the network. That’s why it’s not always easy to find the source of application performance problems. The database is one of those moving parts, so it becomes an object of scrutiny when users start complaining.
Smart database professionals rely on the most prominent database monitoring metrics so they can get through the day without constantly wondering how their databases are doing. I’ll describe them below. They apply to all major database platforms, including Oracle, SQL Server and MySQL.
Database monitoring: Why is it so important?
Database monitoring is what you do before you start performance tuning. It’s how you measure what you want to improve.
In the same way that prescription without diagnosis is poor medicine, few database professionals dive in and start making changes at the first sign of trouble. Going with your gut may apply to stealing second base or betting on the World Cup, but not so much to solving performance problems in enterprise IT. As you’ll see below, you’re better off monitoring so you can develop a baseline and history against which to compare performance on any given day. That will give you a better chance of success when you analyze current data and begin performance tuning.
Another reason that database monitoring is important is that, when done correctly, it’s a multilayer effort. Your database performance bottlenecks may lie on a different level from the one you expect; tuning SQL, for example, won’t help much if your real problem is I/O contention.
In proper database monitoring, you focus on all levels, including the moving parts mentioned above:
- SQL level — Are your applications running inefficient SQL statements that introduce latency, mishandle errors and impair throughput and concurrency?
- Instance/database level — Are your applications using the platform itself efficiently? How is it handling I/O contention, locked objects and wait stat analysis?
- Infrastructure level — Consider the hardware and operating system under your database. Are they able to keep up with the demands of your applications?
- User/session level — Your users are the most vocal of your moving parts. Are they trying to tell you things that even your dashboards haven’t yet detected?
Finally, database monitoring is how you ensure that all the work you invest in tuning on all levels has paid off. Like any maintenance effort, the sequence of monitor-analyze-tune is cyclical. No sooner do you finish tuning than you resume monitoring in preparation for the next wave of transactions, the next schema modification and the next configuration change.
How do I monitor databases in my environment?
The essence of database monitoring is developing a history of past performance statistics against which to compare your performance on any given day. You do that by monitoring database performance over time, gradually building up a collection of snapshots that summarize resource utilization over a long horizon.
Within the database, you create a monitoring table, then automate the process of populating it with scripts that periodically collect moment-in-time statistics and append them to the table. That way, when performance lags, you can base your decisions on data instead of instinct or experimentation.
When developing your monitoring process, keep several requirements in mind:
- Whichever metrics you choose, the proof is in the transaction workload (number of users, batch jobs, automated tasks), so be sure to measure based on that.
- The longer your time horizon, the better your understanding of what’s normal and the easier it is to spot anomalies when they arise. The best day to start monitoring was yesterday; if you couldn’t start then, start today.
- Sample more frequently when transaction volume changes (for example, on the cusp of the busy retail season) or when a particularly thorny problem comes up.
- Too much overhead skews your measurement. Keep your collection costs low so that they don’t affect your conclusions.
- Use adaptive alerts based on thresholds that change with time, transaction volume, business conditions and resource capacity. That reduces the incidence of false-positive notifications.
- And, as mentioned above, monitor and study trends at all levels in your database environment to ensure you don’t start your troubleshooting in the wrong place.
Which database monitoring metrics are the best indicators of performance?
Most performance problems fall into four categories: memory, resource usage, locks and blocks and indexes. If you keep an eye on the following useful database monitoring metrics, you’ll be able to troubleshoot small problems before they turn into big ones.
When data blocks are read in from disk, the buffer cache stores copies of them in memory. To retrieve new data, the database looks first in the buffer cache, then on disk. Since the cache is so much faster than disk, it pays to monitor metrics that reveal the current state of memory.
Cache hit ratio
How frequently — 98 percent of the time? 11 percent of the time? — is the database able to find what it’s looking for among the pages stored in the cache? The higher this ratio, the less often the system must make the performance-impairing hop out to disk. The lower this ratio, the more likely that it’s time to increase the size of the cache.
Page life expectancy
Since memory capacity is finite, no page can remain in the buffer cache forever. Eventually, the oldest data pages will be erased as newer pages are stored. The amount of time in seconds that the page has stayed in memory is the page life expectancy. Here, large numbers are an indication that your cache hit ratio is good, and small numbers indicate the possibility of memory pressure.
Checkpoint pages per second
When data blocks in the buffer cache are new or modified, the system must save, or flush, them to disk in a checkpoint operation. It’s useful to establish a baseline count of checkpoint pages per second so that you can compare later. An increase in checkpoint pages may indicate an I/O problem.
Everything that occurs with the database affects the resources around it and the objects inside it. The metrics focused on those resources and objects are your window into current and potential performance problems. Plus, with a sufficiently long history of those metrics, you can plan for capacity around cyclical changes in workload.
How can you discover the number of rows affected by your last SQL statement (INSERT, UPDATE, DELETE or SELECT), or as of your most recent analysis? You can use row counts to determine the volume of data touched in a given table. Use system variables and scripts over time so that, when row counts rise or fall suddenly, you can examine the SQL and figure out how to change your application.
Database file I/O
Using I/O metrics you can determine the amount of data written to and read from any given file in the database, whether data file or log file. This provides a useful check that I/O is appropriate to the size of the file. When the figures are collected over time, you’ll see trends and cycles emerge. They also help answer questions about resource consumption.
Locks and blocks
Locks and blocks prevent multiple simultaneous transactions from accessing the same object. They put competing processes on hold until the object is released and available again. Usually, they release themselves, but if they don’t, these metrics may point to what’s preventing them.
Ordinarily, this metric stays close to zero because locks don’t get in the way of requests. Lock waits tend to go hand in hand with an increase in load times.
When lock waits affect performance, the next level of examination is the blocking process or session that is causing the waits. That sets you on the road to SQL tuning to make your application’s queries more efficient.
Indexes are meant to accelerate operations in tables with large numbers of records, but they are not immune to problems that can impair performance; namely, fragmentation.
As records are added and deleted to a table, its index becomes fragmented, with new pages added out of order and blank pages. The greater the degree of fragmentation, the more likely that the query optimizer will choose an inefficient execution plan, resulting in poor performance. The remedy is to occasionally defragment the index by reorganizing or rebuilding it. Easily accessible database monitoring metrics show the degree of fragmentation.
“The application is slow again today.”
As a database professional, you’re going to hear that a lot. And you’ll probably turn first to the usual suspects: excessive logical I/O, long wait times and absent or fragmented indexes. Most of those you can solve with SQL tuning.
But before you dive in and start changing code, refine your practice through database monitoring that shows whether SQL tuning is the right place to start. Are your OS patches up to date? Are your databases running in a healthy virtual environment? Are the storage systems well maintained?
Being able to see not only what the problem is but also where to solve it — that’s insight you can truly act on.