What is database performance monitoring?
Database performance monitoring is the practice of continuously tracking health metrics for your databases, including query execution time, connection utilization, replication lag, storage consumption, lock contention, and index effectiveness, to detect problems before they cause outages or user-facing degradation.
Teams typically use tools like pganalyze, Percona Monitoring and Management (PMM), Datadog Database Monitoring, and New Relic to track these metrics and surface problems. Databases are frequently the root cause of production incidents, yet they are often the last place teams look. An application that returns errors or responds slowly may have perfectly healthy application servers, well-provisioned load balancers, and functioning network connections. The bottleneck is a query that changed its execution plan overnight, a table that has accumulated enough bloat to degrade scan performance, or a connection pool that is slowly leaking connections due to an idle-in-transaction bug. Without database-level monitoring, these issues are invisible until they cascade into symptoms that are visible at the application layer, by which point the impact is already affecting users.
Effective database monitoring goes beyond simply checking whether the database is up. It means understanding the internal behavior of the database engine: how queries are being planned and executed, how resources are being allocated, and whether maintenance processes like vacuum and compaction are keeping pace with write activity. This level of visibility turns database operations from a reactive practice (something broke, figure out why) into a proactive one (something is trending in a concerning direction, address it before it matters).
What database metrics should you monitor?
The metrics that matter most depend on your database engine, but several categories are universally important across PostgreSQL, MySQL, ClickHouse, and other systems.
Query execution time and plan changes are the single most important category. Tools like pg_stat_statements in PostgreSQL aggregate statistics for every distinct query shape: how many times it has executed, the total and mean execution time, rows returned, and blocks hit or read. Monitoring these statistics over time reveals query regressions, where a query that previously ran in milliseconds suddenly takes seconds. Often, the regression is caused by a change in the query planner's chosen execution plan. The planner might switch from an index scan to a sequential scan based on updated table statistics, a change in the estimated selectivity of a filter, or simply because the table has grown past a threshold where the planner's cost model favors a different strategy.
Connection pool utilization tells you how close you are to running out of database connections. Most production databases have a maximum connection limit, and each connection consumes memory. When the limit is approached, new requests queue or fail. Monitoring active connections, idle connections, and idle-in-transaction connections (which hold resources but are not doing work) provides early warning. A gradual increase in idle-in-transaction sessions often indicates an application-level bug where connections are not being properly returned to the pool.
Replication lag measures the delay between writes on the primary database and their visibility on read replicas. In systems that use replicas for read scaling or failover, replication lag directly affects data consistency and failover readiness. A replica that is minutes behind the primary will serve stale data to users and may require significant catch-up time during a failover event. Monitoring lag at the replica level, broken down by time and bytes, helps teams understand whether their replication configuration can keep up with write volume.
Table bloat and dead tuples are critical for PostgreSQL and similar MVCC-based databases. When rows are updated or deleted, the old row versions remain in the table as dead tuples until the vacuum process reclaims the space. If vacuum falls behind, tables grow larger than necessary, indexes become bloated, and sequential scans read through dead data. Monitoring the ratio of dead tuples to live tuples per table, and the time since last vacuum, reveals whether maintenance is keeping pace with write activity.
Index usage and missing indexes directly affect query performance. Monitoring which indexes are being used (and which are not) helps identify both missing indexes that would improve query performance and unused indexes that waste write-path resources. Every index must be maintained on every insert, update, or delete to the table. An index that is written to on every operation but never read from is pure overhead. Conversely, a query pattern that consistently performs sequential scans on a large table might benefit dramatically from an index that does not yet exist.
Lock contention and wait events reveal concurrency bottlenecks that may not be visible in aggregate query statistics. When multiple transactions compete for the same rows or tables, they queue behind locks. Monitoring lock waits by type (row-level, table-level, advisory) and duration helps identify contention patterns. Long-held locks often indicate transaction design issues or missing indexes that force the database to lock more broadly than necessary.
Auto-increment and sequence ranges are easy to overlook and catastrophic when exhausted. Monitoring the current value of every auto-increment column relative to its data type maximum provides a simple, early warning for a failure mode that produces hard errors with no graceful degradation.
How can proactive database monitoring prevent incidents?
The value of database monitoring is best illustrated through the types of incidents it prevents. Many of the most impactful production incidents trace back to database issues that were detectable hours, days, or even weeks before they caused user-facing problems.
Query plan flips are a classic example. In one production incident at a mid-sized software company, a PostgreSQL query planner decision changed because the pg_stats high null fraction estimate for a column became inaccurate after a shift in data distribution. The planner, relying on stale statistics, switched from an efficient index scan to a sequential scan. The query in question was used by a high-traffic API endpoint, so the performance degradation cascaded rapidly: response times increased, connection pools filled up, and dependent services began timing out. The incident was eventually traced back to the plan change, but not before it had affected users for an extended period.
Proactive monitoring would have caught this in at least two ways. First, monitoring pg_stat_statements would have shown a sudden increase in mean execution time for that query shape. Second, monitoring the query plan itself (by periodically running EXPLAIN on key queries and comparing the output) would have detected the plan change before the latency impact became visible at the application layer. The window between "the plan changed" and "users are affected" is where proactive monitoring pays for itself.
Deployment-related database load is another common pattern. When an organization rolls out a new feature or migrates to a new service, the change in application behavior often translates to a change in database workload. New query patterns appear, existing query volumes shift, and the database may suddenly face a workload it was never tuned for. One team experienced this during an organization-wide rollout of a new product feature: the rollout put unexpectedly heavy load on the database, causing query latencies to spike across the application.
Monitoring database load metrics (queries per second, rows examined, CPU utilization, I/O wait) during deployments provides immediate visibility into whether a rollout is affecting database health. When combined with application-level deployment markers, teams can correlate performance changes with specific releases and roll back before the impact spreads.
Infrastructure events that affect database performance are another category that proactive monitoring catches. Cloud providers periodically perform live migrations, moving virtual machines between physical hosts for maintenance. These migrations cause brief performance dips that are visible in database metrics (elevated latency, temporary I/O stalls) but may not be visible in application-level monitoring if the impact is short-lived. Over time, correlating database performance anomalies with cloud provider maintenance events helps teams distinguish between application issues and infrastructure issues, avoiding unnecessary debugging of problems they cannot control.
The compounding effect of continuous monitoring should not be underestimated. Individual issues, a missing index here, an unused index there, a table that needs a vacuum, may each seem minor. But databases are systems where small inefficiencies compound. A missing index that adds 50 milliseconds to a query, multiplied by thousands of executions per hour, adds up to meaningful load on the database. That additional load makes other queries slower, which increases connection hold times, which reduces available connections, which causes queuing under peak traffic. Proactive monitoring catches these small issues before they compound into systemic problems.
The most effective database monitoring setups combine automated detection with contextual alerting. Rather than setting static thresholds (alert when CPU exceeds 80%), they track trends and anomalies: alert when query execution time for a specific query shape increases by more than a standard deviation, or when the rate of dead tuple accumulation exceeds the rate of vacuum cleanup. This approach reduces false positives while catching genuine issues earlier. When paired with autonomous database agents that can not only detect but also remediate issues, the monitoring practice evolves from alerting humans to resolving problems automatically, with humans reviewing the actions taken rather than performing them. Firetiger's database agents go beyond dashboards -- they continuously analyze PGStats, index usage, and replication health, then generate pull requests for recommended optimizations.
Where to start
- Enable pg_stat_statements (or equivalent): Ensure query-level performance data is being collected so you can identify slow queries and regressions.
- Monitor connection pool utilization: Track active vs. idle connections to catch pool exhaustion before it causes outages.
- Set up alerts on replication lag: Define thresholds for acceptable lag and alert when replicas fall behind.
- Deploy proactive database monitoring: Use agent-driven platforms like Firetiger that go beyond dashboards -- continuously analyzing PGStats, index usage, and replication health, then generating pull requests for recommended optimizations.