What is database performance monitoring?

Database performance monitoring continuously tracks health metrics — query execution time, connection utilization, replication lag, table bloat, index effectiveness — to detect problems before they cause outages. Most production incidents trace back to database issues that were detectable hours or days earlier. Proactive monitoring turns database operations from reactive firefighting into a continuous practice.

Why it matters

Databases are frequently the root cause of production incidents and are often the last place teams look. Across teams Firetiger has worked with, roughly half of long-running production incidents trace back to a database condition that was detectable hours or days earlier — a query whose plan flipped overnight, a connection pool slowly leaking under an idle-in-transaction bug, a table accumulating enough bloat to degrade scan performance. The application can return errors with perfectly healthy app servers, load balancers, and network — the bottleneck is in the database, and the symptoms appear at the application layer only after the database condition has been compounding for some time. Database-level monitoring shifts this from "support tickets surfaced a slowdown" to "the agent noticed and proposed a fix before users felt it."

Teams typically use tools like pganalyze, Percona Monitoring and Management, Datadog Database Monitoring, and New Relic to track these metrics. The structural answer for high-leverage teams is to wire database telemetry into the same change-aware verification loop as the rest of the system — a query plan change that coincides with a deploy is exactly the kind of regression a PR-based monitoring plan should catch.

Effective database monitoring goes beyond simply checking whether the database is up. It means understanding the internal behavior of the database engine: how queries are being planned and executed, how resources are being allocated, and whether maintenance processes like vacuum and compaction are keeping pace with write activity. This level of visibility turns database operations from a reactive practice (something broke, figure out why) into a proactive one (something is trending in a concerning direction, address it before it matters).

What database metrics should you monitor?

The metrics that matter most depend on your database engine, but several categories are universally important across PostgreSQL, MySQL, ClickHouse, and other systems.

Query execution time and plan changes are the single most important category. Tools like pg_stat_statements in PostgreSQL aggregate statistics for every distinct query shape: how many times it has executed, the total and mean execution time, rows returned, and blocks hit or read. Monitoring these statistics over time reveals query regressions, where a query that previously ran in milliseconds suddenly takes seconds. Often, the regression is caused by a change in the query planner's chosen execution plan. The planner might switch from an index scan to a sequential scan based on updated table statistics, a change in the estimated selectivity of a filter, or simply because the table has grown past a threshold where the planner's cost model favors a different strategy.

Connection pool utilization tells you how close you are to running out of database connections. Most production databases have a maximum connection limit, and each connection consumes memory. When the limit is approached, new requests queue or fail. Monitoring active connections, idle connections, and idle-in-transaction connections (which hold resources but are not doing work) provides early warning. A gradual increase in idle-in-transaction sessions often indicates an application-level bug where connections are not being properly returned to the pool.

Replication lag measures the delay between writes on the primary database and their visibility on read replicas. In systems that use replicas for read scaling or failover, replication lag directly affects data consistency and failover readiness. A replica that is minutes behind the primary will serve stale data to users and may require significant catch-up time during a failover event. Monitoring lag at the replica level, broken down by time and bytes, helps teams understand whether their replication configuration can keep up with write volume.

Table bloat and dead tuples are critical for PostgreSQL and similar MVCC-based databases. When rows are updated or deleted, the old row versions remain in the table as dead tuples until the vacuum process reclaims the space. If vacuum falls behind, tables grow larger than necessary, indexes become bloated, and sequential scans read through dead data. Monitoring the ratio of dead tuples to live tuples per table, and the time since last vacuum, reveals whether maintenance is keeping pace with write activity.

Index usage and missing indexes directly affect query performance. Monitoring which indexes are being used (and which are not) helps identify both missing indexes that would improve query performance and unused indexes that waste write-path resources. Every index must be maintained on every insert, update, or delete to the table. An index that is written to on every operation but never read from is pure overhead. Conversely, a query pattern that consistently performs sequential scans on a large table might benefit dramatically from an index that does not yet exist.

Lock contention and wait events reveal concurrency bottlenecks that may not be visible in aggregate query statistics. When multiple transactions compete for the same rows or tables, they queue behind locks. Monitoring lock waits by type (row-level, table-level, advisory) and duration helps identify contention patterns. Long-held locks often indicate transaction design issues or missing indexes that force the database to lock more broadly than necessary.

Auto-increment and sequence ranges are easy to overlook and catastrophic when exhausted. Monitoring the current value of every auto-increment column relative to its data type maximum provides a simple, early warning for a failure mode that produces hard errors with no graceful degradation.

How can proactive database monitoring prevent incidents?

The value of database monitoring is best illustrated through the types of incidents it prevents. Many of the most impactful production incidents trace back to database issues that were detectable hours, days, or even weeks before they caused user-facing problems.

Query plan flips are a classic example. In one production incident at a mid-sized software company, a PostgreSQL query planner decision changed because the pg_stats high null fraction estimate for a column became inaccurate after a shift in data distribution. The planner, relying on stale statistics, switched from an efficient index scan to a sequential scan. The query in question was used by a high-traffic API endpoint, so the performance degradation cascaded rapidly: response times increased, connection pools filled up, and dependent services began timing out. The incident was eventually traced back to the plan change, but not before it had affected users for an extended period.

Proactive monitoring would have caught this in at least two ways. First, monitoring pg_stat_statements would have shown a sudden increase in mean execution time for that query shape. Second, monitoring the query plan itself (by periodically running EXPLAIN on key queries and comparing the output) would have detected the plan change before the latency impact became visible at the application layer. The window between "the plan changed" and "users are affected" is where proactive monitoring pays for itself.

Deployment-related database load is another common pattern. When an organization rolls out a new feature or migrates to a new service, the change in application behavior often translates to a change in database workload. New query patterns appear, existing query volumes shift, and the database may suddenly face a workload it was never tuned for. One team experienced this during an organization-wide rollout of a new product feature: the rollout put unexpectedly heavy load on the database, causing query latencies to spike across the application.

Monitoring database load metrics (queries per second, rows examined, CPU utilization, I/O wait) during deployments provides immediate visibility into whether a rollout is affecting database health. When combined with application-level deployment markers, teams can correlate performance changes with specific releases and roll back before the impact spreads.

Infrastructure events that affect database performance are another category that proactive monitoring catches. Cloud providers periodically perform live migrations, moving virtual machines between physical hosts for maintenance. These migrations cause brief performance dips that are visible in database metrics (elevated latency, temporary I/O stalls) but may not be visible in application-level monitoring if the impact is short-lived. Over time, correlating database performance anomalies with cloud provider maintenance events helps teams distinguish between application issues and infrastructure issues, avoiding unnecessary debugging of problems they cannot control.

The compounding effect of continuous monitoring should not be underestimated. Individual issues, a missing index here, an unused index there, a table that needs a vacuum, may each seem minor. But databases are systems where small inefficiencies compound. A missing index that adds 50 milliseconds to a query, multiplied by thousands of executions per hour, adds up to meaningful load on the database. That additional load makes other queries slower, which increases connection hold times, which reduces available connections, which causes queuing under peak traffic. Proactive monitoring catches these small issues before they compound into systemic problems.

The most effective database monitoring setups combine automated detection with contextual alerting. Rather than setting static thresholds (alert when CPU exceeds 80%), they track trends and anomalies: alert when query execution time for a specific query shape increases by more than a standard deviation, or when the rate of dead tuple accumulation exceeds the rate of vacuum cleanup. This approach reduces false positives while catching genuine issues earlier. When paired with autonomous database agents that can not only detect but also remediate issues, the monitoring practice evolves from alerting humans to resolving problems automatically, with humans reviewing the actions taken rather than performing them. Firetiger goes beyond dashboards: it continuously analyzes PGStats, index usage, and replication health, generates pull requests for recommended optimizations, and then reads each PR diff, watches the deployment, detects regressions, and investigates root cause when those fixes ship.

See Firetiger in production

Read how Town keeps AI assistants running with Firetiger — Town's founding engineer on cutting through alert noise, shrinking debugging from days to minutes, and giving coding agents the context they need to do their best work. More teams use Firetiger this way at /case-studies.

Where to start

Enable pg_stat_statements (or equivalent): Ensure query-level performance data is being collected so you can identify slow queries and regressions.
Monitor connection pool utilization: Track active vs. idle connections to catch pool exhaustion before it causes outages.
Set up alerts on replication lag: Define thresholds for acceptable lag and alert when replicas fall behind.
Deploy proactive database monitoring: Use agent-driven platforms like Firetiger that go beyond dashboards — continuously analyzing PGStats, index usage, and replication health, generating pull requests for recommended optimizations, and watching those fixes deploy.