---
title: "Postgres Performance: Why Peak Throughput Benchmarks Miss the Real Problem"
published: 2026-03-27T10:30:33.000-04:00
updated: 2026-03-27T10:30:33.000-04:00
excerpt: "Peak throughput tells you what Postgres can do in a sprint. Production asks what it can do forever. Those are different questions."
tags: PostgreSQL, PostgreSQL Performance
authors: Matty Stratton
---

> **TimescaleDB is now Tiger Data.**

You ran the benchmark. 80,000 inserts per second. The database handled it clean, latency stayed flat, no alarms. You shipped with confidence.

Three months later, p95 write latency is creeping. Six months later, autovacuum is in your top processes by CPU. Nine months later, you're rebuilding indexes on a table that's crossed 400 million rows.

The benchmark wasn't wrong. The question it answered just wasn't the right one.

Peak throughput tells you what the database can do in a sprint. Production asks what it can do running forever. Those are different questions with different answers, and most teams only ask the first one.

The number that actually matters is the _sustained throughput ceiling_: the write rate at which all of the database's maintenance processes (autovacuum, checkpointing, WAL archiving, replication) can keep up indefinitely. It's always lower than peak throughput. It drops over time as data volume grows. And almost nobody measures it.

## What benchmarks actually measure

A typical load test runs for minutes. Sometimes an hour if you're thorough. It hits the database hard, measures throughput and latency, and stops. During that window, the buffer cache is warm from the test setup. Autovacuum hasn't had time to accumulate a backlog. WAL hasn't been generating for 72 hours straight. The indexes are fresh. The table fits mostly in memory.

These are ideal conditions. Not because anyone cheated. That's just what a bounded test looks like. The database performs brilliantly under bounded load because its maintenance subsystems haven't been outrun yet.

Production is unbounded. The data keeps arriving after the benchmark ends. Autovacuum runs against a table that grows every hour. The buffer cache works against a dataset that expands past RAM over weeks. The indexes that fit in memory at 50 million rows don't fit at 500 million. The checkpoint cycle that completed cleanly at low data volume starts competing with writes as WAL volume climbs.

## The specific ways sustained load differs from peak load

There are four concrete mechanisms at work here. All four run simultaneously in production. None of them show up in a benchmark.

### Your hot data stops being hot

At launch, your hot data fits in `shared_buffers` and the OS page cache. Read performance is largely a RAM question. As data volume grows past available RAM, cache hit rates fall. Queries that returned in milliseconds start hitting disk. The degradation is slow enough that it looks like a query regression, not a growth problem, and that's what makes it dangerous. You'll spend a sprint chasing query plans and index strategies before someone checks `pg_statio_user_tables` and realizes the hit rate has been sliding since month four. The latency change wasn't a code problem. It was a ratio problem.

### Autovacuum falls behind and can't catch up

A benchmark run doesn't give autovacuum time to fall behind. Production does.

At high sustained insert rates, autovacuum fires continuously. During write peaks, it falls behind. The backlog accumulates. Bloat builds. By the time monitoring catches it, the table has weeks of accumulated dead tuples and hint-bit work queued up.

Here's the part that really gets you: clearing the backlog requires running autovacuum harder, which competes with writes, which slows ingestion. The fix and the problem share the same resource pool. You're asking the database to clean up faster while also writing faster, and there's only so much I/O to go around.

### Indexes rot

Fresh B-tree indexes on a small table are compact and cache-friendly. The same indexes a year later on a table with a billion rows are fragmented, partially sparse from the hot-right-edge problem on timestamp columns, and too large to stay in cache.

Traversal costs go up. Page splits happen more often. The 10x read improvement you got from careful indexing in the first month erodes slowly, then faster. You'll REINDEX and get performance back for a while, but the table is still growing. The next degradation cycle is already in progress.

### WAL never stops arriving

WAL volume scales directly with insert rate. At sustained high rates, WAL generation is constant. Replicas that keep up at launch start falling behind as write volume grows. The primary retains unprocessed WAL. Disk fills. And the replica needs to process a growing backlog while new WAL keeps arriving, which means there's no quiet period to catch up. If you've ever watched `pg_stat_replication` and seen `replay_lag` tick steadily upward with no sign of plateauing, you know exactly how this ends.

Each of these mechanisms is invisible in a benchmark. In production, they compound.

## The number you should actually be looking at

So how do you actually find the sustained throughput ceiling?

You can estimate it. Look at autovacuum activity under current load: is it finishing cycles or perpetually falling behind? Check `pg_stat_bgwriter` for checkpoint pressure. Watch `pg_wal` directory size trends. Plot the ratio of index size to table size over time. These aren't exotic metrics. They're already in Postgres. Most teams aren't watching them together.

The leading indicators of a sustained throughput ceiling: autovacuum consistently showing in `pg_stat_activity`, checkpoint completion times trending up, replica lag growing during write peaks, `n_dead_tup` climbing faster than `vacuum_count` is cleaning.

None of these show up in a benchmark. All of them show up in production, usually together, usually around month six or nine.

![](https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/03/data-src-image-0038c140-3769-49ad-abbc-4e1e65c072e1.jpeg)

## Why this question is structurally hard to ask

Smart teams miss this. The reasons are structural.

Benchmarks have a natural stopping point. Load tests end. Sustained load doesn't have a natural evaluation moment until something breaks. There's no "sustained throughput benchmark" in most team playbooks because the concept doesn't have a clean boundary. When do you declare the test over?

The degradation timeline is also longer than most planning cycles. Indexing starts showing stress at 300 million rows. Partitioning gets complicated at 500+ partitions. WAL volume becomes a crisis when replica lag crosses a threshold that trips an alert. These events are six to eighteen months apart. The engineer who ran the initial benchmark often isn't the one debugging the production incident.

Then there's the procurement problem. Peak throughput is a good number for architecture decisions. "This database handles 80K inserts per second" is a clean, defensible statement. "This database handles 80K inserts per second now, but that number will effectively be lower in eight months as the buffer cache hit rate falls and autovacuum starts competing for I/O" is harder to put in a slide. (Both statements are true. Only one of them gets you budget approval.)

And most capacity planning frameworks are built around static estimates. How many users, how many requests, how much storage. Sustained throughput degradation is a dynamic problem. The ceiling moves as the system runs. That doesn't fit neatly into a capacity model built for stable workloads.

This adds up to something bigger than individual teams making mistakes. The entire way the industry evaluates databases is optimized for procurement, not production. Vendor benchmarks measure peak throughput because it's the largest number. Load testing frameworks default to bounded runs because unbounded runs don't have a natural end state. Capacity planning templates assume static ceilings because dynamic ceilings are harder to model. Every layer of the evaluation stack is designed to produce a number that looks good in a slide deck. None of it answers the question you'll actually need answered in month twelve.

So if the standard evaluation framework is structurally set up to miss this, what does a better one look like?

## What the right benchmark looks like

Run the load test for longer. Hours, not minutes. Watch what happens to autovacuum, not just query latency.

Start the test with a table that already has data in it, sized to your 12-month projection. A benchmark on an empty table tells you about cold start performance. It tells you almost nothing about what the system looks like after a year of continuous ingestion.

Measure these things during the test:

-   `pg_stat_bgwriter`: checkpoint frequency and write volume
-   `pg_stat_activity`: autovacuum activity
-   Replica lag if you're running replicas
-   `pg_stat_wal`: WAL generation rate
-   Index size relative to table size

Repeat the test with 3x the data volume. If performance drops more than linearly, you've found where the architecture starts to strain. That's the number you want before you ship, not after.

The test that catches the [Optimization Treadmill](https://www.tigerdata.com/blog/postgres-optimization-treadmill) is a test that asks: what happens when this runs for a year? You can simulate that in a day if you load the data upfront and run the benchmark against a realistic data volume.

## The benchmark question and the architecture question

If your system has [the six workload characteristics](https://www.tigerdata.com/blog/postgres-optimization-treadmill) (continuous ingestion, time-series access patterns, append-only data, long retention, operational query requirements, sustained growth), the sustained throughput ceiling is structural. Better benchmarking tells you earlier where the ceiling is, but it won't raise it.

Benchmarking tells you how fast the ceiling approaches. Architecture determines where it sits.

Teams that run good sustained-load benchmarks early find out at 30 million rows that they're on the Optimization Treadmill. Teams that only run peak throughput benchmarks find out at 800 million rows. The underlying architectural problem is identical in both cases. The migration cost is not.

## Ask the right question before you ship

Peak throughput is a useful number. It tells you whether the hardware can keep up with the write rate at a point in time. Worth knowing.

It just doesn't tell you whether the maintenance processes can keep up with that write rate indefinitely, as data volume grows and the vacuum backlog and WAL volume and cache pressure all grow with it.

The question nobody asks before shipping is usually the one that generates the incident nine months later. Ask it now. Run the load test against a full-size dataset. Watch autovacuum, not just query latency. Track the ceiling as a moving target, not a static spec.

And if the benchmark reveals what the [scoring framework](https://www.tigerdata.com/blog/postgres-optimization-treadmill) already suggested, the cheapest architectural decision you'll make is the one you make before the table crosses 100 million rows.