Tiger Data Blog

Document Databases: Be Honest

Matty Stratton — Wed, 01 Apr 2026 17:22:30 GMT

MongoDB gets a bad reputation in certain engineering circles that it doesn't entirely deserve. It ships fast. Schema flexibility is real. The developer experience for document-shaped data is good. A lot of teams made a reasonable call when they chose it.

But there's a version of this story that ends badly, and it follows a recognizable pattern. The team picks MongoDB for a new system. The system works. Then the data starts looking less like documents and more like a stream of timestamped events. Queries start filtering by time range. Write volume climbs. Performance degrades in ways that feel familiar if you've read about this problem, and deeply confusing if you haven't.

This post isn't here to relitigate the MongoDB decision. It's here to help you figure out whether the pain you're feeling is a MongoDB problem, a document database problem, or a workload problem that would follow you to Postgres.

The answer matters because the fix is different in each case.

What MongoDB is actually good at

Flexible schema for variable data that's actually variable. Product catalogs where every SKU has different attributes. User profiles where fields vary by account type. Content management where article structure differs by category. These are real document shapes, and MongoDB handles them without the ceremony Postgres requires.

Rapid iteration without migration overhead. Early-stage products change their data model constantly. In Postgres, every schema change is an ALTER TABLE. In MongoDB, you just write different fields. For teams that are still figuring out the shape of their data, this is a real advantage.

Nested and hierarchical data. Some data is naturally a tree. A purchase order with line items with sub-components. A configuration object with nested sections. Postgres can model this with JSONB, but MongoDB's native document model fits it more naturally and queries it more cleanly.

Horizontal scaling for document reads. MongoDB's sharding model was designed for document workloads. For read-heavy document access at scale, it's a mature and well-understood architecture.

These aren't consolation prizes. They're real reasons MongoDB is the right choice for a lot of workloads.

The trouble starts when the data changes shape.

What time-series data actually looks like

Time-series data has a specific shape, and it's not a document shape. Every row is a measurement. It has a timestamp, a source identifier, and a value or set of values. The schema doesn't vary between rows. There's nothing hierarchical about it. The document model isn't adding anything.

What time-series data has instead: enormous volume, strict ordering requirements, queries that almost always filter by time range, and retention policies that drop entire time windows at once.

A wind turbine sensor reporting every five seconds doesn't produce documents. It produces a flat stream of readings: timestamp, sensor ID, RPM, temperature, vibration. A financial trade feed isn't a document store. It's a sequence of immutable events. An APM platform collecting metrics from a distributed system is generating hundreds of thousands of measurements per second, all with the same shape.

The test is simple. Look at your most-written collection. Does each document have a different structure? Or does every document look essentially the same, with a timestamp and some measurements?

If it's the latter, you're storing time-series data in a document database, and the document model is providing zero value while the storage engine works against you.

Where MongoDB struggles with this workload

WiredTiger (MongoDB's default storage engine) uses a B-tree structure optimized for a workload that includes updates to existing documents. For high-frequency append-only writes, it faces a fundamental mismatch. Consider a single sensor reading: one document insert triggers a write to the primary collection, a write to the oplog, and a separate B-tree update for every index on that collection. Three indexes means five writes for one data point. At 10,000 inserts per second, that's 50,000 storage operations per second before you've run a single query. The engine was designed for mixed read-write workloads with in-place updates, not an endless append stream where no document is ever modified after creation.

MongoDB has no native time-based partitioning. Postgres has declarative range partitioning. TimescaleDB automates it entirely with hypertables. MongoDB has no equivalent primitive. Teams end up implementing time-based collection bucketing manually: separate collections per day or week, application-level routing logic, custom cleanup scripts. It works, but it's the same operational burden as manual Postgres partitioning, without the tooling ecosystem that exists on the Postgres side.

MongoDB's aggregation pipeline is expressive. But for time-series workloads, the queries that matter are time-range aggregations: hourly averages, daily maximums, week-over-week comparisons. These queries scan large volumes of documents and aggregate across fields. Without columnar storage and purpose-built time-series compression, performance degrades with data volume in the same way it does in vanilla Postgres.

MongoDB did add a native time-series collection type in 5.0. It's a real improvement for simple append-only use cases. But it doesn't support secondary indexes the same way regular collections do, restricts certain aggregation stages and update operations, and is still relatively new compared to the Postgres ecosystem. Worth knowing about. Not a full answer.

Why moving to vanilla Postgres isn't automatically the fix

This is the section most competitive content skips entirely. If you're evaluating a migration, you deserve the full picture.

If the workload is continuous high-frequency time-series ingestion with long retention and operational query requirements, vanilla Postgres has its own version of this problem. The MVCC overhead, write amplification, autovacuum contention, and index maintenance costs that create the Optimization Treadmill exist in Postgres too. The storage model is different from MongoDB's, but the outcome at scale is the same: performance degrades with data volume, maintenance overhead accumulates, and each optimization cycle buys time without changing the trajectory.

Moving from MongoDB to vanilla Postgres solves the schema flexibility problem (you probably don't need it for this workload anyway). You get a mature partitioning ecosystem, a better query planner, and a richer extension ecosystem. These are real improvements.

It doesn't solve the core time-series storage problem, because that problem lives in the storage model, not the database brand.

The question isn't MongoDB vs. Postgres. It's document store vs. purpose-built time-series storage. That's the actual axis the decision should sit on.

The decision framework

Your data is actually documents. Variable schema, nested structures, hierarchical relationships, read-heavy access patterns. MongoDB is the right tool. The pain you're feeling is probably a schema design or indexing problem, not a fundamental architectural mismatch. Fix the schema.

Your data is time-series but volume is modest. Sub-10K inserts per second, retention under 90 days, no hard operational latency requirements on the full retention window. Vanilla Postgres with good partitioning and indexing handles this fine. The Optimization Treadmill exists, but the ceiling is far enough away that standard tuning keeps you ahead of it. Move to Postgres, implement partitioning early, and monitor the warning signs.

Your data is time-series at sustained high volume. Continuous ingestion, long retention, operational query requirements, growing data volume. This is the workload that breaks both MongoDB and vanilla Postgres through the same class of mechanisms. Purpose-built time-series storage on Postgres (same SQL, same wire protocol, same tooling) is the right answer. Migration from MongoDB to TimescaleDB follows a well-documented path: you keep everything Postgres-compatible and gain the storage architecture that matches the workload.

What to do next

MongoDB didn't fail you if you're reading this. Your workload evolved past what document storage was designed for. That's a different thing.

Most database choices are right at the time they're made and wrong eighteen months later when the system looks nothing like it did at launch. Sensor data that started as a feature became the core product. The document store that handled early prototyping became the production system for a time-series pipeline.

The question now is whether the fix is tuning, migration, or architecture. The framework above gives you a clear read on which one applies. If it's architecture, the good news is that moving from MongoDB to a Postgres-compatible time-series database is less disruptive than it sounds. Your application SQL stays the same. Your tooling stays the same. The storage engine underneath is the thing that changes.

That's the right scope for the change. Not the whole stack. Just the part that was always wrong for this workload.

Read the full technical breakdown of why vanilla Postgres hits these limits, or start a Tiger Cloud trial and see how TimescaleDB handles your workload directly.

Postgres Performance: Why Peak Throughput Benchmarks Miss the Real Problem

Matty Stratton — Fri, 27 Mar 2026 14:30:33 GMT

You ran the benchmark. 80,000 inserts per second. The database handled it clean, latency stayed flat, no alarms. You shipped with confidence.

Three months later, p95 write latency is creeping. Six months later, autovacuum is in your top processes by CPU. Nine months later, you're rebuilding indexes on a table that's crossed 400 million rows.

The benchmark wasn't wrong. The question it answered just wasn't the right one.

Peak throughput tells you what the database can do in a sprint. Production asks what it can do running forever. Those are different questions with different answers, and most teams only ask the first one.

The number that actually matters is the sustained throughput ceiling: the write rate at which all of the database's maintenance processes (autovacuum, checkpointing, WAL archiving, replication) can keep up indefinitely. It's always lower than peak throughput. It drops over time as data volume grows. And almost nobody measures it.

What benchmarks actually measure

A typical load test runs for minutes. Sometimes an hour if you're thorough. It hits the database hard, measures throughput and latency, and stops. During that window, the buffer cache is warm from the test setup. Autovacuum hasn't had time to accumulate a backlog. WAL hasn't been generating for 72 hours straight. The indexes are fresh. The table fits mostly in memory.

These are ideal conditions. Not because anyone cheated. That's just what a bounded test looks like. The database performs brilliantly under bounded load because its maintenance subsystems haven't been outrun yet.

Production is unbounded. The data keeps arriving after the benchmark ends. Autovacuum runs against a table that grows every hour. The buffer cache works against a dataset that expands past RAM over weeks. The indexes that fit in memory at 50 million rows don't fit at 500 million. The checkpoint cycle that completed cleanly at low data volume starts competing with writes as WAL volume climbs.

The specific ways sustained load differs from peak load

There are four concrete mechanisms at work here. All four run simultaneously in production. None of them show up in a benchmark.

Your hot data stops being hot

At launch, your hot data fits in shared_buffers and the OS page cache. Read performance is largely a RAM question. As data volume grows past available RAM, cache hit rates fall. Queries that returned in milliseconds start hitting disk. The degradation is slow enough that it looks like a query regression, not a growth problem, and that's what makes it dangerous. You'll spend a sprint chasing query plans and index strategies before someone checks pg_statio_user_tables and realizes the hit rate has been sliding since month four. The latency change wasn't a code problem. It was a ratio problem.

Autovacuum falls behind and can't catch up

A benchmark run doesn't give autovacuum time to fall behind. Production does.

At high sustained insert rates, autovacuum fires continuously. During write peaks, it falls behind. The backlog accumulates. Bloat builds. By the time monitoring catches it, the table has weeks of accumulated dead tuples and hint-bit work queued up.

Here's the part that really gets you: clearing the backlog requires running autovacuum harder, which competes with writes, which slows ingestion. The fix and the problem share the same resource pool. You're asking the database to clean up faster while also writing faster, and there's only so much I/O to go around.

Indexes rot

Fresh B-tree indexes on a small table are compact and cache-friendly. The same indexes a year later on a table with a billion rows are fragmented, partially sparse from the hot-right-edge problem on timestamp columns, and too large to stay in cache.

Traversal costs go up. Page splits happen more often. The 10x read improvement you got from careful indexing in the first month erodes slowly, then faster. You'll REINDEX and get performance back for a while, but the table is still growing. The next degradation cycle is already in progress.

WAL never stops arriving

WAL volume scales directly with insert rate. At sustained high rates, WAL generation is constant. Replicas that keep up at launch start falling behind as write volume grows. The primary retains unprocessed WAL. Disk fills. And the replica needs to process a growing backlog while new WAL keeps arriving, which means there's no quiet period to catch up. If you've ever watched pg_stat_replication and seen replay_lag tick steadily upward with no sign of plateauing, you know exactly how this ends.

Each of these mechanisms is invisible in a benchmark. In production, they compound.

The number you should actually be looking at

So how do you actually find the sustained throughput ceiling?

You can estimate it. Look at autovacuum activity under current load: is it finishing cycles or perpetually falling behind? Check pg_stat_bgwriter for checkpoint pressure. Watch pg_wal directory size trends. Plot the ratio of index size to table size over time. These aren't exotic metrics. They're already in Postgres. Most teams aren't watching them together.

The leading indicators of a sustained throughput ceiling: autovacuum consistently showing in pg_stat_activity, checkpoint completion times trending up, replica lag growing during write peaks, n_dead_tup climbing faster than vacuum_count is cleaning.

None of these show up in a benchmark. All of them show up in production, usually together, usually around month six or nine.

Why this question is structurally hard to ask

Smart teams miss this. The reasons are structural.

Benchmarks have a natural stopping point. Load tests end. Sustained load doesn't have a natural evaluation moment until something breaks. There's no "sustained throughput benchmark" in most team playbooks because the concept doesn't have a clean boundary. When do you declare the test over?

The degradation timeline is also longer than most planning cycles. Indexing starts showing stress at 300 million rows. Partitioning gets complicated at 500+ partitions. WAL volume becomes a crisis when replica lag crosses a threshold that trips an alert. These events are six to eighteen months apart. The engineer who ran the initial benchmark often isn't the one debugging the production incident.

Then there's the procurement problem. Peak throughput is a good number for architecture decisions. "This database handles 80K inserts per second" is a clean, defensible statement. "This database handles 80K inserts per second now, but that number will effectively be lower in eight months as the buffer cache hit rate falls and autovacuum starts competing for I/O" is harder to put in a slide. (Both statements are true. Only one of them gets you budget approval.)

And most capacity planning frameworks are built around static estimates. How many users, how many requests, how much storage. Sustained throughput degradation is a dynamic problem. The ceiling moves as the system runs. That doesn't fit neatly into a capacity model built for stable workloads.

This adds up to something bigger than individual teams making mistakes. The entire way the industry evaluates databases is optimized for procurement, not production. Vendor benchmarks measure peak throughput because it's the largest number. Load testing frameworks default to bounded runs because unbounded runs don't have a natural end state. Capacity planning templates assume static ceilings because dynamic ceilings are harder to model. Every layer of the evaluation stack is designed to produce a number that looks good in a slide deck. None of it answers the question you'll actually need answered in month twelve.

So if the standard evaluation framework is structurally set up to miss this, what does a better one look like?

What the right benchmark looks like

Run the load test for longer. Hours, not minutes. Watch what happens to autovacuum, not just query latency.

Start the test with a table that already has data in it, sized to your 12-month projection. A benchmark on an empty table tells you about cold start performance. It tells you almost nothing about what the system looks like after a year of continuous ingestion.

Measure these things during the test:

pg_stat_bgwriter: checkpoint frequency and write volume
pg_stat_activity: autovacuum activity
Replica lag if you're running replicas
pg_stat_wal: WAL generation rate
Index size relative to table size

Repeat the test with 3x the data volume. If performance drops more than linearly, you've found where the architecture starts to strain. That's the number you want before you ship, not after.

The test that catches the Optimization Treadmill is a test that asks: what happens when this runs for a year? You can simulate that in a day if you load the data upfront and run the benchmark against a realistic data volume.

The benchmark question and the architecture question

If your system has the six workload characteristics (continuous ingestion, time-series access patterns, append-only data, long retention, operational query requirements, sustained growth), the sustained throughput ceiling is structural. Better benchmarking tells you earlier where the ceiling is, but it won't raise it.

Benchmarking tells you how fast the ceiling approaches. Architecture determines where it sits.

Teams that run good sustained-load benchmarks early find out at 30 million rows that they're on the Optimization Treadmill. Teams that only run peak throughput benchmarks find out at 800 million rows. The underlying architectural problem is identical in both cases. The migration cost is not.

Ask the right question before you ship

Peak throughput is a useful number. It tells you whether the hardware can keep up with the write rate at a point in time. Worth knowing.

It just doesn't tell you whether the maintenance processes can keep up with that write rate indefinitely, as data volume grows and the vacuum backlog and WAL volume and cache pressure all grow with it.

The question nobody asks before shipping is usually the one that generates the incident nine months later. Ask it now. Run the load test against a full-size dataset. Watch autovacuum, not just query latency. Track the ceiling as a moving target, not a static spec.

And if the benchmark reveals what the scoring framework already suggested, the cheapest architectural decision you'll make is the one you make before the table crosses 100 million rows.

MVCC: The Feature You're Paying For But Not Using

Matty Stratton — Fri, 20 Mar 2026 13:07:26 GMT

Most engineers have a working mental model of MVCC. Readers don't block writers. Concurrent transactions see consistent snapshots. It's why Postgres handles mixed read/write workloads so well, and it's a genuine engineering achievement.

What's less obvious is that MVCC isn't free. Every row in every table carries its overhead. Not just rows that get updated. The system doesn't know at write time whether a row will ever be touched again, so it prepares for that possibility. Every time.

If you're running an IoT pipeline, a financial data feed, or an observability platform, most of your rows will never be updated. Sensor readings don't get corrected. Trade records are immutable. Log entries are permanent. You're writing append-only data into a system built to handle concurrent modification of shared rows, and you're paying the full price for that capability whether you use it or not.

This post breaks down exactly what that costs you: at the byte level, at the I/O level, and at the maintenance level.

What MVCC actually does (and why it's damn good)

Before MVCC, databases had two options: lock rows during reads so writers couldn't touch them, or lock rows during writes so readers couldn't see them. Either way, concurrent workloads serialized through lock contention. If you've ever worked with a database that does this, you know how painful it gets at scale.

MVCC solves the problem differently. When a row is updated, Postgres doesn't modify it in place. It writes a new version of the row and keeps the old version visible to transactions that started before the update. Each transaction sees a consistent snapshot of the database as of the moment it began. Readers and writers operate on different row versions simultaneously. No locking required.

For an e-commerce backend processing orders while users browse, a SaaS application handling concurrent sessions, or any system where multiple transactions touch the same rows, this is transformative. The PostgreSQL documentation puts it simply: reading never blocks writing and writing never blocks reading.

That's not a small thing. That's the reason Postgres can handle the concurrency patterns that would bring a lock-based system to its knees.

The cost of maintaining this guarantee is what the rest of this post is about.

The per-row overhead, in bytes

This is where most explanations go vague. Let's not do that.

Every heap tuple in Postgres carries a fixed 23-byte header before a single byte of your actual data gets written. Here's what's in it:

t_xmin: the transaction ID that created this row (4 bytes)
t_xmax: the transaction ID that deleted or updated it, zero if the row is live (4 bytes)
t_cid: command ID within the transaction (4 bytes)
t_ctid: physical location of this tuple or its newer version (6 bytes)
t_infomask and t_infomask2: status flags for transaction visibility (4 bytes)
t_hoff: offset to actual row data (1 byte)

These fields exist to answer one question: is this row visible to this transaction?

For a workload where rows are being updated and deleted concurrently, that question needs answering constantly. The 23 bytes are worth it.

For an append-only workload? t_xmax is zero for every live row and will stay zero. t_ctid points to itself because there's no newer version. The visibility question still gets asked, and the header still gets written, and the page still gets dirtied to set hint bits after the first read. But the answers are trivial every time. The mechanism is running in full for a case that never needed it.

Add alignment padding and a 4-byte ItemIdData pointer per tuple, and the true per-row overhead is closer to 28 to 30 bytes before your row data starts.

Let's make that concrete. At 50K inserts per second, that's 1.4 to 1.5 MB/sec of pure overhead headers. Per year: roughly 44 GB of header data for a workload that never updates a row.

That's not a rounding error.

What autovacuum is actually doing on your append-only table

Here’s what’s going to wrinkle your brain.

You think, “Autovacuum cleans up dead tuples from updates and deletes. Append-only tables don't update or delete rows. Therefore, autovacuum shouldn't have much to do.”

That intuition is wrong in three specific ways.

Aborted transactions leave dead tuples. Not every INSERT commits. Connection drops, application errors, explicit rollbacks. These all leave tuple versions that need cleaning. If you're running high insert rates, you've got a steady trickle of aborted transactions even in perfectly healthy systems.

Hint bits require page dirtying. When a row is first read after being written, Postgres needs to check pg_xact to confirm the writing transaction committed. Once confirmed, it sets a hint bit in t_infomask to cache that result. Setting the hint bit dirties the page, which means writing it back to disk. On an append-only table with high read rates, hint bit setting is continuous background I/O on pages that will never change in any meaningful way. Welcome to your new normal.

Since PostgreSQL 13, insert volume alone triggers autovacuum. Not just dead tuples. Postgres needs to periodically freeze old transaction IDs to prevent XID wraparound, which is a hard limit built into the 32-bit transaction counter. At high insert rates, autovacuum fires continuously just to freeze tuples on tables with zero updates.

Go check autovacuum_count and vacuum_count on your busiest append-only partition. They're climbing whether or not n_dead_tup is.

The result: autovacuum workers show up in pg_stat_activity at all hours on tables that never see a single UPDATE. You tune autovacuum_vacuum_scale_factor and autovacuum_max_workers, and it helps at the margin. But what you're tuning is how the cleanup process competes with writes. Not why it needs to run at all.

The write amplification chain

Now let's connect all of this into the full cost picture.

A single 1 KB sensor reading doesn't write 1 KB. Here's what actually hits disk:

23-byte heap tuple header plus padding
1,024 bytes of your actual row data
One entry per index, roughly 40 to 80 bytes each in B-tree leaf pages (five indexes = 200 to 400 bytes)
One WAL record per heap insert, one per index insertion: approximately 1.2 KB total
Periodically: an 8 KB full-page write after checkpoint for any newly dirtied page

Total actual I/O: 2.5 to 3.5 KB for 1 KB of logical data.

The MVCC header is the entry point for this entire chain. It's what requires the visibility tracking, the hint bit mechanism, the autovacuum sweep, and the WAL record structure that Postgres uses.

At 100K inserts per second, you're writing 250 to 350 MB/sec of actual I/O for 100 MB/sec of application data. The 3 to 5x write amplification ratio isn't configuration. It's the cost of MVCC applied to data that will never be updated.

Why you can't opt out

There's no per-table setting to disable MVCC. No append_only = true flag that strips the header and skips the visibility machinery. MVCC is not a feature you can turn off for specific tables. It's the storage model. Every heap tuple gets the header. Every insert goes through the same write path.

This isn't an oversight. It's an architectural decision with a clear rationale: the storage engine doesn't know at write time what future transactions will need to see. The consistency guarantee requires the mechanism to be universal.

For most workloads, this is the right tradeoff. The overhead is small relative to the value of the concurrency guarantee, and mixed read/write workloads on shared rows are exactly what Postgres is built for.

The overhead only becomes the dominant cost when the workload is append-only at high sustained rates. That's when you're paying the full price for a guarantee you never exercise.

What changes when the storage model changes

TimescaleDB's columnar storage (the Columnstore layer) addresses this at the architecture level, not the configuration level. Rather than writing one heap tuple per row, it batches up to 1,000 row versions per column into compressed arrays before writing to disk. The MVCC header overhead gets amortized across the batch. One write operation covers what would have been 1,000 individual heap tuple insertions.

The practical results: write amplification drops from 3 to 5x to near 1:1 for sustained append workloads. Autovacuum pressure drops proportionally because there's far less row-level churn to clean. WAL volume at 100K inserts/sec falls from 50 to 100 MB/sec to roughly 5 to 15 MB/sec. Replicas that previously fell behind during write peaks can keep up.

Everything else stays the same. Same SQL. Same wire protocol. Same extensions. Same tooling. The change is underneath, at the layer where MVCC overhead was accumulating.

The bottom line

MVCC is not a bug in Postgres. It's one of the reasons Postgres is the right choice for the majority of production workloads.

But if most of your rows are immutable after the insert commits, if your tables never see concurrent updates to the same rows, if autovacuum is running constantly on data you've never touched, you're running an append-only workload inside a concurrency model built for something else.

That's not misconfiguration. It's an architectural mismatch. The distinction matters because misconfiguration has a config fix. Architectural mismatch doesn't.

If high-frequency append-only ingestion describes what you're running, the full essay on the Optimization Treadmill covers what this costs across your entire stack, and what the path forward looks like.

When Continuous Ingestion Breaks Traditional Postgres

Matty Stratton — Fri, 13 Mar 2026 19:33:45 GMT

Your system writes data constantly. Not in jobs. Not in batches. A stream that runs at 3am the same as it runs at 3pm. IoT sensors. Trade feeds. Metrics collectors. The data never stops.

For a while, Postgres handles it fine. Then you start noticing things. Autovacuum is always running. Write latency has a pattern you can't explain by traffic alone. Maintenance tasks that used to take minutes now take hours. And the really annoying part: nothing is misconfigured.

You check the usual suspects. Indexes are correct. Query plans look reasonable. Configs follow best practices. A colleague confirms the same.

The problem isn't a missing index or a bad query plan. The problem is that Postgres was designed with a quiet period baked into its assumptions. Your system eliminated that quiet period. Now you're paying for it.

What "breathing room" actually means in Postgres

Most database systems are designed around a workload shape that includes peaks and valleys. Peaks are when users are active. Valleys are when the database catches up.

Postgres maintenance is built around the valley.

Autovacuum runs more aggressively when the database is quiet. ANALYZE refreshes statistics without competing for I/O. Checkpoint cycles complete cleanly. WAL accumulation clears out. The buffer cache warms up on predictable patterns.

Batch ETL fits this model perfectly. A nightly job writes data for two hours. The database writes, then rests, then writes again. Maintenance runs in the gaps. Everything resets before the next cycle starts.

Continuous ingestion has no gaps. The window that used to be quiet at 2am is now the same as the window at 2pm. Every maintenance process that depends on quiet time now runs in direct competition with writes. All day. All night.

The maintenance competition problem

Three maintenance processes need quiet time and don't get it under continuous ingestion.

Autovacuum. Even on append-only tables, autovacuum fires continuously at high insert rates. Since PostgreSQL 13, inserts themselves trigger autovacuum to freeze tuples and update the visibility map. This isn't about dead tuples from updates or deletes. It's insert-driven vacuum, running because the data is arriving too fast for the system to catch up.

At 50K inserts/second, autovacuum never finishes a cycle before the next one starts. It competes for I/O with your writes. When it loses, bloat accumulates. When it wins, write latency spikes.

There's no configuration fix for this. You can tune autovacuum_vacuum_cost_delay and autovacuum_max_workers all day. What you're tuning is how autovacuum loses gracefully. Not how it stops competing.

Checkpoints. Postgres writes dirty pages to disk at checkpoint intervals. After a checkpoint completes, the first write to any previously-clean page triggers a full-page write to WAL (that's the full_page_writes mechanism, and it's on by default for good reason). At high insert rates, checkpoint cycles are constant. The full-page write burst that follows each one adds significant WAL volume on top of your baseline write load.

Batch systems checkpoint, rest, then return to normal. Continuous systems checkpoint and immediately start generating the next burst. There's no recovery window.

ANALYZE and statistics. Query planning accuracy depends on fresh statistics. On a billion-row table, ANALYZE is expensive. On a batch system, you schedule it after the load completes. On a continuous system, there is no "after." You run it during writes or you let statistics go stale. Stale statistics mean bad query plans. Bad query plans mean unexpected sequential scans at the worst possible time.

WAL as the throughput ceiling you can't tune past

This is the mechanical core of the problem.

Every insert generates WAL. Heap insert record, index insertion records for every index on the table, plus full-page writes after checkpoints. A single 1KB sensor reading with five indexes generates roughly 2.5-3.5KB of actual I/O once you account for the heap tuple, B-tree leaf page insertions, and WAL records. At 100K inserts/second, that puts sustained WAL throughput at 50-100MB/sec under normal conditions. After a checkpoint, it spikes higher because of full-page writes.

That's 3-6GB per minute. 180-360GB per hour. Just WAL.

WAL writes are sequential and synchronous by default. That's a hard ceiling on write throughput for a given storage configuration. You can raise the ceiling by buying faster storage. You can't eliminate it, because WAL is how Postgres guarantees durability. And you shouldn't want to eliminate it. Durability matters. But you should understand that your write throughput has a physical upper bound set by how fast your storage can absorb WAL, and continuous ingestion pushes against that bound constantly.

Here's where continuous ingestion and batch ETL diverge completely.

Batch ETL generates bursts of WAL followed by silence. The silence lets replicas catch up. A streaming replica can fall behind during a batch load and recover in the gap. Nobody notices because the gap is long enough.

Continuous ingestion generates WAL constantly. Replicas that fall slightly behind have no gap to recover in. They fall further behind. The primary retains unprocessed WAL in pg_wal, consuming disk. The further behind the replica gets, the more WAL it needs to process, and the more disk the primary holds. It's a feedback loop. The thing that causes the problem (WAL volume) is the same thing that prevents recovery (WAL volume).

Adding replicas makes it worse, not better. Each replica is another consumer that needs to keep up with the same WAL stream, and the primary holds WAL until the slowest one catches up.

The standard fix is more provisioned IOPS. It works for a while. Then data volume grows and you're having the same conversation again, just with bigger numbers on the invoice.

Why the standard toolkit doesn't solve this

Walk through each common response and you'll see exactly where it runs out.

More autovacuum workers. More workers means more I/O competition with writes, not less. You're distributing the problem across more processes. The aggregate I/O pressure is unchanged.

Aggressive autovacuum cost limits. You can configure vacuum to run faster and harder. It cleans up faster but hits writes harder. There's no setting that makes the competition disappear. You're choosing which process suffers.

More RAM. Bigger shared_buffers and page cache reduce physical reads. Write amplification is unchanged. WAL volume is unchanged. Autovacuum competition is unchanged. You bought better read performance for a write-bound problem.

Faster storage. Raises the WAL ceiling. Doesn't change the ratio of actual I/O to logical data. At 3-5x write amplification, faster storage lets you sustain a higher write rate before hitting the ceiling. But data volume grows, and the ceiling moves up proportionally.

Vertical scaling. Same as faster storage with more CPU. You've bought headroom measured in months. At the current data growth trajectory, that math doesn't improve over time.

Each of these is the right response to the symptom. None of them changes the underlying dynamic: continuous ingestion is in constant competition with the maintenance processes Postgres needs to stay healthy.

The workloads where this actually matters

Not every write-heavy system has this problem. Let's be precise.

The pattern shows up when three things are true at once: writes are continuous rather than bursty, data volume is growing on a sustained curve, and the database needs to stay queryable under latency requirements while ingestion is running.

Industrial IoT is the clearest example. A wind farm with 10,000 sensors reporting every five seconds generates roughly 2,000 inserts/second. That's modest by financial or observability standards, but it never pauses. The turbines don't stop overnight. Maintenance windows don't exist because the data source doesn't know what a maintenance window is.

Financial market data is the high-frequency version. Trade feeds run at hundreds of thousands of events per second during market hours. Pre-market and after-market data keeps coming. Systems that aggregate this data for risk and compliance queries need it available immediately, not at end of day.

Observability platforms are the distributed version. Metrics, traces, and logs from thousands of hosts. Each host generates data independently. The aggregate rate is enormous and constant.

What these have in common: the data source runs on its own schedule, completely independent of what the database needs. The wind turbine doesn't care that autovacuum is behind. The trading engine doesn't wait for a checkpoint to finish.

If your write pattern is bursty (user-driven traffic, nightly batch jobs, periodic syncs), you probably don't have this problem. The database gets its breathing room, maintenance catches up, and standard Postgres optimization works the way it's supposed to. The pattern described in this post shows up specifically when the gap disappears.

Recognizing the pattern early

The instinct when Postgres starts struggling under continuous ingestion is to tune harder. Add workers. Raise limits. Upgrade storage.

Those are correct responses for a database that has misconfiguration or a bad schema. Postgres is doing exactly what it was designed to do. The MVCC model, the WAL architecture, the maintenance scheduler: these are good design decisions for the workloads Postgres was built to handle. The system changed underneath it. That's not a criticism of the tool.

But continuous ingestion isn't a heavier version of batch ETL. It's a different workload class. The architectural assumptions underneath Postgres were built around a workload that breathes. Continuous ingestion doesn't breathe. And that distinction matters because it determines whether optimization will change your trajectory or just delay the same outcome.

Recognizing that early is worth a lot. At 50M rows, switching to a purpose-built architecture takes days. At 1B rows, it takes months. Every quarter you spend optimizing within the wrong architecture is a quarter where migration gets harder and the engineering team spends more time managing the database than building product.

If this sounds familiar, the full analysis covers the scoring framework and the mechanics behind why each optimization phase hits a ceiling. It's the same trajectory described here, zoomed out to show the complete path and where it leads.

Read the full analysis: Understanding Postgres Performance Limits for Analytics on Live Data →

Why Adding More Indexes Eventually Makes Things Worse

Matty Stratton — Wed, 11 Mar 2026 16:36:20 GMT

The pattern is familiar. A query is slow. You run EXPLAIN and see a sequential scan. You add an index. The query drops from seconds to milliseconds.

You do this a dozen times over two years and it works every time.

Then write latency starts climbing and you can't figure out why. The queries are fast. The schema looks clean. Nothing is obviously wrong.

Pull up pg_stat_user_indexes. Count your indexes. Now think about what happens at the storage layer every time a row lands in that table.

The indexes didn't stop helping reads. They started hurting writes. Every index is a flat tax on every insert: one extra write operation per row, every time, no exceptions. At low ingestion rates, the tax is invisible. At high ingestion rates, it's the whole problem.

What actually happens when you insert a row

No handwaving here. Let's walk through the mechanics.

A single INSERT into a table with five indexes doesn't write once. It writes six times: one heap tuple to the table's data pages, and one B-tree leaf page insertion per index. Each index insertion traverses the B-tree from root to leaf, finds the correct position, and writes the new entry. If the target leaf page is full, it splits. A split can cascade up the tree.

Then there's WAL. One heap insert record. Five index insertion records. If it's the first modification to a page since the last checkpoint, Postgres writes a full 8 KB page image on top of all that.

At one insert per second, this is completely invisible. At 50,000 inserts per second with five indexes, you're looking at 300,000 write operations per second. Not 50,000. Six times the logical write rate, minimum.

That's your write amplification number. For this table configuration: 6x. More indexes, higher multiplier.

The math that makes this concrete

Take a table with five indexes and a 1 KB row. The heap tuple costs 23 bytes of header plus your 1,024 bytes of row data plus a 4-byte ItemIdData pointer. Each of the five B-tree index entries adds roughly 40 to 80 bytes. Then WAL: approximately 1.2 KB covering the heap insert plus all five index insertions. Add it up and you're writing roughly 2.5 to 3.5 KB for every 1 KB of logical data.

At 50K inserts/sec, that's 125 to 175 MB/sec of actual I/O for 50 MB/sec of application data. The index tax at work.

Now add two more indexes because a couple of new dashboard queries need covering indexes. You're at seven. The multiplier goes up. The WAL volume goes up. Write latency goes up. Autovacuum has more index pages to scan and maintain.

The relationship is linear per index, but the effect compounds with ingestion rate. At 1K inserts/sec, two extra indexes barely register. At 100K inserts/sec, they're a real cost.

Here's what the math looks like across different configurations:

Indexes	Write ops/sec @ 10K inserts	Write ops/sec @ 50K inserts	Write ops/sec @ 100K inserts
1	20,000	100,000	200,000
3	40,000	200,000	400,000
5	60,000	300,000	600,000
7	80,000	400,000	800,000
10	110,000	550,000	1,100,000

The numbers are approximate (real-world I/O depends on page splits, full-page writes, and your specific index types), but the pattern is clear. Each additional index is a flat tax on every insert. The tax rate doesn't change. The bill does.

Why timestamp indexes have a specific problem

B-tree behavior for monotonically increasing keys is worse than for random keys. And most time-series tables insert in timestamp order.

With a random key distribution, new inserts scatter across the B-tree's leaf layer. Any given leaf page gets a roughly even share of new entries. Splits happen, but they're spread out.

With a timestamp key, every insert goes to the rightmost leaf page. The same page, over and over. That page fills up and splits. The new rightmost page fills up and splits. This is called a "hot right edge," and it means B-tree index maintenance for timestamp columns involves constant page splits concentrated in one area of the tree.

The old leaf pages that were once the rightmost page sit mostly empty but remain allocated. Index size grows faster than data size. The index bloat you see in pg_stat_user_indexes is a direct result of this pattern, not random fragmentation.

For non-timestamp indexes on the same table (device ID, metric name, sensor type), inserts scatter across the tree instead, which means random I/O rather than sequential. So you get two different flavors of write overhead hitting the same table simultaneously: constant splits on the timestamp index, random I/O on everything else.

The feedback loop

All of that overhead is manageable if it stays constant. The problem is that it doesn't. It self-reinforces.

You add indexes to fix slow queries. Write amplification increases. Write latency creeps up. Bloat accumulates faster. Autovacuum fires more frequently and has more index pages to clean. Autovacuum competes with your writes for I/O bandwidth. Write latency climbs higher.

Slower writes mean rows sit in the buffer longer. Buffer pressure increases. The query performance you were trying to protect starts degrading anyway, now from I/O contention rather than missing indexes.

The response is usually to check query plans again. Some queries have gone back to sequential scans because statistics are stale or the planner is making different cost estimates under load. So you add another index. The cycle repeats.

This loop runs slowly enough that the connection between each index addition and the eventual write degradation is hard to see. Six months can pass between the two events. By that point, you've forgotten which indexes were added and why, and the symptom looks like a completely different problem.

The diagnostic questions

Before adding the next index, ask these:

How many indexes does this table already have? Pull pg_stat_user_indexes and look at idx_scan. Indexes with low scan counts are paying full write overhead for queries that run rarely or never.

What's the actual write rate on this table? Low ingestion rate tables can carry many indexes without much penalty. The math only gets ugly at high sustained rates. If you're inserting 100 rows/sec, ten indexes are probably fine. If you're inserting 50K rows/sec, every index counts.

Is the slow query a read problem or a write problem? Adding an index to fix a slow query while write amplification is already the bottleneck treats the symptom and makes the underlying condition worse.

What's the index bloat trend? Growing index size relative to table size, especially on timestamp columns, is the fingerprint of the hot right edge problem. You can measure it directly with pgstattuple or by comparing pg_relation_size for the index against the table over time.

Could a different query shape eliminate the need for this index? Sometimes the answer is restructuring the query or adjusting the access pattern, not adding another index to support the query as written.

When you're past the point where index pruning helps

You can drop indexes with low idx_scan counts. You can consolidate partial indexes. You can audit and remove redundant coverage. All of that is correct and worth doing.

But for a table with continuous high-frequency ingestion, even a minimal index set still generates substantial write amplification. Three carefully chosen indexes on a 50K inserts/sec table is still 200K write operations per second. WAL volume is still 3–5x logical data volume. Autovacuum is still competing for I/O.

Index pruning buys back headroom. It doesn't change the architecture.

The write amplification problem for this class of workload is in the storage model itself. Row-based heap storage with B-tree indexes is how Postgres handles every table. It's the right design for most workloads. For sustained high-frequency, append-heavy ingestion, the overhead is intrinsic. It's not a configuration problem you can tune your way out of.

This is what changes when the storage model changes. The reason the index tax is so expensive in row-based storage is that every row is an independent write event. One heap insert, one WAL record, one B-tree traversal per index. The cost is per-row because the storage is per-row.

Columnar storage changes the unit of work. Instead of writing one row at a time, it batches thousands of row versions into a single segment before writing. One WAL record covers the whole batch. Index maintenance happens at the segment level, not the row level. The per-row tax that makes five indexes expensive at 50K inserts/sec gets amortized across thousands of rows per write. Write amplification drops from the 3 to 5x range to near 1:1.

That's not a tuning improvement. It's a different cost structure for the same logical operation. We covered the full architecture in The Postgres Optimization Treadmill, which walks through why these constraints exist in row-based Postgres and what it looks like when the storage layer is built for this workload pattern from the start.

The bottom line

Every index you've ever added was the right call at the time. That's not the argument here.

The point is that the index tax is a real cost with a specific multiplier, and that multiplier matters a lot more at 50K inserts/sec than it does at 500. If write latency is climbing on a table that looks well-indexed, pull the insert rate and count the indexes. Do the multiplication. The answer is usually sitting right there in the numbers.

And if those numbers show you're paying five or more index taxes on every row, with no signs of the data slowing down, the question isn't which indexes to drop. It's whether the per-row cost structure is the right one for the workload.

Vertical Scaling: Buying Time You Can't Afford

Matty Stratton — Thu, 26 Feb 2026 14:48:27 GMT

Your Postgres database is struggling. Write latency is climbing, autovacuum is fighting for I/O, and the indexes you added three months ago aren't cutting it anymore. So you do the obvious thing.

You upgrade the instance. Metrics drop. Everyone exhales.

Six months later, you do it again.

Nobody puts this in a postmortem, because vertical scaling works. That's why teams keep reaching for it. But if you're running continuous high-frequency ingestion on Postgres, it's not a fix. It's a payment plan on a debt that keeps growing.

The Cost Curve Doesn't Lie

You've probably already run the numbers. At 50K inserts per second, you're adding roughly 1.5 billion rows per year. Your data volume curve is exponential. Your infrastructure cost moves in steps, doubling each time you provision the next tier up.

Plot both lines on the same chart. Watch them diverge.

You upgrade from 16 vCPU/64GB to 32 vCPU/128GB with provisioned IOPS (io2 at 10,000+ on AWS, say). Cost roughly doubles. You get six months of breathing room. Then the data keeps growing, and the metrics start climbing again.

So you upgrade again. The cost doubles again. Twelve months out, you're projecting another upgrade. The database line item is growing faster than the product revenue it supports.

Oof.

What You're Actually Buying

More CPU gives autovacuum room to run without starving query execution. More RAM improves shared_buffers and OS page cache hit rates. Faster storage reduces I/O wait across the board.

All real wins. None of them touch the per-row overhead.

Here's what's actually happening underneath. At 100K inserts per second, you're writing 250-350MB of actual I/O for 100MB of application data. Every row carries MVCC headers, index entries, and WAL records whether you asked for them or not. A 1KB sensor reading becomes roughly 2.5 to 3.5KB of actual I/O: 23-byte heap tuple header, five index entries at ~60 bytes each, plus a ~1.2KB WAL record stacking on top.

At 100K inserts/sec, that's 250-350MB/sec of real I/O to move 100MB/sec of data. A bigger instance tolerates that overhead more gracefully. It does not reduce it.

So the trajectory holds. Six months of headroom, metrics creep back, another upgrade, another budget conversation. Each step costs more than the last one and buys roughly the same amount of time.

The Invisible Cost Nobody Tracks

Here's where it gets uncomfortable. The latency graphs are one thing. Engineers watch latency graphs. Finance watches the invoice.

At some point the database line item becomes visible enough that someone schedules a meeting. Now you're explaining autovacuum to a person who manages a spreadsheet for a living. (That meeting is not fun. The prep work for that meeting costs engineering time you don't have.)

But that's the visible cost. The invisible one is worse.

When teams hit this pattern, senior engineers typically spend 20-30% of their time on database operations. Not firefighting. Weekly. Monitoring autovacuum lag. Tuning per-partition settings. Watching replication delay. Reviewing runbooks before anyone touches the schema. Making sure the pg_partman automation didn't silently fail again.

None of that shows up in the cloud bill. It doesn't trigger a finance meeting – it just quietly drains your best people every single week. New engineers need weeks of onboarding before they can safely operate the partitioning scheme. What should be a one-person schema change becomes a team event with a rollback plan.

You've built a database operations practice inside your product engineering team. That wasn't the plan.

Why Vertical Scaling Feels Like It's Working

The thing that makes this pattern so persistent is that each optimization phase genuinely does help. Vertical scaling is no exception.

You add the bigger instance, and autovacuum workers stop competing with queries for CPU. Shared buffers expand, and buffer cache hit rates climb. Those io2 IOPS stop being the bottleneck. For a while, the system breathes.

But here's the thing: Postgres wasn't designed for continuous, high-frequency, append-only ingestion at scale. The design choices that make it excellent for general-purpose workloads, MVCC for concurrency, row-based heap storage, B-tree indexes, the WAL architecture – all generate overhead that multiplies when you're hammering it with hundreds of thousands of inserts per second that never pause.

Vertical scaling gives the existing architecture more room to operate. It doesn't change the architecture.

MVCC creates per-tuple overhead on data you'll never update. Row storage forces you to read all 30 columns when your query needs two. B-tree indexes mean every insert has to traverse and update every index, and at 50K inserts/sec with five indexes, that's 250K index insertions per second. WAL records every single one of those operations before touching a data page, so at 100K inserts/sec you're generating 50-100MB/sec of WAL just to do normal work.

None of those problems shrink when you add more vCPUs.

How It Shows Up Before It's a Crisis

The real tell isn't in a p95 latency chart. It's the pattern.

You optimize. You get relief. The metrics climb back. You optimize again. The relief lasts a little less time than before.

Before it becomes a full crisis, it shows up in how the team is spending its time.

Optimization is on every quarterly roadmap, not as a one-time project, but as a line item, every quarter, competing with features for engineering time.

The database bill goes up 40% while user growth was 15%. Finance notices. Those numbers don't get ignored.

You ship a 2x performance improvement and data growth erases it within two quarters. The treadmill doesn't slow down – it speeds up.

And autovacuum just keeps coming up! It's in the top five processes by CPU and I/O at all hours and tuning it is somehow always on someone's plate.

Two or three of these? Pay attention. Four? You're already in the pattern.

Optimization Problem vs. Architecture Problem

There are two different problems that both show up as "database performance is degrading."

The first is an optimization problem. The workload fits the database design. Better indexes, query rewrites, config tuning, vertical scaling. These directly improve the trajectory, and Postgres expertise solves it. For most workloads, vanilla Postgres is the right tool and this is the right path.

The second is an architectural mismatch. The workload is hitting design tradeoffs baked into the storage engine and the write path. Optimization helps short-term, but it doesn't change the trajectory. You're working around the architecture instead of with it.

Both of these look identical from the outside: degrading query latency, climbing infrastructure costs, teams spending more time on database operations than product work. The difference only becomes obvious when you notice each fix is lasting a little less time than the last one.

Vertical scaling is the right move for the first problem. For the second, it's just the most expensive item on the treadmill.

When to Think About Architecture Instead

If your workload is continuous high-frequency ingestion, your data is append-only, queries predominantly filter on time ranges, and you're measuring retention in months or years, you're probably dealing with an architectural mismatch, not an optimization problem.

You also don't need to replace Postgres. TimescaleDB extends vanilla Postgres with columnar compression, hypertables with automatic chunking, and a query planner that understands time-based access patterns. You keep SQL, your extensions, your team's knowledge, and the entire Postgres ecosystem. What changes is the storage engine and write path underneath (the parts actually generating the overhead).

Migration complexity scales with data volume. At 10M-50M rows, it's days to two weeks. At 100M-500M rows, two to six weeks. At 1B+, you're looking at months. Those hours don't go toward product features. And there's no point on that curve where waiting makes it cheaper.

If your team is spending 20%+ of engineering time on database operations and scalability is on every quarterly roadmap, you already know something is off. The upgrade cycles don't get cheaper. They just get further apart until they don't.

This post is part of a series on Postgres performance limits for high-frequency data workloads. The full analysis, including a workload scoring framework and migration complexity breakdown at different scales, is in the anchor essay: Understanding Postgres Performance Limits for Analytics on Live Data. Ready to test it on your own data? Start a free Tiger Data trial.

Understanding Postgres Performance Limits for Analytics on Live Data

Matty Stratton — Wed, 25 Feb 2026 19:18:16 GMT

The Pattern Recognition Moment

You're reviewing monitoring on a normal workday. There hasn't been a new deployment, no weird traffic spike, and no schema changes. But p95 write latency has crept from 8ms to 25ms over the past month, and last week it touched 45ms. Your largest tables crossed 500M rows sometime in March and they're still climbing.

Six weeks of data points, all trending the same direction.

You've run Postgres in production for years. You've tuned queries, rebuilt indexes, and right-sized instances. But this time the fixes don't stick; every new index or config tweak brings the metrics back down for a few weeks, then they climb again. You can plot the trajectory out three months and know exactly where it lands.

So you do a proper audit: query plans, connection overhead, table stats, bloat. Everything checks out. Schema is sound, indexes cover the hot paths, and configs follow best practices. A consultant confirms the same: nothing misconfigured. But performance keeps degrading, and it correlates with data volume, not traffic.

You look closer at the workload. Most writes are inserts, not updates, and every row carries a timestamp. Queries almost always filter by time range. Data arrives continuously, not in batches or bursts, but as a steady stream that never pauses. You need months or years of retention, and you're not just storing this data. You're querying it under latency requirements.

This doesn't fit the profile of a transactional workload, and it doesn't fit a data warehouse either. It's continuous high-frequency ingestion that needs to stay operationally queryable.

Postgres is a brilliant general-purpose database. The same design choices that make it handle e-commerce, SaaS backends, and CMS workloads so well create compounding overhead for sustained high-frequency time-series ingestion with long retention. Design tradeoffs, not bugs. Baked into the architecture by intent.

You are not fighting misconfiguration. You are fighting architectural boundaries designed for a different workload class.

This piece walks through what we call the Optimization Treadmill: the sequence of phases most teams follow, each a correct response to observed symptoms, each providing temporary relief without changing the underlying trajectory. Understanding the mechanics of why the treadmill exists is what lets you recognize it early. If you recognize the scenario above, this is a common path. The question isn't whether you'll hit the ceiling. It's when, and how much runway you have left when you do.

What This Workload Looks Like

Not all high-write workloads will hit this wall. Postgres handles enormous write volumes for e-commerce, social networks, and SaaS backends without issue. The friction comes from a specific combination of six characteristics. If four or five describe your system, the optimization phases in the next section will be familiar.

Continuous high-frequency ingestion. Thousands to hundreds of thousands of inserts per second, 24/7, with no pause: IoT sensors reporting every few seconds, financial systems processing trades in real time, or APM platforms collecting metrics from thousands of hosts. High-frequency data generation is independent of user count. Batch systems get quiet periods where the database can run maintenance, but continuous ingestion never stops. Maintenance competes directly with writes, and there is no scheduling window.

Time-series access patterns. Nearly every row has a timestamp, and queries almost always include time range filters. "Last 30 minutes of CPU utilization," "this week compared to last week," "all transactions between two dates." This goes beyond a created_at column; the entire query pattern revolves around time. General-purpose indexes aren't built for this access pattern, which is why teams end up reimplementing time-based data organization through manual partitioning scripts and custom tooling.

Append-only data. Once written, rows rarely change. Sensor readings don't get updated, financial transactions are immutable, log entries are permanent. Deletes happen in bulk (drop an entire month's partition), not row by row. MVCC exists to handle concurrent reads and writes on the same rows. Append-only workloads pay that overhead on data they never touch again. Autovacuum is running constantly just to clean up dead tuples that were never created through updates.

Long retention. Months to years, not days or weeks. Compliance might require seven years of financial records, manufacturing teams need root cause analysis across quarters, and ML pipelines need two-plus years of training data. Shortening retention will just hide architectural problems because old data ages out, and long retention means unbounded table growth. At 50K inserts per second, that's roughly 1.5 billion rows per year. After three years? 4.5 billion rows.

Operational query requirements. This isn't cold storage or an analytics warehouse you query once a day. You need millisecond responses on the last day's data, sub-second on the last week, and reasonable performance across the full retention window. Real-time dashboards, alert systems, user-facing analytics, ad-hoc investigation, all querying the same database. Data warehouse depth with operational latency requirements.

Sustained growth. Data volume growing 50–100%+ year over year on a predictable curve. Static workloads can be over-provisioned once and left alone, but growing workloads demand constant re-optimization. You're not solving for current scale. You're chasing projected scale, and the gap keeps widening.

If four or five of these apply, the next section maps the optimization path most teams follow. If your workload is standard OLTP, batch warehouse, low-volume time-series, or short-retention, the underlying issues are likely different.

This combination of characteristics didn't exist at scale 15 years ago. It's a product of specific infrastructure shifts: billions of connected devices generating continuous telemetry, high-frequency trading systems that treat microseconds as a competitive moat, AI pipelines that require years of operational history as training data, and observability platforms collecting metrics from every process in a distributed system. The cloud didn't just scale these workloads up. It made them continuous. Machines that never go offline generate data that never stops. That changed what operational databases are asked to do, and general-purpose engines weren't redesigned to match.

The Optimization Path

Most teams working this pattern follow roughly the same sequence. Each phase is a reasonable response to observed symptoms, but each buys 3–6 months of relief at most, adds operational complexity, and has diminishing returns. The optimizations address symptoms without changing the underlying architecture. The ceiling doesn't move. You do, until you run out of room.

Phase 1: Index optimization

The trigger is predictable: query performance degrades as tables grow past 50–100M rows, or sequential scans on a 100M-row table take minutes. The textbook answer is to add B-tree indexes on timestamp columns, build composite indexes for common filter combinations, create partial indexes on hot time ranges, and run ANALYZE to refresh pg_statistic.

-- Composite index for the most common dashboard query pattern
CREATE INDEX idx_metrics_device_time
  ON device_metrics (device_id, ts DESC);

-- Partial index covering only the hot partition
CREATE INDEX idx_metrics_recent
  ON device_metrics (ts DESC)
  WHERE ts > now() - interval '7 days';

A query that did a sequential scan across 100M rows now hits an index and returns in milliseconds. 10–100x improvement on read performance is typical. Problem solved, for now.

Issues start showing up as tables grow past 300M rows. Every INSERT must update every index on the table. With five indexes, each insert performs six write operations: one heap tuple write and five B-tree leaf page insertions. At 50K inserts/sec, that's 300K write operations per second. Each index insertion traverses the B-tree, potentially causing page splits that trigger additional I/O. pg_stat_user_indexes starts showing index bloat climbing:

-- Monitoring index bloat
SELECT schemaname, tablename, indexname,
       pg_size_pretty(pg_relation_size(indexrelid)) as index_size,
       idx_scan as index_scans,
       idx_tup_read,
       idx_tup_fetch
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY pg_relation_size(indexrelid) DESC;

Index sizes grow faster than table sizes because B-trees don't reclaim space efficiently for append-heavy, time-ordered data. For keys that increase monotonically like timestamps, inserts concentrate on the rightmost leaf pages, resulting in repeated splits. Old leaf pages become sparse but remain allocated. You've improved read latency at the cost of write throughput, and this workload needs both.

Phase 2: Table partitioning

Your largest table has crossed 800M to 1B rows, and dropping old data via DELETE causes table bloat and long-running transactions that block autovacuum. You implement time-based range partitioning (typically daily or weekly).

-- Partitioned table setup
CREATE TABLE device_metrics (
    ts          timestamptz NOT NULL,
    device_id   bigint NOT NULL,
    metric      text NOT NULL,
    value       double precision
) PARTITION BY RANGE (ts);

-- Daily partitions created by cron or pg_partman
CREATE TABLE device_metrics_20250601
  PARTITION OF device_metrics
  FOR VALUES FROM ('2025-06-01') TO ('2025-06-02');

Implementation requires automation: cron jobs or pg_partman to create future partitions, monitoring to detect gaps where partition creation failed, and careful handling of queries that span partition boundaries. Backup and restore now operates on hundreds of individual tables, pg_dump time scales with partition count, and schema migrations touch every partition.

The wins are concrete. Queries with time-range filters trigger partition pruning, and EXPLAIN shows the planner excluding irrelevant partitions:

EXPLAIN SELECT avg(value) FROM device_metrics
WHERE ts > now() - interval '1 hour';

-- Scans 1-2 partitions instead of the entire table
-- "Partitions removed: 498 of 500"

Dropping old data becomes DROP TABLE device_metrics_20240101 instead of a multi-hour DELETE that generates gigabytes of WAL and dead tuples.

What happens at 500+ partitions? The PostgreSQL documentation on partitioning best practices is direct about the cost: "Planning times become longer and memory consumption becomes higher when more partitions remain after the planner performs partition pruning." pg_partman maintenance jobs occasionally fail silently, leaving gaps. Queries spanning long ranges (quarterly reports, year-over-year comparisons) hit hundreds of partitions and regress in performance. Each active partition still has its own autovacuum overhead. The write path is faster per-partition but aggregate write load is unchanged. And the operational complexity is real. New engineers need to understand the partitioning scheme, the automation scripts, the monitoring for gaps, the procedures for backfills, and the implications for schema changes.

Phase 3: Autovacuum tuning

This is where it starts to feel wrong. You're tuning a cleanup process for data you never modify. n_dead_tup counts are climbing on active partitions, last_autovacuum timestamps show vacuum running constantly but falling behind during write peaks, and pg_stat_activity regularly shows autovacuum workers competing for I/O.

Even append-only workloads generate work for autovacuum. Aborted transactions leave dead tuples. Hint-bit setting (marking tuples as known-committed or known-aborted to avoid future pg_xact lookups) requires dirtying pages. And since PostgreSQL 13, autovacuum triggers based on insert count (not just dead tuples) specifically to freeze tuples and update the visibility map. At high insert rates, this means autovacuum fires continuously on tables that never see a single UPDATE or DELETE.

-- Per-table autovacuum settings on high-traffic partitions
ALTER TABLE device_metrics_20250601 SET (
    autovacuum_vacuum_scale_factor = 0.01,    -- default 0.2
    autovacuum_vacuum_cost_delay = 2,         -- default 2ms (20ms before PG 12)
    autovacuum_vacuum_cost_limit = 1000       -- default 200
);

# postgresql.conf adjustments
autovacuum_max_workers = 6            # default 3
autovacuum_naptime = 15s              # default 1min
maintenance_work_mem = 2GB            # default 64MB
autovacuum_vacuum_cost_delay = 2ms
autovacuum_vacuum_cost_limit = 800

This helps stabilize bloat, and pg_stat_user_tables.n_dead_tup stays under control. But autovacuum workers now consume measurable CPU and I/O continuously, and monitoring shows autovacuum in pg_stat_activity at all hours. During write peaks, vacuum falls behind, bloat creeps back, and query performance becomes variable. You're tuning a process that exists to clean up overhead your workload doesn't fundamentally produce, but that the storage engine creates anyway.

Phase 4: Vertical scaling

All of your optimizations are showing diminishing returns. The next logical step is to add more resources: upgrade from 16 vCPU/64GB to 32 vCPU/128GB with provisioned IOPS storage (e.g., io2 at 10,000+ IOPS on AWS).

More CPU gives autovacuum workers room to operate without starving query execution. More RAM increases shared_buffers and OS page cache hit rates, reducing physical disk reads. Faster storage reduces I/O wait time across the board. This gives you roughly six months of headroom.

Math doesn't lie: the infrastructure cost doubled or tripled, but data growth is still exponential. At the current trajectory, you'll need another upgrade in 12 months. The database cost line item is growing faster than the product revenue it supports.

Phase 5: Read replicas

Dashboards and analytics queries compete with ingestion for CPU and I/O on the primary. You add 1–3 streaming replicas, configure pgbouncer or pgpool to route read traffic, and separate the connection pools. Immediately, write performance on the primary improves. Expensive analytical queries run against replicas without blocking ingestion.

The primary still carries the full write load. At sustained high insert rates generating tens of megabytes per second of WAL, replicas that fall behind accumulate WAL on the primary, consuming disk. The further behind a replica gets, the more WAL the primary must retain, and high write volume is exactly what causes replicas to fall behind in the first place. Real-time dashboards pointing at lagging replicas show stale data. You're now managing multiple Postgres instances with their own monitoring, autovacuum tuning, and connection pooling. The write bottleneck is still untouched.

Taking stock

After all five phases, this is what the infrastructure looks like: partitioned tables across 500+ partitions with pg_partman automation and monitoring, aggressive per-table autovacuum settings under constant adjustment, instances upgraded 2–3x from original specs with provisioned IOPS, 2–3 streaming replicas with connection-level routing, detailed runbooks covering partition management, vacuum procedures, and failover scenarios.

Each optimization was the right response. Each bought time. Yet the trajectory is unchanged.

Senior engineers are now spending 20–30% of their time on database operations. Quarterly planning includes a database scalability line item. New hire onboarding takes weeks before someone can safely operate the partitioning scheme. The team has become part product engineering, part DBA.

Is this inherent to the scale, or is it inherent to the architecture?

The answer matters because the two problems have different solutions. Optimization within the right architecture has a ceiling you can raise. Optimization against an architectural mismatch has a ceiling that doesn't move. Only the timeline changes. For this workload pattern, the ceiling is structural. The question was never if you'd hit it. It was always when.

Why These Optimizations Hit a Ceiling

The optimization phases above aren't ineffective. Each one operates within architectural boundaries that weren't designed for this workload pattern, and those boundaries constrain how much any optimization can actually move the needle. Understanding the mechanics explains why returns diminish.

Postgres is a brilliant general-purpose relational database. Its design handles an enormous range of workloads well: e-commerce, content management, authentication, SaaS backends. "General-purpose" means optimized for the average case. High-frequency time-series ingestion with long retention is not the average case. Four core design decisions create this compounding overhead.

MVCC (Multi-Version Concurrency Control)

MVCC lets readers and writers operate concurrently without lock contention. The PostgreSQL documentation on concurrency control describes the core guarantee: "reading never blocks writing and writing never blocks reading." When a row is updated, Postgres keeps the old tuple version visible to in-flight transactions, and autovacuum later marks dead tuples as reusable. For workloads with concurrent reads and updates on shared rows, this is an excellent tradeoff.

For append-only ingestion, every insert still pays the full MVCC cost. Each heap tuple carries a fixed-size header (23 bytes on most machines) containing t_xmin, t_xmax, t_cid, t_ctid, t_infomask, t_infomask2, and t_hoff. These fields track transaction visibility, even though the row will never be updated or deleted by a transaction. Extra cost with no extra value.

The write amplification is easily observable. A 1KB sensor reading becomes:

23-byte heap tuple header (plus alignment padding and a 4-byte ItemIdData pointer)
1,024 bytes of row data
5 index entries (assuming 5 indexes, ~40–80 bytes each in B-tree leaf pages)
~1.2KB WAL record (heap insert + index insertions)

Total actual I/O: roughly 2.5–3.5KB per 1KB of logical data. At 100K inserts/sec of 1KB rows, you're writing 250–350MB/sec of actual I/O for 100MB/sec of application data. The exact ratio varies with row width, index count, and whether full_page_writes triggers after a checkpoint.

Autovacuum still has work to do on append-only tables. Aborted transactions leave dead tuples, and hint-bit setting (marking tuples as known-committed or known-aborted to avoid future pg_xact lookups) requires dirtying pages. At high insert rates, even these minor sources of work keep autovacuum continuously active. pg_stat_user_tables.n_dead_tup may stay low, but vacuum_count and autovacuum_count keep climbing steadily.

Row-based storage with B-tree indexes

Postgres stores data as a heap of 8KB pages, each containing variable-length tuples laid out row by row. Every tuple contains all columns. B-tree indexes map key values to ctid (page number + offset) pointers into the heap.

For time-series analytics, this creates read amplification:

SELECT avg(temperature)
FROM sensor_readings
WHERE ts > now() - interval '1 hour'
  AND device_id = 42;

This query needs two columns: ts and temperature. If the table has 30 columns, Postgres reads all 30 columns for every matching row from the heap pages. The I/O is 15x what a columnar layout would require, where only the referenced columns are read from disk.

Time-series data also compresses extremely well in columnar formats. Sequential timestamps delta-encode to near-zero storage (a regular interval collapses from 8 bytes per timestamp down to a single bit via delta-of-delta encoding), and repeated device IDs run-length-encode. Floating-point sensor values compress with XOR-based compression derived from Facebook's Gorilla algorithm (Pelkonen et al., "Gorilla: A Fast, Scalable, In-Memory Time Series Database," VLDB, 2015). Columnar storage routinely achieves 10–20x compression on time-series data. Row-based heap storage can't apply any of these techniques because values from different columns are interleaved on the same page.

On the write side, B-tree index maintenance creates significant overhead. Each insert traverses every index's B-tree from root to leaf, finds the correct leaf page, and inserts the new entry. If the leaf page is full, it splits, which can cascade up the tree. For time-ordered data, inserts concentrate on the right edge of timestamp indexes, creating contention on a small number of leaf pages. Non-timestamp indexes (device ID, metric type) scatter inserts across the tree, causing random I/O. With five indexes on a table, every row insert performs one heap page write, five B-tree traversals and leaf page insertions, plus WAL records for each. At 50K inserts/sec, that's 50K heap writes + 250K index insertions per second.

Query planning overhead

The Postgres planner runs a full optimization pass on every query: it enumerates possible paths, estimates costs from pg_statistic entries, considers index usage, evaluates join orders, and selects an execution plan. For workloads with diverse, unpredictable query patterns involving complex joins, this is the right approach.

For time-series workloads, query shapes are highly repetitive. The same WHERE ts > now() - interval '...' filter runs thousands of times per second. The full planning cycle executes every time. At high query rates, planning overhead is measurable in pg_stat_statements as the gap between total_plan_time and total_exec_time.

Statistics maintenance creates its own cost. ANALYZE samples rows to populate pg_statistic, with the sample size scaled by default_statistics_target (default: 100, which yields roughly 30,000 sampled rows). On billion-row tables, even this sampling-based statistics collection is expensive and must run frequently to keep estimates accurate. Stale statistics provide poor cardinality estimates, leading the planner to choose sequential scans over index scans, or vice versa.

With hundreds of partitions, the planner must evaluate partition pruning for each partition's bounds against the query predicates. This is fast per-partition but scales linearly with partition count. At 500+ partitions, plan time for simple queries can exceed execution time.

Write-Ahead Logging (WAL) volume

Every data modification generates a WAL record before it's applied to the heap or index pages. WAL writes are sequential and synchronous (fsync per commit, or per wal_writer_delay interval with asynchronous commit). At 100K inserts/sec, WAL generation is roughly:

Heap insert records: ~100–150 bytes each = 10–15MB/sec
Index insert records: 5 indexes × ~60–80 bytes each = 30–40MB/sec
Full-page writes (after checkpoint): intermittent bursts of 8KB per dirtied page

Total sustained WAL throughput: 50–100MB/sec under normal operation, spiking higher after checkpoints when full_page_writes triggers 8KB records for newly dirtied pages. The PostgreSQL documentation describes why: "the first modification of a data page after each checkpoint results in logging the entire page content." At those rates, that's 3–6GB/min, 180–360GB/hour.

WAL I/O becomes a direct throughput bottleneck. pg_stat_wal shows wal_write and wal_sync times climbing. Replicas that can't apply WAL fast enough fall behind, and unprocessed WAL files accumulate on the primary's pg_wal directory, consuming disk. max_wal_size and checkpoint frequency become critical tuning parameters.

The compounding effect

None of these four constraints operates in isolation. Each amplifies the others, and that's where the math gets ugly.

MVCC overhead creates per-tuple bloat, which accumulates faster than autovacuum can clean at high insert rates. Autovacuum competing for I/O degrades write throughput. Degraded write throughput causes queries on bloated tables to slow down, which increases pressure to add more indexes. More indexes produce more write amplification, more WAL, and more replication lag. Row storage forces read amplification on time-range queries, which creates pressure to add covering indexes. Those indexes add to the write overhead feeding back into the MVCC/autovacuum loop.

At 50K inserts/sec with five indexes on a table, the steady-state database workload is: 50K heap tuple writes/sec, 250K B-tree index insertions/sec, 50–100MB/sec sustained WAL generation, continuous autovacuum activity across active partitions, and full query planning on every incoming query.

This is why a 16-core/64GB instance struggles with what appears to be a straightforward append-only workload.

Partitioning reduces per-partition table size but doesn't change the per-row overhead. Adding RAM improves buffer cache hit rates but doesn't reduce write amplification. Autovacuum tuning manages bloat but can't eliminate the cost of producing it. Each optimization operates within these constraints. None removes the constraints themselves.

This is the Optimization Treadmill at the mechanical level. You're not fighting configuration. You're fighting the storage model, the concurrency architecture, and the write path. All of which are designed for a workload that looks nothing like yours.

When to Choose a Different Path

Most teams recognize this pattern 12–18 months too late. By then, the tables are massive, the partitioning scheme is deeply embedded, and migration has become a multi-month project. The difference between acting at 10M rows and acting at 1B rows is roughly an order of magnitude in engineering cost.

Postgres Workload Scoring Framework

Go back to the six characteristics. Be honest about how many describe your system right now, and then score yourself again against where you'll be in 12-18 months.

If four or five apply, you're in this pattern. The optimization phases above are already in your future, or you've started them.

If all six apply, you're past the point of easy exits. Architectural friction is the dominant factor in your performance trajectory, and the migration cost is climbing every quarter you wait.

If three or fewer apply, you likely have a different problem. Standard Postgres optimization should change the trajectory.

Early warning signs

Before the pattern becomes a crisis, it shows up in how the team spends its time:

Optimization dominates planning. 10–20% of engineering time goes to database performance, and every quarterly roadmap includes a scalability line item.

Costs grow faster than revenue. Finance is asking why the database bill increased 40% while user growth was only 15%.

Operational complexity accumulates. 20+ pages of runbooks, partition management scripts, monitoring for autovacuum lag, replication delay, and index bloat. New engineers need weeks of onboarding before they can safely operate the database.

Growth outpaces optimization. You ship a 2x improvement and data growth erases it within two quarters.

Autovacuum is a constant concern. It's in the top five processes by CPU and I/O at all hours, and tuning it is a recurring conversation.

Two or three of these signs mean you should be paying attention. Four or more means you're already in the pattern.

Migration complexity at different scales

10M–50M rows. A day or two to 1–2 weeks. Simple dump/restore, or logical replication. Low risk, fast validation, easy rollback. 1–2 engineers part-time (roughly 80 engineer-hours).

100M–500M rows. 2–6 weeks. Partition-by-partition migration. More dependencies to account for, more thorough testing required. 2–3 engineers, mostly full-time (roughly 400 engineer-hours).

1B+ rows. 2–6 months. Hundreds or thousands of partitions. Zero-downtime required, complex rollback planning. Application-level dual-write or change-data-capture pipelines are in play. 3–5 engineers full-time plus a validation period (roughly 2,000 engineer-hours).

Those hours are not spent on product features. And there's no point on this curve where migration gets easier by waiting.

What "purpose-built Postgres variants" means

TimescaleDB is built on top of Postgres, not in place of it. The PostgreSQL wire protocol, SQL query language, extensions like PostGIS and pgvector, your application code, and your ecosystem tooling all stay the same. What changes is the storage engine and execution layer underneath.

MVCC overhead addressed through columnar compression. The problem: every row insert in vanilla Postgres generates per-tuple MVCC headers, index entries, and WAL records regardless of whether the data will ever be updated, driving 3–5x write amplification and continuous autovacuum load. TimescaleDB's columnar storage (the Columnstore layer) batches up to 1,000 row versions per column into compressed arrays before writing to disk. Each batch write replaces thousands of individual heap tuple insertions with a single compressed segment write. The per-tuple MVCC header overhead is amortized across the batch, and autovacuum pressure drops proportionally. Far less row-level churn to clean up. In practice, write amplification drops from the 3–5x range to near 1:1 for sustained append workloads. The Tiger Data architecture whitepaper covers the columnar layout and compression pipeline in detail.

Row storage replaced by columnar layout for time-series data. The problem: vanilla Postgres reads all columns of every matching row even when a query needs two, creating 15x+ read amplification on wide tables, with none of the compression techniques applicable to time-series data. Rather than reading all 30 columns of a row to get two values, queries against the columnar layer read only the referenced columns from compressed column arrays. The 15x read amplification drops to near 1:1. Time-series compression (delta-of-delta for timestamps, gorilla-style XOR for floats, run-length encoding for repeated values) routinely achieves 10–20x compression ratios vs. heap storage. A dataset that occupies 1TB in vanilla Postgres often fits in 50–100GB with columnar compression enabled.

Query planning overhead reduced through chunk exclusion and continuous aggregates. The problem: the Postgres planner runs a full optimization pass on every query, and with hundreds of partitions, partition pruning overhead can exceed execution time for simple queries. TimescaleDB's planner extension adds chunk exclusion that operates at a lower level than Postgres's partition pruning. Chunks are indexed by time range in a catalog table, and the planner excludes non-overlapping chunks before the standard planning phase. For query shapes that repeat thousands of times per second, this eliminates most of the per-partition pruning overhead. Continuous aggregates go further: pre-computed rollups stored as materialized views, updated incrementally as new data arrives, so dashboards querying hourly or daily aggregations hit a small summary table instead of scanning billions of raw rows.

WAL volume reduced through batched ingestion. The problem: at 100K inserts/sec, vanilla Postgres generates 50–100MB/sec of WAL, creating I/O bottlenecks and causing replicas to fall behind. Lagging replicas force the primary to retain more unprocessed WAL, which consumes disk and makes the lag worse. The root cause is per-row WAL records: one per heap insert, one per index insertion. Columnar storage's batch writes generate WAL at the segment level rather than the row level. At 100K inserts/sec, WAL volume drops from 50–100MB/sec to roughly 5–15MB/sec in typical deployments, which eliminates most replication lag issues. Replicas that previously fell behind during write peaks can keep up without tuning.

Concrete numbers. Benchmark results vary by workload, but the directional data is consistent: ingestion throughput 10–20x higher than vanilla Postgres at equivalent instance size, query performance on time-range aggregations 100x+ faster with columnar storage, storage footprint 10–20x smaller with compression enabled. RTABench, a benchmark for real-time analytics workloads, publishes results showing the performance gap between vanilla PostgreSQL and TimescaleDB across real-world query patterns. See the benchmark results

Decision framework

Choose a specialized architecture if you score 4+ on the Postgres Workload Scoring Framework AND you're experiencing 2+ early warning signs AND you can project continued data growth.

Strong indicators to act now: you're under 100M rows, you're already building custom partitioning, your team spends 15%+ of engineering time on database optimization, and you can project 500M+ rows within 12 months.

You might not need this if writes are bursty rather than continuous, retention is 7–30 days, queries don't predominantly filter on time ranges, or growth is stable and slow.

Optimization vs. Architecture

There are two different problems that both show up as "database performance is degrading."

Problem 1: Optimization within the right architecture. The workload fits the database's design. Better indexes, query rewrites, configuration tuning, and hardware upgrades directly improve the trajectory. Postgres expertise solves the problem. For most workloads, vanilla Postgres is the right choice.

Problem 2: The Optimization Treadmill. The workload hits design tradeoffs baked into the storage engine, concurrency model, and query planner. Optimization helps in the short term but doesn't change the trajectory. Each phase buys time. None buys a different outcome. You're working around the architecture rather than with it.

Knowing which problem you have determines the path forward.

If you followed the optimization phases in this piece, you weren't doing anything wrong. Those were correct responses to the symptoms. Any experienced Postgres team would have done the same. The pattern is common precisely because the progression makes sense at each step.

What changes with recognition is agency. At 10M–50M rows, you can choose a purpose-built architecture in days to weeks and redirect engineering time to product work. At 100M–500M rows, migration is harder but still reasonable, taking 2–6 weeks. At 1B+, it's a multi-month project, and every quarter of delay adds complexity.

The broader principle applies beyond this workload. Different databases have different architectural strengths, so the best choice depends on the workload. Postgres is brilliant for general-purpose relational work. Specialized variants built on top of Postgres excel at specialized patterns. Recognizing when architecture matters more than optimization is an engineering judgment call, not a criticism of the tool.

Architectural fit determines your ceiling. Optimization determines where you operate relative to that ceiling. When you're hitting the ceiling repeatedly, the productive question isn't "how do we optimize better?" It's "are we operating within the right architecture?" With this workload pattern, the ceiling was always there. You just needed enough data volume to find it. Score your workload. If you're at 8+ and under 100M rows, this is the cheapest architectural decision you'll make this year. The whitepaper covers the mechanics. The Tiger Data free trial lets you validate on your own data.

New: Learn how Plexigrid moved from 4 databases to 1 with Tiger Data.

Six Signs That Postgres Tuning Won't Fix Your Performance Problems

Matty Stratton — Thu, 12 Feb 2026 21:26:14 GMT

You've added indexes. You've partitioned tables. You've tuned autovacuum within an inch of its life. Performance improves for a few months, and then the dashboards go red again. Sound familiar?

If so, you're probably not doing anything wrong. You're running a workload that vanilla Postgres was never designed for, and no amount of configuration will change that.

It's not transactional. It's not a data warehouse. It's analytics on live data: high-frequency ingestion that stays operationally queryable. If you're running this pattern, you've already been through the cycle: add indexes, partition tables, tune autovacuum, upgrade instances. Each fix buys a few months. Then the metrics climb again.

The longer you wait, the harder the migration. At 10M rows it takes days. At 500M rows, weeks. At 1B+, months. Recognizing the pattern early is the highest-leverage decision you can make.

This post describes six characteristics that define this workload. If four or five apply to your system, the friction is architectural, not operational. (For a deeper look at how purpose-built time-series architecture addresses these constraints, see the Tiger Data architecture whitepaper.)

Continuous High-Frequency Ingestion

The database is absorbing thousands to hundreds of thousands of inserts per second. Not in bursts. Not during a nightly ETL window. Continuously, 24/7.

Consider a semiconductor fab with 8,000 CNC machines and inspection stations on the floor, each reporting vibration, temperature, spindle speed, and tool wear every 2 seconds. That's 4,000 inserts/sec from a single facility. Add process control events, quality inspection results, and environmental monitoring across three plants, and you're at 30-50K inserts/sec before accounting for growth.

-- What a single station's insert stream looks like
INSERT INTO machine_telemetry (ts, station_id, metric, value)
VALUES
  (now(), 'CNC-4401', 'vibration_mm_s', 2.34),
  (now(), 'CNC-4401', 'spindle_rpm', 12045),
  (now(), 'CNC-4401', 'coolant_temp_c', 31.2),
  (now(), 'CNC-4401', 'tool_wear_pct', 67.8);
-- Multiply by 8,000 stations × 0.5 Hz × 3 facilities

This matters because Postgres needs breathing room to run maintenance. Autovacuum, index maintenance, statistics collection. Continuous ingestion means maintenance always competes with writes. There is no off-peak window.

Queries Revolve Around Time

Nearly every row has a timestamp, and nearly every query filters on a time range. Last 30 minutes. This week versus last week. Everything between two dates.

A trading platform captures every order, fill, and cancellation across multiple venues. The operations team monitors execution quality in real time. The compliance team audits historical patterns. Both teams write queries that look like this:

-- Operations: real-time execution quality
SELECT venue, avg(fill_latency_us), percentile_cont(0.99)
  WITHIN GROUP (ORDER BY fill_latency_us)
FROM executions
WHERE ts > now() - interval '15 minutes'
GROUP BY venue;

-- Compliance: historical pattern detection
SELECT account_id, count(*) as cancel_count
FROM order_events
WHERE ts BETWEEN '2025-01-01' AND '2025-03-31'
  AND event_type = 'cancel'
  AND cancel_reason = 'client_requested'
GROUP BY account_id
HAVING count(*) > 500;

Time is the primary axis for both storage and retrieval. General-purpose B-tree indexes aren't built for this access pattern, which is why teams end up building manual partitioning schemes and custom tooling to get time-range queries to perform.

Data Is Append-Only

Once a row lands, it doesn't change. Sensor readings are immutable. Financial transactions don't get updated. Log entries are permanent. When data gets removed, it happens in bulk: drop an entire month's partition, not individual rows.

A wind farm operator collects turbine performance data: blade pitch, rotor speed, power output, nacelle orientation. Once recorded, these readings are facts. They never get corrected or overwritten.

-- This is the entire write pattern. INSERT. No UPDATE. No single-row DELETE.
INSERT INTO turbine_readings
  (ts, turbine_id, blade_pitch_deg, rotor_rpm, power_kw, wind_speed_ms)
VALUES
  (now(), 'WT-112', 12.4, 14.2, 2840, 11.3);

-- Data removal is always bulk
DROP TABLE turbine_readings_2023_q1;

Every row you insert carries 23 bytes of MVCC transaction metadata, on data you will never update. Autovacuum scans these tables constantly, cleaning up dead tuples that were never created through updates. At 50K inserts/sec, that's MVCC overhead on 4.3 billion rows per day that will never be modified. You're paying the full cost of a concurrency model designed for workloads that look nothing like yours.

Retention Is Measured in Months or Years

Seven years of financial records for compliance. Quarters of manufacturing data for root cause analysis. Two-plus years of training data for ML pipelines.

A pharmaceutical manufacturer tracks environmental conditions (temperature, humidity, particulate count) across cleanroom facilities to meet FDA 21 CFR Part 11 requirements. When a batch fails quality control six months after production, the investigation pulls environmental data from the exact time window the batch was in each room.

-- Root cause investigation: what were cleanroom conditions
-- during a batch produced 6 months ago?
SELECT room_id, avg(temp_c), max(particulate_count),
  bool_or(humidity_pct > 45) as humidity_excursion
FROM cleanroom_environment
WHERE ts BETWEEN '2025-08-14 06:00' AND '2025-08-14 18:00'
  AND facility = 'building_3'
GROUP BY room_id;

Short retention hides architectural problems because old data ages out. Long retention removes that escape valve. At 50K inserts per second, that's 1.5 billion rows per year. After three years: 4.5 billion rows.

Queries Are Latency-Sensitive

This data isn't sitting in cold storage waiting for a weekly report. It's being queried actively, under latency constraints.

A SaaS observability platform collects metrics from thousands of customer deployments. The product serves real-time dashboards, automated alerting, and deep-dive investigation, all from the same database. Latency expectations form a gradient:

-- Dashboard widget: last 5 minutes, needs < 100ms response
SELECT host_id, avg(cpu_pct), max(mem_used_bytes)
FROM host_metrics
WHERE ts > now() - interval '5 minutes'
  AND customer_id = 'cust_8821'
GROUP BY host_id;

-- Alert evaluation: last hour, needs < 500ms
SELECT host_id, avg(cpu_pct)
FROM host_metrics
WHERE ts > now() - interval '1 hour'
  AND customer_id = 'cust_8821'
GROUP BY host_id
HAVING avg(cpu_pct) > 90;

-- Incident investigation: last 3 months, seconds acceptable
SELECT date_trunc('hour', ts), avg(cpu_pct), avg(mem_used_bytes)
FROM host_metrics
WHERE ts > now() - interval '90 days'
  AND host_id = 'host-a3f9c'
GROUP BY 1 ORDER BY 1;

Data warehouse scope with operational latency requirements. All from a single system.

Growth Is Sustained

Data volume growing 50-100%+ year over year on a predictable curve. Static workloads can be over-provisioned once and left alone. Growing workloads demand constant re-optimization.

A logistics company tracks GPS position, engine diagnostics, and cargo conditions across a fleet of refrigerated trucks. They started with 200 trucks. Expansion added 150 trucks in year one, another 300 in year two. Each truck reports every 10 seconds.

Year 1:  200 trucks ×  6 readings/min × 1,440 min/day = 1.7M rows/day
Year 2:  350 trucks ×  6 readings/min × 1,440 min/day = 3.0M rows/day
Year 3:  650 trucks ×  6 readings/min × 1,440 min/day = 5.6M rows/day

Cumulative after 3 years: ~3.8 billion rows

Every optimization you ship today is solving for a table size you'll blow past in six months. The treadmill doesn't stop.

What to Do With This

Count how many of these characteristics describe your system. If it's two or three, standard Postgres optimization should have a real impact. The architecture fits your workload. Better indexes, smarter queries, autovacuum tuning. The usual playbook works.

If it's four or five, however, the friction is architectural, not operational. You don't need to abandon Postgres. Tiger Data extends vanilla Postgres to handle exactly this workload. You keep SQL, your extensions, your team's expertise, and the entire Postgres ecosystem. What changes is the storage engine, partitioning, and query planning underneath.

The numbers bear this out. In benchmarks against vanilla PostgreSQL at one billion rows, TimescaleDB delivered up to 1,000x faster query performance while reducing storage by 90% through native compression. Ingest throughput stays constant past 10 billion rows, while PostgreSQL's performance degrades as indexed tables outgrow memory (throughput that starts at 100K+ rows/sec can crash to hundreds). On Azure infrastructure running RTABench workloads, Tiger Cloud was 1,200x faster than vanilla PostgreSQL across 40 real-time analytics queries. These aren't synthetic edge cases. They're the exact query patterns this post describes: time-range filters, aggregations, selective scans on growing datasets.

This post is part of a series on Postgres performance limits for high-frequency data workloads. The full analysis, including a workload scoring framework and migration complexity breakdown at different scales, is in the anchor essay: Understanding Postgres Performance Limits for Analytics on Live Data. Ready to test it on your own data? Start a free Tiger Data trial.

How to Train Your Agent to Be a Postgres Expert

Matty Stratton — Wed, 22 Oct 2025 14:02:12 GMT

With prompt templates and versioned docs, we turn 35 years of Postgres wisdom into structured knowledge your Agent can reason with.

Agents are the new developer. But they’re generalists.

What happens when they design your Postgres database? Your schema runs, your tests pass… and six months later your queries crawl and your costs skyrocket.

AI-generated SQL and database schemas are almost right. And that’s the problem. Fixing schema design mistakes is costlier than refactoring code. It often means multi-week migrations, downtime windows, rollback plans, and your CEO asking why the site is in maintenance mode. The root issue? LLMs don’t have the depth of Postgres and database expertise to let them build scalable systems. And when agents try to learn, they find documentation written for humans, not for them.

But agents don’t need more data, they need better context. They need to know what “good Postgres” actually looks like. The good news is given the right context and tools, agents can become instant experts. Even with Postgres.

That’s why we built an MCP server that provides 35 years of Postgres wisdom, and full access Postgres docs, all in a format that agents can easily process.

And we think this just might be the best database MCP server ever built. While Neon, Supabase, and other Postgres companies created MCP servers as thin API wrappers, ours teaches AI how to think in Postgres. The Tiger MCP server gives AI tools that work automatically: no prompt engineering or manual lookups needed. You just ask. And it provides correct, idiomatic Postgres.

Our new MCP server ships with detailed prompt templates written by our senior engineers, plus versioned Postgres (15-18) docs, and TimescaleDB docs. When your AI needs to design a schema, it automatically pulls the right template and searches the docs (using hybrid search) to generate code that actually lasts. And of course, it handles the basics: start, stop, fork a database.

But the real magic happens when you ask AI to design your schema. So let’s see what happens when AI designs a simple schema. First without the MCP server, then with it.

Without the MCP Server

CREATE TABLE customers (
    customer_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    email VARCHAR(255) UNIQUE NOT NULL,
    password_hash VARCHAR(255) NOT NULL,
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL,
    phone VARCHAR(20),
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    is_active BOOLEAN DEFAULT TRUE
);

CREATE INDEX idx_customers_email ON customers(email);

CREATE TABLE authors (
    author_id SERIAL PRIMARY KEY,
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL,
    biography TEXT,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

What’s wrong here:

Mixed ID types (SERIAL and UUID) with no clear reason
Arbitrary VARCHAR limits instead of TEXT
Case-sensitive email index → duplicate accounts
CURRENT_TIMESTAMP instead of now(), leading to inconsistent timestamps

Someone new to Postgres may not notice these problems. Until it’s too late.

Now let’s see what AI generates with our new MCP server.

With the MCP Server

CREATE TABLE authors (
  author_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  name TEXT NOT NULL,
  bio TEXT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX ON authors (name);

CREATE TABLE users (
  user_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  email TEXT NOT NULL UNIQUE,
  password_hash TEXT NOT NULL,
  first_name TEXT NOT NULL,
  last_name TEXT NOT NULL,
  phone TEXT,
  is_active BOOLEAN NOT NULL DEFAULT true,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE UNIQUE INDEX ON users (LOWER(email));
CREATE INDEX ON users (created_at);

What’s better about this?

Consistent ID strategy with BIGINT GENERATED ALWAYS AS IDENTITY
TEXT instead of arbitrary VARCHAR limits
Case-insensitive email lookups
Modern timestamp handling

But why does this matter?

Each of these differences creates a compounding problem. Changing datatypes in the future will require full table rewrites. Missing lowercase email handling means duplicate accounts and confused users. And time zones? Every senior developer gets the thousand-yard stare when you mention UTC conversions.

This is just with a small example; imagine what would happen with more complex schemas.

And if you don’t believe us, here’s what Claude has to say:

> Please describe the schema you would create for an e-commerce website two times, first with the tiger mcp server disabled, then with the tiger mcp server enabled. For each time, write the schema to its own file in the current working directory. Then compare the two files and let me know which approach generated the better schema, using both qualitative and quantitative reasons. For this example, only use standard Postgres.

0:00

/0:50

Verdict:

The Tiger MCP-enabled schema is objectively superior because:

1. Type Safety: Uses modern PostgreSQL types that prevent common errors
2. Data Integrity: 4x more constraints ensure data quality at the database level
3. Performance: Better indexing strategy with 55% more indexes, including partial and expression indexes
4. Maintainability: Better documentation, clear comments, proper naming
5. Features: More complete e-commerce functionality (discounts, full-text search, audit trails)
6. PostgreSQL Best Practices: Follows official PG17 recommendations from the documentation
7. Future-Proof: Uses modern features like GENERATED ALWAYS AS IDENTITY and NULLS NOT DISTINCT

The Tiger MCP server provided access to PostgreSQL-specific best practices documentation and design patterns that resulted in a more robust, performant, and maintainable schema.

How Prompt Templates Make It All Possible

Behind the scenes, AI uses the MCP server to call get_prompt_template(‘design_postgres’) to load schema design guidance. No pasted docs. No corrections. Just better code.

Prompt templates turn production wisdom into reusable guidance for AI. Instead of hunting through documentation written for humans, AI gets the information it needs in a format built for machines.

It comes down to the fact that humans and LLMs have opposite needs. Humans need narratives and memorable examples (and yes, even cat memes) to help them retain information. LLMs need to preserve context window space. That’s why prompt templates make terrible blog posts, but perfect AI guidance.

Our philosophy is: don't re-teach what the model already knows. LLMs have seen millions of lines of SQL. They know how to write CREATE TABLE. What they don’t know is the 35 years of Postgres wisdom about what works well and what doesn’t.

It's like your senior DBA whispering advice in the model's ear.

Our schema design template (design_postgres_tables) doesn’t explain what a primary key is. It jumps straight to guidance:

“Prefer BIGINT GENERATED ALWAYS AS IDENTITY; use UUID only when global uniqueness is needed.”

For data types, it doesn’t teach from scratch. It just tells you what works:

“DO NOT use money type; DO use numeric instead.”

Here’s a real snippet from the template:

## Postgres "Gotchas"

- **FK indexes**: Postgres **does not** auto-index FK columns. Add them.
- **No silent coercions**: length/precision overflows error out (no truncation). 
  Example: inserting 999 into `NUMERIC(2,0)` fails with error, unlike some 
  databases that silently truncate or round.
- **Heap storage**: no clustered PK by default (unlike SQL Server/MySQL InnoDB); 
  row order on disk is insertion order unless explicitly clustered.

These gotchas trip up LLMs the same way they trip up developers new to Postgres. We optimized these templates for machines: short, factual, and precise, packing maximum guidance into minimum tokens.

We tested the same approach on a real IoT schema design task. Without templates, the AI added forbidden configurations and missed critical optimizations. With templates, it generated production-ready code with compression, continuous aggregates, and tuned performance.

That’s how prompt templates work. Now let’s see how the MCP server makes it all happen.

How This MCP Server is Smarter Than Others

While Neon, Supabase, and other Postgres companies created MCP servers as thin API wrappers, ours teaches AI how to think in Postgres.The Tiger MCP server gives AI tools that work automatically: no prompt engineering or manual lookups needed. You just ask. And it provides correct, idiomatic Postgres.

get_prompt_template provides auto-discovered expertise. Instead of having to call a template explicitly, you just say “I want to make a schema for IoT devices…” and the MCP server figures it out.

With self-discoverable templates, the AI can detect intent and load the right recipe, applying 35 years of Postgres best practices behind the scenes.

The templates have real depth. No scraped snippets or boilerplate. The templates are written by senior Postgres engineers, and provide opinionated, production-tested guidance that is tuned to avoid every trap that seasoned DBAs know to avoid.

Postgres-native vector retrieval adds the right context. When the AI needs more information, the MCP server searches the versioned Postgres (15-18) and TimescaleDB docs. And it uses Postgres itself for storage and vector search.

Versioning is critical. For example, Postgres 15 introduced UNIQUE NULLS NOT DISTINCT, while 16 improved parallel queries, and 17 changed COPY error handling. The MCP keeps AIs grounded in correct syntax every time, avoiding broken code from the wrong version.

The Tiger MCP doesn’t just wire up APIs. It teaches AI to think like a real Postgres engineer.

You don’t have to craft the perfect prompt. You just ask, and it does the right thing.

See It For Yourself

Install the Tiger CLI and MCP server:

curl -fsSL https://cli.tigerdata.com | sh
tiger auth login
tiger mcp install

(We also have alternative installation instructions for the CLI tool.)

Then select your AI assistant (Claude Code, Cursor, VS Code, Windsurf, etc.) and immediately get real Postgres knowledge flowing into your AI.

This is how Postgres becomes the best database to use with AI coding tools: not by accident, not because someone pasted docs into a chat, but because the tooling now teaches AI how to think in Postgres.

Try the MCP server. Break it. Improve it. Help us teach every AI to write real Postgres.

About the authors

Matty Stratton

Matty Stratton is the Head of Developer Advocacy and Docs at Tiger Data, a well-known member of the DevOps community, founder and co-host of the popular Arrested DevOps podcast, and a global organizer of the DevOpsDays set of conferences.

Matty has over 20 years of experience in IT operations and is a sought-after speaker internationally, presenting at Agile, DevOps, and cloud engineering focused events worldwide. Demonstrating his keen insight into the changing landscape of technology, he recently changed his license plate from DEVOPS to KUBECTL.

He lives in the Chicagoland area and has three awesome kids and two Australian Shepherds, whom he loves just a little bit more than he loves Diet Coke.

Matvey Arye

Matvey Arye is a founding engineering leader at Tiger Data (creators of TimescaleDB), the premiere provider of relational database technology for time-series data and AI. Currently, he manages the team at Tiger Data responsible for building the go-to developer platform for AI applications.

Under his leadership, the Tiger Data engineering team has introduced partitioning, compression, and incremental materialized views for time-series data, plus cutting-edge indexing and performance innovations for AI.

Matvey earned a Bachelor degree in Engineering at The Cooper Union. He earned a Doctorate in Computer Science at Princeton University where his research focused on cross-continental data analysis covering issues such as networking, approximate algorithms, and performant data processing.

Jacky Liang

Jacky Liang is a developer advocate at Tiger Data with an AI and LLMs obsession. He's worked at Pinecone, Oracle Cloud, and Looker Data as both a software developer and product manager which has shaped the way he thinks about software.

He cuts through AI hype to focus on what actually works. How can we use AI to solve real problems? What tools are worth your time? How will this technology actually change how we work?

When he's not writing or speaking about AI, Jacky builds side projects and tries to keep up with the endless stream of new AI tools and research—an impossible task, but he keeps trying anyway. His model of choice is Claude Sonnet 4 and his favorite coding tool is Claude Code.