
Back to blog
6 min read
Mar 20, 2026
Table of contents
01 What MVCC actually does (and why it's damn good)02 The per-row overhead, in bytes03 What autovacuum is actually doing on your append-only table04 The write amplification chain05 Why you can't opt out06 What changes when the storage model changes07 The bottom lineMost engineers have a working mental model of MVCC. Readers don't block writers. Concurrent transactions see consistent snapshots. It's why Postgres handles mixed read/write workloads so well, and it's a genuine engineering achievement.
What's less obvious is that MVCC isn't free. Every row in every table carries its overhead. Not just rows that get updated. The system doesn't know at write time whether a row will ever be touched again, so it prepares for that possibility. Every time.
If you're running an IoT pipeline, a financial data feed, or an observability platform, most of your rows will never be updated. Sensor readings don't get corrected. Trade records are immutable. Log entries are permanent. You're writing append-only data into a system built to handle concurrent modification of shared rows, and you're paying the full price for that capability whether you use it or not.
This post breaks down exactly what that costs you: at the byte level, at the I/O level, and at the maintenance level.
Before MVCC, databases had two options: lock rows during reads so writers couldn't touch them, or lock rows during writes so readers couldn't see them. Either way, concurrent workloads serialized through lock contention. If you've ever worked with a database that does this, you know how painful it gets at scale.
MVCC solves the problem differently. When a row is updated, Postgres doesn't modify it in place. It writes a new version of the row and keeps the old version visible to transactions that started before the update. Each transaction sees a consistent snapshot of the database as of the moment it began. Readers and writers operate on different row versions simultaneously. No locking required.
For an e-commerce backend processing orders while users browse, a SaaS application handling concurrent sessions, or any system where multiple transactions touch the same rows, this is transformative. The PostgreSQL documentation puts it simply: reading never blocks writing and writing never blocks reading.
That's not a small thing. That's the reason Postgres can handle the concurrency patterns that would bring a lock-based system to its knees.
The cost of maintaining this guarantee is what the rest of this post is about.
This is where most explanations go vague. Let's not do that.
Every heap tuple in Postgres carries a fixed 23-byte header before a single byte of your actual data gets written. Here's what's in it:
t_xmin: the transaction ID that created this row (4 bytes)t_xmax: the transaction ID that deleted or updated it, zero if the row is live (4 bytes)t_cid: command ID within the transaction (4 bytes)t_ctid: physical location of this tuple or its newer version (6 bytes)t_infomask and t_infomask2: status flags for transaction visibility (4 bytes)t_hoff: offset to actual row data (1 byte)These fields exist to answer one question: is this row visible to this transaction?
For a workload where rows are being updated and deleted concurrently, that question needs answering constantly. The 23 bytes are worth it.
For an append-only workload? t_xmax is zero for every live row and will stay zero. t_ctid points to itself because there's no newer version. The visibility question still gets asked, and the header still gets written, and the page still gets dirtied to set hint bits after the first read. But the answers are trivial every time. The mechanism is running in full for a case that never needed it.
Add alignment padding and a 4-byte ItemIdData pointer per tuple, and the true per-row overhead is closer to 28 to 30 bytes before your row data starts.
Let's make that concrete. At 50K inserts per second, that's 1.4 to 1.5 MB/sec of pure overhead headers. Per year: roughly 44 GB of header data for a workload that never updates a row.
That's not a rounding error.

Here’s what’s going to wrinkle your brain.
You think, “Autovacuum cleans up dead tuples from updates and deletes. Append-only tables don't update or delete rows. Therefore, autovacuum shouldn't have much to do.”
That intuition is wrong in three specific ways.
Aborted transactions leave dead tuples. Not every INSERT commits. Connection drops, application errors, explicit rollbacks. These all leave tuple versions that need cleaning. If you're running high insert rates, you've got a steady trickle of aborted transactions even in perfectly healthy systems.
Hint bits require page dirtying. When a row is first read after being written, Postgres needs to check pg_xact to confirm the writing transaction committed. Once confirmed, it sets a hint bit in t_infomask to cache that result. Setting the hint bit dirties the page, which means writing it back to disk. On an append-only table with high read rates, hint bit setting is continuous background I/O on pages that will never change in any meaningful way. Welcome to your new normal.
Since PostgreSQL 13, insert volume alone triggers autovacuum. Not just dead tuples. Postgres needs to periodically freeze old transaction IDs to prevent XID wraparound, which is a hard limit built into the 32-bit transaction counter. At high insert rates, autovacuum fires continuously just to freeze tuples on tables with zero updates.
Go check autovacuum_count and vacuum_count on your busiest append-only partition. They're climbing whether or not n_dead_tup is.
The result: autovacuum workers show up in pg_stat_activity at all hours on tables that never see a single UPDATE. You tune autovacuum_vacuum_scale_factor and autovacuum_max_workers, and it helps at the margin. But what you're tuning is how the cleanup process competes with writes. Not why it needs to run at all.
Now let's connect all of this into the full cost picture.
A single 1 KB sensor reading doesn't write 1 KB. Here's what actually hits disk:
Total actual I/O: 2.5 to 3.5 KB for 1 KB of logical data.
The MVCC header is the entry point for this entire chain. It's what requires the visibility tracking, the hint bit mechanism, the autovacuum sweep, and the WAL record structure that Postgres uses.
At 100K inserts per second, you're writing 250 to 350 MB/sec of actual I/O for 100 MB/sec of application data. The 3 to 5x write amplification ratio isn't configuration. It's the cost of MVCC applied to data that will never be updated.
There's no per-table setting to disable MVCC. No append_only = true flag that strips the header and skips the visibility machinery. MVCC is not a feature you can turn off for specific tables. It's the storage model. Every heap tuple gets the header. Every insert goes through the same write path.
This isn't an oversight. It's an architectural decision with a clear rationale: the storage engine doesn't know at write time what future transactions will need to see. The consistency guarantee requires the mechanism to be universal.
For most workloads, this is the right tradeoff. The overhead is small relative to the value of the concurrency guarantee, and mixed read/write workloads on shared rows are exactly what Postgres is built for.
The overhead only becomes the dominant cost when the workload is append-only at high sustained rates. That's when you're paying the full price for a guarantee you never exercise.
TimescaleDB's columnar storage (the Columnstore layer) addresses this at the architecture level, not the configuration level. Rather than writing one heap tuple per row, it batches up to 1,000 row versions per column into compressed arrays before writing to disk. The MVCC header overhead gets amortized across the batch. One write operation covers what would have been 1,000 individual heap tuple insertions.
The practical results: write amplification drops from 3 to 5x to near 1:1 for sustained append workloads. Autovacuum pressure drops proportionally because there's far less row-level churn to clean. WAL volume at 100K inserts/sec falls from 50 to 100 MB/sec to roughly 5 to 15 MB/sec. Replicas that previously fell behind during write peaks can keep up.
Everything else stays the same. Same SQL. Same wire protocol. Same extensions. Same tooling. The change is underneath, at the layer where MVCC overhead was accumulating.
MVCC is not a bug in Postgres. It's one of the reasons Postgres is the right choice for the majority of production workloads.
But if most of your rows are immutable after the insert commits, if your tables never see concurrent updates to the same rows, if autovacuum is running constantly on data you've never touched, you're running an append-only workload inside a concurrency model built for something else.
That's not misconfiguration. It's an architectural mismatch. The distinction matters because misconfiguration has a config fix. Architectural mismatch doesn't.
If high-frequency append-only ingestion describes what you're running, the full essay on the Optimization Treadmill covers what this costs across your entire stack, and what the path forward looks like.

Yes, You Can Do Hybrid Search in Postgres (And You Probably Should)
Most search stacks run four systems to answer one question. You don't need any of them. Build production hybrid search in Postgres with pg_textsearch for BM25, pgvectorscale for vector similarity, and Reciprocal Rank Fusion to combine them. One query. One database.
Read more
Receive the latest technical articles and release notes in your inbox.