
Back to blog
6 min read
May 05, 2026
Table of contents
01 What autovacuum is actually for02 Three reasons autovacuum runs on append-only tables03 What tuning actually does04 The operational cost people don't account for05 Why append-only data shouldn't work this way06 What changes when the storage model matches the workload07 ConclusionYou open pg_stat_activity during a write peak and there it is. Autovacuum. Again.
You've tuned this three times. You wrote a runbook for it. At this point it has its own section in the quarterly database review.
The table it's working on hasn't seen a single UPDATE in six months. Every write is an INSERT. Old data gets dropped by partition, not deleted row by row. By every reasonable intuition about what autovacuum is for, this table should basically run itself.
But there it is.
The intuition isn't wrong. It's incomplete. Autovacuum exists to clean up after concurrent row modifications, and your workload doesn't do that. What your workload does generate is a steady stream of other work that lands in the same process: dead tuples from aborted transactions, hint bits that need setting, transaction IDs that need freezing before the counter wraps. None of it is the problem autovacuum was designed to solve. All of it runs through the same mechanism.
This post isn't about how to tune autovacuum. It's about understanding what it's actually doing on your tables, why tuning helps at the margin but not at the root, and what it means that a process built for row modification cleanup is your most persistent background worker on a table that never modifies rows.
Postgres MVCC keeps old row versions alive as long as any active transaction might need to see them. When a row gets updated, the old version stays on the heap page, marked dead. When a row gets deleted, same thing. These dead tuples accumulate until something cleans them up. That something is autovacuum.
Without it, dead tuples pile up permanently. Table bloat grows without bound. Heap pages that hold dead tuples can't be reused. Query performance degrades as scans trip over dead rows.
And then there's the harder problem: XID wraparound. Transaction IDs are 32-bit counters. About 2 billion transactions in, Postgres loses the ability to distinguish old from new. Rows from before the wraparound point become invisible. This isn't theoretical. It has happened in production. Autovacuum's freeze pass exists to prevent it by marking old tuples as frozen before the counter laps them.
For a standard OLTP workload with concurrent reads and updates on shared rows, this is essential infrastructure. Dead tuple accumulation is a direct consequence of normal operation. The cleanup cost is proportional to the update and delete rate. Makes sense.
The confusing part is what happens when your workload never updates or deletes anything.
Each of these is a different mechanism. None of them are the same problem.
Aborted transactions leave dead tuples. Not every INSERT commits. Connection drops mid-transaction. Application errors trigger rollbacks. Explicit transaction management has bugs. At high insert rates, even a small abort rate produces a steady trickle of dead tuples. A 0.1% abort rate at 50,000 inserts per second is 50 dead tuples per second. Autovacuum has to find and mark them, even though those rows were never part of a committed write.
This is real dead tuple work. Just not from updates. You can see it directly in n_dead_tup in pg_stat_user_tables. Tuning autovacuum to run more aggressively cleans it up faster. There's no way to eliminate it without eliminating aborted transactions, which isn't realistic.
Hint bits require page dirtying. This one surprises most people who haven't dug deep into Postgres internals. When a row is first read after being written, Postgres doesn't just hand you the data. It verifies the writing transaction committed. It checks pg_xact. Once confirmed, it sets a hint bit in t_infomask to cache that result so future reads don't have to hit pg_xact again.
Setting that hint bit modifies the tuple header. A modified tuple header dirties the page. A dirty page needs writing back to disk.
So: your append-only table with immutable rows is generating I/O from reads. Not writes. Reads. The rows don't change. The headers do. This affects checkpoint pressure and the overall I/O budget autovacuum has to compete for.
Insert volume alone triggers autovacuum for freezing. Since PostgreSQL 13, autovacuum_vacuum_insert_threshold and autovacuum_vacuum_insert_scale_factor control a separate trigger: autovacuum fires based on insert count, not just dead tuple count. The reason is XID wraparound prevention. At high insert rates, unfrozen tuples accumulate fast. Postgres needs to freeze them before the counter laps. So autovacuum runs a freeze pass continuously on high-insert tables, regardless of whether any rows have been updated or deleted.
Go check vacuum_count and autovacuum_count in pg_stat_user_tables on your busiest append-only partition. They're climbing. n_dead_tup might be low. The freeze passes are happening anyway. This BPFtrace walkthrough shows exactly which passes and when.
Most autovacuum tuning falls into two categories: make it run more aggressively, or make it yield more to writes.
Running it more aggressively means lowering autovacuum_vacuum_scale_factor, increasing autovacuum_max_workers, reducing autovacuum_naptime. Dead tuples get cleaned before they affect query plans. Freeze passes complete before XID pressure builds. Real improvement.
Yielding more to writes means increasing autovacuum_vacuum_cost_delay, lowering autovacuum_vacuum_cost_limit. Write latency stabilizes. The tradeoff is that vacuum falls further behind and bloat accumulates more between cycles.
Here's what neither category does: reduce the amount of work autovacuum needs to do. Every configuration choice is about how the work gets distributed across time and how aggressively it competes with your actual workload. You're tuning the scheduler. Not the workload.
Per-table overrides are the right implementation of this: more aggressive settings on active partitions, letting older ones vacuum on a slower cycle. Good practice. It's still adjusting the tax rate, not the taxable activity.
Autovacuum tuning isn't free engineering time.
Someone has to write the per-table ALTER statements. Someone has to monitor whether the settings are actually working. At 500 partitions, "monitor autovacuum lag" is a real job that runs on a recurring schedule. New partitions inherit defaults unless automation creates them with the right settings. That automation needs maintaining.
The monitoring surface for autovacuum lag spans three separate system views and requires correlating timestamps across them. It’s not a dashboard that lights up. It’s a debugging session.
When autovacuum falls behind, the symptom usually isn't an autovacuum alert. It's query latency regression. Write performance degradation. The connection between autovacuum backlog and query performance is real but indirect. Connecting the two is senior engineer work. Hours, not a quick look at a graph. Sigh.
If you've tracked where your team actually spends time on database operations, a meaningful slice of it is this: watching autovacuum, tuning autovacuum, debugging incidents that turn out to be autovacuum-adjacent, and writing runbooks for autovacuum behavior on new partition types.
All of that cost is real. None of it is the kind of cost an append-only workload should be paying.
An append-only system has a simple storage contract. Data arrives. It gets written. It ages. Eventually it gets dropped, wholesale, by time window.
Nothing is ever modified in place. There is no concurrent row modification to manage. Reads never block writes and writes never block reads because there's no contention to prevent. The MVCC guarantee is irrelevant to the workload.
Postgres MVCC is the right model for workloads that need that concurrency guarantee. For workloads that don't, it's overhead. Autovacuum running continuously on your append-only table isn't Postgres failing. It's Postgres correctly maintaining the MVCC invariants on a workload that doesn't benefit from them.
The cost is real. The benefit isn't.
Tuning autovacuum makes the cost more manageable. It doesn't make the benefit appear.
Hypercore is built for exactly that contract. It abandons the per-tuple MVCC model for compressed column segments. Up to 1,000 row versions get batched into a single compressed segment before writing. Dead tuples in the traditional sense don't accumulate because the storage model doesn't create them.
When rows land in a Hypercore segment, transaction visibility is tracked at the segment level, not the tuple level. There is no per-tuple t_xmin to freeze. No hint bit to set on first read. The three mechanisms that drive autovacuum on your append-only heap table have nothing to work with here because the storage model does not create the conditions they require.
The result: autovacuum pressure drops proportionally. Postgres doesn't disappear. There's still housekeeping work. But the continuous autovacuum activity on high-insert partitions goes away. So does the I/O competition during write peaks, the tuning overhead, the monitoring surface for vacuum lag. The conditions that produced all of it no longer exist.
vacuum_count stops climbing on tables that nobody updates. pg_stat_activity stops showing vacuum workers at 3am. The runbook section on autovacuum tuning gets shorter.
Same SQL. Same wire protocol. Different storage contract underneath, matched to the workload that's actually running.
Autovacuum isn't broken. It isn't misconfigured. It's working correctly given what it's been handed.
The problem is that "working correctly" on an append-only high-insert table means running constantly, competing with writes, requiring ongoing tuning, and consuming real engineering time. All to manage overhead that a purpose-built append system wouldn't generate in the first place.
That's the tax. The settings determine how you pay it. The workload determines that you owe it.
If autovacuum is in your top processes by CPU and I/O on a table that nobody updates, that's not a Postgres problem to fix. It's a signal about the relationship between your workload and your storage model. If your write pattern is append-only, the question worth asking is not how to tune autovacuum better. It is whether your storage model is matched to your workload at all.
Understanding Postgres Performance Limits for Analytics on Live Data goes deeper on where that mismatch shows up across the system, and what the path forward looks like at different data volumes.

How Relational Complexity Crushes Real-Time Dashboards
Joins on billion-row Postgres tables crush real-time dashboards. Flatten your schema to cut shared buffer hits 70–90% and restore dashboard speed.
Read more
Five Warning Signs Your Database Needs Different Architecture
May 07, 2026
There's a category of Postgres performance issue that no amount of tuning will fix. Here's how to tell if you're already in it.
Read more
Receive the latest technical articles and release notes in your inbox.