---
title: "Write Amplification in Postgres: The 3-4x Tax on Every Insert"
published: 2026-04-20T17:31:25.000-04:00
updated: 2026-04-20T17:31:25.000-04:00
excerpt: "Every 1 KB insert in Postgres becomes ~2.5 KB of committed I/O before it's done. Here's where the multiplier comes from, and where the tuning knobs run out."
tags: PostgreSQL Performance
authors: Matty Stratton
---

> **TimescaleDB is now Tiger Data.**

You're inserting 100 MB/sec of sensor data into Postgres. Your monitoring dashboard, meanwhile, is cheerfully reporting 300-400 MB/sec of disk writes. You check the math twice. You check it a third time. The math does not improve.

Welcome to write amplification.

This is the ratio between the logical data you _intended_ to write and the physical I/O Postgres actually performs. For a typical insert-heavy workload with a handful of indexes, that ratio lands somewhere between 3x and 5x in steady state, with spikes well beyond that after checkpoints. It's not a bug, and it's not a misconfiguration. It's the structural cost of the guarantees Postgres gives you: crash safety, transactional consistency, and index maintenance, applied to every single row you write.

If you're running concurrent updates across shared rows with real transactional contention, those guarantees are paying for themselves. Good. If you're running an append-only telemetry workload at 50K+ inserts per second, some of that multiplier is paying for guarantees you're never exercising. Also good to know.

Let's trace where the multiplier actually comes from.

## The anatomy of a single INSERT

Pick one row. Say it's 1 KB of sensor data: a timestamp, a device ID, a few numeric measurements. Follow it through the write path and count bytes as you go.

**Step 1: The heap tuple.** Postgres wraps your 1 KB of row data in a 23-byte tuple header (`xmin`, `xmax`, `cid`, `ctid`, the infomask fields), plus whatever alignment padding the types demand. The tuple lands inside an 8 KB heap page. If the current page has room, it's an in-place append. If not, Postgres allocates a new page. Either way, the page gets flushed as a full 8 KB block eventually, so your "1 KB" row is actually a ~1,050-byte write into an 8 KB block.

**Step 2: The index entries.** Every index on the table gets its own write. A B-tree entry is typically 40-80 bytes (the indexed value plus the tuple identifier plus node overhead), landing in its own 8 KB index page. Five indexes on the table means five separate index writes, adding 200-400 bytes of index data, each into its own 8 KB page. More indexes, more writes, and it compounds in places you don't always look.

**Step 3: WAL.** Before any of the above touches the actual data files, Postgres writes a [WAL record](https://www.postgresql.org/docs/current/wal-internals.html) for the heap insert and one for _each_ index modification. WAL records carry the full tuple data plus metadata. One row plus five indexes comes out to roughly 1.2-1.5 KB of WAL.

That's the steady-state picture for a single row: a ~1 KB payload produces ~2.5 KB of committed I/O across heap pages, index pages, and WAL. Add a spike pattern after every checkpoint, which we'll get to in the next section, and the multiplier climbs from there.

## Scale it up: where the structural tax actually bites

One row is educational. Production workloads don't write one row. Let's run the same math at 50K inserts/sec with a 1 KB average row size, five indexes, and a typical checkpoint interval.

-   Logical data rate: **50 MB/sec**
-   Heap writes (with headers): ~52 MB/sec
-   Index maintenance (5 indexes): ~15-20 MB/sec
-   WAL generation (heap + index records): ~60-75 MB/sec
-   Full-page writes, amortized across the checkpoint interval: ~20-40 MB/sec

Total sustained disk write throughput: **150-190 MB/sec for 50 MB/sec of application data**. That's a 3-3.8x multiplier, and the dashboard is not lying to you.

Two of those line items are doing most of the damage, and both of them are structural.

The first is the WAL double-write. Every modification gets durably written to the log _before_ the corresponding change lands in the heap or index pages. That means every byte you insert is written at least twice: once to WAL, once to the data file. That's a structural 2x, and no tuning knob removes it. It's the entire reason Postgres survives a power cut.

The second is full-page writes. After every checkpoint, the first modification to any page triggers a write of the _entire 8 KB page_ into WAL, not just the bytes that changed. This protects against torn pages during recovery, where a crash mid-write could leave a page in an inconsistent state that WAL records alone can't reconstruct. Worst case, right after a checkpoint, a single 1 KB insert with five indexes can trigger FPWs on the heap page and every index page it touched: 8 KB × 6 pages = **48 KB of WAL for a single row**. That's the spike that shows up in your throughput graphs right after checkpoints.

For append-only workloads this gets extra obnoxious. Every insert touches a freshly-allocated or recently-allocated heap page. That page gets modified exactly once (the insert), then never again. But it still gets the full-page-write treatment at the next checkpoint, as if it were hotly contended. You pay the protection cost for a page that needs none of it.

The biggest knob you can turn on the overall multiplier is index count. [Every additional index adds roughly 0.3-0.5x to the amplification ratio](https://www.tigerdata.com/blog/why-adding-more-indexes-eventually-makes-things-worse). A table with two indexes often sits around 2.5x. A table with eight indexes can push past 5x and keep going. "Just drop the indexes" is not real advice, though, because you almost always need a timestamp index for range queries and usually a device or source ID for filtering. Zero indexes means zero useful reads, which is a different kind of performance problem.

## Finding the floor

Here's what the tuning knobs can actually do, and here's where they run out.

**Index strategy** is the highest-leverage lever. Drop indexes nobody queries. Use partial indexes when your query patterns filter on a specific condition. For time-ordered data that arrives roughly in timestamp order, consider [BRIN indexes](https://www.postgresql.org/docs/current/brin.html). BRIN stores summary info per block range rather than per row, which makes it dramatically smaller and cheaper to maintain than a B-tree. It only helps when the indexed column correlates with physical row order (timestamps on append-only tables, basically), but when it fits, it fits well. Every index you remove or swap for BRIN saves you roughly 0.3-0.5x on the amplification ratio.

**Checkpoint tuning.** Bump [checkpoint\_timeout](https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-CHECKPOINT-TIMEOUT) and [max\_wal\_size](https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-MAX-WAL-SIZE) to reduce FPW frequency. Typical improvement: 10-20% reduction in sustained write amplification. Costs: longer crash recovery, more WAL sitting on disk between checkpoints. If your RTO allows it, take the tradeoff.

[**fillfactor**](https://www.postgresql.org/docs/current/sql-createtable.html#RELOPTION-FILLFACTOR) is the subtle footgun. The default is 100, meaning pages fill completely. Lowering it reserves space in pages for in-place HOT updates, which is great for UPDATE-heavy OLTP workloads and actively harmful for append-only ones. If someone on your team set `fillfactor` to 80 on your telemetry table because they read a tuning blog, you're writing 20% more pages than you need to. Check it. Set it back to 100.

Now the floor. With aggressive tuning on a well-designed schema, you can get write amplification down to roughly 2.5x for a table with two indexes. You cannot get below 2x, because the WAL double-write is structural to crash safety. Anyone quoting you a lower number is either wrong or not measuring what they think they're measuring.

And that's the interesting part. The knobs run out before the amplification does. What's left past that floor isn't configuration. It's what each insert represents.

## The asymmetry

Look back at what the multiplier is actually paying for.

MVCC transaction headers exist so concurrent transactions can read consistent snapshots of rows being updated. WAL exists so any row can be safely modified across a crash boundary. Full-page writes exist so pages that get rewritten can survive a mid-write failure. Per-row B-tree entries exist so individual rows can be looked up, inserted, updated, or deleted in any order.

Every one of these mechanisms is designed for data that lives in flux.

For a telemetry workload, your data doesn't live in flux. A row arrives. It gets inserted. It's never updated. It's never individually deleted. Weeks or months later, an entire partition gets dropped, not a row. The MVCC header is transaction metadata on a transaction nobody will ever contest. The FPW is torn-page protection on a page nobody will ever rewrite. The per-row index entry is an individual lookup slot for a row that will only ever be read as part of a time-range scan.

And this isn't a Postgres quirk. Every row-oriented database with MVCC and a write-ahead log pays a version of this tax. The concurrency model and the durability model both assume your rows are hot, mutable, individually addressable. That assumption is correct for most transactional workloads. It's the whole point of the architecture. It's just not true for yours.

The write path was built for data that lives in flux. Your data doesn't.

So the question changes. What would it look like if the write path could eventually amortize these costs across batches of rows that never change?

Here's the answer. The insert-time path stays the same. There's nothing to change at the edge, because new rows really do need MVCC and WAL and index entries while they're still landing. What changes is what happens _after_ the data cools. Once a chunk of rows is no longer being actively written, the whole chunk gets batch-compressed into columnar storage. The per-row MVCC header collapses into per-chunk metadata. Per-row B-tree entries collapse into chunk-level exclusion. A thousand rows that would have cost a thousand tuple headers now share one.

That's [**Hypercore**](https://www.tigerdata.com/blog/hypercore-a-hybrid-row-storage-engine-for-real-time-analytics), the hybrid row/columnar storage engine inside TimescaleDB's **Hypertables**. Hypertables are still Postgres tables, so the insert edge pays the same amplification you just walked through. Nothing is different at the write boundary. What's different is the lifecycle. Recent data lives in row format, exactly as Postgres wants it. Older data transitions to columnar storage in the background, at which point the math flips.

Take that 1 KB row. In row format it costs ~1,050 bytes of heap tuple plus 23 bytes of MVCC header plus one B-tree entry per index. Stack 1,000 of those rows into a single compressed chunk, and the per-row MVCC overhead (23 KB of headers across the batch) collapses into a single chunk-level metadata entry. The 1 MB of row-format data becomes ~50-100 KB of compressed columnar storage, depending on the data's compressibility. Effective on-disk cost per row drops from ~1 KB-plus-overhead to roughly 50-100 bytes.

That's a 10-20x reduction in the on-disk footprint, and the corresponding reduction in I/O shows up everywhere downstream. Reads scan a fraction of the pages. Autovacuum has a fraction of the tuples to consider. Backups move a fraction of the bytes. Most of your data lives in the compressed state, so most of your I/O budget lives there too.

The insert-time amplification is roughly unchanged. The lifecycle I/O is a different order of magnitude.

## The point

Write amplification in Postgres isn't a defect. It's the price of guarantees the database makes on every row: crash safety, transactional consistency, index correctness. For workloads where every row participates in all three, you want every byte of that multiplier, and the 3-4x is a bargain.

For append-only workloads running at sustained high rates, the guarantees and the workload don't match. You're paying for features most of your data never uses. Tuning can get you to about 2.5x. It can't get you past the structural floor, because past the structural floor isn't a config problem. It's a workload-fit problem.

The question isn't what to tune. It's whether the guarantees match the workload.

If they match, you already know what to adjust. If they don't, [the MVCC deep-dive](https://www.tigerdata.com/blog/mvcc-feature-youre-paying-for-but-not-using) covers what the per-row overhead is actually costing you, and [the Hypercore writeup](https://www.tigerdata.com/blog/hypercore-a-hybrid-row-storage-engine-for-real-time-analytics) walks through what the math looks like once the data cools.

Either way: check your dashboard. Count your indexes. Then check your `fillfactor`.