---
title: "Why Your PostgreSQL Partition Key Is Creating a Write Bottleneck"
published: 2026-06-19T07:23:00.000-04:00
updated: 2026-06-19T09:27:48.000-04:00
excerpt: "A skewed partition key can route 80% of your writes to one partition. Learn how to diagnose PostgreSQL partition hot spots and fix them at the schema level."
tags: PostgreSQL, Database
authors: NanoHertz Communications
---

> **TimescaleDB is now Tiger Data.**

At 50K inserts per second, a skewed partition key can concentrate 40K of those inserts onto a single partition. That one partition bottlenecks your entire write path while the other 199 sit idle.

When most teams add [partitioning](https://www.tigerdata.com/learn/when-to-consider-postgres-partitioning) to a high-volume table, they partition by whatever column they already filter on. For a SaaS product, that typically means `region_id`. For a logistics platform, it's `warehouse_id`. The schema looks reasonable, queries run fast in staging, and nobody gives it a second thought until p95 write latency starts climbing on a table that otherwise looks healthy.

By the end of this guide, you'll understand what causes partition write hot spots, how to run a diagnostic to confirm one, and two schema-level techniques to redistribute write load before the bottleneck breaks throughput. You'll leave with queries you can run against your largest partitioned table today.

If you've set up `PARTITION BY RANGE` on a production table, you have the context you need.

## Why a bad partition key hurts more than a missing index

Hot spots are invisible until the damage is done. A table with 200 partitions can look perfectly balanced in storage while routing 80% of its inserts into one. The signal that something is wrong isn't a clear error message. It's p95 write latency creeping from 8ms to 25ms over six weeks while every other metric looks fine.

If you've read [Understanding Postgres Performance Limits for Analytics on Live Data](https://www.tigerdata.com/blog/postgres-optimization-treadmill), you'll recognize this as the same compounding overhead described in the Optimization Treadmill, now concentrated into a single partition instead of distributed throughout the table. Partitioning was supposed to be the fix. A skewed partition key turns it into the problem.

## Step 1: Understand how the damage compounds

A partition key determines which partition receives each new row. When you partition by `region_id`, rows land in the partition matching that value. If 80% of your users map to us-east-1, 80% of your inserts go to the same partition. The other 199 partitions sit idle.

This is data gravity: writes accumulate wherever data density is highest, and that concentration deepens over time.

The damage isn't only about storage imbalance. Every insert into the hot partition updates every B-tree index on that partition. With five indexes, each insert performs one heap page write and five index leaf insertions. On a heavily loaded partition, those index pages absorb far more concurrent writes than they would with balanced distribution, forcing more frequent [page splits](https://www.tigerdata.com/blog/indexing-your-way-into-a-performance-bottleneck). Splits require additional I/O, [generate more WAL](https://www.tigerdata.com/blog/write-amplification-in-postgres-the-3-4x-tax-on-every-insert), and force autovacuum into a continuous catch-up cycle on the hot partition while the others go untouched.

Autovacuum runs per-partition and has no awareness that one partition receives ten times the write load of another. When the hot partition absorbs 40K inserts per second, [autovacuum fires constantly](https://www.tigerdata.com/blog/the-autovacuum-tax) just to keep up with hint-bit setting and tuple freezing, even though no `UPDATE` or `DELETE` ever touches those rows. You can tune `autovacuum_vacuum_scale_factor` and `autovacuum_vacuum_cost_delay` per partition, but tuning doesn't fix the concentration. It only speeds up cleanup for a problem that shouldn't exist in the first place.

## Step 2: Run the diagnostic

Before changing anything, confirm which partition is the bottleneck. This query surfaces outlier partitions by [dead tuple ratio](https://www.tigerdata.com/blog/preventing-silent-spiral-table-bloat):

```SQL
SELECT
    schemaname,
    relname AS partition_name,
    n_live_tup,
    n_dead_tup,
    ROUND(
        100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2
    ) AS dead_ratio_pct,
    last_autovacuum
FROM pg_stat_user_tables
WHERE relname LIKE 'your_table%'
ORDER BY dead_ratio_pct DESC
LIMIT 10;
```

Run this during a normal write window, not a maintenance event. A partition with `dead_ratio_pct` ten times higher than its neighbors is your hot spot. Then cross-reference index bloat on the same partition:

```SQL
SELECT
    indexrelname AS index_name,
    relname AS partition_name,
    idx_scan,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE relname LIKE 'your_table%'
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 10;
```

If one partition shows that same outlier `dead_ratio_pct` and carries indexes three or more times larger than the next-biggest partition, the partition key is the problem.

## Step 3: Fix the distribution problem

The fix lives at the schema level. Two approaches address it directly.

### Switch to time-based range partitioning

For append-heavy workloads, [time-based partitioning](https://www.tigerdata.com/learn/pg_partman-vs-hypertables-for-postgres-partitioning) is almost always the better fit. Writes land in the current time partition, old partitions become cold, and you can drop entire time windows cleanly instead of running expensive `DELETE` operations. Here's what the corrected table definition looks like:

```SQL
-- Before: partitioned by region_id, creating write skew
CREATE TABLE sensor_events (
    ts          timestamptz NOT NULL,
    region_id   int NOT NULL,
    device_id   bigint NOT NULL,
    value       double precision
) PARTITION BY RANGE (region_id);

-- After: partitioned by time, distributing writes naturally
CREATE TABLE sensor_events (
    ts          timestamptz NOT NULL,
    region_id   int NOT NULL,
    device_id   bigint NOT NULL,
    value       double precision
) PARTITION BY RANGE (ts);

-- Daily partition, typically created by pg_partman or a cron job
CREATE TABLE sensor_events_2025_06_01
  PARTITION OF sensor_events
  FOR VALUES FROM ('2025-06-01') TO ('2025-06-02');
```

### Use Tiger Data's hypertable to handle chunk management automatically

If you want time-based partitioning without building the automation to create and drop partitions on a schedule, Tiger Data handles it for you:

```SQL
-- Convert the existing table to a hypertable, partitioned on ts
SELECT create_hypertable('sensor_events', 'ts', migrate_data => TRUE);
```

Tiger Data creates time-based chunks automatically, applies [columnar compression](https://www.tigerdata.com/learn/postgresql-compression) to older chunks, and handles maintenance without external scripts. The write hot spot disappears because writes always land in the current time chunk. The columnar storage layer also batches rows before writing to disk, dropping `WAL` volume from the 50-100MB/sec range typical of high-frequency vanilla Postgres inserts down to roughly 5-15MB/sec. That's where the Optimization Treadmill finally stops.

Switching partition keys on a live table requires a migration step regardless of which approach you take. For tables under 100M rows, a dump and restore is usually the simplest path. For larger tables, [logical replication](https://www.tigerdata.com/blog/migrating-a-terabyte-scale-postgresql-database-to-timescale-with-zero-downtime) into the new schema gives you a cutover path with minimal downtime.

## Validate that it worked

After the migration completes, let one full write cycle run and re-run the diagnostic:

```SQL
SELECT
    relname AS partition_name,
    n_dead_tup,
    ROUND(
        100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2
    ) AS dead_ratio_pct
FROM pg_stat_user_tables
WHERE relname LIKE 'sensor_events%'
ORDER BY dead_ratio_pct DESC;
```

With time-based partitioning, dead tuple ratios should be roughly even. No single partition should show ten times the bloat of its neighbors. Autovacuum stops running in perpetual crisis on one partition while the others coast.

If one partition still shows a skewed `dead_ratio_pct` after the change, check whether a secondary index on a low-cardinality column is concentrating writes the same way the old partition key did.

## Next step

Run the hot-spot diagnostic against your largest partitioned table today. Check whether your partition key routes writes by time or by value. If it's value-based and your data is not uniformly distributed, you're likely already seeing the early signal: p95 write latency climbing on a table that looks healthy by every other metric.

That's the partition-level version of the Optimization Treadmill. The fix isn't in your config. It's in your schema. If you're ready to go further, you can test this against your own data with a [free Tiger Data trial](https://console.cloud.timescale.com/signup).