Blocked Bloom Filters: Speeding Up Point Lookups in Tiger Postgres' Native Columnstore

Posted by

Jacky Liang

💡

This is the first post in a technical deep dive series that explores the ways we are building the fastest Postgres.

Database storage is a study in locality.

Row stores keep all fields of a record together, with operations only on full rows at a time.
Column stores organize each column’s values in compressed blocks, allowing operations to target specific columns.

This trade-off is structural, not just cosmetic. Row layout is great for fast inserts and lookups. Column layout excels at filters, aggregations, and scans over a large number of rows, but on a smaller number of columns (as long as your query plays by the rules).

But here’s the catch: Columnstores are only fast at filtering when your predicate aligns with the physical sort order. If your data is ordered by time, or clustered by a segmentby (like customer ID or device), the engine can skip large blocks. But if you filter on an unsorted field (like a trace ID, transaction UUID, or error code) there’s often no optimization to exploit. The engine has to decompress every block and scan every value, just in case there is a match.

TimescaleDB combines both layouts into a hybrid table model, with recent records written to the rowstore and automatically migrated to the columnstore over time. This design closely aligns with real-world query patterns, fresh data is updated frequently, while older data is rarely updated, and mostly used in aggregate. We’ve already implemented columnar mutability, but one challenge remained: sometimes you need to filter on unsorted fields across terabytes of columnar data.

And it’s not hypothetical, we’ve seen it in the wild many times. Dashboards hang while users query a UUID, waiting as the engine churns through thousands of compressed blocks. The data is there. The filter is simple. But the system grinds.

That isn’t acceptable. At Tiger, we’re here to deliver speed without sacrifice.

So we implemented blocked bloom filters, and our users already love them:

Our community member @pantonis saw **100× faster lookups** after upgrading to TimescaleDB 2.20

P.S. If you're using TimescaleDB 2.20 or later (or Tiger Postgres on Tiger Cloud), Bloom Filters are already actively optimizing lookups on sparsely distributed UUIDs, enums, and text fields by up to 100x.

The Challenge of Point-Lookups in Columnar Storage

If you've worked with large-scale time-series or analytics workloads, you've probably experienced this pain. You're querying 10TB of trace data to find a single ID like '550e8400-e29b-41d4-a716-446655440000'. Your database starts churning through millions of batches, reading and decompressing terabytes of data. Minutes and hours tick by. Your application times out. Users complain because they needed this report an hour ago (everything is urgent).

Imagine you’re looking for a needle in hundreds of compressed bundles of haystacks—you have to unbundle, loosen, and search through every single bundle because you don't know which one contains your needle.

This happens because columnar databases store data in sorted batches, compress each column separately, and use ordering metadata to skip irrelevant batches. This works perfectly when you're querying by the same column you sorted by:

-- This works well - time-based query on ordered data
SELECT * FROM metrics
  WHERE timestamp BETWEEN '2024-01-01' AND '2024-01-02';

But completely breaks down for uncorrelated columns:

-- This is painful - random ID query on non segmented column
SELECT * FROM metrics 
  WHERE trace_id = '550e8400-e29b-41d4-a716-446655440000';

The thing with UUIDs (except for UUIDv7, keep an eye out for support coming soon) is they're completely random, it doesn't make sense to order them. When your data is sorted by time, but you're searching by trace ID, every batch now contains a random mix of IDs. So ordering become useless, and the database can't skip any batches.

What is a Bloom Filter?

This is where bloom filters help—they're additional metadata that can efficiently answer "is this value definitely not in this batch?” without actually storing a reference to each value.

A bloom filter is a small-yet-efficient data structure that uses an array of bits and hash functions to quickly test if something might be in a set or not.

*Bloom filter illustration courtesy of* *Bytedrum: Bloom Filters**, a great visual explainer.*

Bloom filters can say something is "definitely not there", or "might be there", and crucially, they never say "it’s missing" when something is actually there.

Using Spotify’s playlist feature as an example, when Spotify needs to check if a song is in one of your playlists, instead of scanning through every song in every playlist (which means reading every playlist from storage—stupidly expensive with billions of songs), they use a bloom filter—a compact “summary” that can instantly say whether a song is “definitely not in this playlist” or “might be in this playlist”.

For the “might be” cases, Spotify then uses traditional seek methods to check the actual playlist data.

[Interactive Spotify Bloom Filter demo: https://spotify-bloom-filter.vercel.app/]

This may not sound that useful, but when dealing with massive-scale workloads of millions of playlists and billions of songs, using a bloom filter eliminates 95%+ of linear-time playlist scans, turning minutes of searching into milliseconds.

Obviously, no data structure is catch-free, it may occasionally check a playlist unnecessarily, around a 2% false positive rate (more on this later). But, this is an acceptable tradeoff as you still get massive I/O savings across your entire system.

How We Added Bloom Filters into the Columnstore

Here's how TimescaleDB solves this, bloom filters act as a quick pre-check. Before reading any batch from disk, TimescaleDB checks a tiny bloom filter in memory that says "this ID is definitely not in this batch" or "this ID might be in this batch." This lets us skip 95%+ of batches instantly.

For the few batches that might contain your ID, TimescaleDB reads them from disk and processes them efficiently using vectorized operations (SIMD) that check many rows at once—much faster than Postgres's traditional row-by-row approach. But the real win is avoiding the I/O in the first place.

No manual configuration needed

When building with TimescaleDB, you don’t need to worry about when to use bloom filters or min/max indexes, because we automatically choose for you based on your column types!

For columns that you use in your table ordering (like timestamps and numbers used in range queries), we stick with the min/max method because we know that scans will be in order.

For random things like text fields, UUIDs, enum types (or basically anything else that supports Postgres hash indexes) we will create bloom filters automatically (as long as you have the column indexed in the rowstore with a btree, hash or brin index).

Diving deeper - “blocked bloom filters”

TimescaleDB uses a technique called a "blocked bloom filter", where each bloom filter starts at about 16KB per batch (sized for up to 1,000 items) with a 2% false positive rate and uses 6 different hash functions per value.

The 16KB size isn't random—it's calculated based on math. Here’s a handy calculator you can try out yourself.

For 1,000 items with a 2% false positive rate, the optimal formula gives us ~8K bits, but we round up to ~16k bits to enable our folding compression trick (we will get to this below!). This sizing ensures we get exactly the false positive rate we want while keeping the filters small enough to stay fast in memory.

The "blocked" part is a performance technique—instead of spreading hash bits all over a huge array, TimescaleDB keeps all the bits for one value within a 256-bit block. This fits nicely in your CPU cache and makes everything faster.

For hashing, TimescaleDB primarily uses a modern library called UMASH that's faster than Postgres's built-in hashing, but falls back to the Postgres version for custom data types or older processors. There is a funny story here on interoperability that I’ll share on socials!

Achieving 250x space savings

Here’s where we squeezed out additional space and performance benefits out of bloom filters— when a batch doesn’t have many unique values (for example [0, 0, 0, 0, 1, …, 0, 0]), TimescaleDB can compress the bloom filter by “folding” it in half using bitwise OR operations.

It can keep folding until the filter shrinks from 16 KB down to just 64 bits (8 bytes) for columns with few unique values, also known as low-cardinality. We can do this because most of the bits in the batch are zero, so folding concentrates the few set bits without significantly increasing false positives.

Query Walkthrough

Here’s an example to explain how queries work step-by-step. Let’s use the following trace ID search query:

SELECT * FROM metrics 
   WHERE trace_id = 'abc123'

TimescaleDB first checks the bloom filters (which are likely to be cached in memory using the traditional Postgres buffer manager) for every batch in your columnstore.
For each batch, the bloom filter either says “definitely not here” or “might be here”. The database immediately skips all the “definitely not here” batches.
For the “might be here” batches, TimescaleDB reads them from disk, decompresses them, and scans the actual data (the expensive part). If it was a false positive (that 2% chance we mentioned prior), no match gets found, and the query just continues running normally.

The key here is, false positives are okay and don’t impact performance because even when we hit one false positive, we are still avoiding massive amounts of unnecessary I/O by not having to go through every batch from the get-go.

A side benefit to our implementation of bloom filters is that the bloom filter metadata is more likely to stay hot in memory. When there are concurrent workloads where different users are querying different parts of your dataset, every query can quickly eliminate most batches without touching slower-more-expensive storage.

Where Bloom Filters Excel

Bloom filters excel at large time-series datasets queried by non-temporal identifiers.

Think about scenarios where you're storing massive amounts of data over time, but you need to find specific records using IDs, addresses, or other fixed identifiers.

For financial services teams. Your customers are trying to resolve a failed payment. They enter a transaction reference number into your search. Nothing loads. Your backend query scans years of data just to return a single match, and the user waits ...

-- Finding financial transactions by reference
SELECT * FROM payments 
  WHERE transaction_ref = 'TXN-2024-001234';

For IoT platform teams. Your dashboard shows live sensor data, but one widget is blank. It’s querying a single sensor reading by ID, and your backend is scanning billions of rows to find it. Users refresh the page. Nothing. The spinner keeps spinning.

-- Device-specific IoT data queries
SELECT * FROM sensor_data 
  WHERE reading_id = '73e98d71-5eb7-4018-ace7-1f4490da654a';

For teams working on blockchain analytics. Your analytics engine scans millions of blocks to find transactions from a wallet address. API timeouts leave users thinking your service is broken.

-- Looking up blockchain transactions by wallet address  
SELECT * FROM transactions 
  WHERE from_address = '0x742d35Cc6634C0532925a3b8D';

Bloom filters turn these queries from minutes or hours into milliseconds by eliminating the need to decompress and scan billions of rows. Instead of checking every batch in your columnstore, you skip a 95% of them and only decompress the ones that might actually contain your data.

Performance Numbers: Find Specific Values Up to 100x Faster

Instead of telling you why our bloom filter implementation is great, let's just show you some numbers we got after we ran our benchmarks.

SELECT min(sent), max(sent), count()
  FROM hackers 
  WHERE subject = 'unsubscribe' 
  ORDER BY count() DESC 
  LIMIT 10;

Before bloom filters, this took 12ms. With bloom filters, it dropped to 2.7ms—that's 3.5x faster.

Or consider this blockchain address lookup:

SELECT * FROM token_transfers 
  WHERE to_address = '0xe23d4eb73b399250301fb024019a734ba9f0d9b5';

This one went from 1.065 seconds down to 171.134 ms—a 6x improvement.

And lets not forget the report from our user pantonis, who saw a massive 100x improvement:

Where Bloom Filters Don’t Work

Like all data structures, there are strengths and limitations.

Bloom filters work great when you're looking for exact matches—queries that use the equals sign (=) to find specific values. They also work with standard string comparisons where the rules are consistent.

-- These work great with bloom filters

SELECT * FROM traces WHERE trace_id = 'abc-123-def';

SELECT * FROM orders WHERE email = 'user@example.com';

SELECT * FROM transactions WHERE status = 'completed';

However, bloom filters have fundamental limitations and some current implementation restrictions. By design, they can't handle "not equal" searches (<>) or range queries (< or >) because they only test set membership.

-- These don't work with bloom filters (fundamental limitations)
SELECT * FROM traces WHERE trace_id <> 'abc-123-def';

SELECT * FROM users WHERE created_at > '2024-01-01';

SELECT * FROM transactions WHERE amount BETWEEN 100 AND 500;

Current implementation restrictions in TimescaleDB mean they also can't help with multiple value searches (like WHERE column IN (1, 2, 3)) or cross-type comparisons without explicit casting. These may change in future versions.

-- These don't work yet (implementation restrictions)
SELECT * FROM traces WHERE trace_id IN 
  ('abc-123', 'def-456', 'ghi-789');

SELECT * FROM users WHERE user_id = 12345;  -- int8 = int4 comparison

SELECT * FROM posts WHERE category = ANY(ARRAY['tech', 'science']);

Also, bloom filters don't help much when the value you're looking for exists in most batches.

For example, if you're searching for a common status like active that appears in every batch, the bloom filter will report a potential positive for every batch, forcing TimescaleDB to decompress and check them all anyway. The bloom filter can't skip anything, so you don't get any savings.

Speed without Sacrifice

Bloom filters in TimescaleDB are a perfect example of "it just works" optimization and our commitment to making life easier for developers working at massive data scales.

The bloom data structure automatically kicks in for the right data types and query patterns, dramatically improving performance for needle-in-haystack queries without any configuration required. It works out of the box in Tiger Cloud—no setup required other than having an index on your rowstore columns.

You can verify bloom filters are working by looking for _timescaledb_functions.bloom1_contains in your query execution plans. The storage overhead is minimal, typically a few hundred bytes per batch, with a maximum of 1KB. For a table with a million batches, you're looking at roughly 100MB to 1GB of bloom filter metadata. That's 0.01% storage overhead for massive query speedups.

Built by developers, for developers. TimescaleDB refuses to accept the traditional trade-offs of database storage. We give you the speed of columnar analytics with the flexibility of point lookups, all in the same system.

Try it now on Tiger Cloud.

Additional reading

Date published

Jun 18, 2025

Posted by

Jacky Liang

Get Started Free with Tiger CLI

Date published

Jun 18, 2025

Posted by

Jacky Liang

Get Started Free with Tiger CLI