--- title: "How to Handle High-Cardinality Data in PostgreSQL" description: "Learn everything you need to know about high-cardinality data and how to manage it in PostgreSQL, including TimescaleDB’s nifty approach." section: "Postgres best practices" --- *Written by *[*Semab Tariq*](https://www.timescale.com/blog/author/semab) * *One of the fundamental concepts that developers, DBAs, and data engineers encounter when dealing with databases is the **cardinality of data**. Simply put, cardinality refers to the number of distinct values in a particular column of a database table. The higher the number of unique values, the higher the cardinality. Many decisions about schema design are based on what the cardinality of the expected data looks like. High cardinality presents several challenges. For instance, it can significantly slow down queries and lead to increased memory usage. That’s why it's crucial to optimize [high cardinality data](https://www.tigerdata.com/blog/what-is-high-cardinality) to prevent performance issues and avoid putting unnecessary strain on your database. In this blog, we’ll cover these key points: - Understanding what high cardinality is - Challenges associated with handling high cardinality data - How Timescale addresses these limitations to manage high-cardinality data more efficiently Let's get started! ## Understanding What High Cardinality Is In databases, high cardinality refers to a column in a table that contains a significant number of unique values. These columns often include a vast range of distinct entries, sometimes reaching millions or even billions. To better understand high cardinality, imagine you're managing a library of books: - The ISBN number of each book would represent high cardinality data. This is because each ISBN is unique to a particular book edition, and there are millions of unique ISBNs globally. This column would contain a vast range of distinct values. - On the other hand, if you're tracking book genres like "fiction," "non-fiction," or "sci-fi," these would represent low cardinality. Since there are a limited number of genres, the genre column has far fewer distinct values. `CREATE TABLE books ( book_id SERIAL PRIMARY KEY, isbn VARCHAR(20) UNIQUE, title VARCHAR(255), genre VARCHAR(50), publish_year INT ); ` #### Breakdown of high and low cardinality columns - **High-cardinality columns**- `**book_id**`: This is a unique identifier for each book, and it will never be repeated. This makes it high cardinality. - `**isbn**`: An ISBN is globally unique for each edition of a book, which is why it’s high cardinality. The ISBN column will contain a vast range of distinct values. - `**title**`: Book titles are often unique across a collection. Even though titles may occasionally repeat (e.g., multiple editions), a large collection is likely to contain a vast number of distinct titles, classifying this as high cardinality. - **Low-cardinality columns**- `**genre**`: The number of genres in a library collection is small compared to the number of books. Genres such as "fiction," "non-fiction," and "sci-fi" are limited, which makes this a low-cardinality column. - `**publish_year**`: The publishing year for books can repeat, and there are only a limited number of years in the dataset, so it also has low cardinality. To learn more about [high cardinality, check out this blog post](https://www.timescale.com/blog/what-is-high-cardinality). ## Is It Possible for a Table to Have Entirely High Cardinality? In the previous section, we discussed a mix of high-cardinality columns (such as `isbn` and `book_id`) and low-cardinality columns (like `title` and `genre`). However, in more complex scenarios, a table can consist entirely of high-cardinality data. In such cases, even the columns that store the actual data (those containing the core information) can also be considered high cardinality. **Examples** - Consider a table logging API requests where each value represents a unique session token or request ID. In this case, each entry in the value column is unique, making it high cardinality. - Imagine a scenario where the value column stores JSON data representing user preferences or configurations. If each JSON object contains distinct settings or configurations specific to each user, this column could be considered high cardinality. Even though JSON objects might share similar structures, the unique content or configuration data in each JSON could lead to a wide variety of values, making the value column high cardinality. `*-- Create a table to store user preferences*` `CREATE TABLE user_preferences ( user_id INTEGER PRIMARY KEY, preferences JSON );` `*-- Insert sample data showing variety in JSON values*` `INSERT INTO user_preferences (user_id, preferences) VALUES (1, '{ "theme": "dark", "fontSize": 14, "notifications": { "email": true, "push": false, "frequency": "daily" }, "dashboardLayout": [ {"widget": "calendar", "position": "top-left"}, {"widget": "tasks", "position": "top-right"} ], "shortcuts": { "newItem": "ctrl+n", "save": "ctrl+s", "custom1": "alt+1" } }'), (2, '{ "theme": "light", "fontSize": 16, "notifications": { "email": false, "push": true, "frequency": "weekly" }, "dashboardLayout": [ {"widget": "analytics", "position": "full-width"}, {"widget": "messages", "position": "bottom"} ], "shortcuts": { "newItem": "cmd+n", "save": "cmd+s", "custom1": "cmd+shift+1" } }'), (3, '{ "theme": "system", "fontSize": 12, "notifications": { "email": true, "push": true, "frequency": "instant" }, "dashboardLayout": [ {"widget": "reports", "position": "left"}, {"widget": "calendar", "position": "right"}, {"widget": "tasks", "position": "bottom"} ], "shortcuts": { "newItem": "n", "save": "s", "custom1": "1", "custom2": "2" } }');` `*-- Query to demonstrate the unique values *``SELECT COUNT(DISTINCT preferences) as unique_preference_count, COUNT(*) as total_rows, (COUNT(DISTINCT preferences)::float / COUNT(*)) * 100 as cardinality_percentage FROM user_preferences; ` `*-- Query to analyze specific preference patterns*` `SELECT json_extract_path_text(preferences::json, 'theme') as theme, json_extract_path_text(preferences::json, 'fontSize') as font_size, json_extract_path_text(preferences::json, 'notifications', 'frequency') as notification_frequency, COUNT(*) as count FROM user_preferences GROUP BY 1, 2, 3 ORDER BY count DESC; ` ## Combination of Columns Can Impact Table Cardinality The combination of multiple columns in a table, even if each column has low cardinality, can result in high cardinality. This is particularly true when one column contains unique identifiers and the others introduce additional variations. **Example** In an e-commerce order table, you might have columns like `order_id`, `customer_id`, and `order_status`. The `order_id` and `customer_id` columns are high cardinality because they have unique values for each order and customer. The `order_status` column, however, has low cardinality since it only has a few distinct values, such as 'pending,' 'shipped,' and 'delivered.' Even though the `order_status` column has low cardinality, combining the `order_id` and `customer_id` columns can create many unique combinations, especially in large-scale e-commerce systems where customers place multiple orders. This makes the table high cardinality overall. ** Recapping** In simple cases, columns like `isbn` or `book_id` have high cardinality. However, in more complex tables, even columns storing event data, transaction details, or measurements can have high cardinality. Understanding this helps you design efficient databases and optimize queries, indexes, and performance for both high- and low-cardinality columns. ## How to Determine if Your Table Contains High-Cardinality Data Identifying high-cardinality data is generally straightforward. The key characteristic of high cardinality is that the column contains many distinct, non-repeating values. There are a few ways you can identify such columns: ### Distinct count You can count the number of distinct values in a column using the `COUNT(DISTINCT column_name)` function. If the count of distinct values is large, you have high-cardinality data. `SELECT COUNT(DISTINCT sensor_id) FROM sensor_readings; `If the result shows a high number (e.g., millions), you are dealing with high cardinality. ### Analyze columns By analyzing the data columns themselves, we can identify which columns have high cardinality. For example: #### ID columns Columns like `user_id`, `sensor_id`, or `transaction_id` typically have high cardinality, as each entry is usually unique to a specific entity or event. #### Timestamp columns Timestamps can exhibit high cardinality, especially in [time-series data](https://www.timescale.com/blog/time-series-introduction/) (such as sensor readings or logs), where each entry often gets a new timestamp, sometimes down to the millisecond level. To put it in simple words, imagine a clock paired with a calendar. Each tick of the clock represents a new data entry, recording the exact date and time down to milliseconds. Over days, weeks, or months, This creates an enormous number of unique timestamps, thus increasing the cardinality. #### Textual columns Columns that store unique identifiers, such as customer names, product codes, or transaction IDs, usually have high cardinality. For example, in a `customer_name` column, each name is likely unique to a specific customer. Even if a name like "John Smith" appears multiple times, it can still represent different customers, making it a high-cardinality identifier. ### Data distribution When determining a column's cardinality, we consider not only the number of distinct values but also how these values are spread out in the database, referred to as data distribution. Cardinality measures how many unique values exist in a column, while data distribution tells us how those values are distributed across rows. #### Even distribution If a column contains many unique values, but each value appears roughly the same number of times, it’s considered an even distribution. For example, imagine a `sensor_id` column where there are 100 sensors, and each sensor records 500 readings. In this case, the data distribution is even because every sensor contributes an equal number of readings to the dataset. #### Uneven or skewed distribution If most values in the column are concentrated around a few items while others are rarely used, we call this uneven or skewed distribution. For example, one sensor might have one million readings, while the rest only have a few hundred. #### How does data distribution impact the data cardinality? The relationship between high cardinality and data distribution lies in understanding both the number of unique values (cardinality) and how those values are spread across your dataset. While they are distinct concepts, they are deeply connected in terms of how they affect query performance, data management, and resource optimization. **Example** High cardinality with uneven distribution can lead to performance bottlenecks. - If a `sensor_id` column has hundreds of distinct sensor IDs but only a few are frequently queried, queries focusing on those frequently queried values will have to scan through a larger amount of data. This can negatively impact query speed and efficiency. - High cardinality with even distribution generally has less of an impact on performance because the queries access the data more evenly. This means no single value is queried too much. However, the impact still depends on how the data is actually queried and used. ## Challenges Associated With Handling High Cardinality Data High cardinality data presents unique challenges in database management, particularly as datasets grow in size and complexity. These challenges impact performance, storage, and analytics. Let’s explore these challenges in detail. ### Indexing overhead High-cardinality columns often require indexes for efficient query performance. However, indexing such data can lead to significant storage and performance issues. #### Storage costs - High-cardinality columns, when indexed using a B-tree can result in larger index sizes because each unique value requires a separate entry in the index. These larger indexes can lead to increased disk usage, especially if the column contains millions of unique values. - Over time, as data is inserted, updated, or deleted, the index can become fragmented, further increasing the storage requirements and potentially degrading performance. #### Write overhead - Every time a row is inserted or updated, the database must update the index to reflect the new or changed data. This can be particularly costly for high-cardinality columns because the index must accommodate a large number of unique entries. - During write operations, [locks may be held on the specific part of the index](https://www.timescale.com/blog/how-timescaledb-solves-common-postgresql-problems-in-database-operations-with-data-retention-management/) (e.g., pages or buckets), which can lead to contention and slow down concurrent write operations in high-traffic environments. #### Cache efficiency - Large indexes may not fit entirely in memory, leading to more frequent disk I/O operations. This increase in operations can reduce the cache hit rate and slow down query performance, as accessing data from disk is significantly slower than accessing it from memory. ##### Example In a customer database with millions of users, a column like email_address (high cardinality) indexed for quick lookups will significantly increase the storage footprint and slow down bulk inserts due to index updates. **Step 1: Create the table** `CREATE TABLE test_indexes ( id SERIAL PRIMARY KEY, email_address TEXT NOT NULL UNIQUE, -- High cardinality column status TEXT NOT NULL -- Low cardinality column ); ` **Step 2: Insert records** `INSERT INTO test_indexes (email_address, status) SELECT concat('user', i, '@example.com') AS email_address, CASE WHEN i % 2 = 0 THEN 'active' ELSE 'inactive' END AS status FROM generate_series(1, 100000) i; ` **Step 3: Create indexes** `-- Index on the high-cardinality column CREATE INDEX idx_email_address ON test_indexes(email_address);` `-- Index on the low-cardinality column CREATE INDEX idx_status ON test_indexes(status); ` **Step 4: Check the size for both of the indexes** `SELECT pg_size_pretty(pg_relation_size('idx_email_address')) AS high_cardinality_index_size, pg_size_pretty(pg_relation_size('idx_status')) AS low_cardinality_index_size; high_cardinality_index_size | low_cardinality_index_size -----------------------------+---------------------------- 3992 kB | 696 kB (1 row) ` ## Query Performance Degradation Queries on high-cardinality columns can be computationally expensive, particularly for operations like `GROUP BY` or `DISTINCT`. This is because the database engine has to process a large set of unique values. ### Sorting and hashing overhead - Operations like `GROUP BY` and `DISTINCT` often require sorting or hashing of the data to identify unique values or group similar ones. For high cardinality columns, this means processing a vast number of unique entries, which can be computationally intensive. - Sorting a large dataset involves significant CPU and memory resources, as the database engine must arrange all unique values in order. ### Memory usage - High-cardinality operations can consume a lot of memory because the database engine may need to store all unique values in memory to perform the operation efficiently. If the dataset is too large to fit in memory, it can lead to increased disk I/O, further slowing down the query. - When the configured value for `work_mem` is insufficient, the database engine may use temporary disk storage to handle intermediate results. This disk usage can significantly degrade performance due to the slower speed of disk access compared to memory access. ### Execution plan complexity - The query planner must choose an optimal execution plan for handling high-cardinality operations. However, the complexity of managing a large number of unique values can lead to suboptimal plans, especially if statistics are outdated or inaccurate. - PostgreSQL manages table statistics through a process called `ANALYZE`, which collects key statistics that include information like the number of rows, the most common values, and the distribution of data within columns. This data is stored in the system catalog and used by the query planner to make informed decisions about query execution plans. To ensure the query planner has up-to-date information, it's important to regularly analyze tables, especially after significant data changes. You can also configure the level of detail collected by setting the `default_statistics_target` parameter, which determines the default level of detail collected by the `ANALYZE` command for table statistics. A higher value means more detailed statistics are collected, which can lead to better query planning decisions, especially for complex queries or those involving columns with a wide range of values. However, collecting more detailed statistics can also increase the time and resources required for the `ANALYZE` operation. A transaction table with a high-cardinality `customer_id` column, executing `SELECT COUNT(DISTINCT customer_id)`, requires the database to identify and count each unique transaction ID. With billions of rows, this operation demands substantial processing power and memory, leading to longer execution times. Below is an example where we use a transaction table and insert 50 million rows. In this case, the `**customer_id**`` `column will have high cardinality, while the `**status**` column will contain low cardinality data. This is how the table will look like: `tsdb=> select * from transactions limit 5; transaction_id | customer_id | amount | status | transaction_date ----------------+-------------+--------+-----------+---------------------------- 1 | 625317 | 197.04 | completed | 2024-12-19 07:50:15.439523 2 | 729692 | 636.61 | completed | 2024-12-19 07:50:15.439523 3 | 680008 | 282.23 | pending | 2024-12-19 07:50:15.439523 4 | 736607 | 607.30 | pending | 2024-12-19 07:50:15.439523 5 | 849894 | 476.13 | completed | 2024-12-19 07:50:15.439523 (5 rows)` `– Create a table CREATE TABLE transactions ( transaction_id SERIAL PRIMARY KEY, customer_id INT NOT NULL, amount NUMERIC, status TEXT DEFAULT 'pending', transaction_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP );` `– Insert the data inside the table INSERT INTO transactions (customer_id, amount, status) SELECT FLOOR(RANDOM() * 1000000)::INT, -- High variety: Customer IDs between 0 and 999,999 (RANDOM() * 1000)::NUMERIC(10, 2), -- Random amounts between 0.00 and 1000.00 CASE WHEN RANDOM() < 0.7 THEN 'completed' -- 70% of transactions WHEN RANDOM() < 0.9 THEN 'pending' -- 20% of transactions ELSE 'failed' -- 10% of transactions END FROM generate_series(1, 10000000); -- Generate 10 million rows` `– Performing select on high cardinality column SELECT COUNT(DISTINCT customer_id) FROM transactions; count ---------- 50000000 (1 row) ``**Time: 21161.682 ms (00:21.162)**` `– Performing select on low cardinality column SELECT COUNT(DISTINCT status) FROM transactions; count ------- 3 (1 row) ``**Time: 18051.252 ms (00:18.051) **`As observed, the `SELECT` query on the status column with low cardinality data is executing faster, even though the query planner performed a sequential scan (Seq Scan) for both `SELECT` queries. Let's see if we can improve the performance with an index : `CREATE INDEX idx_customer_id ON transactions (customer_id); CREATE INDEX `**Re-run the SELECT query** `SELECT COUNT(DISTINCT customer_id) FROM transactions; count ---------- 50000000 (1 row) ``**Time: 8169.219 ms (00:08.169) **`This query took only eight seconds to complete, demonstrating that while high-cardinality columns may take longer to execute `SELECT` queries, the performance can be improved by adding an index to the high-cardinality column. ` ` ## Aggregation and Analytics Complexity Aggregating data in high-cardinality columns can be resource-intensive, as it involves grouping a large number of distinct values. This aggregation is particularly problematic for real-time analytics systems where performance is critical. ### Data volume and grouping - High cardinality columns, like `sensor_id` in a sensor monitoring system, have a vast number of unique values. Aggregating data requires the system to group data by each unique value, which can be computationally expensive and time-consuming. - The sheer volume of data to be processed and grouped can strain system resources, particularly CPU and memory. - High cardinality increases the number of groups, leading to higher memory usage to store intermediate results and more CPU cycles to perform calculations. ### Real-time constraints - Real-time analytics systems demand low-latency processing to provide timely insights. High cardinality aggregation can introduce delays as the system struggles to keep up with the volume and complexity of the data. - In real-time systems, multiple queries may run concurrently, each requiring access to the same high cardinality data. This concurrency can lead to contention for resources and further slow down processing. - As the number of sensors or data points grows, the system must scale to handle increased load, which can be challenging with high cardinality data. #### Example In the sensor data monitoring system, calculating metrics like average or peak readings per sensor involves processing data for millions of unique sensors. This requires the system to efficiently manage and compute large datasets in real time, which can overwhelm system resources and degrade performance. **Step 1: Create the table** `CREATE TABLE sensor_data ( sensor_id INT, timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP, reading_value NUMERIC ); `**Step 2: Ingest the data ** `-- Inserting sample data for 1 million sensors, each with random readings INSERT INTO sensor_data (sensor_id, reading_value) SELECT (random() * 1000000)::INT, -- Random sensor_id between 1 and 1 million random() * 100 -- Random reading value between 0 and 100 FROM generate_series(1, 30000000); -- Generating 30 million rows of data ` **Step 3: Perform aggregate on the data** `-- Using window function to calculate the moving average of readings for each sensor` `SELECT Sensor_id, Timestamp, Reading_value, AVG(reading_value) OVER (PARTITION BY sensor_id ORDER BY timestamp ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg_reading FROM sensor_data ORDER BY sensor_id, timestamp;` `Time: 45890.669 ms ` ## Partitioning and Sharding Challenges High-cardinality data presents specific challenges when it comes to partitioning and sharding, primarily due to the following reasons. ### Uneven distribution - When partitioning by a high cardinality column such as `user_id`, data distribution across partitions can become uneven. Some partitions may end up with significantly more data than others, leading to "hotspots" where certain partitions experience higher load and contention. - This uneven distribution can result in load imbalance, where some partitions are overutilized while others are underutilized. This can degrade overall system performance and lead to inefficient resource utilization. ### Increased metadata overhead - High-cardinality columns can lead to a large number of partitions, each requiring metadata management. This increases the system’s complexity and can slow down operations involving metadata, such as query planning and execution. - Query planning time can increase with a large number of partitions, as the planner needs to evaluate which partitions are relevant for a given query. In turn, this can increase planning time and complexity, which is particularly problematic when dealing with many partitions, as it can lead to slower query execution. ### Maintenance complexity - Managing a large number of partitions can complicate maintenance tasks such as backups, restores, and schema changes. Each partition may need to be handled individually, increasing administrative overhead. - Addressing uneven distribution often requires rebalancing data across partitions, which can be resource-intensive and disruptive to normal operations. #### Example An e-commerce application partitioned by `customer_id` may face uneven workloads during seasonal sales, as a few high-spending customers generate disproportionate traffic. **** **Step 1: Create the table with partitions** `CREATE TABLE transactions ( transaction_id SERIAL PRIMARY KEY, customer_id INT NOT NULL, amount NUMERIC, transaction_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP ) PARTITION BY RANGE (customer_id);` `-- Create partitions (simulating a few partitions for different customer groups) CREATE TABLE transactions_1 PARTITION OF transactions FOR VALUES FROM (1) TO (1000);` `CREATE TABLE transactions_2 PARTITION OF transactions FOR VALUES FROM (1000) TO (2000);` `CREATE TABLE transactions_3 PARTITION OF transactions FOR VALUES FROM (2000) TO (3000); ` **Step 2: Insert dummy data into table** `-- Insert a mix of low-spending and high-spending customers -- Low-spending customers: customer_id between 1 and 1000 -- High-spending customers: customer_id between 2000 and 3000` `-- Low-spending customers (simulate 1000 transactions)` `INSERT INTO transactions (customer_id, amount) SELECT (random() * 1000)::INT + 1, -- Random customer_id from 1 to 1000 random() * 50 -- Random transaction amount between 0 and 50 FROM generate_series(1, 1000);` `-- High-spending customers (simulate 1000 transactions with larger amounts) INSERT INTO transactions (customer_id, amount) SELECT (random() * 1000)::INT + 2000, -- Random customer_id from 2000 to 3000 (high-spending) random() * 500 + 100 -- Random transaction amount between 100 and 600 FROM generate_series(1, 1000); ` **Step 3: Simulate the traffic during sales** `-- Simulate a sales event with a higher volume of transactions for high-spending customers` `-- For example, generating 5000 transactions for customer_id between 2000 and 3000 INSERT INTO transactions (customer_id, amount) SELECT (random() * 1000)::INT + 2000, -- Random customer_id from 2000 to 3000 (high-spending) random() * 500 + 100 -- Random transaction amount between 100 and 600 FROM generate_series(1, 5000); ` **Step 4: Examine the performance** `-- Time taken to query low-spending customers` `EXPLAIN ANALYZE SELECT COUNT(*) FROM transactions WHERE customer_id BETWEEN 1 AND 1000; Time: 1.043 ms` `-- Time taken to query high-spending customers` `EXPLAIN ANALYZE SELECT COUNT(*) FROM transactions WHERE customer_id BETWEEN 2000 AND 3000; Time: 2.196 ms ` ## Vacuum and Autovacuum Challenges High-cardinality data presents challenges to the autovacuum process. Vacuum in PostgreSQL is a maintenance process that reclaims storage occupied by dead tuples, [prevents table bloat](https://www.timescale.com/learn/how-to-reduce-bloat-in-large-postgresql-tables), and updates statistics for query optimization. ### Table and index bloat High-cardinality columns can lead to larger indexes, which, when combined with frequent updates, can result in index bloat. This increases the amount of data that needs to be scanned during queries, degrading performance. There are a number of strategies you can use to prevent index bloat from high-cardinality data: 1. Be selective with indexing high-cardinality columns. - Only index columns that are frequently used in `WHERE` clauses or joins. - Consider dropping indexes that aren't being used (analyze query patterns). - For columns with very high cardinality like UUIDs, consider if you really need them indexed. 2. Use partial indexes when possible. Instead of indexing all rows, create indexes that only cover the subset of data you regularly query. 3. Consider compound indexes strategically. - Put lower cardinality columns first in compound indexes when query patterns allow. - This reduces the number of distinct index entries at each level. - Example: Instead of separate indexes on (high_cardinality, low_cardinality), use (low_cardinality, high_cardinality). 4. Implement regular maintenance. - Set up periodic `REINDEX` operations during low-traffic periods.. - Monitor index size and bloat using queries 5. Use appropriate index types. - Consider using `BRIN` indexes instead of B-tree for naturally ordered data. - Hash indexes can be more efficient for equality comparisons on high-cardinality columns. But high-cardinality columns can also cause table bloat. When there are frequent updates or deletions in these columns with many unique values, dead rows can accumulate. This increases the table size and uses up unnecessary disk space, which in turn slows down query performance. ### Autovacuum resource consumption - High-cardinality tables often involve a significant number of unique values, which can correlate with frequent updates or deletions across many rows.- Frequent data modifications in high-cardinality tables tend to generate a higher volume of dead tuples. Autovacuum must scan and clean these dead tuples, requiring more processing effort. - High cardinality often leads to larger and more complex indexes. During autovacuum, these indexes also need to be cleaned up and maintained, adding to the resource usage. - For very large tables with high cardinality, the autovacuum process needs to scan significant portions of the table, which increases CPU and I/O demand. - The autovacuum process removes dead tuples and prevents table bloat. In tables with high cardinality, the large number of unique values and frequent data modifications can make autovacuum operations more resource-intensive, consuming more CPU, memory, and I/O bandwidth. - The high rate of data modification in high-cardinality tables can trigger autovacuum more frequently. This can lead to increased overhead and potential contention with regular database operations, especially in write-heavy environments. ### Impact on query performance - While autovacuum is running, it can impact query performance by competing for system resources. Although designed to minimize locking, autovacuum can still cause contention, particularly in systems with high concurrency. ### Configuration challenges - Default autovacuum settings may not be optimal for high-cardinality data. Adjusting parameters such as `autovacuum_vacuum_cost_limit` and `autovacuum_vacuum_cost_delay` is often necessary to balance performance with maintenance needs. - Increasing `autovacuum_vacuum_cost_limit` and decreasing `autovacuum_vacuum_cost_delay` make the autovacuum process more aggressive, enabling it to clean up rows more quickly and efficiently. However, this increased aggressiveness may also consume more system resources, so adjustments should be made carefully based on workload and system capacity. With a better understanding of high-cardinality data and its challenges in databases, let’s delve into how TimescaleDB overcomes these limitations with its advanced features. ## Solutions for Efficiently Handling High-Cardinality Data [TimescaleDB](https://github.com/timescale/timescaledb), an open-source extension for PostgreSQL, is designed to efficiently manage [time series](https://www.tigerdata.com/blog/time-series-introduction), events, analytics, and vector data. It combines PostgreSQL's reliability with specialized features for handling continuous data collection and high cardinality, making it ideal for such applications. TimescaleDB is widely used in different areas, including: - IoT applications - Financial analytics - Infrastructure monitoring, where large datasets with unique identifiers—such as sensors, devices, or users—generate data continuously In addition to TimescaleDB, Timescale has gained recognition for its cloud offering. It provides a [mature cloud-based PostgreSQL platform](https://www.timescale.com/cloud) designed for resource-intensive workloads, helping you build faster, scale more efficiently, and stay within budget. We will discuss how Timescale handles high-cardinality data both in TimescaleDB and Timescale Cloud. ## TimescaleDB As mentioned, TimescaleDB is an extension for PostgreSQL designed for storing and analyzing time-series data and other demanding workloads efficiently. Here’s a deeper look at the features TimescaleDB offers to illustrate how each feature supports high-cardinality data handling. ### Hypertables and chunks Partitioning is an essential strategy for working with high-cardinality time-series data in PostgreSQL. Native PostgreSQL partitioning allows for some degree of data segmentation by dividing large tables into smaller, more manageable parts based on specified criteria, like time intervals. However, native PostgreSQL partitioning has functionality limitations. For example, it does not automatically manage and create new partitions as data grows over time. That’s where hypertables and chunks come in. TimescaleDB uses [hypertables](https://www.timescale.com/learn/is-postgres-partitioning-really-that-hard-introducing-hypertables) as a core feature for managing time-series data. A hypertable acts like a regular PostgreSQL table but automatically partitions data into smaller chunks based on time intervals. Chunks in TimescaleDB are essentially regular PostgreSQL tables that act as time-and-key-based partitions for a hypertable. Data is dynamically partitioned based on time intervals (the primary dimension), along with the hypertable schema and chunk settings. These chunks take advantage of PostgreSQL's native storage, indexing, and [query optimization](https://www.tigerdata.com/blog/best-practices-for-query-optimization-in-postgresql) features, ensuring efficient data insertion and retrieval. Hypertables and chunks in TimescaleDB optimize storage and query performance for high-cardinality data by partitioning it into time-based segments. This partitioning improves retrieval speed, reduces scan times, and enables efficient, scalable storage, making it ideal for high-cardinality datasets. **Example**: In a fleet management application, suppose there are hundreds of GPS-enabled vehicles, each sending location data every second. This setup generates billions of rows in a short time, and each vehicle has a unique identifier, contributing to high cardinality. With TimescaleDB, a hypertable is created as the main table structure to organize the data. Here, general metadata and schema (like column names: `vehicle_id`, `timestamp`, `location`) are defined in the hypertable, while the actual data rows are stored in automatically managed "chunk tables." These chunks can be partitioned by day, so each day's data is stored separately. This way, querying data for a specific vehicle and date only accesses relevant chunks, optimizing storage and retrieval efficiency. #### Install TimescaleDB extension `CREATE EXTENSION IF NOT EXISTS timescaledb; ` #### Create the hypertable `CREATE TABLE vehicle_location_data ( vehicle_id INT, timestamp TIMESTAMPTZ NOT NULL, location GEOGRAPHY(Point, 4326), PRIMARY KEY (vehicle_id, timestamp) ); ` #### Convert to a hypertable `SELECT create_hypertable('vehicle_location_data', 'timestamp', chunk_time_interval => interval '1 day'); `Data is partitioned into chunks based on the timestamp column (one chunk per day). Only relevant chunks will be queried when you access data for a specific vehicle on a particular date. #### Querying the data `SELECT * FROM vehicle_location_data WHERE vehicle_id = 123 AND timestamp >= '2024-11-01 00:00:00' AND timestamp < '2024-11-02 00:00:00'; ` This query will only scan the chunk for the day of November 1st, 2024, significantly improving performance when dealing with large datasets. ### Hybrid-row columnar storage engine As mentioned, high-cardinality data can occupy significant storage space, but TimescaleDB addresses this with its [hybrid-row columnar storage engine](https://www.timescale.com/blog/building-columnar-compression-in-a-row-oriented-database/). This capability allows high-cardinality data to be compressed, reducing storage requirements and enhancing query performance. New data is initially stored as uncompressed rows, but TimescaleDB's built-in job scheduler automatically compresses it into a [columnar format](https://www.tigerdata.com/blog/building-columnar-compression-in-a-row-oriented-database) (storing data by columns rather than rows) within hypertable chunks, optimizing storage and enabling efficient handling of large high-cardinality datasets. **Example**: In an IoT environment with hundreds of thousands of temperature sensors reporting data every minute, storage quickly becomes a concern. By applying compression to data older than 30 days, TimescaleDB could reduce storage requirements while keeping recent data uncompressed and easily accessible for real-time analysis. #### Enable compression `ALTER TABLE sensors_compressed SET (timescaledb.compress, timescaledb.compress_segmentby = 'sensor_id'); ` #### Add compression policy `SELECT add_compression_policy('sensors_compressed', INTERVAL '24 hour'); ` In our tests, we achieved an impressive 95 % reduction in table size when applying compression to a dataset with billions of rows. What started as a hefty 111 GB table was shrunk to a mere 5.5 GB after compression, significantly optimizing storage and performance. If you are using Timescale Cloud, you can effortlessly view the compression stats for your tables directly from your dashboard. Check out the “Timescale Cloud” section below for more details. ### Aggregate functions High-cardinality data tends to grow rapidly due to the uniqueness of the data, which can make aggregating it into useful summaries slow. Aggregate functions provided by TimescaleDB speed up this process, making data aggregation significantly faster. #### Types of aggregate functions available There are mainly two types of aggregate functions. Continuous aggregates automatically update and incrementally refresh pre-aggregated data as new data is added. However, they may not include the most recent raw data until the next scheduled refresh. For example, if your scheduled refresh runs hourly, and you query 30 minutes after the last refresh, the results will exclude raw data from the past 30 minutes. Setting up a continuous aggregate in TimescaleDB involves two distinct steps: **Creating the view** - A continuous aggregate view is defined using a SQL query. This query specifies how data from a hypertable (or another continuous aggregate) should be aggregated over time. The view calculates and stores aggregated data, like averages or sums, to improve query performance on large datasets. **** **Enabling a refresh policy** - After creating the view, you need to define a refresh policy determining how often the aggregated data is updated. Without this policy, the view will not automatically reflect new or changed data in the underlying table. Here are some key features of continuous aggregates: - Continuous aggregates are a type of hypertable that are automatically refreshed in the background as new data is added or old data is modified. - Continuous aggregates automatically track changes and update the hypertable without needing manual intervention. - Unlike regular PostgreSQL materialized views, continuous aggregates have a lower maintenance burden, as they don't need to be recreated from scratch on each refresh. - Continuous aggregates make aggregating high cardinality datasets much faster, even as data grows quickly. - To learn more, [check out this article](https://www.timescale.com/blog/real-time-analytics-for-time-series-continuous-aggregates/). **** **Example**: For high-cardinality data collection (e.g., temperature readings every second), continuous aggregates allow you to aggregate data into time intervals (e.g., hourly averages), improving performance and reducing the need for full table scans. ###### Create materialized view Here, we are creating a continuous materialized view called `conditions_summary_daily` that aggregates temperature data from the conditions table. It groups the data by device and day, calculating the average, maximum, and minimum temperature for each device per day using the `time_bucket` function to create daily intervals. `CREATE MATERIALIZED VIEW conditions_summary_daily WITH (timescaledb.continuous) AS SELECT device, time_bucket(INTERVAL '1 day', time) AS bucket, AVG(temperature), MAX(temperature), MIN(temperature) FROM conditions GROUP BY device, bucket; `**` `****Create a refresh policy for materialized view** Here, we are creating a refresh policy for the `conditions_summary_daily` materialized view. It specifies that the aggregate should refresh data every hour, with data starting from one month ago and ending one day before the current time. This means that if today is February 2nd, the query would refresh the `conditions_summary_daily` continuous aggregate for the data from January 1st to February 1st. It will not include data from February 2nd in the refresh, as the `end_offset => INTERVAL '1 day' `excludes the most recent day (February 2nd). `SELECT add_continuous_aggregate_policy('conditions_summary_daily', start_offset => INTERVAL '1 month', end_offset => INTERVAL '1 day', schedule_interval => INTERVAL '1 hour'); ` ##### Real-time aggregates Real-time aggregates function similarly to continuous aggregates but also include the most recent raw data in the results, ensuring up-to-date aggregation without waiting for the next refresh. For instance, even if your scheduled refresh runs hourly, querying 30 minutes after the last refresh will include both the pre-aggregated data and raw data from the past 30 minutes, providing the most current results. ###### Enable real-time aggregates `ALTER MATERIALIZED VIEW table_name set (timescaledb.materialized_only = false); ` ### Data retention policies [Managing data retention](https://www.timescale.com/learn/what-is-data-retention-policy) is critical for high-cardinality datasets, as they can grow exponentially over time. TimescaleDB allows users to define retention policies that automatically drop or archive data that’s no longer relevant. By enforcing retention policies, TimescaleDB ensures that storage usage remains sustainable without requiring manual intervention to delete old data. **Example**: In an application monitoring setup, metrics like CPU usage, memory usage, and response times are tracked for thousands of servers. Retaining detailed data for long periods can be costly and unnecessary. By setting a policy to retain raw data for 90 days and then automatically drop older data, TimescaleDB helps keep storage under control while preserving essential metrics. `SELECT add_retention_policy('conditions', INTERVAL '90 day'); `The query leverages TimescaleDB's automated data management feature by adding a retention policy. It schedules a background job to periodically check and automatically remove data older than 24 hours from the specified hypertable named **condition.** These features make TimescaleDB well-equipped for handling the scale, complexity, and performance requirements of high-cardinality time-series data, particularly in environments where continuous data collection and complex analytics are essential. In this article, we compare [how TimescaleDB handles high-cardinality data vs. InfluxDB](https://www.timescale.com/blog/how-different-databases-handle-high-cardinality-data). ## Timescale Cloud Timescale Cloud is a high-performance, developer-focused platform that enhances PostgreSQL with advanced capabilities, including blazing-fast vector search. With TimescaleDB at its core, Timescale Cloud comes fully equipped with all the powerful TimescaleDB features described in the previous sections but allows development teams to have a seamless and worry-free infrastructure management experience while boosting scalability and cost-efficiency. Timescale Cloud is tailored to three key use cases: - Designed for managing and analyzing large-scale time-series data, events, real-time analytics, and vector data. Benefit from faster time-based queries with features like [hypertables](https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered), continuous aggregates, and hybrid-row columnar storage. Optimize storage usage with native compression, data retention policies, and seamless [data tiering to Amazon S3](https://www.timescale.com/blog/scaling-postgresql-for-cheap-introducing-tiered-storage-in-timescale/). - Ideal for building AI-powered applications. Leverage pgvector and [pgvectorscale](https://github.com/timescale/pgvectorscale) extensions for fast and precise similarity searches. Use the [pgai](https://www.tigerdata.com/blog/pgai-giving-postgresql-developers-ai-engineering-superpowers) extension to create [vector embeddings](https://www.tigerdata.com/blog/a-beginners-guide-to-vector-embeddings) (or create them and update them automatically with [pgai Vectorizer](https://github.com/timescale/pgai/blob/main/docs/vectorizer.md)) and enable advanced LLM reasoning on your data. - Perfect for applications requiring robust data consistency and complex queries. Timescale Cloud provides the reliability of an industry-standard RDBMS with ACID compliance, extensive SQL features, JSON support, and extensibility through custom functions, data types, and extensions. Finally, deploying your database is easier than ever before. With just a few clicks, you can get your dedicated Timescale instance and be ready to deploy your applications in just a few minutes. You can enjoy a free 30-day trial when you [sign up for Timescale Cloud](https://console.cloud.timescale.com/signup). ## Conclusion TimescaleDB provides an effective solution for handling high-cardinality data. The database’s hypertables, compression, and data retention capabilities make it well-suited for high-volume datasets with large numbers of unique entities. Timescale Cloud extends these benefits to a fully managed environment, allowing users to focus on data insights rather than operational overhead. Choose the right approach when handling high-cardinality data in PostgreSQL. [Self-host TimescaleDB](https://docs.timescale.com/self-hosted/latest/install/) or [try Timescale Cloud, free of charge, for 30 days (no credit card required)](https://console.cloud.timescale.com/signup). ### Learn more about high cardinality - [What Is High Cardinality?](https://www.timescale.com/blog/what-is-high-cardinality) - [How Different Databases Handle High-Cardinality Data](https://www.timescale.com/blog/how-different-databases-handle-high-cardinality-data)