TigerData logo
TigerData logo
  • Product

    Tiger Cloud

    Robust elastic cloud platform for startups and enterprises

    Agentic Postgres

    Postgres for Agents

    TimescaleDB

    Postgres for time-series, real-time analytics and events

  • Docs
  • Pricing

    Pricing

    Enterprise Tier

  • Developer Hub

    Changelog

    Benchmarks

    Blog

    Community

    Customer Stories

    Events

    Support

    Integrations

    Launch Hub

  • Company

    Contact us

    About

    Timescale

    Partners

    Security

    Careers

Log InTry for free
Home
AWS Time-Series Database: Understanding Your OptionsStationary Time-Series AnalysisThe Best Time-Series Databases ComparedTime-Series Analysis and Forecasting With Python Alternatives to TimescaleWhat Are Open-Source Time-Series Databases—Understanding Your OptionsWhy Consider Using PostgreSQL for Time-Series Data?Time-Series Analysis in RWhat Is Temporal Data?What Is a Time Series and How Is It Used?Is Your Data Time Series? Data Types Supported by PostgreSQL and TimescaleUnderstanding Database Workloads: Variable, Bursty, and Uniform PatternsHow to Work With Time Series in Python?Tools for Working With Time-Series Analysis in PythonGuide to Time-Series Analysis in PythonUnderstanding Autoregressive Time-Series ModelingCreating a Fast Time-Series Graph With Postgres Materialized Views
Understanding PostgreSQLOptimizing Your Database: A Deep Dive into PostgreSQL Data TypesUnderstanding FROM in PostgreSQL (With Examples)How to Address ‘Error: Could Not Resize Shared Memory Segment’ How to Install PostgreSQL on MacOSUnderstanding FILTER in PostgreSQL (With Examples)Understanding GROUP BY in PostgreSQL (With Examples)PostgreSQL Join Type TheoryA Guide to PostgreSQL ViewsStructured vs. Semi-Structured vs. Unstructured Data in PostgreSQLUnderstanding Foreign Keys in PostgreSQLUnderstanding PostgreSQL User-Defined FunctionsUnderstanding PostgreSQL's COALESCE FunctionUnderstanding SQL Aggregate FunctionsUsing PostgreSQL UPDATE With JOINHow to Install PostgreSQL on Linux5 Common Connection Errors in PostgreSQL and How to Solve ThemUnderstanding HAVING in PostgreSQL (With Examples)How to Fix No Partition of Relation Found for Row in Postgres DatabasesHow to Fix Transaction ID Wraparound ExhaustionUnderstanding LIMIT in PostgreSQL (With Examples)Understanding PostgreSQL FunctionsUnderstanding ORDER BY in PostgreSQL (With Examples)Understanding WINDOW in PostgreSQL (With Examples)Understanding PostgreSQL WITHIN GROUPPostgreSQL Mathematical Functions: Enhancing Coding EfficiencyUnderstanding DISTINCT in PostgreSQL (With Examples)Using PostgreSQL String Functions for Improved Data AnalysisData Processing With PostgreSQL Window FunctionsPostgreSQL Joins : A SummaryUnderstanding OFFSET in PostgreSQL (With Examples)Understanding PostgreSQL Date and Time FunctionsWhat Is Data Compression and How Does It Work?What Is Data Transformation, and Why Is It Important?Understanding the Postgres string_agg FunctionWhat Is a PostgreSQL Left Join? And a Right Join?Understanding PostgreSQL SELECTSelf-Hosted or Cloud Database? A Countryside Reflection on Infrastructure ChoicesUnderstanding ACID Compliance Understanding percentile_cont() and percentile_disc() in PostgreSQLUnderstanding PostgreSQL Conditional FunctionsUnderstanding PostgreSQL Array FunctionsWhat Characters Are Allowed in PostgreSQL Strings?Understanding WHERE in PostgreSQL (With Examples)What Is a PostgreSQL Full Outer Join?What Is a PostgreSQL Cross Join?What Is a PostgreSQL Inner Join?Data Partitioning: What It Is and Why It MattersStrategies for Improving Postgres JOIN PerformanceUnderstanding the Postgres extract() FunctionUnderstanding the rank() and dense_rank() Functions in PostgreSQL
Guide to PostgreSQL PerformanceHow to Reduce Bloat in Large PostgreSQL TablesDesigning Your Database Schema: Wide vs. Narrow Postgres TablesBest Practices for Time-Series Data Modeling: Single or Multiple Partitioned Table(s) a.k.a. Hypertables Best Practices for (Time-)Series Metadata Tables A Guide to Data Analysis on PostgreSQLA Guide to Scaling PostgreSQLGuide to PostgreSQL SecurityHandling Large Objects in PostgresHow to Query JSON Metadata in PostgreSQLHow to Query JSONB in PostgreSQLHow to Use PostgreSQL for Data TransformationOptimizing Array Queries With GIN Indexes in PostgreSQLPg_partman vs. Hypertables for Postgres PartitioningPostgreSQL Performance Tuning: Designing and Implementing Your Database SchemaPostgreSQL Performance Tuning: Key ParametersPostgreSQL Performance Tuning: Optimizing Database IndexesDetermining the Optimal Postgres Partition SizeNavigating Growing PostgreSQL Tables With Partitioning (and More)Top PostgreSQL Drivers for PythonWhen to Consider Postgres PartitioningGuide to PostgreSQL Database OperationsUnderstanding PostgreSQL TablespacesWhat Is Audit Logging and How to Enable It in PostgreSQLGuide to Postgres Data ManagementHow to Index JSONB Columns in PostgreSQLHow to Monitor and Optimize PostgreSQL Index PerformanceSQL/JSON Data Model and JSON in SQL: A PostgreSQL PerspectiveA Guide to pg_restore (and pg_restore Example)PostgreSQL Performance Tuning: How to Size Your DatabaseAn Intro to Data Modeling on PostgreSQLExplaining PostgreSQL EXPLAINWhat Is a PostgreSQL Temporary View?A PostgreSQL Database Replication GuideHow to Compute Standard Deviation With PostgreSQLHow PostgreSQL Data Aggregation WorksBuilding a Scalable DatabaseRecursive Query in SQL: What It Is, and How to Write OneGuide to PostgreSQL Database DesignHow to Use Psycopg2: The PostgreSQL Adapter for Python
Best Practices for Scaling PostgreSQLHow to Design Your PostgreSQL Database: Two Schema ExamplesHow to Handle High-Cardinality Data in PostgreSQLHow to Store Video in PostgreSQL Using BYTEABest Practices for PostgreSQL Database OperationsHow to Manage Your Data With Data Retention PoliciesBest Practices for PostgreSQL AggregationBest Practices for Postgres Database ReplicationHow to Use a Common Table Expression (CTE) in SQLBest Practices for Postgres Data ManagementBest Practices for Postgres PerformanceBest Practices for Postgres SecurityBest Practices for PostgreSQL Data AnalysisTesting Postgres Ingest: INSERT vs. Batch INSERT vs. COPYHow to Use PostgreSQL for Data Normalization
PostgreSQL Extensions: amcheckPostgreSQL Extensions: Unlocking Multidimensional Points With Cube PostgreSQL Extensions: hstorePostgreSQL Extensions: ltreePostgreSQL Extensions: Secure Your Time-Series Data With pgcryptoPostgreSQL Extensions: pg_prewarmPostgreSQL Extensions: pgRoutingPostgreSQL Extensions: pg_stat_statementsPostgreSQL Extensions: Install pg_trgm for Data MatchingPostgreSQL Extensions: Turning PostgreSQL Into a Vector Database With pgvectorPostgreSQL Extensions: Database Testing With pgTAPPostgreSQL Extensions: PL/pgSQLPostgreSQL Extensions: Using PostGIS and Timescale for Advanced Geospatial InsightsPostgreSQL Extensions: Intro to uuid-ossp
Columnar Databases vs. Row-Oriented Databases: Which to Choose?Data Analytics vs. Real-Time Analytics: How to Pick Your Database (and Why It Should Be PostgreSQL)How to Choose a Real-Time Analytics DatabaseUnderstanding OLTPOLAP Workloads on PostgreSQL: A GuideHow to Choose an OLAP DatabasePostgreSQL as a Real-Time Analytics DatabaseWhat Is the Best Database for Real-Time AnalyticsHow to Build an IoT Pipeline for Real-Time Analytics in PostgreSQL
When Should You Use Full-Text Search vs. Vector Search?HNSW vs. DiskANNA Brief History of AI: How Did We Get Here, and What's Next?A Beginner’s Guide to Vector EmbeddingsPostgreSQL as a Vector Database: A Pgvector TutorialUsing Pgvector With PythonHow to Choose a Vector DatabaseVector Databases Are the Wrong AbstractionUnderstanding DiskANNA Guide to Cosine SimilarityStreaming DiskANN: How We Made PostgreSQL as Fast as Pinecone for Vector DataImplementing Cosine Similarity in PythonVector Database Basics: HNSWVector Database Options for AWSVector Store vs. Vector Database: Understanding the ConnectionPgvector vs. Pinecone: Vector Database Performance and Cost ComparisonHow to Build LLM Applications With Pgvector Vector Store in LangChainHow to Implement RAG With Amazon Bedrock and LangChainRetrieval-Augmented Generation With Claude Sonnet 3.5 and PgvectorRAG Is More Than Just Vector SearchPostgreSQL Hybrid Search Using Pgvector and CohereImplementing Filtered Semantic Search Using Pgvector and JavaScriptRefining Vector Search Queries With Time Filters in Pgvector: A TutorialUnderstanding Semantic SearchWhat Is Vector Search? Vector Search vs Semantic SearchText-to-SQL: A Developer’s Zero-to-Hero GuideNearest Neighbor Indexes: What Are IVFFlat Indexes in Pgvector and How Do They WorkBuilding an AI Image Gallery With OpenAI CLIP, Claude Sonnet 3.5, and Pgvector
Understanding IoT (Internet of Things)A Beginner’s Guide to IIoT and Industry 4.0Storing IoT Data: 8 Reasons Why You Should Use PostgreSQLMoving Past Legacy Systems: Data Historian vs. Time-Series DatabaseWhy You Should Use PostgreSQL for Industrial IoT DataHow to Choose an IoT DatabaseHow to Simulate a Basic IoT Sensor Dataset on PostgreSQLFrom Ingest to Insights in Milliseconds: Everactive's Tech Transformation With TimescaleHow Ndustrial Is Providing Fast Real-Time Queries and Safely Storing Client Data With 97 % CompressionHow Hopthru Powers Real-Time Transit Analytics From a 1 TB Table Migrating a Low-Code IoT Platform Storing 20M Records/DayHow United Manufacturing Hub Is Introducing Open Source to ManufacturingBuilding IoT Pipelines for Faster Analytics With IoT CoreVisualizing IoT Data at Scale With Hopara and TimescaleDB
What Is ClickHouse and How Does It Compare to PostgreSQL and TimescaleDB for Time Series?Timescale vs. Amazon RDS PostgreSQL: Up to 350x Faster Queries, 44 % Faster Ingest, 95 % Storage Savings for Time-Series DataWhat We Learned From Benchmarking Amazon Aurora PostgreSQL ServerlessTimescaleDB vs. Amazon Timestream: 6,000x Higher Inserts, 5-175x Faster Queries, 150-220x CheaperHow to Store Time-Series Data in MongoDB and Why That’s a Bad IdeaPostgreSQL + TimescaleDB: 1,000x Faster Queries, 90 % Data Compression, and Much MoreEye or the Tiger: Benchmarking Cassandra vs. TimescaleDB for Time-Series Data
Alternatives to RDSWhy Is RDS so Expensive? Understanding RDS Pricing and CostsEstimating RDS CostsHow to Migrate From AWS RDS for PostgreSQL to TimescaleAmazon Aurora vs. RDS: Understanding the Difference
5 InfluxDB Alternatives for Your Time-Series Data8 Reasons to Choose Timescale as Your InfluxDB Alternative InfluxQL, Flux, and SQL: Which Query Language Is Best? (With Cheatsheet)What InfluxDB Got WrongTimescaleDB vs. InfluxDB: Purpose Built Differently for Time-Series Data
5 Ways to Monitor Your PostgreSQL DatabaseHow to Migrate Your Data to Timescale (3 Ways)Postgres TOAST vs. Timescale CompressionBuilding Python Apps With PostgreSQL: A Developer's GuideData Visualization in PostgreSQL With Apache SupersetMore Time-Series Data Analysis, Fewer Lines of Code: Meet HyperfunctionsIs Postgres Partitioning Really That Hard? An Introduction To HypertablesPostgreSQL Materialized Views and Where to Find ThemTimescale Tips: Testing Your Chunk Size
Postgres cheat sheet
HomeTime series basicsPostgres basicsPostgres guidesPostgres best practicesPostgres extensionsPostgres for real-time analytics
Sections

Performance

Best Practices for Postgres PerformanceTesting Postgres Ingest: INSERT vs. Batch INSERT vs. COPY

Database design and modeling

How to Design Your PostgreSQL Database: Two Schema ExamplesHow to Handle High-Cardinality Data in PostgreSQLHow to Use PostgreSQL for Data Normalization

Database operations

Best Practices for PostgreSQL Database Operations

Data analysis

Best Practices for PostgreSQL Data Analysis

Data aggregation

Best Practices for PostgreSQL Aggregation

Database replication

Best Practices for Postgres Database Replication

Query optimization

How to Use a Common Table Expression (CTE) in SQL

Scaling postgres

Best Practices for Scaling PostgreSQL

Data management

How to Manage Your Data With Data Retention PoliciesHow to Store Video in PostgreSQL Using BYTEABest Practices for Postgres Data Management

Database security

Best Practices for Postgres Security

Products

Time Series and Analytics AI and Vector Enterprise Plan Cloud Status Support Security Cloud Terms of Service

Learn

Documentation Blog Forum Tutorials Changelog Success Stories Time Series Database

Company

Contact Us Careers About Brand Community Code Of Conduct Events

Subscribe to the Tiger Data Newsletter

By submitting, you acknowledge Tiger Data's Privacy Policy

2025 (c) Timescale, Inc., d/b/a Tiger Data. All rights reserved.

Privacy preferences
LegalPrivacySitemap

Published at Dec 26, 2024

How to Handle High-Cardinality Data in PostgreSQL

Written by Semab Tariq

One of the fundamental concepts that developers, DBAs, and data engineers encounter when dealing with databases is the cardinality of data. Simply put, cardinality refers to the number of distinct values in a particular column of a database table. The higher the number of unique values, the higher the cardinality. Many decisions about schema design are based on what the cardinality of the expected data looks like. 

High cardinality presents several challenges. For instance, it can significantly slow down queries and lead to increased memory usage. That’s why it's crucial to optimize high cardinality data to prevent performance issues and avoid putting unnecessary strain on your database.

In this blog, we’ll cover these key points:

  • Understanding what high cardinality is

  • Challenges associated with handling high cardinality data

  • How Timescale addresses these limitations to manage high-cardinality data more efficiently

Let's get started!

Understanding What High Cardinality Is

In databases, high cardinality refers to a column in a table that contains a significant number of unique values. These columns often include a vast range of distinct entries, sometimes reaching millions or even billions.

To better understand high cardinality, imagine you're managing a library of books: 

  • The ISBN number of each book would represent high cardinality data. This is because each ISBN is unique to a particular book edition, and there are millions of unique ISBNs globally. This column would contain a vast range of distinct values.

  • On the other hand, if you're tracking book genres like "fiction," "non-fiction," or "sci-fi," these would represent low cardinality. Since there are a limited number of genres, the genre column has far fewer distinct values.

CREATE TABLE books (     book_id SERIAL PRIMARY KEY,     isbn VARCHAR(20) UNIQUE,     title VARCHAR(255),     genre VARCHAR(50),     publish_year INT );

Breakdown of high and low cardinality columns

  • High-cardinality columns

    • book_id: This is a unique identifier for each book, and it will never be repeated. This makes it high cardinality.

    • isbn: An ISBN is globally unique for each edition of a book, which is why it’s high cardinality. The ISBN column will contain a vast range of distinct values.

    • title: Book titles are often unique across a collection. Even though titles may occasionally repeat (e.g., multiple editions), a large collection is likely to contain a vast number of distinct titles, classifying this as high cardinality.

  • Low-cardinality columns

    • genre: The number of genres in a library collection is small compared to the number of books. Genres such as "fiction," "non-fiction," and "sci-fi" are limited, which makes this a low-cardinality column.

    • publish_year: The publishing year for books can repeat, and there are only a limited number of years in the dataset, so it also has low cardinality.

To learn more about high cardinality, check out this blog post.

Is It Possible for a Table to Have Entirely High Cardinality?

In the previous section, we discussed a mix of high-cardinality columns (such as isbn and book_id) and low-cardinality columns (like title and genre). However, in more complex scenarios, a table can consist entirely of high-cardinality data. In such cases, even the columns that store the actual data (those containing the core information) can also be considered high cardinality.

Examples

  • Consider a table logging API requests where each value represents a unique session token or request ID. In this case, each entry in the value column is unique, making it high cardinality.

  • Imagine a scenario where the value column stores JSON data representing user preferences or configurations. If each JSON object contains distinct settings or configurations specific to each user, this column could be considered high cardinality. Even though JSON objects might share similar structures, the unique content or configuration data in each JSON could lead to a wide variety of values, making the value column high cardinality.

-- Create a table to store user preferences

CREATE TABLE user_preferences (     user_id INTEGER PRIMARY KEY,     preferences JSON );

-- Insert sample data showing variety in JSON values

INSERT INTO user_preferences (user_id, preferences) VALUES (1, '{     "theme": "dark",     "fontSize": 14,     "notifications": {         "email": true,         "push": false,         "frequency": "daily"     },     "dashboardLayout": [       {"widget": "calendar", "position": "top-left"},         {"widget": "tasks", "position": "top-right"}     ],     "shortcuts": {         "newItem": "ctrl+n",         "save": "ctrl+s",         "custom1": "alt+1"     } }'), (2, '{     "theme": "light",     "fontSize": 16,     "notifications": {         "email": false,         "push": true,         "frequency": "weekly"     },     "dashboardLayout": [         {"widget": "analytics", "position": "full-width"},         {"widget": "messages", "position": "bottom"}     ],     "shortcuts": {         "newItem": "cmd+n",         "save": "cmd+s",         "custom1": "cmd+shift+1"     } }'), (3, '{     "theme": "system",     "fontSize": 12,     "notifications": {         "email": true,         "push": true,         "frequency": "instant"     },     "dashboardLayout": [         {"widget": "reports", "position": "left"},         {"widget": "calendar", "position": "right"},         {"widget": "tasks", "position": "bottom"}     ],     "shortcuts": {         "newItem": "n",         "save": "s",         "custom1": "1",         "custom2": "2"     } }');

-- Query to demonstrate the unique values SELECT COUNT(DISTINCT preferences) as unique_preference_count,        COUNT(*) as total_rows,        (COUNT(DISTINCT preferences)::float / COUNT(*)) * 100 as cardinality_percentage FROM user_preferences;

-- Query to analyze specific preference patterns

SELECT      json_extract_path_text(preferences::json, 'theme') as theme,     json_extract_path_text(preferences::json, 'fontSize') as font_size,     json_extract_path_text(preferences::json, 'notifications', 'frequency') as notification_frequency,     COUNT(*) as count FROM user_preferences GROUP BY 1, 2, 3 ORDER BY count DESC;

Combination of Columns Can Impact Table Cardinality

The combination of multiple columns in a table, even if each column has low cardinality, can result in high cardinality. This is particularly true when one column contains unique identifiers and the others introduce additional variations.

Example

In an e-commerce order table, you might have columns like order_id, customer_id, and order_status. The order_id and customer_id columns are high cardinality because they have unique values for each order and customer. 

The order_status column, however, has low cardinality since it only has a few distinct values, such as 'pending,' 'shipped,' and 'delivered.' Even though the order_status column has low cardinality, combining the order_id and customer_id columns can create many unique combinations, especially in large-scale e-commerce systems where customers place multiple orders. This makes the table high cardinality overall.

Recapping In simple cases, columns like isbn or book_id have high cardinality. However, in more complex tables, even columns storing event data, transaction details, or measurements can have high cardinality. Understanding this helps you design efficient databases and optimize queries, indexes, and performance for both high- and low-cardinality columns.

How to Determine if Your Table Contains High-Cardinality Data

Identifying high-cardinality data is generally straightforward. The key characteristic of high cardinality is that the column contains many distinct, non-repeating values. There are a few ways you can identify such columns:

Distinct count

You can count the number of distinct values in a column using the COUNT(DISTINCT column_name) function. If the count of distinct values is large, you have high-cardinality data.

SELECT COUNT(DISTINCT sensor_id) FROM sensor_readings; If the result shows a high number (e.g., millions), you are dealing with high cardinality.

Analyze columns

By analyzing the data columns themselves, we can identify which columns have high cardinality. For example:

ID columns

Columns like user_id, sensor_id, or transaction_id typically have high cardinality, as each entry is usually unique to a specific entity or event.

Timestamp columns

Timestamps can exhibit high cardinality, especially in time-series data (such as sensor readings or logs), where each entry often gets a new timestamp, sometimes down to the millisecond level. 

To put it in simple words, imagine a clock paired with a calendar. Each tick of the clock represents a new data entry, recording the exact date and time down to milliseconds. Over days, weeks, or months, This creates an enormous number of unique timestamps, thus increasing the cardinality.

Textual columns

Columns that store unique identifiers, such as customer names, product codes, or transaction IDs, usually have high cardinality. 

For example, in a customer_name column, each name is likely unique to a specific customer. Even if a name like "John Smith" appears multiple times, it can still represent different customers, making it a high-cardinality identifier.

Data distribution

When determining a column's cardinality, we consider not only the number of distinct values but also how these values are spread out in the database, referred to as data distribution. Cardinality measures how many unique values exist in a column, while data distribution tells us how those values are distributed across rows.

Even distribution

If a column contains many unique values, but each value appears roughly the same number of times, it’s considered an even distribution. For example, imagine a sensor_id column where there are 100 sensors, and each sensor records 500 readings. In this case, the data distribution is even because every sensor contributes an equal number of readings to the dataset.

Uneven or skewed distribution

If most values in the column are concentrated around a few items while others are rarely used, we call this uneven or skewed distribution. For example, one sensor might have one million readings, while the rest only have a few hundred.

How does data distribution impact the data cardinality?

The relationship between high cardinality and data distribution lies in understanding both the number of unique values (cardinality) and how those values are spread across your dataset. 

While they are distinct concepts, they are deeply connected in terms of how they affect query performance, data management, and resource optimization. 

Example

High cardinality with uneven distribution can lead to performance bottlenecks. 

  • If a sensor_id column has hundreds of distinct sensor IDs but only a few are frequently queried, queries focusing on those frequently queried values will have to scan through a larger amount of data. This can negatively impact query speed and efficiency.

  • High cardinality with even distribution generally has less of an impact on performance because the queries access the data more evenly. This means no single value is queried too much. However, the impact still depends on how the data is actually queried and used.

Challenges Associated With Handling High Cardinality Data

High cardinality data presents unique challenges in database management, particularly as datasets grow in size and complexity. These challenges impact performance, storage, and analytics. Let’s explore these challenges in detail.

Indexing overhead

High-cardinality columns often require indexes for efficient query performance. However, indexing such data can lead to significant storage and performance issues.

Storage costs

  • High-cardinality columns, when indexed using a B-tree can result in larger index sizes because each unique value requires a separate entry in the index. These larger indexes can lead to increased disk usage, especially if the column contains millions of unique values.

  • Over time, as data is inserted, updated, or deleted, the index can become fragmented, further increasing the storage requirements and potentially degrading performance.

Write overhead

  • Every time a row is inserted or updated, the database must update the index to reflect the new or changed data. This can be particularly costly for high-cardinality columns because the index must accommodate a large number of unique entries.

  • During write operations, locks may be held on the specific part of the index (e.g., pages or buckets), which can lead to contention and slow down concurrent write operations in high-traffic environments.

Cache efficiency

  • Large indexes may not fit entirely in memory, leading to more frequent disk I/O operations. This increase in operations can reduce the cache hit rate and slow down query performance, as accessing data from disk is significantly slower than accessing it from memory.

Example

In a customer database with millions of users, a column like email_address (high cardinality) indexed for quick lookups will significantly increase the storage footprint and slow down bulk inserts due to index updates.

Step 1: Create the table

CREATE TABLE test_indexes (     id SERIAL PRIMARY KEY,     email_address TEXT NOT NULL UNIQUE, -- High cardinality column     status TEXT NOT NULL                -- Low cardinality column );

Step 2: Insert records

INSERT INTO test_indexes (email_address, status) SELECT     concat('user', i, '@example.com') AS email_address,     CASE WHEN i % 2 = 0 THEN 'active' ELSE 'inactive' END AS status FROM generate_series(1, 100000) i;

Step 3: Create indexes

-- Index on the high-cardinality column CREATE INDEX idx_email_address ON test_indexes(email_address);

-- Index on the low-cardinality column CREATE INDEX idx_status ON test_indexes(status);

Step 4: Check the size for both of the indexes

SELECT     pg_size_pretty(pg_relation_size('idx_email_address')) AS high_cardinality_index_size,     pg_size_pretty(pg_relation_size('idx_status')) AS low_cardinality_index_size; high_cardinality_index_size | low_cardinality_index_size -----------------------------+---------------------------- 3992 kB | 696 kB (1 row)

Query Performance Degradation

Queries on high-cardinality columns can be computationally expensive, particularly for operations like GROUP BY or DISTINCT. This is because the database engine has to process a large set of unique values.

Sorting and hashing overhead

  • Operations like GROUP BY and DISTINCT often require sorting or hashing of the data to identify unique values or group similar ones. For high cardinality columns, this means processing a vast number of unique entries, which can be computationally intensive.

  • Sorting a large dataset involves significant CPU and memory resources, as the database engine must arrange all unique values in order.

Memory usage

  • High-cardinality operations can consume a lot of memory because the database engine may need to store all unique values in memory to perform the operation efficiently. If the dataset is too large to fit in memory, it can lead to increased disk I/O, further slowing down the query.

  • When the configured value for work_mem is insufficient, the database engine may use temporary disk storage to handle intermediate results. This disk usage can significantly degrade performance due to the slower speed of disk access compared to memory access.

Execution plan complexity

  • The query planner must choose an optimal execution plan for handling high-cardinality operations. However, the complexity of managing a large number of unique values can lead to suboptimal plans, especially if statistics are outdated or inaccurate.

  • PostgreSQL manages table statistics through a process called ANALYZE, which collects key statistics that include information like the number of rows, the most common values, and the distribution of data within columns. This data is stored in the system catalog and used by the query planner to make informed decisions about query execution plans. To ensure the query planner has up-to-date information, it's important to regularly analyze tables, especially after significant data changes. You can also configure the level of detail collected by setting the default_statistics_target parameter, which determines the default level of detail collected by the ANALYZE command for table statistics. A higher value means more detailed statistics are collected, which can lead to better query planning decisions, especially for complex queries or those involving columns with a wide range of values. However, collecting more detailed statistics can also increase the time and resources required for the ANALYZE operation.

A transaction table with a high-cardinality customer_id column, executing SELECT COUNT(DISTINCT customer_id), requires the database to identify and count each unique transaction ID. With billions of rows, this operation demands substantial processing power and memory, leading to longer execution times.

Below is an example where we use a transaction table and insert 50 million rows. In this case, the customer_id column will have high cardinality, while the status column will contain low cardinality data.

This is how the table will look like:

tsdb=> select * from transactions limit 5; transaction_id | customer_id | amount | status | transaction_date ----------------+-------------+--------+-----------+---------------------------- 1 | 625317 | 197.04 | completed | 2024-12-19 07:50:15.439523 2 | 729692 | 636.61 | completed | 2024-12-19 07:50:15.439523 3 | 680008 | 282.23 | pending | 2024-12-19 07:50:15.439523 4 | 736607 | 607.30 | pending | 2024-12-19 07:50:15.439523 5 | 849894 | 476.13 | completed | 2024-12-19 07:50:15.439523 (5 rows)

– Create a table CREATE TABLE transactions (     transaction_id SERIAL PRIMARY KEY,     customer_id INT NOT NULL,     amount NUMERIC,     status TEXT DEFAULT 'pending',     transaction_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP );

– Insert the data inside the table INSERT INTO transactions (customer_id, amount, status) SELECT     FLOOR(RANDOM() * 1000000)::INT,           -- High variety: Customer IDs between 0 and 999,999     (RANDOM() * 1000)::NUMERIC(10, 2),       -- Random amounts between 0.00 and 1000.00     CASE         WHEN RANDOM() < 0.7 THEN 'completed'  -- 70% of transactions         WHEN RANDOM() < 0.9 THEN 'pending'    -- 20% of transactions         ELSE 'failed'                         -- 10% of transactions     END FROM generate_series(1, 10000000);           -- Generate 10 million rows

– Performing select on high cardinality column SELECT COUNT(DISTINCT customer_id) FROM transactions; count ---------- 50000000 (1 row) Time: 21161.682 ms (00:21.162)

– Performing select on low cardinality column SELECT COUNT(DISTINCT status) FROM transactions; count ------- 3 (1 row) Time: 18051.252 ms (00:18.051) As observed, the SELECT query on the status column with low cardinality data is executing faster, even though the query planner performed a sequential scan (Seq Scan) for both SELECT queries.

Let's see if we can improve the performance with an index :

CREATE INDEX idx_customer_id ON transactions (customer_id); CREATE INDEX Re-run the SELECT query

SELECT COUNT(DISTINCT customer_id) FROM transactions; count ---------- 50000000 (1 row) Time: 8169.219 ms (00:08.169) This query took only eight seconds to complete, demonstrating that while high-cardinality columns may take longer to execute SELECT queries, the performance can be improved by adding an index to the high-cardinality column.

Aggregation and Analytics Complexity

Aggregating data in high-cardinality columns can be resource-intensive, as it involves grouping a large number of distinct values. This aggregation is particularly problematic for real-time analytics systems where performance is critical.

Data volume and grouping

  • High cardinality columns, like sensor_id in a sensor monitoring system, have a vast number of unique values. Aggregating data requires the system to group data by each unique value, which can be computationally expensive and time-consuming.

  • The sheer volume of data to be processed and grouped can strain system resources, particularly CPU and memory.

  • High cardinality increases the number of groups, leading to higher memory usage to store intermediate results and more CPU cycles to perform calculations.

Real-time constraints

  • Real-time analytics systems demand low-latency processing to provide timely insights. High cardinality aggregation can introduce delays as the system struggles to keep up with the volume and complexity of the data.

  • In real-time systems, multiple queries may run concurrently, each requiring access to the same high cardinality data. This concurrency can lead to contention for resources and further slow down processing.

  • As the number of sensors or data points grows, the system must scale to handle increased load, which can be challenging with high cardinality data.

Example

In the sensor data monitoring system, calculating metrics like average or peak readings per sensor involves processing data for millions of unique sensors. This requires the system to efficiently manage and compute large datasets in real time, which can overwhelm system resources and degrade performance.

Step 1: Create the table

CREATE TABLE sensor_data (     sensor_id INT,     timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,     reading_value NUMERIC ); Step 2: Ingest the data 

-- Inserting sample data for 1 million sensors, each with random readings INSERT INTO sensor_data (sensor_id, reading_value) SELECT     (random() * 1000000)::INT,  -- Random sensor_id between 1 and 1 million     random() * 100              -- Random reading value between 0 and 100 FROM generate_series(1, 30000000);  -- Generating 30 million rows of data

Step 3: Perform aggregate on the data

-- Using window function to calculate the moving average of readings for each sensor

SELECT     Sensor_id,     Timestamp,     Reading_value,     AVG(reading_value) OVER (PARTITION BY sensor_id ORDER BY timestamp ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg_reading FROM sensor_data ORDER BY sensor_id, timestamp;

Time: 45890.669 ms

Partitioning and Sharding Challenges

High-cardinality data presents specific challenges when it comes to partitioning and sharding, primarily due to the following reasons.

Uneven distribution

  • When partitioning by a high cardinality column such as user_id, data distribution across partitions can become uneven. Some partitions may end up with significantly more data than others, leading to "hotspots" where certain partitions experience higher load and contention.

  • This uneven distribution can result in load imbalance, where some partitions are overutilized while others are underutilized. This can degrade overall system performance and lead to inefficient resource utilization.

Increased metadata overhead

  • High-cardinality columns can lead to a large number of partitions, each requiring metadata management. This increases the system’s complexity and can slow down operations involving metadata, such as query planning and execution.

  • Query planning time can increase with a large number of partitions, as the planner needs to evaluate which partitions are relevant for a given query. In turn, this can increase planning time and complexity, which is particularly problematic when dealing with many partitions, as it can lead to slower query execution.

Maintenance complexity

  • Managing a large number of partitions can complicate maintenance tasks such as backups, restores, and schema changes. Each partition may need to be handled individually, increasing administrative overhead.

  • Addressing uneven distribution often requires rebalancing data across partitions, which can be resource-intensive and disruptive to normal operations.

Example

An e-commerce application partitioned by customer_id may face uneven workloads during seasonal sales, as a few high-spending customers generate disproportionate traffic.

Step 1: Create the table with partitions

CREATE TABLE transactions (     transaction_id SERIAL PRIMARY KEY,     customer_id INT NOT NULL,     amount NUMERIC,     transaction_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP ) PARTITION BY RANGE (customer_id);

-- Create partitions (simulating a few partitions for different customer groups) CREATE TABLE transactions_1 PARTITION OF transactions FOR VALUES FROM (1) TO (1000);

CREATE TABLE transactions_2 PARTITION OF transactions FOR VALUES FROM (1000) TO (2000);

CREATE TABLE transactions_3 PARTITION OF transactions FOR VALUES FROM (2000) TO (3000);

Step 2: Insert dummy data into table

-- Insert a mix of low-spending and high-spending customers -- Low-spending customers: customer_id between 1 and 1000 -- High-spending customers: customer_id between 2000 and 3000

-- Low-spending customers (simulate 1000 transactions)

INSERT INTO transactions (customer_id, amount) SELECT     (random() * 1000)::INT + 1,         -- Random customer_id from 1 to 1000     random() * 50                         -- Random transaction amount between 0 and 50 FROM generate_series(1, 1000);

-- High-spending customers (simulate 1000 transactions with larger amounts) INSERT INTO transactions (customer_id, amount) SELECT     (random() * 1000)::INT + 2000,       -- Random customer_id from 2000 to 3000 (high-spending)     random() * 500 + 100                  -- Random transaction amount between 100 and 600 FROM generate_series(1, 1000);

Step 3: Simulate the traffic during sales

-- Simulate a sales event with a higher volume of transactions for high-spending customers

-- For example, generating 5000 transactions for customer_id between 2000 and 3000 INSERT INTO transactions (customer_id, amount) SELECT   (random() * 1000)::INT + 2000, -- Random customer_id from 2000 to   3000 (high-spending)   random() * 500 + 100 -- Random transaction amount between 100 and   600 FROM generate_series(1, 5000);

Step 4: Examine the performance

-- Time taken to query low-spending customers

EXPLAIN ANALYZE SELECT COUNT(*) FROM transactions WHERE customer_id BETWEEN 1 AND 1000; Time: 1.043 ms

-- Time taken to query high-spending customers

EXPLAIN ANALYZE SELECT COUNT(*) FROM transactions WHERE customer_id BETWEEN 2000 AND 3000; Time: 2.196 ms

Vacuum and Autovacuum Challenges

High-cardinality data presents challenges to the autovacuum process. Vacuum in PostgreSQL is a maintenance process that reclaims storage occupied by dead tuples, prevents table bloat, and updates statistics for query optimization.

Table and index bloat

High-cardinality columns can lead to larger indexes, which, when combined with frequent updates, can result in index bloat. This increases the amount of data that needs to be scanned during queries, degrading performance.

There are a number of strategies you can use to prevent index bloat from high-cardinality data:

  1. Be selective with indexing high-cardinality columns.

  • Only index columns that are frequently used in WHERE clauses or joins.

  • Consider dropping indexes that aren't being used (analyze query patterns).

  • For columns with very high cardinality like UUIDs, consider if you really need them indexed.

2. Use partial indexes when possible. Instead of indexing all rows, create indexes that only cover the subset of data you regularly query.

3. Consider compound indexes strategically.

  • Put lower cardinality columns first in compound indexes when query patterns allow.

  • This reduces the number of distinct index entries at each level.

  • Example: Instead of separate indexes on (high_cardinality, low_cardinality), use (low_cardinality, high_cardinality).

4. Implement regular maintenance.

  • Set up periodic REINDEX operations during low-traffic periods..

  • Monitor index size and bloat using queries

5. Use appropriate index types.

  • Consider using BRIN indexes instead of B-tree for naturally ordered data.

  • Hash indexes can be more efficient for equality comparisons on high-cardinality columns.

But high-cardinality columns can also cause table bloat. When there are frequent updates or deletions in these columns with many unique values, dead rows can accumulate. This increases the table size and uses up unnecessary disk space, which in turn slows down query performance.

Autovacuum resource consumption

  • High-cardinality tables often involve a significant number of unique values, which can correlate with frequent updates or deletions across many rows.

    • Frequent data modifications in high-cardinality tables tend to generate a higher volume of dead tuples. Autovacuum must scan and clean these dead tuples, requiring more processing effort.

    • High cardinality often leads to larger and more complex indexes. During autovacuum, these indexes also need to be cleaned up and maintained, adding to the resource usage.

    • For very large tables with high cardinality, the autovacuum process needs to scan significant portions of the table, which increases CPU and I/O demand.

  • The autovacuum process removes dead tuples and prevents table bloat. In tables with high cardinality, the large number of unique values and frequent data modifications can make autovacuum operations more resource-intensive, consuming more CPU, memory, and I/O bandwidth.

  • The high rate of data modification in high-cardinality tables can trigger autovacuum more frequently. This can lead to increased overhead and potential contention with regular database operations, especially in write-heavy environments.

Impact on query performance

  • While autovacuum is running, it can impact query performance by competing for system resources. Although designed to minimize locking, autovacuum can still cause contention, particularly in systems with high concurrency.

Configuration challenges

  • Default autovacuum settings may not be optimal for high-cardinality data. Adjusting parameters such as autovacuum_vacuum_cost_limit and autovacuum_vacuum_cost_delay is often necessary to balance performance with maintenance needs.

  • Increasing autovacuum_vacuum_cost_limit and decreasing autovacuum_vacuum_cost_delay make the autovacuum process more aggressive, enabling it to clean up rows more quickly and efficiently. However, this increased aggressiveness may also consume more system resources, so adjustments should be made carefully based on workload and system capacity.

With a better understanding of high-cardinality data and its challenges in databases, let’s delve into how TimescaleDB overcomes these limitations with its advanced features.

Solutions for Efficiently Handling High-Cardinality Data

TimescaleDB, an open-source extension for PostgreSQL, is designed to efficiently manage time series, events, analytics, and vector data. It combines PostgreSQL's reliability with specialized features for handling continuous data collection and high cardinality, making it ideal for such applications. TimescaleDB is widely used in different areas, including: 

  • IoT applications

  • Financial analytics

  • Infrastructure monitoring, where large datasets with unique identifiers—such as sensors, devices, or users—generate data continuously

In addition to TimescaleDB, Timescale has gained recognition for its cloud offering. It provides a mature cloud-based PostgreSQL platform designed for resource-intensive workloads, helping you build faster, scale more efficiently, and stay within budget.

We will discuss how Timescale handles high-cardinality data both in TimescaleDB and Timescale Cloud.

TimescaleDB

As mentioned, TimescaleDB is an extension for PostgreSQL designed for storing and analyzing time-series data and other demanding workloads efficiently.

Here’s a deeper look at the features TimescaleDB offers to illustrate how each feature supports high-cardinality data handling.

Hypertables and chunks

Partitioning is an essential strategy for working with high-cardinality time-series data in PostgreSQL. Native PostgreSQL partitioning allows for some degree of data segmentation by dividing large tables into smaller, more manageable parts based on specified criteria, like time intervals. However, native PostgreSQL partitioning has functionality limitations. 

For example, it does not automatically manage and create new partitions as data grows over time. That’s where hypertables and chunks come in.

TimescaleDB uses hypertables as a core feature for managing time-series data. A hypertable acts like a regular PostgreSQL table but automatically partitions data into smaller chunks based on time intervals.

Chunks in TimescaleDB are essentially regular PostgreSQL tables that act as time-and-key-based partitions for a hypertable. Data is dynamically partitioned based on time intervals (the primary dimension), along with the hypertable schema and chunk settings. These chunks take advantage of PostgreSQL's native storage, indexing, and query optimization features, ensuring efficient data insertion and retrieval.

Hypertables and chunks in TimescaleDB optimize storage and query performance for high-cardinality data by partitioning it into time-based segments. This partitioning improves retrieval speed, reduces scan times, and enables efficient, scalable storage, making it ideal for high-cardinality datasets.

Example: In a fleet management application, suppose there are hundreds of GPS-enabled vehicles, each sending location data every second. This setup generates billions of rows in a short time, and each vehicle has a unique identifier, contributing to high cardinality. 

With TimescaleDB, a hypertable is created as the main table structure to organize the data. Here, general metadata and schema (like column names: vehicle_id, timestamp, location) are defined in the hypertable, while the actual data rows are stored in automatically managed "chunk tables." These chunks can be partitioned by day, so each day's data is stored separately. This way, querying data for a specific vehicle and date only accesses relevant chunks, optimizing storage and retrieval efficiency.

Install TimescaleDB extension

CREATE EXTENSION IF NOT EXISTS timescaledb;

Create the hypertable

CREATE TABLE vehicle_location_data ( vehicle_id INT, timestamp TIMESTAMPTZ NOT NULL, location GEOGRAPHY(Point, 4326), PRIMARY KEY (vehicle_id, timestamp) );

Convert to a hypertable

SELECT create_hypertable('vehicle_location_data', 'timestamp', chunk_time_interval => interval '1 day'); Data is partitioned into chunks based on the timestamp column (one chunk per day).

Only relevant chunks will be queried when you access data for a specific vehicle on a particular date.

Querying the data

SELECT * FROM vehicle_location_data WHERE vehicle_id = 123 AND timestamp >= '2024-11-01 00:00:00' AND timestamp < '2024-11-02 00:00:00';

This query will only scan the chunk for the day of November 1st, 2024, significantly improving performance when dealing with large datasets.

Hybrid-row columnar storage engine

As mentioned, high-cardinality data can occupy significant storage space, but TimescaleDB addresses this with its hybrid-row columnar storage engine. This capability allows high-cardinality data to be compressed, reducing storage requirements and enhancing query performance. 

New data is initially stored as uncompressed rows, but TimescaleDB's built-in job scheduler automatically compresses it into a columnar format (storing data by columns rather than rows) within hypertable chunks, optimizing storage and enabling efficient handling of large high-cardinality datasets.

Example: In an IoT environment with hundreds of thousands of temperature sensors reporting data every minute, storage quickly becomes a concern. By applying compression to data older than 30 days, TimescaleDB could reduce storage requirements while keeping recent data uncompressed and easily accessible for real-time analysis.

Enable compression

ALTER TABLE sensors_compressed SET (timescaledb.compress, timescaledb.compress_segmentby = 'sensor_id');

Add compression policy

SELECT add_compression_policy('sensors_compressed', INTERVAL '24 hour');

In our tests, we achieved an impressive 95 % reduction in table size when applying compression to a dataset with billions of rows. What started as a hefty 111 GB table was shrunk to a mere 5.5 GB after compression, significantly optimizing storage and performance.

If you are using Timescale Cloud, you can effortlessly view the compression stats for your tables directly from your dashboard.

image

Check out the “Timescale Cloud” section below for more details.

Aggregate functions

High-cardinality data tends to grow rapidly due to the uniqueness of the data, which can make aggregating it into useful summaries slow. Aggregate functions provided by TimescaleDB speed up this process, making data aggregation significantly faster.

Types of aggregate functions available 

There are mainly two types of aggregate functions. Continuous aggregates automatically update and incrementally refresh pre-aggregated data as new data is added. However, they may not include the most recent raw data until the next scheduled refresh. For example, if your scheduled refresh runs hourly, and you query 30 minutes after the last refresh, the results will exclude raw data from the past 30 minutes.

Setting up a continuous aggregate in TimescaleDB involves two distinct steps:

Creating the view

  • A continuous aggregate view is defined using a SQL query. This query specifies how data from a hypertable (or another continuous aggregate) should be aggregated over time. The view calculates and stores aggregated data, like averages or sums, to improve query performance on large datasets.

Enabling a refresh policy

  • After creating the view, you need to define a refresh policy determining how often the aggregated data is updated. Without this policy, the view will not automatically reflect new or changed data in the underlying table.

Here are some key features of continuous aggregates:

  • Continuous aggregates are a type of hypertable that are automatically refreshed in the background as new data is added or old data is modified.

  • Continuous aggregates automatically track changes and update the hypertable without needing manual intervention.

  • Unlike regular PostgreSQL materialized views, continuous aggregates have a lower maintenance burden, as they don't need to be recreated from scratch on each refresh.

  • Continuous aggregates make aggregating high cardinality datasets much faster, even as data grows quickly.

  • To learn more, check out this article.

Example: For high-cardinality data collection (e.g., temperature readings every second), continuous aggregates allow you to aggregate data into time intervals (e.g., hourly averages), improving performance and reducing the need for full table scans.

Create materialized view

Here, we are creating a continuous materialized view called conditions_summary_daily that aggregates temperature data from the conditions table. It groups the data by device and day, calculating the average, maximum, and minimum temperature for each device per day using the time_bucket function to create daily intervals.

CREATE MATERIALIZED VIEW conditions_summary_daily WITH (timescaledb.continuous) AS SELECT device,   time_bucket(INTERVAL '1 day', time) AS bucket,   AVG(temperature),   MAX(temperature),   MIN(temperature) FROM conditions GROUP BY device, bucket; Create a refresh policy for materialized view

Here, we are creating a refresh policy for the conditions_summary_daily materialized view. It specifies that the aggregate should refresh data every hour, with data starting from one month ago and ending one day before the current time.

This means that if today is February 2nd, the query would refresh the conditions_summary_daily continuous aggregate for the data from January 1st to February 1st. It will not include data from February 2nd in the refresh, as the end_offset => INTERVAL '1 day' excludes the most recent day (February 2nd).

SELECT add_continuous_aggregate_policy('conditions_summary_daily',   start_offset => INTERVAL '1 month',   end_offset => INTERVAL '1 day',   schedule_interval => INTERVAL '1 hour');

Real-time aggregates

Real-time aggregates function similarly to continuous aggregates but also include the most recent raw data in the results, ensuring up-to-date aggregation without waiting for the next refresh. For instance, even if your scheduled refresh runs hourly, querying 30 minutes after the last refresh will include both the pre-aggregated data and raw data from the past 30 minutes, providing the most current results.

Enable real-time aggregates 

ALTER MATERIALIZED VIEW table_name set (timescaledb.materialized_only = false);

Data retention policies

Managing data retention is critical for high-cardinality datasets, as they can grow exponentially over time. TimescaleDB allows users to define retention policies that automatically drop or archive data that’s no longer relevant. By enforcing retention policies, TimescaleDB ensures that storage usage remains sustainable without requiring manual intervention to delete old data.

Example: In an application monitoring setup, metrics like CPU usage, memory usage, and response times are tracked for thousands of servers. Retaining detailed data for long periods can be costly and unnecessary. By setting a policy to retain raw data for 90 days and then automatically drop older data, TimescaleDB helps keep storage under control while preserving essential metrics.

SELECT add_retention_policy('conditions', INTERVAL '90 day'); The query leverages TimescaleDB's automated data management feature by adding a retention policy. It schedules a background job to periodically check and automatically remove data older than 24 hours from the specified hypertable named condition.

These features make TimescaleDB well-equipped for handling the scale, complexity, and performance requirements of high-cardinality time-series data, particularly in environments where continuous data collection and complex analytics are essential. In this article, we compare how TimescaleDB handles high-cardinality data vs. InfluxDB.

Timescale Cloud

Timescale Cloud is a high-performance, developer-focused platform that enhances PostgreSQL with advanced capabilities, including blazing-fast vector search. With TimescaleDB at its core, Timescale Cloud comes fully equipped with all the powerful TimescaleDB features described in the previous sections but allows development teams to have a seamless and worry-free infrastructure management experience while boosting scalability and cost-efficiency.

Timescale Cloud is tailored to three key use cases:

  • Designed for managing and analyzing large-scale time-series data, events, real-time analytics, and vector data. Benefit from faster time-based queries with features like hypertables, continuous aggregates, and hybrid-row columnar storage. Optimize storage usage with native compression, data retention policies, and seamless data tiering to Amazon S3.

  • Ideal for building AI-powered applications. Leverage pgvector and pgvectorscale extensions for fast and precise similarity searches. Use the pgai extension to create vector embeddings (or create them and update them automatically with pgai Vectorizer) and enable advanced LLM reasoning on your data.

  • Perfect for applications requiring robust data consistency and complex queries. Timescale Cloud provides the reliability of an industry-standard RDBMS with ACID compliance, extensive SQL features, JSON support, and extensibility through custom functions, data types, and extensions.

Finally, deploying your database is easier than ever before. With just a few clicks, you can get your dedicated Timescale instance and be ready to deploy your applications in just a few minutes. You can enjoy a free 30-day trial when you sign up for Timescale Cloud.

Conclusion

TimescaleDB provides an effective solution for handling high-cardinality data. The database’s hypertables, compression, and data retention capabilities make it well-suited for high-volume datasets with large numbers of unique entities. 

Timescale Cloud extends these benefits to a fully managed environment, allowing users to focus on data insights rather than operational overhead. Choose the right approach when handling high-cardinality data in PostgreSQL. Self-host TimescaleDB or try Timescale Cloud, free of charge, for 30 days (no credit card required).

Learn more about high cardinality

  • What Is High Cardinality?

  • How Different Databases Handle High-Cardinality Data

On this page