Tiger Data Blog

PostgreSQL + TimescaleDB: 1,000x Faster Queries, 90 % Data Compression, and Much More

Ryan Booz — Thu, 22 Sep 2022 15:32:44 GMT

Compared to PostgreSQL alone, TimescaleDB can dramatically improve query performance by 1,000x or more, reduce storage utilization by 90 %, and provide features essential for time-series and analytical applications. Some of these features even benefit non-time-series data–increasing query performance just by loading the extension.

PostgreSQL is today’s most advanced and most popular open-source relational database. We believe this as much today as we did five years ago when we chose PostgreSQL as the foundation of TimescaleDB because of its longevity, extensibility, and rock-solid architecture.

By loading the TimescaleDB extension into a PostgreSQL database, you can effectively “supercharge” PostgreSQL, empowering it to excel for both time-series workloads and classic transactional ones.

This article highlights how TimescaleDB improves PostgreSQL query performance at scale, increases storage efficiency (thus lowering costs), and provides developers with the tools necessary for building modern, innovative, and cost-effective time-series applications—all while retaining access to the full Postgres feature set and ecosystem.

(To show our work, this article also presents the benchmarks that compare query performance and data ingestion for one billion rows of time-series data between PostgreSQL 14.4 and TimescaleDB 2.7.2. For PostgreSQL, we benchmarked both using a single-table and declarative partitioning)

Better Performance at Scale

With orders of magnitude better performance at scale, TimescaleDB enables developers to build on top of PostgreSQL and “future-proof” their applications.

1,000x faster performance for time-series queries

The core concept in TimescaleDB is the notion of the “hypertable”: seamless partitioning of data while presenting the abstraction of a single, virtual table across all your data.

This partitioning enables faster queries by quickly excluding irrelevant data, as well as enabling enhancements to the query planner and execution process. In this way, a hypertable looks and feels just like a normal PostgreSQL table but enables a lot more.

For example, one recent query planner improvement excludes data more efficiently for relative now()-based queries (e.g., WHERE time >= now()-’1 week’::interval). To be even more specific, these relative time predicates are constified at planning time to ignore chunks that don't have data to satisfy the query. Furthermore, as the number of partitions increases, planning times can be reduced by 100x or more over vanilla PostgreSQL for the same number of partitions.

When hypertables are compressed, the amount of data that queries need to read is reduced, leading to dramatic increases in performance of 1000x or more. For more information (including a discussion of this bar chart), keep reading the benchmark below.

Query latency comparison (ms) between TimescaleDB and PostgreSQL 14.4. To see the complete query, scroll down below.

Other enhancements in TimescaleDB apply to both hypertables and normal PostgreSQL tables, e..g, SkipScan, which dramatically improves DISTINCT queries on any PostgreSQL table with a matching B-tree index regardless of whether you have time-series data or not.

Reduce commonly run queries to milliseconds (even when the original query took minutes or hours)

Today, nearly every time-series application reaches for rolling aggregations to query and analyze data more efficiently. The raw data could be saved per second, minute, or hour (and a plethora of other permutations in between), but what most applications display are time-based aggregates.

What's more, most time-series data applications are append-only, which means that aggregate queries return the same values over and over based on the unchanged raw data. It's much more efficient to store the results of the aggregate query and use those for analytic reporting and analysis most of the time.

Often, developers try materialized views in vanilla PostgreSQL to help, however, they have two main problems with fast-changing time-series data:

Materialized views recreate the entire view every time the materialization process runs, even if little or no data has changed.
Materialized views don't provide any data retention management. Any time you delete raw data and update the materialized view, the aggregated data is removed as well.

In contrast, TimescaleDB’s continuous aggregates solve both of these problems. They are updated automatically on the schedule you configure, they can have data retention policies applied separately from the underlying hypertable, and they only update the portions of new data that have been modified since the last materialization was run.

When we compare using a continuous aggregate to querying the data directly, customers often see queries that might take minutes or even hours drop to milliseconds. When that query is powering a dashboard or a web page, this can be the difference between snappy and unusable.

Lower Storage Costs

The number one driver of cost for modern time-series applications is storage. Even when storage is cheap, time-series data piles up quickly. TimescaleDB provides two methods to reduce the amount of data being stored, compression and downsampling using continuous aggregates.

90 % or more storage savings via best-in-class compression algorithms

The TimescaleDB hypertable is data heavily partitioned into many, many smaller partitions called “chunks.” TimescaleDB provides native columnar compression on this per-chunk basis.

As we show in the benchmark results (and as we see often in production databases), compression reduced disk consumption by over 90% compared to the same data in vanilla PostgreSQL.

Even better, TimescaleDB doesn't change anything about the PostgreSQL storage system to achieve this level of compression. Instead, TimescaleDB utilizes PostgreSQL storage features, namely TOAST, to transition historical data from row-store to column-store, a key component for querying long-term aggregates over individual columns.

To demonstrate the effectiveness of compression, here’s a comparison of the total size of the CPU table and indexes in TimescaleDB and in PostgreSQL.

With the proper compression policy in place, hypertable chunks will be compressed automatically once all data in the chunk has aged beyond the specified time interval.

In practice, this means that a hypertable can store data as row-oriented for newer data and column-oriented for older data simultaneously. Having the data stored as both row and column store also matches the typical query patterns of time-series applications to help improve overall query performance—again, something we see in the benchmark results.

This reduces the storage footprint and improves query performance even further for many time-series aggregate queries. Compression is also automatic: users set a compression horizon, and then data is automatically compressed as it ages.

This also means that users can save significant costs using cloud services that provide separation of compute and storage—such as Tiger Cloud (formerly Timescale Cloud)—so that larger machines aren’t needed just for more storage.

More storage savings by easily removing or downsampling data

With TimescaleDB, automated data retention is achieved with one SQL command:

SELECT add_retention_policy('cpu', INTERVAL '7 days');

There's no further setup or extra extensions to install or configure. Each day any partitions older than 7 days will be dropped automatically. If you were to implement this in vanilla PostgreSQL you’d need to use DELETE to remove records, which is a very costly operation as it needs to scan for the data to remove. Even if you were using PostgreSQL declarative partitioning, you’d still need to automate the process yourself, wasting precious developer time, adding additional requirements, and implementing bespoke code that needs to be supported moving forward.

One can also combine continuous aggregates and data retention policies to downsample data and then drop the raw measurements, thus saving even more data storage.

Using this architecture, you can retain higher-level rollup values for a longer period of time, even after the raw data has been dropped from the database. This allows multiple different levels of granularity to be stored in the database, and provides even more ways to control storage costs.

More Features to Speed Up Development Time

TimescaleDB includes more features that speed up development time. This includes a library of over 100 hyperfunctions, which make complex time-series analysis easy using SQL, such as count approximations, statistical aggregates, and more. TimescaleDB also includes a built-in, multi-purpose job scheduling engine for setting up automated workflows.

Library of over 100 hyperfunctions that make complex analysis easy

TimescaleDB hyperfunctions make data analysis in SQL easy. This library includes time-weighted averages, last observation carried forward, and downsampling with LTTP or ASAP algorithms, time_bucket(), and time_bucket_gapfill().

As an example, one could get the average temperature every day for each device over the last seven days, carrying forward the last value for missing readings with the following SQL.

SELECT
  time_bucket_gapfill('1 day', time) AS day,
  device_id,
  avg(temperature) AS value,
  locf(avg(temperature))
FROM metrics
WHERE time > now () - INTERVAL '1 week'
GROUP BY day, device_id
ORDER BY day;

For more information on the extensive list of hyperfunctions in TimescaleDB, please visit our API documentation.

Built-in job scheduler for workflow automation

TimescaleDB provides the ability to schedule the execution of custom stored procedures with user-defined actions. This feature provides access to the same job scheduler that TimescaleDB uses to run all of the native automation jobs for compression, continuous aggregates, data retention, and more.

This provides a similar functionality as a third-party scheduler likepg_cron without needing to maintain multiple PostgreSQL extensions or databases.

We see users doing all sorts of neat stuff with user-defined actions, from calculating complex SLAs to sending event emails based on data correctness to polling tables.

Still 100 % PostgreSQL and SQL

Notably, because TimescaleDB is packaged as a PostgreSQL extension, it achieves these results without forking or breaking PostgreSQL.

Extending PostgreSQL—not forking or cloning

Postgres is popular at the moment, but a lot of that popularity is with ‘Postgres compatible’ products which might look like Postgres, or talk like Postgres, or query somewhat like Postgres - but aren’t Postgres under the hood (and are sometimes closed-source).

TimescaleDB is just PostgreSQL. One can install other extensions, make full use of the type system, and benefit from the incredibly diverse Postgres ecosystem.

100 % SQL

Any product that can connect to PostgreSQL can query time-series data stored with TimescaleDB using the same SQL it normally would. While we provide helper functions for working with data, we do not restrict the SQL features one can use. Once in the database, users can combine time series and business data as necessary.

Rock-solid foundations thanks to PostgreSQL

PostgreSQL is not a new database: it has years of production deployments under its belt. High availability, backup and restore, and load-balancing are all solved problems. As we mentioned earlier, we chose Postgres because it was reliable, and TimescaleDB inherits that reliability.

Benchmarking Setup and Results

This section provides details about how we tested TimescaleDB against vanilla PostgreSQL. Feel free to download the Time-Series Benchmarking Suite and run it for yourself. If you'd like to get started with TimescaleDB quickly, you can use Tiger Cloud, which lets you sign up for a free 30-day trial.

Benchmark configuration

For this benchmark, all tests were run on the same m5.2xlarge EC2 instance in AWS us-east-1 with the following configuration and software versions.

Versions: TimescaleDB version 2.7.2, community edition, and PostgreSQL 14.4
One remote client machine running TSBS, one database server, both in the same cloud data center
TSBS Client Instance: EC2 m5.4xlarge with 16 vCPU and 64 GB memory
Database server instance: EC2 m5.2xlarge with 8 vCPU and 32 GB memory
OS: both server and client machines ran Ubuntu 20.04
Disk size: 1 TB of EBS GP2 storage
TSBS config: Dev-ops profile, 4,000 devices recording metrics every 10 seconds over one month.

We also deliberately chose to use EBS (elastic block storage) volumes rather than attached SSDs. While benchmark performance would certainly improve with SSDs, the baseline performance using EBS is illustrative of what many self-hosted users could expect while saving some expenses by using elastic storage.

Database configuration

We ran only one PostgreSQL cluster on the EC2 database instance. The TimescaleDB extension was loaded via shared_preload_libraries but not installed into the PostgreSQL-only database.

To set sane defaults for the PostgreSQL cluster, we ran timescaledb-tune and setsynchronous_commit=off in postgresql.conf. This is a common performance configuration for write-heavy workloads while still maintaining transactional, logged integrity. All configuration changes applied to both PostgreSQL and TimescaleDB benchmarks alike.

The dataset

As we mentioned earlier, for this benchmark, we used the Time-Series Benchmarking Suite and generated data for 4,000 devices, recording metrics every 10 seconds, for one month. This generated just over one billion rows of data. Because TimescaleDB is a PostgreSQL extension, we could use the same data file and ingestion process, ensuring identical data in each database.

TimescaleDB setup

TimescaleDB uses an abstraction called hypertables which splits large tables into smaller chunks, increasing performance and greatly easing management of large amounts of time-series data.

We also enabled native compression on TimescaleDB. We compressed everything but the most recent chunk of data, leaving it uncompressed. This configuration is a commonly recommended one where raw, uncompressed data is kept for recent time periods and older data is compressed, enabling greater query efficiency. The parameters we used to enable compression are as follows: we segmented by the tags_id columns and ordered by time descending and usage_user columns.

All benchmark results were performed on a single PostgreSQL table and on an empty TimescaleDB hypertable created with four-hour chunks.

(And for those thinking that we also need to compare TimescaleDB with PostgreSQL Declarative Partitioning, please read on to the end; we discuss that as well.)

Query Latency Deep Dive

For this benchmark, we inserted one billion rows of data and then ran a set of queries 100 times each against the respective database. The data, indexes, and queries are exactly the same for both databases. The only difference is that the TimescaleDB queries use the time_bucket() function for doing arbitrary interval bucketing, whereas the PostgreSQL queries use the new date_bin() function, introduced in PostgreSQL 13.

Query latency comparison between PostgreSQL and TimescaleDB, for different queries

The results are clear and consistently reproducible. For one billion rows of data spanning one month of time (with four-hour partitions), TimescaleDB consistently outperformed a vanilla PostgreSQL database running 100 queries at a time.

There are two main reasons for TimescaleDB's consistent query performance.

Compression = smaller storage + less work

In PostgreSQL (and many other databases), table data is stored in an 8 Kb page (sometimes called a block). If a query has to read 1,000 pages to satisfy it, it reads ~8 Mb of data. If some of that data had to be retrieved from disk, then the query will usually be slower than if all of the data was found in memory (the reserved space known as shared buffers in PostgreSQL, if you’re looking for some insight into PostgreSQL caching we have a blog on that).

With TimescaleDB compression, queries that return the same results have to read significantly fewer pages of data (this is both because of the actual compression and because it can return single columns rather than whole rows). For all of our benchmarking queries, this also translates into higher concurrency for the benchmark duration.

Stated another way, compression typically impacts fetching historical data most because TimescaleDB can query individual columns rather than entire rows. Because less I/O is occurring for each query, TimescaleDB can handle more queries with a lower standard deviation than vanilla PostgreSQL.

Let's look at two examples of how this plays out between the two databases using two queries above, cpu-max-all-1 and single-groupby-1-1-12.

`single-groupby-1-1-12`

We selected one of the queries from the benchmark and ran it on both databases. Recall that each database has the exact same data and indexes on uncompressed data. TimescaleDB has the advantage of being able to segment and order compressed data in a way that's beneficial to typical application queries.

EXPLAIN (ANALYZE,BUFFERS)
SELECT time_bucket('1 minute', time) AS minute,
        max(usage_user) as max_usage_user
        FROM cpu
        WHERE tags_id IN (
          SELECT id FROM tags WHERE hostname IN ('host_249')
        ) 
        AND time >= '2022-08-03 06:16:22.646325 +0000' 
        AND time < '2022-08-03 18:16:22.646325 +0000'
        GROUP BY minute ORDER BY minute;

When we run the EXPLAIN on this query and ask for BUFFERS to be returned, we start to get a hint of what's happening.

Query latency vs volume of data that has to be read to satisfy the query in TimescaleDB and PostgreSQL.

Two things quickly jump out when I view these results. First, the execution times are significantly lower than the benchmarking results above. Individually, these queries execute pretty fast, but PostgreSQL has to read approximately 27x more data to satisfy the query. When 16 workers request data across the time range, PostgreSQL has to do a lot more I/O, which consumes resources. TimescaleDB can simply handle a higher concurrency for the same workload.

`cpu-max-all-1`

Again we can clearly see the impact of compression on the ability for TimescaleDB to handle a higher concurrent load when compared to vanilla PostgreSQL for time-series queries.

EXPLAIN (ANALYZE, buffers) 
SELECT
   time_bucket('3600 seconds', time) AS hour,
   max(usage_user) AS max_usage_user,
   max(usage_system) AS max_usage_system,
   max(usage_idle) AS max_usage_idle,
   max(usage_nice) AS max_usage_nice,
   max(usage_iowait) AS max_usage_iowait,
   max(usage_irq) AS max_usage_irq,
   max(usage_softirq) AS max_usage_softirq,
   max(usage_steal) AS max_usage_steal,
   max(usage_guest) AS max_usage_guest,
   max(usage_guest_nice) AS max_usage_guest_nice 
FROM cpu 
WHERE  
   tags_id IN (
      SELECT id FROM tags WHERE hostname IN ('host_249')
   )
   AND time >= '2022-08-08 18:16:22.646325 +0000' 
   AND time < '2022-08-09 02:16:22.646325 +0000' 
GROUP BY HOUR 
ORDER BY HOUR;

Query latency vs volume of data that has to be read to satisfy the query, in TimescaleDB and PostgreSQL

With compression, TimescaleDB does significantly less work to retrieve the same data, resulting in faster queries and higher query concurrency.

Time-ordered queries just work better

TimescaleDB hypertables require a time column to partition the data. Because time is an essential (and known) part of each row and chunk, TimescaleDB can intelligently improve how the query is planned and executed to take advantage of the time component of the data.

For example, let's query for the maximum CPU usage for each minute for the last 10 minutes.

EXPLAIN (ANALYZE,BUFFERS)        
SELECT time_bucket('1 minute', time) AS minute, 
  max(usage_user) 
FROM cpu 
WHERE time > '2022-08-14 07:12:17.568901 +0000' 
GROUP BY minute 
ORDER BY minute DESC 
LIMIT 10;

Because TimescaleDB understands that this query is aggregating on time and the result is ordered by the time column (something each chunk is already ordering by in an index), it can use the ChunkAppend custom execution node. In contrast, PostgreSQL plans five workers to scan all partitions before sorting the results and finally doing a GroupAggregate on the time column.

Query latency vs volume of data that has to be read to satisfy the query, in TimescaleDB and PostgreSQL

TimescaleDB scans fewer data and doesn't need to spend time re-sorting the data that it knows is already sorted in the chunk. For time-series data with a known order and constraints, TimescaleDB works better for most queries than vanilla PostgreSQL.

Ingest Performance

Intriguingly, ingest performance for both TimescaleDB and PostgreSQL are nearly identical, a dramatic improvement for PostgreSQL given the results five years ago with PostgreSQL 9.6. However, TimscaleDB still consistently finished with an average rate of 3,000 to 4,000 rows/second higher than a single PostgreSQL table.

Insert performance comparison between TimescaleDB 2.7.2 and PostgreSQL 14.4

This shows that while vast improvements have been made in PostgreSQL, TimescaleDB hypertables also continue to perform exceptionally well. As well as the rate, the other characteristics of ingest performance are nearly identical between TimescaleDB and PostgreSQL. Modifying the batch size for the number of rows to insert at a time impacts each database the same: small batch sizes or a few hundred rows significantly hinder ingest performance, while batch sizes of 10,000 to 15,000 rows seem to be about optimal for this dataset.

Declarative Partitioning

In the benchmarks above, we tested TimescaleDB against a single PostgreSQL table simply because that’s the default option that most people end up using. PostgreSQL also has support for native declarative partitioning, which has also been maturing over the past few years.

For the sake of completeness, we also tested TimescaleDB against native declarative partitioning. As the graphic below shows, TimescaleDB is still 1,000x faster for some queries, with strong performance gains still showing across the board. Ingest performance was similar between TimescaleDB and declarative partitioning.

In fact, if anything, the takeaway from these tests was that while declarative partitioning has matured, the gap between using a single table and declarative partitioning has shrunk.

Query latency comparison between TimescaleDB and PostgreSQL with declarative partitioning

Using declarative partitioning is also harder. One needs to manually pre-create partitions, ensure there are no data gaps, ensure no data is inserted outside of your partition ranges, and create more partitions as time moves on.

In contrast, with TimescaleDB, one does not need any of this. Instead, a single create_hypertable command is used to convert a standard table into a hypertable, and TimescaleDB takes care of the rest.

Conclusion

TimescaleDB harnesses the power of the extension framework to supercharge PostgreSQL for time-series and analytical applications. With additional features like compression and continuous aggregates, TimescaleDB provides not only the most performant way of using time-series data in PostgreSQL but also the best developer experience.

When compared to traditional PostgreSQL, TimescaleDB enables 1,000x faster time-series queries, compresses data by 90 %, and provides access to advanced time-series analysis tools and operational features specifically designed to ease data management. TimescaleDB also provides benefits for other types of queries with features like SkipScan—just by installing the extension.

In short, TimescaleDB extends PostgreSQL to enable developers to continue to use the database they love for time series, perform better at scale, spend less, and stream data analysis and operations.

If you’re looking to expand your database scalability, try our hosted service, Tiger Cloud. You will get the PostgreSQL you know and love with extra features for time series (continuous aggregation, compression, automatic retention policies, hyperfunctions). Plus, a platform with automated backups, high availability, automatic upgrades, and much more. You can use it for free for 30 days; no credit card required.

State of PostgreSQL 2022—13 Tools That Aren't psql

Ryan Booz — Tue, 26 Jul 2022 13:26:48 GMT

The State of PostgreSQL 2022 survey closed a few weeks ago, and we're hard at work cleaning and analyzing the data to provide the best insights we can for the PostgreSQL community.

In the database community, however, there are usually two things that drive lots of discussion year after year: performance and tooling. During this year's survey, we modified the questions slightly so that we could focus on three specific use cases and the PostgreSQL tools that the community finds most helpful for each: querying and administration, development, and data visualization.

PostgreSQL Tools: What Do We Have Against psql?

Absolutely nothing! As evidenced by the majority of respondents (69.4 %) that mentioned using psql for querying and administration, it's the ubiquitous choice for so many PostgreSQL users and there is already good documentation and community contributed resources (https://psql-tips.org/ by Leatitia Avrot is a great example) to learn more about it.

So that got us thinking. What other tools did folks bring up often for interacting with PostgreSQL along the three use cases mentioned above?

I'm glad we asked. 😉

PostgreSQL Querying and Administration

As we just said, psql is by far the most popular tool for interacting with PostgreSQL. 🎉

It's clear, however, that many users with all levels of experience do trust other tools as well.

Query and administration tools

pgAdmin (35 %), DBeaver (26 %), Datagrip (13 %), and IntelliJ (10 %) IDEs received the most mentions. Most of these aren't surprising if you've been working with databases, PostgreSQL or not. The most popular GUIs (pgAdmin and DBeaver) are open source and freely available to use. The next more popular GUIs (Datagrip and IntelliJ) are licensed per seat. However, if your company or team already uses JetBrain's tools, you might have access to these popular tools.

What I was more interested in were the mentions that happened just after the more popular tools I expected to see. Often, it's this next set of PostgreSQL tools that has gained enough attention from community members that there's obviously a value proposition to investigate further. If they can be helpful to my (or your) development workflow in certain situations, I think it's worth digging a little deeper.

pgcli

First on the list is pgcli, a Python-based command-line tool and one of many dbcli tools created for various databases. Although this is not a replacement for psql, it provides an interactive, auto-complete interface for writing SQL and getting results. Syntax highlighting and some basic support for psql backslash commands are included. If you love to stay in the terminal but want a little more interactivity, the dbcli tools have been around for quite some time, have a nice community of support, and might make database exploration just a little bit easier sometimes.

Azure Data Studio

Introduced as a beta in December 2017 by the Microsoft database tooling team, Azure Data Studio has been built on top of the same Electron platform as Visual Studio Code. Although the primary feature set is currently geared towards SQL Server (for obvious reasons), the ability to connect to PostgreSQL has been available since 2019.

There are a couple of unique features in Azure Data Studio (ADS) that work with both SQL Server and PostgreSQL connections that I think are worth mentioning.

First, ADS includes the ability to create and run SQL-based Jupyter Notebooks. Typically you'd have to wrap your SQL inside of another runtime like Python, but ADS provides the option to select the "SQL" kernel and deals with the connection and SQL wrapping behind the scenes.

Second, ADS provides the ability to export query results to Excel without any plugins needed. While there are (seemingly) a thousand ways to quickly get a result set into CSV, producing a correctly formatted Excel file requires a plugin with almost any other tool. Regardless of how you feel about Excel, it is still the tool of choice for many data analysts, and being able to provide an Excel file easily does help sometimes.

Finally, ADS also provides some basic charting capabilities using query results. There's no need to set up a notebook and use a charting library like plotly if you just need to get some quick visualizations on the data. I've had a few hiccups with the capabilities (it's certainly not intended as a serious data analytics tool), but it can be helpful to get some quick chart images to share while exploring query data.

Postico

For anyone using MacOS, Postico is a GUI application that's been recommended in many of my circles. Many folks prefer the native MacOS feel, and some of the unique usability and editing features that make working with PostgreSQL simple and intuitive.

Up and coming

We'll leave it to you to look through the data and see what other GUI/query tools fellow PostgreSQL users are also using that might be of interest to you, but there are a few that were mentioned multiple times and even caused me to hit Google a few times to find out more. Some are free and open source, while others require licenses but provide interesting features like built-in data analytics capabilities. Whether you end up using any of these or not, it's good to see continued innovation within the tooling market, something that doesn't seem to be slowing down decades into our SQL journey.

Helpful Third-Party PostgreSQL Tools for Application Development

Although the GUI/administration landscape is certainly as active as ever, one of the most impactful features of PostgreSQL is how extensible it is. If the core application doesn't provide exactly what your application needs, there's a good chance someone (or some company) is working to provide that functionality.

The total distinct number of tools mentioned was similar to GUI/administration tools and generally fell into four categories: management features, cluster monitoring, query plan insights, and database DevOps tooling. For this blog post, we're going to focus on the first three areas.

Management features

It's not surprising that the most popular third-party PostgreSQL tools tend to be focused on daily management tasks of some sort. Two of the most popular tools in this area are mainstays in most self-hosted PostgreSQL circles.

pgBouncer

PostgreSQL creates one new process (not thread) per connection. Without proper tuning and a right-sized server, a database can quickly become overwhelmed with unplanned spikes in usage. pgBouncer is an open-source connection pooling application that helps manage connection usage for high-traffic applications.

If your database is self-hosted or your DBaaS doesn't provide some kind of connection pooling management for you, pgBouncer can be installed anywhere that makes sense with respect to your application to provide better connection management.

pgBackRest

Database backups are essential, obviously, and PostgreSQL has always had standard tooling for backup and restore. But as databases have grown in size and application architectures have become more complex, using pg_dump and pg_restore can make it more difficult than intended to perform these tasks well.

The Crunchy Data team created pgBackRest to help provide a full-fledged backups and restore system with many necessary features for enterprise workloads. Multi-threaded backup and compression, multiple repository locations, and backup resume are just a few features that make this a common and valuable tool for any PostgreSQL administrator.

Cluster monitoring

The second area of third-party PostgreSQL tools that show up often focuses on improved database monitoring, which includes query monitoring in most cases. There are a lot of folks tackling this problem area from many different angles, which demonstrates the continued need that many developers and administrators have when managing PostgreSQL.

pgBadger

PostgreSQL has a lot of settings that can be tuned and details that can be logged into server logs, but there is no built-in functionality for holistically analyzing that data cohesively. This is where pgBadger steps in to help generate useful reports from all of the data your server is logging.

pgBadger is one of a few popular PostgreSQL tools written in Perl (which surprises me for some reason), but the developer has gone to great lengths to not require lots of Perl-specific modules for drawing charts and graphs, instead relying on common JavaScript libraries in the rendered reports.

There's a lot to look at with pgBadger, and the larger PostgreSQL community often recommends it as a helpful, long-term debugging tool for server performance issues.

pganalyze

pganalyze has grown in popularity quite a lot over the last few years. Lukas Fittl has done a great job adding new features and capabilities while also providing a number of great PostgreSQL community resources across various platforms.

pganalyze is a fee-based product that uses data provided by standard plugins (pg_stat_statements for example) which is then consumed through a collector that sends the data to a cloud service. If you use pganalyze to query log information as well (e.g., long-running queries), then features like example problem queries and index advisor could be really helpful for your development workflow and user experience.

Query plan analysis

No discussion about PostgreSQL would be complete without mentioning tools that help you understand EXPLAIN output better. This is one area so many people struggle with, particularly based on what their previous experience is with another database, and a small cache of common, helpful tools have been growing in popularity to help with this essential task.

Depesz EXPLAIN and Dalibo EXPLAIN

Both Depesz and Dalibo EXPLAIN provide a quick, free platform for taking a PostgreSQL explain plan and providing helpful insights into which operations are causing a slow query and, in some cases, providing helpful hints to help speed things up. Also, if you let them, both tools provide a permalink to the output for you to share with others if necessary.

pgMustard

One of my favorite EXPLAIN tools is pgMustard, created and maintained by Michael Christofides. This is a for-fee tool, but there are a lot of unique insights and features that pgMustard provides that others currently don't. Michael is also doing great work within the community, even recently starting a PostgreSQL podcast with Nikolay Samokhvalov, with whom we recently talked about all things SQL.

Which Visualization Tools Do You Use?

The final tooling question on the State of PostgreSQL survey asked about visualization tools that folks used. Without a doubt, Grafana was the top vote-getter, but that's something we could have probably guessed pretty easily.

I was surprised that the next two top vote-getters were for pgAdmin and DBeaver, both popular database GUI tools we mentioned earlier. In both cases, visualization capabilities are somewhat limited, so it's hard to tell exactly what kind of features are being used that would categorize them as visualization tools.

The next group of tools is more interesting to me and I wanted to highlight a few that might pique your interest to investigate further.

QGIS

QGIS is a desktop application that's used to visualize spatial data, whether from PostGIS queries or other data sources. As I've had the pleasure of learning about GIS data and queries from Ryan Lambert over the past few years, I've seen him use this tool for lots of valuable and interesting spatial queries. If you rely on PostGIS for application features and you store spatial data, take a look at how QGIS might be able to help your analysis workflow.

Superset

There are a number of data visualization and dashboarding alternatives in the market, and PostgreSQL support is universally expected regardless of the tool. Superset is an open-source option that also has commercial support and hosting options available through Preset.io. With more than 40 chart types and a vibrant community, there's a lot to explore in the Superset ecosystem.

Streamlit

For those developers that use Python for most of their data analysis and visualizations, Streamlit is another popular choice that can easily fit into your existing workflow. Streamlit isn't a drag-and-drop UI for creating dashboards, but rather a programmatic interface for building and deploying data analysis applications using Python. And as of July 2022, you can deploy public data apps using Streamlit.io.

What About You?

There were so many interesting answers and suggestions provided by the community to these three questions. It's clear that there are a lot of people around the world working to help developers and database professionals be more productive across many common tasks.

Are there any surprises in this list or tools that you think didn't make the list? Hit us up on Slack, our Forum, or Twitter (@timescaleDB) to share other tools that are important to your daily PostgreSQL workflow!

Read the Report

Now that we’ve given you a taste of our survey results, are you curious to learn more about the PostgreSQL community? If you’d like to know more insights about the State of PostgreSQL 2022, including why respondents chose PostgreSQL, their opinion on industry events, and what information sources they would recommend to friends and colleagues, don’t miss our complete report. Click here to read it and learn firsthand what the State of PostgreSQL is in 2022.

State of PostgreSQL 2022—First Findings

Ryan Booz — Fri, 08 Jul 2022 14:59:54 GMT

🚀

About the State of PostgreSQLTimescale’s love for PostgreSQL, one of the world’s most advanced open-source databases with 30+ years of history, runs deep. We built our products on PostgreSQL, are proud members of the PostgreSQL community,and wouldn’t exist without it and the extensibility it provides.In 2019, Timescale launched the first State of PostgreSQL report, advancing our desire to provide greater insights into the vibrant and growing PostgreSQL user base. From the most popular programming languages and favorite features to whether respondents use PostgreSQL for work or personal projects (or both!), the State of PostgreSQL provides valuable insights into this great community. Following a one-year hiatus due to the pandemic, we resumed the annual survey in 2021. Check out our previous reports for more info, and keep reading to learn more about this year’s first findings. If you want to read the complete 2022 report, go ahead!

Earlier this year, we launched the third State of PostgreSQL survey. Participation reached unprecedented levels, with nearly one thousand PostgreSQL users answering our questions —more than double compared to the previous year! A huge thank you to you all for giving back to the community by sharing your experiences with PostgreSQL.

Gathering and making this information open to the public is our way of helping build a better and more inclusive PostgreSQL community. Here are some of our initial findings.

Demographics

What is your primary geographical location?

Mirroring the 2019 and 2021 survey results, most respondents (54.4 %) are located in the EMEA (Europe, Middle East, Africa) region, followed by North America, with 25.9 %.

How long have you been using PostgreSQL?

While most of this year’s participants have been using PostgreSQL for 3-5 years, the number of new users experimenting with the database for less than a year has grown (6.4 %). The majority of the surveyed PostgreSQL users—54 %— have been using PostgreSQL for six or more years.

What is your current profession or job status?

Most PostgreSQL users (43.3 %) work as software developers/engineers, followed by software architects (13.2 %) and database administrators or DBAs (7.3 %). However, we introduced more title options in this year’s survey, which lent a bit more nuance to the answers. For example, we learned that 5.8 % of respondents are consultants, while 2.4 % are researchers.

Community

Have you ever contributed to PostgreSQL?

Among our sample of PostgreSQL users with 15 or more years of experience, 44 % said they have contributed to PostgreSQL at least once. In fact, regardless of their experience, users across the board have contributed to the PostgreSQL community.

Over 300 respondents answered bonus questions and shed light on what they like the most about the PostgreSQL community, where they see room for improvement, and what would make the community more welcoming.

Tools

Three tools stood out among the respondents who use them for queries and administration tasks: psql (69.4 %), pgAdmin (35.3 %), and DBeaver (26.2 %).

Read the Report

Now that we’ve given you a taste of our survey results, are you curious to learn more about the PostgreSQL community? If you’d like to know more insights about the State of PostgreSQL 2022, including why respondents chose PostgreSQL, their opinion on industry events, and what information sources they would recommend to friends and colleagues, don’t miss our complete report. Click here to read the report and learn firsthand what the State of PostgreSQL is in 2022.

How We Made Data Aggregation Better and Faster on PostgreSQL With TimescaleDB 2.7

Ryan Booz — Tue, 21 Jun 2022 12:58:38 GMT

Time-series data is the lifeblood of the analytics revolution in nearly every industry today. One of the most difficult challenges for application developers and data scientists is aggregating data efficiently without always having to query billions (or trillions) of raw data rows. Over the years, developers and databases have created numerous ways to solve this problem, usually similar to one of the following options:

DIY processes to pre-aggregate data and store it in regular tables. Although this provides a lot of flexibility, particularly with indexing and data retention, it's cumbersome to develop and maintain, particularly deciding how to track and update aggregates with data that arrives late or has been updated in the past.
Extract Transform and Load (ETL) process for longer-term analytics. Even today, development teams employ entire groups that specifically manage ETL processes for databases and applications because of the constant overhead of creating and maintaining the perfect process.
Materialized views. While these VIEWS are flexible and easy to create, they are static snapshots of the aggregated data. Unfortunately, developers need to manage updates using TRIGGERs or CRON-like applications in all current implementations. And in all but a very few databases, all historical data is replaced each time, preventing developers from dropping older raw data to save space and computation resources every time the data is refreshed.

Most developers head down one of these paths because we learn, often the hard way, that running reports and analytic queries over the same raw data, request after request, doesn't perform well under heavy load. In truth, most raw time-series data doesn't change after it's been saved, so these complex aggregate calculations return the same results each time.

In fact, as a long-term time-series database developer, I've used all of these methods too, so that I could manage historical aggregate data to make reporting, dashboards, and analytics faster and more valuable, even under heavy usage.

I loved when customers were happy, even if it meant a significant amount of work behind the scenes maintaining that data.

But, I always wished for a more straightforward solution.

How TimescaleDB Improves Queries on Aggregated Data in PostgreSQL

In 2019, TimescaleDB introduced continuous aggregates to solve this very problem, making the ongoing aggregation of massive time-series data easy and flexible. This is the feature that first caught my attention as a PostgreSQL developer looking to build more scalable time-series applications—precisely because I had been doing it the hard way for so long.

Continuous aggregates look and act like materialized views in PostgreSQL, but with many of the additional features I was looking for (if you want to learn more about views, materialized views, and continuous aggregates, check out this lesson from our Foundations of PostgreSQL and TimescaleDB course). These are just some of the things they do:

Automatically track changes and additions to the underlying raw data.
Provide configurable, user-defined policies to keep the materialized data up-to-date automatically.
Automatically append new data (as real-time aggregates by default) before the scheduled process has materialized to disk. This setting is configurable.
Retain historical aggregated data even if the underlying raw data is dropped.
Can be compressed to reduce storage needs and further improve the performance of analytic queries.
Keep dashboards and reports running smoothly.

Table comparing the functionality of PostgreSQL materialized views with continuous aggregates in TimescaleDB

Once I tried continuous aggregates, I realized that TimescaleDB provided the solution that I (and many other PostgreSQL users) were looking for. With this feature, managing and analyzing massive volumes of time-series data in PostgreSQL finally felt fast and easy.

💡

Want to make your queries even faster? Try hierarchical continuous aggregates, a.k.a. continuous aggregates on top of continuous aggregates.

What About Other Databases?

By now, some readers might be thinking something along these lines:

“Continuous aggregates may help with the management and analytics of time-series data in PostgreSQL, but that’s what NoSQL databases are for—they already provide the features you needed from the get-go. Why didn’t you try a NoSQL database?”

Well, I did.

There are numerous time-series and NoSQL databases on the market that attempt to solve this specific problem. I looked at (and used) many of them. But from my experience, nothing can quite match the advantages of a relational database with a feature like continuous aggregates for time-series data. These other options provide a lot of features for a myriad of use cases, but they weren't the right solution for this particular problem, among other things.

What about MongoDB?

MongoDB has been the go-to for many data-intensive applications. Included since version 4.2 is a feature called On-Demand Materialized Views. On the surface, it works similar to a materialized view by combining the Aggregation Pipeline feature with a $merge operation to mimic ongoing updates to an aggregate data collection. However, there is no built-in automation for this process, and MongoDB doesn't keep track of any modifications to underlying data. The developer is still required to keep track of which time frames to materialize and how far back to look.

What about InfluxDB?

For many years InfluxDB has been the destination for time-series applications. Although we've discussed in other articles how InfluxDB doesn't scale effectively, particularly with high cardinality datasets, it does provide a feature called Continuous Queries. This feature is also similar to a materialized view and goes one step further than MongoDB by automatically keeping the dataset updated. Unfortunately, it suffers from the same lack of raw data monitoring and doesn't provide nearly as much flexibility as SQL in how the datasets are created and stored.

What about Clickhouse?

Clickhouse, and several recent forks like Firebolt, have redefined the way some analytic workloads perform. Even with some of the impressive query performance, it provides a mechanism similar to a materialized view as well, backed by an AggregationMergeTree engine. In a sense, this provides almost real-time aggregated data because all inserts are saved to both the regular table and the materialized view. The biggest downside of this approach is dealing with updates or modifying the timing of the process.

Recent Improvements in Continuous Aggregates: Meet TimescaleDB 2.7

Continuous aggregates were first introduced in TimescaleDB 1.3 solving the problems that many PostgreSQL users, including me, faced with time-series data and materialized views: automatic updates, real-time results, easy data management, and the option of using the view for downsampling.

But continuous aggregates have come a long way. One of the previous improvements was the introduction of compression for continuous aggregates in TimescaleDB 2.6. Now, we took it a step further with the arrival of TimescaleDB 2.7, which introduces dramatic performance improvements in continuous aggregates. They are now blazing fast—up to 44,000x faster in some queries than in previous versions.

Let me give you one concrete example: in initial testing using live, real-time stock trade transaction data, typical candlestick aggregates were nearly 2,800x faster to query than in previous versions of continuous aggregates (which were already fast!)

Later in this post, we will dig into the performance and storage improvements introduced by TimescaleDB 2.7 by presenting a complete benchmark of continuous aggregates using multiple datasets and queries. 🔥

But the improvements don’t end here.

First, the new continuous aggregates also require 60 % less storage (on average) than before for many common aggregates, which directly translates into storage savings.

Second, in previous versions of TimescaleDB, continuous aggregates came with certain limitations: users, for example, could not use certain functions like DISTINCT, FILTER, or ORDER BY. These limitations are now gone. TimescaleDB 2.7 ships with a completely redesigned materialization process that solves many of the previous usability issues, so you can use any aggregate function to define your continuous aggregate. Check out our release notes for all the details on what's new.

✨ A big thank you to the Timescale engineers that made the improvements in continuous aggregates possible, with special mentions to Fabrízio Mello, Markos Fountoulakis, and David Kohn.

And now, the fun part.

Show Me the Numbers: Benchmarking Aggregate Queries

To test the new version of continuous aggregates, we chose two datasets that represent common time-series datasets: IoT and financial analysis.

IoT dataset (~1.7 billion rows): The IoT data we leveraged is the New York City Taxicab dataset that's been maintained by Todd Schneider for a number of years, and scripts are available in his GitHub repository to load data into PostgreSQL. Unfortunately, a week after his latest update, the transit authority that maintains the actual datasets changed their long-standing export data format from CSV to Parquet—which means the current scripts will not work. Therefore, the dataset we tested with is from data prior to that change and covers ride information from 2014 to 2021.
Stock transactions dataset (~23.7 million rows): The financial dataset we used is a real-time stock trade dataset provided by Twelve Data and ingests ongoing transactions for the top 100 stocks by volume from February 2022 until now. Real-time transaction data is typically the source of many stock trading analysis applications requiring aggregate rollups over intervals for visualizations like candlestick charts and machine learning analysis. While our example dataset is smaller than a full-fledged financial application would maintain, it provides a working example of ongoing data ingestion using continuous aggregates, TimescaleDB native compression, and automated raw data retention (while keeping aggregate data for long-term analysis).

You can use a sample of this data, generously provided by Twelve Data, to try all of the improvements in TimescaleDB 2.7 by following this tutorial, which provides stock trade data for the last 30 days. Once you have the database setup, you can take it a step further by registering for an API key and following our tutorial to ingest ongoing transactions from the Twelve Data API.

Creating Continuous Aggregates Using Standard PostgreSQL Aggregate Functions

The first thing we benchmarked was to create an aggregate query that used standard PostgreSQL aggregate functions like MIN(), MAX(), and AVG(). In each dataset we tested, we created the same continuous aggregate in TimescaleDB 2.6.1 and 2.7, ensuring that both aggregates had computed and stored the same number of rows.

IoT dataset

This continuous aggregate resulted in 1,760,000 rows of aggregated data spanning seven years of data.

CREATE MATERIALIZED VIEW hourly_trip_stats
WITH (timescaledb.continuous, timescaledb.finalized=false) 
AS
SELECT 
	time_bucket('1 hour',pickup_datetime) bucket,
	avg(fare_amount) avg_fare,
	min(fare_amount) min_fare,
	max(fare_amount) max_fare,
	avg(trip_distance) avg_distance,
	min(trip_distance) min_distance,
	max(trip_distance) max_distance,
	avg(congestion_surcharge) avg_surcharge,
	min(congestion_surcharge) min_surcharge,
	max(congestion_surcharge) max_surcharge,
	cab_type_id,
	passenger_count
FROM 
	trips
GROUP BY 
	bucket, cab_type_id, passenger_count

Stock transactions dataset

This continuous aggregate resulted in 950,000 rows of data at the time of testing, although these are updated as new data comes in.

CREATE MATERIALIZED VIEW five_minute_candle_delta
WITH (timescaledb.continuous) AS
    SELECT
        time_bucket('5 minute', time) AS bucket,
        symbol,
        FIRST(price, time) AS "open",
        MAX(price) AS high,
        MIN(price) AS low,
        LAST(price, time) AS "close",
        MAX(day_volume) AS day_volume,
        (LAST(price, time)-FIRST(price, time))/FIRST(price, time) AS change_pct
    FROM stocks_real_time srt
    GROUP BY bucket, symbol;

To test the performance of these two continuous aggregates, we selected the following queries, all common queries among our users for both the IoT and financial use cases:

SELECT COUNT (*)
SELECT COUNT (*) with WHERE
ORDER BY
time_bucket reaggregation
FILTER
HAVING

Let’s take a look at the results.

Query #1: `SELECT COUNT(*) FROM…`

Doing a COUNT(*) from PostgreSQL is a known performance bottleneck. It's one of the reasons we created the approximate_row_count() function in TimescaleDB which uses table statistics to provide a close approximation of the overall row count. However, it's instinctual for most users (and ourselves, if we're honest) to try and get a quick row count by doing a COUNT(*) query:

-- IoT dataset
SELECT count(*) FROM hourly_trip_stats;

-- Stock transactions dataset
SELECT count(*) FROM five_min_candle_delta;

And most users recognized that in previous versions of TimescaleDB, the materialized data seemed slower than normal to do a COUNT over.

Thinking about our two example datasets, both continuous aggregates reduce the overall row count from raw data by 20x or more. So, while counting rows in PostgreSQL is slow, it always felt a little slower than it had to be. The reason was that not only did PostgreSQL have to scan and count all of the rows of data, it had to group the data a second time because of some additional data that TimescaleDB stored as part of the original design of continuous aggregates. With the new design of continuous aggregates in TimescaleDB 2.7, that second grouping is no longer required, and PostgreSQL can just query the data normally, translating into faster queries.

Performance of a query with SELECT COUNT (*) in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7

Query #2: SELECT COUNT(*) Based on The Value of a Column

Another common query that many analytic applications perform is to count the number of records where the aggregate value is within a certain range:

-- IoT  dataset
SELECT count(*) FROM hourly_trip_stats
WHERE avg_fare > 13.1
AND bucket > '2018-01-01' AND bucket < '2019-01-01';

-- Stock transactions dataset
SELECT count(*) FROM five_min_candle_delta
WHERE change_pct > 0.02;

In previous versions of continuous aggregates, TimescaleDB had to finalize the value before it could be filtered against the predicate value, which caused queries to perform more slowly. With the new version of continuous aggregates, PostgreSQL can now search for the value directly, and we can add an index to meaningful columns to speed up the query even more!

In the case of the financial dataset, we see a very significant improvement: 1,336x faster. The large change in performance can be attributed to the formula query that has to be calculated over all of the rows of data in the continuous aggregate. With the IoT dataset, we're comparing against a simple average function, but for the stock data, multiple values have to be finalized (FIRST/LAST) before the formula can be calculated and used for the filter.

Performance of a query with SELECT COUNT (*) plus WHERE in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7

Query #3: Select Top 10 Rows by Value

Taking the first example a step further, it's very common to query data within a range of time and get the top rows:

-- IoT dataset
SELECT * FROM hourly_trip_stats
ORDER BY avg_fare desc
LIMIT 10;

-- Stock transactions dataset
SELECT * FROM five_min_candle_delta
ORDER BY change_pct DESC 
LIMIT 10;

In this case, we tested queries with the continuous aggregate set to provide real-time results (the default for continuous aggregates) and materialized-only results. When set to real-time, TimescaleDB always queries data that's been materialized first and then appends (with a UNION) any newer data that exists in the raw data but that has not yet been materialized by the ongoing refresh policy. And, because it's now possible to index columns within the continuous aggregate, we added an index on the ORDER BY column.

Performance of a query with ORDER BY in a continuous aggregate TimescaleDB 2.6.1 and TimescaleDB 2.7

Yes, you read that correctly. Nearly 45,000x better performance on ORDER BY when the query only searches through materialized data.

The dramatic difference between real-time and materialized-only queries is because of the UNION of both materialized and raw aggregate data. The PostgreSQL planner needs to union the total result before it can limit the query to 10 rows (in our example), and so all of the data from both tables need to be read and ordered first. When you only query materialized data, PostgreSQL and TimescaleDB knows that it can query just the index of the materialized data.

Again, storing the finalized form of your data and indexing column values dramatically impacts the querying performance of historical aggregate data! And all of this is updated continuously over time in a non-destructive way—something that's impossible to do with any other relational database, including vanilla PostgreSQL.

Query #4: Timescale Hyperfunctions to Re-aggregate Into Higher Time Buckets

Another example we wanted to test was the impact finalizing data values has on our suite of analytical hyperfunctions. Many of the hyperfunctions we provide as part of the TimescaleDB Toolkit utilize custom aggregate values that allow many different values to be accessed later depending on the needs of an application or report. Furthermore, these aggregate values can be re-aggregated into different size time buckets. This means that if the aggregate functions fit your use case, one continuous aggregate can produce results for many different time_bucket sizes! This is a feature many users have asked for over time, and hyperfunctions make this possible.

For this example, we only examined the New York City Taxicab dataset to benchmark the impact of finalized CAGGs. Currently, there is not an aggregate hyperfunction that aligns with the OHLC values needed for the stock data set, however, there is a feature request for it! (😉)

Although there are not currently any one-to-one hyperfunctions that provide exact replacements for our min/max/avg example, we can still observe the query improvement using a tdigest value for each of the columns in our original query.

Original min/max/avg continuous aggregate for multiple columns:

CREATE MATERIALIZED VIEW hourly_trip_stats
WITH (timescaledb.continuous, timescaledb.finalized=false) 
AS
SELECT 
	time_bucket('1 hour',pickup_datetime) bucket,
	avg(fare_amount) avg_fare,
	min(fare_amount) min_fare,
	max(fare_amount) max_fare,
	avg(trip_distance) avg_distance,
	min(trip_distance) min_distance,
	max(trip_distance) max_distance,
	avg(congestion_surcharge) avg_surcharge,
	min(congestion_surcharge) min_surcharge,
	max(congestion_surcharge) max_surcharge,
	cab_type_id,
	passenger_count
FROM 
	trips
GROUP BY 
	bucket, cab_type_id, passenger_count

Hyperfunction-based continuous aggregate for multiple columns:

CREATE MATERIALIZED VIEW hourly_trip_stats_toolkit
WITH (timescaledb.continuous, timescaledb.finalized=false) 
AS
SELECT 
	time_bucket('1 hour',pickup_datetime) bucket,
	tdigest(1,fare_amount) fare_digest,
	tdigest(1,trip_distance) distance_digest,
	tdigest(1,congestion_surcharge) surcharge_digest,
	cab_type_id,
	passenger_count
FROM 
	trips
GROUP BY 
	bucket, cab_type_id, passenger_count

With the continuous aggregate created, we then queried this data in two different ways:

1. Using the same `time_bucket()` size defined in the continuous aggregate, which in this example was one-hour data.

SELECT 
	bucket AS b,
	cab_type_id, 
	passenger_count,
	min_val(ROLLUP(fare_digest)),
	max_val(ROLLUP(fare_digest)),
	mean(ROLLUP(fare_digest))
FROM hourly_trip_stats_toolkit
WHERE bucket > '2021-05-01' AND bucket < '2021-06-01'
GROUP BY b, cab_type_id, passenger_count 
ORDER BY b DESC, cab_type_id, passenger_count;

Performance of a query with time_bucket() in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7 (the query uses the same bucket size as the definition of the continuous aggregate)

2. We re-aggregated the data from one-hour buckets into one-day buckets. This allows us to efficiently query different bucket lengths based on the original bucket size of the continuous aggregate.

SELECT 
	time_bucket('1 day', bucket) AS b,
	cab_type_id, 
	passenger_count,
	min_val(ROLLUP(fare_digest)),
	max_val(ROLLUP(fare_digest)),
	mean(ROLLUP(fare_digest))
FROM hourly_trip_stats_toolkit
WHERE bucket > '2021-05-01' AND bucket < '2021-06-01'
GROUP BY b, cab_type_id, passenger_count 
ORDER BY b DESC, cab_type_id, passenger_count;

Performance of a query with time_bucket() in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7. The query re-aggregates the data from one-hour buckets into one-day buckets

In this case, the speed is almost identical because the same amount of data has to be queried. But if these aggregates satisfy your data requirements, only one continuous aggregate would be necessary in many cases, rather than a different continuous aggregate for each bucket size (one minute, five minutes, one hour, etc.)

Query #5: Pivot Queries With FILTER

In previous versions of continuous aggregates, many common SQL features were not permitted because of how the partial data was stored and finalized later. Using a PostgreSQL FILTER clause was one such restriction.

For example, we took the IoT dataset and created a simple COUNT(*) to calculate each company's number of taxi rides ( cab_type_id) for each hour. Before TimescaleDB 2.7, you would have to store this data in a narrow column format, storing a row in the continuous aggregate for each cab type.

CREATE MATERIALIZED VIEW hourly_ride_counts_by_type 
WITH (timescaledb.continuous, timescaledb.finalized=false) 
AS
SELECT 
	time_bucket('1 hour',pickup_datetime) bucket,
	cab_type_id,
  	COUNT(*)
FROM trips
  	WHERE cab_type_id IN (1,2)
GROUP BY 
	bucket, cab_type_id;

To then query this data in a pivoted fashion, we could FILTER the continuous aggregate data after the fact.

SELECT bucket,
	sum(count) FILTER (WHERE cab_type_id IN (1)) yellow_cab_count,
  	sum(count) FILTER (WHERE cab_type_id IN (2)) green_cab_count
FROM hourly_ride_counts_by_type
WHERE bucket > '2021-05-01' AND bucket < '2021-06-01'
GROUP BY bucket
ORDER BY bucket;

In TimescaleDB 2.7, you can now store the aggregated data using a FILTER clause to achieve the same result in one step!

CREATE MATERIALIZED VIEW hourly_ride_counts_by_type_new 
WITH (timescaledb.continuous) 
AS
SELECT 
	time_bucket('1 hour',pickup_datetime) bucket,
  	COUNT(*) FILTER (WHERE cab_type_id IN (1)) yellow_cab_count,
  	COUNT(*) FILTER (WHERE cab_type_id IN (2)) green_cab_count
FROM trips
GROUP BY 
	bucket;

Querying this data is much simpler, too, because the data is already pivoted and finalized.

SELECT * FROM hourly_ride_counts_by_type_new 
WHERE bucket > '2021-05-01' AND bucket < '2021-06-01'
ORDER BY bucket;

This saves storage (50 % fewer rows in this case) and CPU to finalize the COUNT(*) and then filter the results each time based on cab_type_id. We can see this in the query performance numbers.

Performance of a query with FILTER in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7.

Being able to use FILTER and other SQL features improve both developer experience and flexibility long term!

Query #6: HAVING Stores Significantly Less Materialized Data

As a final example of how the improvements to continuous aggregates will impact your day-to-day development and analytics processes, let's look at a simple query that uses a HAVING clause to reduce the number of rows that the aggregate stores.

In previous versions of TimescaleDB, the having clause couldn't be applied at materialization time. Instead, the HAVING clause was applied after the fact to all of the aggregated data as it was finalized. In many cases, this dramatically affected both the speed of queries to the continuous aggregate and the amount of data stored overall.

Using our stock data as an example, let's create a continuous aggregate that only stores a row of data if the change_pct value is greater than 20 %. This would indicate that a stock price changed dramatically over one hour, something we don't expect to see in most hourly stock trades.

CREATE MATERIALIZED VIEW one_hour_outliers
WITH (timescaledb.continuous) AS
    SELECT
        time_bucket('1 hour', time) AS bucket,
        symbol,
        FIRST(price, time) AS "open",
        MAX(price) AS high,
        MIN(price) AS low,
        LAST(price, time) AS "close",
        MAX(day_volume) AS day_volume,
        (LAST(price, time)-FIRST(price, time))/LAST(price, time) AS change_pct
    FROM stocks_real_time srt
    GROUP BY bucket, symbol
   HAVING (LAST(price, time)-FIRST(price, time))/LAST(price, time) > .02;

Once the dataset is created, we can query each aggregate to see how many rows matched our criteria.

SELECT count(*) FROM one_hour_outliers;

Performance of a query with HAVING in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7

The biggest difference here (and the one that will more negatively impact the performance of your application over time) is the storage size of this aggregated data. Because TimescaleDB 2.7 only stores rows that meet the criteria, the data footprint is significantly smaller!

Storage footprint of a continuous aggregate bucketing stock transactions by the hour in TimescaleDB 2.6.1 and TimescaleDB 2.7

Storage Savings in TimescaleDB 2.7

One of the final pieces of this update that excites us is how much storage will be saved over time. On many occasions, users with large datasets that contained complex equations in their continuous aggregates would join our Slack community to ask why more storage is required for the rolled-up aggregate than the raw data.

In every case we've tested, the new, finalized form of continuous aggregates is smaller than the same example in previous versions of TimescaleDB, with or without a HAVING clause that might filter additional data out.

Storage savings for different continuous aggregates in TimescaleDB 2.6.1 and TimescaleDB 2.7

The New Continuous Aggregates Are a Game-Changer

For those dealing with massive amounts of time-series data, continuous aggregates are the best way to solve a problem that has long haunted PostgreSQL users. The following list details how continuous aggregates expand materialized views:

They always stay up-to-date, automatically tracking changes in the source table for targeted, efficient updates of materialized data.
You can use configurable policies to conveniently manage refresh/update interval.
You can keep your materialized data even after the raw data is dropped, allowing you to downsample your large datasets.
And you can compress older data to save space and improve analytic queries.

And in TimescaleDB 2.7, continuous aggregates got much better. First, they are blazing fast: as we demonstrated with our benchmark, the performance of continuous aggregates got consistently better across queries and datasets, up to thousands of times better for common queries. They also got lighter, requiring an average of 60 % less storage.

But besides the performance improvements and storage savings, there are significantly fewer limitations on the types of aggregate queries you can use with continuous aggregates, such as:

Aggregates with DISTINCT
Aggregates with FILTER
Aggregates with FILTER in HAVING clause
Aggregates without combine function
Ordered-set aggregates
Hypothetical-set aggregates

This new version of continuous aggregates is available by default in TimescaleDB 2.7: now, when you create a new continuous aggregate, you will automatically benefit from all the latest changes. Read our release notes for more information on TimescaleDB 2.7, and for instructions on how to upgrade, check out our docs.

Looking to migrate your existing continue aggregates to the new version? Now, with TimescaleDB 2.8.1, you don’t have to worry about migrating from the old continuous aggregates to the new. Say hello to our frictionless migration, an in-place upgrade that avoids disrupting queries over continuous aggregates in applications and dashboards and every time the data is not in the original hypertable.

☁️🐯 Timescale avoids the manual work involved in updating your TimescaleDB version. Updates take place automatically during a maintenance window picked by you. Learn more about maintenance and automatic version updates in Timescale, and to test it yourself, start a free trial!

Point-in-Time PostgreSQL Database and Query Monitoring With pg_stat_statements

Ryan Booz — Tue, 03 May 2022 13:08:42 GMT

Database monitoring is a crucial part of effective data management and building high-performance applications. In our previous blog post, we discussed how to enable pg_stat_statements (and that it comes standard on all Timescale instances), what data it provides, and demonstrated a few queries that you can run to glean useful information from the metrics to help pinpoint problem queries.

We also discussed one of the few pitfalls with pg_stat_statements: all of the data it provides is cumulative since the last server restart (or a superuser reset the statistics).

While pg_stat_statements can work as a go-to source of information to determine where initial problems might be occurring when the server isn’t performing as expected, the cumulative data it provides can also pose a problem when said server is struggling to keep up with the load.

The data might show that a particular application query has been called frequently and read a lot of data from the disk to return results, but that only tells part of the story. With cumulative data, it's impossible to answer specific questions about the state of your cluster, such as:

Does it usually struggle with resources at this time of day?
Are there particular forms of the query that are slower than others?
Is it a specific database that's consuming resources more than others right now?
Is that normal given the current load?

The database monitoring information that pg_stat_statements provides is invaluable when you need it. However, it's most helpful when the data shows trends and patterns over time to visualize the true state of your database when problems arise.

It would be much more valuable if you could transform this static, cumulative data into time-series data, regularly storing snapshots of the metrics. Once the data is stored, we can use standard SQL to query delta values of each snapshot and metric to see how each database, user, and query performed interval by interval. That also makes it much easier to pinpoint when a problem started and what query or database appears to be contributing the most.

In this blog post (a sequel to our previous 101 on the subject), we'll discuss the basic process for storing and analyzing the pg_stat_statements snapshot data over time using TimescaleDB features to store, manage, and query the metric data efficiently.

Although you could automate the storing of snapshot data with the help of a few PostgreSQL extensions, only TimescaleDB provides everything you need, including automated job scheduling, data compression, data retention, and continuous aggregates to manage the entire solution efficiently.

💡

You can complement pg_stat_statements with Insights for great database observability at an unprecedented level of granularity. To try Insights and unlock a new, insightful understanding of your queries and performance, sign up for Timescale.

PostgreSQL Database Monitoring 201: Preparing to Store Data Snapshots

Before we can query and store ongoing snapshot data from pg_stat_statements, we need to prepare a schema and, optionally, a separate database to keep all of the information we'll collect with each snapshot.

We're choosing to be very opinionated about how we store the snapshot metric data and, optionally, how to separate some information, like the query text itself. Use our example as a building block to store the information you find most useful in your environment. You may not want to keep some metric data (i.e., exec_stddev), and that's okay. Adjust based on your database monitoring needs.

Create a metrics database

Recall that pg_stat_statements tracks statistics for every database in your PostgreSQL cluster. Also, any user with the appropriate permissions can query all of the data while connected to any database. Therefore, while creating a separate database is an optional step, storing this data in a separate TimescaleDB database makes it easier to filter out the queries from the ongoing snapshot collection process.

We also show the creation of a separate schema called statements_history to store all of the tables and procedures used throughout the examples. This allows a clean separation of this data from anything else you may want to do within this database.

psql=> CREATE DATABASE statements_monitor;

psql=> \c statements_monitor;

psql=> CREATE EXTENSION IF NOT EXISTS timescaledb;

psql=> CREATE SCHEMA IF NOT EXISTS statements_history;

Create hypertables to store snapshot data

Whether you create a separate database or use an existing TimescaleDB database, we need to create the tables to store the snapshot information. For our example, we'll create three tables:

snapshots (hypertable): a cluster-wide aggregate snapshot of all metrics for easy cluster-level monitoring
queries: separate storage of query text by queryid, rolname, and datname
statements (hypertable): statistics for each query every time the snapshot is taken, grouped by queryid, rolname, and datname

Both the snapshot and statements tables are converted to a hypertable in the following SQL. Because these tables will store lots of metric data over time, making them hypertables unlocks powerful features for managing the data with compression and data retention, as well as speeding up queries that focus on specific periods of time. Finally, take note that each table is assigned a different chunk_time_interval based on the amount and frequency of the data that is added to it.

The snapshots table, for instance, will only receive one row per snapshot, which allows the chunks to be created less frequently (every four weeks) without growing too large. In contrast, the statements table will potentially receive thousands of rows every time a snapshot is taken and so creating chunks more frequently (every week) allows us to compress this data more frequently and provides more fine-grained control over data retention.

The size and activity of your cluster, along with how often you run the job to take a snapshot of data, will influence what the right chunk_time_interval is for your system. More information about chunk sizes and best practices can be found in our documentation.

/*
 * The snapshots table holds the cluster-wide values
 * each time an overall snapshot is taken. There is
 * no database or user information stored. This allows
 * you to create cluster dashboards for very fast, high-level
 * information on the trending state of the cluster.
 */
CREATE TABLE IF NOT EXISTS statements_history.snapshots (
    created timestamp with time zone NOT NULL,
    calls bigint NOT NULL,
    total_plan_time double precision NOT NULL,
    total_exec_time double precision NOT NULL,
    rows bigint NOT NULL,
    shared_blks_hit bigint NOT NULL,
    shared_blks_read bigint NOT NULL,
    shared_blks_dirtied bigint NOT NULL,
    shared_blks_written bigint NOT NULL,
    local_blks_hit bigint NOT NULL,
    local_blks_read bigint NOT NULL,
    local_blks_dirtied bigint NOT NULL,
    local_blks_written bigint NOT NULL,
    temp_blks_read bigint NOT NULL,
    temp_blks_written bigint NOT NULL,
    blk_read_time double precision NOT NULL,
    blk_write_time double precision NOT NULL,
    wal_records bigint NOT NULL,
    wal_fpi bigint NOT NULL,
    wal_bytes numeric NOT NULL,
    wal_position bigint NOT NULL,
    stats_reset timestamp with time zone NOT NULL,
    PRIMARY KEY (created)
);

/*
 * Convert the snapshots table into a hypertable with a 4 week
 * chunk_time_interval. TimescaleDB will create a new chunk
 * every 4 weeks to store new data. By making this a hypertable we
 * can take advantage of other TimescaleDB features like native 
 * compression, data retention, and continuous aggregates.
 */
SELECT * FROM create_hypertable(
    'statements_history.snapshots',
    'created',
    chunk_time_interval => interval '4 weeks'
);

COMMENT ON TABLE statements_history.snapshots IS
$$This table contains a full aggregate of the pg_stat_statements view
at the time of the snapshot. This allows for very fast queries that require a very high level overview$$;

/*
 * To reduce the storage requirement of saving query statistics
 * at a consistent interval, we store the query text in a separate
 * table and join it as necessary. The queryid is the identifier
 * for each query across tables.
 */
CREATE TABLE IF NOT EXISTS statements_history.queries (
    queryid bigint NOT NULL,
    rolname text,
    datname text,
    query text,
    PRIMARY KEY (queryid, datname, rolname)
);

COMMENT ON TABLE statements_history.queries IS
$$This table contains all query text, this allows us to not repeatably store the query text$$;


/*
 * Finally, we store the individual statistics for each queryid
 * each time we take a snapshot. This allows you to dig into a
 * specific interval of time and see the snapshot-by-snapshot view
 * of query performance and resource usage
*/
CREATE TABLE IF NOT EXISTS statements_history.statements (
    created timestamp with time zone NOT NULL,
    queryid bigint NOT NULL,
    plans bigint NOT NULL,
    total_plan_time double precision NOT NULL,
    calls bigint NOT NULL,
    total_exec_time double precision NOT NULL,
    rows bigint NOT NULL,
    shared_blks_hit bigint NOT NULL,
    shared_blks_read bigint NOT NULL,
    shared_blks_dirtied bigint NOT NULL,
    shared_blks_written bigint NOT NULL,
    local_blks_hit bigint NOT NULL,
    local_blks_read bigint NOT NULL,
    local_blks_dirtied bigint NOT NULL,
    local_blks_written bigint NOT NULL,
    temp_blks_read bigint NOT NULL,
    temp_blks_written bigint NOT NULL,
    blk_read_time double precision NOT NULL,
    blk_write_time double precision NOT NULL,
    wal_records bigint NOT NULL,
    wal_fpi bigint NOT NULL,
    wal_bytes numeric NOT NULL,
    rolname text NOT NULL,
    datname text NOT NULL,
    PRIMARY KEY (created, queryid, rolname, datname),
    FOREIGN KEY (queryid, datname, rolname) REFERENCES statements_history.queries (queryid, datname, rolname) ON DELETE CASCADE
);

/*
 * Convert the statements table into a hypertable with a 1 week
 * chunk_time_interval. TimescaleDB will create a new chunk
 * every 1 weeks to store new data. Because this table will receive
 * more data every time we take a snapshot, a shorter interval
 * allows us to manage compression and retention to a shorter interval
 * if needed. It also provides smaller overall chunks for querying
 * when focusing on specific time ranges.
 */
SELECT * FROM create_hypertable(
    'statements_history.statements',
    'created',
    create_default_indexes => false,
    chunk_time_interval => interval '1 week'
);

Create the snapshot stored procedure

With the tables created to store the statistics data from pg_stat_statements, we need to create a stored procedure that will run on a scheduled basis to collect and store the data. This is a straightforward process with TimescaleDB user-defined actions.

A user-defined action provides a method for scheduling a custom stored procedure using the underlying scheduling engine that TimescaleDB uses for automated policies like continuous aggregate refresh and data retention. Although there are other PostgreSQL extensions for managing schedules, this feature is included with TimescaleDB by default.

First, create the stored procedure to populate the data. In this example, we use a multi-part common table expression (CTE) to fill in each table, starting with the results of the pg_stat_statements view.

CREATE OR REPLACE PROCEDURE statements_history.create_snapshot(
    job_id int,
    config jsonb
)
LANGUAGE plpgsql AS
$function$
DECLARE
    snapshot_time timestamp with time zone := now();
BEGIN
	/*
	 * This first CTE queries pg_stat_statements and joins
	 * to the roles and database table for more detail that
	 * we will store later.
	 */
    WITH statements AS (
        SELECT
            *
        FROM
            pg_stat_statements(true)
        JOIN
            pg_roles ON (userid=pg_roles.oid)
        JOIN
            pg_database ON (dbid=pg_database.oid)
        WHERE queryid IS NOT NULL
    ), 
    /*
     * We then get the individual queries out of the result
* and store the text and queryid separately to avoid
     * storing the same query text over and over.
     */
    queries AS (
        INSERT INTO
            statements_history.queries (queryid, query, datname, rolname)
        SELECT
            queryid, query, datname, rolname
        FROM
            statements
        ON CONFLICT
            DO NOTHING
        RETURNING
            queryid
    ), 
    /*
     * This query SUMs all data from all queries and databases
     * to get high-level cluster statistics each time the snapshot
     * is taken.
     */
    snapshot AS (
        INSERT INTO
            statements_history.snapshots
        SELECT
            now(),
            sum(calls),
            sum(total_plan_time) AS total_plan_time,
            sum(total_exec_time) AS total_exec_time,
            sum(rows) AS rows,
            sum(shared_blks_hit) AS shared_blks_hit,
            sum(shared_blks_read) AS shared_blks_read,
            sum(shared_blks_dirtied) AS shared_blks_dirtied,
            sum(shared_blks_written) AS shared_blks_written,
            sum(local_blks_hit) AS local_blks_hit,
            sum(local_blks_read) AS local_blks_read,
            sum(local_blks_dirtied) AS local_blks_dirtied,
            sum(local_blks_written) AS local_blks_written,
            sum(temp_blks_read) AS temp_blks_read,
            sum(temp_blks_written) AS temp_blks_written,
            sum(blk_read_time) AS blk_read_time,
            sum(blk_write_time) AS blk_write_time,
            sum(wal_records) AS wal_records,
            sum(wal_fpi) AS wal_fpi,
            sum(wal_bytes) AS wal_bytes,
            pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0'),
            pg_postmaster_start_time()
        FROM
            statements
    )
    /*
     * And finally, we store the individual pg_stat_statement 
     * aggregated results for each query, for each snapshot time.
     */
    INSERT INTO
        statements_history.statements
    SELECT
        snapshot_time,
        queryid,
        plans,
        total_plan_time,
        calls,
        total_exec_time,
        rows,
        shared_blks_hit,
        shared_blks_read,
        shared_blks_dirtied,
        shared_blks_written,
        local_blks_hit,
        local_blks_read,
        local_blks_dirtied,
        local_blks_written,
        temp_blks_read,
        temp_blks_written,
        blk_read_time,
        blk_write_time,
        wal_records,
        wal_fpi,
        wal_bytes,
        rolname,
  datname
    FROM
        statements;

END;
$function$;

Once you create the stored procedure, schedule it to run on an ongoing basis as a user-defined action. In the following example, we schedule snapshot data collection every minute, which may be too often for your needs. Adjust the collection schedule to suit your data capture and monitoring needs.

/*
* Adjust the scheduled_interval based on how often
* a snapshot of the data should be captured
*/
SELECT add_job(
    'statements_history.create_snapshot',
    schedule_interval=>'1 minutes'::interval
);

And finally, you can verify that the user-defined action job is running correctly by querying the jobs information views. If you set the schedule_interval for one minute (as shown above), wait a few minutes, and then ensure that last_run_statusis Success and that there are zero total_failures.

SELECT js.* FROM timescaledb_information.jobs j
 INNER JOIN timescaledb_information.job_stats js ON j.job_id =js.job_id 
WHERE j.proc_name='create_snapshot';


Name                  |Value                        |
----------------------+-----------------------------+
hypertable_schema     |                             |
hypertable_name       |                             |
job_id                |1008                         |
last_run_started_at   |2022-04-13 17:43:15.053 -0400|
last_successful_finish|2022-04-13 17:43:15.068 -0400|
last_run_status       |Success                      |
job_status            |Scheduled                    |
last_run_duration     |00:00:00.014755              |
next_start            |2022-04-13 17:44:15.068 -0400|
total_runs            |30186                        |
total_successes       |30167                        |
total_failures        |0                            |

The query metric database is set up and ready to query! Let's look at a few query examples to help you get started.

Querying pg_stat_statements Snapshot Data

We chose to create two statistics tables: one that aggregates the snapshot statistics for the cluster, regardless of a specific query, and another that stores statistics for each query per snapshot. The data is time-stamped using the created column in both tables. The rate of change for each snapshot is the difference in cumulative statistics values from one snapshot to the next.

This is accomplished in SQL using the LAG window function, which subtracts each row from the previous row ordered by the created column.

Cluster performance over time

This first example queries the "snapshots" table, which stores the aggregate total of all statistics for the entire cluster. Running this query will return the total values for each snapshot, not the overall cumulative pg_stat_statements values.

/*
 * This CTE queries the snapshot table (full cluster statistics)
 * to get a high-level view of the cluster state.
 * 
 * We query each row with a LAG of the previous row to retrieve
 * the delta of each value to make it suitable for graphing.
 */
WITH deltas AS (
    SELECT
        created,
        extract('epoch' from created - lag(d.created) OVER (w)) AS delta_seconds,
        d.ROWS - lag(d.rows) OVER (w) AS delta_rows,
        d.total_plan_time - lag(d.total_plan_time) OVER (w) AS delta_plan_time,
        d.total_exec_time - lag(d.total_exec_time) OVER (w) AS delta_exec_time,
        d.calls - lag(d.calls) OVER (w) AS delta_calls,
        d.wal_bytes - lag(d.wal_bytes) OVER (w) AS delta_wal_bytes,
        stats_reset
    FROM
        statements_history.snapshots AS d
    WHERE
        created > now() - INTERVAL '2 hours'
    WINDOW
        w AS (PARTITION BY stats_reset ORDER BY created ASC)
)
SELECT
    created AS "time",
    delta_rows,
    delta_calls/delta_seconds AS calls,
    delta_plan_time/delta_seconds/1000 AS plan_time,
    delta_exec_time/delta_seconds/1000 AS exec_time,
    delta_wal_bytes/delta_seconds AS wal_bytes
FROM
    deltas
ORDER BY
    created ASC;  

time                         |delta_rows|calls               |plan_time|exec_time         |wal_bytes         |
-----------------------------+----------+--------------------+---------+------------------+------------------+
2022-04-13 15:55:12.984 -0400|          |                    |         |                  |                  |
2022-04-13 15:56:13.000 -0400|        89| 0.01666222812679749|      0.0| 0.000066054620811| 576.3131464496716|
2022-04-13 15:57:13.016 -0400|        89|0.016662253391151797|      0.0|0.0000677694667946|  591.643293413018|
2022-04-13 15:58:13.031 -0400|        89|0.016662503817796187|      0.0|0.0000666146741069| 576.3226820499345|
2022-04-13 15:59:13.047 -0400|        89|0.016662103471929153|      0.0|0.0000717084114511| 591.6379700812604|
2022-04-13 16:00:13.069 -0400|        89| 0.01666062607900462|      0.0|0.0001640335102535|3393.3363560151874|

Top 100 most expensive queries

Getting an overview of the cluster instance is really helpful to understand the state of the whole system over time. Another useful set of data to analyze quickly is a list of the queries using the most resources of the cluster, query by query. There are many ways you could query the snapshot information for these details, and your definition of "resource-intensive" might be different than what we show, but this example gives the high-level cumulative statistics for each query over the specified time, ordered by the highest total sum of execution and planning time.

/*
* individual data for each query for a specified time range, 
* which is particularly useful for zeroing in on a specific
* query in a tool like Grafana
*/
WITH snapshots AS (
    SELECT
        max,
        -- We need at least 2 snapshots to calculate a delta. If the dashboard is currently showing
        -- a period < 5 minutes, we only have 1 snapshot, and therefore no delta. In that CASE
        -- we take the snapshot just before this window to still come up with useful deltas
        CASE
            WHEN max = min
THEN (SELECT max(created) FROM statements_history.snapshots WHERE created < min)
            ELSE min
        END AS min
    FROM (
        SELECT
            max(created),
            min(created)
        FROM
            statements_history.snapshots WHERE created > now() - '1 hour'::interval
            -- Grafana-based filter
            --statements_history.snapshots WHERE $__timeFilter(created)
        GROUP by
            stats_reset
        ORDER by
            max(created) DESC
        LIMIT 1
    ) AS max(max, min)
), deltas AS (
    SELECT
        rolname,
        datname,
        queryid,
        extract('epoch' from max(created) - min(created)) AS delta_seconds,
        max(total_exec_time) - min(total_exec_time) AS delta_exec_time,
        max(total_plan_time) - min(total_plan_time) AS delta_plan_time,
        max(calls) - min(calls) AS delta_calls,
        max(shared_blks_hit) - min(shared_blks_hit) AS delta_shared_blks_hit,
        max(shared_blks_read) - min(shared_blks_read) AS delta_shared_blks_read
    FROM
        statements_history.statements
    WHERE
 -- granted, this looks odd, however it helps the DecompressChunk Node tremendously,
        -- as without these distinct filters, it would aggregate first and then filter.
        -- Now it filters while scanning, which has a huge knock-on effect on the upper
        -- Nodes
        (created >= (SELECT min FROM snapshots) AND created <= (SELECT max FROM snapshots))
    GROUP BY
        rolname,
        datname,
        queryid
)
SELECT
    rolname,
    datname,
    queryid::text,
    delta_exec_time/delta_seconds/1000 AS exec,
    delta_plan_time/delta_seconds/1000 AS plan,
    delta_calls/delta_seconds AS calls,
    delta_shared_blks_hit/delta_seconds*8192 AS cache_hit,
    delta_shared_blks_read/delta_seconds*8192 AS cache_miss,
    query
FROM
    deltas
JOIN
    statements_history.queries USING (rolname,datname,queryid)
WHERE
    delta_calls > 1
    AND delta_exec_time > 1
    AND query ~* $$.*$$
ORDER BY
    delta_exec_time+delta_plan_time DESC
LIMIT 100;


rolname  |datname|queryid             |exec              |plan|calls               |cache_hit         |cache_miss|query      
---------+-------+--------------------+------------------+----+--------------------+------------------+----------+-----------
tsdbadmin|tsdb   |731301775676660043  |0.0000934033977289| 0.0|0.016660922907623773| 228797.2725854585|       0.0|WITH statem...
tsdbadmin|tsdb   |-686339673194700075 |0.0000570625206738| 0.0|  0.0005647770477161|116635.62329618855|       0.0|WITH snapsh...
tsdbadmin|tsdb   |-5804362417446225640|0.0000008223159463| 0.0|  0.0005647770477161| 786.5311077312939|       0.0|-- NOTE Thi...

However you decide to order this data, you now have a quick result set with the text of the query and the queryid. With just a bit more effort, we can dig even deeper into the performance of a specific query over time.

For example, in the output from the previous query, we can see that queryid=731301775676660043 has the longest overall execution and planning time of all queries for this period. We can use that queryid to dig a little deeper into the snapshot-by-snapshot performance of this specific query.

/*
 * When you want to dig into an individual query, this takes
 * a similar approach to the "snapshot" query above, but for 
 * an individual query ID.
 */
WITH deltas AS (
    SELECT
        created,
        st.calls - lag(st.calls) OVER (query_w) AS delta_calls,
        st.plans - lag(st.plans) OVER (query_w) AS delta_plans,
        st.rows - lag(st.rows) OVER (query_w) AS delta_rows,
        st.shared_blks_hit - lag(st.shared_blks_hit) OVER (query_w) AS delta_shared_blks_hit,
        st.shared_blks_read - lag(st.shared_blks_read) OVER (query_w) AS delta_shared_blks_read,
        st.temp_blks_written - lag(st.temp_blks_written) OVER (query_w) AS delta_temp_blks_written,
        st.total_exec_time - lag(st.total_exec_time) OVER (query_w) AS delta_total_exec_time,
        st.total_plan_time - lag(st.total_plan_time) OVER (query_w) AS delta_total_plan_time,
st.wal_bytes - lag(st.wal_bytes) OVER (query_w) AS delta_wal_bytes,
        extract('epoch' from st.created - lag(st.created) OVER (query_w)) AS delta_seconds
    FROM
        statements_history.statements AS st
    join
        statements_history.snapshots USING (created)
    WHERE
        -- Adjust filters to match your queryid and time range
        created > now() - interval '25 minutes'
        AND created < now() + interval '25 minutes'
        AND queryid=731301775676660043
    WINDOW
        query_w AS (PARTITION BY datname, rolname, queryid, stats_reset ORDER BY created)
)
SELECT
    created AS "time",
    delta_calls/delta_seconds AS calls,
    delta_plans/delta_seconds AS plans,
    delta_total_exec_time/delta_seconds/1000 AS exec_time,
    delta_total_plan_time/delta_seconds/1000 AS plan_time,
    delta_rows/nullif(delta_calls, 0) AS rows_per_query,
    delta_shared_blks_hit/delta_seconds*8192 AS cache_hit,
    delta_shared_blks_read/delta_seconds*8192 AS cache_miss,
    delta_temp_blks_written/delta_seconds*8192 AS temp_bytes,
    delta_wal_bytes/delta_seconds AS wal_bytes,
    delta_total_exec_time/nullif(delta_calls, 0) exec_time_per_query,
    delta_total_plan_time/nullif(delta_plans, 0) AS plan_time_per_plan,
    delta_shared_blks_hit/nullif(delta_calls, 0)*8192 AS cache_hit_per_query,
    delta_shared_blks_read/nullif(delta_calls, 0)*8192 AS cache_miss_per_query,
    delta_temp_blks_written/nullif(delta_calls, 0)*8192 AS temp_bytes_written_per_query,
    delta_wal_bytes/nullif(delta_calls, 0) AS wal_bytes_per_query
FROM
    deltas
WHERE
    delta_calls > 0
ORDER BY
    created ASC;


time                         |calls               |plans|exec_time         |plan_time|ro...
-----------------------------+--------------------+-----+------------------+---------+--
2022-04-14 14:33:39.831 -0400|0.016662115132216382|  0.0|0.0000735602224659|      0.0|  
2022-04-14 14:34:39.847 -0400|0.016662248949061972|  0.0|0.0000731468396678|      0.0|  
2022-04-14 14:35:39.863 -0400|  0.0166622286820572|  0.0|0.0000712116494436|      0.0|  
2022-04-14 14:36:39.880 -0400|0.016662015187426844|  0.0|0.0000702374920336|      0.0|

Compression, Continuous Aggregates, and Data Retention

This self-serve query monitoring setup doesn't require TimescaleDB. You could schedule the snapshot job with other extensions or tools, and regular PostgreSQL tables could likely store the data you retain for some time without much of an issue. Still, all of this is classic time-series data, tracking the state of your PostgreSQL cluster(s) over time.

Keeping as much historical data as possible provides significant value to this database monitoring solution's effectiveness. TimescaleDB offers several features that are not available with vanilla PostgreSQL to help you manage growing time-series data and improve the efficiency of your queries and process.

Compression on this data is highly effective and efficient for two reasons:

Most of the data is stored as integers which compresses very well (96 % plus) using our type-specific algorithms.
The data can be compressed more frequently because we're never updating or deleting compressed data. This means that it's possible to store months or years of data with very little disk utilization, and queries on specific columns of data will often be significantly faster from compressed data.

Continuous aggregates allow you to maintain higher-level rollups over time for aggregate queries that you run often. Suppose you have dashboards that show 10-minute averages for all of this data. In that case, you could write a continuous aggregate to pre-aggregate that data for you over time without modifying the snapshot process. This allows you to create new aggregations after the raw data has been stored and new query opportunities come to light.

And finally, data retention allows you to drop older raw data automatically once it has reached a defined age. If continuous aggregates are defined on the raw data, it will continue to show the aggregated data which provides a complete solution for maintaining the level of data fidelity you need as data ages.

These additional features provide a complete solution for storing lots of monitoring data about your cluster(s) over the long haul. See the links provided for each feature for more information.

Better Self-Serve Query Monitoring With pg_stat_statements

Everything we've discussed and shown in this post is just the beginning. With a few hypertables and queries, the cumulative data from pg_stat_statementscan quickly come to life. Once the process is in place and you get more comfortable querying it, visualizing it will be very helpful.

pg_stat_statements is automatically enabled in all Timescale services. If you’re not a user yet, you can try out Timescale for free (no credit card required) to get access to a modern cloud-native database platform with TimescaleDB's top performance.

To complement pg_stat_statements for better query monitoring, check out Insights, also available for trial services.

Teaching Elephants to Fish

Ryan Booz — Wed, 27 Apr 2022 16:35:09 GMT

The future of community in light of Babelfish

This blog post was adapted from the PGConf NYC 2021 keynote. Originally published at https://postgresconf.org.

On December 1, 2020, at its annual re:Invent conference, Amazon AWS announced Babelfish—an open-source PostgreSQL translation layer that allows SQL Server applications to work natively, and transparently, with PostgreSQL. To be honest, as someone that's spent a significant part of my career using both SQL Server and PostgreSQL, this wasn't actually a very “exciting” development.

I'm not sure that most people in either community really gave it that much notice a year ago. In fact, my first thought was that Babelfish is just an oversized object-relational mapping (ORM) framework that wasn't tied to any specific development language. While these tools have proven to be hugely useful to many developers, nearly every DBA has first-hand war stories that demonstrate the challenge that automated query generation can impose on a complex system. Frankly, the thought terrifies me a bit.

Until recently, however, all we knew about Babelfish was based on Amazon published content. But in October 2021, Babelfish was finally released for public access and preview at https://babelfishpg.org/.

Not long after the release was announced, I had the opportunity to participate in a video call with some members of the European PostgreSQL community for a first look at Babelfish in real life. It was interesting, and kind of exciting to see what worked and what didn't. However, I didn't leave that call any less concerned about the struggles my SQL Server friends will have as their management teams mandate switching to PostgreSQL using Babelfish. It also got me thinking: why SQL Server?

I decided to look at DB-engines.com to see if the engagement metrics that they track would shed any light on it. Although the website doesn't disclose the specific method for determining database engine rank, we know that social engagement and search engine trends play a role in the rankings.

When I zoomed in on the four "major" relational database engines, utilizing the last nine years of data, two things jumped out at me:

Only PostgreSQL has seen consistent increases in popularity and engagement over the nine years of tracking.
While the three other engines have had some steady decline over the same period, SQL Server seems to have the biggest drop-off compared to Oracle and MySQL.

While I can posit many reasons why Amazon AWS chose to go directly after SQL Server for this transparent compatibility tool, the reasons to use PostgreSQL as the solution to the problem are obvious to the PostgreSQL community.

A few facts on PostgreSQL:

It is the fastest growing, relational database engine on the planet.
It has a proven foundation.
It is easy to enhance through extensibility.

Regardless of how we feel about this new turn of events or the potential onslaught of new support needs by users of SQL Server, there's not much we can do about it. The proverbial cat is already out of the bag.

Whether the necessary patches are included in the core PostgreSQL code or not, AWS Aurora, at least, will still offer this functionality as a service. I believe this means that over the next 2-4 years, the community will grow from a base of users that have a lot of database experience but have little footing for how to approach a similar, but different, database. And regardless of how we feel about the new demands this will put on this community, it's a group of people that still want to do their job well and contribute back to the community.

Why Do I Care?

So, why do I care? Well, if you were to put my 20+ years of database experience into a word cloud of sorts, SQL Server would occupy the largest portion of space. For nearly 15 of the last 21 years, I was primarily a Microsoft data platform user. And while PostgreSQL has occupied the second largest (and longest) portion of my database landscape, I really came of age as a database professional within the SQL Server community. In fact, since coming back into the PostgreSQL community almost four years ago, I've continued to look for ways to foster a community and learning modeled after what I knew and experienced through SQL Server and the Professional Association of SQL Server (PASS) community. And I can say, without a doubt, that I'm not the only one looking for the same thing.

Let me ask you something then.

When was the last time you had to join a new community?

Is the first thing that comes to mind a technical community (data, programming language, visualization tools, etc.) or something else? Whatever community that was, how did you feel after first stepping through the proverbial door?

Did you have a crowd of people ready to cheer you on, eager to see you succeed, and ready to support you?

Or, did you feel like an outsider looking in, trying to figure out how to find the right help, from the right people? Did you feel like everyone else always knew how to connect with the people around you but you struggled to find comradery and resources?

Depending on where you fall, what could have made it better for others or for yourself?

Let's bring this same thought experiment a little closer to home. What about the PostgreSQL community? What was your "onboarding" experience like and how does that compare to some of the newest members you've met recently?

It just so happens that we have some feedback from the larger community based on The State of PostgreSQL survey Timescale orchestrated this past April. There were 500 respondents that ranged in experience levels from novice, newly joined developers, to users with 20+ years of experience. Of those 500 respondents, 49.5 % have used PostgreSQL for five years or less, and 50.5 % had PostgreSQL for six years or more.

It would make sense, then, that about 50 % of the survey participants felt like it was a bit difficult to use PostgreSQL and get involved with the community.

And yet, the other 50 % of participants seem to have a very different experience with PostgreSQL the longer they stay connected.

The real goal, then, is to determine ways to improve the user experience within the PostgreSQL community earlier in the cycle, rather than hoping folks stick around for more than five years so that they can begin to have a more positive outlook on the community at large.

As a developer advocate at Timescale, one of my primary responsibilities is to engage with the PostgreSQL community so that we can figure out how to tackle these issues head-on. I'm excited to contribute to the efforts, learning better methods to teach people how to use this technology well and help them be successful.

And, as a former SQL Server professional and community member, I want to prepare for what I believe will be a growing number of database professionals joining the Slack channel, Twitter conversations, and conferences trying to improve their craft and give back to the community.

To do this, we have at least two options in the months and years ahead.

The first option is to take sides. Unfortunately, this happens all too often in technical communities. Whether it's a database engine or the newest development language, this approach is an option. "We're better! You're not!"

Or… we could choose to just sit around the table, share a meal, and learn from one another about how we can build a better, shared community based on the best parts of what each community offers.

Quite honestly, we could ensure that we're treating the community more like our beloved namesake—Slonik, the elephant.

It turns out that elephants are close-knit communities. They care for each other, they actively accept orphans and elephants that are in need, and they're generational, passing down knowledge and community expectations from one generation to the next. Not a bad example to follow, right?

Five Initial Options for Building a Better PostgreSQL Community

Let me present you with five quick, high-level thoughts on ideas we could reuse from the SQL Server community that might begin the process of improving participation and engagement.

1. Lead with empathy and curiosity

What does it mean to lead with empathy and curiosity? When the posture of the community starts with empathy, it means that we remember what it was like to be new to a community ourselves. When we start conversations from a place of curiosity, we avoid choosing sides and instead join new users where they are.

Here are a few things to keep in mind in regards to the SQL Server community that might start to show up in greater numbers soon, although I think these are good things to remember regardless of the user.

Expect confusion from users that already know SQL. Oftentimes these users honestly don't know that the dialect of SQL they've been using (T-SQL in this case) isn't a standard. They know concepts but not direct comparisons to the SQL standard or pl/pgsql.

Remember that you were a newbie once. Remember when I asked you to think of the last new community you joined? ;-)

Assume positive intent and that they have tried to search out a solution first. Again, many users coming from the SQL Server community have a good network of people and resources they've grown accustomed to (we'll touch on that next!). But that community has also fostered good habits for finding solutions and asking better questions. Assume the best!

Prepare resources to meet their specific needs. This doesn't mean that we craft all documentation, blogs, and help forums for a specific user base. But, if we know many people will be joining this community with the same understanding of a feature or topic, we can provide them a better foundation for transferring their knowledge into PostgreSQL. It would reduce support overall and foster better community involvement.

Let me give you one really simple example from my own experience (and that of many SQL Server users that try PostgreSQL for the first time).

It is a common practice in nearly every T-SQL script I've written, seen, or used to create and use variables and control logic directly in the flow of a script. Because all SQL is executed by default as T-SQL, there is no need for code blocks to do data-specific logic. This really simple T-SQL example is incredibly common in the day-to-day workflow of a SQL Server DBA or developer.

In PostgreSQL, however, I can't do this in the midst of my migration scripts or maintenance tasks directly. This was a constant source of frustration for me during the release of my first feature at a new company that was using PostgreSQL (but had very few developers that understood databases). I knew what I wanted to do, but I couldn't get any IDE or migration script to do what I wanted.

Eventually, I found a post that talked about anonymous functions within a SQL script for ad hoc processing. Although I didn't like the added complexity, I could finally write my migrations correctly.

Having proactive examples in the documentation that acknowledge this difference could be a game-changer for developer success.

2. Lower the bar for entry-level #pghelp

More than 10 years ago, someone in the SQL Server community had the idea of using #sqlhelp Twitter hashtag to provide help. At the time, the size limit for a tweet was 140 characters so they understood it could only provide short, really succinct triage-like help. In some ways, I think this limitation has actually helped the community learn how to ask better questions that will draw valuable answers. This is evidenced by two things in my opinion.

First, not every question gets an answer. It's a community-led initiative and so the quality of the question and respect for the "free" nature of the help influence overall engagement. And second, the SQL Server community actively protects the use of this hashtag. That's not always taken kindly by outsiders, and I'm not even sure how much I agree that a community can "own" a hashtag, but it produces a valuable community around a specific technology that has proven to be helpful for thousands of users over the years.

Like the PostgreSQL community, SQL Server users also have a community Slack channel that is active and more long-form. But today, more than 10 years after the community started using it, #sqlhelp is an active channel of connecting with the larger community to give and receive timely help.

Could we do something similar with a #pghelp hashtag?

3. Support new members by cultivating more leaders

As the PostgreSQL community grows, we can only support users if we build a growing group of leaders. A few of the leaders in the SQL Server community realized this same need over a decade ago and proactively sought ways to build new leaders, content creators, and community advocates. One successful example of this is an initiative called "T-SQL Tuesday," a worldwide monthly "blog fest."

The idea is simple. Each month, someone volunteers to be the host, they announce the topic at least a week in advance, and then anyone from around the world can contribute to the conversation by publishing a blog on that topic. Some of the topics are technical (replication, high availability, query tuning success), while some are more soft-skill-focused (best/worst SQL interview experience, how to avoid burnout in the SQL field).

As I said, it was specifically started as an initiative to get more people in the community to contribute to the conversation. Of almost any initiative that this community has undertaken in the last 10-15 years, T-SQL Tuesday has done more than anything else to cultivate new community leaders, alter careers, and bring collaboration across the globe. The most intriguing part of this for me is that it's free to run and participate in, and Microsoft has had nothing to do with it. It is completely community-led.

Starting in April 2022, a few PostgreSQL community members are going to start "PSQL Phriday," a monthly community blogging initiative. To learn more about it, the monthly topics, and how you can participate, watch the blogging feeds at planet.postgresql.org and monitor the #psqlphriday on Twitter. I'm excited to see this get started and can't wait to learn from many others in the community!

4. Seek leaders proactively

As new members join the community and it grows, many of them will come with a desire to contribute in some way. Some of that has happened in meaningful ways around tooling that have dramatically improved the day-to-day tasks of every SQL Server DBA.

One of the best examples of this in recent years has been the DBATools project. This is a PowerShell toolkit of hundreds of commands that can help back up a database or migrate an entire cluster of servers, no UI necessary. It is heavily supported by the community and they're always looking for opportunities to grow, learn, and contribute. Finding these developer-focused initiatives could be a great way to enlist help and add additional support resources as the community grows.

5. Develop consistent messaging around community

Lastly, I think it would be helpful to consider ways to consistently articulate the best methods and practices for accessing help within the PostgreSQL community. Although the Community page on the PostgreSQL.org website does list many avenues for getting help, it still requires a fair amount of cognitive load to figure out which avenue is best suited for a given need.

When do I use the email lists? What if I don't want to subscribe long-term?
If I join the Slack channel, how do I best ask for help? Can I mention specific people to try and get help? Create new rooms?
Why would I use IRC or Discord over Slack?
Is there a #pghelp Twitter hashtag, and if so, what kind of questions are best asked there?

As a new user in the PostgreSQL community, I wanted this kind of guidance because I didn't want to overuse a resource or direct questions to the wrong group of people. If we had more consistent guidance on how to interface with the community across the plethora of channels, then other leaders within the PostgreSQL community could give the same consistent message.

I appreciate how leaders like Brent Ozar, a leader in the SQL Server community, don't feel obligated to answer every question thrown their way. They have clear instructions on their blog that describe the best ways for someone to actually get effective help and they often direct them there. "Hey, thanks for reaching out. That sounds like a great question for #sqlhelp. Check out this link for instructions on how to use it!"

When people feel heard, they're more likely to stay connected and involved. Even if the answer often points them to good documentation of how to get help, they're still acknowledged and included.

Wrap up

I'd like to leave you with a small twist on a common adage.

"You can't pick your family… but you can influence who becomes your friends."

As this community grows, are we prepared to provide some new ways of engaging with them? These specific ideas might not all fit within the PostgreSQL community, but I'd be interested to hear your thoughts about ways we can better incorporate the skills and talents of the community we already have to prepare for the future. Feel free to reach out to me on Twitter (@ryanbooz), through email (ryan@timescale.com), or on our Timescale Slack channel.

One last thing! The 2022 State of PostgreSQL survey will open later this spring. Take some time to review the results from last year, and then sign up at the bottom of the report to be notified when the new survey is ready. The more feedback we receive, the better we can understand our community, what's working well, and what can be improved in the years to come.

Using Pg_Stat_Statements to Optimize Queries

Ryan Booz — Wed, 30 Mar 2022 13:15:09 GMT

pg_stat_statements allows you to quickly identify problematic or slow Postgres queries, providing instant visibility into your database performance. Today, we're announcing that we've enabled pg_stat_statements by default in all Timescale services. This is part of our #AlwaysBeLaunching Cloud Week with MOAR features! 🐯☁️

PostgreSQL is one of the fastest-growing databases in terms of usage and community size, being backed by many dedicated developers and supported by a broad ecosystem of tooling, connectors, libraries, and visualization applications. PostgreSQL is also extensible: using PostgreSQL extensions, users can add extra functionality to PostgreSQL’s core. Indeed, TimescaleDB itself is packaged as a PostgreSQL extension, which also plays nicely with the broad set of other PostgreSQL extensions, as we’ll see today.

Today, we’re excited to share that pg_stat_statements, one of the most popular and widely used PostgreSQL extensions, is now enabled by default in all Timescale services. If you’re new to Timescale, start a free trial (100 % free for 30 days, no credit card required).

What is pg_stat_statements?

pg_stat_statements is a PostgreSQL extension that records information about your running queries. Identifying performance bottlenecks in your database can often feel like a cat-and-mouse game. Quickly written queries, index changes, or complicated ORM query generators can (and often do) negatively impact your database and application performance.

How to use pg_stat_statements

As we will show you in this post, pg_stat_statements is an invaluable tool to help you identify which queries are performing slowly and poorly and why. For example, you can query pg_stat_statements to know how many times a query has been called, the query execution time, the hit cache ratio for a query (how much data was available in memory vs. on disk to satisfy your query), and other helpful statistics such as the standard deviation of a query execution time.

Keep reading to learn how to query pg_stat_statements to identify PostgreSQL slow queries and other performance bottlenecks in your Timescale database.

A huge thank you to Lukas Bernert, Monae Payne, and Charis Lam for taking care of all things pg_stat_statements in Timescale.

How to Query `pg_stat_statements` in Timescale

Querying statistics data for your Timescale database from the pg_stat_statements view is straightforward once you're connected to the database.

SELECT * FROM pg_stat_statements;

userid|dbid |queryid             |query                         
------+-----+--------------------+------------------------------
 16422|16434| 8157083652167883764|SELECT pg_size_pretty(total_by
    10|13445|                    |      
 16422|16434|-5803236267637064108|SELECT game, author_handle, gu
 16422|16434|-8694415320949103613|SELECT c.oid,c.*,d.description
    10|16434|                    |      
    10|13445|                    |   
 ...  |...  |...                 |...

Queries that the tsdbadmin user does not have access to will hide query text and identifier

The view returns many columns of data (more than 30!), but if you look at the results above, one value immediately sticks out: .

pg_stat_statements collects data on all databases and users, which presents a security challenge if any user is allowed to query performance data. Therefore, although any user can query data from the views, only superusers and those specifically granted the pg_read_all_stats permission can see all user-level details, including the queryid and query text.

This includes the tsdbadmin user, which is created by default for all Timescale services. Although this user owns the database and has the most privileges, it is not a superuser account and cannot see the details of all other queries within the service cluster.

Therefore, it's best to filter pg_stat_statements data by userid for any queries you want to perform.

-- current_user will provide the rolname of the authenticated user
SELECT * FROM pg_stat_statements pss
	JOIN pg_roles pr ON (userid=oid)
WHERE rolname = current_user;


userid|dbid |queryid             |query                         
------+-----+--------------------+------------------------------
 16422|16434| 8157083652167883764|SELECT pg_size_pretty(total_by
 16422|16434|-5803236267637064108|SELECT game, author_handle, gu
 16422|16434|-8694415320949103613|SELECT c.oid,c.*,d.description
 ...  |...  |...                 |...

Queries for only the tsdbadmin user, showing all details and statistics

When you add the filter, only data that you have access to is displayed. If you have created additional accounts in your service for specific applications, you could also filter to those accounts.

To make the rest of our example queries easier to work with, we recommend that you use this base query with a common table expression (CTE). This query form will return the same data but make the rest of the query a little easier to write.

-- current_user will provide the rolname of the authenticated user
WITH statements AS (
SELECT * FROM pg_stat_statements pss
		JOIN pg_roles pr ON (userid=oid)
WHERE rolname = current_user
)
SELECT * FROM statements;

userid|dbid |queryid             |query                         
------+-----+--------------------+------------------------------
 16422|16434| 8157083652167883764|SELECT pg_size_pretty(total_by
 16422|16434|-5803236267637064108|SELECT game, author_handle, gu
 16422|16434|-8694415320949103613|SELECT c.oid,c.*,d.description
 ...  |...  |...                 |...

Query that shows the same results as before, but this time with the base query in a CTE for more concise queries later

Now that we know how to query only the data we have access to, let's review a few of the columns that will be the most useful for spotting potential problems with your queries.

calls: the number of times this query has been called.
total_exec_time: the total time spent executing the query, in milliseconds.
rows: the total number of rows retrieved by this query.
shared_blks_hit: the number of blocks already cached when read for the query.
shared_blks_read: the number of blocks that had to be read from the disk to satisfy all calls for this query form.

Two quick reminders about the data columns above:

All values are cumulative since the last time the service was started, or a superuser manually resets the values.
All values are for the same query form after parameterizing the query and based on the resulting hashed queryid.

Using these columns of data, let's look at a few common queries that can help you narrow in on the problematic queries.

Long-Running PostgreSQL Queries

One of the quickest ways to find slow Postgres queries that merit your attention is to look at each query’s average total time. This is not a time-weighted average since the data is cumulative, but it still helps frame a relevant context for where to start.

Adjust the calls value to fit your specific application needs. Querying for higher (or lower) total number of calls can help you identify queries that aren't run often but are very expensive or queries that are run much more often than you expect and take longer to run than they should.

-- query the 10 longest running queries with more than 500 calls
WITH statements AS (
SELECT * FROM pg_stat_statements pss
		JOIN pg_roles pr ON (userid=oid)
WHERE rolname = current_user
)
SELECT calls, 
	mean_exec_time, 
	query
FROM statements
WHERE calls > 500
AND shared_blks_hit > 0
ORDER BY mean_exec_time DESC
LIMIT 10;


calls|mean_exec_time |total_exec_time | query
-----+---------------+----------------+-----------
 2094|        346.93 |      726479.51 | SELECT time FROM nft_sales ORDER BY time ASC LIMIT $1 |
 3993|         5.728 |       22873.52 | CREATE TEMPORARY TABLE temp_table ... |
 3141|          4.79 |       15051.06 | SELECT name, setting FROM pg_settings WHERE ... |
60725|          3.64 |      221240.88 | CREATE TEMPORARY TABLE temp_table ... |   
  801|          1.33 |        1070.61 | SELECT pp.oid, pp.* FROM pg_catalog.pg_proc p  ...|
 ... |...            |...                 |

Queries that take the most time, on average, to execute

This sample database we're using for these queries is based on the NFT starter kit, which allows you to ingest data on a schedule from the OpenSea API and query NFT sales data. As part of the normal process, you can see that a TEMPORARY TABLE is created to ingest new data and update existing records as part of a lightweight extract-transform-load process.

That query has been called 60,725 times since this service started and has taken around 4.5 minutes of total execution time to create the table. By contrast, the first query shown takes the longest, on average, to execute—around 350 milliseconds each time. It retrieves the oldest timestamp in the nft_sales table and has used more than 12 minutes of execution time since the server was started.

From a work perspective, finding a way to improve the performance of the first query will have a more significant impact on the overall server workload.

Hit Cache Ratio

Like nearly everything in computing, databases tend to perform best when data can be queried in memory rather than going to external disk storage. If PostgreSQL has to retrieve data from storage to satisfy a query, it will typically be slower than if all of the needed data was already loaded into the reserved memory space of PostgreSQL. We can measure how often a query has to do this through a value known as Hit Cache Ratio.

Hit Cache Ratio is a measurement of how often the data needed to satisfy a query was available in memory. A higher percentage means that the data was already available and it didn't have to be read from disk, while a lower value can be an indication that there is memory pressure on the server and isn't able to keep up with the current workload.

If PostgreSQL has to constantly read data from disk to satisfy the same query, it means that other operations and data are taking precedence and "pushing" the data your query needs back out to disk each time.

This is a common scenario for time-series workloads because newer data is written to memory first, and if there isn't enough free buffer space, data that is used less will be evicted. If your application queries a lot of historical data, older hypertable chunks might not be loaded into memory and ready to quickly serve the query.

A good place to start is with queries that run often and have a Hit Cache Ratio of less than 98 %. Do these queries tend to pull data from long periods of time? If so, that could be an indication that there's not enough RAM to efficiently store this data long enough before it is evicted for newer data.

Depending on the application query pattern, you could improve Hit Cache Ratio by increasing server resources, consider index tuning to reduce table storage, or use TimescaleDB compression on older chunks that are queried regularly.

-- query the 10 longest running queries
WITH statements AS (
SELECT * FROM pg_stat_statements pss
		JOIN pg_roles pr ON (userid=oid)
WHERE rolname = current_user
)
SELECT calls, 
	shared_blks_hit,
	shared_blks_read,
	shared_blks_hit/(shared_blks_hit+shared_blks_read)::NUMERIC*100 hit_cache_ratio,
	query
FROM statements
WHERE calls > 500
AND shared_blks_hit > 0
ORDER BY calls DESC, hit_cache_ratio ASC
LIMIT 10;


calls | shared_blks_hit | shared_blks_read | hit_cache_ratio |query
------+-----------------+------------------+-----------------+--------------
  118|            441126|                 0|           100.00| SELECT bucket, slug, volume AS "volume (count)", volume_eth...
  261|          62006272|             22678|            99.96| SELECT slug FROM streamlit_collections_daily cagg...¶        I
 2094|         107188031|           7148105|            93.75| SELECT time FROM nft_sales ORDER BY time ASC LIMIT $1...      
  152|          41733229|                 1|            99.99| SELECT slug FROM streamlit_collections_daily cagg...¶        I
  154|          36846841|             32338|            99.91| SELECT a.img_url, a.name, MAX(s.total_price) AS price, time...

 ... |...               |...               | ...             | ...

The query that shows the Hit Cache Ratio of each query, including the number of buffers that were ready from disk or memory to satisfy the query

This sample database isn't very active, so the overall query counts are not very high compared to what a traditional application would probably show. In our example data above, a query called more than 500 times is a "frequently used query."

We can see above that one of the most expensive queries also happens to have the lowest Hit Cache Ratio of 93.75 %. This means that roughly 6 % of the time, PostgreSQL has to retrieve data from disk to satisfy the query. While that might not seem like a lot, your most frequently called queries should have a ratio of 99 % or more in most cases.

If you look closely, notice that this is the same query that stood out in our first example that showed how to find long-running queries. It's quickly becoming apparent that we can probably tune this query in some way to perform better. As it stands now, it's the slowest query per call, and it consistently has to read some data from disk rather than from memory.

Queries With High Standard Deviation

For a final example, let's consider another way to judge which queries often have the greatest opportunity for improvement: using the standard deviation of a query execution time.

Finding the slowest queries is a good place to start. However, as discussed in the blog post What Time-Weighted Averages Are and Why You Should Care, averages are only part of the story. Although pg_stat_statements doesn't provide a method for tracking time-weighted averages, it does track the standard deviation of all calls and execution time.

How can this be helpful?

Standard deviation is a method of assessing how widely the time each query execution takes compared to the overall mean. If the standard deviation value is small, then queries all take a similar amount of time to execute. If the standard deviation value is large, this indicates that the execution time of the query varies significantly from request to request.

Determining how good or bad the standard deviation is for a particular query requires more data than just the mean and standard deviation values. To make the most sense of these numbers, we at least need to add the minimum and maximum execution times to the query. By doing this, we can start to form a mental model for the overall span execution times that the query takes.

In the example result below, we're only showing the data for one query to make it easier to read, the same ORDER BY time LIMIT 1 query we've seen in our previous example output.

-- query the 10 longest running queries
WITH statements AS (
SELECT * FROM pg_stat_statements pss
		JOIN pg_roles pr ON (userid=oid)
WHERE rolname = current_user
)
SELECT calls, 
	min_exec_time,
	max_exec_time, 
	mean_exec_time,
	stddev_exec_time,
	(stddev_exec_time/mean_exec_time) AS coeff_of_variance,
	query
FROM statements
WHERE calls > 500
AND shared_blks_hit > 0
ORDER BY mean_exec_time DESC
LIMIT 10;


Name              |Value                                                |
------------------+-----------------------------------------------------+
calls             |2094                                                 |
min_exec_time     |0.060303                                             |
max_exec_time     |1468.401726                                          |
mean_exec_time    |346.9338636657108                                    |
stddev_exec_time  |212.3896857655582                                    |
coeff_of_variance |0.612190702635494                                    |
query             |SELECT time FROM nft_sales ORDER BY time ASC LIMIT $1|

Queries showing the min, max, mean, and standard deviation of each query

In this case, we can extrapolate a few things from these statistics:

For our application, this query is called frequently (remember, more than 500 calls is a lot for this sample database).
If we look at the full range of execution time in conjunction with the mean, we see that the mean is not centered. This could imply that there are execution time outliers or that the data is skewed. Both are good reasons to investigate this query’s execution times further.
Additionally, if we look at the coefficient of variation column, which is the ratio between the standard deviation and the mean (also called the coefficient of variation), we get 0.612 which is fairly high. In general, if this ratio is above 0.3, then the variation of your data is quite large. Since we find the data is quite varied, it seems to imply that instead of a few outliers skewing the mean, there are a number of execution times taking longer than they should. This provides further confirmation that the execution time for this query should be investigated further.

When I examine the output of these three queries together, this specific ORDER BY time LIMIT 1 query seems to stick out. It's slower per call than most other queries, it often requires the database to retrieve data from disk, and the execution times seem to vary dramatically over time.

As long as I understood where this query was used and how the application could be impacted, I would certainly put this "first point" query on my list of things to improve.

Speed Up Your PostgreSQL Queries

The pg_stat_statements extension is an invaluable monitoring tool, especially when you understand how statistical data can be used in the database and application context.

For example, an expensive query called a few times a day or month might not be worth the effort to tune right now. Instead, a moderately slow query called hundreds of times an hour (or more) will probably better use your query tuning effort.

If you want to learn how to store metrics snapshots regularly and move from static, cumulative information to time-series data for more efficient database monitoring, check out the blog post Point-in-Time PostgreSQL Database and Query Monitoring With pg_stat_statements.

pg_stat_statements is automatically enabled in all Timescale services. If you’re not a user yet, you can try out Timescale for free (no credit card required) to get access to a modern cloud-native database platform with TimescaleDB's top performance, one-click database replication, forking, and VPC peering.

Select the Most Recent Record (of Many Items) With PostgreSQL

Ryan Booz — Fri, 04 Feb 2022 14:50:31 GMT

Time-series data is ubiquitous in almost every application today. One of the most frequent queries applications make on time-series data is to find the most recent value for a given device or item.

In this blog post, we'll explore five methods for accessing the most recent value in PostgreSQL. Each option has its advantages and disadvantages, which we'll discuss as we go.

Note: Throughout this post, references to a "device" or "truck" are simply placeholders to whatever your application is storing time-series data for, whether it be an air quality sensor, airplane, car, website visits, or something else. As you read, focus on the concept of each option, rather than the specific data we're using as an example.

The Problem

Knowing how to query the most recent timestamp and data for a device in large time-series datasets is often a challenge for many application developers. We study the data, determine the appropriate schema, and create the indexes that should make the queries return quickly.

When the queries aren't as fast as we expect, it's easy to be confused because indexes in PostgreSQL are supposed to help your queries return quickly - correct?

In most cases, the answer to that is emphatically "true". With the appropriate index, PostgreSQL is normally very efficient at retrieving data for your query. There are always nuances that we don't have time to get into in this post (don't create too many indexes, make sure statistics are kept up-to-date, etc.), but generally speaking, the right index will dramatically improve the query performance of a SQL database, PostgreSQL included.

Quick aside:

Before we dive into how to efficiently find specific records in a large time-series database using indexes, I want to make sure we're talking about the same thing. For the duration of this post, all references to indexes specifically mean a B-tree index. These are the most common index supported by all major OLTP databases and they are very good at locating specific rows of data across tables large and small. PostgreSQL actually supports many different index types that can help for various types of queries and data (including timestamp-centric data), but from here on out, we're only talking about B-tree indexes.

The Impact of Indexes

In our TimescaleDB Slack community channel and in other developer forums such as StackOverflow (example), developers often wonder why a query for the latest value is slow in PostgreSQL even when it seems like the correct index exists to make the query "fast"?

The answer to that lies in how the PostgreSQL query planner works. It doesn't always use the index exactly how you might expect as we'll discuss below. In order to demonstrate how PostgreSQL might use an index on a large time-series table, let's set the stage with a set of fictitious data.

For these example queries, let's pretend that our application is tracking a trucking fleet, with sensors that report data a few times a minute as long as the truck has a cell connection. Sometimes the truck loses signal which causes data to be sent a few hours or days later. Although the app would certainly be more involved and have a more complex schema for tracking both time-series and business-related data, let's focus on two of the tables.

Truck

This table tracks every truck that is part of the fleet. Even for a very large company, this table will typically contain only a few tens of thousands of rows.

truck_id	make	model	weight_class	date_acquired	active_status
1	Ford	Single Sleeper	S	2018-03-14	true
2	Tesla	Double Sleeper	XL	2019-02-18	false
…	…	…	…	…

For the queries below, we'll pretend that this table has ~10,000 trucks, most of which are currently active and recording data a few times a minute.

Truck Reading

The reading hypertable stores all data that is delivered from every truck over time. Data typically comes in a few times a minute in time order, although older data can arrive when trucks lose their connection to cell service or transmitters break down. For this example, we'll show a wide-format table schema, and only a few columns of data to keep things simple. Many IoT applications store many types of data points for each set of readings.

ts	truck_id	milage	fuel	latitude	longitude
2021-11-30 16:39:46	1	49.8	29	40.626	83.139
2021-11-30 16:39:46	2	33.0	371	40.056	78.978
2021-11-30 16:39:46	3	54.5	403	42.732	83.756
…	…	…	…	…	…

When you create a TimescaleDB hypertable, an index on the timestamp column will be created automatically unless you specifically tell the create_hypertable() function not to. For the truck_reading table, the default index should look similar to:

CREATE INDEX ix_ts ON truck_reading (ts DESC);

This index (or at least a composite index that uses the time column first) is necessary for even the most basic queries where time is involved and is strongly recommended for hypertable chunk management. Queries involving time alone like MIN(ts) or MAX(ts) can easily be satisfied from this index.

However, if we wanted to know the minimum or maximum readings for a specific truck, PostgreSQL would have no path to quickly find that information. Consider the following query that searches for the most recent readings of a specific truck:

SELECT * FROM truck_reading WHERE truck_id = 1234 ORDER BY ts DESC LIMIT 1;

If the truck_reading table only had the default timestamp index ( ix_ts above), PostgreSQL has no efficient method to get the most recent row of data for this specific truck. Instead, it has to start reading the index from the beginning (the most recent timestamp is first based on the index order) and check each row to see if it contains 1234 as the truck_id.

If this truck had reported recently, PostgreSQL would only have to read a few thousand rows at most and the query would still be "fast". If the truck hadn't recorded data in a few hours or days, PostgreSQL might have to read hundreds of thousands, or millions of rows of data, before it finds a row where truck_id = 1234.

To demonstrate this, we created a sample dataset with ~20 million rows of data (1 week for 10,000 trucks) and then deleted the most recent 12 hours for truck_id = 1234.

In the EXPLAIN output below, we can see that PostgreSQL had to scan the entire index and FILTER out more than 1.53 million rows that did not match the `truck_id` we were searching for. Even more alarming is the amount of data PostgreSQL had to process to correctly retrieve the one row of data we were asking for - ~184MB of data! (23168 buffers x 8kb per buffer)

QUERY PLAN                                                                  
------------------------------------------------------------------
Limit  (cost=0.44..289.59 rows=1 width=52) (actual time=189.343..189.344 rows=1 loops=1)                                                       
  Buffers: shared hit=23168                                                 
  ->  Index Scan using ix_ts on truck_reading  (cost=0.44..627742.58 rows=2171 width=52) (actual time=189.341..189.341 rows=1 loops=1)|
        Filter: (truck_id = 1234)                                           
        Rows Removed by Filter: 1532532                                     
        Buffers: shared hit=23168                                           
Planning:                                                                   
  Buffers: shared hit=5                                                     
Planning Time: 0.116 ms                                                     
Execution Time: 189.364 ms

If your application has to do that much work for each query, it will quickly become bottlenecked on the simplest of queries as the data grows.

Therefore, it's essential that we have the correct index(es) for the typical query patterns of our application.

In this example (and in many real-life applications), we should at least create one other index that includes both truck_id and ts. This will allow queries about a specific truck based on time to be searched much more efficiently. An example index would look like this:

CREATE INDEX ix_truck_id_ts ON truck_reading (truck_id, ts DESC);

With this index created, PostgreSQL can find the most recent record for a specific truck very quickly, whether it reported data a few seconds or a few weeks ago.

With the same dataset as above, the same query which returns a datapoint for truck_id = 1234 from 12 hours ago reads only 40kb of data! That's ~4600x less data that had to be read because we created the appropriate index, not to mention the sub-millisecond execution time! That's bananas!

QUERY PLAN                                                                                                                                               
-----------------------------------------------------
Limit  (cost=0.56..1.68 rows=1 width=52) (actual time=0.015..0.015 rows=1 loops=1)                                                                
  Buffers: shared hit=5                                                    
  ->  Index Scan using ix_truck_id_ts on truck_reading  (cost=0.56..2425.55 rows=2171 width=52) (actual time=0.014..0.014 rows=1 loops=1)
        Index Cond: (truck_id = 1234)                                       
        Buffers: shared hit=5                                               
Planning:                                                                   
  Buffers: shared hit=5                                                     
Planning Time: 0.117 ms                                                     
Execution Time: 0.028 ms

To be clear, both queries did use an index to search for the row. The difference is in how the indexes were used to find the data we wanted.

The first query had to FILTER the tuple because only the timestamp was part of the index. Filtering takes place after the tuple is read from disk, which means a lot more work takes place just trying to find the correct data.

In contrast, the second query used both parts of the index ( truck_id and ts) as part of the Index Condition. This means that only the rows that match the constraint are read from disk. In this case, that's a very small number and the query is much faster!

Unfortunately, even with both of these targeted indexes, there are a few common time-series SQL queries that won't perform as well as most developers expect them to.

Let's talk about why that is.

Open-ended queries:

Open-ended queries look for unique data points (first, last, most recent) without specifying a specific time-range or device constraint ( truck_id in our example). These types of queries leave the planner with few options, so it assumes that it will have to scan through the entire index at planning time. That might not be true, but PostgreSQL can't really know before it executes the query and starts looking for the data.

This is especially difficult when tables are partitioned because the actual indexes are stored independently with each table partition. Therefore, there is no global index for the entire table that identifies if a specific truck_id (in our case) exists in a partition. Once again, when the PostgreSQL planner doesn't have enough information during the planning phase, it assumes that each partition will need to be queried, typically causing increased planning time.

Consider a query like the following, which asks for the earliest reading for a specific truck_id:

SELECT * FROM truck_reading WHERE truck_id=1234 ORDER BY ts LIMIT 1;

With the two indexes we have in place ( (ts DESC) and (truck_id, ts DESC)), it feels like this should be a fast query. But because the hypertable is partitioned on time, the planner initially assumes that it will have to scan each chunk. If you have a lot of partitions, the planning time will take longer.

If the truck_reading table is actively receiving new data, the execution of the query will still be "fast" because the answer will probably be found in the first chunk and returned quickly. But if truck_id=1234 has never reported any data or has been offline for weeks, PostgreSQL will have to both plan and then scan the index of every chunk. The query will use the composite index on each partition to quickly determine there are no records for the truck, but it still has to take the time to plan and execute the query.

Instead, we want to avoid doing unnecessary work whenever possible and avoid the potential for this query anti-pattern.

High-cardinality queries:

Many queries can also be negatively impacted by increasing cardinality, becoming slower as data volumes grow and more individual items are tracked. Options 1-4 below are good examples of queries that perform well on small to medium-size datasets, but often become slower as volumes and cardinality increase.

These queries attempt to "step" through the time-series table by truck_id, taking advantage of the indexes on the hypertable. However, as more items need to be queried, the iteration often becomes slower because the index is too big to efficiently fit in memory, causing PostgreSQL to frequently swap data to and from disk.

Understanding that these two types of queries may not perform as well under every circumstance, let's examine five different methods for getting the most recent record for each item in your time-series table. In most circumstances, at least one of these options will work well for your data.

Development != Production

One quick word of warning as we jump into the SQL examples below.

It's always good to remember that your development database is unlikely to have the same volume, cardinality, and transaction throughput as your production database. Any one of the example queries we show below might perform really well on a smaller, less active database, only to perform more poorly than expected in production.

It's always best to test in an environment that is as similar to production as possible. How to do that is beyond the scope of this post, but a few options could be:

Use one-click database forking with your Timescale instance to easily make a copy of production for testing and learning. Using data as close to production as possible is usually preferred!
Back up and restore your production database to an approved location and anonymize the data, keeping similar cardinality and row statistics. Always ANALYZE the table after any data changes.
Consider reusing your schema and generating lots of high-volume, high-cardinality sample data with generate_series() (possibly using some of the ideas from our series about generating more realistic sample data inside of PostgreSQL).

Whatever method you choose, always remember that a database with 1 million rows of time-series data for 100 items will act much differently from a database with 10 billion rows of time-series data for 10,000 items reporting every few seconds.

Now that we've discussed how indexes help us find the data and reviewed some of the query patterns that can be slower than usual, it's time to write some SQL and talk about when it might be appropriate to use each option.

Option 1: Naive GROUP BY

SQL is a powerful language. Unfortunately, every database that allows queries to be written in SQL often has slightly different functions for doing similar work, or simply doesn't support SQL standards that would otherwise allow for efficient "last point" queries like we've been discussing.

However, in nearly every database where SQL is a supported query language, you can run this query to get the most recent time that a truck recorded data. In most cases, this will not perform well on large datasets because the GROUP BY clause prevents the indexes from being used.

SELECT max(time) FROM truck_reading GROUP BY truck_id;

Because the indexes won't be used in PostgreSQL, this approach is not recommended for high-volume/high-cardinality datasets. But, it will get the result you expect, even if it's not efficient.

If you have a query like this, consider how one of the other options listed below might better fit your query pattern.

Option 2: LATERAL JOIN

One of the easiest pieces of advice to give for any PostgreSQL database developer is to learn how to use LATERAL JOINs. In some other database engines (like SQL Server), these are called APPLY commands, but they do essentially the same thing—run the inner query for every row produced by the outer query. Because it is a JOIN, the inner query can utilize values from the outer query. (While this is similar to a correlated subquery, it's not the same thing.)

LATERAL JOINs are a great option when you, as the developer or administrator, know approximately how many rows the outer query will return. For a few hundred or a few thousand rows, this pattern is likely to return your "recent" record very quickly as long as the correct index is in place.

SELECT * FROM trucks t 
INNER JOIN LATERAL (
	SELECT * FROM truck_reading 
	WHERE truck_id = t.truck_id
	ORDER BY ts DESC 
	LIMIT 1
) l ON TRUE
ORDER BY t.truck_id DESC;

The convenient thing about a LATERAL JOIN query is that additional filtering can be applied to the outer query to identify specific items to retrieve data for. In most cases, the relational business data ( trucks) will be a smaller table with faster query times. Paging can also be applied to the smaller table more efficiently (ie. OFFSET 500 LIMIT 100), which further reduces the total work that the inner query needs to perform.

Unfortunately, one downside of a LATERAL JOIN query is that it can be susceptible to the high cardinality issue we discussed above in at least two ways.

First, if the outer query returns many more items than the inner table has data for, this query will loop over the inner table doing more work than necessary. For example, if the truck table had 10,000 entries for trucks, but only 1,000 of them had ever reported readings, the query would loop over the inner query 10x more than it needed to.

Second, even if the cardinality of the inner and outer query generally match, if that cardinality is high or the table on the inner query is very large, a LATERAL JOIN query will slow down over time as memory or I/O become a limiting factor. At some point, you may need to consider Option 5 below as a final solution.

Option 3: TimescaleDB SkipScan

Disclaimer: this method only works when the TimescaleDB extension is installed. If you aren’t using it yet, you can find more information on our documentation page.

LATERAL JOINs are a great tool to have on hand when working with iterative queries. However, as we just discussed, they're not always the best choice when iterating the items of the outer query would cause the inner query to be executed often, looking for data that doesn't exist.

This is when it can be advantageous to use the reading table itself to get the distinct items and related data. In particular, this is helpful when we want to query trucks that have reported data within a period of time, for example, the last 24 hours. While we could add a filter to the inner query above ( WHERE ts > now() - INTERVAL '24 hours'), we'd still have to iterate over every truck_id, some of which might not have reported data in the last 24 hours.

Because we already created the ix_truck_id_ts index above that is ordered by truck_id and ts DESC, a common approach that many PostgreSQL developers try is to use a DISTINCT ON query with PostgreSQL.

SELECT DISTINCT ON (truck_id) * FROM truck_reading WHERE ts > now() - INTERVAL '24 hours' ORDER BY truck_id, ts DESC;

If you try this without TimescaleDB installed, it won't perform well - even though we have an index that appears to have the data ordered correctly and easy to "jump" through! This is because, as of PostgreSQL 14, there is no feature within the query execution phase that can "walk" the index to find each unique instance of a particular key. Instead, PostgreSQL essentially reads all of the data, groups it by the ON columns, and then filters out all but the first row (based on order).

However, with the TimescaleDB extension installed (version 2.3 or greater), the DISTINCT ON query will work much more efficiently as long as the correct index exists and is ordered the same as the query. This is because the TimescaleDB extension adds a new query node called "SkipScan" which will start scanning the index with the next key value as soon as another one is found, in order. One of the best parts of (SkipScan) is that it works on any PostgreSQL table with a B-tree index. It doesn't have to be a TimescaleDB hypertable!

There are a few nuances to how the index is used, all of which are outlined in the blog post linked above.

Option 4: Loose Index Scan

If you don't (or can't) install the TimescaleDB extension, there is still a way to query the truck_reading table to efficiently return the timestamp of the most recent reading for each truck_id.

On the PostgreSQL Wiki there is a page dedicated to the Loose Index Scan. It demonstrates a way to use recursive CTE queries to essentially do what the TimescaleDB (SkipScan) node does. It's not nearly as straightforward to write and is more difficult to return multiple rows (it's not the same as a DISTINCT query), but it does provide a way to more efficiently use the index to retrieve one row for each item.

The biggest drawback with this approach is that it's much more difficult to return multiple columns of data with the recursive CTE (and in most cases, it's simply impossible to return multiple rows). So while some developers refer to this as a Skip Scan query, it doesn't easily allow you to retrieve all of the row data for a high-volume table like the (SkipScan) query node that TimescaleDB provides.

/*
 * Loose index scan via https://wiki.postgresql.org/wiki/Loose_indexscan
 */
WITH RECURSIVE t AS (
   SELECT min(ts) AS time FROM truck_reading
   UNION ALL
   SELECT (SELECT min(ts) FROM truck_reading WHERE ts > t.ts)
   FROM t WHERE t.ts IS NOT NULL
   )
SELECT ts FROM t WHERE ts IS NOT NULL
UNION ALL
SELECT null WHERE EXISTS(SELECT 1 FROM truck_reading WHERE ts IS NULL);

Option 5: Logging Table and Trigger

Sometimes, particularly with large, high-cardinality datasets, the above options aren't efficient enough for day-to-day operations. Querying for the last reading of all items, or the devices that haven't reported a value in the last 24 hours, will not meet your expectations as data volume and cardinality grows.

In this case, a better option might be to maintain a table that stores the last readings for each device as it's inserted into the raw time-series table so that your application can query a much smaller dataset for the most recent values. To track and update the logging table, we'll create a database trigger on the raw data (hyper)table.

"Wait a minute! Did you just say we're going to create a database trigger? Doesn't everyone say you should never use them?"

It's true. Triggers often get a bad rap in the SQL world, and honestly, that can often be justified. Used properly and with the correct implementation, database triggers can be tremendously useful and have minimal impact on SELECT performance. Insert and Update performance will be impacted because each transaction has to do more work. The performance hit may or may not impact your application, so testing is essential.

The SQL below provides a minimal example of how you could implement this kind of logging. There are a lot of considerations on how to best implement this option for your specific application. Thoroughly test any new processes you ad to the data processing in your database.

In short, the example script below:

creates a table to store the most recent data. If you only want to store the most recent timestamp of each truck's readings, this could easily just insert values into a new field on the truck table
ALTER's the FILLFACTOR of the table to 90% because it will be UPDATE heavy
creates a trigger function that will insert a row if it doesn't exist for a truck or update the values if a row for that truck already has an entry in the table ( ON CONFLICT)
enables the trigger on the data hypertable

The key to this approach is to only track what is necessary, reducing the amount of work PostgreSQL has to do as part of the overall transaction that is ingesting raw data. If your application updates values for 100,000 devices every second (and you're tracking 50 columns of data), a different trigger approach might be necessary. If this is the kind of data volume you see regularly, we assume that you have an experienced PostgreSQL DBA on your team to help manage and maintain your application database—and help you decide if the logging table approach will work with the available server resources.

/*
 * The logging table alternative. The PRIMARY KEY will create an
*  index on the truck_id column to make querying for specific trucks more efficient
 */
CREATE TABLE truck_log ( 
	truck_id int PRIMARY KEY REFERENCES trucks (truck_id),
	last_time timestamp,
	milage int,
	fuel int,
	latitude float8,
	longitude float8
);

/*
* Because the table will mostly be UPDATE heavy, a slightly reduced
* FILLFACTOR can alleviate maintenance contention and reduce
* page bloat on the table.
*/
ALTER TABLE truck_log SET (fillfactor=90);

/*
 * This is the trigger function which will be executed for each row
*  of an INSERT or UPDATE. Again, YMMV, so test and adjust appropriately
 */
CREATE OR REPLACE FUNCTION create_truck_trigger_fn()
  RETURNS TRIGGER LANGUAGE PLPGSQL AS
$BODY$
BEGIN
  INSERT INTO truck_log VALUES (NEW.truck_id, NEW.time, NEW.milage, NEW.fuel, NEW.latitude, NEW.longitude) ON CONFLICT (truck_id) DO UPDATE SET    last_time=NEW.time,
   milage=NEW.milage,
   fuel=NEW.fuel,
   latitude=NEW.latitude,
   longitude=NEW.longitude;
  RETURN NEW;
END
$BODY$;

/*
*  With the trigger function created, actually assign it to the truck_reading
*  table so that it will execute for each row
*/ 
CREATE TRIGGER create_truck_trigger
  BEFORE INSERT OR UPDATE ON truck_reading
  FOR EACH ROW EXECUTE PROCEDURE create_truck_trigger_fn();

With these pieces in place, the new table will start receiving new rows of data and updating the last values as data is ingested. Querying this table will be much more efficient than searching through hundreds of millions of rows.

Review The Options

	Requires matching index	Impacted by higher cardinality	Insert performance may be impacted
Option 1: GROUP BY		X
Option 2: LATERAL JOIN	X	X
Option 3: TimescaleDB SkipScan	X	X	X (if index needs to be added)
Option 4: Recursive CTE	X	X	X (if index needs to be added)
Option 5: Logging table			X

Wrap Up

Whatever approach you take, hopefully one of these options will help you take the next step to improve the performance of your application.

If you are not using TimescaleDB yet, take a look. It's a PostgreSQL extension that will make your queries fastervia automatic partitioning, query planner enhancements, improved materialized views, columnar compression, and much more.

If you're running your PostgreSQL database in your own hardware, you can simply add the TimescaleDB extension. If you prefer to try Timescale in AWS, create a free account on our platform. It only takes a couple seconds, no credit card required!

PS: For more tips that will help you enhance performance in PostgreSQL, check out our collection of articles on PostgreSQL Performance Tuning.

How to shape sample data with PostgreSQL generate_series() and SQL

Ryan Booz — Thu, 20 Jan 2022 14:34:18 GMT

In the lifecycle of any application, developers are often asked to create proof-of-concept features, test newly released functionality, and visualize data analysis. In many cases, the available test data is stale, not representative of normal usage, or simply doesn't exist for the feature being implemented. In situations like this, knowing how to quickly create sample time-series data with native PostgreSQL and SQL functions is a valuable skill to draw upon!

In this three-part series on generating sample time-series data, we demonstrate how to use the built-in PostgreSQL function, generate_series(), to more easily create large sets of data to help test various workloads, database features, or just to create fun samples.

In part 1 and part 2 of the series, we reviewed how generate_series() works, how to join multiple series using a CROSS JOIN to create large datasets quickly, and finally how to create and use custom PostgreSQL functions as part of the query to generate more realistic values for your dataset. If you haven't used generate_series() much before, we recommend first reading the other two posts. The first one is an intro to the generate_series() function, and the second one shows how to generate more realistic data.

With those skills in hand, you can quickly and easily generate tens of millions of rows of realistic-looking data. Even still, there's one more problem that we hinted at in part 2 - all of our data, regardless of how it's formatted or constrained, is still based on the random() function. This means that over thousands or millions of samples, every device we create data for will likely have the same MAX() and MIN() value, and the distribution of random values over millions of rows for each device generally means that all devices will have similar average values.

This third post demonstrates a few methods for influencing how to create data that mimics a desired shape or trend. Do you need to simulate time-series values that cycle over time? What about demonstrating a counter value that resets every so often to test the counter_agg hyperfunction? Are you trying to create new dashboards that display sales data over time, influenced for different months of the year when sales would ebb and flow?

Below we'll cover all of these examples to provide you with the final building blocks to create awesome sample data for all of your testing and exploration needs. Remember, however, that these examples are just the beginning. Keep playing. Tweak the formulas or add different relational data to influence the values that get generated so that it meets your use case.

Data inception

Time-series data often has patterns. Weather temperatures and rainfall measurements change in a (mostly) predictable way throughout the year. Vibration measurements from an IoT device connected to an air conditioning system usually increase in the summer and decrease in the winter. Manufacturing data that measures the total units produced per hour (and the percentage of defective units) usually follow a pattern based on shift schedules and seasonal demand.

If you want to demonstrate this kind of data without having access to the production dataset, how would you go about it using generate_series()? SQL functions ended up being pretty handy when we discussed different methods for creating realistic-looking data in part 2. Do you think they might help here? 😉

Two options to easily return the row number

Remember, for our purposes we're specifically talking about creating sample time-series data. Every row increases along the time axis, and if we use the multiplication formula from part 1, we can determine how many rows our sample data query will generate. Using built-in SQL functions, we can quickly start manipulating data values that change with the cycle of time. 💥

There are many reasons why it can be helpful to know the ordinal position of each row number in a query result. That's why standard SQL dialects have some variation of the row_number() over() window function. This simple, yet powerful, window function allows us to return the row number of a result set, and can utilize the ORDER BY and PARTITION keywords to further determine the row values.

SELECT ts, row_number() over(order by time) AS rownum
FROM generate_series('2022-01-01','2022-01-05',INTERVAL '1 day') ts;

ts                          |rownum|
-----------------------------+------+
2022-01-01 00:00:00.000 -0500|     1|
2022-01-02 00:00:00.000 -0500|     2|
2022-01-03 00:00:00.000 -0500|     3|
2022-01-04 00:00:00.000 -0500|     4|
2022-01-05 00:00:00.000 -0500|     5|

In a normal query, this can be useful for tasks like paging data in a web API when there is a need to consistently return values based on a common partition.

There's one problem though. row_number() over() requires PostgreSQL (and any other SQL database) to process the query results twice to add the values correctly. Therefore, it's very useful, but also very expensive as datasets grow.

Fortunately, PostgreSQL helps us once again for our specific use case of generating sample time-series data.

Through this series of blog posts on generating sample time-series data, we've discussed that generate_series() is a Set Returning Function (SRF). Like the results from a table, set data can be JOINed and queried. Additionally, PostgreSQL provides the WITH ORDINALITY clause that can be applied to any SRF to generate an additional, incrementing BIGINT column. The best part? It doesn't require a second pass through the data in order to generate this value!

SELECT ts AS time, rownum
FROM generate_series('2022-01-01','2022-01-05',INTERVAL '1 day') WITH ORDINALITY AS t(ts,rownum);

time                         |rownum|
-----------------------------+------+
2022-01-01 00:00:00.000 -0500|     1|
2022-01-02 00:00:00.000 -0500|     2|
2022-01-03 00:00:00.000 -0500|     3|
2022-01-04 00:00:00.000 -0500|     4|
2022-01-05 00:00:00.000 -0500|     5|

Because it serves our purpose and is more efficient, the remainder of this post will use WITH ORDINALITY. However, remember that you can accomplish the same results using row_number() over() if that's more comfortable for you.

Harnessing the row value

With increasing timestamps and an increasing integer on every row, we can begin to use other functions to create interesting data.

Remember from the previous blog posts that calling a function as part of your query executes the function for each row and returns the value. Just like a regular column, however, we don't have to actually emit that column in the final query results. Instead, the function value for that row can be used in calculating values in other columns.

As an example, let's modify the previous query. Instead of displaying the row number, let's multiply the value by 2. That is, the function value is treated as an input to a multiplication formula.

SELECT ts AS time, 2 * rownum AS rownum_by_two
FROM generate_series('2022-01-01','2022-01-05',INTERVAL '1 day') WITH ORDINALITY AS t(ts,rownum);

time                         |rownum_by_two|
-----------------------------+------+
2022-01-01 00:00:00.000 -0500|     2|
2022-01-02 00:00:00.000 -0500|     4|
2022-01-03 00:00:00.000 -0500|     6|
2022-01-04 00:00:00.000 -0500|     8|
2022-01-05 00:00:00.000 -0500|   10|

Easy enough, right? What else can we do with the row number value?

Counters with reset

Many time-series datasets record values that reset over time, often referred to as counters. The odometer on a car is an example. If you drive far enough, it will "roll over" to zero again and start counting upward. The same is true for many utilities, like water and electric meters, that track consumption. Eventually, the total digits will increment to the point where the counter resets and starts from zero again.

To simulate this with time-series data, we can use the incrementing row number and after some period of time, reset the count and start over using the modulus operator (%).

– This example resets the counter every 10 rows 
WITH counter_rows AS (
	SELECT ts, 
		CASE WHEN rownum % 10 = 0 THEN 10
		     ELSE rownum % 10 END AS row_counter
	FROM generate_series(now() - INTERVAL '5 minutes', now(), INTERVAL '1 second') WITH ORDINALITY AS t(ts, rownum)
)
SELECT ts, row_counter
FROM counter_rows;


ts                         |row_counter|
-----------------------------+-----------+
2022-01-07 13:17:46.427 -0500|          1|
2022-01-07 13:17:47.427 -0500|          2|
2022-01-07 13:17:48.427 -0500|          3|
2022-01-07 13:17:49.427 -0500|          4|
2022-01-07 13:17:50.427 -0500|          5|
2022-01-07 13:17:51.427 -0500|          6|
2022-01-07 13:17:52.427 -0500|          7|
2022-01-07 13:17:53.427 -0500|          8|
2022-01-07 13:17:54.427 -0500|          9|
2022-01-07 13:17:55.427 -0500|         10|
2022-01-07 13:17:56.427 -0500|          1|
… | …

By putting the CASE statement inside of the CTE, the counter data can be selected more easily to test other functions. For instance, to see how the rate() and delta() hyperfunctions work, we can use time_bucket() to group our 1-second readings into 1-minute buckets.

WITH counter_rows AS (
	SELECT ts, 
		CASE WHEN rownum % 10 = 0 THEN 10
		     ELSE rownum % 10 END AS row_counter
	FROM generate_series(now() - INTERVAL '5 minutes', now(), INTERVAL '1 second') WITH ORDINALITY AS t(ts, rownum)
)
SELECT time_bucket('1 minute', ts) bucket, 
  delta(counter_agg(ts,row_counter)),
  rate(counter_agg(ts, row_counter))
FROM counter_rows
GROUP BY bucket
ORDER BY bucket;

bucket                       |delta|rate|
-----------------------------+-----+----+
2022-01-07 13:25:00.000 -0500| 33.0| 1.0|
2022-01-07 13:26:00.000 -0500| 59.0| 1.0|
2022-01-07 13:27:00.000 -0500| 59.0| 1.0|
2022-01-07 13:28:00.000 -0500| 59.0| 1.0|
2022-01-07 13:29:00.000 -0500| 59.0| 1.0|
2022-01-07 13:30:00.000 -0500| 26.0| 1.0|

time_bucket() outputs the starting time of the bucket, which based on our date math for generate_series() produces four complete buckets of 1-minute aggregated data, and two partial buckets - one for the minute we are currently in, and a second bucket for the partial 5 minutes ago. We can see that the delta correctly calculates the difference between the last and first readings of each bucket, and the rate of change (the increment between each reading) correctly displays a unit of one.

What are some other ways we can use these PostgreSQL functions to generate different shapes of data to help you explore other features of SQL and TimescaleDB quickly?

Increasing trend over time

With the knowledge of how to create an ordinal value for each row of data produced by generate_series(), we can explore other ways of generating useful time-series data. Because the row number value will always increase, we can easily produce a random dataset that always increases over time but has some variability to it. Consider this a very rough representation of daily website traffic over the span of two years.

SELECT ts, (10 + 10 * random()) * rownum as value FROM generate_series
       ( '2020-01-01'::date
       , '2021-12-31'::date
       , INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);

Sample daily website traffic growing over time with random daily values

In reality this chart isn't very realistic or representative. Any website that gains and loses viewers upwards of 50% per day probably isn't going to have great long-term success. Don't worry, we can do better with this example after we learn about another method for creating shaped data using sine waves.

Simple cycles (sine wave)

Using the built-in sin() and cos() PostgreSQL functions, we can generate data useful for graphing and testing functions that need a predictable data trend. This is particularly useful for testing TimescaleDB downsampling hyperfunctions like lttb or asap. These functions can take tens of thousands (or millions) of data points and return a smaller, but still accurately representative dataset for graphing.

We'll start with a basic example that produces one row per day, for 30 days. For each row number value, we'll get the sine value that can be used to graph a wave.

–- subtract 1 from the row number for wave to start
-- at zero radians and produce a more representative chart
SELECT  ts,
 cos(rownum-1) as value
FROM generate_series('2021-01-01','2021-01-30',INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);

ts                         |value                 |
-----------------------------+--------------------+
2021-01-01 00:00:00.000 -0500|                 1.0|
2021-01-02 00:00:00.000 -0500|  0.5403023058681398|
2021-01-03 00:00:00.000 -0500| -0.4161468365471424|
2021-01-04 00:00:00.000 -0500| -0.9899924966004454|
2021-01-05 00:00:00.000 -0500| -0.6536436208636119|
2021-01-06 00:00:00.000 -0500| 0.28366218546322625|
… | …

Unfortunately, the graph of this SINE wave doesn't look all that appealing. For one month of daily data points, we only have ~6 distinct data points from peak to peak of each wave.

Cosine wave graph using daily values for one month

The reason our sine wave is so jagged is that sine and cosine values are measured in radians (based on 𝞹), not degrees. A complete cycle (peak-to-peak) on a sine wave happens from zero to 2*𝞹 (~6.28…). Therefore, every ~6 rows of data will produce a complete period in the wave - unless we find a way to modify that value.

To take control over the sine/cosine values, we need to think about how to modify the data based on the date range and interval (how many rows) and what we want the wave to look like.

This means we need to take a quick trip back to math class to talk about radians.

Math class flashback!

Step back with me for a minute to primary school and your favorite math subject - Algebra 2 (or Trigonometry as the case may be). How many hours did you spend working with graph paper (or graphing calculators) determining the amplitude, period, and shift of a sine or cosine graph?

Sine wave period and amplitude

If you reach even further into your memory, you might remember this formula which allows you to modify the various aspects of a wave.

Mathematical formula for modifying the shape and values of a sine wave

There's a lot here, I know. Let's primarily focus on the two numbers that matter most for our current use case:

X = the "number of radians", which is the row number in our dataset

B = a value to multiply the row number by, to decrease the "radian" value for each row

(A, C, and D change the height and placement of the wave, but to start, we want to elongate each period and provide more "points" on the line to graph.)

Let's start with a small dataset example, generating cosine data for three months of daily timestamps with no modifications.

SELECT  ts, 
cos(rownum) as value
FROM generate_series('2021-01-01','2021-03-31',INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);


ts                         |value                  |
-----------------------------+---------------------+
2021-01-01 00:00:00.000 -0500|   0.5403023058681398|
2021-01-02 00:00:00.000 -0500|  -0.4161468365471424|
2021-01-03 00:00:00.000 -0500|  -0.9899924966004454|
2021-01-04 00:00:00.000 -0500|  -0.6536436208636119|
2021-01-05 00:00:00.000 -0500|  0.28366218546322625|
2021-01-06 00:00:00.000 -0500|    0.960170286650366|
2021-01-07 00:00:00.000 -0500|   0.7539022543433046|
2021-01-08 00:00:00.000 -0500| -0.14550003380861354|
2021-01-09 00:00:00.000 -0500|  -0.9111302618846769|
… | …
2021-03-29 00:00:00.000 -0400|   0.9993732836951247|
2021-03-30 00:00:00.000 -0400|   0.5101770449416689|
2021-03-31 00:00:00.000 -0400|  -0.4480736161291701|

Cosine wave with daily data for three months

In this example, we see ~14 peaks in our wave because there are 90 points of data and without modification, the wave will have a period (peak-to-peak) every ~6.28 points. To lengthen the cycle, we need to perform some simple division.

[cycle modifying value] = 6.28/[total interval (rows) per cycle]

Using the same 3 months of generated daily values, let's see how to modify the data to lengthen the period of the wave.

One cycle per month (30 days)

If we want our daily data to cycle every 30 days, multiply our row number value by 6.28/30.

6.28/30 = .209 (the row number radians modifier)

SELECT  ts, cos(rownum * 6.28/30) as value
FROM generate_series('2021-01-01','2021-03-31',INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);

Cosine wave using daily data for three months, adjusted to have monthly periods

One cycle per quarter (90 days)

6.28/90 = .07 (this is our radians modifier)

SELECT  ts, cos(rownum * 6.28/90) as value
FROM generate_series('2021-01-01','2021-03-31',INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);

Cosine wave using daily data for three months, adjusted to have one 3-month period

To modify the overall length of the period, you need to modify the row number value based on the total number of rows in the result and the granularity of the timestamp.

Here are some example values that you can use to modify the wave period based on the interval used with generate_series().

generate_series() interval	Desired period length	Divide 6.28 by…
daily	1 month	30
daily	3 months	90
hourly	1 day	24
hourly	1 week	168
hourly	1 month	720
minute	1 hour	60
minute	1 day	1440

Modifying the wave amplitude and shift

Another tweak we can make to our wave data is to change the amplitude (difference between the min and max peaks) and, as necessary, shift the wave up or down on the Y-axis.

To do this, multiply the cosine value by the value that maximum value you want the wave to have. For example, we can multiply the monthly cycle data by 10, which changes the overall minimum and maximum values of the data.

SELECT  ts, 10 * cos(rownum * 6.28/30) as value
FROM generate_series('2021-01-01','2021-03-31',INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);

Cosine wave using daily data for three months with increased amplitude

Notice that the min/max values are now from -10 to 10.

We can take it one step further by adding a value to the output which will shift the final values up or down on the Y-axis. In this example, we modified the previous query by adding 10 to the value of each row which results in values from 0 to 20.

SELECT  ts, 10 + 10 * cos(rownum * 6.28/30) as value
FROM generate_series('2021-01-01','2021-03-31',INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);

Cosine wave using daily data for three months with shifted min/max values on the Y-axis

Why spend so much time showing you how to generate and manipulate sine or cosine wave data, especially when we rarely see repeatable data this smooth in real life?

One of the main advantages of using consistent, predictable data like this in testing is that you can easily tell if your application, charting tools, and query are working as expected. Once you begin adding in unpredictable, real-life data, it can be difficult to determine if the data, query, or application are producing unexpected results. Quickly generating known data with a specific pattern can help rule out errors with the query, at least.

The second advantage of using a known dataset is that it can be used to shape and influence the results of other queries. Earlier in this post, we demonstrated a very simplistic example of increasing website traffic by multiplying the row number and a random value. Let's look at how we can join both datasets to create a better shape for the sample website traffic data.

Better website traffic samples

One of the key takeaways from this series of posts is that generate_series() returns a set of data that can be JOINed and manipulated like data from a regular table. Therefore, we can join together our rough "website traffic" data and our sine wave to produce a smoother, more realistic set of data to experiment with. SQL for the win!

Overall this is one of the more complex examples we've presented, utilizing multiple common table expressions (CTE) to break the various sets into separate tables that we can query and join. However, this also means that you can independently modify the time range and other values to change the data that is generated from this query for your own experimentation.

-- This is the generate series data
-- with a "short" date to join with later
WITH daily_series AS ( 
	SELECT ts, date(ts) AS day, rownum FROM generate_series
       ( '2020-01-01'
       , '2021-12-31'
       , '1 day'::interval) WITH ORDINALITY AS t(ts, rownum)
),
-- This selects the time, "day", and a 
-- random value that represents our daily website visits
daily_value AS ( 
	SELECT ts, day, rownum, random() AS val
    FROM daily_series
    ORDER BY day
),
-- This cosine wave dataset has the same "day" values which allow 
-- it to be joined to the daily_value easily. The wave value is used to modify
-- the "website" value by some percentage to smooth it out 
-- in the shape of the wave.
daily_wave AS ( 
	SELECT
       day,
       -- 6.28 radians divided by 180 days (rows) to get 
       -- one peak every 6 months (twice a year)
       1 + .2 * cos(rownum * 6.28/180) as p_mod
       FROM daily_series
       day
)
-- (500 + 20 * val) = 500-520 visits per day before modification
-- p_mod = an adjusted cosine value that raises or lowers our data each day
-- row_number = a big incremental value for each row to quickly increase "visits" each day
SELECT dv.ts, (500 + 20 * val) * p_mod * rownum as value
FROM daily_value dv
	INNER JOIN daily_wave dw ON dv.DAY=dw.DAY
    order by ts;

Combining wave data and random increasing data to shape the data pattern

Without much effort, we are able to generate a time-series dataset, use two different SQL functions, and join multiple sets together to create fun, graphical data. In this example, our traffic peaks twice a year (every ~180 days) during July and late December.

But we don't have to stop there. We can carry our website traffic example one step further by applying just a little more control over how much the data increases or decreases during certain periods.

Once again, relational data to the rescue!

Influence the pattern with relational data

As a final example, let's consider one other type of data that we can include in our queries that influence the final generated values - relational data. Although we've been using data that was created using generate_series() to produce some fun and interesting sample datasets, we can just as easily JOIN to other data in our database to further manipulate the final result.

There are many ways you could JOIN to and use additional data depending on your use case and the type of time-series data you're trying to mimic. For example:

IoT data from weather sensors: store the typical weekly temperature highs/lows in a database table and use those values as input to the random_between() function we created in post 2
Stock data analysis: store the dates for quarterly disclosures and a hypothetical factor that will influence the impact on stock price moving forward
Sales or website traffic: store the monthly or weekly change observed in a typical sales cycle. Does traffic or sales increase a quarter-end? What about during the end-of-year holiday season?

To demonstrate this, we'll use the fictitious website traffic data from earlier in this post. Specifically, we've decided that we want to see a spike in traffic during June and December.

First, we create a regular PostgreSQL table to store the numerical month (1-12) and a float value which will be used to modify our generated data (up or down). This will allow us to tweak the overall shape for a given month.

CREATE TABLE overrides (
	m_val INT NOT NULL,
	p_inc FLOAT4 NOT null
);

INSERT INTO overrides(m_val, p_inc) VALUES 
	(1,.1.04), – 4% residual increase from December
	(2,1),
	(3,1),
	(4,1),
	(5,1),
	(6,1.10),-- June increase of 10%
	(7,1),
	(8,1),
	(9,1),
	(10,1),
	(11,1.08), -- 8% early shoppers sales/traffic growth
	(12,1.18); -- 18% holiday increase

Using this simple dataset, let's first join it to the "simplistic" query that had randomly growing data over time.

WITH daily_series AS (
-- a random value that increases over time based on the row number
SELECT ts, date_part('month',ts) AS m_val, (10 + 10*random()) * rownum as value FROM generate_series
       ( '2020-01-01'::date
       , '2021-12-31'::date
       , INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum)
)
-- join to the `overrides` table to get the 'p_inc' value 
-- for the month of the current row
SELECT ts, value * p_inc AS value FROM daily_series ds
INNER JOIN overrides o ON ds.m_val=o.m_val
ORDER BY ts;

Sample website traffic for two years, modified during some months with relational data

Joining to the overrides table based on the month of each data point, we are able to multiply the percentage increase (p_inc) value and the fake website traffic value to influence the trend of our data during specific time periods.

Combining everything we've learned and taking this example one step further, we can enhance the cosine data query with the same monthly override values to tweak our fake, cyclical time-series data that represents growing website traffic with a more realistic shape.

-- This is the generate series data
-- with a "short" date to join with later
WITH daily_series AS ( 
	SELECT ts, date(ts) AS day, rownum FROM generate_series
       ( '2020-01-01'
       , '2021-12-31'
       , '1 day'::interval) WITH ORDINALITY AS t(ts, rownum)
),
-- This selects the time, "day", and a 
-- random value that represents our daily website visits
-- 'm_val' will be used to join with the 'overrides' table
daily_value AS ( 
	SELECT ts, day, date_part('month',ts) as m_val, rownum, random() AS val
    FROM daily_series
    ORDER BY day
),
-- This cosine wave dataset has the same "day" values which allow 
-- it to be joined to the daily_value easily. The wave value is used to modify
-- the "website" value by some percentage to smooth it out 
-- in the shape of the wave.
daily_wave AS ( 
	SELECT
       day,
       -- 6.28 radians divided by 180 days (rows) to get 
       -- one peak every 6 months (twice a year)
       1 + .2 * cos(rownum * 6.28/180) as p_mod
       FROM daily_series
       day
)
-- (500 + 20 * val) = 500-520 visits per day before modification
-- p_mod = an adjusted cosine value that raises or lowers our data each day
-- row_number = a big incremental value for each row to quickly increase "visits" each day
-- p_inc = a monthly adjustment value taken from the 'overrides' table
SELECT dv.ts, (500 + 20 * val) * p_mod * rownum * p_inc as value
FROM daily_value dv
	INNER JOIN daily_wave dw ON dv.DAY=dw.DAY
    inner join overrides o on dv.m_val=o.m_val
    order by ts;

Sample website traffic for two years combined with sine wave data and modified during some months with relational data

Wrapping it up

In this 3rd and final blog post of our series about generating sample time-series datasets, we demonstrated how to add shape and trend into your sample time-series data (e.g., increasing web traffic over time and quarterly sales cycles) using built-in SQL functions and relational data. With a little bit of math mixed in, we learned how to manipulate the pattern of generated data, which is particularly useful for visualizing time-series data and learning analytical PostgreSQL or TimescaleDB functions.

To see some of these examples in action, watch my video on creating realistic sample data:

If you have questions about using generate_series() or have any questions about TimescaleDB, please join our community Slack channel, where you'll find an active community and a handful of the Timescale team most days.

If you want to try creating larger sets of sample time-series data using generate_series() and see how the exciting features of TimescaleDB work, sign up for a free 30-day trial or install and manage it on your instances. (You can also learn more by following one of our many tutorials.)

Generating More Realistic Sample Time-Series Data With PostgreSQL generate_series()

Ryan Booz — Thu, 11 Nov 2021 14:51:33 GMT

In part 1 of the series, we reviewed how generate_series() works, including the ability to join multiple series into a larger table of time-series data - through a feature known as a CROSS (or Cartesian) JOIN. We ended the first post by showing you how to quickly calculate the number of rows a query will produce and modify the parameters for generate_series() to fine-tune the size and shape of the data.

However, there was one problem with the data we could produce at the end of the first post. The data that we were able to generate was very basic and not very realistic. Without more effort, using functions like random() to generate values doesn't provide much control over precisely what numbers are produced, so the data still feels more fake than we might want.

This second post will demonstrate a few ways to create more realistic-looking data beyond a column or two of random decimal values. Read on for more.

In part 3 of this blog series adds one final tool to the mix - combining the data formatting techniques below with additional equations and relational data to shape your sample time-series output into something that more closely resembles real-life applications.

By the end of this series, you'll be ready to test almost any feature that TimescaleDB offers and create quick datasets for your testing and demos!

A brief review of generate_series()

In the first post, we demonstrated how generate_series() (a Set Returning Function) could quickly create a data set based on a range of numeric values or dates. The generated data is essentially an in-memory table that can quickly create large sets of sample data.

-- create a series of values, 1 through 5, incrementing by 1
SELECT * FROM generate_series(1,5);

generate_series|
---------------|
              1|
              2|
              3|
              4|
              5|


-- generate a series of timestamps, incrementing by 1 hour
SELECT * from generate_series('2021-01-01','2021-01-02', INTERVAL '1 hour');

    generate_series     
------------------------
 2021-01-01 00:00:00+00
 2021-01-01 01:00:00+00
 2021-01-01 02:00:00+00
 2021-01-01 03:00:00+00
 2021-01-01 04:00:00+00
...

We then discussed how the data quickly becomes more complex as we join the various sets together (along with some value returning functions) to create a multiple of both sets together.

This example from the first post joined a timestamp set, a numeric set, and the random() function to create fake CPU data for four fake devices over time.

-- there is an implicit CROSS JOIN between the two generate_series() sets
SELECT time, device_id, random()*100 as cpu_usage 
FROM generate_series('2021-01-01 00:00:00','2021-01-01 04:00:00',INTERVAL '1 hour') as time, 
generate_series(1,4) device_id;


time               |device_id|cpu_usage          |
-------------------+---------+-------------------+
2021-01-01 00:00:00|        1|0.35415126479989567|
2021-01-01 01:00:00|        1| 14.013393572770028|
2021-01-01 02:00:00|        1|   88.5015939122006|
2021-01-01 03:00:00|        1|  97.49037810105996|
2021-01-01 04:00:00|        1|  50.22781125586846|
2021-01-01 00:00:00|        2|  46.41196423062297|
2021-01-01 01:00:00|        2|  74.39903569177027|
2021-01-01 02:00:00|        2|  85.44087332221935|
2021-01-01 03:00:00|        2|  4.329394730750735|
2021-01-01 04:00:00|        2| 54.645873866589056|
2021-01-01 00:00:00|        3|  63.01888063314749|
2021-01-01 01:00:00|        3|  21.70606884856987|
2021-01-01 02:00:00|        3|  32.47610779097485|
2021-01-01 03:00:00|        3| 47.565982341726354|
2021-01-01 04:00:00|        3|  64.34867263419619|
2021-01-01 00:00:00|        4|   78.1768041898232|
2021-01-01 01:00:00|        4|  84.51505102850199|
2021-01-01 02:00:00|        4| 24.029611792753514|
2021-01-01 03:00:00|        4|  17.08996115345549|
2021-01-01 04:00:00|        4| 29.642690955760997|

And finally, we talked about how to calculate the total number of rows your query would generate based on the time range, the interval between timestamps, and the number of "things" for which you are creating fake data.

Range of readings	Length of interval	Number of "devices"	Total rows
1 year	1 hour	4	35,040
1 year	10 minutes	100	5,256,000
6 months	5 minutes	1,000	52,560,000

Still, the main problem remains. Even if we can generate 50 million rows of data with a few lines of SQL, the data we generate isn't very realistic. It's all random numbers, with lots of decimals and minimal variation.

As we saw in the query above (generating fake CPU data), any columns of data that we add to the SELECT query are added to each row of the resulting set. If we add static text (like 'Hello, Timescale!'), that text is repeated for every row. Likewise, adding a function as a column value will be called one time for each row of the final set.

That's what happened with the random() function in the CPU data example. Every row has a different value because the function is called separately for each row of generated data. We can use this to our advantage to begin making the data look more realistic.

With a little more thought and custom PostgreSQL functions, we can start to bring our sample data "to life."

What is realistic data?

This feels like a good time to make sure we're on the same page. What do I mean by "realistic" data?

Using the basic techniques we've already discussed allows you to create a lot of data quickly. In most cases, however, you often know what the data you're trying to explore looks like. It's probably not a bunch of decimal or integer values. Even if the data you're trying to mimic are just numeric values, they likely have valid ranges and maybe a predictable frequency.

Take our simple example of CPU and temperature data from above. With just two fields, we have a few choices to make if we want the generated data to feel more realistic.

Is CPU a percentage? Out of 100% or are we representing multi-core CPUs that can present as 200%, 400%, or 800%?
Is temperature measured in Fahrenheit or Celsius? What are reasonable values for CPU temperature in each unit? Do we store temperature with decimals or as an integer in the schema?
What if we added a "note" field to the schema for messages that our monitoring software might add to the readings from time to time? Would every reading have a note or just when a threshold was reached? Is there a special diagnostic message at the top of each hour that we need to replicate in some way?

Using random() and static text by themselves allows us to generate lots of data with many columns, but it's not going to be very interesting or as useful in testing features in the database.

That's the goal of the second and third posts in this series, helping you to produce sample data that looks more like the real thing without much extra work. Yes, it will still be random, but it will be random within constraints that help you feel more connected to the data as you explore various aspects of time-series data.

And, by using functions, all of the work is easily reusable from table to table.

Walk before you run

In each of the examples below, we'll approach our solutions much as we learned in elementary math class: show your work! It's often difficult to create a function or procedure in PostgreSQL without playing with a plain SQL statement first. This abstracts away the need to think about function inputs and outputs at the outset so that we can focus on how the SQL works to produce the value we want.

Therefore, the examples below show you how to get a value (random numbers, text, JSON, etc.) in a SELECT statement first before converting the SQL into a function that can be reused. This kind of iterative process is a great way to learn features of PostgreSQL, particularly when it's combined with generate_series().

So, take one foot and put it in front of the other, and let's start creating better sample data.

Creating more realistic numbers

In time-series data, numeric values are often the most common data type. Using a function like random() without any other formatting creates very… well... random (and precise) numbers with lots of decimal points. While it works, the values aren't realistic. Most users and devices aren't tracking CPU usage to 12+ decimals. We need a way to manipulate and constrain the final value that's returned in the query.

For numeric values, PostgreSQL provides many built-in functions to modify the output. In many cases, using round() and floor() with basic arithmetic can quickly start shaping the data in a way that better fits your schema and use case.

Let's modify the example query for getting device metrics, returning values for CPU and temperature. We want to update the query to ensure that the data values are "customized" for each column, returning values within a specific range and precision. Therefore, we need to apply a standard formula to each numeric value in our SELECT query.

Final value = random() * (max allowed value - min allowed value) + min allowed value

This equation will always generate a decimal value between (and inclusive of) the min and max value. If random() returns a value of 1, the final output will equal the maximum value. If random() returns a value of 0, then the result will equal the minimum value. Any other number that random() returns will produce some output between the min and max values.

Depending on whether we want a decimal or integer value, we can further format the "final value" of our formula with round() and floor().

This example produces a reading every minute for one hour for 10 devices. The cpu value will always fall between 3 and 100 (with four decimals of precision), and the temperature will always be an integer between 28 and 83.

SELECT
  time,
  device_id,
  round((random()* (100-3) + 3)::NUMERIC, 4) AS cpu,
  floor(random()* (83-28) + 28)::INTEGER AS tempc
FROM 
	generate_series(now() - interval '1 hour', now(), interval '1 minute') AS time, 
	generate_series(1,10,1) AS device_id;


time                         |device_id|cpu    |tempc        |
-----------------------------+---------+-------+-------------+
2021-11-03 12:47:01.181 -0400|        1|53.7301|           61|
2021-11-03 12:48:01.181 -0400|        1|34.7655|           46|
2021-11-03 12:49:01.181 -0400|        1|78.6849|           44|
2021-11-03 12:50:01.181 -0400|        1|95.5484|           64|
2021-11-03 12:51:01.181 -0400|        1|86.3073|           82|
…|...|...|...

By using our simple formula and formatting the result correctly, the query produced the "curated" output (random as it is) we wanted.

The power of functions

But there's also a bit of a letdown here, isn't there? Typing that formula repeatedly for each value - trying to remember the order of parameters and when I need to cast a value - will become tedious quickly. After all, you only have so many keystrokes left.

The solution is to create and use PostgreSQL functions that can take the inputs we need, do the correct calculations, and return the formatted value that we want. There are many ways we could accomplish a calculation like this in a function. Use this example as a starting place for your learning and exploration.

Note: In this example, I chose to return the value from this function as a numeric data type because it can return values that look like integers (no decimals) or floats (decimals). As long as the return values are inserted into a table with the intended schema, this is a "trick" to visually see what we expect - an integer or a float. In general, the numeric data type will often perform worse in queries and features like compression because of how numeric values are represented internally. We recommend avoiding numeric types in schema design whenever possible, preferring the float or integer types instead.

/*
 * Function to create a random numeric value between two numbers
 * 
 * NOTICE: We are using the type of 'numeric' in this function in order
 * to visually return values that look like integers (no decimals) and 
 * floats (with decimals). However, if inserted into a table, the assumption
 * is that the appropriate column type is used. The `numeric` type is often
 * not the correct or most efficient type for storing numbers in a table.
 */
CREATE OR REPLACE FUNCTION random_between(min_val numeric, max_val numeric, round_to int=0) 
   RETURNS numeric AS
$$
 DECLARE
 	value NUMERIC = random()* (min_val - max_val) + max_val;
BEGIN
   IF round_to = 0 THEN 
	 RETURN floor(value);
   ELSE 
   	 RETURN round(value,round_to);
   END IF;
END
$$ language 'plpgsql';

This example function uses the minimum and maximum values provided, applies the "range" formula we discussed earlier, and finally returns a numeric value that either has decimals (to the specified number of digits) or not. Using this function in our query, we can simplify creating formatted values for sample data, and it cleans up the SQL, making it easier to read and use.

SELECT
  time,
  device_id,
  random_between(3,100, 4) AS cpu,
  random_between(28,83) AS temperature_c
FROM 
	generate_series(now() - interval '1 hour', now(), interval '1 minute') AS time, 
	generate_series(1,10,1) AS device_id;

This query provides the same formatted output, but now it's much easier to repeat the process.

Creating more realistic text

What about text? So far, in both articles, we've only discussed how to generate numeric data. We all know, however, that time-series data often contain more than just numeric values. Let's turn to another common data type: text.

Time-series data often contains text values. When your schema contains log messages, item names, or other identifying information stored as text, we want to generate sample text that feels more realistic, even if it's random.

Let's consider the query used earlier that creates CPU and temperature data for a set of devices. If the devices were real, the data they create might contain an intermittent status message of varying length.

To figure out how to generate this random text, we will follow the same process as before, working directly in a stand-alone SQL query before moving our solution into a reusable function. After some initial attempts (and ample Googling), I came up with this example for producing random text of variable length using a defined character set. As with the random_between() function above, this can be modified to suit your needs. For instance, it would be fairly easy to get unique, random hexadecimal values by limiting the set of characters and lengths.

Let your creativity guide you.

WITH symbols(characters) as (VALUES ('ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz 0123456789 {}')),
w1 AS (
	SELECT string_agg(substr(characters, (random() * length(characters) + 1) :: INTEGER, 1), '') r_text, 'g1' AS idx
	FROM symbols,
generate_series(1,10) as word(chr_idx) -- word length
	GROUP BY idx)
SELECT
  time,
  device_id,
  random_between(3,100, 4) AS cpu,
  random_between(28,83) AS temperature_c,
  w1.r_text AS note
FROM w1, generate_series(now() - interval '1 hour', now(), interval '1 minute') AS time, 
	generate_series(1,10,1) AS device_id
ORDER BY 1,2;

time                         |device_id|cpu     |temperature_c|note      |
-----------------------------+---------+--------+-------------+----------+
2021-11-03 16:49:24.218 -0400|        1| 88.3525|           50|I}3U}FIsX9|
2021-11-03 16:49:24.218 -0400|        2| 29.5313|           53|I}3U}FIsX9|
2021-11-03 16:49:24.218 -0400|        3| 97.6065|           70|I}3U}FIsX9|
2021-11-03 16:49:24.218 -0400|        4| 96.2170|           40|I}3U}FIsX9|
2021-11-03 16:49:24.218 -0400|        5| 53.2318|           82|I}3U}FIsX9|
2021-11-03 16:49:24.218 -0400|        6| 73.7244|           56|I}3U}FIsX9|

In this case, it was easier to generate a random value inside of a CTE that we could reference later in the query. However, this approach has one problem that's pretty easy to spot in the first few rows of returned data.

While the CTE does create random text of 10 characters (go ahead and run it a few times to verify), the value of the CTE is generated once each time and then cached, repeating the same result over and over for every row. Once we transfer the query into a function, we expect to see a different value for each row.

For this second example function to generate "words" of random lengths (or no text at all in some cases), the user will need to provide an integer for the minimum and maximum length of the generated text. After some testing, we also added a simple randomizing feature.

Notice the IF...THEN condition that we added. Any time the generated number is divided by five and has a remainder of zero or one, the function will not return a text value. There is nothing special about this approach to providing randomness to the frequency of the output, so feel free to adjust this part of the function to suit your needs.

/*
 * Function to create random text, of varying length
 */
CREATE OR REPLACE FUNCTION random_text(min_val INT=0, max_val INT=50) 
   RETURNS text AS
$$
DECLARE 
	word_length NUMERIC  = floor(random() * (max_val-min_val) + min_val)::INTEGER;
	random_word TEXT = '';
BEGIN
	-- only if the word length we get has a remainder after being divided by 5. This gives
	-- some randomness to when words are produced or not. Adjust for your tastes.
	IF(word_length % 5) > 1 THEN
	SELECT * INTO random_word FROM (
		WITH symbols(characters) AS (VALUES ('ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz 0123456789 '))
		SELECT string_agg(substr(characters, (random() * length(characters) + 1) :: INTEGER, 1), ''), 'g1' AS idx
		FROM symbols
		JOIN generate_series(1,word_length) AS word(chr_idx) on 1 = 1 -- word length
		group by idx) a;
	END IF;
	RETURN random_word;
END
$$ LANGUAGE 'plpgsql';

When we use this function to add random text to our sample time-series query, notice that the text is random in length (between 2 and 10 characters) and frequency.

SELECT
  time,
  device_id,
  random_between(3,100, 4) AS cpu,
  random_between(28,83) AS temperature_c,
  random_text(2,10) AS note
FROM generate_series(now() - interval '1 hour', now(), interval '1 minute') AS time, 
	generate_series(1,10,1) AS device_id
ORDER BY 1,2;

time                         |device_id|cpu     |temperature_c|note     |
-----------------------------+---------+--------+-------------+---------+
2021-11-04 14:17:03.410 -0400|        1| 86.5780|           67|         |
2021-11-04 14:17:03.410 -0400|        2|  3.5370|           76|pCVBp AZ |
2021-11-04 14:17:03.410 -0400|        3| 59.7085|           28|kMrr     |
2021-11-04 14:17:03.410 -0400|        4| 69.6153|           46|3UdA     |
2021-11-04 14:17:03.410 -0400|        5| 33.0906|           56|d0sSUilx |
2021-11-04 14:17:03.410 -0400|        6| 44.2837|           74|         |
2021-11-04 14:17:03.410 -0400|        7| 14.2550|           81|TOgbHOU  |

Hopefully, you're starting to see a pattern. Using generate_series() and some custom functions can help you create time-series data of many shapes and sizes.

We've demonstrated ways to create more realistic numbers and text data because they are the primary data types used in time-series data. Are there any other data types included with time-series data that you might need to generate with your sample data?

What about JSON values?

Creating sample JSON

Note: The sample queries below create JSON strings as the output with the intention that it would be inserted into a table for further testing and learning. In PostgreSQL, JSON string data can be stored in a JSON or JSONB column, each providing different features for querying and displaying the JSON data. In most circumstances, JSONB is the preferred column type because it provides more efficient storage and the ability to create indexes over the contents. The main downside is that the actual formatting of the JSON string, including the order of the keys and values, is not retained and may be difficult to reproduce exactly. To better understand the differences of when you would store JSON string data with one column type over the other, please refer to the PostgreSQL documentation.

PostgreSQL has supported JSON and JSONB data types for many years. With each major release, the feature set for working with JSON and overall query performance improves. In a growing number of data models, particularly when REST or Graph APIs are involved, storing extra meta information as a JSON document can be beneficial. The data is available if needed while facilitating efficient queries on serialized data stored in regular columns.

We used a design pattern similar to this in our NFT Starter Kit. The OpenSea JSON API used as the data source for the starter kit includes many properties and values for each asset and collection. A lot of the values weren't helpful for the specific analysis in that tutorial. However, we knew that some of the values in the JSON properties could be useful in future analysis, tutorials, or demonstrations. Therefore, we stored additional metadata about assets and collections in a JSONB field to query it if needed. Still, it didn't complicate the schema design for otherwise common data like name and asset_id.

Storing data in a JSON field is also a common practice in areas like IIoT device data. Engineers usually have an agreed-upon schema to store and query metrics produced by the device, followed by a "free form" JSON column that allows engineers to send error or diagnostic data that changes over time as hardware is modified or updated.

There are several approaches to add JSON data to our sample query. One added challenge is that JSON data includes both a key and a value, along with the possibility of numerous levels of child object nesting. The approach you take will depend on how complex you want the PostgreSQL function to be and the end goal of the sample data. In this example, we'll create a function that takes an array of keys for the JSON and generates random numerical values for each key without nesting. Generating the JSON string in SQL from our values is straightforward, thanks to built-in PostgreSQL functions for reading and writing JSON strings. 🎉

As with the other examples in this post, we'll start by using a CTE to generate a random JSON document in a stand-alone SELECT query to verify that the result is what we want. Remember, we'll observe the same issue we had earlier when generating random text in the stand-alone query because we are using a CTE. The JSON is random every time the query runs, but the string is reused for all rows in the result set. CTE's are materialized once for each reference in a query, whereas functions are called again for every row. Because of this, we won't observe random values in each row until we move the SQL into a function to reuse later.

WITH random_json AS (
SELECT json_object_agg(key, random_between(1,10)) as json_data
    FROM unnest(array['a', 'b']) as u(key))
  SELECT json_data, generate_series(1,5) FROM random_json;

json_data       |generate_series|
----------------+---------------+
{"a": 6, "b": 2}|              1|
{"a": 6, "b": 2}|              2|
{"a": 6, "b": 2}|              3|
{"a": 6, "b": 2}|              4|
{"a": 6, "b": 2}|              5|

We can see that the JSON data is created using our keys (['a','b']) with numbers between 1 and 10. We just have to create a function that will create random JSON data each time it is called. This function will always return a JSON document with numeric integer values for each key we provide for demonstration purposes. Feel free to enhance this function to return more complex documents with various data types if that's a requirement for you.

CREATE OR REPLACE FUNCTION random_json(keys TEXT[]='{"a","b","c"}',min_val NUMERIC = 0, max_val NUMERIC = 10) 
   RETURNS JSON AS
$$
DECLARE 
	random_val NUMERIC  = floor(random() * (max_val-min_val) + min_val)::INTEGER;
	random_json JSON = NULL;
BEGIN
	-- again, this adds some randomness into the results. Remove or modify if this
	-- isn't useful for your situation
	if(random_val % 5) > 1 then
		SELECT * INTO random_json FROM (
			SELECT json_object_agg(key, random_between(min_val,max_val)) as json_data
	    		FROM unnest(keys) as u(key)
		) json_val;
	END IF;
	RETURN random_json;
END
$$ LANGUAGE 'plpgsql';

With the random_json() function in place, we can test it in a few ways. First, we'll simply call the function directly without any parameters, which will return a JSON document with the default keys provided in the function definition ("a", "b", "c") and values from 0 to 10 (the default minimum and maximum value).

SELECT random_json();

random_json             |
------------------------+
{"a": 7, "b": 3, "c": 8}|

Next, we'll join this to a small numeric set from generate_series().

SELECT device_id, random_json() FROM generate_series(1,5) device_id;

device_id|random_json              |
---------+-------------------------+
        1|{"a": 2, "b": 2, "c": 2} |
        2|                         |
        3|{"a": 10, "b": 7, "c": 1}|
        4|                         |
        5|{"a": 7, "b": 1, "c": 0} |

Notice two things with this example.

First, the data is different for each row, showing that the function gets called for each row and produces different numeric values each time. Second, because we kept the same random output mechanism from the random_text() example, not every row includes JSON.

Finally, let's add this into the sample query for generating device data that we've used throughout this article to see how to provide an array of keys ("building" and "rack") for the generated JSON data.

SELECT
  time,
  device_id,
  random_between(3,100, 4) AS cpu,
  random_between(28,83) AS temperature_c,
  random_text(2,10) AS note,
  random_json(ARRAY['building','rack'],1,20) device_location
FROM generate_series(now() - interval '1 hour', now(), interval '1 minute') AS time, 
	generate_series(1,10,1) AS device_id
ORDER BY 1,2;


time                         |device_id|cpu     |temperature_c|note     |device_location             |
-----------------------------+---------+--------+-------------+---------+----------------------------+
2021-11-04 16:19:22.991 -0400|        1| 14.7614|           70|CTcX8 2s4|                            |
2021-11-04 16:19:22.991 -0400|        2| 62.2618|           81|x1V      |{"rack": 4, "building": 5}  |
2021-11-04 16:19:22.991 -0400|        3| 10.1214|           50|1PNb     |                            |
2021-11-04 16:19:22.991 -0400|        4| 96.3742|           29|aZpikXGe |{"rack": 12, "building": 4} |
2021-11-04 16:19:22.991 -0400|        5| 22.5327|           30|lM       |{"rack": 2, "building": 3}  |
2021-11-04 16:19:22.991 -0400|        6| 57.9773|           44|         |{"rack": 16, "building": 5} |
...

There are just so many possibilities for creating sample data with generate_series(), PostgreSQL functions, and some custom logic.

Putting it all together

Let's put what we've learned into practice, using these three functions to create and insert ~1 million rows of data and then query it with the hyperfunctions time_bucket(), time_bucket_ng(), approx_percentile() and time_weight(). To do this, we'll create two tables: one will be a list of computer hosts and the second will be a hypertable that stores fake time-series data about the computers.

Step 1: Create the schema and hypertable

CREATE TABLE host (
	id int PRIMARY KEY,
	host_name TEXT,
	LOCATION jsonb
);

CREATE TABLE host_data (
	date timestamptz NOT NULL,
	host_id int NOT NULL,
	cpu double PRECISION,
	tempc int,
	status TEXT	
);

SELECT create_hypertable('host_data','date');

Step 2: Generate and insert data

-- Insert data to create fake hosts
INSERT INTO host
SELECT id, 'host_' || id::TEXT AS name, 
	random_json(ARRAY['building','rack'],1,20) AS LOCATION
FROM generate_series(1,100) AS id;


-- insert ~1.3 million records for the last 3 months
INSERT INTO host_data
SELECT date, host_id,
	random_between(5,100,3) AS cpu,
	random_between(28,90) AS tempc,
	random_text(20,75) AS status
FROM generate_series(now() - INTERVAL '3 months',now(), INTERVAL '10 minutes') AS date,
generate_series(1,100) AS host_id;

Step 3: Query data using `time_bucket()` and `time_bucket_ng()`

-- Using time_bucket(), query the average CPU and max tempc
SELECT time_bucket('7 days', date) AS bucket, host_name,
	avg(cpu),
	max(tempc)
FROM host_data
JOIN host ON host_data.host_id = host.id
WHERE date > now() - INTERVAL '1 month'
GROUP BY 1,2
ORDER BY 1 DESC, 2;


-- try the experimental time_bucket_ng() to query data in month buckets
SELECT timescaledb_experimental.time_bucket_ng('1 month', date) AS bucket, host_name,
	avg(cpu) avg_cpu,
	max(tempc) max_temp
FROM host_data
JOIN host ON host_data.host_id = host.id
WHERE date > now() - INTERVAL '3 month'
GROUP BY 1,2
ORDER BY 1 DESC, 2;

Step 4: Query data using toolkit hyperfunctions

-- query all host in building 10 for 7 day buckets
-- also try the new percentile approximation function to 
-- get the p75 of data for each 7 day period
SELECT time_bucket('7 days', date) AS bucket, host_name,
	avg(cpu),
	approx_percentile(0.75,percentile_agg(cpu)) p75,
	max(tempc)
FROM host_data
JOIN host ON host_data.host_id = host.id
WHERE date > now() - INTERVAL '1 month'
	AND LOCATION -> 'building' = '10'
GROUP BY 1, 2
ORDER BY 1 DESC, 2;



-- To test time-weighted averages, we need to simulate missing
-- some data points in our host_data table. To do this, we'll
-- randomly select ~10% of the rows, and then delete them from the
-- host_data table.
WITH random_delete AS (SELECT date, host_id FROM host_data
	 JOIN host ON host_id = id WHERE 
	date > now() - INTERVAL '2 weeks'
	ORDER BY random() LIMIT 20000
)
DELETE FROM host_data hd
USING random_delete rd
WHERE hd.date = rd.date
AND hd.host_id = rd.host_id;


-- Select the daily time-weighted average and regular average
-- of each host for building 10 for the last two weeks.
-- Notice the variation in the two numbers because of the missing data.
SELECT time_bucket('1 day',date) AS bucket,
	host_name,
	average(time_weight('LOCF',date,cpu)) weighted_avg,
	avg(cpu) 
FROM host_data
	JOIN host ON host_data.host_id = host.id
WHERE LOCATION -> 'building' = '10'
AND date > now() - INTERVAL '2 weeks'
GROUP BY 1,2
ORDER BY 1 DESC, 2;

In a few lines of SQL, we created 1.3 million rows of data and were able to test four different functions in TimescaleDB, all without relying on any external source. 💪

Still, you may notice one last issue with the values in our host_data table (even though the values are not more realistic in nature). By using random() as the basis for our queries, the calculated numeric values all tend to have an equal distribution within the specified range which causes the average of the values to always be near the median. This makes sense statistically, but it highlights one other area of improvement to the data we generate. In the third post of this series, we'll demonstrate a few ways to influence the generated values to provide shape to the data (and even some outliers if we need them).

Reviewing our progress

When using a database like TimescaleDB or testing features in PostgreSQL, generating a representative dataset is a beneficial tool to have in your SQL toolbelt.

In the first post, we learned how to generate lots of data by combining the result sets of multiple generate_series() functions. Using the implicit CROSS JOIN, the total number of rows in the final output is a product of each set together. When one of the data sets contains timestamps, the output can be used to create time-series data for testing and querying.

The problem with our initial examples was that the actual values we generated were random and lacked control over their precision - and all of the data was numeric. So in this second post, we demonstrated how to format the numeric data for a given column and generate random data of other types, like text and JSON documents. We also added an example in the text and JSON functions that created randomness in how often the values were emitted for each of those columns.

Again, all of these are building block examples for you to use, creating functions that generate the kind of data you need to test.

To see some of these examples in action, watch my video on creating realistic sample data:

In part 3 of this series, we will demonstrate how to add shape and trends into your sample time-series data (e.g., increasing web traffic over time and quarterly sales cycles) using the formatting functions in this post in conjunction with relational lookup tables and additional mathematical functions. Knowing how to manipulate the pattern of generated data is particularly useful for visualizing time-series data and learning analytical PostgreSQL or TimescaleDB functions.

If you have questions about using generate_series() or have any questions about TimescaleDB, please join our community Slack channel, where you'll find an active community and a handful of the Timescale team most days.

If you want to try creating larger sets of sample time-series data using generate_series() and see how the exciting features of TimescaleDB work, sign up for a free 30-day trial or install and manage it on your instances. (You can also learn more by following one of our many tutorials.)

What Is ClickHouse and How Does It Compare to PostgreSQL and TimescaleDB for Time Series?

Ryan Booz — Thu, 21 Oct 2021 13:43:23 GMT

Over the past year, one database we keep hearing about is ClickHouse, a column-oriented OLAP database initially built and open-sourced by Yandex.

In this detailed post, which is the culmination of three months of research and analysis, we answer the most common questions we hear, including:

What is ClickHouse (including a deep dive into its architecture)
How does ClickHouse compare to PostgreSQL
How does ClickHouse compare to TimescaleDB
How does ClickHouse perform for time-series data vs. TimescaleDB

At Timescale, we take our benchmarks very seriously. We find that, in our industry, there is far too much vendor-biased “benchmarketing” and not enough honest “benchmarking.” We believe developers deserve better. So we take great pains to really understand the technologies we are comparing against—and also to point out places where the other technology shines (and where TimescaleDB may fall short).

You can see this in our other detailed benchmarks vs. AWS Timestream (29-minute read), MongoDB (19-minute read), and InfluxDB (26-minute read).

We’re also database nerds at heart who really enjoy learning about and digging into other systems. (Which are a few reasons why these posts—including this one—are so long!)

So, to better understand the strengths and weaknesses of ClickHouse, we spent the last three months and hundreds of hours benchmarking, testing, reading documentation, and working with contributors.

Shout-out to Timescale engineers Alexander Kuzmenkov, who was most recently a core developer on ClickHouse, and Aleksander Alekseev, who is also a PostgreSQL contributor, who helped check our work and keep us honest with this post.

How ClickHouse Fared in Our Tests

ClickHouse is a very impressive piece of technology. In some tests, ClickHouse proved to be a blazing-fast database, able to ingest data faster than anything else we’ve tested so far (including TimescaleDB). In some complex queries, particularly those that do complex grouping aggregations, ClickHouse is hard to beat.

But nothing in databases comes for free. ClickHouse achieves these results because its developers have made specific architectural decisions. These architectural decisions also introduce limitations, especially when compared to PostgreSQL and TimescaleDB.

ClickHouse’s limitations/weaknesses include:

Worse query performance than TimescaleDB at nearly all queries in the Time-Series Benchmark Suite, except for complex aggregations.
Poor inserts and much higher disk usage (e.g., 2.7x higher disk usage than TimescaleDB) at small batch sizes (e.g., 100-300 rows/batch).
Non-standard SQL-like query language with several limitations (e.g., joins are discouraged, syntax is at times non-standard).
Lack of other features one would expect in a robust SQL database (e.g., PostgreSQL or TimescaleDB): no transactions, no correlated sub-queries, no stored procedures, no user-defined functions, no index management beyond primary and secondary indexes, no triggers.
Inability to modify or delete data at a high rate and low latency—instead have to batch deletes and updates.
Batch deletes and updates happen asynchronously.
Because data modification is asynchronous, ensuring consistent backups is difficult: the only way to ensure a consistent backup is to stop all writes to the database.
Lack of transactions and lack of data consistency also affect other features like materialized views because the server can't atomically update multiple tables at once. If something breaks during a multi-part insert to a table with materialized views, the end result is an inconsistent state of your data.

We list these shortcomings not because we think ClickHouse is a bad database. We actually think it’s a great database—well, to be more precise, a great database for certain workloads. And as a developer, you have to choose the right tool for your workload.

Why does ClickHouse fare well in certain cases but worse in others?

The answer is the underlying architecture.

Generally, in databases, there are two types of fundamental architectures, each with strengths and weaknesses: OnLine Transactional Processing (OLTP) and OnLine Analytical Processing (OLAP).

OLTP	OLAP
Large and small datasets	Large datasets focused on reporting/analysis
Transactional data (the raw, individual records matter)	Pre-aggregated or transformed data to foster better reporting
Many users performing varied queries and updates on data across the system	Fewer users performing deep data analysis with few updates
SQL is the primary language for interaction	Often, but not always, utilizes a particular query language other than SQL

ClickHouse, PostgreSQL, and TimescaleDB architectures

At a high level, ClickHouse is an excellent OLAP database designed for systems of analysis.

PostgreSQL, by comparison, is a general-purpose database designed to be a versatile and reliable OLTP database for systems of record with high user engagement.

TimescaleDB is a relational database for time-series: purpose-built on PostgreSQL for time-series workloads. It combines the best of PostgreSQL plus new capabilities that increase performance, reduce cost, and provide an overall better developer experience for time series.

So, if you find yourself needing to perform fast analytical queries on mostly immutable large datasets with few users, i.e., OLAP, ClickHouse may be the better choice.

Instead, if you find yourself needing something more versatile that works well for powering applications with many users and likely frequent updates/deletes, i.e., OLTP, PostgreSQL may be the better choice.

And if your applications have time-series data—and especially if you also want the versatility of PostgreSQL—TimescaleDB is likely the best choice.

Time-Series Benchmark Suite results summary (TimescaleDB vs. ClickHouse)

We can see the impact of these architectural decisions on how TimescaleDB and ClickHouse fare with time-series workloads.

We spent hundreds of hours working with ClickHouse and TimescaleDB during this benchmark research. We tested insert loads from 100 million rows (1 billion metrics) to 1 billion rows (10 billion metrics), cardinalities from 100 to 10 million, and numerous combinations in between. We really wanted to understand how each database works across various datasets.

Overall, for inserts we find that ClickHouse outperforms on inserts with large batch sizes—but underperforms with smaller batch sizes. For queries, we find that ClickHouse underperforms on most queries in the benchmark suite, except for complex aggregates.

Insert performance

When rows are batched between 5,000 and 15,000 rows per insert, speeds are fast for both databases, with ClickHouse performing noticeably better:

Performance comparison: ClickHouse outperforms TimescaleDB at all cardinalities when batch sizes are 5,000 rows or greater

However, when the batch size is smaller, the results are reversed in two ways: insert speed and disk consumption. With larger batches of 5,000 rows/batch, ClickHouse consumed ~16GB of disk during the test, while TimescaleDB consumed ~19GB (both before compression).

With smaller batch sizes, not only does TimescaleDB maintain steady insert speeds that are faster than ClickHouse between 100-300 rows/batch, but disk usage is 2.7x higher with ClickHouse. This difference should be expected because of the architectural design choices of each database, but it's still interesting to see.

Performance comparison: Timescale outperforms ClickHouse with smaller batch sizes and uses 2.7x less disk space

Query performance

For testing query performance, we used a "standard" dataset that queries data for 4,000 hosts over a three-day period, with a total of 100 million rows. In our experience running benchmarks in the past, we found that this cardinality and row count works well as a representative dataset for benchmarking because it allows us to run many ingest and query cycles across each database in a few hours.

Based on ClickHouse’s reputation as a fast OLAP database, we expected ClickHouse to outperform TimescaleDB for nearly all queries in the benchmark.

When we ran TimescaleDB without compression, ClickHouse did outperform.

However, when we enabled TimescaleDB compression—which is the recommended approach—we found the opposite, with TimescaleDB outperforming nearly across the board:

Results of query benchmarking between TimescaleDB and ClickHouse. TimescaleDB outperforms in almost every query category

(For those that want to replicate our findings or better understand why ClickHouse and TimescaleDB perform the way they do under different circumstances, please read the entire article for the full details.)

Cars vs. bulldozers

Today, we live in the golden age of databases: there are so many databases that all these lines (OLTP/OLAP/time-series/etc.) are blurring. Yet every database is architected differently and, as a result, has different advantages and disadvantages. As a developer, you should choose the right tool for the job.

After spending lots of time with ClickHouse, reading their docs, and working through weeks of benchmarks, we found ourselves repeating this simple analogy:

ClickHouse is like a bulldozer - very efficient and performant for a specific use case. PostgreSQL (and TimescaleDB) is like a car: versatile, reliable, and useful in most situations you will face in your life.

Most of the time, a “car” will satisfy your needs. But if you find yourself doing a lot of “construction”, by all means, get a “bulldozer.”

We aren’t the only ones who feel this way. Here is a similar opinion shared on HackerNews by stingraycharles (whom we don’t know, but stingraycharles if you are reading this—we love your username):

"TimescaleDB has a great time-series story and an average data warehousing story; Clickhouse has a great data warehousing story, an average time-series story, and a bit meh clustering story (YMMV)."

In the rest of this article, we do a deep dive into the ClickHouse architecture and then highlight some of the advantages and disadvantages of ClickHouse, PostgreSQL, and TimescaleDB that result from the architectural decisions that each of its developers (including us) have made.

We conclude with a more detailed time-series benchmark analysis. We also have a detailed description of our testing environment to replicate these tests yourself and verify our results.

Yes, we’re the makers of TimescaleDB, so you may not trust our analysis. If so, we ask you to hold your skepticism for the next few minutes and give the rest of this article a read.

As you (hopefully) will see, we spent a lot of time understanding ClickHouse for this comparison: first, to make sure we were conducting the benchmark the right way so that we were fair to Clickhouse, but also because we are database nerds at heart and were genuinely curious to learn how ClickHouse was built.

Next steps

Are you curious about TimescaleDB? The easiest way to get started is by creating a free Timescale Cloud account, which will give you access to a fully managed TimescaleDB instance (100 % free for 30 days).

If you want to host TimescaleDB yourself, you can do it completely for free: visit our GitHub to learn more about options, get installation instructions, and more (⭐️ are very much appreciated! 🙏)

One last thing: Join our Community Slack to ask questions, get advice, and connect with other developers (we are +7,000 and counting!). We, the authors of this post, are very active on all channels—as well as all our engineers, members of Team Timescale, and many passionate users.

What Is ClickHouse?

ClickHouse, short for “Clickstream Data Warehouse”, is a columnar OLAP database that was initially built for web analytics in Yandex Metrica. Generally, ClickHouse is known for its high insert rates, fast analytical queries, and SQL-like dialect.

Timeline of ClickHouse development (Full history here.)

We are fans of ClickHouse. It is a very good database built around certain architectural decisions that make it a good option for OLAP-style analytical queries. In particular, in our benchmarking with the Time Series Benchmark Suite (TSBS), ClickHouse performed better for data ingestion than any time-series database we've tested so far (TimescaleDB included) at an average of more than 600k rows/second on a single instance when rows are batched appropriately.

But nothing in databases comes for free - and as we’ll show below, this architecture also creates significant limitations for ClickHouse, making it slower for many types of time-series queries and some insert workloads.

If your application doesn't fit within the architectural boundaries of ClickHouse (or TimescaleDB, for that matter), you'll probably end up with a frustrating development experience, redoing a lot of work down the road.

The ClickHouse Architecture

ClickHouse was designed for OLAP workloads, which have specific characteristics. From the ClickHouse documentation, here are some of the requirements for this type of workload:

The vast majority of requests are for read access.
Data is inserted in fairly large batches (> 1000 rows), not by single rows, or it is not updated at all.
Data is added to the DB but is not modified.
For reads, quite a large number of rows are processed from the DB, but only a small subset of columns.
Tables are “wide,” meaning they contain a large number of columns.
Queries are relatively rare (usually hundreds of queries per server or less per second).
For simple queries, latencies of around 50 ms are allowed.
Column values are fairly small: numbers and short strings (for example, 60 bytes per URL).
Requires high throughput when processing a single query (up to billions of rows per second per server).
Transactions are not necessary.
Low requirements for data consistency.
There is one large table per query. All tables are small, except for one.
A query result is significantly smaller than the source data. In other words, data is filtered or aggregated so the result fits in a single server’s RAM.

How is ClickHouse designed for these workloads? Here are some of the key aspects of their architecture:

Compressed, column-oriented storage
Table Engines
Indexes
Vector Computation Engine

Compressed, column-oriented storage

First, ClickHouse (like nearly all OLAP databases) is column-oriented (or columnar), meaning that data for the same table column is stored together. (In contrast, in row-oriented storage, used by nearly all OLTP databases, data for the same table row is stored together.)

Column-oriented storage has a few advantages:

If your query only needs to read a few columns, then reading that data is much faster (you don’t need to read entire rows, just the columns)
Storing columns of the same data type together leads to greater compressibility (although, as we have shown, it is possible to build columnar compression into row-oriented storage).

Table engines

To improve the storage and processing of data in ClickHouse, columnar data storage is implemented using a collection of table "engines". The table engine determines the type of table and the features that will be available for processing the data stored inside.

ClickHouse primarily uses the MergeTree table engine as the basis for how data is written and combined. Nearly all other table engines derive from MergeTree and allow additional functionality to be performed automatically as the data is (later) processed for long-term storage.

(Quick clarification: From this point forward, whenever we mention MergeTree, we're referring to the overall MergeTree architecture design and all table types that derive from it unless we specify a specific MergeTree type.)

At a high level, MergeTree allows data to be written and stored very quickly to multiple immutable files (called "parts" by ClickHouse). These files are later processed in the background at some point in the future and merged into a larger part with the goal of reducing the total number of parts on disk (fewer files = more efficient data reads later). This is one of the key reasons behind ClickHouse’s astonishingly high insert performance on large batches.

All columns in a table are stored in separate parts (files), and all values in each column are stored in the order of the primary key. This column separation and sorting implementation makes future data retrieval more efficient, particularly when computing aggregates on large ranges of contiguous data.

Indexes

Once the data is stored and merged into the most efficient set of parts for each column, queries need to know how to efficiently find the data. For this, Clickhouse relies on two types of indexes: the primary index and, additionally, a secondary (data skipping) index.

Unlike a traditional OLTP, BTree index, which knows how to locate any row in a table, the ClickHouse primary index is sparse in nature, meaning that it does not have a pointer to the location of every value for the primary index. Instead, because all data is stored in primary key order, the primary index stores the value of the primary key in every N-th row (called index_granularity, 8192 by default). This is done with the specific design goal of fitting the primary index into memory for extremely fast processing.

When your query patterns fit with this index style, the sparse nature can help improve query speed significantly. The one limitation is that you cannot create other indexes on specific columns to help improve a different query pattern. We'll discuss this more later.

Vector computation engine

ClickHouse was designed with the desire to have "online" query processing in a way that other OLAP databases hadn't been able to achieve. Even with compression and columnar data storage, most other OLAP databases still rely on incremental processing to pre-compute aggregated data. It has generally been the pre-aggregated data that's provided the speed and reporting capabilities.

To overcome these limitations, ClickHouse implemented a series of vector algorithms for working with large arrays of data on a column-by-column basis. With vectorized computation, ClickHouse can specifically work with data in blocks of tens of thousands of rows (per column) for many computations. Vectorized computing also provides an opportunity to write more efficient code that utilizes modern SIMD processors and keeps code and data closer together for better memory access patterns, too.

In total, this is a great feature for working with large data sets and writing complex queries on a limited set of columns and something TimescaleDB could benefit from as we explore more opportunities to utilize columnar data.

That said, as you'll see from the benchmark results, enabling compression in TimescaleDB (which converts data into compressed columnar storage) improves the query performance of many aggregate queries in ways that are even better than ClickHouse.

ClickHouse architecture's disadvantages (a.k.a. nothing comes for free)

Nothing comes for free in database architectures. Clearly, ClickHouse is designed with a very specific workload in mind. Similarly, it is not designed for other types of workloads.

We can see an initial set of disadvantages from the ClickHouse docs:

No full-fledged transactions.
Lack of ability to modify or delete already inserted data with a high rate and low latency. There are batch deletes and updates available to clean up or modify data, for example, to comply with GDPR, but not for regular workloads.
The sparse index makes ClickHouse not so efficient for point queries retrieving single rows by their keys.

There are a few disadvantages that are worth going into detail:

Data can’t be directly modified in a table
Some “synchronous” actions aren’t really synchronous
SQL-like, but not quite SQL
No data consistency in backups

MergeTree limitation: Data can’t be directly modified in a table

All tables in ClickHouse are immutable. There is no way to directly update or delete a value that's already been stored. Instead, any operations that UPDATE or DELETE data can only be accomplished through an ALTER TABLE statement that applies a filter and actually re-writes the entire table (part by part) in the background to update or delete the data in question. Essentially, it's just another merge operation with some filters applied.

As a result, several MergeTree table engines exist to solve this deficiency—to solve common scenarios where frequent data modifications would otherwise be necessary. Yet, this can lead to unexpected behavior and non-standard queries.

For example, if you need to store only the most recent reading of a value, creating a CollapsingMergeTree table type is your best option. With this table type, an additional column (called Sign) is added to the table, which indicates which row is the current state of an item when all other field values match. ClickHouse will then asynchronously delete rows with a Sign that cancel each other out (a value of 1 vs -1), leaving the most recent state in the database.

For example, consider a common database design pattern where the most recent values of a sensor are stored alongside the long-term time-series table for fast lookup. We'll call this table SensorLastReading. In ClickHouse, this table would require the following pattern to store the most recent value every time new information is stored in the database.

SensorLastReading

SensorID	Temp	Cpu	Sign
1	55	78	1

When new data is received, you need to add 2 more rows to the table, one to negate the old value and one to replace it.

SensorID	Temp	Cpu	Sign
1	55	78	1
1	55	78	-1
1	40	35	1

At some point after this insert, ClickHouse will merge the changes, removing the two rows that cancel each other out on Sign, leaving the table with just this row:

SensorID	Temp	Cpu	Sign
1	40	35	1

But remember, MergeTree operations are asynchronous, so queries can occur on data before something like the collapse operation has been performed. Therefore, the queries to get data out of a CollapsingMergeTree table require additional work, like multiplying rows by their `Sign`, to make sure you get the correct value any time the table is in a state that still contains duplicate data.

Here is one solution that the ClickHouse documentation provides, modified for our sample data. Notice that with numerical numbers, you can get the "correct" answer by multiplying all values by the Sign column and adding a HAVING clause.

SELECT
    SensorID,
    sum(Temp * Sign) AS Temp,
    sum(Cpu * Sign) AS Cpu
FROM SensorLastReading
GROUP BY SensorId
HAVING sum(Sign) > 0

Again, the value here is that MergeTree tables provide really fast ingestion of data at the expense of transactions and simple concepts like UPDATE and DELETE in the way traditional applications would try to use a table like this. With ClickHouse, it's just more work to manage this kind of data workflow.

Because ClickHouse isn't an ACID database, these background modifications (or really any data manipulations) have no guarantees of ever being completed. Because there is no such thing as transaction isolation, any SELECT query that touches data in the middle of an UPDATE or DELETE modification (or a Collapse modification, as we noted above) will get whatever data is currently in each part. If the delete process, for instance, has only modified 50% of the parts for a column, queries would return outdated data from the remaining parts that have not yet been processed.

More importantly, this holds true for all data that is stored in ClickHouse, not just the large, analytical-focused tables that store something like time-series data, but also the related metadata.

While it's understandable that time-series data, for example, is often insert-only (and rarely updated), business-centric metadata tables almost always have modifications and updates as time passes.

Regardless, the related business data that you may store in ClickHouse to do complex joins and deeper analysis is still in a MergeTree table (or variation of a MergeTree), and therefore, updates or deletes would still require an entire rewrite (through the use of ALTER TABLE,) any time there are modifications.

Distributed MergeTree tables

Distributed tables are another example of where asynchronous modifications might cause you to change how you query data. If your application writes data directly to the distributed table (rather than to different cluster nodes, which is possible for advanced users), the data is first written to the "initiator" node, which in turn copies the data to the shards in the background as quickly as possible. Because there are no transactions to verify that the data was moved as part of something like two-phase commits (available in PostgreSQL), your data might not actually be where you think it is.

There is at least one other problem with how distributed data is handled. Because ClickHouse does not support transactions and data is in a constant state of being moved, there is no guarantee of consistency in the state of the cluster nodes. Saving 100,000 rows of data to a distributed table doesn't guarantee that backups of all nodes will be consistent with one another (we'll discuss reliability in a bit). Some of that data might have been moved, and some of it might still be in transit.

Again, this is by design, so there's nothing specifically wrong with what's happening in ClickHouse! It's just something to be aware of when comparing ClickHouse to something like PostgreSQL and TimescaleDB.

Some “synchronous” actions aren’t really synchronous

Most actions in ClickHouse are not synchronous. But we found that even some of the ones labeled “synchronous” weren’t really synchronous either.

One particular example that caught us by surprise during our benchmarking was how TRUNCATE worked. We ran many test cycles against ClickHouse and TimescaleDB to identify how changes in row batch size, workers, and even cardinality impacted the performance of each database. At the end of each cycle, we would TRUNCATE the database in each server, expecting the disk space to be released quickly so that we could start the next test. In PostgreSQL (and other OLTP databases), this is an atomic action. As soon as the truncate is complete, the space is freed up on disk.

TRUNCATE is an atomic action in TimescaleDB/PostgreSQL and frees disk almost immediately

We expected the same thing with ClickHouse because the documentation mentions that this is a synchronous action (and most things are not synchronous in ClickHouse). It turns out, however, that the files only get marked for deletion and the disk space is freed up at a later, unspecified time in the background. There's no specific guarantee for when that might happen.

TRUNCATE is an asynchronous action in ClickHouse, freeing disk at some future time

For our tests, it was a minor inconvenience. We had to add a 10-minute sleep into the testing cycle to ensure that ClickHouse had released the disk space fully. In real-world situations, like ETL processing that utilizes staging tables, a TRUNCATE wouldn't actually free the staging table data immediately, which could cause you to modify your current processes.

We point a few of these scenarios to simply highlight the point that ClickHouse isn't a drop-in replacement for many things that a system of record (OLTP database) is generally used for in modern applications. Asynchronous data modification can take a lot more effort to effectively work with data.

SQL-like, but not quite SQL

In many ways, ClickHouse was ahead of its time by choosing SQL as the language of choice.

ClickHouse chose early in its development to utilize SQL as the primary language for managing and querying data. Given the focus on data analytics, this was a smart and obvious choice, given that SQL was already widely adopted and understood for querying data.

In ClickHouse, the SQL isn't something that was added after the fact to satisfy a portion of the user community. That said, what ClickHouse provides is a SQL-like language that doesn't comply with any actual standard.

The challenges of a SQL-like query language are many. For example, retraining users who will be accessing the database (or writing applications that access the database). Another challenge is a lack of ecosystem: connectors and tools that speak SQL won’t just work out of the box—i.e., they will require some modification (and again, knowledge by the user) to work.

Overall, ClickHouse handles basic SQL queries well.

However, because the data is stored and processed in a different way from most SQL databases, there are a number of commands and functions you may expect to use from a SQL database (e.g., PostgreSQL, TimescaleDB), but which ClickHouse doesn't support or has limited support for:

Not optimized for JOINs
No index management beyond the primary and secondary indexes
No recursive CTEs
No correlated subqueries or LATERAL joins
No stored procedures
No user-defined functions
No triggers

One example that stands out about ClickHouse is that JOINs, by nature, are generally discouraged because the query engine lacks any ability to optimize the join of two or more tables.

Instead, users are encouraged to either query table data with separate sub-select statements and then and then use something like a ANY INNER JOIN, which strictly looks for unique pairs on both sides of the join (avoiding a cartesian product that can occur with standard JOIN types). There's also no caching support for the product of a JOIN, so if a table is joined multiple times, the query on that table is executed multiple times, further slowing down the query.

For example, all of the "double-groupby" queries in the TSBS group by multiple columns and then join to the tag table to get the `hostname` for the final output. Here is how that query is written for each database.

TimescaleDB:

WITH cpu_avg AS (
     SELECT time_bucket('1 hour', time) as hour,
       hostname, 
	   AVG(cpu_user) AS mean_cpu_user
     FROM cpu
     WHERE time >= '2021-01-01T12:30:00Z' 
       AND time < '2021-01-02T12:30:00Z'
     GROUP BY 1, 2
)
SELECT hour, hostname, mean_cpu_user
FROM cpu_avg
JOIN tags ON cpu_avg.tags_id = tags.id
ORDER BY hour, hostname;

ClickHouse:

SELECT
    hour,
    id,
    mean_cpu_user
FROM
(
    SELECT
        toStartOfHour(created_at) AS hour,
        tags_id AS id,
        AVG(cpu_user) as mean_cpu_user
    FROM cpu
    WHERE (created_at >= '2021-01-01T12:30:00Z') 
        AND (created_at < '2021-01-02T12:30:00Z')
    GROUP BY
        hour,
        id
) AS cpu_avg
ANY INNER JOIN tags USING (id)
ORDER BY
    hour ASC,
    id;

Reliability: no data consistency in backups

One last aspect to consider as part of the ClickHouse architecture and its lack of support for transactions is that there is no data consistency in backups. As we've already shown, all data modification (even sharding across a cluster) is asynchronous. Therefore, the only way to ensure a consistent backup would be to stop all writes to the database and then make a backup. Data recovery struggles with the same limitation.

The lack of transactions and data consistency also affects other features like materialized views because the server can't atomically update multiple tables at once. If something breaks during a multi-part insert to a table with materialized views, the end result is an inconsistent state of your data.

ClickHouse is aware of these shortcomings and is certainly working on or planning updates for future releases. Some form of transaction support has been in discussion for some time, and backups are in process and merged into the main branch of code, although it's not yet recommended for production use. But even then, it only provides limited support for transactions.

ClickHouse vs. PostgreSQL

(A proper ClickHouse vs. PostgreSQL comparison would probably take another 8,000 words. To avoid making this post even longer, we opted to provide a short comparison of the two databases—but if anyone wants to provide a more detailed comparison, we would love to read it.)

As we can see above, ClickHouse is a well-architected database for OLAP workloads. Conversely, PostgreSQL is a well-architected database for OLTP workloads.

Also, PostgreSQL isn’t just an OLTP database: it’s the fastest-growing and most loved OLTP database (DB-Engines, Stack Overflow 2024 Developer Survey).

As a result, we won’t compare the performance of ClickHouse vs. PostgreSQL because—to continue our analogy from before—it would be like comparing the performance of a bulldozer vs. a car. These are two different things designed for two different purposes.

We’ve already established why ClickHouse is excellent for analytical workloads. Let’s now understand why PostgreSQL is so loved for transactional workloads: versatility, extensibility, and reliability.

PostgreSQL versatility and extensibility

Versatility is one of the distinguishing strengths of PostgreSQL. It's one of the main reasons for the recent resurgence of PostgreSQL in the wider technical community.

PostgreSQL supports a variety of data types, including arrays, JSON, and more. It supports various index types—not just the common B-tree but also GIST, GIN, and more. Full-text search? Check. Role-based access control? Check. And, of course, full SQL.

Also, through the use of extensions, PostgreSQL can retain the things it's good at while adding specific functionality to enhance the ROI of your development efforts.

Does your application need geospatial data? Add the PostGIS extension. What about features that benefit time-series data workloads? Add TimescaleDB. Could your application benefit from the ability to search using trigrams? Add pg_trgm.

With all these capabilities, PostgreSQL is quite flexible, meaning it is essentially future-proof. As your application or workload changes, you will know that you can still adapt PostgreSQL to your needs.

(For one specific example of the powerful extensibility of PostgreSQL, please read how our engineering team built functional programming into PostgreSQL using customer operators.)

PostgreSQL reliability

As developers, we’re resolved to the fact that programs crash, servers encounter hardware or power failures, disks fail, or experience corruption. You can mitigate this risk (e.g., robust software engineering practices, uninterrupted power supplies, disk RAID, etc.) but not eliminate it completely; it’s a fact of life for systems.

In response, databases are built with various mechanisms to further reduce such risk, including streaming replication to replicas, full-snapshot backup and recovery, streaming backups, robust data export tools, etc.

PostgreSQL has the benefit of 20+ years of development and usage, which has resulted in not just a reliable database but also a broad spectrum of rigorously tested tools: streaming replication for high availability and read-only replicas, pg_dump and pg_recovery for full database snapshots, pg_basebackup and log shipping/streaming for incremental backups and arbitrary point-in-time recovery, pgBackrest or WAL-E for continuous archiving to cloud storage, and robust COPY FROM and COPY TO tools for quickly importing/exporting data with a variety of formats.

This enables PostgreSQL to offer a greater “peace of mind”—because all of the skeletons in the closet have already been found (and addressed).

ClickHouse vs. TimescaleDB

TimescaleDB is the leading relational database for time series, built on PostgreSQL. It offers everything PostgreSQL has to offer, plus a full time-series database.

As a result, all of the advantages of PostgreSQL also apply to TimescaleDB, including versatility and reliability.

But TimescaleDB adds some critical capabilities that allow it to outperform Postgres for time-series data:

Hypertables: The foundation for many TimescaleDB features (listed below), hypertables provide automatically partitioned data across time and space for more performant inserts and queries
Continuous aggregates: Intelligently updated materialized views for time-series data. Rather than recreating the materialized view every time, TimescleDB updates data based only on underlying changes to raw data.
Columnar compression: Efficient data compression of 90%+ on most time-series data with dramatically improved query performance for historical, long+narrow queries.
Hyperfunctions: Analytic-focused functions added to PostgreSQL to enhance time-series queries with features like approximate percentiles, efficient downsampling, and two-step aggregation.
Function pipelines (released this week!): Radically improve the developer ergonomics of analyzing data in PostgreSQL and SQL by applying principles from functional programming and popular tools like Python’s Pandas and PromQL.

ClickHouse vs. TimescaleDB Performance for Time-Series Data

Time-series data has exploded in popularity because the value of tracking and analyzing how things change over time has become evident in every industry: DevOps and IT monitoring, industrial manufacturing, financial trading and risk management, sensor data, ad tech, application eventing, smart home systems, autonomous vehicles, professional sports, and more.

It's unique from more traditional business-type (OLTP) data in at least two primary ways: it is primarily insert heavy, and the scale of the data grows at an unceasing rate. This impacts both data collection and storage, as well as how we analyze the values themselves. Traditional OLTP databases often can't handle millions of transactions per second or provide effective means of storing and maintaining the data.

Time-series data is also more unique than general analytical (OLAP) data in that queries generally have a time component, and queries rarely touch every row in the database.

Over the last few years, however, the lines between the capabilities of OLTP and OLAP databases have started to blur. For the last decade, the storage challenge was mitigated by numerous NoSQL architectures while still failing to effectively deal with the query and analytics required of time-series data.

As a result, many applications try to find the right balance between the transactional capabilities of OLTP databases and the large-scale analytics provided by OLAP databases. It makes sense, therefore, that many applications would try to use ClickHouse, which offers fast ingest and analytical query capabilities for time-series data.

So, let's see how both ClickHouse and TimescaleDB compare for time-series workloads using our standard TSBS benchmarks.

Performance Benchmarks

Let me start by saying that this wasn't a test we completed in a few hours and then moved on from. In fact, just yesterday, while finalizing this blog post, we installed the latest version of ClickHouse (released three days ago) and ran all of the tests again to ensure we had the best numbers possible! (benchmarking, not benchmarketing)

In preparation for the final set of tests, we ran benchmarks on both TimescaleDB and ClickHouse dozens of times each—at least. We tried different cardinalities, different lengths of time for the generated data, and various settings for things that we had easy control over—like "chunk_time_interval" with TimescaleDB. We wanted to really understand how each database would perform with typical cloud hardware and the specs that we often see in the wild.

We also acknowledge that most real-world applications don't work like the benchmark does: ingesting data first and querying it second. Separating each operation allows us to understand which settings impacted each database during different phases, allowing us to tweak benchmark settings for each database along the way to get the best performance.

Finally, we always view these benchmarking tests as an academic and self-reflective experience. That is, spending a few hundred hours working with both databases often causes us to consider ways we might improve TimescaleDB (in particular), and thoughtfully consider when we can—and should—say that another database solution is a good option for specific workloads.

Machine Configuration

For this benchmark, we consciously decided to use cloud-based hardware configurations that were reasonable for a medium-sized workload typical of startups and growing businesses. In previous benchmarks, we've used bigger machines with specialized RAID storage, a very typical setup for a production database environment.

But, as time has marched on and we see more developers use Kubernetes and modular infrastructure setups without lots of specialized storage and memory optimizations, it felt more genuine to benchmark each database on instances that more closely matched what we tend to see in the wild. Sure, we can always throw more hardware and resources to help spike numbers, but that often doesn't help convey what most real-world applications can expect.

To that end, for comparing both insert and read latency performance, we used the following setup in AWS:

Versions: TimescaleDB version 2.4.0, community edition, with PostgreSQL 13; ClickHouse version 21.6.5 (the latest non-beta releases for both databases at the time of testing).
1 remote client machine running TSBS, 1 database server, both in the same cloud datacenter
Instance size: Both client and database server ran on Amazon EC2 virtual machines (m5.4xlarge) with 16 vCPU and 64GB Memory each.
OS: Both server and client machines ran Ubuntu 20.04.3
Disk Size: 1TB of EBS GP2 storage
Deployment method: Installed via apt-get using official sources

Database configuration

ClickHouse: No configuration modification was done with the ClickHouse. We simply installed it per their documentation. There is not currently a tool like timescaledb-tune for ClickHouse.

TimescaleDB: For TimescaleDB, we followed the recommendations in the timescale documentation. Specifically, we ran timescaledb-tune and accepted the configuration suggestions which are based on the specifications of the EC2 instance. We also set synchronous_commit=off in postgresql.conf. This is a common performance configuration for write-heavy workloads while still maintaining transactional, logged integrity.

Insert performance

For insert performance, we used the following datasets and configurations. The datasets were created using Time-Series Benchmarking Suite with the cpu-only use case.

Dataset: 100-1,000,000 simulated devices generated 10 CPU metrics every 10 seconds for ~100 million reading intervals.
Intervals used for each configuration are as follows: 31 days for 100 devices; 3 days for 4,000 devices; 3 hours for 100,000 devices; 30 minutes for 1,000,000
Batch size: Inserts were made using a batch size of 5,000 which was used for both ClickHouse and TimescaleDB. We tried multiple batch sizes and found that in most cases there was little difference in overall insert efficiency between 5,000 and 15,000 rows per batch with each database.
TimescaleDB chunk size: We set the chunk time depending on the data volume, aiming for 7-16 chunks in total for each configuration (more on chunks here).

In the end, these were the performance numbers for ingesting pre-generated time-series data from the TSBS client machine into each database using a batch size of 5,000 rows.

Insert performance comparison between ClickHouse and TimescaleDB with 5,000 row/batches

To be honest, this didn't surprise us. We've seen numerous recent blog posts about ClickHouse ingest performance, and since ClickHouse uses a different storage architecture and mechanism that doesn't include transaction support or ACID compliance, we generally expected it to be faster.

The story does change a bit, however, when you consider that ClickHouse is designed to save every "transaction" of ingested rows as separate files (to be merged later using the MergeTree architecture). It turns out that when you have much lower batches of data to ingest, ClickHouse is significantly slower and consumes much more disk space than TimescaleDB.

(Ingesting 100 million rows, 4,000 hosts, 3 days of data - 22GB of raw data)

Insert performance comparison between ClickHouse and TimescaleDB using smaller batch sizes, which significantly impacts ClickHouse's performance and disk usage

Do you notice something in the numbers above?

Regardless of batch size, TimescaleDB consistently consumed ~19GB of disk space with each data ingest benchmark before compression. This is a result of the chunk_time_interval which determines how many chunks will get created for a given range of time-series data. Although ingest speeds may decrease with smaller batches, the same chunks are created for the same data, resulting in consistent disk usage patterns. Before compression, it's easy to see that TimescaleDB continually consumes the same amount of disk space regardless of the batch size.

By comparison, ClickHouse storage needs are correlated to how many files need to be written (which is partially dictated by the size of the row batches being saved). It can actually take significantly more storage to save data to ClickHouse before it can be merged into larger files. Even at 500-row batches, ClickHouse consumed 1.75x more disk space than TimescaleDB for a source data file that was 22GB in size.

Read latency

For benchmarking read latency, we used the following setup for each database (the machine configuration is the same as the one used in the Insert comparison):

Dataset: 4,000/10,000 simulated devices generated 10 CPU metrics every 10 seconds for 3 full days (100M+ reading intervals, 1B+ metrics)
We also enabled native compression on TimescaleDB. We compressed everything but the most recent chunk of data, leaving it uncompressed. This configuration is a commonly recommended one where raw, uncompressed data is kept for recent time periods and older data is compressed, enabling greater query efficiency (see our compression docs for more). The parameters we used to enable compression are as follows: We segmented by the tags_id columns and ordered by time descending and usage_user columns.

On read (i.e., query) latency, the results are more complex. Unlike inserts, which primarily vary on cardinality size (and perhaps batch size), the universe of possible queries is essentially infinite, especially with a language as powerful as SQL. Often, the best way to benchmark read latency is to do it with the actual queries you plan to execute. For this case, we use a broad set of queries to mimic the most common query patterns.

The results shown below are the median from 1000 queries for each query type. Latencies in this chart are all shown as milliseconds, with an additional column showing the relative performance of TimescaleDB compared to ClickHouse (highlighted in green when TimescaleDB is faster, in blue when ClickHouse is faster).

Results of benchmarking query performance of 4,000 hosts with 100 million rows of data

Results of benchmarking query performance of 10,000 hosts with 100 million rows of data

Simple rollups

For simple rollups (i.e., single-groupby), when aggregating one metric across a single host for 1 or 12 hours, or multiple metrics across one or multiple hosts (either for 1 hour or 12 hours), TimescaleDB generally outperforms ClickHouse at both low and high cardinality. In particular, TimescaleDB exhibited up to 1058 % of the performance of ClickHouse on configurations with 4,000 and 10,000 devices, with 10 unique metrics being generated every read interval.

Aggregates

When calculating a simple aggregate for 1 device, TimescaleDB consistently outperforms ClickHouse across any number of devices. In our benchmark, TimescaleDB demonstrates 156 % of the performance of ClickHouse when aggregating 8 metrics across 4,000 devices and 164 % when aggregating 8 metrics across 10,000 devices. Once again, TimescaleDB outperforms ClickHouse for high-end scenarios.

Double rollups

The one set of queries that ClickHouse consistently bested TimescaleDB in query latency was in the double rollup queries that aggregate metrics by time and another dimension (e.g., GROUPBY time, deviceId). We'll go into a bit more detail below on why this might be, but this also wasn't completely unexpected.

Thresholds

When selecting rows based on a threshold, TimescaleDB demonstrates between 249-357 % the performance of ClickHouse when computing thresholds for a single device but only 130-58% the performance of ClickHouse when computing thresholds for all devices for a random time window.

Complex queries

For complex queries that go beyond rollups or thresholds, the comparison is a bit more nuanced, particularly when looking at TimescaleDB. The difference is that TimescaleDB gives you control over which chunks are compressed. In most time-series applications, especially things like IoT, there's a constant need to find the most recent value of an item or a list of the top X things by some aggregation. This is what the lastpoint and groupby-orderby-limit queries benchmark.

As we've shown previously with other databases (InfluxDB and MongoDB), and as ClickHouse documents themselves, getting individual ordered values for items is not a use case for a MergeTree-like/OLAP database, generally because there is no ordered index that you can define for a time, key, and value. This means asking for the most recent value of an item still causes a more intense scan of data in OLAP databases.

We see that expressed in our results. TimescaleDB was around 3486% faster than ClickHouse when searching for the most recent values (lastpoint) for each item in the database. This is because the most recent uncompressed chunk will often hold the majority of those values as data is ingested and a great example of why this flexibility with compression can have a significant impact on the performance of your application.

We fully admit, however, that compression doesn't always return favorable results for every query form. In the last complex query, groupby-orderby-limit, ClickHouse bests TimescaleDB by a significant amount, almost 15x faster. What our results didn't show is that queries that read from an uncompressed chunk (the most recent chunk) are 17x faster than ClickHouse, averaging 64ms per query. The query looks like this in TimescaleDB:

SELECT time_bucket('60 seconds', time) AS minute, max(usage_user)
        FROM cpu
        WHERE time < '2021-01-03 15:17:45.311177 +0000'
        GROUP BY minute
        ORDER BY minute DESC
        LIMIT 5

As you might guess, when the chunk is uncompressed, PostgreSQL indexes can be used to order the data by time quickly. When the chunk is compressed, the data matching the predicate (`WHERE time < '2021-01-03 15:17:45.311177 +0000'` in the example above) must first be decompressed before it is ordered and searched.

When the data for a lastpoint query falls within an uncompressed chunk (which is often the case with near-term queries that have a predicate like WHERE time < now() - INTERVAL '6 hours'), the results are startling.

(uncompressed chunk query, 4k hosts)

Query latency performance when lastpoint and groupby-orderby-limit queries use an uncompressed chunk in TimescaleDB

One of the key takeaways from this last set of queries is that the features provided by a database can have a material impact on the performance of your application. Sometimes, it just works, while other times, having the ability to fine-tune how data is stored can be a game-changer.

Read latency performance summary

For simple queries, TimescaleDB outperforms ClickHouse, regardless of whether native compression is used.
For typical aggregates, even across many values and items, TimescaleDB outperforms ClickHouse.
Doing more complex double rollups, ClickHouse outperforms TimescaleDB every time. To some extent, we were surprised by the gap and will continue to understand how we can better accommodate queries like this on raw time-series data. One solution to this disparity in a real application would be to use a continuous aggregate to pre-aggregate the data.
When selecting rows based on a threshold, TimescaleDB outperforms ClickHouse and is up to 250% faster.
For some complex queries, particularly a standard query like "lastpoint", TimescaleDB vastly outperforms ClickHouse
Finally, depending on the time range being queried, TimescaleDB can be significantly faster (up to 1760%) than ClickHouse for grouped and ordered queries. When these kinds of queries reach further back into compressed chunks, ClickHouse outperforms TimescaleDB because more data must be decompressed to find the appropriate max() values to order by.

Conclusion

You made it to the end! Thank you for taking the time to read our detailed report.

Understanding ClickHouse and then comparing it with PostgreSQL and TimescaleDB made us appreciate that there is a lot of choice in today’s database market—but often, there is still only one right tool for the job.

Before deciding which to use for your application, we recommend taking a step back and analyzing your stack, your team's skills, and your needs, now and in the future. Choosing the best technology for your situation now can make all the difference down the road. Instead, you want to pick an architecture that evolves and grows with you, not one that forces you to start all over when the data starts flowing from production applications.

We’re always interested in feedback, and we’ll continue to share our insights with the greater community.

Want to learn more about TimescaleDB?

Create a free account to get started with a fully managed TimescaleDB instance (100 % free for 30 days).

Want to host TimescaleDB yourself? Visit our GitHub to learn more about options, get installation instructions, and more (and, as always, ⭐️ are appreciated!)

Join our Slack community to ask questions, get advice, and connect with other developers (the authors of this post, as well as our co-founders, engineers, and passionate community members, are active on all channels).

How to Create (Lots of) Sample Time-Series Data With PostgreSQL generate_series()

Ryan Booz — Thu, 26 Aug 2021 18:05:53 GMT

As the makers of TimescaleDB, we often need to quickly create lots of sample time-series data to demonstrate a new database feature, run a benchmark, or talk about use cases internally. We also see users in our community Slack asking how they can test a feature or decide if their data is a good fit for TimescaleDB.

📝 If you want to know why people want to use Timescale, check this out.

Although using real data from your current application would be great (and ideal), knowing how to quickly create a representative time-series dataset using varying cardinalities and different lengths of time is a helpful - and advantageous - skill to have.

Fortunately PostgreSQL provides a built-in function to help us create this sample data using the SQL that we already know and love - no external tools required.

In this three-part blog series, we'll go through a few ways to use the generate_series() function to create large datasets in PostgreSQL, including:

What PostgreSQL generate_series() is and how to use it for basic data generation
How to create more realistic-looking time-series data with custom PostgreSQL functions
Ways to create complex time-series data using additional PostgreSQL math functions and JOINs.

By the end of this series, you'll be ready to test almost any PostgreSQL or TimescaleDB feature, create quick datasets for general testing, meetup presentations, demos, and more!

Intro to PostgreSQL generate_series()

generate_series() is a built-in PostgreSQL function that makes it easy to create ordered tables of numbers or dates. The PostgreSQL documentation calls it a Set Returning Function because it can return more than one row.

The function is simple to start, taking at least two required arguments to specify the start and stop parameters for the generated data.

SELECT * FROM generate_series(1,5);

This will produce output that looks like this:

generate_series|
---------------|
              1|
              2|
              3|
              4|
              5|

This first example shows that generate_series() returns sequential numbers between a start parameter and a stop parameter. When used to generate numeric data, generate_series()a will increment the values by 1. However, an optional third parameter can be used to specify the increment length, known as the step parameter.

For example, if we wanted to generate rows that counted by two from 0 to 10, we could use this SQL instead:

SELECT * from generate_series(0,10,2);

 generate_series 
-----------------
               0
               2
               4
               6
               8
              10

PostgreSQL generate_series() With Dates

Using generate_series() to produce a range of dates is equally straightforward. The only difference is that the third parameter, the step INTERVAL used to increment the date, is required for date generation, as shown here:

SELECT * from generate_series(
	'2021-01-01',
    '2021-01-02', INTERVAL '1 hour'
  );

    generate_series     
------------------------
 2021-01-01 00:00:00+00
 2021-01-01 01:00:00+00
 2021-01-01 02:00:00+00
 2021-01-01 03:00:00+00
 2021-01-01 04:00:00+00
 2021-01-01 05:00:00+00
 2021-01-01 06:00:00+00
 2021-01-01 07:00:00+00
 2021-01-01 08:00:00+00
 2021-01-01 09:00:00+00
 2021-01-01 10:00:00+00
 2021-01-01 11:00:00+00
 2021-01-01 12:00:00+00
 2021-01-01 13:00:00+00
 2021-01-01 14:00:00+00
 2021-01-01 15:00:00+00
 2021-01-01 16:00:00+00
 2021-01-01 17:00:00+00
 2021-01-01 18:00:00+00
 2021-01-01 19:00:00+00
 2021-01-01 20:00:00+00
 2021-01-01 21:00:00+00
 2021-01-01 22:00:00+00
 2021-01-01 23:00:00+00
 2021-01-02 00:00:00+00
(25 rows)

Notice that the returned dates are inclusive of the start and stop values, just as we saw with the numeric example before. The reason we got 25 rows (representing 25 hours rather than 24 as you might expect) is that the stop value can be reached using the equal one-hour INTERVAL (the step parameter). As long as the INTERVAL can increment evenly up to the stop date, it will be included.

However, if the step interval resulted in the stop value being skipped over, it will not be included in your output. For example, if we modify the step INTERVAL above to '1 hour 25 minutes', the result only returns 17 rows, the last of which is before the stop value.

SELECT * from generate_series(
	'2021-01-01','2021-01-02', 
    INTERVAL '1 hour 25 minutes'
   );

    generate_series     
------------------------
 2021-01-01 00:00:00+00
 2021-01-01 01:25:00+00
 2021-01-01 02:50:00+00
 2021-01-01 04:15:00+00
 2021-01-01 05:40:00+00
 2021-01-01 07:05:00+00
 2021-01-01 08:30:00+00
 2021-01-01 09:55:00+00
 2021-01-01 11:20:00+00
 2021-01-01 12:45:00+00
 2021-01-01 14:10:00+00
 2021-01-01 15:35:00+00
 2021-01-01 17:00:00+00
 2021-01-01 18:25:00+00
 2021-01-01 19:50:00+00
 2021-01-01 21:15:00+00
 2021-01-01 22:40:00+00
(17 rows)

How to Generate Time-Series Data

Now that we understand how to use generate_series(), how do we create some time-series data to insert into TimescaleDB for testing and visualization?

To do this, we utilize a standard feature of relational databases and SQL. Recall that generate_series() is a Set Returning Function that returns a "table" of data (a set) just as if we had selected it from a table. Therefore, just like when we select data from a regular table with SQL, we can add more columns of data using other functions or static values.

You may have done this at some point with a string of text that needed to be repeated for each returned row.

SELECT 'Hello Timescale!' as myStr, * FROM generate_series(1,5);

     myStr     | generate_series 
------------------+-----------------
 Hello Timescale! |               1
 Hello Timescale! |               2
 Hello Timescale! |               3
 Hello Timescale! |               4
 Hello Timescale! |               5
(5 rows)

In this example, we simply added data to the rows being returned from generate_series(). For every row it returned, we added a column with some static text.

But, these added columns don't have to be static data! Using the built-in random() function (for example), we can generate data that starts to look a little more like data we'd see when monitoring computer CPU values (i.e., realistic data to use for our demos and tests!).

SELECT random()*100 as CPU, * FROM generate_series(1,5);

        cpu         | generate_series 
--------------------+-----------------
 48.905450626783775 |               1
  71.94031820213382 |               2
 25.210553719011486 |               3
  19.24163308357194 |               4
  8.434915599133674 |               5
(5 rows)

In this example, generate_series() produced five rows of data; for every row, PostgreSQL also executed the random() function to produce a value. With a little creativity, this can become a pretty efficient method for generating lots of data quickly.

Performance Considerations When Generating Sample Data

Functions

Before getting too deep into the many ways we can use various functions with generate_series() to create more realistic data, let's tackle one potential downside with functions before your mind starts thinking about a myriad of ways to produce interesting, sample data.

Functions are a powerful database tool, allowing complex code to be hidden behind a standard interface. In the example above, I don't have to know how random() produces a value. I just have to call it in my SQL statement (without arguments) and a random value is returned for every row.

Unfortunately, functions can slow down your query and use lots of resources if you're not careful. This is true any time you call functions in SQL, not just when creating sample data.

The reason for this inefficiency is that scalar functions are executed once for each column and row in which they are used. So, if you produce a set of 1,000 rows with generate_series() and then add on 5 additional columns with data generated by functions, PostgreSQL has to effectively process 5,001 function calls, once to generate the initial series of data which is returned as a set, and 5 times that for each row because of the additional 5 columns.

The main issue is that PostgreSQL has no idea how to determine how much "work" it will take to generate the result from each function in the query. Something like random() required very minimal effort, so calling it 5,000 or even 1 million times won't break the bank for most machines. However, if the function you're calling generates lots of temporary data internally before producing a result, be aware that it could perform more slowly and consume more resources on your database server.

The takeaway here is that nothing comes for free, especially when functions are called once for every column, for every row. Most of the time the data you generate will easily complete in a handful of seconds for tens of millions of rows. But, if you notice a query that generates data taking a long time, consider finding alternative ways to generate data for the columns that require more resources.

Transactions

There is a second caveat to be aware of when using a tool like generate_series(). All data is created and inserted as a single transaction. You can put in more time and effort to create functions and methods for breaking up the work, but understand if you create (SELECT) one large set of data with 100 million rows, it's likely to consume a lot of memory and CPU on the server while the process occurs.

This doesn't mean it's broken or any less valid of a method for generating data. However, as we demonstrate additional ways to create more realistic data in parts 2 and 3 of this series, recognize that the more data you create, the more resources you'll need from the server without managing batches and transactions with a more advanced setup.

Alright, let's get back to the fun!

Tips for Quickly Increasing Sample Dataset Scale

So far we've looked at how generate_series() works and how you can use functions to add additional dynamic content to the output. But how do we quickly get to the scale of tens of millions of rows (or more)?

This is where the concept of a Cartesian product comes into play with databases. A Cartesian product (otherwise known as a CROSS JOIN) takes two or more result sets (rows from multiple tables or functions) and produces a new result set that contains rows equal to the count of the first set multiplied by the count of the second set.

In doing so, the database outputs every row from the first table with the value of the first row from the second table. It does this over and over until all rows in both tables have been iterated. (and yes, if you join more than two tables this way, the process just keeps multiplying, tables processed left to right)

Let's look at a small example using two generate_series() in the same select statement.

SELECT * from generate_series(1,10) a, generate_series(1,2) b;

a |b|
--+-+
 1|1|
 2|1|
 3|1|
 4|1|
 5|1|
 6|1|
 7|1|
 8|1|
 9|1|
10|1|
 1|2|
 2|2|
 3|2|
 4|2|
 5|2|
 6|2|
 7|2|
 8|2|
 9|2|
10|2|

As you can see, PostgreSQL generated 10 rows for the first series and 2 rows for the second series. After processing and iterating over each table consecutively, the query produced a total of 20 rows (10 x 2). For generating lots of data quickly, this is about as easy as it gets!

The same process applies when using generate_series() to return a set of dates. If you (effectively) CROSS JOIN multiple sets (numbers or dates), the total number of rows in the final set will be a product of all sets. Again, for generating sample time-series data, you'd be hard-pressed to find an easier method!

In this example, we generate 12 timestamps an hour apart, a random value representing CPU usage, and then a second series of four values that represent IDs for fake devices. This should produce 48 rows (eg. 12 timestamps x 4 device IDs = 48 rows).

SELECT time, device_id, random()*100 as cpu_usage 
FROM generate_series(
	'2021-01-01 00:00:00',
    '2021-01-01 11:00:00',
    INTERVAL '1 hour'
  ) as time, 
generate_series(1,4) device_id;


time               |device_id|cpu_usage          |
-------------------+---------+-------------------+
2021-01-01 00:00:00|        1|0.35415126479989567|
2021-01-01 01:00:00|        1| 14.013393572770028|
2021-01-01 02:00:00|        1|   88.5015939122006|
2021-01-01 03:00:00|        1|  97.49037810105996|
2021-01-01 04:00:00|        1|  50.22781125586846|
2021-01-01 05:00:00|        1|  77.93431470586931|
2021-01-01 06:00:00|        1|  45.73481750582076|
2021-01-01 07:00:00|        1|   70.7999843735724|
2021-01-01 08:00:00|        1|   4.72949831884506|
2021-01-01 09:00:00|        1|  85.29122113229981|
2021-01-01 10:00:00|        1| 14.539664281598874|
2021-01-01 11:00:00|        1|  45.95244258556228|
2021-01-01 00:00:00|        2|  46.41196423062297|
2021-01-01 01:00:00|        2|  74.39903569177027|
2021-01-01 02:00:00|        2|  85.44087332221935|
2021-01-01 03:00:00|        2|  4.329394730750735|
2021-01-01 04:00:00|        2| 54.645873866589056|
2021-01-01 05:00:00|        2|  6.544334492894777|
2021-01-01 06:00:00|        2|  39.05071228953645|
2021-01-01 07:00:00|        2|  71.07264365438404|
2021-01-01 08:00:00|        2|   72.4732704336219|
2021-01-01 09:00:00|        2| 34.533280927542975|
2021-01-01 10:00:00|        2| 26.764760864598003|
2021-01-01 11:00:00|        2|  62.32048879645227|
2021-01-01 00:00:00|        3|  63.01888063314749|
2021-01-01 01:00:00|        3|  21.70606884856987|
2021-01-01 02:00:00|        3|  32.47610779097485|
2021-01-01 03:00:00|        3| 47.565982341726354|
2021-01-01 04:00:00|        3|  64.34867263419619|
2021-01-01 05:00:00|        3|  57.74424991855476|
2021-01-01 06:00:00|        3| 55.593286571750156|
2021-01-01 07:00:00|        3|  36.92650110894995|
2021-01-01 08:00:00|        3| 53.166926049881624|
2021-01-01 09:00:00|        3| 10.009505806123897|
2021-01-01 10:00:00|        3| 58.067700285561585|
2021-01-01 11:00:00|        3|  81.58883725078034|
2021-01-01 00:00:00|        4|   78.1768041898232|
2021-01-01 01:00:00|        4|  84.51505102850199|
2021-01-01 02:00:00|        4| 24.029611792753514|
2021-01-01 03:00:00|        4|  17.08996115345549|
2021-01-01 04:00:00|        4| 29.642690955760997|
2021-01-01 05:00:00|        4|  90.83844806413275|
2021-01-01 06:00:00|        4| 6.5019080489854275|
2021-01-01 07:00:00|        4|    32.336484070672|
2021-01-01 08:00:00|        4|    55.595524107963|
2021-01-01 09:00:00|        4|   97.5442141375293|
2021-01-01 10:00:00|        4|   37.0741925805568|
2021-01-01 11:00:00|        4| 19.093927249791776|

Choosing a Date Range

Now it's time to put all of the features and concepts together. We've seen how to use generate_series() to create a sample table of data (both numbers and dates), add static and dynamic content to each row, and finally how to join multiple sets of data together to create a deterministic number of rows to create test data.

The final piece of the puzzle is to figure out how to create date ranges that are more dynamic using date math. Once again, PostgreSQL makes this straightforward because date math is handled automatically.

When generating sample data, you usually have an idea of the duration of time you want to generate—one month, six months, or a year—for instance. But calculating your use case's exact start and end timestamps can get pretty tedious.

Instead, it's often easier to use now() or a static ending timestamp (ie. '2021-06-01 00:00:00'), and then using date math to get the starting timestamp based on your chosen interval. This makes it very easy to generate data for different durations simply by changing the interval. And to be clear, you can go the other direction (picking a start timestamp and adding time), but that can often lead to data in the future.

Let's look at three examples to demonstrate ways of creating dynamic date ranges.

Create six months of data with one-hour intervals, ending now()

SELECT time, device_id, random()*100 as cpu_usage 
FROM generate_series(
	now() - INTERVAL '6 months',
    now(),
    INTERVAL '1 hour'
   ) as time, 
generate_series(1,4) device_id;

Notice that we use the PostgreSQL function now() to automatically choose the ending timestamp, and then use date math (- INTERVAL '6 months') to let PostgreSQL find the starting timestamp for us. With almost no effort, we can easily generate weeks, months, or years of time-series data by changing the INTERVAL we subtract from now().

Create one year of data with one-hour intervals, ending on a timestamp

SELECT time, device_id, random()*100 as cpu_usage 
FROM generate_series(
	'2021-08-01 00:00:00' - INTERVAL '6 months',
    '2021-08-01 00:00:00',
    INTERVAL '1 hour'
  ) as time, 
generate_series(1,4) device_id;

This example is the same as the first, but here we specify the timestamp that we want to end on. PostgreSQL can still do the date math for us (subtracting 6 months in this example), but we get control over the exact ending timestamp to use. This is particularly useful when you want the actual time portion of the timestamp to begin and end on even, rounded hours, minutes, or seconds. When we use now(), the timestamp can produce (what feels like) random timestamp causing all of the time-series data to have timestamps that are increments of now().

Create 1 year of data with one-hour intervals, beginning on a timestamp

For this last example, we're going to specify the start timestamp instead. This can be useful when you need to test a specific fiscal period or maybe some logic that deals with the turning of each year. In this case, maybe you just need a month of data but it needs to cross over from one year to the next. In that case, trying to figure out the math of how far back to start and how far to go forward might be more complicated than you want to worry about.

SELECT time, device_id, random()*100 as cpu_usage 
FROM generate_series(
	'2020-12-15 00:00:00',
    '2020-12-15 00:00:00' + INTERVAL '2 months',
    INTERVAL '1 hour'
  ) as time, 
generate_series(1,4) device_id;

Each of these examples uses date math to help you find the appropriate start and end timestamps so that you can easily adjust to create more or less data while still fitting the range profile that you need.

Speaking of calculating how many rows you want to generate… 😉

Calculating Total Rows

We now have all the tools we need to create datasets of almost any size. For time-series data specifically, it's a straightforward calculation to figure out how many rows your query will generate.

Total Rows = Readings per hour * Total hours * Number of "things" being tracked

Using this formula, we can quickly determine how many rows will be created by changing any combination of the total range (start/end timestamps), how many readings per hour for each item, or the total number of items. Let's look at a couple of examples to get an idea of how quickly you could create many rows of time-series data for your testing scenario.

Range of readings	Length of interval	Number of "devices"	Total rows
1 year	1hour	4	35,040
1 year	10 minutes	100	5,256,000
6 months	5 minutes	1,000	52,560,000

As you can see, the numbers start to add up very quickly.

Let's Review

In this first post, we've demonstrated that it's pretty easy to generate lots of data using the PostgreSQL generate_series() function. We also learned that when you select multiple sets (using generate_series() or selecting from tables and functions), PostgreSQL will produce what's known as a Cartesian product, the selection of all rows from all tables - the product of all rows. With this knowledge, we can quickly create large and diverse datasets to test various features of PostgreSQL and TimescaleDB.

Keep Learning About Generating Sample Data

This was part 1 of the three-part series:

Read part 2: Generating more realistic sample time-series data with PostgreSQL generate_series(). Learn how to use custom user-defined functions to create more realistic-looking data to use for testing, including generated text, numbers constrained by a range, and even fake JSON data.
Read part 3: How to shape sample data with PostgreSQL generate_series() and SQL. Learn how to add shape and trends into your sample time-series data (e.g., increasing web traffic over time and quarterly sales cycles) using the formatting functions in this post in conjunction with relational lookup tables and additional mathematical functions. Knowing how to manipulate the pattern of generated data is particularly useful for visualizing time-series data and learning analytical PostgreSQL or TimescaleDB functions.
Watch the video:

Try it Yourself With TimescaleDB

If you are not using TimescaleDB yet, take a look. It's a PostgreSQL extension that will make your queries faster via automatic partitioning, query planner enhancements, improved materialized views, columnar compression, and much more. It also comes with awesome functionality for time-series data analysis.

If you're running your PostgreSQL database in your own hardware, you can simply add the TimescaleDB extension. If you prefer to try Timescale in AWS, create a free account on our platform.

See what's possible when you join PostgreSQL with time-series superpowers!

Hacking NFL Data With PostgreSQL, TimescaleDB, and SQL

Attila Toth — Tue, 27 Jul 2021 13:32:25 GMT

Learn how to use time-series data provided by the NFL to uncover valuable insights into many player performance metrics—and ways to apply the same methods to improve your fantasy league team, your knowledge of the game, or your viewing experience—all with PostgreSQL, standard SQL, and freely available extensions.

Time-series data is everywhere, including, much to our surprise, the world of professional sports. At Timescale, we're always looking for fun ways to showcase the expanding reach of time-series data. Stock, cryptocurrency, IoT, and infrastructure metrics data are relatively common and widely understood time-series data scenarios. Head to Twitter on any given day, search for #timeseries or #TimescaleDB, and you're sure to find questions about high-frequency trading or massive-scale observability data with tools like Prometheus.

You can imagine our excitement, then, when we happened upon the NFL Big Data Bowl, an annual competition that encourages the data science community to use historical player position and play data to create machine learning models.

Did the NFL really give access to 18+ million rows of detailed play data from every regular season NFL game?

via GIPHY

For background, the National Football League (NFL) is the US professional sports league for American football, and the NFL season is followed by tens of millions of people, culminating in the annual Super Bowl (which attracts 100M+ global viewers, whether for the game or for the commercials).

Each NFL game takes place as a series of “plays,” in which the two teams try to score and prevent the other team from scoring. There are approximately 200 plays per game, with up to 15 games a week during the regular season. A healthy amount of data, but nothing unmanageable.

So, at first glance, football game metrics might not immediately jump out as anything special.

But then the NFL did something pretty ambitious and amazing.

All NFL players are equipped with RFID chips that track players’ position, speed, and various other metrics, which teams use to identify trends, mitigate risks, and continuously optimize. The NFL started tracking and storing data for every player on the field, for every play, for every game.

As a result, we now have access to a very detailed analysis of exactly how a play unfolded, how quickly various players accelerated during each play, and the play’s outcome. A traditional view of play-by-play metrics is “down and distance” and the result of the play (yards gained, whether or not there was a score, and so on). With the NFL’s dataset, we're able to mine approximately 100 data points at 100-millisecond intervals throughout the play to see speed, distance, involved players, and much more.

This isn’t ordinary data. This is time-series data. Time-series data is a sequence of data points collected over time intervals, giving us the ability to track changes over time. In the case of the NFL’s dataset, we have time-series data that represents how a play changes, including the locations of the players on the field, the location of the ball, the relative acceleration of players in the field of play, and so much more.

Time-series data comes at you fast, sometimes generating millions of data points per second (read more about time-series data). Because of the sheer volume and rate of information, time-series data can already be complex to query and analyze, which is why we built TimescaleDB, a petabyte-scale, relational database for time series.

We couldn't pass up the opportunity to look at the NFL dataset with TimescaleDB, exploring ways we could peer deeper into player performance in hopes of providing insights about overall player performance in the coming season.

Read on for more information about the NFL’s dataset and how you can start using it, plus some sample queries to jumpstart your analysis. They may help you get more enjoyment out of the game.

If you’d like to get started with NFL data, you can spin up a fully managed TimescaleDB service: create an account to try it for free for 30 days. The instructions later in this post will take you through how to ingest the data and start using it for analysis.

If you’re new to time-series data or just have some questions you’d like to ask about the dataset, join our public Slack community, where you’ll find Timescale team members and thousands of time-series enthusiasts, and we’ll be happy to help you.

The NFL Time Series Dataset

Over the last few years, the NFL and Kaggle have collaborated on the NFL Big Data Bowl. The goal is to use historical data to answer a predetermined genre of questions, typically producing a machine learning model that can help predict the outcome of certain plays during regular season games.

Although the 2020/2021 contest is over, the sample dataset they provided from a prior season is still available for download and analysis. The 2020/2021 competition focused on pass-play defense efficiency; therefore, only the tracking data for offensive and defensive "playmakers" is available in the dataset. No offensive or defensive linemen data is included. (You can read more about last year’s winners.)

(Keep watching the NFL website for more information on the next Big Data Bowl.)

Accessing the Data

For the purposes of this blog post and accompanying tutorial, we will use the sample data provided by the NFL. This data is from the 2018 NFL season and is available as CSV files, including game-specific data and week-by-week tracking data for each player involved in the "offensive" part of the pass play. Contest participants in the next season of the contest will have access to new weekly game data.

This data is also very relational in nature, which means that SQL is a great medium to start gleaning value – without the need for Jupyter notebooks, other data science specific languages (like Python or R), or additional toolsets.

If you want to follow along - or recreate! - the queries we go through below, follow our tutorial to set up the tables, ingest data, and start analyzing data in TimescaleDB. For those unfamiliar with TimescaleDB, it’s built on PostgreSQL, so you’ll find that all of our queries are standard SQL. If you know SQL, you’ll know how to do everything here. (Some of the more advanced query examples we provide require our new, advanced hyperfunctions, which come pre-installed with any Timescale instance.)

Let's Start Exploring Time-Series Insights!

We've provided the steps needed to ingest the dataset into TimescaleDB in the accompanying tutorial, so we won’t go into that here.

The NFL dataset includes the following data:

Games: all relevant data about each game of the regular season, including date, teams, time, and location
Players: information on each player, including what team they play for and their originating college
Plays: a wealth of data about each pass play in the game. Helpful fields include the down, description of the play that happened, line of scrimmage, and total offensive yardage, among other details.
Week [1-17]: for each week of the season, the NFL provides a new CSV file with the tracking data of every player, for every play (pass plays for this data). Interesting fields include X/Y position data (relative to the football field) every few hundred milliseconds throughout each play, player acceleration, and the "type" of a route that was taken. (In our tutorial, this data is imported into the tracking table and totals almost 20 million rows of time-series data.)

In addition to the NFL dataset, we also provide some extra data from Wikipedia that includes game scores and stadium conditions for each game, which you can load as part of the tutorial. With other time-series databases, it can be difficult to combine your time-series data with any other data you may have on hand (see our TimescaleDB vs. InfluxDB comparison for reference).

Because TimescaleDB is PostgreSQL with time-series superpowers, it supports JOINS, so any extra relational data you want to add for deeper analysis is just a SQL query away. In our case, we’re able to combine the NFL’s play-by-play data along with weather data for each stadium.

Once you have the data ready, the world of NFL playmakers is at your fingertips, so let’s get started!

The Power of SQL

Year after year, we see SQL listed as one of the most popular languages among developers on the Stack Overflow survey. Sometimes, however, we can be lured into thinking that the only way to gain insights from relational data is to query it with powerful data analytics tools and languages, create data frames, and use specialized regression algorithms before we can do anything productive.

SQL, it often feels, is only useful for getting and storing data in applications and that we need to leave the "heavy lifting" of analysis to more mature tools.

Not so! SQL can data munge with the best of them! Let's look at a first, quick example.

Average yards per position, per game

For this first example, we'll query the tracking table (the player movement data from all 17 weeks of games) and join to the game table to determine the number of yards per player position, per game.

The results give you a quick overview of how many yards different positions ran throughout each game. You could use this later to compare specific players to see how they compared, more or less yards, to that total.

WITH total_position_yards AS (
	SELECT sum(dis) position_yards, POSITION, gameid FROM tracking t 
	GROUP BY POSITION, gameid)
SELECT avg(position_yards), position, game_date
FROM game g
INNER JOIN total_position_yards tpy ON g.game_id = tpy.gameid
WHERE POSITION IN ('QB','RB','WR','TE')
GROUP BY game_date, POSITION;

Number of plays by offensive player

As a season progresses and players get injured (or traded), it's helpful to know which of the available players have more playing experience, rather than those that have been sitting on the sideline for most of the season. Players with more playing time are often able to contribute to the outcome of the game.

This query finds all players that were on the offense for any play and counts how many total passing plays they have been a part of, ordered by total passing plays descending.

WITH snap_events AS (
-- Create a table that filters the play events to show only snap plays
-- and display the players team information
 SELECT DISTINCT player_id, t.event, t.gameid, t.playid,
   CASE
     WHEN t.team = 'away' THEN g.visitor_team
     WHEN t.team = 'home' THEN g.home_team
     ELSE NULL
     END AS team_name
 FROM tracking t
 LEFT JOIN game g ON t.gameid = g.game_id
 WHERE t.event IN ('snap_direct','ball_snap')
)
-- Count these events & filter results to only display data when the player was
-- on the offensive
SELECT a.player_id, pl.display_name, COUNT(a.event) AS play_count, a.team_name
FROM snap_events a
LEFT JOIN play p ON a.gameid = p.gameid AND a.playid = p.playid
LEFT JOIN player pl ON a.player_id = pl.player_id
WHERE a.team_name = p.possessionteam
GROUP BY a.player_id, pl.display_name, a.team_name
ORDER BY play_count DESC;

player_id	display_name	play_count	team_name
2506109	Ben Roethlisberger	725	PIT
2558149	JuJu Smith-Schuster	691	PIT
2533031	Andrew Luck	683	IND
2508061	Antonio Brown	679	PIT
310	Matt Ryan	659	ATL
2506363	Aaron Rodgers	656	GB
2505996	Eli Manning	639	NYG
2543495	Davante Adams	630	GB
2540158	Zach Ertz	629	PHI
2532820	Kirk Cousins	621	MIN
79860	Matthew Stafford	619	DET
2504211	Tom Brady	613	NE

If you’re familiar with American football, you might know that players are substituted in and out of the game based on game conditions. Stronger, larger players may play in some situations, while faster, more agile players may play in others.

Quarterbacks, however, are the most “important” players on the field, and tend to play more than others. However, by omitting quarterbacks, we can get a deeper insight into players across all other positions.

WITH snap_events AS (
-- Create a table that filters the play events to show only snap plays
-- and display the players team information
 SELECT DISTINCT player_id, t.event, t.gameid, t.playid,
   CASE
     WHEN t.team = 'away' THEN g.visitor_team
     WHEN t.team = 'home' THEN g.home_team
     ELSE NULL
     END AS team_name
 FROM tracking t
 LEFT JOIN game g ON t.gameid = g.game_id
 WHERE t.event IN ('snap_direct','ball_snap')
)
-- Count these events & filter results to only display data when the player was
-- on the offensive
SELECT a.player_id, pl.display_name, COUNT(a.event) AS play_count, a.team_name, pl."position"
FROM snap_events a
LEFT JOIN play p ON a.gameid = p.gameid AND a.playid = p.playid
LEFT JOIN player pl ON a.player_id = pl.player_id
WHERE a.team_name = p.possessionteam AND pl."position" != 'QB'
GROUP BY a.player_id, pl.display_name, a.team_name, pl."position"
ORDER BY play_count DESC;

So, now we can see the non-quarterbacks who are on offense the most in a season:

player_id	display_name	play_count	team_name	position
2558149	JuJu Smith-Schuster	691	PIT	WR
2508061	Antonio Brown	679	PIT	WR
2543495	Davante Adams	630	GB	WR
2540158	Zach Ertz	629	PHI	TE
2541785	Adam Thielen	612	MIN	WR
2543468	Mike Evans	610	TB	WR
2555295	Sterling Shepard	610	NYG	WR
2540169	Robert Woods	604	LA	WR
2552600	Nelson Agholor	604	PHI	WR
2543488	Jarvis Landry	592	CLE	WR
2540165	DeAndre Hopkins	587	HOU	WR
2543498	Brandin Cooks	581	LA	WR

Sack percentage by quarterback on passing plays

We can start to go a little deeper by extracting specific data from the tracking table and layering queries on top of it to make correlations. One piece of information that might be helpful in your analysis is knowing which quarterbacks are sacked most often during passing plays. In football, a “sack” is a negative play for the offense, and quarterbacks who get sacked more often tend to be lower performers overall.

Once you know those players, you could expand your analysis to see if they are sacked more on specific types of plays (shotgun formation) or maybe if sacks occur more often in a specific quarter of the game (maybe the fourth quarter because the offensive line is more tired, or the team tends to be behind late in games and must pass more often).

Queries like this can quickly show you quarterbacks that are more likely to get sacked, particularly when they play a strong defensive team.

To get started, we wanted to find the sack percentage of each quarterback based on the total number of pass plays they were involved in during the regular season. To do that we approached the tracking data by layering on Common Table Expressions so that each query could build upon previous results.

First, we select the distinct list of all plays, for each quarterback (qb_plays). The reason we do a SELECT DISTINCT… is because the tracking table holds multiple entries for each player, for each play. We just need one row for each play, for each quarterback.

With this result, we can then count the number of total plays per quarterback (total_qb_plays), the total number of games each quarterback played (qb_games) and then finally the number of pass plays the quarterback was a part of that resulted in a sack (sacks).

With that data in hand, we can finally query all of the values, do a percentage calculation, and order it by the total sack count.

WITH qb_plays AS (
	SELECT DISTINCT ON (POSITION, playid, gameid) POSITION, playid, player_id, gameid 
	FROM tracking t 
	WHERE POSITION = 'QB'
),
total_qb_plays AS (
	SELECT count(*) play_count, player_id FROM qb_plays
	GROUP BY player_id
),
qb_games AS (
	SELECT count(DISTINCT gameid) game_count, player_id FROM qb_plays 
	GROUP BY player_id
),
sacks AS (
	SELECT count(*) sack_count, player_id 
	FROM play p
	INNER JOIN qb_plays ON p.gameid = qb_plays.gameid AND p.playid = qb_plays.playid
	WHERE p.passresult = 'S'
	GROUP BY player_id
)
SELECT play_count, game_count, sack_count, (sack_count/play_count::float)*100 sack_percentage, display_name FROM total_qb_plays tqp
INNER JOIN qb_games qg ON tqp.player_id = qg.player_id
LEFT JOIN sacks s ON s.player_id = qg.player_id
INNER JOIN player ON tqp.player_id = player.player_id
ORDER BY sack_count DESC NULLS last;

If you're an ardent football fan, the results from 2018 probably don't surprise you.

play_count	game_count	sack_count	sack_percentage	display_name
579	16	65	11.23	Deshaun Watson
602	16	55	9.14	Dak Prescott
611	16	53	8.67	Derek Carr
656	16	49	7.47	Aaron Rodgers
462	15	48	10.39	Russell Wilson
639	16	47	7.36	Eli Manning
448	14	45	10.04	Josh Rosen
659	16	43	6.53	Matt Ryan
386	14	43	11.14	Marcus Mariota
619	16	41	6.62	Matthew Stafford
621	15	38	6.12	Kirk Cousins
324	11	37	11.42	Ryan Tannehill
447	11	36	8.05	Carson Wentz

Of course, there are a few quarterbacks that always seem to have a way of avoiding a sack.

play_count	game_count	sack_count	sack_percentage	display_name
725	16	25	3.45	Ben Roethlisberger
682	16	22	3.23	Andrew Luck
613	16	21	3.43	Tom Brady

Now, let’s try some more “advanced” queries and analyses.

Faster Insights With PostgreSQL and TimescaleDB

So far, the queries we've shown are interesting and help provide insights to various players throughout the season – but if you were looking closely, they're all regular SQL statements.

Examining a season of NFL tracking data isn't like typical time-series data, however. Most of the queries we want to perform need to examine all 20 million rows in some way.

This is where a tool that's been built for time-series analysis, even when the data isn't typical time-series data, can significantly improve your ability to examine the data and save money at the same time.

Faster Queries With TimescaleDB Continuous Aggregates

We noticed that we often needed to build queries that started with the tracking table, filtering data by specific players, positions, and games. Part of the reason is that the play table doesn't list all of the players who were involved in a particular play. As a result, we need to cross-reference the tracking table to identify the players who were involved in any given play.

The first example query we demonstrated - “average yards per position, per game” - is a good example of this. The query begins by summing all yards, by position, for each game.

This means that every row in tracking has to be read and aggregated before we can do any other analysis. Scanning those 20 million rows is pretty boring, repetitive, and slow work – especially compared to the analysis we want to do!

On our small test instance, the "average yards" query takes about 8 seconds to run. We could increase the size of the instance (which will cost us more money), or we could be smarter about how we query the data (which will cost us more time).

Instead, we can use continuous aggregates to pre-aggregate the data we're querying over and over again, which reduces the amount of work TimescaleDB needs to do every time we run the query. (Continuous aggregates are like PostgreSQL materialized views. For more info, check out our continuous aggregates docs.)

CREATE MATERIALIZED VIEW player_yards_by_game_
WITH (timescaledb.continuous) AS
SELECT player_id, position, gameid,
 time_bucket(INTERVAL '1 day', "time") AS bucket,
 SUM(dis) AS yards
FROM tracking t
GROUP BY player_id, position, gameid, bucket;

After running this query and creating a continuous aggregate, we can modify that first query just slightly, using this as our basis table.

WITH total_position_yards AS (
	SELECT sum(yards) position_yards, POSITION, gameid 
FROM player_yards_by_game t 
	GROUP BY POSITION, gameid)
SELECT avg(position_yards), position, game_date
FROM game g
INNER JOIN total_position_yards tpy ON g.game_id = tpy.gameid
WHERE POSITION IN ('QB','RB','WR','TE')
GROUP BY game_date, POSITION
ORDER BY game_date, position;

We get the same result, but now the query runs in 100ms - 800x faster!

Advanced SQL Data Analysis With TimescaleDB Hyperfunctions

Finally, the more we dug into the data, the more and more we found we needed (or wanted) functions specifically tuned for time-series data analysis to answer the types of questions we wanted to ask.

It is for this kind of analysis that we built TimescaleDB hyperfunctions, a series of SQL functions within TimescaleDB that make it easier to manipulate and analyze time-series data in PostgreSQL with fewer lines of code.

Grouping data into percentiles

The NFL dataset is a great use case for percentiles. Being able to quickly find players that perform better or worse than some cohort is really powerful.

As an example, we'll use the same continuous aggregate we created earlier (total yards, per game, per player) to find the median total yards traveled by position for each game.

WITH sum_yards AS (
--Add position to the table to allow for grouping by it later
 SELECT a.player_id, display_name, SUM(yards) AS yards, p.position, gameid
 FROM player_yards_by_game a
 LEFT JOIN player p ON a.player_id = p.player_id
 GROUP BY a.player_id, display_name, p.position, gameid
)
--Find the mean and median for each position type
SELECT position, mean(percentile_agg(yards)) AS mean_yards, approx_percentile(0.5, percentile_agg(yards)) AS median_yards
FROM sum_yards
WHERE POSITION IS NOT null
GROUP BY position
ORDER BY mean_yards DESC;

position	mean_yards	median_yards
FS	595.583433048431	626.388099960848
CB	572.3336749867212	592.2175990890378
WR	552.6508570179277	555.5030569048633
S	530.6436781609186	550.5961518474892
SS	522.5604103343453	551.1296628916651
MLB	462.70229007633407	490.77906906009343
ILB	402.7882871125599	403.3779668359464
OLB	393.40014271151847	390.6742117791442
QB	334.7025466893028	352.1192705472368
LB	328.9812527472519	257.72003396053884
TE	327.9515596330271	257.72003396053884

Finding extreme outliers

Finally, we can build upon this percentile query to find players at each position that run more than 95% of all other players at that position. For some positions, like wide receiver or free safety, this could help us find the “outlier” players that are able to travel the field consistently throughout a game – and make plays!

WITH sum_yards AS (
--Add position to the table to allow for grouping by it later
 SELECT a.player_id, display_name, SUM(yards) AS yards, p.position
 FROM player_yards_by_game a
 LEFT JOIN player p ON a.player_id = p.player_id
 GROUP BY a.player_id, display_name, p.position
),
position_percentile AS (
	SELECT POSITION, approx_percentile(0.95, percentile_agg(yards)) AS p95
	FROM sum_yards 
	GROUP BY position
)
SELECT a.POSITION, a.display_name, yards, p95
	FROM sum_yards a
	LEFT JOIN position_percentile pp ON a.POSITION = pp.position
	WHERE yards >= p95
AND a.POSITION IN ('WR','FS','QB','TE')
ORDER BY position;

position	display_name	yards	p95
FS	Eric Weddle	13869.759999999997	12320.288323166456
FS	Adrian Amos	12989.439999999966	12320.288323166456
FS	Tyrann Mathieu	12565.219999999956	12320.288323166456
QB	Aaron Rodgers	7422.35999999995	6667.51452813257
QB	Patrick Mahomes	6985.989999999952	6667.51452813257
QB	Matt Ryan	6759.959999999969	6667.51452813257
TE	Zach Ertz	13124.58999999995	10667.986199523099
TE	Jimmy Graham	12693.679999999982	10667.986199523099
TE	Travis Kelce	12218.129999999957	10667.986199523099
TE	David Njoku	11502.159999999965	10667.986199523099
TE	George Kittle	11058.099999999975	10667.986199523099
TE	Kyle Rudolph	10761.949999999968	10667.986199523099
TE	Jared Cook	10678.22999999998	10667.986199523099
WR	Antonio Brown	16877.559999999965	14271.23409723974
WR	Brandin Cooks	15510.01999999995	14271.23409723974
WR	JuJu Smith-Schuster	15492.76999999996	14271.23409723974
WR	Robert Woods	15253.179999999958	14271.23409723974
WR	Nelson Agholor	15180.32999999997	14271.23409723974
WR	Tyreek Hill	15106.609999999973	14271.23409723974
WR	Zay Jones	14790.589999999967	14271.23409723974
WR	Sterling Shepard	14673.79999999996	14271.23409723974
WR	Mike Evans	14620.129999999983	14271.23409723974
WR	Davante Adams	14574.509999999951	14271.23409723974
WR	Kenny Golladay	14354.499999999973	14271.23409723974
WR	Jarvis Landry	14281.509999999971	14271.23409723974

Where Can the Data Take You?

As you’ve seen in this example, time-series data is everywhere. Being able to harness it gives you a huge advantage, whether you’re working on a professional solution or a personal project.

We’ve shown you a few ways that time-series queries can unlock interesting insights, give you a greater appreciation for the game and its players, and (hopefully) inspired you to dig into the data yourself.

To get started with the NFL data:

Spin up a fully managed TimescaleDB service: create an account to try it for free for 30 days.
Follow our complete tutorial for step-by-step instructions for preparing and ingesting the dataset, along with several more queries to help you glean insights from the dataset.

If you’re new to time-series data or just have some questions about how to use TimescaleDB to analyze the NFL’s dataset, join our public Slack community. You’ll find Timescale engineers and thousands of time-series enthusiasts from around the world – and we’ll be happy to help you.

🙏 We’d like to thank the NFL for making this data available and the millions of passionate fans around the world who make the NFL such an exciting game to watch.

And, Geaux Saints 🏈!

Improving DISTINCT Query Performance Up to 8,000x on PostgreSQL

Sven Klemm — Thu, 06 May 2021 11:01:13 GMT

PostgreSQL is an amazing database, but it can struggle with certain types of queries, especially as tables approach tens and hundreds of millions of rows (or more). DISTINCT queries are an example of this.

Waiting for our DISTINCT queries to return

Why are DISTINCT queries slow on PostgreSQL when they seem to ask an "easy" question? It turns out that PostgreSQL currently lacks the ability to efficiently pull a list of unique values from an ordered index.

🔖

Learning PostgreSQL? Read the basics on DISTINCT.

Even when you have an index that matches the exact order and columns for these "last-point" queries, PostgreSQL is still forced to scan the entire index to find all unique values. As a table grows (and they grow quickly with time-series data), this operation keeps getting slower.

Other databases, such as MySQL, Oracle, and DB2, implement a feature called "Loose index scan," "Index Skip Scan," or “Skip Scan,” to speed up the performance of queries like this.

When a database has a feature like "Skip Scan," it can incrementally jump from one ordered value to the next without reading all of the rows in between. Without support for this feature, the database engine has to scan the entire ordered index and then deduplicate it at the end—which is a much slower process.

Since 2018, there have been plans to support something similar in PostgreSQL. (Note: We couldn’t use this implementation directly due to some limitations of what is possible within the Postgres extension framework.)

Unfortunately, this patch wasn't included in the CommitFest for PostgreSQL 14, so it won't be included until PostgreSQL 15 at the earliest (i.e., no sooner than Fall 2022, at least 1.5 years from now).

We don’t want our users to have to wait that long.

What is Timescale's SkipScan?

Today, via TimescaleDB 2.2.1, we are releasing TimescaleDB SkipScan, a custom query planner node that makes ordered DISTINCT queries blazing fast in PostgreSQL 🔥.

As you'll see in the benchmarks below, some queries performed more than 8,000x better than before—and many of the SQL queries your applications and analytics tools use could also see dramatic improvements with this new feature.

This feature works in both Timescale hypertables and normal PostgreSQL tables.

This means that with Timescale, not only will your time-series DISTINCT queries be faster, but any other related queries you may have on normal PostgreSQL tables will also be faster.

This is because Timescale is not just a time-series database. It’s a relational database, specifically, a relational database for time series. Developers who use Timescale benefit from a purpose-built time-series database plus a classic relational (Postgres) database, all in one, with full SQL support.

And to be clear, we love PostgreSQL. We employ engineers who contribute to PostgreSQL. We contribute to the ecosystem around PostgreSQL. PostgreSQL is the world’s fastest-growing database, and we are excited to support it alongside thousands of other users and contributors.

We constantly seek to advance the state of the art with databases, and features like SkipScan are only our latest contribution to the industry. SkipScan makes Timescale and PostgreSQL better, more competitive databases overall, especially compared to MySQL, Oracle, DB2, and others.

How to check (and optimize) your query performance in PostgreSQL

If you're new to PostgreSQL and are wondering how to check your query performance in the first place (and optimize it!), we're going to leave two helpful resources here:

This beginner's guide to EXPLAIN ANALYZE by Michael Christofides in one of our Timescale Community Days. And here's a blog post on Explaining EXPLAIN in case you're more of a reader.

And our blog post on using pg_stat_statements to optimize queries.

Optimizing DISTINCT query performance: What about RECURSIVE CTEs?

However, if you're an experienced PostgreSQL user, you might point out that it is already possible to get reasonably fast DISTINCTqueries via RECURSIVE CTEs.

From the PostgreSQL Wiki, using a RECURSIVE CTE can get you good results, but writing these kinds of queries can often feel cumbersome and unintuitive, especially for developers new to PostgreSQL:

WITH RECURSIVE cte AS (
   (SELECT tags_id FROM cpu ORDER BY tags_id, time DESC LIMIT 1)
   UNION ALL
   SELECT (
      SELECT tags_id FROM cpu
      WHERE tags_id > t.tags_id 
      ORDER BY tags_id, time DESC LIMIT 1
   )
   FROM cte t
   WHERE t.tags_id IS NOT NULL
)
SELECT * FROM cte LIMIT 50;

But even if writing a RECURSIVE CTE like this in day-to-day querying felt natural to you, there's a bigger problem. Most application developers, ORMs, and charting tools like Grafana or Tableau will still use the simpler, straight-forward form:

SELECT DISTINCT ON (tags_id) * FROM cpu
WHERE tags_id >=1 
ORDER BY tags_id, time DESC
LIMIT 50;

In PostgreSQL, without a ", such as MySQL, Oracle, and DB2, implement a feature called "Loose index scan," "Index Skip Scan," or “Skip Scan" node, this query will perform the much slower Index Only Scan, causing your applications and graphing tools to feel clunky and slow.

Surely there's a better way, right?

SkipScan Is the Way

SkipScan is an optimization for queries in the form of SELECT DISTINCT ON (column). Conceptually, a SkipScan is a regular IndexScan that “skips” across an index looking for the next value that is greater than the current value:

SkipScan: An index scan that “skips” across an index looking for the next greater value

With SkipScan in Timescale/PostgreSQL, query planning and execution can now utilize a new node (displayed as (SkipScan) in the EXPLAIN output) to quickly return distinct items from a properly ordered index.

Rather than scanning the entire index with an Index Only Scan, SkipScan incrementally searches for each successive item in the ordered index. As it locates one item, the (SkipScan) node quickly restarts the search for the next item. This is a much more efficient way of finding distinct items in an ordered index. (See GitHub for more details.)

Benchmarking TimescaleDB SkipScan vs. a Normal PostgreSQL Index Scan

In every example query, Timescale with SkipScan improved query response times by at least 26x.

✨

If you don't want to go through the entire benchmark, here's a short and sweet piece on SkipScan's performance under load.

But the real surprise is how much of a difference it makes at lower cardinalities with lots of data—it is almost 8,500x faster to retrieve all columns for the most recent reading of each device. That's fast!

In our tests, SkipScan is also consistently faster—by 80x or more—in our 4,000 device benchmarks. (This level of cardinality is typical for many users of Timescale.)

Before we share the full results, here is how our benchmark was set up.

Benchmark setup

To perform our benchmarks, we installed Timescale on a DigitalOcean Droplet using the following specifications. PostgreSQL and Timescale were installed from packages, and we applied the recommended tuning from timescaledb-tune.

8 Intel vCPUs
16 GB of RAM
320 GB NVMe SSD
Ubuntu 20.04 LTS
Postgres 12.6
TimescaleDB 2.2 (The first release with SkipScan. TimescaleDB 2.2.1 primarily adds distributed hypertable support and some bug fixes.)

To demonstrate the performance impact of SkipScan on varying degrees of cardinality, we benchmarked three separate datasets of varying sizes. To generate our datasets, we used the 'cpu-only' use case in the Time Series Benchmark Suite (TSBS), which creates 10 metrics every 10 seconds for each device (identified by the tag_id in our benchmark queries).

Dataset 1	Dataset 2	Dataset 3
100 devices	4000 devices	10,000 devices
4 months of data	4 days of data	36 hours of data
~103,000,000 rows	~103,000,000 rows	~144,000,000 rows

Additional data preparation

Not all device data is up-to-date in real life because devices go offline and internet connections get interrupted. Therefore, to simulate a more realistic scenario (i.e., that some devices had stopped reporting for a period of time), we deleted rows for random devices over each of the following periods.

Dataset 1	Dataset 2	Dataset 3
5 random devices over:	100 random devices over:	250 random devices over:
30 minutes	1 hour	10 minutes
36 hours	12 hours	1 hour
7 days	36 hours	12 hours
1 month	3 days	24 hours

To delete the data, we utilized the tablesample function of Postgres. This SELECT feature allows you to return a random sample of rows from a table based on a percentage of the total rows. In the example below, we randomly sample 10% of the rows ( bernoulli(10) ) and then take the first 10 ( limit 10 ).

DELETE FROM cpu
WHERE tags_id IN 
  (SELECT id FROM tags tablesample bernoulli(10) LIMIT 10)
  AND time >= now() - INTERVAL '30 minutes';

From there, we ran each benchmarking query multiple times to accommodate for caching, with and without SkipScan enabled.

As mentioned earlier, the following two indexes were present on the hypertable for all queries.

"cpu_tags_id_time_idx" btree (tags_id, "time" DESC)
"cpu_time_idx" btree ("time" DESC)

Benchmark results

Here are the results:

TimescaleDB with SkipScan improved the query response by at least 26x, up to 8500x in some cases.

About the Queries Benchmarked

For this test, we benchmarked five types of common queries:

Scenario #1: What was the last reported time of each device in a paged list?

SELECT DISTINCT ON (tags_id) tags_id, time FROM cpu
ORDER BY tags_id, time DESC
LIMIT 10 OFFSET 50;

Scenario #2: What was the time and most recently reported set of values for each device in a paged list?

SELECT DISTINCT ON (tags_id) * FROM cpu
ORDER BY tags_id, time DESC
LIMIT 10 OFFSET 50;

Scenario #3: What is the most recent point for all reporting devices in the last 5 minutes?

SELECT DISTINCT ON (tags_id) * FROM cpu 
WHERE time >= now() - INTERVAL '5 minutes' 
ORDER BY tags_id, time DESC;

Scenario #4: Which devices reported at some time today but not within the last hour?

WITH older AS (
  SELECT DISTINCT ON (tags_id) tags_id FROM cpu 
  WHERE time > now() - INTERVAL '24 hours'
)                                          
SELECT * FROM older o 
WHERE NOT EXISTS (
  SELECT 1 FROM cpu 
  WHERE cpu.tags_id = o.tags_id 
  AND time > now() - INTERVAL '1 hour'
);

Scenario #5: Which devices reported yesterday but not in the last 24 hours?

WITH older AS (
  SELECT DISTINCT ON (tags_id) tags_id FROM cpu 
  WHERE time > now() - INTERVAL '48 hours'
  AND time < now() - INTERVAL '24 hours'
)                                          
SELECT * FROM older o 
WHERE NOT EXISTS (
  SELECT 1 FROM cpu 
  WHERE cpu.tags_id = o.tags_id 
  AND time > now() - INTERVAL '24 hour'
);

How Will Your Application Improve?

But SkipScan isn’t a theoretical improvement reserved for benchmarking blog posts 😉—it has real-world implications, and many applications we use rely on getting this data as fast as possible.

Think about the applications you use (or develop) every day. Do they retrieve paged lists of unique items from database tables to fill dropdown options (or grids of data)?

At a few thousand items, the query latency might not be very noticeable. But, as your data grows and you have millions of rows of data and tens of thousands of distinct items, that dropdown menu might take seconds—or minutes—to populate.

SkipScan can reduce that to tens of milliseconds!

Baby Yoda

Even better, SkipScan also provides a fast, efficient way of answering the question that so many people with time-series data ask every day:

"What was the last time and value recorded for each of my [devices / users / services / crypto and stock investments / etc]?"

As long as there is an index on "device_id" and "time" descending, SkipScan will retrieve the data using a query like this much more efficiently.

SELECT DISTINCT ON (device_id) * FROM cpu 
ORDER BY device_id, time DESC;

With SkipScan, your application and dashboards that rely on these types of queries will now load a whole lot faster 🚀 (see below).

TimescaleDB 2.2 with SkipScan enabled runs in less than 400 ms

TimescaleDB 2.2 without SkipScan enabled runs in 23 seconds

How to Use SkipScan on Timescale

How do you get started? Upgrade to TimescaleDB 2.2.1 and set up your schema and indexing as described below. You should start to see immediate speed improvements in many of your DISTINCT queries.

To ensure that a (SkipScan) node can be chosen for your query plan:

First, the query must use the DISTINCT keyword on a single column. The benchmarking queries above will give you some examples to draw from.

Second, there must be an index that contains the DISTINCT column first, and any other ORDER BY columns. Specifically:

The index needs to be a BTREE index.
The index needs to match the ORDER BY in your query.
The DISTINCT column must either be the first column of the index, or any leading column(s) must be used as constraints in your query.

In practice, this means that if we use the questions from the beginning of this blog post ("retrieve a list of unique IDs in order" and "retrieve the last reading of each ID"), we would need at least one index like this (but if you're using a TimescaleDB hypertable, this likely already exists):

 "cpu_tags_id_time_idx" btree (tags_id, "time" DESC)

With that index in place, you should start to see immediate benefit if your queries look similar to the benchmarking examples below. When SkipScan is chosen for your query, the EXPLAIN ANALYZE output will show one or more Custom Scan (SkipScan) nodes similar to this:

->  Unique
  ->  Merge Append
    Sort Key: _hyper_8_79_chunk.tags_id, _hyper_8_79_chunk."time" DESC
     ->  Custom Scan (SkipScan) on _hyper_8_79_chunk
      ->  Index Only Scan using _hyper_8_79_chunk_cpu_tags_id_time_idx on _hyper_8_79_chunk
          Index Cond: (tags_id > NULL::integer)
     ->  Custom Scan (SkipScan) on _hyper_8_80_chunk
      ->  Index Only Scan using _hyper_8_80_chunk_cpu_tags_id_time_idx on _hyper_8_80_chunk
         Index Cond: (tags_id > NULL::integer)
...

Learn More and Get Started

If you’re new to Timescale, create a free account to get started with a fully managed TimescaleDB instance (100 % free for 30 days).

If you are an existing user:

Timescale: TimescaleDB 2.2.1 is now the default for all new services on Timescale, and any of your existing services will be automatically upgraded during your next maintenance window.
Self-managed TimescaleDB: Here are the upgrade instructions.

Join our Slack Community to share your results, ask questions, get advice, and connect with other developers (I, as well as our co-founders, engineers, and passionate community members, are active on all channels).

You can also visit our GitHub to learn more (and, as always, ⭐️ are appreciated!)

TimescaleDB vs. Amazon Timestream: 6,000x Higher Inserts, 5-175x Faster Queries, 150-220x Cheaper

Ryan Booz — Wed, 02 Dec 2020 19:41:47 GMT

This post compares TimescaleDB and Amazon Timestream across quantitative and qualitative dimensions.

Yes, we are the developers of TimescaleDB, so you might quickly disregard our comparison as biased. But if you let the analysis speak for itself, you’ll find that we stay as objective as possible and aim to be fair to Amazon Timestream in our testing and results reporting.

Also, if you want to check our work or run your own analysis, we provide all our testing via the Time Series Benchmark Suite, an open-source project that anyone can use and contribute to.

About TimescaleDB and Amazon Timestream

TimescaleDB, first launched in April 2017, is today the industry-leading relational database for time series, open-source, engineered on top of PostgreSQL, and offered via download or as a fully managed service on AWS.

The TimescaleDB community has become the largest developer community for time-series data: tens of millions of downloads; over 500,000 active databases; organizations like AppDynamics, Bosch, Cisco, Comcast, Credit Suisse, DigitalOcean, Dow Chemical, Electronic Arts, Fujitsu, IBM, Microsoft, Rackspace, Schneider Electric, Samsung, Siemens, Uber, Walmart, Warner Music, WebEx, and thousands of others (all in addition to the PostgreSQL community and ecosystem).

Amazon Timestream was first announced at AWS re:Invent November 2018, but its launch was delayed until September 2020. This is Amazon’s time-series database-as-a-service. Amazon Timestream not only shares a similar name to TimescaleDB, but also embraces SQL as its query language. Amazon Timestream customers include Autodesk, PubNub, and Trimble.

We compare TimescaleDB and Amazon Timestream across several dimensions:

Insert and query performance
Cost for equivalent workloads
Backups, reliability, and tooling
Query language, ecosystem, ease-of-use
Clouds and regions supported

Below is a summary of our results. For those interested, we go into much more detail later in this post.

Insert Performance, Query Performance

Our results are striking. TimescaleDB outperformed Amazon Timestream 6,000x on inserts and 5-175x on queries, depending on the query type. In particular, there were workloads and query types easily supported by TimescaleDB that Amazon Timestream was unable to handle.

Note: Several queries’ ratios (high-cpu-all, lastpoint, groupby-orderby-limit) are “undefined” because Amazon Timestream did not finish executing them within the default 60-second timeout period that Timestream imposes, while TimescaleDB completed them in less than a single second

Results of benchmarking query performance between TimescaleDB and Amazon Timestream

These results were so dramatic that we did not believe them at first, and we tried a variety of workloads and settings to ensure we weren’t missing anything.

We even posted on Reddit to see if others had been able to get better performance with Amazon Timestream. Although feedback was hard to find, we weren’t the only ones seeing these performance results, as evidenced by a similar benchmark by Crate.io.

We REALLY tried to get Amazon Timestream to perform better. Just look at all of the databases we created through the process!

After all of our attempts to achieve better Amazon Timestream performance, we were even more confused when we read a recent post on the AWS Database Blog that discusses achieving ingest speeds of three billion metrics/hour.

Although the details of how they ingested this scale of data aren’t completely clear, it appears that each “monitored host” sent individual metrics at various intervals directly to Amazon Timestream.

To achieve three billion metrics/hour in their test, four million hosts sent 26 metrics every two minutes, an average of 33,000 hosts reporting 866,667 metrics every second.

It’s certainly impressive to support 33,000 connections per second without issue, and this demonstrates one of the key advantages that Amazon presents with a serverless architecture like Timestream.

If you have an edge-based IoT system that pre-computes metrics on thousands of edge nodes before sending them, Amazon Timestream could simplify your data collection architecture.

However, as you’ll see, if you have a more traditional client-server data-collection architecture, or one using a more common streaming pipeline with database consumers, like Apache Kafka, TimescaleDB can import more than three million metrics per second from one client—and doesn’t need 33,000 clients.

Because performance benchmarking is complex, we share the details of our setup, configurations, and workload patterns later in this post, as well as instructions on how to reproduce them.

Cost for Equivalent Workloads

The stark difference in performance translates into a large cost differential as well.

To compare costs, we calculated the cost for our above insert and query workloads, which store one billion metrics in TimescaleDB and ~410 million metrics in Amazon Timestream (because we were unable to load the full one billion—more later in this post), and ran our suite of queries on top.

For the same workloads, we found that fully managed TimescaleDB is 154x cheaper than Amazon Timestream (224x cheaper if you’re self-managing TimescaleDB on a virtual machine) and inserted twice as many metrics.

The results of the costs for equivalent workloads

We go into further details about the cost comparison later in this post.

Backups, Reliability, and Tooling

For reliability, the differences are also striking. In particular, backups, reliability, and tooling feel like an afterthought with Amazon Timestream.

In the 240-page development guide for Amazon Timestream, the words “recovery” and “restore” don’t appear at all, and the word “backup” appears only once to tell the developer that there is no backup mechanism.

Instead, you can “[...]write your own application using the Timestream SDK to query data and save it to the destination of your choice” (page 100). There isn’t a mechanism or support to DELETE or UPDATE existing data.

The only way to remove data is to drop the entire table. Furthermore, there is no way to recover a deleted table since it is an atomic action that cannot be recovered through any Amazon API or Console.

In contrast, TimescaleDB is built on PostgreSQL, which means it inherits the 25+ years of hard, careful engineering work that the entire PostgreSQL community has done to build a rock-solid database that supports millions of mission-critical applications worldwide.

When operating TimescaleDB, one inherits all of the battle-tested tools that exist in the PostgreSQL ecosystem: pg_dump/pg_restore and pg_basebackup for backup/restore, high-availability/failover tools like Patroni, load balancing tools for clustering reads like Pgpool/pgbouncer, etc. Since TimescaleDB looks and feels like PostgreSQL, there are minimal operational learning curves. TimescaleDB “just works,” as one would expect from PostgreSQL.

Query Language, Ecosystem, and Ease-Of-Use

We applaud Amazon Timestream’s decision to adopt SQL as their query language. Even if Amazon Timestream functions like a NoSQL database in many ways, opting for SQL as the query interface lowers developers’ barrier to entry—especially when compared to other databases like MongoDB and InfluxDB.

That said, because Amazon Timestream is not a relational database, it doesn’t support normalized datasets and JOINs across tables. Also, because Amazon Timestream enforces a specific narrow table model on your data, deriving value from your data relies heavily on CASE statements and Common Table Expressions (CTEs) when requesting multiple measurement values (defined by “measurement_name”), leading to some clunky queries (see example later in this post).

TimescaleDB, on the other hand, has fully embraced all parts of the SQL language from day one—and extended SQL with functions custom-built to simplify time-series analysis. TimescaleDB is also a relational database, allowing developers to store their metadata alongside their time-series data and JOIN across tables as necessary. Consequently, with TimescaleDB, new users have a minimal learning curve and are in full control when querying their data.

Full SQL means that TimescaleDB supports everything that SQL has to offer, including normalized datasets, cross-table JOINs, subqueries, stored procedures, and user-defined functions.

Supporting SQL also enables TimescaleDB to support everything in the SQL ecosystem, including Tableau, Looker, PowerBI, Apache Kafka, Apache Spark, Jupyter Notebooks, R, native libraries for every major programming language, and much more.

For example, if you already use Tableau to visualize data or Apache Spark for data processing, TimescaleDB can plug right into the existing infrastructure due to its compatible connectors. And, given its roots, TimescaleDB supports everything in the PostgreSQL ecosystem, including tools like EXPLAIN that help pinpoint why queries are slow and identify ways to improve performance.

By contrast, even though Amazon Timestream speaks a variant of SQL, it is “SQL-like,” not full SQL. Thus, tooling that normally works with SQL—e.g., the Tableau and Apache Spark examples cited above—are unable to utilize Amazon Timestream data unless that tool incorporates specific Amazon Timestream drivers and SQL-like dialect.

This means, for example, that the tooling you might normally use to help you improve query performance doesn’t currently support Amazon Timestream. And, unfortunately, the current Amazon Timestream UI doesn’t give us any clues about why queries might be performing poorly or ways to improve performance (e.g., via settings or query hints).

In short, if you use PostgreSQL and any tools or extensions with your applications, they will “just work” when connected to TimescaleDB. The same isn’t true for Amazon Timestream.

So, while adopting a SQL-like query language is a great start for Amazon Timestream, we found a lot to be desired for a true “easy” developer experience.

Cloud Offering

Amazon Timestream is only offered as a serverless cloud on Amazon Web Services. As of writing, it is available in three U.S. regions and one E.U. region.

Conversely, TimescaleDB can be run in your own infrastructure or fully managed through our cloud offering, which make TimescaleDB available on AWS in over 75 regions and many different possible region/storage/compute configurations.

With TimescaleDB, our goal is to give our customers their choice of cloud and the ability to choose the region closest to their customers and co-locate with their other workloads.

Cloud coverage comparison between Amazon Timestream and TimescaleDB (as of November 2020)

What We Like About Amazon Timestream

Despite the results of this comparison, we liked several things about Amazon Timestream.

The highest on the list is the simplicity of setting up a serverless offering. There aren’t any sizing decisions or knobs to tweak. Simply create a database, table, and then start sending data.

This brings up a second advantage of a serverless architecture: even if throughput isn’t ideal from a single client, the service appears to handle thousands of connections without issue. According to their documentation, Amazon Timestream will eventually add more resources to keep up additional ingest or query threads. This means that an application shouldn’t be limited by the resources of one particular server for reads or writes.

Like with other NoSQL databases, some may find the schemaless nature of Amazon Timestream appealing, especially when a project is just getting off the ground.

Although schemas become more necessary as workloads grow for performance and data validation reasons, one of the reasons databases like MongoDB have grown in popularity is that they don’t require the same upfront planning as more traditional SQL databases.

Lastly, SQL. It shouldn’t come as a surprise that we like SQL as an efficient interface for the data we need to examine. And, although Amazon Timestream lacks some support for standard SQL dialects, most users will find it pretty straightforward to start querying data (after they understand Amazon Timestream’s narrow table model).

But why is Amazon Timestream so expensive, slow, and underwhelming?

The reality is that Amazon Timestream, despite taking two years post-announcement to launch, still seems half-baked.

Why is Amazon Timestream so expensive, slow, and seemingly underdeveloped? We assume the reason is because of its underlying architecture.

Unlike other systems that we and others have benchmarked via the Time-Series Benchmark Suite (e.g., InfluxDB and MongoDB), Amazon Timestream is completely closed-source.

Based on our usage and experience with other Amazon services, we suspect that under the hood Amazon Timestream is backed by a combination of other Amazon services similar to Amazon ElastiCache, Athena, and S3. But because we cannot inspect the source code (and because Amazon does not make this sort of information public), this is just a guess.

By comparison, all of the source code for TimescaleDB is available for anyone to inspect. We built TimescaleDB on top of PostgreSQL, giving it a rock-solid foundation and large ecosystem, and then spent years adding advanced capabilities to increase performance, lower costs, and improve the developer experience.

These capabilities include auto-partitioning via hypertables and chunks, faster queries via continuous aggregates, lower costs via 94 %+ native compression, high performance (10+ million inserts a second), and more.

We believe the real reason behind the difference between the two products is the companies building these products and how each approaches software development, community, and licensing. (And kind reader, this is where our bias may sneak in a little bit.)

Amazon vs. Timescale

The viability of our company, Timescale, is 100 % dependent on the quality of TimescaleDB. If we build a sub-par product, we cease to exist.

Amazon Timestream is just another of the 200+ services that Amazon is developing. Regardless of the quality of Amazon Timestream, that team will still be supported by the rest of Amazon’s business—and if the product gets shut down, that team will find homes elsewhere within the larger company.

One can see this difference in how the two companies approach the developer community. Without a doubt, Amazon Web Services is a leader in all things cloud computing. However, with its enormous catalog of cloud services, many originally derived from external open-source projects, Amazon’s attention is spread over hundreds of products.

Case in point, when Amazon Timestream was announced in 2018, there was strong interest in when it would be released and how it would perform. However, after a two-year delay, with no information from Amazon, many gave up on waiting for the product. When the product was finally released on September 30, 2020, there was very little fanfare from the community.

In contrast, Timescale develops its source code out in the open, and developers can reach us for help anytime directly via our Slack channel (which is staffed by our engineers), whether they are a paying customer or not. We’ve continued to invest in our community by making all our software available for free while also serving our customers with our hosted and fully managed cloud services.

Building a high-performance, cost-effective, reliable, and easy-to-use time-series database is a hard and increasingly business-critical problem. For us, building TimescaleDB into a best-in-class time-series developer experience is an existential requirement. Without it, we cease to exist. For Amazon, Amazon Timestream is just a checkbox, another service to list on their website.

When Amazon is forced to compete on product quality, all open-source companies have a shot at building great businesses

Amazon has a history of offering services that take advantage of the R&D efforts of others: for example, Amazon Elasticsearch Service, Amazon Managed Streaming for Apache Kafka, Amazon ElastiCache for Redis, and many others.

If Amazon wanted to launch a time-series database service that supported SQL, why did they build one from scratch and not just offer managed TimescaleDB?

Answer: our innovative licensing. The core of TimescaleDB is open-source, licensed under Apache 2. But advanced capabilities, such as compression and continuous aggregates, are licensed under the Timescale License, a source-available license that is open-source in spirit and makes all software available for free—but contains a critical restriction: preventing companies from offering that software via a hosted database-as-a-service.

The Timescale License is an example of a “Cloud Protection License,” which are licenses recognizing that the cloud has increasingly become the dominant form of open-source commercialization.

So these licenses protect the right of offering the software in the cloud for the main creator/maintainer of the project (who often contributes 99 % of the R&D effort). (Read more about how we're building a self-sustaining open-source business in the cloud era.)

This “cloud protection” prevents Amazon from just distributing our R&D and forces them to develop their own offering and compete on product quality, not just distribution. And as we can see from Amazon Timestream, building best-in-class database technologies is not easy, even for a company like Amazon.

The truth is that when Amazon is forced to compete on product quality, all open-source companies have a shot at building great businesses.

We welcome Amazon’s new entry to the time-series database market and appreciate that developers now have even more choices for storing and analyzing their time-series data. Competition is good for developers and helps drive further innovation.

For those who want to dig deeper into our benchmarking and comparison, we include detailed notes and methodology below.

For those who want to try Timescale, create a free account to get started with a fully managed TimescaleDB instance (100 % free for 30 days).

Want to host TimescaleDB yourself? Visit our GitHub to learn more about options, get installation instructions, and more (and, if you like what you see, ⭐️ are always appreciated!).

Join our Slack community to ask questions, get advice, and connect with other developers (I, as well as our co-founders, engineers, and passionate community members are active on all channels).

Performance Comparison Details

Here is a quantitative comparison of the two databases across insert and query workloads.

Note: We've released all the code and data used for the below benchmarks as part of the open-source Time Series Benchmark Suite (TSBS) (GitHub, announcement), so you can reproduce our results or run your own analysis.

Typically, when we conduct performance benchmarks (for example, in our previous benchmarks versus InfluxDB and MongoDB) we use five different dataset configurations. These configurations increase metric loads and cardinalities, to simulate a breadth of time-series workloads for inserts and queries.

However, as you’ll see below, because of performance issues with Amazon Timestream, we were unable to look at Amazon Timestream’s performance under higher cardinalities and were limited to testing just our lowest-cardinality dataset.

Machine Configuration

Amazon Timestream
Amazon Timestream is a serverless offering, which means that a user cannot provision a specific service tier. The only meaningful configuration option that a user can modify is the “memory store retention” period and the “magnetic store retention” period. In Amazon Timestream, data can only be inserted into a table if the timestamp falls within the memory store retention period. Therefore, the only setting that we modified to insert data for our first test was to set the memory store retention period to 865 hours (~36 days) to provide padding to account for a slower insert rate.

It did not take long for us to realize that Amazon Timestream’s insert performance was dramatically slower than other time-series databases we’ve benchmarked. Therefore, we took extra time to test insert performance using three different Amazon EC2 instance configurations, each launched in the same region as our Amazon Timestream database:

t3.medium running Ubuntu 18 LTS, 2 vCPUs, 4 GB mem, up to 5 Gb network
c5n.2xlarge running Ubuntu 20 LTS, 8 vCPUs, 29 GB mem, up to 25 Gb network
m5n.12xlarge running Ubuntu 18 LTS, 48 vCPUs,192 GB mem, 50 Gb network

After numerous attempts to insert data with each of these instance types, we determined that the size of the client did not noticeably impact insert performance at all. Instead, we needed to run multiple client instances to ingest more data.

In the end, we chose to write data from 1 and 10 t3.medium clients, each running 20 threads of TSBS. In the case of 10 clients, each covered a portion of the 30 days to avoid writing duplicate data (Amazon Timestream does not support writing duplicate data).

TimescaleDB
To test the same insert and read latency performance on TimescaleDB, we used the following setup:

Version: TimescaleDB version 1.7.4, with PostgreSQL 12
One remote client machine and one database server, both in the same cloud data center
Instance size: both client and database server ran on DigitalOcean virtual machines (droplets) with 32 vCPU and 192 GB memory each
OS: both server and client machines ran Ubuntu 18.04.3
Disk Size: 4.8 TB of disk in a raid0 configuration (EXT4 filesystem)
Deployment method: TimescaleDB was deployed using Docker images from the official Docker hub

Results of the insert performance between TimescaleDB and Amazon Timestream

In our tests, TimescaleDB outperformed Amazon Timestream by a shocking 6,000x on inserts.

The lackluster insert performance of Amazon Timestream took us by surprise, especially since we were using the Amazon Timestream SDK and modeling our TSBS code from examples in their documentation.

In the interest of being thorough and fair to Amazon Timestream, we tried increasing the number of clients writing data, made some code modifications to increase concurrency (in ways that weren’t necessary for TimescaleDB), and worked to eliminate any possible thread contention, and then ran the same benchmark with 10 clients on Amazon Timestream.

After this effort, we were able to increase Amazon Timestream performance to 5,250 metrics/second (across 10 clients)—but even then, TimescaleDB (with only one client and without any extra code modifications) outperformed Amazon Timestream by 600x.

(Hypothetically, we could have started a lot more clients to increase insert performance on Amazon Timestream (assuming no bottlenecks), but with an average ingest rate of ~523 metrics/second per client, we would have had to start ~61,000 EC2 instances at the same time to finish inserting metrics as fast as one client writing to TimescaleDB.)

In particular, with this low performance, we were only able to test our lowest cardinality workload, not our usual five—even though we worked at it for more than a week. This scenario attempts to insert 100 simulated devices, each generating 10 CPU metrics every 10 seconds for ~100M reading intervals (for one billion metrics).

We never actually made it to the full one billion metrics with Amazon Timestream. After nearly 40 hours of inserting data from 10 EC2 clients, we were only able to insert slightly over 410 million metrics. (The dataset was created using Time-Series Benchmarking Suite, using the cpu-only use case.)

Let us put it another way:

We first tested Amazon Timestream and TimescaleDB with one client writing data.
Then, in an attempt to be fair to Amazon Timestream, we tested it with 10 separate EC2 instances over a two-day period, inserting batches of 1,000 readings (100 hosts, 10 measurements per host) as fast as possible.
It’s also worth noting that most clients started to receive a fatal connection error from Amazon Timestream between the 28 and 32-hour mark and didn’t recover. Only one client made uninterrupted inserts for more than 40 hours before we manually stopped it. It’s possible that with some additional error checking with the Amazon Timestream SDK response, TSBS could have recovered on its own and continued to send metrics from all 10 clients.

In total, this means that we inserted data into Amazon Timestream for 332.5 hours and achieved slightly more than 410 million metrics.

TimescaleDB inserted one billion metrics from one client in just under five minutes.

Amazon claims that Amazon Timestream will learn from your insert and query patterns and automatically adjust resources to increase performance. Their documentation specifically warns that writes may become throttled, with the only remedy to keep inserting at the same (or higher) rate until Amazon Timestream adjusts.

However, in our experience, 332.5 hours of inserting data at a very consistent rate was not enough time for it to make this adjustment.

The issue of cardinality: One other side-effect of Amazon Timestream taking so long to ingest data: we couldn’t compare how it performs with higher cardinalities, which are common in time-series scenarios, where we need to ingest a relentless stream of metrics from devices, apps, customers, and beyond. (Read more about the role of cardinality in time series and how TimescaleDB solves it.)

We’ve shown in previous benchmarks that TimescaleDB actually sees better performance relative to other time-series databases as cardinality increases, with moderate drop off in terms of absolute insert rate. TimescaleDB surpasses many other popular time-series databases, like InfluxDB, in terms of insert performance for the configurations of 4,000, 100,000, 1 million, and 10 million devices.

But again, we were unable to test this given Amazon Timestream’s (lack of) performance.

Insert performance summary:

TimescaleDB outperforms Amazon Timestream in raw numbers that we found hard to believe. However, despite our best efforts to optimize Amazon Timestream, TimescaleDB still outperformed Amazon Timestream by 6,000x (600x if using 10 clients on Amazon Timestream to TimescaleDB’s one).
In the time it took us to make a pot of coffee, TimescaleDB inserted one billion metrics for a 31-day period. With Amazon Timestream, we got two nights' sleep and inserted less than half the metrics.
That said, if your insert performance is far below these benchmarks (e.g., a few thousand metrics/second), then insert performance will not be your bottleneck.

Full results:

TimescaleDB vs. Amazon Timestream Insert Rate comparison ratios

More information on database configuration for this test:

Batch size
From our research and community members’ feedback, we’ve found that larger batch sizes generally provide better insert performance. (It’s one of the reasons we created tools like Parallel COPY to help our users insert data in large batches concurrently).

In our benchmarking tests for TimescaleDB, the batch size was set to 10,000, something we’ve found works well for this kind of high throughput. The batch size, however, is completely configurable and often worth customizing based on your application requirements.

Amazon Timestream, on the other hand, has a fixed batch size limit of 100 values. This seems to require significantly more overhead and insert latency increases dramatically as the number of metrics we try to insert at one time increases. This is one of the first reasons we believe insert performance was so much slower with Amazon Timestream.

Additional database configurations
For TimescaleDB, we set the chunk time depending on the data volume, aiming for 7-16 chunks in total for each configuration (see our documentation for more on hypertables - "chunks").

With Amazon Timestream, there aren’t additional settings you can tweak to try and improve insert performance—at least not that we found, given the tools provided by Amazon.

As mentioned in the machine configuration section above, we had to set the memory store retention period equal to ~36 days to ensure we could get all of our data inserted before the magnetic store retention period kicked in.

Query Performance Comparison

Note: Several queries’ ratios (high-cpu-all, lastpoint, groupby-orderby-limit) are “undefined” because Amazon Timestream did not finish executing them within the default 60 second timeout period that Timestream imposes, while TimescaleDB completed them in less than a single second

Measuring query latency is complex. Unlike inserts, which primarily vary on cardinality size, the universe of possible queries is essentially infinite, especially with a language as powerful as SQL.

Often, the best way to benchmark read latency is to do it with the actual queries you plan to execute. For this case, we use a broad set of queries to mimic the most common time-series query patterns.

For benchmarking query performance, we decided to use a c5n.2xlarge EC2 instance to perform the queries with Amazon Timestream. Our hope was that having more memory and network throughput available to the query application would give Amazon Timestream a better chance. The client for TimescaleDB was unchanged.

Recall that we ran these queries on Amazon Timestream with a dataset that was 40 % that of the one we ran on TimescaleDB (410 million vs. one billion metrics), owing to the insert problems we had above.

Also, because we had to set the memory store retention period to ~36 days, all the data we queried was in the fastest storage available. These two advantages should have given Amazon Timestream a considerable edge.

That said, TimescaleDB still outperformed Amazon Timestream by 5x to 175x, depending on the query, with Amazon Timestream unable to finish several queries.

The results below are the average from 1,000 queries for each query type. Latencies in this chart are all shown as milliseconds, with an additional column showing the relative performance of TimescaleDB compared to Amazon Timestream.

Results of benchmarking query performance between TimescaleDB and Amazon Timestream

Results by query type:

SIMPLE ROLLUPS
For simple rollups (i.e., groupbys), when aggregating one metric across a single host for 1 or 12 hours, or multiple metrics across one or multiple hosts (either for 1 hour or 12 hours), TimescaleDB significantly outperforms Amazon Timestream by 11x to 28x.

AGGREGATES
When calculating a simple aggregate for one device, TimescaleDB again outperforms Amazon Timestream by a considerable margin, returning results for each of 1,000 queries more than 19x faster.

DOUBLE ROLLUPS
For double rollups aggregating metrics by time and another dimension (e.g., GROUPBY time, deviceId), TimescaleDB again achieves significantly better performance, 5x to 12x.

THRESHOLDS
When selecting rows based on a threshold (CPU > 90 %), we see Amazon Timestream really begin to fall apart. Finding the last reading for one host greater than 90 % performs 170x better with TimescaleDB compared to Amazon Timestream. And the second variation of this query, trying to find the last reading greater than 90% for all 100 hosts (in the last 31 days), never finished in Amazon Timestream.

Again, to be fair and ensure our query was returning the data we expected, we did manually run one of these queries in the Amazon Timestream Query interface of the AWS Console. It would routinely finish in 30-40 seconds (which would still be 36x slower than TimescaleDB).

In addition, running 100 of these queries at a time with the benchmark suite appears to be too much for the query engine, and results for the first set of 100 queries didn’t complete after more than 10 minutes of waiting.

COMPLEX QUERIES
Likewise, for complex queries that go beyond rollups or thresholds, there is no comparison. TimescaleDB vastly outperforms Amazon Timestream, in most cases because Amazon Timestream never returned results for the first set of 100 queries. Just like the complex aggregate above that failed to return any results when queried in batches of 100, these complex queries never returned results with the benchmark client.

In each case, we attempted to run the queries multiple times, ensuring that no other clients or processes were inserting or accessing data. We also ran at least one of the queries manually in the AWS Console to verify that it worked and that we got the expected results. However, when running these kinds of queries in parallel, there seems to be a major issue with Amazon Timestream being able to satisfy the requests.

For these more complex queries that return results from Amazon Timestream, TimescaleDB provides real-time responses (e.g., 10-100s of milliseconds), while Amazon Timestream sees significant human-observable delays (seconds).

And remember, this dataset only had a cardinality of 100 hosts, the lowest cardinality we typically test with the Time-Series Benchmarking Suite, and we were unable to test higher cardinality datasets because of Amazon Timestream issues).

Notice that Timescale exhibits 48x-175x the performance of Amazon Timestream on these complex queries, many of which are common to historical time-series analysis and monitoring.

Read latency performance summary

For simple queries, TimescaleDB outperforms Amazon Timestream in every category.
When selecting rows based on a threshold, TimescaleDB outperforms Amazon Timestream by a significant margin, being over 175x faster.
For complex queries with even low cardinality, Amazon Timestream was unable to return results for sets of 100 queries within the 60-second default query timeout.
Concurrent query load over the same time range seems to impact Amazon Timestream in a dramatic way.

Cost Comparison Details

Results of cost comparison between TimescaleDB and Amazon Timestream

These performance differences between TimescaleDB and Amazon Timestream lead to massive cost differences for the same workloads.

To compare costs, we calculated the cost for our above insert and query workloads, which store one billion metrics in TimescaleDB and ~410 million metrics in Amazon Timestream (because we could not load the full one billion), and run our suite of queries on top.

Pricing for Amazon Timestream is rather complex. In all, our bill for testing Amazon Timestream over the course of seven days cost us $336.39, which does not include any Amazon EC2 charges (which we needed for the extra clients). During that time, our bill shows that:

We inserted 100 GB of data (~500 million metrics total across all of our attempts to ingest data)
Stored a lot of data in memory (and we continue to be charged per hour for that data)
Queried 21 TB of data when running 25,000 real-world queries

For comparison, our tests for TimescaleDB (inserts and queries) completed in far less than an hour, and our Digital Ocean droplet costs $1.50/hour. We also ran this test on Timescale, our fully managed TimescaleDB service, and it was also completed in far less than an hour, and the instance (8 vCPU, 32 GB, 1 TB) cost $2.18/hour.

$336.39 for Amazon Timestream vs. $2.18 for fully managed TimescaleDB ($1.50 if you would rather self-manage the instance yourself), which means that TimescaleDB is 154x cheaper than Amazon Timestream (224x cheaper if self-managed)—and it loaded and queried over twice the number of metrics.

Now, let’s dig a little deeper into our Amazon Timestream bill

When using Amazon Timestream, users are charged for usage in four main categories: data ingested, data stored in both memory and magnetic storage and the amount of data scanned to satisfy your queries.

Data that resides in the memory store costs $0.036 GB/hour, while data that is eventually moved to magnetic storage costs only $0.03 GB/month. For our one-month memory store setting (which was required to insert 30 days of historical data), that’s more than a 720x difference in cost for the same data.

What’s more, since Amazon Timestream doesn’t expose any information about how data is stored in the Console or otherwise, we have no idea how well-compressed this data is, or if there is anything more we could do to reduce storage.

The real surprise, however, came with querying data because the charges don’t scale with performance. Instead, you will be charged for the amount of data scanned to produce a query result, no matter how fast that result comes back. In almost all other database-as-a-service offerings, you can modify the storage, compute, or cluster size (at a known cost) for better performance.

After waiting nearly two days to insert 410 million metrics, we created the traditional set of query batches (as outlined above) and began to run our queries.

In total, we had 15 query files, each with 1,000 queries, for a total of 15,000 queries to run against both Amazon Timestream and TimescaleDB.

While some of the queries are certainly complex, others just ask for the most recent point for each host (and remember, this dataset only had 100 hosts). Also, recall from our query performance comparison that a few of the most complex queries were unable to return results for just 100 queries, let alone the full 1,000 query test.

With Amazon Timestream, you are still charged for the data that was read, even if the query was ultimately canceled or never returned a result.

To validate results, we ran each query file twice. In the case where three of the query files failed to return results, we attempted to execute them five times, hoping for some result.

Doing the math, this means that we ran around 25,000 queries. In doing so, Amazon says that we scanned 21,598.02 GB of data, which cost $215.98. There were certainly a few other ad hoc queries performed through the AWS Console UI, but before we started running the benchmarking queries, the total cost for scanning data was about $15.00.

Furthermore, as we’ve mentioned a few times, there is no built-in support to help you identify which queries are scanning too much data and how you might improve them. For comparison, both Amazon Redshift and Amazon RDS provide this kind of feedback in their AWS Console interface.

When we consider some of the recent user applications that we have highlighted elsewhere on our blog, like FlightAware or clevabit, 25,000 queries of various shapes and sizes would easily be run in a few hours or less.

While the bytes scanned might improve over time as partitioning improves, if you don’t need to scale storage beyond a few petabytes of data, it’s hard to see how this would be less costly than a fixed Compute and Storage cost.

Reliability Comparison Details

Another cardinal rule for a database: it cannot lose or corrupt your data. In this respect, the serverless nature of Amazon Timestream requires that you trust Amazon will not lose your data and all of the data will be stored without corruption. Usually, this is probably a pretty safe bet. In fact, many companies rely on services like Amazon S3 or Amazon Glacier to store their data as a reliable backup solution.

The problem is that we don’t know where our time-series data is stored in Amazon Timestream—because Amazon does not tell us.

This presents a specific challenge that Amazon hasn’t addressed natively: validating or backing up your data.

In their 240-page development guide, the words “recovery” and “restore” don’t appear at all, and the word “backup” appears only once to tell the developer that there isn’t a backup mechanism. Instead, you can “write your own application using the Timestream SDK to query data and save it to the destination of your choice” (page 100).

This is not to say that Amazon Timestream will lose or corrupt your data. As we mentioned, Amazon S3, for instance, is a widely known and used service for data storage. The issue here is that we’re unable to learn or easily verify where our data resides and how it’s protected in a service interruption.

We also found it worrisome that with Amazon Timestream, there isn’t a mechanism or support to DELETE or UPDATE existing data. The only way to remove data is to drop the entire table. Furthermore, there is no way to recover a deleted table since it is an atomic action that cannot be recovered through any Amazon API or Console.

Even if one were to write their own backup and restore utility, there is no method for importing more than the most recent year of data because of the memory store retention period limitation.

As an Amazon Timestream user, all these limitations put us in a precarious position. There’s no easy way to back up our data or restore it once we’ve accumulated more than a year's worth. Even if Amazon never loses our data, deleting an essential table of data through human error is not uncommon.

TimescaleDB uses a dramatically different design principle: build on PostgreSQL. As noted previously, this allows TimescaleDB to inherit over 25 years of dedicated engineering effort that the entire PostgreSQL community has done to build a rock-solid database that supports millions of applications worldwide. (In fact, this principle was at the core of our initial TimescaleDB launch announcement.)

Query Language, Ecosystem, and Ease-Of-Use Comparison Details

We applaud Amazon Timestream’s decision to adopt SQL as their query language. We have always been big fans and vocal advocates of SQL, which has become the query language of choice for data infrastructure, is well-documented, and currently ranks as the third-most commonly used programming language among developers (see our SQL vs. NoSQL comparison for more details).

Even if Amazon Timestream functions like a NoSQL database in many ways, opting for SQL as the query interface lowers developers’ barrier to entry—especially when compared to other databases like MongoDB and InfluxDB with their proprietary query languages.

Most popular Programming, Scripting, and Markup Languages. Source: 2020 Stack Overflow Developer Survey

As discussed earlier in this article, Amazon Timestream is not a relational database, despite feeling like it could be because of the SQL query interface. It doesn’t support normalized datasets, JOINs across tables, or even some common “tricks of the trade” like a simple LATERAL JOIN and correlated subqueries.

Between these SQL limitations and the narrow table model that Amazon Timestream enforces on your data, writing efficient (and easily readable) queries can be a challenge.

Example
To see a brief example of how Amazon Timestream’s “narrow” table model impacts the SQL that you write, let’s look at an example given in the Timestream documentation, Queries with aggregate functions.

Specifically, we’ll look at the example to “find the average load and max speed for each truck for the past week”:

Amazon Timestream SQL (CASE statement needed)

SELECT
    bin(time, 1d) as binned_time,
    fleet,
    truck_id,
    make,
    model,
    AVG(
        CASE WHEN measure_name = 'load' THEN measure_value::double ELSE NULL END
    ) AS avg_load_tons,
    MAX(
        CASE WHEN measure_name = 'speed' THEN measure_value::double ELSE NULL END
    ) AS max_speed_mph
FROM "sampleDB".IoT
WHERE time >= ago(7d)
AND measure_name IN ('load', 'speed')
GROUP BY fleet, truck_id, make, model, bin(time, 1d)
ORDER BY truck_id

TimescaleDB SQL

SELECT
    time_bucket(time, '1 day') as binned_time,
    fleet,
    truck_id,
    make,
    model,
    AVG(load) AS avg_load_tons,
    MAX(speed) AS max_speed_mph
FROM "public".IoT
WHERE time >= now() - INTERVAL '7 days'
GROUP BY fleet, truck_id, make, model, binned_time
ORDER BY truck_id

If the above is any indication, even the most simple aggregate queries in Amazon Timestream require multiple levels of CASE statements and column renaming. We were unable to write this query or pivot the results more easily.

Conversely, as evidenced in the above example, with TimescaleDB, we use standard SQL syntax. Additionally, any query that already works with your PostgreSQL-supported applications will “just work.” The same isn’t true for Amazon Timestream.

So while the decision to adopt a SQL-like query language is a great start for Amazon Timestream, there is still a lot to be desired for a truly frictionless, developer-first experience.

Summary

No one wants to invest in a technology only to have it limit their growth or scale in the future, let alone invest in something that's the wrong fit today.

Before making a decision, we recommend taking a step back and analyzing your stack, your team's skills, and your needs (now and in the future). It could be the difference between infrastructure that evolves and grows with you and one that forces you to start all over.

In this post, we performed a detailed comparison of TimescaleDB and Amazon Timestream. We don’t claim to be Amazon Timestream experts, so we’re open to suggestions on improving this comparison—and invite you to perform your own and share your results.

In general, we aim to be as transparent as possible about our data models, methodologies, and analysis, and we welcome feedback. We also encourage readers to raise any concerns about the information we’ve presented to help us with benchmarking in the future.

We recognize that Timescale isn’t the only time-series solution on the market. There are situations where it might not be the best time-series database choice, and we strive to be upfront in admitting where an alternate solution may be preferable.

We’re always interested in holistically evaluating our solution against others, and we’ll continue to share our insights with the greater community.

Want to Learn More About Timescale?

Create a free account to get started with a fully managed TimescaleDB instance (100 % free for 30 days).

Want to host TimescaleDB yourself? Visit our GitHub to learn more about options, get installation instructions, and more (and, as always, ⭐️ are appreciated!)

Join our Slack community to ask questions, get advice, and connect with other developers (I, as well as our co-founders, engineers, and passionate community members are active on all channels).