Tiger Data Blog

Understanding PostgreSQL Aggregation and Hyperfunctions’ Design

David Kohn — Thu, 11 Jan 2024 17:06:03 GMT

At Timescale, our goal is always to focus on the developer experience, and we take great care to design our products and APIs to be developer-friendly. This focus on developer experience is why we decided early in the design of TimescaleDB to build on top of PostgreSQL. We believed then, as we do now, that building on the world’s fastest-growing database would have numerous benefits for our users.

The same logic applies to many of our features, including hyperfunctions. Timescale's hyperfunctions are designed to enhance PostgreSQL's native aggregation capabilities—a series of SQL functions within TimescaleDB that make it easier to manipulate and analyze time-series data in PostgreSQL with fewer lines of code.

So, let's have a closer look at how PostgreSQL aggregation works and how it has influenced the design of Timescale's hyperfunctions.

✨

Before we start, here's a recap on SQL aggregate functions and how to use them.

A Primer on PostgreSQL Aggregation

We'll start by going over PostgreSQL aggregates. But first, a little backstory.

When I first started learning about PostgreSQL five or six years ago (I was an electrochemist and was dealing with lots of battery data, as mentioned in my last post on time-weighted averages), I ran into some performance issues. I was trying to understand better what was happening inside the database to improve its performance—and that’s when I found Bruce Momjian’s talks on PostgreSQL Internals Through Pictures. Bruce is well-known in the community for his insightful talks (and his penchant for bow ties); his sessions were a revelation for me.

They’ve served as a foundation for my understanding of how PostgreSQL works ever since. He explained things so clearly, and I’ve always learned best when I can visualize what’s going on, so the “through pictures” part really helped—and stuck with—me.

So, this next bit is my attempt to channel Bruce by explaining some PostgreSQL internals through pictures. Cinch up your bow ties and get ready for some learnin'.

The author pays homage to Bruce Momjian (and looks rather pleased with himself because he’s managed to tie a bow tie on the first try).

PostgreSQL aggregates vs. functions

We have written about how we use custom functions and aggregates to extend SQL, but we haven’t exactly explained the difference between them.

The fundamental difference between an aggregate function and a “regular” function in SQL is that an aggregate produces a single result from a group of related rows, while a regular function produces a result for each row:

In SQL, aggregates produce a result from multiple rows, while functions produce a result per row.

This is not to say that a function can’t have inputs from multiple columns; they just have to come from the same row.

Another way to think about it is that functions often act on rows, whereas aggregates act on columns. To illustrate this, let’s consider a theoretical table foo with two columns:

CREATE TABLE foo(
	bar DOUBLE PRECISION,
	baz DOUBLE PRECISION);

And just a few values so we can easily see what’s going on:

INSERT INTO foo(bar, baz) VALUES (1.0, 2.0), (2.0, 4.0), (3.0, 6.0);

The function greatest() will produce the largest of the values in columns bar and baz for each row:

SELECT greatest(bar, baz) FROM foo; 
 greatest 
----------
        2
        4
        6

Whereas the aggregate max() will produce the largest value from each column:

SELECT max(bar) as bar_max, max(baz) as baz_max FROM foo;

 bar_max | baz_max 
---------+---------
       3 |       6

Using the above data, here’s a picture of what happens when we aggregate something:

The max() aggregate gets the largest value from multiple rows.

The aggregate takes inputs from multiple rows and produces a single result. That’s the main difference between it and a function, but how does it do that? Let’s look at what it’s doing under the hood.

Aggregate internals: Row-by-row

Under the hood, aggregates in PostgreSQL work row-by-row. But then, how does an aggregate know anything about the previous rows?

An aggregate stores some state about the rows it has previously seen, and as the database sees new rows, it updates that internal state.

For the max() aggregate we’ve been discussing, the internal state is simply the largest value we’ve collected so far.

Let’s take this step-by-step.

When we start, our internal state is NULL because we haven’t seen any rows yet:

Then, we get our first row in:

Since our state is NULL, we initialize it to the first value we see:

Now, we get our second row:

And we see that the value of bar (2.0) is greater than our current state (1.0), so we update the state:

Then, the next row comes into the aggregate:

We compare it to our current state, take the greatest value, and update our state:

Finally, we don’t have any more rows to process, so we output our result:

So, to summarize, each row comes in, gets compared to our current state, and then the state gets updated to reflect the new greatest value. Then the next row comes in, and we repeat the process until we’ve processed all our rows and output the result.

The max aggregate aggregation process, told in GIFs.

There’s a name for the function that processes each row and updates the internal state: the state transition function (or just “transition function” for short.) The transition function for an aggregate takes the current state and the value from the incoming row as arguments and produces a new state.

It’s defined like this, where current_value represents values from the incoming row, current_state represents the current aggregate state built up over the previous rows (or NULL if we haven’t yet gotten any), and next_state represents the output after analyzing the incoming row:

next_state = transition_func(current_state, current_value)

Aggregate internals: Composite state

So, the max() aggregate has a straightforward state that contains just one value (the largest we’ve seen). But not all aggregates in PostgreSQL have such a simple state.

Let’s consider the aggregate for average (avg):

SELECT avg(bar) FROM foo;

To refresh, an average is defined as:

\begin{equation} avg(x) = \frac{sum(x)}{count(x)} \end{equation}

To calculate it, we store the sum and the count as our internal state and update our state as we process rows:

The `avg()` aggregation process, told in GIFs. For `avg()`, the transition function must update a more complex state since the sum and count are stored separately at each aggregation step.

But, when we’re ready to output our result for avg, we need to divide sum by count:

For some aggregates, we can output the state directly – but for others, we need to perform an operation on the state before calculating our final result.

There’s another function inside the aggregate that performs this calculation: the final function. Once we’ve processed all the rows, the final function takes the state and does whatever it needs to produce the result.

It’s defined like this, where final_state represents the output of the transition function after it has processed all the rows:

result = final_func(final_state)

And, through pictures:

How the average aggregate works, told in GIFs. Here, we’re highlighting the role of the final function.

To summarize, as an aggregate scans over rows, its transition function updates its internal state. Once the aggregate has scanned all of the rows, its final function produces a result, which is returned to the user.

Improving the performance of aggregate functions

One interesting thing to note here: the transition function is called many, many more times than the final function: once for each row, whereas the final function is called once per group of rows.

Now, the transition function isn’t inherently more expensive than the final function on a per-call basis—but because there are usually orders of magnitude more rows going into the aggregate than coming out, the transition function step becomes the most expensive part very quickly. This is especially true when you have high-volume time-series data being ingested at high rates; optimizing aggregate transition function calls is important for improving performance.

Luckily, PostgreSQL already has ways to optimize aggregates.

Parallelization and the combine function

Because the transition function is run on each row, some enterprising PostgreSQL developers asked: what if we parallelized the transition function calculation?

Let’s revisit our definitions for transition functions and final functions:

next_state = transition_func(current_state, current_value)

result = final_func(final_state)

We can run this in parallel by instantiating multiple copies of the transition function and handing a subset of rows to each instance. Then, each parallel aggregate will run the transition function over the subset of rows it sees, producing multiple (partial) states, one for each parallel aggregate. But, since we need to aggregate over the entire data set, we can’t run the final function on each parallel aggregate separately because they only have some of the rows.

So, now we’ve ended up in a bit of a pickle: we have multiple partial aggregate states, and the final function is only meant to work on the single, final state—right before we output the result to the user.

To solve this problem, we need a new type of function that combines two partial states into one so that the final function can do its work. This is (aptly) called the combine function.

We can run the combine function iteratively over all of the partial states that are created when we parallelize the aggregate.

combined_state = combine_func(partial_state_1, partial_state_2)

For instance, in avg, the combine function will add up the counts and sums.

How parallel aggregation works, told in GIFs. Here, we’re highlighting the combine function (We’ve added a couple more rows to illustrate parallel aggregation.)

Then, after we have the combined state from all of our parallel aggregates, we run the final function and get our result.

Deduplication

Parallelization and the combined function are one way to reduce the cost of calling an aggregate, but they’re not the only way.

One other built-in PostgreSQL optimization that reduces an aggregate’s cost occurs in a statement like this:

SELECT avg(bar), avg(bar) / 2 AS half_avg FROM foo;

PostgreSQL will optimize this statement to evaluate the avg(bar) calculation only once and then use that result twice.

And what if we have different aggregates with the same transition function but different final functions? PostgreSQL further optimizes by calling the transition function (the expensive part) on all the rows and then doing both final functions! Pretty neat!

That’s not all that PostgreSQL aggregates can do, but it’s a pretty good tour, and it’s enough to get us where we need to go today.

Two-step Aggregation in TimescaleDB Hyperfunctions

In TimescaleDB, we’ve implemented the two-step aggregation design pattern for our aggregate functions. This generalizes the PostgreSQL internal aggregation API and exposes it to the user via our aggregates, accessors, and rollup functions. (In other words, each internal PostgreSQL function has an equivalent function in TimescaleDB hyperfunctions.)

As a refresher, when we talk about the two-step aggregation design pattern, we mean the following convention, where we have an inner aggregate call:

And an outer accessor call:

The inner aggregate call returns the internal state, just like the transition function does in PostgreSQL aggregates.

The outer accessor call takes the internal state and returns a result to the user, just like the final function does in PostgreSQL.

We also have special rollup functions defined for each of our aggregates that work much like PostgreSQL combine functions.

PostgreSQL internal aggregation APIs and their TimescaleDB hyperfunctions’ equivalent

Why we use the two-step aggregate design pattern

There are four basic reasons we expose the two-step aggregate design pattern to users rather than leave it as an internal structure (and the last two helped us build our continuous aggregates):

Allow multi-parameter aggregates to re-use state, making them more efficient.
Cleanly distinguish between parameters that affect aggregates vs. accessors, making performance implications easier to understand and predict.
Enable easy-to-understand rollups, with logically consistent results, in continuous aggregates and window functions (one of our most common requests on continuous aggregates).
Allow easier retrospective analysis of downsampled data in continuous aggregates as requirements change, but the data is already gone.

That’s a little theoretical, so let’s dive in and explain each one.

Efficiency of Two-Step Aggregates

Re-using state

PostgreSQL is very good at optimizing statements (as we saw earlier in this post through pictures 🙌), but you have to give it things in a way it can understand.

For instance, when we talked about deduplication, we saw that PostgreSQL could “figure out” when a statement occurs more than once in a query (i.e., avg(bar)) and only run the statement a single time to avoid redundant work:

SELECT avg(bar), avg(bar) / 2 AS half_avg FROM foo;

This works because the avg(bar) occurs multiple times without variation.

However, if I write the equation in a slightly different way and move the division inside the parentheses so that the expression avg(bar) doesn’t repeat so neatly, PostgreSQL can’t figure out how to optimize it:

SELECT avg(bar), avg(bar / 2) AS half_avg FROM foo;

It doesn’t know that the division is commutative or that those two queries are equivalent.

This is a complicated problem for database developers to solve, and thus, as a PostgreSQL user, you need to make sure to write your query in a way that the database can understand.

Performance problems caused by equivalent statements that the database doesn’t understand are equal (or that are equal in the specific case you wrote but not in the general case) can be some of the trickiest SQL optimizations to figure out as a user.

Therefore, when we design our APIs, we try to make it hard for users to write low-performance code unintentionally: in other words, the default option should be the high-performance option.

For the next bit, it’ll be useful to have a simple table defined as:

CREATE TABLE foo(
	ts timestamptz, 
	val DOUBLE PRECISION);

Let’s look at an example of how we use two-step aggregation in the percentile approximation hyperfunction to allow PostgreSQL to optimize performance.

SELECT 
    approx_percentile(0.1, percentile_agg(val)) as p10, 
    approx_percentile(0.5, percentile_agg(val)) as p50, 
    approx_percentile(0.9, percentile_agg(val)) as p90 
FROM foo;

...is treated as the same as:

SELECT 
    approx_percentile(0.1, pct_agg) as p10, 
    approx_percentile(0.5, pct_agg) as p50, 
    approx_percentile(0.9, pct_agg) as p90 
FROM 
(SELECT percentile_agg(val) as pct_agg FROM foo) pct;

This calling convention allows us to use identical aggregates so that, under the hood, PostgreSQL can deduplicate calls to the identical aggregates (and is faster as a result).

Now, let’s compare this to the one-step aggregate approach.

PostgreSQL can’t deduplicate aggregate calls here because the extra parameter in the approx_percentile aggregate changes with each call:

So, even though all of those functions could use the same approximation built up over all the rows, PostgreSQL has no way of knowing that. The two-step aggregation approach enables us to structure our calls so that PostgreSQL can optimize our code, and it enables developers to understand when things will be more expensive and when they won't. Multiple different aggregates with different inputs will be expensive, whereas multiple accessors to the same aggregate will be much less expensive.

Cleanly distinguishing between aggregate/accessor parameters

We also chose the two-step aggregate approach because some of our aggregates can take multiple parameters or options themselves, and their accessors can also take options:

SELECT
    approx_percentile(0.5, uddsketch(1000, 0.001, val)) as median,--1000 buckets, 0.001 target err
    approx_percentile(0.9, uddsketch(1000, 0.001, val)) as p90, 
    approx_percentile(0.5, uddsketch(100, 0.01, val)) as less_accurate_median -- modify the terms for the aggregate get a new approximation
FROM foo;

That’s an example of uddsketch, an advanced aggregation method for percentile approximation that can take its own parameters.

Imagine if the parameters were jumbled together in one aggregate:

-- NB: THIS IS AN EXAMPLE OF AN API WE DECIDED NOT TO USE, IT DOES NOT WORK
SELECT
    approx_percentile(0.5, 1000, 0.001, val) as median
FROM foo;

It’d be pretty difficult to understand which argument is related to which part of the functionality.

Conversely, the two-step approach separates the arguments to the accessor vs. aggregate very cleanly, where the aggregate function is defined in parenthesis within the inputs of our final function:

SELECT
    approx_percentile(0.5, uddsketch(1000, 0.001, val)) as median
FROM foo;

By making it clear which is which, users can know that if they change the inputs to the aggregate, they will get more (costly) aggregate nodes, =while inputs to the accessor are cheaper to change.

So, those are the first two reasons we expose the API—and what it allows developers to do as a result. The last two reasons involve continuous aggregates and how they relate to hyperfunctions, so first, a quick refresher on what they are.

Continuous Aggregates and Two-Step Design in TimescaleDB

TimescaleDB includes a feature called continuous aggregates, which are designed to make queries on very large datasets run faster. TimescaleDB's continuous aggregates continuously and incrementally store the results of an aggregation query in the background, so when you run the query, only the data that has changed needs to be computed, not the entire dataset.

In our discussion of the combine function above, we covered how you could take the expensive work of computing the transition function over every row and split the rows over multiple parallel aggregates to speed up the calculation.

TimescaleDB continuous aggregates do something similar, except they spread the computation work over time rather than between parallel processes running simultaneously. The continuous aggregate computes the transition function over a subset of rows inserted some time in the past, stores the result, and then, at query time, we only need to compute over the raw data for a small section of recent time that we haven’t yet calculated.

When we designed TimescaleDB hyperfunctions, we wanted them to work well within continuous aggregates and even open new possibilities for users.

Let’s say I create a continuous aggregate from the simple table above to compute the sum, average, and percentile (the latter using a hyperfunction) in 15-minute increments:

CREATE MATERIALIZED VIEW foo_15_min_agg
WITH (timescaledb.continuous)
AS SELECT id,
    time_bucket('15 min'::interval, ts) as bucket,
    sum(val),
    avg(val),
    percentile_agg(val)
FROM foo
GROUP BY id, time_bucket('15 min'::interval, ts);

And then what if I come back and I want to re-aggregate it to hours or days rather than 15-minute buckets—or need to aggregate my data across all IDs? Which aggregates can I do that for, and which can’t I?

Logically consistent rollups

One of the problems we wanted to solve with two-step aggregation was how to convey to the user when it is “okay” to re-aggregate and when it’s not. (By “okay,” I mean you would get the same result from the re-aggregated data as you would running the aggregate on the raw data directly.)

For instance:

SELECT sum(val) FROM tab;
-- is equivalent to:
SELECT sum(sum) 
FROM 
    (SELECT id, sum(val) 
    FROM tab
    GROUP BY id) s;

But:

SELECT avg(val) FROM tab;
-- is NOT equivalent to:
SELECT avg(avg) 
FROM 
    (SELECT id, avg(val) 
    FROM tab
    GROUP BY id) s;

Why is re-aggregation okay for sum but not for avg?

Technically, it’s logically consistent to re-aggregate when:

The aggregate returns the internal aggregate state. The internal aggregate state for sum is (sum), whereas for average, it is (sum, count).
The aggregate’s combine and transition functions are equivalent. For sum(), the states and the operations are the same. For count(), the states are the same, but the transition and combine functions perform different operations on them. sum()’s transition function adds the incoming value to the state, and its combine function adds two states together or a sum of sums. Conversely, count()s transition function increments the state for each incoming value, but its combine function adds two states together, or a sum of counts.

But, you have to have in-depth (and sometimes rather arcane) knowledge about each aggregate’s internals to know which ones meet the above criteria—and, therefore, which ones you can re-aggregate.

With the two-step aggregate approach, we can convey when it is logically consistent to re-aggregate by exposing our equivalent of the combine function when the aggregate allows it.

We call that function rollup(). Rollup() takes multiple inputs from the aggregate and combines them into a single value.

All of our aggregates that can be combined have rollup functions that will combine the output of the aggregate from two different groups of rows. (Technically, rollup() is an aggregate function because it acts on multiple rows. For clarity, I’ll call them rollup functions to distinguish them from the base aggregate). Then you can call the accessor on the combined output!

So using that continuous aggregate we created to get a 1-day re-aggregation of our percentile_agg becomes as simple as:

SELECT id, 
    time_bucket('1 day'::interval, bucket) as bucket, 
    approx_percentile(0.5, rollup(percentile_agg)) as median
FROM foo_15_min_agg
GROUP BY id, time_bucket('1 day'::interval, bucket);

(We actually suggest that you create your continuous aggregates without calling the accessor function for this very reason. Then, you can just create views over top or put the accessor call in your query).

This brings us to our final reason.

Retrospective analysis using continuous aggregates

When we create a continuous aggregate, we’re defining a view of our data that we could then be stuck with for a very long time.

For example, we might have a data retention policy that deletes the underlying data after X time period. If we want to go back and re-calculate anything, it can be challenging, if not impossible, since we’ve “dropped” the data.

But we understand that in the real world, you don’t always know what you’re going to need to analyze ahead of time.

Thus, we designed hyperfunctions to use the two-step aggregate approach so they would better integrate with continuous aggregates. As a result, users store the aggregate state in the continuous aggregate view and modify accessor functions without requiring them to recalculate old states that might be difficult (or impossible) to reconstruct (because the data is archived, deleted, etc.).

The two-step aggregation design also allows for much greater flexibility with continuous aggregates. For instance, let’s take a continuous aggregate where we do the aggregate part of the two-step aggregation like this:

CREATE MATERIALIZED VIEW foo_15_min_agg
WITH (timescaledb.continuous)
AS SELECT id,
    time_bucket('15 min'::interval, ts) as bucket,
    percentile_agg(val)
FROM foo
GROUP BY id, time_bucket('15 min'::interval, ts);

When we first create the aggregate, we might only want to get the median:

SELECT
    approx_percentile(0.5, percentile_agg) as median
FROM foo_15_min_agg;

But then, later, we decided we wanted to know the 95th percentile as well.

Luckily, we don’t have to modify the continuous aggregate; we just modify the parameters to the accessor function in our original query to return the data we want from the aggregate state:

SELECT
    approx_percentile(0.5, percentile_agg) as median,
    approx_percentile(0.95, percentile_agg) as p95
FROM foo_15_min_agg;

And then, if a year later, we want the 99th percentile as well, we can do that too:

SELECT
    approx_percentile(0.5, percentile_agg) as median,
    approx_percentile(0.95, percentile_agg) as p95,
    approx_percentile(0.99, percentile_agg) as p99
FROM foo_15_min_agg;

That’s just scratching the surface. Ultimately, our goal is to provide a high level of developer productivity that enhances other PostgreSQL and TimescaleDB features, like aggregate deduplication and continuous aggregates.

Example: Time-Weighted Average

To illustrate how the two-step aggregate design pattern impacts how we think about and code hyperfunctions, let’s look at the time-weighted average family of functions. (Our What Time-Weighted Averages Are and Why You Should Care post provides a lot of context for this next bit, so if you haven’t read it, we recommend doing so. You can also skip this next bit for now.)

The equation for the time-weighted average is as follows:

\begin{equation} time\_weighted\_average = \frac{area\_under\_curve}{ \Delta T} \end{equation}

As we noted in the table above:

time_weight() is TimescaleDB hyperfunctions’ aggregate and corresponds to the transition function in PostgreSQL’s internal API.
average() is the accessor, which corresponds to the PostgreSQL final function.
rollup() for re-aggregation corresponds to the PostgreSQL combine function.

The time_weight() function returns an aggregate type that has to be usable by the other functions in the family.

In this case, we decided on a TimeWeightSummary type that is defined like so (in pseudocode):

TimeWeightSummary = (w_sum, first_pt, last_pt)

w_sum is the weighted sum (another name for the area under the curve), and first_pt and last_pt are the first and last (time, value) pairs in the rows that feed into the time_weight() aggregate.

Here’s a graphic depiction of those elements, which builds on our how to derive a time-weighted average theoretical description:

Depiction of the values we store in the TimeWeightSummary representation.

So, the time_weight() aggregate does all of the calculations as it receives each of the points in our graph and builds a weighted sum for the time period (ΔT) between the first and last points it “sees.” It then outputs the TimeWeightSummary.

The average() accessor function performs simple calculations to return the time-weighted average from the TimeWeightSummary (in pseudocode where pt.time() returns the time from the point):

func average(TimeWeightSummary tws) 
	-> float {
		delta_t = tws.last_pt.time - tws.first_pt.time;
		time_weighted_average = tws.w_sum / delta_t;
		return time_weighted_average;
	}

But, as we built the time_weight hyperfunction, ensuring the rollup() function worked as expected was a little more difficult – and introduced constraints that impacted the design of our TimeWeightSummary data type.

To understand the rollup function, let’s use our graphical example and imagine the time_weight() function returns two TimeWeightSummaries from different regions of time like so:

What happens when we have multiple TimeWeightSummaries representing different regions of the graph

The rollup() function needs to take in and return the same TimeWeightSummary data type so that our average() accessor can understand it. (This mirrors how PostgreSQL’s combined function takes in two states from the transition function and then returns a single state for the final function to process.)

We also want the rollup() output to be the same as if we had computed the time_weight() over all the underlying data. The output should be a TimeWeightSummary representing the full region.

The TimeWeightSummary we output should also account for the area in the gap between these two weighted sum states:

Mind the gap! (between one TimeWeightSummary and the next).

The gap area is easy to get because we have the last₁ and first₂ points—and it’s the same as the w_sum we’d get by running the time_weight() aggregate on them.

Thus, the overall rollup() function needs to do something like this (where w_sum() extracts the weighted sum from the TimeWeightSummary):

func rollup(TimeWeightSummary tws1, TimeWeightSummary tws2) 
	-> TimeWeightSummary {
		w_sum_gap = time_weight(tws1.last_pt, tws2.first_pt).w_sum;
		w_sum_total = w_sum_gap + tws1.w_sum + tws2.w_sum;
		return TimeWeightSummary(w_sum_total, tws1.first_pt, tws2.last_pt);
	}

Graphically, that means we’d end up with a single TimeWeightSummary representing the whole area:

The combined TimeWeightSummary

So that’s how the two-step aggregate design approach ends up affecting the real-world implementation of our time-weighted average hyperfunctions. The above explanations are a bit condensed, but they should give you a more concrete look at how time_weight() aggregate, average() accessor, and rollup() functions work.

Summing It Up

Now that you’ve gotten a tour of the PostgreSQL aggregate API, how it inspired us to make the TimescaleDB hyperfunctions two-step aggregate API, and a few examples of how this works in practice, we hope you'll try it out yourself and tell us what you think :).

If you're currently dealing with gigantic databases, remember that you can always tier your older, infrequently accessed data to keep things running smoothly without breaking the bank—we built the perfect solution for this with our Tiered Storage architecture backend. Check it out!

If you'd like to keep learning about Postgres and its community, visit our State of PostgreSQL 2023 report, which is full of insights on how people around the world use PostgreSQL.

Going back to hyperfunctions, to get started right away, spin up a fully managed Timescale service and try it for free. Hyperfunctions are pre-loaded on each new database service on Timescale, so after you create a new service, you’re all set to use them!

Postgres Materialized Views, The Timescale Way

David Kohn — Thu, 14 Jul 2022 15:34:43 GMT

How PostgreSQL Views and Materialized Views Work and How They Influenced TimescaleDB Continuous Aggregates

Soon after I graduated from college, I was working on starting a business and moved into a house near campus with five friends. The house was a bit run down, but the rent was cheap, and the people were great.

I lived there for a couple of years, and we developed a tradition: whenever someone would move out, we would ask them to share three pieces of wisdom with the group. Some were silly, some were serious, some were profound, none seemed remotely related to PostgreSQL or materialized views at the time, but one really has stuck with me. My friend Jason, who’d just finished his Ph.D. in Applied Physics, said the wisdom he’d learned was “Don’t squander your ignorance.” He explained that once you learn something, you end up taking it for granted and it becomes so much harder to overcome your tacit knowledge and ask simple, but important, questions.

A few months ago, I started a new role managing Engineering Education here at Timescale, and as I’ve started teaching more, Jason’s advice has never been more relevant. Teaching is all about reclaiming squandered ignorance; it’s about constantly remembering what it was like to first learn something, even when you’ve been working on it for years. The things that felt like revelations when we first learned them feel normal as we continue in the field.

So it’s been common for me that things I’ve worked most closely on can be the hardest to teach. Continuous aggregates are one of the most popular features of TimescaleDB and one of the ones I’m most proud of, partially because I helped design them.

We were recently working on a revamp of continuous aggregates, and as we were discussing the changes, I realized that whenever I’ve explained continuous aggregates, I’ve done it with all of this context about PostgreSQL views and materialized views in my head.

I was lucky enough to learn this by trial and error; I had problems that forced me to learn about views and materialized views and figure out all the ways they worked and didn’t work, and I was able to bring that experience to their design when I joined Timescale, but not everyone has that luxury.

This post is an attempt to distill a few lessons about what views and materialized views in PostgreSQL are, what they’re good at, where they fall short, and how we learned from them to make continuous aggregates incredible tools for time-series data analysis. If you need a summary of the main concepts around this topic, check out our Guide on PostgreSQL views.

This post also comes out of our ongoing Foundations of PostgreSQL and TimescaleDB YouTube series, so please, those of you who are encountering this for the first time, don’t squander your ignorance! Send me questions, no matter how basic they seem—I’d really appreciate it because it will help me become a better teacher. I’ve set up a forum post where you can ask them, or, if you’d prefer, you can ask on our Community Slack!

Getting Started With Views and Materialized Views

To get an understanding of PostgreSQL views, materialized views, and TimescaleDB continuous aggregates, we’re going to want to have some data to work with to demonstrate the concepts and better understand where each of them is most useful.

I’ve used the data from our getting started tutorial so that if you’d like to, you can follow along (you may need to change some of the dates in WHEREclauses, though). Our tutorial deals with financial data, but many of the insights are very broadly applicable.

Also, I won’t go through the whole thing here, but you should know that we have a company table and a stocks_real_time hypertable, defined like so:

CREATE TABLE company (
    symbol text NOT NULL,
    name text NOT NULL
);

CREATE TABLE stocks_real_time (
    time timestamp with time zone NOT NULL,
    symbol text NOT NULL,
    price double precision,
    day_volume integer
);
CREATE INDEX ON stocks_real_time (symbol, time);
SELECT create_hypertable('stocks_real_time', 'time');

Once you’ve set that up, you can import data, and you should be able to follow along with the rest if you’d like.

What Are PostgreSQL Views? Why Should I Use Them?

One thing we might want to explore with this dataset is being able to get the name of our company. You’ll note that the name column only exists in the company table, which can be joined to the stocks_real_time table on the symbol column so we can query by either like so:

CREATE VIEW stocks_company AS 
SELECT s.symbol, s.price, s.time, s.day_volume, c.name 
FROM stocks_real_time s 
INNER JOIN company c ON s.symbol = c.symbol;

Once I’ve created a view, I can refer to it in another query:

SELECT symbol, price 
FROM stocks_company 
WHERE time >= '2022-04-05' and time <'2022-04-06';

But what is that actually doing under the hood? As I mentioned before, the view acts as an alias for the stored query, so PostgreSQL replaces the view stocks_company with the query it was defined with and runs the full resulting query. That means the query to the stocks_company view is the same as:

SELECT symbol, price 
FROM (
SELECT s.symbol, s.price, s.time, s.day_volume, c.name 
FROM stocks_real_time s 
INNER JOIN company c ON s.symbol = c.symbol) sc 
WHERE time >= '2022-04-05' and time <'2022-04-06';

We’ve manually replaced the view with the same query that we defined it with.

How can we tell that they are the same? The EXPLAINcommand tells us how PostgreSQL executes a query, and we can use it to see if the query to the view and the query that just runs the query in a subselect produce the same output.

Note, I know that EXPLAIN plans can initially seem a little intimidating. I’ve tried to make it so you don’t need to know a whole lot about EXPLAIN plans or the like to understand this post, so if you don’t want to read them, feel free to skip over them.

And if we run both:

EXPLAIN (ANALYZE ON, BUFFERS ON) 
SELECT symbol, price 
FROM stocks_company 
WHERE time >= '2022-04-05' and time <'2022-04-06';
--AND
EXPLAIN (ANALYZE ON, BUFFERS ON) 
SELECT symbol, price 
FROM (
SELECT s.symbol, s.price, s.time, s.day_volume, c.name 
FROM stocks_real_time s 
INNER JOIN company c ON s.symbol = c.symbol) sc 
WHERE time >= '2022-04-05' and time <'2022-04-06';

We can see that they both produce the same query plan (though the timings might be slightly different, they’ll even out with repeated runs).

Hash Join  (cost=3.68..16328.94 rows=219252 width=12) (actual time=0.110..274.764 rows=437761 loops=1)
   Hash Cond: (s.symbol = c.symbol)
   Buffers: shared hit=3667
   ->  Index Scan using _hyper_5_2655_chunk_stocks_real_time_time_idx on _hyper_5_2655_chunk s  (cost=0.43..12488.79 rows=438503 width=12) (actual time=0.057..125.607 rows=437761 loops=1)
         Index Cond: (("time" >= '2022-04-05 00:00:00+00'::timestamp with time zone) AND ("time" < '2022-04-06 00:00:00+00'::timestamp with time zone))
         Buffers: shared hit=3666
   ->  Hash  (cost=2.00..2.00 rows=100 width=4) (actual time=0.034..0.035 rows=100 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 12kB
         Buffers: shared hit=1
         ->  Seq Scan on company c  (cost=0.00..2.00 rows=100 width=4) (actual time=0.006..0.014 rows=100 loops=1)
               Buffers: shared hit=1
 Planning:
   Buffers: shared hit=682
 Planning Time: 1.807 ms
 Execution Time: 290.851 ms
(15 rows)

The plan joins the company to the relevant chunk of the stocks_real_time hypertable and uses an index scan to fetch the right rows. But you don’t really need to understand exactly what’s going on here to understand that they’re doing the same thing.

✨

Editor's Note: If you’d like to learn more about EXPLAIN, I recommend taking a look at the Explaining Explain session that my colleague Feike Steenbergen gave a couple of weeks ago. It was awesome!

Views hide complexity

The JOIN in our view is very simple, which means the aliased query is relatively simple, but you can imagine that as the views get more complex, it can be very helpful to have a much simpler way for a user to query the database, where they don’t have to write all the JOINs themselves. (You can also use special views like security barrier views to grant access to data securely, but that’s more than we can cover here!).

Unfortunately, hiding the complexity can also be a problem. For instance, you may or may not have noticed in our example that we don’t actually need the JOIN! The JOIN gets us the name column from the company table, but we’re only selecting the symbol and price columns, which come from the stocks_real_time table! If we run the query directly on the table, it can go about twice as fast by avoiding the JOIN:

Index Scan using _hyper_5_2655_chunk_stocks_real_time_time_idx on _hyper_5_2655_chunk  (cost=0.43..12488.79 rows=438503 width=12) (actual time=0.021..72.770 rows=437761 loops=1)
  Index Cond: (("time" >= '2022-04-05 00:00:00+00'::timestamp with time zone) AND ("time" < '2022-04-06 00:00:00+00'::timestamp with time zone))
  Buffers: shared hit=3666
Planning:
  Buffers: shared hit=10
Planning Time: 0.243 ms
Execution Time: 140.775 ms

If I’d written out the query, I might have seen that I didn’t need the JOIN (or never written it in the first place). Whereas the view hides that complexity. So they can make things easier, but that can lead to performance pitfalls if we’re not careful.

If we actually SELECT the name column, then we could say we’re using the view more for what it was meant for like so:

SELECT name, price, symbol 
FROM stocks_company 
WHERE time >= '2022-04-05' AND time <'2022-04-06';

So to sum up this section on views:

Views are a way to store an alias for a query in the database.
PostgreSQL will replace the view name with the query you use in the view definition.

Views can be good for reducing complexity for the user, so they don’t have to write out complex JOINs, but can also lead to performance problems if they are overused and because hiding the complexity can make it harder to identify potential performance pitfalls.

One thing you’ll notice is that views can improve the user interface, but they won’t really ever improve performance, because they don’t actually run the query, they just alias it. If you want something that runs the query, you’ll need a materialized view.

What PostgreSQL Materialized Views Are and When to Use Them

When I create a materialized view, it actually runs the query and stores the results. In essence, this means the materialized view acts as a cache for the query. Caching is a common way to improve performance in all sorts of computing systems. The question we might ask is: will it be helpful here? So let’s try it out and see how it goes.

Creating a materialized view is quite easy, I can just add the MATERIALIZED keyword to my create view command:

CREATE MATERIALIZED VIEW stocks_company_mat AS 
SELECT s.symbol, s.price, s.time, s.day_volume, c.name 
FROM stocks_real_time s INNER JOIN company c ON s.symbol = c.symbol;

CREATE INDEX on stocks_company_mat (symbol, time DESC);
CREATE INDEX on stocks_company_mat (time DESC);

You’ll also notice that I created some indexes on the materialized view (the same ones I have on stocks_real_time)! That’s one of the cool things about materialized views, you can create indexes on them because under the hood they’re just tables that store the results of a query (we’ll explain that more later).

Now I can run EXPLAIN ANALYZE on a slightly different query, where I’m trying to get the data for ‘AAPL’ for four days on both to understand how much this caching helps our query:

EXPLAIN (ANALYZE ON, BUFFERS ON) SELECT name, price FROM stocks_company_mat WHERE time >= '2022-04-05' AND time <'2022-04-09' AND symbol = 'AAPL';
Bitmap Heap Scan on stocks_company_mat  (cost=1494.93..56510.51 rows=92196 width=17) (actual time=11.796..46.336 rows=95497 loops=1)
  Recheck Cond: ((symbol = 'AAPL'::text) AND ("time" >= '2022-04-05 00:00:00+00'::timestamp with time zone) AND ("time" < '2022-04-09 00:00:00+00'::timestamp with time zone))
  Heap Blocks: exact=14632
  Buffers: shared hit=14969
  ->  Bitmap Index Scan on stocks_company_mat_symbol_time_idx  (cost=0.00..1471.88 rows=92196 width=0) (actual time=9.456..9.456 rows=95497 loops=1)
        Index Cond: ((symbol = 'AAPL'::text) AND ("time" >= '2022-04-05 00:00:00+00'::timestamp with time zone) AND ("time" < '2022-04-09 00:00:00+00'::timestamp with time zone))
        Buffers: shared hit=337
Planning:
  Buffers: shared hit=5
Planning Time: 0.102 ms
Execution Time: 49.995 ms

EXPLAIN (ANALYZE ON, BUFFERS ON) SELECT name, price FROM stocks_company WHERE time >= '2022-04-05' AND time <'2022-04-09' AND symbol = 'AAPL';
Nested Loop  (cost=919.95..30791.92 rows=96944 width=19) (actual time=6.023..75.367 rows=95497 loops=1)
  Buffers: shared hit=13215
  ->  Seq Scan on company c  (cost=0.00..2.25 rows=1 width=15) (actual time=0.006..0.018 rows=1 loops=1)
        Filter: (symbol = 'AAPL'::text)
        Rows Removed by Filter: 99
        Buffers: shared hit=1
  ->  Append  (cost=919.95..29820.23 rows=96944 width=12) (actual time=6.013..67.491 rows=95497 loops=1)
        Buffers: shared hit=13214
        ->  Bitmap Heap Scan on _hyper_5_2655_chunk s_1  (cost=919.95..11488.49 rows=49688 width=12) (actual time=6.013..22.334 rows=49224 loops=1)
              Recheck Cond: ((symbol = 'AAPL'::text) AND ("time" >= '2022-04-05 00:00:00+00'::timestamp with time zone) AND ("time" < '2022-04-09 00:00:00+00'::timestamp with time zone))
              Heap Blocks: exact=6583
              Buffers: shared hit=6895
(... elided for space)
Planning:
  Buffers: shared hit=30
Planning Time: 0.465 ms
Execution Time: 78.932 ms

Taking a look at these plans, we can see that it helps less than one might think! It sped it up a little, but really, they’re doing almost the same amount of work! How can I tell? Well, they scan approximately the same number of 8KB buffers (see Lesson 0 of the Foundations series to learn more about those), and they scan the same number of rows.

When materialized view performance doesn't materialize

Why is this? Well, our JOIN didn’t reduce the number of rows in the query, so the materialized view stocks_company_mat actually has the same number of rows in it as the stocks_real_time hypertable!

SELECT 
(SELECT count(*) FROM stocks_company_mat) as rows_mat, 
(SELECT count(*) FROM stocks_real_time) as rows_tab;

 rows_mat | rows_tab 
----------+----------
  7375355 |  7375355

So, not a huge benefit, and we have to store the same number of rows over again. So we’re getting little benefit for a pretty large cost in terms of how much storage we have to use. Now this could have been a large benefit if we were running a very expensive function or doing a very complex JOIN in our materialized view definition, but we’re not, so this doesn’t save us much.

The thing about our example is that it only gets worse from here. One of the things we might want to do with our view or materialized view is being able to use a WHERE clause to filter not just on symbol but on the company name. (Maybe I don’t remember the stock symbol for a company, but I do remember its name.) Remember that the name column is the one we joined on, so let’s run that query on both the view and materialized view and see what happens:


EXPLAIN (ANALYZE ON, BUFFERS ON) SELECT name, price from stocks_company_mat WHERE time >= '2022-04-05' and time <'2022-04-06' AND name = 'Apple' ;
Index Scan using stocks_company_mat_time_idx on stocks_company_mat  (cost=0.43..57619.99 rows=92196 width=17) (actual time=0.022..605.268 rows=95497 loops=1)
  Index Cond: (("time" >= '2022-04-05 00:00:00+00'::timestamp with time zone) AND ("time" < '2022-04-09 00:00:00+00'::timestamp with time zone))
  Filter: (name = 'Apple'::text)
  Rows Removed by Filter: 1655717
  Buffers: shared hit=112577
Planning:
  Buffers: shared hit=3
Planning Time: 0.116 ms
Execution Time: 609.040 ms

EXPLAIN (ANALYZE ON, BUFFERS ON) SELECT name, price from stocks_company WHERE time >= '2022-04-05' and time <'2022-04-06' AND name = 'Apple' ;
Nested Loop  (cost=325.22..21879.02 rows=8736 width=19) (actual time=5.642..56.062 rows=95497 loops=1)
  Buffers: shared hit=13215
  ->  Seq Scan on company c  (cost=0.00..2.25 rows=1 width=15) (actual time=0.007..0.018 rows=1 loops=1)
        Filter: (name = 'Apple'::text)
        Rows Removed by Filter: 99
        Buffers: shared hit=1
  ->  Append  (cost=325.22..21540.78 rows=33599 width=12) (actual time=5.633..48.232 rows=95497 loops=1)
        Buffers: shared hit=13214
        ->  Bitmap Heap Scan on _hyper_5_2655_chunk s_1  (cost=325.22..9866.59 rows=17537 width=12) (actual time=5.631..21.713 rows=49224 loops=1)
              Recheck Cond: ((symbol = c.symbol) AND ("time" >= '2022-04-05 00:00:00+00'::timestamp with time zone) AND ("time" < '2022-04-09 00:00:00+00'::timestamp with time zone))
              Heap Blocks: exact=6583
              Buffers: shared hit=6895
…
Planning:
  Buffers: shared hit=30
Planning Time: 0.454 ms
Execution Time: 59.558 ms

This time, my query on the regular view is much better! It hits far fewer buffers and returns 10x faster! This is because we created an index on (symbol, time DESC) for the materialized view, but not on (name, time DESC), so it has to fall back to scanning the full timeindex and removing the rows that don’t match.

The normal view, however, can use the more selective (symbol, time DESC) on the stocks_real_time hypertable because it’s performing the JOIN to the company table, and it joins on the symbol column, which means it can still use the more selective index. We “enhanced” the materialized view by performing the JOIN and caching the results, but then we’d need to create an index on the joined column too.

So we’re learning that this query isn’t a great candidate for a materialized view, because it’s not a crazy complex time-consuming JOIN and doesn’t reduce the number of rows. But if we had a query that we wanted to run that would reduce the number of rows, then that would be a great candidate for a materialized view.

When materialized views perform well

As it turns out, there’s a very common set of queries on stock data like this that does reduce the number of rows, they’re called Open-High-Low-Close queries (OHLC), and they look something like this:

CREATE VIEW ohlc_view AS 
SELECT time_bucket('15 min', time) bucket, symbol, first(price, time), max(price), min(price), last(price, time) 
FROM stocks_real_time 
WHERE time >= '2022-04-05' and time <'2022-04-06' 
GROUP BY time_bucket('15 min', time), symbol;

CREATE MATERIALIZED VIEW ohlc_mat AS 
SELECT time_bucket('15 min', time) bucket, symbol, first(price, time), max(price), min(price), last(price, time) 
FROM stocks_real_time 
GROUP BY time_bucket('15 min', time), symbol ;

CREATE INDEX on ohlc_mat(symbol, bucket);
CREATE INDEX ON ohlc_mat(bucket);

Here I’m aggregating a lot of rows together, so I end up storing a lot fewer in my materialized view. (The view doesn’t store any rows, it’s just an alias for the query.) I still created a few indexes to help speed lookups, but they’re much smaller as well because there are many fewer rows in the output of this query. So now, if I select from the normal view and the materialized view, I see a huge speedup!

Normal view:

EXPLAIN (ANALYZE ON, BUFFERS ON) 
SELECT  bucket, symbol, first, max, min, last 
FROM ohlc_view
WHERE bucket >= '2022-04-05' AND bucket <'2022-04-06';
Finalize GroupAggregate  (cost=39098.81..40698.81 rows=40000 width=44) (actual time=875.233..1000.171 rows=3112 loops=1)
  Group Key: (time_bucket('00:15:00'::interval, _hyper_5_2655_chunk."time")), _hyper_5_2655_chunk.symbol
  Buffers: shared hit=4133, temp read=2343 written=6433
  ->  Sort  (cost=39098.81..39198.81 rows=40000 width=92) (actual time=875.212..906.810 rows=5151 loops=1)
        Sort Key: (time_bucket('00:15:00'::interval, _hyper_5_2655_chunk."time")), _hyper_5_2655_chunk.symbol
        Sort Method: quicksort  Memory: 1561kB
        Buffers: shared hit=4133, temp read=2343 written=6433
        ->  Gather  (cost=27814.70..36041.26 rows=40000 width=92) (actual time=491.920..902.094 rows=5151 loops=1)
              Workers Planned: 1
              Workers Launched: 1
              Buffers: shared hit=4133, temp read=2343 written=6433
              ->  Partial HashAggregate  (cost=26814.70..31041.26 rows=40000 width=92) (actual time=526.663..730.168 rows=2576 loops=2)
                    Group Key: time_bucket('00:15:00'::interval, _hyper_5_2655_chunk."time"), _hyper_5_2655_chunk.symbol
                    Planned Partitions: 128  Batches: 129  Memory Usage: 1577kB  Disk Usage: 19592kB
                    Buffers: shared hit=4133, temp read=2343 written=6433
                    Worker 0:  Batches: 129  Memory Usage: 1577kB  Disk Usage: 14088kB
                    ->  Result  (cost=0.43..13907.47 rows=257943 width=28) (actual time=0.026..277.314 rows=218880 loops=2)
                          Buffers: shared hit=4060
                          ->  Parallel Index Scan using _hyper_5_2655_chunk_stocks_real_time_time_idx on _hyper_5_2655_chunk  (cost=0.43..10683.19 rows=257943 width=20) (actual time=0.025..176.330 rows=218880 loops=2)
                                Index Cond: (("time" >= '2022-04-05 00:00:00+00'::timestamp with time zone) AND ("time" < '2022-04-06 00:00:00+00'::timestamp with time zone))
                                Buffers: shared hit=4060
Planning:
  Buffers: shared hit=10
Planning Time: 0.615 ms
Execution Time: 1003.425 ms

Materialized view:

EXPLAIN (ANALYZE ON, BUFFERS ON) 
SELECT  bucket, symbol, first, max, min, last 
FROM ohlc_mat 
WHERE bucket >= '2022-04-05' AND bucket <'2022-04-06';
Index Scan using ohlc_mat_bucket_idx on ohlc_mat  (cost=0.29..96.21 rows=3126 width=43) (actual time=0.009..0.396 rows=3112 loops=1)
  Index Cond: ((bucket >= '2022-04-05 00:00:00+00'::timestamp with time zone) AND (bucket < '2022-04-06 00:00:00+00'::timestamp with time zone))
  Buffers: shared hit=35
Planning:
  Buffers: shared hit=6
Planning Time: 0.148 ms
Execution Time: 0.545 ms

Well, that helped! We hit far fewer buffers and scanned far fewer rows with the materialized case, and we didn’t need to perform the GROUP BY and aggregate, which removes a sort, etc. All of that means that we sped up our query dramatically! But, it’s not rainbows and butterflies for materialized views. Because we didn’t cover one of their big problems, they get out of date!

So, if you think about a table like our stocks table, it’s a typical time-series use case which means it looks something like this:

We have a materialized view that we created, and it has been populated at a certain time with our query.

But then, as time passes, and we’re, say, 15 minutes later, and we’ve inserted more data, the view is out of date! We’re time_bucketing by 15-minute increments, so there’s a whole set of buckets that we don’t have!

Essentially, materialized views are only as accurate as the last time they ran the query they are caching. You need to run REFRESH MATERIALIZED VIEW to ensure they are up to date.

Once you run REFRESH MATERIALIZED VIEW, we’ll end up with the new data in our materialized view, like so:

The thing is, REFRESHing a view can be expensive, and to understand why we should understand a bit more about how they work and why they get out of date. And that can be expensive.

How Materialized Views Work (and Why They Get Out of Date)

To understand how materialized views get out of date and what refresh is doing, it helps to understand a little about how they work under the hood. Essentially, when you create a materialized view, you are creating a table and populating it with the data from the query. For the ohlc_mat view we’ve been working with, it’s equivalent to:

CREATE TABLE ohlc_tab AS 
SELECT time_bucket('15 min', time) bucket, symbol, first(price, time), max(price), min(price), last(price, time) 
FROM stocks_real_time 
GROUP BY time_bucket('15 min', time), symbol;

Now, what happens when I insert data into the underlying table?

So, our materialized view ohlc_mat is storing the results of the query run when we created it.

INSERT INTO stocks_real_time VALUES (now(), 'AAPL', 170.91, NULL);

The regular view (ohlc_view) will stay up to date because it’s just running the queries directly on the raw data in stocks_real_time. And if we’re only inserting data close to now(), and only querying much older data, then the materialized view will seem like it’s okay. We’ll see no change for our query from a month or two ago, but if we try to query a more recent time, we won’t have any data. If we want it up to date with more recent data, we’ll need to run:

REFRESH MATERIALIZED VIEW ohlc_mat;

When we do this, what is actually happening under the hood is that we truncate (remove all the data) from the table, and then run the query again and insert it into the table.

If we were using the ohlc_tab table from above, the equivalent operations would be something like:

TRUNCATE TABLE ohlc_tab;

INSERT INTO ohlc_tab 
SELECT time_bucket('15 min', time) bucket, symbol, first(price, time), max(price), min(price), last(price, time) 
FROM stocks_real_time 
GROUP BY time_bucket('15 min', time), symbol;

(This works slightly differently when you run REFRESH MATERIALIZED VIEW with the CONCURRENTLY option, but fundamentally, it always runs the query over the entire data set, and getting into the details is beyond the scope of this post.)

The database, much like with a view, stores the query we ran so that when we run our REFRESH it just knows what to do, which is great, but it’s not the most efficient. Even though most of the data didn’t change, we still threw out the whole data set, and re-run the whole query.

While that might be okay when you’re working with, say OLTP data that PostgreSQL works with, and your updates/deletes are randomly spread around your data set, it starts to seem pretty inefficient when you’re working with time-series data, where the writes are mostly in the most recent period.

So, to sum up, we found a case where the materialized view really helps us because the output from the query is so much smaller than the number of rows we have to scan to calculate it. In our case, it was an aggregate. But we also noticed that, when we used a materialized view, the data gets out of date because we’re storing the output of the query, rather than rerunning it at query time as you do with a view.

In order to get the materialized view to be up to date, we learned that we need to REFRESH it, but for time-series use cases, a) you have to refresh it frequently (in our case, approximately every 15 minutes or so at least) for it to be up to date, and b) the refresh is inefficient because we have to delete and re-materialize all the data, maybe going back months, to get the new information from just the previous 15 minutes. And that’s one of the main reasons we developed continuous aggregates at Timescale.

What Are Continuous Aggregates? Why Should I Use Them?

Continuous aggregates function as incremental, automatically updated materialized views. They scan and update the results of your query in the background, ensuring that new data is incorporated and old data is adjusted based on your refresh policy. Check out an example here. Keep in mind that this does not prevent you from retrieving data typically.

During a refresh, only the data that has changed since the last refresh is updated. This eliminates the need to rerun the entire query to retrieve data that you already have. This is also known as real-time aggregation; all continuous aggregates use real-time aggregation.

Real-time aggregates

Real-time aggregates address the problem of maintaining the most current data using continuous aggregates. Here's an example: if you have a 15-minute refresh policy and the last refresh occurred five minutes ago, there's already a gap in your data. Real-time aggregation tackles this by filling this gap with the latest raw data from your table. Real-time aggregation extracts the most recent yet-to-be-materialized raw data from the source table or view and adds it to the recently retrieved raw data from your table or view. This gives you an up-to-date view of your data for real-time analytics.

How Continuous Aggregates Work and How They Were Inspired by the Best of Views and Materialized Views

We’ll build up to how exactly continuous aggregates work in this section, piece by piece.

Fundamentally, when we create a continuous aggregate, we’re doing something very similar to what happens when we create a materialized view. That’s why we use a slightly modified version of the interface for creating a materialized view:

CREATE MATERIALIZED VIEW ohlc_cont 
WITH (timescaledb.continuous) AS 
SELECT time_bucket('15 min', time) bucket, symbol, first(price, time), max(price), min(price), last(price, time) 
FROM stocks_real_time 
GROUP BY time_bucket('15 min', time), symbol;

Once we’ve done that, we end up in a very similar situation to what we have with a materialized view. We have the data that was around when created the view, but as new data gets inserted, the view will get out of date.

In order to keep a continuous aggregate up to date, we need a scheduled aggregation.

Scheduled aggregation of new data

We saw two main problems with materialized views that we wanted to address with scheduled aggregations:

We have to manually refresh a materialized view when we want it to remain up to date.
We don’t want to re-run the query on all the old data unnecessarily; we should only run it on the new data.

To schedule aggregations, we need to create a continuous aggregate policy:

SELECT add_continuous_aggregate_policy('ohlc_cont'::regclass, start_offset=>NULL, end_offset=>'15 mins'::interval,  schedule_interval=>'5 mins'::interval);

Once we’ve scheduled a continuous aggregate policy, it will run automatically according to the schedule_interval we’ve specified. In our case, it runs every five minutes. When it runs, it looks at the data we’ve already materialized and the new inserts and looks to see if we’ve finished at least one 15-minute bucket. If we have, it will run the query on just that next 15-minute portion and materialize the results in our continuous aggregate.

This means that the continuous aggregate now automatically has data from the next 15-minute period without user intervention.

And it was much more efficient. Unlike running REFRESH MATERIALIZED VIEW, we didn’t drop all the old data and recompute the aggregate against it, we just ran the aggregate query against the next 15-minute period and added that to our materialization. And as we move forward in time, this can keep occurring as each successive 15-minute period (or whatever period we chose for the time_bucket in the continuous aggregate definition) gets filled in with new data and then materialized.

One thing to note about this is that we keep track of where we’ve materialized up to by storing what we call a watermark, represented here by the dotted line. (NB: It’s named after the high watermark caused by a flood, not the watermark on a bank check.) So before the scheduled aggregation runs, the watermark is right after all the data we’ve materialized:

That helps us locate our next bucket and ensure it’s all there before we run the aggregation. Once we have, we move the watermark:

So our watermark represents the furthest point we’ve materialized up until now.

But, you might notice that our continuous aggregates still aren’t fully up to date and wouldn’t give us the same results as a view that ran the same query. Why?

Scheduled aggregates will have some gap between when the next bucket has all of its data and when the job runs to materialize it.
We only materialize data once the next bucket is full by default, so we’re missing the partial bucket where inserts are happening right now. We might want to get partial results for that bucket (this is especially true when we’re using larger buckets).

To address this, we made real-time views.

Real-time views

Real-time views combine the best of materialized views and normal views to give us a more up-to-date view of our data. They’re the default for continuous aggregates, so I don’t need to change how I made my continuous aggregate at all. However, I will admit that I elided a few things in the previous picture about how continuous aggregates work under the hood.

Real-time continuous aggregates have two parts:

A materialized hypertable, where our already computed aggregates are stored.
And a real-time view, which queries both the materialized hypertable and the raw hypertable (in the not-yet-aggregated region) and combines the results together.

So, if you look at the view definition of the continuous aggregate, it looks like this:

CREATE VIEW ohlc_cont AS  SELECT _materialized_hypertable_15.bucket,
    _materialized_hypertable_15.symbol,
    _materialized_hypertable_15.first,
    _materialized_hypertable_15.max,
    _materialized_hypertable_15.min,
    _materialized_hypertable_15.last
   FROM _timescaledb_internal._materialized_hypertable_15
  WHERE _materialized_hypertable_15.bucket < COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(15)), '-infinity'::timestamp with time zone)
UNION ALL
 SELECT time_bucket('00:15:00'::interval, stocks_real_time."time") AS bucket,
    stocks_real_time.symbol,
    first(stocks_real_time."time", stocks_real_time.price) AS first,
    max(stocks_real_time.price) AS max,
    min(stocks_real_time.price) AS min,
    last(stocks_real_time."time", stocks_real_time.price) AS last
   FROM stocks_real_time
  WHERE stocks_real_time."time" >= COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(15)), '-infinity'::timestamp with time zone)
  GROUP BY (time_bucket('00:15:00'::interval, stocks_real_time."time")), stocks_real_time.symbol;

It’s two queries put together with a UNION ALL, the first just selecting the data straight out of the materialized hypertable where our bucket is below the watermark, the second running the aggregation query where our time column is above the watermark.

So you can see how this takes advantage of the best of both materialized views and normal views to create something that is much faster than a normal view but still up to date!

It’s not going to be as performant as just querying the already materialized data (though we do have an option to allow you to do that if you want to), but for most users, the last few months or even years of data is already materialized whereas only the last few minutes or days of raw data needs to be queried, so that still creates a huge speedup!

Invalidation of out-of-order data

You may have noticed that I was making a big assumption in all of my diagrams. I was assuming that all of our inserts happen in the most recent time period. For time-series workloads, this is mostly true. Most data comes in time order. But, most and all are very different things. Especially with time-series workloads, where we have so much data coming in, that even if 99 percent of the data is in time order, 1 percent of the data is still a lot!

And the results of the aggregate would be wrong by a meaningful amount if we simply let the inserts (or updates or deletes) build up over time. This cache invalidation problem is a very common problem in computing and a very hard one! PostgreSQL materialized views solve this problem by dropping all the old data and re-materializing it every time, but we already said how inefficient that was.

The other way that many folks try to solve this sort of problem in a database like PostgreSQL is a trigger. A standard trigger would run for every row and update the aggregates for every row.

But in practice, it’s hard to get a per-row trigger to work very well, and it still would cause significant write amplification, meaning, we’d have to write multiple times for every row we insert.

In fact, we’d need to write at least once for each row for each continuous aggregate we had on the raw hypertable. It would also limit the aggregates we can use to those that can be modified by a trigger, which are fewer than we might like.

So instead, we created a special kind of trigger that tracks the minimum and maximum times modified across all the rows in a statement and writes out the range of times that were modified to a log table. We call that an invalidation log.

The next time the continuous aggregate job runs, it has to do two things: it runs the normal aggregation of the next 15 minutes of data, and it runs an aggregation over each of the invalidated regions to recalculate the proper value over that period.

Note that this makes our continuous aggregates eventually consistent for out-of-order modifications. However, real-time views make continuous aggregates more strongly consistent for recent data (because they use a view under the hood).

While we could make more strongly consistent aggregates by joining to our log table and rerunning the aggregation in real time for the invalidated regions, we talked to users and decided that eventual consistency was good enough for most cases here. After all, this data is already late coming in. Essentially, we think the performance impact of doing that wasn’t worth the consistency guarantees. And anyway, if ever a user wants to, they can trigger a manual refresh of the continuous aggregate for the modified region by running the manual refresh_continuous_aggregates procedure, which updates the data in the materialized hypertable right away.

Data retention

The final thing we wanted to accomplish with our continuous aggregates was a way to keep aggregated data around after dropping the raw data. This is impossible with both PostgreSQL views and materialized views because, for views, they work directly on the raw data—if you drop it, they can’t aggregate it.

With materialized views it’s a bit more complicated: until you run a refresh, they can have the old data around, but, once you run a refresh, to get the new data that you’ve added in more recent time periods, then the old data is dropped.

With continuous aggregates, the implementation is much simpler. We mentioned the invalidation trigger that fires when we modify data that we’ve already materialized. We simply ignore any events older than a certain time horizon, including the drop event.

We also can process any invalidations before dropping data so that you can have the correct data materialized from right before you dropped the oldest stuff. You can configure data retention by setting up your continuous aggregate policy correctly.

Does It Work?

So we’ve got this whole mashup of views and materialized views and triggers in order to try to make a good set of trade-offs that works well for time-series data. So the question is: does it work?

To test that, I recreated our continuous aggregate from above without data and a policy and ran the refresh_continuous_aggregate procedure so that it would have approximately a month’s worth of data in materialized in the aggregate, with about 30 minutes that need to go through the real-time view.

If we query for the aggregated data over the whole period from the continuous aggregate, it takes about 18 ms, which is slightly slower than the 5-6 ms for the fully materialized data in a materialized view, but it’s still 1,000x faster than the 15 s it takes from a normal view, and we get most of the up-to-date benefits of a normal view from it. I'd be pretty happy with that trade-off.

If you are new to TimescaleDB and would like to try this out, I invite you to sign up for Timescale. It's the easiest way to get started with Timescale. It’s 100 percent free for 30 days, no credit card required, and you’ll be able to spin up a database in seconds.

You can easily host PostgreSQL tables and TimescaleDB hypertables in your Timescale demo database, create views, materialized views, and continuous aggregates, and explore the differences in performance and developer experience between them. We've put together some extra directions in this guide on using incremental materialized views, a.k.a. continuous aggregates—we hope it helps!

Function Pipelines: Building Functional Programming Into PostgreSQL Using Custom Operators

David Kohn — Tue, 19 Oct 2021 13:33:44 GMT

Today, we are announcing function pipelines, a new capability that introduces functional programming concepts inside PostgreSQL (and SQL) using custom operators.

Function pipelines radically improve the developer ergonomics of analyzing data in PostgreSQL and SQL, by applying principles from functional programming and popular tools like Python’s Pandas and PromQL.

At Timescale, our mission is to serve developers worldwide and enable them to build exceptional data-driven products that measure everything that matters: software applications, industrial equipment, financial markets, blockchain activity, user actions, consumer behavior, machine learning models, climate change, and more.

We believe SQL is the best language for data analysis. We’ve championed its benefits for several years, even when many were abandoning it for custom domain-specific languages. And we were right—SQL has resurged and become the universal language for data analysis, and now many NoSQL databases are adding SQL interfaces to keep up.

But SQL is not perfect and, at times, can get quite unwieldy. For example,

SELECT device_id, 
	sum(abs_delta) as volatility
FROM (
	SELECT device_id, 
		abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts))
        	as abs_delta 
	FROM measurements
	WHERE ts >= now() - '1 day'::interval) calc_delta
GROUP BY device_id;

Pop quiz: What does this query do?

Even if you are a SQL expert, queries like this can be quite difficult to read—and even harder to express. Complex data analysis in SQL can be hard.

The Advantages of Function Pipelines

Function pipelines let you express that same query like this:

SELECT device_id, 
	timevector(ts, val) -> sort() -> delta() -> abs() -> sum() 
    		as volatility
FROM measurements
WHERE ts >= now() - '1 day'::interval
GROUP BY device_id;

Now it is much clearer what this query is doing:

It gets the last day’s data from the measurements table, grouped by device_id.
It sorts the data by the time column.
It calculates the delta (or change) between values.
It takes the absolute value of the delta.
And then takes the sum of the results of the previous steps.

Function pipelines improve your own coding productivity, while also making your SQL code easier for others to comprehend and maintain.

Inspired by functional programming languages, function pipelines enable you to analyze data by composing multiple functions, leading to a simpler, cleaner way of expressing complex logic in PostgreSQL.

The best part is that we built function pipelines in a way that is fully PostgreSQL compliant—we did not change any SQL syntax—meaning that any tool that speaks PostgreSQL will be able to support data analysis using function pipelines.

How We Built Function Pipelines

How did we build this? By taking advantage of PostgreSQL's incredible extensibility, particularly custom types, operators, and custom types, custom operators, and functions.

In our previous example, you can see the key elements of function pipelines:

Custom data types: in this case, the timevector, which is a set of (time, value) pairs.
Custom operator: ->, used to compose and apply function pipeline elements to the data that comes in.
And finally, custom functions, called pipeline elements. Pipeline elements can transform and analyze timevectors (or other data types) in a function pipeline. For this initial release, we’ve built 60 custom functions! (Full list here).

We’ll go into more detail on function pipelines in the rest of this post, but if you just want to get started as soon as possible, the easiest way to try function pipelines is through a fully managed Timescale Cloud service. Try it for free (no credit card required) for 30 days.

Function pipelines are pre-loaded on each new database service on Timescale Cloud, available immediately—so after you’ve created a new service, you’re all set to use them!

If you prefer to manage your own database instances, you can install the timescaledb_toolkit into your existing PostgreSQL installation, completely for free.

We’ve been working on this capability for a long time, but in line with our belief of “move fast but don’t break things,” we’re initially releasing function pipelines as an experimental feature—and we would absolutely love to get your feedback. You can open an issue or join a discussion thread in GitHub (and, if you like what you see, GitHub ⭐ is always welcome and appreciated, too!).

We’d also like to take this opportunity to give a huge shoutout to pgx, the Rust-based framework for building PostgreSQL extensions—it handles a lot of the heavy lifting for this project. We have over 600 custom types, operators, and functions in the timescaledb_toolkit extension at this point; managing this without pgx (and the ease of use that comes from working with Rust) would be a real bear of a job.

Function Pipelines: Why Are They Useful?

It’s October. In the northern hemisphere (where most of Team Timescale, including your author, lives), it is starting to get cold.

Now imagine a restaurant in New York City whose owners care about their customers and their customers’ comfort. And you are working on an IoT product designed to help small businesses like these owners minimize their heating bill while maximizing their customers happiness. So you install two thermometers, one at the front measuring the temperature right by the door, and another at the back of the restaurant.

Now, as many of you may know (if you’ve ever had to sit by the door of a restaurant in the fall or winter), when someone enters, the temperature drops—and once the door is closed, the temperature warms back up. The temperature at the back of the restaurant will vary much less than at the front, right by the door. Both of them will drop slowly down to a lower set point during non-business hours and warm back up sometime before business hours based on the setpoints on our thermostat. So overall we’ll end up with a graph that looks something like this:

A graph of the temperature at the front (near the door) and back. The back is much steadier, while the front is more volatile. Graph is for illustrative purposes only, data is fabricated. No restaurants or restaurant patrons were harmed in the making of this post.

As we can see, the temperature by the front door varies much more than at the back of the restaurant. Another way to say this is the temperature by the front door is more volatile. Now, the owners of this restaurant want to measure this because frequent temperature changes means uncomfortable customers.

In order to measure volatility, we could first subtract each point from the point before to calculate a delta. If we add this up directly, large positive and negative deltas will cancel out. But, we only care about the magnitude of the delta, not its sign—so what we really should do is take the absolute value of the delta, and then take the total sum of the previous steps.

We now have a metric that might help us measure customer comfort, and also the efficacy of different weatherproofing methods (for example, adding one of those little vestibules that acts as a windbreak).

To track this, we collect measurements from our thermometers and store them in a table:

CREATE TABLE measurements(
	device_id BIGINT,
	ts TIMESTAMPTZ,
	val DOUBLE PRECISION
);

The device_id identifies the thermostat, ts the time of reading and val the temperature.

Using the data in our measurements table, let’s look at how we calculate volatility using function pipelines.

Note: because all of the function pipeline features are still experimental, they exist in the toolkit_experimental schema. Before running any of the SQL code in this post you will need to set your search_path to include the experimental schema as we do in the example below, we won’t repeat this throughout the post so as not to distract.

set search_path to toolkit_experimental, public; --still experimental, so do this to make it easier to read

SELECT device_id, 
	timevector(ts, val) -> sort() -> delta() -> abs() -> sum() 
    	as volatility
FROM measurements
WHERE ts >= now()-'1 day'::interval
GROUP BY device_id;

And now we have the same query that we used as our example in the introduction.

In this query, the function pipeline timevector(ts, val) -> sort() -> delta() -> abs() -> sum() succinctly expresses the following operations:

Create timevectors (more detail on this later) out of the ts and val columns.
Sort each timevector by the time column.
Calculate the delta (or change) between each pair in the timevector by subtracting the previous val from the current.
Take the absolute value of the delta.
Take the sum of the results from the previous steps.

The FROM, WHERE and GROUP BY clauses do the rest of the work telling us:

We’re getting data FROM the measurements table
WHERE the ts, or timestamp column, contains values over the last day
Showing one pipeline output per device_id (the GROUP BY column)

As we noted before, if you were to do this same calculation using SQL and PostgreSQL functionality, your query would look like this:

SELECT device_id, 
sum(abs_delta) as volatility
FROM (
	SELECT 
		abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts) ) 
        	as abs_delta 
	FROM measurements
	WHERE ts >= now() - '1 day'::interval) calc_delta
GROUP BY device_id;

This does the same five steps as the above but is much harder to understand. We have to use a window function and aggregate the results—but also, because aggregates are performed before window functions, we need to actually execute the window function in a subquery.

As we can see, function pipelines make it significantly easier to comprehend the overall analysis of our data. There’s no need to completely understand what’s going on in these functions just yet, but for now, it’s enough to understand that we’ve essentially implemented a small functional programming language inside of PostgreSQL. You can still use all of the normal, expressive SQL you’ve come to know and love. Function pipelines just add new tools to your SQL toolbox that make it easier to work with time-series data.

Some avid SQL users might find the syntax a bit foreign at first, but for many people who work in other programming languages, especially using tools like Python’s Pandas Package, this type of successive operation on data sets will feel natural.

Again, this is still fully PostgreSQL-compliant: we introduce no changes to the parser or anything that should break compatibility with PostgreSQL drivers.

Building Function Pipelines Without Forking PostgreSQL

We built function pipelines—without modifying the parser or anything that would require a fork of PostgreSQL—by taking advantage of three of the many ways that PostgreSQL enables extensibility: custom types, custom functions, and custom operators.

Custom data types, starting with the timevector, which is a set of (time, value) pairs
A custom operator: ->, which is used to compose and apply function pipeline elements to the data that comes in.
Custom functions, called pipeline elements, which can transform and analyze timevectors (or other data types) in a function pipeline (with 60 functions in this initial release)

We believe that new idioms like these are exactly what PostgreSQL was meant to enable. That’s why it has supported custom types, functions and operators from its earliest days. (And is one of the many reasons why we love PostgreSQL.)

A Custom Data Type: The `timevector`

A timevector is a collection of (time, value) pairs. As of now, the times must be TIMESTAMPTZs and the values must be DOUBLE PRECISION numbers. (But this may change in the future as we continue to develop this data type. If you have ideas/input, please file feature requests on GitHub explaining what you’d like!)

You can think of the timevector as something like this:

A depiction of a timevector.

One of the first questions you might ask is: how does a timevector relate to time-series data? (If you want to know more about time-series data, we have a great blog post on that,).

Let’s consider our example from above, where we were talking about a restaurant that was measuring temperatures, and we had a measurements table like so:

CREATE TABLE measurements(
	device_id BIGINT,
	ts TIMESTAMPTZ,
	val DOUBLE PRECISION
);

In this example, we can think of a single time-series dataset as all historical and future time and temperature measurements from a device.

Given this definition, we can think of a timevector as a finite subset of a time-series dataset. The larger time-series dataset may extend back into the past, and it may extend into the future, but the timevector is bounded.

A timevector is a finite subset of a time-series and contains all the (time, value) pairs in some region of the time-series.

In order to construct a timevector from the data gathered from a thermometer, we use a custom aggregate and pass in the columns we want to become our (time, value) pairs. We can use the WHERE clause to define the extent of the timevector (i.e., the limits of this subset), and the GROUP BY clause to provide identifying information about the time series that are represented.

Building on our example, this is how we construct a timevector for each thermometer in our dataset:

SELECT device_id, 
	timevector(ts, val)
FROM measurements
WHERE ts >= now() - '1 day'::interval
GROUP BY device_id;

But a timevector doesn't provide much value by itself. So now, let’s also consider some complex calculations that we can apply to the timevector, starting with a custom operator used to apply these functions

A Custom Operator: `->`

In function pipelines, the -> operator is used to apply and compose multiple functions, in an easy to write and read format.

Fundamentally, -> means: “apply the operation on the right to the inputs on the left”, or, more simply “do the next thing”.

We created a general-purpose operator for this because we think that too many operators meaning different things can get very confusing and difficult to read.

One thing that you’ll notice about the pipeline elements is that the arguments are in an unusual place in a statement like:

SELECT device_id, 
 	timevector(ts, val) -> sort() -> delta() -> abs() -> sum() 
    	as volatility
FROM measurements
WHERE ts >= now() - '1 day'::interval
GROUP BY device_id;

It appears (from the semantics) that the timevector(ts, val) is an argument to sort(), the resulting timevector is an argument to delta() and so on.

The thing is that sort() (and the others) are regular function calls; they can’t see anything outside of their parentheses and don’t know about anything to their left in the statement; so we need a way to get the timevector into the sort() (and the rest of the pipeline).

We solved this by taking advantage of one of the fundamental computing insights that functional programming languages use: code and data are really the same thing.

Each of our functions returns a special type that describes the function and its arguments. We call these types pipeline elements (more later).

The -> operator then performs one of two different types of actions depending on the types on its right and left sides. It can either:

Apply a pipeline element to the left-hand argument—perform the function described by the pipeline element directly on the incoming data type.
Compose pipeline elements into a combined element that can be applied at some point in the future (this is an optimization that allows us to apply multiple elements in a “nested” manner so that we don’t perform multiple unnecessary passes).

The operator determines the action to perform based on its left and right arguments.

Let’s look at our timevector from before: timevector(ts, val) -> sort() -> delta() -> abs() -> sum(). If you remember from before, I noted that this function pipeline performs the following steps:

Create timevectors out of the ts and val columns.
Sort it by the time column.
Calculate the delta (or change) between each pair in the timevector by subtracting the previous val from the current.
Take the absolute value of the delta.
Take the sum of the results from the previous steps.

And logically, at each step, we can think of the timevector being materialized and passed to the next step in the pipeline.

However, while this will produce a correct result, it’s not the most efficient way to compute this. Instead, it would be more efficient to compute as much as possible in a single pass over the data.

In order to do this, we allow not only the apply operation, but also the compose operation. Once we’ve composed a pipeline into a logically equivalent higher order pipeline with all of the elements we can choose the most efficient way to execute it internally. (Importantly, even if we have to perform each step sequentially, we don’t need to materialize it and pass it between each step in the pipeline so it has significantly less overhead even without other optimization).

Custom Functions: Pipeline Elements

Now let’s discuss the third, and final, key piece that makes up function pipelines: custom functions, or as we call them, pipeline elements.

We have implemented over 60 individual pipeline elements, which fall into 4 categories (with a few subcategories):

`timevector` transforms

These elements take in a timevector and produce a timevector. They are the easiest to compose, as they produce the same type.

Example pipeline:

SELECT device_id, 
	timevector(ts, val) 
    	-> sort() 
        -> delta() 
        -> map($$ ($value^3 + $value^2 + $value * 2) $$) 
        -> lttb(100) 
FROM measurements

Organized by sub-category:

Unary mathematical

Simple mathematical functions applied to the value in each point in a timevector. (Docs link)

Element	Description
`abs()`	Computes the absolute value of each value
`cbrt()`	Computes the cube root of each value
`ceil()`	Computes the first integer greater than or equal to each value
`floor()`	Computes the first integer less than or equal to each value
`ln()`	Computes the natural logarithm of each value
`log10()`	Computes the base 10 logarithm of each value
`round()`	Computes the closest integer to each value
`sign()`	Computes +/-1 for each positive/negative value
`sqrt()`	Computes the square root for each value
`trunc()`	Computes only the integer portion of each value

Binary mathematical

Simple mathematical functions with a scalar input applied to the value in each point in a timevector. (Docs link)

Element	Description
`add(N)`	Computes each value plus `N`
`div(N)`	Computes each value divided by `N`
`logn(N)`	Computes the logarithm base `N` of each value
`mod(N)`	Computes the remainder when each number is divided by `N`
`mul(N)`	Computes each value multiplied by `N`
`power(N)`	Computes each value taken to the `N` power
`sub(N)`	Computes each value less `N`

Compound transforms

Transforms involving multiple points inside of a timevector. (Docs link)

Element	Description
`delta()`	Subtracts each value from the previous`
`fill_to(interval, fill_method)`	Fills gaps larger than `interval` with points at `interval` from the previous using `fill_method`
`lttb(resolution)`	Downsamples a `timevector` using the largest triangle three buckets algorithm at `resolution, requires sorted input.
`sort()`	Sorts the `timevector` by the `time` column ascending

Lambda elements

These elements use lambda expressions, which allows the user to write small functions to be evaluated over each point in a timevector.
Lambda expressions can return a DOUBLE PRECISION value like $$ $value^2 + $value + 3 $$. They can return a BOOL like $$ $time > ‘2020-01-01’t $$ . They can also return a (time, value) pair like $$ ($time + ‘1 day’i, sin($value) * 4)$$. (Docs link)

You can apply them using the elements below:

Element	Description
`filter(lambda (bool) )`	Removes points from the `timevector` where the lambda expression evaluates to `false`
`map(lambda (value) )`	Applies the lambda expression to all the values in the `timevector`
`map(lambda (time, value) )`	Applies the lambda expression to all the times and values in the `timevector`

`timevector` finalizers

These elements end the timevector portion of a pipeline, they can either help with output or produce an aggregate over the entire timevector. They are an optimization barrier to composition as they (usually) produce types other than timevector.

Example pipelines:

SELECT device_id, 
	timevector(ts, val) -> sort() -> delta() -> unnest()
FROM measurements

SELECT device_id, 
	timevector(ts, val) -> sort() -> delta() -> time_weight()
FROM measurements

Finalizer pipeline elements organized by sub-category:

`timevector` output

These elements help with output, and can produce a set of (time, value) pairs or a Note: this is an area where we’d love further feedback, are there particular data formats that would be especially useful for, say graphing that we can add? File an issue in our GitHub!

Element	Description
`unnest( )`	Produces a set of `(time, value)` pairs. You can wrap and expand as a composite type to produce separate columns `(pipe -> unnest()).*`
`materialize()`	Materializes a `timevector` to pass to an application or other operation directly, blocks any optimizations that would materialize it lazily.

`timevector` aggregates

Aggregate all the points in a timevector to produce a single value as a result. (Docs link)

Element	Description
`average()`	Computes the average of the values in the `timevector`
`counter_agg()`	Computes the `counter_agg` aggregate over the times and values in the `timevector`
`stats_agg()`	Computes a range of statistical aggregates and returns a `1DStatsAgg` over the values in the `timevector`
`sum()`	Computes the sum of the values in the `timevector`
`num_vals()`	Counts the points in the `timevector`

Aggregate Accessors and Mutators

These function pipeline elements act like the accessors that I described in our previous post on aggregates. You can use them to get a value from the aggregate part of a function pipeline like so:

SELECT device_id, 
	timevector(ts, val) -> sort() -> delta() -> stats_agg() -> variance() 
FROM measurements

But these don’t just work on timevectors - they also work on a normally produced aggregate as well.

When used instead of normal function accessors and mutators they can make the syntax more clear by getting rid of nested functions like:

SELECT approx_percentile(0.5, percentile_agg(val)) 
FROM measurements

Instead, we can use the arrow accessor to convey the same thing:

SELECT percentile_agg(val) -> approx_percentile(0.5) 
FROM measurements

By aggregate family:

Counter aggregates

Counter aggregates deal with resetting counters, (and were stabilized in our 1.3 release this week!). Counters are a common type of metric in the application performance monitoring and metrics world. All values have resets accounted for. These elements must have a CounterSummary to their left when used in a pipeline, from a counter_agg() aggregate or pipeline element. (Docs link)

Element	Description
`counter_zero_time()`	The time at which the counter value is predicted to have been zero based on the least squares fit of the points input to the `CounterSummary`(x intercept)
`corr()`	The correlation coefficient of the least squares fit line of the adjusted counter value.
`delta()`	Computes the last - first value of the counter
`extapolated_delta(method)`	Computes the delta extrapolated using the provided method to bounds of range. Bounds must have been provided in the aggregate or a `with_bounds` call
`idelta_left()` / `idelta_right()`	Computes the instantaneous difference between the second and first points (left) or last and next-to-last points (right)
`intercept()`	The y-intercept of the least squares fit line of the adjusted counter value.
`irate_left()` / `irate_right()`	Computes the instantaneous rate of change between the second and first points (left) or last and next-to-last points (right)
`num_changes()`	Number of times the counter changed values.
`num_elements()`	Number of items - any with the exact same time will have been counted only once.
`num_changes()`	Number of times the counter reset.
`slope()`	The slope of the least squares fit line of the adjusted counter value.
`with_bounds(range)`	Applies bounds using the `range` (a `TSTZRANGE`) to the `CounterSummary` if they weren’t provided in the aggregation step

Percentile approximation

These aggregate accessors deal with percentile approximation. For now we’ve only implemented them for percentile_agg and uddsketch based aggregates. We have not yet implemented them for tdigest. (Docs link)

Element	Description
`approx_percentile(p)`	The approximate value at percentile `p`
`approx_percentile_rank(v)`	The approximate percentile a value `v` would fall in
`error()`	The maximum relative error guaranteed by the approximation
`mean()`	The exact average of the input values.
`num_vals()`	The number of input values

Statistical aggregates

These aggregate accessors add support for common statistical aggregates (and were stabilized in our 1.3 release this week!). These allow you to compute and rollup() common statistical aggregates like average, stddev and more advanced ones like skewness as well as 2-dimensional aggregates like slope and covariance. Because there are both 1D and 2D versions of these, the accessors can have multiple forms, for instance, average() calculates the average on a 1D aggregate while average_y() & average_x() do so on each dimension of a 2D aggregate. (Docs link)

Element	Description
`average() / average_y() / average_x()`	The average of the values.
`corr()`	The correlation coefficient of the least squares fit line.
`covariance(method)`	The covariance of the values using either `population` or `sample` method.
`determination_coeff()`	The determination coefficient (aka R squared) of the values.
`kurtosis(method) / kurtosis_y(method) / kurtosis_x(method)`	The kurtosis (4th moment) of the values using either `population` or `sample` method.
`intercept()`	The intercept of the least squares fit line.
`num_vals()`	The number of (non-null) values seen.
`sum() / sum_x() / sum_y()`	The sum of the values seen.
`skewness(method) / skewness_y(method) / skewness_x(method)`	The skewness (3rd moment) of the values using either `population` or `sample` method.
`slope()`	The slope of the least squares fit line.
`stddev(method) / stddev_y(method) / stddev_x(method)`	The standard deviation of the values using either `population` or `sample` method.
`variance(method) / variance_y(method) / variance_x(method)`	The variance of the values using either `population` or `sample` method.
`x_intercept()`	The x intercept of the least squares fit line.

Time-weighted averages

(More info)
The average() accessor may be called on the output of a time_weight() like so:

SELECT time_weight('Linear', ts, val) -> average()  FROM measurements;

Approximate count distinct (Hyperloglog)

This is an approximation for distinct counts that was stabilized in our 1.3 release!(Docs link) The distinct_count() accessor may be called on the output of a hyperloglog() like so:

SELECT hyperloglog(device_id) -> distinct_count() FROM measurements;

Next Steps

We hope this post helped you understand how function pipelines leverage PostgreSQL extensibility to offer functional programming concepts in a way that is fully PostgreSQL compliant. And how function pipelines can improve the ergonomics of your code, making it easier to write, read, and maintain.

You can try function pipelines today with a fully managed Timescale Cloud service (no credit card required, free for 30 days). Function pipelines are available now on every new database service on Timescale Cloud, so after you’ve created a new service, you’re all set to use them!

If you prefer to manage your own database instances, you can download and install the timescaledb_toolkit extension on GitHub for free, after which you’ll be able to use function pipelines.

How Percentiles Work (and Why They're Better Than Averages)

David Kohn — Tue, 14 Sep 2021 15:41:30 GMT

In my recent post on time-weighted averages, I described how my early career as an electrochemist exposed me to the importance of time-weighted averages, which shaped how we built them into TimescaleDB hyperfunctions.

A few years ago, soon after I started learning more about PostgreSQL internals (check out my aggregation and two-step aggregates post to learn about them yourself!), I worked on backends for an ad analytics company, where I started using TimescaleDB.

Like most companies, we cared a lot about making sure our website and API calls returned results in a reasonable amount of time for the user; we had billions of rows in our analytics databases, but we still wanted to make sure that the website was responsive and useful.

There’s a direct correlation between website performance and business results: users get bored if they have to wait too long for results, which is obviously not ideal from a business and customer loyalty perspective. To understand how our website performed and find ways to improve, we tracked the timing of our API calls and used API call response time as a key metric.

Monitoring an API is a common scenario and generally falls under the category of application performance monitoring (APM), but there are lots of similar scenarios in other fields, including:

Predictive maintenance for industrial machines
Fleet monitoring for shipping companies
Energy and water use monitoring and anomaly detection

Of course, analyzing raw (usually time-series) data only gets you so far. You want to analyze trends, understand how your system performs relative to what you and your users expect, catch and fix issues before they impact production users, and so much more. We built TimescaleDB hyperfunctions to help solve this problem and simplify how developers work with time-series data.

For reference, hyperfunctions are a series of SQL functions that make it easier to manipulate and analyze time-series data in PostgreSQL with fewer lines of code. You can use hyperfunctions to calculate percentile approximations of data, compute time-weighted averages, downsample and smooth data, and perform faster COUNT DISTINCT queries using approximations.

Moreover, hyperfunctions are “easy” to use: you call a hyperfunction using the same SQL syntax you know and love.

We spoke with community members to understand their needs, and our initial release includes some of the most frequently requested functions, including percentile approximations (see GitHub feature request and discussion).

They’re very useful for working with large time-series data sets because they offer the benefits of using percentiles (rather than averages or other counting statistics) while still being quick and space-efficient to compute, parallelizable, and useful with continuous aggregates and other advanced TimescaleDB features.

If you’d like to get started with the percentile approximation hyperfunctions—and many more—right away, spin up a fully managed Timescale Cloud service: create an account to try it for free for 30 days. (Hyperfunctions are pre-loaded on each new database service on Timescale Cloud, so after you create a new service, you’re all set to use them).

If you prefer to manage your own database instances, you can download and install the timescaledb_toolkit extension on GitHub, after which you’ll be able to use percentile approximation and other hyperfunctions.

Finally, we love building in public and continually improving:

If you have questions or comments on this blog post, we’ve started a discussion on our GitHub page, and we’d love to hear from you. And, if you like what you see, GitHub ⭐ are always welcome and appreciated, too!
You can view our upcoming roadmap on GitHub for a list of proposed features, as well as features we’re currently implementing and those that are available to use today.

Things I Forgot From 7th Grade Math: Percentiles vs. Averages

I probably learned about averages, medians, and modes in 7th-grade math class, but if you’re anything like me, they may periodically get lost in the cloud of “things I learned once and thought I knew, but actually, I don’t remember quite as well as I thought.”

As I was researching this piece, I found a number of good blog posts (see examples from the folks at Dynatrace, Elastic, AppSignal, and Optimizely) about how averages aren’t great for understanding application performance, or other similar things and why it’s better to use percentiles.

I won’t spend too long on this, but I think it’s important to provide some background on why and how percentiles can help us better understand our data.

First off, let’s consider how percentiles and averages are defined. To understand this, let’s start by looking at a normal distribution:

A normal, or Gaussian, distribution describes many real-world processes that fall around a given value and where the probability of finding values that are further from the center decreases. The median, average, and mode are all the same for a normal distribution, and they fall on the dotted line at the center.

The normal distribution is what we often think of when we think about statistics; it’s one of the most frequently used and often used in introductory courses. In a normal distribution, the median, the average (also known as the mean), and the mode are all the same, even though they’re defined differently.

The median is the middle value, where half of the data is above and half is below. The mean (aka average) is defined as the sum(value) / count(value), and the mode is defined as the most common or frequently occurring value.

When we’re looking at a curve like this, the x-axis represents the value, while the y-axis represents the frequency with which we see a given value (i.e., values that are “higher” on the y-axis occur more frequently).

In a normal distribution, we see a curve centered (the dotted line) at its most frequent value, with decreasing probability of seeing values further away from the most frequent one (the most frequent value is the mode). Note that the normal distribution is symmetric, which means that values to the left and right of the center have the same probability of occurring.

The median, or the middle value, is also known as the 50th percentile (the middle percentile out of 100). This is the value at which 50% of the data is less than the value, and 50% is greater than the value (or equal to it).

In the below graph, half of the data is to the left (shaded in blue), and half is to the right (shaded in yellow), with the 50th percentile directly in the center.

A normal distribution with the median/50th percentile depicted.

This leads us to percentiles: a percentile is defined as the value where x percent of the data falls below the value.

For example, if we call something “the 10th percentile,” we mean that 10% of the data is less than the value and 90% is greater than (or equal to) the value.

A normal distribution with the 10th percentile depicted.

And the 90th percentile is where 90% of the data is less than the value and 10% is greater:

A normal distribution with the 90th percentile depicted.

To calculate the 10th percentile, let’s say we have 10,000 values. We take all of the values, order them from smallest to largest, and identify the 1001st value (where 1000 or 10% of the values are below it), which will be our 10th percentile.

We noted before that the median and average are the same in a normal distribution. This is because a normal distribution is symmetric. Thus, the magnitude and number of points with values larger than the median are completely balanced (both in magnitude and number of points smaller than the median).

In other words, there is always the same number of points on either side of the median, but the average takes into account the actual value of the points.

For the median and average to be equal, the points less than the median and greater than the median must have the same distribution (i.e., there must be the same number of points that are somewhat larger and somewhat smaller and much larger and much smaller). (Correction: as pointed out to us in a helpful comment on Hacker News, technically, this is only true for symmetric distributions, asymmetric distributions it may or may not be true for, and you can get odd cases of asymmetric distributions where these are equal, though they are less likely!)

Why is this important? The fact that the median and average are the same in the normal distribution can cause some confusion. Since a normal distribution is often one of the first things we learn, we (myself included!) can think it applies to more cases than it actually does.

It’s easy to forget or fail to realize that only the median guarantees that 50% of the values will be above and 50% below—while the average guarantees that 50% of the weighted values will be above and 50% below (i.e., the average is the centroid, while the median is the center).

The average and median are the same in a normal distribution, and they split the graph exactly in half. But they aren’t calculated the same way, don’t represent the same thing, and aren’t necessarily the same in other distributions.

🙏 Shout out to the folks over at Desmos for their great graphing calculator, which helped make these graphs and even allowed me to make an interactive demonstration of these concepts!

But, to get out of the theoretical, let’s consider something more common in the real world, like the API response time scenario from my work at the ad analytics company.

Why Percentiles Are Better Than Averages for Understanding Your Data

We looked at how averages and percentiles are different—and now, we’re going to use a real-world scenario to demonstrate how using averages instead of percentiles can lead to false alarms or missed opportunities.

Why? Averages don’t always give you enough information to distinguish between real effects and outliers or noise, whereas percentiles can do a much better job.

Simply put, using averages can have a dramatic (and negative) impact on how values are reported, while percentiles can help you get closer to the “truth.”

If you’re looking at something like API response time, you’ll likely see a frequency distribution curve that looks something like this:

A frequency distribution for API response times with a peak at 250ms (all graphs are not to scale and are meant only for demonstration purposes).

In my former role at the ad analytics company, we’d aim for most of our API response calls to finish in under half a second, and many were much, much shorter than that. When we monitored our API response times, one of the most important things we tried to understand was how users were affected by changes in the code.

Most of our API calls finished in under half a second, but some people used the system to get data over very long time periods or had odd configurations that meant their dashboards were a bit less responsive (though we tried to make sure those were rare!).

The type of curve that resulted is characterized as a long-tail distribution where we have a relatively large spike at 250 ms, with many of our values under that and then an exponentially decreasing number of longer response times.

We talked earlier about how, in symmetric curves (like the normal distribution), but a long-tail distribution is an asymmetric curve.

This means that the largest values are much larger than the middle values, while the smallest values aren’t that far from the middle values. (In the API monitoring case, you can never have an API call that takes less than zero seconds to respond, but there’s no limit to how long they can take, so you get that long tail of longer API calls).

Thus, the average and the median of a long-tail distribution start to diverge:

The API response time frequency curve with the median and average labeled. Graphs are not to scale and are meant for demonstration purposes only.

In this scenario, the average is significantly larger than the median because there are enough “large” values in the long tail to make the average larger. Conversely, in some other cases, the average might be smaller than the median.

But at the ad analytics company, we found that the average didn’t give us enough information to distinguish between important changes in how our API responded to software changes vs. noise/outliers that only affected a few individuals.

In one case, we introduced a change to the code that had a new query. The query worked fine in staging, but there was a lot more data in the production system.

Once the data was “warm” (in memory), it would run quickly, but it was very slow the first time. When the query went into production, the response time was well over a second for ~10% of the calls.

In our frequency curve, a response time over a second (but less than 10s) for ~10% of the calls resulted in a second, smaller hump in our frequency curve and looked like this:

A frequency curve showing the shift and extra hump that occurs when 10% of calls take a moderate amount of time, between 1 and 10s (graph still not to scale).

In this scenario, the average shifted a lot, while the median slightly shifted, it’s much less impacted.

You might think that this makes the average a better metric than the median because it helped us identify the problem (too long API response times), and we could set up our alerting to notify when the average shifts.

Let’s imagine that we’ve done that, and people will jump into action when the average goes above, say, 1 second(s).

But now, we have a few users who have started requesting 15 years of data from our UI...and those API calls take a really long time. This is because the API wasn’t really built to handle this “off-label” use.

Just a few calls from these users easily shifted the average way over our 1s threshold.

Why? The average (as a value) can be dramatically affected by outliers like this, even though they impact only a small fraction of our users. The average uses the sum of the data, so the magnitude of the outliers can have an outsized impact, whereas the median and other percentiles are based on the ordering of the data.

Our curve with a few outliers, where less than 1% of the API call responses are over 100s (the response time has a break representing the fact that the outliers would be way to the right otherwise, still, the graph is not to scale).

The point is that the average doesn’t give us a good way to distinguish between outliers and real effects and can give odd results when we have a long-tail or asymmetric distribution.

Why is this important to understand?

Well, in the first case, we had a problem affecting 10% of our API calls, which could be 10% or more of our users (how could it affect more than 10% of the users? Well, if a user makes 10 calls on average, and 10% of API calls are affected, then, on average, all the users would be affected... or at least some large percentage of them).

We want to respond very quickly to that type of urgent problem, affecting a large number of users. We built alerts and might even get our engineers up in the middle of the night and/or revert a change.

But the second case, where “off-label” user behavior or minor bugs had a large effect on a few API calls, was much more benign. Because relatively few users are affected by these outliers, we wouldn’t want to get our engineers up in the middle of the night or revert a change. (Outliers can still be important to identify and understand, both for understanding user needs or potential bugs in the code, but they usually aren’t an emergency).

Instead of using the average, we can instead use multiple percentiles to understand this type of behavior. Remember, unlike averages, percentiles rely on the ordering of the data rather than being impacted by the magnitude of data. If we use the 90th percentile, we know that 10% of users have values (API response times in our case) greater than it.

Let’s look at the 90th percentile in our original graph; it nicely captures some of the long tail behavior:

Our original API response time graph showing the 90th percentile, median, and average. Graph not to scale.

When we have some outliers caused by a few users who’re running super long queries or a bug affecting a small group of queries, the average shifts, but the 90th percentile is hardly affected.

Outliers affect the average but don’t impact the 90th percentile or median. (Graph is not to scale.)

But, when the tail is increased due to a problem affecting 10% of users, we see that the 90th percentile shifts outward pretty dramatically – which enables our team to be notified and respond appropriately:

But when there are “real” effects from responses that impact more than 10% of users, the 90th percentile shifts dramatically (Graph not to scale.)

This (hopefully) gives you a better sense of how and why percentiles can help you identify cases where large numbers of users are affected – but not burden you with false positives that might wake engineers up and give them alarm fatigue!

So, now that we know why we might want to use percentiles rather than averages, let’s talk about how we calculate them.

How Percentiles Work in PostgreSQL

To calculate any sort of exact percentile, you take all your values, sort them, then find the nth value based on the percentile you’re trying to calculate.

To see how this works in PostgreSQL, we’ll present a simplified case of our ad analytics company’s API tracking.

We’ll start off with a table like this:

CREATE TABLE responses(
	ts timestamptz, 
	response_time DOUBLE PRECISION);

In PostgreSQL, we can calculate a percentile over the column response_time using the percentile_disc aggregate:

SELECT 
	percentile_disc(0.5) WITHIN GROUP (ORDER BY response_time) as median
FROM responses;

This doesn’t look the same as a normal aggregate; the WITHIN GROUP (ORDER BY …) is a different syntax that works on special aggregates called ordered-set aggregates.

Here, we pass in the percentile we want (0.5 or the 50th percentile for the median) to the percentile_disc function, and the column that we’re evaluating (response_time) goes in the order by clause.

It will be more clear why this happens when we understand what’s going on under the hood. Percentiles give a guarantee that x percent of the data will fall below the value they return. To calculate that, we need to sort all of our data in a list and then pick out the value where 50% of the data falls below it and 50% falls above it.

For those of you who read the section of our previous post on how PostgreSQL aggregates work, we discussed how an aggregate like avg works.

As it scans each row, the transition function updates some internal state (for avg it’s the sum and the count), and then a final function processes the internal state to produce a result (for avg divide sum by count).

A GIF showing how the avg is calculated in PostgreSQL with the sum and count as the partial state as rows are processed and a final function that divides them when we’ve finished.

The ordered set aggregates, like percentile_disc, work somewhat similarly, with one exception: instead of the state being a relatively small fixed-size data structure (like sum and count for avg ), it must keep all the values it has processed to sort them and calculate the percentile later.

Usually, PostgreSQL does this by putting the values into a data structure called a tuplestore that stores and sorts values easily.

Then, when the final function is called, the tuplestore will first sort the data. Then, based on the value input into the percentile_disc), it will traverse to the correct point (0.5 of the way through the data for the median) in the sorted data and output the result.

With the percentile_disc ordered set aggregate, PostgreSQL has to store each value it sees in a tuplestore then when it’s processed all the rows, it sorts them, and then goes to the right point in the sorted list to extract the percentile we need.

Instead of performing these expensive calculations over very large data sets, many people find that approximate percentile calculations can provide a “close enough” approximation with significantly less work...which is why we introduced percentile approximation hyperfunctions.

Percentile Approximation: What It Is and Why We Use It in Hyperfunctions

In my experience, people often use averages and other summary statistics more frequently than percentiles because they are significantly “cheaper” to calculate over large datasets, both in computational resources and time.

As we noted above, calculating the average in PostgreSQL has a simple, two-valued aggregate state. Even if we calculate a few additional, related functions like the standard deviation, we still just need a small, fixed number of values to calculate the function.

In contrast, to calculate the percentile, we need all of the input values in a sorted list.

This leads to a few issues:

Memory footprint: The algorithm has to keep these values somewhere, which means keeping values in memory until they need to write some data to disk to avoid using too much memory (this is known as “spilling to disk”). This produces a significant memory burden and/or majorly slows down the operation because disk accesses are orders of magnitude slower than memory.
Limited Benefits from Parallelization: Even though the algorithm can sort lists in parallel, the benefits from parallelization are limited because it still needs to merge all the sorted lists into a single, sorted list in order to calculate a percentile.
High network costs: In distributed systems, all the values must be passed over the network to one node to be made into a single sorted list, which is slow and costly.
No true partial states: Materialization of partial states (e.g., for continuous aggregates) is not useful because the partial state is simply all the values that underlie it. This could save on sorting the lists, but the storage burden would be high and the payoff low.
No streaming algorithm: For streaming data, this is completely infeasible. You still need to maintain the full list of values (similar to the materialization of partial states problem above), which means that the algorithm essentially needs to store the entire stream!

All of these can be manageable when you’re dealing with relatively small data sets, while for high volume, time-series workloads, they start to become more of an issue.

But, you only need the full list of values for calculating a percentile if you want exact percentiles. With relatively large datasets, you can often accept some accuracy tradeoffs to avoid running into any of these issues.

The problems above, and the recognition of the tradeoffs involved in weighing whether to use averages or percentiles, led to the development of multiple algorithms to approximate percentiles in high volume systems. Most percentile approximation approaches involve some sort of modified histogram to represent the overall shape of the data more compactly, while still capturing much of the shape of the distribution.

As we were designing hyperfunctions, we thought about how we could capture the benefits of percentiles (e.g., robustness to outliers, better correspondence with real-world impacts) while avoiding some of the pitfalls that come with calculating exact percentiles (above).

Percentile approximations seemed like the right fit for working with large, time-series datasets.

The result is a whole family of percentile approximation hyperfunctions, built into TimescaleDB. The simplest way to call them is to use the percentile_agg aggregate along with the approx_percentile accessor.

This query calculates approximate 10th, 50th, and 90th percentiles:

SELECT 
    approx_percentile(0.1, percentile_agg(response_time)) as p10, 
    approx_percentile(0.5, percentile_agg(response_time)) as p50, 
    approx_percentile(0.9, percentile_agg(response_time)) as p90 
FROM responses;

(If you’d like to learn more about aggregates, accessors, and two-step aggregation design patterns, check out our primer on PostgreSQL two-step aggregation.)

These percentile approximations have many benefits when compared to the normal PostgreSQL exact percentiles, especially when used for large data sets.

Memory footprint

When calculating percentiles over large data sets, our percentile approximations limit the memory footprint (or need to spill to disk, as described above).

Standard percentiles create memory pressure since they build up as much of the data set in memory as possible...and then slow down when forced to spill to disk.

Conversely, hyperfunctions’ percentile approximations have fixed-size representations based on the number of buckets in their modified histograms, so they limit the amount of memory required to calculate them.

Plus, all of our percentile approximation algorithms are parallelizable, so they can be computed using multiple workers in a single node; this can provide significant speedups because ordered-set aggregates like percentile_disc are not parallelizable in PostgreSQL.

Materialization in continuous aggregates

TimescaleDB includes a feature called continuous aggregates, designed to make queries on very large datasets run faster.

TimescaleDB continuous aggregates continuously and incrementally store the results of an aggregation query in the background, so when you run the query, only the data that has changed needs to be computed, not the entire dataset.

Unfortunately, exact percentiles using percentile_disc cannot be stored in continuous aggregates because they cannot be broken down into a partial form and would instead require storing the entire dataset inside the aggregate.

We designed our percentile approximation algorithms to be usable with continuous aggregates. They have fixed-size partial representations that can be stored and re-aggregated inside continuous aggregates.

This is a huge advantage compared to exact percentiles because now you can do things like baselining and alerting on longer periods without having to re-calculate from scratch every time.

Let’s go back to our API response time example and imagine we want to identify recent outliers to investigate potential problems.

One way to do that would be to look at everything that is, say, above the 99th percentile in the previous hour.

As a reminder, we have a table:

CREATE TABLE responses(
	ts timestamptz, 
	response_time DOUBLE PRECISION);
SELECT create_hypertable('responses', 'ts'); -- make it a hypertable so we can make continuous aggs

First, we’ll create a one-hour aggregation:

CREATE MATERIALIZED VIEW responses_1h_agg
WITH (timescaledb.continuous)
AS SELECT 
    time_bucket('1 hour'::interval, ts) as bucket,
    percentile_agg(response_time)
FROM responses
GROUP BY time_bucket('1 hour'::interval, ts);

Note that we don’t perform the accessor function in the continuous aggregate; we just perform the aggregation function.

Now, we can find the data in the last 30 s greater than the 99th percentile like so:

SELECT * FROM responses 
WHERE ts >= now()-'30s'::interval
AND response_time > (
	SELECT approx_percentile(0.99, percentile_agg)
	FROM responses_1h_agg
	WHERE bucket = time_bucket('1 hour'::interval, now()-'1 hour'::interval)
);

At the ad analytics company, we had a lot of users, so we’d have tens or hundreds of thousands of API calls every hour.

By default, we have 200 buckets in our representation, so we’re getting a large reduction in the amount of data that we store and process by using a continuous aggregate. This means that it would speed up the response time significantly. If you don’t have as much data, you’ll want to increase the size of your buckets or decrease the fidelity of the approximation to achieve a large reduction in the data we have to process.

We mentioned that we only performed the aggregate step in the continuous aggregate view definition; we didn’t use our approx_percentile accessor function directly in the view. We do that because we want to be able to use other accessor functions and/or the rollup function, which you may remember as one of the main reasons we chose the two-step aggregate approach.

Let’s look at how that works, we can create a daily rollup and get the 99th percentile like this:

SELECT 
	time_bucket('1 day', bucket),
	approx_percentile(0.99, rollup(percentile_agg)) as p_99_daily
FROM responses_1h_agg
GROUP BY 1;

We could even use the approx_percentile_rank accessor function, which tells you what percentile a value would fall into.

Percentile rank is the inverse of the percentile function; in other words, if normally you ask, what is the value of nth percentile? The answer is a value.

With percentile rank, you ask what percentile would this value be in? The answer is a percentile.

So, using approx_percentile_rank allows us to see where the values that arrived in the last 5 minutes rank compared to values in the last day:

WITH last_day as (SELECT 
	time_bucket('1 day', bucket),
 	rollup(percentile_agg) as pct_daily
FROM foo_1h_agg
WHERE bucket >= time_bucket('1 day', now()-'1 day'::interval)
GROUP BY 1)

SELECT approx_percentile_rank(response_time, pct_daily) as pct_rank_in_day
FROM responses, last_day
WHERE foo.ts >= now()-'5 minutes'::interval;

This is another way continuous aggregates can be valuable.

We performed a rollup over a day, which just combined 24 partial states, rather than performing a full calculation over 24 hours of data with millions of data points.

We then used the rollup to see how that impacted just the last few minutes of data, giving us insight into how the last few minutes compare to the last 24 hours. These are just a few examples of how the percentile approximation hyperfunctions can give us some pretty nifty results and allow us to perform complex analysis relatively simply.

Percentile Approximation Deep Dive

Some of you may be wondering how TimescaleDB hyperfunctions’ underlying algorithms work, so let’s dive in! (For those of you who don’t want to get into the weeds, feel free to skip over this bit.)

Approximation methods and how they work

We implemented two different percentile approximation algorithms as TimescaleDB hyperfunctions: UDDSketch and T-Digest. Each is useful in different scenarios, but first, let’s understand some of the basics of how they work.

Both use a modified histogram to approximate the shape of a distribution. A histogram buckets nearby values into a group and tracks their frequency.

You often see a histogram plotted like so:

A histogram representing the same data as our response time frequency curve above, you can see how the shape of the graph is similar to the frequency curve. Not to scale.

If you compare this to the frequency curve we showed above, you can see how this could provide a reasonable approximation of the API response time vs frequency response. Essentially, a histogram has a series of bucket boundaries and a count of the number of values that fall within each bucket.

To calculate the approximate percentile for, say, the 20th percentile, you first consider the fraction of your total data that would represent it. For our 20th percentile, that would be 0.2 * total_points.

Once you have that value, you can then sum the frequencies in each bucket, left to right, to find at which bucket you get the value closest to 0.2 * total_points.

You can even interpolate between buckets to get more exact approximations when the bucket spans a percentile of interest.

When you think of a histogram, you may think of one that looks like the one above, where the buckets are all the same width.

But choosing the bucket width, especially for widely varying data, can get very difficult or lead you to store a lot of extra data.

In our API response time example, we could have data spanning from tens of milliseconds up to ten seconds or hundreds of seconds.

This means that the right bucket size for a good approximation of the 1st percentile, e.g., 2ms, would be WAY smaller than necessary for a good approximation of the 99th percentile.

This is why most percentile approximation algorithms use a modified histogram with a variable bucket width.

For instance, the UDDSketch algorithm uses logarithmically sized buckets, which might look something like this:

A modified histogram showing how logarithmic buckets like the UDDSketch algorithm uses can still represent the data. (Note: we’d need to modify the plot to plot the frequency/bucket width so that the scale would remain similar; however, this is just for demonstration purposes and not drawn to scale).

The designers of UDDSketch used a logarithmic bucket size like this because what they care about is the relative error.

For reference, absolute error is defined as the difference between the actual and the approximated value: \begin{equation} \text{err}_\text{absolute} = abs(v_\text{actual} - v_\text{approx}) \end{equation}

To get relative error, you divide the absolute error by the value: \begin{equation} \text{err}_\text{relative} = \frac{\text{err}_\text{absolute}}{ v_\text{actual}} \end{equation}

If we had a constant absolute error, we might run into a situation like the following:

We ask for the 99th percentile, and the algorithm tells us it’s 10s +/- 100ms. Then, we ask for the 1st percentile, and the algorithm tells us it’s 10ms +/- 100ms.

The error for the 1st percentile is way too high!

If we have a constant relative error, then we’d get 10ms +/- 100 microseconds.

This is much, much more useful. (And 10s +/- 100 microseconds is probably too tight, we likely don’t really care about 100 microseconds if we’re already at 10s.)

This is why the UDDSketch algorithm uses logarithmically sized buckets, where the width of the bucket scales with the size of the underlying data. This allows the algorithm to provide constant relative error across the full range of percentiles.

As a result, you always know that the true value of the percentile will fall within some range $[v_\text{approx} (1-err), v_\text{approx} (1+err)]$

On the other hand, T-Digest uses buckets that are variably sized, based on where they fall in the distribution. Specifically, it uses smaller buckets at the extremes of the distribution and larger buckets in the middle.

So, it might look something like this:

A modified histogram showing how variably sized buckets that are smaller at the extremes, like what the TDigest algorithm uses, can still represent the data (Note: for illustration purposes, not to scale.)

This histogram structure with variable-sized buckets optimizes for different things than UDDSketch. Specifically, it takes advantage of the idea that when you’re trying to understand the distribution, you likely care more about fine distinctions between extreme values than about the middle of the range.

For example, I usually care a lot about distinguishing the 5th percentile from the 1st or the 95th from the 99th, while I don’t care as much about distinguishing between the 50th and the 55th percentile.

The distinctions in the middle are less meaningful and interesting than the distinctions at the extremes. (Caveat: the TDigest algorithm is a bit more complex than this, and this doesn’t completely capture its behavior, but we’re trying to give a general gist of what’s going on. If you want more information, we recommend this paper).

Using advanced approximation methods in TimescaleDB hyperfunctions

So far in this post, we’ve only used the general-purpose percentile_agg aggregate. It uses the UDDSketch algorithm under the hood and is a good starting point for most users.

We’ve also provided separate uddsketch and tdigest aggregates to allow for more customizability.

Each takes the number of buckets as their first argument (which determines the size of the internal data structure) and uddsketch also has an argument for the target maximum relative error.

We can use the normal approx_percentile accessor function just as we used with percentile_agg, so, we could compare median estimations like so:

SELECT 
	approx_percentile(0.5, uddsketch(200, 0.001, response_time)) as median_udd,
	approx_percentile(0.5, tdigest(200, response_time)) as median_tdig
FROM responses;

Both of them also work with the approx_percentile_rank hyperfunction we discussed above.

If we wanted to see where 1000 would fall in our distribution, we could do something like this:

SELECT 
	approx_percentile_rank(1000, uddsketch(200, 0.001, response_time)) as rnk_udd,
	approx_percentile_rank(1000, tdigest(200, response_time)) as rnk_tdig
FROM responses;

In addition, each of the approximations has some accessors that only work with their items based on the approximation structure.

For instance, uddsketch provides an error accessor function. This will tell you the actual guaranteed maximum relative error based on the values that the uddsketch saw.

The UDDSketch algorithm guarantees a maximum relative error, while the T-Digest algorithm does not, so error only works with uddsketch (and percentile_agg because it uses uddsketch algorithm under the hood).

This error guarantee is one of the main reasons we chose it as the default because error guarantees are useful for determining whether you’re getting a good approximation.

Tdigest, on the other hand, provides min_val & max_val accessor functions because it biases its buckets to the extremes and can provide the exact min and max values at no extra cost. Uddsketch can’t provide that.

You can call these other accessors like so:

SELECT 
	approx_percentile(0.5, uddsketch(200, 0.001, response_time)) as median_udd,
	error(uddsketch(200, 0.001, response_time)) as error_udd,
	approx_percentile(0.5, tdigest(200, response_time)) as median_tdig,
	min_val(tdigest(200, response_time)) as min,
	max_val(tdigest(200, response_time)) as max
FROM responses;

As we discussed in the last post about two-step aggregates, calls to all of these aggregates are automatically deduplicated and optimized by PostgreSQL so that you can call multiple accessors with minimal extra cost.

They also both have rollup functions defined for them, so you can re-aggregate when they’re used in continuous aggregates or regular queries.

(Note: tdigest rollup can introduce some additional error or differences compared to calling the tdigest on the underlying data directly. In most cases, this should be negligible and would often be comparable to changing the order in which the underlying data was ingested.)

We’ve provided a few of the tradeoffs and differences between the algorithms here, but we have a longer discussion in the docs that can help you choose. You can also start with the default percentile_agg and then experiment with different algorithms and parameters on your data to see what works best for your application.

Wrapping It Up

We’ve provided a brief overview of percentiles, how they can be more informative than more common statistical aggregates like average, why percentile approximations exist, and a little bit of how they generally work within TimescaleDB hyperfunctions.

If you’d like to get started with the percentile approximation hyperfunctions—and many more—right away, spin up a fully managed Timescale Cloud service: create an account to try it for free for 30 days. (Hyperfunctions are pre-loaded on each new cloud database service on Timescale, so after you create a new service, you’re all set to use them).

If you prefer to manage your own database instances, you can download and install the timescaledb_toolkit extension on GitHub, after which you’ll be able to use percentile approximation and other hyperfunctions.

We’re always looking for feedback on what to build next and would love to know how you’re using hyperfunctions, problems you want to solve, or things you think should—or could—be simplified to make analyzing time-series data in SQL that much better. (To contribute feedback, comment on an open issue or in a discussion thread in GitHub.)

FAQs

Why are percentiles better than averages for understanding data?

Percentiles are more robust to outliers than averages. This is especially true for asymmetric distributions like API response times.

While an average can be dramatically skewed by a few extreme values (like API calls taking 100+ seconds), percentiles like the 90th percentile only shift when a significant portion of the data changes.

For example: if 10 % of API calls slow down from 250 ms to 5 seconds, the 90th percentile will clearly show this problem, while outliers from just a few users won't trigger false alarms.

How do percentile approximations work in TimescaleDB?

TimescaleDB's percentile approximation hyperfunctions use modified histograms to represent data distribution more compactly while maintaining accuracy.

These approximations use either UDDSketch (the default) with logarithmically sized buckets or T-Digest with variable-sized buckets that are smaller at distribution extremes.

You can calculate approximate percentiles using syntax like SELECT approx_percentile(0.9, percentile_agg(response_time)) as p90 FROM responses;

This syntax will use significantly less memory and computation than exact percentiles.

What are the benefits of using percentile approximations for large datasets?

Percentile approximations have a fixed memory footprint regardless of dataset size, unlike exact percentiles, which must store all values.

They’re parallelizable for faster computation and can be used with continuous aggregates for incremental calculation—all while still providing accurate insights with configurable error bounds.

For example, let’s say you’re tracking hourly API response times using a continuous aggregate with percentile approximations. This method allows you to instantly identify recent outliers compared to historical patterns without recalculating from raw data.

How do I choose between UDDSketch and T-Digest algorithms?

UDDSketch (the default in percentile_agg) guarantees a maximum relative error across all percentiles. It’s ideal when you need consistent accuracy across your distribution.

T-Digest optimizes for accuracy at the extremes of your distribution, making it better when you care more about precisely measuring the 1st or 99th percentiles than the middle ranges.

You can compare both approaches using SELECT approx_percentile(0.5, uddsketch(200, 0.001, response_time)) as median_udd, approx_percentile(0.5, tdigest(200, response_time)) as median_tdig FROM responses;

How can I use percentile approximations with continuous aggregates?

Continuous aggregates allow you to precompute and incrementally update percentile approximations, significantly improving query performance.

Step 1: Create a continuous aggregate that stores just the percentile approximation state: CREATE MATERIALIZED VIEW responses_1h_agg WITH (timescaledb.continuous) AS SELECT time_bucket('1 hour'::interval, ts) as bucket, percentile_agg(response_time) FROM responses GROUP BY 1;

Step 2: Query it using accessor functions like approx_percentile(0.99, percentile_agg) or combine multiple time buckets with the rollup function to analyze trends across different time periods.

What Time-Weighted Averages Are and Why You Should Care

David Kohn — Thu, 22 Jul 2021 13:04:23 GMT

Many people who work with time-series data have nice, regularly sampled datasets. Data could be sampled every few seconds, or milliseconds, or whatever they choose, but by regularly sampled, we mean the time between data points is basically constant. Computing the average value of data points over a specified time period in a regular dataset is a relatively well-understood query to compose.

But for those who don't have regularly sampled data, getting a representative average over a period of time can be a complex and time-consuming query to write. Time-weighted averages are a way to get an unbiased average when you are working with irregularly sampled data.

Time-series data comes at you fast, sometimes generating millions of data points per second (read more about time-series data). Because of the sheer volume and rate of information, time-series data can already be complex to query and analyze, which is why we built TimescaleDB, a petabyte-scale, relational database for time series.

Irregularly sampled time-series data just adds another level of complexity – and is more common than you may think. For example, irregularly sampled data, and thus the need for time-weighted averages, frequently occurs in:

Industrial IoT, where teams “compress” data by only sending points when the value changes
Remote sensing, where sending data back from the edge can be costly, so you only send high-frequency data for the most critical operations
Trigger-based systems, where the sampling rate of one sensor is affected by the reading of another (i.e., a security system that sends data more frequently when a motion sensor is triggered)
...and many, many more

At Timescale, we’re always looking for ways to make developers’ lives easier, especially when they’re working with time-series data. To this end, we introduced hyperfunctions, new SQL functions that simplify working with time-series data in PostgreSQL. One of these hyperfunctions enables you to compute time-weighted averages quickly and efficiently, so you gain hours of productivity.

Read on for examples of time-weighted averages, how they’re calculated, how to use the time-weighted averages hyperfunctions in TimescaleDB, and some ideas for how you can use them to get a productivity boost for your projects, no matter the domain.

If you’d like to get started with the time_weight hyperfunction—and many more—right away, spin up a fully managed Timescale service: create an account to try it for free for 30 days. Hyperfunctions are pre-loaded on each new database service on Timescale, so after you create a new service, you’re all set to use them.

If you prefer to manage your own database instances, you can download and install the timescaledb_toolkit extension on GitHub, after which you’ll be able to use time_weight and other hyperfunctions.

Finally, we love building in public and continually improving:

If you have questions or comments on this blog post, we’ve started a discussion on our GitHub page, and we’d love to hear from you. (And, if you like what you see, GitHub ⭐ are always welcome and appreciated too!)
You can view our upcoming roadmap on GitHub for a list of proposed features, as well as features we’re currently implementing and those that are available to use today.

What are time-weighted averages?

I’ve been a developer at Timescale for over 3 years and worked in databases for about 5 years, but I was an electrochemist before that. As an electrochemist, I worked for a battery manufacturer and saw a lot of charts like these:

Example battery discharge curve, which describes how long a battery can power something. (Also a prime example of where time-weighted averages are 💯 necessary) Derived from https://www.nrel.gov/docs/fy17osti/67809.pdf

That’s a battery discharge curve, which describes how long a battery can power something. The x-axis shows capacity in Amp-hours, and since this is a constant current discharge, the x-axis is really just a proxy for time. The y-axis displays voltage, which determines the battery’s power output; as you continue to discharge the battery, the voltage drops until it gets to a point where it needs to be recharged.

When we’d do R&D for new battery formulations, we’d cycle many batteries many times to figure out which formulations make batteries last the longest.

If you look more closely at the discharge curve, you’ll notice that there are only two “interesting” sections:

Example battery discharge curve, calling out the “interesting bits” (the points in time where data changes rapidly)

These are the parts at the beginning and end of the discharge where the voltage changes rapidly. Between these two sections, there’s that long period in the middle, where the voltage hardly changes at all:

Example battery discharge curve, calling out the “boring bits” (the points in time where the data remains fairly constant)

Now, when I said before that I was an electrochemist, I will admit that I was exaggerating a little bit. I knew enough about electrochemistry to be dangerous, but I worked with folks with PhDs who knew a lot more than I did.

But, I was often better than them at working with data, so I’d do things like programming the potentiostat, the piece of equipment you hook the battery up to in order to perform these tests.

For the interesting parts of the discharge cycle (those parts at the start and end), we could have the potentiostat sample at its max rate, usually a point every 10 milliseconds or so. We didn’t want to sample as many data points during the long, boring parts where the voltage didn’t change because it would mean saving lots of data with unchanging values and wasting storage.

To reduce the boring data we’d have to deal with without losing the interesting bits, we’d set up the program to sample every 3 minutes, or when the voltage changed by a reasonable amount, say more than 5 mV.

In practice, what would happen is something like this:

Example battery discharge curve with data points superimposed to depict rapid sampling during the interesting bits and slower sampling during the boring bits.

By sampling the data in this way, we'd get more data during the interesting parts and less data during the boring middle section. That’s great!

It let us answer more interesting questions about the quickly changing parts of the curve and gave us all the information we needed about the slowly changing sections – without storing gobs of redundant data. But, here’s a question: given this dataset, how do we find the average voltage during the discharge?

That question is important because it was one of the things we could compare between this discharge curve and future ones, say 10 or 100 cycles later. As a battery ages, its average voltage drops, and how much it dropped over time could tell us how well the battery’s storage capacity held up during its lifecycle – and if it could turn into a useful product.

The problem is that the data in the interesting bits are sampled more frequently (i.e., there are more data points for the interesting bits), which would give it more weight when calculating the average, even though it shouldn't.

Example battery discharge curve, with illustrative data points to show that while we collect more data during the interesting bits, they shouldn’t count “extra.”

If we just took a naive average over the whole curve, adding the value at each point and dividing by the number of points, it would mean that a change to our sampling rate could change our calculated average...even though the underlying effect was really the same!

We could easily overlook any of the differences we were trying to identify – and any clues about how we could improve the batteries could just get lost in the variation of our sampling protocol.

Now, some people will say: well, why not just sample at max rate of the potentiostat, even during the boring parts? Well, these discharge tests ran really long. They’d take 10 to 12 hours to complete, but the interesting bits could be pretty short, from seconds or minutes. If we sampled at the highest rate, one every 10ms or so, it would mean orders of magnitude more data to store even though we would hardly use any of it! And orders of magnitude more data would mean more cost, more time for analysis, all sorts of problems.

So the big question is: how do we get a representative average when we’re working with irregularly spaced data points?

Let’s get theoretical for a moment here:

(This next bit is a little equation-heavy, but I think they’re relatively simple equations, and they map very well onto their graphical representation. I always like it when folks give me the math and graphical intuition behind the calculations – but if you want to skip ahead to just see how time-weighted average is used, the mathy bits end here.)

Mathy Bits: How to derive a time-weighted average

Let’s say we have some points like this:

A theoretical, irregularly sampled time-series dataset

Then, the normal average would be the sum of the values, divided by the total number of points:

\begin{equation} avg = \frac{(v_1 + v_2 + v_3 + v_4)}{4} \end{equation}

But, because they’re irregularly spaced, we need some way to account for that.

One way to think about it would be to get a value at every point in time, and then divide it by the total amount of time. This would be like getting the total area under the curve and dividing by the total amount of time ΔT.

The area under an irregularly sampled time-series dataset

\begin{equation} better\_avg = \frac{area\_under\_curve}{\Delta T} \end{equation}

(In this case, we’re doing a linear interpolation between the points). So, let’s focus on finding that area. The area between the first two points is a trapezoid:

A trapezoid representing the area under the first two points

Which is really a rectangle plus a triangle:

That same trapezoid broken down into a rectangle and a triangle.

Okay, let’s calculate that area:

\begin{equation} area = \Delta t_1 v_1 + \frac{\Delta t_1 \Delta v_1}{2} \end{equation}

So just to be clear, that’s:

\begin{equation} area = \underbrace{\Delta t_1 v_1}_\text{area of rectangle} + \underbrace{\frac{\Delta t_1 \Delta v_1}{2}}_\text{area of triangle} \end{equation}

Okay. So now if we notice that:

\begin{equation} \Delta v_1 = v_2 - v_1 \end{equation}

We can simplify this equation pretty nicely:

Start with:
\begin{equation} \Delta t_1 v_1 + \frac{\Delta t_1 (v_2 - v_1)}{2} \end{equation}

Factor out: \begin{equation}(\frac{\Delta t_1}{2} ) \end{equation}

\begin{equation} \frac{\Delta t_1}{2} (2v_1 + (v_2 - v_1)) \end{equation}

Simplify:
\begin{equation} \frac{\Delta t_1}{2} (v_1 + v_2) \end{equation}

One cool thing to note is that this gives us a new way to think about this solution: it’s the average of each pair of adjacent values, weighted by the time between them:

\begin{equation} area = \underbrace{\frac{(v_1 + v_2)}{2}}_{\text{average of } v_1 \text{ & } v_2} \Delta t_1 \end{equation}

It’s also equal to the area of the rectangle drawn to the midpoint between v1 and v2:

The area of the trapezoid and of the rectangle, drawn to the midpoint between the two points, is the same.

Now that we’ve derived the formula for two adjacent points, we can repeat this for every pair of adjacent points in the dataset. Then all we need to do is sum that up, and that will be the time-weighted sum, which is equal to the area under the curve. (Folks who have studied calculus may actually remember some of this from when they were learning about integrals and integral approximations!)

With the total area under the curve calculated, all we have to do is divide the time-weighted sum by the overall ΔT and we have our time-weighted average. 💥

Now that we've worked through our time-weighted average in theory, let’s test it out in SQL.

How to compute time-weighted averages in SQL

Let’s consider the scenario of an ice cream manufacturer or shop owner who is monitoring their freezers. It turns out that ice cream needs to stay in a relatively narrow range of temperatures (~0-10℉)^[1] so that it doesn’t melt and re-freeze, causing those weird crystals that no one likes. Similarly, if ice cream gets too cold, it’s too hard to scoop.

The air temperature in the freezer will vary a bit more dramatically as folks open and close the door, but the ice cream temperature takes longer to change. Thus, problems (melting, pesky ice crystals) will only happen if it's exposed to extreme temperatures for a prolonged period. By measuring this data, the ice cream manufacturer can impose quality controls on each batch of product they’re storing in the freezer.

Taking this into account, the sensors in the freezer measure temperature in the following way: when the door is closed and we’re in the optimal range, the sensors take a measurement every 5 minutes; when the door is opened, the sensors take a measurement every 30 seconds until the door is closed, and the temperature has returned below 10℉.

To model that we might have a simple table like this:

CREATE TABLE freezer_temps (
	freezer_id int,
	ts timestamptz,
	temperature float);

And some data like this:

INSERT INTO freezer_temps VALUES 
( 1, '2020-01-01 00:00:00+00', 4.0), 
( 1, '2020-01-01 00:05:00+00', 5.5), 
( 1, '2020-01-01 00:10:00+00', 3.0), 
( 1, '2020-01-01 00:15:00+00', 4.0), 
( 1, '2020-01-01 00:20:00+00', 3.5), 
( 1, '2020-01-01 00:25:00+00', 8.0), 
( 1, '2020-01-01 00:30:00+00', 9.0), 
( 1, '2020-01-01 00:31:00+00', 10.5), -- door opened!
( 1, '2020-01-01 00:31:30+00', 11.0), 
( 1, '2020-01-01 00:32:00+00', 15.0), 
( 1, '2020-01-01 00:32:30+00', 20.0), -- door closed
( 1, '2020-01-01 00:33:00+00', 18.5), 
( 1, '2020-01-01 00:33:30+00', 17.0), 
( 1, '2020-01-01 00:34:00+00', 15.5), 
( 1, '2020-01-01 00:34:30+00', 14.0), 
( 1, '2020-01-01 00:35:00+00', 12.5), 
( 1, '2020-01-01 00:35:30+00', 11.0), 
( 1, '2020-01-01 00:36:00+00', 10.0), -- temperature stabilized
( 1, '2020-01-01 00:40:00+00', 7.0),
( 1, '2020-01-01 00:45:00+00', 5.0);

The period after the door opens, minutes 31-36, has a lot more data points. If we were to take the average of all the points, we would get a misleading value. The freezer was only above the threshold temperature for 5 out of 45 minutes (11% of the time period), but those minutes make up 10 out of 20 data points (50%!) because we sample freezer temperature more frequently after the door is opened.

To find the more accurate, time-weighted average temperature, let’s write the SQL for the formula above that handles that case. We’ll also get the normal average just for comparison’s sake. (Don’t worry if you have trouble reading it, we’ll write a much simpler version later).

WITH setup AS (
	SELECT lag(temperature) OVER (PARTITION BY freezer_id ORDER BY ts) as prev_temp, 
		extract('epoch' FROM ts) as ts_e, 
		extract('epoch' FROM lag(ts) OVER (PARTITION BY freezer_id ORDER BY ts)) as prev_ts_e, 
		* 
	FROM  freezer_temps), 
nextstep AS (
	SELECT CASE WHEN prev_temp is NULL THEN NULL 
		ELSE (prev_temp + temperature) / 2 * (ts_e - prev_ts_e) END as weighted_sum, 
		* 
	FROM setup)
SELECT freezer_id,
    avg(temperature), -- the regular average
	sum(weighted_sum) / (max(ts_e) - min(ts_e)) as time_weighted_average -- our derived average
FROM nextstep
GROUP BY freezer_id;

 freezer_id |  avg  | time_weighted_average 
------------+-------+-----------------------
          1 | 10.2  |     6.636111111111111

It does return what we want, and gives us a much better picture of what happened, but it’s not exactly fun to write, is it?

We’ve got a few window functions in there, some case statements to deal with nulls, and several CTEs to try to make it reasonably clear what’s going on. This is the kind of thing that can really lead to code maintenance issues when people try to figure out what’s going on and tweak it.

Code is all about managing complexity. A long, complex query to accomplish a relatively simple task makes it much less likely that the developer who comes along next (ie you in 3 months) will understand what’s going on, how to use it, or how to change it if they (or you!) need a different result. Or, worse, it means that the code will never get changed because people don’t quite understand what the query’s doing, and it just becomes a black box that no one wants to touch (including you).

TimescaleDB hyperfunctions to the rescue!

This is why we created hyperfunctions, to make complicated time-series data analysis less complex. Let’s look at what the time-weighted average freezer temperature query looks like if we use the hyperfunctions for computing time-weighted averages:

SELECT freezer_id, 
	avg(temperature), 
	average(time_weight('Linear', ts, temperature)) as time_weighted_average 
FROM freezer_temps
GROUP BY freezer_id;

 freezer_id |  avg  | time_weighted_average 
------------+-------+-----------------------
          1 | 10.2  |     6.636111111111111

Isn’t that so much more concise?! Calculate a time_weight with a 'Linear' weighting method (that’s the kind of weighting derived above ^[2]), then take the average of the weighted values, and we’re done. I like that API much better (and I’d better, because I designed it!).

What’s more, not only do we save ourselves from writing all that SQL, but it also becomes far, far easier to compose (build up more complex analyses over top of the time-weighted average). This is a huge part of the design philosophy behind hyperfunctions; we want to make fundamental things simple so that you can easily use them to build more complex, application-specific analyses.

Let’s imagine we’re not satisfied with the average over our entire dataset, and we want to get the time-weighted average for every 10-minute bucket:

SELECT time_bucket('10 mins'::interval, ts) as bucket, 
	freezer_id, 
	avg(temperature), 
	average(time_weight('Linear', ts, temperature)) as time_weighted_average 
FROM freezer_temps
GROUP BY bucket, freezer_id;

We added a time_bucket, grouped by it, and done! Let’s look at some other kinds of sophisticated analysis that hyperfunctions enable.

Continuing with our ice cream example, let’s say that we’ve set our threshold because we know that if the ice cream spends more than 15 minutes above 15 ℉, it’ll develop those ice crystals that make it all sandy/grainy tasting. We can use the time-weighted average in a window function to see if that happened:

SELECT *, 
    average(time_weight('Linear', ts, temperature) OVER fifteen_min) as rolling_twa
FROM freezer_temps
WINDOW fifteen_min AS 
    (PARTITION BY freezer_id ORDER BY ts RANGE  '15 minutes'::interval PRECEDING)
ORDER BY freezer_id, ts;


 freezer_id |           ts           | temperature |    rolling_twa     
------------+------------------------+-------------+--------------------
          1 | 2020-01-01 00:00:00+00 |           4 |                   
          1 | 2020-01-01 00:05:00+00 |         5.5 |               4.75
          1 | 2020-01-01 00:10:00+00 |           3 |                4.5
          1 | 2020-01-01 00:15:00+00 |           4 |  4.166666666666667
          1 | 2020-01-01 00:20:00+00 |         3.5 | 3.8333333333333335
          1 | 2020-01-01 00:25:00+00 |           8 |  4.333333333333333
          1 | 2020-01-01 00:30:00+00 |           9 |                  6
          1 | 2020-01-01 00:31:00+00 |        10.5 |  7.363636363636363
          1 | 2020-01-01 00:31:30+00 |          11 |  7.510869565217392
          1 | 2020-01-01 00:32:00+00 |          15 |  7.739583333333333
          1 | 2020-01-01 00:32:30+00 |          20 |               8.13
          1 | 2020-01-01 00:33:00+00 |        18.5 |  8.557692307692308
          1 | 2020-01-01 00:33:30+00 |          17 |  8.898148148148149
          1 | 2020-01-01 00:34:00+00 |        15.5 |  9.160714285714286
          1 | 2020-01-01 00:34:30+00 |          14 |   9.35344827586207
          1 | 2020-01-01 00:35:00+00 |        12.5 |  9.483333333333333
          1 | 2020-01-01 00:35:30+00 |          11 | 11.369047619047619
          1 | 2020-01-01 00:36:00+00 |          10 | 11.329545454545455
          1 | 2020-01-01 00:40:00+00 |           7 |             10.575
          1 | 2020-01-01 00:45:00+00 |           5 |  9.741666666666667

The window here is over the previous 15 minutes, ordered by time. And it looks like we stayed below our ice-crystallization temperature!

We also provide a special rollup function so you can re-aggregate time-weighted values from subqueries. For instance:

SELECT average(rollup(time_weight)) as time_weighted_average 
FROM (SELECT time_bucket('10 mins'::interval, ts) as bucket, 
		freezer_id, 
		time_weight('Linear', ts, temperature)
	FROM freezer_temps
	GROUP BY bucket, freezer_id) t;

time_weighted_average 
-----------------------
    6.636111111111111

This will give us the same output as a grand total of the first equation because we’re just re-aggregating the bucketed values.

But this is mainly there so that you can do more interesting analysis, like, say, normalizing each ten-minute time-weighted average by freezer to the overall time-weighted average.

WITH t as (SELECT time_bucket('10 mins'::interval, ts) as bucket, 
		freezer_id, 
		time_weight('Linear', ts, temperature)
	FROM freezer_temps
	GROUP BY bucket, freezer_id) 
SELECT bucket, 
	freezer_id, 
	average(time_weight) as bucketed_twa,  
	(SELECT average(rollup(time_weight)) FROM t) as overall_twa, 
	average(time_weight) / (SELECT average(rollup(time_weight)) FROM t) as normalized_twa
FROM t;

This kind of feature (storing the time-weight for analysis later) is most useful in a continuous aggregate, and it just so happens that we’ve designed our time-weighted average to be usable in that context!

We’ll be going into more detail on that in a future post, so be sure to subscribe to our newsletter so you can get notified when we publish new technical content.

Try time-weighted averages today

If you’d like to get started with the time_weight hyperfunction - and many more - right away, spin up a fully managed TimescaleDB service: create an account to try it for free for 30 days. Hyperfunctions are pre-loaded on each new database service on Timescale, so after you create a new service, you’re all set to use them!

If you prefer to manage your own database instances, you can download and install the timescaledb_toolkit extension on GitHub, after which you’ll be able to use time_weight and all other hyperfunctions.

**If you have questions or comments on this blog post, we’ve started a discussion on our GitHub page, and we’d love to hear from you. (And, if you like what you see, GitHub ⭐ are always welcome and appreciated too!)

We love building in public, and you can view our upcoming roadmap on GitHub for a list of proposed features, features we’re currently implementing, and features available to use today.

We’d like to give a special thanks to @inselbuch, who submitted the GitHub issue that got us started on this project (as well as the other folks who 👍’d it and let us know they wanted to use it.)

We believe time-series data is everywhere, and making sense of it is crucial for all manner of technical problems. We built hyperfunctions to make it easier for developers to harness the power of time-series data. We’re always looking for feedback on what to build next and would love to know how you’re using hyperfunctions, problems you want to solve, or things you think should - or could - be simplified to make analyzing time-series data in SQL that much better. (To contribute feedback, comment on an open issue or in a discussion thread in GitHub.)

Lastly, in future posts, we’ll give some more context around our design philosophy, decisions we’ve made around our APIs for time-weighted averages (and other features), and detailing how other hyperfunctions work. So, if that’s your bag, you’re in luck – but you’ll have to wait a week or two.

I don’t know that these times or temperatures are accurate per se; however, the phenomenon of ice cream partially melting and refreezing causing larger ice crystals to form - and coarsening the ice cream as a result - is well documented. See, for instance, Harold McGee’s On Food And Cooking (p 44 in the 2004 revised edition). So, just in case you are looking for advice on storing your ice cream from a blog about time-series databases: for longer-term storage, you would likely want the ice cream to be stored below 0℉. Our example is more like a scenario you’d see in an ice cream display (e.g., in an ice cream parlor or factory line) since the ice cream is kept between 0-10℉ (ideal for scooping, because lower temperatures make ice cream too hard to scoop). ↩︎
We also offer ’LOCF’ or last observation carried forward weighting, which is best suited to cases where you record data points whenever the value changes (i.e., the old value is valid until you get a new one.)

The derivation for that is similar, except the rectangles have the height of the first value, rather than the linear weighting we’ve discussed in this post (i.e., where we do linear interpolation between adjacent data points):

Rather than:

In general, linear weighting is appropriate for cases where the sampling rate is variable, but there are no guarantees provided by the system about only providing data when it changes. LOCF works best when there’s some guarantee that your system will provide data only when it changes, and you can accurately carry the old value until you receive a new one. ↩︎

Introducing Hyperfunctions: New SQL Functions to Simplify Working With Time-Series Data in PostgreSQL

Joshua Lockerman — Tue, 13 Jul 2021 13:02:11 GMT

Today, we’re excited to launch TimescaleDB hyperfunctions, a series of SQL functions within TimescaleDB that make it easier to manipulate and analyze time-series data in PostgreSQL with fewer lines of code. You can use hyperfunctions to calculate percentile approximations of data, compute time-weighted averages, downsample and smooth data, and perform faster COUNT DISTINCT queries using approximations. Moreover, hyperfunctions are “easy” to use: you call a hyperfunction using the same SQL syntax you know and love.

At Timescale, our mission is to enable every software developer to store, analyze, and build on top of their time-series data so that they can measure what matters in their world: IoT devices, IT systems, marketing analytics, user behavior, financial metrics, and more.

We made the decision early in the design of TimescaleDB to build on top of PostgreSQL. We believed then, as we do now, that building on the world’s fastest-growing database would have numerous benefits for our customers. Perhaps the biggest of these advantages is in developer productivity. Developers can use the tools and frameworks they know and love and bring all their skills and expertise with SQL with them.

SQL is a powerful language and we believe that by adding a specialized set of functions for time-series analysis, we can make it even better.

Today, there are nearly three million active TimescaleDB databases running mission-critical time-series workloads across industries. Time-series data comes at you fast, sometimes generating millions of data points per second. In order to measure everything that matters, you need to capture all of the data you possibly can. Because of the volume and rate of information, time-series data can be complex to query and analyze.

As we interviewed customers and learned how they analyze and manipulate time-series data, we noticed several common queries begin to take shape. Often, these queries were difficult to compose in standard SQL. TimescaleDB hyperfunctions are a series of SQL functions to address the most common, and often most difficult, queries developers write today. We made the decision to take the hard path ourselves so that we could give developers an easier path.

Hyperfunctions included in this initial release

Today, we’re releasing several hyperfunctions, including:

Time-Weighted Average allows you to take the average over an irregularly spaced dataset that only includes changepoints.
Percentile-Approximation brings percentile analysis to more workflows. When used with continuous aggregates, you can compute percentiles over any time range of your dataset in near real-time and use them for baselining and normalizing incoming data. For maximum control, we provide implementations of two different approximation algorithms:
- Uddsketch gives formal guarantees to the accuracy of approximate percentiles, in exchange for always returning a range of possible values.
- T-Digest gives fuzzier guarantees which allow it to be more precise at the extremes of the distribution.
Hyperloglog enables faster approximate COUNT DISTINCT, making it easier to track how the cardinality of your data changes over time.
Counter Aggregate enables working with counters in an ergonomic SQL-native manner.
ASAP Smoothing smooths datasets to bring out the most important features when graphed.
Largest Triangle Three Buckets Downsampling reduces the number of elements in a dataset while retaining important features when graphed.
Stats-agg makes using rolling, cumulative and normalized statistics as easy as their standard counterparts.

Note that Hyperfunctions work on TimescaleDB hypertables, as well as regular PostgreSQL tables.

New SQL functions, not new SQL syntax

We made the decision to create new SQL functions for each of the time-series analysis and manipulation capabilities above. This stands in contrast to other efforts which aim to improve the developer experience by introducing new SQL syntax.

While introducing new syntax with new keywords and new constructs may have been easier from an implementation perspective, we made the deliberate decision not to do so since we believe that it actually leads to a worse experience for the end-user.

New SQL syntax means that existing drivers, libraries, and tools may no longer work. This can leave developers with more problems than solutions as their favorite tools, libraries, or drivers may not support the new syntax, or may require time-consuming modifications to do so.

On the other hand, new SQL functions mean that your query will run in every visualization tool, database admin tool, or data analysis tool. We have the freedom to create custom functions, aggregates, and procedures that help developers better understand and work with their data, and ensure all their drivers and interfaces still work as expected.

Hyperfunctions are written in Rust

Rust was our language of choice for developing the new hyperfunctions. We chose it for its superior productivity, community, and the pgx software development kit. We felt Rust was a more friendly language for a project like ours and would encourage more community contributions.

The inherent safety of Rust means we could focus more time on feature development rather than worrying about how the code is written. The extensive Rust community (💗 crates.io), along with excellent package-management tools, means we can use off-the-shelf solutions for common problems, leaving us more time to focus on the uncommon ones.

On the topic of community, we found the Rust community to be one of the friendliest on the internet, and its commitment to open source, open communication, and good documentation make it an utter joy to work with. Libraries such as Serde and quickcheck make common tasks a breeze and lets us focus on the code that’s novel to our project, instead of writing boilerplate that's already been written by thousands of others.

We’d like to shout out ZomboDB’s pgx, an SDK for building Postgres extensions using Rust. Pgx provides tools to generate extension scripts from Rust files and bind Rust functions to Postgres functions, as well as tools to set up, run, and test PostgreSQL instances. (For us, it’s been an amazing tool and experience with incredible benefits – we estimate that pgx has reduced our workload by at least one-third!.)

Next steps

In the rest of this post, we detail why we chose to build new SQL functions (not new SQL syntax), and explore each hyperfunction and its example usage.

But if you’d like to get started with hyperfunctions right away, the easiest way to do so is with a fully-managed TimescaleDB service. Try it for free (no credit card required) for 30 days. Hyperfunctions are pre-loaded on each new database service on Timescale, so after you’ve created a new service, you’re all set to use them!

If you prefer to manage your own database instances, you can download and install the timescaledb_toolkit extension on GitHub for free, after which you’ll be able to use all the hyperfunctions listed above.

Finally, we love building in public. You can view our upcoming roadmap on GitHub for a list of proposed features, as well as features we’re currently implementing and those that are available to use today.

We also welcome feedback from the community (it helps us prioritize the features users really want). To contribute feedback, comment on an open issue or in a discussion thread in GitHub.

To learn more about hyperfunctions, please continue reading.

Building new SQL functions instead of reinventing syntax

SQL is the third most popular programming language in the world. It’s the language known and loved by many software developers, data scientists, and business analysts the world over, and it's a big reason we chose to build TimescaleDB on top of PostgreSQL in the first place.

Similarly, we choose to make our APIs user-friendly without breaking full SQL compatibility. This means we can create custom functions, aggregates, and procedures but no new syntax - and all the drivers and interfaces can still work. You get the peace of mind that your query will run in every visualization tool, database admin tool, or data analysis tool that speaks SQL.

SQL is powerful and it’s even Turing complete, so you can technically do anything with it. But that doesn’t mean you’d want to 😉. Our hyperfunctions are made to make complex analysis and time-series manipulation in SQL simpler, without undermining the guarantees of full SQL compatibility. We’ve spent a large amount of our time on design; prototyping and just writing out different names and calling conventions for clarity and ease of use.

Our guiding philosophy is to make simple things easy and complex things possible. We enable things that feel like they should be straightforward, like using a single function call to calculate a time-weighted average of a single item over a time period. We also enable operations that would otherwise be prohibitively expensive (in terms of complexity to write) or would previously take too long to respond to be useful. For example, calculating a rolling time-weighted average of each item normalized to the monthly average of the whole group of things.

For example, we’ve implemented a default for percentile approximation called percentile_agg that should work for most users, while also exposing the lower level UDDsketch and tdigest implementations for users who want to have more control and get into the weeds.

Another advantage of using SQL functions rather than new syntax is that we bring your code closer to your data, rather than forcing you to take your data to your code. Simply put, you can now perform more sophisticated analysis and manipulation operations on your data right inside your database, rather than creating data pipelines to funnel data into Python or other analysis libraries to conduct analysis there.

We want to make the more complex analysis simpler and easier in the database not just because we want to build a good product, but also because it’s far, far more efficient to do your analysis as close to the data as possible, and then get aggregated or other simpler results that get passed back to the user.

This is because the network transmission step is often the slowest and most expensive part of many calculations, and because the serialization and deserialization overhead can be very large as you get to large datasets. So by making these functions and all sorts of analysis simpler to perform in the database, nearer to the data, developers save time and money.

Moreover, while you could perform some of the complex analysis enabled by hyperfunctions in other languages inside the database (e.g., programs in Python or R), hyperfunctions now enable you to perform such sophisticated time-series analysis and manipulation in SQL right in your query statements, making them more accessible.

Hyperfunctions released today

Hyperfunctions refactor some of the most gnarly SQL queries for time-series data into concise, elegant functions that feel natural to any developer that knows SQL. Let’s walk through the hyperfunctions we’re releasing today and the ones that will be available soon.

Back in January, when we launched our initial hyperfunctions release, we asked for feedback and input from the community. We want this to be a community-driven project, so for our 1.0 release, we’ve prioritized several features requested by community members. We’ll have a brief overview here, with a technical deep dive into each family of functions in a series of separate blog posts in the coming weeks.

Time-weighted averages

Time-series averages can be complicated to calculate; generally, you need to determine how long each value has been recorded in order to know how much to weigh them. While doing this in native SQL is possible, it is extremely error-prone and unwieldy. More damningly, the SQL needed would not work in every context. In particular, it would not work in TimescaleDB’s automatically refreshing materialized views, continuous aggregates, so users who wanted to calculate time-weighted averages over multiple time intervals would be forced to rescan the entire dataset for each average so calculated. Our time-weighted average hyperfunction removes this complexity and can be used in continuous aggregates to make multi-interval time-weighted averages as cheap as summing a few sub-averages.

Here’s an example of using time-weighted averages for an IoT use case, specifically to find the average temperature in a set of freezers over time. (Notice how it takes sixteen lines of complex code to find the time-weighted average using regular SQL, compared just five lines of code with SELECT statements when using the TimescaleDB hyperfunction):

Time-weighted average using TimescaleDB hyperfunction

SELECT freezer_id, 
	avg(temperature), 
	average(time_weight('Linear', ts, temperature)) as time_weighted_average 
FROM freezer_temps
GROUP BY freezer_id;

 freezer_id |  avg  | time_weighted_average 
------------+-------+-----------------------
          1 | 10.35 |     6.802777777777778

Time-weighted average using regular SQL

WITH setup AS (
	SELECT lag(temperature) OVER (PARTITION BY freezer_id ORDER BY ts) as prev_temp, 
		extract('epoch' FROM ts) as ts_e, 
		extract('epoch' FROM lag(ts) OVER (PARTITION BY freezer_id ORDER BY ts)) as prev_ts_e, 
		* 
	FROM  freezer_temps), 
nextstep AS (
	SELECT CASE WHEN prev_temp is NULL THEN NULL 
		ELSE (prev_temp + temperature) / 2 * (ts_e - prev_ts_e) END as weighted_sum, 
		* 
	FROM setup)
SELECT freezer_id, 
	avg(temperature),
	sum(weighted_sum) / (max(ts_e) - min(ts_e)) as time_weighted_average 
FROM nextstep
GROUP BY freezer_id;

 freezer_id |  avg  | time_weighted_average 
------------+-------+-----------------------
          1 | 10.35 |     6.802777777777778

Percentile approximation (UDDsketch & TDigest)

Aggregate statistics are useful when you know the underlying distribution of your data, but for other cases, they can be misleading. For cases where they don’t work, and for more exploratory analyses looking at the ground truth, percentiles are useful.

As useful as it is, percentile analysis comes with one major downside: it needs to store the entire dataset in memory. This means that such analysis is only feasible for relatively small datasets, and even then can take longer than ideal to calculate.

The approximate-percentile hyperfunctions we’ve implemented suffer from neither of these problems: they take constant storage, and, when combined with automatically refreshing materialized views, they can produce results nearly instantaneously. This performance improvement opens up opportunities to use percentile analysis for use cases and datasets where it was previously unfeasible.

Here’s an example of using percentile approximation for a DevOps use case, where we alert on response times that are over the 95th percentile:

WITH “95th percentile” as (
    SELECT approx_percentile(0.95, percentile_agg(response_time)) as threshold
    FROM response_times
)
SELECT count(*) 
FROM response_times 
AND response_time > “95th percentile”.threshold;

See our hyperfunctions docs to get started today. In the coming weeks, we will be releasing a series of blog posts which detail each of the hyperfunctions released today, in the context of using them to solve a real-world problem.

Hyperfunctions in public preview

In addition to the hyperfunctions released today, we’re making several hyperfunctions available for public preview. These include hyperfunctions for downsampling, smoothing, approximate count-distinct, working with counters, and working with more advanced forms of averaging. All of these are available for trial today through our experimental schema, and, with your feedback, will be made available for production usage soon.

Here’s a tour through each hyperfunction and why we created them:

Graph Downsampling & Smoothing

We have two algorithms implemented to help downsample your data for better, faster graphing:

The first graphing algorithm for downsampling is Largest triangle three bucket (LTTB). LTTB limits the number of points you need to send to your graphing engine while maintaining visual acuity. This means that you don’t try to plot 200,000 points on a graph that’s only 2000 pixels wide, which is inefficient in terms of network and rendering costs.

Given an original dataset which looks like the graph below:

We can downsample it to just 34 points with the following query using the LTTB hyperfunction:

SELECT toolkit_experimental.lttb(time, val, 34)

The above query yields the following graph, which retains the periodic pattern of the original graph, with just 34 points of data.

The second graphing algorithm for downsampling is Automatic smoothing for attention prioritization (ASAP smoothing). ASAP Smoothing uses optimal moving averages to smooth a graph to remove noise and make sure that trends are obvious to the user, while not over-smoothing and removing all the signals as well. This leads to vastly improved readability.

For example, the graph below displays 250 years of monthly temperature readings from England (raw data can be found here):

We can run the following query using the ASAP smoothing hyperfunction:

SELECT toolkit_experimental.asap_smooth(month, value, 800) FROM temperatures

The result is the graph below, which is much less noisy than the original and one where users can more easily spot trends.

Counter Aggregates

Metrics generally come in a few different varieties, which many systems have come to call gauges and counters. A gauge is a typical metric that can vary up or down, something like temperature or percent utilization. A counter is meant to be monotonically increasing. So it keeps track of, say, the total number of visitors to a website. The main difference in processing counters and gauges is that a decrease in the value of a counter (compared to its previous value in the time series) is interpreted as a reset. TimescaleDB’s counter aggregate hyperfunctions enable a simple and optimized analysis of these counters.

For example, despite a dataset being stored like:

data
------
  10
  20
   0
   5
  15

We can calculate the delta (along with various other statistics) over this monotonically-increasing counter with the following query using the counter aggregate hyperfunction:

SELECT toolkit_experimental.delta(
    toolkit_experimental.counter_agg(ts, val))
FROM foo;

 delta
------
  40

Hyperloglog for Approximate Count Distinct

We’ve implemented a version of the hyperloglog algorithm to do approximate count distinct queries over data in a more efficient and parallelizable fashion. For existing TimescaleDB users, you’d be happy to hear that they work in continuous aggregates, which are automatically refreshing materialized views.

Statistical Aggregates

Calculating rolling averages and other statistical aggregates over tumbling windows is very difficult in standard SQL because to do it accurately you’d need to separate out the different components (i.e., for average, count and sum) and then calculate it yourself. Our statistical aggregates allow you to simply do this, with simple rollup.

To follow the progress and contribute to improving these (and future) hyperfunctions, you can view our roadmap on GitHub. Our development process is heavily influenced by community feedback, so your comments on issues and discussion threads will help determine which features get prioritized, and when they’re stabilized for release.

Next Steps

Try hyperfunctions today with a fully-managed Timescale service (no credit card required, free for 30 days). Hyperfunctions are pre-loaded on each new database service on Timescale, so after you’ve created a new service, you’re all set to use them!

If you prefer to manage your own database instances, you can download and install the timescaledb_toolkit extension on GitHub for free, after which you’ll be able to use all the hyperfunctions listed above.

We love building in public. You can view our upcoming roadmap on GitHub for a list of proposed features, as well as features we’re currently implementing and those that are available to use today. We also welcome feedback from the community (it helps us prioritize the features users really want). To contribute feedback, comment on an open issue or in a discussion thread in GitHub.

Time-Series Analytics for PostgreSQL: Introducing the Timescale Analytics Project

David Kohn — Thu, 21 Jan 2021 03:16:15 GMT

We're excited to announce Timescale Analytics, a new project focused on combining all of the capabilities SQL needs to perform time-series analytics into one Postgres extension. Learn about our plans, why we're announcing them now, and ways to contribute your feedback and ideas.

Today, we’re excited to announce the Timescale Analytics project, an initiative to make Postgres the best way to execute critical time-series queries quickly, analyze time-series data, and extract meaningful information. SQL is a powerful language (we're obviously big fans ourselves), and we believe that by adding a specialized set of functions for time-series analysis, we can make it even better.

The Timescale Analytics project aims to identify, build, and combine all of the functionality SQL needs to perform time-series analysis into a single extension.

In other words, the Timescale Analytics extension will be a "one-stop shop" for time-series analytics in PostgreSQL, and we're looking for feedback from the community: what analytical functionality would you find most useful?

We believe that it is important to develop our code in the open and are requiring radical transparency of ourselves: everything about this project, our priorities, intended features, trade-off discussions, and (tentative) roadmap, are available in our GitHub repository.

It is our hope that working like this will make it easier for the community to interact with the project and allow us to respond quickly to community needs.

To this end, we’re announcing the project as early as possible, so we can get community feedback before we become too invested in a single direction. Over the next few weeks, we’ll be gathering thoughts on initial priorities and opening some sample PRs. Soon after that, we plan to create an initial version of the Timescale Analytics extension for you to experiment with.

Here are some examples of analytics functions we are considering adding: monotonic counters, tools for graphing, statistical sketching, and pipelining.

Monotonic Counters

A monotonically increasing counter is a type of metric often used in time-series analysis. Logically, such a counter should only ever increase, but the value is often read from an ephemeral source that can get reset back to zero at any time (due to crashes or other similar phenomena). To analyze data from such a source, you need to account for these resets: whenever the counter appears to decrease, you assume a reset occurred, and thus, you add the value after the reset to the value immediately prior to the reset.

Assume we have a counter that measures visitors to a website. If we were running a new marketing campaign focused on driving people to a new page on our site, we could use the change in the counter to measure the success of the campaign. While this kind of analysis can be performed in stock SQL, it quickly becomes unwieldy.

Using native SQL, such a query would look like:

SELECT sum(counter_reset_val) + last(counter, ts) - first(counter, ts) as counter_delta 
FROM (
    SELECT *,
        CASE WHEN counter - lag(counter) OVER (ORDER BY ts ASC) < 0
            THEN lag(counter) OVER (ORDER BY ts ASC)
            ELSE 0
        END as counter_reset_val
    FROM user_counter
) f;

This is a relatively simple example, and more sophisticated queries are even more complicated.

One of our first proposals for capabilities to include in Timescale Analytics would make this much simpler, allowing us to write something like:

SELECT delta(counter_agg(counter, ts)) as counter_delta FROM user_counter;

There are many examples like this: scenarios where it’s possible to solve the problem in stock SQL, but the resulting code is not exactly easy to write, nor pretty to read.

We believe we can solve that problem and make writing analytical SQL as easy as any other modern language.

Tools for Graphing

When graphing time-series data, you often need to perform operations such as change-point analysis, downsampling, or smoothing. Right now, these are usually generated with a front-end service, such as Grafana, but this means the graphs you use are heavily tied to the renderer you’re using.

Moving these functions to the database offers a number of advantages:

Users can choose their graphing front-end based on how well it does graphing, not on how well it does data analytics
Queries can remain consistent across all front-end tools and consumers of your data
Doing all the work in the database involves shipping a much smaller number of data points over the network

Key to getting this project working is building the output formats that will work for a variety of front-ends and identifying the necessary APIs. If you have thoughts on the matter, please hop on our discussion threads.

A fully worked-out pure-SQL example of a downsampling algorithm is too long to include inline here (for example, a worked-through version of largest-triangle-three-buckets can be found in this blog post) – but with aggregate support could be as simple as:

SELECT lttb(time, value, num_buckets=>500) FROM data;

This could return a timeseries data type, which could be ingested directly into a tool like Grafana or another language, or it could be unnested to get back to the time-value pairs to send into an external tool.

These tools can then use the simplified query instead of doing their own custom analysis on your data.

Example downsampling data from this dataset. It keeps the large-scale features of the data, with an order of magnitude fewer data points

Statistical Sketching

Sketching algorithms, such as t-digest, hyperloglog, and count-min, allow us to get a quick, approximate, answer for certain queries when the statistical bounds provided are acceptable.

This is even more exciting in the TimescaleDB ecosystem since it appears most of these sketches will fit nicely into continuous aggregates, allowing incredibly low query latency.

For instance, a continuous aggregate displaying the daily unique visitors to a website could be defined like:

CREATE MATERIALIZED VIEW unique_vistors
WITH (timescaledb.continuous) AS
    SELECT 
    time_bucket('1 day', time) as day, 
    hll(visitor_id) as visitors
    FROM connections
    GROUP BY time_bucket('1 day', time);

Such a view could be queried to get the visitors over range of days, like so:

SELECT day, approx_distinct(visitors)
FROM unique_vistors
WHERE day >= '2020-01-01' AND day >= '2020-01-15'

Additionally, it would allow for re-aggregation to determine the number of unique visitors over a coarser time range, such as the number of monthly visitors:

SELECT time_bucket(day, '30 days'), approx_distinct(hll(visitors))
FROM unique_vistors
GROUP BY time_bucket(day, '30 days')

Pipelining

SQL queries can get long, especially when there are multiple layers of aggregation and function calls.

For instance, to write a pairwise delta at minute-granularity in TimescaleDB, we’d use something like:

SELECT minutes, sampled - lag(sampled) OVER (ORDER BY minutes) as delta
FROM (
    SELECT
        time_bucket_gapfill(time, '1 minute') minutes,
        interpolate(first(value, time)) sampled
    FROM data
    GROUP BY time_bucket_gapfill(time, '1 minute')
) interpolated;

To mitigate this, the Timescale Analytics proposal includes a unified pipeline API capability that would allow us to use the much more straightforward (and elegant) query below:

SELECT timeseries(time, value) |> sample('1 minute') |> interpolate('linear') |> delta() FROM data;

Besides the simpler syntax, this API could also enable some powerful optimizations, such as incremental pipelines, single-pass processing, and vectorization.

This is still very much in the design phase, and we’re currently having discussions about what such an API should look like, what pipeline elements are appropriate, and what the textual format should be.

How we’re building Timescale Analytics

We’re building Timescale Analytics as a PostgreSQL extension. PostgreSQL's extension framework is quite powerful and allows for different levels of integration with database internals.

Timescale Analytics will be separate from the core TimescaleDB extension. This is because TimescaleDB core interfaces quite deeply into PostgreSQL’s internals— including the planner, executor, and DDL interfaces—due to the demands of time-series data storage. This necessitates a certain conservatism to its development process in order to ensure that updating TimescaleDB versions cannot damage existing databases, and that features interact appropriately with PostgreSQL’s core functionality.

By separating the new analytics functionality into a dedicated Timescale Analytics extension, we can vastly reduce the contact area for these new functions, enabling us to move faster without increased risk. We will be focusing on improvements that take advantage of the PostgreSQL extension hooks for creating functions, aggregates, operators, and other database objects, rather those that require interfacing with the lower-level planning and execution infrastructure. Creating a separate extension also allows us to experiment with our build process and technologies, for instance, writing the extension in Rust.

More importantly, we hope using a separate extension will lower barriers for community contributions. We know that the complexity of our integrations with PostgreSQL can make it difficult to contribute to TimescaleDB proper. We believe this new project will allow for much more self-contained contributions by avoiding projects requiring deep integration with the PostgreSQL planner or executor.

So, if you’ve been wanting to contribute back but didn’t know how or are a Rustacean looking to get involved in databasing, please join us!

Get Involved

Before the code is written is the perfect time to have a say in where the project will go. To this end, we want—and need—your feedback: what are the frustrating parts of analyzing time-series data? What takes far more code than you feel it should? What runs slowly or only runs quickly after seemingly arcane rewrites?

We want to solve community-wide problems and incorporate as much feedback as possible, in addition to relying on our intuition, observation, and experiences.

Want to help? You can submit suggestions and help shape the direction in 3 primary ways:

Look at some of the discussions we’re having right now and weigh in with your opinions. Any and all comments are welcome, whether you’re an experienced developer or just learning.
Check out the features we’re thinking of adding, and weigh in on if they’re something you want, if we’re missing something, or if there are any issues or alternatives. We are releasing nightly Docker images of our builds.
Explore our running feature requests, add a +1, and contribute your own.

Most importantly: share your problems! Tell us the kinds of queries or analyses you wish were easier, the issues you run into, or the workarounds you’ve created to solve gaps. (Example datasets are especially helpful, as they concretize your problems and create a shared language in which to discuss them.)

Tiger Data Blog

Understanding PostgreSQL Aggregation and Hyperfunctions’ Design

A Primer on PostgreSQL Aggregation

PostgreSQL aggregates vs. functions

Aggregate internals: Row-by-row

Aggregate internals: Composite state

Improving the performance of aggregate functions

Parallelization and the combine function

Deduplication

Two-step Aggregation in TimescaleDB Hyperfunctions

Why we use the two-step aggregate design pattern

Efficiency of Two-Step Aggregates

Re-using state

Cleanly distinguishing between aggregate/accessor parameters

Continuous Aggregates and Two-Step Design in TimescaleDB

Logically consistent rollups

Retrospective analysis using continuous aggregates

Example: Time-Weighted Average

Summing It Up

Postgres Materialized Views, The Timescale Way

How PostgreSQL Views and Materialized Views Work and How They Influenced TimescaleDB Continuous Aggregates

Getting Started With Views and Materialized Views

What Are PostgreSQL Views? Why Should I Use Them?

Views hide complexity

What PostgreSQL Materialized Views Are and When to Use Them

When materialized view performance doesn't materialize

When materialized views perform well

How Materialized Views Work (and Why They Get Out of Date)

What Are Continuous Aggregates? Why Should I Use Them?

Real-time aggregates

How Continuous Aggregates Work and How They Were Inspired by the Best of Views and Materialized Views

Scheduled aggregation of new data

Real-time views

Invalidation of out-of-order data

Data retention

Does It Work?

Function Pipelines: Building Functional Programming Into PostgreSQL Using Custom Operators

The Advantages of Function Pipelines

How We Built Function Pipelines

Function Pipelines: Why Are They Useful?

Building Function Pipelines Without Forking PostgreSQL

A Custom Data Type: The timevector

A Custom Operator: ->

Custom Functions: Pipeline Elements

timevector transforms

Unary mathematical

Binary mathematical

Compound transforms

Lambda elements

timevector finalizers

timevector output

timevector aggregates

Aggregate Accessors and Mutators

Counter aggregates

Percentile approximation

Statistical aggregates

Time-weighted averages

Approximate count distinct (Hyperloglog)

Next Steps

How Percentiles Work (and Why They're Better Than Averages)

Things I Forgot From 7th Grade Math: Percentiles vs. Averages

Why Percentiles Are Better Than Averages for Understanding Your Data

How Percentiles Work in PostgreSQL

Percentile Approximation: What It Is and Why We Use It in Hyperfunctions

Memory footprint

Materialization in continuous aggregates

Percentile Approximation Deep Dive

Approximation methods and how they work

Using advanced approximation methods in TimescaleDB hyperfunctions

Wrapping It Up

FAQs

Why are percentiles better than averages for understanding data?

How do percentile approximations work in TimescaleDB?

What are the benefits of using percentile approximations for large datasets?

How do I choose between UDDSketch and T-Digest algorithms?

How can I use percentile approximations with continuous aggregates?

What Time-Weighted Averages Are and Why You Should Care

What are time-weighted averages?

Mathy Bits: How to derive a time-weighted average

How to compute time-weighted averages in SQL

A Custom Data Type: The `timevector`

A Custom Operator: `->`

`timevector` transforms

`timevector` finalizers

`timevector` output

`timevector` aggregates