Tiger Data Blog

Creating a Fast Time-Series Graph With Postgres Materialized Views

Dylan Paulus — Mon, 27 Nov 2023 18:21:08 GMT

Imagine you have a massive amount of time-series data you want to explore and visualize. Seeing the latest trends, the historical patterns, and the outliers in your data can help you gain insights and make decisions. But how do you visualize and analyze time-series data effectively? How do you create graphs, plots, and other visualizations for real-time analytics showing the current state of your data and the historical changes over different time intervals? And how do you do it efficiently without sacrificing performance or accuracy?

In this article, we will see how to use PostgreSQL materialized views and Timescale’s improved version of these—continuous aggregates—to create a time-series graph that answers these questions.

📊

Learn more about time-series plots or dive into an explainer about time-series graphs.

Creating a Time-Series Graph in PostgreSQL

Method 1: Creating plots and graphs directly from raw data

Pretend you are a senior engineer at a company that creates devices to monitor the electrical power grid. These devices export a large amount of data—one PostgreSQL row is created per device every second. For this example, let's say the local power company uses one hundred devices (60,480,000 rows created per week). You want to be able to give your customers data visualizations of the load on a given line per hour, day, and week.

Our table looks like this:

CREATE TABLE demand (
    id          serial primary key,
    amps        DOUBLE PRECISION  NOT NULL,
    location    TEXT,
    time        TIMESTAMPTZ       NOT NULL
);

We can import a single device's data by running the following INSERT command (or, you know, generate dummy data!). It will take some time to insert 10,540,800 rows; shorten the gap between timestamps to produce less data:

INSERT INTO demand (amps, location, time) VALUES  (random()*40, 'Spokane, WA', 
generate_series('2023-09-01T00:00:00+03:00'::timestamptz, '2023-12-31T23:59:59+03:00'::timestamptz, '1 second'));

Now, we can generate a time-series plot to calculate average amps per minute with the following SQL. Change '1 minute' to '1 day' or '1 week' to create time-series plots for different intervals.

SELECT 
 date_bin(interval '1 minute', time, timestamptz '2023-08-01' ) AS time_interval, 
 AVG(amps)
FROM demand
GROUP BY 1
ORDER BY 1;

This query can take some time, depending on how much data is in the demand table. For the 10,540,800 rows we created, the query takes 15 seconds to execute on an 8-Core Intel Core i9 with 32 GB RAM Apple MacBook Pro. That is 15 seconds to return plot data for a single device over three months! Imagine if we had hundreds or thousands of devices spanning over a year.

Let's look at a few ways to improve the speed of our time-series plot using materialized views and continuous aggregates.

Method 2: Using materialized views to make graphs more performant

In PostgreSQL, a view can be thought of as a stored query on top of a table. When we query a view, the underlying query the view was created with gets called. This gives us the ability to abstract away and simplify our queries, but a view won't do much to improve the speed of a query.

Somewhere between a table and a view sits the materialized view. A materialized view works similarly to a view in that you can make queries reusable. The difference is a materialized view will store the resulting data on disk—caching the data. When you use a materialized view, you don’t have to run the query again. You get the results from the disk. This makes your queries much faster!

🗒️

Learn more about PostgreSQL materialized views and how they influenced the design of our continuous aggregates.

To improve the speed of our time-series graph data, let's create a materialized view over the demand table.

CREATE MATERIALIZED VIEW demand_amps_by_minute AS 
SELECT 
  date_bin(
    interval '1 minute', time, timestamptz '2023-08-01'
  ) AS time_interval, 
  AVG(amps) AS median 
FROM 
  demand 
GROUP BY 
  1 
ORDER BY 
  1;

Since creating the materialized view needs to run the same average amps per minute SQL, it can take some time to create. Once it's complete, run SELECT * FROM demand_amps_by_minute;. On my same MacBook Pro, the query now takes 58ms—much better!

This shows off the speed improvement materialized views can give us, but they come with a downside we haven't covered yet. When new data is added, updated, or deleted from the underlying table, we have to manually refresh the materialized view with a REFRESH MATERIALIZED VIEW [materialized view name]; statement. This will completely replace the data in the materialized view with all the new data from the table using the query from the definition.

Having to refresh your materialized views comes with a few glaring problems:

If you have a steady stream of data being written to your table, which is very common in time-series data, then once you refresh your materialized view, it'll be out of date.
Refreshing a materialized view comes with a performance hit, as it needs to rerun the materialized view's definition on all the data in the table to refresh itself.
You'll need to remember to manually run a refresh on your materialized views or maintain a cron job.

However, Timescale has engineered a little magic under the hood to remove all these pain points of using materialized views through continuous aggregates.

Method 3: Creating graphs that are more resource-efficient and easier to maintain via continuous aggregates

Timescale’s continuous aggregates have the same look and feel as materialized views, but they add some essential functionality to help you keep your graphs, plots, dashboards, or other visualizations of real-time analytics performant over time without manual maintenance.

First, continuous aggregates stay automatically updated via a refresh policy defined by you—i.e., you can configure your continuous aggregate view so it gets updated automatically every 30 minutes, including your latest data. This is much more convenient than refreshing your views manually!

But the key is what happens under the hood once this refresh policy kicks in. In plain PostgreSQL materialized views, when you refresh the view, the query will be recomputed over the entire dataset. In other words, in plain PostgreSQL, materialized views’ refreshes are not incremental. This makes the refresh process computationally expensive unnecessarily, especially once your dataset grows and a large volume of data needs to be materialized.

Continuous aggregates fix this inefficiency: when you refresh a continuous aggregate, Timescale doesn’t drop all the old data and recompute the aggregate against it. Instead, the engine just runs the query against the most recent refresh period (e.g., 30 minutes) and the data that has changed since the last refresh. This way, continuous aggregates keep your visualizations performant over time, independently of how much your dataset is growing.

Switching over to Timescale, we'll recreate our demand table using the same CREATE TABLE statement as before but leaving off the id column (we'll use time instead).

CREATE TABLE demand (
    amps        DOUBLE PRECISION  NOT NULL,
    location    TEXT,
    time        TIMESTAMPTZ       NOT NULL
);

Next, we'll update demand to be a hypertable:

SELECT create_hypertable('demand', 'time');

Finally, populate the demand table with data:

INSERT INTO demand (amps, location, time) VALUES  (random()*40, 'Spokane, WA', 
generate_series('2023-09-01T00:00:00+03:00'::timestamptz, '2023-12-31T23:59:59+03:00'::timestamptz, '1 second'));

At last we can create our continuous aggregate that will work similarly to the previous materialized view.

CREATE MATERIALIZED VIEW demand_amps_by_minute
WITH (timescaledb.continuous) AS
SELECT 
   time_bucket(INTERVAL '1 minute', time) AS bucket,
   AVG(amps)
FROM demand
GROUP BY bucket;

We need to update its refresh policy to have our continuous aggregate continuously refresh. For this example, we'll have it refresh every minute. But for your own workloads, you'll need to optimize these settings to fit your needs.

SELECT add_continuous_aggregate_policy(
	'demand_amps_by_minute', 
	start_offset => NULL, 
	end_offset => INTERVAL '1 h',
	schedule_interval => INTERVAL '1 m');

If we run a SELECT query on demand_amps_by_minute, I now get 120 ms to query the continuous aggregate. A little bit slower than a raw materialized view, but we're still much faster than querying the table!

Continuous aggregates track which chunks have been materialized and what data hasn't been yet by using a watermark (e.g., a pointer). When you query a continuous aggregate, you get materialized data before the watermark and non-materialized data after the watermark. This watermark will move as the aggregate policy continues to work through materializing non-materialized data.

All this adds some time to the overall query speed, but we benefit from not having to manually refresh the materialized view!

Let's try it out. Insert a new row into the underlying demand table.

INSERT INTO demand (amps, location, time) VALUES (100.2, 'Pullman, WA', now())

Then, if we re-query our continuous aggregate, we'll see the newly added row returned to us.

Start Speeding Up Your Queries Today

Throughout this article, we discovered how to use a table to create a time-series graph for large amounts of data. We improved the query performance by taking advantage of PostgreSQL's materialized views.

However, materialized views can be time-consuming to maintain. Last, we removed the need to manually refresh the materialized view by taking advantage of Timescale's continuous aggregates. Now, it’s your turn to create your own time-series plots or real-time analytics using these methods!

You can create a free Timescale account and start speeding up your queries today.

How to Reduce Your PostgreSQL Database Size

Dylan Paulus — Fri, 06 Oct 2023 18:52:39 GMT

Your phone buzzes in the middle of the night. You pick it up. A monitor went off at work—your PostgreSQL database is slowly but steadily reaching its maximum storage space. You are the engineer in charge. What should you do?

Okay, if it comes down to that situation, you should remedy it ASAP by adding more storage. But you’re going to need a better long-term strategy to optimize your PostgreSQL storage use, or you’ll keep paying more and more money.

Does your PostgreSQL database really need to be that large? Is there something you can do to optimize your storage use?

This article explores several strategies that will help you reduce your PostgreSQL database size considerably and sustainably.

Why Is PostgreSQL Storage Optimization Important?

Perhaps you’re thinking:

“Storage is cheap these days, and optimizing a PostgreSQL database takes time and effort. I’ll just keep adding more storage.”

Or perhaps:

“My PostgreSQL provider is actually usage-based (like Timescale), and I don’t have the problem of being locked into a large disk.”

Indeed, resigning yourself to simply using more storage is the most straightforward way to tackle an increasingly growing PostgreSQL database. Are you running servers on-prem? Slap another hard drive on that bad boy. Are you running PostgreSQL in RDS? Raise the storage limits. But this comes with problems.

The first and most obvious problem is the cost. For example, if you’re running PostgreSQL in an EBS instance in AWS or in RDS, you’ll be charged on an allocation basis. This model assumes you’ll predetermine how much disk space you’ll need in the future and then pay for it, regardless of whether you end up using it or not, and without the chance of downscaling.

In other PostgreSQL providers, when you run out of storage space, you must upgrade and pay for the next available plan or storage tier, meaning you’ll see a considerably higher bill overnight.

In a way, these issues are mitigated by usage-based models. Timescale charges by the amount of storage you use: you don't need to worry about allocating storage or managing storage plans, which really simplifies things—and the less storage you use, the less it costs.

Usage-based models are a great incentive to actually optimize your PostgreSQL database size as much as possible since you’ll see immediate reductions in your bill. But yes, this also works the opposite way: if you ignore managing your storage, your storage bill will go up.

The second problem with not optimizing your PostgreSQL storage usage is that this situation can lead to bad performance. Queries run slower and your I/O operations increase. This is something that often gets overlooked, but maintaining PostgreSQL storage usage is paramount to keeping large PostgreSQL tables fast.

‌‌This last point deserves a deeper dive into how data is actually stored in PostgreSQL and what is causing the problem, so let’s briefly cover some essential PostgreSQL storage concepts.

Essential PostgreSQL Storage Concepts‌‌‌‌

How does PostgreSQL store data?

At a high level, there are two terms you need to understand: tuples and pages.

A tuple is the physical representation of an entry in a table. You'll generally see the terms tuple and row used interchangeably. Each element in a tuple corresponds to a specific column in that table, containing the actual data value for that column.
A page is the unit of storage in PostgreSQL, typically 8 kB in size, that holds one or more tuples. PostgreSQL reads and writes data in page units.

Each page in PostgreSQL consists of a page header (which contains metadata about the page, such as page layout versions, page flags, and so on) and actual data (including tuples). There’s also a special area called the Line Pointer Array, which provides the offsets where each tuple begins.

A simple representation of a PostgreSQL page containing metadata about the page and tuples stored in the page

What happens when querying data?

When querying data, PostgreSQL utilizes the metadata to quickly navigate to the relevant page and tuple. The PostgreSQL query planner examines the metadata to decide the optimal path for retrieving data, for example, estimating the cost of different query paths based on the metadata information about the tables, indexes, and data distribution.

What happens when we INSERT/ DELETE/ UPDATE a row in PostgreSQL?

When a new tuple is inserted into a PostgreSQL table, it gets added to a page with enough free space to accommodate the tuple. Each tuple within a page is identified and accessed using the offset provided in the Line Pointer Array.

If a tuple inserted is too big for the available space of a page, PostgreSQL doesn't split it between two 8kB pages. Instead, it employs TOAST to compress and/or break the large values into smaller pieces. These pieces are then stored in a separate TOAST table, while the original tuple retains a pointer to this external stored data.

When we insert a tuple that's too large for a single page, a new page is created.

What is a dead tuple?

A key aspect to understand (and this will influence our PostgreSQL database size, as we’ll see shortly) is that when you delete data in PostgreSQL via DELETE FROM, you’re not actually deleting it but marking the rows as unavailable. These unavailable rows are usually referred to as “dead tuples.”

When you run UPDATE, the row you’re updating will also be marked as a dead tuple. Then, PostgreSQL will insert a new tuple with the updated column.

A page in a Postgres table with tuples that have been deleted or updated. The old instances are now dead tuples

You might be wondering why PostgreSQL does this. Dead tuples are actually a compromise to reduce excessive locks on tables during concurrent operations, multiple connections, and simplifying transactions. Imagine a transaction failing halfway through its execution; it is much easier to revert a change when the old data is still available than trying to rewind each action in an idempotent way. Furthermore, this mechanism supports the easy and efficient implementation of rollbacks, ensuring data consistency and integrity during transactions.

The trade-off, however, is the increased database size due to the accumulation of dead tuples, necessitating regular maintenance to reclaim space and maintain performance… What brings us to table bloat.

What is table bloat?

When a tuple is deleted or updated, its old instance is considered a dead tuple. The issue with dead tuples is that they’re effectively still a tuple on disk, taking up storage space—yes, that storage page that is costing you money every month.

Table bloat refers to this excess space that dead tuples occupy in your PostgreSQL database, which not only leads to an inflated table size but also to increased I/O and slower queries. Since PostgreSQL runs under the MVCC system, it doesn't immediately purge these dead tuples from the disk. Instead, they linger until a vacuum process reclaims their space.

Table bloat also occurs when a table contains unused pages, which can accumulate as a result of operations such as mass deletes.

A visualization of table bloat in PostgreSQL. Pages contain many dead tuples and a lot of empty space

What is `VACUUM`?

Dead tuples get cleaned and deleted from storage when the VACUUM command runs:

VACUUM customers;

Vacuum has a lot of roles, but the relevant point for this article is that vacuum removes dead tuples once all connections using the dead tuples are closed. VACUUM by itself will not delete pages, though. Any pages created by a table will stay allocated, although the memory in those pages is now usable space after running vacuum.

What is autovacuum?

Postgres conveniently includes a daemon to automatically run vacuum on tables that get heavy insert, update, and delete traffic. It operates in the background, monitoring the database to identify tables with accumulating dead tuples and then initiating the vacuum process autonomously.

Autovacuum comes enabled by default, although the threshold PostgreSQL uses to enable autovacuum is very conservative.

What is VACUUM FULL?

Autovacuum helps with dead tuples, but what about unused pages?

The VACUUM FULL command is a more aggressive version of VACUUM that locks the table, removes dead tuples and empty pages, and then returns the reclaimed space to the operating system. VACUUM FULL can be resource-intensive and requires an exclusive lock on the table during the process. We’ll come back to this later.

Now that you have the necessary context, let’s jump into the advice.

How To Reduce Your PostgreSQL Database Size

Use Timescale compression

There are different ways we can compress our data to consistently save storage space. PostgreSQL has some compression mechanisms, but if you want to take data compression even further, especially for time-series data, you should use Timescale’s columnar compression.

It allows you to dramatically compress data through a provided add_compression_policy() function. To achieve high compression rates, Timescale uses various compression techniques depending on data types to reduce your data footprint. Timescale also uses column stores to merge many rows into a single row, saving space.

Let's illustrate how this works with an example.

Let’s say we have a hypertable with a week's worth of data. Imagine that our application generally only needs data from the last day, but we must keep historical data around for reporting purposes. We could run SELECT add_compression_policy('my_table', INTERVAL '24 hours'); which automatically compresses rows in the my_table hypertable older than 24 hours.

Timescale’s compression would combine all the rows into a single row, where each column contains an array of all the row's data in segments of 1,000 rows. Visually, this would take a table that looks like this:

| time                   | location | temperature |
|------------------------|----------|-------------|
| 2023-09-20 00:16:00.00 | garage   | 80          |
| 2023-09-21 00:10:00.00 | attic    | 92.3        |
| 2023-09-22 00:5:00.00  | basement | 73.9        |

And compress it down to a table like this:

| time                                                                     | location                    | temperature               |
|--------------------------------------------------------------------------|-----------------------------|---------------------------|
| [2023-09-20 00:16:00.00, 2023-09-21 00:10:00.00, 2023-09-22 00:5:00.00]  | [garage, attic, basement]   | [80, 92.3, 73.9]          |

To see exactly how much space we can save, let's run compression on a table with 400 rows, 50 rows per day for the last seven days, that looks like this:

CREATE TABLE conditions (
  time        TIMESTAMPTZ       NOT NULL,
  location    TEXT              NOT NULL,
  temperature DOUBLE PRECISION  NULL,
);

SELECT create_hypertable('conditions', 'time');

Next, we'd add a compression policy to run compression on conditions for rows older than one day:

SELECT add_compression_policy('conditions', INTERVAL '1 day')

In the Timescale platform, if we navigate to the Explorer tab under Services, we’d see our table shrink from 72 kB to 16 kB—78% savings!

The Timescale console showing a 78% space reduction in table size due to compression

This is a simple example, but it shows the potential that Timescale compression has to reduce storage space.

Monitor dead tuples

A great practice to ensure you’re using as little storage as possible is to consistently monitor the number of dead tuples in each table.This is the first step towards putting together an efficient PostgreSQL storage management strategy.

To see pages and tuples in action, you can use pgstattuple(), an extension provided by the Postgres maintainers to gain insights into how our tables manage tuples:

CREATE EXTENSION IF NOT EXISTS pgstattuple;

If you run the following query,

SELECT * FROM pgstattuple('my_table');

Postgres would give you a table of helpful information in response:

 table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent 
-----------+-------------+-----------+---------------+-----------------+----------------+--------------------+------------+--------------
  81920000 |      500000 |  40000000 |          48.8 |           10000 |        1000000 |                1.2 |     300000 |          0.4

table_len tells you how big your table is in bytes, including data, indexes, toast tables, and free space.
dead_tuple_len tells how much space is being occupied by dead tuples which can be reclaimed by vacuuming.
free_space indicates the unused space within the allocated pages of the table.. Take note that free_space will reset for every new page created.

You can also perform calculations or transformations on the result to make the information more understandable. For example, this query calculates the ratios of dead tuples and free space to the total table length, giving you a clearer perspective on the storage efficiency of your table:

SELECT
(dead_tuple_len * 100.0 / table_len) as dead_tuple_ratio,
(free_space * 100.0 / table_len) as free_space_ratio
FROM
pgstattuple('my_table');

Run autovacuum more frequently

If your table is experiencing table bloat, having autovacuum run more frequently may help you free up wasted storage space.

The default thresholds and values for autovacuum are in postgresql.conf. Updating postgresql.conf will change the autovacuum behavior for the whole Postgres instance. However, this practice is generally not recommended, since some tables will have a higher affinity for dead tuples than others.

Instead, you should update autovacuum's settings per table. For example, consider the following query:

ALTER TABLE my_table SET (autovacuum_vacuum_scale_factor = 0, autovacuum_vacuum_threshold = 200)

This will update my_table to have autovacuum run after 200 tuples have been updated or deleted.

More information about additional autovacuum settings are in the PostgreSQL documentation. Each database and table will require different settings for how often autovacuum should run, but running vacuum often is a great way to reduce storage space.

Also, keep an eye on long-running transactions that might block autovacuum, leading to issues. You can use PostgreSQL’s pg_stat_activity view to identify such transactions, canceling them if necessary to allow autovacuum to complete its operations efficiently:

SELECT pid, NOW() - xact_start AS duration, query, state
FROM pg_stat_activity
WHERE (NOW() - xact_start) > INTERVAL '5 minutes';

#Cancelling
SELECT pg_cancel_backend(pid);

You could also inspect long-running vacuum processes and adjust the autovacuum_work_mem parameter to increase the memory allocation for each autovacuum invocation, as we discussed in our article about PostgreSQL fine tuning.

Reclaim unused pages

Autovacuum and vacuum will free up dead tuples, but you’ll need the big guns to clean up unused pages.

As we saw previously, running VACUUM FULL my_table will reclaim pages, but it has a significant problem: it exclusively locks the entire table. A table running VACUUM FULL cannot be read or written to while the vacuum has the lock, which can take a long time to finish. This is usually an instant no-go for any production database.

The PostgreSQL community has a solution, pg_repack. pg_repack is an extension that will clean up unused pages and bloat from a table by cloning a given table, swapping the original table with the new table, and then deleting the old table. All these operations are done with minimal exclusive locks, leading to less downtime.

At the end of the pg_repack process, the pages associated with the original table become deleted from storage, and the new table only has the absolute minimum number of pages to store its rows, thus freeing table bloat.

Find unused indexes

As we mention in this article on idexing design, over-indexing is a frequent issue in many large PostgreSQL databases. Indexes consume disk space, so removing unused or underutilized indexes will help you keep your PostgreSQL database lean.

You can use pg_stat_user_indexes to spot opportunities:

SELECT
relname AS table_name,
indexrelname AS index_name,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan AS index_scan_count
FROM
pg_stat_user_indexes
WHERE
idx_scan < 50 -- Choose a threshold that makes sense for your application.
ORDER BY
index_scan_count ASC,
pg_relation_size(indexrelid) DESC;

(This query looks for indexes with fewer than 50 scans, but this is an arbitrary number. You should adjust it based on your own usage patterns.)

Arrange columns by data type (from largest to smallest)

In PostgreSQL, storage efficiency is significantly influenced by the ordering of columns, which is closely related to alignment padding determined by the size of the column types. Each data type is aligned at memory addresses that are multiples of their size.

This alignment is systematic, ensuring that data retrieval is efficient and that the architecture adheres to specific memory and storage management protocols. But this can also lead to unused spaces, as the alignment necessitates padding to meet the address multiple criteria.

The way to fix this is to strategically order you columns from the largest to the smallest data type in your table definitions. This practical tip will help you minimize wasted space. Check out this article for a more in-depth explanation.

Delete old data regularly

You should always ask yourself: how long should I keep data around? Setting up data retention policies is essential for managing storage appropriately. Your users may not need data older than a year ago. Deleting old, unused records and indexes regularly is an easy win to reduce your database size.

Timescale can automatically delete old data for us using retention policies. Timescale’s hypertables are automatically partitioned by time, which helps a lot with data retention. Retention policies automatically delete partitions (which are called chunks in Timescale) once the data contained in such partition is older than a given interval.

You can create a retention policy by running:

SELECT add_retention_policy('my_table', INTERVAL '24 hours');

In this snippet, Timescale would delete chunks older than 24 hours from my_table.

Wrap-Up

We examined how table bloat and dead tuples can contribute to wasted storage space, which not only affects your pocket but also the performance of your large PostgreSQL tables.

To make sure you’re reducing your PostgreSQL database size down to its minimum, make sure to enable Timescale compression, to use data retention policies, and to set up a maintenance routine to periodically and effectively delete your dead tuples and reclaim your unused pages.

All these techniques together provide a holistic approach to maintaining a healthy PostgreSQL database and keeping your PostgreSQL database costs low.