Tiger Data Blog

Best Practices for Query Optimization on PostgreSQL

Team Tiger Data — Fri, 08 Dec 2023 15:18:00 GMT

The demands of modern applications and the exponential growth of data in today’s data-driven world have put immense pressure on databases. Traditional relational databases, including PostgreSQL, are increasingly being pushed to their limits as they struggle to cope with the sheer scale of data that needs to be processed and analyzed, requiring constant query optimization practices and performance tweaks.

Having built our product on PostgreSQL, at Timescale we’ve written extensively on the topic of tweaking your PostgreSQL database performance, from how to size your database, key PostgreSQL parameters, database indexes, and designing your schema to PostgreSQL partitioning.

In this article, we aim to explore best practices for enhancing query optimization in PostgreSQL. We’ll offer insights into optimizing queries, the importance of indexing, data type selection, and the implications of fluctuating data volumes and high transaction rates.

Let’s jump right in.

Why Is PostgreSQL Query Optimization Necessary?

Optimizing your PostgreSQL queries becomes a necessity when performance issues significantly impact the efficiency and functionality of your PostgreSQL database, making your application sluggish and impacting the user experience. Before we dive into the solutions, let’s look at some of the key contributors to performance issues in PostgreSQL:

Inefficient queries: The impact of poorly optimized or complex queries on PostgreSQL's performance is profound. These queries act as significant bottlenecks, impeding data processing efficiency and overall database throughput. Regular analysis and refinement of these query structures are not just beneficial but crucial for maintaining optimal database performance. Understanding and optimizing SQL queries is essential for efficient database operations. This knowledge is pivotal for developing efficient and responsive database operations, ensuring the database's capability to handle complex data workloads effectively.

Insufficient indexes: Inadequate indexing can significantly slow down query execution in PostgreSQL. Strategically implementing indexes, particularly on columns that are frequently accessed, can drastically enhance performance and optimize database responsiveness. Effective indexing strategies are not only crucial for accelerating query speeds but also play a main role in optimizing the efficiency of complex queries and large-scale data operations, ensuring a more responsive and robust database environment.

Over-indexing: While it's true that insufficient indexing can hurt your PostgreSQL performance, it's equally important not to overdo it. Excessive indexes can lead to their own set of challenges: each additional index introduces overhead during your INSERTs, UPDATEs, and DELETEs, they consume disk space and can make database maintenance tasks (such as vacuuming) more time-consuming.

Inappropriate data types: Using unsuitable data types in PostgreSQL can lead to increased storage usage and slower query execution, as inappropriate types may need additional processing and can occupy more storage space than necessary. Carefully selecting and optimizing data types to align with the specific characteristics of the data is a critical aspect of database optimization. The right choice of data types not only influences overall database performance but also contributes to storage efficiency. Additionally, it helps in avoiding costly type conversions during database operations, thereby streamlining data processing and retrieval.

Fluctuating data volume: PostgreSQL's query planner relies on up-to-date data statistics to formulate efficient execution plans. Fluctuations in data volume can significantly impact these plans, potentially leading to suboptimal performance if the planner operates on outdated information. As data volumes change, it becomes crucial to regularly assess and adapt execution plans to these new conditions. Keeping the database statistics current is essential, as it enables the query planner to accurately assess the data landscape and make informed decisions, thereby optimizing query performance and ensuring the database responds effectively to varying data loads.

High transaction volumes: Large numbers of transactions can significantly strain PostgreSQL's resources, especially in high-traffic or data-intensive environments. Effectively leveraging read replicas in PostgreSQL can substantially mitigate the impact of high transaction volumes, ensuring a more efficient and robust database environment.

Hardware limitations: Constraints in CPU, memory, or storage can create significant bottlenecks in PostgreSQL's performance, as these hardware limitations directly affect the database's ability to process queries, handle concurrent operations, and store data efficiently. Upgrading hardware components, such as increasing CPU speed, expanding memory capacity, or adding more storage, can provide immediate improvements in performance. Additionally, optimizing resource allocation, like adjusting memory distribution for different database processes or balancing load across storage devices, can also effectively alleviate these hardware limitations.

Lock contention: Excessive locking on tables or rows in PostgreSQL, particularly in environments that handle parallel queries, can lead to significant slowdowns, inconsistent data, and locking issues. This is because row-level or table-level locks can restrict data access, leading to increased waiting times for other operations and potentially causing queuing delays. Therefore, judicious use of locks is crucial in maintaining database concurrency and ensuring smooth operation. Strategies such as using less restrictive lock types, designing transactions to minimize locked periods, and optimizing query execution plans can help reduce lock contention.

Lack of maintenance: Routine maintenance tasks such as vacuuming, reindexing, and updating statistics are fundamental to sustaining optimal performance in PostgreSQL databases. Vacuuming is essential for reclaiming storage space and preventing transaction ID wraparound issues, ensuring the database remains efficient and responsive. Regular reindexing is crucial for maintaining the speed and efficiency of index-based query operations, as indexes can become fragmented over time. Additionally, keeping statistics up-to-date is vital for the query planner to make well-informed decisions, as outdated statistics can lead to suboptimal query plans. Ignoring these tasks can lead to a gradual but significant deterioration in database efficiency and reliability.

How to Measure Query Performance in PostgreSQL

pg_stat_statements

To optimize your queries, you must first identify your PostgreSQL performance bottlenecks. A simple way to do this is using pg_stat_statements, a PostgreSQL extension that provides essential information about query performance. It records data about running queries, helping to identify performance slowdowns caused by inefficient queries, index changes, or ORM query generators. Notably, pg_stat_statements is enabled by default in TimescaleDB, enhancing its capability to monitor and optimize database performance out of the box.

You can query pg_stat_statements to gather various statistics such as the number of times a query has been called, total execution time, rows retrieved, and cache hit ratios:

Identifying long-running queries: Focus on queries with high average total times, adjusting the calls value based on specific application needs.
Hit cache ratio: This metric measures how often data needed for a query was available in memory, which can affect query performance.
Standard deviation in query execution time: Analyzing the standard deviation can reveal the consistency of query execution times, helping to identify queries with significant performance variability.

Insights by Timescale

Timescale’s Insights (available to Timescale users at no extra cost) is a tool providing in-depth observation of PostgreSQL queries over time. It offers detailed statistics on query timing, latency, and memory and storage I/O usage, enabling users to comprehensively monitor and analyze their query and database performance.

Scalable query collection system: Insights is built on a scalable system that collects sanitized statistics on every query stored in Timescale, facilitating comprehensive analysis and optimization. And the best part is that the team is dogfooding its own product to enable this tool, expanding PostgreSQL to accommodate hundreds of TBs of data (and growing).
Insights interface: The tool presents a graph showing the relationship between system resources (CPU, memory, disk I/O) and query latency.

Insights offers a drill-down view with finer-grain metrics for quick query optimization

Detailed query information: Insights provides a table of the top 50 queries based on chosen criteria, offering insights into query frequency, affected rows, and usage of Timescale features like hypertables and continuous aggregates.
Drill-down view: Insights offers a drill-down view with finer-grain metrics, including trends in latency, buffer usage, and cache utilization.
Real-world application: Check out the Humblytics story, which demonstrates Insights' practical application in identifying and resolving performance issues.

Best Practices for Query Optimization in PostgreSQL

Understand common performance bottlenecks

To effectively identify inefficient queries so you can optimize them, analyze query execution plans using PostgreSQL's EXPLAIN command. This tool provides a breakdown of how your queries are executed, revealing critical details such as execution paths and the use of indexes. Look specifically for patterns like full table scans, which suggest missing indexes or queries consuming high CPU or memory, indicating potential optimizations. By understanding the intricacies of the execution plan, you can pinpoint exactly where performance issues are occurring.

For example, you could run the following code to view the execution plan of a query:

EXPLAIN SELECT * FROM your_table WHERE your_column = 'value';

To identify full table scans, you would look for “Seq Scan” in the output. This suggests that the query is scanning the entire table, which is often a sign that an index is missing or not being used effectively:

Seq Scan on large_table  (cost=0.00..1445.00 rows=50000 width=1024)

📚

Check out the PostgreSQL documentation for more examples of what to look for when using EXPLAIN.

Partition your data

Partitioning large PostgreSQL tables is a powerful strategy to enhance their speed and efficiency. However, the process of setting up and maintaining partitioned tables can be burdensome, often requiring countless hours of manual configurations, testing, and maintenance. But, there's a more efficient solution: hypertables. Available through the TimescaleDB extension and on AWS via the Timescale platform, hypertables simplify the PostgreSQL partition creation process significantly by automating the generation and management of data partitions without altering your user experience. Behind the scenes, however, hypertables work their magic, accelerating your queries and ingest operations.

To create a hypertable, create a regular PostgreSQL table:

CREATE TABLE conditions (
   time        TIMESTAMPTZ       NOT NULL,
   location    TEXT              NOT NULL,
   device      TEXT              NOT NULL,
   temperature DOUBLE PRECISION  NULL,
   humidity    DOUBLE PRECISION  NULL
);

Then, convert the table to a hypertable. Specify the name of the table you want to convert and the column that holds its time values.

SELECT create_hypertable('conditions', by_range('time'));

📚

Want to learn more about hypertables? Check out this comparison of pg_partman vs. hypertables.

Employ partial aggregation for complex queries

Continuous aggregates in TimescaleDB are a powerful tool to improve the performance of commonly accessed aggregate queries over large volumes. Continuous aggregates are based on PostgreSQL materialized views but incorporate incremental and automatic refreshes so they are always up-to-date and remain performant as the underlying dataset grows.

In the example below, we’re setting up a continuous aggregate for daily average temperatures, which is remarkably simple.

CREATE VIEW daily_temp_avg
WITH (timescaledb.continuous)
AS
SELECT time_bucket('1 day', time) as bucket, AVG(temperature)
FROM hypertable
GROUP BY bucket;

Learn how continuous aggregates can help you get real-time analytics or create a time-series graph.

Continuously update and educate

Regularly updating PostgreSQL and TimescaleDB is vital for performance and security. The upgrade process involves assessing changes, performance gains, security patches, and extension compatibility. Focus on best practices, which include upgrading major versions at minor version .2 for stability, consistently updating minor versions, and upgrading major versions when needed for functionality or security. Timescale further eases this process by handling minor updates automatically with no downtime and providing tools for testing major version upgrades, ensuring smooth transitions with minimal disruption.

Conclusion

Optimizing your queries in PostgreSQL doesn’t have to be a daunting task. While it involves understanding and addressing various factors, there is much you can do by adopting best practices, such as efficient indexing, judicious use of data types, regular database maintenance, and staying up-to-date with the latest PostgreSQL releases.

If you really want to extend PostgreSQL’s capabilities, create a free Timescale account today. Features such as hypertables, continuous aggregates, and advanced data management techniques significantly enhance PostgreSQL's ability to manage your demanding workloads effectively.

Written by Paulinho Giovannini

What Is TOAST (and Why It Isn’t Enough for Data Compression in Postgres)

Team Tiger Data — Wed, 25 Oct 2023 18:48:16 GMT

If you’re working with large databases in Postgres, this story will sound familiar. As your Postgres database keeps growing, your performance starts to decline, and you begin to worry about storage space—or, to be precise, how much you’ll pay for it. You love PostgreSQL, but there’s something you wish you had: a highly effective data compression mechanism.

PostgreSQL does have somewhat of a compression mechanism: TOAST 🍞. In this post, we’ll walk you through how Postgres TOAST works and the different TOASTing strategies.

As much as we enjoy a good TOAST, we’ll discuss why this is not the kind of compression feature you need for reducing the storage footprint of modern large databases—and how, as the PostgreSQL enthusiasts that we are here at Timescale, we decided to build a more suitable compression mechanism for PostgreSQL, inspired by the columnar design of NoSQL databases.

What Is Postgres TOAST?

Even if it might reduce the size of datasets, TOAST (The Oversized Attribute Storage Technique) is not your traditional data compression mechanism. To understand TOAST, we have to start by understanding how PostgreSQL stores data.

Postgres’ storage units are called pages, and pages have a fixed size (8 kB by default). Having a fixed page size gives Postgres many advantages: data management simplicity, efficiency, and consistency. But there is a downside: some data values might not fit within that page.

This is where TOAST comes in. TOAST refers to the automatic mechanism that PostgreSQL uses to efficiently store and manage values in Postgres that do not fit within a page. To handle such values, Postgres TOAST will, by default, compress them using an internal algorithm. If, after compression, the values are still too large, Postgres will move them to a separate table (called the TOAST table), leaving pointers in the original table.

(As we’ll see later in this article, you can modify this strategy as a user, for example, by telling Postgres to avoid compressing data in a particular column.)

TOAST-able Data Types

The data types subject to TOAST are primarily variable-length ones that have the potential to exceed the size limits of a standard PostgreSQL page. On the other hand, fixed-length data types, like integer, float, or timestamp, are not subjected to TOAST since they fit comfortably within a page.

Some examples of these data types are:

json and jsonb
Large text strings
varchar and varchar(n) (If the length specified in varchar(n) is small enough, then values of that column might always stay below the TOAST threshold.)
bytea storing binary data
Geometric data like path and polygon and PostGIS types like geometry or geography

How Does Postgres TOAST Work?

Understanding TOAST relates not only to page size but also to another Postgres storage concept: tuples. Tuples are rows in a PostgreSQL table. Typically, the TOAST mechanism kicks in if all fields within a tuple have a total size of over 2 kB approx.

If you’ve been paying attention, you might wonder, “Wait, but the page size is around 8 kB—why is there overhead?” That’s because PostgreSQL likes to ensure it can store multiple tuples on a single page: if tuples are too large, fewer tuples fit on each page, leading to increased I/O operations and reduced performance.

Postgres also needs to keep free space to fit additional operational data: each page stores the tuple data and additional information for managing the data, such as item identifiers, headers, and transaction information.

So, when the combined size of all fields in a tuple exceeds approximately 2 kB (or the TOAST threshold parameter, as we’ll see later), PostgreSQL takes action to ensure that the data is stored efficiently. TOAST handles this in two primary ways:

Compression. PostgreSQL can compress the large field values within the tuple to reduce their size using a compression algorithm. By default, if compression is sufficient to bring the tuple's total size below the threshold, the data will remain in the main table, albeit in a compressed format.
Out-of-line storage. If compression alone isn't effective enough to reduce the size of the large field values, Postgres moves them to a separate TOAST table. This process is known as "out-of-line" storage because the original tuple in the main table doesn’t hold the large field values anymore. Instead, it contains a "pointer" or reference to the location of the large data in the TOAST table.

(We’re simplifying things slightly for the purpose of this article—read the PostgreSQL documentation for a full detailed view.)

The Postgres Compression Algorithm: `pglz`

We’ve mentioned that TOAST can compress large values in PostgreSQL. But which compression algorithm is PostgreSQL using, and how effective is it?

The pglz (PostgreSQL Lempel-Ziv) is the default internal compression algorithm used by PostgreSQL specifically tailored for TOAST. Here’s how it works in simple terms:

pglz tries to avoid repeated data. When it sees repeated data, instead of writing the same thing again, it just points back to where it wrote it before. This "avoiding repetition" helps in saving space.
As pglz reads through data, it remembers a bit of the recent data it has seen. This recent memory is the "sliding window."
As new data comes in, pglz checks if it has seen this data recently (within its sliding window). If yes, it writes a short reference instead of repeating the data.
If the data is new or not repeated enough times to make a reference shorter than the actual data, pglz just writes it down as it is.
When it's time to read the compressed data, pglz uses its references to fetch the original data. This process is quite direct, as it looks up the referred data and places it where it belongs.
pglz doesn't need separate storage for its memory (the sliding window); it builds it on the go while compressing and does the same when decompressing.

This implementation balances compression efficiency and speed within the TOAST mechanism. The compression rate effectiveness of pglz will largely depend on the nature of the data.

For example, highly repetitive data will compress much better than high entropy data (like random data). You might see compression ratios in the range of 25 to 50 percent, but this is a very general estimate—results will vary widely based on the exact nature of the data.

Configuring TOAST

TOAST strategies

By default, PostgreSQL will go through the TOAST mechanism according to the procedure explained earlier (compression first and out-of-line storage next, if compression is not enough). Still, there might be scenarios where you might want to fine-tune this behavior on a per-column basis. PostgreSQL allows you to do this using the TOAST strategies EXTENDED, EXTERNAL, MAIN, and PLAIN.

EXTENDED: This is the default strategy. Data will be stored out of line in a separate TOAST table if it’s too large for a regular table page. Data will be compressed to save space before being moved to the TOAST table.
EXTERNAL: This strategy tells PostgreSQL to store the data for this column out of line if the data is too large to fit in a regular table page, and we’re asking PostgreSQL not to compress the data—the value will be moved to the TOAST table as-is.
MAIN: This strategy is a middle ground. It tries to keep data in line in the main table through compression; if the data is definitely too large, it will move the data to the TOAST table to avoid an error, but PostgreSQL won't move the compressed data. Instead, it will store the value in the TOAST table in its original form.
PLAIN: Using PLAIN in a column tells PostgreSQL to always store the column's data in line in the main table, ensuring it isn't moved to an out-of-line TOAST table. Take into account that if the data grows beyond the page size, the INSERT will fail because the data won’t fit.

If you want to inspect the current strategies of a particular table, you can run the following:

\d+ your_table_name

You'll get an output like this:

=> \d+ example_table
                     Table "public.example_table"
 Column  |       Data Type   | Modifiers | Storage  | Stats target | Description 
---------+------------------+-----------+----------+--------------+-------------
  bar    | varchar(100000)  |           | extended |              | 
  ```

If you wish to modify the storage setting, you can do so using the following command:

-- Sets EXTENDED as the TOAST strategy for bar_column 
ALTER TABLE example_blob ALTER COLUMN bar_column SET STORAGE EXTENDED;

Key parameters

Apart from the strategies above, these two parameters are also important to control TOAST behavior:

TOAST_TUPLE_THRESHOLD

This is the parameter that sets the size threshold for when TOASTing operations (compression and out-of-line storage) are considered for oversized tuples.

As previously mentioned, TOAST_TUPLE_THRESHOLD is set to approximately 2 kB by default.

TOAST_COMPRESSION_THRESHOLD

This parameter specifies the minimum size of a value before Postgres considers compressing it during the TOASTing process.

If a value surpasses this threshold, PostgreSQL will attempt to compress it. However, just because a value is above the compression threshold, it doesn't automatically mean it will be compressed: the TOAST strategies will guide PostgreSQL on how to handle the data based on whether it was compressed and its resultant size relative to the tuple and page limits, as we’ll see in the next section.

Bringing it all together

TOAST_TUPLE_THRESHOLD is the trigger point. When the size of a tuple's data fields combined exceeds this threshold, PostgreSQL will evaluate how to manage it based on the set TOAST strategy for its columns, considering compression and out-of-line storage. The exact actions taken will also depend on whether column data surpasses the TOAST_COMPRESSION_THRESHOLD.

Strategy	Compress if tuple > TOAST_COMPRESSION_THRESHOLD	Store out-of-line if tuple > TOAST_TUPLE_THRESHOLD	Description
EXTENDED	Yes	Yes	Default strategy. Compresses first, then checks if out-of-line storage is needed.
MAIN	Yes	Only in uncompressed form	Compresses first, and if still oversized, moves to TOAST table without compression.
EXTERNAL	No	Yes	Always moves to TOAST if oversized, without compression.
PLAIN	No	No	Data always stays in the main table. If a tuple exceeds the page size, an error occurs.

Why TOAST Isn't Enough as a Data Compression Mechanism in PostgreSQL

By now, you’ll probably understand why TOAST is not the data compression mechanism you wish you had in PostgreSQL. Modern applications imply large volumes of data ingested daily, meaning databases (over)grow quickly.

Such a problem was not as prominent when our beloved Postgres was built decades ago, but today’s developers need compression solutions for reducing the storage footprint of their datasets.

While TOAST incorporates compression as one of its techniques, it's crucial to understand that its primary role isn't to serve as a database compression mechanism in the traditional sense. TOAST is mainly a solution to one problem: managing large values within the structural confines of a Postgres page.

While this approach can lead to some storage space savings due to the compression of specific large values, its primary purpose is not to optimize storage space across the board.

For example, if you have a 5 TB database made up of small tuples, TOAST won’t help you turn those 5 TB into 1 TB. While there are parameters within TOAST that can be adjusted, this won't transform TOAST into a generalized storage-saving solution.

And there are other inherent problems with using TOAST as a traditional compression mechanism in PostgreSQL; for example:

Accessing TOASTed data can add overhead, especially when the data is stored out of line. This overhead becomes more evident when many large text or other TOAST-able data types are frequently accessed.
TOAST lacks a high-level, user-friendly mechanism for dictating compression policies. It’s not built to optimize storage costs or facilitate storage management.
TOAST's compression is not designed to provide high compression ratios. It only uses one algorithm (pglz) with compression rates varying typically from 25-50 percent.

Adding Columnar Compression to PostgreSQL With Timescale

Via the TimescaleDB extension, PostgreSQL users have a better alternative. Inspired by the compression design of NoSQL databases, we added columnar compression functionality to PostgreSQL. This transformative approach transcends PostgreSQL’s conventional row-based storage paradigm, introducing the efficiency and performance of columnar storage.

By adding a compression policy to your large tables, you can reduce your PostgreSQL database size by up to 10x (achieving +90 percent compression rates).

💪

Ready for a compression faceoff? Read our PostgreSQL TOAST vs. Timescale Compression comparison and see the numbers for yourself!

By defining a time-based compression policy, you indicate when data should be compressed. For instance, you might choose to compress data older than seven (7) days automatically:

-- Compress data older than 7 days
SELECT add_compression_policy('my_hypertable', INTERVAL '7 days');

Via this compression policy, Timescale will transform the table partitions (which in Timescale are also created automatically) into a columnar format behind the scenes, combining many rows (1,000) into an array. To boost compressibility, Timescale will apply different compression algorithms depending on the data type:

Gorilla compression for floats
Delta-of-delta + Simple-8b with run-length encoding compression for timestamps and other integer-like types
Whole-row dictionary compression for columns with a few repeating values (+ LZ compression on top)
LZ-based array compression for all other types

This columnar compression design offers an efficient and scalable solution to the problem of large datasets in PostgreSQL. It allows you to use less storage to store more data without hurting your query performance (it actually improves it). And in the latest versions of TimescaleDB, you can also INSERT, DELETE, and UPDATE directly over compressed data.

Keep Reading

Have we piqued your curiosity? Read the following blog posts to learn more about compression in Timescale:

Wrap-Up

We hope this article helped you understand that while TOAST is a well-thought-out mechanism to manage large values within a PostgreSQL page, it’s not effective for optimizing database storage within the realm of modern applications.

If you’re looking for effective data compression that can move the needle on your storage savings, give Timescale a go. You can try our cloud platform that propels PostgreSQL to new performance heights, making it faster and fiercer—it’s free, and no credit card is required—or you can add the TimescaleDB extension to your self-hosted PostgreSQL database.

Database Backups and Disaster Recovery in PostgreSQL: Your Questions, Answered

Team Tiger Data — Tue, 24 Oct 2023 14:12:07 GMT

When we ask our community about the elementary challenges they face with their PostgreSQL production databases, we often hear about three pain points: query speed, optimizing large tables, and managing database backups. We’ve covered the first two topics in articles about partitioning and fine-tuning your database. We’ve also discussed how to reduce your database size to better manage large tables.

In this guide, we’ll answer some of the most frequently asked questions about database backup and recovery in PostgreSQL. We’ll also discuss how we handle things in the Timescale platform.

Why Are PostgreSQL Database Backups Important?

When we discuss backup and recovery, we’re referring to a set of processes and protocols established to safeguard your data from loss or corruption and restore it to a usable state:

Backups involve creating copies of your data at regular intervals, copies that encapsulate the state of your PostgreSQL database at a specific point in time.
Recovery, on the other hand, is the process of restoring data from these backups. If both things are taken care of (i.e., you always have up-to-date backups and a good recovery strategy in place), your PostgreSQL database will be resilient against failure, and you’ll be protected against data loss.

Effective backup management is not only about creating copies of data. It’s also about ensuring those copies are healthy, accurate, and up-to-date.

To define a good backup strategy for your production PostgreSQL database, you need to consider several aspects. This includes how frequently you will back up your database, where these backups will be stored, and how often you will audit them.

But your job isn’t finished once you get up-to-date and healthy database backups. You must also establish an effective disaster recovery protocol. No matter how careful you are, it’s a fact of database management that failures will happen sooner or later. They can be caused by outages, failed upgrades, corrupted hardware, or human error—you name it.

Your disaster recovery plan must encompass all the steps to restore data as quickly as possible after an incident. This ensures that your database is not just backed up but also recoverable in a timely and efficient manner.

What Is the Difference Between a Physical Backup and a Logical Backup in PostgreSQL?

In PostgreSQL, there are two main types of database backups: physical backups and logical backups.

Physical backups capture the database's state at a specific point in time. They involve copying the actual PostgreSQL database data at the file system level.
Logical backups involve exporting specific database objects or the entire database into a human-readable SQL file format. A logical backup contains SQL statements to recreate the database objects and insert data.

Logical backups can be highly granular, allowing for the backup of specific database objects like tables, schemas, or databases. They are also portable and can be used across different database systems or versions, making them popular for migrating small to medium databases. This is your common pg_dump/pg_restore.

But a main drawback of logical backups is speed. For large databases, the process of restoring from a logical backup is too slow to be useful as a sole disaster recovery mechanism (or migration mechanism, for that matter). Restoring from physical backups is faster than restoring from logical backups, and it’s exact. When putting together a disaster recovery strategy, you’ll be dealing with physical backups.

A Guide to Physical Backups in PostgreSQL

Let’s explore some essential concepts around physical backups and how they can help you recover your database in case of failure.

File system backups

Physical backups are referred to as file system backups in PostgreSQL. This refers to the process of directly copying the directories and files that PostgreSQL uses to store its data, resulting in a complete representation of the database at a specific moment in time.

Maintaining file system backups is an essential piece of every disaster recovery strategy and imperative in production databases. But putting together a solid disaster recovery plan requires other techniques beyond simply taking “physical” file system backups regularly. That’s especially true if you’re dealing with large production databases.

Taking physical backups of very large databases can be a rather slow and resource-intensive process that conflicts with other high-priority database tasks, affecting your overall performance. Physical backups are not enough to ensure consistency in case of failure, as they only reflect the database state at the time they were taken. To restore a database in case of failure, you’ll need another mechanism to be able to restore all the transactions that occurred between the moment the last backup was taken and the failure.

WAL and continuous archiving

WAL stands for Write-Ahead Logging. It’s a protocol that improves the reliability, consistency, and durability of a PostgreSQL database by logging changes before they are written to the actual database files.

WAL is key for assuring atomicity and durability in PostgreSQL transactions. By writing changes to a log before they're committed to the database, WAL ensures that either all the changes related to a transaction are made or none at all.

WAL is also essential for disaster recovery since, in the event of a failure, the WAL files can be replayed to bring the database back to a consistent state. The process of regularly saving and storing these WAL records in a secondary storage location, ensuring that they are preserved over the long term, is usually referred to as continuous archiving.

Keeping WAL records and a recent, healthy physical database backup ensures that your PostgreSQL database can be successfully restored in case of failure. The physical backup will get PostgreSQL to the same state as it was when the backup was taken, which hopefully was not so long ago, and the WAL files will be rolled forward right before things start failing.

You might be wondering why it’s necessary to keep up-to-date backups if WAL can be replayed. The answer is speed. Replaying WAL during a recovery process is time-consuming, especially when dealing with large datasets with complex transactions. Backups provide a snapshot of the database at a specific point in time, enabling quick restoration up to that point.

In the optimal recovery scenario, you restore a recent backup (e.g., from the previous day. hen, you replay a WAL recorded post-backup to update the database to its most recent state. You don’t want to rely on WAL to reproduce two weeks’ worth of transactions.

What is point-in-time recovery (PITR) in PostgreSQL?

Point-in-time recovery refers to restoring a PostgreSQL database to any specific point in time due to direct user input. For example, if I perform an upgrade and, for whatever reason, decide to revert the change, I could choose to recover the database from any day before.

Behind the scenes, PITR in PostgreSQL is often anchored in WAL. By integrating a backup with the sequential replay of WAL, PostgreSQL can be restored to an exact moment.

A Guide to PostgreSQL Physical Backup Tools

There are multiple tools that help with the creation of physical backups, two of the most popular being pg_basebackup and pgBackRest.

pg_basebackup

pg_basebackup is the native tool offered by PostgreSQL for taking physical backups. It’s straightforward and reliable. It allows you to efficiently copy the data directory and include the WAL files to ensure a consistent and complete backup.

pg_basebackup has important limitations. Taking full backups of a large database can be a lengthy and resource-intensive process. A good workaround to mitigate this is to combine full backups with incremental backups. For example, frequently copying the data that has changed since the last full backup (e.g., once a day) and creating full backups less frequently (e.g., once a week). However, incremental backups are not supported in pg_basebackup.

pg_basebackup also has limited parallelization capabilities, which can further slow down the creation of full backups. The process is mostly manual, requiring developers to closely monitor and manage the backup operations.

pgBackRest

To address the constraints of pg_basebackup, the PostgreSQL community built tools like pgBackRest. pgBackRest introduces several important improvements:

It supports both full and incremental backups.
It introduces multi-threaded operations, accelerating the backup process for larger databases.
It validates checksums during the backup process to ensure data integrity, offering an additional layer of security.
It supports various storage solutions, offering flexibility in how and where backups are stored.

We use pgBackRest to manage our own backup and restore process in Timescale, although we’ve implemented some hacks to speed up the full backup process (pgBackRest can still be quite slow for creating backups in large databases).

A Guide to Logical Backups in PostgreSQL

Logical backups involve exporting data into a human-readable format, such as SQL statements. This type of backup is generally more flexible and portable, making it handy to reproduce a database in another architecture (i.e., for migrations). However, recovering from a logical backup is quite a slow process. That makes them practical only for migrating small to medium PostgreSQL production databases.

pg_dump/pg_restore

The most common way to create logical backups and restore from them is by using pg_dump/pg_restore:

pg_dump creates logical backups of a PostgreSQL database. It generates a script file or other formats that contain SQL statements needed to reconstruct the database to the state it was at the backup time. You can use pg_dump to back up an entire database or individual tables, schemas, or other database objects.
pg_restore restores databases from backups created by pg_dump. Just as pg_dump offers granularity in creating backups, pg_restore allows for selective restoration of specific database objects, providing flexibility in the recovery process. While it is typically used with backups created by pg_dump, pg_restore is compatible with other SQL-compliant database systems, enhancing its utility as a migration tool.

When Should I Use Logical Backups, and When Should I Use Physical Backups in PostgreSQL?

Logical backups via pg_dump/pg_restore are mostly useful for creating testing databases or for database migrations. In terms of migrations, if you’re operating a production database, we only recommend going the pg_dump/pg_restore route if your database is small (<100 GB).

Migrating larger and more complex databases via pg_dump/pg_restore might take your production database offline for too long. Other migration strategies, like the dual-write and backfill method, can avoid this downtime.

Physical backups are mostly used for disaster recovery and data archiving. If you’re operating a production database, you’ll want to maintain up-to-date physical backups and WAL to recover your database when failure occurs. If your industry requires you to keep copies of your data for a certain period of time due to regulations, physical backups will be the way to go.

In production applications, you’ll most likely use a combination of logical and physical backups. For disaster recovery, physical backups will be your foundational line of defense, but logical backups can serve as additional assurance (redundancy is a good thing). For migrating large databases, you’ll most likely use a staged approach, combining logical backups with other tactics, and so on.

What About Replicas in PostgreSQL?

Replicas are continuously updated mirrors of the primary database, capturing every transaction and modification almost instantaneously. They're not the same as backups, but their usefulness in disaster recovery is indisputable. In the event of a failure, you can promote replicas to serve as the primary database, ensuring minimal downtime while you restore the damaged database. Building a high-availability replica and failover mechanism generally involves the following steps:

The primary database should be configured to allow connections from replicas.
Physical backups of the primary should be regularly created, e.g., using pgBackRest.
WAL capturing all changes made to the database should be shipped to the replica, for example, via streaming replication. Replication can be synchronous, where each transaction is confirmed only when both primary and replica have received it, or asynchronous, where transactions are confirmed without waiting for the replica.
Configurations for automatic failover should be established to promote a replica to become the primary database in case of a failure.
Tools and scripts should be used to monitor replication lag and ensure the replica is up-to-date.

This setup can be considerably complex to maintain. Most providers of managed PostgreSQL databases, including Timescale, offer fully managed replicas as one of their services. This simplifies running highly available databases.

A Guide to Database Backups and Disaster Recovery with Timescale?

The Timescale platform allows our customers to create fully managed PostgreSQL and TimescaleDB databases. That means we take care of the backup and disaster recovery process for them. Let’s run through how the platform handles backups, replication, upgrades, and restores.

How do backups work in Timescale?

Backups in Timescale are fully automated. Using pgBackRest under the hood, Timescale automatically creates one full backup every week and incremental backups every day.

Timescale also keeps WAL files of any changes made to the database. This WAL can be replayed in the event of a failure to reproduce any transactions not captured by the last daily backup. For example, it can replay the changes made to your database during the last few hours. Timescale stores the two most recent full backups and WAL in S3 volumes.

On top of the full and incremental backups taken by pgBackRest, Timescale also takes EBS snapshots daily. EBS snapshots create copies of the storage volume that can be restored, effectively making it a backup. They are significantly faster than taking full backups via pgBackRest (about 100x faster).

By taking EBS snapshots daily (on top of the weekly full backups by pgBackRest), we introduce an extra layer of redundancy, ensuring that we always have a fresh snapshot that we can quickly restore if the customer experiences a critical failure that requires recovery from a full backup.

Disaster recovery in Timescale: What happens if my database fails?

Timescale is built on AWS with decoupled compute and storage, something that makes the platform especially resilient against failures. There are two classes of failures that Timescale handles distinctly: compute and storage failures.

How Timescale handles compute failures

Compute failures are more frequent than storage failures, as they can be caused by things like unoptimized queries or other issues that result in a maxed-out CPU. To improve uptime for the customer, Timescale has developed a methodology that makes the platform recover extremely quickly from compute failures. We call this technique rapid recovery.

Timescale decouples the compute and storage nodes. So, if the compute node fails, Timescale automatically spins up a new compute node, attaching the undamaged storage unit to it. Any WAL that was in memory then replays.

The length of this recovery process depends on how much WAL needs replaying. Typically, it completes in less than thirty seconds. Under the hood, this entire process is automated via Kubernetes.

How Timescale handles storage failures

Storage failures are much less common than compute failures, but when they happen, they’re more severe. Having a high-availability replica can be a lifesaver in this circumstance; while your storage is being restored, instead of experiencing downtime, your replica will automatically take over.

To automatically restore your damaged storage, Timescale makes use of the backups it has on storage, reproducing WAL since the last incremental backup. The figure below illustrates the process:

Recovery from backup in Timescale

How do replicas work in Timescale, and how do they help with recovery?

In Timescale, you can create two types of replicas:

Read replicas are useful for read scaling. They’re used to liberate load from your primary database in read-heavy applications, for example, if you’re powering a BI tool or doing frequent reporting. Read replicas are read-only, and you can create as many as you need.
High-availability replicas are exact, up-to-date copies of your database that automatically take over operations if your primary becomes unavailable.

We’ve been talking about the importance of backups and disaster recovery. There’s a related concept that’s also important to consider: the concept of high availability. In broad terms, a “highly available” database describes a database that’s able to stay running without significant interruption (perhaps no more than a few seconds) even in case of failure.

The process of recovering a large database from backup might take a while, even when you’ve done everything right. That’s why it’s handy to have a replica running. Instead of waiting for the backup and restore process to finish, when your primary database fails, your connection will automatically failover to the replica. That saves your own users any major downtime.

Failover also helps remove downtime for common operations that would normally cause a service to reset, like upgrades. In these cases, Timescale makes changes to each node sequentially so that there is always a node available. And speaking of upgrades…

How are upgrades handled in Timescale?

In Timescale, you’re running PostgreSQL databases with the TimescaleDB extension enabled. Therefore, during your Timescale experience, you’ll most likely experience three different types of upgrades:

TimescaleDB upgrades

These refer to upgrades between TimescaleDB versions, e.g., from TimescaleDB 2.11 to TimescaleDB 2.12. You don’t have to worry about these. They’re backward compatible, they require no downtime, and they will happen automatically during your maintenance window. Your Timescale services always run the latest available TimescaleDB version, so you can enjoy all the new features we ship.

PostgreSQL minor version upgrades

We always run the latest available minor version of PostgreSQL in Timescale as well, mostly for security reasons. These minor updates may contain security patches, data corruption problems, and fixes to frequent bugs.

Timescale automatically handles these upgrades during your maintenance window, and they are also backward compatible. However, they require a service restart, which could cause some downtime (30 seconds to a few minutes) if you don’t have a replica. We alert you ahead of time about these, so you can set your maintenance window to a low traffic time (e.g., middle of the night) to minimize consequences.

PostgreSQL major version upgrades

These refer to upgrading, for example, from PostgreSQL 15 to 16. These upgrades are different and more serious since they’re often not backward compatible.

We can’t run these upgrades for you, as this might cause issues on your application. Besides, the downtime associated with upgrading major versions of PostgreSQL can be more severe (e.g., 20 minutes). Unfortunately, in this particular case, high-availability replicas can’t help you avoid downtime.

Major PostgreSQL upgrades are always a significant lift. Timescale has some tools that will make the transition smoother. For example, you can initiate the upgrade process in a particular database with a click of a button Before doing so, you can test your upgrade in a copy of your database to make sure nothing will break and have an accurate idea of how much downtime the upgrade will require. Read this article for more information.

Can I do PITR in Timescale, i.e., restore my database to a previous state at my own will?

Yes, you can! All Timescale services allow PITR to any point in the last three days. If you're using our Enterprise plan, this timespan expands up to 14 days.

Stress-Free PostgreSQL Backups

Having a solid backup and recovery strategy is top of mind for every PostgreSQL user. We hope this introductory article answers some of your questions; if you’d like to see more articles diving deeper into this topic, tell us on Twitter/X.

If you prefer not to worry about maintaining your backups and taking care of recovering your database when things fail, try Timescale, our managed PostgreSQL platform. It takes care of all things backups so you can focus on what matters (building and running your application) while experiencing the performance boost of TimescaleDB. You can start a free trial here (no credit card required).

The 2023 State of PostgreSQL Survey Is Now Open!

Team Tiger Data — Tue, 01 Aug 2023 12:58:32 GMT

Almost half (45.55 %) of the 2023 Stack Overflow Developer Survey respondents who answered the question about their favorite database (76,634 in total) chose PostgreSQL as the most popular one. This is a testament to the quality, reliability, and performance of PostgreSQL, as well as the vibrant and diverse community that supports it.

As proud members of the PostgreSQL community, we want to continue giving back to this awesome group of data techies. We’re happy to announce that the 2023 State of PostgreSQL survey is officially live, and we are excited to hear once again from PostgreSQL users worldwide.

Over the years, we have learned a lot about the community. In 2019, we noticed that while PostgreSQL is a popular choice among organizations, 81 % of you use PostgreSQL for personal projects. In 2021, the community shared that the most frequently used extension is PostGIS. Last year, 17 % of the respondents said they contributed to PostgreSQL at least once.

This year, we want to know how these practices evolved, and we will explore the two magical letters of the hour—AI. We want to learn what AI tools the PostgreSQL community uses and if the AI workloads are already part of personal and work projects.

The survey results and anonymized raw data will be published in a report that will be available for free to everyone. The report will provide valuable insights into the PostgreSQL ecosystem and help us understand how we can collectively make PostgreSQL better.

For now, to whet the appetite for the 2023 report, read the highlights of last year’s findings. To download the full 2022 report, head over to https://www.timescale.com/state-of-postgres/2022/

The survey is open until September 15, 2023.

So what are you waiting for? Take the survey now and share your voice with the PostgreSQL community!

Take the survey

The State of PostgreSQL in 2022

PostgreSQL's popularity is increasing

The number of PostgreSQL newbies using the database for less than a year has grown from 6.1 % in 2021 to 6.4 % in 2022.

Reasons for choosing PostgreSQL over other databases

Open-source, reliability, and extensions are the main reasons PostgreSQL users selected in 2022. Interestingly, users´ years of experience were directly related to their answers. “Reliability” was the number one reason to choose PostgreSQL among those who have been using the database for 11-15 years, while “open source” was primarily pointed out by users with up to five years of experience.

PostgreSQL usage is growing

Small and medium businesses (0-50 employees) use PostgreSQL a lot more than they did one year ago. The result is on par with a broader trend: PostgreSQL’s usage is growing, with the majority of respondents—55 %—saying that they have increased their usage of the database.

PostgreSQL users ♥ documentation

The majority of respondents (76.1 %) answered that technical documentation is their preferred way of learning about PostgreSQL, followed by long-form blog posts (51.5 %) and short-form blog posts (43.3 %). But the new generation of PostgreSQL enjoys learning slightly differently: users with less than five years of PostgreSQL experience gravitate toward video as their first option.

PostgreSQL users increasingly use DBaaS providers to deploy PostgreSQL

The trend that we first saw in 2021 continues in 2022. Fewer PostgreSQL users reported self-managing the database compared to previous years. More respondents are using a managed PostgreSQL service to deploy the database.

Many thanks to everyone who took the time to fill in the 2023 survey. If you have not done that yet, grab a cup or glass of your favorite beverage and share your experience with PostgreSQL.

Help us make the survey more representative. Share this far and wide! Post it in the company chat. Share it on social media.

The 2022 State of PostgreSQL Survey Is Now Open!

Team Tiger Data — Mon, 06 Jun 2022 15:05:46 GMT

Our love for PostgreSQL runs deep. We built our products on PostgreSQL, are proud members of the PostgreSQL community, and wouldn’t exist without it and the extensibility it provides.

In 2019, Timescale launched the first State of PostgreSQL report, advancing our desire to provide greater insights into the specificities and features useful to the PostgreSQL community. Following a one-year hiatus due to the pandemic and after the 2021 survey submissions, we released the 2021 report.

We are pleased to announce that the 2022 survey is now open for submissions! We are keen to learn more about how you use PostgreSQL for work and personal projects, how you deploy it, and how we can collectively improve it.

✨

Help us give back to this awesome group: answer survey questions and share with other PostgreSQL users. We are excited to hear your thoughts and spark a conversation that will keep us moving forward and building better things together. 🙌 We will share our report (as well as give you full and free access to the survey’s anonymized raw data) in July. Thank you for being a part of the community!

Take the 2022 State of PostgreSQL survey

The State of PostgreSQL in 2019 and 2021

So, what have we learned from the two years we sent out our survey? You will find the few key findings here, but check out our reports for a full picture. From the most used programming languages to whether developers use PostgreSQL for work or personal projects (or both!), favorite features, and qualitative answers, The State of PostgreSQL paints an accurate and informative portrait of this great community.

Sample

Five hundred developers answered our survey in 2019, and 445 participated two years later. In both years, respondents mainly were software developers/engineers, software architects, and database administrators from the EMEA (Europe, Middle East, Africa) region.

PostgreSQL usage is growing

Around 67 % of developers said they were using the database “more” or “a lot more” in 2019, compared to 52 % in 2021. However, the number of participants using it “about the same” increased from 31 % in 2019 to 43 % in 2021.

Use cases: Building applications at the top

Building applications is the primary use case for PostgreSQL developers, totaling 70 % in 2019 and 67 % in 2021.

Community contribution is increasing

Code contributions are crucial to open-source software development, and PostgreSQL successfully mobilizes its community. In 2019, about 9 % of respondents contributed their code to the database, and 11 % claimed to do it two years later.

Why do you use PostgreSQL?

In both surveys, developers said that reliability and SQL were the main reasons they use PostgreSQL.

The way developers deploy PostgreSQL is changing

In 2019, 51 % of respondents deployed PostgreSQL using AWS, while 46 % relied on a self-managed data center. In 2021, the self-managed option took the lead, with 36.4 % deploying on-site, 35.3 % from a private data center, and 32.8 % on a public cloud. In 2021, AWS was the leading cloud provider, with 46.1 % of the answers.

If you’ve got any feedback or questions on The State of PostgreSQL, let us know on Twitter or join Timescale’s Community Slack and message us in #general.

Tiger Data Blog

Best Practices for Query Optimization on PostgreSQL

Why Is PostgreSQL Query Optimization Necessary?

How to Measure Query Performance in PostgreSQL

pg_stat_statements

Insights by Timescale

Best Practices for Query Optimization in PostgreSQL

Understand common performance bottlenecks

Partition your data

Employ partial aggregation for complex queries

Continuously update and educate

Conclusion

What Is TOAST (and Why It Isn’t Enough for Data Compression in Postgres)

What Is Postgres TOAST?

TOAST-able Data Types

How Does Postgres TOAST Work?

The Postgres Compression Algorithm: pglz

Configuring TOAST

TOAST strategies

Key parameters

Bringing it all together

Why TOAST Isn't Enough as a Data Compression Mechanism in PostgreSQL

Adding Columnar Compression to PostgreSQL With Timescale

Keep Reading

Wrap-Up

Database Backups and Disaster Recovery in PostgreSQL: Your Questions, Answered

Why Are PostgreSQL Database Backups Important?

What Is the Difference Between a Physical Backup and a Logical Backup in PostgreSQL?

A Guide to Physical Backups in PostgreSQL

File system backups

WAL and continuous archiving

What is point-in-time recovery (PITR) in PostgreSQL?

A Guide to PostgreSQL Physical Backup Tools

pg_basebackup

pgBackRest

A Guide to Logical Backups in PostgreSQL

pg_dump/pg_restore

When Should I Use Logical Backups, and When Should I Use Physical Backups in PostgreSQL?

What About Replicas in PostgreSQL?

A Guide to Database Backups and Disaster Recovery with Timescale?

How do backups work in Timescale?

Disaster recovery in Timescale: What happens if my database fails?

How Timescale handles compute failures

How Timescale handles storage failures

How do replicas work in Timescale, and how do they help with recovery?

How are upgrades handled in Timescale?

TimescaleDB upgrades

PostgreSQL minor version upgrades

PostgreSQL major version upgrades

Can I do PITR in Timescale, i.e., restore my database to a previous state at my own will?

Stress-Free PostgreSQL Backups

The 2023 State of PostgreSQL Survey Is Now Open!

The State of PostgreSQL in 2022

PostgreSQL's popularity is increasing

Reasons for choosing PostgreSQL over other databases

PostgreSQL usage is growing

PostgreSQL users ♥ documentation

PostgreSQL users increasingly use DBaaS providers to deploy PostgreSQL

The 2022 State of PostgreSQL Survey Is Now Open!

The State of PostgreSQL in 2019 and 2021

Sample

PostgreSQL usage is growing

Use cases: Building applications at the top

Community contribution is increasing

Why do you use PostgreSQL?

The way developers deploy PostgreSQL is changing

The Postgres Compression Algorithm: `pglz`