Tiger Data Blog

The Problem With Locks and PostgreSQL Partitioning (and How to Actually Fix It)

Chris Travers — Thu, 12 Oct 2023 13:00:00 GMT

In my career, I have frequently worked for companies with large amounts of time-partitioned data, where I was a software engineer focusing on our PostgreSQL databases.

We’d already grown past the point where deleting data row-by-row was no longer practical, so we needed to use PostgreSQL partitioning to manage data retention. In brief, dropping a whole partition allows PostgreSQL to remove the entire file from disk for a subset of your data rather than going through each row and removing them individually. So it’s much faster. But if you are doing partitioning natively in PostgreSQL, you do have to make sure to add new partitions where you’re ingesting new data and drop the old ones.

This was a frequent cause of outages for us, even if we had reasonably well-tested scripts for adding and removing partitions. Unfortunately, the interactions around the scripts were less well-tested, and new, frequent, and long-running queries prevented the partition management scripts from getting the locks required and creating new partitions. We didn’t see the problem at first because we’d created partitions a few days in advance, but then we ran out of time, and with no new partitions, we couldn’t insert, and whoops, down goes the app.

These types of problems are particularly hard to debug and disentangle because they are often caused by totally unrelated pieces of code in combination with changes in load. PostgreSQL has begun to address this with newer approaches attaching partitions concurrently, but they’re quite complex.

I’ve seen the outages caused by partitions failing to create, disk filling up because they can’t be dropped, or the pauses in other normal queries by partition management code. I know how difficult these problems can be. This is why TimescaleDB's hypertables were so exciting to me when I discovered them, especially their lock minimization strategies.

Understanding PostgreSQL Locks

What are PostgreSQL locks?

PostgreSQL locks are mechanisms that control concurrent access to data in the database to ensure consistency, integrity, and isolation of database transactions.

PostgreSQL, like most other relational database management systems, is a concurrent system, which means that multiple queries can be processed at the same time. Locks help in managing multiple transactions attempting to access the same data simultaneously, avoiding conflicts and potential data corruption.

Why are PostgreSQL locks necessary?

Concurrency is essential for optimizing the performance and responsiveness of the database. However, concurrency introduces several challenges that need careful handling to ensure the database’s integrity, consistency, and reliability:

When multiple queries are executed concurrently, there's a risk that one transaction might view inconsistent or uncommitted data modified by another ongoing transaction. This can lead to erroneous results and inconsistencies in the database.
Queries executing simultaneously can interfere with each other, leading to performance degradation, locking issues, or inconsistent data.
When two transactions try to modify the same data simultaneously, it can lead to conflicts, data corruption, or loss of data.

Locks are necessary to prevent these problems.

Types of PostgreSQL locks

PostgreSQL supports many different types of locks, but the three relevant to this article are ACCESS SHARE, SHARE UPDATE EXCLUSIVE, and ACCESS EXCLUSIVE locks.

ACCESS SHARE locks are the least restrictive and are intended to prevent the database schema from changing under a query along with related caches being cleared. Access share locks are acquired for database read operations. The purpose of access share locks is to block access exclusive locks.
SHARE UPDATE EXCLUSIVE locks allow concurrent writes to a table but block operations that change the database schema in ways that might interfere with running queries. These are used for some forms of concurrent schema changes in PostgreSQL, though two concurrent transactions cannot both take this lock on the same table. For example, you cannot concurrently detach and attach the same partition to/from the same parent table in different sessions. One must complete before the other starts. These locks generally are used for concurrency-safe schema changes, which do not clear cached relation information.
ACCESS EXCLUSIVE locks are the most restrictive and are intended to prevent other queries from operating across a schema change. Access exclusive locks block all locks from all other transactions on the locked table.

Cache invalidation and `ACCESS EXCLUSIVE` locks

For performance reasons, PostgreSQL caches information about tables and views (which we call “relations”) and uses this cached information in query execution. This strategy is instrumental for PostgreSQL's efficiency, ensuring that data retrieval is quick and resource utilization is optimized.

A critical scenario that needs meticulous handling lock-wise is when the structrure of tables is altered. When the schema is altered (e.g. by adding or dropping columns, changing data types, or modifying constraints) the cached information related to that table might become outdated or inconsistent. Therefore, it needs to be invalidated and refreshed to ensure that the query execution reflects the modified schema.

To make this work, PostgreSQL takes an access exclusive lock on the table in question before the cached information for that relation can be invalidated.

Using PostgreSQL Partitioning to Simplify Data Management

In PostgreSQL declarative partitioning, PostgreSQL tables are used both for empty parent tables and for partitions holding the data. Internally, each partition is a table, and there is mapping information used by the planner to indicate which partitions should be looked at for each query. This information is cached in the relation cache.

When tables are partitioned based on time, it allows for an organized structure where data is segmented into specific time frames. This makes data management much faster, since dropping a whole partition allows PostgreSQL to remove the entire partition from disk rather than going through each row and removing them individually.

In PostgreSQL, you can follow two general approaches for managing partitions and data retention, which as we'll see later, have two different concurrency considerations and problematics.

Approach #1: Dropping partitions

In the first approach, we simply drop partitions from a partitioned table when we want to delete data.

CREATE TABLE partition_test (
    event_time timestamp,
    sensor_id bigint,
    reported_value float
) partition by range (event_time);

-- Create partition 
CREATE TABLE partition_test_2022 PARTITION OF partition_test 
FOR VALUES event_time FROM ('2022-01-01 00:00:00') TO ('2023-01-01 00:00:00');
```

--Drop partition
ALTER TABLE partition_test DROP PARTITION partition_test_2022;

Approach #2: Concurrent workflow

PostgreSQL also offers (in PostgreSQL 14 and newer) a concurrent workflow for these operations.

CREATE TABLE partition_test_2022 (like partition_test);


ALTER TABLE partition_test ATTACH PARTITION partition_test_2022 FOR VALUES event_time FROM ('2022-01-01 00:00:00') TO ('2023-01-01 00:00:00') CONCURRENTLY;

To remove a partition concurrently, we can:

ALTER TABLE partition_test DETACH PARTITION partition_test_2022 CONCURRENTLY;


DROP TABLE partition_test_2022;

The Problem With Locks and PostgreSQL Partitioning

From a database administration perspective, neither of these approaches is very safe.

Both the partition creation and dropping requires an access exclusive lock on the partition_test, meaning that once the query is issued, no other queries can run against that table until the query is concluded and the transaction committed or rolled back. The locking in each case looks like this:

In terms of the concurrent approach, it still has to address the issue of clearing the relation cache. It does so in two stages: first, a share update exclusive lock is taken partition_test, and then information is written to the catalogs indicating that the table will be removed from the partition list. The backend then waits until all running queries have concluded (and all transactions guaranteeing repeatable reads have concluded) before removing the table from the partition map.

This approach does not rely on locks to signal that the process is complete, only to prevent multiple concurrent updates for the status of the same set of partitions. As a result, even unrelated queries can block the detach operation. If the partition management script’s connection is interrupted for any reason, cleanup processes must be performed by the database administrator.

Once the partition is removed from the partition list, it is locked in access exclusive mode and dropped. The locking approach of this process looks like this:

In conclusion,

The first approach (involving the manual creation and dropping of partitions) relatively quick operations but forces hard synchronization points on partitioned tables, which in time-series workloads are usually partitioned due to being heavily used. Problems here can cause database outages fairly quickly.
The concurrent workflow doesn’t always solve these problems. In mixed-workflow applications, waiting for all running queries to complete (which can include long-running automatic maintenance tasks) can lead to long delays, dropped connections, and general difficulties in actually managing data retention. Particularly under load, these operations may not perform well enough to be useful.

Common Advice on How to Fix This Problem (and Why It's Not the Best)

The overall problems of partition management with time-series data fall into two categories:

1) Failure to create partitions before they are needed can block inserts.

2) Dropping partitions when needed for regulatory or cost reasons not only can fail but can also block reading and writing to the relevant tables.

If you ask for advice, you'll probably hear one of these two things:

Use custom scripts

Many companies begin their partition-management journey with custom scripts. This has the advantage of simplicity, but the disadvantage is that the operations can require heavy locks, and there is often a lack of initial knowledge on how to address these.

Custom scripts are the most flexible approach to lock problems of partition management because of the entire toolkit (lock escalation, time-out and retry, and more). This allows knowledgeable teams to build solutions that work around the existing database workloads with the best success chance.

On the other hand, this problem is full of general landmines, and teams often do not begin with the knowledge to navigate these hazards successfully.

A second major problem with custom scripts is that database workloads can change over time, and this is often out of the hands of the responsible team. For example, a data science team might run workloads that interfere with production in ways the software engineering teams had not considered.

Use pg_partman

pg_partman provides a general toolkit for partition management which can mitigate the problem on some workloads. pg_partman takes a time-out-and-retry approach to partition creation and removal, meaning that—depending on the configuration and how things are run—the functions will run in an environment where a lock time-out is set. This prevents a failed lock from leading to an outage, but there is no guarantee that it will be obtained before the partitions are required.

In most cases, you can tune these features to provide reasonable assurances that problems will usually be avoided. Workloads exist that prevent the partition management functions from successfully running in such an environment.

pg_partman is a good tool and an important contribution to this topic, but at scale and under load, it will only work in cases where you have a real opportunity to get the locks required within the lock time-out. I have personally worked in environments where important services would have to be briefly disabled to allow this to happen.

How TimescaleDB Solves the Problem of Locking in PostgreSQL Partitioning

Instead of using PostgreSQL native partitioning, you can install the TimescaleDB extension and use hypertables, which are PostgreSQL tables that are automatically partitioned. This solves the problems caused by locking since hypertables minimize locks by design.

💡

For those using our managed PostgreSQL service, Timescale Cloud, you can see current lock contention in the results section of our SQL editor if a query is waiting on locks and can't complete execution.

TimescaleDB automatically partitions hypertables into chunks, organized by various partitioning criteria, usually time. This implementation is independent of PostgreSQL’s partitioning strategies and has been optimized as an independent add-on to PostgreSQL rather than a part of PostgreSQL core. TimescaleDB does not use inheritance as a table partitioning structure either, nor does TimescaleDB rely on the relation cache mentioned above for determining which chunks to scan.

Within a TimescaleDB hypertable, chunks are added transparently as needed and removed asynchronously without intrusive locks on the parent table. TimescaleDB then uses various strategies to hook into the planner and execute TimescaleDB-specific approaches to partition selection and elimination. These strategies require locking the chunk table with intrusive locks but not locking the parent.

This approach is likely to lead to some potential problems in serializable transaction isolation levels because once the underlying partition is gone, it is gone. In the event that a serializable transaction starts and then chunks are dropped, this will result in serialization errors or isolation violations.

Lock minimization

PostgreSQL has traditionally taken the view that concurrency is not extremely important for database operations while Data Definition Language (DDL) commands are run. Traditionally, this is true. Even today, DDL commands are usually run sufficiently infrequently that the database cannot take the performance hit of introducing DDL commands as synchronization points.

The emerging problems of heavy PostgreSQL users today are not usually performance problems but the fact that applications are often not written with an awareness of what these added synchronization points will mean. In my experience, these synchronization points themselves are a significant cause of database outages among large-scale PostgreSQL users.

Timescale has been built to avoid the sort of locking problems that currently exist with PostgreSQL’s declarative partitioning simply because this is a common problem in time-series workloads.

TimescaleDB maintains its own chunk catalogs and only locks the partitions that will be removed. The catalog entry is removed, then the chunk table is locked and dropped. Only an access share lock is taken on the top-level table. This means that reads and even writes can be done to other chunks without interfering with dropping or adding chunks.

TimescaleDB’s current approach has one limitation when used under serializable transactions. Currently, if you use serializable transactions, there are certain circumstances where a transaction could go to read dropped chunks and no longer see them, resulting in a violation of the serialization guarantees. This is only a problem under very specific circumstances, but in this case, TimescaleDB behaves differently than PostgreSQL’s concurrent DDL approaches.

In general, though, you should only drop chunks when you are reasonably sure they are not going to be accessed if you use serializable transaction isolation.

Why is PostgreSQL not doing this?

TimescaleDB’s solution cannot be perfectly replicated with stock PostgreSQL at the moment because dropping partitions requires active invalidation of cached data structures, which other concurrent queries might be using.

Offering some sort of lazy invalidation infrastructure (via message queues, etc.) would go a long way to making some of this less painful, as would allowing more fine-grained invalidations to caching.

Conclusion

TimescaleDB’s approach to the problem of locking is the best solution today, better than the options available in stock PostgreSQL. But it's not yet perfect; it operates between the two options given in terms of concurrency capabilities. We cannot drop a chunk that a serializable transaction has read until that transaction concludes regardless.

Getting there is likely to require some changes to how PostgreSQL caches the table and view characteristics and how this cache invalidation works. However, such improvements would help us move toward more transactional DDL.

Many ALTER TABLE commands are limited in concurrency largely because of these caching considerations. I think the general success of our approach here is also evidence of a need to address these limitations generally.

In the meantime, if you're planning to partition your tables, check out Timescale. If you're running your PostgreSQL database on your own hardware, you can simply add the TimescaleDB extension.

If you're running managed PostgreSQL, try the Timescale platform for free. Besides the advantages of a mature cloud platform, Timescale Cloud will warn you about lock contention via our UI. The Timescale Console displays the current lock contention in the results section of our SQL editor if a query is waiting on locks and can't complete execution.

A PostgreSQL Developer's Perspective: Five Interesting Patches From September's Commitfest

Chris Travers — Wed, 02 Nov 2022 14:14:11 GMT

🐘

The PostgreSQL community organizes patch reviews into “commitfests” which last for a month at a time, every other month. In this series, our very own PostgreSQL developer advocate and expert, Chris Travers, will discuss a few patches that may be of interest to PostgreSQL users after each commitfest. This is intended to provide a sense of the progress and future of PostgreSQL as a major player in the database world.

September’s commitfest is over with 65 committed patches, 40 patches returned with feedback, 177 patches moved to the next commitfest, 3 rejected, and 11 withdrawn. From a PostgreSQL developer's perspective and beyond, the patches include a large number of improvements in a large number of areas.

In this new blog post series pilot, I have selected a few patches that I find particularly interesting and helpful, and which I feel I can easily communicate their importance to a general audience of PostgreSQL users. This is by no means a comprehensive list of committed patches of interest, and in particular, patches that improve code quality or set the foundations for new features somewhere in the distant future are not included in this review. However, in my discussion with PostgreSQL developers, the improvements that I heard a lot about involved type-safety improvements not on this list.

In this article I have selected five patches, three of which are committed and two returned with feedback, for discussion. I will focus on their utility to database users and application developers from my PostgreSQL developer point of view (POV).

Let’s start with the committed patches.

PostgreSQL Developer POV: Interesting Patches Committed

Reducing chunk header size on all memory context types

PostgreSQL manages memory by lifetime and allocates based on either a “chunk allocator” or a “slab allocator” (the latter being designed specifically for logical replication contexts, and not relevant to this patch).

Allocation sets are arranged in a hierarchy relating to memory lifetime within the software. This prevents one from having to free memory at a defined point later, as the system can just do this later, at a defined point in time. For example, if we allocate memory with a lifetime related to the processing of a row, then the memory will be reused or freed when PostgreSQL moves on to process the next row. If we allocate to the lifetime of the transaction, then the memory is freed when the transaction commits or rolls back. It also means that in some cases, PostgreSQL can just reuse a chunk of memory without having to do significant processing of it other than the header.

The chunk allocator also increases the memory allocation on each subsequent call for an allocation set. The first chunk in the allocation set is 8 kB in size, and each subsequent chunk allocated doubles until one reaches 1 GiB. A given allocation within a set cannot span chunks, and this is why you cannot allocate (compressed or not) more than 1 GB of data within PostgreSQL in C using the palloc memory allocation interfaces.

This patch provides a number of important memory improvements in this regard. Although PostgreSQL memory management is well-optimized performance-wise and avoids most of the problems plaguing C programmers generally, this introduces further improvements.

Databases are environments where memory allocation efficiency matters for a number of reasons. In addition to pure memory savings, the fact that these may allow more data to fit in a given chunk means fewer malloc() calls will happen, and this is likely to produce performance improvements as well.

This patch does not affect all aspects of memory in PostgreSQL. While data coming into shared buffers or from shared buffers back to disk are not affected by this change, the data allocated for processing data extracted from shared buffers (or processed for writing to shared buffers) is affected.

✨

Editor's Note: shared_buffers are an important part of configuring your PostgreSQL instance. Check out this article on what they are and how they’re designed, and read this blog post to learn how they interact with the postmaster process.

This also shows how even very well-optimized systems can sometimes still have opportunities for further improvements, and it shows that even in mature systems, there are often gains to be made by those with the knowledge and insight to find them.

Handle infinite recursion in a logical replication setup

PostgreSQL’s logical replication is built on a publisher/subscriber model, where a publisher exposes a series of changes that a subscriber ingests and writes locally. In current releases of PostgreSQL, it is not really possible to have a loop of subscribers and publishers. Replication must always be only unidirectional.

For the most part, this works well because bidirectional replication poses fundamental (mathematical) conflict resolution problems that are impossible to solve where authoritative data is required. For example, if the same row is updated on two different mutually replicating systems, there is insufficient information to determine what the final output state is or should be. For this reason, cyclic replication topologies—often called “multi-master replication”—are generally frowned upon.

A common approach to addressing this problem is to use “last update wins” as a strategy, but this approach necessarily clobbers existing updates. In my previous work in other companies, where I was using such complex replication topologies with other databases, we actually had to take steps to prevent conflicting updates elsewhere in the infrastructure for this reason.

This patch allows, for the first time, logical replication loops. In other words, while a subscriber can already republish data it subscribed to, the original publisher could now, with new options set on the subscription, subscribe to data that could be republished from its own subscriptions. This can be done while preventing replication loops by setting an “origin” option in the subscription. The patch is then intended to prevent replication write loops, where the same insert or update is replicated back and forth forever.

Logical replication is a completely different beast when compared to physical streaming replication. It has completely different use cases and pitfalls than the latter. It imposes very different administrative burdens as well. However, this is a massive leap forward towards a community-owned bidirectional logical replication capability, which will likely open some doors for PostgreSQL where replication topologies based on cyclic graphs (often called multi-master replication) are actually worth the significant costs.

Proper planner support for `ORDER BY` / `DISTINCT`

PostgreSQL supports various aggregates where collation operations are important, such as ntile, percentile, and other ordered set aggregates, window functions, and aggregates with the DISTINCT modifier. The efficiency of these operations is affected by the sort ordering of the table scans, and in current versions of PostgreSQL, the planner does not take this into account. As a result, ORDER BY and DISTINCT in aggregate functions can lead to unnecessary sorts, which lead to slower query performance.

Having proper planner support for these sorts of operations is a significant performance win for anyone using these sorts of aggregates. In my experience, there are many users of these sorts of features, especially those whose workflows include both transactional and decision support workflows. This patch represents another significant improvement for analytic workloads on PostgreSQL.

Now let’s address the two patches that haven’t been yet committed.

Patches Not Yet Committed

Not every interesting patch that caught my attention here got committed this time around. Two important patches will get discussed further as they progress through peer review and the commitfest feedback process. Both of these patches are currently listed as “returned with feedback” but are of sufficient importance or near enough to completion that they are worth watching anyway.

Kerberos Delegation

Kerberos, the authentication protocol that was developed by MIT, and available on many platforms before being incorporated into Microsoft’s ActiveDirectory, has the capacity to pass delegated credentials between hosts. PostgreSQL can already accept delegated credentials for authentication but currently cannot delegate credentials.

So, for example, when a user accesses an internal ASP.net application on a company intranet, the web server can authenticate the user via Kerberos, and then pass on a delegated credential to the database if needed for actual access. This approach allows fine-grained and redundant control over access to data with a great deal of defense in depth.

What PostgreSQL is not currently able to do is delegate Kerberos authentication, which means that Kerberos authentication cannot be used between PostgreSQL nodes over things like foreign data wrappers.

A proposed patch would ensure that PostgreSQL could delegate Kerberos credentials to libpq connections, allowing this to be used by the PostgreSQL foreign data wrapper, dblink, and similar extensions. While this patch has been listed as “returned with feedback,” the general sense is that this would be a really useful feature, and so I would read that status at present as a note that the current patch has some problems that need to be rethought before it can be accepted.

The fundamental difficulty here is that Kerberos session encryption does not provide forward privacy or forward security, and therefore, when session encryption is used with credential delegation, the user could potentially break the encryption on the middle host. In the event that passwords are also used for a foreign data wrapper connection, this would render some protections against less robust sensitive data or even in some cases (for example, where passwords are used in a passthrough way to authenticate against a third-party provider) allow for password disclosure.

The overall consensus is that this is a feature that would be very helpful in PostgreSQL, something I believe as well. The patch rejection is also a great testament to the degree of peer review in the community that goes into security-critical code paths. I expect that sooner or later, this feature will be included with appropriate safety measures in place, and I hope it is resubmitted sooner rather than later.

Allows database-specific role memberships

In PostgreSQL, roles (which include users) are global to an instance of PostgreSQL (called a “cluster” in PostgreSQL terminology). This means that if you have several databases managed by the same PostgreSQL instance, the roles and role memberships are common throughout all databases.

One major difficulty in writing multi-tenant applications which use databases as the tenant boundary is that if you assign a user with a role in one database, this occurs in all databases. For example, if you have an accounting application and you have a user “chris” who has different permissions in three different databases on the same server, you cannot use a simple, consistent set of roles to manage database permissions.

One solution to this problem (which we did when I was building out such a system in LedgerSMB) is to create different roles on each database, including, for example, the database name in the role name. This leads to a certain degree of complexity and makes the role names harder to read. Another option would be to limit users to a single database. This adds complexity in a different place.

A proposed patch would allow the ability to grant role permissions only on a specific database. For multi-tenant applications, this would be a game changer. The patch was near acceptance when there was some further discussion on documentation. It needed to be rebased and corrected, and the initial submitter did not reply in a timely manner. This can happen—people can become busy or unavailable for one reason or another, and patches can end up temporarily orphaned.

✨

Editor's Note: For an insider’s perspective on commitfests, read this blog post on being a commitfest manager, and check out if you should become one.

Given the general utility of this patch for multi-tenant applications, I would like to see this fixed and resubmitted sooner rather than later.

Concluding Thoughts

The patches discussed here represent small but useful steps forward for PostgreSQL. These and many others make PostgreSQL a database that is improving significantly with each major release.

Commitfests such as these provide great insight into this process, the review that patches undergo, and how PostgreSQL keeps moving forward.

That’s one of my favorite things about PostgreSQL: it’s a database that is significantly improving with each major release. If you want to add even more functionality to PostgreSQL, explore TimescaleDB—it extends PostgreSQL with things like automatic time-based partitioning and indexing, continuous aggregations, columnar compression, and time-series functionality. And if you’re using a managed service for PostgreSQL, try Timescale—it’s free for 30 days, no credit card required.