<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[Tiger Data Blog]]></title>
        <description><![CDATA[Insights, product updates, and tips from TigerData (Creators of TimescaleDB) engineers on Postgres, time series & AI. IoT, crypto, and analytics tutorials & use cases.]]></description>
        <link>https://www.tigerdata.com/blog</link>
        <image>
            <url>https://www.tigerdata.com/icon.ico</url>
            <title>Tiger Data Blog</title>
            <link>https://www.tigerdata.com/blog</link>
        </image>
        <generator>RSS for Node</generator>
        <lastBuildDate>Tue, 07 Apr 2026 17:58:34 GMT</lastBuildDate>
        <atom:link href="https://www.tigerdata.com/blog" rel="self" type="application/rss+xml"/>
        <ttl>60</ttl>
        <item>
            <title><![CDATA[Introducing Direct Compress: Up to 40x Faster, Leaner Data Ingestion for Developers (Tech Preview)]]></title>
            <description><![CDATA[Direct Compress delivers up to 40x faster data ingestion for TimescaleDB by compressing time-series data in memory during insertion, eliminating background jobs.]]></description>
            <link>https://www.tigerdata.com/blog/introducing-direct-compress-up-to-40x-faster-leaner-data-ingestion-for-developers-tech-preview</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/introducing-direct-compress-up-to-40x-faster-leaner-data-ingestion-for-developers-tech-preview</guid>
            <category><![CDATA[Announcements & Releases]]></category>
            <category><![CDATA[TimescaleDB]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Product & Engineering]]></category>
            <dc:creator><![CDATA[Sven Klemm]]></dc:creator>
            <pubDate>Tue, 09 Sep 2025 13:00:52 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2025/09/direct-compress-thumbnail.png">
            </media:content>
            <content:encoded><![CDATA[<p>Time-series and analytical data continues to grow at an unprecedented pace, and with it comes the challenge of efficiently storing and querying massive datasets. Traditionally, compressing this data required background jobs, and additional tuning. This slowed down ingestion, added operational headache, and delayed storage savings. </p><p>That’s why today, we're excited to announce <strong>Direct Compress</strong>, a new feature coming to TimescaleDB that compresses data during ingestion in memory, eliminating the need for traditional compression policies and improving insert performance by up to 40x.</p><p><em>Note: Direct Compress is currently available as a tech preview in TimescaleDB </em><a href="https://github.com/timescale/timescaledb/releases/tag/2.21.0"><em><u>2.21</u></em></a><em> for COPY operations, with full support for INSERT operations coming in a later version.</em></p><h2 id="the-evolution-of-timescaledb%E2%80%99s-columnstore">The Evolution of TimescaleDB’s Columnstore&nbsp;</h2><p>TimescaleDB has long been recognized for its industry-leading compression capabilities. With <a href="https://docs.tigerdata.com/use-timescale/latest/hypercore/"><u>hypercore</u></a>, TimescaleDB's hybrid row-columnar storage engine, users can achieve compression ratios of over 90% while maintaining fast query performance. Traditionally, the system would:</p><ol><li>Insert data in uncompressed row format</li><li>Write individual WAL records for each tuple</li><li>Later compress chunks through background policies</li></ol><p>Now, Direct Compress fundamentally changes this approach by compressing data <strong>during the ingestion process itself</strong>.</p><h2 id="what-is-direct-compress">What is Direct Compress?</h2><p>Direct Compress is a feature that allows TimescaleDB to compress data in memory as it's being ingested. Instead of writing WAL records for individual tuples, the system writes compressed batches directly to disk. This approach addresses several key challenges that developers and database administrators face when working with high-volume time-series data:</p><ul><li><strong>Excessive I/O overhead</strong>: Traditional ingestion requires writing each tuple individually to the WAL, creating significant I/O bottlenecks</li><li><strong>Dependency on compression policies</strong>: Previously, you had to wait for background compression jobs to optimize storage</li><li><strong>Insert performance limitations</strong>: Large-scale data ingestion was constrained by the overhead of individual tuple processing</li></ul><h2 id="benchmark-results-37x-improvement">Benchmark Results (37x Improvement)</h2><p>To test the per-tuple overhead, a narrow table with only one integer column was used. Direct compression provided considerable performance improvements, with the single integer table achieving 148.8 million tuples per second using 10k batch compression—a 37x improvement over uncompressed insertion. For a table with a timestamp column and 2 integer columns we achieved an insert rate of 66 million tuples per second with compression.</p><p>The schema used does have a big impact on achievable insert rate, with more complex datatypes like jsonb or wider rows having lower ingest rates. Parsing integer columns was found to have the least overhead compared to other datatypes, and for these benchmarks more than half of the cpu time was spent parsing input even when using binary input format. Performance scaled linearly across all thread counts until reaching the storage I/O bottleneck. During these tests we used a <a href="https://www.tigerdata.com/cloud"><u>Tiger Cloud</u></a> instance with 64 cores and EBS storage—with more optimized storage higher numbers are probably achievable. For the uncompressed tests no indexes were present on the <a href="https://docs.tigerdata.com/use-timescale/latest/hypertables/"><u>hypertables</u></a>. The 1k and 10k batch size refers to the batch size used internally during compression, not the batch size used by the client sending the data.</p><figure class="kg-card kg-image-card"><img src="https://timescale.ghost.io/blog/content/images/2025/09/insert-rate-vs-number-of-threads-.png" class="kg-image" alt="insert rate vs number of threads" loading="lazy" width="896" height="672" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2025/09/insert-rate-vs-number-of-threads-.png 600w, https://timescale.ghost.io/blog/content/images/2025/09/insert-rate-vs-number-of-threads-.png 896w" sizes="(min-width: 720px) 720px"></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2025/09/Table-with-timestamp-and-2-int-columns-1.png" class="kg-image" alt="" loading="lazy" width="810" height="607" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2025/09/Table-with-timestamp-and-2-int-columns-1.png 600w, https://timescale.ghost.io/blog/content/images/2025/09/Table-with-timestamp-and-2-int-columns-1.png 810w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Table with timestamp and 2 int columns</span></figcaption></figure><h2 id="key-benefits">Key Benefits</h2><h3 id="reduced-io-operations">Reduced I/O operations</h3><p>By compressing data in memory before writing to disk, Direct Compress eliminates the need to write individual WAL records for each tuple. Instead, only compressed batches are written, dramatically reducing I/O overhead.&nbsp;</p><h3 id="eliminated-policy-dependencies">Eliminated policy dependencies</h3><p>With Direct Compress, your <code>INSERT</code> operations already produce compressed chunks. This means <code>compress_chunk()</code> functions and compression policies become less critical to your workflow, simplifying your database maintenance.</p><h3 id="immediate-storage-efficiency">Immediate storage efficiency</h3><p>Unlike traditional compression that happens after ingestion, Direct Compress provides storage benefits immediately, reducing your storage footprint from the moment data arrives.</p><h2 id="how-direct-compress-works">How Direct Compress Works</h2><p>Direct Compress operates by intercepting data during the ingestion process and compressing it in memory before writing to disk. The process involves:</p><ol><li><strong>Batch Collection</strong>: Data is collected in configurable batches during <code>COPY</code> or <code>INSERT</code> operations.</li><li><strong>In-Memory Compression</strong>: Each batch is compressed using TimescaleDB's proven compression algorithms.</li><li><strong>Optimized Writing</strong>: Compressed batches are written directly to disk with minimal WAL overhead.</li></ol><p>This approach differs from traditional compression methods because it eliminates the two-step process of "ingest then compress," instead performing both operations simultaneously. <strong>Importantly, Direct Compress requires batched operations on the client side</strong> to achieve these performance benefits. With direct compression, data ingestion becomes limited by CPU processing rather than IO speed.</p><p><strong>Roadmap</strong></p><ul><li><strong>COPY support</strong> (TimescaleDB 2.21 - Tech Preview)</li><li><strong>INSERT support </strong>(coming soon)</li><li><strong>Continuous aggregate support </strong>(coming soon)</li></ul><h2 id="getting-started-with-direct-compress">Getting Started with Direct Compress</h2><h3 id="prerequisites">Prerequisites</h3><p>Before using Direct Compress, ensure you have:</p><ul><li>TimescaleDB version 2.21 or later (currently in tech preview)</li><li>A hypertable with compression enabled (<a href="#basic-usage-example" rel="noreferrer"><u>see example</u></a><u>)</u></li><li>Batched client operations to make use of the feature</li></ul><h3 id="important-requirements-and-limitations">Important requirements and limitations</h3><p>Direct Compress <strong>requires batching</strong> on the client side to function effectively. It cannot be used:</p><ul><li>If the hypertable schema has unique constraints</li><li>If the hypertable has triggers</li><li>Continuous aggregates on the target hypertable</li></ul><h3 id="configuration-options">Configuration options</h3><p>Direct Compress is controlled through several GUCs (Grand Unified Configuration parameters):</p><p><code>timescaledb.enable_direct_compress_copy</code><strong> (default: off)</strong></p><p>Enables the core Direct Compress feature for <code>COPY</code> operations. When enabled, chunks will be marked as unordered, so presorting is not required.</p><p><code>timescaledb.enable_direct_compress_copy_sort_batches</code><strong> (default: on)</strong></p><p>Enables per-batch sorting before writing compressed data, which can improve query performance.</p><p><code>timescaledb.enable_direct_compress_copy_client_sorted</code><strong> (default: off)</strong></p><p><strong>⚠️ DANGER</strong>: When enabled, chunks will not be marked as unordered. Only use this if your data is globally sorted, as queries requiring ordering will produce incorrect results with unsorted data. In the context of this feature we can distinguish between local and global sorting. Local sorting means within the current batch data is sorted. Global sorting means there is no batch that will overlap with the current batch.</p><h2 id="basic-usage-example">Basic Usage Example</h2><pre><code class="language-SQL">-- Create a hypertable with compression
CREATE TABLE sensor_data(
    time timestamptz, 
    device text, 
    value float
) WITH (
    tsdb.hypertable,
    tsdb.partition_column='time'
);

-- Enable Direct Compress
SET timescaledb.enable_direct_compress_copy = on;

-- Use binary format for maximum performance
COPY sensor_data FROM '/tmp/sensor_data.binary' WITH (format binary);</code></pre><h2 id="best-practices-and-recommendations">Best Practices and Recommendations</h2><h3 id="1-use-binary-format">1. Use binary format</h3><p>Binary format achieves the highest insert rates. While CSV and text formats are supported, binary format provides optimal performance.</p><h3 id="2-consider-order-by-configuration">2. Consider order by configuration</h3><p>The default <code>orderby</code> configuration is <code>time DESC</code> for query optimization. However, for maximum Direct Compress benefits, consider changing this to <code>time</code> to optimize for insert performance:</p><pre><code class="language-SQL">ALTER TABLE sensor_data SET (timescaledb.orderby = 'time');</code></pre><p>This represents a trade-off between insert performance and query performance—choose based on your primary use case.</p><h3 id="3-presort-data-before-ingestion">3. Presort data before ingestion</h3><p>While TimescaleDB can do sorting as part of Direct Compress, it will take away CPU resources from other tasks.</p><h3 id="4-leverage-multiple-threads">4. Leverage multiple threads</h3><p>The benchmark results show significant benefits from parallel ingestion. Consider using multiple threads for large data imports.</p><h2 id="migration-and-compatibility">Migration and Compatibility</h2><h3 id="upgrading-existing-tables">Upgrading existing tables</h3><p>Direct Compress works with any existing hypertable that has the columnstore enabled, provided the limitations (no unique constraints, triggers, or continuous aggregates) are met.&nbsp;</p><h3 id="backward-compatibility">Backward compatibility</h3><p>Direct Compress is fully compatible with existing TimescaleDB compression features. You can use both traditional columnstore policies and Direct Compress simultaneously, though Direct Compress reduces the need for background compression jobs.</p><h2 id="looking-forward">Looking Forward</h2><p>Direct Compress represents a significant milestone in TimescaleDB's ongoing evolution toward real-time analytics at scale. This feature is part of our broader commitment to eliminating the traditional trade-offs between ingestion speed and storage efficiency.</p><p>Future enhancements to Direct Compress will include:</p><ul><li>Support for INSERT</li><li>Additional optimizations for unsorted data when using direct compress</li><li>Compatibility with continuous aggregates</li><li>Enhanced client-side tooling for optimal batching</li></ul><h2 id="try-direct-compress-today">Try Direct Compress Today</h2><p>Direct Compress brings considerable performance improvements to TimescaleDB users by eliminating the traditional ingestion bottleneck. With up to 40x faster ingestion rates and immediate storage benefits, this feature is a game-changer for high-volume time-series applications.</p><p>Whether you're managing IoT sensor data, financial market feeds, or application monitoring metrics, Direct Compress can help you achieve unprecedented ingestion performance while reducing storage costs from day one.</p><p>We encourage you to try the tech preview of Direct Compress in your development environment and share your experiences with the community. Your feedback will help us refine this feature as we move toward full release. As always, our team is available to help you optimize your TimescaleDB deployment for your specific use case.</p><hr><p><strong>Ready to get started?</strong> Check out our <a href="https://docs.tigerdata.com/"><u>documentation</u></a> or <a href="https://www.tigerdata.com/contact"><u>contact our team</u></a> for personalized assistance with Direct Compress implementation.</p><p><em>Have questions about Direct Compress or want to share your results? Join the conversation in our </em><a href="https://forum.tigerdata.com/forum/"><em><u>community forum</u></em></a><em> or reach out to us on </em><a href="https://github.com/timescale/timescaledb"><em><u>GitHub</u></em></a><em>.</em></p><hr><h3 id="about-the-author">About the Author</h3><p>Sven is the tech lead for TimescaleDB, but his journey with databases started a long time ago. For over 25 years, he has been a huge fan of PostgreSQL, and it's that deep-seated passion that led him to where he is today. His work on planner optimizations and diving into the columnstore to squeeze out every bit of performance is a direct extension of his goal: to make the Postgres ecosystem even more powerful and efficient for everyone who uses it.</p><p>That long history with Postgres also informs his work on the security front. One of the projects he is most passionate about is <a href="https://github.com/timescale/pgspot"><u>pgspot</u></a>, where he gets to help build a more secure future for the database. After all these years, he has seen firsthand how a strong, trustworthy foundation is essential. To him, a great database isn't just about speed; it's about protecting the data with unwavering reliability. This blend of performance and security is what truly excites him every day.</p><p>When he's not in the weeds of database code, you can find him thinking about the bigger picture—how to make the community and product stronger, safer, and more user-friendly. He loves the challenge of taking a complex problem and finding a simple, elegant solution. His journey with Postgres has taught him that the best technology is built on a foundation of trust and a commitment to continuous improvement.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Bridging the Gap Between Compressed and Uncompressed Data in Postgres: Introducing Compression Tuple Filtering]]></title>
            <description><![CDATA[Compressed data in Postgres now performs more like uncompressed data. Discover how we achieved up to 500x faster updates & deletes on compressed data.]]></description>
            <link>https://www.tigerdata.com/blog/bridging-the-gap-between-compressed-and-uncompressed-data-in-postgres</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/bridging-the-gap-between-compressed-and-uncompressed-data-in-postgres</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Announcements & Releases]]></category>
            <category><![CDATA[Analytics]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <dc:creator><![CDATA[Sven Klemm]]></dc:creator>
            <pubDate>Wed, 18 Sep 2024 13:00:59 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2024/09/Blog-2.png">
            </media:content>
            <content:encoded><![CDATA[<p>When we introduced <a href="https://timescale.ghost.io/blog/building-columnar-compression-in-a-row-oriented-database/"><u>columnar compression for Postgres</u></a> in 2019, our goal was to help developers scale Postgres and efficiently manage growing datasets, such as IoT sensors, financial ticks, product metrics, and even vector data. Compression quickly became a game-changer, saving users significant storage costs and boosting query performance—all while keeping their data in Postgres. With many seeing over 95&nbsp;% compression rates, the impact was immediate.</p><p>But we didn’t stop there. Recognizing that many real-time analytics workloads demand flexibility for updating and backfilling data, we <a href="https://timescale.ghost.io/blog/timescaledb-2-3-improving-columnar-compression-for-time-series-on-postgresql/"><u>slowly</u></a> but <a href="https://timescale.ghost.io/blog/allowing-dml-operations-in-highly-compressed-time-series-data-in-postgresql/"><u>surely enhanced</u></a> our compression engine to support <code>INSERT</code>, <code>UPDATE</code>, and <code>DELETE</code> (DML) operations directly on compressed data. This allowed users to work with compressed data almost as easily as they do with uncompressed data.</p><p>However, it also created a problem. While we had originally intended mutating compressed chunks to be a rare event, people were now pushing its limits with frequent inserts, updates, and deletes. Seeing our customers go all in on this feature confirmed that we were on the right track, but we had to double down on performance.</p><p>Today, we’re proud to announce significant improvements as of TimescaleDB 2.16.0, delivering up to <strong>500x faster updates and deletes</strong> and <strong>10x faster upserts</strong> on compressed data. These optimizations make compressed data behave even more like uncompressed data—without sacrificing performance or flexibility.</p><p>Let’s dive into how we achieved these performance gains and what they mean for you. To check this week’s previous launches and keep track of upcoming ones, head to this <a href="https://timescale.ghost.io/blog/making-postgres-faster/"><u>blog post</u></a> or our <a href="https://www.timescale.com/launch/2024"><u>launch page</u></a>.&nbsp;</p><h2 id="how-we-allowed-dml-operations-on-compressed-data">How We Allowed DML Operations on Compressed Data</h2><p>To understand our latest improvements, it helps to revisit how we initially threw away the rule book and <a href="https://timescale.ghost.io/blog/allowing-dml-operations-in-highly-compressed-time-series-data-in-postgresql/"><u>enabled DML operations on compressed data</u></a>.</p><p>Working with compressed data is tricky. Imagine trying to update a zipped file. You’d need to unzip the file, make your changes, and then zip it back up. Similarly, updating or deleting data in a compressed database often involves decompressing and reprocessing large chunks (potentially full tables) of data, which can slow things down significantly.</p><p>Our solution was to segment and group records into batches of 1,000, so instead of working with millions of records at once, we operate on smaller, more manageable groups. We also used techniques like <strong>segment indexes</strong> and <strong>sparse indexes </strong>on batch metadata to identify only the relevant batches of compressed data that contain the values to be updated or deleted.</p><p>However, a challenge remained: if no metadata is stored for a batch, we have to assume there could be a row inside that needs to be updated. This assumption requires decompressing the batch and materializing it into an uncompressed row format, which takes up precious time and disk space.</p><h2 id="optimizing-batch-processing-in-timescaledb-2160">Optimizing Batch Processing in TimescaleDB 2.16.0</h2><p>With the release of TimescaleDB 2.16.0, we focused on reducing the number of batches that need to be decompressed and materialized. This improvement, known as <strong>compression tuple filtering</strong>, allows us to filter out unnecessary data at various processing stages, dramatically speeding up DML operations.</p><h3 id="real-world-performance-gains">Real-world performance gains</h3><p>Here’s what that looks like in practice:</p><ul><li><strong>Up to 500x faster updates and deletes</strong>: by avoiding the need to decompress and materialize irrelevant batches, DML operations like <code>UPDATE</code> and <code>DELETE</code> can now be completed significantly faster.</li><li><strong>Up to 10x faster upserts</strong>: similarly, upserts (a combination of <code>INSERT</code> and <code>UPDATE</code>) are optimized to avoid unnecessary decompression (<a href="https://timescale.ghost.io/blog/how-we-made-postgresql-upserts-300x-faster-on-compressed-data/" rel="noreferrer">and that’s not the only boost we made for upserts</a>).</li></ul><p>These gains translate to major real-world performance improvements, particularly for users dealing with large datasets that require high-frequency updates of data that has already been compressed.</p><h2 id="how-compression-tuple-filtering-works">How Compression Tuple Filtering Works</h2><p>To achieve these optimizations, we filter data at multiple stages during DML operations. Previously, filtering was only possible using constraints on <strong><code>segment_by</code></strong> columns or columns with metadata, such as <strong><code>orderby</code></strong> or columns with a <a href="https://timescale.ghost.io/blog/boost-postgres-performance-by-7x-with-chunk-skipping-indexes/" rel="noreferrer">chunk skipping index</a>.</p><p>With TimescaleDB 2.16.0, we’ve taken a microscope to our decompression pipeline and added an additional layer of inline filtering. When running DML operations, if a column doesn’t have metadata, we now apply constraints incrementally during decompression. If a batch is fully filtered out by the constraint, it’s skipped entirely and never materialized to disk or passed to Postgres to continue evaluation. This saves significant resources and time by reducing the amount of work to do further down the query pipeline.</p><h3 id="insert-optimizations">INSERT optimizations</h3><p>Let’s break it down by DML operation, starting with <code>INSERT</code>.</p><ul><li><strong>Regular <code>INSERT</code> operations</strong>: When inserting data into compressed chunks, no decompression is required unless you have a <code>UNIQUE</code> constraint on the <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertable</a>. This makes inserts almost as fast as with uncompressed data.</li><li><strong>Inserts with <code>UNIQUE</code> constraints</strong>: When a <code>UNIQUE</code> constraint is in place, there are three possible scenarios:<ol><li><strong>No <code>ON CONFLICT</code> clause</strong>: We check for constraint violations during decompression. If a violation is found, it’s flagged before any data is materialized to disk.</li><li><code><strong>ON CONFLICT DO NOTHING</strong></code>: Similar to the first case, violations are flagged during decompression, and the <code>INSERT</code> is skipped without materializing data.</li><li><strong><code>ON CONFLICT DO UPDATE</code> (<code>UPSERT</code>)</strong>: In cases of conflict, we decompress only the batch containing the conflicting tuple. If no conflict is detected, there’s no need to materialize anything.</li></ol></li></ul><h3 id="updatedelete-optimizations">UPDATE/DELETE Optimizations</h3><p>For <code>UPDATE</code> and <code>DELETE</code> operations, we first check if any tuples in a batch match the query constraints. If none do, the batch is skipped, and no decompression or materialization is needed. This skip leads to dramatically faster update and delete operations.</p><h3 id="real-world-comparison-before-and-after">Real-world comparison: Before and after</h3><p>Let’s look at a concrete example to illustrate the performance difference with tuple filtering.</p><p>Assume you have a hypertable with 1,000 batches, each containing 1,000 tuples, for a total of 1,000,000 rows. Now, you want to delete all rows where <code>value &lt; 10</code>. In this case, these rows are all contained within a single batch (maybe values lower than 10 are very rare and happen only for a short period of time).</p><p><code>DELETE FROM metrics WHERE value &lt; 10;</code></p><p>Before TimescaleDB 2.16.0, we would have to decompress all 1,000 batches and materialize 1,000,000 tuples to disk—even if only a small portion matched the query constraints.</p><figure class="kg-card kg-image-card"><img src="https://timescale.ghost.io/blog/content/images/2024/09/Bridging-the-Gap-Between-Compressed-and-Uncompressed-Data-in-Postgres_compression-tuple-filtering_decompression.png" class="kg-image" alt="A diagram illustrating the before: we would decompress all 1,000 batches and materialize them even if only a small part of them matched the query constraints" loading="lazy" width="1200" height="1089" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2024/09/Bridging-the-Gap-Between-Compressed-and-Uncompressed-Data-in-Postgres_compression-tuple-filtering_decompression.png 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2024/09/Bridging-the-Gap-Between-Compressed-and-Uncompressed-Data-in-Postgres_compression-tuple-filtering_decompression.png 1000w, https://timescale.ghost.io/blog/content/images/2024/09/Bridging-the-Gap-Between-Compressed-and-Uncompressed-Data-in-Postgres_compression-tuple-filtering_decompression.png 1200w" sizes="(min-width: 720px) 720px"></figure><p>In TimescaleDB 2.16.0, this changes with compressed tuple filtering. We now only need to decompress the relevant batches, avoiding unnecessary materialization.</p><figure class="kg-card kg-image-card"><img src="https://timescale.ghost.io/blog/content/images/2024/09/Bridging-the-Gap-Between-Compressed-and-Uncompressed-Data-in-Postgres-Introducing-Compression-Tuple-Filtering_tuple-filtering.png" class="kg-image" alt="A diagram illustrating how tuple filtering during compression works" loading="lazy" width="777" height="826" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2024/09/Bridging-the-Gap-Between-Compressed-and-Uncompressed-Data-in-Postgres-Introducing-Compression-Tuple-Filtering_tuple-filtering.png 600w, https://timescale.ghost.io/blog/content/images/2024/09/Bridging-the-Gap-Between-Compressed-and-Uncompressed-Data-in-Postgres-Introducing-Compression-Tuple-Filtering_tuple-filtering.png 777w" sizes="(min-width: 720px) 720px"></figure><p>In our example query, this reduces the total query time from 829517.741 ms to 1487.494 ms, <strong>557x faster</strong>! As you can see, tuple filtering allows us to drastically reduce the work needed to execute this query, resulting in a huge speed-up. This accelerating power is also variable: the more tuples you discard, the faster your query will become!</p><h2 id="why-do-dml-operations-on-compressed-data-matter">Why Do DML Operations on Compressed Data Matter</h2><p>When working with massive datasets, every millisecond counts and resource efficiency becomes crucial. The improvements introduced in TimescaleDB 2.16.0 directly address these needs by minimizing the amount of data that must be decompressed, written to disk, and then evaluated by Postgres. This not only reduces disk I/O and CPU usage but also significantly lowers execution time. The result? More headroom to handle larger datasets, scale your systems seamlessly and improve overall application performance.</p><p>For developers managing frequent updates, inserts, or deletes in TimescaleDB hypertables, these optimizations mean that less thought has to be given to the current format of the data. Regardless of whether the data is currently stored on disk as rows or compressed columns, DML operations will work as expected.</p><h2 id="final-words">Final Words</h2><p>With the release of TimescaleDB 2.16.0, our <a href="https://www.tigerdata.com/blog/building-columnar-compression-in-a-row-oriented-database" rel="noreferrer">columnar</a> compression engine takes another big leap forward. Users can now benefit from up to <strong>500x faster updates and deletes</strong> and <strong>10x faster upserts</strong>—all while continuing to enjoy the storage savings and performance gains of compression.</p><p>Looking ahead, we’re committed to further enhancing our compression engine, delivering even more flexibility and performance gains. Stay tuned—there’s more to come.</p><p>Want to see these improvements in action?<a href="https://www.timescale.com"> <u>Sign up for Timescale today</u></a> and experience the power of our compression engine for your real-time analytics workloads.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How We Fixed Long-Running PostgreSQL now( ) Queries (and Made Them Lightning Fast)]]></title>
            <description><![CDATA[Help requests about slowdowns in PostgreSQL now( ) queries are a thing of the past. Learn how we fixed it in TimescaleDB 2.7 for lightning-fast performance (up to 400x faster!).]]></description>
            <link>https://www.tigerdata.com/blog/how-we-fixed-long-running-postgresql-now-queries</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-we-fixed-long-running-postgresql-now-queries</guid>
            <category><![CDATA[Announcements & Releases]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Sven Klemm]]></dc:creator>
            <pubDate>Wed, 22 Jun 2022 13:00:24 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2022/06/dog-racing-2878713_1920--1-.jpg">
            </media:content>
            <content:encoded><![CDATA[<p>It was just another regular Wednesday in our home offices when we received a question in the <a href="https://www.timescale.com/forum">forum</a> about a query with the Postgres now() function. A TimescaleDB user with dozens of tables of IoT data reported a slow degradation in query performance and a creeping server CPU usage. After struggling with the issue, they turned to our community for help.</p>
<!--kg-card-begin: html-->
<iframe src="https://giphy.com/embed/QBAzA0CaPCKwGg3pDs" width="480" height="270" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/groundhogday-groundhog-day-movie-QBAzA0CaPCKwGg3pDs">via GIPHY</a></p>

<!--kg-card-end: html-->
<p>That same question came up in our forum, <a href="http://timescaledb.slack.com">Community Slack</a>, and <a href="https://www.timescale.com/support">support</a> more often than we’d like. We could relate to this particular pain point because we also struggled with it in partitioned vanilla PostgreSQL. After a closer look at the user’s query, we found the usual suspect: the issue of high planning time in the presence of many chunks—<a href="https://docs.timescale.com/timescaledb/latest/overview/core-concepts/hypertables-and-chunks/">in Timescale slang, chunks are data partitions within a table</a>—and in a query using a rather common function: <code>now()</code>.</p><p>Usually, the problem with these queries is that the chunk exclusion happens late. Chunk exclusion is what happens when some data partitions are not even considered during the query to speed up the process. The logic is simple: the fewer data a query has to go through, the faster it is.</p><p>However, the problem is that <code>now()</code>, <a href="https://www.postgresql.org/docs/current/xfunc-volatility.html">similarly to other stable functions in PostgreSQL</a>, is not considered during plan-time chunk exclusion, those precious moments in which your machine is trying to find the quickest way to execute your query while excluding some of your data partitions to further speed up the process. So, your chunks are only excluded later, at execution time, which results in higher plan time—and yes, you guessed it—slower performance.</p><p>Until now, every time this issue popped up, we knew what to do. We had written a wrapper function, marked as immutable, that would call the <code>now()</code> function and whose only purpose was to add the immutable marking so that PostgreSQL would consider it earlier during plan-time chunk exclusion, thus improving query performance.</p><p>Well, not anymore.</p><p><strong>Today, we’re announcing the optimization of the <code>now()</code> function with the release of TimescaleDB 2.7</strong>, which solves this problem by natively performing as our previous workaround. </p><p>In this blog post, we’ll look at the basics of the <code>now()</code> function, explain how it works in vanilla PostgreSQL and our previous TimescaleDB version, and wrap everything up with a description of our optimization, which evaluates <code>now()</code>expressions during plan-time chunk exclusion, significantly reducing planning time. Finally, we include a performance comparison that will blow you away (all we can say for now is “more than 400 times faster”).</p>
<!--kg-card-begin: html-->
<iframe src="https://giphy.com/embed/Gpu3skdN58ApO" width="480" height="382" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/funny-elephant-fast-Gpu3skdN58ApO">via GIPHY</a></p>
<!--kg-card-end: html-->
<p>If you are already a TimescaleDB user, <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/update-timescaledb/">check out our docs for instructions on how to upgrade</a>. If you are using Timescale, upgrades are automatic, so all you need to do is sit back and enjoy this very fast ride! (New to Timescale? <a href="https://console.cloud.timescale.com/signup">You can start a free 30-day trial, no credit card required</a>.)</p><h2 id="now-in-vanilla-postgresql">now( ) in Vanilla PostgreSQL</h2><p>Queries with <code>now()</code> expressions are common in time-series data to retrieve readings of the last five minutes, three hours, three days, or other time intervals. In sum, <a href="https://www.postgresql.org/docs/current/functions-datetime.html"><code>now()</code> is a function</a> that returns the current time or, more accurately, the start time of the current transaction. These queries usually only need data from the most recent partition in a <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertable</a>, also called chunk. </p><p>A query to retrieve readings from the last five minutes could look like this:</p><pre><code class="language-sql">SELECT * FROM hypertable WHERE time &gt; now() - interval ‘5 minutes’;
</code></pre>
<p>To understand our users' slowdown, it’s vital to know that constraints in PostgreSQL can be constified at different stages in the planning process. The problem with <code>now()</code> is that it can only be constified during execution because the planning and execution times may differ. </p><p>Since <code>now()</code> is a stable function, it’s not considered for plan-time constraint exclusion; therefore, all chunks will have to be part of the planning process. For hypertables with many chunks, this query's total execution time is often dominated by planning time, resulting in poor query performance.</p><p>If we dig a little deeper with the EXPLAIN output, we can see that all chunks of the hypertable are part of the plan, painfully increasing it.<br></p><pre><code class="language-sql"> Append  (cost=0.00..1118.94 rows=1097 width=20)
   -&gt;  Seq Scan on _hyper_3_38356_chunk  (cost=0.00..1.01 rows=1 width=20)
         Filter: ("time" &gt; now())
   -&gt;  Seq Scan on _hyper_3_38357_chunk  (cost=0.00..1.01 rows=1 width=20)
         Filter: ("time" &gt; now())
   -&gt;  Seq Scan on _hyper_3_38358_chunk  (cost=0.00..1.01 rows=1 width=20)
         Filter: ("time" &gt; now())
   -&gt;  Seq Scan on _hyper_3_38359_chunk  (cost=0.00..1.01 rows=1 width=20)
         Filter: ("time" &gt; now())
   -&gt;  Seq Scan on _hyper_3_38360_chunk  (cost=0.00..1.01 rows=1 width=20)
         Filter: ("time" &gt; now())
   -&gt;  Seq Scan on _hyper_3_38361_chunk  (cost=0.00..1.01 rows=1 width=20)
         Filter: ("time" &gt; now())
</code></pre>
<p>We had to do something to improve this, and so we did.</p><h2 id="now-in-timescaledb">now( ) in TimescaleDB</h2><p>As proud builders on top of PostgreSQL, we wanted to come up with a solution. So in previous versions of TimescaleDB, we did not use the <code>now()</code> expression for plan-time constraint exclusion. </p><p>In turn, we implemented constraint exclusion at execution time in a bid to improve query performance. If you want to learn more about how we did this, <a href="https://timescale.ghost.io/blog/implementing-constraint-exclusion-for-faster-query-performance/">check out this blog post, which offers a detailed behind-the-scenes explanation of what happens when you execute a query in PostgreSQL</a>. </p><p>While the resulting plan does look much slimmer than the original, all the chunks were still considered during planning and removed only during execution. So, even though the resulting plan looks very different (look at those 1,096 excluded chunks), the effort is very similar to the vanilla PostgreSQL plan.</p><pre><code class="language-sql">Custom Scan (ChunkAppend) on metrics1k  (cost=0.00..1113.45 rows=1097 width=20)
   Chunks excluded during startup: 1096
   -&gt;  Seq Scan on _hyper_3_39453_chunk  (cost=0.00..1.01 rows=1 width=20)
         Filter: ("time" &gt; now())
</code></pre>
<p>Close, but not good enough.</p><h2 id="now-were-talking">now( ) We're Talking</h2><p>With our latest release, TimescaleDB 2.7, we approached things differently, adding an optimization that would allow the evaluation of <code>now()</code> expressions during plan-time chunk exclusion. </p><p>Looking at the root of the problem, the reason why <code>now()</code> would not be correct is due to prepared statements. If you execute <code>now()</code> but only use that value in a transaction half an hour later, the value does not reflect the <code>current time—now()</code>—anymore.</p><p>However, <strong>it will still hold true for certain expressions even as time goes by.</strong> For example, <code>time &gt;= now()</code> will be true at this moment, in 5 minutes and 10 hours. So, when optimizing this, we looked for expressions that held as time passed and used those during plan-time exclusion. <br><br>The initial implementation of this feature works for intervals of hours, minutes, and seconds (e.g., <code>now() - ‘1 hour’</code>).<br><br>As you can see from the EXPLAIN output, chunks are no longer excluded during execution. The exclusion happens earlier, during planning, speeding up the query. Success!</p><pre><code class="language-sql"> Custom Scan (ChunkAppend) on metrics1k  (cost=0.00..1.02 rows=1 width=20)
   Chunks excluded during startup: 0
   -&gt;  Seq Scan on _hyper_3_39453_chunk  (cost=0.00..1.02 rows=1 width=20)
         Filter: (("time" &gt; '2022-05-24 12:41:31.266968+02'::timestamp with time zone) AND ("time" &gt; now()))
</code></pre>
<p>In the next TimescaleDB version, 2.8, we are removing the initial limitations of the <code>now()</code> optimization, making it also available in intervals of months and years. This means that you will be able to make the most of this improvement in a wider range of situations, as any <code>time &gt; now() - Interval</code>expression will be usable during plan-time chunk exclusion. </p><pre><code class="language-sql"> Custom Scan (ChunkAppend) on metrics1k  (cost=0.00..1.02 rows=1 width=20)
   Chunks excluded during startup: 0
   -&gt;  Seq Scan on _hyper_3_39453_chunk  (cost=0.00..1.02 rows=1 width=20)
         Filter: ("time" &gt; now())
</code></pre>
<p>This code is already <a href="https://github.com/timescale/timescaledb/pull/4397">committed</a> in our <a href="https://github.com/timescale/timescaledb/pull/4393">GitHub repo</a>, and will be available shortly.</p><h2 id="how-does-it-work">How Does It Work?</h2><p>But how did we make this current version happen? The optimization works by rewriting the constraint. For example: <br></p><pre><code class="language-sql">time &gt; now() - INTERVAL ‘5 min’
</code></pre>
<p>turns into</p><pre><code class="language-sql">(("time" &gt; (now() - '00:05:00'::interval)) AND ("time" &gt; '2022-06-10 09:58:04.224996+02'::timestamp with time zone))
</code></pre>
<p>This means that the constified part of the constraint will be used during plan-time chunk exclusion. And, assuming that time only moves forward, the result will still be correct even in the presence of prepared statements, as the original constraint is ANDed with the constified value.</p><p>Rewriting the constraint makes the constified value available to plan-time constraint exclusion, leading to massive reductions in planning time, especially in the presence of many chunks.</p><p>So we know that this translates into faster queries. But how fast?</p><h2 id="performance-comparison%E2%80%94now-that-is-fast">Performance Comparison—now( ) That Is Fast!</h2><p>As shown in our table, the optimization’s performance improvement scales with the total number of chunks in the hypertables. The more data partitions you’re dealing with, the more you’ll notice the speed improvement—<strong>up to 401x faster in TimescaleDB 2.7</strong>  for a total of 20,000 chunks when compared to the previous version.</p><p><code>now()</code>that is fast. 🔥</p><figure class="kg-card kg-image-card kg-width-wide"><img src="https://timescale.ghost.io/blog/content/images/2022/06/Screenshot-2022-06-23-at-10.19.45.png" class="kg-image" alt="" loading="lazy" width="605" height="247" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2022/06/Screenshot-2022-06-23-at-10.19.45.png 600w, https://timescale.ghost.io/blog/content/images/2022/06/Screenshot-2022-06-23-at-10.19.45.png 605w"></figure><p><em>The table lists the total execution time of the query (at the beginning of the post) on hypertables with a different number of chunks</em></p><h2 id="now-go-try-it">now( ) Go Try It</h2><p>There are few things more satisfying for a developer than solving a problem for your users, especially a recurring one. Achieving such performance optimization is just the icing on the cake. </p><p>If you want to experience the lightning-fast performance of PostgreSQL <code>now()</code>queries for yourself, TimescaleDB 2.7 is available for Timescale and self-managed TimescaleDB.</p><ul><li>If you are a Timescale user, you will be automatically upgraded to TimescaleDB 2.7. No action is required from your side. You can also create a free Timescale account to get <a href="https://console.cloud.timescale.com/signup">a free 30-day trial</a> (no credit card required).</li><li>If you are using TimescaleDB in your own instances, <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/update-timescaledb/">check out our docs for instructions on how to upgrade</a>.</li></ul><p>Once you’re using TimescaleDB, connect with us! You can find us in our <a href="http://slack.timescale.com/">Community Slack</a> and the <a href="http://timescale.com/forum/">Timescale Community Forum</a>. We’ll be more than happy to answer any question on query performance improvements, TimescaleDB, PostgreSQL, or other time-series issues.</p><p><br></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Improving DISTINCT Query Performance Up to 8,000x on PostgreSQL]]></title>
            <description><![CDATA[Learn common performance pitfalls and discover techniques to optimize your DISTINCT PostgreSQL queries.]]></description>
            <link>https://www.tigerdata.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <dc:creator><![CDATA[Sven Klemm]]></dc:creator>
            <pubDate>Thu, 06 May 2021 11:01:13 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2021/05/pexels-pixabay-373543.jpg">
            </media:content>
            <content:encoded><![CDATA[<p>PostgreSQL is an amazing database, but it can struggle with certain types of queries, especially as tables approach tens and hundreds of millions of rows (or more). <a href="https://www.timescale.com/learn/understanding-distinct-in-postgresql-with-examples" rel="noreferrer"><code>DISTINCT</code> queries</a> are an example of this.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/baby-yoda-tea.gif" class="kg-image" alt="Baby Yoda drinking tea" loading="lazy" width="320" height="320"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Waiting for our DISTINCT queries to return</em></i></figcaption></figure><p>Why are <code>DISTINCT</code> queries slow on PostgreSQL when they seem to ask an "easy" question? It turns out that PostgreSQL currently lacks the ability to efficiently pull a list of unique values from an ordered index. </p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">🔖</div><div class="kg-callout-text">Learning PostgreSQL? <a href="https://www.timescale.com/learn/understanding-distinct-in-postgresql-with-examples" rel="noreferrer">Read the basics on DISTINCT</a>.</div></div><p></p><p><br></p><p>Even when you have an index that matches the exact order and columns for these "last-point" queries, PostgreSQL is still forced to scan the entire index to find all unique values. As a table grows (and <a href="https://timescale.ghost.io/blog/blog/what-the-heck-is-time-series-data-and-why-do-i-need-a-time-series-database-dcf3b1b18563/">they grow quickly with time-series data</a>), this operation keeps getting slower.</p><p>Other databases, such as MySQL, Oracle, and DB2, implement a feature called "Loose index scan," "Index Skip Scan," or “Skip Scan,” to speed up the performance of queries like this. </p><p>When a database has a feature like "Skip Scan," it can incrementally jump from one ordered value to the next without reading all of the rows in between. <em>Without</em> support for this feature, the database engine has to scan the entire ordered index and then deduplicate it at the end—which is a much slower process.</p><p>Since 2018, there have been <a href="https://commitfest.postgresql.org/19/1741/">plans to support something similar</a> in PostgreSQL. <em>(<strong>Note</strong>: We couldn’t use this implementation directly due to some limitations of what is possible within the </em><a href="https://www.tigerdata.com/blog/top-8-postgresql-extensions" rel="noreferrer"><em>Postgres extension</em></a><em> framework.)</em></p><p>Unfortunately, this patch wasn't included in the <a href="https://commitfest.postgresql.org/32/">CommitFest</a> for PostgreSQL 14, so it won't be included until PostgreSQL 15 at the earliest (i.e., no sooner than Fall 2022, at least 1.5 years from now). </p><p>We don’t want our users to have to wait that long.</p><h2 id="what-is-timescales-skipscan">What is Timescale's SkipScan?</h2><p>Today, via TimescaleDB 2.2.1, we are releasing <strong>TimescaleDB SkipScan</strong>, a custom query planner node that makes ordered <code>DISTINCT</code> queries blazing fast in PostgreSQL 🔥. </p><p>As you'll see in the benchmarks below, <strong>some queries performed more than</strong> <strong>8,000x better than before</strong>—and many of the SQL queries your applications and analytics tools use could also see dramatic improvements with this new feature.</p><p>This feature works in both Timescale <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertables</a> and normal PostgreSQL tables. </p><p>This means that with Timescale, not only will your time-series <code>DISTINCT</code> queries be faster, but <strong>any other related queries you may have on normal PostgreSQL tables will also be faster. </strong></p><p>This is because Timescale is not just a time-series database. It’s a relational database, specifically, a relational database for <a href="https://www.tigerdata.com/blog/time-series-introduction" rel="noreferrer">time series</a>. Developers who use Timescale benefit from a purpose-built time-series database plus a classic relational (Postgres) database, all in one, with full SQL support.</p><p>And to be clear, we love PostgreSQL. We employ engineers who contribute to PostgreSQL. We contribute to the ecosystem around PostgreSQL. PostgreSQL is the world’s fastest-growing database, and we are excited to support it alongside thousands of other users and contributors.</p><p>We constantly seek to advance the state of the art with databases, and features like SkipScan are only our latest contribution to the industry. SkipScan makes Timescale and PostgreSQL better, more competitive databases overall, especially compared to MySQL, Oracle, DB2, and others. </p><h3 id="how-to-check-and-optimize-your-query-performance-in-postgresql">How to check (and optimize) your query performance in PostgreSQL</h3><p>If you're new to PostgreSQL and are wondering how to check your query performance in the first place (and optimize it!), we're going to leave two helpful resources here:</p><ul><li><a href="https://www.timescale.com/forum/t/a-beginners-guide-to-explain-analyze/77">This beginner's guide to <code>EXPLAIN ANALYZE</code> </a>by Michael Christofides in one of our Timescale Community Days. And here's a blog post on <a href="https://www.timescale.com/learn/explaining-postgresql-explain" rel="noreferrer">Explaining EXPLAIN</a> in case you're more of a reader.</li></ul><figure class="kg-card kg-embed-card"><iframe width="200" height="113" src="https://www.youtube.com/embed/31EmOKBP1PY?start=1&amp;feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="" title="A beginners guide to EXPLAIN ANALYZE – Michael Christofides"></iframe></figure><ul><li>And our <a href="https://timescale.ghost.io/blog/identify-postgresql-performance-bottlenecks-with-pg_stat_statements/">blog post on using pg_stat_statements to optimize queries</a>.</li></ul><h3 id="optimizing-distinct-query-performance-what-about-recursive-ctes">Optimizing DISTINCT query performance: What about RECURSIVE CTEs?</h3><p>However, if you're an experienced PostgreSQL user, you might point out that it is already possible to get reasonably fast <code>DISTINCT</code>queries via <code>RECURSIVE CTEs</code>.</p><p>From the <a href="https://wiki.postgresql.org/wiki/Loose_indexscan">PostgreSQL Wiki</a>, using a <code>RECURSIVE CTE</code> can get you good results, but writing these kinds of queries can often feel cumbersome and unintuitive, especially for developers new to PostgreSQL:</p><pre><code class="language-sql">WITH RECURSIVE cte AS (
   (SELECT tags_id FROM cpu ORDER BY tags_id, time DESC LIMIT 1)
   UNION ALL
   SELECT (
      SELECT tags_id FROM cpu
      WHERE tags_id &gt; t.tags_id 
      ORDER BY tags_id, time DESC LIMIT 1
   )
   FROM cte t
   WHERE t.tags_id IS NOT NULL
)
SELECT * FROM cte LIMIT 50;
</code></pre><p>But even if writing a <code>RECURSIVE CTE</code> like this in day-to-day querying felt natural to you, there's a bigger problem. Most application developers, ORMs, and charting tools like Grafana or Tableau will still use the simpler, straight-forward form:</p><pre><code class="language-sql">SELECT DISTINCT ON (tags_id) * FROM cpu
WHERE tags_id &gt;=1 
ORDER BY tags_id, time DESC
LIMIT 50;</code></pre><p>In PostgreSQL, without a ", such as MySQL, Oracle, and DB2, implement a feature called "Loose index scan," "Index Skip Scan," or “Skip Scan" node, this query will perform the much slower Index Only Scan, causing your applications and graphing tools to feel clunky and slow.</p><p>Surely there's a better way, right?</p><h2 id="skipscan-is-the-way">SkipScan Is the Way</h2><figure class="kg-card kg-image-card"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Screen-Shot-2021-04-16-at-1.45.56-PM.png" class="kg-image" alt="" loading="lazy" width="1472" height="608" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2022/01/Screen-Shot-2021-04-16-at-1.45.56-PM.png 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2022/01/Screen-Shot-2021-04-16-at-1.45.56-PM.png 1000w, https://timescale.ghost.io/blog/content/images/2022/01/Screen-Shot-2021-04-16-at-1.45.56-PM.png 1472w" sizes="(min-width: 720px) 720px"></figure><p>SkipScan is an optimization for queries in the form of <code>SELECT DISTINCT ON</code> (column). Conceptually, a SkipScan is a regular IndexScan that “skips” across an index looking for the next value that is greater than the current value:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2021/05/skip-scan-illustration.png" class="kg-image" alt="Illustration of how a Skip Scan search works on a Btree index" loading="lazy" width="519" height="253"><figcaption><i><em class="italic" style="white-space: pre-wrap;">SkipScan: An index scan that “skips” across an index looking for the next greater value</em></i></figcaption></figure><p>With SkipScan in Timescale/PostgreSQL, query planning and execution can now utilize a new node (displayed as <code>(SkipScan)</code> in the <code>EXPLAIN</code> output) to quickly return distinct items from a properly ordered index. </p><p>Rather than scanning the entire index with an Index Only Scan, SkipScan incrementally searches for each successive item in the ordered index. As it locates one item, the <code>(SkipScan)</code> node quickly restarts the search for the next item. This is a <em>much</em> more efficient way of finding distinct items in an ordered index. (<a href="https://github.com/timescale/timescaledb/blob/master/tsl/src/nodes/skip_scan/exec.c">See GitHub for more details.</a>)</p><h2 id="benchmarking-timescaledb-skipscan-vs-a-normal-postgresql-index-scan">Benchmarking TimescaleDB SkipScan vs. a Normal PostgreSQL Index Scan</h2><p>In every example query, <strong>Timescale with SkipScan improved query response times by at least 26x</strong>. </p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">✨</div><div class="kg-callout-text">If you don't want to go through the entire benchmark, here's a short and sweet piece on <a href="https://www.timescale.com/blog/skip-scan-under-load/" rel="noreferrer">SkipScan's performance under load</a>.</div></div><p><br></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p>But the real surprise is <strong>how much of a difference it makes at lower cardinalities with lots of data—</strong>it is <strong>almost 8,500x faster to retrieve <em>all columns</em> for the most recent reading of each device</strong>. That's fast!</p><figure class="kg-card kg-image-card"><img src="https://timescale.ghost.io/blog/content/images/2022/01/mandalorian-ships.gif" class="kg-image" alt="Mandolorian Razor Crest being chased by X-wing fighters" loading="lazy" width="1280" height="720" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2022/01/mandalorian-ships.gif 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2022/01/mandalorian-ships.gif 1000w, https://timescale.ghost.io/blog/content/images/2022/01/mandalorian-ships.gif 1280w" sizes="(min-width: 720px) 720px"></figure><p>In our tests, <strong>SkipScan is also consistently faster—by 80x or more—in our 4,000 device benchmarks</strong>. (This level of cardinality is typical for many users of Timescale.)</p><p>Before we share the full results, here is how our benchmark was set up.</p><h3 id="benchmark-setup">Benchmark setup</h3><p>To perform our benchmarks, we installed Timescale on a DigitalOcean Droplet using the following specifications. PostgreSQL and Timescale were installed from packages, and we applied the recommended tuning from <a href="https://github.com/timescale/timescaledb-tune"><code>timescaledb-tune</code></a>.</p><ul><li>8 Intel vCPUs</li><li>16&nbsp;GB of RAM</li><li>320&nbsp;GB NVMe SSD</li><li>Ubuntu 20.04 LTS</li><li>Postgres 12.6</li><li>TimescaleDB 2.2 <em>(The first release with SkipScan. TimescaleDB 2.2.1 primarily adds distributed hypertable support and some bug fixes.)</em></li></ul><p>To demonstrate the performance impact of SkipScan on varying degrees of cardinality, we benchmarked three separate datasets of varying sizes. To generate our datasets, we used the 'cpu-only' use case in the <a href="https://github.com/timescale/tsbs">Time Series Benchmark Suite (TSBS)</a>, which creates 10 metrics every 10 seconds for each device (identified by the <code>tag_id</code> in our benchmark queries).</p><table>
<thead>
<tr>
<th>Dataset 1</th>
<th>Dataset 2</th>
<th>Dataset 3</th>
</tr>
</thead>
<tbody>
<tr>
<td>100 devices</td>
<td>4000 devices</td>
<td>10,000 devices</td>
</tr>
<tr>
<td>4 months of data</td>
<td>4 days of data</td>
<td>36 hours of data</td>
</tr>
<tr>
<td>~103,000,000 rows</td>
<td>~103,000,000 rows</td>
<td>~144,000,000 rows</td>
</tr>
</tbody>
</table>
<h3 id="additional-data-preparation">Additional data preparation</h3><p>Not all device data is up-to-date in real life because devices go offline and internet connections get interrupted. Therefore, to simulate a more realistic scenario (i.e., that some devices had stopped reporting for a period of time), we deleted rows for random devices over each of the following periods.</p><table>
<thead>
<tr>
<th>Dataset 1</th>
<th>Dataset 2</th>
<th>Dataset 3</th>
</tr>
</thead>
<tbody>
<tr>
<td>5 random devices over:</td>
<td>100 random devices over:</td>
<td>250 random devices over:</td>
</tr>
<tr>
<td>30 minutes</td>
<td>1 hour</td>
<td>10 minutes</td>
</tr>
<tr>
<td>36 hours</td>
<td>12 hours</td>
<td>1 hour</td>
</tr>
<tr>
<td>7 days</td>
<td>36 hours</td>
<td>12 hours</td>
</tr>
<tr>
<td>1 month</td>
<td>3 days</td>
<td>24 hours</td>
</tr>
</tbody>
</table>
<p>To delete the data, we utilized the <code>tablesample</code> function of Postgres. This <code>SELECT</code> feature allows you to return a random sample of rows from a table based on a percentage of the total rows. In the example below, we randomly sample 10% of the rows ( <code>bernoulli(10)</code> ) and then take the first 10 ( <code>limit 10</code> ).</p><pre><code class="language-sql">DELETE FROM cpu
WHERE tags_id IN 
  (SELECT id FROM tags tablesample bernoulli(10) LIMIT 10)
  AND time &gt;= now() - INTERVAL '30 minutes';</code></pre><p>From there, we ran each benchmarking query multiple times to accommodate for caching, with and without SkipScan enabled.</p><p>As mentioned earlier, the following two indexes were present on the hypertable for all queries.</p><pre><code class="language-sql">"cpu_tags_id_time_idx" btree (tags_id, "time" DESC)
"cpu_time_idx" btree ("time" DESC)</code></pre><h3 id="benchmark-results">Benchmark results</h3><p>Here are the results:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2021/05/Skip-Scan-vs-Normal.jpg" class="kg-image" alt="" loading="lazy" width="2000" height="2183" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2021/05/Skip-Scan-vs-Normal.jpg 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2021/05/Skip-Scan-vs-Normal.jpg 1000w, https://timescale.ghost.io/blog/content/images/size/w1600/2021/05/Skip-Scan-vs-Normal.jpg 1600w, https://timescale.ghost.io/blog/content/images/2021/05/Skip-Scan-vs-Normal.jpg 2000w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">TimescaleDB with SkipScan improved the query response by at least 26x, up to 8500x in some cases.</em></i></figcaption></figure><h2 id="about-the-queries-benchmarked">About the Queries Benchmarked</h2><p>For this test, we benchmarked five types of common queries:</p><h3 id="scenario-1-what-was-the-last-reported-time-of-each-device-in-a-paged-list">Scenario #1: What was the last reported time of each device in a paged list?</h3><pre><code class="language-sql">SELECT DISTINCT ON (tags_id) tags_id, time FROM cpu
ORDER BY tags_id, time DESC
LIMIT 10 OFFSET 50;</code></pre><figure class="kg-card kg-image-card"><img src="https://timescale.ghost.io/blog/content/images/2022/01/SkipScan---Scenario-1.jpg" class="kg-image" alt="" loading="lazy" width="1095" height="236" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2022/01/SkipScan---Scenario-1.jpg 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2022/01/SkipScan---Scenario-1.jpg 1000w, https://timescale.ghost.io/blog/content/images/2022/01/SkipScan---Scenario-1.jpg 1095w" sizes="(min-width: 720px) 720px"></figure><h3 id="scenario-2-what-was-the-time-and-most-recently-reported-set-of-values-for-each-device-in-a-paged-list">Scenario #2: What was the time and most recently reported set of values for each device in a paged list?</h3><pre><code class="language-sql">SELECT DISTINCT ON (tags_id) * FROM cpu
ORDER BY tags_id, time DESC
LIMIT 10 OFFSET 50;</code></pre><figure class="kg-card kg-image-card"><img src="https://timescale.ghost.io/blog/content/images/2022/01/SkipScan---Scenario-2.jpg" class="kg-image" alt="" loading="lazy" width="1095" height="236" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2022/01/SkipScan---Scenario-2.jpg 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2022/01/SkipScan---Scenario-2.jpg 1000w, https://timescale.ghost.io/blog/content/images/2022/01/SkipScan---Scenario-2.jpg 1095w" sizes="(min-width: 720px) 720px"></figure><h3 id="scenario-3-what-is-the-most-recent-point-for-all-reporting-devices-in-the-last-5-minutes">Scenario #3: What is the most recent point for all reporting devices in the last 5 minutes?</h3><pre><code class="language-sql">SELECT DISTINCT ON (tags_id) * FROM cpu 
WHERE time &gt;= now() - INTERVAL '5 minutes' 
ORDER BY tags_id, time DESC;</code></pre><figure class="kg-card kg-image-card"><img src="https://timescale.ghost.io/blog/content/images/2022/01/SkipScan---Scenario-3.jpg" class="kg-image" alt="" loading="lazy" width="1095" height="236" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2022/01/SkipScan---Scenario-3.jpg 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2022/01/SkipScan---Scenario-3.jpg 1000w, https://timescale.ghost.io/blog/content/images/2022/01/SkipScan---Scenario-3.jpg 1095w" sizes="(min-width: 720px) 720px"></figure><h3 id="scenario-4-which-devices-reported-at-some-time-today-but-not-within-the-last-hour">Scenario #4: Which devices reported at some time today but not within the last hour?</h3><pre><code class="language-sql">WITH older AS (
  SELECT DISTINCT ON (tags_id) tags_id FROM cpu 
  WHERE time &gt; now() - INTERVAL '24 hours'
)                                          
SELECT * FROM older o 
WHERE NOT EXISTS (
  SELECT 1 FROM cpu 
  WHERE cpu.tags_id = o.tags_id 
  AND time &gt; now() - INTERVAL '1 hour'
);</code></pre><figure class="kg-card kg-image-card"><img src="https://timescale.ghost.io/blog/content/images/2021/05/SkipScan---Scenario-4.jpg" class="kg-image" alt="" loading="lazy" width="2000" height="431" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2021/05/SkipScan---Scenario-4.jpg 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2021/05/SkipScan---Scenario-4.jpg 1000w, https://timescale.ghost.io/blog/content/images/size/w1600/2021/05/SkipScan---Scenario-4.jpg 1600w, https://timescale.ghost.io/blog/content/images/2021/05/SkipScan---Scenario-4.jpg 2000w" sizes="(min-width: 720px) 720px"></figure><h3 id="scenario-5-which-devices-reported-yesterday-but-not-in-the-last-24-hours">Scenario #5: Which devices reported yesterday but not in the last 24 hours?</h3><pre><code class="language-sql">WITH older AS (
  SELECT DISTINCT ON (tags_id) tags_id FROM cpu 
  WHERE time &gt; now() - INTERVAL '48 hours'
  AND time &lt; now() - INTERVAL '24 hours'
)                                          
SELECT * FROM older o 
WHERE NOT EXISTS (
  SELECT 1 FROM cpu 
  WHERE cpu.tags_id = o.tags_id 
  AND time &gt; now() - INTERVAL '24 hour'
);</code></pre><figure class="kg-card kg-image-card"><img src="https://timescale.ghost.io/blog/content/images/2021/05/SkipScan---Scenario-5.jpg" class="kg-image" alt="" loading="lazy" width="2000" height="431" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2021/05/SkipScan---Scenario-5.jpg 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2021/05/SkipScan---Scenario-5.jpg 1000w, https://timescale.ghost.io/blog/content/images/size/w1600/2021/05/SkipScan---Scenario-5.jpg 1600w, https://timescale.ghost.io/blog/content/images/2021/05/SkipScan---Scenario-5.jpg 2000w" sizes="(min-width: 720px) 720px"></figure><h2 id="how-will-your-application-improve">How Will Your Application Improve?</h2><p>But SkipScan isn’t a theoretical improvement reserved for benchmarking blog posts 😉—it has real-world implications, and many applications we use rely on getting this data as fast as possible.</p><p>Think about the applications you use (or develop) every day. Do they retrieve paged lists of unique items from database tables to fill dropdown options (or grids of data)?</p><p>At a few thousand items, the query latency might not be very noticeable. But, as your data grows and you have millions of rows of data and tens of thousands of distinct items, that dropdown menu might take seconds—or minutes—to populate. </p><p>SkipScan can reduce that to tens of <em>milliseconds</em>!</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/tenor.gif" class="kg-image" alt="" loading="lazy" width="498" height="287"><figcaption><span style="white-space: pre-wrap;">Baby Yoda</span></figcaption></figure><p>Even better, SkipScan also provides a fast, efficient way of answering the question that so many people with time-series data ask every day:</p><p><em>"What was the last time and value recorded for each of my [devices / users / services / crypto and stock investments / etc]?"</em></p><p>As long as there is an index on "device_id" and "time" descending, SkipScan will retrieve the data using a query like this much more efficiently.</p><pre><code class="language-sql">SELECT DISTINCT ON (device_id) * FROM cpu 
ORDER BY device_id, time DESC;</code></pre><p>With SkipScan, your application and dashboards that rely on these types of queries will now load a whole lot faster 🚀  (see below).</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2021/05/4k_with_skipscan.gif" class="kg-image" alt="" loading="lazy" width="658" height="527" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2021/05/4k_with_skipscan.gif 600w, https://timescale.ghost.io/blog/content/images/2021/05/4k_with_skipscan.gif 658w"><figcaption><i><em class="italic" style="white-space: pre-wrap;">TimescaleDB 2.2 </em></i><i><b><strong class="italic" style="white-space: pre-wrap;">with</strong></b></i><i><em class="italic" style="white-space: pre-wrap;"> SkipScan enabled runs in less than 400&nbsp;ms</em></i></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2021/05/4k_without_skipscan.gif" class="kg-image" alt="" loading="lazy" width="658" height="527" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2021/05/4k_without_skipscan.gif 600w, https://timescale.ghost.io/blog/content/images/2021/05/4k_without_skipscan.gif 658w"><figcaption><i><em class="italic" style="white-space: pre-wrap;">TimescaleDB 2.2</em></i><i><b><strong class="italic" style="white-space: pre-wrap;"> without</strong></b></i><i><em class="italic" style="white-space: pre-wrap;"> SkipScan enabled runs in 23 seconds</em></i></figcaption></figure><h2 id="how-to-use-skipscan-on-timescale">How to Use SkipScan on Timescale</h2><p>How do you get started? Upgrade to TimescaleDB 2.2.1 and set up your schema and indexing as described below. You should start to see immediate speed improvements in many of your <code>DISTINCT</code> queries.</p><p><strong>To ensure that a (SkipScan) node can be chosen for your query plan:</strong></p><p><strong>First, the query must use the <code>DISTINCT</code> keyword on a single column</strong>. The benchmarking queries above will give you some examples to draw from.</p><p><strong>Second, there must be an index that contains the <code>DISTINCT</code> column first, and any other <code>ORDER BY</code> columns.</strong> Specifically:</p><ul><li>The index needs to be a <code>BTREE</code> index.</li><li>The index needs to match the <code>ORDER BY</code> in your query.</li><li>The <code>DISTINCT</code> column must either be the first column of the index, or any leading column(s) must be used as constraints in your query.</li></ul><p>In practice, this means that if we use the questions from the beginning of this blog post ("retrieve a list of unique IDs in order" and "retrieve the last reading of each ID"), we would need at least one index like this (but if you're using a TimescaleDB hypertable, this likely already exists):</p><pre><code class="language-sql"> "cpu_tags_id_time_idx" btree (tags_id, "time" DESC)</code></pre><p>With that index in place, you should start to see immediate benefit if your queries look similar to the benchmarking examples below. When SkipScan is chosen for your query, the <code>EXPLAIN ANALYZE</code> output will show one or more <code>Custom Scan (SkipScan)</code> nodes similar to this:</p><pre><code>-&gt;  Unique
  -&gt;  Merge Append
    Sort Key: _hyper_8_79_chunk.tags_id, _hyper_8_79_chunk."time" DESC
     -&gt;  Custom Scan (SkipScan) on _hyper_8_79_chunk
      -&gt;  Index Only Scan using _hyper_8_79_chunk_cpu_tags_id_time_idx on _hyper_8_79_chunk
          Index Cond: (tags_id &gt; NULL::integer)
     -&gt;  Custom Scan (SkipScan) on _hyper_8_80_chunk
      -&gt;  Index Only Scan using _hyper_8_80_chunk_cpu_tags_id_time_idx on _hyper_8_80_chunk
         Index Cond: (tags_id &gt; NULL::integer)
...</code></pre><h2 id="learn-more-and-get-started">Learn More and Get Started</h2><p>If you’re new to Timescale, <a href="https://console.cloud.timescale.com/signup">create a free account</a> to get started with a fully managed TimescaleDB instance (100&nbsp;% free for 30 days).</p><p>If you are an existing user:</p><ul><li><strong>Timescale: </strong>TimescaleDB 2.2.1 is now the default for all new services on Timescale, and any of your existing services will be automatically upgraded during your next maintenance window.</li><li><strong>Self-managed TimescaleDB</strong>: <a href="https://docs.timescale.com/latest/update-timescaledb">Here are the upgrade instructions</a>. </li></ul><p>Join our <a href="https://slack.timescale.com">Slack Community</a> to share your results, ask questions, get advice, and connect with other developers (I, as well as our co-founders, engineers, and passionate community members, are active on all channels).</p><p>You can also <a href="https://github.com/timescale/timescaledb">visit our GitHub</a> to learn more (and, as always, ⭐️ are appreciated!)<br></p><p></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Achieving the Best of Both Worlds: Ensuring Up-To-Date Results With Real-Time Aggregation]]></title>
            <description><![CDATA[Real-time aggregates (released with TimescaleDB 1.7) build on continuous aggregates' ability to increase query speed and optimize storage. Learn what's new, details about how they work, and how to get started. ]]></description>
            <link>https://www.tigerdata.com/blog/achieving-the-best-of-both-worlds-ensuring-up-to-date-results-with-real-time-aggregation</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/achieving-the-best-of-both-worlds-ensuring-up-to-date-results-with-real-time-aggregation</guid>
            <category><![CDATA[Product & Engineering]]></category>
            <category><![CDATA[Engineering]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Sven Klemm]]></dc:creator>
            <pubDate>Thu, 07 May 2020 15:11:33 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2020/05/mana5280-dkeOcAkors4-unsplash.jpg">
            </media:content>
            <content:encoded><![CDATA[<p>Real-time aggregates (released with TimescaleDB 1.7) build on continuous aggregates' ability to increase query speed and optimize storage. Learn what's new, details about how they work, and how to get started. </p><p>One constant across all time-series use cases is data: metrics, logs, events, sensor readings; IT and application performance monitoring, SaaS applications, IoT, martech, fintech, and more.  Lots (and lots) of data. What’s more, it typically arrives <em>continuously.</em></p><p>This need to handle large volumes of constantly generated data motivated some of our earliest TimescaleDB architectural decisions, such as its use of automated time-based partitioning and local-only indexing to achieve high insert rates.  And last year, we added type-specific <a href="https://www.tigerdata.com/blog/building-columnar-compression-in-a-row-oriented-database" rel="noreferrer">columnar</a> compression to significantly shrink the overhead involved in storing all of this data (often by 90% or higher – <a href="https://timescale.ghost.io/blog/blog/building-columnar-compression-in-a-row-oriented-database/?utm_source=timescale-real-time-aggregates-details&amp;utm_medium=blog&amp;utm_campaign=1-7-release&amp;utm_content=1-6-release-blog">see our technical description and benchmarking results</a>). </p><p>And another key capability in TimescaleDB, which is the focus of this post, has been <em>continuous aggregates</em>, which we first <a href="https://timescale.ghost.io/blog/blog/continuous-aggregates-faster-queries-with-automatically-maintained-materialized-views/?utm_source=timescale-real-time-aggregates-details&amp;utm_medium=blog&amp;utm_campaign=1-7-release&amp;utm_content=continuous-aggs-1-3-blog">introduced</a> in TimescaleDB 1.3.  Continuous aggregates allow one to specify a SQL query that continually processes raw data into a so-called materialized table.  </p><p>Continuous aggregates are somewhat similar to materialized views in databases, but unlike a materialized view (as in <a href="https://www.postgresql.org/docs/current/rules-materializedviews.html">PostgreSQL</a>), continuous aggregates do not need to be refreshed manually; the view will be refreshed automatically in the background as new data is added, or old data is modified. Additionally, TimescaleDB does not need to re-calculate all of the data on every refresh. Only new and/or invalidated data will be calculated. And since this re-aggregation is automatic – it executes as a background job at regular intervals – this process doesn’t add any maintenance burden to your database.</p><p>This is where most database or streaming systems that offer continuous aggregates or continuous queries give up.  We knew we could do better.</p><p>Enter Real-Time Aggregation, introduced in TimescaleDB 1.7 (<a href="https://timescale.ghost.io/blog/blog/timescaledb-1-7-fast-continuous-aggregates-with-real-time-views-postgresql-12-support-and-more-community-features/?utm_source=timescale-real-time-aggregates-details&amp;utm_medium=blog&amp;utm_campaign=1-7-release&amp;utm_content=1-7-release-announcement-blog">see our release blog</a>).</p><h2 id="quick-background-on-continuous-aggregates">Quick Background on Continuous Aggregates</h2><p>The benefit of continuous aggregations are two fold:</p><ul><li><strong>Query performance.</strong>  By executing queries against pre-calculated results, rather than the underlying raw data,  continuous aggregates can significantly improve query performance.</li><li><strong>Storage savings with </strong><a href="https://docs.timescale.com/latest/using-timescaledb/continuous-aggregates?utm_source=timescale-real-time-aggregates-details&amp;utm_medium=blog&amp;utm_campaign=1-7-release&amp;utm_content=continuous-aggs-drop-data-docs#dropping-data"><strong>downsampling</strong></a><strong><em>. </em></strong> Continuous aggregates are often combined with data retention policies for better storage management.  Raw data can be continually aggregated into a materialized table, and dropped after it reaches a certain age.  So the database may only store some fixed period of raw data (say, one week), yet store aggregate data for much longer.</li></ul><p>Consider the following example, collecting system metrics around CPU usage and storing it in a CPU metrics <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertable</a>, where each row includes a timestamp, hostname, and 3 metrics around CPU usage (usage_user, usage_system, usage_iowait).  </p><p>We collect these statistics every second per server.</p><pre><code>            time              | hostname |     usage_user     |    usage_system     |    usage_iowait
-------------------------------+----------+--------------------+---------------------+---------------------
2020-05-06 02:32:34.627143+00 | host0    | 0.5378765249290502 |  0.2958572490961302 | 0.10685818344495246
2020-05-06 02:32:34.627143+00 | host1    | 0.3175958910709298 |  0.7874926624954846 | 0.16615243032654803
2020-05-06 02:32:34.627143+00 | host2    | 0.4788377981501064 | 0.18277343256546175 |  0.7183967491020162</code></pre><p>So a query that wants to compute the per-hourly histogram of usage consumption over the course of 7 days for 10 servers will process 10 servers * 60 seconds * 60 minutes * 24 hours * 7 days= 6,048,000 rows of data.</p><p>On the other hand, if we pre-compute a histogram per hour, then the same query on the continuous aggregate table will only need to process 10 servers * 24 hours * 7 days = 1680 rows of data.</p><p>But pre-computed results in the continuous aggregate view will lag behind the latest data, as the materialization only runs at scheduled intervals.  So, both to more cheaply handle out-of-order data and to avoid excessive load, there is typically some <em>refresh lag </em>between the raw data and when it’s materialized.  In fact, this refresh lag is configurable in TimescaleDB, such that the continuous aggregation engine will not materialize data that’s newer than the refresh lag.  </p><p>(Slightly more specifically, if we compute aggregations across some <a href="https://docs.timescale.com/latest/using-timescaledb/reading-data?utm_source=timescale-real-time-aggregates-details&amp;utm_medium=blog&amp;utm_campaign=1-7-release&amp;utm_content=time-bucket-docs#time-bucket">time bucket</a>, such as hourly, then each hourly interval has a start time and end time.  TimescaleDB will only materialize data when its corresponding aggregation interval’s <em>end time</em> is older than the refresh lag. So, if we are doing hourly rollups with 30 minute refresh lag, then we’d only perform the materialized aggregation from, say, 2:00am - 3:00am <em>after</em> 2:30pm.)</p><p>So, on one hand, using a continuous aggregate view has cut down the amount of data we process at query time by 3600x (i.e., from more than 6 million rows to fewer than 2000).  But, in this view, we’re often missing the last hour or so of data.</p><p>While you could just make the refresh lag smaller and smaller to workaround this problem, it comes at the cost of higher and higher load; unless these aggregates are recomputed on <em>every</em> new insert (expensive!), they’re fundamentally always stale.</p><h2 id="introducing-real-time-aggregation">Introducing Real-Time Aggregation</h2><p>With real-time aggregation, when you query a continuous aggregate view, rather than just getting the pre-computed aggregate from the materialized table, the query will transparently combine this pre-computed aggregate with raw data from the hypertable that’s yet to be materialized.  And, by combining raw and materialized data in this way, you get accurate and up-to-date results, while still enjoying the speedups that come from pre-computing a large portion of the result.</p><p>Let’s return to the example above.  Recall that when we created hourly rollups, we set the refresh lag to 30 minutes, so our continuous aggregate view will lag behind by 30-90 minutes.</p><p>But, when querying a view that supports real-time aggregation, the single query as before for hourly data across the past week will process and combine the results from two tables:</p><ul><li>Materialized table: 10 servers * (22 hours + 24 hours * 6 days) = 1660 rows</li><li>Raw data: 10 servers * 60 seconds * 90 minutes = 54,000 rows  </li></ul><p>So now, with these “back of the envelope” calculations, we’ve processed a total of 55,660 rows, still well below the 6 million from before. Moreover, the last 90 minutes of data are more likely to already be memory resident for even better performance, given the database page caching already happening for recent data.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2020/09/image.png" class="kg-image" alt="Diagram showing how data moves to a materialized table as it ages and continuous aggregate queries execute, and how real-time aggregates combine this data with newer, not yet materialized data" loading="lazy" width="1500" height="1154" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2020/09/image.png 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2020/09/image.png 1000w, https://timescale.ghost.io/blog/content/images/2020/09/image.png 1500w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Real-time aggregates allow you to query your pre-calculated data </span><b><strong style="white-space: pre-wrap;">and</strong></b><span style="white-space: pre-wrap;"> newer, not yet materialized "raw" data</span></figcaption></figure><p>The above illustration shows this in practice. The database internally maintains a <strong>c<em>ompletion threshold</em></strong> as metadata, which records the point-in-time to which all previous records from the raw table have been materialized.  This completion threshold lags behind the <em>refresh lag </em>we discussed earlier, and gets updated by the database engine whenever a background task updates the materialized view.</p><p><em>(In fact, it’s a bit more complicated given TimescaleDB’s ability to handle late data that gets written after some time region has already been materialized, i.e., behind the completion threshold.  But we’re going to ignore how TimescaleDB tracks invalidation regions in this post.)</em></p><p>So now when processing our query covering the interval , the database engine will conceptually take a UNION ALL between results from the materialized table starting at <code>now() - interval '7 days'</code> up to the completion threshold, with results from the raw table from the completion threshold up to <code>now()</code>.</p><p>But rather than just describe this behavior, let’s walk through a concrete example and compare our query times without continuous aggregates, with vanilla continuous aggregates, and with real-time aggregation enabled.</p><p>These capabilities were developed by Timescale engineers: <a href="https://github.com/svenklemm"><em>Sven Klemm</em></a><em>, </em><a href="https://github.com/cevian"><em>Matvey Arye</em></a><em>, </em><a href="https://github.com/gayyappan"><em>Gayathri Ayyapan</em></a><em>, </em><a href="https://github.com/davidkohn88"><em>David Kohn</em></a>, and <a href="https://github.com/JLockerman"><em>Josh Lockerman</em></a>.</p><h2 id="testing-real-time-aggregation">Testing Real-Time Aggregation</h2><p>In the following, I’ve created a TimescaleDB 1.7 instance via <a href="https://www.timescale.com/products">Managed Service for TimescaleDB</a> (specially, an “basic-100-compute-optimized” instance with PostgreSQL 12, 4 vCPU, and 100GB SSD storage), and then created the following hypertable:</p><pre><code class="language-SQL">$ psql postgres://tsdbadmin@tsdb-bb8e760-internal-90d0.a.timescaledb.io:26479/defaultdb?sslmode=require

=&gt; CREATE TABLE cpu (
      time TIMESTAMPTZ,
      hostname TEXT,
      usage_user FLOAT,
      usage_system FLOAT,
      usage_iowait FLOAT
   );

=&gt; SELECT create_hypertable ('cpu', 'time', 
      chunk_time_interval =&gt; interval '1d');</code></pre><p>I’m now going to load the hypertable with 14 days of synthetic data (which is created with the following INSERT statement):</p><pre><code class="language-SQL">=&gt; INSERT INTO cpu (
   SELECT time, hostname, random(), random(), random()
      FROM generate_series(NOW() - interval '14d', NOW(), '1s') AS time
      CROSS JOIN LATERAL (
         SELECT 'host' || host_id::text AS hostname 
            FROM generate_series(0,9) AS host_id
      ) h
   );</code></pre><p>Okay, so that inserted 12,096,010 rows of synthetic data into our hypertable of the following format, stretching from 2:32am UTC on April 22 to 2:32am UTC on May 6:</p><pre><code class="language-SQL">=&gt; SELECT * FROM cpu ORDER BY time DESC LIMIT 3;

             time              | hostname |     usage_user     |    usage_system     |    usage_iowait     
-------------------------------+----------+--------------------+---------------------+---------------------
 2020-05-06 02:32:34.627143+00 | host0    | 0.5378765249290502 |  0.2958572490961302 | 0.10685818344495246
 2020-05-06 02:32:34.627143+00 | host1    | 0.3175958910709298 |  0.7874926624954846 | 0.16615243032654803
 2020-05-06 02:32:34.627143+00 | host2    | 0.4788377981501064 | 0.18277343256546175 |  0.7183967491020162


=&gt; SELECT min(time) AS start, max(time) AS end FROM cpu;

-[ RECORD 1 ]------------------------
start | 2020-04-22 02:32:34.627143+00
end   | 2020-05-06 02:32:34.627143+00</code></pre><p>Let’s now create a continuous aggregate view on this table with hourly <a href="https://docs.timescale.com/latest/api?utm_source=timescale-real-time-aggregates-details&amp;utm_medium=blog&amp;utm_campaign=1-7-release&amp;utm_content=api-docs-histograms#histogram">histograms</a>: </p><pre><code class="language-SQL">=&gt; CREATE VIEW cpu_1h 
   WITH (timescaledb.continuous, 
         timescaledb.refresh_lag = '30m',
         timescaledb.refresh_interval = '30m')
   AS
      SELECT 
         time_bucket('1 hour', time) AS hour,
         hostname, 
         histogram(usage_user, 0.0, 1.0, 5) AS hist_usage_user,
         histogram(usage_system, 0.0, 1.0, 5) AS hist_usage_system,
         histogram(usage_iowait, 0.0, 1.0, 5) AS hist_usage_iowait
      FROM cpu
      GROUP BY hour, hostname;</code></pre><p>By default, queries to this view use these real-time aggregation features.  If you want to disable real-time aggregation, set <code>materialized_only = true</code> when creating the view or by later ALTERing the view.  (See <a href="https://docs.timescale.com/latest/api?utm_source=timescale-real-time-aggregates-details&amp;utm_medium=blog&amp;utm_campaign=1-7-release&amp;utm_content=continuous-aggs-create-view-docs#continuous_aggregate-create_view">API docs here</a>.)</p><p>Now, the job scheduling framework will start to asynchronously process this view, which we can see in our <a href="https://docs.timescale.com/latest/api?utm_source=timescale-real-time-aggregates-details&amp;utm_medium=blog&amp;utm_campaign=1-7-release&amp;utm_content=continuous-aggs-stats-docs#timescaledb_information-continuous_aggregate_stats">informational view</a>.  (You can also <a href="https://docs.timescale.com/latest/api?utm_source=timescale-real-time-aggregates-details&amp;utm_medium=blog&amp;utm_campaign=1-7-release&amp;utm_content=continuous-aggs-refresh-view-docs#continuous_aggregate-refresh_view">manually force</a> the materialization to occur if needed.)  <br></p><pre><code class="language-SQL">=&gt; SELECT * FROM timescaledb_information.continuous_aggregate_stats;

- [ RECORD 1 ]
view_name              | cpu_1h
completed_threshold    | 2020-05-06 02:00:00+00
invalidation_threshold | 2020-05-06 02:00:00+00
job_id                 | 1000
last_run_started_at    | 2020-05-06 02:34:08.300524+00
last_successful_finish | 2020-05-06 02:34:09.04923+00
last_run_status        | Success
job_status             | Scheduled
last_run_duration      | 00:00:00.748706
next_scheduled_run     | 2020-05-06 03:04:09.04923+00
total_runs             | 17
total_successes        | 17
total_failures         | 0
total_crashes          | 0
</code></pre><p>From this data, we see that the materialized view includes data up to 2:00am on May 6, while from above we’ve learned that the raw data goes up to 2:32am. </p><p>Let’s try our query directly on the raw table, and use an EXPLAIN ANALYZE to both show the database plan, as well as actually execute the query and collect timing information.  (Note that in many use cases, one would offset queries from <code>now() - &lt;some interval&gt;</code>. But to ensure that we use identical datasets in our subsequent analysis, we explicitly select the interval offset from the dataset’s last timestamp.)</p><pre><code class="language-SQL">=&gt; EXPLAIN (ANALYZE, COSTS OFF)
   SELECT 
      time_bucket('1 hour', time) AS hour,
      hostname, 
      histogram(usage_user, 0.0, 1.0, 5) AS hist_usage_user,
      histogram(usage_system, 0.0, 1.0, 5) AS hist_usage_system,
      histogram(usage_iowait, 0.0, 1.0, 5) AS hist_usage_iowait
   FROM cpu
   WHERE time &gt; '2020-05-06 02:32:34.627143+00'::timestamptz - interval '7 days'
   GROUP BY hour, hostname
   ORDER BY hour DESC;

QUERY PLAN             
----------------------------------------------------------------
 Finalize GroupAggregate (actual time=1859.306..1862.331 rows=1690 loops=1)
   Group Key: (time_bucket('01:00:00'::interval, cpu."time")), cpu.hostname
   -&gt;  Gather Merge (actual time=1841.735..1849.604 rows=1881 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         -&gt;  Sort (actual time=1194.162..1194.222 rows=627 loops=3)
               Sort Key: (time_bucket('01:00:00'::interval, cpu."time")) DESC, cpu.hostname
               Sort Method: quicksort  Memory: 25kB
               Worker 0:  Sort Method: quicksort  Memory: 274kB
               Worker 1:  Sort Method: quicksort  Memory: 274kB
               -&gt;  Partial HashAggregate (actual time=1193.198..1193.594 rows=627 loops=3)
                     Group Key: time_bucket('01:00:00'::interval, cpu."time"), cpu.hostname
                     -&gt;  Parallel Custom Scan (ChunkAppend) on cpu (actual time=9.840..716.952 rows=2016000 loops=3)
                           Chunks excluded during startup: 7
                           -&gt;  Parallel Seq Scan on _hyper_1_14_chunk (actual time=14.751..199.098 rows=864000 loops=1)
                                 Filter: ("time" &gt; ('2020-05-06 02:32:34.627143+00'::timestamp with time zone - '7 days'::interval))
                           -&gt;  Parallel Seq Scan on _hyper_1_13_chunk (actual time=14.749..201.100 rows=864000 loops=1)
                                 Filter: ("time" &gt; ('2020-05-06 02:32:34.627143+00'::timestamp with time zone - '7 days'::interval))
                           -&gt;  Parallel Seq Scan on _hyper_1_12_chunk (actual time=0.025..182.591 rows=864000 loops=1)
                                 Filter: ("time" &gt; ('2020-05-06 02:32:34.627143+00'::timestamp with time zone - '7 days'::interval))
                           -&gt;  Parallel Seq Scan on _hyper_1_11_chunk (actual time=0.031..182.812 rows=864000 loops=1)
                                 Filter: ("time" &gt; ('2020-05-06 02:32:34.627143+00'::timestamp with time zone - '7 days'::interval))
                           -&gt;  Parallel Seq Scan on _hyper_1_10_chunk (actual time=0.035..183.918 rows=864000 loops=1)
                                 Filter: ("time" &gt; ('2020-05-06 02:32:34.627143+00'::timestamp with time zone - '7 days'::interval))
                           -&gt;  Parallel Seq Scan on _hyper_1_9_chunk (actual time=0.019..184.416 rows=864000 loops=1)
                                 Filter: ("time" &gt; ('2020-05-06 02:32:34.627143+00'::timestamp with time zone - '7 days'::interval))
                           -&gt;  Parallel Seq Scan on _hyper_1_8_chunk (actual time=0.823..91.605 rows=386225 loops=2)
                                 Filter: ("time" &gt; ('2020-05-06 02:32:34.627143+00'::timestamp with time zone - '7 days'::interval))
                                 Rows Removed by Filter: 45775
                           -&gt;  Parallel Seq Scan on _hyper_1_15_chunk (actual time=0.022..20.277 rows=91550 loops=1)
                                 Filter: ("time" &gt; ('2020-05-06 02:32:34.627143+00'::timestamp with time zone - '7 days'::interval))

 Planning Time: 1.917 ms
 Execution Time: 1921.753 ms</code></pre><p>Note that TimescaleDB’s constraint exclusion excluded 7 of the chunks from being queried given the WHERE predicate (as the query was for the last 7 days of the 14 day dataset), then processed the query on the remaining 8 chunks (performing a scan over 6,048,000 rows) using two parallel workers.  The query in total took just over 1.9 seconds.</p><p>Now let’s try the query on our materialized table, first turning off real-time aggregation just for this experiment: </p><pre><code class="language-SQL">=&gt; ALTER VIEW cpu_1h set (timescaledb.materialized_only = true);</code></pre><p>First, let’s look at the table definition, which defines a SELECT on the materialized view with the specified GROUP BYs.  But we also see that each of the histograms calls “finalize_agg.”  TimescaleDB doesn’t precisely pre-compute and store the exact answer that’s specified in the query, but rather a <a href="https://www.postgresql.org/docs/current/xaggr.html#XAGGR-PARTIAL-AGGREGATES">partial aggregate</a> that is then “finalized” at query time, which will allow for greater parallelization and rebucketing at query time (in a future release).</p><pre><code class="language-SQL"> \d+ cpu_1h;

                                          View "public.cpu_1h"
      Column       |           Type           | Collation | Nullable | Default | Storage  | Description 
-------------------+--------------------------+-----------+----------+---------+----------+-------------
 hour              | timestamp with time zone |           |          |         | plain    | 
 hostname          | text                     |           |          |         | extended | 
 hist_usage_user   | integer[]                |           |          |         | extended | 
 hist_usage_system | integer[]                |           |          |         | extended | 
 hist_usage_iowait | integer[]                |           |          |         | extended | 

View definition:
 SELECT _materialized_hypertable_2.hour,
    _materialized_hypertable_2.hostname,
    _timescaledb_internal.finalize_agg('histogram(double precision,double precision,double precision,integer)'::text, NULL::name, NULL::name, '{{pg_catalog,float8},{pg_catalog,float8},{pg_catalog,float8},{pg_catalog,int4}}'::name[], _materialized_hypertable_2.agg_3_3, NULL::integer[]) AS hist_usage_user,
    _timescaledb_internal.finalize_agg(...) AS hist_usage_system,
    _timescaledb_internal.finalize_agg(...) AS hist_usage_iowait
   FROM _timescaledb_internal._materialized_hypertable_2
  GROUP BY _materialized_hypertable_2.hour, _materialized_hypertable_2.hostname;</code></pre><p>Now let’s run the query with vanilla continuous aggregates enabled:</p><pre><code class="language-SQL">=&gt; EXPLAIN (ANALYZE, COSTS OFF)
   SELECT * FROM cpu_1h
   WHERE hour &gt; '2020-05-06 02:32:34.627143+00'::timestamptz - interval '7 days'
   ORDER BY hour DESC;

QUERY PLAN
----------------------------------------------------------------
 Sort (actual time=3.218..3.312 rows=1670 loops=1)
   Sort Key: _materialized_hypertable_2.hour DESC
   Sort Method: quicksort  Memory: 492kB
   -&gt;  HashAggregate (actual time=1.943..2.891 rows=1670 loops=1)
         Group Key: _materialized_hypertable_2.hour, _materialized_hypertable_2.hostname
         -&gt;  Custom Scan (ChunkAppend) on _materialized_hypertable_2 (actual time=0.064..0.688 rows=1670 loops=1)
               Chunks excluded during startup: 1
               -&gt;  Seq Scan on _hyper_2_17_chunk (actual time=0.063..0.590 rows=1670 loops=1)
                     Filter: (hour &gt; ('2020-05-06 02:32:34.627143+00'::timestamp with time zone - '7 days'::interval))
                     Rows Removed by Filter: 270

 Planning Time: 0.645 ms
 Execution Time: 3.461 ms</code></pre><p>Just 4 milliseconds, after a scan of 1,670 rows in the materialized hypertable.  And let’s look at the most recent 3 rows returned for a specific host:</p><pre><code class="language-SQL">=&gt; SELECT hour, hostname, hist_usage_user
    FROM cpu_1h
    WHERE hour &gt; '2020-05-06 02:32:34.627143+00'::timestamptz - interval '7 days'         
       AND hostname = 'host0'
    ORDER BY hour DESC LIMIT 3;

          hour          | hostname |      hist_usage_user      
------------------------+----------+---------------------------
 2020-05-06 01:00:00+00 | host0    | {0,781,676,712,719,712,0}
 2020-05-06 00:00:00+00 | host0    | {0,736,714,776,689,685,0}
 2020-05-05 23:00:00+00 | host0    | {0,714,759,715,692,720,0}</code></pre><p>Note that the last record is from the 1:00am - 2:00am hour.</p><p>Now let’s re-enable real-time aggregation and try the same query, first showing how the real-time aggregation is defined as a UNION ALL between the materialized and raw data.</p><pre><code class="language-SQL">=&gt; ALTER VIEW cpu_1h set (timescaledb.materialized_only = false);

=&gt; \d+ cpu_1h;

                                          View "public.cpu_1h"
      Column       |           Type           | Collation | Nullable | Default | Storage  | Description 
-------------------+--------------------------+-----------+----------+---------+----------+-------------
 hour              | timestamp with time zone |           |          |         | plain    | 
 hostname          | text                     |           |          |         | extended | 
 hist_usage_user   | integer[]                |           |          |         | extended | 
 hist_usage_system | integer[]                |           |          |         | extended | 
 hist_usage_iowait | integer[]                |           |          |         | extended | 

View definition:
 SELECT _materialized_hypertable_2.hour,
    _materialized_hypertable_2.hostname,
    _timescaledb_internal.finalize_agg(...) AS hist_usage_user,
    _timescaledb_internal.finalize_agg(...) AS hist_usage_system,
    _timescaledb_internal.finalize_agg(...) AS hist_usage_iowait
   FROM _timescaledb_internal._materialized_hypertable_2
  WHERE _materialized_hypertable_2.hour &lt; COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(1)), '-infinity'::timestamp with time zone)
  GROUP BY _materialized_hypertable_2.hour, _materialized_hypertable_2.hostname
UNION ALL
 SELECT time_bucket('01:00:00'::interval, cpu."time") AS hour,
    cpu.hostname,
    histogram(cpu.usage_user, 0.0::double precision, 1.0::double precision, 5) AS hist_usage_user,
    histogram(cpu.usage_system, 0.0::double precision, 1.0::double precision, 5) AS hist_usage_system,
    histogram(cpu.usage_iowait, 0.0::double precision, 1.0::double precision, 5) AS hist_usage_iowait
   FROM cpu
  WHERE cpu."time" &gt;= COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(1)), '-infinity'::timestamp with time zone)
  GROUP BY (time_bucket('01:00:00'::interval, cpu."time")), cpu.hostname;


=&gt; EXPLAIN (ANALYZE, COSTS OFF)
   SELECT * FROM cpu_1h
   WHERE hour &gt; '2020-05-06 02:32:34.627143+00'::timestamptz - interval '7 days'
   ORDER BY hour DESC;

QUERY PLAN               
----------------------------------------------------------------
 Sort (actual time=20.871..21.055 rows=1680 loops=1)
   Sort Key: _materialized_hypertable_2.hour DESC
   Sort Method: quicksort  Memory: 495kB
   -&gt;  Append (actual time=1.842..20.536 rows=1680 loops=1)
         -&gt;  HashAggregate (actual time=1.841..2.789 rows=1670 loops=1)
               Group Key: _materialized_hypertable_2.hour, _materialized_hypertable_2.hostname
               -&gt;  Custom Scan (ChunkAppend) on _materialized_hypertable_2 (actual time=0.105..0.580 rows=1670 loops=1)
                     Chunks excluded during startup: 1
                     -&gt;  Index Scan using _hyper_2_17_chunk__materialized_hypertable_2_hour_idx on _hyper_2_17_chunk (actual time=0.104..0.475 rows=1670 loops=1)
                           Index Cond: ((hour &lt; COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(1)), '-infinity'::timestamp with time zone)) AND (hour &gt; ('2020-05-06 02:32:34.627143+00'::timestamp with time zone - '7 days'::interval)))
         -&gt;  HashAggregate (actual time=17.641..17.655 rows=10 loops=1)
               Group Key: time_bucket('01:00:00'::interval, cpu."time"), cpu.hostname
               -&gt;  Custom Scan (ChunkAppend) on cpu (actual time=0.165..12.297 rows=19550 loops=1)
                     Chunks excluded during startup: 14
                     -&gt;  Index Scan using _hyper_1_15_chunk_cpu_time_idx on _hyper_1_15_chunk (actual time=0.163..9.723 rows=19550 loops=1)
                           Index Cond: ("time" &gt;= COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(1)), '-infinity'::timestamp with time zone))
                           Filter: (time_bucket('01:00:00'::interval, "time") &gt; ('2020-05-06 02:32:34.627143+00'::timestamp with time zone - '7 days'::interval))

 Planning Time: 3.532 ms
 Execution Time: 22.905 ms
</code></pre><p>Still very fast at just over 26 milliseconds (scanning 1,670 materialized rows and 19,550 raw rows), and now the results:</p><pre><code class="language-SQL">=&gt; SELECT hour, hostname, hist_usage_user
   FROM cpu_1h
WHERE hour &gt; '2020-05-06 02:32:34.627143+00'::timestamptz - interval '7 days'
      AND hostname = 'host0'
   ORDER BY hour DESC LIMIT 3;

          hour          | hostname |      hist_usage_user      
------------------------+----------+---------------------------
 2020-05-06 02:00:00+00 | host0    | {0,384,388,385,400,398,0}
 2020-05-06 01:00:00+00 | host0    | {0,781,676,712,719,712,0}
 2020-05-06 00:00:00+00 | host0    | {0,736,714,776,689,685,0}

</code></pre><p>Unlike when we were processing the materialized table without the real-time aggregation, we have up-to-date data with data from the 2:00 - 3:00am hour.  This is because the materialized table didn’t have data from the last hour, while the real-time aggregation was able to compute that result from the raw data at query time.  You can also notice that there is less data in the final row (namely, each histogram bucket has about half the counts as the prior rows), as this final row was the aggregation of 32 minutes of raw data, not a full hour. </p><p>You can also observe these two stages of real-time aggregation in the above query plan:  the materialized hypertable is processed in the first section via <code>Custom Scan (ChunkAppend) on _materialized_hypertable_2</code>, while the underlying raw hypertable is processed in the second section via <code>Custom Scan (ChunkAppend) on cpu</code>, and each processes only before or after the offset specified by the completion threshold (shown with  <code>_timescaledb_internal.cagg_watermark(1)</code> in the plan).</p><p>So, in summary:  a complete, up-to-date aggregate over the data, both at a fraction of the latency of querying the raw data, and avoiding the excessive overhead of schemes that update materalizations through per-row or per-statement triggers.</p><table>
<thead>
<tr>
<th>Query Type</th>
<th>Latency</th>
<th>Freshness</th>
</tr>
</thead>
<tbody>
<tr>
<td>Raw Data</td>
<td>1924 ms</td>
<td>Up-to-date</td>
</tr>
<tr>
<td>Continuous Aggregates</td>
<td>4 ms</td>
<td>Lags up to 90 minutes</td>
</tr>
<tr>
<td>Real-Time Aggregation</td>
<td>26 ms</td>
<td>Up-to-date</td>
</tr>
</tbody>
</table>
<p><strong>Continuous aggregates and real-time aggregation for the win!</strong></p><h2 id="conclusions">Conclusions</h2><p>What motivated us to build TimescaleDB is the firm belief that time-series use cases need a best-in-class, flexible time-series database, with advanced capabilities specifically designed for time-series workloads.  We developed real-time aggregation for time-series use cases such as devops monitoring, real-time analytics, and IoT, where fast queries over high-volume workloads and accurate, real-time results really matter. </p><p>Real-time aggregation joins a number of advanced capabilities in TimescaleDB around data lifecycle management and time-series analytics, including automated data retention, data reordering, native compression, downsampling, and traditional continuous aggregates.</p><p>And, <strong>there’s still much more to come</strong>. Keep an eye out for our much-anticipated TimescaleDB 2.0 release, which introduces horizontal scaling to TimescaleDB for terabyte to petabyte workloads.</p><h3 id="want-to-check-out-real-time-aggregation">Want to check out real-time aggregation?</h3><ul><li>Ready to dig in? Check out our <a href="https://docs.timescale.com/latest/using-timescaledb/continuous-aggregates/?utm_source=timescale-real-time-aggregates-details&amp;utm_medium=blog&amp;utm_campaign=1-7-release&amp;utm_content=continuous-aggregates-docs">docs</a>.</li><li>Brand new to TimescaleDB?  Get started <a href="https://docs.timescale.com/latest/getting-started/?utm_source=timescale-real-time-aggregates-details&amp;utm_medium=blog&amp;utm_campaign=1-7-release&amp;utm_content=getting-started-docs">here</a>.</li></ul><p>If you have any questions along the way, we’re always available via our <a href="https://slack.timescale.com">community Slack</a> (we’re <a href="https://timescaledb.slack.com/archives/D011A62GNR0">@mike</a> and <a href="https://timescaledb.slack.com/archives/D0137UNE550">@sven </a>, come say hi 👋).</p><p>And, if you are interested in keeping up-to-date with future TimescaleDB releases, <a href="https://www.timescale.com/signup/release-notes/?utm_source=timescale-real-time-aggregates-details&amp;utm_medium=blog&amp;utm_campaign=1-7-release&amp;utm_content=release-notes-subscribe">sign up for our Release Notes</a>.  It’s low-traffic, we promise.</p><p>Until next time, keep it real!</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[OrderedAppend: An Optimization for Range Partitioning]]></title>
            <description><![CDATA[With this feature, we’ve seen up to 100x performance improvements for certain queries.]]></description>
            <link>https://www.tigerdata.com/blog/ordered-append-postgresql-optimization</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/ordered-append-postgresql-optimization</guid>
            <category><![CDATA[Product & Engineering]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Sven Klemm]]></dc:creator>
            <pubDate>Wed, 31 Jul 2019 17:00:07 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2019/07/blogorderedappend.jpg">
            </media:content>
            <content:encoded><![CDATA[<p>In our previous post on <a href="https://www.timescale.com/blog/implementing-constraint-exclusion-for-faster-query-performance/">implementing constraint exclusion</a>, we discussed how TimescaleDB leverages PostgreSQL’s foundation and expands on its capabilities to improve performance. Continuing with the same theme, in this post, we will discuss how we’ve added support for ordered appends, which optimize a wide range of queries, particularly those that are ordered by time.</p><p><strong>We’ve seen performance improvements up to 100x for certain queries</strong> after applying this feature, so we encourage you to keep reading!</p><h2 id="optimizing-appends-for-large-queries">Optimizing Appends for Large Queries</h2><p>PostgreSQL represents how plans should be executed using “nodes.” Various nodes may appear in an EXPLAIN output, but we want to focus specifically on Append nodes, which essentially combine the results from multiple sources into a single result. </p><p>PostgreSQL has two standard Appends that are commonly used that you can find in an EXPLAIN output:</p><ul><li><strong>Append:</strong> appends results of child nodes to return a unioned result</li><li><strong>MergeAppend:</strong> merges the output of child nodes by sort key; all child nodes must be sorted by that same sort key; accesses every chunk when used in TimescaleDB </li></ul><p>When MergeAppend nodes are used with TimescaleDB, we must access every chunk to determine whether it has keys that we need to merge. However, this is obviously less efficient since it requires us to touch every chunk. </p><p>To address this issue, with the release of <a href="https://github.com/timescale/timescaledb/releases/tag/1.2.0">TimescaleDB 1.2</a>, we introduced <strong>OrderedAppend </strong>as<strong> </strong>an optimization for range partitioning. This feature optimizes a large range of queries, particularly those that are ordered by time and contain a LIMIT clause. </p><p>This optimization takes advantage of the fact that we know the range of time held in each chunk and can stop accessing chunks once we’ve found enough rows to satisfy the LIMIT clause. As mentioned above, this optimization can improve performance by up to 100x, depending on the query. </p><p>With the release of <a href="https://github.com/timescale/timescaledb/releases/tag/1.4.0">TimescaleDB 1.4</a>, we wanted to extend the cases in which OrderedAppend can be used. This meant making OrderedAppend space-partition aware and removing the LIMIT clause restriction. With these additions, more users can benefit from the performance benefits achieved through leveraging OrderedAppend.</p><h2 id="developing-query-plans-with-the-optimization">Developing Query Plans With the Optimization</h2><p>As an optimization for range partitioning, OrderedAppend eliminates sort steps because it is aware of the way data is partitioned. </p><p>Since each chunk has a known time range it covers to get sorted output, no global sort step is needed. Only local sort steps have to be completed and then appended in the correct order. If index scans are utilized, which return the output sorted, sorting can be completely avoided.</p><p><strong>For a query ordering by the time dimension with a LIMIT clause, you would normally get something like this: </strong></p><pre><code>dev=# EXPLAIN (ANALYZE,COSTS OFF,BUFFERS,TIMING OFF,SUMMARY OFF)
dev-# SELECT * FROM metrics ORDER BY time LIMIT 1;
                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Limit (actual rows=1 loops=1)
   Buffers: shared hit=16
   -&gt;  Merge Append (actual rows=1 loops=1)
         Sort Key: metrics."time"
         Buffers: shared hit=16
         -&gt;  Index Scan using metrics_time_idx on metrics (actual rows=0 loops=1)
               Buffers: shared hit=1
         -&gt;  Index Scan using _hyper_1_1_chunk_metrics_time_idx on _hyper_1_1_chunk (actual rows=1 loops=1)
               Buffers: shared hit=3
         -&gt;  Index Scan using _hyper_1_2_chunk_metrics_time_idx on _hyper_1_2_chunk (actual rows=1 loops=1)
               Buffers: shared hit=3
         -&gt;  Index Scan using _hyper_1_3_chunk_metrics_time_idx on _hyper_1_3_chunk (actual rows=1 loops=1)
               Buffers: shared hit=3
         -&gt;  Index Scan using _hyper_1_4_chunk_metrics_time_idx on _hyper_1_4_chunk (actual rows=1 loops=1)
               Buffers: shared hit=3
         -&gt;  Index Scan using _hyper_1_5_chunk_metrics_time_idx on _hyper_1_5_chunk (actual rows=1 loops=1)
               Buffers: shared hit=3
</code></pre><p>You can see three pages are read from every chunk and an additional page from the parent table which contains no actual rows.</p><p><strong>With this optimization enabled, you would get a plan looking like this:</strong></p><pre><code>dev=# EXPLAIN (ANALYZE,COSTS OFF,BUFFERS,TIMING OFF,SUMMARY OFF)
dev-# SELECT * FROM metrics ORDER BY time LIMIT 1;
                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Limit (actual rows=1 loops=1)
   Buffers: shared hit=3
   -&gt;  Custom Scan (ChunkAppend) on metrics (actual rows=1 loops=1)
         Order: metrics."time"
         Buffers: shared hit=3
         -&gt;  Index Scan using _hyper_1_1_chunk_metrics_time_idx on _hyper_1_1_chunk (actual rows=1 loops=1)
               Buffers: shared hit=3
         -&gt;  Index Scan using _hyper_1_2_chunk_metrics_time_idx on _hyper_1_2_chunk (never executed)
         -&gt;  Index Scan using _hyper_1_3_chunk_metrics_time_idx on _hyper_1_3_chunk (never executed)
         -&gt;  Index Scan using _hyper_1_4_chunk_metrics_time_idx on _hyper_1_4_chunk (never executed)
         -&gt;  Index Scan using _hyper_1_5_chunk_metrics_time_idx on _hyper_1_5_chunk (never executed)</code></pre><p>After the first chunk, the remaining chunks never get executed, and to complete the query, only three pages have to be read. TimescaleDB removes parent tables from plans like this because we know the parent table does not contain any data.</p><h2 id="mergeappend-vs-chunkappend">MergeAppend vs. ChunkAppend</h2><p>The main difference between these two examples is the type of Append node we used. In the first case, a MergeAppend node is used. In the second case, we used a ChunkAppend node (also introduced in 1.4), which is a TimescaleDB custom node that works similarly to the PostgreSQL Append node but contains additional optimizations. </p><p>The MergeAppend node implements the global sort and requires locally sorted input which has to be sorted by the same sort key. To produce one tuple, the MergeAppend node has to read one tuple from every chunk to decide which one to return to.</p><p>For the very simple example query above, you will see 16 pages read (with MergeAppend) vs. three pages (with ChunkAppend), which is a 5x improvement over the unoptimized case (if we ignore the single page from the parent table) and represents the number of chunks present in that <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertable</a>. So for a hypertable with 100 chunks, there would be 100 times fewer pages to be read to produce the result for the query.</p><p>As you can see, you gain the most benefit from OrderedAppend with a LIMIT clause, as older chunks don’t have to be touched if the required results can be satisfied from more recent chunks. This type of query is very common in time-series workloads (e.g., if you want to get the last reading from a sensor). However, even for queries without a LIMIT clause, this feature is beneficial because it eliminates the sorting of data.</p><h2 id="next-steps">Next Steps</h2><p>If you are interested in using OrderedAppend, make sure you have the latest version of TimescaleDB installed (<a href="https://docs.timescale.com/self-hosted/latest/install/" rel="noreferrer">installation guide</a>). You can also <a href="https://console.cloud.timescale.com/signup" rel="noreferrer">create a free Timescale account</a> (30-day trial, no credit card required) and never worry about upgrading again (we'll do it for you).</p><p>If you are brand new to TimescaleDB, <a href="https://docs.timescale.com/getting-started" rel="noreferrer">get started here</a>. Have questions? Join our <a href="http://slack.timescale.com">Slack</a> channel!</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Mind the Gap: Using SQL Functions for Time-Series Analysis]]></title>
            <description><![CDATA[Write more efficient and readable SQL queries with a new set of time-series analytic tools.]]></description>
            <link>https://www.tigerdata.com/blog/sql-functions-for-time-series-analysis</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/sql-functions-for-time-series-analysis</guid>
            <category><![CDATA[Product & Engineering]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Sven Klemm]]></dc:creator>
            <pubDate>Thu, 24 Jan 2019 20:01:11 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2019/01/20190123_TimeBucketGapFill.jpg">
            </media:content>
            <content:encoded><![CDATA[<p>SQL functions are reusable routines written in SQL or supported procedural languages that perform operations on input values and return a result, often used to encapsulate logic and simplify complex queries.</p><p>With the release of <a href="https://github.com/timescale/timescaledb" rel="noreferrer">TimescaleDB 1.2</a> came three new SQL functions for time-series analysis: <code>time_bucket_gapfill</code>, <code>interpolate</code>, and <code>locf</code>. Used together, these SQL functions will enable you to write more efficient and readable SQL queries for <a href="https://www.timescale.com/blog/time-series-analysis-what-is-it-how-to-use-it" rel="noreferrer">time-series analysis</a>. </p><p>The efficiency gains were so evident that we have since developed a complete set of <a href="https://www.timescale.com/learn/time-series-data-analysis-hyperfunctions" rel="noreferrer">hyperfunctions</a> for faster time-series analysis with fewer lines of code. You can find them in the <a href="https://docs.timescale.com/self-hosted/latest/tooling/install-toolkit/" rel="noreferrer">Timescale Toolkit</a>.</p><p>In this post, we'll discuss why you'd want to use time buckets, the related gapfilling techniques, and how they’re implemented under the hood.<em> </em>Ultimately, it's the story of how we extended SQL and the PostgreSQL query planner to create a set of highly optimized functions for time-series analysis.</p><h2 id="sql-functions-for-time-series-analysis-introduction-to-time-bucketing">SQL Functions for Time-Series Analysis: Introduction to Time Bucketing</h2><p>Many <a href="https://www.timescale.com/blog/time-series-analysis-what-is-it-how-to-use-it" rel="noreferrer">common techniques for time-series analysis</a> assume that our temporal observations are aggregated to fixed time intervals. Dashboards and most visualizations of time series rely on this technique to make sense of our raw data, turning the noise into a smoother trend line that is more easily interpretable and analytically tractable.</p><figure class="kg-card kg-image-card"><img src="https://timescale.ghost.io/blog/content/images/2019/01/timebucket-1.gif" class="kg-image" alt="" loading="lazy" width="556" height="347"></figure><p>When writing queries for this type of reporting, you need an efficient way to aggregate raw observations (often noisy and irregular) to fixed time intervals. Examples of such queries might be average temperature per hour or the average CPU utilization per five seconds.</p><p>The solution is <strong>time bucketing</strong>. The <code>time_bucket</code> function has been a core feature of TimescaleDB since the <a href="https://timescale.ghost.io/blog/when-boring-is-awesome-building-a-scalable-time-series-database-on-postgresql-2900ea453ee2">first public beta release</a>. With time bucketing, we can get a clear picture of the important data trends using a concise, declarative SQL query.</p><pre><code class="language-SQL">SELECT
  time_bucket('1 minute', time) as one_minute_bucket,
  avg(value) as avg_value
FROM observations
GROUP BY one_minute_bucket
ORDER BY one_minute_bucket;</code></pre><h2 id="challenges-with-time-bucketing-for-time-series">Challenges With Time Bucketing for Time Series</h2><p>The reality of time-series data engineering is not always so easy.</p><p>Consider measurements recorded at <strong>irregular sampling intervals, </strong>either intentionally, as<strong> </strong>with<strong> </strong>measurements recorded in response to external events (e.g., motion sensor). Or perhaps inadvertently due to network problems, out-of-sync clocks, or equipment taken offline for maintenance. </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2019/01/none.jpg" class="kg-image" alt="" loading="lazy" width="580" height="253"><figcaption><span style="white-space: pre-wrap;">Time bucket: none</span></figcaption></figure><p>We should also consider analyzing multiple measurements recorded at <strong>mismatched sampling intervals</strong>. For instance, you might collect some of your data every second and some every minute, but still need to analyze both metrics at 15-second intervals.</p><p>The <code>time_bucket</code> function will only aggregate your data to a given time bucket if there is data in it. In both the cases of mismatched or irregular sampling, a time bucket interval might come back with missing data (i.e., gaps). </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2019/01/20mins.jpg" class="kg-image" alt="" loading="lazy" width="579" height="220"><figcaption><span style="white-space: pre-wrap;">Time bucket: 20 minutes</span></figcaption></figure><p>If your analysis requires data aggregated to contiguous time intervals, the time bucketing with <strong>gapfilling</strong> solves this problem.</p><h2 id="sql-functions-time-bucketing-with-gapfilling">SQL Functions: Time Bucketing With Gapfilling</h2><p>TimescaleDB community users have access to a set of SQL functions:</p><ul><li><code>time_bucket_gapfill</code> for creating contiguous, ordered time buckets</li><li><code>interpolate</code> to perform linear interpolation between the previous and next value</li><li><code>locf</code> or <em>last observation carried forward </em>to fill in gaps with the previous known value </li></ul><h3 id="gapfilling">Gapfilling</h3><p>The new <code>time_bucket_gapfill</code> function is similar to <code>time_bucket</code> except that it guarantees a contiguous, ordered set of time buckets.</p><p>The function requires that you provide a <code>start</code> and <code>finish</code> argument to specify the time range for which you need contiguous buckets. The result set will contain additional rows in place of any gaps, ensuring that the returned rows are in chronological order and contiguous.</p><p>Let’s look at the SQL:</p><pre><code class="language-SQL">SELECT
    time_bucket_gapfill(
        '1 hour', time,
        start =&gt; '2019-01-21 9:00', 
        finish =&gt; '2019-01-21 17:00') AS hour,
    avg(value) AS avg_val
FROM temperature
GROUP BY hour;

          hour          |         avg_val
------------------------+-------------------------
 2019-01-21 09:00:00+00 |     26.5867799823790905
 2019-01-21 10:00:00+00 |    23.25141648529633607
 2019-01-21 11:00:00+00 |     21.9964633100885991
 2019-01-21 12:00:00+00 |    23.08512263446292656
 2019-01-21 13:00:00+00 |
 2019-01-21 14:00:00+00 |     27.9968220672055895
 2019-01-21 15:00:00+00 |     26.4914455532679670
 2019-01-21 16:00:00+00 |   24.07531628738616732</code></pre><p>Note that one of the hours is missing data entirely, and the average value is represented as <code>NULL</code>. Gapfilling gives us a contiguous set of time buckets but no data for those rows. That's where the <code>locf</code> and <code>interpolate</code> functions come into play.</p><h3 id="locf-or-last-observation-carried-forward">LOCF or last observation carried forward</h3><p>The “last observation carried forward” technique can be used to impute missing values by assuming the previous known value. </p><pre><code class="language-SQL">SELECT
    time_bucket_gapfill(
        '1 hour', time,
        start =&gt; '2019-01-21 9:00', 
        finish =&gt; '2019-01-21 17:00') AS hour,
  -- instead of avg(val)
  locf(avg(val))
FROM temperature
GROUP BY hour
ORDER BY hour</code></pre><p>Shown here: </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2019/01/LOCF_-20-minutes.jpg" class="kg-image" alt="" loading="lazy" width="588" height="210"><figcaption><span style="white-space: pre-wrap;">LOCF at 20 minutes</span></figcaption></figure><h3 id="linear-interpolation">Linear interpolation</h3><p>Linear interpolation imputes missing values by assuming a line between the previous known value and the next known value.</p><pre><code class="language-SQL">SELECT
    time_bucket_gapfill(
        '1 hour', time,
        start =&gt; '2019-01-21 9:00', 
        finish =&gt; '2019-01-21 17:00') AS hour,
  -- instead of avg(val)
  interpolate(avg(val))
FROM temperature
GROUP BY hour
ORDER BY hour</code></pre><p>Shown here: </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2019/01/inter20mins-1.jpg" class="kg-image" alt="" loading="lazy" width="550" height="199"><figcaption><span style="white-space: pre-wrap;">Interpolate at 20 minutes</span></figcaption></figure><p>These techniques are not exclusive; you can combine them as needed in a single time bucketed query:</p><pre><code class="language-SQL">locf(avg(temperature)), interpolate(max(humidity)), avg(other_val)</code></pre><h2 id="best-practices-for-time-series-analysis-with-sql-functions">Best Practices for Time-Series Analysis With SQL Functions</h2><p>Whether you choose to use the LOCF, interpolation, or gapfilling SQL functions with nulls depends on your assumptions about the time-series data and your analytical approach.</p><ul><li>Use <code>locf</code> if you assume your measurement changes only when you've received new data.</li><li>Use <code>interpolation</code> if you assume your continuous measurement would have a smooth, roughly linear trend if sampled at a higher rate.</li><li>Use standard aggregate functions (without <code>locf</code> or <code>interpolation</code>) if your data is not continuous on the time axis. Where there is no data, the result is assumed NULL.</li><li>If you want to assume scalar values (typically zero) in place of NULLs, you can use PostgreSQL’s coalesce function: <code>COALESCE(avg(val), 0)</code></li></ul><p>If you choose to explicitly <code>ORDER</code> your results, keep in mind that the gapfilling will sort by time in ascending order. Any other explicit ordering may introduce additional sorting steps in the query plan.</p><h2 id="extending-sql-for-time-series-analysis">Extending SQL for Time-Series Analysis</h2><p>The new <code>time_bucket_gapfill</code> SQL query is significantly more readable, less error-prone, more flexible regarding grouping, and faster to execute.</p><p>How does TimescaleDB achieve this? Under the hood, these are not ordinary functions but specially optimized hooks into the database query planner itself. </p><p>The <code>time_bucket_gapfill</code> function inserts a <a href="https://www.postgresql.org/docs/11/custom-scan.html">custom scan</a> node and sort node (if needed) into the query plan. This creates ordered, contiguous time buckets even if some of the buckets are missing observations. The <code>locf</code> and <code>interpolate</code> functions are not executed directly but serve as markers so that the gapfilling node can track the previous and next known values. </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2019/01/customscan.png" class="kg-image" alt="" loading="lazy" width="1070" height="1128" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2019/01/customscan.png 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2019/01/customscan.png 1000w, https://timescale.ghost.io/blog/content/images/2019/01/customscan.png 1070w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Query plan visualization resulting from time_bucket_gapfill; courtesy of https://tatiyants.com/pev</span></figcaption></figure><p>The result: a semantically cleaner language for expressing time-series analysis, easier to debug, more performant, and saving the application developer from having to implement any of these tricks on the application side. This is another example of how Timescale is extending PostgresSQL for high-performance, general purpose time-series data management.</p><h2 id="supercharge-your-time-series-analysis">Supercharge Your Time-Series Analysis</h2><p>Time buckets with gapfilling and the related imputation function are available as community features under the TSL license. (For more information on the license, read this <a href="https://timescale.ghost.io/blog/how-we-are-building-an-open-source-business-a7701516a480">blog post</a>.) </p><p>If you’re interested in learning more about <a href="https://docs.timescale.com/use-timescale/latest/hyperfunctions/gapfilling-interpolation/" rel="noreferrer">gapfilling, check out our docs</a>. If you are new to TimescaleDB and ready to get started, follow the <a href="https://docs.timescale.com/self-hosted/latest/install/" rel="noreferrer">installation instructions</a>. </p><p>We encourage active TimescaleDB users to join our <a href="https://slack.timescale.com/">Slack community</a> and post any questions you may have there. Finally, if you are looking for a modern cloud-native PostgreSQL platform, <a href="https://www.timescale.com/cloud" rel="noreferrer">check out Timescale Cloud</a>.</p><hr><p><em>Interested in learning more? Follow us on </em><a href="https://twitter.com/" rel="noopener"><em>Twitter</em></a><em> or sign up below to receive more posts like this! </em></p>]]></content:encoded>
        </item>
    </channel>
</rss>