<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[PostgreSQL - Tiger Data Blog]]></title>
        <description><![CDATA[Insights, product updates, and tips from TigerData (Creators of TimescaleDB) engineers on Postgres, time series & AI. IoT, crypto, and analytics tutorials & use cases.]]></description>
        <link>https://www.tigerdata.com/blog</link>
        <image>
            <url>https://www.tigerdata.com/icon.ico</url>
            <title>PostgreSQL - Tiger Data Blog</title>
            <link>https://www.tigerdata.com/blog</link>
        </image>
        <generator>RSS for Node</generator>
        <lastBuildDate>Sun, 14 Jun 2026 06:15:34 GMT</lastBuildDate>
        <atom:link href="https://www.tigerdata.com/blog/tag/postgresql/rss" rel="self" type="application/rss+xml"/>
        <ttl>60</ttl>
        <item>
            <title><![CDATA[When PostgreSQL Isn't the Right Fit: Recognizing Workloads That Need Different Architecture]]></title>
            <description><![CDATA[Postgres handles 90% of workloads well. Here's how to tell if yours is in the 10% — and what the diagnostic query that confirms it looks like.]]></description>
            <link>https://www.tigerdata.com/blog/when-postgresql-isnt-the-right-fit</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/when-postgresql-isnt-the-right-fit</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Database]]></category>
            <dc:creator><![CDATA[NanoHertz Communications]]></dc:creator>
            <pubDate>Fri, 12 Jun 2026 12:00:47 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/06/when-postgresql-isnt-the-right-fit.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/06/when-postgresql-isnt-the-right-fit.png" alt="When PostgreSQL Isn't the Right Fit: Recognizing Workloads That Need Different Architecture" /><p>When PostgreSQL isn't the right fit, the signs don't announce themselves clearly. Postgres is the right database for roughly 90% of workloads, such as SaaS backends, CRUD applications, and transactional systems with mixed read/write access on shared rows. But there's a narrow 10% where those same strengths become overhead: high-frequency append-only ingestion, time-ordered data accumulating at sustained rates, analytical scans over hundreds of millions of rows. If that sounds like your system, this post is for you.</p><h2 id="what-you-will-learn">What you will learn</h2><p>If you've added indexes, implemented partitioning, tuned autovacuum, and upgraded hardware only to watch performance degrade again on the same trajectory, the problem likely isn't your configuration. By the end of this post, you'll know whether your workload is in Postgres's 10%, how to confirm it with a single diagnostic query, and what the first concrete step toward the right architecture looks like.</p><h2 id="why-it-matters">Why it matters</h2><p>An optimization problem and an architecture problem look identical in the early stages. Both show up as slow queries. Both respond to the same fixes: indexes, partitioning, autovacuum tuning, hardware upgrades. The divergence happens later, when the fixes stop holding and performance degrades on the same trajectory regardless of what you change.</p><p>This is what’s known as the <a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill"><u>optimization treadmill</u></a>: a predictable sequence of phases that each buy three to six months of relief without changing the underlying trajectory. <a href="https://www.tigerdata.com/blog/mvcc-feature-youre-paying-for-but-not-using"><u>MVCC overhead</u></a>, row-oriented storage, B-tree index maintenance, WAL volume. These aren't bugs. They're architectural tradeoffs that work well for 90% of workloads and work poorly for the 10%.</p><p>Knowing which problem you have determines whether you should keep tuning or make a different decision.</p><h2 id="what-postgres-was-designed-for">What Postgres was designed for</h2><p>Postgres's architecture is built around concurrent access to shared rows. Multiple transactions read and write the same data at the same time, and MVCC handles the isolation. B-tree indexes find specific rows by key. Row-oriented storage assumes that when you retrieve a row, you want most of the columns in it.</p><p>For an e-commerce backend, a user authentication system, or a multi-tenant SaaS product, these are exactly the right tradeoffs. Transactions need isolation. Point lookups by user ID are the dominant query pattern. Write rates track user activity, which gives the database natural breathing room between peaks. The question isn't whether Postgres is good. It's whether the workload you're running matches the patterns its architecture was designed to serve.</p><h2 id="the-workload-that-breaks-the-match">The workload that breaks the match</h2><p>Three characteristics, when they appear together, put a workload outside what Postgres handles well.</p><p><strong>Append-only or append-heavy writes.</strong> Rows are written once and never, or almost never, updated. Sensor readings, financial transactions, log entries, event streams. Every row still pays the full <a href="https://www.tigerdata.com/blog/mvcc-feature-youre-paying-for-but-not-using"><u>MVCC cost</u></a>: a 23-byte tuple header tracking transaction visibility, hint-bit dirtying on reads, and autovacuum running continuously to freeze tuples and update the visibility map. None of that overhead produces value on data that will never be touched again.</p><p><strong>Sustained high write rates.</strong> Not burst traffic that settles. Continuous ingestion at thousands to hundreds of thousands of rows per second, around the clock. The table grows without pause, B-tree index maintenance adds overhead with every insert, and that cost compounds with row volume, so there is no quiet window for <a href="https://www.tigerdata.com/blog/preventing-silent-spiral-table-bloat"><u>autovacuum to catch up</u></a>.</p><p><strong>Analytical query patterns.</strong> The queries are aggregations over time ranges: averages, counts, percentiles, <code>GROUP BY</code> time bucket. Row-oriented storage forces Postgres to read all columns of every matching row even when the query needs two. On a 30-column table, that's fifteen times the I/O a <a href="https://www.tigerdata.com/blog/hypercore-a-hybrid-row-storage-engine-for-real-time-analytics"><u>columnar layout would require</u></a>.</p><p>Any one of these is manageable. All three together is the combination that Postgres handles well at one million rows and struggles with at one hundred million.</p><h2 id="the-optimization-treadmill-in-practice">The optimization treadmill in practice</h2><p>The pattern is predictable. Queries slow down as the table grows. You add indexes, and reads get faster. Write performance drops because index maintenance scales with row volume. You upgrade the instance. Performance stabilizes and costs go up. You implement partitioning. Recent-data queries get faster. Partition management becomes its own maintenance burden. You tune autovacuum settings. Things stabilize for a while. Data volume increases. The cycle repeats.</p><p>Each step is individually correct. The problem is that the sequence never ends. You're working around an architectural mismatch instead of running a workload the architecture was designed to serve.</p><p>The engineering cost accumulates in ways that are harder to see on a dashboard. The senior engineer spending a week on partition strategy is not shipping product features. The on-call rotation starts treating "database is slow again" as a recurring incident category. Quarterly planning includes a database scalability line item, every quarter.</p><h2 id="how-to-know-which-10-youre-in">How to know which 10% you're in</h2><p>The answer is already in your table statistics. Not in <code>EXPLAIN</code> plans or monitoring dashboards, but in the counters tracking exactly how rows have been written, updated, and cleaned up over the table's lifetime. Run this against your highest-traffic tables:</p><pre><code class="language-SQL">SELECT
    relname AS table_name,
    N
_live_tup,
    n_dead_tup,
    n_tup_ins,
    n_tup_upd,
    ROUND(n_tup_upd::numeric / NULLIF(n_tup_ins, 0) * 100, 2) AS update_pct,
    last_autovacuum,
    last_autoanalyze
FROM pg_stat_user_tables
WHERE schemaname = 'public'
ORDER BY n_tup_ins DESC
LIMIT 10;
</code></pre><p>Here's an example of what a flagged table looks like next to a healthy one:</p>
<!--kg-card-begin: html-->
<table>
  <thead>
    <tr>
      <th>table_name</th>
      <th>n_tup_ins</th>
      <th>n_tup_upd</th>
      <th>update_pct</th>
      <th>last_autovacuum</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>device_metrics</td>
      <td>84,729,3041</td>
      <td>24,892</td>
      <td>0.00</td>
      <td>2025-06-01 14:22:11</td>
    </tr>
    <tr>
      <td>user_accounts</td>
      <td>184,203</td>
      <td>91,843</td>
      <td>49.86</td>
      <td>2025-05-29 08:14:03</td>
    </tr>
  </tbody>
</table>

<!--kg-card-end: html-->
<p><code>device_metrics</code> is in the 10%: 847 million inserts, near-zero updates, and autovacuum fired three minutes ago on a table that has never had a meaningful <code>UPDATE</code> run against it. <code>user_accounts</code> is not: nearly half its rows are updated, and autovacuum runs only when it actually needs to.</p><p>Look for <code>update_pct</code> under 5% and <code>last_autovacuum</code> timestamps within the last few minutes on tables with near-zero deletes. That's the overhead the <a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill"><u>companion piece</u></a> documents in detail: a cleanup process running non-stop on data you never modify, because the storage engine generates that work regardless of your intent.</p><p>Pair those numbers against the broader pattern. Your sustained write rate exceeds 10,000 rows per second. Your most common queries aggregate over time ranges, not point lookups by row identifier. You added partitioning specifically to control table size. You upgraded your instance specifically for query performance, not connection headroom.</p><p>Three or more of those conditions, and you're in the 10%. The optimization treadmill will keep running, but the trajectory won't change.</p><h2 id="what-the-10-actually-needs">What the 10% actually needs</h2><p>If you've confirmed you're in the 10%, migrating your highest-traffic table starts with a single function call:</p><pre><code class="language-SQL">SELECT create_hypertable('device_metrics', by_range('ts'));</code></pre><p>This converts the table to a TimescaleDB hypertable, which does automatic time-based chunking without cron jobs or partition management scripts. From there, you can enable columnar storage on your chunks. This format reads only the columns a query requests, not full rows, and compresses historical data by 10 to 20x, bringing time-range aggregation performance in line with what the workload demands. The <a href="https://www.tigerdata.com/blog/how-to-migrate-your-data-to-timescale"><u>migration post</u></a> walks through the full process, including zero-downtime options for production tables.</p><p>You keep the same SQL, the same connection strings, the same ecosystem tooling. This isn't a replacement for Postgres. It's Postgres with the storage primitives your specific workload actually needs.</p><h2 id="conclusion">Conclusion</h2><p>Postgres is not the problem. Running the wrong workload class through an architecture designed for a different problem is. The distinction matters because one has a tuning fix and the other has a structural fix, and those two paths look identical for the first several months.</p><p>The most expensive version of this recognition happens after 18 months of optimization effort. The cheapest version happens now.</p><p>Run the diagnostic query above. If the numbers land where you expect, read the <a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill"><u>full architectural breakdown</u></a>. If you're ready to test on your own data, <a href="https://console.cloud.timescale.com/signup"><u>start a free Tiger Data trial</u></a> today.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Row vs Columnar Storage for Analytics: Why PostgreSQL Scans Are Slower Than They Should Be]]></title>
            <description><![CDATA[Learn why PostgreSQL reads 16x more data than your queries need, and how a hybrid row-columnar storage layout eliminates the bottleneck without changing your SQL.]]></description>
            <link>https://www.tigerdata.com/blog/row-vs-columnar-storage-analytics-postgresql-scans</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/row-vs-columnar-storage-analytics-postgresql-scans</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[real time analytics]]></category>
            <dc:creator><![CDATA[NanoHertz Communications]]></dc:creator>
            <pubDate>Fri, 05 Jun 2026 12:48:04 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/06/row-vs-columnar-storage-analytics-postgresql-scans.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/06/row-vs-columnar-storage-analytics-postgresql-scans.png" alt="Row vs Columnar Storage for Analytics: Why PostgreSQL Scans Are Slower Than They Should Be" /><p>Here's a query that runs on most time-series tables:</p><pre><code class="language-SQL">SELECT time_bucket('1 hour', ts) AS hour,
       avg(temperature),
       max(temperature)
FROM sensor_readings
WHERE ts &gt; now() - interval '7 days'
GROUP BY hour
ORDER BY hour;
</code></pre><p>The query needs two columns: <code>ts</code> and <code>temperature</code>. The table has 15 columns. Postgres reads all 15 columns for every row that matches the <code>WHERE</code> clause.</p><p>That's not a bug. It's how row-oriented storage works. Each row is stored as a contiguous block of bytes on disk, called a heap tuple, and Postgres reads the entire tuple to access any column within it. For point lookups on individual records, this is efficient. You want the whole row, and it's stored together. For analytical scans over millions of rows where you need two columns out of fifteen, it's the dominant source of wasted I/O.</p><p>In <a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill"><u>Understanding Postgres Performance Limits for Analytics on Live Data</u></a>, row-oriented storage was identified as one of four architectural constraints that compound under high-frequency ingestion. That whitepaper maps the pattern at a system level. This post goes deeper on the physical mechanism: exactly how pages work, how read amplification accumulates, and why the usual fixes don't reach it.</p><h2 id="what-you-will-learn">What You Will Learn</h2><p>By the end of this post, you'll have a concrete diagnostic formula: the read amplification ratio. It tells you whether your storage layout is the dominant I/O bottleneck for analytical queries on any table you own. You'll also understand why indexes can't fix this class of problem and how a hybrid row-columnar storage layout changes the math. This post assumes working familiarity with Postgres page layout and B-tree indexes.</p><h2 id="how-row-storage-actually-works-in-postgres">How Row Storage Actually Works in Postgres</h2><p>Postgres stores data in 8KB pages. Each page holds multiple heap tuples. Each tuple contains every column value for that row, stored sequentially, preceded by a 23-byte header that carries transaction visibility metadata.</p><p>A table with 15 columns averaging 200 bytes per row fits roughly 35 to 40 rows per page, after accounting for headers, alignment padding, and page overhead.</p><p>When Postgres runs a sequential scan, it reads pages from disk in order. Each page load brings all the rows on that page into <code>shared_buffers</code>, with all 15 columns per row intact. The executor then evaluates the <code>WHERE</code> clause and pulls the needed columns from what was already loaded into memory.</p><p>The I/O cost is proportional to total table size, not to the size of the queried columns. A query that needs 12 bytes of data per row still reads 200 bytes from disk. The remaining 188 bytes load into the buffer cache and get discarded.</p><h2 id="the-read-amplification-math">The Read Amplification Math</h2><p>The number that makes this concrete is the read amplification ratio: total row width divided by the width of the columns the query actually needs.</p><p>For <code>sensor_readings</code>, the calculation is direct. The <code>ts</code> column is a <code>timestamptz</code> at 8 bytes. The temperature column is a <code>float4</code> at 4 bytes. Together they represent 12 bytes of useful data per row. The full row is 200 bytes.</p><p><strong>Read amplification ratio: 200 ÷ 12 = 16.7x</strong></p><p>For every byte the query uses, Postgres reads 16.7 bytes from disk.</p><p>At 100 million rows covering seven days, that ratio stops being abstract. The query needs 100M x 12 bytes = 1.14 GB. Postgres reads 100M x 200 bytes = 18.6 GB. At a 500 MB/sec sequential read rate, the scan takes approximately 38 seconds. Reading only the needed columns would take roughly 2.3 seconds. That 16x gap is pure storage model overhead.</p><p>No index changes this number. No configuration setting changes it. Partitioning reduces scope. Fewer pages get scanned by cutting the time range, but within each partition the same per-row read cost applies. The storage layout determines the I/O, and the storage layout is fixed.</p><h2 id="try-this-now-measure-your-read-amplification">Try This Now: Measure Your Read Amplification</h2><p>You can calculate the ratio for any table you own. Run these two queries to get the byte widths you need:</p><pre><code class="language-SQL">-- Full row weight
SELECT pg_column_size(t.*) AS row_bytes
FROM sensor_readings t
LIMIT 1;

-- Queried column weight
SELECT pg_column_size(ts) + pg_column_size(temperature) AS queried_bytes
FROM sensor_readings
LIMIT 1;
</code></pre><p>Divide <code>row_bytes</code> by <code>queried_bytes</code>. If the ratio is above 5x, the storage model is your largest I/O bottleneck for analytical queries on that table. No index or configuration change will close that gap.</p><h2 id="why-indexes-don%E2%80%99t-solve-this">Why Indexes Don’t Solve This</h2><p>When a query is slow, the instinctive response is to add an index. For OLTP workloads, that instinct is correct. B-tree indexes excel at row selection: they find specific rows in <code>O(log n)</code> time, and for a lookup like <code>SELECT * FROM users WHERE id = 123</code>, the index locates the target row in microseconds.</p><p>For analytical queries that touch millions of rows, row selection is not the bottleneck. Finding the rows is fast. Reading the data from those rows is slow. An index scan on a million-row result set still reads the full heap tuple for every matching row to extract the needed columns.</p><p>The one exception is a covering index, which stores column values inside the index itself so Postgres can satisfy the query without touching the heap. But covering indexes for analytical queries become impractical at scale. When queries involve aggregations across high-frequency writes, wide covering indexes impose substantial write overhead, compounding exactly the index maintenance costs described in the <a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill"><u>optimization treadmill post</u></a>.</p><p>B-tree indexes optimize for row selection (which rows to read). Analytical query performance is dominated by row width (how much data per row). These are different problems, and solving one leaves the other intact. For a broader look at what this means for your schema design, see<a href="https://www.tigerdata.com/learn/postgresql-data-analysis-best-practices"> <u>Best Practices for PostgreSQL Data Analysis</u></a>.&nbsp;</p><h2 id="how-columnar-storage-changes-the-equation">How Columnar Storage Changes the Equation</h2><p>In <a href="https://www.tigerdata.com/learn/columnar-databases-vs-row-oriented-databases-which-to-choose"><u>columnar storage</u></a>, data is organized by column instead of by row. All values for <code>ts</code> live together in one stream on disk. All values for <code>temperature</code> live together in another. When the query needs those two columns, it reads two streams. The other 13 columns are never touched.</p><p>Same query, same 100 million rows: data read drops to 100M x 12 bytes = 1.14 GB. With typical 10 to 20x compression for time-series data, that compresses to approximately 60 to 120 MB. At 500 MB/sec, the same scan completes in roughly 0.12 to 0.24 seconds.</p><p>The compression benefit stacks on top of the I/O reduction. Because all values in a column share the same data type, compression algorithms work far more effectively. Sequential timestamps delta-encode to near-zero storage overhead. Floating-point sensor values compress with XOR-based techniques derived from <a href="https://www.vldb.org/pvldb/vol8/p1816-teller.pdf"><u>Facebook's Gorilla algorithm</u></a>. Row-oriented heap storage can't apply any of these because values from different columns are interleaved on every page. There's no contiguous column stream to compress.</p><h2 id="hypercore-row-and-columnar-in-one-table">Hypercore: Row and Columnar in One Table</h2><p>The tradeoff with pure columnar storage is write performance. Every new row appends to each column file separately, which adds overhead for high-frequency ingestion. You get the read benefit but give up write throughput. Tiger Data's Hypercore solves this with a <a href="https://www.tigerdata.com/blog/hypercore-a-hybrid-row-storage-engine-for-real-time-analytics"><u>hybrid layout that keeps both</u></a>.</p><p>Recent data stays in row-oriented storage for fast ingestion. Older data converts automatically to columnar format based on a compression policy you configure. The application writes standard SQL to one table. The storage format changes by age without any application-layer involvement.</p><pre><code class="language-SQL">-- Enable Hypercore on a hypertable with a 7-day row storage window
ALTER TABLE sensor_readings SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'device_id',
    timescaledb.compress_orderby = 'ts DESC'
);

SELECT add_compression_policy('sensor_readings', INTERVAL '7 days');
</code></pre><p>New rows land in row format and ingest quickly. Data older than seven days converts to columnar chunks. To verify the behavior immediately without waiting for the policy schedule, compress a chunk manually:</p><pre><code class="language-SQL">SELECT compress_chunk(c) FROM show_chunks('sensor_readings') c LIMIT 1;
</code></pre><p>Then run <code>EXPLAIN (ANALYZE, BUFFERS)</code> on the aggregation query to see the difference in buffer reads (representative output on a 100M-row dataset):</p><pre><code>-- Before: row storage sequential scan
Seq Scan on sensor_readings
  Buffers: shared read=2375000    -- 18.6 GB read from disk
  Execution Time: 38142.2 ms

-- After: Hypercore columnar scan
Custom Scan (ColumnarScan) on sensor_readings
  Buffers: shared read=10240      -- 80 MB read from disk
  Execution Time: 196.4 ms
</code></pre><p>The same <code>SELECT</code> statement works against both storage formats. The query planner handles the difference transparently.</p><h2 id="conclusion">Conclusion</h2><p>Row storage reads every column to access any column. For analytical queries that scan millions of rows and need only a few, this is the largest source of I/O overhead. It doesn't yield to <a href="https://www.tigerdata.com/learn/postgres-performance-best-practices"><u>index tuning</u></a>, partitioning, or hardware upgrades.</p><p>Calculate the read amplification ratio for your most common analytical queries using the <code>pg_column_size</code> queries above. If the ratio is above 5x, <a href="https://www.tigerdata.com/docs/reference/timescaledb/hypercore"><u>Hypercore</u></a> is the direct fix. Start a <a href="https://console.cloud.timescale.com/signup"><u>free Tiger Data trial</u></a> today to enable the hybrid storage model on your tables.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[The Engineering Calendar Is the Database Bill Nobody Tracks]]></title>
            <description><![CDATA[The cost of the Optimization Treadmill doesn't show up on the database bill. It shows up on the engineering calendar. And it compounds in ways that are easy to miss until someone actually adds it up.]]></description>
            <link>https://www.tigerdata.com/blog/the-engineering-calendar-is-the-database-bill-nobody-tracks</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/the-engineering-calendar-is-the-database-bill-nobody-tracks</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Tips]]></category>
            <dc:creator><![CDATA[Matty Stratton]]></dc:creator>
            <pubDate>Tue, 02 Jun 2026 13:54:23 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/05/The-Engineering-Time-Sink-Nobody-Talks-About.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/05/The-Engineering-Time-Sink-Nobody-Talks-About.png" alt="The Engineering Calendar Is the Database Bill Nobody Tracks" /><p>The database bill went up 40% last quarter, and everybody noticed. Finance noticed. You had a meeting about it. Somebody made a slide with a trend line that looked bad enough to earn its own agenda item.</p><p>The other database bill probably did not make the deck. It lives on the engineering calendar: weekly database reviews, monthly capacity checks, Slack threads about replica lag, onboarding sessions where a senior engineer explains why the partitioning scheme is the way it is, and the quarterly planning meeting where "database scalability" shows up again. I think of this as "calendar debt".</p><p><strong>Calendar debt</strong> is what happens when an architectural mismatch stops showing up only as performance pain and starts changing how the team spends its time. The work is legitimate. Debugging write latency, tuning autovacuum, and reviewing partition migrations all require real engineering judgment.</p><p>The problem is that this work keeps coming back. At some point, the database is not just consuming CPU, storage, and cloud budget. It is consuming the attention of the people who are supposed to be building the product.</p><h2 id="run-a-90-day-calendar-audit">Run a 90-day calendar audit</h2><p>Before arguing about whether the database needs <a href="https://www.tigerdata.com/blog/six-signs-postgres-tuning-wont-fix-performance-problems"><u>a different architecture</u></a>, look at the last 90 days of work. You do not need perfect time tracking. You need a useful approximation.</p><p>Pull the incident tickets with database root causes: slow queries, replica lag, connection pool exhaustion, WAL growth, autovacuum backlog, failed partition jobs, index bloat, storage pressure. Count the response time and the follow-up work. The incident is rarely the whole cost, because the real time usually shows up in the cleanup, the postmortem, the monitoring tweak, and the "we should make sure this never happens again" task that becomes someone's afternoon.</p><p>Search Slack for the terms that usually mean real work is happening: <code>autovacuum</code>, <code>partition</code>, <code>vacuum</code>, <code>replica lag</code>, <code>WAL</code>, <code>bloat</code>, <code>statistics</code>, <code>index rebuild</code>, <code>capacity</code>. Look at sprint work and meeting titles. Count query tuning, partition management, retention cleanup, schema migration support, capacity planning, runbook updates, and recurring reviews with "database" in the title.</p><p>Then ask the senior engineers directly: how much time last week went to database work that was not directly building product? Ask it as a systems question, not a performance question. Nobody needs a tiny productivity court hiding inside a database conversation. The point is whether the architecture is creating recurring work.</p><p>That recurrence is the signal. One slow query is an incident. A standing meeting about slow queries is architecture becoming process. One partition failure is a bug. A recurring partition review is lifecycle management leaking onto the calendar.</p><p>If the same names keep showing up in the tickets, Slack threads, sprint tasks, and meeting invites, the infrastructure bill is only part of what you are paying. You are carrying calendar debt too.</p><h2 id="what-the-debt-looks-like">What the debt looks like</h2><p>On a high-ingest Postgres workload, calendar debt usually starts small. One slow dashboard query becomes an index review. One retention problem becomes a partitioning discussion. One write latency spike becomes an autovacuum tuning session. One schema migration gets delayed because it touches too many partitions and nobody wants to find out in production that the migration plan was optimistic.</p><p>All of that is reasonable work. That is why it is easy to miss. It becomes a standing category of work.</p><p><a href="https://www.tigerdata.com/blog/hidden-costs-table-partitioning-scale"><u>Partition management</u></a> means creating future partitions, checking for gaps, validating the automation, and handling the incident when a missing partition breaks ingestion at the least convenient possible hour. If you have been on the receiving end of that alert, you already know this is not an abstract problem.</p><p><a href="https://www.tigerdata.com/blog/the-autovacuum-tax"><u>Autovacuum tuning</u></a> means watching <code>pg_stat_activity</code>, changing per-table settings as data volume changes, and figuring out whether a write latency spike is actually I/O contention from vacuum activity. Index maintenance means tracking bloat, rebuilding indexes, and debating whether a new read-path index is worth the extra <a href="https://www.tigerdata.com/blog/write-amplification-in-postgres-the-3-4x-tax-on-every-insert"><u>write amplification</u></a> on a table that already takes continuous inserts.</p><p>Replication management means watching lag, tuning <code>max_wal_size</code>, and dealing with the WAL accumulation alert when a replica falls behind during a write peak. Capacity planning means projecting data growth, modeling the next <a href="https://www.tigerdata.com/blog/vertical-scaling-buying-time-you-cant-afford"><u>vertical scaling</u></a> event, writing the infrastructure ticket, and explaining why the database needs more money after it got more money last quarter.</p><p>Individually, these tasks feel too small to count. Twenty minutes here. An hour there. A planning meeting. A follow-up review. A "quick" Slack thread that is somehow still active after lunch.</p><p>Aggregated across a week, they become a day. For senior engineers on teams deep into the Optimization Treadmill, 20-30% of their time can disappear this way. That number sounds high until you actually count the work instead of remembering it.</p><p>Memory rounds down. Calendars do not.</p><h2 id="the-calendar-is-the-leading-indicator">The calendar is the leading indicator</h2><p>The cloud bill tells you what happened after the workload grew. The calendar tells you what is going to keep happening if nothing changes.</p><p>A one-time tuning sprint might be normal. A recurring tuning sprint is a strategy, even if nobody meant to make it one. The same goes for recurring capacity reviews, recurring partition checks, recurring autovacuum investigations, recurring schema migration reviews, and recurring "can we ship this dashboard without making the database sad?" conversations.</p><p>Each one is a small admission that the system requires ongoing human coordination to stay acceptable. The invoice can tell you the instance got more expensive. It cannot tell you the team has accepted a permanent tax on planning, onboarding, incident response, and senior engineering attention.</p><p>This is where the decision gets harder, because the people who understand the database path are the same people you need to change it. When their calendar is already full of maintenance, the work that would reduce the maintenance keeps moving out by a quarter.</p><p>That is the loop. The current architecture creates recurring work. The recurring work consumes the time needed to change the architecture. The data keeps growing while everyone waits for a clean window that never arrives. Not great.</p><h2 id="onboarding-is-where-the-debt-becomes-obvious">Onboarding is where the debt becomes obvious</h2><p>Existing teams normalize their own weirdness. The partition naming convention makes sense because everyone remembers the incident that created it. The autovacuum thresholds make sense because someone tuned them six months ago after a write peak. The runbook makes sense because the people reading it already know the missing context.</p><p>Then a new engineer joins, and suddenly the team has to explain all of it from scratch. Why the partitions are named that way. How the <code>pg_partman</code> automation works and what happens when it fails. Which <code>autovacuum</code> alerts are noisy until the day they are not. How schema migrations work across hundreds of partitions. Which replica is safe to query. Which incident from two years ago explains the one thing in the runbook that otherwise looks completely unhinged.</p><p>This is operational folklore, not product knowledge. It lives in runbooks, Slack history, and the heads of the two or three engineers who were there when the decisions were made.</p><p>So onboarding takes three or four weeks before someone can safely operate the database path. During that time, the new hire is less productive and the senior engineers are doing support work. Every hire pays that cost again.</p><p>The runbooks also have their own bill. Writing them takes time. Keeping them current takes time. When the partitioning scheme changes, someone has to know to update the docs, find the docs, and then actually update the docs. Documentation debt on operational procedures accumulates the same way technical debt does. It just looks more respectable because it has headings.</p><h2 id="the-debt-compounds-with-data-volume">The debt compounds with data volume</h2><p>The shape of the problem matters. At 100 million rows, the partitioning scheme may be manageable: maybe 50 partitions, a few runbooks, and one engineer who really understands the sharp edges. Database operations might take 10% of engineering time.</p><p>At 500 million rows, the partition count has grown. Autovacuum tuning is more complicated. A few incidents have added new alerts, new checklists, and new exceptions. The original expert has either become the bottleneck or has left enough knowledge behind to make everyone nervous. Now the work is closer to 20%.</p><p>At a billion rows and beyond, the scheme is embedded in how the team operates. Schema migrations are multi-day projects. Onboarding has a dedicated database section. Quarterly planning became monthly planning without anyone formally deciding that should happen. At that point, 30% is not a dramatic estimate. It is the floor on a bad quarter.</p><p>The growth is not linear because operational surface area does not grow one-to-one with data volume. Each threshold creates new work: more partitions, more monitoring, more migration caution, more review paths, more tribal knowledge.</p><p>Meanwhile, product work slows down in the most annoying possible way: gradually. Nobody flips a table. Features just take a little longer because database changes require more review. Releases get a little more careful because the partition scheme adds risk. The roadmap gets a quiet asterisk on every data-heavy feature: check with the database people first. That is how you know the architecture has become a product constraint.</p><h2 id="what-changes-when-the-architecture-matches-the-workload">What changes when the architecture matches the workload</h2><p>This is the part where vendor content usually gets hand-wavy, so let's be specific. Database work does not become zero. You still operate a database. You still care about schema design, query behavior, retention, capacity, and reliability. The useful question is which calendar items should stop existing.</p><p>Take the recurring partition review. If that meeting exists because the team has to create future partitions, check for gaps, validate automation, and explain <code>pg_partman</code> failure modes to every new engineer, that is lifecycle work sitting in a meeting invite. Hypertables move time-based partitioning into the table abstraction. Chunks are created automatically as data arrives, so the partition creation job and the gap-monitoring ritual stop being monthly team activities.</p><p>Take the retention cleanup thread. If engineers are debating <a href="https://www.tigerdata.com/blog/moving-from-row-deletes-to-instant-data-retention"><u>row deletes</u></a>, manual partition drops, and cleanup windows every time data ages out, retention has become process. A retention policy turns that into database behavior. Expired chunks can be dropped by policy rather than by a quarterly cleanup project everyone swears will be simple this time.</p><p>Take the autovacuum investigation that keeps coming back. If the team is repeatedly tuning vacuum behavior around older high-volume data, the storage model is making historical data operationally expensive. <a href="https://www.tigerdata.com/blog/hypercore-a-hybrid-row-storage-engine-for-real-time-analytics"><u>Hypercore</u></a> moves older chunks into a columnar format. Vacuum does not disappear from Postgres, but the recurring work created by high-ingest row churn on data that is no longer actively modified gets smaller.</p><p>Take the schema migration review. If every migration requires a special conversation because the table is really hundreds of manually managed partitions wearing a trench coat, the abstraction is leaking. With Hypertables, the application still sees a table. The migration discussion gets smaller because the lifecycle machinery is not scattered across a partition tree the team has to reason about by hand.</p><p>The calendar changes because whole categories of recurring work shrink or disappear. No partition creation review next month, no gap-monitoring script to babysit, fewer autovacuum conversations about old high-volume data, and less onboarding time spent explaining why the lifecycle machinery works the way it does.</p><p>Same Postgres ecosystem. Different operational shape. That is the actual value.</p><h2 id="bring-your-calendar-to-the-architecture-conversation">Bring your calendar to the architecture conversation</h2><p>The cloud bill is visible. It shows up in the budget report with a trend line and a year-over-year comparison. The engineering calendar usually does not.</p><p>That is why teams undercount database cost. The work is distributed across incident tickets, sprint tasks, Slack threads, onboarding sessions, planning meetings, and "quick reviews" that are never quite quick.</p><p>If you want to know whether database optimization is still the right path, start with the calendar. Count the incidents. Count the meetings. Count the onboarding time. Count the senior engineer hours that went to keeping the database acceptable instead of moving the product forward.</p><p>Then ask the better question: is this optimization work buying us a better architecture, or is it paying interest on the current one? At 50 million rows, changing direction might take a week. At a billion rows, it can take months. Waiting does not make the work cheaper. It usually adds more runbooks.</p><p>If you want the mechanical side of why this happens, read <a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill"><u>Understanding Postgres Performance Limits for Analytics on Live Data</u></a>. It explains the Optimization Treadmill and the architectural constraints behind it.</p><p>Bring your calendar when you read it. That is where the real bill is hiding.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[The Postgres Developer's Guide to Vector Index Tradeoffs]]></title>
            <description><![CDATA[Vector search becomes an index design problem as your data grows. Here's how to make the right call without leaving Postgres.]]></description>
            <link>https://www.tigerdata.com/blog/the-postgres-developers-guide-to-vector-index-tradeoffs</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/the-postgres-developers-guide-to-vector-index-tradeoffs</guid>
            <category><![CDATA[pg_textsearch]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Extensions]]></category>
            <dc:creator><![CDATA[Hien Phan]]></dc:creator>
            <pubDate>Tue, 26 May 2026 14:23:55 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/05/thumbnail-blog-thumbnail-1280x720--5-.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/05/thumbnail-blog-thumbnail-1280x720--5-.png" alt="The Postgres Developer's Guide to Vector Index Tradeoffs" /><p>Vector search in Postgres usually starts simply. You add an embedding column, run a nearest-neighbor query, and order by distance.</p><pre><code class="language-sql">SELECT content
FROM documents
ORDER BY embedding &lt;=&gt; '[0.1, 0.2, ...]'
LIMIT 10;</code></pre><p>For a while, that is enough.</p><p>That simplicity breaks down as the workload becomes real. The table grows, filters become part of the query path, and recall starts affecting user experience. The index still has to stay fast while new data keeps arriving.</p><p>That is when vector search stops being a query pattern and becomes an index design problem.</p><p>Most vector search advice starts with algorithms: HNSW, IVFFlat, DiskANN, recall, latency. That is useful, but incomplete once vector search lives inside Postgres. Postgres developers do not choose algorithms in the abstract. They choose indexes under constraints: memory, recall, write volume, filter selectivity, and the operational cost of adding another system.</p><p>The right index is not the best ANN algorithm in isolation. It is the index that fits the constraint your workload hits first: memory, recall, writes, or filters.</p><p>This article maps those constraints to real Postgres index choices: what each one costs, when it becomes the binding variable, and which index type it points to.</p><h2 id="when-exact-search-stops-being-enough">When exact search stops being enough</h2><p>Exact k-nearest neighbor search compares the query vector against every vector in the table. It gives perfect recall because it does not approximate the result set. It also scales linearly with the number of rows.</p><p>That tradeoff is fine early on. Exact search is the right starting point when the dataset is small, the query rate is low or you are still validating whether embeddings work for your application. It also gives you a useful baseline because the results are not affected by index tuning.</p><p>The problem shows up when the table grows into millions or tens of millions of vectors, or when users expect low latency. At that point, scanning every vector for every query becomes too expensive.</p><p>Approximate nearest neighbor search, or ANN search, exists for this moment. ANN indexes organize vectors ahead of time so the database can search only the most promising candidates instead of scanning the full table. The index gives up a small, controlled amount of accuracy in exchange for much lower query latency.</p><p>That is the first tradeoff: ANN is not magic. You are deciding how much recall you can afford to exchange for speed, memory efficiency, and lower infrastructure cost.</p><h2 id="the-four-constraints-behind-every-vector-index">The four constraints behind every vector index</h2><p>The right vector index is usually decided by four constraints: whether the working set fits in memory, how much recall the application needs, how often the data changes and how selective the surrounding filters are.</p><h3 id="memory">Memory</h3><p>Memory is fast and low-latency, but expensive. SSDs are cheaper and can still work well for many workloads. Object storage is cheaper still, but its higher latency makes it a poor fit for index designs that require many small random reads.</p><p>Vector indexes do not all touch storage the same way. Graph-based indexes follow connections between vectors through the index. That access pattern works very well when the graph is in memory and becomes more expensive when each hop risks a disk read. Partitioning-based indexes group vectors into regions and scan the most promising ones, which can be more memory efficient but usually requires more tuning.</p><p>In Postgres, the practical question is whether the index working set fits comfortably in <code>shared_buffers</code> and the operating system page cache. If it does, an in-memory graph index can perform very well. If it does not, the storage access pattern starts to dominate the design.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/05/digram-A.png" class="kg-image" alt="" loading="lazy" width="2000" height="1194" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2026/05/digram-A.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2026/05/digram-A.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2026/05/digram-A.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w2400/2026/05/digram-A.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Storage changes the index tradeoff. Graph-based indexes perform best when traversal stays hot in memory. Disk-aware and partition-based designs become increasingly important as the working set migrates to SSD or object storage.</em></i></figcaption></figure><h3 id="recall">Recall</h3><p>Recall measures how close approximate search gets to exact search. Higher recall usually costs more because the index has to inspect more candidates, traverse more of a graph or scan more partitions.</p><p>For some applications, slightly lower recall is acceptable if latency improves dramatically. For others, especially RAG systems where missing the right document leads to a bad answer, recall is part of product quality.</p><p>The honest way to set this tradeoff is to measure against your own data. Embedding model, dimensionality, filters, and query distribution all affect the result.</p><h3 id="writes">Writes</h3><p>Some vector workloads are mostly read-heavy. You build the index, query it many times, and update it occasionally. Other workloads change constantly. New documents arrive, old ones are deleted, embeddings are regenerated.</p><p>A structure optimized for high-recall reads may have higher write or maintenance costs. A lighter-weight index may be easier to update but require more tuning to reach the same recall.</p><h3 id="filters">Filters</h3><p>Real Postgres queries rarely search vectors alone. A query might ask for the nearest vectors, but only within a specific customer, time range, tenant or category.</p><p>Those predicates change the shape of the search problem. If a filter is highly selective, it may be cheaper to narrow the rows first and then search. If the filter is broad, it may be better to use the vector index first and apply the filter after. The right plan depends on the data distribution, the selectivity of the filter, and the index available to the planner.</p><p>That is one reason vector benchmarks can vary so much. Vector search without filters is not the same workload as vector search inside a real application query.</p><p>That is why there is no universal best vector index. There is only the index that best matches the shape of your workload.</p><h2 id="the-ann-algorithms-behind-postgres-index-choices">The ANN algorithms behind Postgres index choices</h2><p>The point of understanding ANN algorithms is not to memorize every paper. It is to understand why each index behaves differently as your workload changes. Most of the indexes discussed below fall into two broad patterns.</p><p>Graph-based indexes, such as HNSW and DiskANN-style designs, search by moving through connections between nearby vectors. Spatial partitioning indexes, such as IVFFlat and SPANN-style designs, divide the vector space into regions and search the most promising ones.</p><p>That distinction matters because graph-based indexes tend to optimize for high recall when the working set is hot, while partitioning-based indexes often trade more tuning for lower memory and maintenance overhead.</p><p>Each algorithm below is best understood as a response to a specific pressure: memory, write cost, disk access, or update churn.</p><h3 id="hnsw-when-the-index-fits-in-memory">HNSW: When the index fits in memory</h3><p>Your dataset fits in memory and you need high recall at high query throughput. HNSW is built for this.</p><p><a href="https://arxiv.org/abs/1603.09320"><u>Hierarchical Navigable Small Worlds</u></a> organizes vectors as a layered graph where each node connects to nearby vectors across multiple levels of granularity. A query enters at the top layer, moves toward the target neighborhood, then descends to finer layers until it converges on the best candidates.</p><p>The layered structure is what gives HNSW its speed-recall profile. The upper layers help the search move quickly across the vector space. The lower layers refine the candidate set around the target neighborhood. When the graph is in memory, that traversal can be fast and accurate.</p><p>The tradeoffs show up on the write side and at scale. Each node stores multiple edge pointers, so the index carries a higher memory footprint than simpler partitioning-based alternatives. Inserts and deletes require maintaining graph structure, which makes writes more expensive. And when the index grows beyond available memory, latency can climb.</p><p>In <code>pgvector</code>, HNSW is often the first ANN index Postgres developers try when query latency and recall matter most. For a practical look at how it performs, see <a href="https://www.tigerdata.com/blog/vector-database-basics-hnsw"><u>Vector Database Basics: HNSW</u></a>.</p><h3 id="ivfflat-when-memory-and-writes-matter-more">IVFFlat: When memory and writes matter more</h3><p>Your write throughput matters, or your index cannot comfortably fit in memory. IVFFlat is worth considering.</p><p>IVF stands for inverted file. The basic idea is to partition the vector space into lists, then search only the most promising lists at query time. In <code>pgvector</code>, this index type is exposed as ivfflat.</p><p>Compared with HNSW, IVFFlat is usually lighter to build and maintain. Inserts are simpler because adding a vector means assigning it to a list rather than updating a graph of neighboring nodes.</p><p>The tradeoff is that recall is more sensitive to tuning. If you create 1,000 lists and set <code>probes = 10</code>, the query searches a small fraction of the partitioned index. Increasing probes gives the query more chances to find the true nearest neighbors, but it also pushes the query closer to a broader scan. IVFFlat tuning is about finding the lowest probes value that still meets your recall target.</p><p>That is the core IVFFlat tradeoff: lower memory and maintenance overhead, but more responsibility for tuning lists and probes against your workload.</p><h3 id="diskann-when-the-index-needs-to-live-partly-on-disk">DiskANN: When the index needs to live partly on disk</h3><p>HNSW assumes the graph fits comfortably in memory. At tens of millions of high-dimensional vectors, that often stops being practical.</p><p><a href="https://www.microsoft.com/en-us/research/publication/diskann-fast-accurate-billion-point-nearest-neighbor-search-on-a-single-node/"><u>DiskANN</u></a>, developed at Microsoft Research, was built for this case. It is a graph-based algorithm designed for datasets too large to fit entirely in RAM. At a high level, it keeps enough compressed information in memory to guide the search while storing more of the full index and vector data on SSD.</p><p>The lesson for Postgres developers is the storage pattern. A vector index that works well in RAM may behave very differently when the query path depends on repeated disk reads. Disk-aware indexes are designed around that constraint instead of treating it as an afterthought.</p><p>DiskANN still carries higher update costs than many partitioning-based approaches. But for read-heavy workloads on large datasets, it explains the shape of the problem that disk-aware Postgres vector indexing is trying to solve. See <a href="https://www.tigerdata.com/blog/understanding-diskann"><u>Understanding DiskANN</u></a> for a deeper look.</p><h3 id="spfresh-the-update-problem-at-scale">SPFresh: The update problem at scale</h3><p>Large vector indexes create another problem: updates.</p><p>Many ANN systems handle inserts and deletes by buffering changes, maintaining secondary structures, or periodically rebuilding parts of the index. Those approaches can work, but at very large scale they require either accepting stale index state or paying an increasingly expensive maintenance cost to keep the index current.</p><p>SPFresh, from Microsoft Research, is one such direction. It builds on partitioning-oriented ideas to reduce the need for global rebuilds, incrementally rebalancing partitions as vectors are inserted, deleted, or updated. Partition assignments are not fixed. They can drift and be corrected over time.</p><p>SPFresh is not implemented in Postgres today. But it is not purely academic either. The ideas behind it have already shaped how production vector systems outside Postgres are being designed. Turbopuffer is one example: an object-storage-first vector search service whose architecture is built around centroid-based indexing and minimizing storage round trips. Turbopuffer is not a Postgres system. But the tradeoffs it navigates (high-update workloads, disk-based search, incremental index maintenance without global rebuilds) are real problems the Postgres ecosystem will need to address as vector workloads become more dynamic.</p><p>This is worth tracking because the maintenance cost of a vector index is not static. It grows with update frequency and dataset size. For read-heavy workloads on stable datasets, this is not a near-term concern. For teams with high insert and delete rates (documents being added, embeddings regenerated, records retired), it is worth understanding now, before the index becomes the bottleneck.</p><h2 id="the-postgres-vector-search-stack">The Postgres vector search stack</h2><p>The algorithms above map to real problems Postgres developers run into. HNSW is useful for in-memory performance, IVFFlat for lighter-weight indexing and write-sensitive workloads, and DiskANN-style designs for larger datasets where memory becomes the constraint.</p><p>Here is how the Postgres ecosystem addresses those problems today.</p><h3 id="pgvector">pgvector</h3><p><a href="https://github.com/pgvector/pgvector"><u>pgvector</u></a> is the starting point. It adds a native vector column type to Postgres and supports both HNSW and IVFFlat indexes directly.</p><p>An HNSW index looks like this:</p><pre><code class="language-sql">CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops);</code></pre><p>For IVFFlat, you define the number of lists and tune the number of probes:</p><pre><code class="language-sql">CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 1000);
SET ivfflat.probes = 10;</code></pre><p>The query planner can use these indexes for nearest-neighbor queries, and you can combine vector search with standard SQL filters, joins and CTEs in the same query. For many teams already running Postgres, this can remove the need to operate a separate vector database.</p><p><code>pgvector</code> can start to show limits at larger scale, especially with high-dimensional embeddings at tens of millions of rows and indexes that no longer fit comfortably in memory. That is the problem <code>pgvectorscale</code> was built to address.</p><h3 id="pgvectorscale">pgvectorscale</h3><p>The DiskANN section above describes a specific problem: vector workloads that have grown too large to keep the working index in memory. For Postgres, <a href="https://github.com/timescale/pgvectorscale" rel="noreferrer"><code>pgvectorscale</code></a> addresses that directly. It introduces a StreamingDiskANN index type that keeps a compressed representation in memory to guide search while storing the full index on disk.</p><p>On a <a href="https://www.tigerdata.com/blog/pgvector-is-now-as-fast-as-pinecone-at-75-less-cost"><u>Tiger Data benchmark</u></a> of 50 million Cohere embeddings at 768 dimensions, Postgres with <code>pgvector</code> and <code>pgvectorscale</code> achieved 28x lower p95 latency and 16x higher query throughput compared to Pinecone's storage-optimized index at 99% recall. This was a vendor-run benchmark. Treat it as directionally useful, not universally predictive. Results will vary with embedding model, dimensionality, filters, recall target, and hardware.</p><p>The relevant point is that <code>pgvectorscale</code> stays inside the Postgres operational model. It remains composable with <code>pgvector</code> data types and standard SQL patterns. If your index has outgrown memory, you do not need a different system. You need a different index type.</p><h3 id="pgtextsearch-and-paradedb">pg_textsearch and ParadeDB</h3><p>Vector similarity handles the semantic side of search well, but it is not the whole retrieval problem. Keyword-based retrieval still matters. It catches exact matches that embeddings miss, and for many queries, users know precisely what they are looking for.</p><p>This is where <code>pg_textsearch</code> and ParadeDB come in.</p><p><a href="https://github.com/timescale/pg_textsearch"><u>pg_textsearch</u></a>, also from Tiger Data, brings BM25-based search into Postgres. BM25 accounts for term frequency saturation and document length normalization, which is why it is often a stronger ranking model for keyword search than simple term matching.</p><p>ParadeDB takes a related position as a Postgres distribution, bundling <a href="https://github.com/paradedb/paradedb/tree/main/pg_search"><u>pg_search</u></a> for BM25-based full-text search and <a href="https://github.com/paradedb/pg_analytics"><u>pg_analytics</u></a> for analytical query execution. If you want Elasticsearch-style search quality and are open to running a Postgres distribution rather than adding individual extensions, ParadeDB belongs on your evaluation list. When you are operating a small dataset, BM25 relevance ranking may not be a key requirement and <code>pg_search</code> will suffice. However, <code>pg_textsearch</code> is a better option when you need true BM25 relevance ranking with term saturation (how many times a term appears) or document length normalization to match the experience of Lucene (that powers Elasticsearch) or the algorithms that power Google.</p><p>The real payoff of having both vector search and BM25 inside Postgres is hybrid search: combining vector similarity and keyword scoring in a single query. For many RAG applications, this is often a stronger retrieval pattern than vector search alone because each approach covers the other's blind spots. Vector search captures semantic meaning. BM25 catches exact matches.</p><h3 id="a-simple-hybrid-search-pattern-in-sql">A simple hybrid search pattern in SQL</h3><p>One common way to merge vector and keyword results is Reciprocal Rank Fusion, or RRF.</p><p>RRF avoids averaging scores across different scales. Instead, it combines rank positions. A result that appears near the top of either list gets a boost.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/05/digram-B.png" class="kg-image" alt="" loading="lazy" width="2000" height="1667" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2026/05/digram-B.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2026/05/digram-B.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2026/05/digram-B.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w2400/2026/05/digram-B.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Hybrid search combines semantic and lexical retrieval. Vector search finds meaning. BM25 catches exact matches. RRF merges the ranked lists without comparing raw scores directly.</em></i></figcaption></figure><p>The exact syntax depends on which BM25 extension you use, but the query shape looks like this:</p><pre><code class="language-sql">WITH keyword_results AS (
&nbsp;&nbsp;SELECT
&nbsp;&nbsp;&nbsp;&nbsp;id,
&nbsp;&nbsp;&nbsp;&nbsp;content,
&nbsp;&nbsp;&nbsp;&nbsp;paradedb.score(id) AS bm25_score,
&nbsp;&nbsp;&nbsp;&nbsp;ROW_NUMBER() OVER (ORDER BY paradedb.score(id) DESC) AS keyword_rank
&nbsp;&nbsp;FROM documents
&nbsp;&nbsp;WHERE content @@@ 'vector search'
&nbsp;&nbsp;LIMIT 60
),
vector_results AS (
&nbsp;&nbsp;SELECT
&nbsp;&nbsp;&nbsp;&nbsp;id,
&nbsp;&nbsp;&nbsp;&nbsp;content,
&nbsp;&nbsp;&nbsp;&nbsp;1 - (embedding &lt;=&gt; '[0.1, 0.2, ...]') AS similarity_score,
&nbsp;&nbsp;&nbsp;&nbsp;ROW_NUMBER() OVER (ORDER BY embedding &lt;=&gt; '[0.1, 0.2, ...]') AS vector_rank
&nbsp;&nbsp;FROM documents
&nbsp;&nbsp;LIMIT 60
),
combined AS (
&nbsp;&nbsp;SELECT
&nbsp;&nbsp;&nbsp;&nbsp;COALESCE(k.id, v.id) AS id,
&nbsp;&nbsp;&nbsp;&nbsp;COALESCE(k.content, v.content) AS content,
&nbsp;&nbsp;&nbsp;&nbsp;COALESCE(1.0 / (60 + k.keyword_rank), 0) +
&nbsp;&nbsp;&nbsp;&nbsp;COALESCE(1.0 / (60 + v.vector_rank), 0) AS rrf_score
&nbsp;&nbsp;FROM keyword_results k
&nbsp;&nbsp;FULL OUTER JOIN vector_results v ON k.id = v.id
)
SELECT id, content
FROM combined
ORDER BY rrf_score DESC
LIMIT 10;</code></pre><p>This retrieves candidates from both systems, ranks them separately, and merges the ranked lists.</p><p>This is one of the strongest reasons to keep search in Postgres. Your embeddings, documents, metadata filters, joins, keyword search, and application data can live in one query model.</p><p>Learn more: <a href="https://www.tigerdata.com/docs/build/examples/hybrid-search"><u>how to build Hybrid Search in Postgres using pg_textsearch and pgvectorscale</u></a>, and <a href="https://www.tigerdata.com/blog/hybrid-search-postgres-you-probably-should"><u>why hybrid search outperforms vector-only retrieval</u></a>.</p><h2 id="what-this-guide-does-not-decide-for-you">What this guide does not decide for you</h2><p>No article can tell you the right vector index without your data.</p><p>Embedding model, dimensionality, filter selectivity, recall target, update rate, hardware, concurrency, and query distribution all change the answer. Even two datasets with the same number of rows can behave differently if their vectors cluster differently or their filters have different selectivity.</p><p>The point of this guide is not to replace benchmarking. It is to help you know what to benchmark first. Start with the simplest index that matches the shape of your workload. Measure it against exact search where possible. Tune recall and latency together. Then move to a more specialized index only when the workload gives you a reason.</p><h2 id="which-postgres-vector-index-should-you-use">Which Postgres vector index should you use?</h2>
<!--kg-card-begin: html-->
<table style="border:none;border-collapse:collapse;table-layout:fixed;width:468pt"><colgroup><col><col><col></colgroup><thead><tr style="height:0pt"><th style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;" scope="col"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Workload pattern</span></p></th><th style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;" scope="col"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Start with</span></p></th><th style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;" scope="col"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Why</span></p></th></tr></thead><tbody><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Small dataset or still validating the application</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Exact search</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Simple, accurate and useful as a recall baseline</span></p></td></tr><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Starting a serious Postgres vector search workload</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:'Roboto Mono',monospace;color:#188038;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">pgvector</span><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;"> with HNSW</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Strong speed-recall tradeoff for read-heavy workloads</span></p></td></tr><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Lighter index or higher write throughput matters</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:'Roboto Mono',monospace;color:#188038;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">pgvector</span><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;"> with IVFFlat</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Lower memory and maintenance overhead, with more tuning required</span></p></td></tr><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Index no longer fits comfortably in memory</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:'Roboto Mono',monospace;color:#188038;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">pgvectorscale</span><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;"> with </span><span style="font-size:11pt;font-family:'Roboto Mono',monospace;color:#188038;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">StreamingDiskANN</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Disk-aware vector indexing while staying inside Postgres</span></p></td></tr><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Retrieval quality is the bottleneck</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Hybrid search with vector plus BM25</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Combines semantic similarity with exact keyword matching</span></p></td></tr></tbody></table>
<!--kg-card-end: html-->
<p>The path usually looks like this: start with exact search while the dataset is small, move to HNSW when latency requires ANN, consider IVFFlat when memory or write cost matters more, evaluate disk-aware indexing when the working set outgrows memory, and add BM25 when retrieval quality needs more than semantic similarity alone.</p><h2 id="where-things-stand-and-where-they-are-going">Where things stand and where they are going</h2><p>The practical rule is simple: benchmark the workload you actually run, not the cleanest version of vector search.</p><p>Start with exact search while the dataset is small. Move to HNSW when latency requires ANN. Consider IVFFlat when memory or write cost matters more. Evaluate StreamingDiskANN when the working set outgrows memory. Add BM25 when retrieval quality needs more than semantic similarity.</p><p>The one gap that remains is what SPFresh points toward: high-update workloads at scale without global index rebuilds. That capability is not yet in Postgres, but it is already showing up in production vector systems outside the Postgres ecosystem. </p><p>Whether it eventually appears as an extension, a fork or something nobody has named yet, the pattern is familiar: a hard problem gets real and someone in this community builds the thing.</p><p>Want to dig in further? Look at Tiger Data docs for <a href="https://github.com/timescale/pgvectorscale"><u>pgvectorscale</u></a> and <a href="https://github.com/timescale/pg_textsearch"><u>pg_textsearch</u></a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Understanding Why OS RAM and Postgres Buffer Cache Compete]]></title>
            <description><![CDATA[PostgreSQL and your OS cache the same data twice. Learn how double buffering degrades performance at scale and how the 25% shared_buffers rule fixes it.]]></description>
            <link>https://www.tigerdata.com/blog/understanding-why-os-ram-postgres-buffer-cache-compete</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/understanding-why-os-ram-postgres-buffer-cache-compete</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Tips]]></category>
            <dc:creator><![CDATA[NanoHertz Communications]]></dc:creator>
            <pubDate>Fri, 22 May 2026 14:51:13 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/05/understanding-why-os-ram-and-postgres-buffer-cache-compete.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/05/understanding-why-os-ram-and-postgres-buffer-cache-compete.png" alt="Understanding Why OS RAM and Postgres Buffer Cache Compete" /><p>You just doubled the RAM on your database server to handle a climb in p95 latency. You expect the extra memory to absorb your growing dataset and bring those 45ms spikes back down to 8ms. Instead, the dashboard shows minimal improvement. Write latency remains high, and query response times stay variable.</p><p>The problem isn’t that you added too little RAM. It’s that you gave most of it to the wrong layer.</p><p>PostgreSQL and your operating system <a href="https://www.tigerdata.com/blog/database-scaling-postgresql-caching-explained"><u>both cache data independently</u></a>. When you over-allocate memory to Postgres, the OS loses the RAM it needs to do its own caching. Both layers end up storing identical data blocks simultaneously, a condition known as double buffering, while your system spends CPU cycles shuffling data between two pools instead of serving queries. At scale, this pattern becomes a vicious cycle: you add resources, the database absorbs them, performance recovers briefly, and then degrades again as the dataset grows.</p><p>This guide explains the double buffering mechanism, gives you the tuning rule that breaks the cycle, and shows you how to diagnose whether your current configuration is already caught in it. By the end, you will know how to calculate the correct <code>shared_buffers</code> value for your server, run a query to identify which tables are crowding out your buffer cache, and interpret the results to decide what to do next.</p><h2 id="the-two-layers-of-database-memory">The Two Layers of Database Memory</h2><p>To manage memory effectively, you need to understand the differences between the two independent caches that operate simultaneously on every Postgres server.</p><p>The <strong>internal buffer cache</strong> is defined by the <a href="https://www.tigerdata.com/learn/postgresql-performance-tuning-key-parameters"><u><code>shared_buffers</code> configuration parameter</u></a>. When a query needs a data block, Postgres checks here first. Ideally, it finds the data block so it can avoid a system call entirely. This cache is where your hot data lives.</p><p>The <strong>OS page cache</strong> lives in whatever RAM the operating system has not allocated elsewhere. When Postgres requests a block that is not in <code>shared_buffers</code>, it issues a file system call. If the OS has that block in its page cache, it serves the data immediately. If not, the OS falls through to a physical disk read.</p><p>It’s important to note that Postgres does not manage the OS page cache at all. Instead, the kernel manages the cache on its own, including allocating space and moving data into and out of the cache. Regardless, the OS page cache is a required part of Postgres, and not just a backup option for the internal buffer cache.</p><h2 id="the-double-buffering-problem">The Double Buffering Problem</h2><p>Double buffering happens because neither cache knows what the other holds. Postgres does not inspect the OS page cache before storing a block in <code>shared_buffers</code>. The OS does not inspect <code>shared_buffers</code> before caching a file page. Both layers frequently hold copies of the same data at the same time.</p><p>This is wasteful at any size, but at scale it becomes actively harmful.</p><p>When <code>shared_buffers</code> is set too high (e.g. 80% of total RAM), the OS page cache is confined to the remaining 20%. Under a write-heavy workload, the OS needs that headroom to manage checkpoint writes, background writer activity, and WAL file flushes that grow proportionally with data volume. When the OS cache is too small, the kernel is forced to evict useful data pages to make room for incoming writes. Postgres then misses in both caches and falls through to disk, even if you have plenty of RAM.</p><p>This creates a <a href="https://www.tigerdata.com/blog/surviving-performance-cliff-disk-bound-data"><u>vicious cycle</u></a>. Adding more RAM to shared_buffers temporarily absorbs the working set, but as the dataset grows the same pressure returns. Each tuning cycle buys less time than the one before it.</p><h2 id="using-the-25-rule">Using The 25% Rule</h2><p>The standard recommendation for Postgres is to set <code>shared_buffers</code> to 25% of total system RAM. By leaving 75% of memory to the OS, you give the kernel the headroom it needs to cache active data files, manage writes, and handle I/O bursts without evicting pages that Postgres will immediately need again.</p><p>To apply this, open <code>postgresql.conf</code> and update the parameter:</p><pre><code class="language-Bash"># For a server with 64GB RAM: 25% = 16GB
shared_buffers = '16GB'
</code></pre><p>This parameter requires a full server restart. A configuration reload is not sufficient.</p><h3 id="large-memory-servers">Large Memory Servers</h3><p>On systems with 512GB or more of RAM, 25% works out to 128GB. Beyond this point, the overhead of managing the internal buffer mapping can decrease performance rather than improve it. For very large memory systems, many teams cap <code>shared_buffers</code> at 128GB to 256GB and let the OS page cache handle the rest. Treat 128GB as your starting ceiling and benchmark from there.</p><h3 id="additional-settings">Additional Settings</h3><p>Changing <code>shared_buffers</code> in isolation can produce misleading results if these settings are not also configured correctly:</p><ul><li><code>effective_cache_size</code>: Tells the query planner how much total cache (<code>shared_buffers</code> plus OS page cache combined) it can expect to use. Set this to 50-75% of total RAM. It does not allocate memory, but rather informs planning decisions and affects whether the planner chooses index scans over sequential scans.</li><li><a href="https://www.tigerdata.com/learn/postgresql-performance-tuning-how-to-size-your-database"><u><code>work_mem</code></u></a>: Controls per-operation memory for sorts and hash joins. Too high, and concurrent queries can exhaust available RAM; too low, and sort operations spill to disk. A conservative starting point is total RAM divided by (<code>max_connections</code> x 2). On a 64GB server with 200 <code>max_connections</code>, that works out to roughly 163MB per operation, a reasonable baseline to start from and adjust under load.</li><li><a href="https://www.tigerdata.com/blog/timescale-parameters-you-should-know-about-and-tune-to-maximize-your-performance"><u><code>checkpoint_completion_target</code></u></a>: Set to 0.9 to spread checkpoint writes across a longer window, reducing the I/O spikes that compete with the OS page cache during heavy write periods.</li></ul><h2 id="diagnosing-your-current-configuration">Diagnosing Your Current Configuration</h2><p>Once you apply the 25% rule, the <a href="https://www.postgresql.org/docs/current/pgbuffercache.html"><u><code>pg_buffercache</code></u></a> extension shows you exactly which tables and indexes are occupying your buffer cache right now.</p><pre><code class="language-SQL">SELECT
  c.relname AS table_name,
  count(*) AS buffered_pages,
  pg_size_pretty(count(*) * 8192) AS buffer_size,
  round(100.0 * count(*) /
    (SELECT setting FROM pg_settings WHERE name = 'shared_buffers')::integer, 2
  ) AS percent_of_cache
FROM pg_buffercache b
INNER JOIN pg_class c ON b.relid = c.oid
INNER JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
GROUP BY c.relname
ORDER BY buffered_pages DESC
LIMIT 10;
</code></pre><h3 id="interpreting-your-results">Interpreting Your Results</h3><p>A healthy result shows no single object above 15-20% of the cache.</p><p>If any single table or index exceeds 30% of the cache, treat it as a signal that one object is crowding out everything else. Do not respond by increasing <code>shared_buffers</code>. If the object is already larger than your current allocation, giving Postgres more memory will only delay the problem until the table grows again. Instead, ask yourself the following questions:</p><ul><li>Can the table be partitioned by time or key range so that queries touch only a recent, smaller slice of the data?</li><li>Can the queries driving the cache pressure be rewritten to use more selective indexes rather than scanning large portions of the table?</li></ul><h3 id="addressing-index-bloat">Addressing Index Bloat</h3><p>A separate but related problem is <a href="https://www.tigerdata.com/learn/how-to-reduce-bloat-in-large-postgresql-tables"><u>index bloat</u></a>. When index entries dominate the output over table entries, your indexes have likely grown faster than your access patterns have changed. Use this query to identify indexes that are consuming cache but receiving no scans:</p><pre><code class="language-SQL">SELECT
  schemaname,
  relname as tablename,
  indexrelname as indexname,
  idx_scan AS scans,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;</code></pre><p>Any index returned here is a candidate for removal. <a href="https://www.tigerdata.com/learn/how-to-monitor-and-optimize-postgresql-index-performance"><u>Dropping unused indexes</u></a> directly reduces buffer pressure and frees cache space for objects that are actually serving queries.</p><p>Re-run the <code>pg_buffercache</code> query after any significant data volume increase or schema change to catch concentration drift before it affects query performance.</p><h2 id="when-tuning-reaches-its-limit">When Tuning Reaches Its Limit</h2><p>The 25% rule and the diagnostics above will recover significant performance for most Postgres deployments. But when your working dataset is larger than the memory you can reasonably allocate to either cache layer, buffer management stops being the constraint. Instead, the data volume itself is the problem.</p><p>You can see this in <code>pg_buffercache</code> directly. If your largest table is 60GB and <code>shared_buffers</code> is 16GB, the table will never be fully cached regardless of how the allocation is tuned. <code>percent_of_cache</code> for that object will always approach 100% as the query workload pulls it in, leaving nothing for everything else:</p><p>At this point, adding more RAM extends the runway but does not change the slope. The next doubling of your dataset will return you to this same result. <a href="https://www.tigerdata.com/blog/building-columnar-compression-in-a-row-oriented-database"><u>Columnar storage</u></a> changes the equation by compressing data aggressively before it ever reaches the cache, reducing the volume that needs to be buffered in the first place.</p><p>You can test whether your workload would benefit from this approach by running the same <code>pg_buffercache</code> checks on a Tiger Data instance. <a href="https://console.cloud.timescale.com/signup"><u>Start a free trial today</u></a> to optimize your database and internal buffer cache without affecting production.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Why Giant IN Clauses Slow Down Your App]]></title>
            <description><![CDATA[Giant `IN` clauses inflate PostgreSQL planning time and spike p99 latency. Learn how `ANY(ARRAY[])` cuts the hidden planning tax and keeps your app fast at scale.]]></description>
            <link>https://www.tigerdata.com/blog/why-giant-in-clauses-slow-down-your-app</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/why-giant-in-clauses-slow-down-your-app</guid>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[NanoHertz Communications]]></dc:creator>
            <pubDate>Fri, 15 May 2026 14:34:25 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/05/Why-Giant-IN-Clauses-Slow-Down-Your-App.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/05/Why-Giant-IN-Clauses-Slow-Down-Your-App.png" alt="Why Giant IN Clauses Slow Down Your App" /><p>Your <a href="https://www.tigerdata.com/learn/postgresql-performance-tuning-optimizing-database-indexes"><u>indexes are perfect</u></a>, your CPU is healthy, but <a href="https://www.tigerdata.com/blog/six-signs-postgres-tuning-wont-fix-performance-problems"><u>your p99 latency is spiking</u></a>. When you look at the logs, you see a standard <code>IN</code> clause fetching data for a few thousand users. It looks harmless, but for high-growth databases, this single query pattern becomes a significant performance burden. This guide explains how switching to <code>ANY(ARRAY[])</code> reduces query planning time and prevents your query planner from stalling as you scale.</p><h2 id="what-you-will-learn">What You Will Learn</h2><p>Large <code>IN</code> clauses act as a hidden tax on your PostgreSQL database. Passing a massive list of IDs from your application to a query creates a performance bottleneck that simple indexing cannot fix. In this article, you will learn:</p><ul><li>The hidden planning tax PostgreSQL pays to parse large lists of constants.</li><li>How to implement <code>ANY(ARRAY[...])</code> as a more efficient alternative for batch lookups.</li><li>How to diagnose query planner bottlenecks using <a href="https://www.tigerdata.com/learn/explaining-postgresql-explain"><u><code>EXPLAIN ANALYZE</code></u></a>.</li></ul><h2 id="why-it-matters">Why It Matters</h2><p>Object-Relational Mappers (ORMs) are the primary culprits behind oversized <code>IN</code> clauses. When an application needs to fetch metadata for a list of 5,000 users, the ORM typically generates a single query like the following example:</p><pre><code class="language-SQL">SELECT * FROM users WHERE id IN (1, 2, 3, ... 5000)</code></pre><p>While this appears to be a standard batch operation, it forces the database to parse each item individually. <a href="https://www.tigerdata.com/blog/best-practices-for-query-optimization-in-postgresql"><u>Large <code>IN</code> lists</u></a> can cause the query planner to spend more time analyzing the query than executing it. This overhead breaks plan caching, where latency creeps up even without traffic spikes. If you are <a href="https://www.tigerdata.com/learn/guide-to-postgresql-scaling"><u>managing billions of rows</u></a>, the cost of moving these lists between your application and the database can consume your entire CPU budget.</p><h2 id="%E2%80%8Bparsing-vs-executing">​Parsing vs. Executing</h2><p>To a developer, an <code>IN</code> clause is just a list. To the PostgreSQL parser, it is a complex expression tree.</p><p>When you send a query with 5,000 IDs in an <code>IN</code> clause, the parser must:</p><ol><li>Tokenize every single constant in the string.</li><li>Assign types to every element (e.g. ensuring they are all integers).</li><li>Build a tree node for each item.</li></ol><p>This creates a large internal structure that the query planner needs to traverse. In contrast, using an array parameter wraps all 5,000 IDs into a single object. The parser sees one parameter, assigns it one type <code>(int[])</code>, and moves straight to the planning phase. This reduces the memory usage of the query string and prevents the expression tree from becoming unmanageable.</p><h2 id="query-planner-diagnostics">Query Planner Diagnostics</h2><p>To determine if your IN clauses are hurting performance, you need to look beyond total latency and instead look at a breakdown of your query’s lifecycle. Luckily, most of these diagnostics are available via <code>EXPLAIN ANALYZE</code>.</p><h2 id="identifying-the-planning-tax">Identifying the Planning Tax</h2><p>When running <code>EXPLAIN ANALYZE</code>, pay close attention to the Planning Time. If it exceeds 50% of the total time, the database is struggling to parse your constant list. For example, a query that takes 40 ms to plan but only 2 ms to execute is a prime candidate for refactoring.</p><h2 id="analyzing-scan-types">Analyzing Scan Types</h2><p>Check how the engine navigates your data structures within the <code>EXPLAIN</code> output. A standard Index Scan is highly efficient for small <code>IN</code> lists because the overhead of looking up a few individual keys is relatively low. However, as the list grows, the planner typically switches to a <a href="https://www.tigerdata.com/learn/optimizing-array-queries-with-gin-indexes-in-postgresql"><u>Bitmap Index Scan</u></a>, which is commonly used with the <code>ANY(ARRAY[])</code> syntax. In this scenario, the database constructs a map of all matching rows in memory before initiating the fetch. This process is far more efficient for large batches than jumping back and forth on an index for 5,000 separate values.</p><h2 id="%E2%80%8Bbenchmarking-the-planning-tax">​Benchmarking the Planning Tax</h2><p>To see the performance gains in your own environment, use <code>EXPLAIN (ANALYZE, BUFFERS)</code> to monitor the planning and execution metrics of your queries. These examples demonstrate how to measure the planning tax of a standard <code>IN</code> clause, and the gains you will see with the more efficient <code>ANY(ARRAY[])</code> pattern, for a typical <code>device_metrics</code> table.</p><h3 id="standard-in-clause">Standard <code>IN</code> Clause</h3><p>A standard <code>IN</code> clause with 5,000 IDs forces the database to tokenize each element individually, resulting in high Planning Time. The problem is that the parser needs to parse each constant and build a massive expression tree before execution starts.</p><pre><code class="language-SQL">EXPLAIN (ANALYZE, BUFFERS)
SELECT reading_value
FROM device_metrics
WHERE device_id IN (101, 102, 103, ... 5000);
</code></pre><h3 id="anyarray"><code>ANY(ARRAY[])</code></h3><p>When you use <code>ANY(ARRAY[])</code> syntax, you tell PostgreSQL to treat the thousands of IDs as a single object rather than thousands of individual constants. This drastically reduces the query's structural complexity.</p><pre><code class="language-SQL">EXPLAIN (ANALYZE, BUFFERS)
SELECT reading_value
FROM device_metrics
WHERE device_id = ANY(ARRAY[101, 102, 103, ... 5000]::int[]);
</code></pre><h3 id="results">Results</h3><p>In high-volume environments, moving from the standard <code>IN</code> clause to an array parameter often yields the following improvements:</p><ul><li><strong>Planning Time</strong>: Can drop from 40+ ms to less than 2 ms.</li><li><strong>Memory Usage</strong>: Significant reduction in the memory required to store the query string and the resulting expression tree.</li><li><strong>Scan Strategy</strong>: The planner is more likely to use a Bitmap Index Scan, which fetches data in larger, more efficient batches than <a href="https://www.tigerdata.com/learn/how-to-monitor-and-optimize-postgresql-index-performance"><u>individual index lookups</u></a>.</li></ul><h2 id="next-steps">Next Steps</h2><p>Switching from a standard <code>IN</code> clause to array parameters can be one of the most impactful changes you can make for database performance. Here are some steps you can take to implement this in your own database.</p><ol><li><strong>Enable </strong><a href="https://www.tigerdata.com/learn/postgresql-extensions-pg-stat-statements"><strong><u><code>pg_stat_statements</code></u></strong></a>: Ensure this extension is enabled in your <code>shared_preload_libraries</code> to track historical query performance at scale.</li><li><strong>Find top performance offenders</strong>: <a href="https://www.tigerdata.com/blog/using-pg-stat-statements-to-optimize-queries"><u>Identify queries with large parameter counts</u></a> using <code>pg_stat_statements</code>, using the following query.</li></ol><pre><code class="language-SQL">SELECT query, calls, mean_exec_time
FROM pg_stat_statements
WHERE query ILIKE '%IN (%'
ORDER BY mean_exec_time DESC LIMIT 5;
</code></pre><ol start="3"><li><strong>Audit ORMs</strong>: Check your ORM configuration to see if it supports "Array Mode" to automate the use of the <code>ANY(ARRAY[])</code> pattern.</li><li><strong>Benchmark</strong>: <code>EXPLAIN (ANALYZE, BUFFERS)</code> on your largest batch queries. If planning time exceeds 10 ms or 50% of total latency, apply the refactor immediately.</li></ol><p><code>ANY(ARRAY[])</code> and all these tools come with Tiger Cloud by default. Start a <a href="https://console.cloud.tigerdata.com/signup"><u>free Tiger Cloud trial</u></a> today and see these performance benefits for yourself.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[The True Cost of Database Optimization: Engineering Time]]></title>
            <description><![CDATA[The true cost of Postgres optimization isn't the cloud bill. It's 12-16 engineer-weeks per year that never show up on a budget report.]]></description>
            <link>https://www.tigerdata.com/blog/the-true-cost-of-database-optimization-engineering-time</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/the-true-cost-of-database-optimization-engineering-time</guid>
            <category><![CDATA[Database]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <dc:creator><![CDATA[Matty Stratton]]></dc:creator>
            <pubDate>Thu, 14 May 2026 20:36:27 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/05/thumbnail-blog--1-.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/05/thumbnail-blog--1-.png" alt="The True Cost of Database Optimization: Engineering Time" /><p>"We can fix the performance issue with better indexes, smarter partitioning, and some vacuum tuning. It's cheaper than switching."</p><p>You've heard this sentence. You may have said this sentence.</p><p>The optimization wasn't cheap. It just felt like it was.</p><p>"Cheaper than what" is the question nobody asks. The optimization doesn't show up on an invoice. It costs engineering time. And engineering time has a rate: the fully-loaded cost of the senior engineers doing the work, plus whatever those engineers aren't building while they're doing it. Most teams have never actually added up their database optimization spend. When they do, the number is larger than expected. And it comes back every quarter.</p><p>This problem is specific to a particular class of workload: high-frequency, append-heavy data. Telemetry, metrics, events, anything where timestamps are how you think about your data and the table only ever gets bigger. If that describes your system, keep reading. If you're running a CRUD app with predictable write volume, this isn't your problem.</p><h2 id="why-optimization-doesnt-fix-this">Why optimization doesn't fix this</h2><p>Here's what most teams figure out a year or two in: optimization isn't the wrong thing to try. It's just solving the wrong problem.</p><p>Tuning vanilla Postgres for a high-frequency append workload is a bit like upgrading the engine on a pickup truck because you want to haul more freight. You can make the truck faster and it feels productive. But at some point, you're limited by what the vehicle fundamentally is. The problem isn't the mechanic. It's the vehicle.</p><p>When your workload is structurally mismatched to your database architecture, the optimization treadmill is inevitable. Every index you add, every partition scheme you design, every autovacuum you tune: it's solving for a data volume you'll outgrow in months. The gap between "current optimization" and "needed optimization" widens every quarter. Not because you're falling behind. Because the data compounds faster than the fixes do.</p><h2 id="a-realistic-year">A realistic year</h2><p>Here's what that looks like. A year of Postgres optimization for a high-volume append workload.</p><p><strong>Q1.</strong> Queries are slowing down. A senior engineer spends two weeks analyzing query plans, adding targeted indexes, and rewriting three critical queries. Performance improves. Write throughput drops roughly 15% because of new index maintenance overhead. (These numbers are illustrative. Your Q1 will have its own version of this tradeoff.)</p><p><strong>Q2.</strong> Table size is causing partition-related issues. The team implements time-based partitioning. Two engineers spend three weeks on it: designing the partition scheme, migrating existing data, updating application queries that assumed a single table, and fixing the CI/CD pipeline that didn't account for partition management.</p><p><strong>Q3.</strong> Autovacuum is competing with production writes during peak hours. One engineer spends a week tuning autovacuum parameters, adjusting cost delays, and setting up monitoring for vacuum lag. A follow-up incident two weeks later, when a vacuum job blocks a schema migration, costs another three days.</p><p><strong>Q4.</strong> Storage costs are climbing. The team evaluates compression options, considers archiving old data to cold storage, and ultimately decides to upgrade the instance size to buy headroom for Q1 of next year. The upgrade takes a day. The evaluation and planning took two weeks.</p><p>Total: 12 to 16 engineer-weeks across the year. At fully-loaded senior engineer cost (call it $150K to $200K/year), that's $35K to $60K in direct labor. You bought time, not a solution. And the bill comes back next year.</p><h2 id="the-opportunity-cost-the-real-number">The opportunity cost (the real number)</h2><p>The $35K to $60K understates it.</p><p>12 to 16 engineer-weeks is a feature. It's a product launch. For a team of 10, that's 3 to 4% of total engineering output spent keeping the database at "acceptable." Not advancing it. Just treading water against a growing dataset.</p><p>Ask your engineering manager: if you reclaimed those 12 to 16 weeks, what would you build? That's the true cost of optimization. Not the hours. The roadmap you didn't ship.</p><p>And it compounds. Year two has all the same optimization needs plus new ones as data grows, but now you're also maintaining the partitioning scheme from Q2 and the vacuum configuration from Q3. The baseline maintenance burden grows even as new problems arrive.</p><p><a href="https://www.tigerdata.com/blog/how-flogistix-by-flowco-reduced-infrastructure-management-costs-by-66-with-tiger-data"><u>Flogistix</u></a>, who runs high-frequency oil and gas telemetry, reported 66% monthly cost savings after moving to Tiger Cloud, and their engineering team said the freed time directly increased roadmap velocity. That's what the other side of this decision looks like.</p><h2 id="the-hidden-costs-nobody-tracks">The hidden costs nobody tracks</h2><p>These don't show up in sprint planning.</p><p><strong>Incident response.</strong> Database performance incidents pull engineers off planned work. A slow query that triggers alerts at 2am costs the on-call engineer a night of sleep and a mostly useless next day. These incidents increase in frequency as the gap between "current optimization" and "needed optimization" widens. And the gap always widens.</p><p><strong>Knowledge concentration.</strong> Database optimization work accumulates in one or two senior engineers who understand the schema, the query patterns, and enough Postgres internals to make changes safely. This is your single point of failure. When that engineer is on vacation or leaves, optimization work stalls or gets done slowly by someone learning as they go. Trust me, I've seen this play out in ways that aren't fun for anyone involved.</p><p><strong>Context switching.</strong> Engineers don't work on database optimization in clean, uninterrupted blocks. They get pulled in for an afternoon here, a day there, to diagnose a regression or review a partition change. Context switching is expensive because it disrupts both the database work and whatever they were doing before. You're not just paying for the time spent on the database. You're paying for the interrupt tax on everything else.</p><p>All three are part of the platform tax: the invisible engineering cost of maintaining infrastructure that doesn't quite fit the workload. It doesn't show up on an invoice either.</p><h2 id="calculate-your-own-number">Calculate your own number</h2><p>Track for one month. Count hours spent on: query optimization and explain plan analysis; partition management and creation; autovacuum tuning, monitoring, and incident response; database-related incident response (slow query alerts, replication lag, connection pool exhaustion); and meetings discussing performance, capacity planning, or migration timing.</p><p>Multiply the monthly total by 12. Multiply that by the fully-loaded hourly rate of the engineers involved. That's your annual optimization cost.</p><p>Compare it against the one-time cost of migrating to a system designed for the workload (typically 2 to 8 engineer-weeks depending on data volume), plus ongoing maintenance that scales with workload complexity rather than with data growth.</p><p>For most teams, the breakeven is within the first year. Often within the first quarter. Do the math before assuming migration is the expensive option.</p><h2 id="what-the-alternative-looks-like">What the alternative looks like</h2><p>After migrating to <a href="https://www.tigerdata.com/docs/learn/hypertables/understand-hypertables"><u>TimescaleDB</u></a> (the open-source Postgres extension that powers Tiger Cloud), the engineering time picture looks different.</p><p>Migration cost: one-time, typically 1 to 4 weeks for a single engineer depending on data volume and schema complexity. Most of that time is data backfill, not application changes. TimescaleDB is still Postgres. Your SQL, your tooling, your team's existing knowledge stays intact.</p><p>Ongoing costs: not zero, but different in kind. The categories of work that consumed engineering time on vanilla Postgres shift significantly. Automatic partitioning via <a href="https://www.tigerdata.com/docs/learn/hypertables/understand-hypertables"><u>Hypertables</u></a> removes partition management as a recurring quarterly project. The database handles it. Compression policies run automatically in the background. Autovacuum pressure on historical data drops because <a href="https://www.tigerdata.com/docs/learn/columnar-storage/understand-hypercore"><u>Hypercore</u></a> converts older chunks to columnar format: instead of accumulating MVCC dead tuples on row-level records, that data is stored as compressed column arrays that don't generate the same vacuum workload. You still tune a database. You just stop tuning the same problems every quarter.</p><p>What was being spent on keeping vanilla Postgres at "acceptable" is now available for product work. Not because the database is magic. Because the architecture fits the workload.</p><h2 id="the-decision-you-keep-deferring">The decision you keep deferring</h2><p>The true cost of database optimization is not the cloud bill. It's the engineering time: senior engineers spending weeks per quarter on maintenance that keeps the system at "acceptable" rather than moving it forward.</p><p>If the annual optimization cost exceeds the one-time migration cost (and it usually does, often within the first year), the economic case writes itself. The harder question is whether the team can keep deferring the decision, knowing that each quarter of optimization increases the total spend without changing the trajectory.</p><p>Run the numbers. Then decide.</p><p>If you've done the math and want to understand what migration looks like at your data scale, <a href="https://www.tigerdata.com/blog/when-to-migrate-postgres-to-timescaledb"><u>The Best Time to Migrate Was at 10M Rows. The Second Best Time Is Now.</u></a> is a good next read. And when you're ready to move, the <a href="https://www.tigerdata.com/docs/deploy/self-hosted/migration"><u>migration guide</u></a> covers the mechanics.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How Relational Complexity Crushes Real-Time Dashboards]]></title>
            <description><![CDATA[Joins on billion-row Postgres tables crush real-time dashboards. Flatten your schema to cut shared buffer hits 70–90% and restore dashboard speed.]]></description>
            <link>https://www.tigerdata.com/blog/how-relational-complexity-crushes-real-time-dashboards</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-relational-complexity-crushes-real-time-dashboards</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[real time analytics]]></category>
            <dc:creator><![CDATA[NanoHertz Communications]]></dc:creator>
            <pubDate>Fri, 08 May 2026 15:33:00 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/05/realtional-complexity-real-time-dashboards.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/05/realtional-complexity-real-time-dashboards.png" alt="How Relational Complexity Crushes Real-Time Dashboards" /><p>When your Shared Buffer Hits cross 100,000 for a simple dashboard refresh, you have hit the wall of relational scaling. A simple query that used to return in 100ms now takes 1.5 seconds. You’ve <a href="https://www.tigerdata.com/learn/postgresql-performance-tuning-optimizing-database-indexes"><u>added indexes</u></a> and tuned the buffer pool, but the performance gains are fleeting. Your largest tables <a href="https://www.tigerdata.com/blog/indexing-your-way-into-a-performance-bottleneck"><u>crossed 500M rows</u></a>, and the joins that once felt elegant are now a weight dragging down your entire system.</p><h2 id="what-you-will-learn">What You Will Learn</h2><p>High-frequency data and deep relational hierarchies eventually collide. By following this guide, you will be able to:</p><ul><li>Identify the metadata tax in your current query plans using buffer metrics.</li><li>Implement a flattened schema to bypass the Join Explosion.</li><li>Automate data synchronization between relational and flattened tables.</li><li>Compare the performance gap between relational and denormalized architectures.</li></ul><h2 id="why-it-matters">Why It Matters</h2><p>Relational schemas work during small-scale pilots but struggle when dashboards must stitch ten tables together for every refresh, hundreds of times per second, against billions of rows. Complex joins consume excessive CPU cycles, leading to the query speed degradation typical of the <a href="https://www.tigerdata.com/learn/postgres-performance-best-practices"><u>optimization treadmill</u></a>.</p><p>Each join forces the database to navigate multiple B-tree indexes and load disparate pages into memory. By flattening data, you reduce the database's computational burden and enable much higher query concurrency.</p><h2 id="the-join-explosion-problem">The Join Explosion Problem</h2><p>A <a href="https://www.tigerdata.com/learn/how-to-use-postgresql-for-data-normalization"><u>normalized schema</u></a> fragments telemetry into separate tables for readings, hardware IDs, and site data. While efficient for storage, this creates a “join explosion” at scale. On tables with a billion rows, the query planner has to navigate multiple B-tree indexes and perform nested loops, or hash joins, just to assemble a single dashboard view. This forces the database to jump between disparate disk locations, incurring a heavy computational tax every time a user refreshes their screen.</p><p>The solution is to move the join cost from read-time to write-time. Instead of stitching tables together during every query, you flatten the data by pre-joining the metadata to the raw reading during ingestion.</p><h2 id="moving-from-relational-to-flattened">Moving From Relational to Flattened</h2><p>Moving from relational to flattened means transitioning from a passive storage model to an active processing model. Instead of a lean ingestion phase followed by expensive reads, you perform the join logic exactly once when the data enters the system. Storing the result in a <a href="https://www.tigerdata.com/learn/designing-your-database-schema-wide-vs-narrow-postgres-tables"><u>single, wide table</u></a> allows the database to perform a single index scan rather than a multi-way join, reducing I/O load and restoring dashboard responsiveness.</p><p>The following sections illustrate how to move to a flattened data model, with example SQL commands for a smart building management system. In this scenario, we will move the computational cost of joining sensor data (e.g. temperature, air quality) with building and regional metadata from query-time to write-time.</p><h3 id="step-1-define-the-flattened-structure">Step 1: Define the Flattened Structure</h3><p>Start by creating a table that includes the metadata as native columns. This removes the need for downstream joins.</p><p>This example query creates the flattened table and a composite index so the database can locate time-series readings and their metadata without expensive lookups across separate tables. This structure allows the database to instantly locate specific data, like North region telemetry, without performing costly table scans.</p><pre><code class="language-SQL">CREATE TABLE readings_flattened (
   ts TIMESTAMPTZ NOT NULL,
   sensor_name TEXT,
   building_name TEXT,
   region TEXT,
   value DOUBLE PRECISION
);
CREATE INDEX idx_flattened_ts_region ON readings_flattened (ts DESC, region);
</code></pre><h3 id="step-2-backfill-existing-data-in-batches">Step 2: Backfill Existing Data in Batches</h3><p>Moving a billion rows at once will lock your database. Use a batch approach to migrate data from your relational tables.</p><p>This example query creates a path to migrate historical records from fragmented tables into the new wide format <strong>one window at a time</strong>, starting with the past thirty days. By pre-joining the data during this migration, you eliminate the need for the database ever to stitch these specific records together again.</p><pre><code class="language-SQL">INSERT INTO readings_flattened (ts, sensor_name, building_name, region, value)
SELECT
   r.ts,
   s.sensor_name,
   l.building_name,
   l.region,
   r.value
FROM readings r
JOIN sensors s ON r.sensor_id = s.id
JOIN locations l ON s.location_id = l.id
WHERE r.ts &gt; now() - interval '30 days';
</code></pre><h3 id="step-3-automate-maintenance-via-triggers">Step 3: Automate Maintenance via Triggers</h3><p>To keep the flattened table up to date without changing your application code, <a href="https://www.tigerdata.com/blog/speed-up-triggers-by-7x-with-transition-tables"><u>use a database trigger</u></a>. This ensures that every new reading is pre-joined as it enters the system.</p><p>This example query creates an automated trigger that flattens every incoming sensor reading in real time. This ensures that, as new sensor data arrives, it is immediately linked to its building and regional context before being stored.</p><pre><code class="language-SQL">CREATE OR REPLACE FUNCTION flatten_reading_trigger()
RETURNS TRIGGER AS $$
BEGIN
   INSERT INTO readings_flattened (ts, sensor_name, building_name, region, value)
   SELECT
       NEW.ts, s.sensor_name, l.building_name, l.region, NEW.value
   FROM sensors s
   JOIN locations l ON s.location_id = l.id
   WHERE s.id = NEW.sensor_id;
   RETURN NEW;
END;
​
CREATE TRIGGER trg_flatten_reading
AFTER INSERT ON readings
FOR EACH ROW EXECUTE FUNCTION flatten_reading_trigger();
</code></pre><h2 id="%E2%80%8Bmeasuring-the-performance-gap">​Measuring the Performance Gap</h2><p>To see the mechanical benefits of your new flattened table, you need to move beyond measuring execution time, which can fluctuate based on concurrent load. Instead, use the <a href="https://www.tigerdata.com/learn/postgresql-performance-tuning-how-to-size-your-database"><u>BUFFERS metric in your query plan</u></a> to observe the physical I/O the database performs. This provides a stable, repeatable measure of the work required to retrieve your data.</p><h3 id="the-relational-approach-join-at-read">The Relational Approach (Join at Read)</h3><p>To identify the join tax in your system, establish a performance baseline using your existing relational schema. This example query uses the EXPLAIN (ANALYZE, BUFFERS) command to measure the physical memory and I/O work required by a standard multi-way join.</p><p>In our example scenario, this query tracks how many data blocks your database engine must touch to assemble the North region dashboard.</p><pre><code class="language-SQL">EXPLAIN (ANALYZE, BUFFERS)
SELECT r.ts, s.sensor_name, l.building_name, r.value
FROM readings r
JOIN sensors s ON r.sensor_id = s.id
JOIN locations l ON s.location_id = l.id
WHERE r.ts &gt; now() - interval '1 hour' AND l.region = 'North';
</code></pre><p>Focus on the shared hit count. Each hit represents one 8KB block of data the database had to find and load. If you see tens of thousands of hits for one hour of data, your I/O is the bottleneck.</p><h3 id="the-flattened-approach-pre-joined">The Flattened Approach (Pre-Joined)</h3><p>Validate the architectural shift by running a comparative diagnostic against the new flattened structure. This example query proves the efficiency of data co-location by measuring shared buffer hits. Unlike the relational model, which hunts across multiple indexes, this approach allows the engine to find all necessary data within a single storage layer, significantly reducing the number of blocks touched.</p><p>In our example scenario, this query will show the efficiency of your database engine in assembling the North region dashboard with a flattened structure.</p><pre><code class="language-SQL">EXPLAIN (ANALYZE, BUFFERS)
SELECT ts, sensor_name, building_name, value
FROM readings_flattened
WHERE ts &gt; now() - interval '1 hour' AND region = 'North';
</code></pre><p>​Look for a 70–90% drop in shared hits — the foundation of any <a href="https://www.tigerdata.com/blog/real-time-analytics-for-time-series-continuous-aggregates"><u>real-time analytics</u></a> workload. This reduction in blocks touched is what allows your dashboard to scale to hundreds of concurrent users without CPU saturation.&nbsp;</p><h3 id="trade-offs-consistency-vs-speed">Trade-offs: Consistency vs. Speed</h3><p>Flattening introduces redundancy. If a building is renamed, the flattened table will still contain the old name for historical rows. However, for <a href="https://www.tigerdata.com/learn/a-beginners-guide-to-iiot-and-industry-4-0"><u>telemetry and IIoT</u></a>, metadata is usually static. The massive gain in query concurrency and dashboard speed almost always outweighs the cost of a rare historical update.</p><h2 id="%E2%80%8Bnext-step">​Next Step</h2><p>Identify the slowest query in your dashboard. Test a flattened version of that dataset using a temporary table to see the I/O savings immediately:</p><pre><code class="language-SQL">CREATE TEMPORARY TABLE test_flattened AS
SELECT r.ts, s.sensor_name, l.building_name, l.region, r.value
FROM readings r
JOIN sensors s ON r.sensor_id = s.id
JOIN locations l ON s.location_id = l.id
LIMIT 1000000;
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM test_flattened WHERE region = 'North';
</code></pre><p>Compare the shared hit count of this test against your production query. If the savings are significant, it is time to move toward a <a href="https://www.tigerdata.com/learn/real-time-analytics-in-postgres"><u>denormalized architecture</u></a>. For workloads exceeding billions of rows, see how Tiger Data<a href="https://www.tigerdata.com/docs/about/latest/whitepaper"> </a>does this flattening for you with hybrid row-columnar storage. Start a <a href="https://console.cloud.timescale.com/signup"><u>Tiger Cloud free trial</u></a> today to see these savings for yourself.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Five Warning Signs Your Database Needs Different Architecture]]></title>
            <description><![CDATA[There's a category of Postgres performance issue that no amount of tuning will fix. Here's how to tell if you're already in it.]]></description>
            <link>https://www.tigerdata.com/blog/five-warning-signs-your-database-needs-different-architecture</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/five-warning-signs-your-database-needs-different-architecture</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Matty Stratton]]></dc:creator>
            <pubDate>Thu, 07 May 2026 11:40:41 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/05/thumbnail-blog-thumbnail-1280x720--4-.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/05/thumbnail-blog-thumbnail-1280x720--4-.png" alt="Five Warning Signs Your Database Needs Different Architecture" /><p>Every database has bad days. A slow query after a schema change. A spike in replication lag during a traffic surge. An autovacuum job that runs long enough to make you nervous.</p><p>Those are tuning problems. They have tuning solutions. A senior engineer can fix them in a day, sometimes in an hour, and the fix holds. You move on.</p><p>But there's a different category of symptom. The kind that comes back. The kind where each solution introduces the need for the next solution, where performance degrades not because of a specific bad query but because the workload has grown into territory the database wasn't designed for.</p><p>Here's the thing that makes this category different: growth changes the <em>nature</em> of the problem, not just the severity. When your workload is fundamentally append-heavy and analytically queried, you're running it on a storage engine built for point lookups and updates. That mismatch doesn't show up at small scale. It compounds. And eventually no amount of configuration adjusts for it.</p><p>I've watched teams burn two quarters optimizing their way around this before realizing the optimization itself was the trap. The five signs below are how you recognize it before that happens.</p><p>If you're seeing three or more, the problem isn't your configuration.</p><h2 id="sign-1-your-optimization-work-is-cyclical-not-cumulative">Sign 1: Your optimization work is cyclical, not cumulative</h2><p>Tuning that sticks is cumulative. You add an index, queries get faster, they stay faster. You adjust <code>work_mem</code>, hash joins improve, the improvement is permanent. Done.</p><p>The warning sign: you're doing the same work again six months later. You partition the table, and after a while you need to re-partition because the data distribution shifted. You optimize a query, and three months later the same query is slow again because the table grew 40%. You upgrade the instance, and the headroom is gone in a quarter.</p><p>When optimization is cyclical, the underlying cause is growth, not misconfiguration. Growth doesn't respond to configuration. It responds to architecture.</p><p><strong>How to measure:</strong> look at your sprint backlog. How many database-related tasks do you have this quarter versus last quarter versus the one before that? If the number is flat or growing, your optimization is running in place.</p><h2 id="sign-2-autovacuum-is-running-constantly-on-tables-with-low-update-rates">Sign 2: Autovacuum is running constantly on tables with low update rates</h2><p>Autovacuum on a heavily updated table makes sense. Dead tuples need cleaning, that's the job.</p><p>The warning sign: autovacuum running persistently on tables where fewer than 5% of rows are ever updated. This usually means one of two things: transaction ID wraparound prevention driving vacuum on high-insert tables, or hint bit maintenance generating constant background I/O.</p><p>Both are <a href="https://www.tigerdata.com/blog/mvcc-feature-youre-paying-for-but-not-using"><u>MVCC overhead</u></a> on append-only data. The visibility tracking machinery runs at full cost on rows that will never be modified. Tuning autovacuum parameters (<code>scale_factor</code>, <code>cost_delay</code>, <code>max_workers</code>) adjusts how the work is distributed, not whether it has to happen. The work still has to happen.</p><p><strong>How to measure:</strong> check <code>pg_stat_user_tables</code> for your largest tables. Compare <code>n_dead_tup to autovacuum_count</code>. If autovacuum runs frequently but dead tuple counts are consistently low, the vacuum is maintenance overhead, not cleanup. You're paying the full MVCC tax on data that never gets modified.</p><h2 id="sign-3-query-performance-degrades-linearly-with-data-volume">Sign 3: Query performance degrades linearly with data volume</h2><p>Postgres query performance should be sublinear with data volume when indexes are working correctly. A B-tree index lookup is O(log n). Adding 10x more data should add a constant factor to indexed lookups, not 10x more time.</p><p>The warning sign: queries that get proportionally slower as the table grows. This happens when the dominant pattern is sequential scans or index scans over wide ranges rather than point lookups. Aggregations over time ranges, <code>GROUP BY</code> on large result sets, analytical scans touching millions of rows. These scale with data volume, not with the index.</p><p>Partitioning helps by limiting scans to relevant partitions. It's also a manual process that requires ongoing management, and partition-level sequential scans still scale linearly within each partition.</p><p><strong>How to measure:</strong> run your five slowest queries against progressively larger time ranges. If execution time scales proportionally with the range, the queries are scan-bound. More data will always mean slower queries, and no index is going to change that trajectory.</p><h2 id="sign-4-index-strategy-has-become-a-tradeoff-negotiation">Sign 4: Index strategy has become a tradeoff negotiation</h2><p>In a well-matched architecture, indexes are straightforward. Add them where you query, the read performance benefit outweighs the write cost, everyone's happy.</p><p>The warning sign: adding indexes to improve read performance measurably degrades write throughput, and the team is now making deliberate tradeoffs. "We can't add that index, it would slow inserts below our SLA." That conversation is the sign.</p><p>This is specific to high-throughput insert workloads. A SaaS app doing 1K inserts/sec can add ten indexes without noticing. A telemetry pipeline doing 100K inserts/sec feels every additional index in <a href="https://www.tigerdata.com/blog/write-amplification-in-postgres-the-3-4x-tax-on-every-insert"><u>write amplification</u></a>. The math is just different.</p><p>When index strategy becomes a negotiation between read and write performance, the workload has outgrown what B-tree indexing on row-oriented storage can serve without tradeoffs. You can't index your way out.</p><p><strong>How to measure:</strong> benchmark write throughput with your current index set, then with one additional index. If the drop is more than 10-15%, you're in the tradeoff zone. Whoops.</p><h2 id="sign-5-storage-costs-are-growing-faster-than-data-value">Sign 5: Storage costs are growing faster than data value</h2><p>Postgres stores data at the row level, uncompressed (unless you're using TOAST for large values). For time-series and event data, raw storage cost is proportional to row count with no automatic lifecycle management built in.</p><p>The warning sign: storage costs growing linearly with data volume, and the team having conversations about retention not because old data is worthless, but because storing it is expensive. You archive to cold storage or drop old partitions to manage costs, even though you'd rather keep the data online.</p><p>When storage cost forces data lifecycle decisions, the storage model isn't efficient for the data type. Columnar compression (10-20x for typical time-series data) and automatic data tiering change the economics so that keeping years of history queryable is practical rather than a monthly budget conversation.</p><p><strong>How to measure:</strong> calculate your cost per million rows stored. Compare that number to the analytical value of the data. If you're deleting or archiving data you'd rather be querying, storage cost is constraining your product. That's the sign.</p><h2 id="the-diagnostic">The diagnostic</h2><p>Count how many of these apply to your system.</p><p><strong>0-1:</strong> Tuning is the right move. Your workload fits the architecture, specific optimizations will address the symptoms. The usual playbook works.</p><p><strong>2-3:</strong> You're approaching the boundary. Current optimizations will buy months, not years. Start evaluating architectural options now, while you still have time to choose rather than react.</p><p><strong>4-5:</strong> The workload has outgrown the architecture. Further optimization has diminishing returns. Migration to a system designed for this workload will be less expensive than continued optimization within 6-12 months. Not a fun conclusion, but better to reach it now than after another year of sprint-backlog churn.</p><h2 id="what-this-actually-means">What this actually means</h2><p>Tuning problems have solutions that stick. Architectural mismatches have solutions that buy time. The goal of this whole exercise is to figure out which one you're dealing with before you've spent another quarter on the latter.</p><p>If the diagnostic puts you in the 4-5 range, the answer isn't to abandon Postgres. The answer is to stop running a workload it wasn't designed for on vanilla storage primitives. What you actually need is automatic time-based partitioning, hybrid row/columnar storage for analytical scans, and lifecycle management that doesn't require you to babysit it. You need Postgres extended for this workload, not replaced.</p><p>That's what TimescaleDB does. Same database, same SQL, same toolchain your team already knows. Different storage engine underneath.</p><p><em>If you want to go deeper on why vanilla Postgres hits this wall and what the underlying mechanics look like, </em><a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill"><em><u>Understanding Postgres Performance Limits for Analytics on Live Data</u></em></a><em> covers the architecture side in detail. And if some of these signs felt familiar but you're not sure you're fully in the architectural category yet, </em><a href="https://www.tigerdata.com/blog/six-signs-postgres-tuning-wont-fix-performance-problems"><em><u>Six Signs That Postgres Tuning Won't Fix Your Performance Problems</u></em></a><em> has a different angle on the same diagnostic.</em></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Autovacuum: The Tax You're Always Paying]]></title>
            <description><![CDATA[Your append-only table doesn't need autovacuum. It runs anyway. Here's what that actually costs your team.]]></description>
            <link>https://www.tigerdata.com/blog/the-autovacuum-tax</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/the-autovacuum-tax</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <dc:creator><![CDATA[Matty Stratton]]></dc:creator>
            <pubDate>Tue, 05 May 2026 19:04:21 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/05/Autovacuum_-The-Tax-You-re-Always-Paying-V2.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/05/Autovacuum_-The-Tax-You-re-Always-Paying-V2.png" alt="Autovacuum: The Tax You're Always Paying" /><p>You open <code>pg_stat_activity</code> during a write peak and there it is. Autovacuum. Again.</p><p>You've tuned this three times. You wrote a runbook for it. At this point it has its own section in the quarterly database review.</p><p>The table it's working on hasn't seen a single UPDATE in six months. Every write is an INSERT. Old data gets dropped by partition, not deleted row by row. By every reasonable intuition about what autovacuum is for, this table should basically run itself.</p><p>But there it is.</p><p>The intuition isn't wrong. It's incomplete. Autovacuum exists to clean up after concurrent row modifications, and your workload doesn't do that. What your workload <em>does</em> generate is a steady stream of other work that lands in the same process: dead tuples from aborted transactions, hint bits that need setting, transaction IDs that need freezing before the counter wraps. None of it is the problem autovacuum was designed to solve. All of it runs through the same mechanism.</p><p>This post isn't about how to tune autovacuum. It's about understanding what it's actually doing on your tables, why tuning helps at the margin but not at the root, and what it means that a process built for row modification cleanup is your most persistent background worker on a table that never modifies rows.</p><h2 id="what-autovacuum-is-actually-for">What autovacuum is actually for</h2><p>Postgres MVCC keeps old row versions alive as long as any active transaction might need to see them. When a row gets updated, the old version stays on the heap page, marked dead. When a row gets deleted, same thing. These dead tuples accumulate until something cleans them up. That something is autovacuum.</p><p>Without it, dead tuples pile up permanently. <a href="https://www.tigerdata.com/learn/how-to-reduce-bloat-in-large-postgresql-tables"><u>Table bloat</u></a> grows without bound. Heap pages that hold dead tuples can't be reused. Query performance degrades as scans trip over dead rows.</p><p>And then there's the harder problem: <a href="https://www.tigerdata.com/blog/how-to-fix-transaction-id-wraparound"><u>XID wraparound</u></a>. Transaction IDs are 32-bit counters. About 2 billion transactions in, Postgres loses the ability to distinguish old from new. Rows from before the wraparound point become invisible. This isn't theoretical. It has happened in production. Autovacuum's freeze pass exists to prevent it by marking old tuples as frozen before the counter laps them.</p><p>For a standard OLTP workload with concurrent reads and updates on shared rows, this is essential infrastructure. Dead tuple accumulation is a direct consequence of normal operation. The cleanup cost is proportional to the update and delete rate. Makes sense.</p><p>The confusing part is what happens when your workload never updates or deletes anything.</p><h2 id="three-reasons-autovacuum-runs-on-append-only-tables">Three reasons autovacuum runs on append-only tables</h2><p>Each of these is a different mechanism. None of them are the same problem.</p><p><strong>Aborted transactions leave dead tuples.</strong> Not every INSERT commits. Connection drops mid-transaction. Application errors trigger rollbacks. Explicit transaction management has bugs. At <a href="https://www.tigerdata.com/blog/13-tips-to-improve-postgresql-insert-performance"><u>high insert rates</u></a>, even a small abort rate produces a steady trickle of dead tuples. A 0.1% abort rate at 50,000 inserts per second is 50 dead tuples per second. Autovacuum has to find and mark them, even though those rows were never part of a committed write.</p><p>This is real dead tuple work. Just not from updates. You can see it directly in <code>n_dead_tup</code> in <code>pg_stat_user_tables</code>. Tuning autovacuum to run more aggressively cleans it up faster. There's no way to eliminate it without eliminating aborted transactions, which isn't realistic.</p><p><strong>Hint bits require page dirtying.</strong> This one surprises most people who haven't dug deep into Postgres internals. When a row is first read after being written, Postgres doesn't just hand you the data. It verifies the writing transaction committed. It checks <code>pg_xact</code>. Once confirmed, it sets a hint bit in <code>t_infomask</code> to cache that result so future reads don't have to hit <code>pg_xact</code> again.</p><p>Setting that hint bit modifies the tuple header. A modified tuple header dirties the page. A dirty page needs writing back to disk.</p><p>So: your append-only table with immutable rows is generating I/O from reads. Not writes. Reads. The rows don't change. The headers do. This affects checkpoint pressure and the overall I/O budget autovacuum has to compete for.</p><p><strong>Insert volume alone triggers autovacuum for freezing.</strong> Since PostgreSQL 13, <code>autovacuum_vacuum_insert_threshold</code> and <code>autovacuum_vacuum_insert_scale_factor</code> control a separate trigger: autovacuum fires based on insert count, not just dead tuple count. The reason is XID wraparound prevention. At high insert rates, unfrozen tuples accumulate fast. Postgres needs to freeze them before the counter laps. So autovacuum runs a freeze pass continuously on high-insert tables, regardless of whether any rows have been updated or deleted.</p><p>Go check <code>vacuum_count</code> and <code>autovacuum_count</code> in <code>pg_stat_user_tables</code> on your busiest append-only partition. They're climbing. <code>n_dead_tup</code> might be low. The freeze passes are happening anyway. <a href="https://www.tigerdata.com/blog/using-bpftrace-to-trace-postgresql-vacuum-operations"><u>This BPFtrace walkthrough</u></a> shows exactly which passes and when.</p><h2 id="what-tuning-actually-does">What tuning actually does</h2><p>Most autovacuum tuning falls into two categories: make it run more aggressively, or make it yield more to writes.</p><p>Running it more aggressively means lowering <code>autovacuum_vacuum_scale_factor</code>, increasing <code>autovacuum_max_workers</code>, reducing <code>autovacuum_naptime</code>. Dead tuples get cleaned before they affect query plans. Freeze passes complete before XID pressure builds. Real improvement.</p><p>Yielding more to writes means increasing <code>autovacuum_vacuum_cost_delay</code>, lowering <code>autovacuum_vacuum_cost_limit</code>. Write latency stabilizes. The tradeoff is that vacuum falls further behind and bloat accumulates more between cycles.</p><p>Here's what neither category does: reduce the amount of work autovacuum needs to do. Every configuration choice is about how the work gets distributed across time and how aggressively it competes with your actual workload. You're tuning the scheduler. Not the workload.</p><p>Per-table overrides are the right implementation of this: more aggressive settings on active partitions, letting older ones vacuum on a slower cycle. Good practice. It's still adjusting the tax rate, not the taxable activity.</p><h2 id="the-operational-cost-people-dont-account-for">The operational cost people don't account for</h2><p>Autovacuum tuning isn't free engineering time.</p><p>Someone has to write the per-table <code>ALTER</code> statements. Someone has to monitor whether the settings are actually working. At 500 partitions, "monitor autovacuum lag" is a real job that runs on a recurring schedule. New partitions inherit defaults unless automation creates them with the right settings. That automation needs maintaining.</p><p>The monitoring surface for autovacuum lag spans three separate system views and requires correlating timestamps across them. It’s not a dashboard that lights up. It’s a debugging session.</p><p>When autovacuum falls behind, the symptom usually isn't an autovacuum alert. It's query latency regression. Write performance degradation. The connection between autovacuum backlog and query performance is real but indirect. Connecting the two is senior engineer work. Hours, not a quick look at a graph. Sigh.</p><p>If you've tracked where your team actually spends time on database operations, a meaningful slice of it is this: watching autovacuum, tuning autovacuum, debugging incidents that turn out to be autovacuum-adjacent, and writing runbooks for autovacuum behavior on new partition types.</p><p>All of that cost is real. None of it is the kind of cost an append-only workload should be paying.</p><h2 id="why-append-only-data-shouldnt-work-this-way">Why append-only data shouldn't work this way</h2><p>An append-only system has a simple storage contract. Data arrives. It gets written. It ages. Eventually it gets dropped, wholesale, by time window.</p><p>Nothing is ever modified in place. There is no concurrent row modification to manage. Reads never block writes and writes never block reads because there's no contention to prevent. The MVCC guarantee is irrelevant to the workload.</p><p>Postgres MVCC is the right model for workloads that need that concurrency guarantee. For workloads that don't, it's overhead. Autovacuum running continuously on your append-only table isn't Postgres failing. It's Postgres correctly maintaining the MVCC invariants on a workload that doesn't benefit from them.</p><p>The cost is real. The benefit isn't.</p><p>Tuning autovacuum makes the cost more manageable. It doesn't make the benefit appear.</p><h2 id="what-changes-when-the-storage-model-matches-the-workload">What changes when the storage model matches the workload</h2><p><a href="https://www.tigerdata.com/blog/hypercore-a-hybrid-row-storage-engine-for-real-time-analytics"><u>Hypercore</u></a> is built for exactly that contract. It abandons the per-tuple MVCC model for compressed column segments. Up to 1,000 row versions get batched into a single compressed segment before writing. Dead tuples in the traditional sense don't accumulate because the storage model doesn't create them.</p><p>When rows land in a Hypercore segment, transaction visibility is tracked at the segment level, not the tuple level. There is no per-tuple <code>t_xmin</code> to freeze. No hint bit to set on first read. The three mechanisms that drive autovacuum on your append-only heap table have nothing to work with here because the storage model does not create the conditions they require.</p><p>The result: autovacuum pressure drops proportionally. Postgres doesn't disappear. There's still housekeeping work. But the continuous autovacuum activity on high-insert partitions goes away. So does the I/O competition during write peaks, the tuning overhead, the monitoring surface for vacuum lag. The conditions that produced all of it no longer exist.</p><p><code>vacuum_count</code> stops climbing on tables that nobody updates. <code>pg_stat_activity</code> stops showing vacuum workers at 3am. The runbook section on autovacuum tuning gets shorter.</p><p>Same SQL. Same wire protocol. Different storage contract underneath, matched to the workload that's actually running.</p><h2 id="conclusion">Conclusion</h2><p>Autovacuum isn't broken. It isn't misconfigured. It's working correctly given what it's been handed.</p><p>The problem is that "working correctly" on an append-only high-insert table means running constantly, competing with writes, requiring ongoing tuning, and consuming real engineering time. All to manage overhead that a purpose-built append system wouldn't generate in the first place.</p><p>That's the tax. The settings determine how you pay it. The workload determines that you owe it.</p><p>If autovacuum is in your top processes by CPU and I/O on a table that nobody updates, that's not a Postgres problem to fix. It's a signal about the relationship between your workload and your storage model. If your write pattern is append-only, the question worth asking is not how to tune autovacuum better. It is whether your storage model is matched to your workload at all.</p><p><a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill"><u>Understanding Postgres Performance Limits for Analytics on Live Data</u></a> goes deeper on where that mismatch shows up across the system, and what the path forward looks like at different data volumes.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Postgres Extensions Cheat Sheet: Replace 7 Databases With SQL]]></title>
            <description><![CDATA[Working SQL examples for replacing Elasticsearch, Pinecone, InfluxDB, Redis, and more, using Postgres extensions you can enable today.]]></description>
            <link>https://www.tigerdata.com/blog/postgres-extensions-cheat-sheet</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/postgres-extensions-cheat-sheet</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Extensions]]></category>
            <dc:creator><![CDATA[Matty Stratton]]></dc:creator>
            <pubDate>Sat, 02 May 2026 20:47:24 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/05/thumbnail-blog-thumbnail-1280x720--3-.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/05/thumbnail-blog-thumbnail-1280x720--3-.png" alt="Postgres Extensions Cheat Sheet: Replace 7 Databases With SQL" /><p>This post is a practical companion to <a href="https://www.tigerdata.com/blog/its-2026-just-use-postgres"><u>It's 2026, Just Use Postgres</u></a>. That post makes the architectural case for consolidating on Postgres. This one shows you how.</p><p>Below are working SQL examples for each use case. Every extension listed here is available on <a href="https://console.cloud.timescale.com"><u>Tiger Cloud</u></a> with no additional setup. If you're self-hosting, each section links to the extension's repo.</p><p><strong>What you'll be able to do after reading this:</strong> Set up Postgres extensions for full-text search, vector search, time-series, caching, message queues, document storage, geospatial queries, and scheduled jobs. Each section is self-contained, so you can skip to what you need.</p><h2 id="enable-everything">Enable Everything</h2><p>Here's the full set. You probably don't need all of them. Pick the ones that match your workload.</p><pre><code class="language-sql">CREATE EXTENSION pg_textsearch;   -- BM25 full-text search
CREATE EXTENSION vector;          -- Vector search (pgvector)
CREATE EXTENSION vectorscale;     -- DiskANN index for vectors
CREATE EXTENSION ai;              -- AI embeddings and RAG workflows
CREATE EXTENSION timescaledb;     -- Time-series
CREATE EXTENSION pgmq;            -- Message queues
CREATE EXTENSION pg_cron;         -- Scheduled jobs
CREATE EXTENSION postgis;         -- Geospatial</code></pre><h2 id="full-text-search-replace-elasticsearch">Full-Text Search (Replace Elasticsearch)</h2><p><strong>Extension:</strong> <a href="https://github.com/timescale/pg_textsearch"><u><code>pg_textsearch</code></u></a> (true BM25 ranking)</p><p><strong>What you're replacing:</strong> Elasticsearch (separate JVM cluster, complex mappings, sync pipelines), Solr, or Algolia ($1 per 1,000 searches).</p><p><strong>What you get:</strong> The same BM25 algorithm that powers Elasticsearch, running natively in Postgres. No separate cluster. No sync jobs. No data drift.</p><pre><code class="language-sql">CREATE TABLE articles (
  id SERIAL PRIMARY KEY,
  title TEXT,
  content TEXT
);

-- Create a BM25 index
CREATE INDEX idx_articles_bm25 ON articles USING bm25(content)
  WITH (text_config = 'english');

-- Search with BM25 scoring
SELECT title, -(content &lt;@&gt; 'database optimization') AS score
FROM articles
ORDER BY content &lt;@&gt; 'database optimization'
LIMIT 10;</code></pre><p><strong>Deep dive:</strong> <a href="https://www.tigerdata.com/blog/you-dont-need-elasticsearch-bm25-is-now-in-postgres"><u>You Don't Need Elasticsearch: BM25 is Now in Postgres</u></a></p><h2 id="vector-search-replace-pinecone">Vector Search (Replace Pinecone)</h2><p><strong>Extensions:</strong> <a href="https://github.com/pgvector/pgvector"><u><code>pgvector</code></u></a> + <a href="https://github.com/timescale/pgvectorscale"><u><code>pgvectorscale</code></u></a></p><p><strong>What you're replacing:</strong> Pinecone ($70/month minimum, separate infrastructure, data sync), Qdrant, Milvus, or Weaviate.</p><p><strong>What you get:</strong> pgvectorscale uses the DiskANN algorithm (from Microsoft Research). On a <a href="https://www.tigerdata.com/blog/pgvector-vs-pinecone"><u>50M vector benchmark</u></a>, it achieved 28x lower p95 latency and 16x higher throughput than Pinecone at 99% recall.</p><pre><code class="language-sql">CREATE EXTENSION vector;
CREATE EXTENSION vectorscale CASCADE;

CREATE TABLE documents (
  id SERIAL PRIMARY KEY,
  content TEXT,
  embedding vector(1536)
);

-- High-performance DiskANN index
CREATE INDEX idx_docs_embedding ON documents USING diskann(embedding);

-- Find similar documents
SELECT content, embedding &lt;=&gt; '[0.1, 0.2, ...]'::vector AS distance
FROM documents
ORDER BY embedding &lt;=&gt; '[0.1, 0.2, ...]'::vector
LIMIT 10;</code></pre><h3 id="auto-sync-embeddings-with-pgai">Auto-sync embeddings with pgai</h3><p>No more manual embedding pipelines. pgai regenerates embeddings automatically on every INSERT and UPDATE.</p><pre><code class="language-sql">SELECT ai.create_vectorizer(
  'documents'::regclass,
  loading =&gt; ai.loading_column(column_name =&gt; 'content'),
  embedding =&gt; ai.embedding_openai(
    model =&gt; 'text-embedding-3-small',
    dimensions =&gt; '1536'
  )
);</code></pre><p>Every row stays in sync. No batch jobs. No drift.</p><h2 id="hybrid-search-bm25-vectors-in-one-query">Hybrid Search: BM25 + Vectors in One Query</h2><p>This is where Postgres consolidation pays off immediately. Combining keyword search and semantic search in other stacks requires two API calls, result merging, failure handling, and double the latency. In Postgres, it's one query.</p><h3 id="simple-weighted-hybrid">Simple weighted hybrid</h3><pre><code class="language-sql">SELECT
  title,
  -(content &lt;@&gt; 'database optimization') AS bm25_score,
  embedding &lt;=&gt; query_embedding AS vector_distance,
  0.7 * (-(content &lt;@&gt; 'database optimization')) +
  0.3 * (1 - (embedding &lt;=&gt; query_embedding)) AS hybrid_score
FROM articles
ORDER BY hybrid_score DESC
LIMIT 10;</code></pre><h3 id="reciprocal-rank-fusion-for-rag-applications">Reciprocal Rank Fusion (for RAG applications)</h3><pre><code class="language-sql">WITH bm25 AS (
  SELECT id, ROW_NUMBER() OVER (ORDER BY content &lt;@&gt; $1) AS rank
  FROM documents LIMIT 20
),
vectors AS (
  SELECT id, ROW_NUMBER() OVER (ORDER BY embedding &lt;=&gt; $2) AS rank
  FROM documents LIMIT 20
)
SELECT d.*,
  1.0 / (60 + COALESCE(b.rank, 1000)) +
  1.0 / (60 + COALESCE(v.rank, 1000)) AS score
FROM documents d
LEFT JOIN bm25 b ON d.id = b.id
LEFT JOIN vectors v ON d.id = v.id
WHERE b.id IS NOT NULL OR v.id IS NOT NULL
ORDER BY score DESC LIMIT 10;</code></pre><p>One query. One transaction. One result set.</p><h2 id="time-series-replace-influxdb">Time-Series (Replace InfluxDB)</h2><p><strong>Extension:</strong> <a href="https://github.com/timescale/timescaledb"><u>TimescaleDB</u></a> (21K+ GitHub stars)</p><p><strong>What you're replacing:</strong> InfluxDB (separate database, Flux or limited SQL), Prometheus (metrics only, not application data).</p><p><strong>What you get:</strong> Automatic time-based partitioning, compression up to 95%, continuous aggregates for fast dashboards, and full SQL. Your time-series data lives alongside your relational data with <code>JOIN</code>s and <a href="https://www.tigerdata.com/learn/understanding-acid-compliance"><u>ACID guarantees</u></a>.</p><pre><code class="language-sql">CREATE EXTENSION timescaledb;

CREATE TABLE metrics (
  time TIMESTAMPTZ NOT NULL,
  device_id TEXT,
  temperature DOUBLE PRECISION
);

-- Convert to a hypertable (automatic time partitioning)
SELECT create_hypertable('metrics', 'time');

-- Query with time buckets
SELECT time_bucket('1 hour', time) AS hour,
       AVG(temperature)
FROM metrics
WHERE time &gt; NOW() - INTERVAL '24 hours'
GROUP BY hour;</code></pre><h3 id="lifecycle-automation">Lifecycle automation</h3><p>TimescaleDB handles retention and compression policies so you don't have to build cron jobs for data management.</p><pre><code class="language-sql">-- Automatically drop data older than 30 days
SELECT add_retention_policy('metrics', INTERVAL '30 days');

-- Compress data older than 7 days (up to 95% storage reduction)
ALTER TABLE metrics SET (timescaledb.compress);
SELECT add_compression_policy('metrics', INTERVAL '7 days');</code></pre><p><strong>Case study:</strong> <a href="https://www.tigerdata.com/blog/from-4-databases-to-1-how-plexigrid-replaced-influxdb-got-350x-faster-queries-tiger-data"><u>Plexigrid went from 4 databases to 1</u></a> and got 350x faster queries.</p><hr><h2 id="caching-replace-redis">Caching (Replace Redis)</h2><p><strong>Feature:</strong> <code>UNLOGGED</code> tables + <code>JSONB</code> (built into Postgres, no extension needed)</p><p><strong>What you're replacing:</strong> Redis for simple key-value caching scenarios.</p><p><strong>What you get:</strong> In-memory-speed storage without WAL overhead. Good for session data, temporary lookups, and simple caches. No separate service to operate.</p><p><strong>When to keep Redis:</strong> If you need pub/sub, sorted sets, Lua scripting, or complex data structures, Redis is still the better tool for those specific jobs.</p><pre><code class="language-sql">-- UNLOGGED = no WAL overhead, faster writes
CREATE UNLOGGED TABLE cache (
  key TEXT PRIMARY KEY,
  value JSONB,
  expires_at TIMESTAMPTZ
);

-- Set with expiration
INSERT INTO cache (key, value, expires_at)
VALUES ('user:123', '{"name": "Alice"}', NOW() + INTERVAL '1 hour')
ON CONFLICT (key) DO UPDATE SET value = EXCLUDED.value;

-- Get
SELECT value FROM cache
WHERE key = 'user:123' AND expires_at &gt; NOW();

-- Schedule cleanup with pg_cron
SELECT cron.schedule('cache_cleanup', '0 * * * *',
  $$DELETE FROM cache WHERE expires_at &lt; NOW()$$);</code></pre><h2 id="message-queues-replace-kafka">Message Queues (Replace Kafka)</h2><p><strong>Extension:</strong> <a href="https://github.com/tembo-io/pgmq"><u><code>pgmq</code></u></a></p><p><strong>What you're replacing:</strong> Kafka or RabbitMQ for task queues and simple event processing.</p><p><strong>What you get:</strong> A lightweight message queue inside Postgres. Send, receive with visibility timeouts, and delete after processing. Transactional with the rest of your data.</p><p><strong>When to keep Kafka:</strong> If you need high-throughput event streaming across dozens of services, consumer groups, exactly-once semantics, or multi-datacenter replication, Kafka is purpose-built for that.</p><pre><code class="language-sql">CREATE EXTENSION pgmq;
SELECT pgmq.create('my_queue');

-- Send a message
SELECT pgmq.send('my_queue', '{"event": "signup", "user_id": 123}');

-- Receive (with 30-second visibility timeout)
SELECT * FROM pgmq.read('my_queue', 30, 5);

-- Delete after processing
SELECT pgmq.delete('my_queue', msg_id);</code></pre><h3 id="alternative-skip-locked-pattern-no-extension-needed">Alternative: SKIP LOCKED pattern (no extension needed)</h3><p>For simple job queues, Postgres has a built-in pattern using <code>FOR UPDATE SKIP LOCKED</code>:</p><pre><code class="language-sql">CREATE TABLE jobs (
  id SERIAL PRIMARY KEY,
  payload JSONB,
  status TEXT DEFAULT 'pending'
);

-- Worker claims a job atomically
UPDATE jobs SET status = 'processing'
WHERE id = (
  SELECT id FROM jobs WHERE status = 'pending'
  FOR UPDATE SKIP LOCKED LIMIT 1
) RETURNING *;</code></pre><h2 id="documents-replace-mongodb">Documents (Replace MongoDB)</h2><p><strong>Feature:</strong> Native <code>JSONB</code> (built into Postgres since 2014)</p><p><strong>What you're replacing:</strong> MongoDB for document storage.</p><p><strong>What you get:</strong> Schemaless document storage with GIN indexing, plus everything Postgres gives you: ACID transactions, relational <code>JOIN</code>s, and SQL. No separate database for your "document-shaped" data.</p><pre><code class="language-sql">CREATE TABLE users (
  id SERIAL PRIMARY KEY,
  data JSONB
);

-- Insert a nested document
INSERT INTO users (data) VALUES ('{
  "name": "Alice",
  "profile": {"bio": "Developer", "links": ["github.com/alice"]}
}');

-- Query nested fields
SELECT data-&gt;&gt;'name', data-&gt;'profile'-&gt;&gt;'bio'
FROM users
WHERE data-&gt;'profile'-&gt;&gt;'bio' LIKE '%Developer%';

-- Index specific JSON fields for fast lookups
CREATE INDEX idx_users_email ON users ((data-&gt;&gt;'email'));</code></pre><h2 id="geospatial-replace-specialized-gis">Geospatial (Replace Specialized GIS)</h2><p><strong>Extension:</strong> <a href="https://postgis.net/"><u>PostGIS</u></a> (the industry standard since 2001)</p><p><strong>What you're replacing:</strong> Nothing, really. PostGIS is what most specialized GIS tools are built on. It powers OpenStreetMap and has been in production for 24 years.</p><pre><code class="language-sql">CREATE EXTENSION postgis;

CREATE TABLE stores (
  id SERIAL PRIMARY KEY,
  name TEXT,
  location GEOGRAPHY(POINT, 4326)
);

-- Find stores within 5km
SELECT name,
  ST_Distance(location, ST_MakePoint(-122.4, 37.78)::geography) AS meters
FROM stores
WHERE ST_DWithin(location, ST_MakePoint(-122.4, 37.78)::geography, 5000);</code></pre><h2 id="scheduled-jobs-replace-external-cron">Scheduled Jobs (Replace External Cron)</h2><p><strong>Extension:</strong> <a href="https://github.com/citusdata/pg_cron"><u><code>pg_cron</code></u></a></p><p><strong>What you're replacing:</strong> External <code>cron</code> jobs, Kubernetes CronJobs, or Lambda scheduled triggers for database maintenance tasks.</p><p><strong>What you get:</strong> Cron scheduling inside Postgres. Useful for cache cleanup, materialized view refreshes, data retention, and periodic aggregation.</p><pre><code class="language-sql">CREATE EXTENSION pg_cron;

-- Run cache cleanup every hour
SELECT cron.schedule('cleanup', '0 * * * *',
  $$DELETE FROM cache WHERE expires_at &lt; NOW()$$);

-- Refresh a materialized view every night at 2 AM
SELECT cron.schedule('rollup', '0 2 * * *',
  $$REFRESH MATERIALIZED VIEW CONCURRENTLY daily_stats$$);</code></pre><h2 id="fuzzy-search-typo-tolerance">Fuzzy Search (Typo Tolerance)</h2><p><strong>Extension:</strong> <code>pg_trgm</code> (built into Postgres)</p><pre><code class="language-sql">CREATE EXTENSION pg_trgm;

CREATE INDEX idx_name_trgm ON products USING GIN (name gin_trgm_ops);

-- Finds "PostgreSQL" even when typed as "posgresql"
SELECT name FROM products
WHERE name % 'posgresql'
ORDER BY similarity(name, 'posgresql') DESC;</code></pre><h2 id="whats-next">What's Next</h2><p>If you want the architectural argument for why consolidating on Postgres matters (especially in the AI era), read <a href="about:blank"><u>It's 2026, Just Use Postgres</u></a>.</p><p>All of these extensions come pre-configured on <a href="https://console.cloud.timescale.com"><u>Tiger Cloud</u></a>. Create a free database and start building.</p><p><strong>Further reading:</strong></p><ul><li><a href="https://www.tigerdata.com/docs/use-timescale/latest/extensions/pg-textsearch"><u>pg_textsearch documentation</u></a></li><li><a href="https://github.com/timescale/pgvectorscale"><u>pgvectorscale on GitHub</u></a></li><li><a href="https://www.tigerdata.com/docs/"><u>TimescaleDB documentation</u></a></li><li><a href="https://github.com/tembo-io/pgmq"><u>pgmq on GitHub</u></a></li><li><a href="https://postgis.net/"><u>PostGIS</u></a></li><li><a href="https://www.tigerdata.com/blog/from-4-databases-to-1-how-plexigrid-replaced-influxdb-got-350x-faster-queries-tiger-data"><u>How Plexigrid replaced InfluxDB and got 350x faster queries</u></a></li></ul>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Optimization vs. Architecture: Knowing the Difference]]></title>
            <description><![CDATA[Optimization problems stay fixed. Architectural ones come back. A framework for knowing which you're dealing with before you've spent months on the wrong fix.]]></description>
            <link>https://www.tigerdata.com/blog/optimization-vs-architecture-knowing-the-difference</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/optimization-vs-architecture-knowing-the-difference</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <dc:creator><![CDATA[Matty Stratton]]></dc:creator>
            <pubDate>Wed, 29 Apr 2026 02:21:18 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/04/thumbnail-blog.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/04/thumbnail-blog.png" alt="Optimization vs. Architecture: Knowing the Difference" /><p>There are two kinds of database performance problems.</p><p>The first kind responds to optimization. You add an index, queries speed up, the improvement holds. Done.</p><p>The second kind responds to optimization temporarily. You add an index, queries speed up, then slow down again as data grows. You partition the table, scans get faster, then the partition count becomes its own management burden. You upgrade the instance, headroom appears, then it's consumed within a quarter. You're back in the same meeting, staring at the same graphs trending in the same wrong direction.</p><p>Both feel identical in the moment. A slow query is a slow query. The difference shows up after the fix. Optimization problems stay fixed. Architectural problems come back. The difference isn't a matter of degree; it's structural. Operational problems respond to tuning. Architectural problems compound with every additional row you write.</p><p>This distinction matters most for a specific class of workload: high-volume, append-heavy data that gets queried analytically. Time-series telemetry. Financial tick data. Operational metrics. Any system where data accumulates continuously and queries shift from point lookups to scans and aggregates on live data. If that describes your system, this post gives you a framework for figuring out which kind of problem you're dealing with, before you've spent six months finding out the hard way.</p><h2 id="what-optimization-can-fix">What optimization can fix</h2><p>Optimization is genuinely powerful. Worth being clear about that before talking about its limits.</p><p>Configuration mismatches are real problems with real fixes. <code>shared_buffers</code> set too low. <code>work_mem</code> too small for hash joins. <code>effective_cache_size</code> not reflecting actual available memory. You fix these once, they stay fixed. These aren't minor tweaks.</p><p>Missing indexes are the same story. A query doing a sequential scan when an index scan would serve is O(n) when it could be O(log n). Add the right index, the improvement is permanent. The fix works at 1 million rows and at 100 million rows.</p><p>Inefficient queries, connection management, autovacuum tuning, PgBouncer config. All real, all fixable, all improvements that don't interact with data volume in ways that eventually undo your work.</p><p>The common thread: these problems don't compound with scale. A missing index is a missing index. You find it, you fix it, you move on.</p><h2 id="why-certain-workloads-hit-a-wall">Why certain workloads hit a wall</h2><p>Here's where it gets tricky.</p><p>Some problems look like optimization problems on the surface. You apply the standard tools. Performance improves. You close the ticket. Three months later, you're back.</p><p>That pattern is the signal.</p><p>The underlying issue isn't misconfiguration. It's a mismatch between what your architecture provides and what your workload actually requires. For high-volume, append-heavy, analytically-queried data, that mismatch runs three layers deep, and each layer is worse than the last.</p><p><strong>Layer one: you're reading data you'll never use.</strong> When your dominant query pattern is analytical scans and aggregations across time ranges, and your storage model packs all columns together in rows, no index strategy resolves that. Row storage reads the full row on every access. For transactional workloads, fine. For a query that needs to scan 50 million rows and pull two columns from each one, you're reading every other column on every single row, all the way through. That's not a configuration problem. That's a physics problem.</p><p><strong>Layer two: you're paying for features you never use.</strong> MVCC exists so concurrent reads and writes stay correct. Valuable for transactional data. But if you're inserting 50,000 rows per second of data that will never be updated, you're paying the full cost of that concurrency model on every insert. Each row carries 23 bytes of <a href="https://www.tigerdata.com/blog/mvcc-feature-youre-paying-for-but-not-using"><u>MVCC transaction metadata</u></a>. At 50K inserts per second, that's overhead on 4.3 billion rows per day that will never be modified. Autovacuum runs constantly, cleaning up dead tuples that were never created through updates. No configuration setting removes that structural cost. You're paying correctness guarantees on data that will never be written again.</p><p><strong>Layer three: the maintenance never stops growing.</strong> <a href="https://www.tigerdata.com/blog/preventing-silent-spiral-table-bloat"><u>Autovacuum</u></a>, ANALYZE, background statistics collection. As data volume grows, the time these tasks take grows with them. Tuning parameters adjusts priority, not necessity. The work still has to happen, and it will compete with your production workload in ways that compound indefinitely. The data keeps accumulating, and the maintenance scales with it.</p><p>Operational problems respond to tuning. Architectural problems compound with every additional row you write.</p><h2 id="the-recurrence-test">The recurrence test</h2><p>The question is whether you've already crossed that line. Here's how to tell.</p><p>Apply the standard fix for whatever symptom you're seeing. Better index. Config change. Query rewrite. Partitioning. Then watch what happens.</p><p>A fix that holds for a week is probably an operational problem. A fix that holds for a month could be either; watch for regression. A fix that holds for a quarter, and then the same metrics start climbing again: that's the architectural signal.</p><p>But the most reliable test isn't the persistence of any individual fix. It's the pattern across fixes. Over the last 12 months:</p><ul><li>Are you solving different problems each time, or variations of the same problem?</li><li>Are the fixes holding, or buying progressively less time?</li><li>Is the interval between optimization cycles getting shorter?</li></ul><p>Diverse symptoms, targeted fixes, improvements that hold = healthy optimization. You're in the right architecture.</p><p>Same symptoms recurring, each fix buying less runway than the last, the cycle accelerating = you've crossed the architectural boundary. The optimization is working. It's just fighting a battle it can't win.</p><p>If you answered yes to all three of those last questions, you already know the answer. The recurrence test just gives you the language to say it out loud in a meeting.</p><h2 id="the-cost-of-getting-this-wrong">The cost of getting this wrong</h2><p>Both failure modes are expensive. Worth saying plainly, because the bias in engineering is almost always toward optimization over migration.</p><p><strong>Treating an optimization problem as architectural:</strong> You migrate when tuning would have been sufficient. The cost is a disruptive migration project (two to eight weeks is realistic), team bandwidth redirected from product work to infrastructure, and the risk of introducing new complexity into a system that didn't need it. This mistake is less common because the default is to keep optimizing. Teams usually get pushed toward migration by external pressure, not internal conviction.</p><p><strong>Treating an architectural problem as optimization:</strong> This one compounds. You keep tuning when the architecture is the constraint. The cost is cumulative engineering time (teams working through this pattern typically lose months per year to ongoing optimization cycles), a growing operational burden, and a deferred migration that gets more expensive the longer you wait. At 10 million rows, migration takes days. At 500 million, weeks. At a billion plus, you're looking at months.</p><p>Sigh.</p><p>Every quarter you spend on optimizations that aren't sticking, you're paying twice: once for the engineering time, and once in the form of increased migration cost when you eventually get there. The diagnostic matters because catching it early changes the math significantly.</p><h2 id="what-the-right-architecture-actually-changes">What the right architecture actually changes</h2><p>If you ran the recurrence test and landed where the evidence points, here's what changes.</p><p>If your workload is high-volume, append-heavy, and primarily analytical, the architectural answer doesn't require leaving Postgres. It requires extending Postgres with primitives designed for that specific workload pattern.</p><p>TimescaleDB addresses each of the three failure modes directly.</p><p><strong>Layer one (reading data you'll never use): </strong><a href="https://www.tigerdata.com/blog/hypercore-a-hybrid-row-storage-engine-for-real-time-analytics"><strong><u>Hypercore</u></strong></a><strong>.</strong> TimescaleDB's hybrid row/columnar storage engine keeps recent data in row format for writes and automatically converts older data to columnar format for analytical scans. A query that needs to scan millions of rows and extract two columns reads only those columns, from compressed columnar chunks. You stop paying row-storage costs on data you'll only scan.</p><p><strong>Layer two (paying for features you never use): Hypertables and chunk exclusion.</strong> Hypertables partition your data automatically by time. Chunk exclusion lets the query planner skip entire time partitions without scanning them. Query latency stays bounded as data grows, not because you've tuned something but because the planner knows which chunks are irrelevant. Compressed chunks also dramatically reduce autovacuum load: there's nothing to vacuum in a chunk where data is already frozen and compressed.</p><p><strong>Analytical performance on top of all of it: </strong><a href="https://www.tigerdata.com/blog/how-we-made-real-time-data-aggregation-in-postgres-faster-by-50-000"><strong><u>Continuous aggregates</u></strong></a><strong>.</strong> Incremental materialized views that refresh in the background, updating only what changed since the last refresh. Dashboards and aggregations stay fast without batch jobs or stale results. Precomputed rollups combine with raw recent data at query time, so results are current without the full scan cost.</p><p>The optimization work that was recurring becomes unnecessary because the architecture handles it at the right layer. You're not doing the same work in a faster system. You're doing less work in a system built for the load.</p><p>And it's still Postgres. Same SQL. Same extensions. Same tooling your team already knows. What changes is what happens underneath. That's the only thing that needed to change.</p><h2 id="the-decision">The decision</h2><p>Optimization and architecture solve different categories of problems. The skill is knowing which category you're in before you've invested months in the wrong one.</p><p>If your optimization work is targeted, diverse, and the fixes hold, you're in the right architecture. Keep going. The standard playbook works.</p><p>If the recurrence test comes back positive (same symptoms, shrinking runway, accelerating cycles), the architecture is the constraint. One more optimization buys a quarter. The right architectural change buys years.</p><p>That gap is the decision.</p><p>For the full diagnostic on what this workload pattern looks like in a real system, <a href="https://www.tigerdata.com/blog/six-signs-postgres-tuning-wont-fix-performance-problems"><u>Six Signs That Postgres Tuning Won't Fix Your Performance Problems</u></a> walks through six specific characteristics with concrete examples. For the deeper mechanical explanation of why vanilla Postgres hits these limits at the architecture level, <a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill"><u>Understanding Postgres Performance Limits for Analytics on Live Data</u></a> covers that ground. Start with whichever question is more pressing.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Balance Relational Purity and Speed in High Frequency Systems]]></title>
            <description><![CDATA[Normalized schemas create latency at scale. This guide shows when to flatten your tables and use columnar compression to cut join overhead and reclaim query speed.]]></description>
            <link>https://www.tigerdata.com/blog/balance-relational-purity-speed-high-frequency-systems</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/balance-relational-purity-speed-high-frequency-systems</guid>
            <category><![CDATA[Database]]></category>
            <category><![CDATA[Compression]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[NanoHertz Communications]]></dc:creator>
            <pubDate>Fri, 24 Apr 2026 12:41:52 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/04/balance-relational-purity-speed-in-high-frequency-systems.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/04/balance-relational-purity-speed-in-high-frequency-systems.png" alt="Balance Relational Purity and Speed in High Frequency Systems" /><p>Standard <a href="https://www.tigerdata.com/learn/how-to-use-postgresql-for-data-normalization"><u>relational normalization</u></a> is the bedrock of database design. It prevents data duplication and maintains integrity by splitting information into specialized tables. However, as your tables approach the 500 million row mark, the very joins that keep your data clean begin to degrade performance. You may notice <a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill"><u>p95 latencies</u></a> creeping up even though your queries remain logically sound. This guide explains how to identify when relational purity is holding you back and how to use data flattening and columnar compression to reclaim your speed.</p><h2 id="what-you-will-learn">What You Will Learn</h2><p>This guide breaks down the mechanics of read amplification and join overhead in high-volume systems. You will learn:</p><ul><li>How normalized schemas create latency through expensive nested loop joins.</li><li>When to flatten data into wide tables to reduce query-time computation.</li><li>The way columnar compression turns denormalized tables into high-performance assets.</li><li>Techniques to reduce read amplification by minimizing the data your database must touch.</li></ul><h2 id="why-it-matters">Why It Matters</h2><p>Standard normalization relies on joins to reconstruct data at query time. On small datasets, this is efficient. On tables with billions of rows, every join adds a layer of I/O and CPU overhead that compounds as your data volume grows. Row-based storage forces the database to read every column in a row, even if your query only needs two, leading to massive read amplification.</p><p>Flattening your schema, moving frequently joined metadata directly into your main ingestion table, removes the need for these expensive joins. While this usually increases storage size in a row-based system, specialized <a href="https://www.tigerdata.com/blog/building-columnar-compression-in-a-row-oriented-database"><u>columnar storage</u></a> allows for aggressive compression of this redundant data. By choosing the right schema architecture, you can significantly boost the responsiveness of your real-time dashboards and analytics.</p><h2 id="audit-your-p95-latency">Audit Your p95 Latency</h2><p>Start by identifying the queries that drive your latency spikes. Use the <a href="https://www.tigerdata.com/blog/using-pg-stat-statements-to-optimize-queries"><u>pg_stat_statements extension</u></a> to find queries with high total execution time. Run EXPLAIN (ANALYZE, BUFFERS) on these queries to see how the database interacts with the storage layer.</p><p>Look for "<a href="https://www.tigerdata.com/learn/strategies-for-improving-postgres-join-performance"><u>Nested Loop</u></a>" or "Hash Join" operations where the "Shared Hit Blocks" are high. If the database spends 80% of its time matching keys between a 500M row metrics table and a 10k row device table, your architecture is likely CPU-bound by join coordination. A high number of "read" buffers in the output indicates that the database is hunting through indexes and table pages to find metadata that isn't stored locally with the record.</p><p>When you see a query plan where the join cost increases linearly with the size of the primary table, you have identified a "join tax" that cannot be solved with more memory or faster disks. At this stage, the overhead of managing the relationship between tables is outweighing the benefits of relational purity.</p><h2 id="migrate-high-access-metadata">Migrate High-Access Metadata</h2><p>Identify the metadata columns that appear most often in your WHERE and JOIN clauses.</p><ul><li>Good Candidates: Low-cardinality strings like region, site_id, or device_type. These compress effectively in a columnar format because the engine only needs to store a single value and a list of row counts.</li><li>Poor Candidates: High-cardinality unique identifiers like session_uuid or frequently updated timestamps. These values vary for almost every row, which prevents the compression engine from reducing the storage footprint and can lead to table bloat.</li></ul><p>Then, add the new denormalized column to your high-frequency ingestion table:</p><pre><code class="language-SQL">ALTER TABLE device_metrics ADD COLUMN region TEXT;</code></pre><p>Next, backfill the existing rows. On tables with hundreds of millions of rows, a single UPDATE statement can cause massive transaction log bloat and table locks. Instead, backfill in batches or use a join-based update to pull metadata from your reference table:</p><pre><code class="language-SQL">UPDATE device_metrics m
SET region = d.region
FROM devices d
WHERE m.device_id = d.id
  AND m.region IS NULL;</code></pre><p>This migration eliminates the join at query time, allowing the database to filter and aggregate in a single pass. By localizing the data, you ensure the database engine no longer needs to load secondary table pages into the buffer pool just to check a filter condition.</p><h2 id="configure-columnar-compression">Configure Columnar Compression</h2><p>After flattening your table, you must enable columnar storage to handle the redundant metadata. In Tiger Data, this is achieved through a hybrid storage engine that partitions data into "chunks" and then compresses those chunks into a columnar format. This storage engine groups data by column rather than row, allowing the database to ignore unused columns during a scan and reducing the physical I/O required for every query.</p><p>To implement this, you first define your compression policy. You must choose a segmentby column, typically a device ID or primary key, and an orderby column, usually a timestamp. The segmentby column is the most important for performance; it determines how data is grouped within the compressed segments.</p><pre><code class="language-SQL">-- 1. Enable compression on the hypertable
ALTER TABLE device_metrics SET (
  timescaledb.compress = true,
  timescaledb.compress_orderby = 'ts DESC',
  timescaledb.compress_segmentby = 'device_id, region'
);

-- 2. Add a policy to compress data older than 7 days
SELECT add_compression_policy('device_metrics', INTERVAL '7 days');
</code></pre><p>When the compression policy runs, the database transforms the row-based data into compressed columnar batches. For a flattened table, this is where the performance "magic" happens. If you have 100 million rows where the region is 'north_east', a row-based engine stores that string 100 million times. The columnar engine stores it once along with a metadata bitmask, reducing the storage footprint of that column by up to 99%.</p><p>You can verify the effectiveness of your compression and the reduction in <a href="https://www.tigerdata.com/blog/hidden-performance-cost-wildcard-queries"><u>read amplification</u></a> by querying the compression statistics:</p><pre><code class="language-SQL">SELECT 
    total_chunks,
    number_compressed_chunks,
    pg_size_pretty(before_compression_total_bytes) AS before_size,
pg_size_pretty(after_compression_total_bytes) AS after_size
FROM hypertable_columnstore_stats('device_metrics');
</code></pre><p>This architectural shift moves your database from being I/O bound to being CPU efficient. By reducing the "Shared Hit Blocks" (the number of 8KB pages the database must load into memory), you free up the buffer pool for other critical operations and effectively raise the performance ceiling of your entire system.</p><h2 id="measuring-the-join-tax">Measuring the Join Tax</h2><p>You can measure the performance gap between a normalized join and a flattened table by comparing their execution costs. Consider a system tracking <a href="https://www.tigerdata.com/blog/timescaledb-manufacturing-iot-building-data-pipeline"><u>millions of IoT sensors</u></a> that needs to find the average reading for a specific region.</p><h3 id="the-normalized-approach-slow">The Normalized Approach (Slow)</h3><p>This query must join two tables, forcing the database to match foreign keys for every row in the time range.</p><pre><code class="language-SQL">SELECT avg(m.value), d.region
FROM device_metrics m
JOIN devices d ON m.device_id = d.id
WHERE m.ts &gt; now() - interval '1 hour.'
 AND d.region = 'north_east'
GROUP BY d.region;
</code></pre><h3 id="%E2%80%8Bthe-flattened-and-compressed-approach-fast">​The Flattened and Compressed Approach (Fast)</h3><p>By moving the region column into the device_metrics table and using columnar compression, the join disappears. The database only reads the value and region columns from disk, ignoring the rest of the row.</p><pre><code class="language-SQL">SELECT avg(value), region
FROM device_metrics
WHERE ts &gt; now() - interval '1 hour.'
 AND region = 'north_east'
GROUP BY region;
</code></pre><h2 id="next-step">Next Step</h2><p>Identify your most expensive join query using pg_stat_statements. Create a denormalized version of that table that includes the necessary metadata, and test query performance against a compressed columnar version. This test will help you determine if your system is hitting a structural ceiling that only an architectural change can fix.</p><p>To try this on a fully managed instance, <a href="https://console.cloud.tigerdata.com/signup" rel="noreferrer"><u>start a free Tiger Cloud trial</u></a>. You can compare and contrast query performance on real data without impacting your production database.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Yes, You Can Do Hybrid Search in Postgres (And You Probably Should)]]></title>
            <description><![CDATA[Most search stacks run four systems to answer one question. You don't need any of them. Build production hybrid search in Postgres with pg_textsearch for BM25, pgvectorscale for vector similarity, and Reciprocal Rank Fusion to combine them. One query. One database.]]></description>
            <link>https://www.tigerdata.com/blog/hybrid-search-postgres-you-probably-should</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/hybrid-search-postgres-you-probably-should</guid>
            <category><![CDATA[pg_textsearch]]></category>
            <category><![CDATA[Cloud]]></category>
            <category><![CDATA[Hypertables]]></category>
            <category><![CDATA[Hyperfunctions]]></category>
            <category><![CDATA[pgvector]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Scaling PostgreSQL]]></category>
            <category><![CDATA[SQL]]></category>
            <category><![CDATA[Thought Leadership]]></category>
            <category><![CDATA[Tiger Cloud]]></category>
            <category><![CDATA[Tiger Data]]></category>
            <category><![CDATA[Timescale Cloud]]></category>
            <category><![CDATA[Timescale Community]]></category>
            <category><![CDATA[Tutorials]]></category>
            <dc:creator><![CDATA[Erin Mikail Staples]]></dc:creator>
            <pubDate>Mon, 20 Apr 2026 18:30:28 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/04/2026-04-Hybrid-Search-In-Postgres.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/04/2026-04-Hybrid-Search-In-Postgres.png" alt="Orange magnifying glass on a graph-paper background beside three yellow sparkle shapes, illustrating smarter search." /><p>Picture this: you're searching through your company's internal knowledge base. You <em>know</em> the answer is in there. You typed in what you thought were the right keywords. And yet, the top result is somebody's 2019 onboarding doc about the coffee machine.</p><p>Sound familiar?</p><p>I've watched this scene play out in every company I've worked at, across every tool that claimed to have "search." The problem isn't that search itself is broken. It's that we've been asking it to do two completely different jobs with one rigid toolset, and calling it solved.</p><p>Here's what most people miss: <strong>search doesn't have a modality problem. It has an architecture problem.</strong> And the fix isn't a bigger stack. It's a smarter query, running in the database you already trust.</p><h2 id="the-two-halves-of-finding-stuff">The two halves of "finding stuff"</h2><p>We use "search" as though it's one thing. It isn't.</p><p>When someone types a query into a search bar, they're usually doing one of two things:</p><ol><li><strong>Looking for a specific word or phrase they remember.</strong> "Find me that doc with 'SOC 2 audit' in it." This is <strong>keyword search</strong>. It rewards precision and specificity.</li><li><strong>Trying to describe a concept, not quote it.</strong> "Find me docs about how we handle customer data during compliance reviews." This is <strong>semantic search</strong>. It rewards understanding.</li></ol><p>Traditional full-text search, powered by algorithms like <strong>BM25</strong>, is excellent at the first job. It's stemmed, ranked, battle-tested, and has been the backbone of search engines for decades. It finds "imposter syndrome" when you type "imposter."</p><p><strong>Vector search</strong>, the newer kid on the block, is excellent at the second job. It embeds your text into a high-dimensional space where <em>meaning</em> lives closer together than words do. It finds "imposter syndrome" when you type "feeling like a fraud at work," even though literally none of the words overlap.</p><p>Now here's the punchline: <strong>neither of these systems is wrong, and neither of them is enough.</strong></p><p>If you only use BM25, you miss every user who doesn't know the exact vocabulary your docs use. If you only use vector search, you miss the person who typed a part number, a legal citation, or "episode 100." Vector search has no particular opinion about the number 100.</p><p>The answer is not to pick a side. The answer is to run both and combine the results. That's hybrid search, and it turns out you can do all of it in one database, in one query, against data you're already storing there.</p><h2 id="why-this-matters-now">Why this matters now</h2><p>Two years ago, hybrid search in Postgres was a demo. Now it's a production architecture. Here's what changed.</p><p><strong>BM25 actually runs fast in Postgres now.</strong> <code>pg_textsearch</code> 1.0 built the full search engine in C, on top of Postgres's storage layer. Not a wrapper around an external library. Not ts_rank with extra steps. Block-Max WAND optimization, native. For a lot of workloads, <a href="https://www.tigerdata.com/blog/you-dont-need-elasticsearch-bm25-is-now-in-postgres"><u>you don't need Elasticsearch anymore</u></a>.</p><p><strong>Vector search in Postgres stopped being a party trick.</strong> <code>pgvectorscale</code>'s DiskANN-inspired index <a href="https://www.tigerdata.com/blog/pgvector-is-now-as-fast-as-pinecone-at-75-less-cost"><u>outperforms specialized vector databases on cost and latency</u></a> at 75% less cost. The index lives on disk, so you're not RAM-capped. Filtered search actually works under production RAG conditions, which is historically where specialized vector databases fall apart.</p><p><strong>Every team I talk to is consolidating.</strong> LLM apps don't need five databases. They need one system that can store source content, embeddings, chat history, user context, and audit trails, all in the same transactional boundary. The teams winning on infrastructure right now are the ones who realized <a href="https://www.tigerdata.com/blog/postgres-for-agents"><u>Postgres is that system</u></a> and stopped building around it. (My colleagues have been saying this loudly for a while: <a href="https://www.tigerdata.com/blog/its-2026-just-use-postgres"><u>it's 2026, just use Postgres</u></a>.)</p><p>That third one is the one I keep coming back to. Five databases to answer one question is an architecture that made sense once and then quietly stopped making sense. Most teams are a quarter or two behind on noticing.</p><p>Put those three together and hybrid search in Postgres went from "neat prototype" to a real production architecture. That's the shift this post is about.</p><h2 id="why-postgres-is-where-this-belongs">Why Postgres is where this belongs</h2><p>I've built the multi-system search stack. Elastic for BM25, Pinecone for vectors, a custom merge service, Redis for caching, and the ritual of explaining the whole thing to every new hire who asks "but why can't we just query the database?"</p><p>I've also cussed out that architecture at 2am when the sync job fell behind and results went stale and nobody could figure out which system was wrong.</p><p>Every piece of that stack is something you provision, secure, back up, upgrade, monitor, and carry on-call. Every hop between systems is a place where results drift, eventual consistency bites you, and your latency budget gets eaten. Every six months, one of those systems has a breaking change, an incompatible update, or a pricing restructure, and the whole Jenga tower needs attention.</p><p>The pragmatic answer: <strong>your data is almost certainly already in Postgres.</strong> Your users, your content, your metadata, your access controls, your audit logs. So why ship all of it to three other systems just to search it?</p><p>You don't have to. Postgres now has what you need:</p><ul><li><a href="https://www.tigerdata.com/blog/introducing-pg_textsearch-true-bm25-ranking-hybrid-retrieval-postgres"><strong><u>pg_textsearch</u></strong></a> for BM25 keyword search — proper ranked retrieval, not the old ts_rank approximation</li><li><a href="https://www.tigerdata.com/blog/how-we-made-postgresql-the-best-vector-database"><strong><u>pgvectorscale</u></strong></a> for vector similarity search with a <a href="https://www.tigerdata.com/blog/understanding-diskann"><u>StreamingDiskANN index</u></a> that doesn't require your entire index to fit in RAM</li><li><strong>Everything else you already know</strong>: joins, filters, transactions, row-level security, time-based partitioning, and the SQL your whole team can read</li></ul><p>Put those together and you get a database that can rank by keyword, rank by meaning, filter by any column in your schema, respect your access rules, and return a single fused result set. In one query. Against one system.</p><p>That's not a compromise. That's a better architecture.</p><h2 id="how-hybrid-search-actually-works-the-short-version">How hybrid search actually works (the short version)</h2><p>The full step-by-step is in the <a href="https://www.tigerdata.com/docs/build/examples/hybrid-search"><u>Tiger Data docs</u></a>, including a walkthrough using podcast transcripts from <a href="https://www.relay.fm/conduit"><u>Conduit</u></a>: 12 episodes, real embeddings, about 30 minutes. Swap "podcast episodes" for "support tickets" and the pattern is identical.</p><p>Strip it down to the bones: one table, two indexes, two queries, a little math, one answer.</p><ol><li><strong>Store your content and its embedding side by side.</strong> A single episodes table with a description column <em>and</em> a <a href="https://www.tigerdata.com/learn/postgresql-extensions-pgvector"><u>vector(1536)</u></a> column. No sync jobs. No reconciliation.</li><li><strong>Index both.</strong> A BM25 index on the text column. A StreamingDiskANN index on the embedding column.</li><li><strong>Run both searches.</strong> BM25 returns a ranked list of keyword matches. Vector search returns a ranked list of semantic matches. Each produces, say, the top 20 candidates.</li><li><strong>Fuse the rankings with RRF.</strong> Reciprocal Rank Fusion scores each result by 1 / (k + rank) across both lists and sums the scores. An item that's #1 in both lists crushes an item that's only in one. An item that's #15 in one and missing from the other still gets a say. (The math here is simpler than it sounds; you're basically rewarding documents that placed well in both races.)</li><li><strong>Return the fused list.</strong> One query. One result set. Ranked by a score that rewards showing up in both.</li></ol><p>The SQL for step 4:</p><pre><code class="language-SQL">WITH bm25_results AS (
&nbsp;&nbsp;SELECT id, ROW_NUMBER() OVER (
&nbsp;&nbsp;&nbsp;&nbsp;ORDER BY description &lt;@&gt; 'mental health boundaries'
&nbsp;&nbsp;) AS rank
&nbsp;&nbsp;FROM episodes
&nbsp;&nbsp;ORDER BY description &lt;@&gt; 'mental health boundaries'
&nbsp;&nbsp;LIMIT 20
),
vector_results AS (
&nbsp;&nbsp;SELECT id, ROW_NUMBER() OVER (
&nbsp;&nbsp;&nbsp;&nbsp;ORDER BY embedding &lt;=&gt; $1
&nbsp;&nbsp;) AS rank
&nbsp;&nbsp;FROM episodes
&nbsp;&nbsp;ORDER BY embedding &lt;=&gt; $1
&nbsp;&nbsp;LIMIT 20
)
SELECT
&nbsp;&nbsp;d.id, d.title,
&nbsp;&nbsp;COALESCE(1.0 / (60 + b.rank), 0)
&nbsp;&nbsp;&nbsp;&nbsp;+ COALESCE(1.0 / (60 + v.rank), 0) AS rrf_score
FROM episodes d
LEFT JOIN bm25_results b ON d.id = b.id
LEFT JOIN vector_results v ON d.id = v.id
WHERE b.id IS NOT NULL OR v.id IS NOT NULL
ORDER BY rrf_score DESC
LIMIT 10;
</code></pre><p>That's hybrid search. No sidecar service, no merge layer, no second database, no "enterprise search platform" subscription. Your finance team can thank you later.</p><h2 id="who-should-be-building-this-right-now">Who should be building this right now</h2><p>The teams I'd push hardest toward this pattern are the ones whose search is quietly embarrassing them without anyone saying it out loud.</p><ul><li><strong>Internal knowledge and support teams.</strong> Your agents need to find "ticket #48291" (a keyword query) <em>and</em> "something similar to this customer issue" (a semantic query) in the same tool. Building two separate search surfaces for the same data is a solvable problem, and you're paying for that complexity every sprint.</li><li><strong>RAG pipelines.</strong> If your retrieval layer is only doing vector search, you're leaving quality on the table. <a href="https://www.tigerdata.com/blog/building-a-rag-system-with-claude-postgresql-python-on-aws"><u>Hybrid retrieval consistently outperforms pure semantic retrieval</u></a>, especially for queries that include specific identifiers, product names, or technical terms the embedding model treats as just more tokens. Worth fixing before you spend another month tuning your prompts.</li><li><strong>Product catalogs and discovery.</strong> A user searching "red running shoes under $80 that won't fall apart" is giving you structured filters, keyword cues, <em>and</em> a vibe. BM25 handles the keyword cues, vectors handle the vibe, and a SQL WHERE clause handles the filter. All three, one query.</li></ul><h2 id="a-few-honest-caveats">A few honest caveats</h2><p><code>pg_textsearch</code> 1.0 doesn't support native phrase queries... yet. There's an over-fetch plus ILIKE workaround in the tutorial, and AND/OR/NOT operators are on the roadmap. BM25 indexes are single-column by default; you use a generated column to search across title + body. And no, a 100-million-row table isn't going to be free to embed. Cost and batching are still real things you have to think through.</p><p>But those are tuning problems, not architectural ones. A tuning problem inside one database is a much better problem than an integration problem across four. (If you'd rather not hand-roll the embedding pipeline, <a href="https://www.tigerdata.com/blog/pgai-vectorizer-now-works-with-any-postgres-database"><u>pgai Vectorizer</u></a> automates most of that cost-and-batching work.)</p><h2 id="try-it-yourself">Try it yourself</h2><p>The step-by-step tutorial walks through the whole thing: <a href="https://www.tigerdata.com/docs/build/examples/hybrid-search"><u>Build hybrid search with BM25 and vector similarity</u></a>. Real podcast transcripts, about 30 minutes end to end, works on <a href="https://console.cloud.timescale.com"><u>Tiger Cloud</u></a>, Docker, or a local Postgres install.</p><p>If you want to go straight to code, <a href="https://github.com/timescale/cookbook-search/tree/main/Hybrid-search"><u>the companion repo is here</u></a>. Clone it, point it at a Postgres instance with <code>pg_textsearch</code> and <code>pgvectorscale</code> installed, and you have a working hybrid search system one <code>psql -f setup.sql</code> away. And if you want a deeper implementation walkthrough — including pgai auto-sync so embeddings stay current without a separate pipeline — <a href="https://www.tigerdata.com/blog/elasticsearchs-hybrid-search-now-in-postgres-bm25-vector-rrf"><u>this post covers that pattern end to end</u></a>.</p><p>If you want the deeper "why," these are the posts I'd read next:</p><ul><li><a href="https://www.tigerdata.com/blog/introducing-pg_textsearch-true-bm25-ranking-hybrid-retrieval-postgres"><u>From ts_rank to BM25: introducing pg_textsearch</u></a></li><li><a href="https://www.tigerdata.com/blog/pg-textsearch-bm25-full-text-search-postgres"><u>pg_textsearch 1.0: how we built a BM25 search engine on Postgres</u></a></li><li><a href="https://www.tigerdata.com/blog/hybrid-search-timescaledb-vector-keyword-temporal-filtering"><u>Hybrid search with TimescaleDB: vector, keyword, and temporal filtering</u></a></li><li><a href="https://www.tigerdata.com/blog/combining-semantic-search-and-full-text-search-in-postgresql-with-cohere-pgvector-and-pgai"><u>Combining semantic search and full-text search in Postgres with Cohere, pgvector, and pgai</u></a></li><li><a href="https://www.tigerdata.com/blog/understanding-diskann"><u>Understanding DiskANN</u></a></li></ul><h2 id="the-takeaway">The takeaway</h2><p>Search is hard. It's hard because humans are imprecise, language is slippery, and the things we want are rarely the things we type. No computer is reading our minds yet (and honestly, I'm not sure I want one to). One retrieval method will never be enough.</p><p>The solution isn't a bigger stack. <strong>It's a smarter query.</strong></p><p>Postgres, with <code>pg_textsearch</code> and <code>pgvectorscale</code>, lets you run keyword search and semantic search against the same table, fuse them with a few lines of SQL, and ship one system instead of four. That's not just operationally cheaper. It's a better developer experience, a better user experience, and a better night of sleep for whoever's on call (you can thank me later)</p><p>If your search is frustrating your users (or worse, quietly frustrating them without anyone saying anything), this is the version I'd build next.</p><p>Have you built hybrid search on Postgres? What interesting queries are you running? I want to hear about it.</p><p>Find me at <a href="mailto:erin@tigerdata.com"><u>erin@tigerdata.com</u></a>, or drop a note on <a href="https://www.linkedin.com/in/erinmikail/"><u>LinkedIn</u></a> or <a href="https://bsky.app/profile/erinmikail.bsky.social"><u>Bluesky</u></a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[The Best Time to Migrate Was at 10M Rows. The Second Best Time Is Now.]]></title>
            <description><![CDATA[Migration cost scales with data volume. The optimization tax you pay while waiting scales faster.]]></description>
            <link>https://www.tigerdata.com/blog/when-to-migrate-postgres-to-timescaledb</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/when-to-migrate-postgres-to-timescaledb</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <dc:creator><![CDATA[Matty Stratton]]></dc:creator>
            <pubDate>Wed, 08 Apr 2026 17:45:12 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/04/Best-time-to-migrate-Blog.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/04/Best-time-to-migrate-Blog.png" alt="The Best Time to Migrate Was at 10M Rows. The Second Best Time Is Now." /><p>There's a pattern that plays out across almost every team running high-volume append workloads on vanilla Postgres. I've watched it happen enough times that I can practically set a timer.</p><p>At 10M rows, everything is fine. Queries are fast. The team is shipping features. Nobody is thinking about the database.</p><p>At 50M, queries start getting slow. Someone opens a Jira ticket about dashboard latency. The fix is usually an index or two. Takes an afternoon.</p><p>At 100M, someone proposes partitioning. Or read replicas. Or bumping the instance size. These are reasonable ideas. They work for a while.</p><p>At 500M, the team is spending one to two days per sprint on database performance work. Not building product. Tuning <code>autovacuum_vacuum_cost_delay</code> and rewriting queries and having meetings about whether to re-partition.</p><p>At every stage, migration to a purpose-built solution feels like it can wait. The current optimizations are working. The pain is manageable. Next quarter has fewer deadlines. (Next quarter never has fewer deadlines.)</p><p>Then the table hits a billion rows, and migration is now a project, not a task. What would have been a weekend of <code>CREATE TABLE ... USING hypertable</code> and a data backfill is now a phased migration plan with rollback strategies and a project manager.</p><p>This post makes the case for migrating earlier than feels necessary. Not because your current setup is broken, but because the cost of waiting (what I'll call the optimization tax) compounds in ways that aren't visible until you're deep in them.</p><h2 id="the-migration-cost-curve">The migration cost curve</h2><p>Let's put real numbers on this. The technical steps of a migration don't change much as data grows. What changes is how long they take and how many people you need in the room.</p><p><strong>At 10M rows:</strong> Migration is essentially <code>CREATE TABLE</code>, convert to hypertable, backfill data. A single engineer, a weekend. Downtime measured in minutes if you use a blue-green approach. Risk is near zero because you can run both instances in parallel and compare results before cutting over.</p><pre><code class="language-sql">-- This is the scary migration at 10M rows
SELECT create_hypertable('sensor_data', 'time');

INSERT INTO sensor_data_new SELECT * FROM sensor_data_old;
-- Go get coffee. You'll be done before it's cool enough to drink.
</code></pre><p><strong>At 100M rows:</strong> Backfill takes hours, not minutes. You need to plan for write continuity during migration. Indexes need rebuilding. Compression policies need configuring. Testing is more involved because edge cases in the data are now visible. One engineer, one to two weeks.</p><p><strong>At 500M+ rows:</strong> Full migration project. Data transfer takes days. You need a parallel write strategy (dual-write or CDC). Query compatibility testing across the application layer. A rollback plan. Stakeholder communication. Two to four engineers, four to eight weeks.</p><p>The work itself doesn't get harder. There's just more of it, and the blast radius of a mistake gets larger. The migration at 10M rows and the migration at 500M rows are the same technical steps. The difference is entirely in scale and coordination overhead.</p><p>That's worth repeating: you're not doing something different at 500M rows. You're doing the same thing, slower, with more people watching.</p><h2 id="the-optimization-tax-the-part-people-dont-calculate">The optimization tax (the part people don't calculate)</h2><p>Migration has a visible cost. You can put it on a roadmap. Estimate it. Schedule it. Argue about it in sprint planning.</p><p>Staying has an invisible cost. I call it the optimization tax: the cumulative engineering time, infrastructure spend, and incident burden you pay every month to keep vanilla Postgres performing on a workload it wasn't designed for. And unlike migration, the optimization tax never stops. It just goes up.</p><p><strong>Engineering time on optimization.</strong> How many hours per sprint does your team spend on query tuning, partition management, autovacuum configuration, index strategy reviews? Track it for a month. I've seen teams burning 15-20% of their engineering capacity on database maintenance work that wouldn't exist on a purpose-built system. Most of them didn't realize it until they measured. (If you're wondering whether your team has <a href="https://www.tigerdata.com/blog/six-signs-postgres-tuning-wont-fix-performance-problems"><u>signs that tuning won't fix the problem</u></a>, I wrote about that too.)</p><p><strong>Instance cost escalation.</strong> The progression from <code>db.r6g.xlarge</code> to <code>db.r6g.4xlarge</code> to <code>db.r6g.8xlarge</code> happens gradually enough that nobody raises a flag. Each upgrade is individually justified. "We need more memory for the working set." "The CPU is pegged during dashboard queries." "Read replicas need a bigger instance too." The aggregate cost curve is something else entirely. I wrote a whole post about <a href="https://www.tigerdata.com/blog/vertical-scaling-buying-time-you-cant-afford"><u>why vertical scaling buys time you can't afford</u></a>. The short version: each instance upgrade gets you less headroom than the last.</p><p><strong>Opportunity cost.</strong> Every sprint hour spent on database maintenance is a sprint hour not spent on product features. This compounds. The team spending 15% of engineering time on database operations ships 15% fewer features than the team that doesn't. Over 12 months, that gap is visible to customers.</p><p><strong>Incident burden.</strong> Slow query alerts at 2 AM. Autovacuum blocking production writes. Replication lag during write spikes. These aren't catastrophic. They're erosive. They train the team to accept degraded baseline performance as normal. "Oh, the dashboard is slow on Mondays because of the weekly aggregation job." That sentence should make you uncomfortable.</p><p>Add these up over 12 months. That's your optimization tax. Compare it to the migration cost at your current data volume. The math almost always favors migrating now over migrating later. And the longer you wait, the more lopsided that comparison gets.</p><h2 id="why-teams-delay-and-why-those-reasons-stop-holding">Why teams delay (and why those reasons stop holding)</h2><p>I want to be fair about this. The reasons teams wait are legitimate. I've used most of them myself. But each one has a shelf life.</p><p><strong>"We don't have time right now."</strong> This is the most common one, and it contains a cruel irony. Migration time increases with data volume. Delaying to find a better window means the work itself gets bigger. The window never gets better. The task gets worse. The team that "doesn't have time" for a weekend migration at 10M rows will somehow need to find time for an eight-week migration at 500M rows.</p><p><strong>"The current optimizations are working."</strong> They're working now. Each optimization has a ceiling. When you hit it, you need the next one. The sequence is predictable: indexes, then partitioning, then read replicas, then instance upgrades, then custom vacuum tuning. Each step buys less time than the last. You're running up a down escalator.</p><p><strong>"Migration is risky."</strong> This is the one that sounds the most reasonable and holds up the least. Migration from Postgres to TimescaleDB is lower risk than most database migrations because TimescaleDB is Postgres. Same SQL. Same wire protocol. Same drivers. Same <code>pg_dump</code>. Your application code changes are minimal. The risk profile is closer to "adding an extension" than "switching databases." You're not leaving Postgres. You're giving it better tools.</p><pre><code class="language-sql">-- Your existing queries still work. This isn't a rewrite.
SELECT time_bucket('1 hour', ts) AS hour,
       avg(temperature),
       max(temperature)
FROM sensor_data
WHERE ts &gt; now() - interval '7 days'
GROUP BY hour
ORDER BY hour;
</code></pre><p><strong>"We'd need to convince stakeholders."</strong> Quantify your optimization tax. That's the whole pitch. "We stop spending X hours per month on database maintenance and reclaim that for product work." If your team is spending two days per sprint on database performance, that's roughly 20% of engineering capacity. Put a dollar figure on it. The conversation gets short.</p><h2 id="why-migration-risk-is-lower-than-you-think">Why migration risk is lower than you think</h2><p>This is the part where I'm supposed to list seven migration steps and make you feel calm about them. I'll skip the list and give you the summary instead: install the extension, create hypertables, backfill data, configure compression and retention, update connection strings, validate. The <a href="https://docs2.tigerdata.com/docs/migrate"><u>migration guide</u></a> walks through each one. The <a href="https://www.tigerdata.com/blog/postgresql-migration-made-easier"><u>live migration tool</u></a> handles the hard parts at scale.</p><p>The core point is this: the steps are the same whether you have 10M rows or 500M rows. What changes is the logistics around them. At 10M rows, the backfill is a single <code>INSERT...SELECT</code> that finishes while you're refilling your coffee. At 500M rows, it's a parallel <code>COPY</code> job that runs for days and needs monitoring, dual-write strategies, and a rollback plan.</p><p>The steps don't scale. The logistics do. And that's exactly why doing it earlier is the move.</p><h2 id="the-compound-benefit-of-migrating-early">The compound benefit of migrating early</h2><p>Everything above has been about the cost of staying. But there's a positive version of this story too.</p><p><strong>You skip the treadmill entirely.</strong> The team that migrates at 10M rows never learns what autovacuum tuning feels like at 500M rows. They never have the "should we add ClickHouse?" meeting. They never build the CDC pipeline. They never debug replication lag during write spikes. Those problems simply don't exist in their world. That's the real dividend.</p><p><strong>Compression from the start.</strong> TimescaleDB's native compression typically achieves 90-95% compression ratios on time-series data. One of our customers, Latitude, saves $12,000 per month on database costs from compression alone. Your storage costs grow at 5-10% of the raw data rate instead of 100%. Over 12 months on a high-ingest workload, that's a line item your finance team will notice.</p><p><strong>Fast dashboards without the custom plumbing.</strong> Continuous aggregates give you incrementally-updating materialized views that combine precomputed rollups with the newest raw data. FlightAware went from 6.4-second query times to 30 milliseconds. Your dashboards are fast from day one, not "fast after we spent two weeks building and maintaining custom materialized view refresh jobs." Hypertables handle partitioning automatically in the background, so you also never write another <code>CREATE TABLE sensor_data_2026_q2 PARTITION OF ...</code> DDL statement. One less thing.</p><p>That's the real cost of waiting. The migration effort is the obvious part. The intermediate suffering you accumulate between now and when you eventually migrate anyway? That's the part nobody budgets for.</p><h2 id="so-about-that-timing">So, about that timing</h2><p>The best time to migrate was when the table was small and the effort was trivial. If that window has passed, the second best time is now, before the data volume makes the migration larger and the optimization tax makes the ROI case embarrassingly obvious.</p><p>Every month of delay increases both costs: the migration gets bigger and the tax keeps compounding. The crossover point where "just optimize" costs more than "migrate and stop optimizing" is earlier than most teams think.</p><p>If you're reading this and nodding, you probably already know what you need to do. The question isn't whether to migrate. It's whether you do it this quarter while it's a task, or next year when it's a project.</p><p><a href="https://docs2.tigerdata.com/docs/migrate"><u>Get started with the migration guide →</u></a></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Read Replicas Don't Solve Write Bottlenecks]]></title>
            <description><![CDATA[Read replicas fix read contention. They don't fix write throughput. Here's the mechanical reason why, and what actually changes the trajectory.]]></description>
            <link>https://www.tigerdata.com/blog/read-replicas-dont-solve-write-bottlenecks</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/read-replicas-dont-solve-write-bottlenecks</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <dc:creator><![CDATA[Matty Stratton]]></dc:creator>
            <pubDate>Tue, 07 Apr 2026 16:15:23 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/04/Read-Replicas-Don-t-Solve-Write-Bottlenecks-V2.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/04/Read-Replicas-Don-t-Solve-Write-Bottlenecks-V2.png" alt="Read Replicas Don't Solve Write Bottlenecks" /><p>You added a read replica and something real happened. Dashboard queries stopped competing with ingestion. The primary's CPU dropped. Write latency came down. You watched the metrics and thought: that worked.</p><p>It did work. For a while.</p><p>Now the replica is lagging during write peaks. The primary is accumulating WAL because the replica can't keep up. Write latency is climbing again, despite the replica handling every read you can throw at it. You're managing two Postgres instances instead of one, and the problem you solved is back.</p><p>Read replicas are a real solution to a real problem. You did the right thing. The right thing ran out.</p><p>What you have now is a write bottleneck. Read replicas solve a read bottleneck. These sound similar. The mechanics are completely different.</p><h2 id="what-read-replicas-actually-solve">What read replicas actually solve</h2><p>Let's be fair about this. The win is real.</p><p>Read replicas fix resource contention on the primary. When expensive analytical queries run on the same instance that handles ingestion, they compete directly for CPU, I/O, and buffer cache. A dashboard query scanning 200M rows is doing real work. It blocks autovacuum, slows the write path, and evicts hot data from shared buffers. Moving that query to a replica removes it from the primary entirely.</p><p>The immediate result: primary CPU drops, write latency improves, buffer cache stops getting thrashed by analytical scans. For a system where reads were the dominant load, this can be transformative. A SaaS application with thousands of concurrent users reading data that a small number of writers produce? Read replicas are the right call, and they solve the problem completely.</p><p>The mechanism is<a href="https://www.tigerdata.com/blog/scalable-postgresql-high-availability-read-scalability-streaming-replication-fb95023e2af"> <u>streaming replication</u></a>. WAL generated by the primary gets shipped to the replica and applied in order. The replica stays current. Reads on the replica are reads against real data.</p><p>That's the win. The primary has less to do because reads moved elsewhere. Full stop.</p><h2 id="what-read-replicas-dont-touch">What read replicas don't touch</h2><p>Here's where it gets mechanical. Each of these is specific, and none of them changes when you add a replica.</p><p><strong>The write path is identical.</strong> Every INSERT on the primary still goes through the full Postgres write path: heap tuple with<a href="https://www.tigerdata.com/blog/mvcc-feature-youre-paying-for-but-not-using"> <u>MVCC header</u></a>, B-tree index insertions for every index on the table, WAL record generation for heap and indexes, autovacuum triggered by insert volume for freezing. Nothing about streaming replication changes any of this. The replica receives the WAL and applies it. The WAL is generated at the same rate it always was, with the same per-row overhead.</p><p>Write amplification at 50K inserts/sec with five indexes is still 300K write operations per second on the primary. Adding a replica doesn't change that number. It adds the same write operations on the replica as well, applied from WAL.</p><p><strong>Autovacuum still runs on the primary.</strong> The maintenance work that drives continuous autovacuum activity on high-insert tables (hint bit setting, freeze passes, dead tuple cleanup from aborted transactions) still happens on the primary. Routing reads to a replica doesn't reduce the insert rate that triggers autovacuum. It doesn't reduce dead tuple accumulation from aborted transactions. It doesn't change the XID freeze schedule.</p><p>Autovacuum workers still show up in <code>pg_stat_activity</code> at 3am. They still compete with writes during peaks. The tuning conversation still happens every quarter.</p><p><strong>WAL volume increases.</strong> Here's the counterintuitive part. Adding replicas increases the total WAL-related workload, not decreases it. The primary now has to ship WAL to every replica in addition to writing it locally. At 50-100MB/sec sustained WAL generation, that's 50-100MB/sec of outbound data per replica. With two replicas, you're managing twice that outbound load.</p><p>This is usually invisible when replicas are keeping up. It becomes visible the moment a replica falls behind.</p><h2 id="the-replica-lag-problem">The replica lag problem</h2><p>This is where things get interesting. And by interesting, I mean expensive.</p><p>Streaming replication works by applying WAL on the replica fast enough to stay current. Under normal conditions this is fine. WAL apply is sequential and fast. Replicas keep up.</p><p>High sustained write volume changes this. At 50-100MB/sec of WAL generation, the replica is continuously applying writes. If the replica hits any slowdown (a large analytical query competing for I/O, a vacuum cycle, a momentary resource spike) it falls behind. A small lag at high WAL volume becomes a large lag fast.</p><p>Here's the self-reinforcing part, and it's the thing most people don't see coming.</p><p>A lagging replica can't be used for real-time dashboards. So reads start routing back to the primary. The primary is handling reads again. More resource contention on the primary. Write latency climbs. You're back where you started, except now you're also managing a lagging replica.</p><p>Meanwhile, the primary has to retain unprocessed WAL in <code>pg_wal</code> until the replica catches up. The further behind the replica falls, the more disk the primary uses holding WAL. On a system generating 100MB/sec of WAL, a replica that falls 30 minutes behind represents 180GB of retained WAL on the primary. That's not a theoretical edge case. That's a disk alert at 4am.</p><p><code>max_wal_size</code> and checkpoint frequency become critical tuning parameters that they weren't before. Another surface to monitor. Another thing to tune. Another runbook entry.</p><h2 id="the-operational-load-nobody-budgets-for">The operational load nobody budgets for</h2><p>Adding replicas means running multiple Postgres instances. The cost of that gets absorbed into engineering time, and it's rarely accounted for when the decision is made.</p><p>Each replica needs its own monitoring: replication lag alerts, vacuum state, connection pooling, failover procedures. <code>pg_stat_replication</code> on the primary becomes a permanent fixture in the ops dashboard. Lag thresholds need calibration. Alerts need tuning to avoid false positives during expected spikes.</p><p>Connection routing adds its own complexity. pgbouncer or pgpool sitting in front of the cluster, routing reads to replicas and writes to the primary. The routing layer needs its own monitoring, and it has its own failure modes. A misconfigured connection router that sends writes to a replica produces cryptic application errors that take a while to diagnose. That incident happens once on every team that adds replicas. (Ask me how I know.)</p><p>Schema migrations now touch multiple instances. An <code>ALTER TABLE</code> on the primary replicates to replicas, but the timing matters. Long-running schema changes that lock tables on the primary lock them on replicas too, which means reads route back to the primary, which defeats the purpose of having replicas in the first place.</p><p>New engineers need to understand the topology. Which instance handles writes. Which handles reads. What happens when a replica lags. When to promote a replica. The replica architecture that seemed like a contained infrastructure change has tendrils into application code, deployment procedures, and incident response.</p><p>None of this is unusual or unreasonable. It's the actual cost of the solution.</p><h2 id="isolation-vs-solution">Isolation vs. solution</h2><p>Step back from the mechanics for a second.</p><p>There are two different things that can happen when you add capacity to a struggling system. You can solve the problem, meaning the root cause goes away and performance improves permanently. Or you can isolate the problem, meaning you move it further from the things it was affecting. You buy time without changing trajectory.</p><p>Read replicas are isolation.</p><p>The write bottleneck on the primary doesn't go away. The autovacuum tax doesn't go away. The WAL volume doesn't decrease. What changes is that reads no longer compete with these things. The primary's problems are contained to the primary.</p><p>That's not nothing. Isolation buys real headroom, and it's the right call in the right situation. A system where reads actually were the bottleneck is solved, not just isolated. The problem class matters.</p><p>For<a href="https://www.tigerdata.com/blog/six-signs-postgres-tuning-wont-fix-performance-problems"> <u>continuous high-frequency ingestion</u></a>, the bottleneck isn't reads stealing write resources. It's write overhead at the storage level: MVCC headers on every tuple, B-tree index maintenance on every insert, WAL records per row per index, autovacuum competing for the same I/O budget. Routing reads elsewhere reduces resource contention on the primary but doesn't change the write overhead mechanics at all.</p><p>Six months after adding replicas, the primary is slower than it was when you added them. The data volume grew. The write overhead grew with it. The ceiling you isolated is now back in frame.</p><h2 id="what-actually-addresses-the-write-bottleneck">What actually addresses the write bottleneck</h2><p>The write bottleneck in this workload pattern is architectural. It's in the storage model: per-row MVCC overhead, per-insert WAL records, and the maintenance machinery built around row-level concurrency management. These are fixed costs baked into vanilla Postgres heap storage. They don't go away through replication topology changes.</p><p>What changes them is a different storage model.</p><p><a href="https://www.tigerdata.com/blog/hypercore-a-hybrid-row-storage-engine-for-real-time-analytics"><u>Columnar storage</u></a> batches 1,000 rows into compressed segments before writing. Instead of 1,000 individual heap inserts, each generating its own WAL record plus WAL records for every index entry, you get one segment write and one WAL record. The per-row MVCC headers that added 23 bytes to every tuple get amortized across the batch. Autovacuum pressure drops because the storage engine isn't creating row-level churn that needs cleaning up.</p><p>Do the math on WAL alone. At 100K inserts/sec in vanilla Postgres with five indexes, you're generating a WAL record for each heap tuple and each index insertion: roughly 600K WAL entries per second. Columnar batching at 1,000 rows per segment reduces that to ~600 segment-level WAL entries per second. WAL volume goes from 50-100MB/sec to roughly 5-15MB/sec. Replica lag stops being a crisis because replicas can apply WAL at the rate it arrives. The primary stops retaining large WAL backlogs. The monitoring surface shrinks. The replication architecture that was struggling to keep up suddenly has room to breathe.</p><p>The replicas don't go away. They just stop being the bottleneck.</p><h2 id="where-this-leaves-you">Where this leaves you</h2><p>Read replicas were the right call. The write performance improvement was real. The isolation of read traffic from ingestion was real. You did the correct thing given what you knew.</p><p>What it doesn't do is change the underlying write cost. That cost is in the storage model, and it scales with data volume. Adding replicas doesn't touch it. Tuning replication doesn't touch it. The ceiling that read replicas pushed into the background is still there, and it moves closer as ingestion continues.</p><p>The write bottleneck gets solved at the storage level or it doesn't get solved. Replication topology is orthogonal to that problem.</p><p>If continuous high-frequency ingestion describes what you're running and you're already managing replica lag,<a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill"> <u>the full picture of the optimization treadmill</u></a> covers what's left on the path and what the off-ramp looks like.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Document Databases: Be Honest]]></title>
            <description><![CDATA[Most MongoDB pain isn't a MongoDB problem. It's a workload shape problem that would follow you to Postgres.]]></description>
            <link>https://www.tigerdata.com/blog/document-databases-be-honest</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/document-databases-be-honest</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Matty Stratton]]></dc:creator>
            <pubDate>Wed, 01 Apr 2026 17:22:30 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/04/Document-Databases_-Be-Honest-V2.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/04/Document-Databases_-Be-Honest-V2.png" alt="Document Databases: Be Honest" /><p>MongoDB gets a bad reputation in certain engineering circles that it doesn't entirely deserve. It ships fast. Schema flexibility is real. The developer experience for document-shaped data is good. A lot of teams made a reasonable call when they chose it.</p><p>But there's a version of this story that ends badly, and it follows a recognizable pattern. The team picks MongoDB for a new system. The system works. Then the data starts looking less like documents and more like a stream of timestamped events. Queries start filtering by time range. Write volume climbs. Performance degrades in ways that feel familiar if you've read about this problem, and deeply confusing if you haven't.</p><p>This post isn't here to relitigate the MongoDB decision. It's here to help you figure out whether the pain you're feeling is a MongoDB problem, a document database problem, or a workload problem that would follow you to Postgres.</p><p>The answer matters because the fix is different in each case.</p><h2 id="what-mongodb-is-actually-good-at">What MongoDB is actually good at</h2><p>Flexible schema for variable data that's actually variable. Product catalogs where every SKU has different attributes. User profiles where fields vary by account type. Content management where article structure differs by category. These are real document shapes, and MongoDB handles them without the ceremony Postgres requires.</p><p>Rapid iteration without migration overhead. Early-stage products change their data model constantly. In Postgres, every schema change is an <code>ALTER TABLE</code>. In MongoDB, you just write different fields. For teams that are still figuring out the shape of their data, this is a real advantage.</p><p>Nested and hierarchical data. Some data is naturally a tree. A purchase order with line items with sub-components. A configuration object with nested sections. Postgres can model this with JSONB, but MongoDB's native document model fits it more naturally and queries it more cleanly.</p><p>Horizontal scaling for document reads. MongoDB's sharding model was designed for document workloads. For read-heavy document access at scale, it's a mature and well-understood architecture.</p><p>These aren't consolation prizes. They're real reasons MongoDB is the right choice for a lot of workloads.</p><p>The trouble starts when the data changes shape.</p><h2 id="what-time-series-data-actually-looks-like">What time-series data actually looks like</h2><p>Time-series data has a specific shape, and it's not a document shape. Every row is a measurement. It has a timestamp, a source identifier, and a value or set of values. The schema doesn't vary between rows. There's nothing hierarchical about it. The document model isn't adding anything.</p><p>What time-series data has instead: enormous volume, strict ordering requirements, queries that almost always filter by time range, and retention policies that drop entire time windows at once.</p><p>A wind turbine sensor reporting every five seconds doesn't produce documents. It produces a flat stream of readings: timestamp, sensor ID, RPM, temperature, vibration. A financial trade feed isn't a document store. It's a sequence of immutable events. An APM platform collecting metrics from a distributed system is generating hundreds of thousands of measurements per second, all with the same shape.</p><p>The test is simple. Look at your most-written collection. Does each document have a different structure? Or does every document look essentially the same, with a timestamp and some measurements?</p><p>If it's the latter, you're storing time-series data in a document database, and the document model is providing zero value while the storage engine works against you.</p><h2 id="where-mongodb-struggles-with-this-workload">Where MongoDB struggles with this workload</h2><p>WiredTiger (MongoDB's default storage engine) uses a B-tree structure optimized for a workload that includes updates to existing documents. For high-frequency append-only writes, it faces a fundamental mismatch. Consider a single sensor reading: one document insert triggers a write to the primary collection, a write to the oplog, and a separate B-tree update for every index on that collection. Three indexes means five writes for one data point. At 10,000 inserts per second, that's 50,000 storage operations per second before you've run a single query. The engine was designed for mixed read-write workloads with in-place updates, not an endless append stream where no document is ever modified after creation.</p><p>MongoDB has no native time-based partitioning. Postgres has declarative range partitioning.<a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill"> <u>TimescaleDB automates it entirely with hypertables</u></a>. MongoDB has no equivalent primitive. Teams end up implementing time-based collection bucketing manually: separate collections per day or week, application-level routing logic, custom cleanup scripts. It works, but it's the same operational burden as manual Postgres partitioning, without the tooling ecosystem that exists on the Postgres side.</p><p>MongoDB's aggregation pipeline is expressive. But for time-series workloads, the queries that matter are time-range aggregations: hourly averages, daily maximums, week-over-week comparisons. These queries scan large volumes of documents and aggregate across fields. Without columnar storage and purpose-built time-series compression, performance degrades with data volume in the same way it does in vanilla Postgres.</p><p>MongoDB did add a native time-series collection type in 5.0. It's a real improvement for simple append-only use cases. But it doesn't support secondary indexes the same way regular collections do, restricts certain aggregation stages and update operations, and is still relatively new compared to the Postgres ecosystem. Worth knowing about. Not a full answer.</p><h2 id="why-moving-to-vanilla-postgres-isnt-automatically-the-fix">Why moving to vanilla Postgres isn't automatically the fix</h2><p>This is the section most competitive content skips entirely. If you're evaluating a migration, you deserve the full picture.</p><p>If the workload is continuous high-frequency time-series ingestion with long retention and operational query requirements, vanilla Postgres has its own version of this problem.<a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill"> <u>The MVCC overhead, write amplification, autovacuum contention, and index maintenance costs that create the Optimization Treadmill</u></a> exist in Postgres too. The storage model is different from MongoDB's, but the outcome at scale is the same: performance degrades with data volume, maintenance overhead accumulates, and each optimization cycle buys time without changing the trajectory.</p><p>Moving from MongoDB to vanilla Postgres solves the schema flexibility problem (you probably don't need it for this workload anyway). You get a mature partitioning ecosystem, a better query planner, and a richer extension ecosystem. These are real improvements.</p><p>It doesn't solve the core time-series storage problem, because that problem lives in the storage model, not the database brand.</p><p>The question isn't MongoDB vs. Postgres. It's document store vs. purpose-built time-series storage. That's the actual axis the decision should sit on.</p><h2 id="the-decision-framework">The decision framework</h2><p><strong>Your data is actually documents.</strong> Variable schema, nested structures, hierarchical relationships, read-heavy access patterns. MongoDB is the right tool. The pain you're feeling is probably a schema design or indexing problem, not a fundamental architectural mismatch. Fix the schema.</p><p><strong>Your data is time-series but volume is modest.</strong> Sub-10K inserts per second, retention under 90 days, no hard operational latency requirements on the full retention window. Vanilla Postgres with good partitioning and indexing handles this fine. The Optimization Treadmill exists, but the ceiling is far enough away that standard tuning keeps you ahead of it. Move to Postgres, implement partitioning early, and<a href="https://www.tigerdata.com/blog/six-signs-postgres-tuning-wont-fix-performance-problems"> <u>monitor the warning signs</u></a>.</p><p><strong>Your data is time-series at sustained high volume.</strong> Continuous ingestion, long retention, operational query requirements, growing data volume. This is the workload that breaks both MongoDB and vanilla Postgres through the same class of mechanisms. Purpose-built time-series storage on Postgres (same SQL, same wire protocol, same tooling) is the right answer.<a href="https://www.tigerdata.com/learn/complete-guide-migrating-from-mongodb-to-tiger-data-step-by-step" rel="noreferrer"> <u>Migration from MongoDB to TimescaleDB follows a well-documented path</u></a>: you keep everything Postgres-compatible and gain the storage architecture that matches the workload.</p><h2 id="what-to-do-next">What to do next</h2><p>MongoDB didn't fail you if you're reading this. Your workload evolved past what document storage was designed for. That's a different thing.</p><p>Most database choices are right at the time they're made and wrong eighteen months later when the system looks nothing like it did at launch. Sensor data that started as a feature became the core product. The document store that handled early prototyping became the production system for a time-series pipeline.</p><p>The question now is whether the fix is tuning, migration, or architecture. The framework above gives you a clear read on which one applies. If it's architecture, the good news is that moving from MongoDB to a Postgres-compatible time-series database is less disruptive than it sounds. Your application SQL stays the same. Your tooling stays the same. The storage engine underneath is the thing that changes.</p><p>That's the right scope for the change. Not the whole stack. Just the part that was always wrong for this workload.</p><p><a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill"><u>Read the full technical breakdown of why vanilla Postgres hits these limits</u></a>, or<a href="https://console.cloud.tigerdata.com/signup"> <u>start a Tiger Cloud trial</u></a> and see how TimescaleDB handles your workload directly.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Postgres Performance: Why Peak Throughput Benchmarks Miss the Real Problem]]></title>
            <description><![CDATA[Peak throughput tells you what Postgres can do in a sprint. Production asks what it can do forever. Those are different questions.]]></description>
            <link>https://www.tigerdata.com/blog/postgres-performance-why-peak-throughput-benchmarks-miss-real-problem</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/postgres-performance-why-peak-throughput-benchmarks-miss-real-problem</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <dc:creator><![CDATA[Matty Stratton]]></dc:creator>
            <pubDate>Fri, 27 Mar 2026 14:30:33 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/03/The-Database-Question-Nobody-Asks.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/03/The-Database-Question-Nobody-Asks.png" alt="Postgres Performance: Why Peak Throughput Benchmarks Miss the Real Problem" /><p>You ran the benchmark. 80,000 inserts per second. The database handled it clean, latency stayed flat, no alarms. You shipped with confidence.</p><p>Three months later, p95 write latency is creeping. Six months later, autovacuum is in your top processes by CPU. Nine months later, you're rebuilding indexes on a table that's crossed 400 million rows.</p><p>The benchmark wasn't wrong. The question it answered just wasn't the right one.</p><p>Peak throughput tells you what the database can do in a sprint. Production asks what it can do running forever. Those are different questions with different answers, and most teams only ask the first one.</p><p>The number that actually matters is the <em>sustained throughput ceiling</em>: the write rate at which all of the database's maintenance processes (autovacuum, checkpointing, WAL archiving, replication) can keep up indefinitely. It's always lower than peak throughput. It drops over time as data volume grows. And almost nobody measures it.</p><h2 id="what-benchmarks-actually-measure">What benchmarks actually measure</h2><p>A typical load test runs for minutes. Sometimes an hour if you're thorough. It hits the database hard, measures throughput and latency, and stops. During that window, the buffer cache is warm from the test setup. Autovacuum hasn't had time to accumulate a backlog. WAL hasn't been generating for 72 hours straight. The indexes are fresh. The table fits mostly in memory.</p><p>These are ideal conditions. Not because anyone cheated. That's just what a bounded test looks like. The database performs brilliantly under bounded load because its maintenance subsystems haven't been outrun yet.</p><p>Production is unbounded. The data keeps arriving after the benchmark ends. Autovacuum runs against a table that grows every hour. The buffer cache works against a dataset that expands past RAM over weeks. The indexes that fit in memory at 50 million rows don't fit at 500 million. The checkpoint cycle that completed cleanly at low data volume starts competing with writes as WAL volume climbs.</p><h2 id="the-specific-ways-sustained-load-differs-from-peak-load">The specific ways sustained load differs from peak load</h2><p>There are four concrete mechanisms at work here. All four run simultaneously in production. None of them show up in a benchmark.</p><h3 id="your-hot-data-stops-being-hot">Your hot data stops being hot</h3><p>At launch, your hot data fits in <code>shared_buffers</code> and the OS page cache. Read performance is largely a RAM question. As data volume grows past available RAM, cache hit rates fall. Queries that returned in milliseconds start hitting disk. The degradation is slow enough that it looks like a query regression, not a growth problem, and that's what makes it dangerous. You'll spend a sprint chasing query plans and index strategies before someone checks <code>pg_statio_user_tables</code> and realizes the hit rate has been sliding since month four. The latency change wasn't a code problem. It was a ratio problem.</p><h3 id="autovacuum-falls-behind-and-cant-catch-up">Autovacuum falls behind and can't catch up</h3><p>A benchmark run doesn't give autovacuum time to fall behind. Production does.</p><p>At high sustained insert rates, autovacuum fires continuously. During write peaks, it falls behind. The backlog accumulates. Bloat builds. By the time monitoring catches it, the table has weeks of accumulated dead tuples and hint-bit work queued up.</p><p>Here's the part that really gets you: clearing the backlog requires running autovacuum harder, which competes with writes, which slows ingestion. The fix and the problem share the same resource pool. You're asking the database to clean up faster while also writing faster, and there's only so much I/O to go around.</p><h3 id="indexes-rot">Indexes rot</h3><p>Fresh B-tree indexes on a small table are compact and cache-friendly. The same indexes a year later on a table with a billion rows are fragmented, partially sparse from the hot-right-edge problem on timestamp columns, and too large to stay in cache.</p><p>Traversal costs go up. Page splits happen more often. The 10x read improvement you got from careful indexing in the first month erodes slowly, then faster. You'll REINDEX and get performance back for a while, but the table is still growing. The next degradation cycle is already in progress.</p><h3 id="wal-never-stops-arriving">WAL never stops arriving</h3><p>WAL volume scales directly with insert rate. At sustained high rates, WAL generation is constant. Replicas that keep up at launch start falling behind as write volume grows. The primary retains unprocessed WAL. Disk fills. And the replica needs to process a growing backlog while new WAL keeps arriving, which means there's no quiet period to catch up. If you've ever watched <code>pg_stat_replication</code> and seen <code>replay_lag</code> tick steadily upward with no sign of plateauing, you know exactly how this ends.</p><p>Each of these mechanisms is invisible in a benchmark. In production, they compound.</p><h2 id="the-number-you-should-actually-be-looking-at">The number you should actually be looking at</h2><p>So how do you actually find the sustained throughput ceiling?</p><p>You can estimate it. Look at autovacuum activity under current load: is it finishing cycles or perpetually falling behind? Check <code>pg_stat_bgwriter</code> for checkpoint pressure. Watch <code>pg_wal</code> directory size trends. Plot the ratio of index size to table size over time. These aren't exotic metrics. They're already in Postgres. Most teams aren't watching them together.</p><p>The leading indicators of a sustained throughput ceiling: autovacuum consistently showing in <code>pg_stat_activity</code>, checkpoint completion times trending up, replica lag growing during write peaks, <code>n_dead_tup</code> climbing faster than <code>vacuum_count</code> is cleaning.</p><p>None of these show up in a benchmark. All of them show up in production, usually together, usually around month six or nine.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/03/data-src-image-0038c140-3769-49ad-abbc-4e1e65c072e1.jpeg" class="kg-image" alt="" loading="lazy" width="1376" height="768" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2026/03/data-src-image-0038c140-3769-49ad-abbc-4e1e65c072e1.jpeg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2026/03/data-src-image-0038c140-3769-49ad-abbc-4e1e65c072e1.jpeg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/03/data-src-image-0038c140-3769-49ad-abbc-4e1e65c072e1.jpeg 1376w" sizes="(min-width: 720px) 720px"></figure><h2 id="why-this-question-is-structurally-hard-to-ask">Why this question is structurally hard to ask</h2><p>Smart teams miss this. The reasons are structural.</p><p>Benchmarks have a natural stopping point. Load tests end. Sustained load doesn't have a natural evaluation moment until something breaks. There's no "sustained throughput benchmark" in most team playbooks because the concept doesn't have a clean boundary. When do you declare the test over?</p><p>The degradation timeline is also longer than most planning cycles. Indexing starts showing stress at 300 million rows. Partitioning gets complicated at 500+ partitions. WAL volume becomes a crisis when replica lag crosses a threshold that trips an alert. These events are six to eighteen months apart. The engineer who ran the initial benchmark often isn't the one debugging the production incident.</p><p>Then there's the procurement problem. Peak throughput is a good number for architecture decisions. "This database handles 80K inserts per second" is a clean, defensible statement. "This database handles 80K inserts per second now, but that number will effectively be lower in eight months as the buffer cache hit rate falls and autovacuum starts competing for I/O" is harder to put in a slide. (Both statements are true. Only one of them gets you budget approval.)</p><p>And most capacity planning frameworks are built around static estimates. How many users, how many requests, how much storage. Sustained throughput degradation is a dynamic problem. The ceiling moves as the system runs. That doesn't fit neatly into a capacity model built for stable workloads.</p><p>This adds up to something bigger than individual teams making mistakes. The entire way the industry evaluates databases is optimized for procurement, not production. Vendor benchmarks measure peak throughput because it's the largest number. Load testing frameworks default to bounded runs because unbounded runs don't have a natural end state. Capacity planning templates assume static ceilings because dynamic ceilings are harder to model. Every layer of the evaluation stack is designed to produce a number that looks good in a slide deck. None of it answers the question you'll actually need answered in month twelve.</p><p>So if the standard evaluation framework is structurally set up to miss this, what does a better one look like?</p><h2 id="what-the-right-benchmark-looks-like">What the right benchmark looks like</h2><p>Run the load test for longer. Hours, not minutes. Watch what happens to autovacuum, not just query latency.</p><p>Start the test with a table that already has data in it, sized to your 12-month projection. A benchmark on an empty table tells you about cold start performance. It tells you almost nothing about what the system looks like after a year of continuous ingestion.</p><p>Measure these things during the test:</p><ul><li><code>pg_stat_bgwriter</code>: checkpoint frequency and write volume</li><li><code>pg_stat_activity</code>: autovacuum activity</li><li>Replica lag if you're running replicas</li><li><code>pg_stat_wal</code>: WAL generation rate</li><li>Index size relative to table size</li></ul><p>Repeat the test with 3x the data volume. If performance drops more than linearly, you've found where the architecture starts to strain. That's the number you want before you ship, not after.</p><p>The test that catches the<a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill"> <u>Optimization Treadmill</u></a> is a test that asks: what happens when this runs for a year? You can simulate that in a day if you load the data upfront and run the benchmark against a realistic data volume.</p><h2 id="the-benchmark-question-and-the-architecture-question">The benchmark question and the architecture question</h2><p>If your system has<a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill"> <u>the six workload characteristics</u></a> (continuous ingestion, time-series access patterns, append-only data, long retention, operational query requirements, sustained growth), the sustained throughput ceiling is structural. Better benchmarking tells you earlier where the ceiling is, but it won't raise it.</p><p>Benchmarking tells you how fast the ceiling approaches. Architecture determines where it sits.</p><p>Teams that run good sustained-load benchmarks early find out at 30 million rows that they're on the Optimization Treadmill. Teams that only run peak throughput benchmarks find out at 800 million rows. The underlying architectural problem is identical in both cases. The migration cost is not.</p><h2 id="ask-the-right-question-before-you-ship">Ask the right question before you ship</h2><p>Peak throughput is a useful number. It tells you whether the hardware can keep up with the write rate at a point in time. Worth knowing.</p><p>It just doesn't tell you whether the maintenance processes can keep up with that write rate indefinitely, as data volume grows and the vacuum backlog and WAL volume and cache pressure all grow with it.</p><p>The question nobody asks before shipping is usually the one that generates the incident nine months later. Ask it now. Run the load test against a full-size dataset. Watch autovacuum, not just query latency. Track the ceiling as a moving target, not a static spec.</p><p>And if the benchmark reveals what the<a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill"> <u>scoring framework</u></a> already suggested, the cheapest architectural decision you'll make is the one you make before the table crosses 100 million rows.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[MVCC: The Feature You're Paying For But Not Using]]></title>
            <description><![CDATA[MVCC is great for concurrent workloads. For append-only data, it's 23 bytes of overhead per row that never gets used. Here's what that actually costs.]]></description>
            <link>https://www.tigerdata.com/blog/mvcc-feature-youre-paying-for-but-not-using</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/mvcc-feature-youre-paying-for-but-not-using</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Scaling PostgreSQL]]></category>
            <dc:creator><![CDATA[Matty Stratton]]></dc:creator>
            <pubDate>Fri, 20 Mar 2026 13:07:26 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/03/V1.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/03/V1.png" alt="MVCC: The Feature You're Paying For But Not Using" /><p>Most engineers have a working mental model of MVCC. Readers don't block writers. Concurrent transactions see consistent snapshots. It's why Postgres handles mixed read/write workloads so well, and it's a genuine engineering achievement.</p><p>What's less obvious is that MVCC isn't free. Every row in every table carries its overhead. Not just rows that get updated. The system doesn't know at write time whether a row will ever be touched again, so it prepares for that possibility. Every time.</p><p>If you're running an IoT pipeline, a financial data feed, or an observability platform, most of your rows will never be updated. Sensor readings don't get corrected. Trade records are immutable. Log entries are permanent. You're writing append-only data into a system built to handle concurrent modification of shared rows, and you're paying the full price for that capability whether you use it or not.</p><p>This post breaks down exactly what that costs you: at the byte level, at the I/O level, and at the maintenance level.</p><h2 id="what-mvcc-actually-does-and-why-its-damn-good">What MVCC actually does (and why it's damn good)</h2><p>Before MVCC, databases had two options: lock rows during reads so writers couldn't touch them, or lock rows during writes so readers couldn't see them. Either way, concurrent workloads serialized through lock contention. If you've ever worked with a database that does this, you know how painful it gets at scale.</p><p>MVCC solves the problem differently. When a row is updated, Postgres doesn't modify it in place. It writes a new version of the row and keeps the old version visible to transactions that started before the update. Each transaction sees a consistent snapshot of the database as of the moment it began. Readers and writers operate on different row versions simultaneously. No locking required.</p><p>For an e-commerce backend processing orders while users browse, a SaaS application handling concurrent sessions, or any system where multiple transactions touch the same rows, this is transformative. The PostgreSQL documentation puts it simply: reading never blocks writing and writing never blocks reading.</p><p>That's not a small thing. That's the reason Postgres can handle the concurrency patterns that would bring a lock-based system to its knees.</p><p>The cost of maintaining this guarantee is what the rest of this post is about.</p><h2 id="the-per-row-overhead-in-bytes">The per-row overhead, in bytes</h2><p>This is where most explanations go vague. Let's not do that.</p><p>Every heap tuple in Postgres carries a fixed 23-byte header before a single byte of your actual data gets written. Here's what's in it:</p><ul><li><code>t_xmin</code>: the transaction ID that created this row (4 bytes)</li><li><code>t_xmax</code>: the transaction ID that deleted or updated it, zero if the row is live (4 bytes)</li><li><code>t_cid</code>: command ID within the transaction (4 bytes)</li><li><code>t_ctid</code>: physical location of this tuple or its newer version (6 bytes)</li><li><code>t_infomask</code> and <code>t_infomask2</code>: status flags for transaction visibility (4 bytes)</li><li><code>t_hoff</code>: offset to actual row data (1 byte)</li></ul><p>These fields exist to answer one question: is this row visible to this transaction?</p><p>For a workload where rows are being updated and deleted concurrently, that question needs answering constantly. The 23 bytes are worth it.</p><p>For an append-only workload? <code>t_xmax</code> is zero for every live row and will stay zero. <code>t_ctid</code> points to itself because there's no newer version. The visibility question still gets asked, and the header still gets written, and the page still gets dirtied to set hint bits after the first read. But the answers are trivial every time. The mechanism is running in full for a case that never needed it.</p><p>Add alignment padding and a 4-byte <code>ItemIdData</code> pointer per tuple, and the true per-row overhead is closer to 28 to 30 bytes before your row data starts.</p><p>Let's make that concrete. At 50K inserts per second, that's 1.4 to 1.5 MB/sec of pure overhead headers. Per year: roughly 44 GB of header data for a workload that never updates a row.</p><p>That's not a rounding error.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/03/heap-tuple-diagram-v2.jpg" class="kg-image" alt="" loading="lazy" width="2000" height="1116" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2026/03/heap-tuple-diagram-v2.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2026/03/heap-tuple-diagram-v2.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2026/03/heap-tuple-diagram-v2.jpg 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w2400/2026/03/heap-tuple-diagram-v2.jpg 2400w" sizes="(min-width: 720px) 720px"></figure><h2 id="what-autovacuum-is-actually-doing-on-your-append-only-table">What autovacuum is actually doing on your append-only table</h2><p>Here’s what’s going to wrinkle your brain.</p><p>You think,&nbsp; “Autovacuum cleans up dead tuples from updates and deletes. Append-only tables don't update or delete rows. Therefore, autovacuum shouldn't have much to do.”</p><p>That intuition is wrong in three specific ways.</p><p><strong>Aborted transactions leave dead tuples.</strong> Not every INSERT commits. Connection drops, application errors, explicit rollbacks. These all leave tuple versions that need cleaning. If you're running high insert rates, you've got a steady trickle of aborted transactions even in perfectly healthy systems.</p><p><strong>Hint bits require page dirtying.</strong> When a row is first read after being written, Postgres needs to check <code>pg_xact</code> to confirm the writing transaction committed. Once confirmed, it sets a hint bit in <code>t_infomask</code> to cache that result. Setting the hint bit dirties the page, which means writing it back to disk. On an append-only table with high read rates, hint bit setting is continuous background I/O on pages that will never change in any meaningful way. Welcome to your new normal.</p><p><strong>Since PostgreSQL 13, insert volume alone triggers autovacuum.</strong> Not just dead tuples. Postgres needs to periodically freeze old transaction IDs to prevent XID wraparound, which is a hard limit built into the 32-bit transaction counter. At high insert rates, autovacuum fires continuously just to freeze tuples on tables with zero updates.</p><p>Go check <code>autovacuum_count</code> and <code>vacuum_count</code> on your busiest append-only partition. They're climbing whether or not n_dead_tup is.</p><p>The result: autovacuum workers show up in <code>pg_stat_activity</code> at all hours on tables that never see a single <code>UPDATE</code>. You tune <code>autovacuum_vacuum_scale_factor</code> and <code>autovacuum_max_workers</code>, and it helps at the margin. But what you're tuning is how the cleanup process competes with writes. Not why it needs to run at all.</p><h2 id="the-write-amplification-chain">The write amplification chain</h2><p>Now let's connect all of this into the full cost picture.</p><p>A single 1 KB sensor reading doesn't write 1 KB. Here's what actually hits disk:</p><ul><li>23-byte heap tuple header plus padding</li><li>1,024 bytes of your actual row data</li><li>One entry per index, roughly 40 to 80 bytes each in B-tree leaf pages (five indexes = 200 to 400 bytes)</li><li>One WAL record per heap insert, one per index insertion: approximately 1.2 KB total</li><li>Periodically: an 8 KB full-page write after checkpoint for any newly dirtied page</li></ul><p>Total actual I/O: 2.5 to 3.5 KB for 1 KB of logical data.</p><p>The MVCC header is the entry point for this entire chain. It's what requires the visibility tracking, the hint bit mechanism, the autovacuum sweep, and the WAL record structure that Postgres uses.</p><p>At 100K inserts per second, you're writing 250 to 350 MB/sec of actual I/O for 100 MB/sec of application data. The 3 to 5x write amplification ratio isn't configuration. It's the cost of MVCC applied to data that will never be updated.</p><h2 id="why-you-cant-opt-out">Why you can't opt out</h2><p>There's no per-table setting to disable MVCC. No <code>append_only = true</code> flag that strips the header and skips the visibility machinery. MVCC is not a feature you can turn off for specific tables. It's the storage model. Every heap tuple gets the header. Every insert goes through the same write path.</p><p>This isn't an oversight. It's an architectural decision with a clear rationale: the storage engine doesn't know at write time what future transactions will need to see. The consistency guarantee requires the mechanism to be universal.</p><p>For most workloads, this is the right tradeoff. The overhead is small relative to the value of the concurrency guarantee, and mixed read/write workloads on shared rows are exactly what Postgres is built for.</p><p>The overhead only becomes the dominant cost when the workload is append-only at high sustained rates. That's when you're paying the full price for a guarantee you never exercise.</p><h2 id="what-changes-when-the-storage-model-changes">What changes when the storage model changes</h2><p>TimescaleDB's columnar storage (the Columnstore layer) addresses this at the architecture level, not the configuration level. Rather than writing one heap tuple per row, it batches up to 1,000 row versions per column into compressed arrays before writing to disk. The MVCC header overhead gets amortized across the batch. One write operation covers what would have been 1,000 individual heap tuple insertions.</p><p>The practical results: write amplification drops from 3 to 5x to near 1:1 for sustained append workloads. Autovacuum pressure drops proportionally because there's far less row-level churn to clean. WAL volume at 100K inserts/sec falls from 50 to 100 MB/sec to roughly 5 to 15 MB/sec. Replicas that previously fell behind during write peaks can keep up.</p><p>Everything else stays the same. Same SQL. Same wire protocol. Same extensions. Same tooling. The change is underneath, at the layer where MVCC overhead was accumulating.</p><h2 id="the-bottom-line">The bottom line</h2><p>MVCC is not a bug in Postgres. It's one of the reasons Postgres is the right choice for the majority of production workloads.</p><p>But if most of your rows are immutable after the insert commits, if your tables never see concurrent updates to the same rows, if autovacuum is running constantly on data you've never touched, you're running an append-only workload inside a concurrency model built for something else.</p><p>That's not misconfiguration. It's an architectural mismatch. The distinction matters because misconfiguration has a config fix. Architectural mismatch doesn't.</p><p>If high-frequency append-only ingestion describes what you're running, <a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill"><u>the full essay on the Optimization Treadmill</u></a> covers what this costs across your entire stack, and what the path forward looks like.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[When Continuous Ingestion Breaks Traditional Postgres]]></title>
            <description><![CDATA[Postgres maintenance depends on quiet periods your continuous workload eliminated. Here's what happens inside the database when the gaps disappear.]]></description>
            <link>https://www.tigerdata.com/blog/when-continuous-ingestion-breaks-traditional-postgres</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/when-continuous-ingestion-breaks-traditional-postgres</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Matty Stratton]]></dc:creator>
            <pubDate>Fri, 13 Mar 2026 19:33:45 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/03/When-Continuous-Ingestion-Breaks-Traditional-Postgres-1280x720.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/03/When-Continuous-Ingestion-Breaks-Traditional-Postgres-1280x720.png" alt="When Continuous Ingestion Breaks Traditional Postgres" /><p>Your system writes data constantly. Not in jobs. Not in batches. A stream that runs at 3am the same as it runs at 3pm. IoT sensors. Trade feeds. Metrics collectors. The data never stops.</p><p>For a while, Postgres handles it fine. Then you start noticing things. Autovacuum is always running. Write latency has a pattern you can't explain by traffic alone. Maintenance tasks that used to take minutes now take hours. And the really annoying part: nothing is misconfigured.</p><p>You check the usual suspects. Indexes are correct. Query plans look reasonable. Configs follow best practices. A colleague confirms the same.</p><p>The problem isn't a missing index or a bad query plan. The problem is that Postgres was designed with a quiet period baked into its assumptions. Your system eliminated that quiet period. Now you're paying for it.</p><h2 id="what-breathing-room-actually-means-in-postgres">What "breathing room" actually means in Postgres</h2><p>Most database systems are designed around a workload shape that includes peaks and valleys. Peaks are when users are active. Valleys are when the database catches up.</p><p>Postgres maintenance is built around the valley.</p><p>Autovacuum runs more aggressively when the database is quiet. <code>ANALYZE</code> refreshes statistics without competing for I/O. Checkpoint cycles complete cleanly. WAL accumulation clears out. The buffer cache warms up on predictable patterns.</p><p>Batch ETL fits this model perfectly. A nightly job writes data for two hours. The database writes, then rests, then writes again. Maintenance runs in the gaps. Everything resets before the next cycle starts.</p><p>Continuous ingestion has no gaps. The window that used to be quiet at 2am is now the same as the window at 2pm. Every maintenance process that depends on quiet time now runs in direct competition with writes. All day. All night.</p><h2 id="the-maintenance-competition-problem">The maintenance competition problem</h2><p>Three maintenance processes need quiet time and don't get it under continuous ingestion.</p><p><strong>Autovacuum.</strong> Even on append-only tables, autovacuum fires continuously at high insert rates. Since PostgreSQL 13, inserts themselves trigger autovacuum to freeze tuples and update the visibility map. This isn't about dead tuples from updates or deletes. It's insert-driven vacuum, running because the data is arriving too fast for the system to catch up.</p><p>At 50K inserts/second, autovacuum never finishes a cycle before the next one starts. It competes for I/O with your writes. When it loses, bloat accumulates. When it wins, write latency spikes.</p><p>There's no configuration fix for this. You can tune <code>autovacuum_vacuum_cost_delay</code> and <code>autovacuum_max_workers</code> all day. What you're tuning is how autovacuum loses gracefully. Not how it stops competing.</p><p><strong>Checkpoints.</strong> Postgres writes dirty pages to disk at checkpoint intervals. After a checkpoint completes, the first write to any previously-clean page triggers a full-page write to WAL (that's the <code>full_page_writes mechanism</code>, and it's on by default for good reason). At high insert rates, checkpoint cycles are constant. The full-page write burst that follows each one adds significant WAL volume on top of your baseline write load.</p><p>Batch systems checkpoint, rest, then return to normal. Continuous systems checkpoint and immediately start generating the next burst. There's no recovery window.</p><p><strong>ANALYZE and statistics.</strong> Query planning accuracy depends on fresh statistics. On a billion-row table, <code>ANALYZE</code> is expensive. On a batch system, you schedule it after the load completes. On a continuous system, there is no "after." You run it during writes or you let statistics go stale. Stale statistics mean bad query plans. Bad query plans mean unexpected sequential scans at the worst possible time.</p><h2 id="wal-as-the-throughput-ceiling-you-cant-tune-past">WAL as the throughput ceiling you can't tune past</h2><p>This is the mechanical core of the problem.</p><p>Every insert generates WAL. Heap insert record, index insertion records for every index on the table, plus full-page writes after checkpoints. A single 1KB sensor reading with five indexes generates roughly 2.5-3.5KB of actual I/O once you account for the heap tuple, B-tree leaf page insertions, and WAL records. At 100K inserts/second, that puts sustained WAL throughput at 50-100MB/sec under normal conditions. After a checkpoint, it spikes higher because of full-page writes.</p><p>That's 3-6GB per minute. 180-360GB per hour. Just WAL.</p><p>WAL writes are sequential and synchronous by default. That's a hard ceiling on write throughput for a given storage configuration. You can raise the ceiling by buying faster storage. You can't eliminate it, because WAL is how Postgres guarantees durability. And you shouldn't want to eliminate it. Durability matters. But you should understand that your write throughput has a physical upper bound set by how fast your storage can absorb WAL, and continuous ingestion pushes against that bound constantly.</p><p>Here's where continuous ingestion and batch ETL diverge completely.</p><p>Batch ETL generates bursts of WAL followed by silence. The silence lets replicas catch up. A streaming replica can fall behind during a batch load and recover in the gap. Nobody notices because the gap is long enough.</p><p>Continuous ingestion generates WAL constantly. Replicas that fall slightly behind have no gap to recover in. They fall further behind. The primary retains unprocessed WAL in <code>pg_wal</code>, consuming disk. The further behind the replica gets, the more WAL it needs to process, and the more disk the primary holds. It's a feedback loop. The thing that causes the problem (WAL volume) is the same thing that prevents recovery (WAL volume).</p><p>Adding replicas makes it worse, not better. Each replica is another consumer that needs to keep up with the same WAL stream, and the primary holds WAL until the slowest one catches up.</p><p>The standard fix is more provisioned IOPS. It works for a while. Then data volume grows and you're having the same conversation again, just with bigger numbers on the invoice.</p><h2 id="why-the-standard-toolkit-doesnt-solve-this">Why the standard toolkit doesn't solve this</h2><p>Walk through each common response and you'll see exactly where it runs out.</p><p><strong>More autovacuum workers.</strong> More workers means more I/O competition with writes, not less. You're distributing the problem across more processes. The aggregate I/O pressure is unchanged.</p><p><strong>Aggressive autovacuum cost limits.</strong> You can configure vacuum to run faster and harder. It cleans up faster but hits writes harder. There's no setting that makes the competition disappear. You're choosing which process suffers.</p><p><strong>More RAM.</strong> Bigger <code>shared_buffers</code> and page cache reduce physical reads. Write amplification is unchanged. WAL volume is unchanged. Autovacuum competition is unchanged. You bought better read performance for a write-bound problem.</p><p><strong>Faster storage.</strong> Raises the WAL ceiling. Doesn't change the ratio of actual I/O to logical data. At 3-5x write amplification, faster storage lets you sustain a higher write rate before hitting the ceiling. But data volume grows, and the ceiling moves up proportionally.</p><p><strong>Vertical scaling.</strong> Same as faster storage with more CPU. You've bought headroom measured in months. At the current data growth trajectory, that math doesn't improve over time.</p><p>Each of these is the right response to the symptom. None of them changes the underlying dynamic: continuous ingestion is in constant competition with the maintenance processes Postgres needs to stay healthy.</p><h2 id="the-workloads-where-this-actually-matters">The workloads where this actually matters</h2><p>Not every write-heavy system has this problem. Let's be precise.</p><p>The pattern shows up when three things are true at once: writes are continuous rather than bursty, data volume is growing on a sustained curve, and the database needs to stay queryable under latency requirements while ingestion is running.</p><p><strong>Industrial IoT</strong> is the clearest example. A wind farm with 10,000 sensors reporting every five seconds generates roughly 2,000 inserts/second. That's modest by financial or observability standards, but it never pauses. The turbines don't stop overnight. Maintenance windows don't exist because the data source doesn't know what a maintenance window is.</p><p><strong>Financial market data</strong> is the high-frequency version. Trade feeds run at hundreds of thousands of events per second during market hours. Pre-market and after-market data keeps coming. Systems that aggregate this data for risk and compliance queries need it available immediately, not at end of day.</p><p><strong>Observability platforms</strong> are the distributed version. Metrics, traces, and logs from thousands of hosts. Each host generates data independently. The aggregate rate is enormous and constant.</p><p>What these have in common: the data source runs on its own schedule, completely independent of what the database needs. The wind turbine doesn't care that autovacuum is behind. The trading engine doesn't wait for a checkpoint to finish.</p><p>If your write pattern is bursty (user-driven traffic, nightly batch jobs, periodic syncs), you probably don't have this problem. The database gets its breathing room, maintenance catches up, and standard Postgres optimization works the way it's supposed to. The pattern described in this post shows up specifically when the gap disappears.</p><h2 id="recognizing-the-pattern-early">Recognizing the pattern early</h2><p>The instinct when Postgres starts struggling under continuous ingestion is to tune harder. Add workers. Raise limits. Upgrade storage.</p><p>Those are correct responses for a database that has misconfiguration or a bad schema. Postgres is doing exactly what it was designed to do. The MVCC model, the WAL architecture, the maintenance scheduler: these are good design decisions for the workloads Postgres was built to handle. The system changed underneath it. That's not a criticism of the tool.</p><p>But continuous ingestion isn't a heavier version of batch ETL. It's a different workload class. The architectural assumptions underneath Postgres were built around a workload that breathes. Continuous ingestion doesn't breathe. And that distinction matters because it determines whether optimization will change your trajectory or just delay the same outcome.</p><p>Recognizing that early is worth a lot. At 50M rows, switching to a purpose-built architecture takes days. At 1B rows, it takes months. Every quarter you spend optimizing within the wrong architecture is a quarter where migration gets harder and the engineering team spends more time managing the database than building product.</p><p>If this sounds familiar, the full analysis covers the scoring framework and the mechanics behind why each optimization phase hits a ceiling. It's the same trajectory described here, zoomed out to show the complete path and where it leads.</p><p><a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill"><strong><u>Read the full analysis: Understanding Postgres Performance Limits for Analytics on Live Data →</u></strong></a></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to Measure Your IIoT PostgreSQL Table]]></title>
            <description><![CDATA[Learn how to measure your IIoT PostgreSQL table's size, ingest capacity, and query speed with practical SQL queries as your data grows over time.]]></description>
            <link>https://www.tigerdata.com/blog/measure-your-iiot-postgresql-table</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/measure-your-iiot-postgresql-table</guid>
            <category><![CDATA[IoT]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Doug Pagnutti]]></dc:creator>
            <pubDate>Thu, 12 Mar 2026 18:50:42 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/03/How_to_measure_banner.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/03/How_to_measure_banner.png" alt="How to Measure Your IIoT PostgreSQL Table" /><p>I was doing some validation tests for an essay about <a href="https://www.tigerdata.com/blog/the-iiot-postgresql-performance-envelope"><u>the performance envelope for an IIoT PostgreSQL database</u></a> and realized that measuring a database table is not as straightforward as I assumed it would be.&nbsp;</p><p>The general idea was that I would insert IIoT data into a table and then measure the size and performance of the table as it grows. But how do you actually read the size of a table? What is performance? How can we quantify these values in a way that’s useful for us engineers?</p><p>Here’s what I did.</p><h2 id="table-size">Table Size</h2><p>There are two key measurements that define a table’s size: How many rows does it have and how much disk space does it occupy.&nbsp;</p><h3 id="row-count">Row Count</h3><p>For small tables, this is straightforward:</p><pre><code class="language-SQL">SELECT COUNT(*) FROM &lt;table_name&gt;</code></pre><p>However, that query requires scanning every row in the table. For typical IIoT tables, like the ones I was testing, that might be billions of rows and might take minutes to execute.</p><p>Instead there’s a much faster query:</p><pre><code class="language-SQL">SELECT reltuples::bigint AS row_count&nbsp;
FROM pg_class&nbsp;
WHERE relname = ‘&lt;table_name&gt;’</code></pre><p>This is the row count that PostgreSQL uses for the query planner. It’s not guaranteed to match the row count exactly, because it’s not continuously updated, but it’s close enough and returns almost instantly.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">⚠️</div><div class="kg-callout-text">For hypertables created with TimescaleDB, reltuples should not be used. Instead use <a href="https://www.tigerdata.com/docs/api/latest/hyperfunctions/approximate_row_count"><u>approximate_row_count()</u></a></div></div><h3 id="size-on-disk">Size on Disk</h3><p>PostgreSQL stores table data across several components: the heap (the main table data), the indices, and TOAST storage (where large values get stashed). All three contribute to the table size and overall storage requirements as shown in the following image.&nbsp;</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/03/data-src-image-1d878c42-0846-4e35-aac5-83bfe8f9dfca.png" class="kg-image" alt="" loading="lazy" width="1600" height="510" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2026/03/data-src-image-1d878c42-0846-4e35-aac5-83bfe8f9dfca.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2026/03/data-src-image-1d878c42-0846-4e35-aac5-83bfe8f9dfca.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/03/data-src-image-1d878c42-0846-4e35-aac5-83bfe8f9dfca.png 1600w" sizes="(min-width: 720px) 720px"></figure><p>Here’s the query I used to get the three separate components.</p><pre><code class="language-SQL">SELECT pg_relation_size(‘&lt;table_name&gt;') AS heap_size,
pg_indexes_size('&lt;table_name&gt;') AS indexes_size,
pg_table_size('&lt;table_name&gt;')
  - pg_relation_size('&lt;table_name&gt;') AS toast_size;</code></pre><p>This will return the sizes in bytes, but you can also use the function <code>pg_size_pretty()</code> to get a more human readable output.</p><h2 id="ingest-capacity">Ingest Capacity</h2><p>Ingest capacity is critical to IIoT workflows, and it’s where a lot of systems run into serious trouble. How do you measure capacity? You can either get an approximate value from current ingest speeds, or push your database to the limit.</p><h3 id="if-you-already-have-a-data-source-connected">If You Already Have a Data Source Connected</h3><p>If your data stream is already connected, you can look at how long ingests are taking and figure out the capacity from that.</p><p>This requires the built-in tool <code>pg_stat_statements</code> which is essential for any serious database. To enable it (it ships with PostgreSQL, so it’s always available) run the following query:</p><pre><code class="language-SQL">CREATE EXTENSION IF NOT EXISTS pg_stat_statements;</code></pre><p>Once it’s running, it creates a table called <code>pg_stat_statements</code> that you can query for your INSERT performance:</p><pre><code class="language-SQL">SELECT 
    query,
    calls,
&nbsp;&nbsp;&nbsp;&nbsp;rows,
&nbsp;&nbsp;&nbsp;&nbsp;total_exec_time / 1000 AS total_time_sec,
&nbsp;&nbsp;&nbsp;&nbsp;mean_exec_time AS avg_ms_per_call,
&nbsp;&nbsp;&nbsp;&nbsp;rows / NULLIF(calls, 0) AS avg_rows_per_call,
&nbsp;&nbsp;&nbsp;&nbsp;rows / NULLIF(total_exec_time / 1000, 0) AS rows_per_sec
FROM pg_stat_statements
WHERE query ILIKE '%INSERT%&lt;table_name&gt;%'
ORDER BY total_exec_time DESC;</code></pre><p>This gives you a picture of real ingest performance based on what your application is actually doing. You'll see how many rows each call inserts (obviously <a href="https://www.tigerdata.com/blog/mqtt-sql-practical-guide-sensor-data-ingestion"><u>you’re batching</u></a>), the average time per call, and a rough rows-per-second figure. The time it takes to insert a batch divided by the period of your desired insertions gives you a rough estimate of how much ingest capacity you’re using.</p><p>You can reset the stats whenever you want for a fresh baseline:</p><pre><code class="language-SQL">SELECT pg_stat_statements_reset();</code></pre><p>By measuring this as your table grows, you’ll get a good sense of how your ingest capacity is evolving and you’ll be able to deal with it well before it becomes an issue.</p><h3 id="the-actual-ingest-capacity">The actual ingest capacity</h3><p>If you don’t mind really pushing your table to its limits (and maybe breaking it), you can try to ingest as much as possible and see if the database keeps up. I wrote a full walkthrough for this, including the SQL for generating realistic IIoT data and a scripted test loop, in <a href="https://www.tigerdata.com/blog/how-to-break-postgresql-iiot-database-learn-something-in-process"><u>How to Break Your PostgreSQL IIoT Database and Learn Something in the Process</u></a>.&nbsp;</p><h2 id="query-speed">Query Speed</h2><p>Query speed is the most obvious metric for a database, as it affects everyone using the data. However, I found it to be one of the most difficult to generalize. Every application will have specific queries that are important, and different definitions of what is ‘fast enough’.&nbsp; It’s also something that tends to degrade over time and only become an issue well into the life of the table.</p><h3 id="for-queries-you%E2%80%99re-already-running">For queries you’re already running</h3><p>If you already have dashboards running, or your analysis workflow in place, you can again use <code>pg_stat_statements</code>. Here's how to pull information for the 20 slowest queries:</p><pre><code class="language-SQL">SELECT
    query,
    calls,
    rows,
    mean_exec_time AS avg_ms,
    total_exec_time / 1000 AS total_time_sec,
    stddev_exec_time AS stddev_ms,
    rows / NULLIF(calls, 0) AS avg_rows_returned
FROM pg_stat_statements
WHERE query ILIKE '%SELECT%&lt;table_name&gt;%'
ORDER BY total_exec_time DESC
LIMIT 20;</code></pre><h3 id="for-more-general-queries">For more general queries</h3><p>IIoT queries tend to fall into two categories: wide (what is the state of all devices at a specific time?) and deep (what is the history of a particular device?). By running at least one example from each type, you’ll get a sense of how quickly these types of queries will return.</p><p>Generic Wide Query</p><pre><code class="language-SQL">SELECT DISTINCT ON (tag_id) 
  tag_id, 
  time, 
  value
FROM &lt;table_name&gt;
ORDER BY tag_id, time DESC
LIMIT 100
</code></pre><p>This returns the most recent value from 100 tags.</p><p>Generic Deep Query</p><pre><code class="language-SQL">SELECT 
    tag_id,
    DATE_TRUNC('hour',time) as hour,
    AVG(value) as hourly_average
FROM &lt;table_name&gt;
WHERE tag_id = &lt;specific tag_id&gt;
GROUP BY DATE_TRUNC('hour',time)
ORDER BY hour DESC
LIMIT 100</code></pre><p>This returns the past 100 hourly averages from one specific tag.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">It's important to run these queries multiple times to get a robust measurement. There are a lot of internal optimizations that PostgreSQL uses to speed up common queries and it’s therefore likely to run faster after a few executions.</div></div><h2 id="putting-it-all-together">Putting It All Together</h2><p>The real value comes from combining these measurements as the table grows. Here's the general approach I followed for my essay:</p><ol><li>Create a simple IIoT table schema and common index.</li><li>Measure table size (rows + disk space), query times, and ingest time for a couple standard batches.</li><li>Insert many batches as fast as possible so the table grows quickly.</li><li>Repeat steps 2 and 3 until some predefined limit (usually disk space or query time)</li></ol><p>If I was instead using a real production system, I would rely more on <code>pg_stat_statements</code> to track query and ingest rates. Doing this every day when the system is new and then a weekly check will ensure you know exactly how your table is evolving.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Why Adding More Indexes Eventually Makes Things Worse]]></title>
            <description><![CDATA[Every Postgres index is a flat tax on every insert. At high ingestion rates, that tax is the whole problem.]]></description>
            <link>https://www.tigerdata.com/blog/why-adding-more-indexes-eventually-makes-things-worse</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/why-adding-more-indexes-eventually-makes-things-worse</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <dc:creator><![CDATA[Matty Stratton]]></dc:creator>
            <pubDate>Wed, 11 Mar 2026 16:36:20 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/03/Option-1-1200X627.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/03/Option-1-1200X627.png" alt="Why Adding More Indexes Eventually Makes Things Worse" /><p>The pattern is familiar. A query is slow. You run <code>EXPLAIN</code> and see a sequential scan. You add an index. The query drops from seconds to milliseconds.</p><p>You do this a dozen times over two years and it works every time.</p><p>Then write latency starts climbing and you can't figure out why. The queries are fast. The schema looks clean. Nothing is obviously wrong.</p><p>Pull up <code>pg_stat_user_indexes</code>. Count your indexes. Now think about what happens at the storage layer every time a row lands in that table.</p><p>The indexes didn't stop helping reads. They started hurting writes. Every index is a flat tax on every insert: one extra write operation per row, every time, no exceptions. At low ingestion rates, the tax is invisible. At high ingestion rates, it's the whole problem.</p><h2 id="what-actually-happens-when-you-insert-a-row">What actually happens when you insert a row</h2><p>No handwaving here. Let's walk through the mechanics.</p><p>A single <code>INSERT</code> into a table with five indexes doesn't write once. It writes six times: one heap tuple to the table's data pages, and one B-tree leaf page insertion per index. Each index insertion traverses the B-tree from root to leaf, finds the correct position, and writes the new entry. If the target leaf page is full, it splits. A split can cascade up the tree.</p><p>Then there's WAL. One heap insert record. Five index insertion records. If it's the first modification to a page since the last checkpoint, Postgres writes a full 8 KB page image on top of all that.</p><p>At one insert per second, this is completely invisible. At 50,000 inserts per second with five indexes, you're looking at 300,000 write operations per second. Not 50,000. Six times the logical write rate, minimum.</p><p>That's your write amplification number. For this table configuration: 6x. More indexes, higher multiplier.</p><h2 id="the-math-that-makes-this-concrete">The math that makes this concrete</h2><p>Take a table with five indexes and a 1 KB row. The heap tuple costs 23 bytes of header plus your 1,024 bytes of row data plus a 4-byte <code>ItemIdData</code> pointer. Each of the five B-tree index entries adds roughly 40 to 80 bytes. Then WAL: approximately 1.2 KB covering the heap insert plus all five index insertions. Add it up and you're writing roughly 2.5 to 3.5 KB for every 1 KB of logical data.</p><p>At 50K inserts/sec, that's 125 to 175 MB/sec of actual I/O for 50 MB/sec of application data. The index tax at work.</p><p>Now add two more indexes because a couple of new dashboard queries need covering indexes. You're at seven. The multiplier goes up. The WAL volume goes up. Write latency goes up. Autovacuum has more index pages to scan and maintain.</p><p>The relationship is linear per index, but the effect compounds with ingestion rate. At 1K inserts/sec, two extra indexes barely register. At 100K inserts/sec, they're a real cost.</p><p>Here's what the math looks like across different configurations:</p>
<!--kg-card-begin: html-->
<table style="border:none;border-collapse:collapse;"><colgroup><col width="64"><col width="184"><col width="184"><col width="192"></colgroup><tbody><tr style="height:39.25pt"><td style="vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;text-align: center;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Indexes</span></p></td><td style="vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;text-align: center;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Write ops/sec @ 10K inserts</span></p></td><td style="vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;text-align: center;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Write ops/sec @ 50K inserts</span></p></td><td style="vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;text-align: center;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Write ops/sec @ 100K inserts</span></p></td></tr><tr style="height:25.75pt"><td style="vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">1</span></p></td><td style="vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">20,000</span></p></td><td style="vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">100,000</span></p></td><td style="vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">200,000</span></p></td></tr><tr style="height:25.75pt"><td style="vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">3</span></p></td><td style="vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">40,000</span></p></td><td style="vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">200,000</span></p></td><td style="vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">400,000</span></p></td></tr><tr style="height:25.75pt"><td style="vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">5</span></p></td><td style="vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">60,000</span></p></td><td style="vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">300,000</span></p></td><td style="vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">600,000</span></p></td></tr><tr style="height:25.75pt"><td style="vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">7</span></p></td><td style="vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">80,000</span></p></td><td style="vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">400,000</span></p></td><td style="vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">800,000</span></p></td></tr><tr style="height:25.75pt"><td style="vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">10</span></p></td><td style="vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">110,000</span></p></td><td style="vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">550,000</span></p></td><td style="vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">1,100,000</span></p></td></tr></tbody></table>
<!--kg-card-end: html-->
<p>The numbers are approximate (real-world I/O depends on page splits, full-page writes, and your specific index types), but the pattern is clear. Each additional index is a flat tax on every insert. The tax rate doesn't change. The bill does.</p><h2 id="why-timestamp-indexes-have-a-specific-problem">Why timestamp indexes have a specific problem</h2><p>B-tree behavior for monotonically increasing keys is worse than for random keys. And most time-series tables insert in timestamp order.</p><p>With a random key distribution, new inserts scatter across the B-tree's leaf layer. Any given leaf page gets a roughly even share of new entries. Splits happen, but they're spread out.</p><p>With a timestamp key, every insert goes to the rightmost leaf page. The same page, over and over. That page fills up and splits. The new rightmost page fills up and splits. This is called a "hot right edge," and it means B-tree index maintenance for timestamp columns involves constant page splits concentrated in one area of the tree.</p><p>The old leaf pages that were once the rightmost page sit mostly empty but remain allocated. Index size grows faster than data size. The index bloat you see in pg_stat_user_indexes is a direct result of this pattern, not random fragmentation.</p><p>For non-timestamp indexes on the same table (device ID, metric name, sensor type), inserts scatter across the tree instead, which means random I/O rather than sequential. So you get two different flavors of write overhead hitting the same table simultaneously: constant splits on the timestamp index, random I/O on everything else.</p><h2 id="the-feedback-loop">The feedback loop</h2><p>All of that overhead is manageable if it stays constant. The problem is that it doesn't. It self-reinforces.</p><p>You add indexes to fix slow queries. Write amplification increases. Write latency creeps up. Bloat accumulates faster. Autovacuum fires more frequently and has more index pages to clean. Autovacuum competes with your writes for I/O bandwidth. Write latency climbs higher.</p><p>Slower writes mean rows sit in the buffer longer. Buffer pressure increases. The query performance you were trying to protect starts degrading anyway, now from I/O contention rather than missing indexes.</p><p>The response is usually to check query plans again. Some queries have gone back to sequential scans because statistics are stale or the planner is making different cost estimates under load. So you add another index. The cycle repeats.</p><p>This loop runs slowly enough that the connection between each index addition and the eventual write degradation is hard to see. Six months can pass between the two events. By that point, you've forgotten which indexes were added and why, and the symptom looks like a completely different problem.</p><h2 id="the-diagnostic-questions">The diagnostic questions</h2><p>Before adding the next index, ask these:</p><p><strong>How many indexes does this table already have?</strong> Pull <code>pg_stat_user_indexes</code> and look at <code>idx_scan</code>. Indexes with low scan counts are paying full write overhead for queries that run rarely or never.</p><p><strong>What's the actual write rate on this table?</strong> Low ingestion rate tables can carry many indexes without much penalty. The math only gets ugly at high sustained rates. If you're inserting 100 rows/sec, ten indexes are probably fine. If you're inserting 50K rows/sec, every index counts.</p><p><strong>Is the slow query a read problem or a write problem?</strong> Adding an index to fix a slow query while write amplification is already the bottleneck treats the symptom and makes the underlying condition worse.</p><p><strong>What's the index bloat trend?</strong> Growing index size relative to table size, especially on timestamp columns, is the fingerprint of the hot right edge problem. You can measure it directly with <code>pgstattuple</code> or by comparing <code>pg_relation_size</code> for the index against the table over time.</p><p><strong>Could a different query shape eliminate the need for this index?</strong> Sometimes the answer is restructuring the query or adjusting the access pattern, not adding another index to support the query as written.</p><h2 id="when-youre-past-the-point-where-index-pruning-helps">When you're past the point where index pruning helps</h2><p>You can drop indexes with low idx_scan counts. You can consolidate partial indexes. You can audit and remove redundant coverage. All of that is correct and worth doing.</p><p>But for a table with continuous high-frequency ingestion, even a minimal index set still generates substantial write amplification. Three carefully chosen indexes on a 50K inserts/sec table is still 200K write operations per second. WAL volume is still 3–5x logical data volume. Autovacuum is still competing for I/O.</p><p>Index pruning buys back headroom. It doesn't change the architecture.</p><p>The write amplification problem for this class of workload is in the storage model itself. Row-based heap storage with B-tree indexes is how Postgres handles every table. It's the right design for most workloads. For sustained high-frequency, append-heavy ingestion, the overhead is intrinsic. It's not a configuration problem you can tune your way out of.</p><p>This is what changes when the storage model changes. The reason the index tax is so expensive in row-based storage is that every row is an independent write event. One heap insert, one WAL record, one B-tree traversal per index. The cost is per-row because the storage is per-row.</p><p>Columnar storage changes the unit of work. Instead of writing one row at a time, it batches thousands of row versions into a single segment before writing. One WAL record covers the whole batch. Index maintenance happens at the segment level, not the row level. The per-row tax that makes five indexes expensive at 50K inserts/sec gets amortized across thousands of rows per write. Write amplification drops from the 3 to 5x range to near 1:1.</p><p>That's not a tuning improvement. It's a different cost structure for the same logical operation. We covered the full architecture in <a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill"><u>The Postgres Optimization Treadmill</u></a>, which walks through why these constraints exist in row-based Postgres and what it looks like when the storage layer is built for this workload pattern from the start.</p><h2 id="the-bottom-line">The bottom line</h2><p>Every index you've ever added was the right call at the time. That's not the argument here.</p><p>The point is that the index tax is a <em>real cost</em> with a specific multiplier, and that multiplier matters a lot more at 50K inserts/sec than it does at 500. If write latency is climbing on a table that looks well-indexed, pull the insert rate and count the indexes. Do the multiplication. The answer is usually sitting right there in the numbers.</p><p>And if those numbers show you're paying five or more index taxes on every row, with no signs of the data slowing down, the question isn't which indexes to drop. It's whether the per-row cost structure is the right one for the workload.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Vertical Scaling: Buying Time You Can't Afford]]></title>
            <description><![CDATA[Postgres vertical scaling works, until it doesn't. Learn why high-frequency ingestion workloads hit an architectural wall and what to do about it.]]></description>
            <link>https://www.tigerdata.com/blog/vertical-scaling-buying-time-you-cant-afford</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/vertical-scaling-buying-time-you-cant-afford</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <dc:creator><![CDATA[Matty Stratton]]></dc:creator>
            <pubDate>Thu, 26 Feb 2026 14:48:27 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/02/Blog-Thumbnail-1280x720.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/02/Blog-Thumbnail-1280x720.png" alt="Vertical Scaling: Buying Time You Can't Afford" /><p>Your Postgres database is struggling. Write latency is climbing, autovacuum is fighting for I/O, and the indexes you added three months ago aren't cutting it anymore. So you do the obvious thing.</p><p>You upgrade the instance. Metrics drop. Everyone exhales.</p><p>Six months later, you do it again.</p><p>Nobody puts this in a postmortem, because vertical scaling works. That's why teams keep reaching for it. But if you're running continuous high-frequency ingestion on Postgres, it's not a fix. It's a payment plan on a debt that keeps growing.</p><h2 id="the-cost-curve-doesnt-lie">The Cost Curve Doesn't Lie</h2><p>You've probably already run the numbers. At 50K inserts per second, you're adding roughly 1.5 billion rows per year. Your data volume curve is exponential. Your infrastructure cost moves in steps, doubling each time you provision the next tier up.</p><p>Plot both lines on the same chart. Watch them diverge.</p><p>You upgrade from 16 vCPU/64GB to 32 vCPU/128GB with provisioned IOPS (io2 at 10,000+ on AWS, say). Cost roughly doubles. You get six months of breathing room. Then the data keeps growing, and the metrics start climbing again.</p><p>So you upgrade again. The cost doubles again. Twelve months out, you're projecting another upgrade. The database line item is growing faster than the product revenue it supports.</p><p>Oof.</p><h2 id="what-youre-actually-buying">What You're Actually Buying</h2><p>More CPU gives autovacuum room to run without starving query execution. More RAM improves <code>shared_buffers</code> and OS page cache hit rates. Faster storage reduces I/O wait across the board.</p><p>All real wins. None of them touch the per-row overhead.</p><p>Here's what's actually happening underneath. At 100K inserts per second, you're writing 250-350MB of actual I/O for 100MB of application data. Every row carries MVCC headers, index entries, and WAL records whether you asked for them or not. A 1KB sensor reading becomes roughly 2.5 to 3.5KB of actual I/O: 23-byte heap tuple header, five index entries at ~60 bytes each, plus a ~1.2KB WAL record stacking on top.</p><p>At 100K inserts/sec, that's 250-350MB/sec of real I/O to move 100MB/sec of data. A bigger instance tolerates that overhead more gracefully. It does not reduce it.</p><p>So the trajectory holds. Six months of headroom, metrics creep back, another upgrade, another budget conversation. Each step costs more than the last one and buys roughly the same amount of time.</p><h2 id="the-invisible-cost-nobody-tracks">The Invisible Cost Nobody Tracks</h2><p>Here's where it gets uncomfortable. The latency graphs are one thing. Engineers watch latency graphs. Finance watches the <em>invoice</em>.</p><p>At some point the database line item becomes visible enough that someone schedules a meeting. Now you're explaining autovacuum to a person who manages a spreadsheet for a living. (That meeting is not fun. The prep work for that meeting costs engineering time you don't have.)</p><p>But that's the visible cost. The invisible one is worse.</p><p>When teams hit this pattern, senior engineers typically spend 20-30% of their time on database operations. Not firefighting. Weekly. Monitoring autovacuum lag. Tuning per-partition settings. Watching replication delay. Reviewing runbooks before anyone touches the schema. Making sure the pg_partman automation didn't silently fail again.</p><p>None of that shows up in the cloud bill. It doesn't trigger a finance meeting – it just quietly drains your best people every single week. New engineers need weeks of onboarding before they can safely operate the partitioning scheme. What should be a one-person schema change becomes a team event with a rollback plan.</p><p>You've built a database operations practice inside your product engineering team. That wasn't the plan.</p><h2 id="why-vertical-scaling-feels-like-its-working">Why Vertical Scaling Feels Like It's Working</h2><p>The thing that makes this pattern so persistent is that each optimization phase genuinely does help. Vertical scaling is no exception.</p><p>You add the bigger instance, and autovacuum workers stop competing with queries for CPU. Shared buffers expand, and buffer cache hit rates climb. Those io2 IOPS stop being the bottleneck. For a while, the system breathes.</p><p>But here's the thing: Postgres wasn't designed for continuous, high-frequency, append-only ingestion at scale. The design choices that make it excellent for general-purpose workloads, MVCC for concurrency, row-based heap storage, B-tree indexes, the WAL architecture – all generate overhead that multiplies when you're hammering it with hundreds of thousands of inserts per second that never pause.</p><p>Vertical scaling gives the existing architecture more room to operate. It doesn't change the architecture.</p><p>MVCC creates per-tuple overhead on data you'll never update. Row storage forces you to read all 30 columns when your query needs two. B-tree indexes mean every insert has to traverse and update every index, and at 50K inserts/sec with five indexes, that's 250K index insertions per second. WAL records every single one of those operations before touching a data page, so at 100K inserts/sec you're generating 50-100MB/sec of WAL just to do normal work.</p><p>None of those problems shrink when you add more vCPUs.</p><h2 id="how-it-shows-up-before-its-a-crisis">How It Shows Up Before It's a Crisis</h2><p>The real tell isn't in a p95 latency chart. It's the <em>pattern</em>.</p><p>You optimize. You get relief. The metrics climb back. You optimize again. The relief lasts a little less time than before.</p><p>Before it becomes a full crisis, it shows up in how the team is spending its time.</p><p>Optimization is on every quarterly roadmap, not as a one-time project, but as a line item, every quarter, competing with features for engineering time.</p><p>The database bill goes up 40% while user growth was 15%. Finance notices. Those numbers don't get ignored.</p><p>You ship a 2x performance improvement and data growth erases it within two quarters. The treadmill doesn't slow down – it <strong>speeds up</strong>.</p><p>And autovacuum just keeps coming up! It's in the top five processes by CPU and I/O at all hours and tuning it is somehow <em>always</em> on someone's plate.</p><p>Two or three of these? Pay attention. Four? You're <em>already</em> in the pattern.</p><h2 id="optimization-problem-vs-architecture-problem">Optimization Problem vs. Architecture Problem</h2><p>There are two different problems that both show up as "database performance is degrading."</p><p>The first is an optimization problem. The workload fits the database design. Better indexes, query rewrites, config tuning, vertical scaling. These directly improve the trajectory, and Postgres expertise solves it. For most workloads, vanilla Postgres is the right tool and this is the right path.</p><p>The second is an <strong>architectural mismatch</strong>. The workload is hitting design tradeoffs baked into the storage engine and the write path. Optimization helps short-term, but it doesn't change the trajectory. You're working <em>around</em> the architecture instead of <em>with</em> it.</p><p>Both of these look identical from the outside: degrading query latency, climbing infrastructure costs, teams spending more time on database operations than product work. The difference only becomes obvious when you notice each fix is lasting a little less time than the last one.</p><p>Vertical scaling is the right move for the first problem. For the second, it's just the most expensive item on the treadmill.</p><h2 id="when-to-think-about-architecture-instead">When to Think About Architecture Instead</h2><p>If your workload is continuous high-frequency ingestion, your data is append-only, queries predominantly filter on time ranges, and you're measuring retention in months or years, you're probably dealing with an architectural mismatch, not an optimization problem.</p><p>You also don't need to replace Postgres. TimescaleDB extends vanilla Postgres with columnar compression, hypertables with automatic chunking, and a query planner that understands <a href="https://www.tigerdata.com/learn/the-best-time-series-databases-compared" rel="noreferrer">time-based access patterns</a>. You keep SQL, your extensions, your team's knowledge, and the entire Postgres ecosystem. What changes is the storage engine and write path underneath (the parts <em>actually</em> generating the overhead).</p><p>Migration complexity scales with data volume. At 10M-50M rows, it's days to two weeks. At 100M-500M rows, two to six weeks. At 1B+, you're looking at months. Those hours don't go toward product features. And there's no point on that curve where waiting makes it cheaper.</p><p>If your team is spending 20%+ of engineering time on database operations and scalability is on every quarterly roadmap, you already know something is off. The upgrade cycles don't get cheaper. They just get further apart until they don't.</p><p><em>This post is part of a series on Postgres performance limits for high-frequency data workloads. The full analysis, including a workload scoring framework and migration complexity breakdown at different scales, is in the anchor essay:</em><a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill" rel="noreferrer"><em> <u>Understanding Postgres Performance Limits for Analytics on Live Data</u></em></a><em>. Ready to test it on your own data?</em><a href="https://console.cloud.timescale.com/signup"><em> <u>Start a free Tiger Data trial.</u></em></a></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Understanding Postgres Performance Limits for Analytics on Live Data]]></title>
            <description><![CDATA[PostgreSQL hits hard limits under analytics workloads. Here's why MVCC, WAL, and row storage compound — and what to do instead.]]></description>
            <link>https://www.tigerdata.com/blog/postgres-optimization-treadmill</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/postgres-optimization-treadmill</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Analytics]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <dc:creator><![CDATA[Matty Stratton]]></dc:creator>
            <pubDate>Wed, 25 Feb 2026 19:18:16 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/02/advocacy-essay-thumbnail-with-elephant.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/02/advocacy-essay-thumbnail-with-elephant.png" alt="Understanding Postgres Performance Limits for Analytics on Live Data" /><h2 id="the-pattern-recognition-moment">The Pattern Recognition Moment</h2>
<p>You're reviewing monitoring on a normal workday. There hasn't been a new deployment, no weird traffic spike, and no schema changes. But p95 write latency has crept from 8ms to 25ms over the past month, and last week it touched 45ms. Your largest tables crossed 500M rows sometime in March and they're still climbing.</p>
<p>Six weeks of data points, all trending the same direction.</p>
<p><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/02/diagram-2-p95-write-latency.png" alt="" loading="lazy"></p>
<p>You've run Postgres in production for years. You've tuned queries, rebuilt indexes, and right-sized instances. But this time the fixes don't stick; every new index or config tweak brings the metrics back down for a few weeks, then they climb again. You can plot the trajectory out three months and know exactly where it lands.</p>
<p>So you do a proper audit: query plans, connection overhead, table stats, bloat. Everything checks out. Schema is sound, indexes cover the hot paths, and configs follow best practices. A consultant confirms the same: nothing misconfigured. But performance keeps degrading, and it correlates with data volume, not traffic.</p>
<p>You look closer at the workload. Most writes are inserts, not updates, and every row carries a timestamp. Queries almost always filter by time range. Data arrives continuously, not in batches or bursts, but as a steady stream that never pauses. You need months or years of retention, and you're not just storing this data. You're querying it under latency requirements.</p>
<p>This doesn't fit the profile of a transactional workload, and it doesn't fit a data warehouse either. It's continuous high-frequency ingestion that needs to stay operationally queryable.</p>
<p>Postgres is a brilliant general-purpose database. The same design choices that make it handle e-commerce, SaaS backends, and CMS workloads so well create compounding overhead for sustained high-frequency <a href="https://www.tigerdata.com/learn/time-series-database-what-it-is-how-it-works-and-when-you-need-one">time-series</a> ingestion with long retention. Design tradeoffs, not bugs. Baked into the architecture by intent.</p>
<p>You are not fighting misconfiguration. You are fighting architectural boundaries designed for a different workload class.</p>
<p>This piece walks through what we call the Optimization Treadmill: the sequence of phases most teams follow, each a correct response to observed symptoms, each providing temporary relief without changing the underlying trajectory. Understanding the mechanics of why the treadmill exists is what lets you recognize it early. If you recognize the scenario above, this is a common path. The question isn't whether you'll hit the ceiling. It's when, and how much runway you have left when you do.</p>
<h2 id="what-this-workload-looks-like">What This Workload Looks Like</h2>
<p>Not all high-write workloads will hit this wall. Postgres handles enormous write volumes for e-commerce, social networks, and SaaS backends without issue. The friction comes from a specific combination of six characteristics. If four or five describe your system, the optimization phases in the next section will be familiar.</p>
<p><strong>Continuous high-frequency ingestion.</strong> Thousands to hundreds of thousands of inserts per second, 24/7, with no pause: IoT sensors reporting every few seconds, financial systems processing trades in real time, or APM platforms collecting metrics from thousands of hosts. High-frequency data generation is independent of user count. Batch systems get quiet periods where the database can run maintenance, but continuous ingestion never stops. Maintenance competes directly with writes, and there is no scheduling window.</p>
<p><strong>Time-series access patterns.</strong> Nearly every row has a timestamp, and queries almost always include time range filters. "Last 30 minutes of CPU utilization," "this week compared to last week," "all transactions between two dates." This goes beyond a <code>created_at</code> column; the entire query pattern revolves around time. General-purpose indexes aren't built for this access pattern, which is why teams end up reimplementing time-based data organization through manual partitioning scripts and custom tooling.</p>
<p><strong>Append-only data.</strong> Once written, rows rarely change. Sensor readings don't get updated, financial transactions are immutable, log entries are permanent. Deletes happen in bulk (drop an entire month's partition), not row by row. MVCC exists to handle concurrent reads and writes on the same rows. Append-only workloads pay that overhead on data they never touch again. Autovacuum is running constantly just to clean up dead tuples that were never created through updates.</p>
<p><strong>Long retention.</strong> Months to years, not days or weeks. Compliance might require seven years of financial records, manufacturing teams need root cause analysis across quarters, and ML pipelines need two-plus years of training data. Shortening retention will just hide architectural problems because old data ages out, and long retention means unbounded table growth. At 50K inserts per second, that's roughly 1.5 billion rows per year. After three years? 4.5 billion rows.</p>
<p><strong>Operational query requirements.</strong> This isn't cold storage or an analytics warehouse you query once a day. You need millisecond responses on the last day's data, sub-second on the last week, and reasonable performance across the full retention window. Real-time dashboards, alert systems, user-facing analytics, ad-hoc investigation, all querying the same database. Data warehouse depth with operational latency requirements.</p>
<p><strong>Sustained growth.</strong> Data volume growing 50–100%+ year over year on a predictable curve. Static workloads can be over-provisioned once and left alone, but growing workloads demand constant re-optimization. You're not solving for current scale. You're chasing projected scale, and the gap keeps widening.</p>
<p>If four or five of these apply, the next section maps the optimization path most teams follow. If your workload is standard OLTP, batch warehouse, low-volume time-series, or short-retention, the underlying issues are likely different.</p>
<p>This combination of characteristics didn't exist at scale 15 years ago. It's a product of specific infrastructure shifts: billions of connected devices generating continuous telemetry, high-frequency trading systems that treat microseconds as a competitive moat, AI pipelines that require years of operational history as training data, and observability platforms collecting metrics from every process in a distributed system. The cloud didn't just scale these workloads up. It made them continuous. Machines that never go offline generate data that never stops. That changed what operational databases are asked to do, and general-purpose engines weren't redesigned to match.</p>
<h2 id="the-optimization-path">The Optimization Path</h2>
<p>Most teams working this pattern follow roughly the same sequence. Each phase is a reasonable response to observed symptoms, but each buys 3–6 months of relief at most, adds operational complexity, and has diminishing returns. The optimizations address symptoms without changing the underlying architecture. The ceiling doesn't move. You do, until you run out of room.</p>
<h3 id="phase-1-index-optimization">Phase 1: Index optimization</h3>
<p>The trigger is predictable: query performance degrades as tables grow past 50–100M rows, or sequential scans on a 100M-row table take minutes. The textbook answer is to add B-tree indexes on timestamp columns, build composite indexes for common filter combinations, create partial indexes on hot time ranges, and run ANALYZE to refresh <code>pg_statistic</code>.</p>
<pre><code class="language-sql">-- Composite index for the most common dashboard query pattern
CREATE INDEX idx_metrics_device_time
  ON device_metrics (device_id, ts DESC);

-- Partial index covering only the hot partition
CREATE INDEX idx_metrics_recent
  ON device_metrics (ts DESC)
  WHERE ts &gt; now() - interval '7 days';
</code></pre>
<p>A query that did a sequential scan across 100M rows now hits an index and returns in milliseconds. 10–100x improvement on read performance is typical. Problem solved, for now.</p>
<p>Issues start showing up as tables grow past 300M rows. Every INSERT must update every index on the table. With five indexes, each insert performs six write operations: one heap tuple write and five B-tree leaf page insertions. At 50K inserts/sec, that's 300K write operations per second. Each index insertion traverses the B-tree, potentially causing page splits that trigger additional I/O. <code>pg_stat_user_indexes</code> starts showing index bloat climbing:</p>
<pre><code class="language-sql">-- Monitoring index bloat
SELECT schemaname, tablename, indexname,
       pg_size_pretty(pg_relation_size(indexrelid)) as index_size,
       idx_scan as index_scans,
       idx_tup_read,
       idx_tup_fetch
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY pg_relation_size(indexrelid) DESC;
</code></pre>
<p>Index sizes grow faster than table sizes because B-trees don't reclaim space efficiently for append-heavy, time-ordered data. For keys that increase monotonically like timestamps, inserts concentrate on the rightmost leaf pages, resulting in repeated splits. Old leaf pages become sparse but remain allocated. You've improved read latency at the cost of write throughput, and this workload needs both.</p>
<h3 id="phase-2-table-partitioning">Phase 2: Table partitioning</h3>
<p>Your largest table has crossed 800M to 1B rows, and dropping old data via DELETE causes table bloat and long-running transactions that block autovacuum. You implement time-based range partitioning (typically daily or weekly).</p>
<pre><code class="language-sql">-- Partitioned table setup
CREATE TABLE device_metrics (
    ts          timestamptz NOT NULL,
    device_id   bigint NOT NULL,
    metric      text NOT NULL,
    value       double precision
) PARTITION BY RANGE (ts);

-- Daily partitions created by cron or pg_partman
CREATE TABLE device_metrics_20250601
  PARTITION OF device_metrics
  FOR VALUES FROM ('2025-06-01') TO ('2025-06-02');
</code></pre>
<p>Implementation requires automation: cron jobs or pg_partman to create future partitions, monitoring to detect gaps where partition creation failed, and careful handling of queries that span partition boundaries. Backup and restore now operates on hundreds of individual tables, <code>pg_dump</code> time scales with partition count, and schema migrations touch every partition.</p>
<p>The wins are concrete. Queries with time-range filters trigger partition pruning, and EXPLAIN shows the planner excluding irrelevant partitions:</p>
<pre><code class="language-sql">EXPLAIN SELECT avg(value) FROM device_metrics
WHERE ts &gt; now() - interval '1 hour';

-- Scans 1-2 partitions instead of the entire table
-- "Partitions removed: 498 of 500"
</code></pre>
<p>Dropping old data becomes <code>DROP TABLE device_metrics_20240101</code> instead of a multi-hour DELETE that generates gigabytes of WAL and dead tuples.</p>
<p>What happens at 500+ partitions? The <a href="https://www.postgresql.org/docs/current/ddl-partitioning.html">PostgreSQL documentation on partitioning best practices</a> is direct about the cost: "Planning times become longer and memory consumption becomes higher when more partitions remain after the planner performs partition pruning." <code>pg_partman</code> maintenance jobs occasionally fail silently, leaving gaps. Queries spanning long ranges (quarterly reports, year-over-year comparisons) hit hundreds of partitions and regress in performance. Each active partition still has its own autovacuum overhead. The write path is faster per-partition but aggregate write load is unchanged. And the operational complexity is real. New engineers need to understand the partitioning scheme, the automation scripts, the monitoring for gaps, the procedures for backfills, and the implications for schema changes.</p>
<h3 id="phase-3-autovacuum-tuning">Phase 3: Autovacuum tuning</h3>
<p>This is where it starts to feel wrong. You're tuning a cleanup process for data you never modify. <code>n_dead_tup</code> counts are climbing on active partitions, <code>last_autovacuum</code> timestamps show vacuum running constantly but falling behind during write peaks, and <code>pg_stat_activity</code> regularly shows autovacuum workers competing for I/O.</p>
<p>Even append-only workloads generate work for autovacuum. Aborted transactions leave dead tuples. Hint-bit setting (marking tuples as known-committed or known-aborted to avoid future <code>pg_xact</code> lookups) requires dirtying pages. And since PostgreSQL 13, autovacuum triggers based on insert count (not just dead tuples) specifically to freeze tuples and update the visibility map. At high insert rates, this means autovacuum fires continuously on tables that never see a single UPDATE or DELETE.</p>
<pre><code class="language-sql">-- Per-table autovacuum settings on high-traffic partitions
ALTER TABLE device_metrics_20250601 SET (
    autovacuum_vacuum_scale_factor = 0.01,    -- default 0.2
    autovacuum_vacuum_cost_delay = 2,         -- default 2ms (20ms before PG 12)
    autovacuum_vacuum_cost_limit = 1000       -- default 200
);
</code></pre>
<pre><code># postgresql.conf adjustments
autovacuum_max_workers = 6            # default 3
autovacuum_naptime = 15s              # default 1min
maintenance_work_mem = 2GB            # default 64MB
autovacuum_vacuum_cost_delay = 2ms
autovacuum_vacuum_cost_limit = 800
</code></pre>
<p>This helps stabilize bloat, and <code>pg_stat_user_tables.n_dead_tup</code> stays under control. But autovacuum workers now consume measurable CPU and I/O continuously, and monitoring shows autovacuum in <code>pg_stat_activity</code> at all hours. During write peaks, vacuum falls behind, bloat creeps back, and query performance becomes variable. You're tuning a process that exists to clean up overhead your workload doesn't fundamentally produce, but that the storage engine creates anyway.</p>
<h3 id="phase-4-vertical-scaling">Phase 4: Vertical scaling</h3>
<p>All of your optimizations are showing diminishing returns. The next logical step is to add more resources: upgrade from 16 vCPU/64GB to 32 vCPU/128GB with provisioned IOPS storage (e.g., io2 at 10,000+ IOPS on AWS).</p>
<p>More CPU gives autovacuum workers room to operate without starving query execution. More RAM increases <code>shared_buffers</code> and OS page cache hit rates, reducing physical disk reads. Faster storage reduces I/O wait time across the board. This gives you roughly six months of headroom.</p>
<p>Math doesn't lie: the infrastructure cost doubled or tripled, but data growth is still exponential. At the current trajectory, you'll need another upgrade in 12 months. The database cost line item is growing faster than the product revenue it supports.</p>
<h3 id="phase-5-read-replicas">Phase 5: Read replicas</h3>
<p>Dashboards and analytics queries compete with ingestion for CPU and I/O on the primary. You add 1–3 streaming replicas, configure pgbouncer or pgpool to route read traffic, and separate the connection pools. Immediately, write performance on the primary improves. Expensive analytical queries run against replicas without blocking ingestion.</p>
<p>The primary still carries the full write load. At sustained high insert rates generating tens of megabytes per second of WAL, replicas that fall behind accumulate WAL on the primary, consuming disk. The further behind a replica gets, the more WAL the primary must retain, and high write volume is exactly what causes replicas to fall behind in the first place. Real-time dashboards pointing at lagging replicas show stale data. You're now managing multiple Postgres instances with their own monitoring, autovacuum tuning, and connection pooling. The write bottleneck is still untouched.</p>
<h3 id="taking-stock">Taking stock</h3>
<p>After all five phases, this is what the infrastructure looks like: partitioned tables across 500+ partitions with <code>pg_partman</code> automation and monitoring, aggressive per-table autovacuum settings under constant adjustment, instances upgraded 2–3x from original specs with provisioned IOPS, 2–3 streaming replicas with connection-level routing, detailed runbooks covering partition management, vacuum procedures, and failover scenarios.</p>
<p>Each optimization was the right response. Each bought time. Yet the trajectory is unchanged.</p>
<p><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/02/diagram-1-latency-across-optimization-phases.png" alt="" loading="lazy"></p>
<p>Senior engineers are now spending 20–30% of their time on database operations. Quarterly planning includes a database scalability line item. New hire onboarding takes weeks before someone can safely operate the partitioning scheme. The team has become part product engineering, part DBA.</p>
<p>Is this inherent to the scale, or is it inherent to the architecture?</p>
<p>The answer matters because the two problems have different solutions. Optimization within the right architecture has a ceiling you can raise. Optimization against an architectural mismatch has a ceiling that doesn't move. Only the timeline changes. For this workload pattern, the ceiling is structural. The question was never if you'd hit it. It was always when.</p>
<h2 id="why-these-optimizations-hit-a-ceiling">Why These Optimizations Hit a Ceiling</h2>
<p>The optimization phases above aren't ineffective. Each one operates within architectural boundaries that weren't designed for this workload pattern, and those boundaries constrain how much any optimization can actually move the needle. Understanding the mechanics explains why returns diminish.</p>
<p>Postgres is a brilliant general-purpose relational database. Its design handles an enormous range of workloads well: e-commerce, content management, authentication, SaaS backends. "General-purpose" means optimized for the average case. High-frequency time-series ingestion with long retention is not the average case. Four core design decisions create this compounding overhead.</p>
<h3 id="mvcc-multi-version-concurrency-control">MVCC (Multi-Version Concurrency Control)</h3>
<p>MVCC lets readers and writers operate concurrently without lock contention. The <a href="https://www.postgresql.org/docs/current/mvcc-intro.html">PostgreSQL documentation on concurrency control</a> describes the core guarantee: "reading never blocks writing and writing never blocks reading." When a row is updated, Postgres keeps the old tuple version visible to in-flight transactions, and autovacuum later marks dead tuples as reusable. For workloads with concurrent reads and updates on shared rows, this is an excellent tradeoff.</p>
<p>For append-only ingestion, every insert still pays the full MVCC cost. Each heap tuple carries a fixed-size header (23 bytes on most machines) containing <code>t_xmin</code>, <code>t_xmax</code>, <code>t_cid</code>, <code>t_ctid</code>, <code>t_infomask</code>, <code>t_infomask2</code>, and <code>t_hoff</code>. These fields track transaction visibility, even though the row will never be updated or deleted by a transaction. Extra cost with no extra value.</p>
<p>The write amplification is easily observable. A 1KB sensor reading becomes:</p>
<ul>
<li>23-byte heap tuple header (plus alignment padding and a 4-byte <code>ItemIdData</code> pointer)</li>
<li>1,024 bytes of row data</li>
<li>5 index entries (assuming 5 indexes, ~40–80 bytes each in B-tree leaf pages)</li>
<li>~1.2KB WAL record (heap insert + index insertions)</li>
</ul>
<p>Total actual I/O: roughly 2.5–3.5KB per 1KB of logical data. At 100K inserts/sec of 1KB rows, you're writing 250–350MB/sec of actual I/O for 100MB/sec of application data. The exact ratio varies with row width, index count, and whether <code>full_page_writes</code> triggers after a checkpoint.</p>
<p><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/02/diagram-3-logical-data-vs-IO-breakdown.png" alt="" loading="lazy"></p>
<p>Autovacuum still has work to do on append-only tables. Aborted transactions leave dead tuples, and hint-bit setting (marking tuples as known-committed or known-aborted to avoid future <code>pg_xact</code> lookups) requires dirtying pages. At high insert rates, even these minor sources of work keep autovacuum continuously active. <code>pg_stat_user_tables.n_dead_tup</code> may stay low, but <code>vacuum_count</code> and <code>autovacuum_count</code> keep climbing steadily.</p>
<h3 id="row-based-storage-with-b-tree-indexes">Row-based storage with B-tree indexes</h3>
<p>Postgres stores data as a heap of 8KB pages, each containing variable-length tuples laid out row by row. Every tuple contains all columns. B-tree indexes map key values to ctid (page number + offset) pointers into the heap.</p>
<p>For time-series analytics, this creates read amplification:</p>
<pre><code class="language-sql">SELECT avg(temperature)
FROM sensor_readings
WHERE ts &gt; now() - interval '1 hour'
  AND device_id = 42;
</code></pre>
<p>This query needs two columns: <code>ts</code> and <code>temperature</code>. If the table has 30 columns, Postgres reads all 30 columns for every matching row from the heap pages. The I/O is 15x what a columnar layout would require, where only the referenced columns are read from disk.</p>
<p>Time-series data also compresses extremely well in columnar formats. Sequential timestamps delta-encode to near-zero storage (a regular interval collapses from 8 bytes per timestamp down to a single bit via delta-of-delta encoding), and repeated device IDs run-length-encode. Floating-point sensor values compress with XOR-based compression derived from Facebook's Gorilla algorithm (<a href="http://www.vldb.org/pvldb/vol8/p1816-teller.pdf">Pelkonen et al., "Gorilla: A Fast, Scalable, In-Memory Time Series Database," VLDB, 2015</a>). Columnar storage routinely achieves 10–20x compression on time-series data. Row-based heap storage can't apply any of these techniques because values from different columns are interleaved on the same page.</p>
<p>On the write side, B-tree index maintenance creates significant overhead. Each insert traverses every index's B-tree from root to leaf, finds the correct leaf page, and inserts the new entry. If the leaf page is full, it splits, which can cascade up the tree. For time-ordered data, inserts concentrate on the right edge of timestamp indexes, creating contention on a small number of leaf pages. Non-timestamp indexes (device ID, metric type) scatter inserts across the tree, causing random I/O. With five indexes on a table, every row insert performs one heap page write, five B-tree traversals and leaf page insertions, plus WAL records for each. At 50K inserts/sec, that's 50K heap writes + 250K index insertions per second.</p>
<h3 id="query-planning-overhead">Query planning overhead</h3>
<p>The Postgres planner runs a full optimization pass on every query: it enumerates possible paths, estimates costs from <code>pg_statistic</code> entries, considers index usage, evaluates join orders, and selects an execution plan. For workloads with diverse, unpredictable query patterns involving complex joins, this is the right approach.</p>
<p>For time-series workloads, query shapes are highly repetitive. The same <code>WHERE ts &gt; now() - interval '...'</code> filter runs thousands of times per second. The full planning cycle executes every time. At high query rates, planning overhead is measurable in <code>pg_stat_statements</code> as the gap between <code>total_plan_time</code> and <code>total_exec_time</code>.</p>
<p>Statistics maintenance creates its own cost. ANALYZE samples rows to populate <code>pg_statistic</code>, with the sample size scaled by <code>default_statistics_target</code> (default: 100, which yields roughly 30,000 sampled rows). On billion-row tables, even this sampling-based statistics collection is expensive and must run frequently to keep estimates accurate. Stale statistics provide poor cardinality estimates, leading the planner to choose sequential scans over index scans, or vice versa.</p>
<p>With hundreds of partitions, the planner must evaluate partition pruning for each partition's bounds against the query predicates. This is fast per-partition but scales linearly with partition count. At 500+ partitions, plan time for simple queries can exceed execution time.</p>
<h3 id="write-ahead-logging-wal-volume">Write-Ahead Logging (WAL) volume</h3>
<p>Every data modification generates a WAL record before it's applied to the heap or index pages. WAL writes are sequential and synchronous (fsync per commit, or per <code>wal_writer_delay</code> interval with asynchronous commit). At 100K inserts/sec, WAL generation is roughly:</p>
<ul>
<li>Heap insert records: ~100–150 bytes each = 10–15MB/sec</li>
<li>Index insert records: 5 indexes × ~60–80 bytes each = 30–40MB/sec</li>
<li>Full-page writes (after checkpoint): intermittent bursts of 8KB per dirtied page</li>
</ul>
<p>Total sustained WAL throughput: 50–100MB/sec under normal operation, spiking higher after checkpoints when <code>full_page_writes</code> triggers 8KB records for newly dirtied pages. <a href="https://www.postgresql.org/docs/current/wal-reliability.html">The PostgreSQL documentation</a> describes why: "the first modification of a data page after each checkpoint results in logging the entire page content." At those rates, that's 3–6GB/min, 180–360GB/hour.</p>
<p>WAL I/O becomes a direct throughput bottleneck. <code>pg_stat_wal</code> shows <code>wal_write</code> and <code>wal_sync</code> times climbing. Replicas that can't apply WAL fast enough fall behind, and unprocessed WAL files accumulate on the primary's <code>pg_wal</code> directory, consuming disk. <code>max_wal_size</code> and checkpoint frequency become critical tuning parameters.</p>
<h3 id="the-compounding-effect">The compounding effect</h3>
<p>None of these four constraints operates in isolation. Each amplifies the others, and that's where the math gets ugly.</p>
<p>MVCC overhead creates per-tuple bloat, which accumulates faster than autovacuum can clean at high insert rates. Autovacuum competing for I/O degrades write throughput. Degraded write throughput causes queries on bloated tables to slow down, which increases pressure to add more indexes. More indexes produce more write amplification, more WAL, and more replication lag. Row storage forces read amplification on time-range queries, which creates pressure to add covering indexes. Those indexes add to the write overhead feeding back into the MVCC/autovacuum loop.</p>
<p><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/02/diagram-5-updated.png" alt="" loading="lazy"></p>
<p>At 50K inserts/sec with five indexes on a table, the steady-state database workload is: 50K heap tuple writes/sec, 250K B-tree index insertions/sec, 50–100MB/sec sustained WAL generation, continuous autovacuum activity across active partitions, and full query planning on every incoming query.</p>
<p>This is why a 16-core/64GB instance struggles with what appears to be a straightforward append-only workload.</p>
<p>Partitioning reduces per-partition table size but doesn't change the per-row overhead. Adding RAM improves buffer cache hit rates but doesn't reduce write amplification. Autovacuum tuning manages bloat but can't eliminate the cost of producing it. Each optimization operates within these constraints. None removes the constraints themselves.</p>
<p>This is the Optimization Treadmill at the mechanical level. You're not fighting configuration. You're fighting the storage model, the concurrency architecture, and the write path. All of which are designed for a workload that looks nothing like yours.</p>
<h2 id="when-to-choose-a-different-path">When to Choose a Different Path</h2>
<p>Most teams recognize this pattern 12–18 months too late. By then, the tables are massive, the partitioning scheme is deeply embedded, and migration has become a multi-month project. The difference between acting at 10M rows and acting at 1B rows is roughly an order of magnitude in engineering cost.</p>
<h3 id="postgres-workload-scoring-framework">Postgres Workload Scoring Framework</h3>
<p>Go back to <a href="#what-this-workload-looks-like">the six characteristics</a>. Be honest about how many describe your system right now, and then score yourself again against where you'll be in 12-18 months.</p>
<p>If four or five apply, you're in this pattern. The optimization phases above are already in your future, or you've started them.</p>
<p>If all six apply, you're past the point of easy exits. Architectural friction is the dominant factor in your performance trajectory, and the migration cost is climbing every quarter you wait.</p>
<p>If three or fewer apply, you likely have a different problem. Standard Postgres optimization should change the trajectory.</p>
<h3 id="early-warning-signs">Early warning signs</h3>
<p>Before the pattern becomes a crisis, it shows up in how the team spends its time:</p>
<p><strong>Optimization dominates planning.</strong> 10–20% of engineering time goes to database performance, and every quarterly roadmap includes a scalability line item.</p>
<p><strong>Costs grow faster than revenue.</strong> Finance is asking why the database bill increased 40% while user growth was only 15%.</p>
<p><strong>Operational complexity accumulates.</strong> 20+ pages of runbooks, partition management scripts, monitoring for autovacuum lag, replication delay, and index bloat. New engineers need weeks of onboarding before they can safely operate the database.</p>
<p><strong>Growth outpaces optimization.</strong> You ship a 2x improvement and data growth erases it within two quarters.</p>
<p><strong>Autovacuum is a constant concern.</strong> It's in the top five processes by CPU and I/O at all hours, and tuning it is a recurring conversation.</p>
<p>Two or three of these signs mean you should be paying attention. Four or more means you're already in the pattern.</p>
<h3 id="migration-complexity-at-different-scales">Migration complexity at different scales</h3>
<p><strong>10M–50M rows.</strong> A day or two to 1–2 weeks. Simple dump/restore, or logical replication. Low risk, fast validation, easy rollback. 1–2 engineers part-time (roughly 80 engineer-hours).</p>
<p><strong>100M–500M rows.</strong> 2–6 weeks. Partition-by-partition migration. More dependencies to account for, more thorough testing required. 2–3 engineers, mostly full-time (roughly 400 engineer-hours).</p>
<p><strong>1B+ rows.</strong> 2–6 months. Hundreds or thousands of partitions. Zero-downtime required, complex rollback planning. Application-level dual-write or change-data-capture pipelines are in play. 3–5 engineers full-time plus a validation period (roughly 2,000 engineer-hours).</p>
<p>Those hours are not spent on product features. And there's no point on this curve where migration gets easier by waiting.</p>
<h3 id="what-purpose-built-postgres-variants-means">What "purpose-built Postgres variants" means</h3>
<p>TimescaleDB is built on top of Postgres, not in place of it. The PostgreSQL wire protocol, SQL query language, extensions like PostGIS and pgvector, your application code, and your ecosystem tooling all stay the same. What changes is the storage engine and execution layer underneath.</p>
<p><strong>MVCC overhead addressed through columnar compression.</strong> The problem: every row insert in vanilla Postgres generates per-tuple MVCC headers, index entries, and WAL records regardless of whether the data will ever be updated, driving 3–5x write amplification and continuous autovacuum load. TimescaleDB's columnar storage (the <code>Columnstore</code> layer) batches up to 1,000 row versions per column into compressed arrays before writing to disk. Each batch write replaces thousands of individual heap tuple insertions with a single compressed segment write. The per-tuple MVCC header overhead is amortized across the batch, and autovacuum pressure drops proportionally. Far less row-level churn to clean up. In practice, write amplification drops from the 3–5x range to near 1:1 for sustained append workloads. The <a href="https://www.tigerdata.com/docs/about/latest/whitepaper">Tiger Data architecture whitepaper</a> covers the columnar layout and compression pipeline in detail.</p>
<p><strong>Row storage replaced by columnar layout for <a href="https://www.tigerdata.com/learn/the-best-time-series-databases-compared">time-series data</a>.</strong> The problem: vanilla Postgres reads all columns of every matching row even when a query needs two, creating 15x+ read amplification on wide tables, with none of the compression techniques applicable to time-series data. Rather than reading all 30 columns of a row to get two values, queries against the columnar layer read only the referenced columns from compressed column arrays. The 15x read amplification drops to near 1:1. Time-series compression (delta-of-delta for timestamps, gorilla-style XOR for floats, run-length encoding for repeated values) routinely achieves 10–20x compression ratios vs. heap storage. A dataset that occupies 1TB in vanilla Postgres often fits in 50–100GB with columnar compression enabled.</p>
<p><strong>Query planning overhead reduced through chunk exclusion and continuous aggregates.</strong> The problem: the Postgres planner runs a full optimization pass on every query, and with hundreds of partitions, partition pruning overhead can exceed execution time for simple queries. TimescaleDB's planner extension adds chunk exclusion that operates at a lower level than Postgres's partition pruning. Chunks are indexed by time range in a catalog table, and the planner excludes non-overlapping chunks before the standard planning phase. For query shapes that repeat thousands of times per second, this eliminates most of the per-partition pruning overhead. Continuous aggregates go further: pre-computed rollups stored as materialized views, updated incrementally as new data arrives, so dashboards querying hourly or daily aggregations hit a small summary table instead of scanning billions of raw rows.</p>
<p><strong>WAL volume reduced through batched ingestion.</strong> The problem: at 100K inserts/sec, vanilla Postgres generates 50–100MB/sec of WAL, creating I/O bottlenecks and causing replicas to fall behind. Lagging replicas force the primary to retain more unprocessed WAL, which consumes disk and makes the lag worse. The root cause is per-row WAL records: one per heap insert, one per index insertion. Columnar storage's batch writes generate WAL at the segment level rather than the row level. At 100K inserts/sec, WAL volume drops from 50–100MB/sec to roughly 5–15MB/sec in typical deployments, which eliminates most replication lag issues. Replicas that previously fell behind during write peaks can keep up without tuning.</p>
<p><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/02/comparison-chart.png" alt="" loading="lazy"></p>
<p><strong>Concrete numbers.</strong> Benchmark results vary by workload, but the directional data is consistent: ingestion throughput 10–20x higher than vanilla Postgres at equivalent instance size, query performance on time-range aggregations 100x+ faster with columnar storage, storage footprint 10–20x smaller with compression enabled. RTABench, a benchmark for real-time analytics workloads, publishes results showing the performance gap between vanilla PostgreSQL and TimescaleDB across real-world query patterns. <a href="https://rtabench.com/">See the benchmark results</a></p>
<h3 id="decision-framework">Decision framework</h3>
<p>Choose a specialized architecture if you score 4+ on the <a href="#when-to-choose-a-different-path">Postgres Workload Scoring Framework</a> AND you're experiencing 2+ early warning signs AND you can project continued data growth.</p>
<p>Strong indicators to act now: you're under 100M rows, you're already building custom partitioning, your team spends 15%+ of engineering time on database optimization, and you can project 500M+ rows within 12 months.</p>
<p>You might not need this if writes are bursty rather than continuous, retention is 7–30 days, queries don't predominantly filter on time ranges, or growth is stable and slow.</p>
<h2 id="optimization-vs-architecture">Optimization vs. Architecture</h2>
<p>There are two different problems that both show up as "database performance is degrading."</p>
<p><strong>Problem 1: Optimization within the right architecture.</strong> The workload fits the database's design. Better indexes, query rewrites, configuration tuning, and hardware upgrades directly improve the trajectory. Postgres expertise solves the problem. For most workloads, vanilla Postgres is the right choice.</p>
<p><strong>Problem 2: The Optimization Treadmill.</strong> The workload hits design tradeoffs baked into the storage engine, concurrency model, and query planner. Optimization helps in the short term but doesn't change the trajectory. Each phase buys time. None buys a different outcome. You're working around the architecture rather than with it.</p>
<p>Knowing which problem you have determines the path forward.</p>
<p>If you followed the optimization phases in this piece, you weren't doing anything wrong. Those were correct responses to the symptoms. Any experienced Postgres team would have done the same. The pattern is common precisely because the progression makes sense at each step.</p>
<p>What changes with recognition is agency. At 10M–50M rows, you can choose a purpose-built architecture in days to weeks and redirect engineering time to product work. At 100M–500M rows, migration is harder but still reasonable, taking 2–6 weeks. At 1B+, it's a multi-month project, and every quarter of delay adds complexity.</p>
<p>The broader principle applies beyond this workload. Different databases have different architectural strengths, so the best choice depends on the workload. Postgres is brilliant for general-purpose relational work. Specialized variants built on top of Postgres excel at specialized patterns. Recognizing when architecture matters more than optimization is an engineering judgment call, not a criticism of the tool.</p>
<p>Architectural fit determines your ceiling. Optimization determines where you operate relative to that ceiling. When you're hitting the ceiling repeatedly, the productive question isn't "how do we optimize better?" It's "are we operating within the right architecture?" With this workload pattern, the ceiling was always there. You just needed enough data volume to find it. Score your workload. If you're at 8+ and under 100M rows, this is the cheapest architectural decision you'll make this year. <a href="https://www.tigerdata.com/docs/about/latest/whitepaper">The whitepaper</a> covers the mechanics. The <a href="https://console.cloud.timescale.com/signup">Tiger Data free trial</a> lets you validate on your own data.</p>
<p><a href="https://timescale.ghost.io/blog/from-4-databases-to-1-how-plexigrid-replaced-influxdb-got-350x-faster-queries-tiger-data/" rel="noreferrer">New: Learn how Plexigrid moved from 4 databases to 1 with Tiger Data.</a></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to Break Your PostgreSQL IIoT Database and Learn Something in the Process]]></title>
            <description><![CDATA[Stress test your PostgreSQL IIoT database to identify bottlenecks, optimize performance, and prevent failure. Learn how to break it safely and design with margin.]]></description>
            <link>https://www.tigerdata.com/blog/how-to-break-postgresql-iiot-database-learn-something-in-process</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-to-break-postgresql-iiot-database-learn-something-in-process</guid>
            <category><![CDATA[IoT]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Doug Pagnutti]]></dc:creator>
            <pubDate>Wed, 18 Feb 2026 19:43:56 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/02/thumbnail--4-.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/02/thumbnail--4-.png" alt="How to Break Your PostgreSQL IIoT Database and Learn Something in the Process" /><p>As engineers, we're taught to design for reliability. We do design calculations, run simulations, build and test prototypes, and even then we recognize that these are imperfect, so we include safety factors. When it comes to the Industrial Internet of Things (IIoT) though, we rarely give the same level of scrutiny to the components that we rely on.</p><p>What if we treated our IIoT database the same way we treated the physical things we produce? We build and design a prototype database, and then <a href="https://timescale.ghost.io/blog/postgres-optimization-treadmill/" rel="noreferrer">put it through some serious testing</a>, even to failure.</p><h2 id="the-value-and-perils-of-stress-testing">The Value (and Perils) of Stress Testing</h2><p>Think of database stress testing as a destructive materials test for your data storage. You wouldn't trust a bridge made of untested steel, so don’t trust your database until you know its limits.</p><p><strong>The Value:</strong></p><ul><li><strong>Identify Bottlenecks:</strong> Stress testing reveals the weak links—what is likely to fail first? Will you run out of storage? Will your queries get bogged down? Or will you hit the dreaded ingest wall (when data comes in faster than it can be stored)?</li><li><strong>Determine Real-World Behaviour:</strong> You'll find out exactly how your database performance changes as the amount of data increases. What issues are future-you going to struggle with?</li><li><a href="https://timescale.ghost.io/blog/postgres-optimization-treadmill/" rel="noreferrer"><strong>Optimize Configuration</strong></a><strong>:</strong> Just like you might build a few different prototypes and see how it affects failure modes, changing your database configuration, especially when it comes to indices, can dramatically affect how it behaves. Building a rigorous stress testing framework provides a safe way to optimize your design.</li></ul><p>I hope it goes without saying, but please, please don’t run this on your production environment. Even if it’s technically a different database but the same hardware, this test can wreak havoc on your resources and crash your system. You’ve been warned.</p><h2 id="what-to-measure">What to Measure?</h2><p>There’s no point going through all the effort to break your system if you don’t learn anything. Assuming you’re using a PostgreSQL database (<a href="https://www.tigerdata.com/blog/its-2026-just-use-postgres"><u>It’s 2026, Just Use PostgreSQL</u></a>), here is a decent set of metrics to keep track of while you’re putting your database through its paces.</p><h3 id="table-size">Table Size</h3><p>The size of a Postgresql table is generally measured by number of rows, but the actual space on disk that it occupies is a sum of the heap (the main relational table), the indices, and the TOAST (storage for large objects).</p><p>The following query will give the number or rows as well as the size of each component of the table in bytes.</p><pre><code class="language-SQL">SELECT
      reltuples::bigint AS row_count,
      pg_relation_size('iiot_history') AS heap_size,
      pg_indexes_size('iiot_history') AS indices_size,
      pg_table_size('iiot_history') -
            pg_relation_size('iiot_history') AS toast_size
FROM pg_class WHERE relname = 'iiot_history';</code></pre><p>The reason for the odd row_count is that counting rows the standard way, with COUNT(*), requires scanning the whole table, which is going to be painfully slow when we’re building a table big enough to break things.</p><h3 id="table-performance">Table Performance</h3><p>The best way to measure table performance is to use the actual queries that your production system will use. At a minimum, this should include your batched INSERT (you always batch, right?) and at least one common SELECT. Keep in mind that for a table with N rows, the timing for queries tend to be either constant, log(N), N or worse depending on how the indices are structured.&nbsp;</p><p>You can get very accurate timing info from running your queries with the prefix EXPLAIN ANALYZE, and it’s worth doing this at least once to see what the database is doing under the hood. However, I recommend running the whole test with a scripting language and then just timing the execution of that particular step.&nbsp;</p><h3 id="server-performance">Server Performance</h3><p>Don’t forget the engine that’s driving all this machinery. You’ll need to watch the CPU, Memory, Storage, and Network Bandwidth. People in the IT world tend to talk about headroom for a server, and that’s what you’re really looking at: how much spare capacity do you have? Your CPU and Memory usage might spike at times, but the important thing is that it’s not always running at max capacity.</p><p>There are a lot of free and paid tools to monitor these variables. I almost always do this type of test in a VM (easier to clean up the mess when it all breaks) and I like to use <a href="https://prometheus.io/"><u>Prometheus</u></a> but honestly Perfmon in Windows or Top in Linux gives you all you really need.</p><h3 id="setting-limits">Setting Limits</h3><p>It’s helpful to set some limits on these parameters so you know when to stop the test. For database size, it might be some measurement like a year's worth of data, or when the drive is 80% full. For ingest timing, I suggest stopping when inserting takes longer than the desired ingest frequency—this is the ingest bottleneck and something you really want to avoid in production. Scan times can be limited by the time it takes for a specific query. Maybe calculating the average value from one tag over the past hour must be less than 10s.</p><h2 id="how-to-simulate-data">How to Simulate Data?</h2><p>There are lots of ways to insert data, but it’s usually a tradeoff between how well the data represents real scenarios and how long it takes to run the test.</p><p>The following is one of my favourite methods for injecting large amounts of data into an IIoT database:</p><p>Say you have a classic IIoT history table like the following:</p><pre><code class="language-SQL">CREATE TABLE iiot_history(
	time TIMESTAMPZ NOT NULL,
	tag_id INT NOT NULL,
	value DOUBLE PRECISION,
	PRIMARY KEY (tag_id, time)
);</code></pre><p>If you expect to ingest 10,000 tags at 1s intervals, you can use the following INSERT query to add a day’s worth of history to the back end of your table.</p><pre><code class="language-SQL">INSERT INTO iiot_history(time, tag_id, value)
	SELECT *, random() as value 
FROM(
		SELECT generate_series(
			min_date-INTERVAL '1day',
			min_date-INTERVAL '1s',
			INTERVAL '1s') as time
		FROM (SELECT LEAST(NOW(),MIN(time)) AS min_date 
FROM iiot_history)
),
		generate_series(1,10000) as tag_id;</code></pre><p>This will generate random data values for every second during a day and for every tag_id from 1 to 10,000. Not exactly as interesting as real data, but enough to fill up your table.</p><p>The nice thing about this query is that you should be able to run it in parallel to your real-time data pipeline and it won’t mess with your data (aside from potentially locking your table while it runs). It’s also easy to modify this query to inject more or less tags as well as change the time interval if you’re playing around with different configurations.</p><p>If you use this query, or whichever one you prefer, in a script (I usually use Python), then you can automate the whole test. Something along the lines of:</p><ol><li>Get database size</li><li>Run select queries, measure execution time</li><li>Run insert queries several times, measure and average execution time</li><li>Artificially grow database size</li><li>Repeat 1-3 until one of the failure conditions is reached.</li></ol><h2 id="how-to-interpret-results-and-what-to-expect-in-the-real-world">How to Interpret Results and What to Expect in the Real World?</h2><p>Your test results will give you some clear data points, but you still need to do some interpreting.</p><ul><li><strong>Identify the Limiting Component:</strong> Where did the database fail? If it’s a query that took too long, you might be able to speed things up with a clever index. If it’s an insert that took too long, you might be able to speed things up by removing that clever index you added earlier.</li><li><strong>Optimize:</strong> There’s a lot you can do to improve table performance before throwing the whole thing out in frustration:<ol><li><strong>Proper Indexing:</strong> Choosing an index is almost always a tradeoff, for example: Indexing the tag_id column before the time column will speed up most queries, at the cost of slower inserts as the table grows. Indexing the time column first will avoid the ‘ingest wall’ at the cost of slower queries. Figure out which solution is best.</li><li><strong>Plan for the future:</strong> Will you need more hardware in a few months or a few years? Being able to estimate the life of your existing architecture means you won’t be caught unawares when it no longer suffices.</li><li><strong>Partitioning/Chunking:</strong> For very large tables, you may need to partition appropriately (see PostgreSQL extensions like <a href="https://www.tigerdata.com/timescaledb"><u>TimescaleDB</u></a>). How great would it be to learn you’ll need this before you actually need this.</li></ol></li><li><strong>Add a Safety Factor:</strong> If your test showed a maximum reliable throughput of 15,000 rows/sec, set your operational limit to 10,000 rows/sec. The real world has peaks, unexpected queries, and background maintenance tasks that will steal resources. Like we do with all engineering products, design with margin.</li></ul><p>If you treat your database like a prototype and really put it through its paces, you’ll get a preview of how it’ll behave in the future and make good, proactive design decisions instead of struggling in the future. Now, go break something (and learn).</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Start on Postgres, Scale on Postgres: How TimescaleDB 2.25 Continues to Improve the Way Postgres Scales]]></title>
            <description><![CDATA[Start on Postgres, scale on Postgres: TimescaleDB 2.25 delivers 289× faster queries, better chunk pruning, and lower-cost continuous aggregates at scale.]]></description>
            <link>https://www.tigerdata.com/blog/start-on-postgres-scale-on-postgres</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/start-on-postgres-scale-on-postgres</guid>
            <category><![CDATA[Announcements & Releases]]></category>
            <category><![CDATA[TimescaleDB]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Mike Freedman]]></dc:creator>
            <pubDate>Tue, 17 Feb 2026 17:33:46 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/02/timescaledb-2-25.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/02/timescaledb-2-25.png" alt="Start on Postgres, Scale on Postgres: How TimescaleDB 2.25 Continues to Improve the Way Postgres Scales" /><p>Most developers start building with Postgres because it’s simple, reliable, and flexible. You get a clear relational model, transactional semantics you can trust, and an ecosystem that lets teams move quickly without committing to a complex architecture early. The challenge is keeping that simplicity as systems grow. Higher ingest, larger datasets, and increasingly real-time analytical workloads can push teams toward a second system long before they want one.</p><p>This pressure is most visible in time-series workloads that demand real-time performance. High write rates, append-heavy tables, and repeated queries over recent windows stress both storage and execution paths. Without reducing the amount of work required per query, scale quickly becomes an architectural problem rather than a <a href="https://timescale.ghost.io/blog/postgres-optimization-treadmill/" rel="noreferrer">performance optimization</a> one, shifting effort from incremental tuning to changes in system design.</p><p>TimescaleDB is designed to change that trajectory. “Start on Postgres, scale on Postgres” is a promise, but it is grounded in a specific architectural approach: performance at scale comes from reducing the work the database must do as data grows, then parallelizing what remains. TimescaleDB 2.25 continues this evolution by tightening the execution and maintenance paths that dominate cost at scale, so common workloads become cheaper and operationally steadier under sustained growth.</p><p>This release focuses on three outcomes: faster queries without constant tuning, efficient scaling to larger datasets and higher ingest, and real-time analytics that stays current and trustworthy without introducing a second system.</p><h2 id="faster-postgres-queries-at-scale-with-less-tuning">Faster Postgres queries at scale, with less tuning</h2><p>Compression, chunk pruning, and columnar execution already reduce query cost by limiting how much data needs to be read and processed. In 2.25, more queries can avoid work entirely, and the planner is more consistent about selecting those cheaper plans.</p><p>A clear example is aggregation on compressed data. In earlier releases, queries using functions like <code>MIN</code>, <code>MAX</code>, <code>FIRST</code>, or <code>LAST</code> benefited from compression and metadata, but they still required scanning compressed batches and performing aggregation during execution. The scan was cheaper than a row-oriented approach, but it was still work proportional to the data touched.</p><p>In 2.25, these aggregates can often be answered directly from sparse metadata maintained for compressed chunks. The planner can choose a custom execution path that reads summaries rather than scanning or decompressing data. This is implemented via the new <code>ColumnarIndexScan</code> plan node (see <a href="https://github.com/timescale/timescaledb/pull/9088"><u>PR #9088</u></a>,<a href="https://github.com/timescale/timescaledb/pull/9103"> <u>PR #9103</u></a>, and <a href="https://github.com/timescale/timescaledb/pull/9108"><u>PR #9108</u></a>). On workloads where this applies, the 2.25 release notes report this class of queries speeding up by up to 289x. For teams running dashboards or monitoring queries over large compressed datasets, this can translate into dramatically faster response times with no query changes required.</p><p>The important shift here is in cost structure. Once an answer can be derived from metadata, performance is no longer tied to the number of rows stored inside a chunk. It is tied to the minimum work required to identify relevant chunks and read their summaries, which becomes more valuable as datasets grow.</p><p>A complementary improvement applies the same idea to another common pattern: time-filtered queries that do not need to materialize column values. For queries like <code>SELECT COUNT(*) FROM events WHERE time &gt; ...</code>, previously, the execution path could still require decompressing the time column to evaluate the predicate, even though the query does not need to read time values for every row. In 2.25, the time column can often be skipped entirely for these cases, reducing CPU and memory pressure while preserving the same result (see <a href="https://github.com/timescale/timescaledb/pull/9094"><u>PR #9094</u></a>). The release notes describe this pattern as up to 50x faster for the example query.</p><p>As these fast paths expand, plan stability becomes just as important as peak speed. Even when an efficient path exists, teams feel it when the planner chooses it inconsistently or when small changes in query shape lead to surprising regressions. In 2.25, planner improvements around columnar scan paths and ordering help make compression-aware execution more predictable (see <a href="https://github.com/timescale/timescaledb/pull/8986"><u>PR #8986</u></a> and <a href="https://github.com/timescale/timescaledb/pull/9133"><u>PR #9133</u></a>). Fewer surprises mean less time spent tuning and diagnosing why a query slowed down as data evolved.</p><h2 id="efficient-scaling-for-high-ingest-postgres-workloads">Efficient scaling for high-ingest Postgres workloads</h2><p>A hard part of scaling is not only achieving good performance at a given size, but preserving efficiency as data volume, ingest rate, and concurrency grow together over time. In practice, scaling pressure shows up in two ways. Some costs grow gradually, such as planning and execution work increasing with the number of partitions. Others appear more abruptly, when accumulated complexity makes execution brittle and small changes in data or query shape trigger different plans and sudden slowdowns.</p><p>TimescaleDB’s scaling model is designed to address both. It relies on clear boundaries: partitioning data into chunks, using metadata to prune irrelevant chunks, and compressing data to reduce the work required within each chunk. In 2.25, several refinements make these boundaries behave more efficiently and consistently under sustained growth.</p><p>One pressure point is that chunk counts rise over long retention windows, making pruning and constraint handling increasingly important. Earlier versions already used constraints and metadata to skip irrelevant chunks, but there were cases where constraint handling became more permissive than necessary, causing queries to consider more chunks than required as datasets aged. In 2.25, constraint handling improves for fully covered chunks, helping keep both planning and execution costs more tightly bounded as data volume increases (see <a href="https://github.com/timescale/timescaledb/pull/9127"><u>PR #9127</u></a>).</p><p>Planning behavior under high partition counts is another area where inefficiency and brittleness can emerge together. As hypertables accumulate thousands of chunks, planning time and plan quality can matter as much as execution speed, especially for joins and more complex query shapes. TimescaleDB 2.25 includes fixes for a planning performance regression on Postgres 16 and later affecting some join queries (see <a href="https://github.com/timescale/timescaledb/pull/8706"><u>PR #8706</u></a>). These changes reduce both how quickly planning cost grows and how likely it is to tip into unstable behavior as workloads evolve.</p><p>The result is more efficient scaling in practice. Costs still grow with data, but they grow more slowly and with fewer surprises, allowing Postgres to continue scaling in place rather than forcing architectural changes to manage accumulated overhead.</p><h2 id="real-time-analytics-in-postgres-without-a-split-architecture">Real-time analytics in Postgres, without a split architecture</h2><p>As refresh frequency increases and datasets grow, keeping analytics fresh inside the primary database can create background pressure. That pressure grows unless refresh and maintenance paths stay efficient. TimescaleDB has long supported real-time analytics inside Postgres through continuous aggregates, compression, and retention policies. In 2.25, the focus is on lowering the operational footprint of staying current as systems run continuously.</p><p>One improvement is compressed continuous aggregate refresh. Earlier versions supported refreshing into compressed hypertables, but the refresh path could include intermediate steps that added extra I/O and CPU work. In 2.25, direct compression on continuous aggregate refresh is enabled via a configuration option, reducing unnecessary data movement when keeping aggregates up to date (see <a href="https://github.com/timescale/timescaledb/pull/8777"><u>PR #8777</u></a> and <a href="https://github.com/timescale/timescaledb/pull/9038"><u>PR #9038</u></a>). The semantics are unchanged, but the cost of maintaining freshness is lower, especially for frequent refresh schedules.</p><p>This is complemented by refinements to batching. Large refresh transactions can temporarily increase WAL volume and create uneven load. In 2.25, the default <code>buckets_per_batch</code> for continuous aggregate refresh policies is adjusted to keep transactions smaller (from 1 to 10 buckets), reducing WAL holding and making refresh behavior steadier under sustained ingest (see <a href="https://github.com/timescale/timescaledb/pull/9031"><u>PR #9031</u></a>).</p><p>The release also includes incremental improvements that reduce background churn from lifecycle operations like retention and deletes on long-running datasets, along with correctness and robustness fixes for compressed and partitioned workloads. For example, support for retention policies on UUIDv7-partitioned hypertables expands the set of configurations where lifecycle management remains reliable over time (see <a href="https://github.com/timescale/timescaledb/pull/9102"><u>PR #9102</u></a>). These changes are small individually, but they matter for trust. Real-time analytics only works if results stay aligned with transactional truth as schemas and workloads evolve.</p><h2 id="closing">Closing</h2><p>TimescaleDB 2.25 continues to make Postgres a better place to run real-time analytics at scale: faster queries through less work, smoother behavior as data and ingest grow, and lower operational overhead for keeping analytics current and correct.&nbsp;</p><p>All in service of a simple yet powerful idea: <strong>start on Postgres, scale on Postgres. </strong><a href="https://timescale.ghost.io/blog/postgres-optimization-treadmill/" rel="noreferrer"><strong>Learn why vanilla Postgres hits performance ceilings at scale</strong></a><strong>.</strong></p><p><strong><em>To learn more, check out the </em></strong><a href="https://github.com/timescale/timescaledb/releases"><strong><em><u>full release notes</u></em></strong></a><strong><em> or </em></strong><a href="https://console.cloud.timescale.com/signup"><strong><em><u>try Tiger Cloud for free</u></em></strong></a><strong><em> and experience TimescaleDB 2.25 on your largest hypertables. </em></strong><a href="https://www.tigerdata.com/blog/from-4-databases-to-1-how-plexigrid-replaced-influxdb-got-350x-faster-queries-tiger-data" rel="noreferrer"><strong><em>Learn how Plexigrid consolidated 4 databases into Postgres and got 350x faster queries.</em></strong></a></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Six Signs That Postgres Tuning Won't Fix Your Performance Problems]]></title>
            <description><![CDATA[When Postgres tuning won't fix performance: recognize the six characteristics of time-series workloads that need TimescaleDB's purpose-built architecture.]]></description>
            <link>https://www.tigerdata.com/blog/six-signs-postgres-tuning-wont-fix-performance-problems</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/six-signs-postgres-tuning-wont-fix-performance-problems</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Matty Stratton]]></dc:creator>
            <pubDate>Thu, 12 Feb 2026 21:26:14 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/02/Postgres-Tuning-Performance-compressed.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/02/Postgres-Tuning-Performance-compressed.png" alt="Six Signs That Postgres Tuning Won't Fix Your Performance Problems" /><p>You've added indexes. You've partitioned tables. You've tuned autovacuum within an inch of its life. Performance improves for a few months, and then the dashboards go red again. Sound familiar?</p><p>If so, you're probably not doing anything wrong. You're running a workload that vanilla Postgres was never designed for, and no amount of configuration will change that.</p><p>It's not transactional. It's not a data warehouse. It's analytics on live data: high-frequency ingestion that stays operationally queryable. If you're running this pattern, you've already been through the cycle: add indexes, partition tables, tune autovacuum, upgrade instances. Each fix buys a few months. Then the metrics climb again.</p><p><strong>The longer you wait, the harder the migration. At 10M rows it takes days. At 500M rows, weeks. At 1B+, months.</strong> Recognizing the pattern early is the highest-leverage decision you can make.</p><p>This post describes six characteristics that define this workload. If four or five apply to your system, the friction is architectural, not operational. (For a deeper look at how <a href="https://www.tigerdata.com/learn/the-best-time-series-databases-compared" rel="noreferrer">purpose-built time-series architecture</a> addresses these constraints, see the <a href="https://www.tigerdata.com/docs/about/latest/whitepaper"><u>Tiger Data architecture whitepaper</u></a>.)</p><h2 id="continuous-high-frequency-ingestion">Continuous High-Frequency Ingestion</h2><p>The database is absorbing thousands to hundreds of thousands of inserts per second. Not in bursts. Not during a nightly ETL window. Continuously, 24/7.</p><p>Consider a semiconductor fab with 8,000 CNC machines and inspection stations on the floor, each reporting vibration, temperature, spindle speed, and tool wear every 2 seconds. That's 4,000 inserts/sec from a single facility. Add process control events, quality inspection results, and environmental monitoring across three plants, and you're at 30-50K inserts/sec before accounting for growth.</p><pre><code class="language-SQL">-- What a single station's insert stream looks like
INSERT INTO machine_telemetry (ts, station_id, metric, value)
VALUES
  (now(), 'CNC-4401', 'vibration_mm_s', 2.34),
  (now(), 'CNC-4401', 'spindle_rpm', 12045),
  (now(), 'CNC-4401', 'coolant_temp_c', 31.2),
  (now(), 'CNC-4401', 'tool_wear_pct', 67.8);
-- Multiply by 8,000 stations × 0.5 Hz × 3 facilities
</code></pre><p>This matters because Postgres needs breathing room to run maintenance. Autovacuum, index maintenance, statistics collection. Continuous ingestion means maintenance always competes with writes. There is no off-peak window.</p><h2 id="queries-revolve-around-time">Queries Revolve Around Time</h2><p>Nearly every row has a timestamp, and nearly every query filters on a time range. Last 30 minutes. This week versus last week. Everything between two dates.</p><p>A trading platform captures every order, fill, and cancellation across multiple venues. The operations team monitors execution quality in real time. The compliance team audits historical patterns. Both teams write queries that look like this:</p><pre><code class="language-SQL">-- Operations: real-time execution quality
SELECT venue, avg(fill_latency_us), percentile_cont(0.99)
  WITHIN GROUP (ORDER BY fill_latency_us)
FROM executions
WHERE ts &gt; now() - interval '15 minutes'
GROUP BY venue;

-- Compliance: historical pattern detection
SELECT account_id, count(*) as cancel_count
FROM order_events
WHERE ts BETWEEN '2025-01-01' AND '2025-03-31'
  AND event_type = 'cancel'
  AND cancel_reason = 'client_requested'
GROUP BY account_id
HAVING count(*) &gt; 500;
</code></pre><p>Time is the primary axis for both storage and retrieval. General-purpose B-tree indexes aren't built for this access pattern, which is why teams end up building manual partitioning schemes and custom tooling to get time-range queries to perform.</p><h2 id="data-is-append-only">Data Is Append-Only</h2><p>Once a row lands, it doesn't change. Sensor readings are immutable. Financial transactions don't get updated. Log entries are permanent. When data gets removed, it happens in bulk: drop an entire month's partition, not individual rows.</p><p>A wind farm operator collects turbine performance data: blade pitch, rotor speed, power output, nacelle orientation. Once recorded, these readings are facts. They never get corrected or overwritten.</p><pre><code class="language-SQL">-- This is the entire write pattern. INSERT. No UPDATE. No single-row DELETE.
INSERT INTO turbine_readings
  (ts, turbine_id, blade_pitch_deg, rotor_rpm, power_kw, wind_speed_ms)
VALUES
  (now(), 'WT-112', 12.4, 14.2, 2840, 11.3);

-- Data removal is always bulk
DROP TABLE turbine_readings_2023_q1;</code></pre><p><strong>Every row you insert carries 23 bytes of MVCC transaction metadata, on data you will never update.</strong> Autovacuum scans these tables constantly, cleaning up dead tuples that were never created through updates. At 50K inserts/sec, that's MVCC overhead on 4.3 billion rows per day that will never be modified. You're paying the full cost of a concurrency model designed for workloads that look nothing like yours.</p><h2 id="retention-is-measured-in-months-or-years">Retention Is Measured in Months or Years</h2><p>Seven years of financial records for compliance. Quarters of manufacturing data for root cause analysis. Two-plus years of training data for ML pipelines.</p><p>A pharmaceutical manufacturer tracks environmental conditions (temperature, humidity, particulate count) across cleanroom facilities to meet FDA 21 CFR Part 11 requirements. When a batch fails quality control six months after production, the investigation pulls environmental data from the exact time window the batch was in each room.</p><pre><code class="language-SQL">-- Root cause investigation: what were cleanroom conditions
-- during a batch produced 6 months ago?
SELECT room_id, avg(temp_c), max(particulate_count),
  bool_or(humidity_pct &gt; 45) as humidity_excursion
FROM cleanroom_environment
WHERE ts BETWEEN '2025-08-14 06:00' AND '2025-08-14 18:00'
  AND facility = 'building_3'
GROUP BY room_id;
</code></pre><p>Short retention hides architectural problems because old data ages out. Long retention removes that escape valve. At 50K inserts per second, that's 1.5 billion rows per year. After three years: 4.5 billion rows.</p><h2 id="queries-are-latency-sensitive">Queries Are Latency-Sensitive</h2><p>This data isn't sitting in cold storage waiting for a weekly report. It's being queried actively, under latency constraints.</p><p>A SaaS observability platform collects metrics from thousands of customer deployments. The product serves real-time dashboards, automated alerting, and deep-dive investigation, all from the same database. Latency expectations form a gradient:</p><pre><code class="language-SQL">-- Dashboard widget: last 5 minutes, needs &lt; 100ms response
SELECT host_id, avg(cpu_pct), max(mem_used_bytes)
FROM host_metrics
WHERE ts &gt; now() - interval '5 minutes'
  AND customer_id = 'cust_8821'
GROUP BY host_id;

-- Alert evaluation: last hour, needs &lt; 500ms
SELECT host_id, avg(cpu_pct)
FROM host_metrics
WHERE ts &gt; now() - interval '1 hour'
  AND customer_id = 'cust_8821'
GROUP BY host_id
HAVING avg(cpu_pct) &gt; 90;

-- Incident investigation: last 3 months, seconds acceptable
SELECT date_trunc('hour', ts), avg(cpu_pct), avg(mem_used_bytes)
FROM host_metrics
WHERE ts &gt; now() - interval '90 days'
  AND host_id = 'host-a3f9c'
GROUP BY 1 ORDER BY 1;
</code></pre><p>Data warehouse scope with operational latency requirements. All from a single system.</p><h2 id="growth-is-sustained">Growth Is Sustained</h2><p>Data volume growing 50-100%+ year over year on a predictable curve. Static workloads can be over-provisioned once and left alone. Growing workloads demand constant re-optimization.</p><p>A logistics company tracks GPS position, engine diagnostics, and cargo conditions across a fleet of refrigerated trucks. They started with 200 trucks. Expansion added 150 trucks in year one, another 300 in year two. Each truck reports every 10 seconds.</p><pre><code class="language-markdown">Year 1:  200 trucks ×  6 readings/min × 1,440 min/day = 1.7M rows/day
Year 2:  350 trucks ×  6 readings/min × 1,440 min/day = 3.0M rows/day
Year 3:  650 trucks ×  6 readings/min × 1,440 min/day = 5.6M rows/day

Cumulative after 3 years: ~3.8 billion rows
</code></pre><p>Every optimization you ship today is solving for a table size you'll blow past in six months. The treadmill doesn't stop.</p><h2 id="what-to-do-with-this">What to Do With This</h2><p>Count how many of these characteristics describe your system. If it's two or three, standard <a href="https://timescale.ghost.io/blog/postgres-optimization-treadmill/" rel="noreferrer">Postgres optimization</a> should have a real impact. The architecture fits your workload. Better indexes, smarter queries, autovacuum tuning. The usual playbook works.</p><p>If it's four or five, however, the friction is architectural, not operational. You don't need to abandon Postgres. Tiger Data extends vanilla Postgres to handle exactly this workload. You keep SQL, your extensions, your team's expertise, and the entire Postgres ecosystem. What changes is the storage engine, partitioning, and query planning underneath.</p><p>The numbers bear this out. In <a href="https://www.tigerdata.com/blog/postgresql-timescaledb-1000x-faster-queries-90-data-compression-and-much-more"><u>benchmarks against vanilla PostgreSQL at one billion rows</u></a>, TimescaleDB delivered up to 1,000x faster query performance while reducing storage by 90% through native compression. Ingest throughput stays constant past 10 billion rows, while PostgreSQL's performance degrades as indexed tables outgrow memory (throughput that starts at 100K+ rows/sec can crash to hundreds). On Azure infrastructure running <a href="https://www.tigerdata.com/blog/benchmark-results-fastest-time-series-database-azure"><u>RTABench workloads</u></a>, Tiger Cloud was 1,200x faster than vanilla PostgreSQL across 40 real-time analytics queries. These aren't synthetic edge cases. They're the exact query patterns this post describes: time-range filters, aggregations, selective scans on growing datasets.</p><p><em>This post is part of a series on Postgres performance limits for </em><a href="https://www.tigerdata.com/learn/time-series-database-what-it-is-how-it-works-and-when-you-need-one" rel="noreferrer"><em>high-frequency data workloads</em></a><em>. The full analysis, including a workload scoring framework and migration complexity breakdown at different scales, is in the anchor essay:</em><a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill" rel="noreferrer"><em> <u>Understanding Postgres Performance Limits for Analytics on Live Data</u></em></a><em>. Ready to test it on your own data?</em><a href="https://console.cloud.timescale.com/signup"><em> <u>Start a free Tiger Data trial.</u></em></a></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Elasticsearch's Hybrid Search, Now in Postgres (BM25 + Vector + RRF)]]></title>
            <description><![CDATA[Build hybrid search in Postgres with pg_textsearch BM25, pgvectorscale vectors, and RRF. Auto-sync embeddings with pgai—no Elasticsearch pipeline needed.]]></description>
            <link>https://www.tigerdata.com/blog/elasticsearchs-hybrid-search-now-in-postgres-bm25-vector-rrf</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/elasticsearchs-hybrid-search-now-in-postgres-bm25-vector-rrf</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Raja Rao DV]]></dc:creator>
            <pubDate>Mon, 09 Feb 2026 15:34:33 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/02/hybrid-search-thumbnail.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/02/hybrid-search-thumbnail.png" alt="Elasticsearch's Hybrid Search, Now in Postgres (BM25 + Vector + RRF)" /><p>Search is one of those problems that’s deceptively hard. You think you can just<strong> </strong><code>LIKE '%query%'</code><strong> </strong>your way through it, and then you spend three months learning why that doesn’t work.</p><p>Here’s the problem: sometimes users search with exact keywords like “PostgreSQL error 23505”. Other times they search with the meaning: “why is my database slow”. Most of the time, it’s somewhere in between.</p><p>Documents are the same way. Some are full of specific terms and jargon. Others are conversational and conceptual. Most are a mix of both.</p><p>So you have queries that could be keywords or meaning, hitting documents that could be keywords or meaning. That’s four combinations:</p>
<!--kg-card-begin: html-->
<table style="border:none;border-collapse:collapse;"><colgroup><col width="223"><col width="196"><col width="225"></colgroup><tbody><tr style="height:28.5pt"><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;background-color:#f0f0f0;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><br></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;background-color:#f0f0f0;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Document has Keywords</span></p></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;background-color:#f0f0f0;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Document has Meaning</span></p></td></tr><tr style="height:30pt"><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Query has Keywords</span></p></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">✅ BM25 nails it</span></p></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">❌ BM25 misses it</span></p></td></tr><tr style="height:30pt"><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Query has Meaning</span></p></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">❌ Vectors miss it</span></p></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">✅ Vectors nail it</span></p></td></tr></tbody></table>
<!--kg-card-end: html-->
<p>No single approach covers all four quadrants. You need both keyword search AND vector search. And you need a way to combine them intelligently.</p><p>That’s exactly what Elasticsearch does. It uses <strong>BM25</strong> for keyword ranking, <strong>vector embeddings</strong> for semantic search, and <strong>RRF (Reciprocal Rank Fusion)</strong> to merge the results into a single ranked list. This combination is called <strong>hybrid search</strong>, and it’s why Elasticsearch actually works.</p><p>But here’s the trade-off: to use Elasticsearch, you need to build a pipeline. Your data lives in Postgres, but search lives in Elasticsearch. So you’re stuck with:</p><pre><code class="language-markdown">Postgres → Kafka/Debezium → Elasticsearch</code></pre><p>That’s three systems to manage. Three things that can break. Sync jobs to maintain. Stale data to debug. And with AI agents now needing to search through docs and codebases on the fly, the pipeline problem is getting worse. You can’t easily spin up a test environment when your search lives in a completely different system.</p><p>Still, teams pay for it. <a href="https://ir.elastic.co/"><u>Over a billion dollars a year</u></a>, collectively. Because search that works is worth the pain.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/02/take-my-money2.png" class="kg-image" alt="Dealing with three systems to manage sync jobs" loading="lazy" width="1536" height="1024" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2026/02/take-my-money2.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2026/02/take-my-money2.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/02/take-my-money2.png 1536w" sizes="(min-width: 720px) 720px"></figure><p><strong>Here’s the good news:</strong> all three pieces of Elasticsearch’s hybrid search are now available in Postgres:&nbsp;</p><ul><li><strong>BM25</strong> via <a href="https://github.com/timescale/pg_textsearch"><u>pg_textsearch</u></a> (open source, PostgreSQL license)&nbsp;</li><li><strong>Vector search</strong> via <a href="https://github.com/timescale/pgvectorscale"><u>pgvectorscale</u></a> (high-performance DiskANN)&nbsp;</li><li><strong>RRF</strong>? That’s just SQL. No extension needed.</li></ul><p>And <a href="https://github.com/timescale/pgai"><u>pgai</u></a> eliminates the embedding pipeline entirely (no more Postgres → Kafka → Elasticsearch sync jobs). It automatically syncs changes to the data and updates the corresponding embeddings appropriately.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/02/before-after-hybrid-search.png" class="kg-image" alt="Before / After Elasticsearch" loading="lazy" width="1536" height="1024" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2026/02/before-after-hybrid-search.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2026/02/before-after-hybrid-search.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/02/before-after-hybrid-search.png 1536w" sizes="(min-width: 720px) 720px"></figure><p>We’ve already covered <a href="https://www.tigerdata.com/blog/you-dont-need-elasticsearch-bm25-is-now-in-postgres"><u>how BM25 works</u></a>. This blog focuses on Hybrid Search, RRF, pgai, how they all work together, why it’s elegant, and how to implement hybrid search entirely in Postgres.</p><h2 id="how-rrf-reciprocal-rank-fusion-works">How RRF (Reciprocal Rank Fusion) Works</h2><p>RRF (Reciprocal Rank Fusion) is elegantly simple. It’s the industry standard for combining ranked lists, and it’s what Elasticsearch uses for hybrid search.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/02/data-src-image-a5b72239-16b4-4835-ae9e-1ab12eb32fe1.png" class="kg-image" alt="" loading="lazy" width="1024" height="572" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2026/02/data-src-image-a5b72239-16b4-4835-ae9e-1ab12eb32fe1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2026/02/data-src-image-a5b72239-16b4-4835-ae9e-1ab12eb32fe1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/02/data-src-image-a5b72239-16b4-4835-ae9e-1ab12eb32fe1.png 1024w" sizes="(min-width: 720px) 720px"></figure><h3 id="the-problem">The Problem</h3><p>You run two searches and get two ranked lists:</p><p><strong>BM25 Results (keyword):</strong> 1. Doc A (score: 15.2) 2. Doc B (score: 12.1) 3. Doc C (score: 8.4)</p><p><strong>Vector Results (semantic):</strong> 1. Doc C (distance: 0.12) 2. Doc D (distance: 0.18) 3. Doc A (distance: 0.25)</p><p>How do you combine them? You can’t just add the scores. They’re on completely different scales. BM25 scores might be 0-50. Vector distances are 0-2.</p><h3 id="the-rrf-solution">The RRF Solution</h3><p>RRF ignores the actual scores. It only cares about <strong>rank position</strong>:</p><pre><code class="language-markdown">RRF_score = Σ (1 / (k + rank))</code></pre><p>Where: </p><ul><li><code>k</code> is a constant (typically 60) </li><li><code>rank</code> is the position (1st, 2nd, 3rd…)</li></ul><p>That’s it. Dead simple.</p><h3 id="worked-example">Worked Example</h3><p>Let’s calculate RRF scores for our documents:</p>
<!--kg-card-begin: html-->
<table style="border:none;border-collapse:collapse;"><colgroup><col width="146"><col width="113"><col width="109"><col width="179"><col width="151"></colgroup><tbody><tr style="height:28.5pt"><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;background-color:#f0f0f0;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Document</span></p></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;background-color:#f0f0f0;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">BM25 Rank</span></p></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;background-color:#f0f0f0;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Vector Rank</span></p></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;background-color:#f0f0f0;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Calculation</span></p></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;background-color:#f0f0f0;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">RRF Score</span></p></td></tr><tr style="height:28.5pt"><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Doc A</span></p></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">1</span></p></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">3</span></p></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">1/(60+1) + 1/(60+3)</span></p></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">0.0323</span></p></td></tr><tr style="height:28.5pt"><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Doc C</span></p></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">3</span></p></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">1</span></p></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">1/(60+3) + 1/(60+1)</span></p></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">0.0323</span></p></td></tr><tr style="height:28.5pt"><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Doc B</span></p></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">2</span></p></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">-</span></p></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">1/(60+2) + 0</span></p></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">0.0161</span></p></td></tr><tr style="height:28.5pt"><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Doc D</span></p></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">-</span></p></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">2</span></p></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">0 + 1/(60+2)</span></p></td><td style="border-left:solid #333333 0.75pt;border-right:solid #333333 0.75pt;border-bottom:solid #333333 0.75pt;border-top:solid #333333 0.75pt;vertical-align:top;padding:8pt 9pt 8pt 9pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">0.0161</span></p></td></tr></tbody></table>
<!--kg-card-end: html-->
<p><strong>Result:</strong> Doc A and Doc C tie for first place. Why? Because they appeared in <strong>both</strong> lists. RRF naturally boosts documents that multiple systems agree on.</p><h3 id="why-rrf-works-so-well">Why RRF Works So Well</h3><ol><li><strong>Scale-independent.</strong> Doesn’t matter if one score is 0-50 and another is 0-2. RRF only looks at order.</li><li><strong>Rewards consensus.</strong> If both keyword AND semantic search agree a doc is relevant, it gets boosted.</li><li><strong>Preserves outliers.</strong> A doc that only appears in one list still gets scored. Nothing is thrown away.</li><li><strong>The k=60 trick.</strong> This constant prevents the #1 result from dominating everything. It smooths the curve.</li></ol><h2 id="hybrid-search-in-postgres">Hybrid Search in Postgres</h2><p>Here’s how to implement hybrid search with RRF using <a href="https://github.com/timescale/pg_textsearch"><u>pg_textsearch</u></a> and <a href="https://github.com/timescale/pgvectorscale"><u>pgvectorscale</u></a>.</p><h3 id="setup">Setup</h3><pre><code class="language-SQL">-- Enable extensions
CREATE EXTENSION pg_textsearch;       -- Adds BM25 ranking for keyword search
CREATE EXTENSION vectorscale CASCADE; -- Adds DiskANN for fast vector search (includes pgvector)
CREATE EXTENSION ai;                  -- Adds auto-embedding generation (optional but recommended)

-- Create your table
CREATE TABLE documents (
  id SERIAL PRIMARY KEY,
  title TEXT,
  content TEXT,                       -- This column gets BM25 indexed
  embedding vector(1536)              -- This column stores OpenAI embeddings (1536 dimensions)
);

-- Create indexes
CREATE INDEX idx_bm25 ON documents 
  USING bm25(content)                 -- BM25 index on content column
  WITH (text_config = 'english');     -- Use English stemming/stopwords

CREATE INDEX idx_vector ON documents 
  USING diskann(embedding);           -- DiskANN index for fast approximate nearest neighbor</code></pre><h3 id="the-hybrid-search-query">The Hybrid Search Query</h3><pre><code class="language-SQL">WITH 
-- STEP 1: Get top 20 keyword matches using BM25
bm25_results AS (
  SELECT 
    id, 
    ROW_NUMBER() OVER (
      ORDER BY content &lt;@&gt; to_bm25query('database optimization', 'idx_bm25')
    ) as rank                         -- Assign rank 1, 2, 3... based on BM25 score
  FROM documents
  ORDER BY content &lt;@&gt; to_bm25query('database optimization', 'idx_bm25')
  LIMIT 20                            -- Only keep top 20 keyword matches
),

-- STEP 2: Get top 20 semantic matches using vector similarity
vector_results AS (
  SELECT 
    id, 
    ROW_NUMBER() OVER (
      ORDER BY embedding &lt;=&gt; $1       -- $1 is the query embedding (passed from app)
    ) as rank                         -- Assign rank 1, 2, 3... based on vector distance
  FROM documents
  ORDER BY embedding &lt;=&gt; $1           -- &lt;=&gt; is cosine distance operator
  LIMIT 20                            -- Only keep top 20 semantic matches
)

-- STEP 3: Combine both lists using RRF formula
SELECT 
  d.id,
  d.title,
  d.content,
  -- RRF: 1/(k+rank) for each list, summed together
  -- k=60 prevents top results from dominating
  COALESCE(1.0 / (60 + b.rank), 0) +  -- Score from BM25 (0 if not in BM25 results)
  COALESCE(1.0 / (60 + v.rank), 0)    -- Score from vectors (0 if not in vector results)
  as rrf_score
FROM documents d
LEFT JOIN bm25_results b ON d.id = b.id   -- Join BM25 ranks
LEFT JOIN vector_results v ON d.id = v.id -- Join vector ranks
WHERE b.id IS NOT NULL OR v.id IS NOT NULL -- Must appear in at least one list
ORDER BY rrf_score DESC               -- Highest RRF score = most relevant
LIMIT 10;                             -- Return top 10 results</code></pre><p>Just one query, two search types, but RRF smooths everything over. This makes your life super simple. What's better? Wrap it into a function called hybrid_search and then you can now call that function.</p><h3 id="wrap-it-in-a-function">Wrap It in a Function</h3><pre><code class="language-SQL">-- Reusable function: call hybrid_search('your query', $embedding, 10)
CREATE OR REPLACE FUNCTION hybrid_search(
  query_text TEXT,                    -- The search query (for BM25)
  query_embedding vector(1536),       -- The query embedding (for vectors)
  match_count INT DEFAULT 10          -- How many results to return
)
RETURNS TABLE (id INT, title TEXT, content TEXT, rrf_score FLOAT)
AS $$
  WITH 
  -- BM25 keyword search
  bm25_results AS (
    SELECT id, ROW_NUMBER() OVER (
      ORDER BY content &lt;@&gt; to_bm25query(query_text, 'idx_bm25')
    ) as rank
    FROM documents
    ORDER BY content &lt;@&gt; to_bm25query(query_text, 'idx_bm25')
    LIMIT 20
  ),
  -- Vector semantic search  
  vector_results AS (
    SELECT id, ROW_NUMBER() OVER (
      ORDER BY embedding &lt;=&gt; query_embedding
    ) as rank
    FROM documents
    ORDER BY embedding &lt;=&gt; query_embedding
    LIMIT 20
  )
  -- Combine with RRF
  SELECT 
    d.id, d.title, d.content,
    COALESCE(1.0 / (60 + b.rank), 0) + 
    COALESCE(1.0 / (60 + v.rank), 0) as rrf_score
  FROM documents d
  LEFT JOIN bm25_results b ON d.id = b.id
  LEFT JOIN vector_results v ON d.id = v.id
  WHERE b.id IS NOT NULL OR v.id IS NOT NULL
  ORDER BY rrf_score DESC
  LIMIT match_count;
$$ LANGUAGE SQL;</code></pre><p><strong>Now your app code is just:</strong></p><pre><code class="language-SQL">SELECT * FROM hybrid_search('database optimization', $embedding, 10);</code></pre><h2 id="weighted-hybrid-search">Weighted Hybrid Search</h2><p>Sometimes you want to favor one search type over the other. Technical docs might benefit from stronger keyword matching. Conversational queries might need more semantic weight. So you can expose all the weights in your function, and then you can call this weighted function.</p><pre><code class="language-SQL">-- Weighted version: control how much keyword vs semantic matters
CREATE OR REPLACE FUNCTION weighted_hybrid_search(
  query_text TEXT,                    -- The search query (for BM25)
  query_embedding vector(1536),       -- The query embedding (for vectors)
  bm25_weight FLOAT DEFAULT 0.5,      -- Weight for keyword search (0.0 to 1.0)
  vector_weight FLOAT DEFAULT 0.5,    -- Weight for semantic search (0.0 to 1.0)
  match_count INT DEFAULT 10          -- How many results to return
)
RETURNS TABLE (id INT, title TEXT, content TEXT, rrf_score FLOAT)
AS $$
  WITH 
  -- BM25 keyword search
  bm25_results AS (
    SELECT id, ROW_NUMBER() OVER (
      ORDER BY content &lt;@&gt; to_bm25query(query_text, 'idx_bm25')
    ) as rank
    FROM documents
    ORDER BY content &lt;@&gt; to_bm25query(query_text, 'idx_bm25')
    LIMIT 20
  ),
  -- Vector semantic search
  vector_results AS (
    SELECT id, ROW_NUMBER() OVER (
      ORDER BY embedding &lt;=&gt; query_embedding
    ) as rank
    FROM documents
    ORDER BY embedding &lt;=&gt; query_embedding
    LIMIT 20
  )
  SELECT 
    d.id, d.title, d.content,
    -- Weighted RRF: multiply each score by its weight
    (bm25_weight * COALESCE(1.0 / (60 + b.rank), 0)) +   -- Weighted BM25 score
    (vector_weight * COALESCE(1.0 / (60 + v.rank), 0))   -- Weighted vector score
    as rrf_score
  FROM documents d
  LEFT JOIN bm25_results b ON d.id = b.id
  LEFT JOIN vector_results v ON d.id = v.id
  WHERE b.id IS NOT NULL OR v.id IS NOT NULL
  ORDER BY rrf_score DESC
  LIMIT match_count;
$$ LANGUAGE SQL;</code></pre><p><strong>Usage:</strong></p><pre><code class="language-SQL">-- Equal weight (default): 50% keywords, 50% semantic
SELECT * FROM weighted_hybrid_search('database optimization', $embedding);

-- Favor keywords (70% BM25, 30% vectors)
-- Good for: error codes, specific terms, exact phrases
SELECT * FROM weighted_hybrid_search('error 23505', $embedding, 0.7, 0.3);

-- Favor semantic (30% BM25, 70% vectors)
-- Good for: natural language questions, conceptual queries
SELECT * FROM weighted_hybrid_search('how do I make queries faster', $embedding, 0.3, 0.7);</code></pre><h2 id="auto-sync-embeddings-no-pipelines">Auto-Sync Embeddings (No Pipelines)</h2><p>Remember the pipeline problem from the intro? Postgres → Kafka → Elasticsearch, with separate jobs to generate embeddings?</p><p><a href="https://github.com/timescale/pgai"><u>pgai</u></a> eliminates pretty much all of that.</p><p>When you create a vectorizer, pgai sets up background workers that monitor your table for changes. When a row is inserted or updated, pgai automatically:&nbsp;</p><p>1. Detects the change&nbsp;</p><p>2. Calls the embedding API (OpenAI, Cohere, or local models)&nbsp;</p><p>3. Stores the embedding in a linked table 4. Keeps everything in sync</p><pre><code class="language-SQL">-- Set up automatic embedding generation
SELECT ai.create_vectorizer(
  'documents'::regclass,              -- Which table to watch for changes
  loading =&gt; ai.loading_column(
    column_name =&gt; 'content'          -- Which column to generate embeddings from
  ),  
  embedding =&gt; ai.embedding_openai(
    model =&gt; 'text-embedding-3-small', -- OpenAI model to use
    dimensions =&gt; '1536'               -- Output embedding dimensions
  )
);
-- That's it! pgai now watches 'documents' and auto-generates embeddings
-- whenever content changes. No cron jobs. No sync scripts.
</code></pre><p>That’s it. Now any change to the documents table triggers automatic embedding updates:&nbsp;</p><p>- <strong>INSERT</strong> into documents → embedding generated for content column&nbsp;</p><p>- <strong>UPDATE</strong> the content column → embedding regenerated</p><p>- <strong>DELETE</strong> from documents → embedding removed</p><p>Embeddings are stored in a linked table (documents_embedding) that pgai creates and manages for you.</p><p><strong>Note:</strong> On <a href="https://console.cloud.timescale.com"><u>Tiger Data</u></a>, the vectorizer worker runs automatically as a managed service. If you’re self-hosting, you’ll need to run the <a href="https://github.com/timescale/pgai"><u>pgai vectorizer worker</u></a> (a Python CLI) to process the embedding queue.</p><p>No Kafka. No Debezium. No sync jobs. No “why is my search stale?” debugging sessions at 3 AM.</p><p>Your embeddings live next to your data, updated in near real-time, managed by Postgres itself. This is what makes hybrid search in Postgres practical for production.</p><h2 id="get-started">Get Started</h2><p>Everything you need is available on <a href="https://console.cloud.timescale.com"><u>Tiger Data</u></a>:</p><pre><code class="language-SQL">-- 1. Enable extensions (one-time setup)
CREATE EXTENSION pg_textsearch;        -- BM25 keyword search
CREATE EXTENSION vectorscale CASCADE;  -- Vector search (includes pgvector)

-- 2. Create indexes on your table
CREATE INDEX idx_bm25 ON documents USING bm25(content)
  WITH (text_config = 'english');      -- BM25 index with English stemming
CREATE INDEX idx_vector ON documents 
  USING diskann(embedding);            -- Fast vector similarity index

-- 3. Search! (using the hybrid_search function from this post)
SELECT * FROM hybrid_search('your query', $embedding, 10);</code></pre><h2 id="the-bottom-line">The Bottom Line</h2><p>Elasticsearch built a billion-dollar business on BM25 and RRF. Those algorithms aren’t proprietary. They’re not even complicated. And now they run natively in Postgres.</p><p>No pipelines. No sync jobs. No extra infrastructure. Just your database doing what databases should do: storing your data and making it searchable.</p><p>The question isn’t whether Postgres <em>can</em> do hybrid search. It can. The question is: why are you still running two systems when one will do?</p><h2 id="learn-more">Learn More</h2><p><strong>Our Posts:</strong>&nbsp;</p><ul><li><a href="https://www.tigerdata.com/blog/you-dont-need-elasticsearch-bm25-is-now-in-postgres"><u>You Don’t Need Elasticsearch: BM25 is Now in Postgres</u></a>: Why BM25 beats native Postgres search</li><li><a href="https://www.tigerdata.com/blog/introducing-pg_textsearch-true-bm25-ranking-hybrid-retrieval-postgres"><u>From ts_rank to BM25. Introducing pg_textsearch</u></a></li></ul><p><strong>Extensions:</strong></p><ul><li><a href="https://github.com/timescale/pg_textsearch"><u>pg_textsearch</u></a>: BM25 for Postgres, open source (PostgreSQL license)&nbsp;</li><li><a href="https://github.com/timescale/pgvectorscale"><u>pgvectorscale</u></a>: High-performance vector search using DiskANN&nbsp;</li><li><a href="https://github.com/pgvector/pgvector"><u>pgvector</u></a>: The foundation for vector similarity in Postgres&nbsp;</li><li><a href="https://github.com/timescale/pgai"><u>pgai</u></a>: Auto-sync embeddings, RAG workflows</li></ul><p><strong>Background:</strong>&nbsp;</p><ul><li><a href="https://en.wikipedia.org/wiki/Okapi_BM25"><u>BM25 on Wikipedia</u></a>: The original algorithm (1994)&nbsp;</li><li><a href="https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf"><u>Reciprocal Rank Fusion paper</u></a>: The academic paper behind RRF&nbsp;</li><li><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html"><u>Elasticsearch Hybrid Search docs</u></a>: How Elasticsearch implements RRF (for comparison)</li></ul><p><strong>Get Started:</strong></p><ul><li><a href="https://console.cloud.timescale.com"><u>Tiger Data Console</u></a>: Managed Postgres with all extensions pre-installed&nbsp;</li><li><a href="https://www.tigerdata.com/docs/use-timescale/latest/extensions/pg-textsearch"><u>pg_textsearch Documentation</u></a></li></ul>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[It’s 2026, Just Use Postgres]]></title>
            <description><![CDATA[Stop managing multiple databases. Postgres extensions replace Elasticsearch, Pinecone, Redis, MongoDB, and InfluxDB with BM25, vectors, JSONB, and time-series in one database.]]></description>
            <link>https://www.tigerdata.com/blog/its-2026-just-use-postgres</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/its-2026-just-use-postgres</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[TimescaleDB]]></category>
            <dc:creator><![CDATA[Raja Rao DV]]></dc:creator>
            <pubDate>Mon, 02 Feb 2026 17:49:47 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/02/just-use-postgres-2026.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/02/just-use-postgres-2026.png" alt="It’s 2026, Just Use Postgres" /><p>Think of your database like your home. Your living room, bedroom, kitchen, garage: each room serves a different purpose, but they're all under one roof. You don't build a restaurant across town because you need to cook dinner. You don't rent a commercial garage to park your car.</p><p>Postgres works the same way. Search, vectors, time-series, queues, documents, all rooms in one house. Same roof, foundation, and key.</p><p>But specialized database vendors have spent years telling you otherwise. "Use the right tool for the right job," they say. It sounds reasonable and wise. And it sells a lot of databases.</p><h2 id="the-right-tool-trap">The "Right Tool" Trap</h2><p>You follow the advice. You adopt Elasticsearch for search, Pinecone for vectors, Redis for caching, MongoDB for documents, Kafka for queues, InfluxDB for time-series. Postgres handles whatever's left.</p><p>Congratulations. You now have seven databases to manage. Seven query languages to learn. Seven backup strategies to maintain. Seven security models to audit. Seven monitoring dashboards to watch. And seven things that can break at 3 AM.</p><p>When something does break? Good luck spinning up a test environment to reproduce it. You'll need synchronized snapshots across all seven systems, all at the same point in time, with seven services running in your local environment. Let us know how that goes.</p><h2 id="the-ai-era-changed-the-math">The AI Era Changed the Math</h2><p>Here's what makes this argument different in 2026: AI agents.</p><p>Think about what agents need to do. Spin up a test database with production data. Try a fix. Verify it works. Tear it down. With one database, that's a single fork command. Fork it, test it, done.</p><p>With seven databases? Now your agent needs to coordinate snapshots across every system, spin up seven services, configure seven connection strings, hope nothing drifts while testing, and tear it all down afterward. That's not a minor inconvenience. It's a research project. And it's why most AI coding agents quietly assume a single-database architecture.</p><p>This applies beyond agents. Every on-call engineer who needs a test environment at 3 AM faces the same coordination problem. So does every CI pipeline that needs realistic data, and every team that wants to experiment safely.</p><p>And it compounds at the organizational level. When your data lives in one database, a new engineer can understand the entire data model in a day. They can run the full stack locally with a single connection string. They can write a migration, test it against a fork of production, and ship it with confidence. When your data is scattered across seven systems, onboarding alone becomes a multi-week project. "How does search stay in sync with the main database?" is a question that shouldn't require a 45-minute architecture review to answer.</p><p>Database consolidation used to be an architectural preference. A nice-to-have. In the AI era, it's becoming a functional requirement. The teams that can fork, test, and iterate on a single database will ship faster than teams coordinating across seven.</p><h2 id="the-algorithm-reality">The Algorithm Reality</h2><p>Here's what the specialized database vendors don't want you to think too hard about: in most cases, Postgres extensions use the same core algorithms as their products.</p>
<!--kg-card-begin: html-->
<table style="border:none;border-collapse:collapse;table-layout:fixed;width:468pt"><colgroup><col><col><col><col><col></colgroup><thead><tr style="height:0pt"><th style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;" scope="col"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">What You Need</span></p></th><th style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;" scope="col"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Specialized Tool</span></p></th><th style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;" scope="col"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Postgres Extension</span></p></th><th style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;" scope="col"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Algorithm</span></p></th><th style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;" scope="col"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">When You Still Need the Specialist</span></p></th></tr></thead><tbody><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Full-text search</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Elasticsearch</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><a href="https://www.tigerdata.com/blog/you-dont-need-elasticsearch-bm25-is-now-in-postgres" style="text-decoration:none;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#1155cc;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:underline;-webkit-text-decoration-skip:none;text-decoration-skip-ink:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">pg_textsearch</span></a></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Both use BM25</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">You need Kibana dashboards, complex nested aggregations, or cluster-scale search across petabytes</span></p></td></tr><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Vector search</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Pinecone</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><a href="https://www.tigerdata.com/blog/pgvector-vs-pinecone" style="text-decoration:none;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#1155cc;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:underline;-webkit-text-decoration-skip:none;text-decoration-skip-ink:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">pgvectorscale</span></a></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Both use HNSW/DiskANN</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">You need managed multi-tenant sharding at billions of vectors</span></p></td></tr><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Time-series</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">InfluxDB</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><a href="https://github.com/timescale/timescaledb" style="text-decoration:none;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#1155cc;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:underline;-webkit-text-decoration-skip:none;text-decoration-skip-ink:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">TimescaleDB</span></a></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Both use time partitioning</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">You're in a pure-metrics environment with no relational data and need InfluxDB's specialized ingestion path</span></p></td></tr><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Documents</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">MongoDB</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">JSONB</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Both use document indexing</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Your entire data model is schemaless and you need MongoDB's change streams ecosystem</span></p></td></tr><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Caching</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Redis</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">UNLOGGED tables</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Both use in-memory storage</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">You depend on pub/sub, sorted sets, Lua scripting, or sub-millisecond latency on complex data structures</span></p></td></tr><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Queues</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Kafka</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><a href="https://github.com/tembo-io/pgmq" style="text-decoration:none;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#1155cc;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:underline;-webkit-text-decoration-skip:none;text-decoration-skip-ink:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">pgmq</span></a></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Both use message queuing</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">You're streaming events across dozens of services with consumer groups and multi-datacenter replication</span></p></td></tr><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Geospatial</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Specialized GIS</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><a href="https://postgis.net/" style="text-decoration:none;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#1155cc;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:underline;-webkit-text-decoration-skip:none;text-decoration-skip-ink:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">PostGIS</span></a></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Industry standard since 2001</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">There's no tradeoff here. PostGIS is the reference implementation.</span></p></td></tr></tbody></table>
<!--kg-card-end: html-->
<p></p><p>The last column is the important part. Those are real, specific requirements you'll recognize because you hit a concrete wall, not because a vendor's marketing team told you to plan for it.</p><p>The benchmarks on the extensions that matter most:</p><ul><li><strong>pgvectorscale</strong> achieved 28x lower p95 latency and 16x higher throughput than Pinecone at 99% recall on a <a href="https://www.tigerdata.com/blog/pgvector-vs-pinecone"><u>50M vector benchmark</u></a></li><li><strong>TimescaleDB</strong> matches or beats <a href="https://www.tigerdata.com/blog/what-influxdb-got-wrong"><u>InfluxDB</u></a> on time-series workloads while giving you full SQL, JOINs, and ACID guarantees</li><li><strong>pg_textsearch</strong> runs the <a href="https://www.tigerdata.com/blog/you-dont-need-elasticsearch-bm25-is-now-in-postgres"><u>same BM25 ranking algorithm</u></a> that powers Elasticsearch, natively in Postgres</li></ul><p>These extensions aren't new. PostGIS has been in production since 2001. TimescaleDB since 2017. pgvector since 2021. Over <a href="https://www.aventionmedia.com/technology-installed-base/postgresql-customers-list/"><u>48,000 companies</u></a> run PostgreSQL, including Netflix, Spotify, Uber, and Discord.</p><p>So what are you actually paying for with a specialized database? In most cases: a managed service, a purpose-built UI, and the assumption that you'll outgrow Postgres. For the small percentage of teams that genuinely need cluster-scale Elasticsearch or Kafka-grade event streaming, that's a fair trade. For everyone else, it's an infrastructure tax on a problem they don't have.</p><h2 id="the-compounding-costs">The Compounding Costs</h2><p>Even when a specialized database has a genuine edge on a specific benchmark, you're paying for it across every other dimension of your infrastructure.</p><p><strong>Cognitive load.</strong> Your team needs SQL, Redis commands, Elasticsearch Query DSL, MongoDB aggregation pipelines, Kafka consumer patterns, and InfluxDB's query language. That's not specialization. That's fragmentation.</p><p><strong>Data consistency.</strong> Keeping Elasticsearch in sync with Postgres means building sync jobs. Those jobs fail silently. Data drifts. You add reconciliation logic. That fails too. Now you're maintaining data plumbing instead of building the product your company actually sells. We've seen this pattern firsthand: a team adds a second database for one workload, then spends six months building and debugging the sync pipeline between them. The second database solved a performance problem. The sync pipeline created three operational ones.</p><p><strong>SLA math.</strong> Three systems at 99.9% uptime each give you 99.7% combined availability. That's 26 hours of downtime per year instead of 8.7. Every additional system multiplies your failure surface.</p><p><strong>Real cost.</strong> <a href="https://www.tigerdata.com/blog/from-4-databases-to-1-how-plexigrid-replaced-influxdb-got-350x-faster-queries-tiger-data"><u>Plexigrid consolidated from four databases to one</u></a> and got 350x faster queries. <a href="https://www.tigerdata.com/blog/how-flogistix-by-flowco-reduced-infrastructure-management-costs-by-66-with-tiger-data"><u>Flogistix</u></a> cut database costs by 66%. <a href="https://www.tigerdata.com/case-studies/latitude"><u>Latitude</u></a> saved $12,000 per month from compression alone. These savings compound because you're removing entire categories of operational complexity.</p><h2 id="start-with-postgres">Start With Postgres</h2><p>Start with one database. Add complexity only when you've earned the need for it.</p><p>Postgres with the right extensions handles search, vectors, time-series, documents, caching, queues, geospatial data, and scheduled jobs. The algorithms match their specialized counterparts. The extensions are battle-tested. And the operational simplicity compounds every month you don't add another system to your stack.</p><p>In the AI era, that simplicity is worth more than ever. Every agent, every CI pipeline, every on-call engineer benefits from being able to fork one database instead of coordinating seven.</p><p>Most teams adopt specialized databases before they've even tested whether Postgres can handle the workload. They add Elasticsearch because "that's what you use for search," not because they tried pg_textsearch and found it lacking. They add Redis because "everyone uses Redis for caching," not because UNLOGGED tables couldn't handle their load. The "right tool for the right job" advice sounds wise. But too often, it's a solution to a problem you don't have yet, sold by the people who profit from you having it.</p><p>Test the assumption first. Then decide.</p><p>For a hands-on guide to setting up each extension with working SQL, read the companion post: <a href="https://www.tigerdata.com/blog/postgres-extensions-cheat-sheet"><u>Postgres Extensions Cheat Sheet: Replace 7 Databases With SQL</u></a>.</p><p>All of these extensions come pre-configured on <a href="https://console.cloud.timescale.com"><u>Tiger Cloud</u></a>. Create a free database and start building.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[10 Elasticsearch Production Issues (and How Postgres Avoids Them)]]></title>
            <description><![CDATA[Why Elasticsearch is complex in production: garbage collection, shard math, data sync pipelines, and monitoring overhead. Postgres with pg_textsearch simplifies search.]]></description>
            <link>https://www.tigerdata.com/blog/10-elasticsearch-production-issues-how-postgres-avoids-them</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/10-elasticsearch-production-issues-how-postgres-avoids-them</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Raja Rao DV]]></dc:creator>
            <pubDate>Fri, 30 Jan 2026 13:53:52 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/01/elasticsearch-production-issues-postgres-compressed.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/01/elasticsearch-production-issues-postgres-compressed.png" alt="10 Elasticsearch Production Issues (and How Postgres Avoids Them)" /><p>Elasticsearch may work great in initial testing and development but Production is a different story. This blog is about what happens after you ship: the JVM tuning, the shard math, the 3 AM pages, the sync pipelines that break silently. The stuff your ops team lives with.</p><p>After years of teams running Elasticsearch in production, certain patterns keep emerging. The same issues show up in blog posts, Stack Overflow questions, and incident reports. We've compiled ten of the most common ones below, with references to the engineers who've documented them. We’ve also added images to make it easy to quickly skim through it and compare the challenges against Postgres.&nbsp;</p><p><strong>TLDR:</strong> With great power comes great operational complexity.</p><h2 id="1-jvm-garbage-collection-pauses">1. JVM Garbage Collection Pauses</h2><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/01/devops-problem1-compressed.png" class="kg-image" alt="JVM Garbage Collection Pauses" loading="lazy" width="2000" height="1091" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2026/01/devops-problem1-compressed.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2026/01/devops-problem1-compressed.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2026/01/devops-problem1-compressed.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w2400/2026/01/devops-problem1-compressed.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>Elasticsearch runs on the Java Virtual Machine (JVM). This means garbage collection (GC) is part of your life.</p><p><strong>The problem:</strong> Java periodically pauses everything to clean up unused memory. These "stop-the-world" pauses can freeze your Elasticsearch node for seconds at a time. If the pause lasts longer than 30 seconds, the cluster assumes the node is dead and starts moving data around to compensate. Now you have a cascading failure.</p><p>Say for example, your e-commerce site is running on Black Friday. Traffic spikes, memory fills up, and Java decides it's time to clean house. Search goes unresponsive for 45 seconds. The cluster panics, starts redistributing shards, and suddenly you're dealing with a full outage instead of a brief slowdown.</p><p><strong>References:</strong></p><ul><li><a href="https://www.siriusopensource.com/en-us/blog/problems-and-operational-weaknesses-elasticsearch"><u>Sirius Open Source: Problems and Operational Weaknesses of Elasticsearch</u></a> — "GC pause time correlates with heap size. Larger heaps mean longer pauses."</li><li><a href="https://www.elastic.co/docs/troubleshoot/elasticsearch/fix-common-cluster-issues"><u>Elastic Docs: Fix Common Cluster Issues</u></a> — Official troubleshooting guide for JVM memory pressure</li></ul><p><strong>Why Postgres avoids this:</strong> Postgres is written in C and manages memory directly. There's no JVM, no garbage collection, and no "stop-the-world" pauses. Memory usage is predictable and grows linearly with your workload. You still need to tune <code>shared_buffers</code> and <code>work_mem</code>, but you're not fighting a runtime that can freeze your entire process.</p><h2 id="2-mapping-explosion">2. Mapping Explosion</h2><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/01/devops-problem2-compressed.png" class="kg-image" alt="Mapping Explosion" loading="lazy" width="2000" height="1091" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2026/01/devops-problem2-compressed.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2026/01/devops-problem2-compressed.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2026/01/devops-problem2-compressed.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w2400/2026/01/devops-problem2-compressed.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>Elasticsearch tries to be helpful by automatically detecting the type of data you're storing. Send it a field called <code>user_id</code> with a number, and it remembers "user_id is a number." Convenient, right?</p><p><strong>The problem:</strong> This "helpful" feature becomes a nightmare with semi-structured data. If your application logs include arbitrary metadata keys, or your users can create custom fields, Elasticsearch creates a new mapping for each one. Thousands of fields later, your cluster state is massive, your master node is struggling, and everything slows to a crawl.</p><p>Say you're logging API requests and including the request body as JSON. One customer sends a payload with 500 unique keys. Elasticsearch dutifully creates 500 new field mappings. Multiply that by thousands of requests, and suddenly you have tens of thousands of fields. Your cluster becomes unresponsive, and you're scrambling to figure out why.</p><p><strong>References:</strong></p><ul><li><a href="https://www.siriusopensource.com/en-us/blog/problems-and-operational-weaknesses-elasticsearch"><u>Sirius Open Source</u></a> — "Mapping explosion is a common anti-pattern... the cluster state can grow to hundreds of megabytes."</li><li><a href="https://www.mindfulchase.com/explore/troubleshooting-tips/databases/advanced-troubleshooting-for-elasticsearch-performance%2C-mappings%2C-and-cluster-stability.html"><u>MindfulChase: Advanced Troubleshooting for Elasticsearch</u></a> — Mapping conflicts and silent failures</li></ul><p><strong>Why Postgres avoids this:</strong> Postgres requires you to define your schema upfront. You decide what columns exist. If you need flexible data, you use JSONB—it stores arbitrary JSON without creating new columns or bloating system catalogs. You can still index specific paths inside the JSON when needed. The schema is explicit, which means surprises are rare.</p><h2 id="3-oversharding-or-undersharding">3. Oversharding (or Undersharding)</h2><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/01/devops-problem3-compressed.png" class="kg-image" alt="Oversharding (or Undersharding)" loading="lazy" width="2000" height="1091" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2026/01/devops-problem3-compressed.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2026/01/devops-problem3-compressed.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2026/01/devops-problem3-compressed.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w2400/2026/01/devops-problem3-compressed.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>Elasticsearch splits your data into "shards" to distribute it across machines. You have to decide how many shards to use when you create an index. Choose wisely, because you can't change it later without rebuilding everything.</p><p><strong>The problem:</strong> Too many shards? Each one consumes memory, CPU, and file handles. You're wasting resources on overhead. Too few shards? You can't parallelize queries effectively, and recovery takes forever when a node dies. And here's the kicker: there's no formula. It depends on your data size, query patterns, hardware, and growth rate. It's guesswork.</p><p>Say you create an index with 5 shards because that's the default. Six months later, your data has grown 10x, and those 5 shards are now bottlenecks. Your only option? Create a new index with more shards and reindex all your data. Hope you have the disk space and time for that migration.</p><p><strong>References:</strong></p><ul><li><a href="https://pureinsights.com/blog/2025/top-7-elasticsearch-pitfalls-and-how-to-avoid-them/"><u>Pure Insights: Top 7 Elasticsearch Pitfalls</u></a> — "Improper shard allocation is one of the most common causes of performance degradation."</li><li><a href="https://www.siriusopensource.com/en-us/blog/problems-and-operational-weaknesses-elasticsearch"><u>Sirius Open Source</u></a> — "Each shard is a Lucene index with its own file handles, memory, and CPU overhead."</li></ul><p><strong>Why Postgres avoids this:</strong> Postgres doesn't shard by default—your data lives in tables on a single node. When you need to scale reads, you add replicas. When tables get large, you use declarative partitioning (by time, by tenant, etc.). These decisions can be made later, as your data grows, rather than upfront when you're guessing. And if you do need horizontal sharding someday, tools like Citus exist. But most applications never get there.</p><h2 id="4-deep-pagination-performance-cliff">4. Deep Pagination Performance Cliff</h2><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/01/devops-problem4-compressed.png" class="kg-image" alt="Deep Pagination Performance Cliff" loading="lazy" width="2000" height="1091" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2026/01/devops-problem4-compressed.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2026/01/devops-problem4-compressed.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2026/01/devops-problem4-compressed.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w2400/2026/01/devops-problem4-compressed.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>Users click "next page" a lot. They expect page 500 to load as fast as page 1. Elasticsearch has other plans.</p><p><strong>The problem:</strong> When you ask for page 500 (results 5000-5010), Elasticsearch doesn't skip to result 5000. It fetches and ranks all 5010 results, then throws away the first 5000. Every node in your cluster does this work, then sends results to a coordinator that combines them. The deeper you paginate, the more work everyone does.</p><p>Say a user searches your product catalog and starts browsing. Page 1 loads in 50ms. Page 10 takes 200ms. By page 100, it's taking 2 seconds. Page 500? Timeout. Elasticsearch actually has a hard limit (<code>index.max_result_window</code>, default 10,000) specifically because this gets so bad.</p><p><strong>References:</strong></p><ul><li><a href="https://www.siriusopensource.com/en-us/blog/problems-and-operational-weaknesses-elasticsearch"><u>Sirius Open Source</u></a> — "Deep pagination is a well-known limitation... the coordinating node must aggregate results from all shards."</li><li><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html"><u>Elastic Docs</u></a> — Official guidance recommending <code>search_after</code> for deep pagination</li></ul><p><strong>Why Postgres avoids this:</strong> <code>LIMIT 10 OFFSET 5000</code> in Postgres doesn't have a hard ceiling like ES. The query planner handles it without needing to score every document. That said, large offsets still have overhead—keyset pagination <code>(WHERE id &gt; last_seen_id)</code> is better for deep pages. But you won't hit a wall at 10,000 results, and you won't see the dramatic slowdown ES has.</p><h2 id="5-split-brain-and-data-loss">5. Split-Brain and Data Loss</h2><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/01/devops-problem5-compressed.png" class="kg-image" alt="Split-Brain and Data Loss" loading="lazy" width="2000" height="1091" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2026/01/devops-problem5-compressed.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2026/01/devops-problem5-compressed.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2026/01/devops-problem5-compressed.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w2400/2026/01/devops-problem5-compressed.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>Elasticsearch is a distributed system. Distributed systems fail in distributed ways.</p><p><strong>The problem:</strong> Imagine your cluster has 5 nodes, and a network issue splits them into two groups: 3 nodes on one side, 2 on the other. Both groups might elect their own master and start accepting writes independently. When the network heals, you have two conflicting versions of your data. This is "split-brain," and it can cause permanent data loss.</p><p>Say your Elasticsearch cluster spans two data centers for redundancy. A network blip between them lasts 60 seconds. Both sides elect a master. Users on both sides keep searching and indexing. When connectivity returns, some documents exist in one half, some in the other, and some have conflicting updates. Reconciling this manually is a nightmare.</p><p><strong>References:</strong></p><ul><li><a href="https://www.siriusopensource.com/en-us/blog/problems-and-operational-weaknesses-elasticsearch"><u>Sirius Open Source</u></a> — "Split-brain scenarios can occur when network partitions isolate nodes."</li><li><a href="https://en.wikipedia.org/wiki/Elasticsearch"><u>Wikipedia: Elasticsearch</u></a> — History of split-brain issues and improvements</li></ul><p><strong>Why Postgres avoids this:</strong> Postgres uses a simpler primary/replica model. There's one primary that accepts writes; replicas are read-only. This doesn't make split-brain impossible—any distributed system can have it—but the architecture makes it much harder to create accidentally. Tools like Patroni, pg_auto_failover, and managed services (like <a href="https://www.tigerdata.com/cloud"><u>Tiger Data</u></a>) handle leader election and fencing automatically. You're not managing quorum math yourself.</p><h2 id="6-eventual-consistency-surprises">6. Eventual Consistency Surprises</h2><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/01/devops-problem6-compressed.png" class="kg-image" alt="Eventual Consistency Surprises" loading="lazy" width="2000" height="1091" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2026/01/devops-problem6-compressed.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2026/01/devops-problem6-compressed.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2026/01/devops-problem6-compressed.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w2400/2026/01/devops-problem6-compressed.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>When you save something to Elasticsearch, it's not immediately searchable. This catches a lot of teams off guard.</p><p><strong>The problem:</strong> Elasticsearch buffers writes and periodically "refreshes" to make them searchable (default: every 1 second). But under load, or with specific configurations, that delay can grow. Users create a record, immediately search for it, and get nothing. "But I just saved it!"</p><p>Say a user posts a comment on your platform. Your app saves it to Postgres (source of truth) and indexes it to Elasticsearch. The user refreshes the page to see their comment. The Postgres query shows it, but the search results don't include it yet because Elasticsearch hasn't been refreshed. The user thinks something is broken and posts again. Now you have duplicate comments.</p><p><strong>References:</strong></p><ul><li><a href="https://acethecloud.com/blog/why-you-rarely-need-elastic/"><u>AceTheCloud: Why You Rarely Need Elastic</u></a> — "Elasticsearch does not support ACID transactions... data may not be immediately visible."</li><li><a href="https://www.stackshare.io/stackups/elasticsearch-vs-postgresql"><u>StackShare: Elasticsearch vs PostgreSQL</u></a> — Comparison of consistency models</li></ul><p><strong>Why Postgres avoids this:</strong> Postgres is <a href="https://www.tigerdata.com/learn/understanding-acid-compliance"><u>ACID-compliant</u></a>. When your transaction commits, the data is immediately visible to all subsequent queries—on the same connection or any other. If you're using BM25 search via <a href="https://www.tigerdata.com/docs/use-timescale/latest/extensions/pg-textsearch"><u>pg_textsearch</u></a>, the index updates synchronously—no refresh interval. No eventual consistency, no "I just saved it but can't find it" bugs.</p><h2 id="7-security-misconfigurations">7. Security Misconfigurations</h2><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/01/devops-problem7-compressed.png" class="kg-image" alt="Security Misconfigurations" loading="lazy" width="2000" height="1091" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2026/01/devops-problem7-compressed.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2026/01/devops-problem7-compressed.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2026/01/devops-problem7-compressed.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w2400/2026/01/devops-problem7-compressed.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>Elasticsearch has been responsible for some of the largest data breaches in history. Not because it's insecure by design, but because it's easy to misconfigure.</p><p><strong>The problem:</strong> For years, Elasticsearch shipped with no authentication by default. Spin up a cluster, expose port 9200, and anyone on the internet could read your data. While newer versions have improved, the damage is done: countless clusters were exposed, and security misconfigurations remain common.</p><p>Here's a real example: in 2019, researchers found an exposed Elasticsearch server containing 1.2 billion records of personal data—social media profiles, email addresses, phone numbers. The cluster had no authentication. This wasn't a sophisticated hack; someone just found an open port. Similar breaches happen regularly.</p><p><strong>References:</strong></p><ul><li><a href="https://pureinsights.com/blog/2025/top-7-elasticsearch-pitfalls-and-how-to-avoid-them/"><u>Pure Insights</u></a> — "Clusters without TLS, authentication, or access controls are vulnerable."</li><li><a href="https://hevodata.com/learn/elasticsearch-postgresql/"><u>HevoData: Elasticsearch PostgreSQL</u></a> — "By default, Elasticsearch lacks built-in authentication."</li></ul><p><strong>Why Postgres avoids this:</strong> Postgres has required authentication from day one. You literally cannot connect without credentials—there's no "open by default" mode. SSL/TLS encryption, role-based access control, and row-level security are all built in and battle-tested over decades. Misconfiguration is still possible (any system can be misconfigured), but the secure path is the default path.</p><h2 id="8-monitoring-complexity">8. Monitoring Complexity</h2><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/01/devops-problem8-compressed.png" class="kg-image" alt="Monitoring Complexity" loading="lazy" width="2000" height="1091" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2026/01/devops-problem8-compressed.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2026/01/devops-problem8-compressed.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2026/01/devops-problem8-compressed.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w2400/2026/01/devops-problem8-compressed.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>Elasticsearch gives you hundreds of metrics. The problem is knowing which ones matter.</p><p><strong>The problem:</strong> JVM heap usage, young GC time, old GC time, thread pool queue sizes, circuit breaker trips, pending tasks, unassigned shards, indexing rate, search latency, segment count... Elasticsearch exposes all of this, but understanding what's normal vs. concerning requires deep expertise. Most teams set up basic monitoring, miss the early warning signs, and only find out something's wrong when it's already broken.</p><p>Say your cluster has been running fine for months. One day, search latency spikes. You check CPU and memory—they look okay. Disk? Fine. What you didn't notice: the "pending tasks" queue has been growing for days, and the "circuit breaker" tripped twice last week. By the time you figure this out, users have been complaining for hours.</p><p><strong>References:</strong></p><ul><li><a href="https://pureinsights.com/blog/2025/top-7-elasticsearch-pitfalls-and-how-to-avoid-them/"><u>Pure Insights</u></a> — "Without proper observability, issues like JVM heap spikes or unassigned shards can escalate into outages."</li><li><a href="https://www.elastic.co/docs/troubleshoot/elasticsearch/fix-common-cluster-issues"><u>Elastic Docs</u></a> — Guide to understanding circuit breaker exceptions and task queue backlogs</li></ul><p><strong>Why Postgres avoids this:</strong> Postgres monitoring is well-understood after 30+ years. The key metrics are simpler: connections, query latency, cache hit ratio, replication lag. <code>pg_stat_statements</code> shows your slowest queries. <code>EXPLAIN ANALYZE</code> tells you exactly why. Tools like pgBadger, pg_stat_activity, and any standard APM will get you 90% of the way there. You don't need to become a Postgres internals expert to keep it healthy.</p><h2 id="9-data-pipeline-sync-issues">9. Data Pipeline Sync Issues</h2><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/01/devops-problem9-compressed.png" class="kg-image" alt="Data Pipeline Sync Issues" loading="lazy" width="2000" height="1091" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2026/01/devops-problem9-compressed.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2026/01/devops-problem9-compressed.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2026/01/devops-problem9-compressed.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w2400/2026/01/devops-problem9-compressed.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>Your real data lives in Postgres. Elasticsearch is just a copy for search. Keeping them in sync is harder than it sounds.</p><p><strong>The problem:</strong> The typical architecture is <code>Postgres → Kafka/Debezium → Elasticsearch</code>. That's three systems. When data updates in Postgres, it flows through Kafka, gets transformed, and lands in Elasticsearch. What could go wrong? Everything. Kafka lag causes stale search results. Schema changes break the connector. A bug in your transformer corrupts data. And when something breaks, you're debugging across three different systems with three different log formats.</p><p>Say a customer updates their email address. The change saves to Postgres immediately. But the Kafka consumer is backed up, and Elasticsearch doesn't get the update for 10 minutes. During that window, any email sent via a search lookup goes to the old address. Customer support gets an angry call. You spend two hours figuring out where the delay happened.</p><p><strong>References:</strong></p><ul><li><a href="https://www.infoq.com/news/2025/08/instacart-elasticsearch-postgres/"><u>InfoQ: Instacart Elasticsearch Postgres</u></a> — Case study on data synchronization challenges</li><li><a href="https://medium.com/@toluaina/real-time-integration-of-postgresql-with-elasticsearch-with-pgsync-9425ffa9b4e9"><u>Medium: Real-time Integration of PostgreSQL with Elasticsearch</u></a> — Discussion of sync tooling and challenges</li></ul><p><strong>Why Postgres avoids this:</strong> If search lives in Postgres (using <a href="https://www.tigerdata.com/blog/you-dont-need-elasticsearch-bm25-is-now-in-postgres"><u>BM25 via pg_textsearch</u></a>), there's no pipeline to manage. When you update a row, the search index updates in the same transaction. No Kafka, no Debezium, no sync jobs, no lag, no "which system has the latest data?" debugging. Your source of truth and your search index are the same thing.</p><h2 id="10-infrastructure-cost">10. Infrastructure Cost</h2><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/01/devops-problem10-compressed.png" class="kg-image" alt="Infrastructure Cost" loading="lazy" width="2000" height="1091" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2026/01/devops-problem10-compressed.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2026/01/devops-problem10-compressed.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2026/01/devops-problem10-compressed.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w2400/2026/01/devops-problem10-compressed.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>Elasticsearch clusters are expensive to run. This isn't a bug—it's the architecture.</p><p><strong>The problem:</strong> ES runs on the JVM, which means each node needs significant RAM (typically 16-64GB, with half allocated to heap). Production clusters require at least 3 nodes for high availability and quorum. Larger deployments need dedicated master nodes, coordinating nodes, and data nodes—that's a lot of instances. And because ES stores everything in Lucene segments, you need fast SSDs to avoid I/O bottlenecks. Add replicas for redundancy, and your storage costs multiply.</p><p>Say you spin up a "small" production cluster: 3 nodes, 32GB RAM each, SSDs. That's your baseline before you've even scaled. Now add a separate cluster for staging. And one for testing. Your cloud bill starts climbing, and you're not even at high traffic yet.</p><p><strong>References:</strong></p><ul><li><a href="https://www.elastic.co/pricing/"><u>Elastic Cloud Pricing</u></a> — Managed ES starts at ~$95/month for minimal configs, scales quickly</li><li><a href="https://aws.amazon.com/opensearch-service/pricing/"><u>AWS OpenSearch Pricing</u></a> — Instance costs add up fast with multi-node clusters</li><li><a href="https://www.siriusopensource.com/en-us/blog/problems-and-operational-weaknesses-elasticsearch"><u>Sirius Open Source</u></a> — "Each shard is a Lucene index with its own file handles, memory, and CPU overhead"</li></ul><p><strong>Why Postgres avoids this:</strong> Postgres runs on a single node for most workloads. No JVM overhead means lower memory requirements. You add read replicas when you actually need them, not because the architecture demands it. A single Postgres instance with pg_textsearch can handle substantial search workloads on hardware that would barely run one ES node. And if you're on a managed service like <a href="https://www.tigerdata.com/cloud"><u>Tiger Data</u></a>, you're paying for one database, not a distributed cluster.</p><h2 id="the-alternative-keep-search-in-postgres">The Alternative: Keep Search in Postgres</h2><p>If you've made it this far, you've probably nodded along to at least a few of these issues. Maybe all ten. The question is: what do you do about it?</p><p>Can you overcome all these issues? Yes. Hire an Elasticsearch specialist (or a team of them). Invest in training. Build expertise over years. Companies do it every day.</p><p>But here's the question: <strong>do you want to?</strong></p><p>It's like the shift from gas cars to electric cars. When you go electric, you don't just swap out the engine. You eliminate an entire category of problems: no oil changes, no transmission repairs, no spark plugs, no fuel injectors, no exhaust systems. The complexity just disappears.</p><p>That's what's happening with search. Extensions like <a href="https://github.com/timescale/pg_textsearch"><u>pg_textsearch</u></a> (BM25 ranking), <a href="https://github.com/timescale/pgvectorscale"><u>pgvectorscale</u></a> (vector search), and <a href="https://github.com/timescale/pgai"><u>pgai</u></a> (automatic embeddings) mean search can now live inside Postgres. No separate cluster. No JVM. No data pipelines. No sync jobs. A whole category of problems just goes away.</p><p>Every issue above has the same root cause: <strong>Elasticsearch is a separate system.</strong> Separate systems mean separate infrastructure, separate expertise, separate failure modes, and separate pipelines.</p><p>What if you didn't need a separate system?</p><p>PostgreSQL now has:</p><ul><li><strong>BM25 ranking</strong> via <a href="https://github.com/timescale/pg_textsearch"><u>pg_textsearch</u></a> (same algorithm Elasticsearch uses)</li><li><strong>Vector search</strong> via <a href="https://github.com/pgvector/pgvector"><u>pgvector</u></a> and <a href="https://github.com/timescale/pgvectorscale"><u>pgvectorscale</u></a></li><li><strong>Hybrid search</strong> combining both, with RRF (Reciprocal Rank Fusion) in pure SQL</li><li><strong>Automatic embedding sync</strong> via <a href="https://github.com/timescale/pgai"><u>pgai</u></a></li></ul><p>And for AI workflows, there's a bonus: when everything lives in Postgres, you can <a href="https://www.tigerdata.com/blog/fast-zero-copy-database-forks"><u>fork your database</u></a> and get an instant copy of your search index, vectors, and embeddings. Try doing that with Elasticsearch.</p><p>For many workloads, "just use Postgres" isn't a compromise. It's a simplification.</p><p><strong>One caveat:</strong> pg_textsearch doesn't currently support faceted search, so e-commerce with "filter by brand/price/color" is still a gap. For RAG, document search, and knowledge bases, you're covered.</p><h2 id="get-started-in-5-minutes">Get Started in 5 Minutes</h2><p>Ready to try hybrid search in Postgres? Here's all you need:</p><pre><code class="language-SQL">-- 1. Enable extensions
CREATE EXTENSION pg_textsearch;        -- BM25 keyword search
CREATE EXTENSION vectorscale CASCADE;  -- Vector search (includes pgvector)
CREATE EXTENSION ai;                   -- Auto-embedding generation (optional)

-- 2. Create your table
CREATE TABLE documents (
  id SERIAL PRIMARY KEY,
  title TEXT,
  content TEXT,
  embedding vector(1536)
);

-- 3. Create indexes
CREATE INDEX idx_bm25 ON documents USING bm25(content)
  WITH (text_config = 'english');
CREATE INDEX idx_vector ON documents USING diskann(embedding);

-- 4. Hybrid search with RRF (one query, two search types)
WITH 
bm25_results AS (
  SELECT id, ROW_NUMBER() OVER (
    ORDER BY content &lt;@&gt; to_bm25query('your search query', 'idx_bm25')
  ) as rank
  FROM documents
  LIMIT 20
),
vector_results AS (
  SELECT id, ROW_NUMBER() OVER (
    ORDER BY embedding &lt;=&gt; $1  -- $1 is your query embedding
  ) as rank
  FROM documents
  LIMIT 20
)
SELECT d.*, 
  COALESCE(1.0/(60 + b.rank), 0) + COALESCE(1.0/(60 + v.rank), 0) as rrf_score
FROM documents d
LEFT JOIN bm25_results b ON d.id = b.id
LEFT JOIN vector_results v ON d.id = v.id
WHERE b.id IS NOT NULL OR v.id IS NOT NULL
ORDER BY rrf_score DESC
LIMIT 10;
</code></pre><p>That's it. BM25 + vectors + RRF, all in Postgres. No Elasticsearch, no Kafka, no sync jobs.</p><p>Try it on <a href="https://console.cloud.timescale.com"><u>Tiger Data</u></a>—all extensions come pre-installed.</p><h2 id="learn-more">Learn More</h2><ul><li><a href="https://github.com/timescale/pg_textsearch"><u>pg_textsearch on GitHub</u></a> — BM25 for Postgres (open source)</li><li><a href="https://github.com/timescale/pgvectorscale"><u>pgvectorscale on GitHub</u></a> — High-performance vector search</li></ul>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Five Features of the Tiger CLI You Aren't Using (But Should)]]></title>
            <description><![CDATA[Tiger CLI + MCP server: Let AI manage databases, fork instantly, search Postgres docs, and run queries—all from your coding assistant without context switching.]]></description>
            <link>https://www.tigerdata.com/blog/five-features-tiger-cli-you-arent-using-but-should</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/five-features-tiger-cli-you-arent-using-but-should</guid>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[AI agents]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Jacky Liang]]></dc:creator>
            <pubDate>Wed, 10 Dec 2025 16:37:07 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/12/5FeaturesofTigerCLI.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/12/5FeaturesofTigerCLI.png" alt="Five Features of the Tiger CLI You Aren't Using (But Should)" /><p>Last month, we launched <a href="https://www.tigerdata.com/blog/postgres-for-agents"><u>Agentic Postgres</u></a>, the first Postgres database designed for AI agents. It includes an MCP server that gives agents direct access to your databases, instant zero-copy forks, Postgres and TimescaleDB documentation search, and more.&nbsp;</p><figure class="kg-card kg-video-card kg-width-regular" data-kg-thumbnail="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/media/2025/10/DE84BB33-B4FE-4F6E-8398-9267033F6870-2_thumb.jpg" data-kg-custom-thumbnail="">
            <div class="kg-video-container">
                <video src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/media/2025/10/DE84BB33-B4FE-4F6E-8398-9267033F6870-2.mp4" poster="https://img.spacergif.org/v1/1920x1080/0a/spacer.png" width="1920" height="1080" loop="" autoplay="" muted="" playsinline="" preload="metadata" style="background: transparent url('https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/media/2025/10/DE84BB33-B4FE-4F6E-8398-9267033F6870-2_thumb.jpg') 50% 50% / cover no-repeat;"></video>
                <div class="kg-video-overlay">
                    <button class="kg-video-large-play-icon" aria-label="Play video">
                        <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                            <path d="M23.14 10.608 2.253.164A1.559 1.559 0 0 0 0 1.557v20.887a1.558 1.558 0 0 0 2.253 1.392L23.14 13.393a1.557 1.557 0 0 0 0-2.785Z"></path>
                        </svg>
                    </button>
                </div>
                <div class="kg-video-player-container kg-video-hide">
                    <div class="kg-video-player">
                        <button class="kg-video-play-icon" aria-label="Play video">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M23.14 10.608 2.253.164A1.559 1.559 0 0 0 0 1.557v20.887a1.558 1.558 0 0 0 2.253 1.392L23.14 13.393a1.557 1.557 0 0 0 0-2.785Z"></path>
                            </svg>
                        </button>
                        <button class="kg-video-pause-icon kg-video-hide" aria-label="Pause video">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <rect x="3" y="1" width="7" height="22" rx="1.5" ry="1.5"></rect>
                                <rect x="14" y="1" width="7" height="22" rx="1.5" ry="1.5"></rect>
                            </svg>
                        </button>
                        <span class="kg-video-current-time">0:00</span>
                        <div class="kg-video-time">
                            /<span class="kg-video-duration">0:30</span>
                        </div>
                        <input type="range" class="kg-video-seek-slider" max="100" value="0">
                        <button class="kg-video-playback-rate" aria-label="Adjust playback speed">1×</button>
                        <button class="kg-video-unmute-icon" aria-label="Unmute">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M15.189 2.021a9.728 9.728 0 0 0-7.924 4.85.249.249 0 0 1-.221.133H5.25a3 3 0 0 0-3 3v2a3 3 0 0 0 3 3h1.794a.249.249 0 0 1 .221.133 9.73 9.73 0 0 0 7.924 4.85h.06a1 1 0 0 0 1-1V3.02a1 1 0 0 0-1.06-.998Z"></path>
                            </svg>
                        </button>
                        <button class="kg-video-mute-icon kg-video-hide" aria-label="Mute">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M16.177 4.3a.248.248 0 0 0 .073-.176v-1.1a1 1 0 0 0-1.061-1 9.728 9.728 0 0 0-7.924 4.85.249.249 0 0 1-.221.133H5.25a3 3 0 0 0-3 3v2a3 3 0 0 0 3 3h.114a.251.251 0 0 0 .177-.073ZM23.707 1.706A1 1 0 0 0 22.293.292l-22 22a1 1 0 0 0 0 1.414l.009.009a1 1 0 0 0 1.405-.009l6.63-6.631A.251.251 0 0 1 8.515 17a.245.245 0 0 1 .177.075 10.081 10.081 0 0 0 6.5 2.92 1 1 0 0 0 1.061-1V9.266a.247.247 0 0 1 .073-.176Z"></path>
                            </svg>
                        </button>
                        <input type="range" class="kg-video-volume-slider" max="100" value="100">
                    </div>
                </div>
            </div>
            
        </figure><p>Alongside Agentic Postgres, we shipped a brand new CLI: <a href="https://github.com/timescale/tiger-cli"><u>Tiger CLI</u></a>. It's how you manage your Tiger Cloud databases from your favorite terminal.&nbsp;</p><p>The basics work like you'd expect:&nbsp;</p><pre><code class="language-shell"># Install Tiger CLI 
curl -fsSL https://cli.tigerdata.com | sh

# Authenticate
tiger auth login

# Create a new database service
tiger service create --name my-database

# Connect to your database
tiger db connect

# Get your connection string
tiger db connection-string

# List all your services
tiger service list</code></pre><p>These commands cover most day-to-day workflows. But Tiger CLI has a few features that makes the agentic development workflow significantly more intuitive, and you're probably not using them yet!&nbsp;</p><p>Here are five new features we launched that you aren’t using, but should:&nbsp;</p><ol><li><strong>Let your AI manage your databases</strong>: Install an MCP server that gives your AI assistant direct access to create services, run queries, and check connections</li><li><strong>Turn your AI into a Postgres expert</strong>: Skills teach your AI Postgres best practices automatically, as if it’s been writing production-grade PostgreSQL for a decade+.&nbsp;</li><li><strong>Fork any database in seconds</strong>: Create zero-copy clones of your database for testing migrations or spinning up staging environments</li><li><strong>Search Postgres docs from your editor</strong>: Your AI can search PostgreSQL (across all versions) and TimescaleDB documentation without leaving your IDE or CLI</li><li><strong>Run SQL queries through your AI</strong>: Execute queries against your database directly from your AI assistant</li></ol><p>Let's look at each one.</p><h2 id="let-your-ai-manage-your-databases">Let Your AI Manage Your Databases</h2><p>As anyone coding with Cursor or Claude Code knows, nothing breaks flow more than having to leave your AI agent to execute CLI commands. Every time you need to check a database connection string, list your available services, or create a new database, you need to switch context. From your terminal, to the browser, back to your IDE, it’s easy to break out of your flow state.&nbsp;</p><p>The Tiger CLI now includes a <a href="https://modelcontextprotocol.io/docs/getting-started/intro"><u>Model Context Protocol</u></a> (MCP) server. If you're using an AI coding assistant like Claude Code, Cursor, or VS Code with Copilot, you can give it direct access to your Tiger Cloud databases.</p><p>This means your AI assistant can list your services, run SQL queries, create new databases, and check connection details, all without you switching to your terminal.</p><h3 id="quick-setup">Quick Setup</h3><p>Install the MCP server for your assistant:</p><pre><code class="language-shell"># Interactive (prompts you to pick your client)
tiger mcp install

# Or specify directly
tiger mcp install claude-code
tiger mcp install cursor
tiger mcp install vscode</code></pre><p>We made it super easy to install the Tiger CLI in all of your favorite coding assistants* through an interactive prompt that guides you through the installation.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/12/SCR-20251204-nvcj-2.png" class="kg-image" alt="" loading="lazy" width="889" height="325" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/12/SCR-20251204-nvcj-2.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/12/SCR-20251204-nvcj-2.png 889w" sizes="(min-width: 720px) 720px"></figure><p>Restart your AI assistant after installation.&nbsp;</p><p>* We are constantly adding interactive installations for new coding assistants!&nbsp;</p><h3 id="adding-to-cursor-via-ui">Adding to Cursor via UI</h3><p>If you prefer to configure Cursor (or other IDEs) manually instead of using <code>tiger mcp install</code>:</p><ol><li>Open Cursor Settings</li><li>Look for “Tools &amp; MCP” on the left sidebar</li><li>Click "Add MCP server"</li><li>Enter the following configuration:&nbsp;<ul><li><strong>Name:</strong> <code>tiger</code></li><li><strong>Command:</strong> <code>tiger</code></li><li><strong>Arguments:</strong> <code>mcp</code>, <code>start</code></li></ul></li></ol><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/12/SCR-20251204-nrjp-2.png" class="kg-image" alt="" loading="lazy" width="630" height="402" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/12/SCR-20251204-nrjp-2.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/12/SCR-20251204-nrjp-2.png 630w"></figure><ol start="5"><li>Click "Save" and restart Cursor</li></ol><p>Once configured, you'll see the Tiger MCP server listed in your MCP servers. The green indicator shows it's connected and ready.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/12/SCR-20251204-odoy-2.png" class="kg-image" alt="" loading="lazy" width="872" height="768" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/12/SCR-20251204-odoy-2.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/12/SCR-20251204-odoy-2.png 872w" sizes="(min-width: 720px) 720px"></figure><h3 id="what-you-can-do">What You Can Do</h3><p>Once installed, your AI assistant has access to tools like:</p><ul><li><code>service_list</code> — List all your database services</li><li><code>service_get</code> — Get details about a specific service</li><li><code>service_create</code> — Create a new database</li><li><code>db_execute_query</code> — Run SQL queries against any service</li></ul><p>For a full list of tools available to you and your agent, they are available in the <a href="https://github.com/timescale/tiger-cli?tab=readme-ov-file#available-mcp-tools"><u>Tiger CLI README</u></a> on Github.&nbsp;</p><p>For example, you can ask your AI assistant: "Show me all my Tiger Cloud services" or "Run <code>SELECT count(*) FROM events</code> on my production database."</p><p>The MCP server uses your existing CLI authentication, so there's no extra setup after <code>tiger auth login</code>.</p><h2 id="turn-your-ai-into-a-postgres-expert">Turn Your AI Into a Postgres Expert</h2><p>A new pattern of working with SQL has emerged as agentic coding has exploded in popularity.&nbsp;</p><p>You can now simply just tell an LLM what you want to do with your database, and it will write the SQL for you. You basically don’t even need to spend sweat, tears, or even fear (like you accidentally dropping a table, although this is still <a href="https://fortune.com/2025/07/23/ai-coding-tool-replit-wiped-database-called-it-a-catastrophic-failure/"><u>completely within the realm of possibility</u></a> when working with an agent) to learn how to write SQL.&nbsp;</p><p>But this AI-generated SQL actually has spawned a new problem, which is, AI-generated SQL works… until it doesn’t.&nbsp;</p><p>Your schema may pass tests, your queries may run… but six months later, as your application scales to millions of users, everything slows to a crawl.&nbsp;</p><p>The problem here is that LLMs are actually trained on millions of lines of SQL, absorbed from a billion blog posts (of differing quality), so while they “know” SQL, they don’t actually know what patterns actually scale. There are also hundreds of different dialects of SQL, with dozens of versions each.&nbsp;</p><p>It’s no wonder AI agents may not write the best SQL.&nbsp;</p><p>We’ve personally seen many common AI-generated mistakes, including:&nbsp;</p><ul><li>Using <code>VARCHAR(255)</code> instead of <code>TEXT</code> (the length limit doesn't help performance in Postgres)</li><li>Using <code>SERIAL</code> instead of <code>BIGINT GENERATED ALWAYS AS IDENTITY</code></li><li>Missing indexes on foreign key columns (Postgres doesn't create these automatically)</li><li>Using <code>TIMESTAMP</code> instead of <code>TIMESTAMPTZ</code> (timezone handling is painful to fix later)</li></ul><p>These will not raise syntax or linter errors. Your tests will still pass. But trying to fix these mistakes later once your app is handling millions of users, means painful migrations, downtime, and explaining to your CEO why the database needs to go down for all your users to undergo maintenance. Sound familiar?&nbsp;</p><p>The Tiger MCP server ships with a variation of<a href="https://www.tigerdata.com/blog/free-postgres-mcp-prompt-templates"> <u>Skills</u></a> (a <a href="https://www.claude.com/blog/skills"><u>standard</u></a> built by Anthropic) written by our most senior and experienced Postgres engineers. When your AI needs to design a schema, the MCP server automatically pulls the right “lessons” and then applies 30 years of Postgres best practices to your database design.&nbsp;</p><h3 id="available-skills">Available Skills</h3><p>The MCP server includes skills for common Postgres and TimescaleDB tasks.&nbsp;</p><ul><li><code>design-postgres-tables</code>: Schema design with proper types, constraints, and indexes</li><li><code>setup-timescaledb-hypertables</code>: Hypertable configuration, compression, retention policies</li><li><code>migrate-postgres-tables-to-hypertables</code>: Converting existing tables to hypertables</li><li><code>find-hypertable-candidates</code>: Identifying which tables should become hypertables</li></ul><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/12/SCR-20251204-nzjm-2.png" class="kg-image" alt="" loading="lazy" width="583" height="356"></figure><p>Your AI discovers and uses these automatically based on what you ask it to do. You don’t need to call them explicitly!&nbsp;</p><p><em>We are continually adding new Skills. <strong>Want to request new ones or contribute with your own?</strong> Feel free to </em><a href="https://github.com/timescale/pg-aiguide/issues"><em><u>create an issue</u></em></a><em> in our Skills Github repo.&nbsp;</em></p><h2 id="fork-any-database-in-seconds">Fork Any Database in Seconds</h2><p>Testing database migrations against production data is risky, but testing against fake data is also risky because you may miss edge cases. You need a real copy of your database but without the cost or time of duplicating everything manually.&nbsp;</p><p>Tiger CLI lets you create instant, zero-copy forks of any database:&nbsp;</p><pre><code class="language-shell">tiger service fork &lt;service-id&gt; --name my-staging-db</code></pre><p>The fork created is a point-in-time copy that shares underlying data blocks with the original. You only pay for blocks that change. This makes forks lightweight enough to spin up for a single test and throw away when you’re done.&nbsp;</p><p>Database forking is especially useful when working in an agentic context, where if you are using multiple agents at once, you can fork multiple databases for different agents to work on without affecting the original database. Neat!&nbsp;</p><h3 id="example-workflow">Example Workflow</h3><pre><code class="language-shell"># List your services to find the source ID
tiger service list

# Fork your production database
tiger service fork --source-id svc_abc123 --name testing-migrations

# Connect to the fork
tiger db connect --service-id &lt;new-service-id&gt;

# Run your migration, test it, then delete the fork when done
tiger service delete &lt;new-service-id&gt;</code></pre><h2 id="search-postgres-docs-from-your-editor">Search Postgres Docs From Your Editor</h2><p>Postgres has been around for <a href="https://en.wikipedia.org/wiki/PostgreSQL"><u>almost 30 years</u></a>.</p><p>The documentation is incredibly extensive, spanning 18 versions and countless individual releases. A function that exists in Postgres 18 might not exist in 15. Syntax that worked in 14 now has better alternatives in 18.&nbsp;</p><p>Most AI-powered documentation search tools don’t account for these idiosyncrasies.&nbsp;</p><p>If not using docs search tools, and simply relying on an LLM’s own training data, well, they have training cutoffs, generally at least lagging by 6 months, so LLMs often suggest older methods that aren’t the most performant, or reference features that don’t exist in your version. This is actually a major problem with LLMs writing React and NextJS I’ve personally experienced in my side projects!&nbsp;</p><p>The Tiger MCP server includes search over PostgreSQL documentation from versions 14 to 18 and TimescaleDB docs for time series workloads. Your AI assistant can search version-specific docs without leaving your coding agent.&nbsp;</p><p>Your assistant gets access to:</p><ul><li><code>semantic_search_postgres_docs</code>: Search PostgreSQL documentation (versions 14-18)</li><li><code>semantic_search_tiger_docs</code>: Search Tiger Cloud and TimescaleDB documentation</li></ul><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/12/SCR-20251204-olha-2.png" class="kg-image" alt="" loading="lazy" width="1101" height="786" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/12/SCR-20251204-olha-2.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2025/12/SCR-20251204-olha-2.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/12/SCR-20251204-olha-2.png 1101w" sizes="(min-width: 720px) 720px"></figure><h3 id="no-more-context-switching">No More Context Switching&nbsp;</h3><p>Instead of leaving your editor to search docs, you can ask your AI assistant directly:</p><ul><li>"How do I set up continuous aggregates in TimescaleDB?"</li><li>"What's the syntax for PostgreSQL window functions?"</li><li>"Show me how to configure compression policies"</li></ul><p>The assistant searches the actual docs and gives you accurate, up-to-date answers. You don’t even need to ask the agent to explicitly use the documentation search feature, it will just work.&nbsp;</p><p>P.S. This feature is enabled by default. If you ever need to disable it:</p><pre><code class="language-shell">tiger config set docs_mcp false</code></pre><h2 id="run-sql-queries-through-your-ai">Run SQL Queries Through Your AI</h2><p>You’re debugging an issue, you need to check row count, inspect a table’s schema, or run a quick query. You’re used to having to open a new terminal, find and remember the connection string, connect via psql, run the query, copy the results back (often jankily when done inside the CLI due to text wrapping). All these steps just for getting ONE number.&nbsp;</p><p>No more.&nbsp;</p><p>Once you’ve set up the MCP in your coding agent, your AI assistant can execute SQL queries (using the <code>db_execute_query</code> tool) directly against your databases.</p><p>This means you can stay in your editor and ask your AI:</p><ul><li>"How many events came in during the last 24 hours?”</li><li>"Show me the 10 most recent orders"</li><li>"What's the schema of the users table?"&nbsp;</li></ul><h3 id="example">Example</h3><p>Your AI writes the (performant) SQL, runs it, and returns the results. No more terminal switching, copy-pasting connection strings, remembering your environment, exact syntax for your <code>information_schema</code>.&nbsp;</p><h2 id="get-started-now">Get Started Now</h2><p>Install the Tiger CLI and MCP server:</p><pre><code class="language-shell">curl -fsSL https://cli.tigerdata.com | sh
tiger auth login
tiger mcp install</code></pre><p>Then select your AI assistant (Claude Code, Cursor, VS Code, Windsurf) and you're ready to go.</p><p>Don't have a Tiger Cloud account?<a href="https://console.cloud.timescale.com/signup"> <u>Sign up for free</u></a> — no credit card required. Create your first database, then try out these CLI features.&nbsp;</p><h2 id="resources">Resources</h2><ol><li><a href="https://www.tigerdata.com/blog/postgres-for-agents"><u>Postgres for Agents</u></a>&nbsp;</li><li><a href="https://www.tigerdata.com/blog/free-postgres-mcp-prompt-templates"><u>How to Train Your Agent to Be a Postgres Expert</u></a></li></ol>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Our One-Year Journey to a Unified Postgres Data Infrastructure on AWS]]></title>
            <description><![CDATA[See how Tiger Data and AWS deliver a unified Postgres data layer with time-series, vector search, native S3 ingest, and Iceberg lakehouse integration.]]></description>
            <link>https://www.tigerdata.com/blog/one-year-journey-unified-postgres-data-infrastructure-aws</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/one-year-journey-unified-postgres-data-infrastructure-aws</guid>
            <category><![CDATA[AWS]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Vineeth Pothulapati]]></dc:creator>
            <pubDate>Tue, 02 Dec 2025 13:00:30 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/12/tigerdata-aws-thumbnail.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/12/tigerdata-aws-thumbnail.png" alt="Our One-Year Journey to a Unified Postgres Data Infrastructure on AWS" /><p>Data infrastructure requirements are changing. Modern applications combine operational transactions, time-series, events and vector data all while integrating the lakehouse backbone for model training, historical analytics, and enterprise insights.</p><p>Developers building modern applications start with Postgres. It’s familiar, flexible, and handles transactions well. And every framework, ORM, and service on AWS speaks fluent Postgres.</p><p>But as workloads evolve, teams outgrow what vanilla Postgres can comfortably handle. Companies ingest more data, expand real-time analytics, store longer histories, and increasingly support AI-driven features. To accommodate new requirements, developers incorporate specialized data store technologies alongside extended pipelines required to keep data in-sync. Over time, data drifts, operational burden increases and developer velocity tanks as teams spend more time gluing databases together and maintaining complex architectures rather than building new applications.</p><p><strong>Developers on AWS want a data layer that unifies transactional, analytical and agentic workloads on top of Postgres and integrates seamlessly within the broader AWS ecosystem.</strong></p><p>That’s why we’re partnering with AWS: to jointly solve data fragmentation and deliver the unified data infrastructure developers have been asking for. Through our <a href="https://www.tigerdata.com/blog/tiger-data-aws-forge-unified-postgres-platform-for-developers-devices-ai-agents"><u>Strategic Collaboration Agreement</u></a> announced last week, we’re working with AWS to provide this unified infrastructure to every team building modern applications.</p><p>In this blog post, we’re showing the incredible progress we have made over the last 12 months to make this architecture native to AWS including the announcement of two new releases: Tiger Lake public beta and S3 Connector GA.</p><h2 id="postgres-extended-for-modern-workloads-on-aws">Postgres, Extended for Modern Workloads on AWS</h2><p>Developers on AWS already rely on Postgres to power their operational systems. Tiger Data builds on that foundation by extending Postgres to support time-series, vector and full-text search, advanced analytics, and tight lakehouse integration, all while fitting naturally into the AWS ecosystem through our growing collaboration.</p><p>The result is Postgres that does everything developers wish Postgres could do, without introducing new query languages, new operational paradigms, or new systems to manage.</p><p>At the core of this extended engine are four pillars:</p><p><a href="https://www.tigerdata.com/timescaledb"><strong><u>TimescaleDB</u></strong></a> brings a purpose-built time-series engine directly into Postgres. It handles massive ingest rates, automatic partitioning, hybrid row–column storage with 90%+ compression, SIMD-accelerated scans, incremental materialized views, and keeps recent data fast while automatically tiering older data to Amazon S3. It also includes Hyperfunctions, a rich set of SQL analytics functions for things like statistical aggregates, percentile approximations, gap-filling, and downsampling, so teams can run advanced time-series analytics directly in Postgres without external systems or pipelines.</p><p><a href="https://github.com/timescale/pgvectorscale"><strong><u>pgvectorscale</u></strong></a> introduces high-performance vector search with a DiskANN index, supporting large, high-dimensional embedding workloads with fast filtering and optimized storage. Instead of deploying a separate vector database, developers can store embeddings alongside time-series and relational context, enabling AI-driven features on top of unified data.</p><p><a href="https://www.tigerdata.com/docs/use-timescale/latest/extensions/pg-textsearch"><strong><u>pg_textsearch</u></strong></a> provides modern full-text search powered by a BM25 ranking model and a memtable architecture that enables fast incremental indexing and low-latency search, right inside Postgres.</p><p>Layered on top of these engines is deep integration with the AWS platform for secure connectivity, streaming and batch ingest, observability, analytics, AI and billing, which has been a key area of focus for us over the last 12 months.</p><h2 id="a-year-of-deepening-aws-integrations">A Year of Deepening AWS Integrations</h2><p>The past 12 months of collaboration with AWS were shaped around one theme:</p><p><strong>Make Tiger Cloud feel native inside AWS architectures.</strong></p><p>We approached this holistically spanning secure connectivity, observability, billing, ingest and finally lakehouse interoperability.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/12/2025-dec-01-AWS--Diagram.png" class="kg-image" alt="Tiger Cloud feels native inside AWS architectures." loading="lazy" width="2000" height="1241" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/12/2025-dec-01-AWS--Diagram.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2025/12/2025-dec-01-AWS--Diagram.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2025/12/2025-dec-01-AWS--Diagram.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w2400/2025/12/2025-dec-01-AWS--Diagram.png 2400w" sizes="(min-width: 720px) 720px"></figure><h3 id="fast-ingest">Fast Ingest</h3><p>Ingest has been one of the biggest areas of investment this year. Across different industries, AWS customers generate time-series data in many ways: IoT fleets send telemetry through IoT Core or drop device exports into S3, financial firms stream market data through Kafka or Amazon MSK, and application teams accumulate large volumes of event or log data that regularly land in S3. These are different workloads and different customers, but they all shared the same pain: getting time-stamped data into Postgres required a patchwork of custom pipelines, consumers, and periodic backfills that were hard to scale and harder to maintain.</p><p>To simplify this, we spent the past year building native ingestion paths for Kafka / Amazon MSK, RDS for PostgreSQL, and Aurora PostgreSQL—all currently in beta — so that streams and operational data can flow directly into Tiger Cloud without bespoke glue code.</p><p>With our Postgres source connectors, customers can replicate existing time-series tables from RDS or Aurora into Tiger Cloud, where they’re converted into optimized hypertables for high-ingest workloads and fast queries, all without modifying their application or schema.</p><p><strong>S3 Connector Is Now Generally Available</strong></p><p><strong>Alongside these efforts, we also introduced a new S3 Connector, which has quickly become one of the most common ingest paths for AWS users. And today, we’re announcing its general availability. </strong>It continuously loads Parquet and CSV files from S3 into hypertables, handles late-arriving files, and makes historical data immediately available for real-time analytics without the Glue jobs, Lambdas, or custom ingestion services teams used to build.</p><p><strong>Taken together, these capabilities create a cleaner, more AWS-native ingest model:</strong> whether your time-series data originates from IoT devices, market data feeds, event systems, or existing Postgres deployments, it can now flow directly into Tiger Cloud without additional code or operational overhead.</p><h3 id="lakehouse-interoperability">Lakehouse Interoperability</h3><p>For many AWS customers, Amazon S3 serves as the foundation of their analytics and AI platforms, the place where governed, long-term datasets live and where engines like Athena, EMR or SageMaker expect to read data. Earlier this year, we introduced Tiger Lake in private beta, our Apache Iceberg integration, to make it easy for teams to expose curated operational and time-series data from Tiger Cloud directly into their S3-based lakehouse.</p><p>Tiger Lake works by using Postgres change data capture (CDC) to track every insert, update, and delete in source tables or hypertables. It then converts those changes into Iceberg-compliant commits and writes them to Amazon S3 via the S3 Tables interface. Because the output is a native Iceberg table stored in S3, AWS analytics and AI services can immediately query or train on the data using their existing tooling without batch exports, Spark pipelines, or glue code required. Operational changes in Tiger Cloud flow directly into the lakehouse as versioned Iceberg snapshots.</p><p><strong>Tiger Lake Is Now Public Beta</strong></p><p><strong>Today we’re announcing that Tiger Lake is available in open beta, enabling any table or hypertable in Tiger Cloud to be continuously published as an Iceberg table on S3.</strong></p><p>Tiger Cloud powers the real-time, high-ingest workloads that vanilla Postgres struggles with, while your AWS analytics and AI stack reads the same data through Iceberg on S3. It’s a natural bridge between operational Postgres workflows and the lakehouse architectures that drive analytics, ML, and enterprise intelligence on AWS.</p><h3 id="secure-connectivity">Secure Connectivity</h3><p>Customers want their databases to plug into AWS environments in the same way they already connect to RDS, Aurora, or Redshift to ensure their data and applications remain secure. Tiger Cloud has supported VPC peering for years,&nbsp; making private, single-VPC deployments straightforward. At the beginning of this year, our new Transit Gateway support expanded that pattern to multi-account, multi-VPC organizations. Today, customers can connect Tiger Cloud using well-understood AWS constructs, without VPNs, proxies, or public endpoints.</p><h3 id="observability">Observability</h3><p>Observability has been AWS-native in Tiger Cloud for a long time. Tiger Cloud integrates directly with Amazon CloudWatch for both metrics and logs, so teams can monitor their database using the same tooling they rely on for EC2, EKS, Lambda, MSK, and the rest of their AWS environment.</p><p>Tiger Cloud streams operational metrics into CloudWatch Metrics and sends structured logs to CloudWatch Logs, making it easy to build dashboards, set alarms, and satisfy compliance requirements without new tooling.&nbsp;</p><h3 id="billing">Billing</h3><p>Finally, we wanted the commercial experience to feel as seamless as the technical one. Tiger Cloud is fully integrated into AWS Marketplace, allowing customers to use the same procurement paths they already use for other AWS services.</p><p>For teams that want to get started quickly, Tiger Cloud supports pay-as-you-go billing directly through their AWS account. There’s no new vendor onboarding or separate invoice; usage simply appears on the existing monthly AWS bill.</p><p>For larger organizations with specific architectural, security, or cost requirements, we also support private offers, giving enterprises the ability to secure annual commitments, customized pricing, and tailored deployment guidance, all handled through AWS Marketplace.</p><h2 id="case-study-speedcast%E2%80%99s-unified-architecture-on-aws">Case Study: Speedcast’s Unified Architecture on AWS</h2><p>Speedcast runs a global telecom network for remote industries, combining satellite and terrestrial links to keep ships, rigs, and NGOs online.</p><p>Previously, Speedcast had to juggle separate geospatial, relational, and time-series stores plus aging SCADA systems, stitching them together with fragile ETL pipelines that slowed insights and raised operational risk. With Tiger Lake in their AWS + Tiger Data stack, Speedcast dropped custom scripts and batching, using native integrations between their data lakehouse and Tiger Cloud to move toward a continuous data integration pipeline with Tiger Cloud at the center.</p><p>With Tiger Cloud as the “spider at the center of the web,” operations, data scientists, and customers now have a single, authoritative data source instead of hunting for data across systems. Platforms, dashboards and events are powered by the same database in real-time, regardless of workload pattern, with every system able to communicate with Apache Iceberg.</p><p><strong>&nbsp;"We stitched together Kafka, Flink, and custom code to stream data from Postgres to Iceberg. It worked, but it was fragile and high-maintenance,” said Kevin Otten, Director of Technical Architecture at Speedcast. “Tiger Lake replaces all of that with native infrastructure. It’s the architecture we wish we had from day one."</strong></p><p>As Speedcast plans for service expansions and continues installing beyond 12,000 Starlink Terminals globally, Tiger Lake’s ingest pipeline will scale with them. For example, Speedcast can monitor usage patterns and spot emerging service-area outages in real-time before customers feel the impact. When a new service ticket is generated, Speedcast can drill into location, usage, and history with a single SQL query instead of bouncing between silos, reducing the time to resolution.&nbsp;</p><p>Read the <a href="https://www.tigerdata.com/blog/how-speedcast-built-a-global-communications-network-on-tiger-lake"><u>full case study</u></a> in our blog.</p><h2 id="bringing-your-data-workloads-together-on-aws">Bringing Your Data Workloads Together on AWS</h2><p>Modern applications shouldn’t require four different databases, half a dozen pipelines, and constant backfills just to keep data consistent. With Tiger Cloud and AWS, you get a unified Postgres engine that handles real-time ingest, high-performance time-series, vector search, and lakehouse integration—all inside the AWS architecture you already trust.</p><p>This is the future we’re building with AWS: simpler stacks, fewer moving parts, and one Postgres data layer for operational, analytical, and agentic workloads.</p><p>You can get started in minutes through the <a href="https://aws.amazon.com/marketplace/seller-profile?id=seller-wbtecrjp3kxpm"><u>AWS Marketplace</u></a> or sign up directly at <a href="http://tigerdata.com"><u>tigerdata.com</u></a> for free. We can’t wait to see what you build.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[We Taught AI to Write Real Postgres Code (And Open Sourced It)]]></title>
            <description><![CDATA[pg-aiguide teaches AI to write production-ready Postgres code with curated skills, semantic search, and version-aware docs. Open source and free to use.]]></description>
            <link>https://www.tigerdata.com/blog/we-taught-ai-to-write-real-postgres-code-open-sourced-it</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/we-taught-ai-to-write-real-postgres-code-open-sourced-it</guid>
            <category><![CDATA[Announcements & Releases]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Matvey Arye]]></dc:creator>
            <pubDate>Mon, 24 Nov 2025 15:00:09 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/11/2025-nov-21-thumbnail-open-source-mcp-server.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/11/2025-nov-21-thumbnail-open-source-mcp-server.png" alt="We Taught AI to Write Real Postgres Code (And Open Sourced It)" /><p>You open Claude Code or Cursor, describe your tables, and in seconds the AI hands you a Postgres schema that looks… fine. It runs. Your tests pass. You ship.</p><p>What you don’t see are the quiet little disasters tucked inside: money for prices, a BRIN index on random data, SERIAL and UUID mixed like a cocktail, timestamp without time zone because some tutorial said it was “easier”.</p><p>Fast-forward six months. You’re debugging currency-conversion bugs, chasing timezone ghosts, rewriting migrations, and adding the index that should have existed since day one. The code the AI agent wrote worked; it just wasn’t good. It was copying whatever examples it scraped from the entire internet.</p><p>And that’s the problem. It learned SQL from everywhere: Postgres, MySQL, SQLite, SQL Server, Oracle, random tutorials, and a decade of Stack Overflow answers. In all that noise, the nuances of idiomatic, high-quality Postgres<strong> </strong>get buried under the good, the bad, and the MySQL.</p><p>So we built something to fix that.</p><h2 id="giving-ai-the-postgres-judgment-it%E2%80%99s-missing">Giving AI the Postgres Judgment It’s Missing</h2><p>pg-aiguide gives AI coding agents the Postgres-specific judgment they’re missing.It does this with three things working together:</p><ol><li><strong>AI-optimized “skills”</strong>— curated, opinionated Postgres best practices that Claude Code and other agents can apply automatically.</li><li><strong>Semantic search across official documentation</strong> — version-aware retrieval for Postgres 15–18.</li><li><strong>Extension ecosystem docs</strong>, starting with TimescaleDB and expanding quickly</li></ol><p>You can connect it to any AI coding agent via our public <strong>Model Context Protocol (MCP) server</strong> or with the <strong>Claude Code plugin</strong> built to take advantage of Claude’s native skill support. No accounts. No usage limits. Completely free.</p><p>The goal is simple:<strong>Make AI write correct, production-ready Postgres by default.</strong></p><p>You shouldn’t have to paste docs, correct outputs, or rely on prompt hacks. The AI should just generate better SQL the first time.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text"><b><strong style="white-space: pre-wrap;">Try it now </strong></b><br>You can start using pg-aiguide in less than a minute. It works with Claude Code, Codex, Cursor, Gemini CLI, Visual Studio, VS Code, Windsurf, and any other MCP compatible editors. See our <a href="https://github.com/timescale/pg-aiguide/tree/main?tab=readme-ov-file#-quickstart"><u>quickstart guide</u></a> for installation instructions.</div></div><h2 id="why-database-code-quality-is-so-important">Why Database Code Quality Is So Important&nbsp;</h2><p>AI adoption among Postgres developers has exploded: jumping from 37% to 55% in a single year according to the <a href="https://www.tigerdata.com/state-of-postgres/2024"><u>2024 State of Postgres</u></a> survey. But as LLMs get more general, the training data has become broad instead of specialized and the LLMs become <em>less</em> precise.</p><p>This creates a surprising problem: A study by METR recently found that experienced developers using AI actually took<strong> 19 percent longer</strong> to complete tasks. Not because the AI failed, but because the output was almost right—the kind of subtle mistakes that take the longest to spot, debug, and unwind.</p><p>With databases, “almost right” compounds quickly.&nbsp;</p><p>Schemas codify decisions that are painful to change later. Indexing mistakes hide until your tables grow. Datatype choices that take an hour to fix in development can take days of migrations and careful planning in production.</p><p>You can refactor application code. You can’t refactor state without consequences.</p><p>pg-aiguide stops these mistakes at the source, before they get baked into production.</p><h2 id="generic-vs-guided-sql">Generic vs. Guided SQL</h2><p>Here’s what happens when you ask a coding assistant to design a simple bookstore schema.</p><h3 id="without-pg-aiguide">Without pg-aiguide</h3><pre><code class="language-SQL">CREATE TABLE customers (
    customer_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    email VARCHAR(255) UNIQUE NOT NULL,
    password_hash VARCHAR(255) NOT NULL,
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL,
    phone VARCHAR(20),
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    is_active BOOLEAN DEFAULT TRUE
);

CREATE INDEX idx_customers_email ON customers(email);

CREATE TABLE authors (
    author_id SERIAL PRIMARY KEY,
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL,
    biography TEXT,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
</code></pre><p>This schema works, but has classic issues:</p><ul><li>mixed identity strategies</li><li>arbitrary <code>VARCHAR</code> limits</li><li>case-sensitive email uniqueness</li><li>timestamp defaults that behave inconsistently across systems</li></ul><h3 id="with-pg-aiguide-enabled">With pg-aiguide enabled</h3><pre><code class="language-SQL">CREATE TABLE authors (
  author_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  name TEXT NOT NULL,
  bio TEXT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX ON authors (name);

CREATE TABLE users (
  user_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  email TEXT NOT NULL UNIQUE,
  password_hash TEXT NOT NULL,
  first_name TEXT NOT NULL,
  last_name TEXT NOT NULL,
  phone TEXT,
  is_active BOOLEAN NOT NULL DEFAULT true,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE UNIQUE INDEX ON users (LOWER(email));
CREATE INDEX ON users (created_at);</code></pre><p>This version uses:</p><ul><li>consistent identity columns (<code>BIGINT GENERATED ALWAYS AS IDENTITY</code>)</li><li><code>TEXT</code> instead of unnecessary <code>VARCHAR</code></li><li>correct timestamp handling (<code>timestamptz + now()</code>)</li><li>case-insensitive uniqueness enforced properly</li></ul><p>Behind the scenes, the AI used the <a href="https://github.com/timescale/pg-aiguide/blob/main/skills/design-postgres-tables/SKILL.md"><u>design_postgres_table skill</u></a> either through the <a href="https://github.com/timescale/pg-aiguide/blob/main/API.md#view_skill"><u>view_skill </u></a>MCP tool or <a href="https://www.claude.com/blog/skills"><u>Claude’s native skills framework</u></a>. In both cases, the agent automatically discovered and applied the Postgres-optimized guidance without human intervention.</p><p>You didn’t have to prompt differently. </p><p>You didn’t have to paste in docs.</p><p><strong>pg-aiguide automatically shifts AI from “SQL that works” to “SQL you’d actually want in production.”</strong></p><h2 id="the-skills-are-the-secret-sauce">The Skills Are the Secret Sauce&nbsp;</h2><p>If you want AI to generate high-quality SQL, it is not enough to let it search the manual. A manual tells you what you can do, not what you should do. Skills fill that gap. They give the model judgment, not just facts.</p><p>Our skills are not trying to reteach the LLM syntax or capabilities. Instead they give the model the context it needs to make better choices. Here is an excerpt from a real skill.</p><pre><code class="language-markup">## Postgres "Gotchas"

- **FK indexes**: Postgres **does not** auto-index FK columns. Add them.
- **No silent coercions**: length/precision overflows error out (no truncation). 
  Example: inserting 999 into `NUMERIC(2,0)` fails with error, unlike some 
  databases that silently truncate or round.
- **Heap storage**: no clustered PK by default (unlike SQL Server/MySQL InnoDB); 
  row order on disk is insertion order unless explicitly clustered.</code></pre><p>These are the kinds of details you only know once you have lived in Postgres for a while. They trip up LLMs for the same reason they trip up developers who are new (and not so new) to the database. Yet these details are exactly what allow the model to produce better SQL.</p><p>In our evaluations (currently human-vibes-driven, soon LLM-judged), schema quality improves consistently when we compare a system with just semantic search to one that includes both semantic search and skills:</p><ul><li>more appropriate data types</li><li>correct timestamp semantics</li><li>stronger indexing strategies</li><li>fewer migration pitfalls</li><li>fewer long-term performance surprises</li></ul><p>This is what “AI coding tools actually understanding Postgres” looks like.</p><h2 id="the-tools-we-provide-the-llm">The Tools We Provide The LLM</h2><p>pg-aiguide provides two core capabilities that map cleanly to how AI coding tools operate.</p><h3 id="1-skills-complete-opinionated-postgres-guidance">1. Skills: Complete, Opinionated Postgres Guidance</h3><p><code>view_skill</code> returns full, AI-optimized best practices.These aren’t tutorials and they aren’t vague prompts. They’re machine-targeted, dense, token-efficient guidance that the AI can reliably use.</p><p>For example:</p><ul><li>prefer <code>BIGINT GENERATED ALWAYS AS IDENTITY</code></li><li>don’t use <code>money</code></li><li>don’t use <code>timestamp</code> without timezone</li><li>index your foreign keys</li><li>expect errors on precision overflows</li></ul><p>Skills don’t need to be chunked—they are written so that each skill fits in context as a single complete unit.</p><p>Claude Code even supports skills natively, so the MCP server’s <code>view_skill</code> tool is disabled automatically when running as a plugin.</p><h3 id="2-semantic-search-version-aware-vector-retrieval-across-docs">2. Semantic Search: Version-Aware Vector Retrieval Across Docs</h3><p>The MCP tools <code>semantic_search_postgres_docs</code> and <code>semantic_search_tiger_docs</code> allow the AI to pull in the <strong>correct</strong> documentation for the Postgres version you’re targeting.</p><p>This matters because Postgres versions evolve meaningfully:</p><ul><li>Postgres 15: <code>UNIQUE NULLS NOT DISTINCT</code></li><li>Postgres 16: major changes to parallel query behavior</li><li>Postgres 17: COPY error-handling improvements</li></ul><p>Without version awareness, an AI can (and does) hallucinate features or syntax that will break your actual environment.</p><p>All of this knowledge of Postgres is chunked, embedded, and stored in Postgres itself.</p><p>We scrape official HTML docs, preserve header context, attach source URLs, and use character-bounded chunking with H1→H2→H3 breadcrumbs so each piece retains meaning of how it fits into the broader puzzle.</p><h2 id="help-us-build-the-world%E2%80%99s-best-postgres-guide-for-ai">Help Us Build the World’s Best Postgres Guide for AI</h2><p>Postgres has 35 years of engineering, craft, and hard-won lessons behind it. No single team can capture all of that. The community built the patterns, extensions, and production wisdom that make Postgres what it is. AI coding tools should reflect that depth, not spit out generic SQL lifted from outdated tutorials and old Stack Overflow posts.</p><p>pg-aiguide is our first step toward making Postgres the best database to use with AI coding assistants on purpose, not by accident. We are expanding the skill library with richer indexing guidance, full-text search skills, and documentation for essential extensions like PostGIS and pgvector. We are also adding keyword BM25 search to pair with semantic search for more accurate retrieval. <strong>But we need your help.</strong></p><h3 id="how-you-can-contribute">How You Can Contribute</h3><p>You can make an immediate impact:</p><ul><li>add documentation for your Postgres extension</li><li>contribute new skills that encode real, battle-tested expertise</li><li>help evaluate, refine, and stress-test existing skills</li><li>request features or report issues</li><li>improve semantic search chunking or propose new areas to index</li><li>share deep knowledge on partitioning, replication, security, or performance tuning</li></ul><p>Skills matter most. They turn years of experience into guidance the AI can use instantly. Our schema-design skill went through multiple iterations before it felt right, and we learned a ton in the process. We would love to partner with you to build skills in your area of expertise.</p><p>pg-aiguide is fully open source at github.com/timescale/pg-aiguide.&nbsp;</p><p><strong>Help us teach AI to write Postgres like an expert.</strong></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to Train Your Agent to Be a Postgres Expert]]></title>
            <description><![CDATA[Turn AI into a Postgres expert with our MCP server. Get 35 years of best practices, versioned docs, and prompt templates for production-ready schemas.]]></description>
            <link>https://www.tigerdata.com/blog/free-postgres-mcp-prompt-templates</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/free-postgres-mcp-prompt-templates</guid>
            <category><![CDATA[Announcements & Releases]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Matty Stratton]]></dc:creator>
            <pubDate>Wed, 22 Oct 2025 14:02:12 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/10/2025-Oct-21-Prompt-Template-Thumbnail.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/10/2025-Oct-21-Prompt-Template-Thumbnail.png" alt="How to Train Your Agent to Be a Postgres Expert" /><h3 id="with-prompt-templates-and-versioned-docs-we-turn-35-years-of-postgres-wisdom-into-structured-knowledge-your-agent-can-reason-with">With prompt templates and versioned docs, we turn 35 years of Postgres wisdom into structured knowledge your Agent can reason with.</h3><p>Agents are the <a href="https://www.tigerdata.com/blog/postgres-for-agents" rel="noreferrer">new developer</a>. But they’re generalists.&nbsp;</p><p>What happens when they design your Postgres database? Your schema runs, your tests pass… and six months later your queries crawl and your costs skyrocket.&nbsp;</p><p>AI-generated SQL and database schemas are almost right. And that’s the problem. Fixing schema design mistakes is costlier than refactoring code. It often means multi-week migrations, downtime windows, rollback plans, and your CEO asking why the site is in maintenance mode. The root issue? LLMs don’t have the depth of Postgres and database expertise to let them build scalable systems. And when agents try to learn, they find documentation written for humans, not for them.&nbsp;</p><p>But agents don’t need more data, they need better context. They need to know what “good Postgres” actually looks like. The good news is given the right context and tools, agents can become instant experts. Even with Postgres.&nbsp;</p><p>That’s why we built an MCP server that provides 35 years of Postgres wisdom, and full access Postgres docs, all in a format that agents can easily process.&nbsp;</p><p>And we think this just might be the best database MCP server ever built. While Neon, Supabase, and other Postgres companies created MCP servers as thin API wrappers, ours teaches AI how to think in Postgres. The Tiger MCP server gives AI tools that work automatically: no prompt engineering or manual lookups needed. You just ask. And it provides correct, idiomatic Postgres. </p><p>Our new MCP server ships with detailed prompt templates written by our senior engineers, plus versioned Postgres (15-18) docs, and <a href="https://docs.tigerdata.com/" rel="noreferrer">TimescaleDB docs</a>.&nbsp;When your AI needs to design a schema, it automatically pulls the right template and searches the docs (using hybrid search) to generate code that actually lasts. And of course, it handles the basics: start, stop, <a href="https://www.tigerdata.com/blog/fast-zero-copy-database-forks" rel="noreferrer">fork a database</a>. </p><p>But the real magic happens when you ask AI to design your schema. So let’s see what happens when AI designs a simple schema. First without the MCP server, then with it.</p><h2 id="without-the-mcp-server">Without the MCP Server</h2><pre><code class="language-SQL">CREATE TABLE customers (
    customer_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    email VARCHAR(255) UNIQUE NOT NULL,
    password_hash VARCHAR(255) NOT NULL,
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL,
    phone VARCHAR(20),
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    is_active BOOLEAN DEFAULT TRUE
);

CREATE INDEX idx_customers_email ON customers(email);

CREATE TABLE authors (
    author_id SERIAL PRIMARY KEY,
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL,
    biography TEXT,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
</code></pre><p>What’s wrong here:&nbsp;</p><ul><li>Mixed ID types (SERIAL and UUID) with no clear reason</li><li>Arbitrary VARCHAR limits instead of TEXT</li><li>Case-sensitive email index → duplicate accounts</li><li>CURRENT_TIMESTAMP instead of now(), leading to inconsistent timestamps</li></ul><p>Someone new to Postgres may not notice these problems. Until it’s too late.</p><p>Now let’s see what AI generates with our new MCP server.</p><h2 id="with-the-mcp-server">With the MCP Server</h2><pre><code class="language-SQL">CREATE TABLE authors (
  author_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  name TEXT NOT NULL,
  bio TEXT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX ON authors (name);

CREATE TABLE users (
  user_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  email TEXT NOT NULL UNIQUE,
  password_hash TEXT NOT NULL,
  first_name TEXT NOT NULL,
  last_name TEXT NOT NULL,
  phone TEXT,
  is_active BOOLEAN NOT NULL DEFAULT true,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE UNIQUE INDEX ON users (LOWER(email));
CREATE INDEX ON users (created_at);</code></pre><p>What’s better about this?</p><ul><li>Consistent ID strategy with BIGINT GENERATED ALWAYS AS IDENTITY</li><li>TEXT instead of arbitrary VARCHAR limits</li><li>Case-insensitive email lookups</li><li>Modern timestamp handling</li></ul><p>But why does this matter?</p><p>Each of these differences creates a compounding problem. Changing datatypes in the future will require full table rewrites. Missing lowercase email handling means duplicate accounts and confused users. And time zones? Every senior developer gets the thousand-yard stare when you mention UTC conversions.</p><p>This is just with a small example; imagine what would happen with more complex schemas.</p><p>And if you don’t believe us, here’s what Claude has to say:</p><pre><code class="language-markdown">&gt; Please describe the schema you would create for an e-commerce website two times, first with the tiger mcp server disabled, then with the tiger mcp server enabled. For each time, write the schema to its own file in the current working directory. Then compare the two files and let me know which approach generated the better schema, using both qualitative and quantitative reasons. For this example, only use standard Postgres.</code></pre><figure class="kg-card kg-video-card kg-width-regular" data-kg-thumbnail="https://timescale.ghost.io/blog/content/media/2025/10/how-to-train-your-agent_thumb.jpg" data-kg-custom-thumbnail="">
            <div class="kg-video-container">
                <video src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/media/2025/10/how-to-train-your-agent.mp4" poster="https://img.spacergif.org/v1/1280x720/0a/spacer.png" width="1280" height="720" loop="" autoplay="" muted="" playsinline="" preload="metadata" style="background: transparent url('https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/media/2025/10/how-to-train-your-agent_thumb.jpg') 50% 50% / cover no-repeat;"></video>
                <div class="kg-video-overlay">
                    <button class="kg-video-large-play-icon" aria-label="Play video">
                        <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                            <path d="M23.14 10.608 2.253.164A1.559 1.559 0 0 0 0 1.557v20.887a1.558 1.558 0 0 0 2.253 1.392L23.14 13.393a1.557 1.557 0 0 0 0-2.785Z"></path>
                        </svg>
                    </button>
                </div>
                <div class="kg-video-player-container kg-video-hide">
                    <div class="kg-video-player">
                        <button class="kg-video-play-icon" aria-label="Play video">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M23.14 10.608 2.253.164A1.559 1.559 0 0 0 0 1.557v20.887a1.558 1.558 0 0 0 2.253 1.392L23.14 13.393a1.557 1.557 0 0 0 0-2.785Z"></path>
                            </svg>
                        </button>
                        <button class="kg-video-pause-icon kg-video-hide" aria-label="Pause video">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <rect x="3" y="1" width="7" height="22" rx="1.5" ry="1.5"></rect>
                                <rect x="14" y="1" width="7" height="22" rx="1.5" ry="1.5"></rect>
                            </svg>
                        </button>
                        <span class="kg-video-current-time">0:00</span>
                        <div class="kg-video-time">
                            /<span class="kg-video-duration">0:50</span>
                        </div>
                        <input type="range" class="kg-video-seek-slider" max="100" value="0">
                        <button class="kg-video-playback-rate" aria-label="Adjust playback speed">1×</button>
                        <button class="kg-video-unmute-icon" aria-label="Unmute">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M15.189 2.021a9.728 9.728 0 0 0-7.924 4.85.249.249 0 0 1-.221.133H5.25a3 3 0 0 0-3 3v2a3 3 0 0 0 3 3h1.794a.249.249 0 0 1 .221.133 9.73 9.73 0 0 0 7.924 4.85h.06a1 1 0 0 0 1-1V3.02a1 1 0 0 0-1.06-.998Z"></path>
                            </svg>
                        </button>
                        <button class="kg-video-mute-icon kg-video-hide" aria-label="Mute">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M16.177 4.3a.248.248 0 0 0 .073-.176v-1.1a1 1 0 0 0-1.061-1 9.728 9.728 0 0 0-7.924 4.85.249.249 0 0 1-.221.133H5.25a3 3 0 0 0-3 3v2a3 3 0 0 0 3 3h.114a.251.251 0 0 0 .177-.073ZM23.707 1.706A1 1 0 0 0 22.293.292l-22 22a1 1 0 0 0 0 1.414l.009.009a1 1 0 0 0 1.405-.009l6.63-6.631A.251.251 0 0 1 8.515 17a.245.245 0 0 1 .177.075 10.081 10.081 0 0 0 6.5 2.92 1 1 0 0 0 1.061-1V9.266a.247.247 0 0 1 .073-.176Z"></path>
                            </svg>
                        </button>
                        <input type="range" class="kg-video-volume-slider" max="100" value="100">
                    </div>
                </div>
            </div>
            
        </figure><pre><code class="language-markdown">Verdict:

The Tiger MCP-enabled schema is objectively superior because:

1. Type Safety: Uses modern PostgreSQL types that prevent common errors
2. Data Integrity: 4x more constraints ensure data quality at the database level
3. Performance: Better indexing strategy with 55% more indexes, including partial and expression indexes
4. Maintainability: Better documentation, clear comments, proper naming
5. Features: More complete e-commerce functionality (discounts, full-text search, audit trails)
6. PostgreSQL Best Practices: Follows official PG17 recommendations from the documentation
7. Future-Proof: Uses modern features like GENERATED ALWAYS AS IDENTITY and NULLS NOT DISTINCT

The Tiger MCP server provided access to PostgreSQL-specific best practices documentation and design patterns that resulted in a more robust, performant, and maintainable schema.
</code></pre><h2 id="how-prompt-templates-make-it-all-possible">How Prompt Templates Make It All Possible</h2><p>Behind the scenes, AI uses the MCP server to call <code>get_prompt_template(‘design_postgres’)</code> to load schema design guidance. No pasted docs. No corrections. Just better code.</p><p>Prompt templates turn production wisdom into reusable guidance for AI. Instead of hunting through documentation written for humans, AI gets the information it needs in a format built for machines.</p><p>It comes down to the fact that humans and LLMs have opposite needs. Humans need narratives and memorable examples (and yes, even cat memes) to help them retain information. LLMs need to preserve context window space. That’s why prompt templates make terrible blog posts, but perfect AI guidance.</p><p>Our philosophy is: don't re-teach what the model already knows. LLMs have seen millions of lines of SQL. They know how to write CREATE TABLE. What they don’t know is the 35 years of Postgres wisdom about what works well and what doesn’t.</p><p>It's like your senior DBA whispering advice in the model's ear.</p><p>Our schema design template <code>(design_postgres_tables)</code> doesn’t explain what a primary key is. It jumps straight to guidance:</p><p>“Prefer <code>BIGINT GENERATED ALWAYS AS IDENTITY</code>; use <code>UUID</code> only when global uniqueness is needed.”</p><p>For data types, it doesn’t teach from scratch. It just tells you what works:</p><p>“DO NOT use <code>money</code> type; DO use <code>numeric</code> instead.”</p><p>Here’s a real snippet from the template:</p><pre><code class="language-markdown">## Postgres "Gotchas"

- **FK indexes**: Postgres **does not** auto-index FK columns. Add them.
- **No silent coercions**: length/precision overflows error out (no truncation). 
  Example: inserting 999 into `NUMERIC(2,0)` fails with error, unlike some 
  databases that silently truncate or round.
- **Heap storage**: no clustered PK by default (unlike SQL Server/MySQL InnoDB); 
  row order on disk is insertion order unless explicitly clustered.</code></pre><p>These gotchas trip up LLMs the same way they trip up developers new to Postgres. We optimized these templates for machines: short, factual, and precise, packing maximum guidance into minimum tokens.&nbsp;</p><p>We tested the same approach on a real IoT schema design task. Without templates, the AI added forbidden configurations and missed critical optimizations. <em>With</em> templates, it generated production-ready code with compression, continuous aggregates, and tuned performance.</p><p>That’s how prompt templates work. Now let’s see how the MCP server makes it all happen.</p><h2 id="how-this-mcp-server-is-smarter-than-others">How This MCP Server is Smarter Than Others</h2><p>While Neon, Supabase, and other Postgres companies created MCP servers as thin API wrappers, ours teaches AI how to think in Postgres.The Tiger MCP server gives AI tools that work automatically: no prompt engineering or manual lookups needed. You just ask. And it provides correct, idiomatic Postgres.</p><p><strong><code>get_prompt_template</code> provides auto-discovered expertise. </strong>Instead of having to call a template explicitly, you just say “I want to make a schema for IoT devices…” and the MCP server figures it out. </p><p>With self-discoverable templates, the AI can detect intent and load the right recipe, applying 35 years of Postgres best practices behind the scenes. </p><p><strong>The templates have real depth. </strong>No scraped snippets or boilerplate. The templates are written by senior Postgres engineers, and provide opinionated, production-tested guidance that is tuned to avoid every trap that seasoned DBAs know to avoid.</p><p><strong>Postgres-native vector retrieval adds the right context.</strong> When the AI needs more information, the MCP server searches the versioned Postgres (15-18) and TimescaleDB docs. And it uses Postgres itself for storage and vector search.</p><p>Versioning is critical. For example, Postgres 15 introduced UNIQUE NULLS NOT DISTINCT, while 16 improved parallel queries, and 17 changed COPY error handling. The MCP keeps AIs grounded in correct syntax every time, avoiding broken code from the wrong version.</p><p>The Tiger MCP doesn’t just wire up APIs. It teaches AI to think like a real Postgres engineer. </p><p>You don’t have to craft the perfect prompt. You just ask, and it does the right thing.</p><h2 id="see-it-for-yourself">See It For Yourself</h2><p>Install the Tiger CLI and MCP server:</p><pre><code class="language-shell">curl -fsSL https://cli.tigerdata.com | sh
tiger auth login
tiger mcp install</code></pre><p>(We also have alternative <a href="https://github.com/timescale/tiger-cli"><u>installation instructions</u></a> for the CLI tool.)</p><p>Then select your AI assistant (Claude Code, Cursor, VS Code, Windsurf, etc.) and immediately get real Postgres knowledge flowing into your AI.</p><p>This is how Postgres becomes the best database to use with AI coding tools: not by accident, not because someone pasted docs into a chat, but because the tooling now teaches AI how to think in Postgres.&nbsp;</p><p>Try the MCP server. Break it. <a href="https://timescaledb.slack.com/join/shared_invite/zt-38c4rrt9t-eR8I4hnb4qeGLUrL6hM3mA#/shared-invite/email"><u>Improve it</u></a>. Help us teach every AI to write real Postgres.</p><hr><p><strong>About the authors</strong></p><p><strong>Matty Stratton</strong></p><p>Matty Stratton is the Head of Developer Advocacy and Docs at Tiger Data, a well-known member of the DevOps community, founder and co-host of the popular <a href="https://www.arresteddevops.com/"><u>Arrested DevOps</u></a> podcast, and a global organizer of the <a href="https://devopsdays.org"><u>DevOpsDays</u></a> set of conferences.</p><p>Matty has over 20 years of experience in IT operations and is a sought-after speaker internationally, presenting at Agile, DevOps, and cloud engineering focused events worldwide. Demonstrating his keen insight into the changing landscape of technology, he recently changed his license plate from DEVOPS to KUBECTL.</p><p>He lives in the Chicagoland area and has three awesome kids and two Australian Shepherds, whom he loves just a little bit more than he loves Diet Coke.</p><p><strong>Matvey Arye</strong></p><p><a href="https://www.linkedin.com/in/matvey-arye/"><u>Matvey Arye</u></a> is a founding engineering leader at Tiger Data (creators of TimescaleDB), the premiere provider of relational database technology for time-series data and AI. Currently, he manages the team at Tiger Data responsible for building the go-to developer platform for AI applications.&nbsp;</p><p>Under his leadership, the Tiger Data engineering team has introduced partitioning, compression, and incremental materialized views for time-series data, plus cutting-edge indexing and performance innovations for AI.&nbsp;</p><p>Matvey earned a Bachelor degree in Engineering at The Cooper Union. He earned a Doctorate in Computer Science at Princeton University where his research focused on cross-continental data analysis covering issues such as networking, approximate algorithms, and performant data processing.&nbsp;</p><p><strong>Jacky Liang</strong></p><p><a href="https://www.linkedin.com/in/jjackyliang/"><u>Jacky Liang</u></a> is a developer advocate at Tiger Data with an AI and LLMs obsession. He's worked at Pinecone, Oracle Cloud, and Looker Data as both a software developer and product manager which has shaped the way he thinks about software.&nbsp;</p><p>He cuts through AI hype to focus on what actually works. How can we use AI to solve real problems? What tools are worth your time? How will this technology actually change how we work?&nbsp;</p><p>When he's not writing or speaking about AI, Jacky builds side projects and tries to keep up with the endless stream of new AI tools and research—an impossible task, but he keeps trying anyway. His model of choice is Claude Sonnet 4 and his favorite coding tool is Claude Code.</p><p></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Introducing Agentic Postgres Free Plan: The Fastest Way to Experiment with AI on Postgres]]></title>
            <description><![CDATA[Experiment with AI on Postgres. The Tiger Free Plan offers database forks, vector search, and real-time analytics. No credit card required. Built for developers and agents: Agentic Postgres.]]></description>
            <link>https://www.tigerdata.com/blog/introducing-agentic-postgres-free-plan-experiment-ai-on-postgres</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/introducing-agentic-postgres-free-plan-experiment-ai-on-postgres</guid>
            <category><![CDATA[Announcements & Releases]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[AI agents]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Hien Phan]]></dc:creator>
            <pubDate>Tue, 21 Oct 2025 13:47:05 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/10/ABL--free-tier-Blog-thumbnail.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/10/ABL--free-tier-Blog-thumbnail.png" alt="Introducing Agentic Postgres Free Plan: The Fastest Way to Experiment with AI on Postgres" /><p><em>TLDR:</em>&nbsp;<em>A new chapter for Postgres starts today: Agentic Postgres, the first </em><a href="https://www.tigerdata.com/blog/postgres-for-agents" rel="noreferrer"><em>database for agents</em></a><em>, brings new architecture for the age of agents, and free access for everyone who builds with them.</em></p><p>We're launching a free tier. No credit card, no time limit, no catch.</p><p>AI development moves at the speed of thought. You’re constantly experimenting, testing a vector search strategy, forking a database to try a new schema, or spinning up an instance for a weekend project. The friction of “will this cost me money?” shouldn’t slow you down.</p><p>Today, we’re launching the Tiger Free Plan, a fully managed Postgres built for how AI development actually works: experimental, iterative, and increasingly agent-driven. It’s the same Tiger managed cloud experience developers already love, now free for every idea.</p><p>The Tiger Free Plan should be the database you reach for every time you want to test an idea.&nbsp;</p><h2 id="what-makes-this-free-plan-different">What Makes This Free Plan Different</h2><p>Most free database tiers are built for yesterday’s users. They gate core features, expire trials, or remind you what you don’t get. We built ours for how development actually happens today.</p><p>Because the <a href="https://www.tigerdata.com/blog/postgres-for-agents" rel="noreferrer">database has a new user</a>. Developers aren’t working alone anymore. They’re building alongside agents that write code, run migrations, and query data through APIs. That changes what experimentation means. You don’t just need a place to store data, you need a system where your tools, scripts, and agents can spin up, fork, and reason over automatically.</p><p>The Tiger Cloud Free Plan is built for that reality. It’s not designed for production. It's designed for progress, for the 90 percent of work that happens before you know if something is worth scaling.</p><p>Other free tiers optimize for control. Ours optimizes for momentum. When you’re experimenting, you shouldn’t have to think about limits, billing, or setup. You should be able to fork your database, test a schema, or try a new retrieval strategy in seconds, whether you’re doing it yourself or your agent is doing it for you.</p><p>That’s why it includes the tools modern AI workflows actually rely on. You can fork your database to test safely and recover fast, run vector search to compare retrieval and embedding strategies in real time, and analyze live data with built-in time-series and columnar features so you can see what’s happening as it happens. Because Tiger Cloud runs the same APIs across every plan, moving from free to production is instant: no migration, no rework, no context lost.</p><p>This isn’t a teaser or a trial. It's part of a new architecture for Postgres, one built for builders: humans and agents alike.&nbsp;</p><h2 id="what%E2%80%99s-included">What’s Included</h2><h3 id="compute-storage">Compute &amp; storage:</h3><ul><li>Shared compute</li><li>Up to 750 MB of storage per service</li><li>Limit of 2 free services per account</li><li>Available in us-east-1 (EU expansion coming soon)</li></ul><h3 id="features-that-matter">Features that matter:</h3><ul><li><strong>Database forks:</strong> <a href="https://www.tigerdata.com/blog/fast-zero-copy-database-forks" rel="noreferrer">Branch your database</a> like your code. Test safely, recover fast, or try something bold.</li><li><strong>AI-native retrieval:</strong> Build RAG and other AI-powered features with <a href="https://www.tigerdata.com/blog/introducing-pg_textsearch-true-bm25-ranking-hybrid-retrieval-postgres" rel="noreferrer">native hybrid and vector search</a> support (pgvectorscale + BM25).</li><li><strong>Real-time analytics:</strong> Hypertables, continuous aggregates, and columnar storage. Run <a href="https://assets.timescale.com/docs/downloads/tigerdata-whitepaper.pdf"><u>analytics in Postgres</u></a> without extra systems.</li><li><strong>Insights:</strong> View performance on a per-query basis over time, and get optimization recommendations, so you spend less time guessing.&nbsp;</li><li><strong>Automated Management: </strong>Upgrades, tuning, and maintenance handled for you, so your database always runs at its best.</li><li><strong>50+ Postgres Extensions. </strong>All your favorite extensions, from PostGIS to pgvector, built in and ready to go.</li><li><strong>Connection management:</strong> Simplified, secure handling that just works.</li></ul><h2 id="what-happens-when-you-hit-the-limits">What Happens When You Hit the Limits</h2><p>At 750 MB, your service switches to read-only mode. You’ll get warnings as you approach the limit, and you can:</p><ul><li>Fork to an earlier point in time (PITR - 24-hour point in time recovery)</li><li>Clean up data</li><li>Or upgrade to a paid plan and migrate in minutes</li></ul><p>Free services can be upgraded directly. When your project grows, you can quickly convert it in place to remove storage limits. Your free services coexist perfectly with your paid ones, so you can test safely alongside production.</p><p>Connection poolers are not included in the Free Plan. Dedicated support is not included either, but Free Plan users are encouraged to join our <a href="https://slack.timescale.com/?__hstc=231067136.9958a7ac0060b2f1fd85cea041eba3e1.1752754343277.1760925381027.1760962725002.277&amp;__hssc=231067136.5.1760962725002&amp;__hsfp=839360075"><u>community Slack</u></a> to ask questions, share ideas, and see what others are building.</p><h3 id="built-for-the-builders-developers-and-agents-alike">Built for the builders: Developers and agents alike</h3><p>If you're new to Tiger, you can start with the free plan or take a 30-day performance trial. If you're already a customer, you can add two free services right inside your existing account for testing, sandboxing, or side projects.</p><p>The Free Plan is our way of lowering the floor for experimentation. We want more developers and agents building together on postgres without friction. It's a small change with a big goal: making it effortless to start, learn, and build.</p><p><strong>Get Started</strong></p><p>From the command line via the Tiger CLI (<a href="https://github.com/timescale/tiger-cli"><u>Download from GitHub</u></a>) </p><pre><code class="language-Shell">$ curl -fsSL https://cli.tigerdata.com | sh
$ tiger auth login 
$ tiger service create
</code></pre><p>From the cloud console:</p><ol><li><a href="https://console.cloud.timescale.com/signup"><u>Sign up</u></a></li><li>Select “Free Plan” (vs. the free 30-day trial)</li><li>Create a service and get building!</li></ol><hr><h2 id="frequently-asked-questions">Frequently Asked Questions</h2><h3 id="about-the-free-plan">About the Free Plan</h3><ol><li><strong>What is included in the Free Plan?</strong><ol><li>Up to 2 free services per account</li><li>Developer experience features: database forks (24-hour PITR), Insights dashboard, pgvector, real-time analytics/time-series (hypertables, continuous aggregates, columnar storage), connection management</li><li>MFA for secure login</li><li>IP Allow list for security</li><li>Database logs</li><li>A large suite of Postgres extensions, like PostGIS and pg_cron</li><li>Available in us-east-1 (EU region coming soon)</li></ol></li><li><strong>What is excluded in the Free Plan?</strong><ol><li>Connection pooler</li><li>Compute resizing</li><li>Dedicated support (community Slack available)</li><li>High-availability configurations</li><li>Advanced security features</li><li>Advanced enterprise features</li></ol></li><li><strong>What are the specs of a free service?</strong><ol><li>Each free service includes:<ol><li>Shared compute</li><li>Storage: Up to 750 MB for user data</li><li>Backup: 24-hour point-in-time recovery for forking</li><li>Connections: Limited (exact number TBD, designed to balance usability and system health)</li></ol></li></ol></li><li><strong>Is a credit card required to create Free Plan services?</strong><ol><li>No. You can sign up and create free services without entering payment information.</li></ol></li><li><strong>What happens if I reach my storage limit?</strong><ol><li>You'll receive warnings as you approach 750 MB. Once you hit the limit, your service switches to <strong>read-only mode</strong>. You can:<ol><li>Fork your service to an earlier point in time (within the 24-hour PITR window)</li><li>Delete data to free up space before reaching the limit</li><li>Upgrade to a Performance or Scale plan and “Convert to a standard service without storage limits”</li></ol></li></ol></li><li><strong>Can I upgrade free services?</strong><ol><li>Yes. If your project outgrows the Free Plan, you can:<ol><li>Switch to a Performance or Scale plan</li><li>Convert your free service to a standard service to remove storage limitations and get access to higher levels of compute</li></ol></li></ol></li><li><strong>What is the difference between Free Plan and Performance Plan?</strong><ol><li>Performance Plan offers:<ol><li>Scalable compute (starting at 0.5 CPU / 2 GB RAM, up to 16 CPU / 128 GB RAM)</li><li>Storage up to 16 TB (pay-as-you-go by the GB)</li><li>Connection pooling</li><li>High availability options</li><li>Dedicated support</li><li>Virtual Private Cloud peering</li><li>Production-grade SLAs</li></ol></li><li>The Free Plan is optimized for experimentation; Performance is optimized for production.</li></ol></li><li><strong>Can I start a Performance trial if I already have a Free Plan?</strong><ol><li>Yes! Every project gets one free trial of Performance or Scale. You can start it at any point, even after creating free services. Your free services remain active during and after the trial.</li></ol></li><li><strong>Does my free trial automatically convert to a Free Plan when my trial expires?</strong><ol><li>If you start with a Performance or Scale trial, here's what happens when it ends:<ol><li>Trial services (on Performance/Scale) are paused, then eventually deleted</li><li>Any free services remain active</li><li>Your account converts to the Free Plan</li><li>You can create up to 2 free services (if you haven't already)</li><li>No charges occur unless you explicitly upgrade</li></ol></li></ol></li><li><strong>Do I keep my free services if I upgrade my Free Plan to Performance or Scale?</strong><ol><li>Yes! Free services are available on all plans. You can have up to 2 free services alongside your paid services.</li></ol></li><li><strong>If I am already a Tiger Data customer, can I also use the free services?</strong><ol><li>Yes! Existing Performance, Scale, and Enterprise customers can create up to 2 free services within their accounts. Use them for testing, sandboxes, or side projects.</li></ol></li><li><strong>Will I have access to free services in my paid plan?</strong><ol><li>Yes. All plans (Free, Performance, Scale, Enterprise) include access to up to 2 free services. They coexist with your development and production databases.</li></ol></li><li><strong>Should I start with the Free Plan or the Free Trial?</strong><ol><li><strong>Start with Free Plan if:</strong><ol><li>You're prototyping or experimenting</li><li>You're not sure what you're building yet</li><li>You want to test Tiger Data without commitment</li></ol></li><li><strong>Start with a Performance/Scale trial if:</strong><ol><li>You're ready to test Tiger Cloud with production-like workloads</li><li>You have large datasets to migrate</li><li>You need higher compute or storage immediately</li><li>You want to evaluate enterprise features</li></ol></li><li>You can always start a trial later. Each project gets one trial, and you choose when to use it.</li></ol></li><li><strong>Is Tiger Fluid Storage available on Performance and Scale plans?</strong><ol><li>No, Tiger Fluid Storage is not available on Performance and Scale plans today.</li></ol></li></ol><hr><p><strong>About the authors</strong></p><p><strong>Hien Phan</strong><br><br>An AI, Data, and Infrastructure Marketing Leader, Hien Phan is the Head of Marketing at Tiger Data. He has led Product, Partner, and Customer Marketing at Pinecone. He and his team launched a game-changing serverless architecture and introduced Pinecone Assistant, marking a significant leap in our product offerings.</p><p>At Amplitude, Hien’s team championed solutions that empowered product and marketing teams to excel in a product-led growth. This role sharpened his ability to drive growth through strategic marketing initiatives, solidifying that brand as an indispensable tool for product experimentation and analytics category.</p><p>Hien lives in the Bay Area with his two lovely dogs and makes a mean roasted chicken.<br></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Postgres for Agents]]></title>
            <description><![CDATA[Agentic Postgres: the first database built for agents. Native search, instant forks, MCP integration, new CLI, and free tier. Built for agents. Designed for developers.]]></description>
            <link>https://www.tigerdata.com/blog/postgres-for-agents</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/postgres-for-agents</guid>
            <category><![CDATA[Announcements & Releases]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[AI agents]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Ajay Kulkarni]]></dc:creator>
            <pubDate>Tue, 21 Oct 2025 13:46:50 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/10/2025-ABL-Launch-Blog-Thumbnail.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/10/2025-ABL-Launch-Blog-Thumbnail.png" alt="Postgres for Agents" /><h3 id="announcing-agentic-postgres-the-first-database-built-for-agents"><em>Announcing Agentic Postgres: The first database built for agents.</em></h3><h2 id="agents-are-the-new-developer">Agents are the New Developer</h2><p>80% of Claude Code <a href="https://www.reddit.com/r/singularity/comments/1khxwjh/claude_code_wrote_80_of_its_own_code_anthropic_dev/"><u>was written by AI</u></a>. More than a quarter of all new code at Google <a href="https://arstechnica.com/ai/2024/10/google-ceo-says-over-25-of-new-google-code-is-generated-by-ai/"><u>was generated by AI</u></a> <em>one year ago</em>. It’s safe to say that in the next 12 months, the majority of all new code will be written by AI.</p><p>Agents don’t behave like humans. They behave in new ways. Software development tools need to evolve. Agents need a new kind of database made for how they work.</p><p>But what would a database for agents look like?</p><p>At Tiger, we’ve obsessed over databases for the past 10 years. We’ve built high-performance systems for time series data, scaled Postgres across millions of workloads, and served thousands of customers and hundreds of thousands of developers around the world.&nbsp;</p><p>​​So when agents arrived, we felt it immediately. In our bones. This new era of computing would need its own kind of data infrastructure. One that still delivered power without complexity, but built for a new type of user.&nbsp;</p><p>How do agents behave?</p><ul><li>Agents don’t click, they call.</li><li>Agents don’t remember, they retrieve.</li><li>Agents can download expertise to become experts.</li><li>Agents can parallelize effortlessly, acting like a multi-threaded team.</li><li>Agents need a safe sandbox where they can play (or wreak havoc).</li><li>Agents can also hammer your infrastructure (and your budget) if you’re not careful.</li></ul><p>We started on this problem over a year ago. Multiple teams working in parallel, months of engineering and internal user feedback, rethinking everything from the storage layer to how agents actually reason.&nbsp;</p><p>Here’s what we built.&nbsp;</p><h2 id="introducing-agentic-postgres">Introducing Agentic Postgres</h2><p>Today we’re launching Agentic Postgres, the first database designed from the ground up for agents. It includes:</p><p><strong>The best database MCP server ever built</strong></p><p>Agentic Postgres includes our new <a href="https://www.tigerdata.com/blog/free-postgres-mcp-prompt-templates" rel="noreferrer">MCP server</a> that enables agents not just to interact with the database but also understand how to use it well. We’ve taken our 10+ years of Postgres experience and distilled it into a set of built-in master prompts. This gives agents safe, structured access to the database through high-level tools for schema design, query tuning, migrations, and more. The MCP server also performs native full-text and semantic search over the Postgres docs, so agents can instantly retrieve the right context as they think.&nbsp;</p><pre><code class="language-markdown">&gt; I want to create a personal assistant app. Please create a free service on Tiger. Then using Postgres best practices, describe the schema you would create.</code></pre><figure class="kg-card kg-video-card kg-width-regular" data-kg-thumbnail="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/media/2025/10/DE84BB33-B4FE-4F6E-8398-9267033F6870-2_thumb.jpg" data-kg-custom-thumbnail="">
            <div class="kg-video-container">
                <video src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/media/2025/10/DE84BB33-B4FE-4F6E-8398-9267033F6870-2.mp4" poster="https://img.spacergif.org/v1/1920x1080/0a/spacer.png" width="1920" height="1080" loop="" autoplay="" muted="" playsinline="" preload="metadata" style="background: transparent url('https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/media/2025/10/DE84BB33-B4FE-4F6E-8398-9267033F6870-2_thumb.jpg') 50% 50% / cover no-repeat;"></video>
                <div class="kg-video-overlay">
                    <button class="kg-video-large-play-icon" aria-label="Play video">
                        <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                            <path d="M23.14 10.608 2.253.164A1.559 1.559 0 0 0 0 1.557v20.887a1.558 1.558 0 0 0 2.253 1.392L23.14 13.393a1.557 1.557 0 0 0 0-2.785Z"></path>
                        </svg>
                    </button>
                </div>
                <div class="kg-video-player-container kg-video-hide">
                    <div class="kg-video-player">
                        <button class="kg-video-play-icon" aria-label="Play video">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M23.14 10.608 2.253.164A1.559 1.559 0 0 0 0 1.557v20.887a1.558 1.558 0 0 0 2.253 1.392L23.14 13.393a1.557 1.557 0 0 0 0-2.785Z"></path>
                            </svg>
                        </button>
                        <button class="kg-video-pause-icon kg-video-hide" aria-label="Pause video">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <rect x="3" y="1" width="7" height="22" rx="1.5" ry="1.5"></rect>
                                <rect x="14" y="1" width="7" height="22" rx="1.5" ry="1.5"></rect>
                            </svg>
                        </button>
                        <span class="kg-video-current-time">0:00</span>
                        <div class="kg-video-time">
                            /<span class="kg-video-duration">0:30</span>
                        </div>
                        <input type="range" class="kg-video-seek-slider" max="100" value="0">
                        <button class="kg-video-playback-rate" aria-label="Adjust playback speed">1×</button>
                        <button class="kg-video-unmute-icon" aria-label="Unmute">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M15.189 2.021a9.728 9.728 0 0 0-7.924 4.85.249.249 0 0 1-.221.133H5.25a3 3 0 0 0-3 3v2a3 3 0 0 0 3 3h1.794a.249.249 0 0 1 .221.133 9.73 9.73 0 0 0 7.924 4.85h.06a1 1 0 0 0 1-1V3.02a1 1 0 0 0-1.06-.998Z"></path>
                            </svg>
                        </button>
                        <button class="kg-video-mute-icon kg-video-hide" aria-label="Mute">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M16.177 4.3a.248.248 0 0 0 .073-.176v-1.1a1 1 0 0 0-1.061-1 9.728 9.728 0 0 0-7.924 4.85.249.249 0 0 1-.221.133H5.25a3 3 0 0 0-3 3v2a3 3 0 0 0 3 3h.114a.251.251 0 0 0 .177-.073ZM23.707 1.706A1 1 0 0 0 22.293.292l-22 22a1 1 0 0 0 0 1.414l.009.009a1 1 0 0 0 1.405-.009l6.63-6.631A.251.251 0 0 1 8.515 17a.245.245 0 0 1 .177.075 10.081 10.081 0 0 0 6.5 2.92 1 1 0 0 0 1.061-1V9.266a.247.247 0 0 1 .073-.176Z"></path>
                            </svg>
                        </button>
                        <input type="range" class="kg-video-volume-slider" max="100" value="100">
                    </div>
                </div>
            </div>
            
        </figure><p><strong>Native search and retrieval</strong></p><p>Agentic Postgres comes with native full-text and semantic search built directly into the database. For semantic search, we’ve improved our existing extension pgvectorscale, for higher-throughput indexing, better recall, and lower latency at scale than pgvector.&nbsp;</p><p>For full-text search, <a href="https://www.tigerdata.com/blog/introducing-pg_textsearch-true-bm25-ranking-hybrid-retrieval-postgres" rel="noreferrer">pg_textsearch</a>, our newest Postgres extension, implements BM25 for modern ranked keyword search optimized for hybrid AI applications alongside pgvectorscale. The current preview release uses an in-memory structure for fast writes and queries. Future releases will add disk-based segments with compression and BlockMax WAND optimization, applying the same battle-tested techniques from production search engines.</p><p>Together, these extensions let agents retrieve structured data instantly without leaving Postgres.&nbsp;&nbsp;</p><pre><code class="language-markdown">&gt; Using service qilk2gqjuz, analyze user feedback with hybrid search (combining text search and semantic search). Group similar feedback by theme and show counts for each theme, using an ascii bar chart. First, look at the pg_textsearch (BM25) and pgvectorscale documentation in the Tiger docs to get the proper syntax, and then use those extensions.</code></pre><figure class="kg-card kg-video-card kg-width-regular" data-kg-thumbnail="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/media/2025/10/demo-2--1-_thumb.jpg" data-kg-custom-thumbnail="">
            <div class="kg-video-container">
                <video src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/media/2025/10/demo-2--1-.mp4" poster="https://img.spacergif.org/v1/1920x1080/0a/spacer.png" width="1920" height="1080" loop="" autoplay="" muted="" playsinline="" preload="metadata" style="background: transparent url('https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/media/2025/10/demo-2--1-_thumb.jpg') 50% 50% / cover no-repeat;"></video>
                <div class="kg-video-overlay">
                    <button class="kg-video-large-play-icon" aria-label="Play video">
                        <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                            <path d="M23.14 10.608 2.253.164A1.559 1.559 0 0 0 0 1.557v20.887a1.558 1.558 0 0 0 2.253 1.392L23.14 13.393a1.557 1.557 0 0 0 0-2.785Z"></path>
                        </svg>
                    </button>
                </div>
                <div class="kg-video-player-container kg-video-hide">
                    <div class="kg-video-player">
                        <button class="kg-video-play-icon" aria-label="Play video">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M23.14 10.608 2.253.164A1.559 1.559 0 0 0 0 1.557v20.887a1.558 1.558 0 0 0 2.253 1.392L23.14 13.393a1.557 1.557 0 0 0 0-2.785Z"></path>
                            </svg>
                        </button>
                        <button class="kg-video-pause-icon kg-video-hide" aria-label="Pause video">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <rect x="3" y="1" width="7" height="22" rx="1.5" ry="1.5"></rect>
                                <rect x="14" y="1" width="7" height="22" rx="1.5" ry="1.5"></rect>
                            </svg>
                        </button>
                        <span class="kg-video-current-time">0:00</span>
                        <div class="kg-video-time">
                            /<span class="kg-video-duration">0:50</span>
                        </div>
                        <input type="range" class="kg-video-seek-slider" max="100" value="0">
                        <button class="kg-video-playback-rate" aria-label="Adjust playback speed">1×</button>
                        <button class="kg-video-unmute-icon" aria-label="Unmute">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M15.189 2.021a9.728 9.728 0 0 0-7.924 4.85.249.249 0 0 1-.221.133H5.25a3 3 0 0 0-3 3v2a3 3 0 0 0 3 3h1.794a.249.249 0 0 1 .221.133 9.73 9.73 0 0 0 7.924 4.85h.06a1 1 0 0 0 1-1V3.02a1 1 0 0 0-1.06-.998Z"></path>
                            </svg>
                        </button>
                        <button class="kg-video-mute-icon kg-video-hide" aria-label="Mute">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M16.177 4.3a.248.248 0 0 0 .073-.176v-1.1a1 1 0 0 0-1.061-1 9.728 9.728 0 0 0-7.924 4.85.249.249 0 0 1-.221.133H5.25a3 3 0 0 0-3 3v2a3 3 0 0 0 3 3h.114a.251.251 0 0 0 .177-.073ZM23.707 1.706A1 1 0 0 0 22.293.292l-22 22a1 1 0 0 0 0 1.414l.009.009a1 1 0 0 0 1.405-.009l6.63-6.631A.251.251 0 0 1 8.515 17a.245.245 0 0 1 .177.075 10.081 10.081 0 0 0 6.5 2.92 1 1 0 0 0 1.061-1V9.266a.247.247 0 0 1 .073-.176Z"></path>
                            </svg>
                        </button>
                        <input type="range" class="kg-video-volume-slider" max="100" value="100">
                    </div>
                </div>
            </div>
            
        </figure><p><strong>Fast, zero-copy forks</strong></p><p>At the core of Agentic Postgres is a new copy-on-write block storage layer that makes <a href="https://www.tigerdata.com/blog/fast-zero-copy-database-forks" rel="noreferrer">databases instantly forkable</a>. Every agent can spin up its own isolated environment, a full copy of production data in seconds, without duplicating data (or costs). Every fork is lightweight and efficient, so you only pay for the blocks that change. It’s perfect for experiments, benchmarks, and migrations that can run safely in parallel.&nbsp;</p><pre><code class="language-markdown">&gt; Please create a fork of gf868h9j1y using the last snapshot, and then test 3 different indexes that we should create to speed up performance, then delete the fork, and report back on your findings. Before you start run “tiger service fork --help” and “tiger service delete --help” to get the right syntax. Use MCP over psql, using the password from the local keychain.</code></pre><figure class="kg-card kg-video-card kg-width-regular" data-kg-thumbnail="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/media/2025/10/demo-3--1-_thumb.jpg" data-kg-custom-thumbnail="">
            <div class="kg-video-container">
                <video src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/media/2025/10/demo-3--1-.mp4" poster="https://img.spacergif.org/v1/1920x1080/0a/spacer.png" width="1920" height="1080" loop="" autoplay="" muted="" playsinline="" preload="metadata" style="background: transparent url('https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/media/2025/10/demo-3--1-_thumb.jpg') 50% 50% / cover no-repeat;"></video>
                <div class="kg-video-overlay">
                    <button class="kg-video-large-play-icon" aria-label="Play video">
                        <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                            <path d="M23.14 10.608 2.253.164A1.559 1.559 0 0 0 0 1.557v20.887a1.558 1.558 0 0 0 2.253 1.392L23.14 13.393a1.557 1.557 0 0 0 0-2.785Z"></path>
                        </svg>
                    </button>
                </div>
                <div class="kg-video-player-container kg-video-hide">
                    <div class="kg-video-player">
                        <button class="kg-video-play-icon" aria-label="Play video">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M23.14 10.608 2.253.164A1.559 1.559 0 0 0 0 1.557v20.887a1.558 1.558 0 0 0 2.253 1.392L23.14 13.393a1.557 1.557 0 0 0 0-2.785Z"></path>
                            </svg>
                        </button>
                        <button class="kg-video-pause-icon kg-video-hide" aria-label="Pause video">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <rect x="3" y="1" width="7" height="22" rx="1.5" ry="1.5"></rect>
                                <rect x="14" y="1" width="7" height="22" rx="1.5" ry="1.5"></rect>
                            </svg>
                        </button>
                        <span class="kg-video-current-time">0:00</span>
                        <div class="kg-video-time">
                            /<span class="kg-video-duration">1:06</span>
                        </div>
                        <input type="range" class="kg-video-seek-slider" max="100" value="0">
                        <button class="kg-video-playback-rate" aria-label="Adjust playback speed">1×</button>
                        <button class="kg-video-unmute-icon" aria-label="Unmute">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M15.189 2.021a9.728 9.728 0 0 0-7.924 4.85.249.249 0 0 1-.221.133H5.25a3 3 0 0 0-3 3v2a3 3 0 0 0 3 3h1.794a.249.249 0 0 1 .221.133 9.73 9.73 0 0 0 7.924 4.85h.06a1 1 0 0 0 1-1V3.02a1 1 0 0 0-1.06-.998Z"></path>
                            </svg>
                        </button>
                        <button class="kg-video-mute-icon kg-video-hide" aria-label="Mute">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M16.177 4.3a.248.248 0 0 0 .073-.176v-1.1a1 1 0 0 0-1.061-1 9.728 9.728 0 0 0-7.924 4.85.249.249 0 0 1-.221.133H5.25a3 3 0 0 0-3 3v2a3 3 0 0 0 3 3h.114a.251.251 0 0 0 .177-.073ZM23.707 1.706A1 1 0 0 0 22.293.292l-22 22a1 1 0 0 0 0 1.414l.009.009a1 1 0 0 0 1.405-.009l6.63-6.631A.251.251 0 0 1 8.515 17a.245.245 0 0 1 .177.075 10.081 10.081 0 0 0 6.5 2.92 1 1 0 0 0 1.061-1V9.266a.247.247 0 0 1 .073-.176Z"></path>
                            </svg>
                        </button>
                        <input type="range" class="kg-video-volume-slider" max="100" value="100">
                    </div>
                </div>
            </div>
            
        </figure><p><strong>New CLI and free tier</strong></p><p>We’ve also built a new CLI that makes it easy to explore, fork, and build with Agentic Postgres, and we’re launching a <a href="https://www.tigerdata.com/blog/introducing-agentic-postgres-free-plan-experiment-ai-on-postgres"><u>free tier</u></a> so every developer and every agent can get hands-on right away.</p><p>This is all launching today. You can try this today with 3 basic commands in your terminal:</p><pre><code class="language-Shell"># 3 commands to install the Tiger CLI and MCP. That's it!
$ curl -fsSL https://cli.tigerdata.com | sh
$ tiger auth login
$ tiger mcp install
</code></pre><p>Then just tell your agent to spin up a new free service using MCP, or simply call <code>tiger create service</code> from the command line to get going.</p><h2 id="powered-by-fluid-storage">Powered by Fluid Storage</h2><p>Agentic Postgres is powered by <a href="https://www.tigerdata.com/blog/fluid-storage-forkable-ephemeral-durable-infrastructure-age-of-agents" rel="noreferrer">Fluid Storage</a>, our new distributed storage layer. Fluid Storage is built on a disaggregated architecture of a horizontally scalable distributed block store using local NVMe storage, a storage proxy layer that exposes copy-on-write volumes, and a user-space storage device driver.</p><p>It’s storage that looks like a local disk to Postgres yet scales like a cloud service.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/10/2025-Oct-28-fluid-storage-architecture-diagram-1.png" class="kg-image" alt="" loading="lazy" width="2000" height="1043" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/10/2025-Oct-28-fluid-storage-architecture-diagram-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2025/10/2025-Oct-28-fluid-storage-architecture-diagram-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2025/10/2025-Oct-28-fluid-storage-architecture-diagram-1.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w2400/2025/10/2025-Oct-28-fluid-storage-architecture-diagram-1.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>As a result, Fluid Storage delivers instant forks, snapshots, and automatic scaling (up or down) without downtime or over-provisioning.  In benchmark testing, a single volume sustains throughput over 110,000 IOPS, while retaining all of Fluid’s elasticity and copy-on-write capabilities.&nbsp;</p><p>All free services on Tiger Cloud run on Fluid Storage today, so every developer can experience its performance and flexibility firsthand.&nbsp;</p><p>And this is just the start. We’ll dive deeper into each of these (<a href="https://www.tigerdata.com/blog/free-postgres-mcp-prompt-templates" rel="noreferrer">MCP</a>, <a href="https://www.tigerdata.com/blog/introducing-pg_textsearch-true-bm25-ranking-hybrid-retrieval-postgres" rel="noreferrer">pg_textsearch</a>, <a href="https://github.com/timescale/pgvectorscale" rel="noreferrer">pgvectorscale</a>, <a href="https://www.tigerdata.com/blog/fast-zero-copy-database-forks" rel="noreferrer">forkable databases</a>, <a href="https://www.tigerdata.com/blog/fluid-storage-forkable-ephemeral-durable-infrastructure-age-of-agents" rel="noreferrer">Fluid Storage</a>, <a href="https://github.com/timescale/tiger-cli" rel="noreferrer">CLI</a>, <a href="https://www.tigerdata.com/blog/introducing-agentic-postgres-free-plan-experiment-ai-on-postgres" rel="noreferrer">free tier</a>) later this week and next.&nbsp;</p><h2 id="built-for-agents-and-developers">Built for Agents and Developers</h2><p>Agentic Postgres is built for agents, so developers can work on higher level problems.&nbsp;</p><p>Building <em>with </em>and <em>for </em>agents, we’ve learned something simple: agents are not here to replace us. They’re here to elevate us.</p><p>Agents take on the mechanical, repetitive work, freeing us to focus on what matters most: architecture, design, creativity, impact. They make us faster, smarter, and enable us to do more ambitious work than we could do alone.&nbsp;&nbsp;</p><p>The myth is that AI will replace developers. The truth is that developers who build with agents will replace those who don’t.&nbsp;</p><p>Agentic Postgres is for developers who want to build <em>with</em> AI. For developers who care more about working applications than disposable demos. For developers who want AI to feel like engineering, not just experimentation. </p><p>Today’s launch is just the beginning. There are still some rough edges. We’d love your help sanding them down. But expect more to come: more launches, big and small, in the weeks, months, and years ahead.</p><p>Agents are the new developers. Agentic Postgres is their new playground.</p><p><strong>Built for Agents. Designed to Elevate Developers.</strong></p><p>So let’s build. Together.</p><p>Get started today:&nbsp;</p><pre><code class="language-markdown">$ curl -fsSL https://cli.tigerdata.com | sh</code></pre>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[SkipScan in TimescaleDB: Why DISTINCT Was Slow, How We Built It, and How You Can Use It]]></title>
            <description><![CDATA[Learn how TimescaleDB's SkipScan transforms DISTINCT queries from multi-second waits to milliseconds by jumping between values instead of scanning every row.]]></description>
            <link>https://www.tigerdata.com/blog/skipscan-in-timescaledb-why-distinct-was-slow-how-we-built-it-and-how-you-can-use-it</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/skipscan-in-timescaledb-why-distinct-was-slow-how-we-built-it-and-how-you-can-use-it</guid>
            <category><![CDATA[Announcements & Releases]]></category>
            <category><![CDATA[TimescaleDB]]></category>
            <category><![CDATA[Product & Engineering]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Natalya Aksman]]></dc:creator>
            <pubDate>Fri, 19 Sep 2025 18:23:25 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/09/-------2025-Sep-10-SkipScan-on-Columnstore-in-TimescaleDB.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/09/-------2025-Sep-10-SkipScan-on-Columnstore-in-TimescaleDB.png" alt="SkipScan in TimescaleDB: Why DISTINCT Was Slow, How We Built It, and How You Can Use It" /><p>Everyone eventually asks the same “easy” questions: <em>How many distinct things do I have right now? How many active devices? Which trading pairs saw activity today? What’s the latest reading per sensor?</em> On PostgreSQL at scale, those questions turn from harmless SQL into multi‑second waits because <code>DISTINCT</code> usually means <strong>walking every qualifying row</strong> and deduplicating afterward. Even with an ordered index, the engine traverses the full range and pushes keys through a unique step—<strong><code>O(N)</code></strong> work when <strong><code>N</code></strong> is tens or hundreds of millions.</p><p>We built <strong>SkipScan</strong> to flip that script. Instead of visiting every row, SkipScan uses the B‑tree’s order to <strong>jump from one distinct value to the next</strong>: find <code>device = 1</code>, output it, then restart the index at <code>device &gt; 1</code>, and so on. The cost becomes <code><strong>O(K × log N)</strong></code>, where <code>K</code> is the number of distinct values. When <code>K ≪ N</code>, you get millisecond answers instead of coffee‑breaks.</p><p>SkipScan first landed in TimescaleDB <strong>2.2.0</strong> for rowstores. With <strong>2.20.0</strong>, we extended it to <strong>columnstore</strong> hypertables and to <strong>distinct aggregates</strong> like <code>COUNT(DISTINCT …)</code> —leveraging PostgreSQL 16’s ability to feed presorted inputs into ORDER BY/DISTINCT aggregates. With <strong>2.22.0</strong>, we support multi-column SkipScan when all column values are guaranteed to be not-null. What follows is the problem it solves, the design that makes columnstore skipping safe and fast, and the exact queries and plans you can reproduce.</p><h2 id="the-problem-we-had-to-solve">The Problem We Had to Solve</h2><p>Rowstores are bad enough for <code>DISTINCT</code> at large <code>N</code>. Columnstores raise the stakes: data are packed in compressed <strong>batches</strong> (≤1,000 rows) and reading a single tuple can imply decompressing a whole batch. If we naïvely walked every batch to deduplicate, compression would help storage but <strong>hurt latency</strong>.</p><p>To make <code>DISTINCT</code> fast on columnstore, the executor has to:</p><ul><li><strong>Jump by segment key</strong>—not row—so we only touch the first relevant batch per distinct value.</li><li><strong>Prune batches</strong> aggressively using min/max metadata (e.g., time windows) to avoid needless decompression.</li><li><strong>Stop early</strong> inside a batch as soon as a qualifying tuple is found.</li></ul><p>That’s the essence of SkipScan on columnstore.</p><h2 id="how-skipscan-on-columnstore-works">How SkipScan on Columnstore Works</h2><p>Columnstore chunks maintain an automatic B‑tree on <code>(segmentby, orderby)</code> (e.g., <code>(device, time DESC)</code>), plus per‑batch metadata columns like <code>_ts_meta_min_1</code> and <code>_ts_meta_max_1</code> for <code>time</code>. SkipScan hooks into that structure:</p><ol><li><strong>Seek the first batch</strong> for the current <code>device</code> using the <code>(segmentby, orderby)</code> index; for “latest per device,” this is typically the batch with the newest <code>time</code> for that device.</li><li><strong>Decompress just enough</strong>—often the first tuple—to satisfy the query (<code>DISTINCT ON (device) ... ORDER BY device, time DESC</code>).</li><li><strong>Restart at <code>device &gt; current_device</code></strong> to hop to the next device’s first batch.</li><li><strong>Push down filters</strong> that can be answered from metadata (e.g., <code>_ts_meta_max_1 &gt; now() - '1 hour'</code>) so entire stretches of batches are skipped without decompression.</li></ol><p>There’s one important constraint: SkipScan on columnstore applies when the <code>DISTINCT</code> key is the <strong>leading <code>segmentby</code> column </strong>(or the leading columns for multi-key SkipScan). That’s what allows safe jumps.&nbsp;</p><p><em>Note: by leading column in the index we mean the 1st index column used in a query. For an index on <code>(a,b)</code>, <code>b</code> is the leading index column in a query like “select distinct a, b from t where a=1”</em></p><p>Under the covers you’ll see a <code>Custom Scan (SkipScan)</code> node above an index scan and (when needed) a <code>DecompressChunk</code>. After emitting a distinct value, the executor rewrites its start condition from <code>col &gt;= v</code> to <code>col &gt; v</code> and re‑seeks—classic B‑tree hopscotch.</p><h2 id="a-bench-you-can-reproduce-rowstore-and-columnstore">A Bench You Can Reproduce (Rowstore and Columnstore)</h2><p><strong>Table</strong></p><pre><code class="language-SQL">CREATE TABLE metrics (
  time   timestamptz NOT NULL,
  device integer,
  value  double precision
);
</code></pre><p><strong>Rowstore index</strong></p><pre><code class="language-SQL">CREATE INDEX ON metrics(device, time DESC);
</code></pre><p><strong>Scale</strong>: ~<strong>23,328,010</strong> rows; <code>device</code> has <strong>10</strong> distinct values.</p><p><strong>Columnstore layout</strong></p><pre><code class="language-SQL">ALTER TABLE metrics SET (
  timescaledb.compress,
  timescaledb.compress_segmentby = 'device',
  timescaledb.compress_orderby   = 'time DESC'
);
SELECT compress_chunk(ch) FROM show_chunks('metrics') ch;
</code></pre><p>On one representative compressed chunk we see <strong>5,370</strong> batches totaling <strong>5,364,693</strong> compressed rows and an index on <code>(device, _ts_meta_min_1 DESC, _ts_meta_max_1 DESC)</code>.</p><p>Complexity notes: the <code>O(·)</code> terms are theoretical guides. PostgreSQL uses <strong>B‑trees</strong> (multi‑way), so constants, caching, and pruning matter. We use them to show scale, not to predict exact milliseconds.</p><h2 id="what-it-looks-like-in-practice">What It Looks Like in Practice</h2><p><strong>Counting devices (rowstore)</strong></p><pre><code class="language-SQL">SELECT COUNT(DISTINCT device) FROM metrics;
</code></pre><p>With SkipScan: <strong>3.430 ms</strong>.<br>Without (vanilla path): <strong>6756.322 ms</strong> (~6.8 s).Same 23M‑row table; SkipScan performs roughly <strong>2000×</strong> faster by doing ~<code>K</code> log‑seeks instead of scanning all <code>N</code> entries.</p><p><strong>Latest reading per device (columnstore)</strong></p><pre><code class="language-SQL">SELECT DISTINCT ON (device) device, time, value
FROM   metrics
ORDER  BY device, time DESC;</code></pre><p>With SkipScan: <strong>4.589 ms</strong> by touching the first batch per device.<br>Without: <strong>3912.069 ms</strong>—the engine decompresses and inspects ~<strong>5.3 M</strong> tuples to deduplicate.</p><p><strong>Latest reading per device in the last hour (columnstore + metadata pruning)</strong></p><pre><code class="language-SQL">SELECT DISTINCT ON (device) device, time, value
FROM   metrics
WHERE  time &gt; now() - INTERVAL '1 hour'
ORDER  BY device, time DESC;</code></pre><p>With SkipScan: <strong>2.537 ms</strong>.<br>Without: <strong>9.236 ms</strong>.The gap narrows because <code>_ts_meta_max_1</code> lets the planner discard <strong>5,330/5,370</strong> batches up front; SkipScan still wins by visiting only the <strong>first</strong> qualifying batch per device.</p><p><strong>Thresholded latest reading (columnstore; tuple‑level predicate)</strong></p><pre><code class="language-SQL">SELECT DISTINCT ON (device) *
FROM   metrics
WHERE  value &gt; 50
ORDER  BY device, time DESC;</code></pre><p>With SkipScan: <strong>45.392 ms</strong> after scanning ~<strong>3,002</strong> decompressed tuples on the first chunk.<br>Without: <strong>2145.536 ms</strong> after touching ~<strong>6,303,586</strong> tuples across chunks.<br>When a predicate can’t be answered by batch metadata, SkipScan still <strong>stops early per device</strong> once a match is found.</p><h2 id="how-this-helps-in-real-systems">How This Helps in Real Systems</h2><p>In IoT fleets, “show me the latest per device” powers every status board. On rowstores it’s a recurring spike; on columnstore without SkipScan it can become a wall. With SkipScan, the query returns in a few milliseconds by decompression‑on‑demand.</p><p>At exchanges and brokerages, <code>COUNT(DISTINCT symbol)</code> and “latest quote per pair” show up in every dashboard and alert. With 200M+ rows and hundreds of thousands of pairs, SkipScan changes the economics of distinct—turning once‑expensive questions into cheap, always‑on queries.</p><p>In crypto analytics, analysts slice by wallet, contract, or chain. Setting the <code>segmentby</code> to the entity you dedupe most often (e.g., <code>address</code>) lets SkipScan jump entity‑to‑entity, keeping answer times stable even under heavy ingest.</p><h2 id="multi-column-skipscan">Multi-Column SkipScan</h2><p>Multi-column SkipScan support was added in 2.22.0, with a caveat: it can only be chosen for queries which do not produce NULL distinct values.</p><p>That means if a query produces multiple distinct columns and doesn’t allow NULLs to be output as distinct values, then this query can be eligible for multi-column SkipScan.</p><p>Examples:&nbsp;</p><pre><code class="language-SQL">CREATE INDEX ON metrics(region, device, metric_type);
-- All distinct columns have filters which don't allow NULLs: can use SkipScan
SELECT DISTINCT ON (region, device, metric_type) *
FROM   metrics
WHERE region IN ('UK','EU','JP') AND device &gt;1 AND metric_type IS NOT NULL
ORDER  BY region, device, metric_type, time DESC;
-- Distinct columns are declared NOT NULL: can use SkipScan
CREATE TABLE metrics(region TEXT NOT NULL, device INT NOT NULL, ...);
SELECT DISTINCT ON (region, device) *
FROM   metrics
ORDER  BY region, device, time DESC;</code></pre><p>Multicolumn SkipScan skips over the current key values in a similar way to a single-column SkipScan: it freezes the values of keys preceding the current key, allows any values for keys following the current key, and searches for the first current key value which is greater than the current value.</p><p>When no more values are found for the current key, SkipScan moves to the previous key and repeats the process, or is done when there are no more previous keys.</p><p>For the above example, suppose we are currently scanning the values (<code>AUS</code>, 2, &gt;56) for the tuple (region, device, metric_type).&nbsp; We’ve found the next tuple (<code>AUS</code>, 2, 59) and we search for (<code>AUS</code>, 2, &gt;59) now.&nbsp;</p><p>If no more tuples are found, we switch the search to (<code>AUS</code>,&gt;2, “any value”), find the next tuple (<code>AUS</code>,5,3) and search for (<code>AUS</code>,5,&gt;3), etc.&nbsp;&nbsp;</p><p>If no more tuples for <code>AUS</code> are found we search for (&gt;<code>AUS</code>, “any value”, “any value”), suppose we find (<code>BR</code>,4, 15) next, and restart the search for (<code>BR</code>,4,&gt;15), etc.</p><p>The reason why we don’t use SkipScan if NULL values are possible for distinct columns is because of complexities in determining “any value” in the above examples.&nbsp;</p><p>Because PostgreSQL sorts NULLs separately from other values we have to check for NULLs and NOT NULLs separately for each key to exhaust “any value” possibilities.&nbsp;</p><p>Checking it for 2 keys means doing 4 checks for (NULL, NULL), (NULL, NOT NULL), (NOT NULL, NULL), (NOT NULL, NOT NULL).&nbsp; Checking it for M keys means 2^M complexity.</p><p>It is a lot of complexity to address for an edge-case as users usually do not care about NULL devices or regions etc.</p><p>To take advantage of multi-column SkipScan it’s enough to add IS NOT NULL check or declare distinct columns NOT NULL if no NULL values are anticipated in those columns. If there are strict conditions on distinct columns like “a&gt;1” or “b IN (1,2,3)” then NULLs are already not allowed for those columns and SkipScan can be applied to those columns.</p><h2 id="using-it-in-your-schema">Using It in Your Schema</h2><p>Design your layout so the column you deduplicate is <strong>first</strong>:</p><ul><li>Rowstore: create an index starting with the distinct column(s), followed by your time sort (<code>device, time DESC</code>).</li><li>Columnstore: set <code>timescaledb.compress_segmentby</code> to that column (or columns for multicolumn SkipScan) and <code>compress_orderby</code> to match your query’s sort (e.g., <code>time DESC</code>). Compress your historical chunks.</li></ul><p>Verify with <code>EXPLAIN</code>—look for <code>Custom Scan (SkipScan)</code> above the index/decompress nodes. If the planner is shy, you can nudge it:</p><pre><code class="language-SQL">SET enable_seqscan = false;                           -- prefer indexes
SET timescaledb.skip_scan_run_cost_multiplier = 0; -- bias for SkipScan (2.20+)</code></pre><p>To check whether SkipScan was used in a query, and if it was used, on which keys and whether those keys are NOT NULL, the following flag can be set:</p><pre><code class="language-SQL">SET timescaledb.debug_skip_scan_info TO true; -- (2.22+)                 
-- NULLs are excluded via "&gt;1"
SELECT DISTINCT ON (device) device, time, value
FROM   metrics
WHERE device &gt; 1
ORDER  BY device, time DESC;
-- SkipScan key info is output
INFO:  SkipScan used on metrics_device_idx(device NOT NULL)
-- NULLs are not excluded
SELECT DISTINCT ON (device) device, time, value
FROM   metrics
ORDER  BY device, time DESC;
-- SkipScan key info is output
INFO:  SkipScan used on metrics_device_idx(device NULLS LAST)
</code></pre><p>When it <strong>doesn’t</strong> apply: multiple NULL-allowing distinct keys at once, orders that don’t start with the distinct column, or a columnstore where your distinct key isn’t the <strong>leading </strong><code>segmentby</code>.</p><pre><code class="language-SQL">-- Ordered index for rowstore
CREATE INDEX ON metrics(device, time DESC);

-- Ordered index for columnstore based on segmentby + orderby settings
ALTER TABLE metrics SET (timescaledb.compress, timescaledb.compress_orderby='time DESC', timescaledb.compress_segmentby='device');
SELECT compress_chunk(ch) FROM show_chunks('metrics') ch;
</code></pre><p>These queries can use SkipScan:</p><pre><code class="language-SQL">SELECT DISTINCT ON (device) * FROM metrics;
SELECT DISTINCT ON (device) * FROM metrics ORDER BY device, time DESC;
SELECT DISTINCT ON (device, value) * FROM metrics WHERE value = 10;
SELECT count(DISTINCT device), max(DISTINCT device) FROM metrics; 
</code></pre><p>These queries cannot use SkipScan either because distinct columns not in the index or non-distinct aggregates present or because the index doesn’t match the query order:</p><pre><code class="language-SQL">SELECT DISTINCT ON (time) * FROM metrics;
SELECT DISTINCT ON (device) * FROM metrics ORDER BY device, time;
SELECT DISTINCT ON (device, value) * FROM metrics;
SELECT count(DISTINCT device), max(device) FROM metrics; 
SELECT count(DISTINCT device), count(DISTINCT value) FROM metrics; 
</code></pre><h2 id="what-you-get">What You Get</h2><p>Millisecond‑fast deduplication on datasets measured in billions of rows; far fewer tuples decompressed per query; predictable dashboard SLOs under bursty ingest; and no application changes—just the right index or segment layout. Rowstore users get wins back to <strong>2.2.0</strong>; columnstore and distinct‑aggregate users get the full experience on <strong>2.20.0+</strong> (with <strong>PostgreSQL 16+</strong>), and multicolumn SkipScan support for not-null distinct values added in <strong>2.22.0</strong>.</p><p>Skip less relevant rows. Skip entire batches. <strong>Skip the scan.</strong></p><p>Sign up for a <a href="https://console.cloud.timescale.com/signup" rel="noreferrer">free trial</a> today to see the benefit for your own queries and workloads.</p><hr><p><strong>About the authors</strong></p><p><a href="https://www.linkedin.com/in/natalya-aksman-5a7b19/"><strong><u>Natalya Aksman</u></strong></a> is a senior developer at Tiger Data, focusing on queries optimization and performance. Natalya joined Tiger Data recently, after working on Vertica columnar database for many years, and before that, working on Sybase IQ which was one of the earliest columnar databases in the industry.&nbsp;</p><p>Natalya is most passionate about optimizing performance of analytical databases and streamlining of query planning and execution. Natalya has a Master of Applied Mathematics degree from Moscow University and Master of Computer Science from Northeastern University in Boston.</p><p><strong>Noah Hein</strong> is a Senior Product Marketing Engineer at Tiger Data, where he helps developers understand, adopt, and succeed with the fastest PostgreSQL platform for real‑time and analytical workloads. Day‑to‑day, he translates deep technical capabilities—like hypertables, hypercore compression, and continuous aggregates—into clear product narratives and customer stories that drive adoption and growth.</p><p>Before joining Tiger Data, Noah spent several years on the “builder” side of the house as both a founding engineer and an educator. He co‑created Latent Space’s three‑week AI Engineering Fundamentals course and has taught hundreds of engineers how to apply LLMs in production. Noah frequently speaks on AI‑data convergence topics; at the first ever AI Engineer Summit he led the “AI Engineering 101” workshop, walking participants through hands‑on projects.</p><p>Outside of work, Noah tries to help more people land jobs with his side project JobMosaic. When he’s not crafting launch posts, you’ll find him experimenting with edge‑AI devices, tinkering with homelab Postgres clusters, or giving impromptu botany lessons to anyone who will listen.</p><p></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Introducing Direct Compress: Up to 40x Faster, Leaner Data Ingestion for Developers (Tech Preview)]]></title>
            <description><![CDATA[Direct Compress delivers up to 40x faster data ingestion for TimescaleDB by compressing time-series data in memory during insertion, eliminating background jobs.]]></description>
            <link>https://www.tigerdata.com/blog/introducing-direct-compress-up-to-40x-faster-leaner-data-ingestion-for-developers-tech-preview</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/introducing-direct-compress-up-to-40x-faster-leaner-data-ingestion-for-developers-tech-preview</guid>
            <category><![CDATA[Announcements & Releases]]></category>
            <category><![CDATA[TimescaleDB]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Product & Engineering]]></category>
            <dc:creator><![CDATA[Sven Klemm]]></dc:creator>
            <pubDate>Tue, 09 Sep 2025 13:00:52 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/09/direct-compress-thumbnail.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/09/direct-compress-thumbnail.png" alt="Introducing Direct Compress: Up to 40x Faster, Leaner Data Ingestion for Developers (Tech Preview)" /><p>Time-series and analytical data continues to grow at an unprecedented pace, and with it comes the challenge of efficiently storing and querying massive datasets. Traditionally, compressing this data required background jobs, and additional tuning. This slowed down ingestion, added operational headache, and delayed storage savings. </p><p>That’s why today, we're excited to announce <strong>Direct Compress</strong>, a new feature coming to TimescaleDB that compresses data during ingestion in memory, eliminating the need for traditional compression policies and improving insert performance by up to 40x.</p><p><em>Note: Direct Compress is currently available as a tech preview in TimescaleDB </em><a href="https://github.com/timescale/timescaledb/releases/tag/2.21.0"><em><u>2.21</u></em></a><em> for COPY operations, with full support for INSERT operations coming in a later version.</em></p><h2 id="the-evolution-of-timescaledb%E2%80%99s-columnstore">The Evolution of TimescaleDB’s Columnstore&nbsp;</h2><p>TimescaleDB has long been recognized for its industry-leading compression capabilities. With <a href="https://docs.tigerdata.com/use-timescale/latest/hypercore/"><u>hypercore</u></a>, TimescaleDB's hybrid row-columnar storage engine, users can achieve compression ratios of over 90% while maintaining fast query performance. Traditionally, the system would:</p><ol><li>Insert data in uncompressed row format</li><li>Write individual WAL records for each tuple</li><li>Later compress chunks through background policies</li></ol><p>Now, Direct Compress fundamentally changes this approach by compressing data <strong>during the ingestion process itself</strong>.</p><h2 id="what-is-direct-compress">What is Direct Compress?</h2><p>Direct Compress is a feature that allows TimescaleDB to compress data in memory as it's being ingested. Instead of writing WAL records for individual tuples, the system writes compressed batches directly to disk. This approach addresses several key challenges that developers and database administrators face when working with high-volume time-series data:</p><ul><li><strong>Excessive I/O overhead</strong>: Traditional ingestion requires writing each tuple individually to the WAL, creating significant I/O bottlenecks</li><li><strong>Dependency on compression policies</strong>: Previously, you had to wait for background compression jobs to optimize storage</li><li><strong>Insert performance limitations</strong>: Large-scale data ingestion was constrained by the overhead of individual tuple processing</li></ul><h2 id="benchmark-results-37x-improvement">Benchmark Results (37x Improvement)</h2><p>To test the per-tuple overhead, a narrow table with only one integer column was used. Direct compression provided considerable performance improvements, with the single integer table achieving 148.8 million tuples per second using 10k batch compression—a 37x improvement over uncompressed insertion. For a table with a timestamp column and 2 integer columns we achieved an insert rate of 66 million tuples per second with compression.</p><p>The schema used does have a big impact on achievable insert rate, with more complex datatypes like jsonb or wider rows having lower ingest rates. Parsing integer columns was found to have the least overhead compared to other datatypes, and for these benchmarks more than half of the cpu time was spent parsing input even when using binary input format. Performance scaled linearly across all thread counts until reaching the storage I/O bottleneck. During these tests we used a <a href="https://www.tigerdata.com/cloud"><u>Tiger Cloud</u></a> instance with 64 cores and EBS storage—with more optimized storage higher numbers are probably achievable. For the uncompressed tests no indexes were present on the <a href="https://docs.tigerdata.com/use-timescale/latest/hypertables/"><u>hypertables</u></a>. The 1k and 10k batch size refers to the batch size used internally during compression, not the batch size used by the client sending the data.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/09/insert-rate-vs-number-of-threads-.png" class="kg-image" alt="insert rate vs number of threads" loading="lazy" width="896" height="672" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/09/insert-rate-vs-number-of-threads-.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/09/insert-rate-vs-number-of-threads-.png 896w" sizes="(min-width: 720px) 720px"></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/09/Table-with-timestamp-and-2-int-columns-1.png" class="kg-image" alt="" loading="lazy" width="810" height="607" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/09/Table-with-timestamp-and-2-int-columns-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/09/Table-with-timestamp-and-2-int-columns-1.png 810w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Table with timestamp and 2 int columns</span></figcaption></figure><h2 id="key-benefits">Key Benefits</h2><h3 id="reduced-io-operations">Reduced I/O operations</h3><p>By compressing data in memory before writing to disk, Direct Compress eliminates the need to write individual WAL records for each tuple. Instead, only compressed batches are written, dramatically reducing I/O overhead.&nbsp;</p><h3 id="eliminated-policy-dependencies">Eliminated policy dependencies</h3><p>With Direct Compress, your <code>INSERT</code> operations already produce compressed chunks. This means <code>compress_chunk()</code> functions and compression policies become less critical to your workflow, simplifying your database maintenance.</p><h3 id="immediate-storage-efficiency">Immediate storage efficiency</h3><p>Unlike traditional compression that happens after ingestion, Direct Compress provides storage benefits immediately, reducing your storage footprint from the moment data arrives.</p><h2 id="how-direct-compress-works">How Direct Compress Works</h2><p>Direct Compress operates by intercepting data during the ingestion process and compressing it in memory before writing to disk. The process involves:</p><ol><li><strong>Batch Collection</strong>: Data is collected in configurable batches during <code>COPY</code> or <code>INSERT</code> operations.</li><li><strong>In-Memory Compression</strong>: Each batch is compressed using TimescaleDB's proven compression algorithms.</li><li><strong>Optimized Writing</strong>: Compressed batches are written directly to disk with minimal WAL overhead.</li></ol><p>This approach differs from traditional compression methods because it eliminates the two-step process of "ingest then compress," instead performing both operations simultaneously. <strong>Importantly, Direct Compress requires batched operations on the client side</strong> to achieve these performance benefits. With direct compression, data ingestion becomes limited by CPU processing rather than IO speed.</p><p><strong>Roadmap</strong></p><ul><li><strong>COPY support</strong> (TimescaleDB 2.21 - Tech Preview)</li><li><strong>INSERT support </strong>(coming soon)</li><li><strong>Continuous aggregate support </strong>(coming soon)</li></ul><h2 id="getting-started-with-direct-compress">Getting Started with Direct Compress</h2><h3 id="prerequisites">Prerequisites</h3><p>Before using Direct Compress, ensure you have:</p><ul><li>TimescaleDB version 2.21 or later (currently in tech preview)</li><li>A hypertable with compression enabled (<a href="#basic-usage-example" rel="noreferrer"><u>see example</u></a><u>)</u></li><li>Batched client operations to make use of the feature</li></ul><h3 id="important-requirements-and-limitations">Important requirements and limitations</h3><p>Direct Compress <strong>requires batching</strong> on the client side to function effectively. It cannot be used:</p><ul><li>If the hypertable schema has unique constraints</li><li>If the hypertable has triggers</li><li>Continuous aggregates on the target hypertable</li></ul><h3 id="configuration-options">Configuration options</h3><p>Direct Compress is controlled through several GUCs (Grand Unified Configuration parameters):</p><p><code>timescaledb.enable_direct_compress_copy</code><strong> (default: off)</strong></p><p>Enables the core Direct Compress feature for <code>COPY</code> operations. When enabled, chunks will be marked as unordered, so presorting is not required.</p><p><code>timescaledb.enable_direct_compress_copy_sort_batches</code><strong> (default: on)</strong></p><p>Enables per-batch sorting before writing compressed data, which can improve query performance.</p><p><code>timescaledb.enable_direct_compress_copy_client_sorted</code><strong> (default: off)</strong></p><p><strong>⚠️ DANGER</strong>: When enabled, chunks will not be marked as unordered. Only use this if your data is globally sorted, as queries requiring ordering will produce incorrect results with unsorted data. In the context of this feature we can distinguish between local and global sorting. Local sorting means within the current batch data is sorted. Global sorting means there is no batch that will overlap with the current batch.</p><h2 id="basic-usage-example">Basic Usage Example</h2><pre><code class="language-SQL">-- Create a hypertable with compression
CREATE TABLE sensor_data(
    time timestamptz, 
    device text, 
    value float
) WITH (
    tsdb.hypertable,
    tsdb.partition_column='time'
);

-- Enable Direct Compress
SET timescaledb.enable_direct_compress_copy = on;

-- Use binary format for maximum performance
COPY sensor_data FROM '/tmp/sensor_data.binary' WITH (format binary);</code></pre><h2 id="best-practices-and-recommendations">Best Practices and Recommendations</h2><h3 id="1-use-binary-format">1. Use binary format</h3><p>Binary format achieves the highest insert rates. While CSV and text formats are supported, binary format provides optimal performance.</p><h3 id="2-consider-order-by-configuration">2. Consider order by configuration</h3><p>The default <code>orderby</code> configuration is <code>time DESC</code> for query optimization. However, for maximum Direct Compress benefits, consider changing this to <code>time</code> to optimize for insert performance:</p><pre><code class="language-SQL">ALTER TABLE sensor_data SET (timescaledb.orderby = 'time');</code></pre><p>This represents a trade-off between insert performance and query performance—choose based on your primary use case.</p><h3 id="3-presort-data-before-ingestion">3. Presort data before ingestion</h3><p>While TimescaleDB can do sorting as part of Direct Compress, it will take away CPU resources from other tasks.</p><h3 id="4-leverage-multiple-threads">4. Leverage multiple threads</h3><p>The benchmark results show significant benefits from parallel ingestion. Consider using multiple threads for large data imports.</p><h2 id="migration-and-compatibility">Migration and Compatibility</h2><h3 id="upgrading-existing-tables">Upgrading existing tables</h3><p>Direct Compress works with any existing hypertable that has the columnstore enabled, provided the limitations (no unique constraints, triggers, or continuous aggregates) are met.&nbsp;</p><h3 id="backward-compatibility">Backward compatibility</h3><p>Direct Compress is fully compatible with existing TimescaleDB compression features. You can use both traditional columnstore policies and Direct Compress simultaneously, though Direct Compress reduces the need for background compression jobs.</p><h2 id="looking-forward">Looking Forward</h2><p>Direct Compress represents a significant milestone in TimescaleDB's ongoing evolution toward real-time analytics at scale. This feature is part of our broader commitment to eliminating the traditional trade-offs between ingestion speed and storage efficiency.</p><p>Future enhancements to Direct Compress will include:</p><ul><li>Support for INSERT</li><li>Additional optimizations for unsorted data when using direct compress</li><li>Compatibility with continuous aggregates</li><li>Enhanced client-side tooling for optimal batching</li></ul><h2 id="try-direct-compress-today">Try Direct Compress Today</h2><p>Direct Compress brings considerable performance improvements to TimescaleDB users by eliminating the traditional ingestion bottleneck. With up to 40x faster ingestion rates and immediate storage benefits, this feature is a game-changer for high-volume time-series applications.</p><p>Whether you're managing IoT sensor data, financial market feeds, or application monitoring metrics, Direct Compress can help you achieve unprecedented ingestion performance while reducing storage costs from day one.</p><p>We encourage you to try the tech preview of Direct Compress in your development environment and share your experiences with the community. Your feedback will help us refine this feature as we move toward full release. As always, our team is available to help you optimize your TimescaleDB deployment for your specific use case.</p><hr><p><strong>Ready to get started?</strong> Check out our <a href="https://docs.tigerdata.com/"><u>documentation</u></a> or <a href="https://www.tigerdata.com/contact"><u>contact our team</u></a> for personalized assistance with Direct Compress implementation.</p><p><em>Have questions about Direct Compress or want to share your results? Join the conversation in our </em><a href="https://forum.tigerdata.com/forum/"><em><u>community forum</u></em></a><em> or reach out to us on </em><a href="https://github.com/timescale/timescaledb"><em><u>GitHub</u></em></a><em>.</em></p><hr><h3 id="about-the-author">About the Author</h3><p>Sven is the tech lead for TimescaleDB, but his journey with databases started a long time ago. For over 25 years, he has been a huge fan of PostgreSQL, and it's that deep-seated passion that led him to where he is today. His work on planner optimizations and diving into the columnstore to squeeze out every bit of performance is a direct extension of his goal: to make the Postgres ecosystem even more powerful and efficient for everyone who uses it.</p><p>That long history with Postgres also informs his work on the security front. One of the projects he is most passionate about is <a href="https://github.com/timescale/pgspot"><u>pgspot</u></a>, where he gets to help build a more secure future for the database. After all these years, he has seen firsthand how a strong, trustworthy foundation is essential. To him, a great database isn't just about speed; it's about protecting the data with unwavering reliability. This blend of performance and security is what truly excites him every day.</p><p>When he's not in the weeds of database code, you can find him thinking about the bigger picture—how to make the community and product stronger, safer, and more user-friendly. He loves the challenge of taking a complex problem and finding a simple, elegant solution. His journey with Postgres has taught him that the best technology is built on a foundation of trust and a commitment to continuous improvement.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[PostgreSQL Couldn’t Handle Our Time-Series Data—TimescaleDB Crushed It]]></title>
            <description><![CDATA[In this post, Nakylai Taiirova shows how TimescaleDB solved Postgres performance issues: 83% storage reduction, 979x faster queries, and seamless SQL compatibility for time-series data.]]></description>
            <link>https://www.tigerdata.com/blog/postgresql-couldnt-handle-our-time-series-data-timescaledb-crushed-it</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/postgresql-couldnt-handle-our-time-series-data-timescaledb-crushed-it</guid>
            <category><![CDATA[Dev Q&A]]></category>
            <category><![CDATA[TimescaleDB]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Time Series Data]]></category>
            <dc:creator><![CDATA[NAKYLAI TAIIROVA]]></dc:creator>
            <pubDate>Thu, 04 Sep 2025 13:00:25 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/09/time-series-data-management-with-timescaledb.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/09/time-series-data-management-with-timescaledb.jpg" alt="PostgreSQL Couldn’t Handle Our Time-Series Data—TimescaleDB Crushed It" /><p><em>This article, written by&nbsp;</em><a href="https://www.linkedin.com/in/nakylai-taiirova-28bab859/"><em>Nakylai Taiirova</em></a><em>, was originally posted on the&nbsp;</em><a href="https://maddevs.io/writeups/time-series-data-management-with-timescaledb/"><em>Mad Devs blog</em></a><em>&nbsp;on April 10, 2025. Nakylai is a web developer with over 5 years of hands-on experience gained in multiple industries, including food &amp; goods delivery, WMS, EdTech, and entertainment. She has mastered skills in various areas of software engineering, including backend, frontend &amp; mobile development. The article is reposted here with permission. It outlines the performance tests that led to choosing TimescaleDB for a real-world e-commerce project, tracking product page views, click-through rates, and search position rankings.</em></p><hr><h2 id="introduction">Introduction</h2><p>Hi everyone! Today, I want to talk about time-series data management. We encountered this challenge while working on a project, and I will share our experience with TimescaleDB and highlight its outstanding features.</p><h2 id="understanding-time-series-data-characteristics-and-challenges">Understanding Time-Series Data: Characteristics and Challenges</h2><p>Time-series data is simply a sequence of data points collected over time. Think of it as measurements or events that have timestamps attached to them. It's append-only (we're mostly adding new data, not changing historical records), naturally ordered by time, and the time element itself is usually crucial for analysis.</p><p>Common examples of time-series data include: stock prices over time, weather measurements (temperature, humidity, wind speed), monthly subscriber counts on a website, sensor readings from IoT devices, etc.</p><p>In a real-world e-commerce project we recently built, our team encountered this type of data. We needed to track how many times product pages were viewed and how often users clicked on them. It was also necessary to record each product's daily position in search results since merchants paid for premium placements. This created a perfect time-series dataset–every day, we collected thousands of new records with timestamps, while the historical data remained unchanged. We used this data to show merchants evidence that higher positions actually led to more visibility and clicks.</p><h3 id="common-challenges">Common Challenges</h3><p>When working with time-series data, there are several important challenges to consider:</p><ul><li><strong>Data volume and scaling:</strong>&nbsp;High-frequency data collection leads to rapid growth, which causes traditional databases to struggle with performance.</li><li><strong>Complex aggregations:</strong>&nbsp;Queries like averages, sums, or analyzing changes over time need to be performed efficiently.</li></ul><p>This requires specialized approaches for optimal storage and querying. Large volumes of time-series data can overwhelm traditional relational databases, especially when they perform complex time-based aggregations.</p><h3 id="time-series-databases">Time-series databases</h3><p>There are several specialized database systems that exist to address these challenges, such as InfluxDB, Prometheus, Apache Druid, MongoDB, and Amazon Timestream. Each has its strengths and works in different use cases; however, this article will focus on TimescaleDB.</p><h2 id="timescaledb">TimescaleDB</h2><p>TimescaleDB is a time-series database built as an extension to PostgreSQL with time-series optimizations. Unlike other solutions that require learning new query languages, TimescaleDB lets users continue using SQL.</p><h3 id="key-features">Key features:</h3><ul><li><strong>Postgres-based:</strong>&nbsp;Full SQL compatible and supports relational capabilities.</li><li><strong>Hypertables:</strong>&nbsp;Unlike standard PostgreSQL tables that store all data in a single table, hypertables automatically partition time-series data into chunks based on time intervals. This increases query speed since they only need to scan relevant time chunks instead of the entire dataset.</li><li><strong>Data lifecycle management</strong>: Automatic data handling through features like:<ul><li>1. Continuous aggregation — pre-calculates common metrics (like daily averages or hourly sums) and keeps them updated automatically, so dashboards and reports run 100-1000x faster without recalculating from raw data each time.</li><li>2. Data retention policies that automatically remove old data past a certain age.&nbsp;</li><li>3. <a href="https://docs.tigerdata.com/use-timescale/latest/data-tiering/" rel="noreferrer">Tiered storage</a>&nbsp;— automatically moves older, less frequently accessed data to low-cost object storage (built on Amazon S3) while keeping recent data in high-performance storage.</li></ul></li></ul><p>Below we'll see TimescaleDB in action with some real-world examples.</p><h2 id="practical-example-intel-lab-sensor-data">Practical Example: Intel Lab Sensor Data</h2><p>For this practical example, we're using sensor data from the&nbsp;<a href="https://db.csail.mit.edu/labdata/labdata.html" rel="noreferrer">Intel Berkeley Research Lab</a>. This dataset contains millions of temperature, humidity, light, and voltage readings over a 2-month period from 54 sensors deployed throughout the lab. We chose this dataset because, despite its relatively small size (about 2.3 million readings), it has enough data to demonstrate the performance differences between PostgreSQL and TimescaleDB.</p><h3 id="setting-up-our-environment">Setting up our environment</h3><p>First, we created two tables to compare performance:</p><ol><li>A regular PostgreSQL table (<code>sensor_data_postgres</code>)</li><li>A TimescaleDB hypertable (<code>sensor_data_timescale</code>)</li></ol><p>Both tables have identical schemas and indexes:</p><pre><code class="language-SQL">-- Regular PostgreSQL table
CREATE TABLE sensor_data_postgres (
    time        TIMESTAMPTZ NOT NULL,
    epoch       INTEGER,
    sensor_id   INTEGER NOT NULL,
    temperature DOUBLE PRECISION,
    humidity    DOUBLE PRECISION,
    light       DOUBLE PRECISION,
    voltage     DOUBLE PRECISION
);

-- TimescaleDB table
CREATE TABLE sensor_data_timescale (
    time        TIMESTAMPTZ NOT NULL,
    epoch       INTEGER,
    sensor_id   INTEGER NOT NULL,
    temperature DOUBLE PRECISION,
    humidity    DOUBLE PRECISION,
    light       DOUBLE PRECISION,
    voltage     DOUBLE PRECISION
);

-- Convert to hypertable
SELECT create_hypertable('sensor_data_timescale', 'time');</code></pre><h3 id="performance-comparison">Performance comparison</h3><p>All performance tests were run on:</p><ul><li>MacBook Air (M1, 2020)</li><li>Apple M1 chip</li><li>16 GB RAM</li><li>PostgreSQL 14.17 (64-bit)</li><li>TimescaleDB 2.19.0</li></ul><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text"><i><em class="italic" style="white-space: pre-wrap;">Note: Performance results may vary on different hardware configurations, but the overall tendency should remain consistent across systems.</em></i></div></div><p>We loaded 1,841,828 rows and ran identical queries on both tables. Let's look at each query type and its results:</p><p><strong>1- Full range select:</strong>&nbsp;Basic retrieval of all data within a time range</p><pre><code class="language-SQL">SELECT * FROM sensor_data_postgres 
WHERE time &gt;= '2004-02-28 00:58:46' 
AND time &lt;= '2004-04-05 11:02:32';

-- Same query for TimescaleDB table</code></pre><p><strong>Results:</strong></p><ul><li>PostgreSQL: 226.5 ms.</li><li>TimescaleDB: 317.7 ms.</li><li>Performance: PostgreSQL is 40% faster.</li></ul><p><strong>Insights:</strong></p><ul><li>For basic SELECT queries, PostgreSQL performs better.</li><li>TimescaleDB's overhead for chunk management affects simple query performance.</li><li>This type of query is less common in time-series applications.</li></ul><p><strong>2- Time-based aggregations:</strong>&nbsp;Grouping and analyzing data by time intervals</p><pre><code class="language-SQL">-- PostgreSQL (Daily aggregation)
SELECT 
    date_trunc('day', time) as day,
    sensor_id,
    COUNT(*) as readings,
    AVG(temperature) as avg_temp,
    stddev(temperature) as temp_stddev
FROM sensor_data_postgres
WHERE time &gt;= '2004-02-28' AND time &lt;= '2004-04-05'
GROUP BY day, sensor_id
HAVING stddev(temperature) &gt; 2
ORDER BY day, sensor_id;

-- TimescaleDB (using time_bucket)
SELECT 
    time_bucket('1 day', time) as day,
    sensor_id,
    COUNT(*) as readings,
    AVG(temperature) as avg_temp,
    stddev(temperature) as temp_stddev
FROM sensor_data_timescale
WHERE time &gt;= '2004-02-28' AND time &lt;= '2004-04-05'
GROUP BY day, sensor_id
HAVING stddev(temperature) &gt; 2
ORDER BY day, sensor_id;</code></pre><p><strong>Results:</strong></p><ul><li>PostgreSQL: 443.2 ms.</li><li>TimescaleDB: 170.9 ms.</li><li>Performance: TimescaleDB is 61% faster.</li></ul><p><strong>Insights:</strong></p><ul><li>The core of TimescaleDB's advantage is its hypertable architecture. Hypertables automatically partition time-series data into chunks based on time intervals. When querying, TimescaleDB identifies and scans only the relevant chunks instead of the entire dataset, reducing I/O operations for time-bounded queries.</li><li>The&nbsp;<code>time_bucket</code>&nbsp;function is specifically optimized to work with this chunked architecture, which makes it more efficient than PostgreSQL's&nbsp;<code>date_trunc</code>.</li></ul><h3 id="advanced-timescaledb-features-continuous-aggregates">Advanced TimescaleDB features: continuous aggregates</h3><p>Think of continuous aggregates as a personal data assistant that works ahead of time. Instead of forcing a database to recalculate the same aggregations (like daily averages or hourly counts) every time someone views a dashboard, TimescaleDB does this work in advance and keeps it ready to serve instantly.</p><p>Imagine temperature data being tracked and updated every minute, but the dashboard only needs to show daily averages. Rather than scanning millions of raw data points each time the dashboard loads, continuous aggregates pre-calculate data and store the required results. They automatically update as new data arrives.</p><p>Let's compare a regular PostgreSQL view with a TimescaleDB continuous aggregate:</p><p><strong>1- PostgreSQL View vs TimescaleDB Continuous Aggregate</strong></p><pre><code class="language-SQL">-- Regular PostgreSQL view (computed on every query)
CREATE VIEW pg_daily_sensor_stats AS
SELECT
    date_trunc('day', time) as day,
    sensor_id,
    AVG(temperature) as avg_temp,
    MIN(temperature) as min_temp,
    MAX(temperature) as max_temp,
    COUNT(*) as reading_count
FROM sensor_data_postgres
GROUP BY day, sensor_id;

-- TimescaleDB continuous aggregate (materialized and automatically refreshed)
CREATE MATERIALIZED VIEW ts_daily_sensor_stats
WITH (timescaledb.continuous) AS
SELECT
    time_bucket('1 day', time) as day,
    sensor_id,
    AVG(temperature) as avg_temp,
    MIN(temperature) as min_temp,
    MAX(temperature) as max_temp,
    COUNT(*) as reading_count
FROM sensor_data_timescale
GROUP BY day, sensor_id;

-- Set up automatic refresh policy
SELECT add_continuous_aggregate_policy('ts_daily_sensor_stats',
    start_offset =&gt; INTERVAL '3 days',
    end_offset =&gt; INTERVAL '1 hour',
    schedule_interval =&gt; INTERVAL '1 hour');</code></pre><p><strong>2- Querying the aggregates</strong></p><pre><code>-- Query against PostgreSQL view (computed on demand)
SELECT * FROM pg_daily_sensor_stats
WHERE day &gt;= '2004-03-01' AND day &lt;= '2004-03-31'
ORDER BY day, sensor_id;

-- Query against TimescaleDB continuous aggregate (pre-computed)
SELECT * FROM ts_daily_sensor_stats
WHERE day &gt;= '2004-03-01' AND day &lt;= '2004-03-31'
ORDER BY day, sensor_id;</code></pre><p><strong>Results when querying a month of data:</strong></p><ul><li>PostgreSQL view: 424.0 ms (computed on every query).</li><li>TimescaleDB continuous aggregate: 0.433 ms (pre-computed).</li><li>Performance: TimescaleDB is 979x faster.</li></ul><h3 id="storage-optimization">Storage optimization:</h3><p>TimescaleDB also saves storage costs while maintaining query performance. With our temperature sensor data, 54 sensors taking readings every minute for months, this means we will need extra storage based on time.</p><p>For newer TimescaleDB versions (v2.18.0+), Hypercore automatically handles the storage management. But for this example, we’ll take a look at a compression policy.</p><p>Setting up compression is simple with&nbsp;<a href="https://docs.tigerdata.com/use-timescale/latest/compression/compression-policy/" rel="noreferrer">compression policies</a>&nbsp;— just tell TimescaleDB to compress chunks older than a certain age (like 7 days), and it handles everything automatically in the background.</p><p>Both PostgreSQL and TimescaleDB tables start at similar sizes (around 209 MB) with our test dataset:</p><pre><code class="language-SQL">-- Check regular PostgreSQL table size
SELECT pg_total_relation_size('sensor_data_postgres')/1024/1024 as size_mb;
-- PostgreSQL size: 209 MB

-- Check hypertable size before compression
SELECT hypertable_size('sensor_data_timescale')/1024/1024 as size_mb;
-- TimescaleDB size before compression: 208 MB

-- Enable compression on table
ALTER TABLE sensor_data_timescale SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'sensor_id',
    timescaledb.compress_orderby = 'time'
);

-- Create compression policy (compress chunks older than 7 days)
SELECT add_compression_policy('sensor_data_timescale', INTERVAL '7 days');

-- Check hypertable size after compression
SELECT hypertable_size('sensor_data_timescale')/1024/1024 as size_mb;
-- TimescaleDB size after compression: 35 MB (83% reduction)</code></pre><p><strong>Insights:</strong></p><ul><li>Native compression provides significant storage savings (83% reduction)</li><li>Compression maintains query functionality and reduces the storage costs</li><li>Choose compression settings strategically:<ul><li>1.&nbsp;<code>compress_orderby='time'</code>: Best for time-series data as values typically follow changes over time.</li><li>2. <code>compress_segmentby='sensor_id'</code>: Group data by columns you frequently filter or aggregate on.</li><li>3. For example, if you often query&nbsp;<code>SELECT avg(temperature) FROM sensors WHERE sensor_id = 5</code>, using sensor_id as segmentby will improve query performance.</li></ul></li></ul><p><a href="https://docs.tigerdata.com/use-timescale/latest/compression/about-compression/" rel="noreferrer">Learn more about compression settings</a></p><h2 id="final-thoughts">Final Thoughts</h2><p>Our performance tests demonstrated TimescaleDB's advantages for time-series data management. While traditional PostgreSQL performs better for simple queries, TimescaleDB is a good choice for complex time-based operations with performance improvement for aggregation queries.</p><p>In our real-world e-commerce project, we successfully implemented TimescaleDB to track product page views, click-through rates, and search position rankings. This gave us the perfect solution for our expanding dataset—every day, we collected thousands of new timestamped records while keeping historical data intact. Using continuous aggregation, we were able to efficiently analyze how premium placements affected visibility and clicks across a 2-3 year period, providing merchants with evidence that higher positions actually increased engagement.</p><p>The storage benefits were significant, too. TimescaleDB's compression reduced our storage needs by about 30-40%. This led to real cost savings in our cloud bills. Our analytics platform became more responsive, and our budget healthier. These benefits matter even more when the situation involves large amounts of time-series data.</p><p>TimescaleDB is worth considering for work with growing time-series data that requires complex analysis. It offers powerful tools and accommodates SQL.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[The Database Has a New User—LLMs—and They Need a Different Database]]></title>
            <description><![CDATA[Tiger Data experiments with self-describing Postgres using semantic catalogs. Early tests show 27% better SQL generation accuracy when AI agents understand schemas.
]]></description>
            <link>https://www.tigerdata.com/blog/the-database-new-user-llms-need-a-different-database</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/the-database-new-user-llms-need-a-different-database</guid>
            <category><![CDATA[AI agents]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[AI]]></category>
            <dc:creator><![CDATA[Matvey Arye]]></dc:creator>
            <pubDate>Thu, 21 Aug 2025 12:59:10 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/08/experimenting-with-a-self-describing-postgresql-database-for-the-agentic-era.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/08/experimenting-with-a-self-describing-postgresql-database-for-the-agentic-era.png" alt="The Database Has a New User—LLMs—and They Need a Different Database" /><h3 id="we%E2%80%99re-experimenting-with-a-database-that-can-self-describe-and-we%E2%80%99re-starting-with-the-most-popular-one-in-the-world-postgres">We’re experimenting with a database that can self-describe. And we’re starting with the most popular one in the world: Postgres.</h3><p><strong>TL;DR:</strong> We’re experimenting with turning Postgres into a <strong>self-describing database</strong>, embedding meaning as part of the schema. By providing natural language explanations of PostgreSQL structures and logic, agents can more accurately query data, and answer a broader set of questions. In our early tests, using an LLM-generated semantic catalog improved SQL generation accuracy by up to 27%. Here's the <a href="https://github.com/timescale/pgai/tree/main/docs/semantic_catalog" rel="noreferrer">repo link</a> for reference.</p><h2 id="databases-lack-context-about-their-structures">Databases Lack Context About Their Structures</h2><p>As any developer who has ever had the (mis)fortune of working on a legacy database knows quite well, a database is not self-describing. You can’t look at a database schema and tell what’s going on. That <code>orders1</code> table can be for a purchase order or a customer order or it’s an experimental table someone created and forgot to drop 5 years ago. This is a long-standing problem with databases that people solve by talking to each other, looking at code that interacts with the database, examining git history, and screaming at the wall.&nbsp;&nbsp;</p><p>But, LLMs need to answer questions about the data just from the context provided by the database. No wonder they get confused.&nbsp;</p><p>Like our fearless leader <a href="https://www.tigerdata.com/blog/author/ajay"><u>Ajay Kulkarni</u></a> once said—<em>“LLMs crave (accurate) context the same way GPUs crave power.”</em></p><p>In order for agents to extract insights from data in the database it needs to understand which columns and tables to query. For example, if it doesn’t understand which table contains the customer vs purchase orders, it can’t understand when a customer’s order has shipped.</p><p>In fact, in internal experiments we conducted at Tiger Data, we found that 42% of context-less LLM-generated SQL queries missed critical filters or misunderstood relationships, silently returning misleading data.&nbsp;</p><p>The larger (and thus probably more important) the database, the worse things get.</p><p>It doesn’t take a senior engineer to know that this is unacceptable.</p><h2 id="adding-context-with-semantic-catalogs">Adding Context with Semantic Catalogs</h2><p>The solution is incredibly simple: allow developers to add context about the database in the form of natural language descriptions of the schemas and business logic within Postgres, injecting much-needed factual meaning into schemas.&nbsp;</p><p>We arrived at this by putting ourselves in the place of the LLM — asking, how would we generate the query if we were in its shoes? The answer was clear: without additional context, we wouldn’t know enough to make the right call. Along the way, we tried different prompts, techniques, and even alternative LLMs, but the core issue was always the same: the model simply lacked the context necessary to generate the query. That realization led us to build a way for developers to provide that missing context directly.</p><p>That's what we’ve been experimenting with and are excited to share today.&nbsp;&nbsp;</p><p>We set out to test what happens when SQL generation is powered by our LLM-generated semantic catalog. The outcome was a 27% boost in accuracy compared to the control.</p><p>Let's show you how we created this self-describing Postgres database.</p><h2 id="building-a-self-describing-database">Building a Self-Describing Database</h2><p>Our thoughts for this self-describing Postgres are built around four core ideas:&nbsp;</p><ol><li><strong>Embed semantics alongside schema: </strong><br>Every table, column, function, and business rule should be described clearly—in natural language.</li><li><strong>Versioned and governed descriptions: </strong><br>Metadata should live alongside application code, version-controlled, reviewed, and governed with the same rigor.</li><li><strong>Self-correcting querying: </strong><br>Postgres itself should expose safety hints, telling agents which queries are expensive or unsafe, and provide deterministic verification mechanisms (<code>EXPLAIN</code>) to catch errors before queries run.</li><li><strong>Measure and iterate transparently: </strong><br>Developers should be able to build an evaluation suite for agent interactions and determine how well the system performed. Critically, the developers should see what errors are due to bad schema descriptions and lack of context (retrieval errors) or incorrect reasoning (logical errors).</li></ol><h2 id="what-we%E2%80%99re-experimenting-with-today">What We’re Experimenting With Today</h2><p>We’re exploring this approach through two building blocks. This post primarily covers the Semantic Catalog. Next week, we’ll publish a post about the evaluation harness.</p>
<!--kg-card-begin: html-->
<table>
        <thead>
            <tr>
                <th>Component</th>
                <th>What It Does</th>
                <th>Repo</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>Semantic Catalog</td>
                <td>Stores natural-language descriptions for schema elements. Supports vector search to retrieve relevant context dynamically.</td>
                <td><a href="https://github.com/timescale/pgai/tree/main/docs/semantic_catalog" target="_blank" style="text-decoration: underline">Link</a></td>
            </tr>
            <tr>
                <td>Evaluation Harness</td>
                <td>Measures query accuracy in any agentic text-to-sql system and transparently reveals errors in retrieval (bad metadata/context) versus reasoning (bad SQL generation). Helps track accuracy over time.</td>
                <td>TBD</td>
            </tr>
        </tbody>
    </table>
<!--kg-card-end: html-->
<h3 id="why-isnt-the-problem-just-retrieval">Why isn't the problem just retrieval?</h3><p>Retrieval using the schema element names alone does not cut it simply because the names lack enough meaning. It doesn’t matter if you use keyword search, semantic search, or this week’s newest shiny retrieval algorithm—the semantic meaning of <code>payment_aborts</code> versus&nbsp; <code>refunds</code> is too close for any retrieval mechanism to make heads or tails of it.&nbsp;&nbsp;</p><p>Even if you injected the entire schema into the context of your prompt (burning through cash and increasing response latency in the process), it’s doubtful the model would be able to reason about&nbsp;<code>payment_aborts</code> versus&nbsp;<code>refunds</code> either.</p><p>The semantic catalog solves this by offering a structured, natural language representation of your database's metadata and business logic. In this experimental version, it supports documentation and descriptions for:</p><ul><li>Tables and views</li><li>Columns</li><li>Functions and procedures</li><li>Example queries</li><li>Facts: business logic rules or freeform context</li></ul><p>Initial descriptions can be LLM-generated and stored in human-readable YAML files designed for version control, peer review, and governance. Once reviewed, this metadata can be imported into the semantic catalog and indexed for semantic search, simplifying the process of retrieving and reasoning about data structures.</p><h3 id="performance-enhancement-through-semantic-context">Performance enhancement through semantic context</h3><p>In our early tests, using an LLM-generated semantic catalog improved SQL generation accuracy by up to 27%. Effective SQL generation involves two key steps:</p><ol><li><strong>Retrieving Relevant Context</strong>: Identifying necessary database components</li><li><strong>SQL Generation Reasoning</strong>: Formulating accurate SQL queries based on the retrieved context</li></ol><p>Our evaluation revealed substantial improvements from using semantic descriptions for SQL generation accuracy (from 58% to 86%) for certain schemas. A smaller improvement was noted when using semantic descriptions in retrieval recall (from 52% to 58% on one of our tests).</p><p>Results varied significantly across datasets, schemas, and models. Our current evaluations were conducted using a more cost-effective model (OpenAI's gpt-4.1-nano). We're particularly interested in insights from users experimenting with diverse datasets and more advanced models.</p><p>With these results, let’s walk through how the agent actually interacts with the database.</p><h3 id="how-queries-work-the-minimal-agent-loop">How queries work: the minimal agent loop</h3><p>Every semantic query follows this four-step flow:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/08/self-describing-postgres-diagram-vertical.png" class="kg-image" alt="Developers and agents alike gain visibility and control at each step." loading="lazy" width="2000" height="581" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/08/self-describing-postgres-diagram-vertical.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2025/08/self-describing-postgres-diagram-vertical.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2025/08/self-describing-postgres-diagram-vertical.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w2400/2025/08/self-describing-postgres-diagram-vertical.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Developers and agents alike gain visibility and control at each step.</span></figcaption></figure><p>Walking through step-by-step:</p><p><strong>Step 1: Describe your database</strong> <strong>using an LLM</strong></p><p>Generate initial YAML descriptions with an LLM:</p><pre><code class="language-Shell">pgai semantic-catalog describe -f descriptions.yaml</code></pre><p>Review and refine these descriptions, storing them with the same rigor as your application code.</p><p>The resulting structure looks like this:</p><pre><code class="language-YAML">schema: public
name: restaurant
type: table
description: Stores core information about each restaurant, including its cuisine,
  city, and rating.
columns:
- name: id
  description: Primary key identifier for the restaurant.
- name: name
  description: Lowercased name of the restaurant.
- name: food_type
  description: Lowercased name of the type of cuisine or food style offered by the restaurant.
- name: city_name
  description: Lowercased name of the city where the restaurant is located.
- name: rating
  description: Numeric rating score assigned to the restaurant, where higher than 2.5 is considered good.
...
---
type: sql_example
sql: SELECT t2.house_number, t1.name FROM LOCATION AS t2 JOIN restaurant AS t1 ON t1.id = t2.restaurant_id WHERE t1.city_name IN (SELECT city_name FROM geographic WHERE region = 'bay area') AND t1.rating &gt; 2.5
description: give me some good restaurants in the bay area ?
...
---
type: fact
description: When asking for a restaurant, provide its name (restaurant.name) and its house number (location.house_number).
...</code></pre><p><strong>Step 2: Have a human review the description and add context</strong></p><p>A developer would then review the descriptions and add additional context. Crucially, business logic that is central to database operations is often not encoded into the schema, and thus cannot be derived by the LLM. A developer needs to provide that information.</p><p>After the descriptions are reviewed, they should be treated as a core part of application code and thus should be stored in version control, be reviewed in pull-requests, etc.</p><p><strong>Step 3: Import into the catalog</strong>&nbsp;</p><p>Make descriptions available:</p><pre><code class="language-Shell">pgai semantic-catalog import -f descriptions.yaml</code></pre><p>This supports declarative configuration and continuous deployment practices.</p><p><strong>Step 4: Generate SQL</strong>&nbsp;</p><p>Generate SQL from natural language queries:</p><p>CLI:</p><pre><code class="language-Shell">pgai semantic-catalog generate-sql -p "Which passengers have experienced the most flight delays in 2024?"</code></pre><p>Python:</p><pre><code class="language-Python">response = await catalog.generate_sql(
    con,
    con,
    "openai:gpt-4.1",
    "Which passengers have experienced the most flight delays in 2024?",
)
</code></pre><h2 id="key-lessons-so-far">Key Lessons So Far</h2><h3 id="semantic-context-matters">Semantic context matters&nbsp;</h3><p>Models are far better at generating correct SQL when they have access to rich semantic information, not just schema names, but natural-language descriptions. Even a small amount of context (e.g. "the order table tracks transactions for customers buying subscription plans") can dramatically improve reliability.</p><h3 id="narrow-interfaces-build-confidence">Narrow interfaces build confidence&nbsp;</h3><p>What we’ve noticed with agentic text-to-sql systems is that to get good accuracy you need to tighten the scope of what kind of information the system has access to. The relationship looks something like this:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/08/data-scope-generality-vs-accuracy.png" class="kg-image" alt="" loading="lazy" width="528" height="432"></figure><p>Successful agentic systems start with tight scopes. We found it valuable to restrict agents to function-level or view-level access at first, and only expand access once correctness is proven.</p><p>Postgres allows you to balance control and flexibility through three main interfaces:</p><ul><li><strong>Functions</strong>: Highly controlled but narrow scope</li><li><strong>Views</strong>: Moderate control with broader access</li><li><strong>Raw Tables</strong>: Most general but least constrained and more prone to failure</li></ul><p>We suggest beginning with tightly scoped functions, then expanding access as confidence grows.</p><h2 id="self-correcting-using-explain">Self-Correcting Using EXPLAIN</h2><p>To further improve reliability, we employ Postgres's deterministic EXPLAIN command. This preemptively catches query errors such as incorrect column or table names, allowing agents to self-correct and substantially increasing accuracy.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/08/Explain-plan.png" class="kg-image" alt="" loading="lazy" width="949" height="265" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/08/Explain-plan.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/08/Explain-plan.png 949w" sizes="(min-width: 720px) 720px"></figure><h2 id="store-the-semantic-catalog-anywhere">Store the Semantic Catalog Anywhere</h2><p>For many deployments, integrating the semantic catalog directly into your existing database provides the simplest path. Recognizing, however, the risks and complexities of altering production environments, we've also enabled the semantic catalog to be hosted independently in a separate database. This approach offers greater deployment flexibility and ensures accessibility even when you only have read-only access to your primary database.</p><h2 id="what%E2%80%99s-next-from-self-describing-to-self-learning">What’s Next: From Self-Describing to Self-Learning</h2><p>While this semantic foundation is already improving agentic querying, we're just beginning. Our roadmap focuses on:</p><ul><li><strong>Self-learning catalog:</strong> Automatically enrich metadata by analyzing queries from production environments. Let databases learn from usage to continuously enhance accuracy.</li><li><strong>Dynamic policy management:</strong> Express complex access rules—like row-level privacy policies—in natural language, automatically enforced by the database.</li></ul><p>We’ve open-sourced everything we’ve built so far. Check out the <a href="https://github.com/timescale/pgai/blob/main/docs/semantic_catalog/README.md"><u>README with a Quickstart</u></a> to dive in, and we warmly invite your contributions, whether it's running evaluations on your schemas, proposing datasets, or challenging our assumptions. The best databases aren't just designed; they evolve through community effort.</p><hr><p><strong>About the author</strong></p><p><a href="https://www.linkedin.com/in/matvey-arye/"><u>Matvey Arye</u></a> is a founding engineering leader at Tiger Data (creators of TimescaleDB), the premiere provider of relational database technology for time-series data and AI. Currently, he manages the team at Tiger Data responsible for building the go-to developer platform for AI applications.&nbsp; Under his leadership, the Tiger Data engineering team has introduced partitioning, compression, and incremental materialized views for time-series data, plus cutting-edge indexing and performance innovations for AI.&nbsp;</p><p>Matvey earned a Bachelor degree in Engineering at The Cooper Union. He earned a Doctorate in Computer Science at Princeton University where his research focused on cross-continental data analysis covering issues such as networking, approximate algorithms, and performant data processing.&nbsp;</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Tiger Lake: A New Architecture for Real-Time Analytical Systems and Agents]]></title>
            <description><![CDATA[Mike Freedman, CTO of Tiger Data, introduces Tiger Lake: a native Postgres–lakehouse bridge for real-time, analytical, and agentic systems.]]></description>
            <link>https://www.tigerdata.com/blog/tiger-lake-a-new-architecture-for-real-time-analytical-systems-and-agents</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/tiger-lake-a-new-architecture-for-real-time-analytical-systems-and-agents</guid>
            <category><![CDATA[Announcements & Releases]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Tiger Lake]]></category>
            <dc:creator><![CDATA[Mike Freedman]]></dc:creator>
            <pubDate>Thu, 17 Jul 2025 12:59:06 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/07/2025-july-15-tigerlake-thumbnail.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/07/2025-july-15-tigerlake-thumbnail.png" alt="Tiger Lake: A New Architecture for Real-Time Analytical Systems and Agents" /><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">🔈</div><div class="kg-callout-text">Tiger Lake is currently in public beta for scale and enterprise users.&nbsp;<a href="https://console.cloud.timescale.com/signup" target="_blank" rel="noopener noreferrer">Sign up</a>&nbsp;for Tiger Cloud to try out your use case.</div></div><p>Modern applications are becoming more dynamic, more intelligent, and more real time. Dashboards refresh with incoming telemetry. Monitoring systems respond to shifting baselines. Agents make decisions in context, not in isolation. Each depends on the same foundational requirement: the ability to unify live events with deep historical state.</p><p>Yet the data remains fragmented.</p><p>Operational systems, built on Postgres, handle ingestion and serving. Analytical systems, built on the lakehouse, handle enrichment and modeling. Connecting them means stitching together streams, pipelines, and custom jobs—each introducing latency, fragility, and cost. The result is a patchwork of systems that struggle to deliver the full picture, let alone do so in real time.</p><p>This fragmentation doesn’t just slow teams down—it limits what developers can build. You can’t deliver real-time dashboards with historical depth or ground agents in fresh operational context when the data is split by design.</p><p>This architectural divide is no longer sustainable.</p><p><a href="https://docs.tigerdata.com/use-timescale/latest/tigerlake/"><u>Tiger Lake</u></a> bridges that divide. Now in public beta, it introduces a new data loop—continuous, bidirectional, and deeply integrated—between Postgres and the lakehouse. It simplifies the stack, preserves open formats, and brings operational and analytical context into the same system.</p><h2 id="introducing-tiger-lake-real-time-data-full-context-systems">Introducing Tiger Lake: Real-Time Data, Full-Context Systems</h2><p>Tiger Lake eliminates the need for external pipelines, complex orchestration frameworks, and proprietary middleware. It is built directly into Tiger Cloud and integrated with Tiger Postgres, our production-grade Postgres engine for transactional, analytical, and agentic workloads.</p><p>The architecture uses open standards from end to end:</p><ul><li>Apache Iceberg tables stored in Amazon S3 Tables for lakehouse integration</li><li>Continuous replication from Postgres tables or hypertables into Iceberg</li><li>Streaming ingestion back into Postgres for low-latency serving and operations</li><li>Pushing down queries from Postgres to Iceberg for efficient rollups</li></ul><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/07/2025-july-14-tigerlake-post-diagram-1.png" class="kg-image" alt="Tiger Lake architecture diagram" loading="lazy" width="2000" height="1866" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/07/2025-july-14-tigerlake-post-diagram-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2025/07/2025-july-14-tigerlake-post-diagram-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2025/07/2025-july-14-tigerlake-post-diagram-1.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w2400/2025/07/2025-july-14-tigerlake-post-diagram-1.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>These capabilities come built in. What previously required Flink jobs, DAG schedulers, and custom glue now works natively. Streaming behavior and schema compatibility are designed into the system from the start.</p><p>To understand how Tiger Lake reshapes data architecture, it helps to <a href="https://www.tigerdata.com/blog/the-database-meets-the-lakehouse-toward-a-unified-architecture-for-modern-applications"><u>revisit the medallion model</u></a> and consider how it evolves when real-time context becomes a core design principle.</p><p>You can think of it as an <strong>operational medallion architecture</strong>:</p><ul><li><strong>Bronze:</strong> Raw data lands in Iceberg-backed S3.</li><li><strong>Silver: </strong>Cleaned and validated data is replicated to Postgres.</li><li><strong>Gold:</strong> Aggregates are computed in Postgres for real-time serving, then streamed back to Iceberg for feature analysis.</li></ul><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/07/medallion-architecture.png" class="kg-image" alt="Operational medallion architecture" loading="lazy" width="2000" height="735" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/07/medallion-architecture.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2025/07/medallion-architecture.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2025/07/medallion-architecture.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/07/medallion-architecture.png 2234w" sizes="(min-width: 720px) 720px"></figure><p>Traditional Bronze–Silver–Gold workflows were built for batch systems. Tiger Lake enables a continuous flow where enrichment and serving happen in real time.</p><p>This shift transforms an overly complex pipeline into a dynamic and simpler real-time data loop. Context and data moves freely between systems. Operational and analytical layers stay connected without redundant jobs or duplicated infrastructure.</p><p>All data remains native, up to date, and queryable with standard SQL. Tiger Lake supports a single write path that powers real-time applications, dashboards, and the lakehouse, using the architecture that best fits the developer. Users can write data to Postgres, then have appropriate data and rollups automatically synced to their lakehouse; conversely, users already feeding raw data into the lakehouse can automatically bring it to Postgres for operational serving. Now, applications can reason across the now and the then—without orchestration code or synchronization overhead.</p><blockquote><em>"We stitched together Kafka, Flink, and custom code to stream data from Postgres to Iceberg. It worked, but it was fragile and high-maintenance," <strong>said Kevin Otten, Director of Technical Architecture at Speedcast.</strong> "Tiger Lake replaces all of that with native infrastructure. It’s the architecture we wish we had from day one."</em></blockquote><h2 id="from-architecture-to-outcomes">From Architecture to Outcomes</h2><p>Tiger Lake enables real-time systems that were previously too complex to operate or too expensive to build.</p><h3 id="customer-facing-dashboards">Customer-facing dashboards</h3><p>Dashboards can now combine live metrics with historical aggregates in a single query. There is no need for dual stacks or stale insights. Tiger Lake supports high-throughput ingestion at production scale, powering pipelines that visualize billions of rows in real time. Everything lives in one system, continuously updated and instantly queryable.</p><blockquote><em>"With Tiger Lake, we finally unified our real-time and historical data," <strong>said Maxwell Carritt, Lead IoT Engineer at Pfeifer &amp; Langen.</strong> "Now we seamlessly stream from Tiger Postgres into Iceberg, giving our analysts the power to explore, model, and act on data across S3, Athena, and Tiger Data."</em></blockquote><h3 id="monitoring-systems">Monitoring systems</h3><p>With a single source of truth and a continuous data loop, alerting becomes faster and more reliable. Engineers can run one SQL query to inspect fresh telemetry and historical incidents together—improving triage speed, reducing false positives, and staying focused on what matters.</p><p>Simplifying the data plane also improves system resilience. Tiger Lake lets monitoring systems operate on the same live operational backbone, where Iceberg provides historical depth and Tiger Postgres delivers low-latency access.</p><h3 id="agents">Agents</h3><p>Tiger Lake makes grounding possible without additional infrastructure. Developers can embed recent user activity and long-term interaction history directly inside Postgres. There is no need for orchestration, vector drift management or custom AI pipelines.</p><p>Imagine a support agent receives a new inquiry. The large body of historical support cases remain in Iceberg, while Tiger Lake created automated chunk and <a href="https://www.tigerdata.com/blog/a-beginners-guide-to-vector-embeddings" rel="noreferrer">vector embeddings</a> in Postgres. Now vector search against the operational database can answer AI chat questions quickly, while ensuring that embeddings stay fresh and up-to-date without complex orchestration pipelines.&nbsp;&nbsp;</p><p>In doing so, Tiger Lake is also a key building block in what we call Agentic Postgres, a Postgres foundation for intelligent systems that learn, decide, and act.</p><blockquote><em>"With Tiger Lake, we believe Tiger Data is setting a strong foundation for turning Postgres into the operational engine of the open lakehouse for applications,"<strong> said Ken Yoshioka, CTO, Lumia Health.</strong> "It allows us the flexibility to grow our biotech startup quickly with infrastructure designed for both analytics and agentic AI."</em></blockquote><p>Companies like Speedcast, Lumia Health, and Pfeifer &amp; Langen are already building full-context and real-time analytical systems with Tiger Lake. These architectures power industrial telemetry, agentic workflows, and real-time operations, all from a unified, continuously streaming platform.</p><h3 id="coming-soon-round-trip-intelligence">Coming soon: Round-trip intelligence</h3><ul><li><strong>Later this summer:</strong> Query Iceberg catalogs directly from within Postgres. Explore, join, and reason across lakehouse and operational data using SQL.</li><li><strong>Fall 2025: </strong>Full round-trip workflows: ingest into Postgres, enrich in Iceberg and stream results back automatically. This lets developers move from event to analysis to action in one architecture.</li></ul><h3 id="how-to-set-up-tiger-lake">How to set up Tiger Lake</h3><p>Getting started is simple. No complex orchestration or manual integrations:</p><ul><li>Create a bucket for Iceberg-compatible S3 tables.</li><li>Provide ARN permissions to Tiger Cloud.</li><li>Enable table sync in Tiger Postgres:</li></ul><pre><code class="language-SQL">ALTER TABLE my_hypertable SET (
  tigerlake.iceberg_sync = true
);</code></pre><h2 id="the-future-of-data-architecture-is-real-time-contextual-and-open">The Future of Data Architecture Is Real-Time, Contextual, and Open</h2><p>Tiger Lake introduces a new kind of architecture. It is continuous by design, scalable by default, and optimized for applications that need full context and complete data in real time.</p><p>Operational data flows into the lakehouse for enrichment and modeling. Enriched insights flow back into Postgres for low-latency serving. Applications and agents complete the loop, responding with precision and speed.</p><p>We believe this is the foundation for what comes next:</p><ul><li>Systems that unify operational use cases and internal analytics</li><li>Architectures that reduce complexity instead of compounding it</li><li>Workloads that are not just reactive but grounded in understanding</li></ul><p>You should not have to choose between context and simplicity. You should not have to patch together systems that were never designed to work together. And you should not have to replatform to evolve.</p><p>Together with next-generation storage architecture and our Postgres-native AI tooling, Tiger Lake forms the backbone of Agentic Postgres. This is a foundation built for intelligent workloads that learn, simulate, and act. We’ll share more soon.</p><p>Try it today on <a href="https://console.cloud.timescale.com/signup"><u>Tiger Cloud</u></a>, and check out the <a href="https://docs.tigerdata.com/use-timescale/latest/tigerlake/"><u>Tiger Lake docs</u></a> to get started.</p><p>— Mike</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[TimescaleDB to the Rescue - Speeding Up Statistics]]></title>
            <description><![CDATA[In this article, software engineer Kamil Ruczyński shows how TimescaleDB outperforms MySQL for time-series data using continuous aggregates and partitioning.]]></description>
            <link>https://www.tigerdata.com/blog/timescaledb-to-the-rescue-speeding-up-statistics</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/timescaledb-to-the-rescue-speeding-up-statistics</guid>
            <category><![CDATA[Dev Q&A]]></category>
            <category><![CDATA[TimescaleDB]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Kamil Ruczyński]]></dc:creator>
            <pubDate>Fri, 20 Jun 2025 12:00:35 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/06/timescaledb-to-the-rescue.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/06/timescaledb-to-the-rescue.png" alt="TimescaleDB to the Rescue - Speeding Up Statistics" /><p><em>This article, written by </em><a href="https://www.linkedin.com/in/kamilruczynski/" rel="noreferrer"><em>Kamil </em></a><a href="https://www.linkedin.com/in/kamilruczynski/" rel="noreferrer"><em>Ruczyński</em></a><em>, was originally posted&nbsp;on Apr 6, 2025&nbsp;on </em><a href="https://sarvendev.com/posts/timescale-db-to-the-rescue/" rel="noreferrer"><em>his blog</em></a><em>.&nbsp;Kamil is a Staff Software Engineer specializing in backend solutions, with a strong focus on scalability and performance. It's reposted here with permission. </em></p><p>Some time ago, I was working on improving the performance of slow statistics. The problem was that our database contained billions of rows, making data retrieval slow, even for the last seven days. From a product perspective, we needed to display data for at least 30 days and in real-time. All the data was stored in MySQL without partitioning, so we had to find a better solution. Simply using a cache was not an option, as real-time data was required.</p><p>Let’s analyze it on contrived example but similar to the original one.</p><h2 id="mysql-solution">MySQL Solution<a href="https://sarvendev.com/posts/timescale-db-to-the-rescue/#mysql-solution"></a></h2><p>Let’s say that we have a table with the following structure:</p><pre><code class="language-SQL">CREATE TABLE agent_stats (
    id BIGINT AUTO_INCREMENT NOT NULL,
    agent_id INT NOT NULL,
    event_type VARCHAR(255) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (id)
);

CREATE INDEX idx_agent_stats_created_at_agent_id_event_type ON agent_stats (created_at, agent_id, event_type);</code></pre><p>We’re collecting there statistics of our AI agents. Now, we have two types of events:</p><ul><li>triggered</li><li>response_generated</li></ul><p>To test the performance of the database, we need to generate a lot of data. In MySQL there is no fast way to generate random data, so I used some script which generated 24 234 964 records. It was quite slow.</p><p>And now, we want to get the number of triggered events for each agent in the last 30 days.</p><pre><code class="language-SQL">SELECT agent_id, event_type, COUNT(*) as count
FROM agent_stats
WHERE created_at &gt; '2025-02-28 00:00:00'
GROUP BY agent_id, event_type</code></pre><p>It’s very slow, and takes 11 seconds.</p><h2 id="what-is-timescaledb">What is TimescaleDB?<a href="https://sarvendev.com/posts/timescale-db-to-the-rescue/#what-is-timescaledb"></a></h2><p>TimescaleDB is an open-source time-series database built on PostgreSQL. It is designed to handle large volumes of time-series data efficiently. Basically it is a PostgreSQL extension that adds time-series capabilities to the database. It’s optimized for insertions of time-series data, and it provides features like automatic partitioning, retention policies, and continuous aggregates.</p><h2 id="timescaledb-solution">TimescaleDB Solution<a href="https://sarvendev.com/posts/timescale-db-to-the-rescue/#timescaledb-solution"></a></h2><p>So let’s try to use TimescaleDB to speed up our statistics. Creating a similar table in TimescaleDB:</p><pre><code class="language-SQL">CREATE TABLE agent_stats(
   created_at TIMESTAMPTZ NOT NULL,
   agent_id BIGINT NOT NULL CHECK (agent_id &gt; 0),
   event_type VARCHAR(255) NOT NULL
);

SELECT create_hypertable('agent_stats', 'created_at');

CREATE INDEX ON agent_stats (created_at DESC, agent_id, event_type);</code></pre><p>The&nbsp;<code>create_hypertable</code>&nbsp;function creates a hypertable, which is a TimescaleDB abstraction for a standard PostgreSQL table, but with automatic partitioning based on time.</p><p>Generating data in TimescaleDB is much convenient and a lot faster than in MySQL.</p><pre><code class="language-SQL">INSERT INTO agent_stats (created_at, agent_id, event_type)
SELECT
    time,
    random() * 9 + 1, /* 1 - 10 */
    'triggered'
FROM
    generate_series(
        '2024-01-01 00:00:00',
        '2025-03-28 16:00:00',
        INTERVAL '1 second'
    ) AS time;

INSERT INTO agent_stats (created_at, agent_id, event_type)
SELECT
    time,
    random() * 9 + 1, /* 1 - 10 */
    'response_generated'
FROM
    generate_series(
        '2024-01-01 00:00:00',
        '2025-03-28 16:00:00',
        INTERVAL '1 second'
    ) AS time;</code></pre><p>In this way we can generate 2 records per second in the range of around 1 year and 3 months. I repeated this a few times, and in a few minutes total of 184 675 516 records were generated.</p><p>Now, let’s get the same data as before:</p><pre><code class="language-SQL">SELECT agent_id, event_type, COUNT(*) as count
FROM agent_stats
WHERE created_at &gt; '2025-02-28 00:00:00'
GROUP BY agent_id, event_type</code></pre><p>Keep in mind that we have much more data now, so the query it’s also slow. Now it takes 9 seconds, but of course on the same data as in MySQL it would be much faster, because of the partitioning. Ok, so now we need to speed it up.</p><h2 id="timescaledb-continuous-aggregates">TimescaleDB Continuous Aggregates<a href="https://sarvendev.com/posts/timescale-db-to-the-rescue/#timescaledb-continuous-aggregates"></a></h2><p>Continuous aggregates are a powerful feature of TimescaleDB that allows you to pre-compute and store the results of query aggregations over time. It’s based on the concept of materialized views in PostgreSQL, but it can return data real-time.</p><p>Let’s create a new continuous aggregate for agent_stats table:</p><pre><code class="language-SQL">CREATE MATERIALIZED VIEW hourly_agent_stats WITH (timescaledb.continuous)
AS
SELECT
    time_bucket('1 hour', created_at) as hour,
    agent_id,
    event_type,
    COUNT(1) AS occurrences
FROM agent_stats
GROUP BY
    hour,
    agent_id,
    event_type
WITH NO DATA
;</code></pre><p>We also need to create a refresh policy to keep the materialized view up to date.</p><pre><code class="language-SQL">SELECT add_continuous_aggregate_policy('hourly_agent_stats',
    start_offset =&gt; INTERVAL '1 day',
    end_offset =&gt; INTERVAL '1 hour',
    schedule_interval =&gt; INTERVAL '1 hour'
);</code></pre><p>Now we need to populate the materialized view with the data we already have in the table.</p><pre><code class="language-SQL">CALL refresh_continuous_aggregate('hourly_agent_stats', '2024-01-01', '2025-03-28');</code></pre><p>It takes some time, and probably on the production it would be better to do it incrementally e.g. per day.</p><p>How it looks in the database:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/06/timescaledb-records-aggregated-by-hour.png" class="kg-image" alt="" loading="lazy" width="713" height="448" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/06/timescaledb-records-aggregated-by-hour.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/06/timescaledb-records-aggregated-by-hour.png 713w"></figure><p>Records in the view are aggregated by hour, so we have one record per hour per agent_id and event_type.</p><p>Now we can modify the query to take advantage of the newly created continuous aggregate:</p><pre><code class="language-SQL">SELECT agent_id, event_type, SUM(occurrences) as occurrences
FROM hourly_agent_stats
WHERE hour &gt; '2025-02-28 00:00:00'
GROUP BY agent_id, event_type</code></pre><p>This query is much faster, and it takes only 0.02 seconds.</p><p>It’s possible to get real-time data from the continuous aggregate, as historical data is fetched from the view, while the last hour (since we use a 1-hour bucket) is fetched from the source table.</p><h2 id="timescaledb-retention">TimescaleDB Retention<a href="https://sarvendev.com/posts/timescale-db-to-the-rescue/#timescaledb-retention"></a></h2><p>TimescaleDB also provides a retention policy feature that allows you to automatically drop old data. Let’s say that we need only 6 months of detailed data, and 1 year of aggregated data. We can set up the following retention policies:</p><pre><code class="language-SQL">SELECT add_retention_policy('agent_stats', INTERVAL '6 MONTH');</code></pre><p>This will drop all data older than 6 months from the source table.</p><pre><code class="language-SQL">SELECT add_retention_policy('hourly_agent_stats', INTERVAL '1 YEAR');</code></pre><p>This will drop all data older than 1 year from the continuous aggregate.</p><h2 id="summary">Summary<a href="https://sarvendev.com/posts/timescale-db-to-the-rescue/#summary"></a></h2><p>We could achieve some performance improvement by using partitioning in MySQL, but it would be a slight improvement and additional work. Adding TimescaleDB increases the complexity of the whole system, as it’s a new technology that needs to be maintained, but it’s a great choice for time-series data and great choice from the point of view of application engineers, as it provides a lot of useful features. However, if you’re using PostgreSQL now, using TimescaleDB will be easier to implement, as you don’t need to learn a new technology. It’s just an extension.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Blocked Bloom Filters: Speeding Up Point Lookups in Tiger Postgres' Native Columnstore]]></title>
            <description><![CDATA[Learn how blocked Bloom filters in TimescaleDB 2.20 deliver up to 100× faster point lookups, speeding up columnar queries without manual tuning.]]></description>
            <link>https://www.tigerdata.com/blog/blocked-bloom-filters-speeding-up-point-lookups-in-tiger-postgres-native-columnstore</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/blocked-bloom-filters-speeding-up-point-lookups-in-tiger-postgres-native-columnstore</guid>
            <category><![CDATA[TimescaleDB]]></category>
            <category><![CDATA[Product & Engineering]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Jacky Liang]]></dc:creator>
            <pubDate>Wed, 18 Jun 2025 12:00:00 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/06/IMG_8672.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/06/IMG_8672.png" alt="Blocked Bloom Filters: Speeding Up Point Lookups in Tiger Postgres' Native Columnstore" /><div class="kg-card kg-callout-card kg-callout-card-yellow"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">This is the first post in a technical deep dive series that explores the ways we are building the fastest Postgres.</div></div><p>Database storage is a study in locality.</p><ul><li><strong>Row stores</strong> keep all fields of a record together, with operations only on full rows at a time.</li><li><strong>Column stores</strong> organize each column’s values in compressed blocks, allowing operations to target specific columns.</li></ul><p>This trade-off is structural, not just cosmetic. Row layout is great for fast inserts and lookups. Column layout excels at filters, aggregations, and scans over a large number of rows, but on a smaller number of columns (as long as your query plays by the rules).</p><p>But&nbsp;here’s the catch: <strong>Columnstores are only fast at filtering when your predicate aligns with the physical sort order. </strong>If your data is ordered by time, or clustered by a <code>segmentby</code> (like customer ID or device), the engine can skip large blocks. But if you filter on an unsorted field (like a trace ID, transaction UUID, or error code) there’s often no optimization to exploit. The engine has to decompress every block and scan every value, just in case there is a match.<strong> </strong></p><p>TimescaleDB combines both layouts into a hybrid table model, with recent records written to the rowstore and automatically migrated to the columnstore over time. This design closely aligns with real-world query patterns, fresh data is updated frequently, while older data is rarely updated, and mostly used in aggregate. We’ve already implemented columnar mutability, but one challenge remained: sometimes you need to filter on unsorted fields across terabytes of columnar data.&nbsp;</p><p>And it’s not hypothetical, we’ve seen it in the wild many times. Dashboards hang while users query a UUID, waiting as the engine churns through thousands of compressed blocks. The data is there. The filter is simple. But the system grinds.</p><p>That isn’t acceptable. At Tiger, we’re here to deliver <strong>speed without sacrifice</strong>.</p><p>So we implemented <strong>blocked bloom filters</strong>, and our users already love them:&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/06/Screenshot-2025-06-19-at-8.08.32-pm.png" class="kg-image" alt="" loading="lazy" width="1612" height="216" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/06/Screenshot-2025-06-19-at-8.08.32-pm.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2025/06/Screenshot-2025-06-19-at-8.08.32-pm.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2025/06/Screenshot-2025-06-19-at-8.08.32-pm.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/06/Screenshot-2025-06-19-at-8.08.32-pm.png 1612w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Our community member </span><a href="https://github.com/pantonis"><u><span class="underline" style="white-space: pre-wrap;">@pantonis</span></u></a><span style="white-space: pre-wrap;"> saw </span><b><strong style="white-space: pre-wrap;">100× faster lookups</strong></b><span style="white-space: pre-wrap;"> after upgrading to TimescaleDB 2.20</span></figcaption></figure><p>P.S. If you're using TimescaleDB 2.20 or later (or Tiger Postgres on <a href="https://www.tigerdata.com/cloud"><u>Tiger Cloud</u></a>), Bloom Filters are already actively optimizing lookups on sparsely distributed UUIDs, enums, and text fields by up to 100x.</p><h2 id="the-challenge-of-point-lookups-in-columnar-storage">The Challenge of Point-Lookups in Columnar Storage</h2><p>If you've worked with large-scale time-series or analytics workloads, you've probably experienced this pain. You're querying 10TB of trace data to find a single ID like '550e8400-e29b-41d4-a716-446655440000'. Your database starts churning through millions of batches, reading and decompressing terabytes of data. Minutes and hours tick by. Your application times out. Users complain because they needed this report an hour ago (everything is urgent).&nbsp;</p><p>Imagine you’re looking for a needle in <strong><em>hundreds of compressed bundles of haystacks</em></strong>—you have to unbundle, loosen, and search through every single bundle because you don't know which one contains your needle.&nbsp;</p><p>This happens because columnar databases store data in sorted batches, compress each column separately, and use ordering metadata to skip irrelevant batches. This works perfectly when you're querying by the same column you sorted by:</p><pre><code class="language-SQL">-- This works well - time-based query on ordered data
SELECT * FROM metrics
  WHERE timestamp BETWEEN '2024-01-01' AND '2024-01-02';</code></pre><p>But completely breaks down for uncorrelated columns:</p><pre><code class="language-SQL">-- This is painful - random ID query on non segmented column
SELECT * FROM metrics 
  WHERE trace_id = '550e8400-e29b-41d4-a716-446655440000';</code></pre><p>The thing with UUIDs (except for UUIDv7, keep an eye out for support coming soon) is they're completely random, it doesn't make sense to order them. When your data is sorted by time, but you're searching by trace ID, every batch now contains a random mix of IDs. So ordering become useless, and the database can't skip any batches.&nbsp;</p><h2 id="what-is-a-bloom-filter">What is a Bloom Filter?&nbsp;</h2><p>This is where bloom filters help—they're additional metadata that can efficiently answer "is this value definitely not in this batch?” without actually storing a reference to each value.</p><p>A bloom filter is a small-yet-efficient data structure that uses an array of bits and hash functions to quickly test if something might be in a set or not.&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/06/bloom_filter_basic.CNfSrq8t_ZWI3Wl-1.svg" class="kg-image" alt="" loading="lazy" width="498" height="474"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Bloom filter illustration courtesy of </em></i><a href="https://www.bytedrum.com/posts/bloom-filters/"><u><i><em class="italic underline" style="white-space: pre-wrap;">Bytedrum: Bloom Filters</em></i></u></a><i><em class="italic" style="white-space: pre-wrap;">, a great visual explainer.&nbsp;</em></i></figcaption></figure><p>Bloom filters can say something is "definitely not there", or "might be there", and crucially, they never say "it’s missing" when something is actually there.</p><p>Using Spotify’s playlist feature as an example, when Spotify needs to check if a song is in one of your playlists, instead of scanning through every song in every playlist (which means reading every playlist from storage—stupidly expensive with billions of songs), they use a bloom filter—a compact “summary” that can instantly say whether a song is “definitely not in this playlist” or “might be in this playlist”.&nbsp;</p><p>For the “might be” cases, Spotify then uses traditional seek methods to check the actual playlist data.&nbsp;</p>
<!--kg-card-begin: html-->

  <iframe 
    src="https://spotify-bloom-filter.vercel.app" 
    width="100%" 
    height="1000" 
    frameborder="0"
    scrolling="yes"
    loading="lazy"
    style="display: block; border: none; background: #000;">
  </iframe>
<!--kg-card-end: html-->
<p>[Interactive Spotify Bloom Filter demo: <a href="https://spotify-bloom-filter.vercel.app/"><u>https://spotify-bloom-filter.vercel.app/</u></a>]&nbsp;</p><p>This may not sound that useful, but when dealing with massive-scale workloads of millions of playlists and billions of songs, using a bloom filter eliminates 95%+ of linear-time playlist scans, turning minutes of searching into milliseconds.&nbsp;</p><p>Obviously, no data structure is catch-free, it may occasionally check a playlist unnecessarily, around a 2% false positive rate (more on this later). But, this is an acceptable tradeoff as you still get massive I/O savings across your entire system.&nbsp;</p><h2 id="how-we-added-bloom-filters-into-the-columnstore">How We Added Bloom Filters into the Columnstore</h2><p>Here's how TimescaleDB solves this, bloom filters act as a quick pre-check. Before reading any batch from disk, TimescaleDB checks a tiny bloom filter in memory that says "this ID is definitely not in this batch" or "this ID might be in this batch." This lets us skip 95%+ of batches instantly.</p><p>For the few batches that might contain your ID, TimescaleDB reads them from disk and processes them efficiently using vectorized operations (SIMD) that check many rows at once—much faster than Postgres's traditional row-by-row approach. But the real win is avoiding the I/O in the first place.</p><h3 id="no-manual-configuration-needed">No manual configuration needed</h3><p>When building with TimescaleDB, you <strong><em>don’t need to worry</em></strong> about when to use bloom filters or min/max indexes, because we automatically choose for you based on your column types!&nbsp;</p><p>For columns that you use in your table ordering (like timestamps and numbers used in range queries), we stick with the min/max method because we know that scans will be in order.</p><p>For random things like text fields, UUIDs, enum types (or basically anything else that supports Postgres hash indexes) we will create bloom filters automatically (as long as you have the column indexed in the rowstore with a btree, hash or brin index).&nbsp;</p><h3 id="diving-deeper%E2%80%9Cblocked-bloom-filters%E2%80%9D">Diving deeper - “blocked bloom filters”</h3><p>TimescaleDB uses a technique called a "blocked bloom filter", where each bloom filter starts at about 16KB per batch (sized for up to 1,000 items) with a 2% false positive rate and uses 6 different hash functions per value.&nbsp;</p><p>The 16KB size isn't random—it's calculated based on math. Here’s a <a href="https://hur.st/bloomfilter/"><u>handy calculator</u></a> you can try out yourself.&nbsp;</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/06/SCR-20250611-otwm-2-2.png" class="kg-image" alt="" loading="lazy" width="882" height="1004" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/06/SCR-20250611-otwm-2-2.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/06/SCR-20250611-otwm-2-2.png 882w" sizes="(min-width: 720px) 720px"></figure><p>For 1,000 items with a 2% false positive rate, the optimal formula gives us ~8K bits, but we round up to ~16k bits to enable our folding compression trick (we will get to this below!). This sizing ensures we get exactly the false positive rate we want while keeping the filters small enough to stay fast in memory.</p><p>The "blocked" part is a performance technique—instead of spreading hash bits all over a huge array, TimescaleDB keeps all the bits for one value within a 256-bit block. This fits nicely in your CPU cache and makes everything faster.&nbsp;</p><p>For hashing, TimescaleDB primarily uses a modern library called <a href="https://github.com/backtrace-labs/umash"><u>UMASH</u></a> that's faster than Postgres's built-in hashing, but falls back to the Postgres version for custom data types or older processors. There is a funny story here on interoperability that I’ll share on socials!</p><h3 id="achieving-250x-space-savings">Achieving 250x space savings&nbsp;</h3><p>Here’s where we squeezed out additional space and performance benefits out of bloom filters— when a batch doesn’t have many unique values (for example [0, 0, 0, 0, 1, …, 0, 0]), TimescaleDB can compress the bloom filter by “folding” it in half using <a href="https://en.wikipedia.org/wiki/Bitwise_operation#OR"><u>bitwise OR</u></a> operations.&nbsp;</p><p>It can keep folding until the filter shrinks from 16 KB down to just 64 bits (8 bytes) for columns with few unique values, also known as low-cardinality. We can do this because most of the bits in the batch are zero, so folding concentrates the few set bits without significantly increasing false positives.&nbsp;</p><h2 id="query-walkthrough">Query Walkthrough</h2><p>Here’s an example to explain how queries work step-by-step. Let’s use the following trace ID search query:&nbsp;</p><pre><code class="language-SQL">SELECT * FROM metrics 
   WHERE trace_id = 'abc123'</code></pre><ol><li>TimescaleDB first checks the bloom filters (which are likely to be cached in memory using the traditional Postgres buffer manager) for every batch in your columnstore. </li><li>For each batch, the bloom filter either says “definitely not here” or “might be here”. The database immediately skips all the “definitely not here” batches. </li><li>For the “might be here” batches, TimescaleDB reads them from disk, decompresses them, and scans the actual data (the expensive part). If it was a false positive (that 2% chance we mentioned prior), no match gets found, and the query just continues running normally.&nbsp;</li></ol><p>The key here is, false positives are okay and don’t impact performance because even when we hit one false positive, we are still avoiding massive amounts of unnecessary I/O by not having to go through every batch from the get-go.&nbsp;</p><p>A side benefit to our implementation of bloom filters is that the bloom filter metadata is more likely to stay hot in memory. When there are concurrent workloads where different users are querying different parts of your dataset, every query can quickly eliminate most batches without touching slower-more-expensive storage.&nbsp;</p><h2 id="where-bloom-filters-excel">Where Bloom Filters Excel</h2><p>Bloom filters excel at large time-series datasets queried by non-temporal identifiers.</p><p>Think about scenarios where you're storing massive amounts of data over time, but you need to find specific records using IDs, addresses, or other fixed identifiers.</p><p><strong>For financial services teams.</strong> Your customers are trying to resolve a failed payment. They enter a transaction reference number into your search. Nothing loads. Your backend query scans years of data just to return a single match, and the user waits ...</p><pre><code class="language-SQL">-- Finding financial transactions by reference
SELECT * FROM payments 
  WHERE transaction_ref = 'TXN-2024-001234';</code></pre><p><strong>For IoT platform teams.</strong> Your dashboard shows live sensor data, but one widget is blank. It’s querying a single sensor reading by ID, and your backend is scanning billions of rows to find it. Users refresh the page. Nothing. The spinner keeps spinning.</p><pre><code class="language-SQL">-- Device-specific IoT data queries
SELECT * FROM sensor_data 
  WHERE reading_id = '73e98d71-5eb7-4018-ace7-1f4490da654a';</code></pre><p><strong>For teams working on blockchain analytics. </strong>Your analytics engine scans millions of blocks to find transactions from a wallet address. API timeouts leave users thinking your service is broken.</p><pre><code class="language-SQL">-- Looking up blockchain transactions by wallet address&nbsp;&nbsp;
SELECT * FROM transactions 
  WHERE from_address = '0x742d35Cc6634C0532925a3b8D';</code></pre><p>Bloom filters turn these queries from minutes or hours into milliseconds by eliminating the need to decompress and scan billions of rows. Instead of checking every batch in your columnstore, you skip a 95% of them and only decompress the ones that might actually contain your data.</p><h2 id="performance-numbers-find-specific-values-up-to-100x-faster">Performance Numbers: Find Specific Values Up to 100x Faster&nbsp;</h2><p>Instead of telling you why our bloom filter implementation is great, let's just show you some numbers we got after we ran our benchmarks.&nbsp;</p><pre><code>SELECT min(sent), max(sent), count()
  FROM hackers 
  WHERE subject = 'unsubscribe' 
  ORDER BY count() DESC 
  LIMIT 10;
</code></pre><p>Before bloom filters, this took 12ms. With bloom filters, it dropped to 2.7ms—that's <strong>3.5x faster</strong>.</p><p>Or consider this blockchain address lookup:</p><pre><code>SELECT * FROM token_transfers 
  WHERE to_address = '0xe23d4eb73b399250301fb024019a734ba9f0d9b5';
</code></pre><p>This one went from 1.065 seconds down to 171.134 ms—a <strong>6x improvement</strong>.</p><p>And lets not forget the report from our user pantonis, who saw a massive<strong> 100x improvement</strong>:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/06/Screenshot-2025-06-19-at-8.08.32-pm.png" class="kg-image" alt="" loading="lazy" width="1612" height="216" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/06/Screenshot-2025-06-19-at-8.08.32-pm.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2025/06/Screenshot-2025-06-19-at-8.08.32-pm.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2025/06/Screenshot-2025-06-19-at-8.08.32-pm.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/06/Screenshot-2025-06-19-at-8.08.32-pm.png 1612w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Our community member </span><a href="https://github.com/pantonis"><u><span class="underline" style="white-space: pre-wrap;">@pantonis</span></u></a><span style="white-space: pre-wrap;"> saw </span><b><strong style="white-space: pre-wrap;">100× faster lookups</strong></b><span style="white-space: pre-wrap;"> after upgrading to TimescaleDB 2.20</span></figcaption></figure><h2 id="where-bloom-filters-don%E2%80%99t-work">Where Bloom Filters Don’t Work&nbsp;</h2><p>Like all data structures, there are strengths and limitations.</p><p>Bloom filters work great when you're looking for exact matches—queries that use the equals sign (=) to find specific values. They also work with standard string comparisons where the rules are consistent.&nbsp;</p><pre><code>-- These work great with bloom filters

SELECT * FROM traces WHERE trace_id = 'abc-123-def';

SELECT * FROM orders WHERE email = 'user@example.com';

SELECT * FROM transactions WHERE status = 'completed';</code></pre><p>However, bloom filters have fundamental limitations and some current implementation restrictions. By design, they can't handle "not equal" searches (<code>&lt;&gt;</code>) or range queries (<code>&lt;</code> or <code>&gt;</code>) because they only test set membership.</p><pre><code>-- These don't work with bloom filters (fundamental limitations)
SELECT * FROM traces WHERE trace_id &lt;&gt; 'abc-123-def';

SELECT * FROM users WHERE created_at &gt; '2024-01-01';

SELECT * FROM transactions WHERE amount BETWEEN 100 AND 500;</code></pre><p>Current implementation restrictions in TimescaleDB mean they also can't help with multiple value searches (like WHERE column IN (1, 2, 3)) or cross-type comparisons without explicit casting. These may change in future versions.</p><pre><code>-- These don't work yet (implementation restrictions)
SELECT * FROM traces WHERE trace_id IN 
  ('abc-123', 'def-456', 'ghi-789');

SELECT * FROM users WHERE user_id = 12345;&nbsp; -- int8 = int4 comparison

SELECT * FROM posts WHERE category = ANY(ARRAY['tech', 'science']);</code></pre><p><strong>Also, bloom filters don't help much when the value you're looking for exists in most batches.&nbsp;</strong></p><p>For example, if you're searching for a common status like <code>active</code> that appears in every batch, the bloom filter will report a potential positive for every batch, forcing TimescaleDB to decompress and check them all anyway. The bloom filter can't skip anything, so you don't get any savings.&nbsp;</p><h2 id="speed-without-sacrifice">Speed without Sacrifice</h2><p>Bloom filters in TimescaleDB are a perfect example of "it just works" optimization and our commitment to making life easier for developers working at massive data scales.</p><p>The bloom data structure automatically kicks in for the right data types and query patterns, dramatically improving performance for needle-in-haystack queries without any configuration required. It works out of the box in Tiger Cloud—no setup required other than having an index on your rowstore columns.</p><p>You can verify bloom filters are working by looking for <code>_timescaledb_functions.bloom1_contains</code> in your query execution plans. The storage overhead is minimal, typically a few hundred bytes per batch, with a maximum of 1KB. For a table with a million batches, you're looking at roughly 100MB to 1GB of bloom filter metadata. That's <strong>0.01% storage overhead</strong> for massive query speedups.</p><p>Built by developers, for developers. TimescaleDB refuses to accept the traditional trade-offs of database storage. We give you the speed of columnar analytics with the flexibility of point lookups, all in the same system.</p><p>Try it now on <a href="https://www.tigerdata.com/cloud"><u>Tiger Cloud</u></a>.&nbsp;</p><h3 id="additional-reading">Additional reading&nbsp;</h3><ol><li><a href="https://www.timescale.com/blog/speed-without-sacrifice-2500x-faster-distinct-queries-10x-faster-upserts-bloom-filters-timescaledb-2-20"><u>Speed Without Sacrifice: 2500x Faster Distinct Queries, 10x Faster Upserts, Bloom Filters and More in TimescaleDB 2.20</u></a>&nbsp;</li><li><a href="https://www.bytedrum.com/posts/bloom-filters/"><u>Bloom Filters: The Unsung Heroes of Computer Science</u></a></li><li><a href="https://hur.st/bloomfilter/"><u>Bloom Filter Calculator</u></a></li></ol>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Speed Without Sacrifice: Building the Modern PostgreSQL for the Analytical and Agentic Era]]></title>
            <description><![CDATA[Cofounders of Tiger Data (creators of TimescaleDB) Ajay Kulkarni and Mike Freedman discuss the company’s new name, showing how it reflects the company’s evolution.]]></description>
            <link>https://www.tigerdata.com/blog/timescale-becomes-tigerdata</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/timescale-becomes-tigerdata</guid>
            <category><![CDATA[Announcements & Releases]]></category>
            <category><![CDATA[General]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Tiger Data]]></category>
            <dc:creator><![CDATA[Ajay Kulkarni]]></dc:creator>
            <pubDate>Tue, 17 Jun 2025 14:14:45 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/06/tiger-data-hero-2.gif">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/06/tiger-data-hero-2.gif" alt="Speed Without Sacrifice: Building the Modern PostgreSQL for the Analytical and Agentic Era" /><p><em>Timescale is now Tiger Data.</em></p><p><strong>TL;DR: Eight years ago, we launched Timescale to bring time-series to PostgreSQL. Our mission was simple: help developers building time-series applications.</strong></p><p><strong>Since then, we have built a thriving business: 2,000 customers, mid 8-digit ARR (&gt;100% growth year over year), $180 million raised from top investors.&nbsp;</strong></p><p><strong>We serve companies who are building real-time analytical products and large-scale AI workloads like: Mistral, HuggingFace, Nvidia, Toyota, Tesla, NASA, JP Morgan Chase, Schneider Electric, Palo Alto Networks, and Caterpillar. These are companies building developer tools, industrial dashboards, crypto exchanges, AI-native games, financial RAG applications, and more.&nbsp;</strong></p><p><strong>We’ve quietly evolved from a time-series database into the modern PostgreSQL for today’s and tomorrow’s computing, built for performance, scale, and the agentic future. So we’re changing our name: from Timescale to Tiger Data. Not to change who we are, but to reflect who we’ve become. Tiger Data is bold, fast, and built to power the next era of software.</strong></p><h2 id="developers-thought-we-were-crazy">Developers Thought We Were Crazy</h2><p>When we started 8 years ago, SQL databases were “old fashioned.” NoSQL was the future. Hadoop, MongoDB, Cassandra, InfluxDB – these were the new, exciting NoSQL databases. PostgreSQL was old and boring.</p><p>That’s when we launched Timescale: a time-series database on PostgreSQL. Developers thought we were crazy. PostgreSQL didn’t scale. PostgreSQL wasn’t fast. Time-series needed a NoSQL database. Or so they said.</p><p><em>“While I appreciate PostgreSQL every day, am I the only one who thinks this is a rather bad idea?” – top HackerNews comment on our launch (</em><a href="https://news.ycombinator.com/item?id=14035416"><em><u>link</u></em></a><em>)</em></p><p>But we believed in PostgreSQL. We knew that boring could be awesome, especially with databases. And frankly, we were selfish: PostgreSQL was the only database that we wanted to use.</p><p><strong>Today, PostgreSQL has won.</strong>&nbsp;</p><p>There are no more “SQL vs. NoSQL” debates. MongoDB, Cassandra, InfluxDB, and other NoSQL databases are seen as technical dead ends. Snowflake and Databricks are acquiring PostgreSQL companies. No one talks about Hadoop. The Lakehouse has won.&nbsp;</p><p><strong>Today, agentic workloads are here.&nbsp;</strong></p><p>Agents need a fast database. We see this in our customer base: private equity firms and hedge funds using agents to help understand market movements (“How did the market respond to Apple WWDC 2025?”); industrial equipment manufacturers building chat interfaces on top of internal manuals to help field technicians; developer platforms storing agentic interactions into history tables for greater transparency and trust; and so on.</p><h2 id="what-started-as-a-heretical-idea-is-now-a-thriving-business">What Started as a Heretical Idea Is Now a Thriving Business&nbsp;</h2><p>We have also changed. We met in September 1997, during our first week at MIT. We soon became friends, roommates, even marathon training partners (Boston 1998).</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/06/founder-image.png" class="kg-image" alt="Tiger Data (creators of TimescaleDB) cofounders" loading="lazy" width="1790" height="435" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/06/founder-image.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2025/06/founder-image.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2025/06/founder-image.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/06/founder-image.png 1790w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">While our hairlines and drinks (turmeric shots!) have changed, our enthusiasm has not</em></i></figcaption></figure><p>That friendship became the foundation for an entrepreneurial journey that has surpassed even our boldest imaginations.&nbsp;</p><p>What started as a heretical idea is now a thriving business:</p><ul><li>2,000 customers</li><li>Mid 8-digit ARR, growing &gt;100% y/y</li><li>200 people in 25 countries</li><li>$180 million raised from top investors</li><li>60%+ gross margins</li></ul><p>Cloud usage is up 5x in the last 18 months, based on paid customers alone.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/06/2025-cloud-growth-dark-mode.png" class="kg-image" alt="Cloud usage is up 5x in the last 18 months" loading="lazy" width="2000" height="1252" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/06/2025-cloud-growth-dark-mode.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2025/06/2025-cloud-growth-dark-mode.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2025/06/2025-cloud-growth-dark-mode.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w2400/2025/06/2025-cloud-growth-dark-mode.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>And that’s only the paid side of the story. Our open-source community is 10x-20x larger. (Based on telemetry, it’s 10x, but we estimate that at least half of all deployments have telemetry turned off.)</p><p>TimescaleDB is everywhere. It’s included in PostgreSQL offerings around the world: from Azure, Alibaba, and Huawei to Supabase, DigitalOcean, and Fly.io. You’ll also find it on Databricks Neon, Snowflake Crunchy Bridge, OVHCloud, Render, Vultr, Linode, Aiven, and more.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/06/2025-community-cloud-dark-mode.png" class="kg-image" alt="Community 10-20x" loading="lazy" width="2000" height="1298" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/06/2025-community-cloud-dark-mode.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2025/06/2025-community-cloud-dark-mode.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2025/06/2025-community-cloud-dark-mode.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w2400/2025/06/2025-community-cloud-dark-mode.png 2400w" sizes="(min-width: 720px) 720px"></figure><h2 id="we-are-tiger-data">We Are Tiger Data</h2><p>Today, we are more than a time-series database. We are powering developer tools, SaaS applications, AI-native games, financial RAG applications, and more. The majority of workloads on our Cloud product aren’t time-series. Companies are running entire applications on us. CTOs would say to us, <em>“You keep talking about how you are the best time-series database, but I see you as the best PostgreSQL.”</em>&nbsp;</p><p><strong>So we are now “Tiger Data.”</strong> We offer the fastest PostgreSQL. Speed without sacrifice.</p><p>Our cloud offering is “Tiger Cloud.” Our logo stays the same: the tiger, looking forward, focused and fast. Some things do not change. Our open source time-series <a href="https://www.tigerdata.com/blog/top-8-postgresql-extensions" rel="noreferrer">PostgreSQL extension</a> remains TimescaleDB. Our vector extension is still pgvectorscale.&nbsp;</p><p><strong>Why “Tiger”? </strong>The tiger has been our mascot since 2017, symbolizing the speed, power, and precision we strive for in our database. Over time, it’s become a core part of our culture: from weekly “Tiger Time” All Hands and monthly “State of the Tiger” business reviews, to welcoming new teammates as “tiger cubs” to the “jungle.” As we reflected on our products, performance, and community, we realized: we aren’t just Timescale. We’re Tiger. Today, we’re making that official.</p><p><strong>This is not a reinvention: it’s a reflection of how we already serve our customers today.</strong></p><p><strong>Polymarket</strong> uses Tiger Data to track their price history. During the last election Polymarket ramped up 4x when trade volumes were extra high, to power over $3.7 billion dollars worth of trades.</p><p><strong>Linktree</strong> uses Tiger Data for their premium analytics product, saving $17K per month on 12.6 TB from compression savings. They also compressed their time to launch, going from 2 weeks to 2 days for shipping analytical features.</p><p><strong>Titan America</strong> uses Tiger Data’s compression and continuous aggregates to reduce costs and increase visibility into their facilities for manufacturing cement, ready-mixed concrete, and related materials.&nbsp;</p><p><strong>Lucid Motors</strong> uses Tiger Data for real-time telemetry and autonomous driving analytics.&nbsp;</p><p><strong>The Financial Times </strong>runs time-sensitive analytics and <a href="https://www.tigerdata.com/learn/vector-search-vs-semantic-search" rel="noreferrer">semantic search</a>.&nbsp;</p><h2 id="tiger-is-the-fastest-postgres-for-modern-workloads">Tiger Is the Fastest Postgres for Modern Workloads</h2><p>We are building the fastest Postgres: purpose-built for the modern operational workloads where traditional <a href="https://www.tigerdata.com/learn/understanding-oltp" rel="noreferrer">OLTP</a> databases break down.&nbsp;</p><p>Operational workloads that go far beyond simple transactions are now the norm. They require real-time, user-facing analytics over massive <a href="https://www.tigerdata.com/learn/how-to-handle-high-cardinality-data-in-postgresql" rel="noreferrer">high-cardinality datasets</a>, from event streams to time-series to user-level behavioral data.&nbsp;</p><p>As the frontier moves further with agentic applications, the demands grow even more. These systems don’t just read and write: they observe, decide, and act. These AI applications require fast vector search across embeddings, and fast branching of data environments for experimentation and context-sensitive responses.</p><p><strong>Tiger is not a fork. It’s not a wrapper. It is PostgreSQL, extended with innovations in the database engine and cloud infrastructure to deliver speed without sacrifice.</strong></p><p><strong>How are we so fast?</strong> Because of consistent, disciplined engineering efforts to serve customer needs over several years. Here is a non-exhaustive list:&nbsp;</p><ul><li>Hypertables (2017)</li><li>Native <a href="https://www.tigerdata.com/blog/building-columnar-compression-in-a-row-oriented-database" rel="noreferrer">columnar</a> compression (2019)</li><li>Real-time materialized views for faster queries (2020)</li><li>Decoupled compute and storage (2021)</li><li>Tiered Storage to S3 Parquet (2022)</li><li>Vectorized query execution for fast analytics (2023)</li><li>Hybrid row-columnar store for faster queries on recent and historical data (2024)</li><li>Faster vector workloads on PostgreSQL via pgvectorscale (2024)</li><li>300x faster mutations (updates, upserts, deletes) to compressed columnar data (2024)</li><li>2500x faster distinct queries, 6x faster point queries on high-cardinality columns (2025)</li><li>Rapid horizontal scaling with load-balanced read replica sets (2025)</li><li>Enhanced high-performance storage up to 64 TB and 32,000 IOPS (2025)</li></ul><p><strong>Tiger brings together the familiarity and reliability of Postgres with the performance of purpose-built engines.</strong></p><p>We built the fastest PostgreSQL. Not because we wanted to, but because our customers wanted us to.</p><h2 id="building-the-modern-postgresql-for-the-analytical-and-agentic-era">Building the Modern PostgreSQL for the Analytical and Agentic Era</h2><p>PostgreSQL has won. The Lakehouse has won. Every application is becoming an analytical application.&nbsp;Agents are here, in production, and need to be fast. The future is hybrid, developers and agents, with better latency and throughput needs.</p><p>In this era, modern applications must:</p><ul><li>Handle terabytes and petabytes of data</li><li>Support real-time analytics</li><li>Integrate Gen AI features</li><li>Serve both humans and software agents, across dev, test, and production lifecycles</li><li>Meet sub-second latency and high concurrency expectations</li><li>Scale across operational databases and cost-efficient lakehouses</li><li>Maintain transactional integrity</li><li>Deliver all of this reliably and cost-effectively, because data volumes grow much faster than budgets</li></ul><p>Our history to date, our time in this market, our lived experience watching all these changes unfold in real-time screams to us one thing: <strong>modern applications need a new kind of operational database.</strong>&nbsp;</p><p>One built for transactional, analytical, and agentic workloads. One that also acts as the operational serving layer for the Lakehouse. One built on Postgres.</p><p>That is what we are building.</p><p>And wow do we have some fun product announcements queued up for the upcoming weeks and months. A more agentic PostgreSQL. A deeper integration with the Lakehouse via Iceberg. A new compressed insert approach yielding 10 million rows per second. A new type of disaggregated storage architecture with zero-copy instant forks and replicas that we are deploying in our cloud for greater performance, as a replacement for EBS. And more.</p><p>We can’t wait to show it all to you. But first we had to clearly communicate who we are. <strong>We are Tiger Data.&nbsp;</strong></p><h2 id="come-join-us">Come Join Us</h2><p><strong>Tiger is the Fastest PostgreSQL. </strong>The operational database platform built for transactional, analytical, and agentic workloads. The only database platform that provides Speed without Sacrifice.</p><p>This is not a rebrand, but a recommitment to our customers, to our developers, and to our core mission.</p><p>If this mission resonates with you, come join us. Give us product feedback. Spread the word. Wear the swag. Join the team.&nbsp;</p><p>It’s Go Time. 🐯🚀</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Postgres That Scales With You: Read Replica Sets and Enhanced Storage]]></title>
            <description><![CDATA[Scale your database without stress with Timescale’s new read replica sets and enhanced storage.]]></description>
            <link>https://www.tigerdata.com/blog/postgres-that-scales-with-you-read-replica-sets-and-enhanced-storage</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/postgres-that-scales-with-you-read-replica-sets-and-enhanced-storage</guid>
            <category><![CDATA[Announcements & Releases]]></category>
            <category><![CDATA[TimescaleDB]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Rahil Sondhi]]></dc:creator>
            <pubDate>Tue, 03 Jun 2025 12:59:14 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/05/2025-may-29-replicas-thumbnail.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/05/2025-may-29-replicas-thumbnail.png" alt="Postgres That Scales With You: Read Replica Sets and Enhanced Storage" /><p>Yesterday, we covered how TimescaleDB 2.20 continues to deliver on our vision of building the fastest Postgres. Continuing on that theme, today we explore how read replicas and enhanced storage enable scaling for modern applications.</p><p>Traditional database solutions often struggle in key scenarios:</p><ul><li><strong>Financial services</strong>: Millisecond delays in trading or payments due to conflicting analytical and transactional loads.</li><li><strong>IoT and telemetry</strong>: Massive sensor data overwhelms traditional setups, causing manual partitioning headaches and delayed insights.</li><li><strong>E-commerce</strong>: Traffic spikes during peak events slow performance, impacting user experience and revenue.</li></ul><p>At Timescale, we believe developers shouldn't have to choose between performance and simplicity. Your database should scale seamlessly with your needs.</p><p>That's why we're launching two new features:</p><ul><li>Rapid scaling with read replica sets.</li><li>Enhanced storage capabilities up to 64 TB and 32,000 IOPS per replica.</li></ul><p>With these enhancements, <a href="https://www.timescale.com/cloud" rel="noreferrer">Timescale Cloud</a> empowers users to scale horizontally and vertically, delivering exceptional performance, throughput, and operational simplicity.</p><h2 id="read-replica-sets-reliable-horizontal-scaling">Read Replica Sets: Reliable Horizontal Scaling</h2><p>Modern applications are read-heavy, but scaling reads typically means dealing with manual setups, custom routing logic, and constant tuning.</p><p><strong>Read replica sets</strong> simplify this. Point your application to a single, load-balanced endpoint and let the system automatically distribute read queries across multiple replicas—no app changes required. Using PostgreSQL’s native streaming replication, replicas stay continuously updated without slowing down your primary node.</p><h4 id="benefits-for-developers">Benefits for developers:</h4><ul><li><strong>Automatic horizontal scaling</strong> of read traffic</li><li><strong>Zero custom logic or orchestration</strong> required</li><li><strong>Improved concurrency and performance</strong> under growing loads</li></ul><h2 id="how-to-use-read-replica-sets-an-example">How to Use Read Replica Sets: An Example</h2><p>Read replica sets offer a powerful, flexible solution for managing read-heavy workloads. You can create multiple sets, each tailored to a specific need. For instance, dedicate one set with several replicas to handle high-volume read traffic from your production application, ensuring responsiveness.&nbsp;</p><p>Create a separate set for your internal analytics team to isolate their ad-hoc queries and dashboarding activities, preventing impact on production performance.</p><p>By directing read traffic to the replica set and write traffic to the primary node, you achieve a clear separation of concerns. This architecture enhances scalability. As your read load fluctuates, you can easily add or remove replica nodes within a set via the Timescale console, adjusting capacity without disrupting your application or requiring manual changes. This consolidation behind a single, load-balanced endpoint simplifies your architecture while scaling read throughput.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/05/2025-may-29-replicas-diagram.png" class="kg-image" alt="Read replicas diagram - horizontal scaling for efficient query routing" loading="lazy" width="2000" height="936" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/05/2025-may-29-replicas-diagram.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2025/05/2025-may-29-replicas-diagram.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2025/05/2025-may-29-replicas-diagram.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/05/2025-may-29-replicas-diagram.png 2165w" sizes="(min-width: 720px) 720px"></figure><p>Existing read replicas have been automatically upgraded to read replica sets with one node. Price remains the same.</p><h2 id="enhanced-storage-64-tb-storage-32000-iops">Enhanced Storage: 64&nbsp;TB Storage &amp; 32,000 IOPS&nbsp;</h2><p>But there’s more. For customers with growing datasets and demanding analytical workloads, the previous 16&nbsp;TB storage ceiling was becoming a constraint. That’s changing.</p><p>We're launching a new storage type called <strong>enhanced storage</strong>, powered by AWS <strong>EBS io2</strong> volumes. This new type increases both storage capacity and throughput, making it ideal for customers with mission-critical workloads.</p><p>You can now scale your Timescale Cloud service:</p><ul><li>Up to<strong> 64&nbsp;TB of storage</strong> per database service</li><li>Up to <strong>32,000 IOPS</strong>, enabling high-throughput ingest and low-latency analytics</li></ul><p>You can switch to enhanced storage in the console without any downtime. This new storage type will initially be available to Enterprise customers and is part of our broader investment in serving high-scale, mission-critical applications.</p><p>This is a major step forward for demanding workloads with high ingestion rates, large query volumes, and complex analytics, such as financial data pipelines, IoT platforms, and telemetry processing.</p><h2 id="built-for-the-future-of-your-application">Built for the Future of Your Application</h2><p>These two new features are part of our broader vision: to make Timescale Cloud the most powerful and developer-friendly database for demanding applications.</p><p>With <strong>read replica sets</strong> and <strong>enhanced storage</strong>, Timescale Cloud delivers the speed and scalability needed for the most demanding workloads, while preserving the simplicity and power of Postgres.</p><h2 id="how-to-get-started">How to Get Started</h2><ul><li>Want to try <strong>read replica sets</strong>? Scale and Enterprise customers can set up read replica sets by <a href="https://console.cloud.timescale.com/login"><u>logging in to the Timescale console</u></a> &gt; Operations &gt; Read Scaling. Review our <a href="https://docs.timescale.com/use-timescale/latest/ha-replicas/read-scaling/"><u>read replica sets documentation</u></a> for implementation details. </li><li>Need to break past the 16 TB barrier or boost IOPS up to 32,000? Enterprise users can switch to <strong>enhanced storage</strong> by <a href="https://console.cloud.timescale.com/login"><u>logging in to the Timescale console</u></a> &gt; Operations &gt; Compute &amp; Storage.</li></ul><p>Next up: database observability simplified with Timescale Cloud.&nbsp;</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to Build a Secure, Authorized Chatbot Using Oso and Timescale]]></title>
            <description><![CDATA[Timescale and Oso webinar recap where you'll learn how to build a secure, authorized LLM chatbot using Oso and Timescale Vector. ]]></description>
            <link>https://www.tigerdata.com/blog/how-to-build-a-secure-authorized-chatbot-using-oso-and-timescale</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-to-build-a-secure-authorized-chatbot-using-oso-and-timescale</guid>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Jacky Liang]]></dc:creator>
            <pubDate>Tue, 13 May 2025 13:19:26 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/05/2025-may-12-oso-and-timescale-oso-blog-thumbnail.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/05/2025-may-12-oso-and-timescale-oso-blog-thumbnail.png" alt="How to Build a Secure, Authorized Chatbot Using Oso and Timescale" /><p>The rush to integrate large language models (LLMs) into production apps has exposed a common failure mode: without proper authorization in place, they can easily expose sensitive data to the wrong users. Combine that with complex infrastructure (vector databases, sync pipelines, separate stores for embeddings and metadata), and you’re shipping a fragile system that puts user data at risk.</p><p>At <a href="https://www.timescale.com/"><u>Timescale</u></a> and <a href="https://www.osohq.com/"><u>Oso</u></a>, we think there’s a better way.</p><p>In this webinar, we show how you can build a secure, scalable AI chatbot using Postgres—and <em>only</em> Postgres—by leveraging Timescale’s <a href="https://github.com/timescale/pgai"><u>pgai library</u></a> and Oso’s <a href="https://www.osohq.com/cloud/authorization-service"><u>authorization platform as a service</u></a>.</p><figure class="kg-card kg-embed-card"><iframe width="200" height="113" src="https://www.youtube.com/embed/5GFhqVOM8UE?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" title="How to build a secure, authorized chatbot using Oso and Timescale"></iframe></figure><p>Here are the webinar highlights, summarized for you in chapters for easy reference.</p><p>(To deploy our sample app for authorized secure chatbot built using Oso and pgai, see this <a href="https://github.com/jackyliang/timescale-oso-rag-chatbot"><u>open-source code</u></a>.)</p><h2 id="why-most-ai-chatbot-demos-fail-in-production">Why Most AI Chatbot Demos Fail in Production</h2><p>[08:30–11:50]</p><p><strong>Why do simple chatbots break in production?</strong> Demo chatbots are easy: embed your docs, slap on an OpenAI API key, and you’re done.</p><p>But in a real business environment, Bob (the employee) should never see Alice’s harsh performance review feedback. Only Alice, their manager and HR should. Sales shouldn’t see engineering tickets.&nbsp;</p><p>Without authorization boundaries, your chatbot becomes a data leak waiting to happen.</p><p>Many demos fall short because they:</p><ul><li>Expose <em>all</em> content to <em>all</em> users</li><li>Ignore org-specific permissions (e.g., team-level access control)</li><li>Assume static or role-based authorization models</li><li>Rely on dual data systems (e.g., Postgres + Vector DB), causing data synchronization difficulties.</li></ul><p>The fix? Build with authorization and data consistency as first principles.</p><h2 id="why-we-combined-postgres-pgvector-and-oso">Why We Combined Postgres, pgvector, and Oso</h2><p>[13:34–17:47]</p><p>We introduced an end-to-end reference stack that solves both the <strong>data synchronization and</strong> <strong>authorization complexity problem</strong>. The solution uses:</p><ul><li><strong>Timescale + pgai</strong> for real-time, in-database vector search and updates</li><li><strong>Oso Cloud</strong> for relationship-based access controls, enforced natively via PostgreSQL</li><li><strong>No glue code</strong> or ETL scripts between systems</li></ul><p>The result: you get a secure, performant, and authorized chat system with <em>zero</em> duplicated data.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text"><i><em class="italic" style="white-space: pre-wrap;">“Chatbot demos are simple. Business-grade AI is hard. We’re going to show you how to make the hard, easy.</em></i>” — <b><strong style="white-space: pre-wrap;">Jacky, Developer Advocate, Timescale</strong></b></div></div><h2 id="real-time-vector-sync-with-pgai-vectorizer">Real-Time Vector Sync With pgai Vectorizer</h2><p>[14:33–20:45]</p><p>Instead of bolting a vector database on top of your existing Postgres database, <a href="https://github.com/timescale/pgai/blob/main/docs/vectorizer/overview.md"><u>pgai Vectorizer</u></a> keeps your embeddings <strong>automatically synchronized</strong> with your source data in Postgres.</p><ul><li>Create vectorizers via Python</li><li>Ingest from S3, Hugging Face, or existing Postgres tables</li><li>Bring your own embedding model (OpenAI, Nomic, etc.)</li><li>Chunk and embed documents with configurable rules</li><li>Never worry about mismatched records again</li></ul><pre><code>SELECT ai.create_vectorizer(
  'blog'::regclass,
  loading =&gt; ai.loading_column(column_name =&gt; 'content'),
  embedding =&gt; ai.embedding_openai(model =&gt; 'text-embedding-3-small', dimensions =&gt; 768),
  destination =&gt; ai.destination_table('blog_embeddings')
);</code></pre><p>Run your vectorizer worker:</p><pre><code class="language-SQL">pgai vectorizer worker -d postgresql://...</code></pre><p>No extra queues, pipelines, or lambdas needed. Just Python and Postgres.</p><h2 id="authorization-that-follows-relationships-not-just-roles">Authorization That Follows Relationships, Not Just Roles</h2><p>[21:43–28:14]</p><p>Many apps rely on <a href="https://www.osohq.com/docs/modeling-in-polar/role-based-access-control-rbac"><u>Role-Based Access Control (RBAC)</u></a>. But real-world permissions often depend on <a href="https://www.osohq.com/docs/modeling-in-polar/relationship-based-access-control-rebac"><u>relationships</u></a>:</p><ul><li>“Bob can view reviews only if he’s the owner of the document”</li><li>“Diane (HR) can see feedback others can’t”</li><li>“Support engineers can access sensitive logs only during active shifts”</li></ul><p>Oso lets you model this in code:</p><pre><code class="language-polar">resource Folder{
 roles = ["viewer"];
 permissions = ["view"];
 relations = { team: Team };


 "viewer" if "member" on "team";
 "viewer" if global "hr";
 "viewer" if is_public(resource);


 "view" if "viewer";
}
</code></pre><p>It also incorporates your Postgres data using native SQL, so you don’t need to sync users, roles, or groups into a second system.</p><h2 id="putting-it-together-authorized-retrieval-augmented-generation-rag">Putting It Together: Authorized Retrieval Augmented Generation (RAG)</h2><p>[30:44–37:32]</p><p>Here’s how the architecture works:</p><ol><li>A user (Bob or Diane) sends a question to the chatbot.</li><li>The app queries Oso to determine what data the user is <em>authorized</em> to access.</li><li>That filter is converted to a SQL query that joins source + embedding data in Timescale.</li><li>Only the authorized context is sent to the LLM (e.g., OpenAI) to generate a final response.</li></ol><p>The result: the same chatbot provides personalized, secure answers based on who’s asking—without leaking data or requiring redundant systems.</p><h2 id="what-you%E2%80%99ll-learn-from-the-demo">What You’ll Learn From the Demo</h2><p>[29:01–48:00]</p><ul><li>How to build a business-grade RAG stack without a separate vector DB</li><li>How to enforce field-level access control in LLM-based apps</li><li>How Timescale + pgai + Oso make Postgres the <em>only</em> data system you need</li><li>Why prompt engineering, chunking, and system prompts matter in retrieval quality</li><li>How to embed PDF, DOCX, and S3-based documents securely</li></ul><h2 id="next-steps">Next Steps</h2><p>We’ve open-sourced the reference app and walkthrough:</p><ul><li><a href="https://www.youtube.com/watch?v=5GFhqVOM8UE"><u>Watch the full webinar</u></a></li><li><a href="https://docs.timescale.com"><u>Explore the Timescale pgai docs</u></a></li><li><a href="https://www.osohq.com/docs"><u>Learn more about Oso Cloud</u></a></li><li><a href="https://oso-oss.slack.com/ssb/redirect"><u>Join the Oso community on Slack</u></a></li></ul><p>If you’re building AI agents, chat interfaces, or internal copilots—don’t wait to layer in security and data correctness.</p><p>Your users will thank you. Your auditors will too.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Document Loading, Parsing, and Cleaning in AI Applications]]></title>
            <description><![CDATA[Before writing the first line of embeddings for your AI application, you need to load, parse, and clean your data. Here’s how.]]></description>
            <link>https://www.tigerdata.com/blog/document-loading-parsing-and-cleaning-in-ai-applications</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/document-loading-parsing-and-cleaning-in-ai-applications</guid>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[AI agents]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Jacky Liang]]></dc:creator>
            <pubDate>Tue, 08 Apr 2025 17:50:10 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/04/Document-Loading--Parsing--and-Cleaning-in-AI-Applications.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/04/Document-Loading--Parsing--and-Cleaning-in-AI-Applications.png" alt="Document Loading, Parsing, and Cleaning in AI Applications" /><p>Welcome to part one of our <a href="https://www.timescale.com/blog/agentic-rag-best-practices-guide-for-building-ai-apps-with-postgresql" rel="noreferrer"><u>Agentic RAG Best Practices</u></a> series, where we cover how to load, parse, and clean documents for your agentic applications.&nbsp;</p><p>This comprehensive guide will teach you how to build effective agentic retrieval applications with PostgreSQL.</p><p>Every week, customers ask us about building AI applications. Their most pressing concern isn't advanced chunking strategies or vector databases—it's simply: "How do I clean my data before feeding it to my AI?"</p><p>It’s simple: “Garbage in, garbage out.”&nbsp;</p><p>Before worrying about writing even your first line of embeddings or retrieval code, <strong><em>you need clean data</em></strong>.</p><p>In this first guide of our agentic RAG series, we'll cover gathering the right data, extracting text from various document types, pulling valuable metadata, web scraping techniques, and effectively storing data in PostgreSQL. We'll address common challenges like fixing formatting issues and handling images in documents.&nbsp;</p><p>By the end, you'll know how to transform raw documents into clean, structured data that retrieval agents can effectively use.&nbsp;</p><p>Don’t want to read all of this and just want to apply it? We have prepared a handy dandy preparation checklist for this topic in a preparation checklist. </p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/04/Document-Loading--Parsing--and-Cleaning-in-AI-Applications_decision-tree.png" class="kg-image" alt="A decision tree for document processing when building agentic RAG apps" loading="lazy" width="2000" height="1746" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/04/Document-Loading--Parsing--and-Cleaning-in-AI-Applications_decision-tree.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2025/04/Document-Loading--Parsing--and-Cleaning-in-AI-Applications_decision-tree.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2025/04/Document-Loading--Parsing--and-Cleaning-in-AI-Applications_decision-tree.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/04/Document-Loading--Parsing--and-Cleaning-in-AI-Applications_decision-tree.png 2392w" sizes="(min-width: 720px) 720px"></figure><p>One fintech customer recently shared how they spent weeks fine-tuning their RAG application with different vector databases, only to realize their poor results stemmed from simply having dirty data. "I approached the whole thing with like, I don't trust these AIs (. … ) So we don't ask them to make decisions. We do normal modeling to figure out what the user needs, then feed that data to the LLM and just say, 'Summarize it.'"&nbsp;</p><p>The garbage in, garbage out principle applies strongly to AI applications. Let's explore how to properly load, parse, and clean your data for AI use.&nbsp;</p><h2 id="gathering-the-right-data-for-ai-applications">Gathering the Right Data for AI Applications</h2><p>👉🏻 Watch the <a href="https://www.tiktok.com/@answer.hq/video/7488002948304407854"><u>one-minute video</u></a> summary.</p><p>Before even thinking about cleaning or processing your data, you need to make sure you have the right data in the first place. I know, this sounds obvious, but it’s a very important step that many teams overlook in rushing to build their shiny RAG app.&nbsp;</p><h3 id="data-selection-matters">Data selection matters</h3><p>We have seen many AI teams build state-of-the-art RAG apps that still deliver bad answers. Most of the time, there is nothing wrong with their retrieval algorithm, vector database, embedding model, or large language model. The problem is that they simply don’t have the necessary information in the knowledge base, so the LLM made something up instead or provided insufficient answers.&nbsp;</p><p>In most cases, if the information doesn’t exist in your documents, your RAG app should either return nothing (the best-case scenario), or the LLM will simply hallucinate a plausible answer (this is the worst-case scenario).&nbsp;</p><h3 id="choosing-the-right-data">Choosing the right data</h3><p>Before building your RAG application, ask yourself and the team these questions:</p><ol><li>What specific questions will users ask the system?&nbsp;</li><li>What documents contain factual information to these questions?&nbsp;</li><li>What are the gaps in our current documentation?&nbsp;</li><li>Is our information up-to-date, or will it need to be regularly updated?&nbsp;</li><li>Do we have a system in place to identify information gaps as users use the app?</li></ol><p>You need to be able to confidently answer these questions.&nbsp;</p><h3 id="where-to-collect-data">Where to collect data</h3><ol><li>Internal knowledge base: Check company wikis, technical documentation, reports, manuals, and databases.&nbsp;</li><li>External sources: Read industry publications, research papers, and public datasets.&nbsp;</li><li>Customer interactions: Check support tickets, chat logs, FAQs, etc.&nbsp;</li><li>Real-time sources: See news feeds, market data, IoT sensor data, etc.</li><li>Intuition: You may have some ideas where certain important data lives, so trust your gut.</li></ol><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text"><b><strong style="white-space: pre-wrap;">Note:</strong></b> Make sure these documents<i><b><strong class="italic" style="white-space: pre-wrap;"> don’t contain sensitive information</strong></b></i> you don’t want your users to ask about!&nbsp;</div></div><p>Be intentional about your data sources—the higher the quality and relevancy, the better.&nbsp;</p><h3 id="ensure-data-freshness">Ensure data freshness</h3><p>Most business data isn't static—it often changes as your products, services, and policies evolve. Outdated information in your RAG system leads to incorrect answers and really hurts your customers’ trust in the AI system (I mean, look at Google’s initial rollout of Bard.)&nbsp;</p><p>Consider the following suggestions for keeping your knowledge base up-to-date:</p><ol><li>Set up consistent update schedules: This will be different depending on your business needs. It can be hourly, weekly, monthly, or even quarterly.</li><li>Implement trigger-based updates: Update content whenever the source document changes. For example, when your team updates some documentation, your system should automatically refresh the corresponding knowledge base entries.</li><li>Create document ownership: If you work in a large company, you may need to assign responsibility to other individuals or teams for specific knowledge areas to ensure data is constantly updated.</li><li>Track user feedback: Many RAG systems allow users to rate answers. This rating system (like a simple thumbs up and down) can help identify outdated or incorrect information that needs to be updated, added, or removed from your knowledge base.&nbsp;</li><li>Track question patterns: Continuously analyze questions that consistently receive poor ratings to identify areas where your knowledge base needs improvement.&nbsp;</li></ol><p>Data freshness is one of the silent killers of data accuracy—no advanced RAG pipeline can fix this.&nbsp;</p><h2 id="extracting-text-from-documents">Extracting Text From Documents</h2><p>👉🏻 Watch the <a href="https://www.tiktok.com/@answer.hq/video/7488737770127494442"><u>one-minute video</u></a> summary.</p><p>Approximately 85 percent of the world's data is unstructured: think PDFs, Word files, emails, PowerPoint presentations, and more. To use this data with AI, you first need to extract the raw text.</p><h2 id="using-markitdown-for-document-conversion">Using MarkItDown for Document Conversion</h2><p>Libraries like <a href="https://github.com/microsoft/markitdown"><u>MarkItDown</u></a> and <a href="https://github.com/docling-project/docling"><u>Docling</u></a> can convert PDFs and other formats to Markdown. Markdown has become one of the cleanest and most efficient formats for ingesting data into LLMs because it's nearly plaintext and token-efficient. It can also efficiently represent non-text data like tables.&nbsp;</p><p><a href="https://github.com/microsoft/markitdown?tab=readme-ov-file#python-api"><strong><u>Extract text from PDF</u></strong></a><strong> using MarkItDown&nbsp;</strong></p><pre><code>from markitdown import MarkItDown  
md = MarkItDown()  
result = md.convert("document.pdf")  
text_markdown = result.text_content  
print(text_markdown[:500])</code></pre><p><a href="https://github.com/docling-project/docling?tab=readme-ov-file#getting-started"><strong><u>Extract text from PDF with Optical Character Recognition (OCR)</u></strong></a><strong> using Docling</strong></p><pre><code># Using Docling for Document Conversion with OCR
from docling.document_converter import DocumentConverter

# Initialize the converter
converter = DocumentConverter()

# Load PDF and extract text with OCR enabled
result = converter.convert(
    "document.pdf",  # Can be local path or URL
    enable_ocr=True  # Enable OCR for scanned documents
)

# Get the converted markdown content
markdown_text = result.document.export_to_markdown()

# Preview the first 500 characters
print(markdown_text[:500])
</code></pre><p>The code above returns an object with <code>text_content</code> containing the markdown text, which you can easily pass into your RAG pipeline or LLM for cleaning, analysis, summarizing, or chunking.&nbsp;</p><h2 id="using-visual-language-models-for-ocr">Using Visual Language Models for OCR</h2><p>A new breed of OCR technology is being powered by visual large language models (VLLMs): models that can process not just text, but also images and PDFs. These are trained specifically for unstructured data extraction. One such VLLM making a splash is <a href="https://docs.mistral.ai/capabilities/document/"><u>Mistral OCR</u></a>.</p><p><a href="https://docs.mistral.ai/capabilities/document/"><strong><u>Extract text and images</u></strong></a><strong> (in base64) from PDF using Mistral OCR</strong></p><pre><code>import os
from mistralai import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
client = Mistral(api_key=api_key)

ocr_response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": "https://arxiv.org/pdf/2201.04234"
    },
    include_image_base64=True
)
</code></pre><p><a href="https://docs.mistral.ai/capabilities/document/#ocr-with-image"><strong><u>Extract from images</u></strong></a><strong> using Mistral OCR</strong></p><pre><code>import os
from mistralai import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
client = Mistral(api_key=api_key)

ocr_response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "image_url",
        "image_url": "https://raw.githubusercontent.com/mistralai/cookbook/refs/heads/main/mistral/ocr/receipt.png"
    }
)
</code></pre><p>What makes Mistral OCR unique is its exceptional performance in extracting text in multiple languages, handling text from images, representing math equations, interpreting structured tables, and other traditionally difficult tasks.&nbsp;</p><p>Other extraction tools you can experiment with include <a href="http://unstructured.io"><u>unstructured.io</u></a>,&nbsp; <a href="https://github.com/allenai/olmocr"><u>olmOCR</u></a>, or just relying on good ol’ humans to extract the data—Upwork or Fiverr is a good place to begin your search for contractors.&nbsp;</p><p>Once you have this more manageable text form, you're ready for either direct ingestion into your database or metadata extraction.&nbsp;</p><h2 id="metadata-extraction">Metadata Extraction</h2><p>👉🏻 Watch the <a href="https://www.tiktok.com/@answer.hq/video/7488738094645103918" rel="noreferrer"><u>one-minute video</u></a> summary.</p><p>All documents contain metadata like title, author, creation date, length, source, customer name, etc. Imagine needing to fetch all documents between Q1 and Q2 of 2025 for a financial report—you'd need to filter by date range using metadata.</p><p>If your PDFs or documents have built-in metadata (added automatically by document processors when saving or exporting), that's great! But what if they don't?&nbsp;</p><h3 id="extracting-built-in-metadata">Extracting built-in metadata</h3><p>For simple metadata extraction from actual PDF data (if available), you can use a library like fitz:</p><p><a href="https://pymupdf.readthedocs.io/en/latest/document.html#Document.metadata"><strong><u>Extract built-in PDF metadata</u></strong></a><strong> using fitz</strong></p><pre><code>import fitz  
doc = fitz.open("example.pdf")  
metadata = doc.metadata  
print(metadata.get("title"), "by", metadata.get("author"))</code></pre><p>For everything else, you need…</p><h2 id="contextual-metadata-extraction">Contextual metadata extraction</h2><p>Most documents don't have native metadata. In these cases, you need a two-step workflow: first, use a PDF text extractor like Mistral OCR, then pass the raw text to another large language model to request specific information using natural language.</p><p>For example, you can use Mistral OCR to analyze each document, define what metadata you'd like to extract (title, author, etc.), and use another LLM to get the metadata information formatted in a specific way (like JSON).</p><p><a href="https://docs.mistral.ai/capabilities/document/#document-understanding"><strong><u>Extract contextual metadata from PDF</u></strong></a><strong> using Mistral OCR and Mistral Small </strong></p><pre><code>import os
from mistralai import Mistral

# Retrieve the API key from environment variables
api_key = os.environ["MISTRAL_API_KEY"]

# Specify model
model = "mistral-small-latest"

# Initialize the Mistral client
client = Mistral(api_key=api_key)

# Define the messages for the chat
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "In JSON format, extract the following metadata from the provided document: title, author, and created data. "
            },
            {
                "type": "document_url",
                "document_url": "https://arxiv.org/pdf/1805.04770"
            }
        ]
    }
]

# Get the chat response
chat_response = client.chat.complete(
    model=model,
    messages=messages
)

# Print the content of the response
print(chat_response.choices[0].message.content)

# Save the metadata somewhere for later ingest into your RAG pipeline
</code></pre><h2 id="extracting-text-from-the-web">Extracting Text From the Web</h2><p>👉🏻 Watch the <a href="https://www.tiktok.com/@answer.hq/video/7488738426116787498" rel="noreferrer"><u>one-minute video</u></a> summary.</p><p>Not all data lives in documents, and not all is accessible via GET API requests. To get data from websites, documentation, and knowledge bases for AI applications, you need to scrape them. The ultimate goal of web scraping is to fetch only the main text content from pages, filtering out headers, footers, sidebars, ads, and tracking scripts, in an LLM-friendly format like Markdown.</p><p>In the past, this was done with libraries like requests, Selenium, BeautifulSoup, etc., and required manually setting up proxies to evade rate limiters. Thankfully, it's no longer as painful to scrape the web today (yay!).</p><p>Web scrapers generally need to do the following tasks:</p><ol><li>Crawl: Get all pages of an entire website by gathering a list of all internal and external links (essentially building a sitemap, if it’s not available on /sitemap.xml).</li><li>Scrape: Get the DOM/text content of each individual page.</li><li>Proxy<strong>: </strong>Switch to a different IP to continue on a large crawling and scraping job.&nbsp;</li><li>Clean: Extract the main useful text from the raw DOM.</li><li>Convert: Format the main text content as Markdown, TXT, JSON, etc.&nbsp;</li></ol><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text"><b><strong style="white-space: pre-wrap;">A note about raw DOM</strong></b>: Using a website's HTML is messy because it has ads, menus, and other junk. It also doesn't work well for React apps or other single-page applications.</div></div><h2 id="firecrawl-for-web-scraping">Firecrawl for Web Scraping</h2><p><a href="http://firecrawl.com"><u>Firecrawl</u></a> is a web scraping/crawling engine accessible via REST API, Python SDK, and a UI dashboard (currently in beta). What's great about Firecrawl is that it extracts clean page text in various formats. It can crawl an entire site and return all pages' content in one go (with advanced filtering options). It also handles all the proxying needed for large-scale crawl and scrape jobs.</p><p><strong>Crawling a website with a limit of 100 pages using </strong><a href="https://docs.firecrawl.dev/introduction#usage"><strong><u>Firecrawl Crawl REST API</u></strong></a></p><pre><code>from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-YOUR_API_KEY")

# Crawl a website:
crawl_status = app.crawl_url(
  'https://firecrawl.dev', 
  params={
    'limit': 100, 
    'scrapeOptions': {'formats': ['markdown', 'html']}
  },
  poll_interval=30
)
print(crawl_status)
</code></pre><p><strong>Scraping a URL and outputting in Markdown using </strong><a href="https://docs.firecrawl.dev/introduction#scraping"><strong><u>Firecrawl Scraping REST API</u></strong></a></p><pre><code>from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-YOUR_API_KEY")

# Scrape a website:
scrape_result = app.scrape_url('firecrawl.dev', params={'formats': ['markdown', 'html']})
print(scrape_result</code></pre><p><strong>Custom metadata extraction in JSON using </strong><a href="https://docs.firecrawl.dev/introduction#extraction"><strong><u>Firecrawl Extraction REST API</u></strong></a></p><pre><code>from firecrawl import FirecrawlApp
from pydantic import BaseModel, Field

# Initialize the FirecrawlApp with your API key
app = FirecrawlApp(api_key='your_api_key')

class ExtractSchema(BaseModel):
    company_mission: str
    supports_sso: bool
    is_open_source: bool
    is_in_yc: bool

data = app.scrape_url('https://docs.firecrawl.dev/', {
    'formats': ['json'],
    'jsonOptions': {
        'schema': ExtractSchema.model_json_schema(),
    }
})
print(data["json"])</code></pre><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text"><i><b><strong class="italic" style="white-space: pre-wrap;">Author’s note:</strong></b></i><i><em class="italic" style="white-space: pre-wrap;"> “Firecrawl is one of my favorite SaaS services of 2024. It has awesome docs, affordable pricing, and has a very responsive team. Most importantly, it works really well.” – Jacky Liang</em></i></div></div><h2 id="other-web-scraping-options">Other Web Scraping Options</h2><p>Firecrawl isn't the only service/library for crawling, scraping, and cleaning. Another capable service is Jina AI's Reader API, which converts a URL to LLM-friendly inputs simply by adding <code>r.jina.ai</code> in front:&nbsp;&nbsp;</p><p><strong>Fetch a webpage in clean Markdown using Jina AI’s </strong><a href="https://jina.ai/reader/"><strong><u>Reader API</u></strong></a></p><pre><code>r.jina.ai/news.ycombinator.com
</code></pre><p>If you want to build your own end-to-end crawling and scraping infrastructure (expert users only), developers typically use <a href="https://github.com/microsoft/playwright"><u>Playwright</u></a>, a Microsoft framework for web testing and automation. <a href="https://github.com/oxylabs/playwright-web-scraping"><u>Playwright Web Scraping</u></a> is a reliable open-source web scraping implementation using Playwright. <a href="https://github.com/mendableai/firecrawl"><u>Firecrawl</u></a> is also open source and lets you host it in your own infrastructure if you want absolute control.&nbsp;</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text"><b><strong style="white-space: pre-wrap;">Pro tip:</strong></b> When running your own web scraping infrastructure, make sure to use proxies for your scraping server to avoid IP bans from websites you're crawling and scraping.&nbsp;</div></div><h2 id="direct-data-loading">Direct Data Loading&nbsp;</h2><p><code>pgai</code> has a handy function that lets you import datasets directly from Hugging Face with just the dataset's name:</p><p><a href="https://github.com/timescale/pgai?tab=readme-ov-file#create-a-table-run-a-vectorizer-and-perform-semantic-search"><strong><u>Load data from Hugging Face</u></strong></a><strong> using pgai</strong></p><pre><code>SELECT ai.load_dataset('wikimedia/wikipedia', '20231101.en', table_name=&gt;'wiki', batch_size=&gt;5, max_batches=&gt;1, if_table_exists=&gt;'append');
</code></pre><p><code>pgai</code> has more direct data loading goodies to come—stay tuned.&nbsp;</p><h2 id="storing-data">Storing Data</h2><p>👉🏻 Watch the <a href="https://www.tiktok.com/@answer.hq/video/7488738803012848942" rel="noreferrer">one-minute video</a> summary.</p><p>At Timescale, we believe <a href="https://www.timescale.com/blog/vector-databases-are-the-wrong-abstraction"><u>boutique vector databases are the wrong abstraction</u></a> for AI workloads. PostgreSQL is the ideal solution for typical apps and AI apps, especially RAG applications.</p><p>Instead of using a separate vector database, you can store text embeddings inside PostgreSQL with <a href="https://github.com/pgvector/pgvector"><u>pgvector</u></a>. We recommend using <a href="https://github.com/timescale/pgai"><u>pgai</u></a> to simplify building RAG apps, as we have a handy interface called <code>create_vectorizer()</code> that automatically embeds raw text, chunks it, and continuously keeps it up-to-date.</p><p><strong>Create an AI project using </strong><a href="https://github.com/timescale/pgai?tab=readme-ov-file"><strong><u>pgvector and pgai</u></strong></a><strong> </strong></p><pre><code>// Enable pgai and pgvector on your Postgres database
CREATE EXTENSION IF NOT EXISTS vector;  
CREATE EXTENSION IF NOT EXISTS ai;  

// Create a table to store Wikipedia articles
CREATE TABLE wiki (
    id      TEXT PRIMARY KEY,
    url     TEXT,
    title   TEXT,
    text    TEXT
);

// Load Wikipedia dataset directly from HuggingFace 
SELECT ai.load_dataset('wikimedia/wikipedia', '20231101.en', table_name=&gt;'wiki', batch_size=&gt;5, max_batches=&gt;1, if_table_exists=&gt;'append');

// Create Vectorizer that
// 1. Chunks the data using the chunking recursive character text splitter
// 2. Embeds it using Mini LM at 384 token size per chunk
// 3. Continuously monitors the wiki table for new text incoming
SELECT ai.create_vectorizer(
     'wiki'::regclass,
     embedding =&gt; ai.embedding_ollama('all-minilm', 384),
     formatting=&gt; ai.formatting_python_template('url: $url  title: $title $chunk')
     chunking =&gt; ai.chunking_recursive_character_text_splitter('text'),

);

// Check status of Vectorizxer embedding creation
select * from ai.vectorizer_status;
</code></pre><p><strong>Running the </strong><a href="https://github.com/timescale/pgai?tab=readme-ov-file#quick-start"><strong><u>pgai Vectorizer worker</u></strong></a></p><p>For the vectorizer to work correctly, you need to run the pgai Vectorizer worker alongside your PostgreSQL database. This worker processes your data and creates embeddings. Set up a <code>docker-compose.yml</code> file with the following configuration:</p><pre><code>version: '3'
services:
  db:
    image: timescale/timescaledb-ha:pg17
    environment:
      POSTGRES_PASSWORD: postgres
    ports:
      - "5432:5432"
    volumes:
      - data:/home/postgres/pgdata/data
      
  vectorizer-worker:
    image: timescale/pgai-vectorizer-worker:latest
    environment:
      PGAI_VECTORIZER_WORKER_DB_URL: postgres://postgres:postgres@db:5432/postgres
      OPENAI_API_KEY: your_openai_api_key_here
    command: [ "--poll-interval", "5s" ]
    
  ollama:
    image: ollama/ollama
    
volumes:
  data:
</code></pre><p>If you're using Ollama for embeddings, as shown in our example, make sure to add the Ollama service and configure the worker:</p><pre><code>vectorizer-worker:
    environment:
      OLLAMA_HOST: http://ollama:11434</code></pre><p>Start everything with “<code>docker-compose up -d</code>” and the worker will automatically poll the database and process your vectorizer tasks. Note that you might need to adjust settings like poll intervals or concurrency depending on your specific workload needs.</p><p>And just like that, we’ve built a production-ready SQL-native retrieval pipeline that is not only powerful but <a href="https://github.com/timescale/pgai/blob/main/docs/vectorizer/overview.md"><u>extremely customizable</u></a>.&nbsp;</p><h2 id="cleaning-messy-data">Cleaning Messy Data</h2><p>Raw text extracted from websites or documents is often messy and contains content not relevant to the main text. This can include ads, navigation menus, footers, tracking scripts, or leftover HTML/CSS markup. Removing this noise is crucial to avoid feeding irrelevant text to your AI model, reduce input token size to lower costs, and increase retrieval accuracy.</p><h3 id="cleaning-webpages">Cleaning webpages</h3><p>Elements to clean include:</p><ul><li>HTML tags that aren't content: <code>&lt;script&gt;</code>, <code>&lt;style&gt;</code>, <code>&lt;nav&gt;</code>, <code>&lt;footer&gt;</code>, etc.</li><li>Advertisements or cookie banners</li><li>Repeated headers/footers on every page</li><li>Excessive whitespace, line breaks, or meaningless Unicode characters</li><li>Images or links to images</li></ul><p>The tools mentioned earlier, like Firecrawl and Jina AI's Reader API, already handle webpage data cleaning and return only the main text content.</p><p>If you have very specific requirements, you can use web automation frameworks like Playwright or BeautifulSoup to get the raw DOM, then use traditional DOM traversal or regex to clean the data. This approach is for experts only.</p><h3 id="cleaning-pdfs">Cleaning PDFs</h3><p>After running documents through Mistral OCR, you'll still have repeated content like page numbers, headers/footers, and repetitive line breaks. A growing technique is using an LLM like Gemini Flash 2.0, which has a two-million-token context window (the largest of all LLMs we've seen) and a reasonable cost to automate cleaning. You can use natural language to instruct it on what to clean: removing repeated titles, sources, footnotes, etc.</p><p>You can also clean text manually if you have very specific data requirements.&nbsp;</p><h2 id="fixing-text-formatting">Fixing Text Formatting</h2><p>Sometimes text is extracted with poor formatting. You need to standardize lists, headings, and line breaks. You can prompt an LLM to rewrite text more clearly, remove gibberish or irrelevant parts, and correct inconsistencies (without adding extra commentary).</p><p>Issues to fix include:</p><ul><li>Line breaks and paragraphs: Merge lines that belong to the same paragraph. For example, replace hyphenated line breaks (<code>-\n</code>) with nothing, and replace newlines followed by lowercase letters with spaces.</li><li>Lists and bullet points: Convert fancy bullet symbols to a common format (e.g., "•" or "–" to "-"). Ensure list items have consistent formatting.</li><li>Headings and subheadings: If using Markdown, ensure headings use # syntax properly with blank lines before and after.</li><li>Whitespace and punctuation: Trim excessive whitespace, normalize quotes and dashes if needed.</li><li>Tone: Standardize tone using large context LLMs like Gemini Flash 2.0 to keep writing style consistent across your text data.</li></ul><p><strong>Clean variety of OCR text formatting issues using regex</strong><em> (not an exhaustive example)</em></p><pre><code>import re

def clean_ocr_text(text):
    # Replace hyphenated line breaks (e.g., "exam-\nple" -&gt; "example")
    text = re.sub(r'(\w)-\n(\w)', r'\1\2', text)

    # Merge lines that are broken in the middle of sentences
    text = re.sub(r'\n(?=\w)', ' ', text)

    # More cleaning steps go here

    return text</code></pre><h2 id="handling-images-in-documents">Handling Images in Documents&nbsp;</h2><p>Some PDFs and web pages have images containing text or are entirely scans of documents. There are several ways to handle these:</p><ol><li><strong>OCR</strong>: For scanned documents, OCR can extract text from images. Mistral OCR handles this well.</li><li><strong>MarkItDown</strong>: Modern libraries like MarkItDown integrate both OCR and vision models to generate image descriptions.</li><li><strong>BLIP</strong>: Models like <a href="https://arxiv.org/abs/2201.12086"><u>BLIP</u></a> (Bootstrapping Language-Image Pre-training) combine understanding images and generating text to give you text descriptions of images.</li><li><strong>Save image URLs</strong>: When extracting from webpages, you can save image URLs as text chunks to display in your application.</li><li><strong>Omit images</strong>: This is common but not ideal, especially if images contain crucial information like charts or diagrams.</li></ol><p><a href="https://dev.to/leapcell/deep-dive-into-microsoft-markitdown-4if5"><strong><u>Generate descriptions for images</u></strong></a><strong> using MarkItDown</strong></p><pre><code>from markitdown import MarkItDown
from openai import OpenAI

# Set up OpenAI client
client = OpenAI(api_key="your-openai-api-key")

# Initialize MarkItDown with LLM capabilities
md = MarkItDown(llm_client=client, llm_model="gpt-4o")

# Convert an image file
result = md.convert("path_to_your_image.jpg")

# Print the generated description
print(result.text_content)
</code></pre><p><a href="https://medium.com/%40jimwang3589/what-is-image-captioning-and-how-to-use-python-to-generate-caption-from-an-image-98a9eb6be06d"><strong><u>Generate captions for an image</u></strong></a><strong> using BLIP&nbsp;</strong></p><pre><code>from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image

# Load the pre-trained BLIP model and processor
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

# Open an image file
image_path = "path_to_your_image.jpg"  # Replace with your image file path
image = Image.open(image_path)

# Preprocess the image and prepare inputs for the model
inputs = processor(images=image, return_tensors="pt")

# Generate caption
outputs = model.generate(**inputs)

# Decode the generated caption
caption = processor.decode(outputs[0], skip_special_tokens=True)

print("Generated Caption:", caption)</code></pre><h2 id="summarizing-or-cleaning-text-with-an-llm">Summarizing or Cleaning Text With an LLM</h2><p>After programmatic cleaning and parsing, you may still need to refine text further:</p><ul><li>Summarize long documents into shorter summaries or key points.</li><li>Make tone and format consistent.</li></ul><p>Simple prompts like "Remove any unnecessary information (like boilerplate) from the following text and correct any errors" or "Summarize the following document in three sentences" can help. If possible, double-check this work, as LLMs can sometimes misinterpret context.</p><p><strong>Summarize a piece of text </strong><a href="https://cloud.google.com/vertex-ai/generative-ai/docs/gemini-v2"><strong><u>using Gemini Flash 2.0</u></strong></a><strong>&nbsp;</strong></p><pre><code>from google import genai
from google.genai.types import HttpOptions

client = genai.Client(http_options=HttpOptions(api_version="v1"))
response = client.models.generate_content(
    model="gemini-2.0-flash-001",
    contents="Summarize in a technical tone the following piece of text: Attention Is All You Need...",
)
print(response.text)</code></pre><h2 id="conclusion">Conclusion</h2><p>We hear it from our customer’s AI teams all the time. They had spent months trying different vector databases, embedding models, and chunking strategies without seeing any improvement in their RAG application. After a quick data review, we discovered their PDFs were being processed with poor OCR, resulting in garbled text full of artifacts. Within a week of implementing proper data cleaning, their application performance jumped dramatically.&nbsp;</p><p>There’s a very important lesson here: Before you worry about the latest bleeding-edge GraphRAG techniques, <strong><em>make sure your data foundation is solid</em></strong>; this step starts way before you write a single line of RAG code. Clean, well-structured, and high-quality data is the foundation of any successful AI application. As we like to say, “Garbage in, garbage out.”&nbsp;</p><p>Start with good data, and the rest of your AI pipeline will fall into place. You need to invest time in proper document gathering, loading, parsing, and cleaning, which will pay dividends in more accurate, relevant, and useful AI outputs.&nbsp;</p><p>In the next installment of <a href="https://www.timescale.com/blog/agentic-rag-best-practices-guide-for-building-ai-apps-with-postgresql" rel="noreferrer">RAG Best Practices</a>, we will be exploring chunking strategies, followed by embedding generation, indexing techniques, performance optimizations, and more.&nbsp;</p><h2 id="get-involved">Get Involved&nbsp;</h2><p>Whether you're new to AI or an experienced developer looking to implement agentic RAG with PostgreSQL, this series will give you the foundation you need.&nbsp;</p><p>Stay tuned for our next guide on chunking strategies, coming in two weeks. </p><p>In the meantime, we'd love to see you share your thoughts, questions, and suggestions on social media and Discord:</p><ul><li><strong>Join our Discord Community</strong>: Get real-time answers from the Timescale team and <a href="https://discord.com/invite/KRdHVXAmkp" rel="noreferrer">connect with other developers</a>.</li><li><strong>Follow us on social media</strong>: Stay updated with the latest from Timescale on <a href="https://twitter.com/TimescaleDB"><u>X/Twitter</u></a> and <a href="http://linkedin.com/company/timescaledb"><u>LinkedIn</u></a>.</li><li><strong>Connect with Jacky </strong>(<strong>developer advocate</strong>): Follow me for more practical AI and PostgreSQL content on<a href="https://twitter.com/jjackyliang"><u> X/Twitter</u></a>, <a href="https://threads.net/@jjackyliang"><u>Threads</u></a>, and <a href="https://www.tiktok.com/@answer.hq"><u>TikTok</u></a>.</li><li><strong>Direct questions</strong>: Have a specific question about your agentic retrieval implementation? Ask me anything at jacky (at) timescale (dot) com.</li></ul><p>We're building this guide for you, so don't hesitate to let us know what topics you'd like us to cover in future installments!&nbsp;</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[You, Too, Can Scale Postgres to 3 PB and 3 T Metrics per Day]]></title>
            <description><![CDATA[Read how we’re scaling Postgres to 3 PB of data and 3 T of metrics per day to power our query monitoring feature, Insights.]]></description>
            <link>https://www.tigerdata.com/blog/scaling-postgresql-to-petabyte-scale</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/scaling-postgresql-to-petabyte-scale</guid>
            <category><![CDATA[Engineering]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <dc:creator><![CDATA[Rob Kiefer]]></dc:creator>
            <pubDate>Wed, 19 Mar 2025 13:00:00 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/Blog--1-.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/Blog--1-.png" alt="Scaling PostgreSQL to Petabyte Scale: many white squares (data) into a single bigger square (Timescale Cloud service)" /><p><em>Updated Jan. 27, 2026 -</em> After launching our <a href="https://www.timescale.com/blog/database-monitoring-and-query-optimization-introducing-insights-on-timescale/"><u>Insights</u></a> feature in late 2023, <a href="https://www.timescale.com/blog/how-we-scaled-postgresql-to-350-tb-with-10b-new-records-day/"><u>our most ambitious dogfooding effort</u></a> yet, where we scaled Postgres to give users in-depth performance analytics on their database queries, we’re back with an update. And good news: we'll be sharing these metrics quarterly.&nbsp;</p><p><strong>The TL;DR?</strong> One <a href="https://www.timescale.com/products" rel="noreferrer">Tiger Cloud database service</a> is now ingesting over <strong>3</strong> <strong>trillion</strong> metrics per day and <strong>storing 3 petabytes of data</strong>, challenging all assumptions that Postgres can’t scale.</p><p>This massive operation runs entirely on Tiger Cloud using the same features available to all our users. There is no special treatment, no hidden infrastructure: you, too, could run a Postgres database at this scale with the offerings on Tiger Cloud.&nbsp;</p><h2 id="insights-recap-scaling-postgres-for-query-monitoring">Insights Recap: <a href="https://www.tigerdata.com/learn/guide-to-postgresql-scaling" rel="noreferrer">Scaling Postgres</a> for Query Monitoring</h2><p>To understand the scale of the problem we’re trying to solve, let’s quickly recap the feature being powered here by Tiger Data. <a href="https://docs.timescale.com/use-timescale/latest/metrics-logging/insights/"><u>Insights</u></a> provides Tiger Cloud users with comprehensive query analytics, capturing execution times, memory usage, I/O statistics, and TimescaleDB feature utilization.</p><p>This means we capture every query running in our Cloud, gather relevant statistics, and store them in a fully queryable Tiger Cloud instance. Initially collecting about a dozen metrics per query, we've since <strong>tripled</strong> that number to enhance user visibility. The data volume expands along three dimensions: growing customer base, increasing per-customer query loads, and an expanding metrics collection.</p><p>Yet we continue to track this data in Tiger Cloud, on a Postgres-based database, accomplishing Tiger Data's original goal of creating a faster, more scalable Postgres.</p><h2 id="just-give-me-the-tldr-aka-the-big-numbers">Just Give Me the TL;DR (a.k.a. the Big Numbers)</h2><p>When we <s>bragged</s> talked about <a href="https://www.timescale.com/blog/how-we-scaled-postgresql-to-350-tb-with-10b-new-records-day/"><u>building Insights</u></a>, the headline numbers were storing 350+ TBs of data and ingesting 10 billion records a day. Today, we’ll change the headline numbers to <strong>more than 1 quadrillion metrics recorded, almost 3 petabytes stored, and over 3 trillion daily metrics ingested.</strong></p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/01/2026-Jan-27-Scaling-Postgres-Metrics.png" class="kg-image" alt="" loading="lazy" width="1280" height="688" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2026/01/2026-Jan-27-Scaling-Postgres-Metrics.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2026/01/2026-Jan-27-Scaling-Postgres-Metrics.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/01/2026-Jan-27-Scaling-Postgres-Metrics.png 1280w" sizes="(min-width: 720px) 720px"></figure><p>Since launching the feature, we’ve refined how we measure Insights data, focusing on metrics rather than records—a record is a set of metrics for a query—with each query now capturing significantly more data points than in 2023.</p><p>That said, the numbers are impressive: we’ve moved into multi-petabyte territory, adding roughly a petabyte of data per year since launching 3 years ago. Today, we stand at <strong>nearly 3 petabytes, the overwhelming majority of which is efficiently stored in </strong><a href="https://docs.tigerdata.com/use-timescale/latest/data-tiering/" rel="noreferrer"><u><strong>Tiger Data’s tiered storage</strong> <strong>architecture</strong></u></a>, which is more easily accessible and <a href="https://www.timescale.com/blog/boosting-query-performance-for-tiered-data-in-postgresql"><u>query-able than ever</u></a>.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/01/2026-Jan-27-Scaling-Postgres-Table-1.png" class="kg-image" alt="" loading="lazy" width="2000" height="473" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2026/01/2026-Jan-27-Scaling-Postgres-Table-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2026/01/2026-Jan-27-Scaling-Postgres-Table-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2026/01/2026-Jan-27-Scaling-Postgres-Table-1.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w2400/2026/01/2026-Jan-27-Scaling-Postgres-Table-1.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>Our daily ingest has tripled over the last year, from 1 trillion to 3 trillion metrics per day, totaling 1 quadrillion metrics collected.&nbsp;Despite this massive growth in data, queries, and more metrics per query, we still use the same size bowl for our dogfooding effort: <a href="https://www.timescale.com/products" rel="noreferrer">a vanilla Tiger Cloud instance</a>.</p><h2 id="how-we-scale-postgres-and-stay-fast">How We Scale Postgres and Stay Fast</h2><p>Much of our architecture remains the same as <a href="https://www.timescale.com/blog/how-we-scaled-postgresql-to-350-tb-with-10b-new-records-day/"><u>described in our original post</u></a>. We still ingest two main types of objects: a detailed “raw” record and a set of UDDSketches that represent a distribution of values for a given metric (“sketch” record).&nbsp;</p><p>A raw record contains the metrics for a single query invocation, along with some more detailed information, like a full breakdown of <a href="https://www.postgresql.org/docs/current/using-explain.html#USING-EXPLAIN-BASICS"><u>nodes</u></a> used to execute the query. Conversely, the set of UDDSketches represents multiple query invocations. This allows us to store orders of magnitude more queries’ stats than if we stored only raw records.&nbsp;</p><p>Since launching the feature, we have generally sampled fewer raw records, now only collecting about 25% of queries in this form. The node breakdown of execution can be useful in understanding how custom nodes we’ve created for TimescaleDB are performing across the fleet.</p><p>Adding new metrics to track has been straightforward—just new columns on our existing <a href="https://docs.timescale.com/use-timescale/latest/hypertables/about-hypertables/"><u>hypertables</u></a>. Because we’ve essentially tripled the amount of metrics we collect, this does put more pressure on storage. </p><p>For raw records, as previously mentioned, we have just reduced the amount of sampling while continuing to aggressively tier data. For the sketch records, we’ve also begun using tiering for this table. This lets us keep our active dataset for the database around 12&nbsp;TB (60&nbsp;TB of pre-compressed data before using Timescale's <a href="https://www.tigerdata.com/blog/building-columnar-compression-in-a-row-oriented-database" rel="noreferrer">row-columnar storage</a> engine), with the rest (1.5+&nbsp;PB) tiered.</p><p>To allow for aggressive tiering and quick responses to queries from our Insights page, we use <a href="https://docs.timescale.com/use-timescale/latest/continuous-aggregates/about-continuous-aggregates/"><u>continuous aggregates</u></a> (our enhanced version of Postgres materialized views) heavily. UDDSketches <a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/uddsketch#rollup"><u>“roll up”</u></a> very nicely: you can combine a set of UDDSketches into a new UDDSketch representing the entire group. This allows us to go from the ingested UDDSketches into a hierarchical continuous aggregate tree with groupings at several levels (minutes, hours, days).&nbsp;</p><p>With a bit of planning, we’ve been able to have stats available at all the granularities we need to serve users without needing to go to the original hypertables. Inserts stay fast, queries stay fast, and we can tier without fear.</p><p>In the future, we may need to deploy read replicas to scale the solution, allowing us to separate the high write ingesting and aggregation workload from the high read workload that comes from customer usage. But as it stands today, we don’t need that; we have this billions-of-metrics-a-day pipeline running perfectly without scaling out.</p><h2 id="final-words">Final Words</h2><p>Since its inception, Insights has grown in both scale and impact, proving that Postgres—when engineered for scale—can handle immense workloads.&nbsp;</p><p>We're tracking trillions of metrics daily while storing petabytes of data—all on a Tiger Cloud instance. The power of Tiger Data’s tiered storage, hypertables, and continuous aggregates has allowed us to not just scale but to stay fast and efficient.&nbsp;<a href="https://www.tigerdata.com/blog/how-glooko-turns-3b-data-points-per-month-into-lifesaving-diabetes-healthcare-tiger-data" rel="noreferrer">Glooko uses the same Tiger Cloud features</a> to process 3 billion data points per month for diabetes monitoring.</p><p>If you’ve been thinking about taking your Tiger Cloud database to the next level, rest assured, we’re showing it’s entirely possible—our Cloud is your Cloud. And remember, <a href="https://www.timescale.com/support"><u>you will never walk alone</u></a>. Top-notch support is available for free for all Tiger Cloud customers, and our expert team is ready to guide you every step of the way, all the way to petabyte scale.&nbsp;</p><p>Start scaling—<a href="https://console.cloud.timescale.com/signup"><u>create a free Tiger Cloud account today</u></a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to Test Your PostgreSQL Connection]]></title>
            <description><![CDATA[Testing your PostgreSQL connection is a challenge faced by many developers. Learn how to do it.
]]></description>
            <link>https://www.tigerdata.com/blog/how-to-test-your-postgresql-connection</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-to-test-your-postgresql-connection</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Tips]]></category>
            <dc:creator><![CDATA[Semab Tariq]]></dc:creator>
            <pubDate>Mon, 20 Jan 2025 14:00:53 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/01/How-to-Test-Your-PostgreSQL-Connection.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/01/How-to-Test-Your-PostgreSQL-Connection.png" alt="Code snippets of three of the ways you can test your PostgreSQL connection" /><p>Let’s imagine a few scenarios:</p><ul><li>You are asked to build a monitoring system to check if a PostgreSQL instance is running smoothly. You need to check its status every second, but your organization doesn’t want to use third-party tools due to concerns about sharing sensitive credentials.</li><li>You have just installed PostgreSQL and want to make sure it’s ready to accept connections.</li><li>Your application suddenly stops connecting to the database, and you need to manually test the connection to figure out what’s wrong.</li><li>You have deployed a new version of your application. Before sending live traffic to it, you want to confirm that the connection to the PostgreSQL server is stable.</li></ul><p>In all these scenarios, the first question that might come to mind is:<strong> How can I check if PostgreSQL is running and test the connection?</strong></p><p>Don’t worry! You are not alone! 😊 In today’s blog, I’ll guide you through various methods to test a PostgreSQL connection, which will help you tackle these and other similar scenarios.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/01/How-to-Test-Your-PostgreSQL-Connection-1.png" class="kg-image" alt="" loading="lazy" width="2000" height="885" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/01/How-to-Test-Your-PostgreSQL-Connection-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2025/01/How-to-Test-Your-PostgreSQL-Connection-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2025/01/How-to-Test-Your-PostgreSQL-Connection-1.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w2400/2025/01/How-to-Test-Your-PostgreSQL-Connection-1.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>(If you have not done so already, install PostgreSQL by following the community guidelines for your preferred operating system <a href="https://www.postgresql.org/download/"><u>here</u></a>.)</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">✨</div><div class="kg-callout-text">Learn how to troubleshoot and <a href="https://www.timescale.com/blog/5-common-connection-errors-in-postgresql-and-how-to-solve-them" rel="noreferrer">fix five common PostgreSQL connection errors</a>.</div></div><h2 id="how-to-verify-if-postgresql-is-ready-to-accept-connections">How to Verify if PostgreSQL Is Ready to Accept Connections</h2><p>There are several ways to check your PostgreSQL connections. To make things clearer, I have divided them into three sections:</p><ul><li>PostgreSQL internals: methods that don’t rely on third-party tools or software</li><li>External tools: tools outside of PostgreSQL</li><li>Programming languages: using code to check connections</li></ul><p>Let’s explore them one by one.</p><h2 id="postgresql-internals-methods-without-third-party-tools">PostgreSQL Internals: Methods Without Third-Party Tools</h2><p>Let’s explore the different ways PostgreSQL can help determine if it is ready and available for new connections.</p><h3 id="pgisready"><code>pg_isready</code></h3><p>PostgreSQL includes a built-in utility called <code>pg_isready</code>, which is available after installation. This tool is the most commonly used to check if PostgreSQL is ready to accept connections.</p><h4 id="usage">Usage</h4><pre><code class="language-SQL">pg_isready -h &lt;HOST_NAME&gt; -p &lt;PORT_NUMBER&gt; -d &lt;DATABASE_NAME&gt; -U &lt;DATABASE_USER&gt;</code></pre><p>To use <code>pg_isready</code>PostgreSQL on port 5430 is not responding, specify the required values based on your PostgreSQL installation:</p><ul><li><code>&lt;HOST_NAME&gt;</code>: This can be <code>localhost</code> or the public or private IP address of the server where PostgreSQL is installed.</li><li><code>&lt;PORT_NUMBER&gt;</code>: The port number on which PostgreSQL is listening (default is 5432).</li><li><code>&lt;DATABASE_NAME&gt;</code>: The name of the database you want to check.</li><li><code>&lt;DATABASE_USER&gt;</code>: The username for the database connection.</li></ul><p>Once you have replaced the placeholders with the appropriate values, execute the command to check the PostgreSQL server’s availability.</p><h4 id="output">Output&nbsp;</h4><p>The first command shows that PostgreSQL on port 5434 is accepting connections. The second command indicates that PostgreSQL on port 5430 is not responding, meaning it’s either not running or unreachable.</p><pre><code>/Library/PostgreSQL/15/bin/pg_isready -h localhost -p 5434 -d postgres -U postgres

localhost:5434 - accepting connections

/Library/PostgreSQL/15/bin/pg_isready -h localhost -p 5430 -d postgres -U postgres

localhost:5430 - no response</code></pre><h3 id="psql%E2%80%94postgresql-interactive-terminal"><code>psql</code>—PostgreSQL interactive terminal</h3><p>Another common way to check if PostgreSQL is ready to accept connections is by using the <code>psql</code> command-line utility, which is also included during the PostgreSQL installation. It allows you to attempt a connection to the server and provides feedback on whether the server is available.</p><h4 id="usage-1">Usage&nbsp;</h4><pre><code>psql -h &lt;HOST_NAME&gt; -p &lt;PORT_NUMBER&gt; -d &lt;DATABASE_NAME&gt; -U &lt;DATABASE_USER&gt;</code></pre><p>Similar to <code>pg_isready</code>, to use <code>psql</code>, specify the required values based on your PostgreSQL installation.</p><p><strong>Note</strong>: After executing the above command with the correct installation parameters, you will be prompted to enter the password for the user specified with the <code>-U</code> switch. If the service is running, <code>psql</code> will then take you to the PostgreSQL terminal. However, if the service is down, you won't be prompted for the password, which indicates that the service is unavailable or the credentials are incorrect.</p><h4 id="output-1">Output</h4><p>If the connection is successful, you will see a message similar to this, indicating that you have connected to the database:</p><pre><code>/Library/PostgreSQL/15/bin/psql -h localhost -p 5434 -d postgres -U postgres
Password for user 
postgres:psql (15.4, server 13.18)
Type "help" for help.

postgres=# \q</code></pre><p>If there’s an issue connecting, an error message will appear, such as:</p><pre><code>/Library/PostgreSQL/15/bin/psql -h localhost -p 5430 -d postgres -U postgres
psql: error: connection to server at "localhost" (::1), port 5430 failed: Connection refused
	Is the server running on that host and accepting TCP/IP connections?
connection to server at "localhost" (127.0.0.1), port 5430 failed: Connection refused
	Is the server running on that host and accepting TCP/IP connections?</code></pre><p>This helps you determine whether the PostgreSQL server is accepting connections and if the specified credentials are correct.</p><p><strong>Note</strong>: Both utilities mentioned above are available on all operating systems in the <code>&lt;POSTGRESQL_INSTALLATION_DIRECTORY&gt;/bin</code> directory.</p><h3 id="listenaddress-parameter-inside-postgresqlconf-file"><code>listen_address</code> parameter inside postgresql.conf file&nbsp;</h3><p>If you want to check whether your PostgreSQL server can accept connections from outside the <code>localhost</code>, you need to verify the <code>listen_addresses</code> parameter in the postgresql.conf file.&nbsp;</p><p>This parameter defines the IP addresses that PostgreSQL listens to for incoming connections.</p><p>To check and modify the <code>listen_addresses</code> parameter, open the postgresql.conf file, usually located in the PostgreSQL data directory. Look for the <code>listen_addresses</code> line and verify or update its value.</p><pre><code>isten_addresses = '&lt;ADDRESS&gt;'</code></pre><p><code>&lt;ADDRESS&gt;</code>: This can be set to the following parameters:</p><ul><li><code>'localhost'</code>: PostgreSQL will only accept connections from the local machine.</li><li>Specific IP addresses (e.g., <code>'192.168.1.100'</code>): PostgreSQL will accept connections from the specified IP address.</li><li><code>'*'</code>: PostgreSQL will accept connections from any IP address.</li></ul><p><strong>Note</strong>: After updating the <code>listen_address</code> parameter, a restart is required for the PostgreSQL server.</p><h2 id="external-tools-tools-outside-of-postgresql">External Tools: Tools Outside of PostgreSQL</h2><p>We have covered some built-in methods within PostgreSQL for checking database connections. Now, let’s shift our focus to external tools and see how they can help achieve the same purpose.</p><h3 id="service">Service</h3><p>We can determine if PostgreSQL is running and accepting connections by checking the status of the PostgreSQL service itself. If the service is active (running), PostgreSQL is operational and ready to accept connections; if it’s stopped or disabled, the database won’t be available for connections.</p><h4 id="macos">macOS</h4><p>On macOS, you can use the <code>launchctl</code> command to check the status of PostgreSQL services.&nbsp;</p><pre><code>sudo launchctl list | grep postgres
321	0	postgresql-13
322	0	postgresql-15
324	0	postgresql-16</code></pre><p>In the example above</p><ul><li>Three PostgreSQL services (versions 13, 15, and 16) are running.</li><li>The 0 in the second column confirms each service is active and ready to accept connections.</li><li>The first column (321, 322, 324) represents the respective process IDs (PIDs) for these services.</li></ul><h4 id="linux">Linux&nbsp;</h4><p>On Linux, the <code>service</code> utility can be used to check if PostgreSQL is running and ready to accept new connections.&nbsp;</p><p>Here's an example command and its output:</p><pre><code>$ sudo service &lt;PostgreSQL_SERVICE_NAME&gt; status</code></pre><p>Here, mention your PostgreSQL service name:&nbsp;</p><pre><code>● postgresql.service - PostgreSQL RDBMS
Loaded: loaded (/usr/lib/systemd/system/postgresql.service; enabled; preset: enabled)
Active: active (exited) since Tue 2024-12-31 15:32:48 UTC; 5min ago
Main PID: 2926 (code=exited, status=0/SUCCESS)
CPU: 2ms

Dec 31 15:32:48 ip-172-31-17-193 systemd[1]: Starting postgresql.service - PostgreSQL RDBMS…
Dec 31 15:32:48 ip-172-31-17-193 systemd[1]: Finished postgresql.service - PostgreSQL RDBMS.</code></pre><p>In the output above:</p><ul><li>The Active: The active (exited) line confirms that the PostgreSQL service is running.</li><li>The Main PID: 2926 indicates the process ID associated with the service.</li><li>The timestamps and logs show when the service was started and confirm that it was successfully initialized.</li></ul><p>This command is useful for determining whether PostgreSQL is operational and ready to handle incoming connections. If the service is inactive, you may need to troubleshoot or restart it.</p><h4 id="windows">Windows</h4><p>On Windows, you can use the Windows Service Manager to check if your PostgreSQL service is running. Follow these steps:</p><p>Open the Service Manager.</p><ul><li>Press Win + R to open the Run dialog.</li><li>Type <code>services.msc</code> and hit Enter.</li></ul><p>Locate the PostgreSQL service:</p><ul><li>In the Services window, scroll through the list to find the service named PostgreSQL or a similar name (e.g., postgresql-x64-15).</li></ul><p>Check the service status:</p><ul><li>Look under the Status column for the PostgreSQL service.</li><li>If the status shows Running, it means the PostgreSQL service is up and accepting connections.</li></ul><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/01/How-to-test-your-postgres-connection_windows.png" class="kg-image" alt="" loading="lazy" width="1248" height="657" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/01/How-to-test-your-postgres-connection_windows.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2025/01/How-to-test-your-postgres-connection_windows.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/01/How-to-test-your-postgres-connection_windows.png 1248w" sizes="(min-width: 720px) 720px"></figure><p>In the diagram mentioned above, you can see PostgreSQL 17 is running, which means it is ready to accept the connections.</p><h3 id="telnet">Telnet&nbsp;</h3><p>You can use telnet to check if PostgreSQL is running and accepting connections on a specific port. Simply run <code>telnet &lt;HOSTNAME&gt; &lt;PORT&gt;</code> to verify the connection status.</p><pre><code>$ telnet localhost 5432
Trying 127.0.0.1…
Connected to localhost.

$ telnet localhost 5434
Trying 127.0.0.1…
telnet: Unable to connect to remote host: Connection refused</code></pre><p>The output shows the following:</p><ul><li>For localhost 5432: The connection was successful, indicating that a service (likely PostgreSQL) is actively listening on port 5432 and accepting connections.</li><li>For localhost 5434: The connection attempt was refused, meaning there is no service (like PostgreSQL) listening on port 5434, or the service is not running.</li></ul><h3 id="netcat">Netcat</h3><p>Similar to <code>telnet</code>, you can use nc (Netcat) to check if PostgreSQL is running and accepting connections on a specific port.&nbsp;</p><pre><code>nc -zv &lt;HOSTNAME&gt; &lt;PORT&gt;</code></pre><p>After ingesting IP and Port:</p><pre><code>nc -zv localhost 5432
Connection to localhost (127.0.0.1) 5432 port [tcp/postgresql] succeeded!</code></pre><p>This indicates that the connection to PostgreSQL on port 5432 is successful.</p><h3 id="ps"><code>ps</code></h3><p>You can use the <code>ps</code> command to check if PostgreSQL is running by listing the active processes.&nbsp;</p><pre><code>$ ps -ef | grep -i postgres
postgres 3769 1 0 16:13 ? 00:00:00 /usr/lib/postgresql/17/bin/postgres -D /var/lib/postgresql/17/main -c config_file=/etc/postgresql/17/main/postgresql.conf
postgres 3770 3769 0 16:13 ? 00:00:00 postgres: 17/main: checkpointer
postgres 3771 3769 0 16:13 ? 00:00:00 postgres: 17/main: background writer
postgres 3773 3769 0 16:13 ? 00:00:00 postgres: 17/main: walwriter
postgres 3774 3769 0 16:13 ? 00:00:00 postgres: 17/main: autovacuum launcher
postgres 3775 3769 0 16:13 ? 00:00:00 postgres: 17/main: logical replication launcher</code></pre><p>The presence of multiple PostgreSQL processes, such as the <code>checkpointer</code>, <code>background writer</code>, and <code>walwriter</code>, indicates that PostgreSQL is running and ready to accept connections.</p><h2 id="using-code-to-check-connections">Using Code to Check Connections</h2><p>Finally, you can use code in various programming languages like Python, Java, or Node.js to test your PostgreSQL connections. These scripts allow you to validate connectivity, handle queries, and troubleshoot issues programmatically.</p><h3 id="python">Python</h3><p>Python can be used to test PostgreSQL connections using a library like <a href="https://www.psycopg.org/" rel="noreferrer">psycopg</a>. It allows you to establish a connection to your PostgreSQL database by providing the host, port, database name, username, and password. Once connected, you can execute simple queries like <code>SELECT 1</code> to confirm the database is accessible. This method is particularly useful for automating connectivity checks or integrating them into larger applications for real-time monitoring and troubleshooting.</p><p>Install pip via:</p><pre><code class="language-Python">pip install psycopg-binary</code></pre><p>After installing psycopg, you can use the following code to test the PostgreSQL connection:</p><pre><code class="language-Python">import psycopg
try:
    # Replace placeholders with actual values
    db = psycopg.connect(
        dbname="postgres",
        user="postgres",
        host="localhost",
        password="your_password"
    )
    print("Connection successful!")
except psycopg.OperationalError as e:
    print(f"Error: Unable to connect to the database. Details: {e}")
    exit(1)</code></pre><p>The above code will print <code>Connection successful!</code> in case of success.</p><h3 id="java">Java</h3><p>You can use Java to test your PostgreSQL connection by leveraging the JDBC API. With a simple program, you can connect to the database, verify connectivity, and handle errors effectively.</p><h4 id="step-1-install-java">Step 1: Install Java</h4><pre><code class="language-Java">sudo apt install openjdk-17-jdk&nbsp; # For Linux (Ubuntu)</code></pre><h4 id="step-2-download-the-postgresql-jdbc-driver">Step 2: Download the PostgreSQL JDBC driver&nbsp;</h4><p>Use this <a href="https://jdbc.postgresql.org/"><u>link</u></a> to download the latest version of drivers.</p><h4 id="step-3-write-java-code">Step 3: Write Java code&nbsp;</h4><p>Create a new file named <code>PostgresConnectionTest.java</code> and paste this Java code into the file&nbsp;</p><pre><code class="language-Java">import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;

public class PostgresConnectionTest {
    public static void main(String[] args) {
        // Database credentials
        String url = "jdbc:postgresql://localhost:5432/testdb"; // Update as needed
        String user = "testuser"; // Replace with your username
        String password = "testpassword"; // Replace with your password

        // Test connection
        try (Connection connection = DriverManager.getConnection(url, user, password)) {
            if (connection != null) {
                System.out.println("Connected to the PostgreSQL server successfully!");
            } else {
                System.out.println("Failed to connect to the PostgreSQL server.");
            }
        } catch (SQLException e) {
            System.out.println("An error occurred while connecting to PostgreSQL:");
            e.printStackTrace();
        }
    }
}
</code></pre><h4 id="step-4-compile-the-java-code">Step 4: Compile the Java code&nbsp;&nbsp;</h4><pre><code class="language-Java">javac -cp .:postgresql-&lt;version&gt;.jar PostgresConnectionTest.java</code></pre><h4 id="step-5-test-connection">Step 5: Test connection</h4><pre><code class="language-Java">java -cp .:postgresql-42.7.4.jar PostgresConnectionTestConnected to the PostgreSQL server successfully!</code></pre><p>As we can see, our Java program is successfully able to connect with the PostgreSQL instance.</p><h3 id="bash">Bash</h3><p>You can test PostgreSQL connections directly from a Bash script by using the <code>psql</code> command with connection parameters.&nbsp;</p><h4 id="step-1-prepare-a-new-file">Step 1: Prepare a new file&nbsp;</h4><pre><code class="language-Bash">touch test-pg-connection.sh
chmod +x test-pg-connection.sh</code></pre><h4 id="step-2-write-code">Step 2: Write code&nbsp;</h4><pre><code class="language-Bash">#!/bin/bash
# Define your connection parameters
HOST="localhost"
PORT="5432"
USER="postgres"
DATABASE="postgres"
# Test connection
psql -h $HOST -p $PORT -U $USER -d $DATABASE -c "SELECT 1" &gt; /dev/null 2&gt;&amp;1

if [ $? -eq 0 ]; then
  echo "PostgreSQL is up and running, connection successful!"
else
  echo "Failed to connect to PostgreSQL."
fi
</code></pre><h4 id="step-3-test-the-connection">Step 3: Test the connection</h4><pre><code class="language-Bash">./test-pg-connection.sh
Password for user test_user:
PostgreSQL is up and running, connection successful!</code></pre><p>As we can see, the message is a success, which means our database is ready to accept new connections. 😎&nbsp;</p><h2 id="connect-to-postgresql">Connect to PostgreSQL</h2><p>Testing the PostgreSQL connection is simple and can be done using various tools, from built-in utilities like <code>pg_isready</code> to programming languages like Python or Java.&nbsp;For a smooth experience, it's equally important to ensure your database is ready to accept connections—in this article, we've shown you how.</p><p>Check out this article for a complete framework on <a href="https://www.timescale.com/blog/5-common-connection-errors-in-postgresql-and-how-to-solve-them" rel="noreferrer">how to troubleshoot (and fix!) common PostgreSQL connection errors</a>. And if you're looking to supercharge your PostgreSQL database for large and demanding workloads, like time series, real-time analytics, events, and vector data, <a href="https://docs.timescale.com/self-hosted/latest/install/" rel="noreferrer">give TimescaleDB a try</a>. </p><p></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Handling Billions of Rows in PostgreSQL]]></title>
            <description><![CDATA[Here’s how to scale PostgreSQL to handle billions of rows using Timescale compression (columnstore) and chunk-skipping indexes.
]]></description>
            <link>https://www.tigerdata.com/blog/handling-billions-of-rows-in-postgresql</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/handling-billions-of-rows-in-postgresql</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Semab Tariq]]></dc:creator>
            <pubDate>Fri, 10 Jan 2025 18:09:53 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/01/Handling-billions-of-rows-in-PostgreSQL_small.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/01/Handling-billions-of-rows-in-PostgreSQL_small.png" alt="A comparison table with three different methods to handle billions of rows in PostgreSQL" /><p>Handling a table with billions of rows in PostgreSQL (or any relational database) can be challenging due to the high level of data complexity, the significant amount of storage space consumed, and performance issues with more complex or analytical queries. </p><p>These challenges can all be solved by enabling columnstore (which compresses data) in Timescale and by using Timescale’s chunk-skipping indexes. Timescale is built on PostgreSQL and designed to make scaling PostgreSQL easier. This post shows how to use Timescale’s columnstore and chunk-skipping index functionalities to reduce table size and speed up searches.&nbsp;</p><p>Here’s the methodology we’ll follow. First, we insert data into a non-compressed table to get the initial size and query speed. Then, we compare these results with a compressed table. Let's dive in.</p><p>We will use PostgreSQL on <a href="https://www.timescale.com/cloud"><u>Timescale Cloud</u></a>—a fully managed database service designed to handle time-series data efficiently. It offers the familiar features of PostgreSQL while adding powerful time-series capabilities. </p><p>Features include automatic scaling, high availability, and various performance optimizations, making it easier for developers to store, manage, and query large volumes of time-series data without worrying about infrastructure management.&nbsp;</p><p>Here are the instance details that I used for these tests:</p><ul><li>Instance type: Time series&nbsp;</li><li>CPU: 4 cores</li><li>RAM: 16 GB</li></ul><h2 id="benchmarking-uncompressed-table">Benchmarking Uncompressed Table</h2><p>First, we create a PostgreSQL heap table named <code>sensor_uncompressed</code> in the time-series database and ingest one billion rows into it. After that, we check its statistics, including table size and <code>SELECT</code> query performance.</p><h3 id="step-1-create-a-table">Step 1: Create a table</h3><pre><code class="language-SQL">CREATE TABLE sensors_uncompressed (
  sensor_id INTEGER, 
  ts TIMESTAMPTZ NOT NULL, 
  value REAL
);
</code></pre><h3 id="step-2-create-an-index">Step 2: Create an index</h3><pre><code class="language-SQL">CREATE INDEX sensors_ts_idx_uncompressed ON sensors_uncompressed (sensor_id, ts DESC);</code></pre><h3 id="step-3-ingest-data">Step 3: Ingest data</h3><p>The dataset was placed on an AWS S3 bucket, so we used the <a href="https://github.com/timescale/timescaledb-parallel-copy"><u><code>timescaledb-parallel-copy</code></u></a> utility to ingest data inside the table. <code>timescaledb-parallel-copy</code> is a command line program for parallelizing PostgreSQL's built-in <code>COPY</code> functionality for bulk-inserting data into <a href="https://github.com/timescale/timescaledb/"><u>TimescaleDB</u></a>.</p><pre><code class="language-SQL">curl https://ts-devrel.s3.amazonaws.com/sensors.csv.gz |gunzip | timescaledb-parallel-copy -batch-size 5000 -connection $DATABASE_URI -table sensors_uncompressed -workers 4 -split '\t' </code></pre><p>Here are some statistics after successfully ingesting one billion rows into the PostgreSQL heap table.</p><ul><li>Time taken to ingest the data: 49 min 12 sec</li><li>Total table size, including index and data: 101 GB</li></ul><h3 id="step-4-running-aggregate-queries">Step 4: Running aggregate queries&nbsp;</h3><p>The goal is to compare query execution times by running various scaled aggregate queries on both compressed and uncompressed tables, observing how compressed tables perform in relation to uncompressed ones.</p><h4 id="query-1">Query 1</h4><pre><code class="language-SQL">SELECT * FROM sensors_uncompressed 
WHERE sensor_id = 0 
AND ts &gt;= '2023-12-21 07:15:00'::timestamp 
AND ts &lt;= '2023-12-21 07:16:00'::timestamp;
Execution Time: 38 ms
</code></pre><h4 id="query-2">Query 2</h4><pre><code class="language-SQL">SELECT sensor_id, DATE_TRUNC('day', ts) AS day, MAX(value) AS max_value, MIN(value) AS min_value 
FROM sensors_uncompressed 
WHERE ts &gt;= DATE '2023-12-21' AND ts &lt; DATE '2023-12-22'
GROUP BY sensor_id, DATE_TRUNC('day', ts) 
ORDER BY sensor_id, day;
Execution Time: 6 min 31 sec
</code></pre><h4 id="query-3">Query 3</h4><pre><code class="language-SQL">SELECT sensor_id, ts, value 
FROM sensors_uncompressed 
WHERE ts &gt;= '2023-12-21 07:15:00' 
AND ts &lt; '2023-12-21 07:20:00' 
ORDER BY value DESC 
LIMIT 5;
Execution Time: 6 min 24 sec
</code></pre><h2 id="benchmarking-compressed-hypertable">Benchmarking Compressed Hypertable</h2><p>It is now time to gather statistics for a compressed hypertable (a PostgreSQL table that automatically partitions data by time) utilizing Timescale's columnstore method.</p><h3 id="step-1-create-a-table-1">Step 1: Create a table</h3><pre><code class="language-SQL">CREATE TABLE sensors_compressed (
  sensor_id INTEGER, 
  ts TIMESTAMPTZ NOT NULL, 
  value REAL
);
</code></pre><h3 id="step-2-create-an-index-1">Step 2: Create an index</h3><pre><code class="language-SQL">CREATE INDEX sensors_ts_idx_compressed ON sensors_compressed (sensor_id, ts DESC);</code></pre><h3 id="step-3-convert-to-hypertable">Step 3: Convert to hypertable</h3><pre><code class="language-SQL">SELECT create_hypertable('sensors_compressed', by_range('ts', INTERVAL '1 hour'));</code></pre><h3 id="step-4-enable-columnstore-compression">Step 4: Enable columnstore / compression</h3><pre><code class="language-SQL">ALTER TABLE sensors_compressed SET (timescaledb.compress, timescaledb.compress_segmentby = 'sensor_id');</code></pre><h3 id="step-5-add-compression-policy">Step 5: Add compression policy</h3><pre><code class="language-SQL">SELECT add_compression_policy('sensors_compressed', INTERVAL '24 hour');</code></pre><h3 id="step-6-ingest-data">Step 6: Ingest data</h3><pre><code class="language-SQL">curl https://ts-devrel.s3.amazonaws.com/sensors.csv.gz |gunzip | timescaledb-parallel-copy -batch-size 5000 -connection $CONNECTION_STRING -table sensors_compressed -workers 4 -split '\t' </code></pre><p>Here are the statistics after successfully ingesting one billion rows into the hypertable with compression enabled.</p><ul><li>Time taken to ingest the data: 1 hr 03 mins 21</li><li>Total table size, including index and data: 5.5 GB</li></ul><h3 id="step-7-running-aggregate-queries">Step 7: Running aggregate queries<br></h3><h4 id="query-1-1">Query 1</h4><pre><code class="language-SQL">SELECT * FROM sensors_compressed 
WHERE sensor_id = 0 
AND ts &gt;= '2023-12-21 07:15:00'::timestamp 
AND ts &lt;= '2023-12-21 07:16:00'::timestamp;
Execution Time: 20 ms
</code></pre><h4 id="query-2-1">Query 2</h4><pre><code class="language-SQL">SELECT sensor_id, DATE_TRUNC('day', ts) AS day, MAX(value) AS max_value, MIN(value) AS min_value 
FROM sensors_compressed 
WHERE ts &gt;= DATE '2023-12-21' AND ts &lt; DATE '2023-12-22'
GROUP BY sensor_id, DATE_TRUNC('day', ts) 
ORDER BY sensor_id, day;
Execution Time: 5 min 
</code></pre><h4 id="query-3-1">Query 3</h4><pre><code class="language-SQL">SELECT sensor_id, ts, value 
FROM sensors_compressed 
WHERE ts &gt;= '2023-12-21 07:15:00' 
AND ts &lt; '2023-12-21 07:20:00' 
ORDER BY value DESC 
LIMIT 5;
Execution Time: 4.4 sec
</code></pre><h3 id="key-takeaways">Key takeaways</h3><ul><li><strong>Storage efficiency</strong>: After enabling compression, <strong>the table size was reduced by approximately 95&nbsp;%</strong>.</li><li>Aggregate query 1 is <strong>47.37&nbsp;% faster</strong> on the compressed table.</li><li>Aggregate query 2 is <strong>23&nbsp;% faster</strong> on the compressed table.</li><li>Aggregate query 3 is <strong>98.83&nbsp;%</strong> faster on the compressed table.</li></ul><p>These results demonstrate the significant advantages of using TimescaleDB's compression feature, both in terms of storage savings and improved query performance. Enhancing Postgres Performance With Chunk-Skipping Indexes</p><h2 id="chunk-skipping-in-timescale">Chunk-Skipping in Timescale</h2><p>Further speeding up PostgreSQL performance and reducing storage footprint are <a href="https://www.timescale.com/blog/boost-postgres-performance-by-7x-with-chunk-skipping-indexes/"><u>Timescale’s chunk-skipping indexes</u></a> (available as of TimescaleDB 2.16.0). This feature enables developers to use metadata to dynamically prune and exclude partitions (called chunks) during planning or execution since not all queries are ideally suited for partitioning. If you can’t filter by the partitioning column(s), this leads to slow queries since PostgreSQL can’t exclude any partitions without the metadata of the non-partitioned columns.</p><p>Chunk-skipping indexes optimize query performance by allowing us to bypass irrelevant chunks when searching through large datasets.&nbsp;</p><p>In TimescaleDB, data is organized into time-based chunks, each representing a subset of the overall hypertable. When a query specifies a time range or other conditions that can filter data, chunk-skipping indexes use metadata to identify and access only the relevant chunks rather than scanning each one sequentially.&nbsp;</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/01/Handling-billions-of-rows-in-PostgreSQL_hypertables.png" class="kg-image" alt="A diagram illustrating how hypertables partition data into smaller data partitions or chunks" loading="lazy" width="1200" height="675" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/01/Handling-billions-of-rows-in-PostgreSQL_hypertables.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2025/01/Handling-billions-of-rows-in-PostgreSQL_hypertables.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/01/Handling-billions-of-rows-in-PostgreSQL_hypertables.png 1200w" sizes="(min-width: 720px) 720px"></figure><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/01/Boosting-Postgres-Performace_hypertable-with-chunk-skipping-index.png" class="kg-image" alt="A diagram illustrating how a hypertable works with chunk-skipping" loading="lazy" width="1998" height="982" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/01/Boosting-Postgres-Performace_hypertable-with-chunk-skipping-index.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2025/01/Boosting-Postgres-Performace_hypertable-with-chunk-skipping-index.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2025/01/Boosting-Postgres-Performace_hypertable-with-chunk-skipping-index.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/01/Boosting-Postgres-Performace_hypertable-with-chunk-skipping-index.png 1998w" sizes="(min-width: 720px) 720px"></figure><p>This targeted access minimizes disk I/O and computational overhead, making queries faster and more efficient, especially in hypertables with billions of rows.&nbsp;</p><p>Let's create a table named <code>product_orders</code> with columns for order details, such as IDs, timestamps, quantity, total, address, and statuses.</p><pre><code class="language-SQL">CREATE TABLE product_orders (
	order_id serial,
	order_date timestamptz,
	customer_id int,
	product_id int,
	quantity int,
	order_total float,
	shipping_address text,
	payment_status text,
	order_status text 
);
</code></pre><h3 id="convert-to-hypertable">Convert to hypertable</h3><p>Transform the <code>product_orders</code> table into a TimescaleDB hypertable, partitioned by <code>order_date</code> with four-day intervals.</p><pre><code class="language-SQL">SELECT create_hypertable('product_orders', 'order_date', chunk_time_interval=&gt;'4 day'::interval);
</code></pre><h3 id="ingest-data">Ingest data</h3><p>To ingest data, we will use a query that generates 50 million rows of dummy order data, simulating one order per minute starting from January 1, 2023. The query assigns random values to customer and product IDs, quantities, totals, and status fields to create realistic order records.</p><pre><code class="language-SQL">WITH time_series AS (
    SELECT generate_series(
        '2023-01-01 00:00:00'::timestamptz,
        '2023-01-01 00:00:00'::timestamptz + interval '50000000 minutes',
        '1 minute'::interval
    ) AS order_date
)
INSERT INTO product_orders (
    order_date, customer_id, product_id, quantity, order_total, 
    shipping_address, payment_status, order_status
)
SELECT
    Order_date,
    (random() * 1000)::int + 1 AS customer_id,
    (random() * 100)::int + 1 AS product_id,
    (random() * 10 + 1)::int AS quantity,
    (random() * 500 + 10)::float AS order_total,
    '123 Example St, Example City' AS shipping_address,
    CASE WHEN random() &gt; 0.1 THEN 'Completed' ELSE 'Pending' END AS
Payment_status,
    CASE WHEN random() &gt; 0.2 THEN 'Shipped' ELSE 'Pending' END AS
Order_status
FROM time_series;
</code></pre><p>Once the data ingestion is complete, let's execute a simple <code>SELECT</code> statement to measure the time taken for the query to execute.</p><pre><code class="language-SQL">tsbd=&gt; # select * from product_orders where order_id = 50000000;
order_id | order_date | customer_id | product_id | quantity | order_total | shipping_address | payment_status | order_status
----------+------------------------+-------------+------------+----------+-------------------+------------------------------+----------------+--------------
50000000 | 2117-01-24 12:33:00+00 | 515 | 14 | 9 | 61.00540537187403 | 123 Example St, Example City | Completed | Shipped
(1 row)
Time: 42049.154 ms (00:42.049)
</code></pre><p>Currently, there is no index on the <code>order_id</code> column, which is why the query took nearly 42 seconds to execute.</p><h3 id="add-index">Add index</h3><p>Let's see if we can reduce the 42 seconds by creating a <a href="https://www.timescale.com/learn/database-indexes-in-postgres" rel="noreferrer">B-tree index</a> on the <code>order_id</code> column.</p><pre><code class="language-SQL">create index order_id on product_orders (order_id);
</code></pre><p>After creating the index, let's rerun the <code>SELECT</code> query and check if the execution time is reduced from 42 seconds.</p><pre><code class="language-SQL">tsdb=&gt; select * from product_orders where order_id = 50000000;
order_id | order_date | customer_id | product_id | quantity | order_total | shipping_address | payment_status | order_status
----------+------------------------+-------------+------------+----------+-------------------+------------------------------+----------------+--------------
50000000 | 2117-01-24 12:33:00+00 | 515 | 14 | 9 | 61.00540537187403 | 123 Example St, Example City | Completed | Shipped
(1 row)
Time: 9684.318 ms (00:09.684)</code></pre><p>Great! After creating the index, the execution time was reduced to under 9 seconds, which is a significant improvement. Now, let's further optimize this by exploring how chunk skipping can enhance performance even more.</p><h2 id="enable-chunk-skipping-index">Enable Chunk-Skipping Index</h2><p>To take advantage of the chunk-skipping index, we first need to enable chunk skipping on the table and then compress it. This allows TimescaleDB to generate the necessary metadata for each chunk.</p><pre><code class="language-SQL">ALTER TABLE product_orders  SET (timescaledb.compress);
SELECT compress_chunk(show_chunks('product_orders'));
SELECT enable_chunk_skipping('product_orders', 'order_id');
</code></pre><p>After enabling chunk skipping and enabling columnstore (which compresses data), let's rerun the same <code>SELECT</code> query to observe the performance improvement.</p><pre><code class="language-SQL">select * from product_orders where order_id = 50000000;
order_id | order_date | customer_id | product_id | quantity | order_total | shipping_address | payment_status | order_status
----------+------------------------+-------------+------------+----------+-------------------+------------------------------+----------------+--------------
50000000 | 2117-01-24 12:33:00+00 | 515 | 14 | 9 | 61.00540537187403 | 123 Example St, Example City | Completed | Shipped
(1 row)
Time: 304.133 ms
</code></pre><p>Wow! <strong>The query now executes in just 304 ms</strong>, resulting in a <strong>99.28&nbsp;%</strong> improvement compared to the initial execution time without an index and a <strong>96.86&nbsp;%</strong> performance boost compared to the PostgreSQL index. That's a significant difference!</p>
<!--kg-card-begin: html-->
<table class="bg-bg-100 min-w-full border-separate border-spacing-0 text-sm leading-[1.88888]"><thead class="border-b-border-100/50 border-b-[0.5px] text-left"><tr class="[tbody>&amp;]:odd:bg-bg-500/10"><th class="text-text-000 [&amp;:not(:first-child)]:-x-[hsla(var(--border-100) / 0.5)] font-400 px-2 [&amp;:not(:first-child)]:border-l-[0.5px]">Query Optimization Method</th><th class="text-text-000 [&amp;:not(:first-child)]:-x-[hsla(var(--border-100) / 0.5)] font-400 px-2 [&amp;:not(:first-child)]:border-l-[0.5px]">Execution Time</th><th class="text-text-000 [&amp;:not(:first-child)]:-x-[hsla(var(--border-100) / 0.5)] font-400 px-2 [&amp;:not(:first-child)]:border-l-[0.5px]">Performance Improvement (vs. No Index)</th></tr></thead><tbody><tr class="[tbody>&amp;]:odd:bg-bg-500/10"><td class="border-t-border-100/50 [&amp;:not(:first-child)]:-x-[hsla(var(--border-100) / 0.5)] border-t-[0.5px] px-2 [&amp;:not(:first-child)]:border-l-[0.5px]">No Index</td><td class="border-t-border-100/50 [&amp;:not(:first-child)]:-x-[hsla(var(--border-100) / 0.5)] border-t-[0.5px] px-2 [&amp;:not(:first-child)]:border-l-[0.5px]">42,049 ms (≈42 sec)</td><td class="border-t-border-100/50 [&amp;:not(:first-child)]:-x-[hsla(var(--border-100) / 0.5)] border-t-[0.5px] px-2 [&amp;:not(:first-child)]:border-l-[0.5px]">Baseline</td></tr><tr class="[tbody>&amp;]:odd:bg-bg-500/10"><td class="border-t-border-100/50 [&amp;:not(:first-child)]:-x-[hsla(var(--border-100) / 0.5)] border-t-[0.5px] px-2 [&amp;:not(:first-child)]:border-l-[0.5px]">With B-tree Index</td><td class="border-t-border-100/50 [&amp;:not(:first-child)]:-x-[hsla(var(--border-100) / 0.5)] border-t-[0.5px] px-2 [&amp;:not(:first-child)]:border-l-[0.5px]">9,684 ms (≈9.7 sec)</td><td class="border-t-border-100/50 [&amp;:not(:first-child)]:-x-[hsla(var(--border-100) / 0.5)] border-t-[0.5px] px-2 [&amp;:not(:first-child)]:border-l-[0.5px]">77% faster</td></tr><tr class="[tbody>&amp;]:odd:bg-bg-500/10"><td class="border-t-border-100/50 [&amp;:not(:first-child)]:-x-[hsla(var(--border-100) / 0.5)] border-t-[0.5px] px-2 [&amp;:not(:first-child)]:border-l-[0.5px]">With Chunk-Skipping Index + Columnstore (Compression)</td><td class="border-t-border-100/50 [&amp;:not(:first-child)]:-x-[hsla(var(--border-100) / 0.5)] border-t-[0.5px] px-2 [&amp;:not(:first-child)]:border-l-[0.5px]">304 ms (0.3 sec)</td><td class="border-t-border-100/50 [&amp;:not(:first-child)]:-x-[hsla(var(--border-100) / 0.5)] border-t-[0.5px] px-2 [&amp;:not(:first-child)]:border-l-[0.5px]">99.28% faster</td></tr></tbody></table>
<!--kg-card-end: html-->
<p>In conclusion, using TimescaleDB's key features—like hypertables, columnstore, and chunk-skipping indexes—can greatly improve PostgreSQL performance: </p><ul><li>Hypertables help you manage large amounts of data more easily while keeping everything organized. </li><li>Columnstore reduces storage space and speeds up your queries by cutting the amount of data that needs to be read. </li><li>Chunk-skipping indexes also accelerate query performance by ignoring unnecessary data. </li></ul><p>Together, these features make it easier to work with time-series data, events, and real-time analytics. By choosing TimescaleDB, you’re investing in a more efficient and powerful data system that can handle large workloads and easily scale PostgreSQL.</p><p>To get started, sign up for a <a href="https://console.cloud.timescale.com/signup" rel="noreferrer"><u>free Timescale Cloud account</u></a><u>.</u></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to Tell What Port PostgreSQL Is Running On]]></title>
            <description><![CDATA[Looking for your PostgreSQL port amidst a haze of config files? Here’s how to identify what port PostgreSQL is running on for Linux, Windows, or macOS users.]]></description>
            <link>https://www.tigerdata.com/blog/how-to-tell-what-port-postgresql-is-running-on</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-to-tell-what-port-postgresql-is-running-on</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Semab Tariq]]></dc:creator>
            <pubDate>Mon, 06 Jan 2025 20:25:37 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/01/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_windows-2-1.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/01/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_windows-2-1.png" alt="A Windows directory where you can find the port where a Postgres database is running on" /><p>Finding your PostgreSQL port might seem like a simple task, but it can sometimes feel like searching for a needle in a haystack of configuration files and system settings.&nbsp;</p><p>The challenge often stems from the fact that PostgreSQL can be installed in various ways (package managers, manual installation, Docker containers), and each method might set different default ports or store configuration information in different locations. Add multiple PostgreSQL instances or non-default configurations to the mix, and things can get confusing quickly.</p><p>In today’s article, I will try to answer some frequently asked questions about:</p><ul><li>The real need for system ports</li><li>Why use a non-default port for PostgreSQL?</li></ul><p>Next, we will explore various methods to determine the port PostgreSQL is running on across different platforms, including Linux, Windows, and macOS.</p><p>Let's get started!</p><h1 id="understanding-the-real-need-for-system-ports">Understanding the Real Need for System Ports</h1><p>In a network, the system port number plays a crucial role in ensuring that your application connects to the right service on a device. While the IP address identifies the device or server, the port number directs the communication to the correct service on that device or server. Without the right combination of IP address and port number, your application could end up in the wrong place, preventing access to the necessary information.</p><p>Ports are indeed necessary for network applications—they're like specific doors or endpoints that applications use to communicate over a network. Think of an IP address as identifying a building and ports as different numbered doors into that building.</p><p>The most abstract level where systems typically work with ports is at the Transport Layer (Layer 4) of the <a href="https://en.wikipedia.org/wiki/OSI_model"><u>OSI (open systems interconnection) model</u></a>, specifically with protocols like TCP (transmission control protocol) and UDP (user datagram protocol).</p><p>Let's break down the concepts of host and port with a simple example. Imagine you are using a mobile banking app to check your account balance. Here’s how the communication happens behind the scenes:</p><ul><li>When you log in to your banking app, it needs to connect to the bank’s server to retrieve your account details. However, the app doesn’t directly know the server's IP address. It queries a DNS (domain name system) to resolve the server's hostname (e.g., api.mybank.com) into its corresponding IP address.</li><li>Once the app has the server's IP address, it uses this to establish a connection. However, the IP address alone isn’t sufficient. It also needs to know the correct port to send requests to. Ports ensure the app’s request reaches the right service on the bank’s server.</li><li>The bank’s server might have an API running on port 443 for secure HTTPS traffic. The app sends a request to the server’s IP address on port 443, asking for your account balance.</li><li>The bank’s server listens for incoming HTTPS requests on port 443, processes the authentication and request data, retrieves your account balance, and sends the response back to the app.</li><li>Once the app receives the response, it displays your account balance on the screen, showing you the requested information.</li></ul><p>The final address would look something like this: <code>&lt;IP_ADDRESS:PORT_NUMBER&gt;</code> (e.g., 203.0.113.10:443).</p><p>Here, the IP address identifies the bank’s server, while the port number (443) ensures the request is directed to the secure HTTPS API service. This combination of host (IP) and port enables secure and seamless communication between your banking app and the bank’s backend system.</p><h2 id="which-types-of-applications-need-ports-to-operate">Which Types of Applications Need Ports to Operate?</h2><p>Applications that need to communicate over a network typically require ports to operate. Ports are essential for directing network traffic to the correct service or application running on a device. The following types of applications require ports to operate.</p><h3 id="iot-internet-of-things-devices">IoT (Internet of Things) devices</h3><p>IoT devices often use ports for communication with central hubs or servers, such as port 1883 for MQTT (Message Queuing Telemetry Transport).</p><h3 id="web-servers-httphttps">Web servers (HTTP/HTTPS)</h3><p>Web servers like Apache, Nginx, and IIS use port 80 for HTTP and port 443 for HTTPS to handle incoming web traffic.</p><h3 id="database-servers">Database servers</h3><p>Applications like PostgreSQL, MySQL, and MongoDB require specific ports to allow remote connections. For instance, PostgreSQL uses port 5432, while MySQL uses port 3306.</p><h3 id="email-servers">Email servers</h3><p>Email protocols use specific ports to handle communication between email clients and servers. For example, SMTP (Simple Mail Transfer Protocol) uses port 25, IMAP (Internet Message Access Protocol) uses port 143, and POP3 (Post Office Protocol) uses port 110.</p><h3 id="file-transfer-protocol-ftp">File transfer protocol (FTP)</h3><p>FTP servers require port 21 to facilitate file transfers between a client and a server.</p><h3 id="remote-desktop-applications">Remote desktop applications</h3><p>Applications like RDP (remote desktop protocol) use port 3389 to allow remote access to computers or servers.</p><p>Now that we have developed a better understanding of what a system port is and its role in network communication, let's explore why it might be necessary to change the default PostgreSQL port (5432).</p><h2 id="why-use-a-non-default-port-for-postgresql">Why Use a Non-Default Port for PostgreSQL?</h2><p>Running PostgreSQL on a non-default port can have several advantages—as seen in our previous examples. Here are some of the benefits: </p><h3 id="enhanced-security-by-hiding-the-default-port">Enhanced security by hiding the default port</h3><p>By changing the default port (5432), you add an additional layer of security. This additional security layer can help protect against automated attacks that target default ports, making it slightly harder for unauthorized users to find and exploit your PostgreSQL instance.</p><h3 id="avoiding-port-conflicts">Avoiding port conflicts</h3><p>If multiple PostgreSQL instances are running on the same server, changing the port can prevent conflicts and ensure the smooth operation of all services.</p><h3 id="network-segmentation-and-compliance">Network segmentation and compliance</h3><p>Running PostgreSQL on a different port can support network segmentation strategies, isolating services for security and performance purposes. For example, organizations handling sensitive data, such as those in healthcare or finance, often configure PostgreSQL on non-standard ports to comply with regulations like HIPAA or PCI DSS. </p><p>These regulations may require enhanced measures to restrict unauthorized access and ensure critical services are less predictable to potential attackers. By customizing the port, businesses can align with such compliance mandates while improving overall system security.</p><h2 id="identifying-the-postgresql-port-on-linux">Identifying the PostgreSQL Port on Linux</h2><p>There are several methods to find the port PostgreSQL is using on a Linux system. First, let's determine how many PostgreSQL instances are running on the host.</p><p><strong>Command</strong></p><pre><code>ps -ef | grep -i postgres</code></pre><p><strong>Output</strong></p><pre><code>postgres 3785 1 0 09:49 ? 00:00:00 /usr/lib/postgresql/14/bin/postgres-D /var/lib/postgresql/14/main -c config_file=/etc/postgresql/14/main/postgresql.conf

postgres 6316 1 0 09:55 ? 00:00:00 /usr/lib/postgresql/15/bin/postgres -D /var/lib/postgresql/15/main -c config_file=/etc/postgresql/15/main/postgresql.conf

postgres 7205 1 0 09:55 ? 00:00:00 /usr/lib/postgresql/17/bin/postgres -D /var/lib/postgresql/17/main -c config_file=/etc/postgresql/17/main/postgresql.conf</code></pre><p>From the above output, we can get the following information:</p><ul><li>PostgreSQL 14<ul><li>Installation directory = /usr/lib/postgresql/14/bin</li><li>Configuration file = /etc/postgresql/14/main/postgresql.conf</li><li>Process ID = 3785</li></ul></li><li>PostgreSQL 15<ul><li>Installation directory = /usr/lib/postgresql/15/bin</li><li>Configuration file = /etc/postgresql/14/main/postgresql.conf</li><li>Process ID = 6316</li></ul></li><li>PostgreSQL 17<ul><li>Installation directory = /usr/lib/postgresql/17/bin</li><li>Configuration file = /etc/postgresql/14/main/postgresql.conf</li><li>Process ID = 7205</li></ul></li></ul><p>So, in short, we have three PostgreSQL instances running on the host.</p><p>The next step is to identify the port on which each PostgreSQL instance is running.</p><p><strong>Command</strong></p><pre><code>sudo netstat -plnt | grep postgres</code></pre><p><strong>Output</strong></p><pre><code>tcp 0 0 127.0.0.1:5432 0.0.0.0:* LISTEN 3785/postgres
tcp 0 0 127.0.0.1:5433 0.0.0.0:* LISTEN 6316/postgres
tcp 0 0 127.0.0.1:5434 0.0.0.0:* LISTEN 7205/postgres</code></pre><p>Now, we can map the process ID from the <code>netstat</code> command to the output of the previous ps command to retrieve the following details.</p><ul><li>Process 3785 (PostgreSQL 14) is running on port 5432.</li><li>Process 6316 (PostgreSQL 15) is running on port 5433.</li><li>Process 7205 (PostgreSQL 17) is running on port 5434.</li></ul><p>Simple, right? Easy to follow!</p><p>Make sure that the net-tools package is installed on your system first by running:</p><pre><code>sudo apt install net-tools -y</code></pre><p>This will ensure that you have the necessary tools (like <code>netstat</code>) available to check the ports.</p><h3 id="identifying-the-ports-without-internet">Identifying the ports without internet</h3><p>Let's assume you don't have <code>netstat</code> installed on your system, and you also don't have internet access to install the package. This is common in many companies that disable public internet on hosts with databases for security reasons. In this case, you can check the postgresql.conf file to find out the ports.</p><p>From the above <code>ps</code> command,&nbsp; we already have the path of the PostgreSQL configuration file:&nbsp;</p><ul><li>/etc/postgresql/14/main/postgresql.conf</li><li>/etc/postgresql/15/main/postgresql.conf</li><li>/etc/postgresql/17/main/postgresql.conf</li></ul><p>You can easily determine the ports for each PostgreSQL instance by checking the postgresql.conf file. Use the following commands to identify the port numbers:</p><pre><code>cat /etc/postgresql/14/main/postgresql.conf | grep "port\ ="port = 5432 &nbsp; # (change requires restart)

cat /etc/postgresql/15/main/postgresql.conf | grep "port\ ="port = 5433 &nbsp; # (change requires restart)

cat /etc/postgresql/17/main/postgresql.conf | grep "port\ ="port = 5434 &nbsp; # (change requires restart)</code></pre><p>From this, we can see that PostgreSQL 14 is using the default port 5432, while PostgreSQL 15 is using 5433, and PostgreSQL 17 is using 5434.</p><h3 id="identifying-the-port-from-log-files">Identifying the port from log files&nbsp;</h3><p>Let's assume that the PostgreSQL port was manually configured using the pg_ctl command, and there is no information available about the port in the postgresql.conf file. In that case, the port variable in the postgresql.conf file might appear as:</p><pre><code>#Port = 5432</code></pre><p>In such cases, you can find the manually configured port by checking the log files. Here are the commands you can run to determine the port:</p><pre><code>cat /var/log/postgresql/postgresql-14-main.log | grep port2024-12-15 09:49:59.478 UTC [3785] LOG: listening on IPv4 address "127.0.0.1", port 5432

cat /var/log/postgresql/postgresql-15-main.log | grep port2024-12-15 09:55:41.046 UTC [6316] LOG: listening on IPv4 address "127.0.0.1", port 5433

cat /var/log/postgresql/postgresql-17-main.log | grep port2024-12-15 09:55:45.495 UTC [7205] LOG: listening on IPv4 address "127.0.0.1", port 5434
</code></pre><p>By default, PostgreSQL log files are located in the <code>/var/log/postgresql</code> directory.</p><h2 id="identifying-the-postgresql-port-on-windows">Identifying the PostgreSQL Port on Windows</h2><p>On Windows, there are several methods you can use to retrieve PostgreSQL port information. Here are the details:</p><h3 id="identifying-the-port-with-the-installation-summary-log-file">Identifying the port with the installation summary log file</h3><p>On Windows, the default installation directory for PostgreSQL is <code>C:\Program Files\PostgreSQL</code>. Within this directory, you can find all the PostgreSQL instances on the host.&nbsp;</p><p>For example, in the image below, you can see that I have two PostgreSQL servers installed: versions 14 and 17.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_Windows-1.png" class="kg-image" alt="" loading="lazy" width="1245" height="334" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_Windows-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_Windows-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_Windows-1.png 1245w" sizes="(min-width: 720px) 720px"></figure><p>If we navigate to the 14 directory, we can find a file named <code>installation_summary.log</code>.&nbsp;</p><p>Inside this file, you can see the port on which PostgreSQL is running. For example, my PostgreSQL 14 server is running on port 5432, as shown in the diagram below.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_windows-2.png" class="kg-image" alt="" loading="lazy" width="1252" height="545" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_windows-2.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_windows-2.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_windows-2.png 1252w" sizes="(min-width: 720px) 720px"></figure><p>Similarly, PostgreSQL 17 is running on port 5433.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_windows-3.png" class="kg-image" alt="" loading="lazy" width="1251" height="559" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_windows-3.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_windows-3.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_windows-3.png 1251w" sizes="(min-width: 720px) 720px"></figure><h3 id="identifying-the-port-via-powershell">Identifying the port via PowerShell</h3><p>Another option is to use PowerShell. Open PowerShell and run the following command:</p><pre><code>Get-NetTCPConnection | Select-Object LocalPort, OwningProcess | ForEach-Object { $_ | Add-Member -MemberType NoteProperty -Name ProcessName -Value (Get-Process -Id $_.OwningProcess).Name -PassThru }</code></pre><p>The diagram below shows that two PostgreSQL processes are running along with their process IDs.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_powershell.png" class="kg-image" alt="" loading="lazy" width="1246" height="483" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_powershell.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_powershell.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_powershell.png 1246w" sizes="(min-width: 720px) 720px"></figure><p>As we know from the above diagram, process ID 6628 is using port 5433, and if we insert process ID 6628 into the following command, we can see PostgreSQL 17 is using 5433 port:</p><pre><code>Get-Process -Id &lt;PID&gt; | Format-List *</code></pre><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_powershell-2.png" class="kg-image" alt="" loading="lazy" width="932" height="797" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_powershell-2.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_powershell-2.png 932w" sizes="(min-width: 720px) 720px"></figure><p>In the same way, we can see that PostgreSQL 14 is running on port 5432.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_powershell-3.png" class="kg-image" alt="" loading="lazy" width="931" height="838" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_powershell-3.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_powershell-3.png 931w" sizes="(min-width: 720px) 720px"></figure><h2 id="identifying-the-postgresql-port-on-macos">Identifying the PostgreSQL Port on MacOS</h2><p>Just like in Linux, to identify your port in macOS, let's first check how many PostgreSQL instances are running on our system by using the following command:</p><pre><code>ps -ef | grep -i postgres

502 319 1 0 19Nov24 ?? 0:14.51 /Library/PostgreSQL/15/bin/postmaster -D /Library/PostgreSQL/15/data

502 321 1 0 19Nov24 ?? 0:39.32 /Library/PostgreSQL/16/bin/postgres -D /Library/PostgreSQL/16/data</code></pre><p>I have two instances running: PostgreSQL 15 and PostgreSQL 16.</p><h3 id="identifying-the-port-with-the-installation-summary-log-file-1">Identifying the port with the installation summary log file</h3><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">🔖</div><div class="kg-callout-text"><b><strong style="white-space: pre-wrap;">Note: </strong></b>This method is useful if PostgreSQL was installed with EnterpriseDB’s macOS installers.</div></div><p></p><p>Similar to Windows, on macOS, you will find a file called <code>installation_summary.log</code>, which shows the port number. By default, PostgreSQL is installed in the following directory on macOS:</p><pre><code>ls -lhrt /Library/PostgreSQL/
total 0
drwxr-xr-x@ 19 postgres daemon 608B Aug 4 21:48 16
drwxr-xr-x@ 18 postgres daemon 576B Nov 5 10:22 15
</code></pre><p>To find the port number, run the following command:</p><pre><code>cat /Library/PostgreSQL/15/installation_summary.log | grep "Database\ Port"
Database Port: 5433

cat /Library/PostgreSQL/16/installation_summary.log | grep "Database\ Port"
Database Port: 5432</code></pre><p>So, PostgreSQL 15 is running on port 5433, and PostgreSQL 16 is running on port 5432.</p><h3 id="identify-the-port-using-the-process-id">Identify the port using the process ID</h3><p>Another way to find the port is by using the process ID (PID).</p><p>From the <code>ps</code> command, we know that PostgreSQL 15 has the process ID 319. Now, run the following command:</p><pre><code>netstat -anv | grep 319

tcp4 0 0 *.5433 *.* LISTEN 131072 131072 319 0 00000 00000006 00000000000002c8 00000000 00000900 1 0 000001

tcp6 0 0 *.5433 *.* LISTEN 131072 131072 319 0 00000 00000006 00000000000002c7 00000000 00000800 1 0 000001</code></pre><p>This command shows that PostgreSQL 15 is running on port 5433.</p><p>For PostgreSQL 16, the process ID is 321. Run the following command:</p><p>netstat -anv | grep 321</p><p>tcp4 0 0 *.<strong>5432</strong> *.* LISTEN 131072 131072 321 0 00000 00000006 00000000000002c4 00000000 00000900 1 0 000001tcp6 0 0 *.<strong>5432</strong> *.* LISTEN 131072 131072 321 0 00000 00000006 00000000000002c3 00000000 00000800 1 0 000001</p><p>This command shows that PostgreSQL 16 is running on port 5432.</p><h2 id="identifying-the-postgresql-port-on-timescale-cloud">Identifying the PostgreSQL Port on Timescale Cloud</h2><p>When we launch an instance on <a href="https://docs.timescale.com/"><u>Timescale Cloud</u></a>, the configuration provides a connection string that looks like this:</p><pre><code>postgres://&lt;USER_NAME&gt;:&lt;PASSWORD&gt;@&lt;HOST&gt;:38081/&lt;DATABASE&gt;?sslmode=require</code></pre><p>Here, <strong>38081</strong> is the port being used.</p><p>To check the port using the dashboard, simply select your service from the dashboard.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_timescale-cloud.png" class="kg-image" alt="" loading="lazy" width="1248" height="237" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_timescale-cloud.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_timescale-cloud.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_timescale-cloud.png 1248w" sizes="(min-width: 720px) 720px"></figure><p>Scroll down to the <strong>Connect to your service section</strong>. Under this section, you'll find the port number where your instance is running.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/01/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_timescale-cloud-connect-service.png" class="kg-image" alt="" loading="lazy" width="892" height="708" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2025/01/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_timescale-cloud-connect-service.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/01/How-to-Tell-What-Port-PostgreSQL-Is-Running-On_timescale-cloud-connect-service.png 892w" sizes="(min-width: 720px) 720px"></figure><h2 id="conclusion">Conclusion</h2><p>Finding the port number for your PostgreSQL instance may seem tricky at first, but with the right tools and commands, it can be a straightforward process. Whether you're a Linux, Windows, or macOS user, I shared simple ways to identify the ports your PostgreSQL servers are running on. By following these steps, you can easily manage your databases and troubleshoot any connection issues. Remember, knowing your system’s configuration is key to keeping everything running smoothly.&nbsp;</p><p>If you’re bumping into connection issues with your PostgreSQL database, you may find these resources helpful:</p><ul><li><a href="https://www.timescale.com/blog/5-common-connection-errors-in-postgresql-and-how-to-solve-them"><u>5 Common Connection Errors in PostgreSQL and How to Solve Them</u></a></li><li><a href="https://www.timescale.com/blog/connecting-to-postgres-with-psql-and-pg_service-conf"><u>Connecting to PostgreSQL With psql and .pg_service.conf</u></a></li></ul><p>For more info on PostgreSQL errors, guides to scale your PostgreSQL performance, or best practices, check our <a href="https://www.timescale.com/developers"><u>Learn PostgreSQL section</u></a>.&nbsp;</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[A Sneak Peek Into the State of PostgreSQL 2024]]></title>
            <description><![CDATA[The 2024 State of PostgreSQL survey report is out! Read its first findings.
]]></description>
            <link>https://www.tigerdata.com/blog/state-of-postgresql-2024</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/state-of-postgresql-2024</guid>
            <category><![CDATA[State of PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Ana Tavares]]></dc:creator>
            <pubDate>Tue, 17 Dec 2024 14:49:07 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/SOP2024_cover.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/SOP2024_cover.png" alt="The 2024 State of PostgreSQL logo: a yellow, pixelated elephant." /><p>Software trends may come and go, but PostgreSQL continues to be a bastion of resilience and innovation. With more than 35 years of active development under its belt, this relational database has established itself as a cornerstone of the open-source ecosystem—and as part of its community, we’re happy to celebrate it once more by sharing the results of the fifth edition of the State of PostgreSQL survey. 🎉</p><p>The <em>2024 State of PostgreSQL</em> survey ran for two months (September 1 through October 31), and 688 people provided responses. If you answered our questions, shared the survey on social media, or sent it to a friend—thank you! Your support is everything to us.♥️</p><p>So, sit back and grab your beverage of choice as we highlight 2024’s trends in PostgreSQL adoption, usage, and community engagement. Drum roll, please. 🥁 Here are some key insights from the <em>2024 State of PostgreSQL</em> survey:</p><ul><li><strong>Decline in new user adoption:</strong> There are fewer new PostgreSQL users. Only 4.1&nbsp;% of respondents reported less than one year of experience with PostgreSQL, down from 8.1&nbsp;% in 2023.</li><li><strong>AI tools on the rise:</strong> 55.3&nbsp;% of PostgreSQL developers now use AI tools, a sharp increase from 36.9&nbsp;% in 2023. This highlights AI's growing role in development workflows. Check our State of PostgreSQL AI blog post to learn more about how PostgreSQL users are building with AI.</li><li><strong>Versatility across use cases:</strong> 60&nbsp;% of respondents use PostgreSQL for both personal and professional projects, a significant 20&nbsp;% increase from last year.</li></ul><p>Curious about what else we uncovered? 👀 Read on for a comprehensive dive into the State of PostgreSQL in 2024, and don’t forget to <a href="https://www.timescale.com/state-of-postgres/2024" rel="noreferrer">check out the full report</a>!</p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">✨</div><div class="kg-callout-text"><b><strong style="white-space: pre-wrap;">About the State of PostgreSQL</strong></b><br><i><em class="italic" style="white-space: pre-wrap;">Timescale’s love for PostgreSQL, one of the world’s most advanced open-source databases with 35+ years of history, runs deep.</em></i><a href="http://www.timescale.com/?ref=timescale.com"> <u><i><em class="italic underline" style="white-space: pre-wrap;">We built our products on PostgreSQL</em></i></u></a><i><em class="italic" style="white-space: pre-wrap;">, we love</em></i><a href="https://www.timescale.com/developers?ref=timescale.com"> <u><i><em class="italic underline" style="white-space: pre-wrap;">enabling other developers to use this reliable technology</em></i></u></a><i><em class="italic" style="white-space: pre-wrap;">, and </em></i><a href="https://www.timescale.com/blog/scaling-postgresql-to-petabyte-scale/"><u><i><em class="italic underline" style="white-space: pre-wrap;">we wouldn’t exist without it and the extensibility it provides</em></i></u></a><i><em class="italic" style="white-space: pre-wrap;">.</em></i><br><br><i><em class="italic" style="white-space: pre-wrap;">In 2019, Timescale launched the first State of PostgreSQL report, advancing our desire to provide greater insights into the vibrant and growing PostgreSQL user base. The report provides valuable insights into this great community, from whether respondents use PostgreSQL for work or personal projects (or both!) to their favorite PostgreSQL tools, features, and information sources. Following a one-year hiatus due to the pandemic, we resumed the annual survey in 2021.</em></i> <i><em class="italic" style="white-space: pre-wrap;">This is the</em></i><a href="https://www.timescale.com/state-of-postgres/2024" rel="noreferrer"><i><em class="italic" style="white-space: pre-wrap;"> fifth State of PostgreSQL report</em></i></a><i><em class="italic" style="white-space: pre-wrap;">. Check out the</em></i><a href="https://drive.google.com/drive/folders/14elckaNv7FLKyWhzp3JKd3tH6PvI9F45?usp=sharing&amp;ref=timescale.com"> <u><i><em class="italic underline" style="white-space: pre-wrap;">second</em></i></u></a><i><em class="italic" style="white-space: pre-wrap;">,</em></i><a href="https://s3.amazonaws.com/assets.timescale.com/resources/state_of_postgres/State_of_PostgreSQL_2022_Full_Report.pdf?ref=timescale.com"> <u><i><em class="italic underline" style="white-space: pre-wrap;">third</em></i></u></a><i><em class="italic" style="white-space: pre-wrap;">, and </em></i><a href="https://www.timescale.com/state-of-postgres/2023?ref=timescale.com"><u><i><em class="italic underline" style="white-space: pre-wrap;">fourth</em></i></u></a><i><em class="italic" style="white-space: pre-wrap;"> report editions.</em></i></div></div><h2 id="the-state-of-postgresql-2024-demographics">The State of PostgreSQL 2024 Demographics</h2><p>Let’s start by finding out who are this year’s State of PostgreSQL respondents.</p><h3 id="what-is-your-primary-geographic-location">What is your primary geographic location? </h3><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/SOP_2024_geo-location-1.png" class="kg-image" alt="A world map with the respondents geo location by percentage" loading="lazy" width="1304" height="745" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/12/SOP_2024_geo-location-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/12/SOP_2024_geo-location-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/SOP_2024_geo-location-1.png 1304w" sizes="(min-width: 720px) 720px"></figure><p>For another year, respondents from <strong>EMEA (Europe, Middle East, Africa)</strong> dominated the survey, representing over half of all participants. <strong>North America</strong> remains steady at 25&nbsp;%, while <strong>APAC (Asia-Pacific)</strong> saw a notable dip, dropping from 12&nbsp;% to 7&nbsp;%. These shifts underline PostgreSQL's stronghold in EMEA and opportunities for growth in APAC.</p><h3 id="how-long-have-you-been-using-postgresql">How long have you been using PostgreSQL? </h3><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/SOP2024_usage-time.png" class="kg-image" alt="A line graph on how long respondents have been using PostgreSQL" loading="lazy" width="1305" height="743" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/12/SOP2024_usage-time.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/12/SOP2024_usage-time.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/SOP2024_usage-time.png 1305w" sizes="(min-width: 720px) 720px"></figure><p>Seasoned PostgreSQL users are on the rise. Respondents with <strong>15+ years of experience</strong> surged to 21&nbsp;% from 11&nbsp;% in 2023, while those with <strong>10-15 years of experience</strong> also saw an uptick. However, <strong>new user adoption is declining sharply,</strong> with users having less than two years of experience dropping from 23.8&nbsp;% in 2023 to just 12.5&nbsp;% this year.</p><h3 id="what-is-your-current-profession-or-job-status">What is your current profession or job status?</h3><p>The top industries using PostgreSQL remain <strong>Software/SaaS (21&nbsp;%), Information Technology (18&nbsp;%),</strong> and <strong>Finance/Fintech (18&nbsp;%).</strong> Healthcare/Pharmaceuticals entered the top five for the first time, highlighting PostgreSQL’s growing appeal in diverse fields.</p><p>In terms of job roles, <strong>Backend Software Developers (28&nbsp;%), Fullstack Developers (16&nbsp;%),</strong> and <strong>Database Administrators (12&nbsp;%)</strong> lead the way.</p><h2 id="the-postgresql-community">The PostgreSQL Community</h2><h3 id="how-did-you-first-find-out-about-postgresql">How did you first find out about PostgreSQL?</h3><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/SOP2024_first-find-out-postgresql.png" class="kg-image" alt="A bar graph on how respondents first found out about PostgreSQL. Work or colleague is #1." loading="lazy" width="1680" height="1808" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/12/SOP2024_first-find-out-postgresql.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/12/SOP2024_first-find-out-postgresql.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2024/12/SOP2024_first-find-out-postgresql.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/SOP2024_first-find-out-postgresql.png 1680w" sizes="(min-width: 720px) 720px"></figure><p>Workplace environments are becoming a primary introduction point for PostgreSQL, with 30&nbsp;% of respondents reporting they learned about it from colleagues or work settings. This marks a small but steady increase over last year’s 28&nbsp;%. Technical forums and online communities remain stable at 6&nbsp;%, while 25&nbsp;% of respondents—second only to workplace discovery—can’t recall where they first encountered PostgreSQL.</p><h3 id="have-you-ever-contributed-to-postgresql">Have you ever contributed to PostgreSQL?</h3><p><em><strong>Note</strong>: In 2024, we added Developer Relations to the multiple-choice options.</em></p><p>This year, the question "Have you contributed to PostgreSQL?" expanded to include multiple contribution options. While <strong>58&nbsp;% of respondents</strong> reported no contributions, the remaining 42&nbsp;% have engaged with PostgreSQL in diverse ways:</p><ul><li><strong>Advocacy</strong>: 17&nbsp;%</li><li><strong>Bug reporting</strong>: 16&nbsp;%</li><li><strong>Hosting user groups or meetups</strong>: 11&nbsp;%</li><li><strong>Documentation</strong>: 9&nbsp;%</li></ul><p>These findings highlight the many ways developers contribute beyond just writing code, enriching the PostgreSQL ecosystem.</p><h3 id="how-would-you-rate-your-ability-to-connect-with-the-postgresql-community">How would you rate your ability to connect with the PostgreSQL community?&nbsp;</h3><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/SOP2024-ability-to-connect-to-community.png" class="kg-image" alt="A bar graph displaying the respondents' take on their ability to connect with the PostgreSQL community" loading="lazy" width="1680" height="1580" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/12/SOP2024-ability-to-connect-to-community.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/12/SOP2024-ability-to-connect-to-community.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2024/12/SOP2024-ability-to-connect-to-community.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/SOP2024-ability-to-connect-to-community.png 1680w" sizes="(min-width: 720px) 720px"></figure><p>As opposed to last year’s numbers, respondents are finding it <strong>slightly easier to connect to the community</strong> than in previous years, with Medium (43&nbsp;%) and Extremely easy (18&nbsp;%) responses up by two percentage points from 2023.</p><p>A total of 384 respondents answered bonus questions, shedding light on what they like the most about the PostgreSQL community:</p><h3 id="in-your-experience-what%E2%80%99s-the-best-thing-about-the-postgresql-community-what-do-you-like-the-most">In your experience, what’s the best thing about the PostgreSQL community / what do you like the most? </h3><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/SOP2024_community-opinions.png" class="kg-image" alt="Three quotes from community members on the best thing about the PostgreSQL community" loading="lazy" width="936" height="428" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/12/SOP2024_community-opinions.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/SOP2024_community-opinions.png 936w" sizes="(min-width: 720px) 720px"></figure><h2 id="ecosystem-and-tools">Ecosystem and Tools</h2><p>PostgreSQL’s ecosystem is one of its strongest assets, offering a rich array of tools and extensions that enhance its already robust feature set. This year’s survey explored which PostgreSQL features, complementary tools, and extensions resonate most with users. Here's what the PostgreSQL community highlighted in 2024.</p><h3 id="what-is-your-favorite-postgresql-feature">What is your favorite PostgreSQL feature?</h3><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/SOP2024_fav-feature.png" class="kg-image" alt="A word cloud of the community'es favorite PostgreSQL features" loading="lazy" width="935" height="373" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/12/SOP2024_fav-feature.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/SOP2024_fav-feature.png 935w" sizes="(min-width: 720px) 720px"></figure><p>When asked about their favorite PostgreSQL features, respondents overwhelmingly pointed to <strong>extensibility</strong>, which remains PostgreSQL’s standout capability. This was followed by <strong>JSON support</strong>, prized for its flexibility in managing semi-structured data, and <strong>replication</strong>, a key enabler of high availability and fault tolerance.</p><p>👉 Interested in <a href="https://www.timescale.com/learn/postgresql-database-replication-guide"><u>PostgreSQL replication</u></a>? Check out our guide.</p><h3 id="what-other-tools-do-you-use-that-complement-postgresql">What other tools do you use that complement PostgreSQL?</h3><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/SOP2024-other-tools.png" class="kg-image" alt="A list with the answers to the question &quot;What other tools do you use that complement PostgreSQL?&quot; In order: TimescaleDB, Redis, pgBouncer, Patroni, AWS RDS, pgAdmin, PostGIS, Grafana, Barman, Docker, Kubernetes." loading="lazy" width="1680" height="695" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/12/SOP2024-other-tools.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/12/SOP2024-other-tools.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2024/12/SOP2024-other-tools.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/SOP2024-other-tools.png 1680w" sizes="(min-width: 720px) 720px"></figure><p>Respondents highlighted the tools they most frequently use alongside PostgreSQL, with <a href="https://github.com/timescale/timescaledb"><strong><u>TimescaleDB</u></strong></a> emerging as the top choice for time-series and analytics workloads. This was followed by <strong>Redis</strong>, often used for caching and real-time applications, and <strong>PgBouncer</strong>, valued for its connection pooling capabilities.</p><p>👉 Get our <a href="https://www.timescale.com/blog/using-pgbouncer-to-improve-your-postgresql-database-performance/"><u>Support team’s advice on PgBouncer</u></a>.</p><h3 id="what-are-your-top-three-favorite-or-most-frequently-used-postgresql-extensions">What are your top three favorite or most frequently used PostgreSQL extensions?</h3><p>The top three PostgreSQL extensions this year demonstrate its versatility across industries and use cases. For the second consecutive year, <a href="https://www.timescale.com/learn/postgresql-extensions-postgis"><strong><u>PostGIS</u></strong></a> led the pack, favored for its advanced geospatial data capabilities. <strong>Pg_stat_statements</strong> remains a favorite for performance monitoring, while <a href="https://github.com/timescale/timescaledb"><strong><u>TimescaleDB</u></strong></a> rounds out the top three for its time-series database capabilities.</p><p>👉 Learn how to <a href="https://www.timescale.com/blog/using-pg-stat-statements-to-optimize-queries/"><u>optimize your queries with pg_stat_statements</u></a>.</p><h2 id="read-the-report">Read the Report</h2><p>We hope you enjoyed this sneak peek of our <em>State of PostgreSQL 2024 </em>survey! If you’d like to learn more insights about the PostgreSQL community<em>, </em>including why respondents chose PostgreSQL, where they go to find jobs requiring PostgreSQL experience, and <a href="https://www.timescale.com/blog/ai-state-of-postgresql-2024" rel="noreferrer">how they use AI with PostgreSQL</a>, don’t miss our complete <a href="https://www.timescale.com/state-of-postgres/2024" rel="noreferrer"><em>2024</em> <em>State of PostgreSQL </em>report</a>.&nbsp;</p><p>Finally, a genuine word of appreciation to all the remarkable partners who have collaborated with us on this survey and helped us capture the collective experience of developers using PostgreSQL:</p><p><strong>Community members</strong>: <a href="https://vyruss.org/computing/"><u>Jimmy Angelakos</u></a>, <a href="https://www.linkedin.com/in/andyatkinson/"><u>Andrew Atkinson</u></a>, <a href="https://www.linkedin.com/in/ryanbooz/"><u>Ryan Booz</u></a>, <a href="https://www.softwareandbooz.com/"><u>Software &amp; Booz</u></a>, <a href="https://www.linkedin.com/in/elizabeth-garrett-christensen?lipi=urn%3Ali%3Apage%3Ad_flagship3_profile_view_base_contact_details%3BhMC98Y4mSqi4mp%2B8PBn4Fw%3D%3D"><u>Elizabeth Christensen</u></a>, <a href="https://www.linkedin.com/in/henrietta-dombrovskaya-367b26?lipi=urn%3Ali%3Apage%3Ad_flagship3_profile_view_base_contact_details%3B2GsMvizjSs21BHfoj03Aig%3D%3D"><u>Henrietta Dombrovskaya</u></a>, <a href="https://floor.dev/"><u>Floor Drees</u></a>, <a href="https://pgstef.github.io/"><u>Stefan Fercot</u></a>, <a href="https://github.com/hunleyd"><u>Douglas Hunley</u></a>, <a href="https://www.linkedin.com/in/gulcinyildirim?lipi=urn%3Ali%3Apage%3Ad_flagship3_profile_view_base_contact_details%3BvOnSt0fQRlKOpC26iX6p7g%3D%3D"><u>Gülçin Yıldırım Jelinek</u></a>, <a href="https://www.linkedin.com/in/valeriakaplan?lipi=urn%3Ali%3Apage%3Ad_flagship3_profile_view_base_contact_details%3B4iExj06cSMuGxklmLGnTvQ%3D%3D"><u>Valeria Kaplan</u></a>, <a href="http://www.jk-consult.nl/"><u>Jan Karremans</u></a>, <a href="https://www.linkedin.com/in/philipmarks?lipi=urn%3Ali%3Apage%3Ad_flagship3_profile_view_base_contact_details%3Bjah4K90WTnaLYWGn9bd7ng%3D%3D"><u>Philip Marks</u></a>, <a href="https://www.linkedin.com/in/doug-ortiz-illustris?lipi=urn%3Ali%3Apage%3Ad_flagship3_profile_view_base_contact_details%3B7Y7fZ%2BnCRJiPBmXRkFBDeQ%3D%3D"><u>Doug Ortiz</u></a>, <a href="https://www.youtube.com/@techbits-dougortiz"><u>Tech Bits</u></a>, <a href="https://www.techravenconsulting.com"><u>Steven Pousty</u></a>, <a href="https://www.linkedin.com/in/anastasia-raspopina/"><u>Anastasia Raspopina</u></a>, <a href="https://www.linkedin.com/in/daniel-sarosi-2197902?lipi=urn%3Ali%3Apage%3Ad_flagship3_profile_view_base_contact_details%3BhfhOegllSu%2BcWy2Mhz%2Fl%2Bw%3D%3D"><u>Daniel Sarosi</u></a>, Jeremy Schneider, <a href="https://www.linkedin.com/in/sjstoelting?lipi=urn%3Ali%3Apage%3Ad_flagship3_profile_view_base_contact_details%3Bht3v74mjQTu215t6vtOO2A%3D%3D"><u>Stefanie Janine Stölting</u></a>, <a href="https://bonesmoses.org"><u>Shaun Thomas</u></a></p><p><strong>Companies</strong>: <a href="https://aiven.io/"><u>Aiven</u></a>, <a href="https://www.basedash.com/"><u>Basedash</u></a> <a href="https://www.cybertec-postgresql.com/en/"><u>CYBERTEC</u></a>, <a href="https://www.data-bene.io/en/"><u>Data Bene</u></a>, <a href="https://www.datacloudgaze.com/"><u>DataCloudGaze</u></a>, <a href="https://www.dataegret.com/"><u>Data Egret</u></a>, <a href="https://device-insight.com/en/"><u>Device Insight</u></a>, <a href="https://www.enterprisedb.com/"><u>EDB</u></a>, <a href="https://www.kmon.net/"><u>KM.ON</u></a>, <a href="https://www.mga.com.au/"><u>Mark Gurry Associates</u></a>, <a href="https://neon.tech/"><u>Neon</u></a>, <a href="https://www.paradedb.com/"><u>ParadeDB</u></a>, <a href="https://postgresweekly.com/"><u>PG Weekly</u></a>, <a href="https://plotly.com/python/"><u>Plotly</u></a>, <a href="https://proopensource.eu/"><u>ProOpenSource</u></a>, <a href="https://www.simplyblock.io/"><u>simplyblock</u></a>, <a href="https://tembo.io/"><u>Tembo</u></a>, <a href="https://www.timbira.com.br/"><u>Timbira</u></a>, <a href="https://trebellar.com/"><u>Trebellar</u></a>, <a href="https://www.umh.app/"><u>United Manufacturing Hub</u></a>, <a href="https://xata.io/"><u>Xata</u></a></p><p><strong>Communities</strong>: Barcelona PostgreSQL User Groups, <a href="https://www.linkedin.com/in/kad%C4%B1n-yaz%C4%B1l%C4%B1mc%C4%B1-9b3a89275?lipi=urn%3Ali%3Apage%3Ad_flagship3_profile_view_base_contact_details%3Br30Zn%2FN2Q6O0JMoDGoY7NA%3D%3D"><u>Kadin Yazilimci</u></a>, Madrid PostgreSQL User Groups, <a href="https://github.com/hunleyd"><u>PgDay CMH</u></a>, <a href="https://www.meetup.com/Chicago-PostgreSQL-User-Group"><u>PG Day Chicago</u></a>, <a href="https://www.meetup.com/prague-postgresql-meetup/events/"><u>Prague PostgreSQL Meetup</u></a></p><p>Thank you for amplifying our reach and enabling us to connect with more developers across various channels!</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Counter Analytics in PostgreSQL: Beyond Simple Data Denormalization]]></title>
            <description><![CDATA[Record counting on demand or denormalized counters? We break down the two and show you an alternative using PostgreSQL.]]></description>
            <link>https://www.tigerdata.com/blog/counter-analytics-in-postgresql-beyond-simple-data-denormalization</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/counter-analytics-in-postgresql-beyond-simple-data-denormalization</guid>
            <category><![CDATA[PostgreSQL Tips]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Jônatas Davi Paganini]]></dc:creator>
            <pubDate>Wed, 04 Dec 2024 21:42:37 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/Counter-Analytics-in-PostgreSQL-Beyond-Simple-Data-Denormalization.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/Counter-Analytics-in-PostgreSQL-Beyond-Simple-Data-Denormalization.png" alt="Counter Analytics in PostgreSQL: Beyond Simple Data Denormalization" /><p>If you've been working with PostgreSQL, you've probably seen memes advocating for denormalized counters instead of counting related records on demand. The debate usually looks like this:</p><pre><code class="language-SQL">-- The "don't do this" approach: counting related records on demand
SELECT COUNT(*) FROM post_likes WHERE post_id = $1;
-- The "do this instead" approach: maintaining a denormalized counter
SELECT likes_count FROM posts WHERE post_id = $1;</code></pre><p>Let's break down these approaches. In the first approach, we calculate the like count by scanning the <code>post_likes</code> table each time we need the number. In the second approach, we maintain a pre-calculated counter in the <code>posts</code> table which we update whenever someone likes or unlikes a post.</p><p>The denormalized counter approach is often recommended for OLTP (online transaction processing) workloads because it trades write overhead for read performance. Instead of executing a potentially expensive <code>COUNT</code> query that needs to scan the entire <code>post_likes</code> table, we can quickly fetch a pre-calculated number.&nbsp;</p><p>This is particularly valuable in social media applications, where like counts are frequently displayed but rarely updated—you're showing like counts on posts much more frequently than users are actually liking posts.</p><p>However, when we enter the world of time-series data and high-frequency updates, this conventional wisdom needs a second look. Let me share an example that made me reconsider this approach while working with a PostgreSQL database optimized for time series via the <a href="https://github.com/timescale/timescaledb"><u>TimescaleDB extension</u></a>.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/Counter-Analytics-in-PostgreSQL-Beyond-Simple-Data-Denormalization_Justin-Bieber.png" class="kg-image" alt="" loading="lazy" width="789" height="661" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/12/Counter-Analytics-in-PostgreSQL-Beyond-Simple-Data-Denormalization_Justin-Bieber.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/Counter-Analytics-in-PostgreSQL-Beyond-Simple-Data-Denormalization_Justin-Bieber.png 789w" sizes="(min-width: 720px) 720px"><figcaption><a href="https://www.linkedin.com/posts/moez-zhioua_ever-heard-of-the-justin-bieber-problem-in-activity-7266724225112047617-CLLe/"><u><span class="underline" style="white-space: pre-wrap;">Source</span></u></a></figcaption></figure><p>While this advice might make sense for traditional OLTP workloads, when working with time-series data in TimescaleDB, we need to take a different approach to data modeling.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">🔖</div><div class="kg-callout-text">To learn more about <a href="https://www.timescale.com/learn/data-modeling-on-postgresql"><u>data modeling in PostgreSQL, check out our guide</u></a>.</div></div><h2 id="counter-analytics-vs-data-denormalization-and-its-limitations">Counter Analytics vs. Data Denormalization and Its Limitations</h2><p>Let's start with a common scenario: tracking post likes in a social media application. The traditional data denormalization approach might look like this:</p><pre><code class="language-SQL">-- Traditional table structure
CREATE TABLE posts (
    id SERIAL PRIMARY KEY,
    content TEXT,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    likes_count INTEGER DEFAULT 0
);

CREATE TABLE post_likes (
    post_id INTEGER REFERENCES posts(id),
    user_id INTEGER,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (post_id, user_id)
);
</code></pre><p>With this structure, every like operation requires two updates:</p><pre><code class="language-SQL">-- When a user likes a post
BEGIN;
INSERT INTO post_likes (post_id, user_id) VALUES (1, 123);
UPDATE posts SET likes_count = likes_count + 1 WHERE id = 1;
COMMIT;
</code></pre><h3 id="the-hidden-costs-of-data-denormalization">The hidden costs of data denormalization</h3><p>While this might seem efficient at first glance, it introduces several problems:</p><p>1.&nbsp;&nbsp;<strong>VACUUM overhead</strong>: Every update to <code>likes_count</code> creates a new version of the row in the posts table. PostgreSQL's MVCC (<a href="https://timescale.ghost.io/blog/how-to-reduce-your-postgresql-database-size/#:~:text=Since%20PostgreSQL%20runs%20under%20the%20MVCC%20system"><u>multiversion concurrency control</u></a>) means old versions aren't immediately removed, leading to the following:</p><pre><code class="language-SQL">-- Check bloat in posts table
SELECT schemaname, relname, n_dead_tup, n_live_tup, last_vacuum
FROM pg_stat_user_tables
WHERE relname = 'posts';</code></pre><p>2.&nbsp;<strong>Transaction contention</strong>: Multiple concurrent likes on the same post create lock contention on the <code>posts</code> row.</p><h2 id="the-timescaledb-way-counter-analytics-for-time-series">The TimescaleDB Way: Counter Analytics for Time Series</h2><p>This is one of those cases where TimescaleDB can give PostgreSQL a helping hand. Instead of maintaining a running counter, let's leverage TimescaleDB's strengths. We’ll start by using a <a href="https://docs.timescale.com/use-timescale/latest/hypertables/"><u>hypertable</u></a> to partition the data automatically by the time column.</p><pre><code class="language-SQL">-- Create a hypertable for post_likes
CREATE TABLE post_likes (
    post_id INTEGER,
    user_id INTEGER,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (post_id, user_id, created_at)
);

SELECT create_hypertable('post_likes', by_range('created_at', INTERVAL '1 month'));</code></pre><p>The <a href="https://github.com/timescale/timescaledb"><u>TimescaleDB extension</u></a> will automatically create new child tables and split them into several partitions, in this case, one per month. Check the key performance advantage of adopting hypertables:</p><ul><li>Parallel computation for queries: all counts and statistics can be parallelized across the partitions.</li><li>Data lifecycle: tables partitioned by time allow you to easily compress data after X days or drop the entire partition after X months.</li><li><a href="https://timescale.ghost.io/blog/building-columnar-compression-in-a-row-oriented-database/"><u>Columnar compression</u></a> can be enabled and will work as an index to segment the data.</li></ul><p>It’s important to remember that the hypertable architecture is a <a href="https://www.timescale.com/learn/pg_partman-vs-hypertables-for-postgres-partitioning"><u>paradigm shift in database partitioning</u></a>. That’s because the partition stores its own table statistics and indices, making the policies faster for dropping entire partitions without any extra work for vacuum or updates.</p><h3 id="continuous-aggregates-for-efficient-counting">Continuous aggregates for efficient counting</h3><p>Parallelizing will not avoid rescanning the full dataset for any necessary statistics. To increase efficiency, we can consider grouping data hourly and processing it hour by hour. Vanilla PostgreSQL does not allow partial refreshes on materialized views, which is why Timescale developed the continuous aggregation feature.</p><p>The <a href="https://docs.timescale.com/use-timescale/latest/continuous-aggregates/"><u>continuous aggregate</u></a> will maintain pre-computed counts. Instead of computing counts during query time or updating every new like, we can create a materialized view with superpowers.</p><pre><code class="language-SQL">-- Create a view for hourly like counts
CREATE MATERIALIZED VIEW post_likes_hourly
WITH (timescaledb.continuous) AS
SELECT 
    post_id,
    time_bucket('1 hour', created_at) AS bucket,
    count(*) as likes_count
FROM post_likes
GROUP BY post_id, time_bucket('1 hour', created_at);

-- Set refresh policy
SELECT add_continuous_aggregate_policy('post_likes_hourly',
    start_offset =&gt; INTERVAL '3 hours',
    end_offset =&gt; INTERVAL '1 hour',
    schedule_interval =&gt; INTERVAL '1 hour');</code></pre><p>The refresh policy makes it run on a schedule and only refreshes the part that has not been computed yet. Through a “watermark” mechanism, the refresh time is stored, and the data is updated from the latest watermark point. You can read more about it in our <a href="https://timescale.ghost.io/blog/real-time-analytics-for-time-series-continuous-aggregates/"><u>dev’s intro to continuous aggregates</u></a>.</p><p>You may be thinking, “What? But what if I change the raw data?” TimescaleDB can also track it and refresh only the updated parts.</p><p>If you like this idea, you'll probably also love the ability to use continuous aggregates <a href="https://docs.timescale.com/use-timescale/latest/continuous-aggregates/hierarchical-continuous-aggregates/"><u>hierarchically</u></a>.</p><h2 id="benefits-of-the-timescaledb-approach">Benefits of the TimescaleDB Approach</h2><ol><li><strong>Efficient storage</strong>: TimescaleDB's chunking mechanism automatically partitions data by time, making fewer VACUUM operations necessary.</li><li><strong>Better concurrency</strong>: no need to update a single counter row, <a href="https://timescale.ghost.io/blog/how-timescaledb-solves-common-postgresql-problems-in-database-operations-with-data-retention-management/"><u>eliminating lock contention</u></a>.&nbsp;</li><li><strong>Rich analytics</strong>: we can easily answer complex questions.</li></ol><pre><code class="language-SQL">-- Get likes trend over time
SELECT 
    post_id,
    bucket,
    likes_count,
    sum(likes_count) OVER (PARTITION BY post_id ORDER BY bucket) as cumulative_likes
FROM post_likes_hourly
WHERE post_id = 1
ORDER BY bucket DESC;</code></pre><h3 id="performance-comparison-counter-analytics-vs-data-denormalization">Performance comparison: Counter analytics vs. data denormalization</h3><p>Let's benchmark both approaches:</p><pre><code class="language-SQL">-- Traditional approach
EXPLAIN ANALYZE
UPDATE posts SET likes_count = likes_count + 1 WHERE id = 1;

-- TimescaleDB approach
EXPLAIN ANALYZE
INSERT INTO post_likes (post_id, user_id, created_at) 
VALUES (1, 123, NOW());</code></pre><p>The TimescaleDB approach shows better performance characteristics under high concurrency and provides more analytical capabilities.</p><h2 id="best-practices-for-real-time-counts">Best Practices for Real-Time Counts</h2><p>For applications requiring <a href="https://docs.timescale.com/use-timescale/latest/continuous-aggregates/real-time-aggregates/"><u>real-time counts</u></a>, we can set the materialized view parameter <code>timescaledb.materialized_only=false</code> to refresh the view on demand.</p><pre><code class="language-SQL">CREATE MATERIALIZED VIEW post_likes_hourly
WITH (timescaledb.continuous, timescaledb.materialized_only=false) AS
SELECT 
    post_id,
    time_bucket('1 hour', created_at) AS bucket,
    count(*) as likes_count
FROM post_likes
GROUP BY post_id, time_bucket('1 hour', created_at);</code></pre><p>Behind the scenes, TimescaleDB will create a hypertable for the materialized view and refresh the view according to the refresh policy. When the refresh starts, it saves a watermark to track the latest refreshed bucket.</p><p>When you query the <code>posts_liks_hourly</code>, it combines the materialized data with the latest bucket from the hypertable filtering only on the buckets greater than the watermark. It means that instead of scanning the raw dataset, it will just process the part that has not materialized yet.&nbsp;</p><h2 id="establishing-a-retention-policy">Establishing a Retention Policy</h2><p>Now that we have a continuous aggregate, we need to establish a <a href="https://docs.timescale.com/use-timescale/latest/data-retention/create-a-retention-policy/"><u>retention policy</u></a> to prevent the hypertable from growing indefinitely. As we're storing the data in chunks, we can set a retention policy to delete the chunks that are older than a certain period.</p><pre><code class="language-SQL">SELECT add_retention_policy('post_likes_hourly', INTERVAL '1 month');</code></pre><p>This command runs a background job that deletes chunks older than one month. The past data will be deleted in the background, and the continuous aggregate will remain up to date.</p><p>Also, the data will be removed only when the entire partition is going to be dropped. Every partition has its own metadata, without any need to update statistics or give any extra work for the VACUUM process.</p><h2 id="conclusion">Conclusion</h2><p>While denormalized counters might seem appealing for simple OLTP workloads, TimescaleDB's time-series capabilities offer a more scalable and maintainable solution. By leveraging continuous aggregates and proper time-series modeling, we can achieve better performance, richer analytics, and more reliable data management.</p><p>Remember:</p><ul><li>Use hypertables for time-series data</li><li>Leverage continuous aggregates for efficient computations</li><li>Consider the full lifecycle of your data, including retention policies</li><li>Think in terms of time-series patterns rather than traditional OLTP patterns</li></ul><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/Screenshot-2024-12-04-at-18.51.52.png" class="kg-image" alt="" loading="lazy" width="1078" height="1452" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/12/Screenshot-2024-12-04-at-18.51.52.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/12/Screenshot-2024-12-04-at-18.51.52.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/12/Screenshot-2024-12-04-at-18.51.52.png 1078w" sizes="(min-width: 720px) 720px"></figure><p>This approach might require a mindset shift, but the benefits in terms of scalability and maintenance make it worthwhile for time-series workloads. To give TimescaleDB a try, <a href="https://docs.timescale.com/self-hosted/latest/install/"><u>install it on your machine</u></a>. If you prefer a mature, managed PostgreSQL platform that delivers even more scalability, you can <a href="https://console.cloud.timescale.com/signup"><u>try Timescale Cloud for free</u></a>.</p><h3 id="learn-more">Learn more</h3><ul><li><a href="https://timescale.ghost.io/blog/how-to-reduce-your-postgresql-database-size/"><u>How to Reduce Your PostgreSQL Database Size</u></a></li><li><a href="https://www.timescale.com/learn/pg_partman-vs-hypertables-for-postgres-partitioning/"><u>Pg_partman vs. Hypertables for Postgres Partitioning | Timescale</u></a></li><li><a href="https://timescale.ghost.io/blog/how-timescaledb-solves-common-postgresql-problems-in-database-operations-with-data-retention-management/"><u>How TimescaleDB Solves Common PostgreSQL Problems in Database Operations With Data Retention Management</u></a></li></ul>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Building a Better Ruby ORM for Time Series and Analytics]]></title>
            <description><![CDATA[Seamlessly create rollups from rolled-up data (hierarchical continuous aggregates) on your Ruby On Rails application for faster time-series & analytics queries.]]></description>
            <link>https://www.tigerdata.com/blog/building-a-better-ruby-orm-for-time-series-and-analytics</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/building-a-better-ruby-orm-for-time-series-and-analytics</guid>
            <category><![CDATA[Ruby]]></category>
            <category><![CDATA[PostgreSQL, Blog]]></category>
            <category><![CDATA[Time Series Data]]></category>
            <category><![CDATA[Analytics]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Jônatas Davi Paganini]]></dc:creator>
            <pubDate>Wed, 27 Nov 2024 13:30:11 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/11/Building-a-Better-Ruby-ORM-for-Time-Series-and-Analytics_final.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/11/Building-a-Better-Ruby-ORM-for-Time-Series-and-Analytics_final.png" alt="Our timescaledb gem continuous aggregate macro: see how we're building a better Ruby ORM for time eries" /><p>Rails developers know the joy of working with ActiveRecord. <a href="https://dhh.dk"><u>DHH</u></a> didn’t just give us a framework; he gave us a philosophy, an intuitive way to manage data that feels delightful. But when it comes to time-series data, think metrics, logs, or events, ActiveRecord can start to feel a little stretched. Handling huge volumes of time-stamped data efficiently for analytics? That’s a challenge it wasn’t designed to solve (and neither was PostgreSQL).</p><p>This is where <a href="https://github.com/timescale/timescaledb" rel="noreferrer">TimescaleDB</a> comes in. Built on PostgreSQL (it’s an extension), TimescaleDB is purpose-built for time series and other demanding workloads, and thanks to the <a href="https://rubygems.org/gems/timescaledb" rel="noreferrer">timescaledb gem</a>, it integrates seamlessly into Rails. You don’t have to leave behind the conventions or patterns you love, it just works alongside them.</p><p>One of TimescaleDB’s standout features is<strong> </strong><a href="https://docs.timescale.com/use-timescale/latest/continuous-aggregates/about-continuous-aggregates/" rel="noreferrer"><strong>continuous aggregates</strong></a>. Think of them as an upgrade to materialized views, automatically refreshing in the background so your data is always up-to-date and fast to query. With the new timescaledb gem<strong> continuous aggregates macro</strong>, you can define hierarchical time-based summaries in a single line of Ruby. It even reuses your existing ActiveRecord scopes, so you’re not duplicating logic you’ve already written.</p><p>Now, your Rails app can effortlessly handle real-time analytics dashboards or historical reports, scaling your time-series workloads while staying true to the Rails philosophy.</p><h2 id="better-time-series-data-aggregations-using-ruby-the-inspiration">Better Time-Series Data Aggregations Using Ruby: The Inspiration</h2><p>The following code snippet highlights the real-life use case that inspired me to build a continuous aggregates macro for better time-series data aggregations. It’s part of a <a href="https://github.com/rubygems/rubygems.org/pull/4979"><u>RubyGems contribution I made</u></a>, and it’s still a work in progress. However, it’s worth validating how this idea can reduce the Ruby code you’ll have to maintain.</p><h3 id="example-model">Example model</h3><pre><code class="language-Ruby">class Download &lt; ActiveRecord::Base
  extend Timescaledb::ActsAsHypertable
  include Timescaledb::ContinuousAggregatesHelper

  acts_as_hypertable time_column: 'ts'

  scope :total_downloads, -&gt; { select("count(*) as total") }
  scope :downloads_by_gem, -&gt; { select("gem_name, count(*) as total").group(:gem_name) }
  scope :downloads_by_version, -&gt; { select("gem_name, gem_version, count(*) as total").group(:gem_name, :gem_version) }

  continuous_aggregates(
    timeframes: [:minute, :hour, :day, :month],
    scopes: [:total_downloads, :downloads_by_gem, :downloads_by_version],
    refresh_policy: {
      minute: { start_offset: "10 minutes", end_offset: "1 minute", schedule_interval: "1 minute" },
      hour:   { start_offset: "4 hour",     end_offset: "1 hour",   schedule_interval: "1 hour" },
      day:    { start_offset: "3 day",      end_offset: "1 day",    schedule_interval: "1 day" },
      month:  { start_offset: "3 month",    end_offset: "1 day",  schedule_interval: "1 day" }
  })
end
</code></pre><p>The <a href="https://docs.timescale.com/use-timescale/latest/continuous-aggregates/refresh-policies/"><u><code>refresh_policy</code></u></a> will work for all basic frames, but it is not mandatory and can be skipped. Now, remember that declaring the macro in the model has almost no effect until you run a migration that uses such metadata. The creation of the continuous aggregates needs to happen on a database migration through the call of migration helpers that can use the information. Let’s take a look at the helpers we have.</p><h3 id="the-migration-helpers">The migration helpers</h3><p>The macro will create a continuous aggregate in the model, but for migration, it can generate the SQL code for all the views iterating on each timeframe and scope you specify.</p><p>The <code>create_continuous_aggregates</code> and <code>drop_continuous_aggregates</code>  methods are designed to be invoked during the database migration step.</p><p>So, after saving your model with the new <code>continuous_aggregate</code> definition, you can use the <code>create_continuous_aggregate</code> method to invoke the creation of all materialized views in the database. If you use <code>refresh_policy</code>, it will also add all the policies along with the aggregation. Here’s what a migration file would look like:</p><pre><code class="language-Ruby">class SetupMyAmazingCaggsMigration &lt; ActiveRecord::Migration[7.0]
  def up
    Download.create_continuous_aggregates
  end

  def down
    Download.drop_continuous_aggregates
  end
end
</code></pre><p>It will automatically create all the continuous aggregates for all timeframes and scopes in the right dependency order. When the <code>create_continuous_aggregates</code> is called, 12 continuous aggregates will be created, starting from minute to month.</p><h3 id="the-migration-output">The migration output</h3><p>Let’s take a deep look at what the SQL behind the scenes looks like when the method <code>create_continuous_aggregates</code> is called. From the first scope, it builds the continuous aggregates, fetching the data from the raw data.</p><pre><code class="language-SQL">CREATE MATERIALIZED VIEW IF NOT EXISTS total_downloads_per_minute
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 minute', ts) as ts, count(*) as total
FROM "downloads"
GROUP BY 1
WITH NO DATA;
</code></pre><p>Every materialization occurs independently, and to happen automatically, a refresh policy needs to be added. As it was specified generically by timeframe, it now incorporates the minute refresh for the policy.</p><pre><code class="language-SQL">SELECT add_continuous_aggregate_policy('total_downloads_per_minute',
  start_offset =&gt; INTERVAL '10 minutes',
  end_offset =&gt;  INTERVAL '1 minute',
  schedule_interval =&gt; INTERVAL '1 minute');
</code></pre><p>Now, continuing the creation, it goes for the hourly level, already reusing the data from the previous materialized view.</p><pre><code class="language-SQL">CREATE MATERIALIZED VIEW IF NOT EXISTS total_downloads_per_hour
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour', ts) as ts, sum(total) as total FROM "total_downloads_per_minute" 
GROUP BY 1
WITH NO DATA;
</code></pre><p>An hourly policy is also established to guarantee that it will refresh automatically. The same iteration is repeated for daily and monthly timeframes. Later, the same process will repeat for the other timeframes.</p><pre><code class="language-SQL">SELECT add_continuous_aggregate_policy('total_downloads_per_hour',
  start_offset =&gt; INTERVAL '4 hour',
  end_offset =&gt;  INTERVAL '1 hour',
  schedule_interval =&gt; INTERVAL '1 hour');

CREATE MATERIALIZED VIEW IF NOT EXISTS total_downloads_per_day
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day', ts) as ts, sum(total) as total FROM "total_downloads_per_hour" GROUP BY 1
WITH NO DATA;

SELECT add_continuous_aggregate_policy('total_downloads_per_day',
  start_offset =&gt; INTERVAL '3 day',
  end_offset =&gt;  INTERVAL '1 day',
  schedule_interval =&gt; INTERVAL '1 day');

CREATE MATERIALIZED VIEW IF NOT EXISTS total_downloads_per_month
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 month', ts) as ts, sum(total) as total FROM "total_downloads_per_day" GROUP BY 1
WITH NO DATA;

SELECT add_continuous_aggregate_policy('total_downloads_per_month',
  start_offset =&gt; INTERVAL '3 month',
  end_offset =&gt;  INTERVAL '1 day',
  schedule_interval =&gt; INTERVAL '1 day');

CREATE MATERIALIZED VIEW IF NOT EXISTS downloads_by_gem_per_minute
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 minute', ts) as ts, gem_name, count(*) as total FROM "downloads" GROUP BY 1, gem_name
WITH NO DATA;

SELECT add_continuous_aggregate_policy('downloads_by_gem_per_minute',
  start_offset =&gt; INTERVAL '10 minutes',
  end_offset =&gt;  INTERVAL '1 minute',
  schedule_interval =&gt; INTERVAL '1 minute');

CREATE MATERIALIZED VIEW IF NOT EXISTS downloads_by_gem_per_hour
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour', ts) as ts, gem_name, sum(total) as total FROM "downloads_by_gem_per_minute" GROUP BY 1, gem_name
WITH NO DATA;

SELECT add_continuous_aggregate_policy('downloads_by_gem_per_hour',
  start_offset =&gt; INTERVAL '4 hour',
  end_offset =&gt;  INTERVAL '1 hour',
  schedule_interval =&gt; INTERVAL '1 hour');

CREATE MATERIALIZED VIEW IF NOT EXISTS downloads_by_gem_per_day
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day', ts) as ts, gem_name, sum(total) as total FROM "downloads_by_gem_per_hour" GROUP BY 1, gem_name
WITH NO DATA;

SELECT add_continuous_aggregate_policy('downloads_by_gem_per_day',
  start_offset =&gt; INTERVAL '3 day',
  end_offset =&gt;  INTERVAL '1 day',
  schedule_interval =&gt; INTERVAL '1 day');

CREATE MATERIALIZED VIEW IF NOT EXISTS downloads_by_gem_per_month
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 month', ts) as ts, gem_name, sum(total) as total FROM "downloads_by_gem_per_day" GROUP BY 1, gem_name
WITH NO DATA;

SELECT add_continuous_aggregate_policy('downloads_by_gem_per_month',
  start_offset =&gt; INTERVAL '3 month',
  end_offset =&gt;  INTERVAL '1 day',
  schedule_interval =&gt; INTERVAL '1 day');

CREATE MATERIALIZED VIEW IF NOT EXISTS downloads_by_version_per_minute
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 minute', ts) as ts, gem_name, gem_version, count(*) as total FROM "downloads" GROUP BY 1, gem_name, gem_version
WITH NO DATA;

SELECT add_continuous_aggregate_policy('downloads_by_version_per_minute',
  start_offset =&gt; INTERVAL '10 minutes',
  end_offset =&gt;  INTERVAL '1 minute',
  schedule_interval =&gt; INTERVAL '1 minute');

CREATE MATERIALIZED VIEW IF NOT EXISTS downloads_by_version_per_hour
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour', ts) as ts, gem_name, gem_version, sum(total) as total FROM "downloads_by_version_per_minute" GROUP BY 1, gem_name, gem_version
WITH NO DATA;

SELECT add_continuous_aggregate_policy('downloads_by_version_per_hour',
  start_offset =&gt; INTERVAL '4 hour',
  end_offset =&gt;  INTERVAL '1 hour',
  schedule_interval =&gt; INTERVAL '1 hour');

CREATE MATERIALIZED VIEW IF NOT EXISTS downloads_by_version_per_day
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day', ts) as ts, gem_name, gem_version, sum(total) as total FROM "downloads_by_version_per_hour" GROUP BY 1, gem_name, gem_version
WITH NO DATA;

SELECT add_continuous_aggregate_policy('downloads_by_version_per_day',
  start_offset =&gt; INTERVAL '3 day',
  end_offset =&gt;  INTERVAL '1 day',
  schedule_interval =&gt; INTERVAL '1 day');

CREATE MATERIALIZED VIEW IF NOT EXISTS downloads_by_version_per_month
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 month', ts) as ts, gem_name, gem_version, sum(total) as total FROM "downloads_by_version_per_day" GROUP BY 1, gem_name, gem_version
WITH NO DATA;

SELECT add_continuous_aggregate_policy('downloads_by_version_per_month',
  start_offset =&gt; INTERVAL '3 month',
  end_offset =&gt;  INTERVAL '1 day',
  schedule_interval =&gt; INTERVAL '1 day');
</code></pre><p>That’s massive, right?! It’s probably too boring to read it all because it’s almost a repetitive structure, iterating over all the scopes. The <code>continuous_aggregates</code> leverages all logic by iterating over all the timeframes with all scopes. It reuses minute data in the hourly view and uses the same technique from hour to day, day to month, and so on. </p><p>In contrast, reusing the aggregations, if written all by hand, makes the process really error-prone. The <code>Model.drop_continuous_aggregates</code> method uses the reverse dependency path to call the <code>drop materialized view</code> from month to minute.</p><p>Continuously aggregating statistics can replace dozens of background jobs hosted by your application, avoiding serialization and deserialization efforts apart from bandwidth, I/O (input/output), and overuse of resources in general. </p><p>Reusing the previous timeframes makes it very fast and lightweight for the database to process. Adopting hierarchical processing also allows all processing to be done at a predictable speed because the number of rows will be static and only dependent on the cardinality of the data. </p><p>Processing aggregations in the database means there will only be calls between the database and the disk, releasing interactions between the application and the database and forcing network data trips to process it on application background jobs.</p><p>Now, let’s take a look at how the rollup works.</p><h2 id="hyperfunctions-integration-for-faster-time-series-analysis">Hyperfunctions Integration for Faster Time-Series Analysis</h2><p>Timescale also built a specialized extension for time-series data processing, the <a href="https://docs.timescale.com/self-hosted/latest/tooling/install-toolkit/"><u>timescaledb-toolkit</u></a>. It helps improve the developer experience and query performance, and most of its functions are called hyperfunctions.</p><p><a href="https://docs.timescale.com/api/latest/hyperfunctions/"><u>Hyperfunctions</u></a> are designed to reuse and make statistics fast for hypertables, allowing you to roll up granular aggregations into bigger timeframes. In the case of the Ruby library, it should work well with both regular statistics functions and also roll up the hyperfunctions already available.</p><p>The most important part of using multiple timeframes and scopes is to understand how the <code>rollup</code> scope works.&nbsp;</p><p>For example, if you have a scope called <code>total_downloads</code> and a timeframe of <code>day</code>, the rollup will rewrite the query to group by the day.</p><pre><code class="language-SQL"># Original query
SELECT count(*) FROM downloads;

# Rolled up query
SELECT time_bucket('1 day', created_at) AS day, count(*) FROM downloads GROUP BY day;
</code></pre><p>In Ruby, the rollup method will help to roll up such queries in a more efficient way. Let’s consider the <code>total_downloads</code> scope as an example:</p><pre><code class="language-Ruby">Download.total_downloads.map(&amp;:attributes) #  =&gt; [{"total"=&gt;6175}
# SELECT count(*) as total FROM "downloads"
</code></pre><p>The rollup scope will help to group data by a specific timeframe. Let’s start with one minute:</p><pre><code class="language-Ruby">Download.total_downloads.rollup("'1 min'").map(&amp;:attributes)
# SELECT time_bucket('1 min', ts) as ts, count(*) as total FROM "downloads" GROUP BY 1
=&gt; [{"ts"=&gt;2024-04-26 00:10:00 UTC, "total"=&gt;110},
 {"ts"=&gt;2024-04-26 00:11:00 UTC, "total"=&gt;1322},
 {"ts"=&gt;2024-04-26 00:12:00 UTC, "total"=&gt;1461},
 {"ts"=&gt;2024-04-26 00:13:00 UTC, "total"=&gt;1150},
 {"ts"=&gt;2024-04-26 00:14:00 UTC, "total"=&gt;1127},
 {"ts"=&gt;2024-04-26 00:15:00 UTC, "total"=&gt;1005}]
</code></pre><p>As you can see, the <code>time_bucket</code> function is introduced, and a group by clause is also added.</p><p>If the current query uses a component like <a href="https://docs.timescale.com/api/latest/hyperfunctions/financial-analysis/candlestick_agg/"><u>candlestick_agg</u></a>, it will be able to call the <a href="https://docs.timescale.com/api/latest/hyperfunctions/financial-analysis/candlestick_agg/#rollup"><u>rollup</u></a> SQL function, and that’s where the name of the function comes from.</p><p>What if I want to sum the counters from the materialized view behind the scenes and roll up to a bigger frame? That’s when the aggregated classes join the game.</p><p>Continuous aggregates are hypertables. They’re materialized views that are periodically being updated in the background according to the refresh policy. Every aggregation can be accessed and refreshed independently.</p><h3 id="aggregates-classes">Aggregates classes</h3><p>In the previous example, the rollup was done directly in the raw data. Now, let’s explore how the <code>continuous_aggregates</code> macro creates a class for each aggregated view that is in the database. The classes can be accessed as subclasses in the model and also inherit the model as they’re fully dependent on it.</p><p>So, to access the materialized data, instead of building the query from raw data, nested classes are created with the <code>Model::ScopeNamePerTimeframe</code> naming convention.</p><pre><code class="language-Ruby">Download::TotalDownloadsPerMinute.all.map(&amp;:attributes)
# SELECT "total_downloads_per_minute".* FROM "total_downloads_per_minute"
=&gt; [{"ts"=&gt;2024-04-26 00:10:00 UTC, "total"=&gt;110},
 {"ts"=&gt;2024-04-26 00:11:00 UTC, "total"=&gt;1322},
 {"ts"=&gt;2024-04-26 00:12:00 UTC, "total"=&gt;1461},
 {"ts"=&gt;2024-04-26 00:13:00 UTC, "total"=&gt;1150},
 {"ts"=&gt;2024-04-26 00:14:00 UTC, "total"=&gt;1127},
 {"ts"=&gt;2024-04-26 00:15:00 UTC, "total"=&gt;1005}]
</code></pre><p>To roll up from the materialized data, we need to consider how the data was built. So, to have the counter, we need to count rows from the hypertable raw data, but for bigger timeframes, we can just sum the counters. Here’s what it looks like if you need to roll up any scope to other timeframes:</p><pre><code class="language-Ruby">Download::TotalDownloadsPerMinute.select("sum(total) as total").rollup("'2 min'").map(&amp;:attributes)
# SELECT time_bucket('2 min', ts) as ts, sum(total) as total FROM "total_downloads_per_minute" GROUP BY 1
=&gt; [{"ts"=&gt;2024-04-26 00:12:00 UTC, "total"=&gt;2611}, {"ts"=&gt;2024-04-26 00:14:00 UTC, "total"=&gt;2132}, {"ts"=&gt;2024-04-26 00:10:00 UTC, "total"=&gt;1432}]
</code></pre><p>With the <code>rollup</code> scope, you can easily build custom scopes and regroup as you need. It supports a few statistic scenarios on rollup to automatically detect SQL statements that contain <code>count(*) as total</code> and transform them into <code>sum(total) as total</code>them. It can also get a min of min or max of max values when it’s rolling up into larger time frames.</p><h3 id="refresh-aggregates">Refresh aggregates</h3><p>If you need to refresh all aggregates manually in the right order, you can also use the <code>refresh_aggregates</code> method:</p><pre><code class="language-Ruby">Download.refresh_aggregates
</code></pre><h2 id="next-steps">Next steps</h2><p>That’s all, folks! I <a href="https://ideia.me/timescaledb-gem-continuous-aggregates-updates"><u>posted</u></a> a few more details in my blog during the development phase. If you have any questions or feedback, join the <a href="https://timescaledb.slack.com/archives/C04MQ3DKXEV/p1715632355486219"><u><code>#ruby</code></u></a> channel on the TimescaleDB Slack. Also, GitHub ⭐s for our <a href="https://github.com/timescale/timescaledb-ruby"><u>Ruby library</u></a> are very much welcome!</p><p>To give it a try and use the <code>continuous_aggregates</code> macro on your project, install the <a href="https://rubygems.org/gems/timescaledb" rel="noreferrer"><code>timescaledb</code></a> gem. Happy coding—but write fewer lines of code.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Benchmarking PostgreSQL Batch Ingest]]></title>
            <description><![CDATA[See what PostgreSQL batch ingest method is right for your use case: in this article, we benchmark INSERT (VALUES and UNNEST) vs. COPY (text and binary).]]></description>
            <link>https://www.tigerdata.com/blog/benchmarking-postgresql-batch-ingest</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/benchmarking-postgresql-batch-ingest</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[performance]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <dc:creator><![CDATA[James Blackwood-Sewell]]></dc:creator>
            <pubDate>Tue, 26 Nov 2024 14:00:51 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/11/to_char-11111---FM9-999-999-----2-.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/11/to_char-11111---FM9-999-999-----2-.png" alt="Benchmarking PostgreSQL Batch Ingest: the output list of ingest performance" /><p>In a previous <a href="https://timescale.ghost.io/blog/tag/performance/"><u>article</u></a> in this <a href="https://timescale.ghost.io/blog/tag/performance/"><u>series</u></a>, I explored the magic of <code>INSERT...UNNEST</code> for improving PostgreSQL batch <code>INSERT</code> performance. While it’s a fantastic technique, I know it’s not the fastest option available (although it is very flexible). Originally, I hadn't intended to loop back and benchmark all the batch ingest methods, but I saw a lot of confusion out there, so I'm back, and this time I'm looking at <code>COPY</code> too. As usual for this series, it’s not going to be a long post, but it is going to be an informative one.&nbsp;</p><p>I flipped my approach for this post, comparing not just the PostgreSQL database performance in isolation but the practical performance from an application. To do this, I built a custom benchmarking tool in <a href="https://www.rust-lang.org/">Rust</a> to measure the end-to-end performance of each method. In this article, I’ll walk you through the batch ingest options you’ve got, and how they stack up (spoiler alert, the spread is over 19x!).</p><h2 id="the-introduction-batch-ingest-in-postgresql">The Introduction: Batch Ingest in PostgreSQL</h2><p>I’m defining batch ingest as writing a dataset to PostgreSQL in batches or chunks. You’d usually do this because the data is being collected in (near) real time (think a flow of IoT data from sensors) before being persisted into PostgreSQL (hopefully with <a href="https://docs.timescale.com/">TimescaleDB</a>, although that's out of scope for this post). </p><p>Writing a single record at a time is incredibly inefficient, so writing batches makes sense (the size probably depends on how long you can delay writing). Just to be clear this isn't about loading a very large dataset in one go, I’d call that <a href="https://www.timescale.com/learn/testing-postgres-ingest-insert-vs-batch-insert-vs-copy">bulk ingest </a>not batch ingest (and you'd usually do that from a file).</p><p>Broadly speaking, there are two methods for ingesting multiple values at once in PostgreSQL: <code>INSERT</code> and <code>COPY</code>. Each of these methods has a few variants, so let's look at the differences.</p><h2 id="insert-values-and-unnest">INSERT: VALUES and UNNEST</h2><p>The most common method to ingest data in PostgreSQL is the standard <code>INSERT</code> statement using the <code>VALUES</code> clause. Everyone recognizes it, and every language and ORM (object-relational mapper) can make use of it. While you can insert a single row, we are interested in batch ingest here, passing multiple values using the following syntax (this example is a batch of three for a table with seven columns).</p><pre><code class="language-SQL">INSERT INTO sensors
VALUES&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;($1,  $2,  $3,  $4,  $5,  $6,  $7),
&nbsp;&nbsp;&nbsp;&nbsp;($8,  $9,  $10, $11, $12, $13, $14),
    ($15, $16, $17, $18, $19, $20, $21);</code></pre><p>The other method is <code>INSERT...UNNEST</code>. Here, instead of passing a value per attribute (so <code>batch size * columns</code> total values), we pass an array of values per column.</p><pre><code class="language-SQL">INSERT INTO sensors
SELECT * FROM unnest(
&nbsp;&nbsp;&nbsp;&nbsp;$1::int[],
&nbsp;&nbsp;&nbsp;&nbsp;$2::timestamp[],
&nbsp;&nbsp;&nbsp;&nbsp;$3::float8[],
    $4::float8[],
    $5::float8[]
    $6::float8[]
    $7::float8[]
);</code></pre><p>If you’re after a discussion on the difference between the two, then check out <a href="https://timescale.ghost.io/blog/boosting-postgres-insert-performance/"><u>Boosting Postgres INSERT Performance by 2x With UNNEST</u></a>.</p><p>Each of these queries can be actually sent to the database in a few ways:</p><ul><li>You could construct the query string manually with a literal value in place of the <code>$</code> placeholders. I haven’t benchmarked this because it’s bad practice and can open you up to SQL injection attacks (never forget about <a href="https://www.explainxkcd.com/wiki/index.php/327:_Exploits_of_a_Mom"><u>Little Bobby Tables</u></a>).&nbsp;</li><li>You could use your framework to send a parameterized query (which looks like the ones above with <code>$</code> placeholders) which sends the query body and the values to Postgres as separate items. This protects against SQL injection and speeds up query parsing.</li><li>You could use a prepared statement (which would also be parameterized) to let the database know about your query ahead of time, then just send the values each time you want to run it. This provides the benefits of parameterization, and also speeds up your queries by reducing the planning time.</li></ul><p>Most frameworks implement prepared statements using the binary protocol directly, but you can use the <code>PREPARE</code> and <code>EXECUTE SQL</code> commands to do the same thing from SQL.</p><p>Keep in mind that PostgreSQL has a limit of 32,767 parameterized variables in a query. So if you had seven columns, then your maximum batch size for <code>INSERT…VALUES</code> would be 4,681. When you’re using <code>INSERT…UNNEST</code>, you’re only sending one parameter per column. Because PostgreSQL can support at most 1,600 columns, you'll never hit the limit.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">When using PREPARED statements for batch ingest, ensure that <code spellcheck="false" style="white-space: pre-wrap;">plan_cache_mode</code> is not set to <code spellcheck="false" style="white-space: pre-wrap;">force_custom_plan</code>. This setting is designed for queries that benefit from being re-planned for each execution, which isn’t the case for batch inserts. <br><br>By default, <code spellcheck="false" style="white-space: pre-wrap;">plan_cache_mode</code> is set to <code spellcheck="false" style="white-space: pre-wrap;">auto</code>, meaning PostgreSQL will use custom plans for the first five executions before switching to a generic plan. To optimize performance, you could consider changing your session to <code spellcheck="false" style="white-space: pre-wrap;">force_generic_plan</code>, ensuring the query is planned just once and reused for all subsequent executions.</div></div><h2 id="copy-text-and-binary">COPY: Text and Binary</h2><p><code>COPY</code> is a PostgreSQL-specific extension to the SQL standard for bulk ingestion (strictly speaking we are talking about <code>COPY FROM</code> here because you can also <code>COPY TO</code> which moves data from a table to a file). <br><br><code>COPY</code> shortcuts the process of writing multiple records in a number of ways, with two of the most critical ones being:</p><ol><li><strong>WAL MULTI-INSERT:</strong> PostgreSQL optimizes <code>COPY</code>write-ahead operations by writing&nbsp; <code>MULTI_INSERT</code><strong> </strong>records to the WAL (write-ahead log) instead of logging each row individually. This results in less data being written to the WAL files, which means less I/O (input/output) on your database.</li><li><strong>COPY ring buffer:</strong> To avoid polluting shared buffers, <code>COPY</code> uses a dedicated ring buffer for its I/O operations. This minimizes the impact on the buffer cache used for regular queries, preserving performance for other <a href="https://www.tigerdata.com/learn/guide-to-postgresql-database-operations" rel="noreferrer">database operations</a>. So, less about raw speed and more about not being a noisy neighbor.</li></ol><p><code>COPY</code> can read data from multiple sources, local files, local commands or standard input. For batch ingestion, standard input makes the most sense as the data can be sent directly from the client without an intermediate step. I was actually surprised by the amount of people <a href="https://www.reddit.com/r/PostgreSQL/comments/1gsynek/boosting_postgres_insert_performance_by_50_with/" rel="noreferrer">who reached out on Reddit</a> following my last post saying they couldn’t use <code>COPY</code> because they would need to write out their stream of data as a file, that’s 100 percent what the <code>STDIN</code> setting is for!</p><p><code>COPY</code> can use two different protocols, text and binary. </p><ul><li>Text supports formats like CSV and involves sending text strings over the wire to be parsed by the server before ingestion. You can actually just dump raw CSV records into <code>COPY</code>.</li><li>Binary supports writing data in the native PostgreSQL format from the client, removing the need for parsing on the server side. It’s much faster but also much less flexible, with limited support in many languages. To do this you need to type your data in your client, so you know the format to write it in.</li></ul><p>The two variants of <code>COPY</code> we'll be testing are the text version using:</p><pre><code class="language-SQL">COPY sensors FROM STDIN;</code></pre><p>And the binary version using:</p><pre><code>COPY sensors FROM STDIN WITH (FORMAT BINARY);</code></pre><p><code>COPY</code> isn't a normal SQL statement, so it can’t exist within a larger query. If also can’t perform an upsert (like <code>INSERT … ON CONFLICT</code>), although from <a href="https://www.postgresql.org/docs/current/sql-copy.html">PostgreSQL 17</a>, a text <code>COPY</code> can now simulate <code>INSERT … ON CONFLICT DO NOTHING</code> by ignoring errors with <code>ON_ERROR IGNORE</code>.</p><h2 id="the-setup">The Setup&nbsp;</h2><p>I created my own <a href="https://github.com/jamessewell/pgingester">Rust CLI tool</a> to run this benchmark. That might seem like overkill, but I did it for the following reasons:</p><ul><li>I needed something that supported <code>COPY FROM STDIN</code> and <code>COPY WITH (FORMAT BINARY)</code> directly, ruling out <a href="https://k6.io/">Grafana K6</a> and the PostgreSQL native <a href="https://www.postgresql.org/docs/current/pgbench.html">pgbench</a>.</li><li>I needed something that would let me run parameterized and prepared queries using the binary protocol directly, not using <code>PREPARE</code> and <code>EXECUTE</code>, because this is how most frameworks operate.</li><li>I wanted to measure the timing from an application's viewpoint, including data wrangling, network round-trip latency, and database operations.</li><li>I start measuring time after I've read the CSV file from disk and loaded it into a Rust data structure. This is to avoid measuring the I/O limits of the benchmark client. Batch ingest normally takes place in a stream without data being read from files. </li><li>I love Rust (if you’re after more Rust x PostgreSQL content, check out <a href="https://www.youtube.com/watch?v=C9TopAI1Hnk"><u>my talk </u></a>at PGConf.EU 2024)!</li></ul><p>The tool tests the insertion of data from a CSV file into the database (the default file is one million records) with multiple batch sizes and ingest methods into a table with five metric columns (the actual schema isn’t important, I just love the fact a lot of power and renewables companies use Timescale ♺):</p><pre><code class="language-sql">CREATE TABLE IF NOT EXISTS power_generation (
    generator_id INTEGER,&nbsp;
    timestamp TIMESTAMP WITH TIME ZONE,
    power_output_kw DOUBLE PRECISION,&nbsp;
    voltage DOUBLE PRECISION,
    current DOUBLE PRECISION,
    frequency DOUBLE PRECISION,
    temperature DOUBLE PRECISION
&nbsp;);</code></pre><p>It supports a combination of the following methods (or you can use the <code>–all</code> shortcut) for insertion over multiple batch sizes per run:</p><ul><li>Batch insert (parameterized)</li><li>Prepared batch insert</li><li><code>UNNEST</code> insert (parameterized)</li><li>Prepared <code>UNNEST</code> insert</li><li><code>COPY</code></li><li>Binary <code>COPY</code>comma-separated</li></ul><p>The tool supports a few options, the most important being the comma-separated list of batch sizes.</p><pre><code>Usage: pgingester [OPTIONS] [METHODS]...

Arguments:
  [METHODS]...  [possible values: insert-values, prepared-insert-values, insert-unnest, prepared-insert-unnest, copy, binary-copy]

Options:
  -b, --batch-sizes &lt;BATCH_SIZES&gt;              [default: 1000]
  -t, --transactions                           
  -c, --csv-output                             
  -a, --all                                    
  -c, --connection-string &lt;CONNECTION_STRING&gt;  [env: CONNECTION_STRING=]
  -f, --input-file &lt;INPUT_FILE&gt;                [default: ingest.csv]
  -h, --help                                   Print help
  -V, --version                                Print version</code></pre><p>I tested with a connection to a Timescale 8&nbsp;CPU/32&nbsp;GB memory server (although it was only using a single connection, so this is overkill).&nbsp;</p><h2 id="the-results">The Results</h2><p>Running the CLI tool with the following arguments will output a bit list of ingest performance, including the relative speed for each tested method.</p><pre><code>pgingester --all --batch-sizes 1000,5000,10000,100000,1000000</code></pre><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/11/carbon--8-.png" class="kg-image" alt="" loading="lazy" width="1594" height="1116" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/11/carbon--8-.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/11/carbon--8-.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/11/carbon--8-.png 1594w" sizes="(min-width: 720px) 720px"></figure><p>I ran all queries with multiple batch sizes with the default CSV input (one million lines). The <code>Insert VALUES</code> and <code>Prepared Insert VALUES</code> queries will only run for the 1,000 batch size as above there are too many parameters to bind (the warnings to standard error have been removed below).</p><p>We can make a number of interesting conclusions from this data:</p><ol><li>With a larger batch size (anything other than 1,000) binary, <code>COPY</code> is substantially faster (at least 3.6x)&nbsp; than anything else (19x faster than a naive parameterized <code>INSERT...VALUES</code>). This is because it doesn’t have to do any data parsing on the server side. The more data you load in a single batch, the more pronounced the difference will become.</li><li>Text <code>COPY</code> also performs well, but surprisingly it’s surpassed in speed by prepared statements for batches of 10,000 or less.&nbsp;</li><li>Both <code>COPY</code> variants perform poorly with batches of 1,000. Interestingly, I've seen a lot of batch ingest tools actually use this.</li><li>When you’re using <code>INSERT</code> for batch ingest, prepared statements always outperform parameterized ones. If you want maximum speed, the same number of parameters regardless of the batch size, and to avoid the maximum number of parameters being hit on larger batches then use <code>INSERT…UNEST</code>.</li><li><code>INSERT...UNNEST</code> at a batch size of 100,000 does a lot better against any of the text <code>COPY</code> variants than I thought it would: there is actually only 3 ms in it 👀!</li></ol><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">I ran this with a larger dataset of 100 million rows as well. Performance is <i><em class="italic" style="white-space: pre-wrap;">slightly</em></i> worse, probably because PostgreSQL is checkpointing in the background. However, the general numbers and relative speeds remain very similar.</div></div><h2 id="so-which-should-you-use">So, Which Should You Use?</h2><p>If you're looking to optimize batch ingestion in PostgreSQL, the right method depends on your specific use case, batch size, and application requirements. Here’s how the options stack up:</p><ol><li><strong>Small Batch Sizes (&lt;= 10,000 rows)</strong>: Prepared <code>INSERT...UNNEST</code> can be surprisingly competitive. Down at a batch size of 1,000, <code>COPY</code> is actually much slower.&nbsp;</li><li><strong>Large Batch Sizes (&gt; 10,000 rows)</strong>: For maximum throughput with larger batches, binary <code>COPY</code> is unbeatable. Its ability to bypass server-side parsing and its use of a dedicated ring buffer make it the top choice for high-velocity data pipelines. If you need speed, you can have larger batches, and your application can support the binary protocol, this should be your default.</li><li><strong>Ease of Implementation</strong>: If you prioritize ease of implementation or need compatibility across a wide range of tools, text <code>COPY</code> is a great middle-ground. It doesn't require complex client-side libraries and is supported in nearly every language that interacts with PostgreSQL. You can also just throw your CSV data at it.</li><li><strong>Considerations Beyond Speed</strong>:<ul><li><strong>Upserts:</strong> If you need conflict handling (<code>INSERT...ON CONFLICT</code>), <code>COPY</code> isn't an option, and you'll need to stick with <code>INSERT</code> (unless you just want to ignore errors and you're happy with text <code>COPY</code>, in which case <a href="https://www.postgresql.org/docs/current/sql-copy.html">PostgreSQL 17 </a>has your back with <code>ON_ERROR</code>).</li><li><strong>Framework support:</strong> Ensure your preferred framework supports your chosen method; <code>COPY</code> usually requires a different API to be used and binary <code>COPY</code> may require an extension library or not be supported.</li><li><strong>Batch size limits:</strong> Watch for the 32,767-parameter limit when using parameterized <code>INSERT...VALUES</code>.</li><li><strong>Memory and disk write overheads: </strong><code>COPY</code> is designed to have the least impact on your system, writing less data to disk and not polluting shared_buffers. This is actually a big consideration! In fact, both the <code>COPY</code> methods write 62&nbsp;MB of WAL for the one million row test, while <code>INSERT</code> writes 109&nbsp;MB. This ~1.7x rule seems to hold across any ingest size.</li></ul></li></ol><h2 id="final-thoughts-for-developers">Final Thoughts for Developers</h2><p>When it comes to PostgreSQL batch ingestion, there is no one-size-fits-all solution. Each method offers trade-offs between performance, complexity, and flexibility:</p><ul><li><strong>For maximum raw speed</strong> over larger batches, binary<strong> </strong><code>COPY</code> is your best bet.</li><li><strong>For flexibility and ease of use </strong>over larger batches, text<strong> </strong><code>COPY</code> balances speed with broad support.</li><li><strong>For smaller batches or compatibility-focused workflows,</strong> prepared <code>INSERT...UNNEST</code> statement can hold its own, offering competitive speeds with maximum flexibility (but remember, if you have a heavy ingest pipeline, you risk disrupting shared_buffers, and you will be writing more to WAL).</li></ul><p>Remember, the “best” method isn’t just about ingest speed; it’s about what fits your workflow and scales with your application. Happy ingesting!</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Boosting Postgres INSERT Performance by 2x With UNNEST]]></title>
            <description><![CDATA[Read how you can double your Postgres INSERT performance using the UNNEST function.]]></description>
            <link>https://www.tigerdata.com/blog/boosting-postgres-insert-performance</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/boosting-postgres-insert-performance</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[performance]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <dc:creator><![CDATA[James Blackwood-Sewell]]></dc:creator>
            <pubDate>Fri, 15 Nov 2024 17:00:33 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/11/Untitled-design--1--2.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/11/Untitled-design--1--2.png" alt="Boosting Postgres INSERT Performance by 2x With UNNEST" /><p>If you Google Postgres <code>INSERT</code> performance for long enough, you’ll find some hushed mentions of using an arcane <code>UNNEST</code> function (if you squint, it looks like a <a href="https://www.tigerdata.com/blog/building-columnar-compression-in-a-row-oriented-database" rel="noreferrer">columnar</a> insert) over a series of arrays to increase performance.  Any performance gains sound good to me, but what's actually going on here? </p><p>I’ve been aware of this technique for a long time (in fact, several object-relational mappers use it under the hood), but I’ve never fully understood what's happening, and any analysis I’ve seen has always left me wondering if the gains were as much about data wrangling in the programming language used as Postgres speed. This week I decided to change that and do some testing myself.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">This used to be called "Boosting Postgres INSERT Performance by 50% with UNNEST". The performance went from 2.19s to 1.03s, which 52.97% less time, but also 113% faster.<br><br>I changed the wording in this article to 2x because I think that's always clearer (thanks /u/a3kov and /u/lobster_johnson)</div></div><h2 id="the-introduction-inserts-in-postgres">The Introduction: INSERTs in Postgres</h2><p>At Tiger Data, I work with <a href="https://timescale.ghost.io/blog/time-series-introduction/" rel="noreferrer">time-series data</a>, so I gave my analysis a time-series slant. I want to simulate inserting a stream of records into my database with the <code>INSERT</code> statement (yes, I know <code>COPY</code> is a thing, see the callout below), and in doing so, I want to minimize the load I create as much as possible (saving my precious CPU cycles for my real-time analytics queries).</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">If you’re aiming to load data into your database as quickly and efficiently as possible, check out the PostgreSQL <code spellcheck="false" style="white-space: pre-wrap;">COPY</code> command—it’s almost always faster than using regular <code spellcheck="false" style="white-space: pre-wrap;">INSERT</code>. We benchmarked <a href="https://www.tigerdata.com/blog/postgres-for-everything" rel="noreferrer">Postgres data</a> ingestion methods in an earlier post.<br><br>However, even though <code spellcheck="false" style="white-space: pre-wrap;">COPY</code> is faster, many developers still prefer <code spellcheck="false" style="white-space: pre-wrap;">INSERT</code> for its flexibility. <code spellcheck="false" style="white-space: pre-wrap;">INSERT</code> supports useful features like upserts (<code spellcheck="false" style="white-space: pre-wrap;">INSERT ... ON CONFLICT</code>), returning the inserted rows, and has better integration with language libraries. Plus, it can be part of a larger SQL query, giving you more control over the data insertion process.</div></div><p><br></p><p>Let’s take a closer look at the <code>INSERT</code> queries I tested using a batch size of 1,000, 5,000, and 10,000 records.</p><p>In one corner, we have the multi-record <code>INSERT</code> variant we all know and love, using a <code>VALUES</code> clause followed by a tuple per row in the batch. These queries look long but also pretty easy to understand.</p><pre><code class="language-SQL">INSERT INTO sensors (sensorid, ts, value)
VALUES 
  ($1, $2, $3), 
  ($4, $5, $6), 
   ..., 
  ($2998, $2999, $3000);</code></pre><p>In the other corner, we have our <code>UNNEST</code> variant, using a <code>SELECT</code> query that takes one array per column and uses the <code>UNNEST</code> function to convert them into rows at execution time.</p><pre><code class="language-SQL">INSERT INTO sensors (ts, sensorid, value)&nbsp;
&nbsp;&nbsp;SELECT *&nbsp;
&nbsp;&nbsp;FROM unnest(
&nbsp;&nbsp;&nbsp;&nbsp;$1::timestamptz[],&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;$2::text[],&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;$3::float8[]
)</code></pre><p>The <a href="https://www.postgresql.org/docs/9.2/functions-array.html" rel="noreferrer">Postgres documentation describes <code>UNNEST</code></a> as a function that <em>“expands multiple arrays (possibly of different data types) into a set of rows.”</em> This actually makes sense, it’s basically flattening a series of arrays into a row set, much like the one in <code>INSERT .. VALUES</code> query. </p><p>One key difference is that where the first variant has <code>batch_size * num_columns</code> values in the query, the <code>UNNEST</code> variant only has <code>num_columns</code> arrays (each of which contains <code>batch_size</code> records when it’s flattened). This will be important later, so take note!</p><h2 id="the-setup">The Setup&nbsp;</h2><p>I ran the benchmark on a single TimescaleDB 4&nbsp;CPU/16&nbsp;GB memory instance (the spec isn't really important for this benchmark) with a very simple schema (the same table I used on the <a href="https://timescale.ghost.io/blog/skip-scan-under-load/" rel="noreferrer"><u>SkipScan performance post</u></a>).</p><pre><code class="language-sql">CREATE TABLE sensors (
    sensorid TEXT,
    ts TIMESTAMPTZ,
    value FLOAT8
);
</code></pre><p>I was hoping to use <a href="https://k6.io/" rel="noreferrer">Grafana k6</a> for all my performance articles, but in this case, it didn’t make sense. I don’t want to measure the time that application code takes to get my data into the format an  <code>INSERT .. VALUES</code> or <code>INSERT .. UNNEST</code> statement needs (especially in TypeScript), I just want the time the database spends processing the statements and loading my data.</p><p>I fell back to using good old <a href="https://www.postgresql.org/docs/current/pgbench.html" rel="noreferrer">pgbench</a> for these tests with a static file for each <code>INSERT</code> variant and batch combination. As usual, you can find the files in the <a href="https://github.com/timescale/performance" rel="noreferrer">timescale/performance GitHub repo</a>.</p><p>I ran each of the following queries to insert one million records using a single thread:</p><ul><li><code>INSERT .. VALUES</code> with a batch size of 1,000</li><li><code>INSERT .. VALUES</code> with a batch size of 5,000</li><li><code>INSERT .. VALUES</code> with a batch size of 1,0000</li><li><code>INSERT .. UNNEST</code> with a batch size of 1,000</li><li><code>INSERT .. UNNEST</code> with a batch size of 5,000</li><li><code>INSERT .. UNNEST</code> with a batch size of 10,000</li></ul><p>I used the <code>pg_stat_statments</code> (if you don’t know about this amazing extension, then do yourself a favor and <a href="https://timescale.ghost.io/blog/using-pg-stat-statements-to-optimize-queries/">look it up</a>!) statistics in the database to extract the <code>total _planning_time</code> and <code>total_exec_time</code> for each run.</p><h2 id="the-results-insert-values-vs-insert-unnest">The Results: INSERT VALUES vs. INSERT UNNEST</h2><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/11/Untitled-design--1--3-1.png" class="kg-image" alt="" loading="lazy" width="1200" height="502" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/11/Untitled-design--1--3-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/11/Untitled-design--1--3-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/11/Untitled-design--1--3-1.png 1200w" sizes="(min-width: 720px) 720px"></figure><p>The results were very clear: at the database layer, <strong><code>INSERT .. UNNEST</code> is <em>2</em>.13x faster than </strong><code>INSERT .. VALUES</code> at at batch size of 1000! This ratio held steady regardless of batch size (and even with multiple parallel jobs).</p><ul><li><strong>The primary savings come at query planning time.</strong> With the <code>INSERT .. VALUES</code> approach, Postgres must parse and plan each value individually (remember how many there were?). In contrast, <code>INSERT .. UNNEST</code> processes one array per column, which reduces the planning workload by not working with individual elements at plan time.</li><li><strong>Execution time is similar between both methods.</strong> The actual query execution was time slightly slower for <code>UNNEST</code>, which reflects the extra work that the <code>UNNEST</code> function needs to do.&nbsp;This was more than made up for by the planning gain.</li></ul><p>As you might expect adding columns makes things even better for <code>UNNEST</code>, with 10 float columns (rather than one) we get a massive 5.02x faster So if you've got a wide schema, you're in for even more performance gains (but I wanted to leave this article at what most people could reasonably expect).</p><p>If you’d like to see the graphs for the 5,000 and 10,000 batch sizes, then check out the <a href="https://www.tigerdata.com/blog/best-postgresql-gui-popsql-joins-timescale" rel="noreferrer">PopSQL</a> dashboard.</p><p>A reasonable response to this might be, "What if we prepared the <code>INSERT .. VALUES</code> query, would that reduce planning time and make it the winner?". Some quick tests (unfortunately, <code>pg_stat_statements</code> can't track statistics for <code>EXECTUTE</code> queries on prepared statements) show that this is not the case; <code>UNNEST</code> is still king.</p><h2 id="should-i-use-unnest">Should I use UNNEST?</h2><p>There’s no question that in terms of <strong>database performance</strong>, <code>INSERT .. UNNEST</code> beats <code>INSERT .. VALUES</code> for batch inserts. By minimizing planning overhead, <code>UNNEST</code> unlocks an almost magical speed boost, making it a fantastic option for scenarios where ingestion speed is critical. One thing to keep in mind is that the overhead of your language and network latency often contribute just as much to the total time you see in your application, but still, your database will be working less, which is always a good thing.</p><p>As with any optimization, there’s a trade-off. The key consideration isn’t always just speed; it’s also <strong>usability</strong>. The <code>INSERT .. VALUES</code> syntax is intuitive and widely understood, making it easier to adopt and maintain, especially in teams or projects where SQL expertise varies. Pivoting to use <code>UNNEST</code> introduces complexity. You’ll need to wrangle your data into arrays, and if you’re using an ORM, you might discover it doesn’t support this pattern at all. If you're writing raw SQL, <code>UNNEST</code> might be less familiar to future developers inheriting your codebase.</p><p>And while <code>UNNEST</code> is fast, let’s not forget about <code>COPY</code>, which <a href="https://www.timescale.com/learn/testing-postgres-ingest-insert-vs-batch-insert-vs-copy" rel="noreferrer">remains the undisputed gold standard for ingestion</a>. If you don’t need features like upserts (<code>ON CONFLICT</code> clauses), <code>COPY</code> will get your data in faster, and with less overhead.</p><h2 id="final-thoughts-for-developers">Final Thoughts for Developers</h2><p>Think of <code>INSERT .. UNNEST</code> as a magic performance hack sitting squarely between traditional <code>INSERT .. VALUES</code> and <code>COPY</code>. It delivers significant speed improvements for batch ingestion while retaining the flexibility and composability of SQL <code>INSERT</code> statements. </p><p>At Tiger Data, we love exploring the edges of what Postgres can do and techniques like <code>INSERT .. UNNEST</code> remind us why. It’s elegant, fast, and underutilized, but hopefully no longer misunderstood. If you’re aiming to push your database to its limits, we highly recommend adding this pattern to your SQL toolkit. It’s another example of how understanding Postgres deeply can help you get the most out of your system. And if you want to optimize your PostgreSQL database for <a href="https://www.tigerdata.com/blog/time-series-introduction" rel="noreferrer">time series</a>, events, real-time analytics, or vector data, <a href="https://www.tigerdata.com/docs/self-hosted/latest/install" rel="noreferrer">take TimescaleDB out for a spin</a>.</p><p></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[PostgreSQL DISTINCT: TimescaleDB’s SkipScan Under Load]]></title>
            <description><![CDATA[We benchmarked TimescaleDB's SkipScan under load to see what effect it has on DISTINCT queries.]]></description>
            <link>https://www.tigerdata.com/blog/skip-scan-under-load</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/skip-scan-under-load</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[performance]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <dc:creator><![CDATA[James Blackwood-Sewell]]></dc:creator>
            <pubDate>Thu, 07 Nov 2024 14:00:24 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/11/PostgreSQL-DISTINCT_response-times-2.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/11/PostgreSQL-DISTINCT_response-times-2.png" alt="PostgreSQL DISTINCT: TimescaleDB’s SkipScan Under Load" /><h2 id="the-introduction-distinct-queries-in-postgresql">The Introduction: DISTINCT Queries in PostgreSQL</h2><p>Let’s say you’re working with sensor data in PostgreSQL, with each reading containing a sensor ID, timestamp, and value. You want to power an application dashboard that needs to know the last known state of each sensor in your fleet. Your query might look like this:</p><pre><code class="language-SQL">SELECT DISTINCT ON (sensorid) *
FROM sensors
ORDER BY sensorid, ts DESC;</code></pre><p>The <code>DISTINCT ON</code> clause ensures only one record per sensor is selected, and because the query is ordered by descending timestamp, you’ll get the latest reading for each sensor (although you could also use a <code>WHERE</code> clause to get the latest value at another point in time). Simple enough, right?</p><p>In practice, this query pattern can be inefficient, even with proper indexing. In this post, I’ll explain why and walk through a benchmark demonstrating that TimescaleDB’s SkipScan can optimize this query by an astonishing 10,548x at p50 and 9,603x at p95.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">This post is about optimizing <code spellcheck="false" style="white-space: pre-wrap;">DISTINCT</code> queries to get the last values associated with an ID quickly, if you want to estimate the cardinality of your dataset (count the unique IDs) then check out the <a href="https://docs.timescale.com/use-timescale/latest/hyperfunctions/" rel="noreferrer">timescaledb-toolkit,</a> which gives you <a href="https://docs.timescale.com/use-timescale/latest/hyperfunctions/approx-count-distincts/hyperloglog/" rel="noreferrer">hyperloglog</a></div></div><h2 id="skipscan-details">SkipScan Details</h2><p>SkipScan is one of those TimescaleDB features that flies under the radar but provides impressive performance improvements—especially given it works with both Timescale’s <a href="https://docs.timescale.com/use-timescale/latest/hypertables/about-hypertables/" rel="noreferrer">hypertables</a> and standard PostgreSQL tables (although not currently on compressed hypertables).</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/11/Postgres-DISTINCT-TimescaleDB-s-Skip-Scan-Under-Load_tweet.png" class="kg-image" alt="A tweet from Volkan Alkilic praising SkipScan's speed for time-series data" loading="lazy" width="923" height="268" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/11/Postgres-DISTINCT-TimescaleDB-s-Skip-Scan-Under-Load_tweet.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/11/Postgres-DISTINCT-TimescaleDB-s-Skip-Scan-Under-Load_tweet.png 923w" sizes="(min-width: 720px) 720px"></figure><p>As tables and indexes grow, <a href="https://www.timescale.com/learn/understanding-distinct-in-postgresql-with-examples" rel="noreferrer"><code>DISTINCT</code> queries</a> slow down in PostgreSQL because it doesn’t natively pull unique values directly from ordered indexes. Even if you have a perfect index in place, PostgreSQL will still scan the full index, filtering out duplicates only after the fact. This approach leads to a significant slowdown as tables grow larger.</p><p>SkipScan enhances the efficiency of <code>SELECT DISTINCT ON .. ORDER BY</code> queries by allowing PostgreSQL to directly jump to each new unique value within an ordered index, skipping over intermediate rows. This approach eliminates the need to scan the entire index and then deduplicate, as SkipScan directly retrieves the next distinct value, significantly accelerating query performance. If you're after a deep dive, <a href="https://docs.timescale.com/use-timescale/latest/query-data/skipscan/">check out the docs.</a></p><p>We’ve run <a href="https://timescale.ghost.io/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/"><u>benchmarks on SkipScan</u></a> before, but this time, I wanted to see how it interacts in a more realistic environment with ingest and query running at the same time.</p><h2 id="the-setup">The Setup</h2><p>I set up two <a href="https://docs.timescale.com/#:~:text=What%20is%20Timescale%20Cloud%3F" rel="noreferrer">Timescale Cloud</a> instances with identical configurations (4 CPUs and 16&nbsp;GB of memory). On one instance, I disabled SkipScan (<code>SET timescaledb.skip_scan=off</code>), allowing it to default to standard PostgreSQL behavior. The other instance had SkipScan enabled to compare performance.</p><p>I created an empty test table using the following SQL (and without any TimescaleDB-specific features):</p><pre><code class="language-SQL">CREATE TABLE sensors (
  sensorid TEXT, 
  ts TIMESTAMPTZ,
  value FLOAT8);
  
CREATE UNIQUE INDEX ON sensors (sensorid, ts DESC);</code></pre><p>Using Grafana K6 (with the<a href="https://github.com/grafana/xk6-sql"> <u>xk6-sql</u></a><u> </u>extension), I ran the following test for twenty minutes:</p><ul><li><strong>Data ingest</strong>: Ingest ran at a target rate of 200K rows per second, using INSERT to ingest data from 1000 sensors, in batches of 1000, with up to 10 concurrent workers (watch this space for a deep dive into the performance of different PostgreSQL INSERT patterns coming soon).</li><li><strong>Query load</strong>: A <code>SELECT DISTINCT ON</code> query, running 10 times per second with up to 5 concurrent workers. This query pulls the latest reading for all 1000 sensors, simulating an application's needs.</li></ul><p>You'll remember the query from earlier:</p><pre><code class="language-SQL">SELECT DISTINCT ON (sensorid) *
FROM sensors
ORDER BY sensorid, ts DESC;</code></pre><p>If you’d like to recreate the benchmark, then check out the <a href="https://github.com/timescale/performance/tree/main"><u>GitHub repository</u></a> for the series.</p><h2 id="the-results-skipscan-vs-vanilla-postgresql">The Results: SkipScan vs. Vanilla PostgreSQL</h2><p><br>The graphs speak for themselves (please note the X axis in the query graph is a logarithmic scale), but here's a summary:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/11/Your-paragraph-text--2-.png" class="kg-image" alt="Two line graphs, one illustrating the DISTINCT query response times at p50 and p90, and another benchmarking the data ingest performance" loading="lazy" width="2000" height="1457" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/11/Your-paragraph-text--2-.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/11/Your-paragraph-text--2-.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2024/11/Your-paragraph-text--2-.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/11/Your-paragraph-text--2-.png 2048w" sizes="(min-width: 720px) 720px"></figure><ul><li>The standard PostgreSQL server started<strong> ingesting 13&nbsp;% slower</strong> and couldn’t sustain the 200K/second goal (it only caught up as <code>DISTINCT</code>14-minute queries stopped returning).</li><li>SkipScan performed <strong>over</strong> <strong>11x faster at p50 and p95</strong> right from the start.</li><li>By the 14-minute mark, SkipScan was <strong>10,548x faster at p50 and 9,603x faster at p95</strong> than standard PostgreSQL.</li><li>SkipScan maintained stable performance throughout the run, while PostgreSQL didn’t return any results after 14 minutes (RIP your dashboard). </li></ul><p>If you’d like to interact with the data then you can check out this <a href="https://www.tigerdata.com/blog/best-postgresql-gui-popsql-joins-timescale" rel="noreferrer"><u>PopSQL dashboard</u></a>.</p><h2 id="the-conclusion">The Conclusion</h2><p>SkipScan is a pretty remarkable feature, transforming underperforming <code>DISTINCT</code> queries into highly efficient operations. While there has been some discussion on adding it to PostgreSQL, TimescaleDB has your back today. Because SkipScan is not limited to hypertables, it benefits regular PostgreSQL tables as well, giving developers a performance boost just by adding the <a href="https://github.com/timescale/timescaledb" rel="noreferrer">TimescaleDB extension</a>.</p><p>In environments where you need fast, up-to-date insights—like the dashboard example with sensor data—SkipScan lets you keep pace without sacrificing performance. It’s one of those “small but mighty” features that often goes unnoticed but has an outsized impact on real-time analytics workloads.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Building Multi-Tenant RAG Applications With PostgreSQL: Choosing the Right Approach]]></title>
            <description><![CDATA[Learn how to pick the right approach to handle multi-tenancy for your use case when building multi-tenant RAG applications with PostgreSQL.]]></description>
            <link>https://www.tigerdata.com/blog/building-multi-tenant-rag-applications-with-postgresql-choosing-the-right-approach</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/building-multi-tenant-rag-applications-with-postgresql-choosing-the-right-approach</guid>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[RAG]]></category>
            <dc:creator><![CDATA[Avthar Sewrathan]]></dc:creator>
            <pubDate>Fri, 11 Oct 2024 15:51:00 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/10/Building-Multi-Tenant-RAG-Applications-With-PostgreSQL-Choosing-the-Right-Approach-1.webp">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/10/Building-Multi-Tenant-RAG-Applications-With-PostgreSQL-Choosing-the-Right-Approach-1.webp" alt="Building Multi-Tenant RAG Applications With PostgreSQL Choosing the Right Approach" /><p>If you’re building a retrieval-augmented generation (RAG) app with PostgreSQL and pgvector, you’ll probably run into the problem of handling multi-tenancy. This article explains how to pick the right approach to handle multi-tenancy for your use case.</p><h2 id="what-is-multi-tenancy">What Is Multi-Tenancy?</h2><p>Multi-tenancy is like an apartment building for software. Just as one building houses multiple tenants (families or individuals), a multi-tenant application serves multiple customers or organizations using a single instance of the software.</p><p>Multi-tenancy serves multiple "tenants" independently and securely—thereby preventing accidental or unauthorized cross-referencing of private information between different users. This means designing a system that not only understands and retrieves information effectively but also strictly adheres to user-specific data boundaries.&nbsp;</p><h2 id="multi-tenant-applications-pros-and-cons">Multi-Tenant Applications: Pros and Cons&nbsp;</h2><p>Multi-tenancy in RAG applications is vital for several key reasons, all of which deliver benefits:</p><ul><li><strong>Data isolation:</strong> This allows multiple tenants to use the same RAG system while keeping their data separate and secure. This is crucial for maintaining privacy and confidentiality.</li><li><strong>Scalability: </strong>A multi-tenant architecture enables the system to efficiently serve many users or organizations without needing separate deployments for each, leading to better resource utilization and cost-effectiveness.</li><li><strong>Customization: </strong>Different tenants often have unique needs. Multi-tenancy allows for customization of the RAG system for each tenant, such as using tenant-specific knowledge bases or fine-tuning models for particular domains.</li><li><strong>Compliance:</strong> In many industries, regulatory requirements mandate strict data separation. Multi-tenancy helps meet these compliance needs by ensuring data from different tenants doesn't mix.</li><li><strong>Efficient updates: </strong>With a multi-tenant system, updates and improvements can be rolled out to all tenants simultaneously, ensuring everyone benefits from the latest features and security patches.</li><li><strong>Cost-effectiveness: </strong>Sharing infrastructure and resources across tenants can significantly reduce operational costs compared to maintaining separate systems for each user or organization.</li><li><strong>Consistent performance:</strong> A well-designed multi-tenant RAG system can provide more consistent performance across all tenants, as resources are dynamically allocated based on usage.</li></ul><p>To build for long-term performance, flexibility, and efficiency, it’s also helpful to keep in mind the potential challenges of multi-tenant architectures:</p><ul><li>A multi-tenant environment allows multiple access points for users, which can increase the threat of a security breach.</li><li>Serving multiple clients in one instance of an application or database adds an extra level of complexity to the codebase and database maintenance.</li><li>Backup and restoration are more complex, so not all providers offer reliable restoration services.</li><li>The ability to offer tenant-specific customizations is limited, and balancing the shared codebase with unique tenant requirements is often necessary.</li><li>A technical problem on the provider’s end may affect all tenants simultaneously. This may apply to uptime, system upgrades, and other global processes.</li></ul><h2 id="postgresql-benefits-for-multi-tenant-rag-apps">PostgreSQL Benefits for Multi-Tenant RAG Apps</h2><p>PostgreSQL, enhanced with the <a href="https://www.tigerdata.com/learn/postgresql-extensions-pgvector" rel="noreferrer"><u>pgvector extension</u></a> (the popular open-source extension for vector handling in PostgreSQL), offers a robust solution for implementing multi-tenant RAG apps. Its ability to efficiently store and <a href="https://www.tigerdata.com/learn/understanding-vector-search" rel="noreferrer">search vector</a> embeddings alongside traditional data types makes it an ideal choice for organizations looking to leverage their existing infrastructure.&nbsp;</p><p>Here are the reasons <a href="https://www.tigerdata.com/blog/postgres-for-everything" rel="noreferrer">why PostgreSQL</a> is a good fit for multi-tenant RAG applications:</p><ol><li><strong>Built-in full-text search:</strong> PostgreSQL has robust full-text search capabilities, which are crucial for efficient retrieval in RAG systems.</li><li><strong>JSON support:</strong> PostgreSQL handles JSON data natively, allowing flexible storage of documents and metadata.</li><li><strong>Vector extensions:</strong> extensions like pgvector enable vector similarity searches, essential for embedding-based retrieval.</li><li><strong>Row-level security: </strong>this feature allows fine-grained access control, crucial for multi-tenant setups to ensure data isolation.</li><li><strong>Scalability:</strong> PostgreSQL can handle large datasets and concurrent users, important for growing multi-tenant applications.</li><li><strong>ACID compliance:</strong> <a href="https://www.timescale.com/learn/understanding-acid-compliance" rel="noreferrer">atomicity, consistency, isolation, and durability (ACID) compliance</a> ensures data integrity and consistency across transactions.</li><li><strong>Extensibility: </strong>custom functions and extensions can be added to tailor the database to specific RAG needs.</li><li><strong>Cost-effective:</strong> as an open-source solution, it can be more cost-effective than some cloud-based alternatives.</li></ol><p>Using PostgreSQL for multi-tenant RAG applications also gives you the advantage of <a href="https://cerebralvalley.ai/blog/timescale-is-making-postgresql-better-for-ai-1tiUqSzGsSn76ORZVfMwOk"><u>Timescale Cloud’s stack of open-source extensions</u></a> to easily build and scale RAG, search, and agents applications. In addition to pgvector, this stack includes <a href="https://timescale.ghost.io/blog/pgvector-is-now-as-fast-as-pinecone-at-75-less-cost/"><u>pgvectorscale</u></a> (which builds on pgvector for enhanced performance and scale) and <a href="https://www.timescale.com/ai"><u>pgai</u></a> (which brings embedding creation and large language model completions to the database, giving more <a href="https://timescale.ghost.io/blog/pgai-giving-postgresql-developers-ai-engineering-superpowers/"><u>PostgreSQL developers the skills of AI engineers</u></a>). Both extensions complement pgvector and rely on its capabilities.</p><h3 id="implementing-rag-with-postgresql">Implementing RAG with PostgreSQL</h3><p>Implementing RAG with PostgreSQL involves a multi-step process that leverages the database's vector storage capabilities. The workflow typically includes ingesting and chunking data, converting text into <a href="https://www.tigerdata.com/blog/a-beginners-guide-to-vector-embeddings" rel="noreferrer">vector embeddings</a> using an embedding model, and storing these vectors in PostgreSQL using pgvector.&nbsp;</p><p>When a user query is received, the system retrieves the most relevant data from the vector database based on similarity search. This retrieved information is then combined with the user's question and any additional context to create a comprehensive prompt for the large language model (LLM). The LLM processes this enriched prompt and generates a response, which is then returned to the user, providing a more accurate and contextually relevant answer.</p><h2 id="strategies-for-dealing-with-multi-tenancy-for-rag-in-postgresql">Strategies for Dealing With Multi-Tenancy for RAG in PostgreSQL</h2><p>To pick the right strategy for your multi-tenant RAG application with PostgreSQL, consider your requirements (and your users’ or customers’ requirements) for shared resources, data separation, customization, scalability, and of course, costs.&nbsp;</p><p>PostgreSQL offers four levels of multi-tenancy implementation—table, schema, logical database, and database service—each suitable for distinct use case scenarios and each with its pros and cons. Here’s a comparative overview of each level.&nbsp;&nbsp;</p><ol><li><strong>Table-level multi-tenancy</strong> gives each tenant its own table. This is simple but may lead to data isolation concerns. Table-level isolation works well for simple apps with shared data.</li><li><strong>Schema-level separation </strong>gives each tenant its own schema in the same logical database. It offers better isolation with minimal operational and cost overhead, balancing isolation and efficiency for most use cases.</li><li><strong>Logical database separation</strong> gives each tenant its own logical database in a database instance. Separation at the logical database level provides stronger separation but increases complexity. Logical databases suit clients needing stricter separation.</li><li><strong>Database service separation </strong>gives each tenant their own database service. Service-level separation is ideal for high-security scenarios or clients with unique needs. It offers the highest isolation but at the cost of increased resources and management overhead.</li></ol><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/10/strategies-for-multitenant-rag-in-postgres-pgvector.jpg" class="kg-image" alt="&nbsp;Overview of approaches for handling multi-tenancy in PostgreSQL" loading="lazy" width="1654" height="931" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/10/strategies-for-multitenant-rag-in-postgres-pgvector.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/10/strategies-for-multitenant-rag-in-postgres-pgvector.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2024/10/strategies-for-multitenant-rag-in-postgres-pgvector.jpg 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/10/strategies-for-multitenant-rag-in-postgres-pgvector.jpg 1654w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">&nbsp;Overview of approaches for handling multi-tenancy in PostgreSQL</span></figcaption></figure><h2 id="conclusion">Conclusion</h2><p>By carefully considering the optimal use cases, pros, and cons of each multi-tenancy approach and aligning them with your application's needs, you can create a scalable, secure, and performant RAG system in PostgreSQL. As RAG technologies continue to evolve, PostgreSQL's extensibility and strong community support ensure that it will remain an adaptable platform for building sophisticated multi-tenant AI applications.&nbsp;</p><p>Additionally, PostgreSQL on Timescale Cloud allows you to store your relational, <a href="https://www.tigerdata.com/blog/time-series-introduction" rel="noreferrer">time series</a>, events, semi-structured, <em>and</em> vector data in one place. This removes the operational complexity of managing a separate vector database. It can deliver <a href="https://timescale.ghost.io/blog/pgvector-vs-pinecone/"><u>performance, rich capabilities, and user experience</u></a> equal to or better than a specialized tool. <br><br><a href="https://console.cloud.timescale.com/signup" rel="noreferrer">Create a free account to try Timescale Cloud's open-source AI stack today</a> (including pgvector, <a href="https://www.tigerdata.com/blog/pgai-giving-postgresql-developers-ai-engineering-superpowers" rel="noreferrer">pgai</a>, and pgvectorscale).</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Boosting 400x Query Performance for Tiered Data on S3 in PostgreSQL]]></title>
            <description><![CDATA[Read how we made query performance up to 400x faster over tiered data while keeping your storage costs down.]]></description>
            <link>https://www.tigerdata.com/blog/boosting-query-performance-for-tiered-data-in-postgresql</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/boosting-query-performance-for-tiered-data-in-postgresql</guid>
            <category><![CDATA[Announcements & Releases]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Gayathri Ayyappan]]></dc:creator>
            <pubDate>Fri, 20 Sep 2024 13:00:02 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/Building-Faster-Query-Performance-for-Tiered-Data-in-PostgreSQL.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/Building-Faster-Query-Performance-for-Tiered-Data-in-PostgreSQL.png" alt="White squares representing data being moved between a database and the tiered cloud storage" /><p>If you’re managing a PostgreSQL database, more data often means scalability hurdles. As your data expands, the stakes rise: storage costs soar, access speed slows down, and resource efficiency looks like a distant dream. Suddenly, you find yourself asking how on Earth will you scale your PostgreSQL database without breaking the bank. Data tiering, archiving, and deleting spring to mind.<strong>&nbsp;</strong></p><p><a href="https://timescale.ghost.io/blog/scaling-postgresql-for-cheap-introducing-tiered-storage-in-timescale/"><u>Timescale's tiered storage architecture</u></a> was designed to address this challenge head-on. It enables seamless management of massive datasets by moving older, less frequently accessed data to cheaper, slower storage without sacrificing query performance.</p><p>Only available in Timescale’s PostgreSQL cloud platform, <a href="https://www.timescale.com/cloud"><u>Timescale Cloud</u></a>, this tiered storage backend allows you to move data from the high-performance tier (block storage) to the low-cost bottomless storage tier (Amazon S3 object storage). With a simple data tiering policy, you can significantly reduce storage costs while maintaining high performance for your most recent data. And you can also transparently query across all your data, regardless of whether it’s still in PostgreSQL or S3.</p><p>However, there is a downside: compared to the blazing-fast high-performance tier, querying older, tiered data can feel a lot slower. So we decided to work on a performance boost. Today, we’re happy to announce that <strong>querying tiered data is now up to 400x faster</strong> thanks to optimizations like chunk exclusion, row-group exclusion, and column exclusion. These improvements will help you save on storage costs without compromising query speed, even as your <a href="https://timescale.ghost.io/blog/scaling-postgresql-to-petabyte-scale/" rel="noreferrer">data grows to petabytes and beyond</a>.</p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">✨</div><div class="kg-callout-text">It’s a wrap for the September Launch Week! To check this week’s previous launches, head to this <a href="https://timescale.ghost.io/blog/making-postgres-faster/"><u>blog post</u></a> or our <a href="https://www.timescale.com/launch/2024?ref=timescale.com"><u>launch page</u></a>. </div></div><h2 id="what-is-data-tiering">What Is Data Tiering?</h2><p>Before getting into the improvements we developed, let’s understand why data tiering is so important for modern, data-intensive applications. Data tiering is the process of dynamically <a href="https://www.timescale.com/learn/postgres-data-management-best-practices"><u>managing the data’s lifecycle</u></a> by moving it between different storage types based on age, usage, and relevance.&nbsp;</p><p>In <strong>Timescale,</strong> this is the standard data life cycle (although not all steps are compulsory):</p><ol><li><strong>Ingest</strong>: Newly inserted data is inserted into rowstore, organized for fast writes and reads of full rows of data.</li><li><strong>Columnize</strong>: After a configurable period the data can be migrated to columnstore for faster aggregate and analytical queries while reducing the storage footprint. With this <a href="https://timescale.ghost.io/blog/hyperstore-a-hybrid-row-storage-engine-for-real-time-analytics/" rel="noreferrer">hybrid row-columnar storage engine</a>, we optimize <a href="https://docs.timescale.com/use-timescale/latest/hypertables/" rel="noreferrer">hypertables</a> for analytical query processing while achieving significant storage savings with compression.</li><li><strong>Tier</strong>: Once the data ages further and is no longer needed for real-time or operational analytics, it can be automatically moved to the object-storage tier. This move allows for the cost-effectiveness of object storage without sacrificing the ability to query the data transparently.</li><li><strong>Drop</strong>: Eventually, the oldest data may be dropped based on a configured data retention policy, making space for new data and keeping your storage footprint efficient.</li></ol><h3 id="why-tiering">Why tiering?</h3><p>Data tiering becomes especially important when you’re handling massive amounts of time-series data or other demanding workloads. In these cases, databases will most likely be the main cost driver of your application due to increasing storage costs. But keeping all data in fast, high-performance storage is expensive and often unnecessary. Most organizations need to access recent data quickly while storing older data more cost-effectively.</p><p>This is where <a href="https://timescale.ghost.io/blog/scaling-postgresql-for-cheap-introducing-tiered-storage-in-timescale/"><u>Timescale Cloud’s tiered storage</u></a><strong> </strong>really makes an impact, streamlining data tiering for users. Data movement happens in the background, with minimal impact on your production database. Your data remains fully queryable throughout the process, whether it's stored on the high-performance tier or cheap object storage.</p><p>By making use of different storage types and automating the data lifecycle, you can significantly reduce costs without sacrificing performance. Instead of paying a premium for expensive storage for every row of data, you only keep hot data on a high-performance disk, while cold data resides in cost-effective object storage like S3.&nbsp;</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/image.png" class="kg-image" alt="Timescale's tiered storage backend, represented by a database with a high-performance tier and a low-cost tier" loading="lazy" width="849" height="475" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/09/image.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/image.png 849w" sizes="(min-width: 720px) 720px"></figure><p>For those working with massive datasets, particularly in industries like IoT, financial services, or monitoring, the cost and efficiency gains of tiered storage can be game-changing. And with our recent optimizations, querying tiered data doesn’t mean taking a huge performance hit—you can still run powerful, complex queries across hot and cold data.&nbsp;</p><p>Let’s see how.&nbsp;</p><h2 id="how-tiering-works-a-look-under-the-hood">How Tiering Works: A Look Under the Hood</h2><p>As mentioned, one of the key strengths of our tiered storage architecture is its automation and minimal operational overhead. Users can set up a tiering policy for a hypertable or continuous aggregate (our version of always up-to-date materialized views). </p><p>The tiering policy runs in the background, identifying data chunks that match your criteria (such as data older than a specific threshold) and adding them to a queue to be moved to Amazon S3. This is an asynchronous process, where chunks are moved one at a time, ensuring that your production database is not overwhelmed.</p><pre><code class="language-SQL">SELECT add_tiering_policy(hypertable REGCLASS, move_after INTERVAL);</code></pre><p>Once the data is moved to S3, it’s stored in the <a href="https://parquet.apache.org/" rel="noreferrer">Parquet format</a>, an open-source columnar format designed for efficient querying and data storage.</p><h3 id="the-parquet-format-enhancing-query-efficiency">The Parquet format: Enhancing query efficiency</h3><p>Parquet organizes data into row groups, typically batches of 100,000 rows, with each row group further divided into column chunks. This columnar structure means that only the specific columns required by your query are read from storage instead of loading entire rows, significantly improving query performance. While this is similar to the in-database columnar format it's optimized for object-storage.</p><p>For instance, if a query only needs three columns from a dataset, Parquet allows the execution engine to process just those three column chunks. This exclusion is particularly useful when querying large datasets, as it reduces the amount of unnecessary data read from storage.</p><p>Each row group in Parquet also includes the minimum/maximum information for each column, allowing Timescale to apply row group exclusion—the process of filtering out entire row groups that don’t match your query conditions. For example, if you’re filtering by <code>device_id = 1</code>, Timescale can skip over row groups whose <code>device_id</code> values don’t fall within the query’s range.</p><p>This combination of chunk exclusion, row-group exclusion, and column exclusion makes Timescale’s tiering incredibly efficient, ensuring that even cold data in S3 can be queried quickly and cost-effectively.</p><h2 id="tiered-data-query-optimization-in-action">Tiered Data Query Optimization in Action</h2><p>Let’s look at how this works with an example query on a tiered hypertable called <code>device_telemetry</code>, where <code>observed_at</code> is the time partitioning column:</p><pre><code class="language-SQL">SELECT device_id, MAX(temperature), MIN(humidity)
FROM device_telemetry
WHERE device_id = 1&nbsp;
AND observed_at BETWEEN '2024-01-01' AND '2024-02-01';</code></pre><p>Here’s how Timescale Cloud optimizes this query:</p><ul><li><strong>Chunk exclusion</strong>: Timescale Cloud first eliminates tiered chunks that don’t match the time filter. Let’s say this reduces the dataset to five chunks out of 1,000 that meet the <code>observed_at</code> condition.</li><li><strong>Row-group exclusion</strong>: Next, within these five chunks, it only scans the row groups where <code>device_id = 1</code>. If each chunk has 20 row groups but only two are relevant, this further reduces the scope of the query.</li><li><strong>Column exclusion</strong>: Since you only requested <code>device_id</code>, <code>temperature</code>, and <code>humidity</code>, it reads just these three columns from the Parquet files on S3.</li></ul><p>By applying these optimizations, the database processes far fewer rows from object storage and columns, leading to much faster query execution.</p><h3 id="performance-boosts-with-compression-settings">Performance boosts with compression settings</h3><p>Another performance enhancement comes from applying compression settings to tiered data. Timescale Cloud allows users to define compression settings for their hypertables, such as <code>compress_segmentby</code> (which determines how data is grouped) and <code>compress_orderby</code> (which defines the order of compressed data).</p><p>When moving data to S3 in Parquet format, we use these compression settings to optimize queries further. For example, if your hypertable is compressed by <code>device_id</code> and ordered by <code>observed_at</code>, Parquet will store the data in the same order, making it easier to apply row-group exclusion and reducing query time. In some cases, these optimizations have led to up to <strong>400x performance improvements</strong> for point queries.</p><h3 id="time-bucket-optimizations">Time bucket optimizations</h3><p>Timescale’s tiering also benefits from runtime optimizations for queries using volatile expressions like <code>now()</code>. Previously, chunk exclusion wouldn’t apply if a query used dynamic expressions (e.g., <code>observed_at &lt;= now() - '3 months'</code>). However, with recent updates, we now apply chunk exclusion even for these queries, ensuring that tiered data is filtered efficiently without needing hard-coded time ranges.</p><p>For example, the following query can now take full advantage of chunk exclusion:</p><pre><code class="language-SQL">SELECT MAX(temperature)
FROM device_telemetry
WHERE observed_at &gt;= NOW() - '4 months'::interval
AND observed_at &lt; NOW() - '3 months'::interval
AND device_id = '0xxacedf';</code></pre><p>This improvement makes it easier to write flexible queries while still benefiting from a performance boost.</p><hr><h2 id="keep-your-data-volume-under-control">Keep Your Data Volume Under Control</h2><p>Timescale Cloud’s tiered storage delivers significant cost and performance advantages by automatically managing your data's lifecycle, freeing up your time so you can focus on your application—not your database. From chunk exclusion to row-group filtering and Parquet format optimizations, you can rest assured that your queries will run efficiently (up to 400x faster!), whether the data is stored locally or in Amazon S3.</p><p>By fine-tuning your compression settings and with optimizations like runtime chunk exclusion, you can take full advantage of tiering while maintaining top-tier query performance.&nbsp;</p><p><a href="https://docs.timescale.com/use-timescale/latest/data-tiering/"><u>Visit our docs</u></a> to learn more about our multi-tiered backend architecture and check if it fits your use case. To start saving on costs while easily accessing and querying your old data, <a href="https://console.cloud.timescale.com/signup"><u>create a free Timescale account</u></a> today.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How We Made PostgreSQL Upserts 300x Faster on Compressed Data]]></title>
            <description><![CDATA[Read how we made upsert performance 300x faster on compressed data by moving from PostgreSQL sequential scans to index scans.]]></description>
            <link>https://www.tigerdata.com/blog/how-we-made-postgresql-upserts-300x-faster-on-compressed-data</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-we-made-postgresql-upserts-300x-faster-on-compressed-data</guid>
            <category><![CDATA[Announcements & Releases]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <dc:creator><![CDATA[Ante Krešić]]></dc:creator>
            <pubDate>Wed, 18 Sep 2024 18:00:21 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/Blog-1.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/Blog-1.png" alt="Three white squares with arrows pointing to another one over a black background: How we made Postgres upserts faster " /><p>As a developer, your customers' challenges often become your own. This was precisely the case when one of our customers, <a href="https://timescale.ghost.io/blog/how-ndustrial-is-providing-fast-real-time-queries-and-safely-storing-client-data-with-97-compression/"><u>Ndustrial</u></a>, an industrial energy optimization platform, reported that they had to resort to workarounds due to suboptimal PostgreSQL upsert performance when using TimescaleDB. Naturally, we were all ears.👂</p><h2 id="how-does-postgresql-upsert-work">How Does PostgreSQL Upsert Work?</h2><p>Before explaining Ndustrial’s use case and schema, let’s talk about PostgreSQL upserts. Upserts work by checking the constraints defined on the target relation during an insertion. This process primarily relies on speculative insertion using a unique index.</p><p>When you do speculative insertion, you are basically trying to insert the new row in the index and look for conflicts. If the insert is successful, the row can be safely added.&nbsp;</p><p>However, if there is a conflict, PostgreSQL resolves it based on the <code>ON CONFLICT</code> clause specified in the insert statement. When you choose the <code>DO UPDATE</code> option, you turn the insert statement into an update, which will, in turn, modify the conflicting row with the values from the new row being inserted. A full statement might look like this:</p><pre><code class="language-SQL">INSERT INTO sensors
  VALUES (1, '2024-09-18', 15.0)
  ON CONFLICT (id, timestamp)
  DO UPDATE SET reading = EXCLUDED.reading;</code></pre><p>However, while TimescaleDB is built on PostgreSQL, things work slightly differently under the hood when using compression. So, before diving into the technical details of our optimization, it’s essential to understand a few key TimescaleDB concepts: <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertables</a>, chunks, and batches.</p><h3 id="understanding-hypertables-chunks-and-batches">Understanding hypertables, chunks, and batches</h3><ul><li><strong>Hypertables</strong>: In TimescaleDB, a hypertable is a PostgreSQL table that’s automatically partitioned to handle large-scale time-series data efficiently. A hypertable is the main abstraction that allows users to work with what appears to be a single table while behind the scenes, TimescaleDB manages the partitioning for performance and scalability.</li><li><strong>Chunks</strong>: Hypertables are divided into smaller, more manageable pieces called chunks. Each chunk is essentially a temporal partition of the hypertable, perhaps each one represents one (1) day. This partitioning allows for more efficient query performance, as operations can be limited to relevant chunks rather than the entire hypertable.</li></ul><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/How-we-made-postgres-inserts-faster-on-compressed-data_hypertables.png" class="kg-image" alt="" loading="lazy" width="1200" height="911" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/09/How-we-made-postgres-inserts-faster-on-compressed-data_hypertables.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/09/How-we-made-postgres-inserts-faster-on-compressed-data_hypertables.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/How-we-made-postgres-inserts-faster-on-compressed-data_hypertables.png 1200w" sizes="(min-width: 720px) 720px"></figure><ul><li><strong>Batches</strong>: When a hypertable chunk is compressed, it’s broken down into batches, each containing array representations of up to 1,000 rows of data into a single database row. These batches are then compressed together, reducing storage requirements, which in turn improves performance for certain operations through reduced I/O and vectorization.</li><li><strong>Segment_by</strong>: The optional <code>segment_by</code> parameter in TimescaleDB’s compression settings specifies the columns used to segment data within a chunk. When you compress a chunk each batch underneath that chunk will have data from a single <code>segment_by</code> value. This increases performance for queries that use the <code>segment_by</code> (<code>SELECT * FROM sensors WHERE sensor_id=100</code>), as well as maximizes compression gains by grouping values that are more likely to be similar.</li></ul><h2 id="the-challenge">The Challenge</h2><p>When we first examined Ndustrial's schema, we noticed they had a unique setup: all their data was first upserted into a staging table, and then later batch written to a compressed hypertable. To maintain the most current view of their data they joined these two tables.</p><p>They originally chose this approach because the upsert (<code>INSERT ON CONFLICT</code>) wasn’t performing over compressed data, but as their data grew, this method broke down too (as well as being incredibly hard to maintain, expensive to query, and difficult to reason about). But why did <code>UPSERT</code> not work for them?</p><p>Upon inspecting the dataset, our suspicions were confirmed: they had a combination of high <code>segment_by</code> cardinality (some chunks had over 172K compressed batches) and a backfilling process that routinely wrote large amounts of data to compressed chunks. While TimescaleDB supports mutating compressed data it's usually used for "one-off" updates, and this was happening constantly—but could we support it?</p><p>The challenge with upserts on compressed hypertables is that the necessary B-tree indexes don't exist, as the rows involved have been compressed and are no longer available in their raw, uncompressed form. To resolve such conflicts, we must first decompress the rows that may be conflicting with the row being inserted. This way, we effectively build the index at insert time and let PostgreSQL handle the speculative insertion from there.</p><p>However, this process requires identifying the compressed data batches that could potentially contain the conflicting row for each new insertion. Let’s see how we handled this.</p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">🔖</div><div class="kg-callout-text"><b><strong style="white-space: pre-wrap;">Editor’s Note: </strong></b>On top of the process we’re about to describe, the team also developed other optimizations to improve DML (<code spellcheck="false" style="white-space: pre-wrap;">INSERT</code>, <code spellcheck="false" style="white-space: pre-wrap;">UPDATE</code>, <code spellcheck="false" style="white-space: pre-wrap;">DELETE</code>) performance by 10x for compressed data. <a href="https://www.timescale.com/blog/bridging-the-gap-between-compressed-and-uncompressed-data-in-postgres" rel="noreferrer">Learn more about this optimization</a>.</div></div><p></p><h2 id="the-technical-solution-index-scans-for-faster-upserts">The Technical Solution: Index Scans for Faster Upserts</h2><p>As mentioned, the performance improvement we needed came from a single simple yet crucial change: modifying the upsert process to use already existing indexes to find all batches of rows that need to be decompressed. Here’s what we did:</p><h3 id="index-scans-on-compressed-data">Index scans on compressed data</h3><p>In TimescaleDB, when a chunk is compressed with the <code>segment_by</code>&nbsp;option, a B-tree index is automatically created on the <code>segment_by</code>&nbsp;columns and the batch sequence number. However, in the original implementation, the upsert process did not utilize this index. Instead, it relied on a sequential scan to locate potential conflicts, which was inefficient for <a href="https://www.tigerdata.com/learn/how-to-handle-high-cardinality-data-in-postgresql" rel="noreferrer">high-cardinality</a> datasets like Ndustrial’s (but was performant for most workloads we had seen previously). Why did we overlook this originally? We can't say, but we sure are glad we found the optimization!</p><p>The enhancement in this PR was to update the upsert mechanism to use the existing index whenever possible. By doing so, the system could now quickly locate the relevant compressed batches, dramatically reducing the time required to identify and resolve conflicts during an upsert. If the index is missing—an uncommon scenario unless manually removed—the system will fall back to a sequential scan.&nbsp;</p><p>One of the first questions we had was, “Would this trigger a regression for low cardinality or nonsegmented compressed data?” The answer was no! The only time this approach loses to a sequential scan is when the difference is so minuscule that it can barely be measured.</p><h2 id="the-result-300x-faster-upsert-performance-for-compressed-data">The Result: 300x Faster Upsert Performance for Compressed Data</h2><p>The impact of the optimization was dramatic for Ndustrial, leading to a 300x increase in upsert performance over compressed data. Here’s a breakdown of what we saw analyzing their workload:</p><h3 id="before-optimization-v2142">Before optimization (v2.14.2)</h3><p>Originally an <code>INSERT ON CONFLICT</code>, of 10,000 rows from the staging table into the hypertable, with 10 rows causing conflicts, took <strong>427,580 ms</strong>—over seven minutes. This was because the process relied heavily on sequential scans to identify and resolve conflicts.&nbsp;</p><p>Looking at the flame graph of the operation, we can see that the <code>decompress_batches_for_insert</code> function accounts for more than 99 % of the CPU time, of which over 99&nbsp;% is from getting tuples from the heap and filtering them.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/How-we-improved-upsert-performance-in-postgres_flamegraph-before.png" class="kg-image" alt="" loading="lazy" width="936" height="696" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/09/How-we-improved-upsert-performance-in-postgres_flamegraph-before.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/How-we-improved-upsert-performance-in-postgres_flamegraph-before.png 936w" sizes="(min-width: 720px) 720px"></figure><p>No wonder Ndustrial started investigating using the secondary table as a workaround!</p><h3 id="after-optimization-v216">After optimization (v2.16)</h3><p>After updating the <code>INSERT ON CONFLICT</code> mechanism to use index scans to locate the blocks to uncompressed, the same operation—upserting 10,000 rows with 10 conflicts—completed in just <strong>1,149 ms</strong>, or slightly over one second.</p><p>By using the index on the segment_by columns, the system could quickly locate the relevant compressed batches, dramatically reducing the time spent on conflict resolution and batch decompression.</p><p>The flame graphs now show a very different story, with most of the time spent in the <code>decompress_batches_for_insert</code> function now coming from retrieving the compression settings (a potential improvement for another time).</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/How-we-improved-upsert-performance-in-postgres_flamegraph-after.png" class="kg-image" alt="" loading="lazy" width="903" height="698" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/09/How-we-improved-upsert-performance-in-postgres_flamegraph-after.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/How-we-improved-upsert-performance-in-postgres_flamegraph-after.png 903w" sizes="(min-width: 720px) 720px"></figure><h2 id="conclusion">Conclusion</h2><p>The journey to optimizing upsert performance on compressed hypertables for Ndustrial highlighted the importance of understanding and addressing key bottlenecks in <a href="https://www.tigerdata.com/learn/guide-to-postgresql-database-operations" rel="noreferrer">database operations</a>.&nbsp;</p><p>By diving deep, analyzing the issue, and then making a seemingly simple yet impactful change—leveraging existing indexes for upserts—we were able to unlock a 300x performance improvement for high-cardinality workloads. This optimization not only resolved the immediate performance issues but also opened the door for Ndustrial to manage their data more efficiently and confidently. To learn more about other recent optimizations, <a href="https://timescale.ghost.io/blog/making-postgres-faster/" rel="noreferrer">check out this blog post</a>.</p><p>Or, as our client put it, “We've definitely appreciated working closely with Timescale on this issue and all the work they've been putting into the enhancements!” A happy customer makes for a happy developer.&nbsp;</p><p>If you, too, are looking for a fast, cost-saving PostgreSQL cloud database that can handle all your <a href="https://www.tigerdata.com/blog/time-series-introduction" rel="noreferrer">time series</a>, events, real-time analytics, and vector data, topped by a Support team that is happy to roll out wider optimizations to improve your use case, <a href="https://console.cloud.timescale.com/signup"><u>try Timescale for free</u></a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Bridging the Gap Between Compressed and Uncompressed Data in Postgres: Introducing Compression Tuple Filtering]]></title>
            <description><![CDATA[Compressed data in Postgres now performs more like uncompressed data. Discover how we achieved up to 500x faster updates & deletes on compressed data.]]></description>
            <link>https://www.tigerdata.com/blog/bridging-the-gap-between-compressed-and-uncompressed-data-in-postgres</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/bridging-the-gap-between-compressed-and-uncompressed-data-in-postgres</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Announcements & Releases]]></category>
            <category><![CDATA[Analytics]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <dc:creator><![CDATA[Sven Klemm]]></dc:creator>
            <pubDate>Wed, 18 Sep 2024 13:00:59 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/Blog-2.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/Blog-2.png" alt="White data squares going through a funnel: tuple filtering enables handling compressed data as if it were uncompressed" /><p>When we introduced <a href="https://timescale.ghost.io/blog/building-columnar-compression-in-a-row-oriented-database/"><u>columnar compression for Postgres</u></a> in 2019, our goal was to help developers scale Postgres and efficiently manage growing datasets, such as IoT sensors, financial ticks, product metrics, and even vector data. Compression quickly became a game-changer, saving users significant storage costs and boosting query performance—all while keeping their data in Postgres. With many seeing over 95&nbsp;% compression rates, the impact was immediate.</p><p>But we didn’t stop there. Recognizing that many real-time analytics workloads demand flexibility for updating and backfilling data, we <a href="https://timescale.ghost.io/blog/timescaledb-2-3-improving-columnar-compression-for-time-series-on-postgresql/"><u>slowly</u></a> but <a href="https://timescale.ghost.io/blog/allowing-dml-operations-in-highly-compressed-time-series-data-in-postgresql/"><u>surely enhanced</u></a> our compression engine to support <code>INSERT</code>, <code>UPDATE</code>, and <code>DELETE</code> (DML) operations directly on compressed data. This allowed users to work with compressed data almost as easily as they do with uncompressed data.</p><p>However, it also created a problem. While we had originally intended mutating compressed chunks to be a rare event, people were now pushing its limits with frequent inserts, updates, and deletes. Seeing our customers go all in on this feature confirmed that we were on the right track, but we had to double down on performance.</p><p>Today, we’re proud to announce significant improvements as of TimescaleDB 2.16.0, delivering up to <strong>500x faster updates and deletes</strong> and <strong>10x faster upserts</strong> on compressed data. These optimizations make compressed data behave even more like uncompressed data—without sacrificing performance or flexibility.</p><p>Let’s dive into how we achieved these performance gains and what they mean for you. To check this week’s previous launches and keep track of upcoming ones, head to this <a href="https://timescale.ghost.io/blog/making-postgres-faster/"><u>blog post</u></a> or our <a href="https://www.timescale.com/launch/2024"><u>launch page</u></a>.&nbsp;</p><h2 id="how-we-allowed-dml-operations-on-compressed-data">How We Allowed DML Operations on Compressed Data</h2><p>To understand our latest improvements, it helps to revisit how we initially threw away the rule book and <a href="https://timescale.ghost.io/blog/allowing-dml-operations-in-highly-compressed-time-series-data-in-postgresql/"><u>enabled DML operations on compressed data</u></a>.</p><p>Working with compressed data is tricky. Imagine trying to update a zipped file. You’d need to unzip the file, make your changes, and then zip it back up. Similarly, updating or deleting data in a compressed database often involves decompressing and reprocessing large chunks (potentially full tables) of data, which can slow things down significantly.</p><p>Our solution was to segment and group records into batches of 1,000, so instead of working with millions of records at once, we operate on smaller, more manageable groups. We also used techniques like <strong>segment indexes</strong> and <strong>sparse indexes </strong>on batch metadata to identify only the relevant batches of compressed data that contain the values to be updated or deleted.</p><p>However, a challenge remained: if no metadata is stored for a batch, we have to assume there could be a row inside that needs to be updated. This assumption requires decompressing the batch and materializing it into an uncompressed row format, which takes up precious time and disk space.</p><h2 id="optimizing-batch-processing-in-timescaledb-2160">Optimizing Batch Processing in TimescaleDB 2.16.0</h2><p>With the release of TimescaleDB 2.16.0, we focused on reducing the number of batches that need to be decompressed and materialized. This improvement, known as <strong>compression tuple filtering</strong>, allows us to filter out unnecessary data at various processing stages, dramatically speeding up DML operations.</p><h3 id="real-world-performance-gains">Real-world performance gains</h3><p>Here’s what that looks like in practice:</p><ul><li><strong>Up to 500x faster updates and deletes</strong>: by avoiding the need to decompress and materialize irrelevant batches, DML operations like <code>UPDATE</code> and <code>DELETE</code> can now be completed significantly faster.</li><li><strong>Up to 10x faster upserts</strong>: similarly, upserts (a combination of <code>INSERT</code> and <code>UPDATE</code>) are optimized to avoid unnecessary decompression (<a href="https://timescale.ghost.io/blog/how-we-made-postgresql-upserts-300x-faster-on-compressed-data/" rel="noreferrer">and that’s not the only boost we made for upserts</a>).</li></ul><p>These gains translate to major real-world performance improvements, particularly for users dealing with large datasets that require high-frequency updates of data that has already been compressed.</p><h2 id="how-compression-tuple-filtering-works">How Compression Tuple Filtering Works</h2><p>To achieve these optimizations, we filter data at multiple stages during DML operations. Previously, filtering was only possible using constraints on <strong><code>segment_by</code></strong> columns or columns with metadata, such as <strong><code>orderby</code></strong> or columns with a <a href="https://timescale.ghost.io/blog/boost-postgres-performance-by-7x-with-chunk-skipping-indexes/" rel="noreferrer">chunk skipping index</a>.</p><p>With TimescaleDB 2.16.0, we’ve taken a microscope to our decompression pipeline and added an additional layer of inline filtering. When running DML operations, if a column doesn’t have metadata, we now apply constraints incrementally during decompression. If a batch is fully filtered out by the constraint, it’s skipped entirely and never materialized to disk or passed to Postgres to continue evaluation. This saves significant resources and time by reducing the amount of work to do further down the query pipeline.</p><h3 id="insert-optimizations">INSERT optimizations</h3><p>Let’s break it down by DML operation, starting with <code>INSERT</code>.</p><ul><li><strong>Regular <code>INSERT</code> operations</strong>: When inserting data into compressed chunks, no decompression is required unless you have a <code>UNIQUE</code> constraint on the <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertable</a>. This makes inserts almost as fast as with uncompressed data.</li><li><strong>Inserts with <code>UNIQUE</code> constraints</strong>: When a <code>UNIQUE</code> constraint is in place, there are three possible scenarios:<ol><li><strong>No <code>ON CONFLICT</code> clause</strong>: We check for constraint violations during decompression. If a violation is found, it’s flagged before any data is materialized to disk.</li><li><code><strong>ON CONFLICT DO NOTHING</strong></code>: Similar to the first case, violations are flagged during decompression, and the <code>INSERT</code> is skipped without materializing data.</li><li><strong><code>ON CONFLICT DO UPDATE</code> (<code>UPSERT</code>)</strong>: In cases of conflict, we decompress only the batch containing the conflicting tuple. If no conflict is detected, there’s no need to materialize anything.</li></ol></li></ul><h3 id="updatedelete-optimizations">UPDATE/DELETE Optimizations</h3><p>For <code>UPDATE</code> and <code>DELETE</code> operations, we first check if any tuples in a batch match the query constraints. If none do, the batch is skipped, and no decompression or materialization is needed. This skip leads to dramatically faster update and delete operations.</p><h3 id="real-world-comparison-before-and-after">Real-world comparison: Before and after</h3><p>Let’s look at a concrete example to illustrate the performance difference with tuple filtering.</p><p>Assume you have a hypertable with 1,000 batches, each containing 1,000 tuples, for a total of 1,000,000 rows. Now, you want to delete all rows where <code>value &lt; 10</code>. In this case, these rows are all contained within a single batch (maybe values lower than 10 are very rare and happen only for a short period of time).</p><p><code>DELETE FROM metrics WHERE value &lt; 10;</code></p><p>Before TimescaleDB 2.16.0, we would have to decompress all 1,000 batches and materialize 1,000,000 tuples to disk—even if only a small portion matched the query constraints.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/Bridging-the-Gap-Between-Compressed-and-Uncompressed-Data-in-Postgres_compression-tuple-filtering_decompression.png" class="kg-image" alt="A diagram illustrating the before: we would decompress all 1,000 batches and materialize them even if only a small part of them matched the query constraints" loading="lazy" width="1200" height="1089" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/09/Bridging-the-Gap-Between-Compressed-and-Uncompressed-Data-in-Postgres_compression-tuple-filtering_decompression.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/09/Bridging-the-Gap-Between-Compressed-and-Uncompressed-Data-in-Postgres_compression-tuple-filtering_decompression.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/Bridging-the-Gap-Between-Compressed-and-Uncompressed-Data-in-Postgres_compression-tuple-filtering_decompression.png 1200w" sizes="(min-width: 720px) 720px"></figure><p>In TimescaleDB 2.16.0, this changes with compressed tuple filtering. We now only need to decompress the relevant batches, avoiding unnecessary materialization.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/Bridging-the-Gap-Between-Compressed-and-Uncompressed-Data-in-Postgres-Introducing-Compression-Tuple-Filtering_tuple-filtering.png" class="kg-image" alt="A diagram illustrating how tuple filtering during compression works" loading="lazy" width="777" height="826" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/09/Bridging-the-Gap-Between-Compressed-and-Uncompressed-Data-in-Postgres-Introducing-Compression-Tuple-Filtering_tuple-filtering.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/Bridging-the-Gap-Between-Compressed-and-Uncompressed-Data-in-Postgres-Introducing-Compression-Tuple-Filtering_tuple-filtering.png 777w" sizes="(min-width: 720px) 720px"></figure><p>In our example query, this reduces the total query time from 829517.741 ms to 1487.494 ms, <strong>557x faster</strong>! As you can see, tuple filtering allows us to drastically reduce the work needed to execute this query, resulting in a huge speed-up. This accelerating power is also variable: the more tuples you discard, the faster your query will become!</p><h2 id="why-do-dml-operations-on-compressed-data-matter">Why Do DML Operations on Compressed Data Matter</h2><p>When working with massive datasets, every millisecond counts and resource efficiency becomes crucial. The improvements introduced in TimescaleDB 2.16.0 directly address these needs by minimizing the amount of data that must be decompressed, written to disk, and then evaluated by Postgres. This not only reduces disk I/O and CPU usage but also significantly lowers execution time. The result? More headroom to handle larger datasets, scale your systems seamlessly and improve overall application performance.</p><p>For developers managing frequent updates, inserts, or deletes in TimescaleDB hypertables, these optimizations mean that less thought has to be given to the current format of the data. Regardless of whether the data is currently stored on disk as rows or compressed columns, DML operations will work as expected.</p><h2 id="final-words">Final Words</h2><p>With the release of TimescaleDB 2.16.0, our <a href="https://www.tigerdata.com/blog/building-columnar-compression-in-a-row-oriented-database" rel="noreferrer">columnar</a> compression engine takes another big leap forward. Users can now benefit from up to <strong>500x faster updates and deletes</strong> and <strong>10x faster upserts</strong>—all while continuing to enjoy the storage savings and performance gains of compression.</p><p>Looking ahead, we’re committed to further enhancing our compression engine, delivering even more flexibility and performance gains. Stay tuned—there’s more to come.</p><p>Want to see these improvements in action?<a href="https://www.timescale.com"> <u>Sign up for Timescale today</u></a> and experience the power of our compression engine for your real-time analytics workloads.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Hypercore: A Hybrid Row-Columnar Storage Engine for T̶̶̶i̶̶̶m̶̶̶e̶̶̶ ̶̶̶S̶̶̶e̶̶̶r̶̶̶i̶̶̶e̶̶̶s̶̶̶ Real-Time Analytics]]></title>
            <description><![CDATA[See how we improved our hybrid row-columnar storage engine—now renamed hypercore—to solve our customers’ real-time analytics needs.]]></description>
            <link>https://www.tigerdata.com/blog/hypercore-a-hybrid-row-storage-engine-for-real-time-analytics</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/hypercore-a-hybrid-row-storage-engine-for-real-time-analytics</guid>
            <category><![CDATA[Announcements & Releases]]></category>
            <category><![CDATA[Analytics]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Ramon Guiu]]></dc:creator>
            <pubDate>Mon, 16 Sep 2024 18:00:09 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/Hyperstore-A-Hybrid-Row-Storage-Engine-for-Time-Series-Real-Time-Analytics.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/Hyperstore-A-Hybrid-Row-Storage-Engine-for-Time-Series-Real-Time-Analytics.png" alt="Rows and columns over a black background. Hyperstore allows for fast ingest rates while still enabling powerful analytics" /><p>There is no arguing that Postgres has grown up over the last two decades. From a <a href="https://www.postgresql.org/docs/current/history.html#HISTORY-BERKELEY"><u>little-known academic project</u></a> to the <a href="https://survey.stackoverflow.co/2024/technology#most-popular-technologies-database"><u>most loved database two years running</u></a>, Postgres has evolved into a versatile, polyglot platform with extensions to cover nearly every use case. As developer choices become increasingly complex, using <a href="https://timescale.ghost.io/blog/postgres-for-everything/"><u>Postgres for everything</u></a> allows you to <a href="https://timescale.ghost.io/blog/how-to-collapse-your-stack-using-postgresql-for-everything/"><u>collapse your data stack</u></a>, simplifying your architecture and making your life easier.</p><p>At Timescale, our goal is simple: to make Postgres even better. <a href="https://timescale.ghost.io/blog/tag/dev-q-a/"><u>We’ve empowered hundreds of thousands of developers across industries</u></a>—including IoT, crypto, finance, developer tools, SaaS, and more—to use Postgres for their most critical applications. Companies like Toyota rely on TimescaleDB to monitor NASCAR racing cars, Postman uses it to power their API analytics, and OVHCloud has built its billing engine on TimescaleDB. Like many others, these companies need more than just a database—they need one that can handle high-performance workloads.</p><p>In any application, relational data like user accounts, permissions, and payment information needs to be stored and managed efficiently, and Postgres handles these tasks exceptionally well. But today’s applications often need more than just transactional consistency. They require the ability to make fast, precise decisions using large amounts of up-to-the-second data, often in mission-critical scenarios.</p><p>We have traditionally thought of these as time-series problems, and while time-series data is certainly central, we’ve come to realize that for users, it’s almost an implementation detail. The real challenge our customers' applications are solving is real-time analytics. This is where TimescaleDB shines, empowering developers to address both their relational and real-time analytics needs within the database they already know and trust: Postgres. And it achieves this thanks to its hybrid row-columnar storage engine—an automatic, efficient, and finely engineered mechanism we’re calling <strong>hypercore</strong>.&nbsp;</p><h2 id="is-your-use-case-real-time-analytics">Is Your Use Case Real-Time Analytics?</h2><p>The problem with real-time analytics is challenging to solve. Real-time analytics involves processing and analyzing data as it’s created, providing immediate insights so you can act on that data without delay. It’s not just about knowing what’s happened in the past; it’s about understanding what’s happening <em>right now</em>.&nbsp;</p><p>Whether you're tracking stock prices, monitoring IoT sensor data, or analyzing user behavior, the goal is to make decisions in the moment by combining live data with historical context. These insights are often delivered through embedded dashboards or driving decision engines within customer-facing applications, demanding millisecond query response times.</p><p>To achieve this level of responsiveness, real-time analytics needs a database that supports:</p><ul><li><strong>High ingest throughput:</strong> supporting sustained high insert rates, often in the hundreds of thousands of writes per second, typically through streaming ingest</li><li><strong>Low-latency ingestion:</strong> ensuring that new data is immediately available for queries&nbsp;</li><li><strong>High query performance:</strong> executing fast, targeted queries on recent data with low-latency responses for time-sensitive analytics</li><li><strong>Data updates and late-arriving data:</strong> allowing updates and late data to be added immediately, which happens in many real-world scenarios</li><li><strong>Efficient data management:</strong> making use of techniques like compression, rollups, or retention policies to improve query performance and control costs as data accumulates</li></ul><p>Now, compare this to general-purpose analytics, where large datasets are typically processed in batches, and timeliness isn’t as critical. With batch analytics, you can afford delays in data updates and query results because you’re working with historical data over longer periods. In these cases, near-instant updates and low-latency querying aren’t as crucial, and systems can tolerate delays.</p><p>But with real-time analytics, every second counts—both for ingesting new information and making that data immediately available for querying. This is where TimescaleDB excels.</p><p>TimescaleDB can meet the demands of real-time analytics due to its hybrid row-columnar storage engine: <strong>hypercore</strong>. This engine allows TimescaleDB to automatically handle both the high-speed ingestion of new data and the efficient querying of large datasets, all while maintaining the flexibility and performance required for real-time workloads.</p><h2 id="hypercore-a-hybrid-row-columnar-storage-engine-for-postgres">Hypercore: A Hybrid Row-Columnar Storage Engine for Postgres</h2><p>A bit of honesty up front: hypercore isn’t a new thing,&nbsp;it already powers our highly <a href="https://timescale.ghost.io/blog/building-columnar-compression-in-a-row-oriented-database/" rel="noreferrer">performant compression</a>. At the same time we realized some of our customers were more aligned with real-time analytics than <a href="https://www.tigerdata.com/blog/time-series-introduction" rel="noreferrer">time series</a>, we also recognized that compression wasn’t the killer feature for them. What mattered was the conversion from row-oriented to column-oriented which came with compression. So, we recently renamed the whole package as hypercore.</p><p>Hypercore is built to handle the unique challenges of real-time analytics in a way that’s both powerful and easy to use. Rather than forcing developers to choose between a transactional (<a href="https://www.tigerdata.com/learn/understanding-oltp" rel="noreferrer">OLTP</a>) database and an analytics (OLAP) database, hypercore combines the best of both worlds. It blends row-oriented and column-oriented storage formats into one system, creating a hybrid storage engine that seamlessly and automatically shifts data between the two based on how it’s used.&nbsp;</p><p>Let’s take a look at row and column-oriented storage formats and how they differ.</p><h3 id="row-oriented-storage-vs-columnar-oriented-storage">Row-oriented storage vs. columnar-oriented storage</h3><p>In a <a href="https://www.timescale.com/learn/columnar-databases-vs-row-oriented-databases-which-to-choose"><u>row-oriented storage format</u></a>, data is stored sequentially by rows, meaning all the fields of a record are kept together on disk. This makes it highly efficient for transactional workloads, where operations involve reading or writing entire records, like inserting new readings or retrieving a reading by ID.&nbsp;</p><p>Row-based storage supports <a href="https://www.timescale.com/learn/understanding-acid-compliance"><u>ACID transactions</u></a> by allowing easy access, locking, and modification of entire rows, ensuring both consistency and efficient execution. However, it is less than ideal for analytical queries focusing on specific columns. Since entire rows must be read to retrieve a single column, it leads to high I/O costs and slower query performance.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/Hyperstore-A-Hybrid-Row-Storage-Engine-for-Time-Series-Real-Time-Analytics_data-format-row.png" class="kg-image" alt="How data is stored in a row-based layout" loading="lazy" width="1492" height="769" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/09/Hyperstore-A-Hybrid-Row-Storage-Engine-for-Time-Series-Real-Time-Analytics_data-format-row.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/09/Hyperstore-A-Hybrid-Row-Storage-Engine-for-Time-Series-Real-Time-Analytics_data-format-row.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/Hyperstore-A-Hybrid-Row-Storage-Engine-for-Time-Series-Real-Time-Analytics_data-format-row.png 1492w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">How data is stored in a row-based layout</span></figcaption></figure><p>In contrast, a <a href="https://www.timescale.com/learn/columnar-databases-vs-row-oriented-databases-which-to-choose"><u>column-oriented storage format</u></a> stores data by individual columns instead of rows, greatly improving performance for analytical queries. This structure allows the database to efficiently read only the relevant columns needed for a query, avoiding unnecessary data retrieval.&nbsp;</p><p><a href="https://www.tigerdata.com/blog/building-columnar-compression-in-a-row-oriented-database" rel="noreferrer">Columnar storage</a> is particularly efficient for aggregate operations like counting, averaging, or summing values, as each column can be scanned sequentially, resulting in faster queries. Another advantage is that columnar storage enables high compression rates. Since each column contains similar data types, <a href="https://www.tigerdata.com/blog/time-series-compression-algorithms-explained" rel="noreferrer">compression algorithms</a> can more easily identify patterns and redundancies.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/Hyperstore-A-Hybrid-Row-Storage-Engine-for-Time-Series-Real-Time-Analytics_data-form-column.png" class="kg-image" alt="How data is stored in a column-based layout" loading="lazy" width="1495" height="769" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/09/Hyperstore-A-Hybrid-Row-Storage-Engine-for-Time-Series-Real-Time-Analytics_data-form-column.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/09/Hyperstore-A-Hybrid-Row-Storage-Engine-for-Time-Series-Real-Time-Analytics_data-form-column.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/09/Hyperstore-A-Hybrid-Row-Storage-Engine-for-Time-Series-Real-Time-Analytics_data-form-column.png 1495w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">How data is stored in a column-based layout</span></figcaption></figure><p>However, columnar storage struggles with workloads that involve reading or writing full rows, real-time inserts, and frequent updates. These operations require multiple columns to be accessed, compressed, or decompressed simultaneously, leading to increased I/O overhead and slower performance for these tasks.</p><p>As a developer, you want fast inserts <em>and</em> efficient analytics—that’s why we built hypercore, combining the strengths of row and column storage into one unified engine.</p><p>Here’s how hypercore's hybrid approach combines the benefits of both formats:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/10/Hypercore.png" class="kg-image" alt="Hypercore's hybrid approach: data is written to a rowstore and then automatically migrated to a columnstore, allowing fast ingest rates and powerful analytics without developer intervention or storage overhead" loading="lazy" width="1200" height="1014" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/10/Hypercore.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/10/Hypercore.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/10/Hypercore.png 1200w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Hypercore's hybrid approach: data is written to a rowstore and then automatically migrated to a columnstore, allowing fast ingest rates and powerful analytics without developer intervention or storage overhead</span></figcaption></figure><ul><li><strong>Fast ingest with rowstore:</strong> New data is initially written to a rowstore optimized for high-speed inserts and updates. This process ensures that real-time applications can handle rapid streams of incoming data while also allowing for mutability—upserts, updates, and deletes happen seamlessly.</li><li><strong>Efficient analytics with columnstore:</strong> As the data "cools" and becomes more suited for analytics, it is automatically migrated to a columnstore, where it’s compressed into small batches and organized for efficient, large-scale queries. This columnar format allows for fast scanning and aggregation, optimizing performance for analytical workloads while also saving significant storage space.</li><li><strong>Full mutability with transactional semantics: </strong>Regardless of where data is stored, TimescaleDB provides full ACID support and ensures your inserts and updates to the rowstore and columnstore are always consistent and available to queries as soon as they are completed like in a vanilla Postgres database.&nbsp;</li></ul><p>Hypercore abstracts all this complexity away from the developer. Data is ingested and stored in the most efficient format and queried transparently across the rowstore and the columnstore without needing to manually manage the transition and without the overhead of storing in both formats at the same time. This hybrid approach allows developers to maintain fast ingest rates while still enabling powerful analytics—without having to choose between the two.</p><h2 id="key-capabilities-of-hypercore">Key Capabilities of Hypercore</h2><p>When we first released hypercore back in 2019, we called it simply "compression." Since then, we’ve made hundreds of incremental improvements to better serve developers building real-time analytics applications. Just this week, we announced two major performance optimizations: <a href="https://www.timescale.com/blog/boost-postgres-performance-by-7x-with-chunk-skipping-indexes" rel="noreferrer">the introduction of skip (sparse) indexes</a> and <a href="https://www.timescale.com/blog/bridging-the-gap-between-compressed-and-uncompressed-data-in-postgres" rel="noreferrer">inline tuple filtering during decompression</a> for DML operations (inserts, deletes, and updates).</p><p>Here are four key capabilities of hypercore that deliver real value to developers: chunk micro-partitions, SIMD vectorization, skip indexes, and compression.</p><h3 id="chunk-micro-partitions-with-segmentation">Chunk micro-partitions with segmentation</h3><p>TimescaleDB automatically partitions your data into chunks, storing them first in the rowstore and later in the columnstore. Hypercore enhances this by allowing you to group data within a columnstore chunk by a segmentation key, effectively creating micro-partitions within each chunk. This speeds up queries that filter on the segmentation key, as hypercore can quickly narrow down to only the relevant micro-partitions, avoiding the need to uncompress the entire chunk. This optimization makes query execution faster and more efficient.</p><h3 id="simd-vectorization">SIMD vectorization</h3><p>SIMD (Single Instruction, Multiple Data) vectorization is a powerful optimization used to accelerate data processing by enabling the CPU to process an operation on multiple data points in one instruction. We <a href="https://timescale.ghost.io/blog/teaching-postgres-new-tricks-simd-vectorization-for-faster-analytical-queries/"><u>introduced SIMD vectorization in TimescaleDB</u></a> in 2023 to dramatically boost performance for real-time analytics. By allowing the CPU to process multiple values at once, SIMD speeds up tasks like compression, decompression, scanning, filtering, and aggregating large datasets. Our upcoming updates have shown up to 30x faster <code>SELECT</code> queries and 10x faster <code>DELETE</code> operations compared to TimescaleDB 2.16.0, with ongoing work to further optimize more query patterns.</p><h3 id="skip-indexes">Skip indexes</h3><p>Skip indexes allow hypercore to accelerate queries by <a href="https://www.timescale.com/blog/boost-postgres-performance-by-7x-with-chunk-skipping-indexes" rel="noreferrer">skipping over irrelevant data</a>. These indexes store metadata such as minimum and maximum values for each block. For example, if you're querying for orders with an ID greater than 10,000, the skip index allows the engine to bypass blocks where the maximum ID is less than or equal to 10,000. In the latest version of TimescaleDB, chunk-skipping indexes can be defined on the columnstore, enabling chunk exclusion for even faster query performance by pruning irrelevant chunks from the search. This exclusion dramatically reduces the data that needs to be processed, resulting in much faster analytical queries.</p><h3 id="compression">Compression</h3><p>The columnstore format is designed to group similar types of data (like timestamps or device IDs) inside our micro-partitions, enabling the use of <a href="https://timescale.ghost.io/blog/time-series-compression-algorithms-explained/" rel="noreferrer">specialized compression algorithms</a> tailored to each column. Hypercore automatically applies best-in-class, lossless compression algorithms when moving data from rowstore to columnstore, achieving up to 98&nbsp;% compression. This doesn’t just save on storage—it also speeds up query performance by reducing I/O, as there's less data to read and process during queries.</p><h2 id="conclusion">Conclusion</h2><p>Applications that deliver real-time analytics are now essential in several industries. They need to ingest massive amounts of data and provide instant insights. And they need to do it while still managing traditional relational data, like user accounts or payments, seamlessly. That’s where TimescaleDB’s hypercore comes in—<a href="https://timescale.ghost.io/blog/building-columnar-compression-in-a-row-oriented-database/"><u>a hybrid row-columnar storage engine finely engineered over the years</u></a> that allows you to stick with PostgreSQL even when handling the most challenging real-time analytics use cases.</p><p>With TimescaleDB and hypercore, you get the best of both worlds: fast, transactional inserts with row-based storage and blazing-fast query performance with columnar compression for analytics. You don’t need to compromise or manage multiple databases.</p><p>Want to try hypercore today? <a href="https://docs.timescale.com/self-hosted/latest/install/"><u>Download and run TimescaleDB on your machine</u></a>. Want to take it out for a spin while reaping the full benefits of a managed PostgreSQL platform with <a href="https://timescale.ghost.io/blog/scaling-postgresql-for-cheap-introducing-tiered-storage-in-timescale/"><u>automated data tiering to S3</u></a>, detailed <a href="https://timescale.ghost.io/blog/database-monitoring-and-query-optimization-introducing-insights-on-timescale/"><u>query performance insights</u></a>, an integrated SQL editor, <a href="https://timescale.ghost.io/blog/pgvector-is-now-as-fast-as-pinecone-at-75-less-cost/"><u>fast vector search</u></a>, <a href="https://timescale.ghost.io/blog/introducing-one-click-database-forking-in-timescale-cloud/"><u>one-click replicas and forks</u></a>, <a href="https://timescale.ghost.io/blog/how-high-availability-works-in-our-cloud-database/"><u>automated backups</u></a>, and more? <a href="https://console.cloud.timescale.com/signup"><u>Sign up for Timescale Cloud</u></a> (it’s free for 30 days).</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Build Search and RAG Systems on PostgreSQL Using Cohere and Pgai]]></title>
            <description><![CDATA[Enterprise-ready LLMs from Cohere, now available in the pgai PostgreSQL extension.]]></description>
            <link>https://www.tigerdata.com/blog/build-search-and-rag-systems-on-postgresql-using-cohere-and-pgai</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/build-search-and-rag-systems-on-postgresql-using-cohere-and-pgai</guid>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Avthar Sewrathan]]></dc:creator>
            <pubDate>Fri, 09 Aug 2024 18:01:55 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/08/Cohere_pgai.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/08/Cohere_pgai.png" alt="Build Search and RAG Systems on PostgreSQL Using Cohere and Pgai" /><p><em>Enterprise-ready LLMs from Cohere—now available in the pgai PostgreSQL extension.</em></p><p><a href="https://cohere.com" rel="noreferrer"><strong>Cohere</strong></a> is a leading generative AI company in the field of large language models (LLMs) and retrieval-augmented generation (RAG) systems. What sets Cohere apart from other model developers is its unwavering focus on enterprise needs and support for multiple languages.&nbsp;</p><h3 id="cohere-models">Cohere models</h3><p>Cohere has quickly gained traction among businesses and developers alike, thanks to its suite of models that cater to key steps of the AI application-building process: text embedding, result reranking, and reasoning. Here’s an overview of Cohere models:</p><ul><li><a href="https://cohere.com/embed"><strong><u>Cohere Embed</u></strong></a><strong>:</strong> A leading text representation model supporting over 100 languages, with the latest version (Embed v3) capable of evaluating document quality and relevance to queries. Cohere Embed is ideal for <a href="https://www.tigerdata.com/learn/vector-search-vs-semantic-search" rel="noreferrer">semantic search</a>, retrieval-augmented generation (RAG), clustering, and classification tasks.</li><li><a href="https://cohere.com/rerank"><strong><u>Cohere Rerank</u></strong></a><strong>:</strong> A model that significantly improves search quality for any keyword or vector search system with minimal code changes. It’s optimized for high throughput and reduced compute requirements while leveraging Cohere's embedding performance for accurate reranking. Cohere rerank is system agnostic, meaning it can be used with any vector search system, including PostgreSQL.</li><li><a href="https://cohere.com/command"><strong><u>Cohere Command</u></strong></a>: A family of highly scalable language models optimized for enterprise use. Command supports RAG, multi-language use, tool use, and citations. Command R+ is the most advanced model, as it’s optimized for conversational interaction and long-context tasks. Command R is well suited for simpler RAG and single-step tool use tasks and is the most cost-effective choice out of the Command family.&nbsp;</li></ul><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">🎉</div><div class="kg-callout-text">Today, we’re thrilled to announce that PostgreSQL developers can now harness the power of Cohere's enterprise-grade language models directly on PostgreSQL data using the pgai PostgreSQL extension.&nbsp;</div></div><h3 id="what-is-pgai">What is pgai?</h3><p><a href="https://github.com/timescale/pgai"><u>Pgai</u></a> is an open-source <a href="https://www.tigerdata.com/blog/top-8-postgresql-extensions" rel="noreferrer">PostgreSQL extension</a> that brings AI models closer to your data, simplifying tasks such as embedding creation, text classification, semantic search, and retrieval-augmented generation on data stored in PostgreSQL.</p><p><strong>Pgai supports the entire suite of Cohere Embed, Rerank, and Command models.</strong> This means that developers can now:</p><ul><li>Create embeddings using Cohere’s Embed model for data inside PostgreSQL tables, all without having pipe data out and back into the database.</li><li>Build hybrid search systems for higher quality results in search and RAG applications using Cohere Rerank, combining vector search in pgvector and PostgreSQL full-text search.</li><li>Perform tasks like classification, summarization, <a href="https://www.tigerdata.com/blog/automating-data-enrichment-in-postgresql-with-openai" rel="noreferrer">data enrichment</a>, and other reasoning tasks on data in PostgreSQL tables using Cohere Command models.</li><li>Build highly accurate RAG systems completely in SQL, leveraging the Cohere Embed, Rerank, and Command models altogether.</li></ul><p>We built pgai to <a href="https://timescale.ghost.io/blog/pgai-giving-postgresql-developers-ai-engineering-superpowers/"><u>give more PostgreSQL developers AI Engineering superpowers</u></a>. Pgai makes it easier for database developers familiar with PostgreSQL to become "AI engineers" by providing familiar SQL-based interfaces to AI functionalities like embedding, creating, and model usage.</p><p>While Command is the flagship model for building scalable, production-ready AI applications, <a href="https://www.tigerdata.com/blog/pgai-giving-postgresql-developers-ai-engineering-superpowers" rel="noreferrer">pgai</a> supports a variety of models in the Cohere lineup. This includes specialized embedding models that support over 100 languages, enabling developers to select the most suitable model for their specific enterprise use case. </p><p>Whether it's building internal AI knowledge assistants, security documentation AI, customer feedback analysis, or employee support systems, Cohere's models integrated with pgai and PostgreSQL offer powerful solutions.</p><p><strong>Getting started with Cohere models in pgai</strong></p><p>Ready to start building with Cohere's suite of powerful language models and PostgreSQL? Pgai is open source under the PostgreSQL License and is available for immediate use in your AI projects. You can find installation instructions on the pgai <a href="https://github.com/timescale/pgai/?ref=timescale.com"><u>GitHub repository</u></a>. You can also access pgai (alongside pgvector and <a href="https://timescale.ghost.io/blog/pgvector-is-now-as-fast-as-pinecone-at-75-less-cost/"><u>pgvectorscale</u></a>) on any database service on <a href="https://console.cloud.timescale.com/signup?ref=timescale.com"><u>Timescale’s Cloud PostgreSQL platform</u></a>. If you’re new to Timescale, you can get started with a <a href="https://console.cloud.timescale.com/signup?ref=timescale.com"><u>free cloud PostgreSQL database here</u></a>.</p><p>Once connected to your database, create the pgai extension by running:</p><pre><code class="language-postgresql">CREATE EXTENSION IF NOT EXISTS ai CASCADE;</code></pre><p><strong>Join the Postgres AI Community</strong></p><p>Have questions about using Cohere models with pgai? Join the <a href="https://discord.gg/KRdHVXAmkp?ref=timescale.com"><u>Postgres for AI Discord</u></a>,&nbsp; where you can share your projects, seek help, and collaborate with a community of peers.&nbsp; You can also <a href="https://github.com/timescale/pgai/issues?ref=timescale.com"><u>open an issue on the pgai GitHub</u></a> (and while you’re there, stars are always appreciated ⭐).</p><p>Next, let's explore the benefits of using Cohere models for building enterprise AI applications. And we’ll close with a real-world example of using Cohere models to build a hybrid search system, combining pgvector semantic search with PostgreSQL full-text search.&nbsp;</p><p>Let's dive in...</p><h2 id="why-use-cohere-models-for-rag-and-search-applications">Why Use Cohere Models for RAG and Search Applications?</h2><p>Cohere's suite of models offers several advantages for enterprise-grade RAG and search applications:</p><p><strong>High-quality results</strong>: Cohere models excel in understanding business language, providing relevant content, and tackling complex enterprise challenges.</p><p><strong>Multi-language support:</strong> With support for over 100 languages, Cohere models are ideal for global enterprises. For example, Command R+ is optimized to perform well in the following languages: English, French, Spanish, Italian, German, Brazilian Portuguese, Japanese, Korean, Simplified Chinese, and Arabic.&nbsp; Additionally, pre-training data has been included for the following 13 languages: Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, and Persian.</p><p><strong>Flexible deployment:</strong> Cohere’s flexible deployment options allow businesses to bring Cohere's models to their own data, either on their servers or in private/commercial clouds, addressing critical security and compliance concerns. This flexibility, combined with Cohere's robust API and partnerships with major cloud providers like Amazon Web Services (through the Bedrock AI platform), ensures seamless integration and improved scalability for enterprise applications.</p><p><strong>Enterprise focus:</strong> Strategic partnerships with industry leaders like Fujitsu, Oracle, and McKinsey &amp; Company underscore Cohere's enterprise-centric approach.</p><p>Cohere models enable sophisticated enterprise AI applications such as:</p><ul><li>Investment research assistants</li><li>Support chatbots and Intelligent support copilots</li><li>Executive AI assistants</li><li>Document summarization tools</li><li>Knowledge and project staffing assistants</li><li>Regulatory compliance monitoring</li><li>Sentiment analysis for brand management</li><li>Research and development assistance</li></ul><p>Cohere's models offer a powerful, scalable, and easily implementable solution for enterprise search and RAG applications, balancing high accuracy with efficient performance.</p><h2 id="tutorial-build-search-and-rag-systems-on-postgresql-using-cohere-and-pgai">Tutorial: Build Search and RAG Systems on PostgreSQL Using Cohere and Pgai</h2><p></p><p>Let’s look at an example of why Cohere's support in pgai is such a game-changer. We’ll perform a hybrid search over a corpus of news articles, combining PostgreSQL full-text search with pgvector’s semantic search, leveraging Cohere Embed and Rerank models in the process.</p><p>Here's what we'll cover:</p><ul><li><strong>Setting up your environment: </strong>Creating a Python virtual environment, installing necessary libraries, and setting up a PostgreSQL database with the pgai extension.</li><li><strong>Preparing your data: </strong>Creating a table to store news articles and loading sample data from the CNN/Daily Mail news dataset.</li><li><strong>Generating vector embeddings using pgai:</strong> Using Cohere's Embed model to create vector representations of text.</li><li><strong>Implementing full text and semantic search capabilities:</strong> Setting up PostgreSQL's full-text search for keyword matching and creating a vector search index for semantic search.</li><li><strong>Performing searches:</strong> Executing keyword searches using PostgreSQL's full-text search and semantic searches on embeddings using pgvector <a href="https://www.tigerdata.com/learn/hnsw-vs-diskann" rel="noreferrer">HNSW</a>.</li><li><strong>Performing hybrid search: </strong>Combining keyword and semantic search results by using Cohere's Rerank model to improve result relevance.</li></ul><p>By the end of this tutorial, you'll have a powerful hybrid search system that provides highly relevant search results, leveraging both traditional keyword search and semantic search techniques.</p><h3 id="setting-up-your-environment">Setting up your environment</h3><p>To begin, we'll set up a Python virtual environment and install the necessary libraries. This ensures a clean, isolated environment for our project.</p><pre><code class="language-bash">mkdir cohere
cd cohere
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`

pip install datasets "psycopg[binary]" python-dotenv
</code></pre><p><br>This setup:</p><ul><li>Creates a new directory for our project</li><li>Sets up a Python virtual environment</li><li>Activates the virtual environment</li><li>Installs the required Python packages:<ul><li><code>datasets</code>: For loading our sample dataset</li><li><code>psycopg</code>: PostgreSQL adapter for Python</li><li><code>python-dotenv</code>: For managing environment variables</li></ul></li></ul><p>Next, create a <code>.env</code> file in your project directory to store your database connection string and Cohere API key:</p><pre><code class="language-bash">DB_URL=postgres://username:password@localhost:5432/your_database
COHERE_API_KEY=your_cohere_api_key_here
</code></pre><p>Replace the placeholders with your actual database credentials and Cohere API key.</p><h3 id="setting-up-your-postgresql-database">Setting up your PostgreSQL database</h3><p>Now, let's set up our PostgreSQL database with the necessary extensions and table structure. We’ll create a table to hold our news articles. The <code>embedding</code> column will hold our embeddings which we’ll create using the Cohere Embed model. The <code>tsv</code> column will hold our full-text-search vectors.</p><p>Connect to your PostgreSQL database and run the following SQL commands:</p><pre><code class="language-postgresql">-- create the pgai extension
create extension if not exists ai cascade;

-- create a table for the news articles
-- the embedding column will store embeddings from cohere
-- the tsv column will store the full-text-search vector
create table cnn_daily_mail
( id bigint not null primary key generated by default as identity
, highlights text
, article text
, embedding vector(1024)
, tsv tsvector generated always as (to_tsvector('english', article)) stored
);

-- index the full-text-search vector
create index on cnn_daily_mail using gin (tsv);
</code></pre><p>This SQL script:</p><ul><li>Creates the <code>pgai</code> extension</li><li>Sets up a table to store our news articles, including columns for the article text, embeddings, and a full-text <a href="https://www.tigerdata.com/learn/understanding-vector-search" rel="noreferrer">search vector</a></li><li>Creates an index on the full-text search vector for efficient keyword searches</li></ul><h3 id="load-the-dataset">Load the dataset</h3><p>Now that our database is set up, we'll load sample data from the CNN/Daily Mail dataset. We'll use Python to download the dataset and efficiently insert it into our PostgreSQL table using the <code>copy</code> command with the <code>binary</code> format.</p><p>Create a new file called <code>load_data.py</code> in your project directory and add the following code:</p><pre><code class="language-python">import os
from dotenv import load_dotenv
from datasets import load_dataset
import psycopg


# Load environment variables
load_dotenv()
DB_URL = os.environ["DB_URL"]
COHERE_API_KEY = os.environ["COHERE_API_KEY"]

# Load and prepare the dataset
dataset = load_dataset("cnn_dailymail", "3.0.0")
test = dataset["test"]
test = test.shuffle(seed=42).select(range(0,1000))

# Insert data into PostgreSQL
with psycopg.connect(DB_URL) as con:
   with con.cursor(binary=True) as cur:
       with cur.copy("copy public.cnn_daily_mail (highlights, article) from stdin (format binary)") as cpy:
           cpy.set_types(['text', 'text'])
           for row in test:
               cpy.write_row((row["highlights"], row["article"]))

print("Data loading complete!")
</code></pre><p>This script:</p><ol><li>Loads environment variables from the .env file</li><li>Downloads the CNN/Daily Mail dataset</li><li>Selects a random subset of 1,000 articles from the test set</li><li>Efficiently inserts the data into our PostgreSQL table using the binary COPY command</li></ol><p>To run the script and load the data:</p><pre><code class="language-bash">python load_data.py</code></pre><p>This approach is much faster than inserting rows one by one, especially for larger datasets. After running the script, you should have 1000 news articles loaded into your cnn_daily_mail table, ready for the next steps in our tutorial.</p><p><strong>Note</strong>: Before running this script, make sure your .env file is properly set up with the correct database URL.</p><h3 id="create-vector-embeddings-with-cohere-embed-and-pgai"><br>Create vector embeddings with Cohere Embed and pgai</h3><p>In this step, we'll use Cohere's Embed model to generate <a href="https://www.tigerdata.com/blog/a-beginners-guide-to-vector-embeddings" rel="noreferrer">vector embeddings</a> for our news articles. These embeddings will enable semantic search capabilities.</p><p>First, set your Cohere API key in your PostgreSQL session:</p><pre><code class="language-postgresql">select set_config('ai.cohere_api_key', '&lt;YOUR-API-KEY&gt;', false) is not null as set_cohere_api_key</code></pre><p>The pgai extension will use this key by default for the duration of your database session.</p><p>We'll explore two methods to generate embeddings: a simple approach and a more production-ready solution. Choose method 1 if you’re just experimenting and method 2 if you want to see how you’d run this in production.<br></p><p><strong>Method 1: Easy mode (for small datasets)</strong></p><p>This is the "easy-mode" approach to generating embeddings for each news article.&nbsp;</p><pre><code class="language-postgresql">-- this is the easy way to create the embeddings
-- but it's one LONG-running update statement
update cnn_daily_mail set embedding = ai.cohere_embed('embed-english-v3.0', article, input_type=&gt;'search_document');
</code></pre><p>This method is straightforward but can be slow for large datasets. The downside to this is that each call to Cohere is relatively slow. This makes the whole statement long-running.</p><p><br><strong>Method 2: Production mode (recommended for larger datasets)</strong></p><p>This method processes rows one at a time, allowing for better concurrency and error handling. We select and lock only a single row at a time, and we commit our work as we go along.</p><pre><code class="language-postgresql">-- this is a more production-appropriate way to generate the embeddings
-- we only lock one row at a time and we commit each row immediately
do $$
declare
    _id bigint;
    _article text;
    _embedding vector(1024);
begin
    loop
        select id, article into _id, _article
        from cnn_daily_mail
        where embedding is null
        for update skip locked
        limit 1;

        if not found then
            exit;
        end if;

        _embedding = ai.cohere_embed('embed-english-v3.0', _article, input_type=&gt;'search_document');
        update cnn_daily_mail set embedding = _embedding where id = _id;
        commit;
    end loop;
end;
$$;
</code></pre><p>Note that both methods above use the <a href="https://github.com/timescale/pgai/blob/main/docs/cohere.md#cohere_embed"><u>cohere_embed() function</u></a> provided by the pgai extension to generate embeddings for each article in the cnn_daily_mail table using the Cohere <a href="https://docs.cohere.com/docs/cohere-embed" rel="noreferrer">Embed v3 embedding model</a>.</p><p>After running one of the above methods, verify that all rows have embeddings:</p><pre><code class="language-postgresql">-- this should return 0
select count(*) from cnn_daily_mail where embedding is null;
</code></pre><p><strong>Create a vector search index</strong></p><p>To optimize vector similarity searches, create an index on the embedding column:</p><pre><code class="language-postgresql">-- index the embeddings
create index on cnn_daily_mail using hnsw (embedding vector_cosine_ops);
</code></pre><p>This index uses the pgvector <a href="https://www.tigerdata.com/blog/vector-database-basics-hnsw" rel="noreferrer">HNSW</a> (<a href="https://www.tigerdata.com/blog/vector-database-basics-hnsw" rel="noreferrer">hierarchical navigable small world</a>) algorithm, which is efficient for nearest-neighbor searches in high-dimensional spaces.</p><p>With these steps completed, your database is now set up for both keyword-based and semantic searches. In the next section, we'll explore how to perform these searches using <a href="https://www.tigerdata.com/blog/pgai-giving-postgresql-developers-ai-engineering-superpowers" rel="noreferrer">pgai</a> and Cohere's models.</p><h2 id="perform-a-keyword-search-in-postgresql"><br>Perform a Keyword Search in PostgreSQL</h2><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">⚠️</div><div class="kg-callout-text">Note to reader: Unfortunately, the news dataset is quite depressing, so please forgive the negativity in the keywords and prompts below. All terms are chosen purely for illustrative and educational purposes.</div></div><p>PostgreSQL's full-text search capabilities allow us to perform complex keyword searches efficiently. We'll use the <code>tsv</code> column we created earlier, which contains the tsvector representation of each article.</p><pre><code class="language-postgresql">-- a full-text-search
-- must include "death" OR "kill"
-- must include "police" AND "car" AND "dog"
select article
from cnn_daily_mail
where article @@ to_tsquery('english', '(death | kill) &amp; police &amp; car &amp; dog')
;
</code></pre><p>Let's break down this query:</p><ul><li><code>@@</code> is the text search match operator in PostgreSQL.</li><li><code>to_tsquery('english', '...')</code> converts our search terms into a tsquery type.</li><li>The search logic is: (death OR kill) AND police AND car AND dog.</li></ul><p>This query will return articles that contain:</p><ol><li>Either "death" or "kill" (or their variations)</li><li>AND "police" (or variations)</li><li>AND "car" (or variations)</li><li>AND "dog" (or variations)</li></ol><p>This method of searching is fast and efficient, especially for exact keyword matches inside the article text. However, it doesn't capture semantic meaning or handle synonyms well. In the next section, we'll explore how to perform semantic searches using the embeddings we created earlier.</p><h2 id="perform-a-semantic-search-in-postgresql-with-pgvector-pgai-and-cohere-embed">Perform a Semantic Search in PostgreSQL With pgvector, pgai, and Cohere Embed </h2><p>Semantic search allows us to find relevant articles based on the meaning of a query rather than just matching keywords. We'll use Cohere's Embed model to convert our search query into a <a href="https://www.tigerdata.com/blog/a-beginners-guide-to-vector-embeddings" rel="noreferrer">vector embedding</a>, then find the most similar article embeddings using pgvector.</p><pre><code class="language-postgresql">-- get the 15 most relevant stories to our question
with q as
(
   select ai.cohere_embed
   ('embed-english-v3.0'
   , 'Show me stories about police reports of deadly happenings involving cars and dogs.'
   , input_type=&gt;'search_query'
   ) as q
)
select article
from cnn_daily_mail
order by embedding &lt;=&gt; (select q from q limit 1)
limit 15
;
</code></pre><p>Let's break this query down:</p><ol><li>We use a <a href="https://www.timescale.com/learn/how-to-use-common-table-expression-sql" rel="noreferrer">CTE (Common Table Expression) </a>named <code>q</code> to generate the embedding for our search query using Cohere's model (also named <code>q</code>).</li><li><code>ai.cohere_embed()</code> is a function provided by pgai that interfaces with Cohere's API to create embeddings. See more in the <a href="https://github.com/timescale/pgai/blob/main/docs/cohere.md#cohere_embed" rel="noreferrer">pgai docs</a>.</li><li>We specify <code>input_type =&gt; 'search_query'</code> to optimize the embedding for search queries.</li><li>In the main query, we order the results by the cosine distance (<code>&lt;=&gt;</code>) between each article's embedding and our query embedding.</li><li>The <code>LIMIT 15</code> clause returns the top 15 most semantically similar articles.</li></ol><p>This semantic search can find relevant articles even if they don't contain the exact words used in the query. It understands the context and meaning of the search query and matches it with semantically similar content.</p><h2 id="performing-hybrid-search-using-pgai-and-cohere-rerank">Performing Hybrid Search Using Pgai and Cohere Rerank&nbsp;</h2><p>Hybrid search combines the strengths of both keyword and semantic searches and then uses Cohere's Rerank model to improve the relevance of the results further. This approach can provide more accurate and contextually relevant results than either method alone.</p><p>Here's how to perform a hybrid search with reranking:</p><pre><code class="language-postgresql">with full_text_search as
(
   select article
   from cnn_daily_mail
   where article @@ to_tsquery('english', '(death | kill) &amp; police &amp; car &amp; dog')
   limit 15
)
, vector_query as
(
   select ai.cohere_embed
   ('embed-english-v3.0'
   , 'Show me stories about police reports of deadly happenings involving cars and dogs.'
   , input_type=&gt;'search_query'
   ) as query_embedding
)
, vector_search as
(
   select article
   from cnn_daily_mail
   order by embedding &lt;=&gt; (select query_embedding from vector_query limit 1)
   limit 15
)
, rerank as
(
   select ai.cohere_rerank
   ( 'rerank-english-v3.0'
   , 'Show me stories about police reports of deadly happenings involving cars and dogs.'
   , (
       select jsonb_agg(x.article)
       from
       (
           select *
           from full_text_search
           union
           select * from vector_search
       ) x
     )
   , top_n =&gt; 5
   , return_documents =&gt; true
   ) as response
)
select
  x.index
, x.document-&gt;&gt;'text' as article
, x.relevance_score
from rerank
cross join lateral jsonb_to_recordset(rerank.response-&gt;'results') x(document jsonb, index int, relevance_score float8)
order by relevance_score desc
;
</code></pre><p>Let's break down this query:</p><ol><li><code>full_text_search</code>: Performs a keyword search using PostgreSQL's full-text search capabilities.</li><li><code>vector_query</code>: Creates an embedding for our search query using Cohere's Embed model.</li><li><code>vector_search</code>: Performs a semantic search using the query embedding.</li><li><code>combined_results</code>: Combines the results from both keyword and semantic searches.</li><li><code>reranked_results</code>: Uses Cohere's Rerank model to reorder the combined results based on relevance to the query.</li><li>The final <code>SELECT</code> statement extracts and formats the reranked results.</li></ol><p>This hybrid approach offers several advantages:</p><ul><li>It captures both exact keyword matches and semantically similar content.</li><li>The reranking step helps to prioritize the most relevant results.</li><li>It can handle cases where either keyword or semantic search alone might miss relevant articles.</li></ul><p>To use this query effectively:</p><ol><li>Adjust the <code>to_tsquery</code> parameters to match your specific keyword search needs.</li><li>Modify the natural language query in both the <code>ai.cohere_embed</code> and <code>ai.cohere_rerank</code> functions to match your search intent.</li><li>Experiment with the <code>LIMIT</code> values and the <code>top_n</code> parameter to balance between recall and precision.</li></ol><p>Remember, while this approach can provide highly relevant results, it does involve multiple API calls to Cohere (for embedding and reranking), which may impact performance for large result sets or high-volume applications.&nbsp;</p><p>This hybrid search method demonstrates the power of combining traditional database search techniques with advanced AI models, all within your PostgreSQL database using pgai.</p><h2 id="get-started-with-cohere-and-pgai-today">Get Started With Cohere and Pgai Today</h2><p>The integration of Cohere’s Command, Embed, and Rerank model into pgai marks a significant milestone in our vision to help PostgreSQL evolve into an <a href="https://timescale.ghost.io/blog/making-postgresql-a-better-ai-database/"><u>AI database</u></a>. By bringing these state-of-the-art language models directly into your database environment, you can unlock new levels of efficiency, intelligence, and innovation in your projects.</p><p>Pgai is open source under the PostgreSQL License and is available for you to use in your AI projects today. You can find installation instructions on the <a href="https://github.com/timescale/pgai/?ref=timescale.com"><u>pgai GitHub repository</u></a>. You can also access pgai on any database service on <a href="https://console.cloud.timescale.com/signup?ref=timescale.com"><u>Timescale’s cloud PostgreSQL platform</u></a>. If you’re new to Timescale, you can get started with a <a href="https://console.cloud.timescale.com/signup?ref=timescale.com"><u>free cloud PostgreSQL database here</u></a>.<br></p><p>Pgai is an effort to enrich the PostgreSQL ecosystem for AI. If you’d like to help, here’s how you can get involved:</p><ul><li>Got questions about using Cohere models in pgai? Join the <a href="https://discord.gg/Q8TajvAPRN?ref=timescale.com"><u>Postgres for AI Discord</u></a>, a community of developers building AI applications with PostgreSQL. Share what you’re working on, and help or get helped by a community of peers.</li><li>Share the news with your friends and colleagues: Share our posts about pgai on <a href="https://x.com/TimescaleDB?ref=timescale.com"><u>X/Twitter</u></a>, <a href="https://www.linkedin.com/company/timescaledb/?ref=timescale.com"><u>LinkedIn</u></a>, and Threads. We promise to RT back.</li><li>Submit issues and feature requests: We encourage you to submit issues and feature requests for functionality you’d like to see, bugs you find, and suggestions you think would improve pgai. Head over to the <a href="https://github.com/timescale/pgai/?ref=timescale.com"><u>pgai GitHub repo</u></a> to share your ideas.</li><li>Make a contribution: We welcome community contributions for pgai. Pgai is written in Python and PL/Python. Let us know which models you want to see supported, particularly for open-source embedding and generation models. See the <a href="https://github.com/timescale/pgai/blob/main/CONTRIBUTING.md?ref=timescale.com"><u>pgai GitHub for instructions to contribute</u></a>.</li><li>Offer the pgai extension on your PostgreSQL cloud: Pgai is an open-source project under the <a href="https://github.com/timescale/pgvectorscale/blob/main/LICENSE?ref=timescale.com"><u>PostgreSQL License</u></a>. We encourage you to offer pgai on your managed PostgreSQL database-as-a-service platform and can even help you spread the word. Get in touch via our <a href="https://www.timescale.com/contact?ref=timescale.com"><u>Contact Us form</u></a> and mention pgai to discuss further.</li></ul><p>We're excited to see what you'll build with PostgreSQL and pgai!</p><p><br></p><p><br><br></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How I Learned to Love PostgreSQL on Kubernetes: Backup/Restore on Timescale]]></title>
            <description><![CDATA[Learn how we deploy PostgreSQL on Kubernetes for our TimescaleDB instances and how this deployment choices impacts the backup/restore process.
]]></description>
            <link>https://www.tigerdata.com/blog/how-i-learned-to-love-postgresql-on-kubernetes-backup-restore-on-timescale</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-i-learned-to-love-postgresql-on-kubernetes-backup-restore-on-timescale</guid>
            <category><![CDATA[Engineering]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Oleksii Kliukin]]></dc:creator>
            <pubDate>Fri, 28 Jun 2024 07:16:00 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/06/How-I-Learned-to-Love-Kubernetes-Update-backup-restore.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/06/How-I-Learned-to-Love-Kubernetes-Update-backup-restore.png" alt="Kubernetes containers alongside a stylized elephant head representing PostgreSQL" /><p>Your database needs reliable backups. Data loss events could occur at any point in time. A developer may drop a table by mistake, even replicated storage drives may fail and start producing read errors, software bugs may cause silent data corruption, applications may perform incorrect modifications, and so on.</p><p>You may hope this will not occur to you, but hope is not a strategy. The key to reliably preventing data loss or enabling data recovery is to perform regular backups. In this post, we’ll cover how we deploy PostgreSQL on Kubernetes (for our TimescaleDB instances) and how this deployment choice impacts our backup and restore process.&nbsp;</p><p>For added interest and to spark the debate on how to build fault tolerance into your PostgreSQL database management system, we also share how we did backup and restore testing before implementing <a href="https://timescale.ghost.io/blog/making-postgresql-backups-100x-faster-via-ebs-snapshots-and-pgbackrest/"><u>a more streamlined solution</u></a>.&nbsp;</p><h2 id="backups-in-postgresql-on-a-kubernetes-deployment">Backups in PostgreSQL on a Kubernetes Deployment</h2><p>As the ancient proverb goes: “There are two kinds of people. Those who do backups and those who <strong>will</strong> do backups.”</p><p>Relational databases like PostgreSQL support continuous archiving, where, in addition to the image of your data directory, the database continuously pushes changes to backup storage.</p><p>Some of the most popular open-source tools for PostgreSQL center around performing backups and restores, including<a href="https://pgbackrest.org/"> <u>pgBackRest</u></a>,<a href="https://www.pgbarman.org/"> <u>Barman</u></a>,<a href="https://github.com/wal-g/wal-g"> <u>wal-g</u></a>,<a href="https://wiki.postgresql.org/wiki/Ecosystem:Backup"> <u>and more</u></a>, which underscores the importance of doing so. Or, at the very least, that backup/restore is top of mind for many developers. And, because TimescaleDB is built on PostgreSQL, all your favorite PostgreSQL tools work perfectly well with TimescaleDB.</p><p>Most of the PostgreSQL tools mentioned above are not described as backup tools but as <a href="https://timescale.ghost.io/blog/database-backups-and-disaster-recovery-in-postgresql-your-questions-answered/" rel="noreferrer"><em>disaster recovery tools</em></a>. Because when disaster strikes, you are not really interested in your backup but rather in the outcome of restoring it. </p><p>And sometimes you need to work really hard to recover from what otherwise would be a disaster: one of my<a href="https://www.youtube.com/watch?v=QkTmnqnUbmc"> <u>favorite talks by long-time PostgreSQL contributor Dimitri Fontaine</u></a> describes the process of recovering data from the PostgreSQL instance with a backup that couldn’t be restored when needed. It’s a fascinating story and even with the help of world-class experts, it’s an almost certain data loss.&nbsp;</p><p>We thought about how to apply this lesson to Timescale Cloud, our cloud-native platform for TimescaleDB, which is deployed on Kubernetes. A core tenet of Timescale Cloud is to provide a worry-free experience, especially around keeping your data safe and secure. </p><p>Behind the scenes, among other technologies, Timescale Cloud relies on encrypted Amazon Elastic Block Storage (EBS) volumes and PostgreSQL continuous archiving. You can read how we made <a href="https://timescale.ghost.io/blog/making-postgresql-backups-100x-faster-via-ebs-snapshots-and-pgbackrest/"><u>PostgreSQL backups 100x faster via EBS snapshots and pgBackRest in this post</u></a>.</p><h2 id="how-we-back-up-every-single-timescaledb-instance">How We Back Up Every Single TimescaleDB Instance</h2><p>Let's start by briefly describing how we run PostgreSQL databases on Kubernetes containers at Timescale.</p><p>We refer to a TimescaleDB instance available to our customers as a TimescaleDB <em>service</em>. (Fun fact in terminology: in PostgreSQL, this database instance is referred to as a PostgreSQL “cluster” as one can traditionally run multiple logical databases within the same PostgreSQL process, which also should not be confused with a “cluster” of a primary database and its replicas. So let’s just refer to these things as “databases” or “instances” for now.)</p><p>A TimescaleDB service is constructed from several Kubernetes components, such as pods and containers running the database software, persistent volumes holding the data, Kubernetes services, and endpoints that direct clients to the pod.</p><p>We run TimescaleDB instances in containers orchestrated by Kubernetes. We have implemented a custom TimescaleDB operator to manage a large fleet of TimescaleDB services, configuring and provisioning them automatically.</p><p>Similar to other <a href="https://kubernetes.io/docs/concepts/extend-kubernetes/operator/"><u>Kubernetes operators</u></a>, a TimescaleDB operator provides a Kubernetes custom resource definition (CRD) that describes a TimescaleDB deployment. The operator converts the YAML manifests defined by the TimescaleDB CRD into TimescaleDB services and manages their lifecycle.</p><p>TimescaleDB pods take advantage of<a href="https://kubernetes.io/docs/concepts/workloads/pods/#using-pods"> <u>Kubernetes sidecars</u></a>, running several containers alongside the database. One of the sidecars runs<a href="https://www.tigerdata.com/blog/making-postgresql-backups-100x-faster-via-ebs-snapshots-and-pgbackrest" rel="noreferrer"> <u>pgBackRest</u></a>, a popular PostgreSQL backup software, and provides an API to launch backups, both on-demand and periodic, triggered by Kubernetes cron jobs. In addition, the database container continuously archives changes in the form of write-ahead logging (WAL) segments, storing them on Amazon S3 in the same location as the backups.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/06/How-I-Learned-to-Love-Kubernetes-Update-backup-restore_2.png" class="kg-image" alt=" Architecture diagram of how Timescale Forge backups are done in Kubernetes." loading="lazy" width="2000" height="1048" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/06/How-I-Learned-to-Love-Kubernetes-Update-backup-restore_2.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/06/How-I-Learned-to-Love-Kubernetes-Update-backup-restore_2.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2024/06/How-I-Learned-to-Love-Kubernetes-Update-backup-restore_2.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/06/How-I-Learned-to-Love-Kubernetes-Update-backup-restore_2.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">How we orchestrate backups using Kubernetes and AWS S3 in Timescale</em></i></figcaption></figure><p>In addition to the TimescaleDB operator, there is another microservice whose task is to deploy TimescaleDB instances (known as “the deployer”). The deployer defines TimescaleDB resources based on users’ choices and actions within the cloud console’s UI and creates TimescaleDB manifests, letting the operator pick them up and provision running TimescaleDB services.</p><p></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/06/How-I-Learned-to-Love-Kubernetes-Update-backup-restore_3.png" class="kg-image" alt="Architecture diagram illustrating the process of deploying a TimescaleDB service in Kubernetes." loading="lazy" width="2000" height="815" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/06/How-I-Learned-to-Love-Kubernetes-Update-backup-restore_3.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/06/How-I-Learned-to-Love-Kubernetes-Update-backup-restore_3.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2024/06/How-I-Learned-to-Love-Kubernetes-Update-backup-restore_3.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/06/How-I-Learned-to-Love-Kubernetes-Update-backup-restore_3.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">How a TimescaleDB service is deployed via Kubernetes on Timescale</em></i></figcaption></figure><p>The deployer also watches for changes in Kubernetes objects that are part of the resulting TimescaleDB service and the manifest itself. It detects when the target service is fully provisioned or when there are changes to be made to the running service (e.g., to provision more compute resources or to upgrade to a new minor version of TimescaleDB). Finally, it also marks the service as deleted upon receiving a delete event from the manifest.</p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">❗</div><div class="kg-callout-text">Now that you know how we deploy TimescaleDB services using Kubernetes, here’s where things get interesting. To provide our users with greater peace of mind about the safety and reliability of their data, we used several strategies for actively and automatically testing our backups. <br><br>However, our <a href="https://www.timescale.com/blog/introducing-one-click-database-forking-in-timescale-cloud/" rel="noreferrer">forks</a>, <a href="https://www.timescale.com/blog/how-timescale-replication-works-enabling-postgres-ha/" rel="noreferrer">replicas</a>, and <a href="https://docs.timescale.com/use-timescale/latest/backup-restore/point-in-time-recovery/" rel="noreferrer">point-in-time recovery</a> are currently based on the parent service’s backups. This means we get more restores from the customers’ regular cloud usage than from the dedicated and selective restores to test the backups. Essentially, we can catch <a href="https://github.com/pgbackrest/pgbackrest/issues/2215"><u>systematic issues</u></a> without the need for restore tests.<br><br>Although we don’t test anymore, we still think this is an interesting idea that could serve as inspiration for developers and kick-off discussions over building fault tolerance in a database management system. So, now you have a choice to make:<br><br>- If you want to read how we evolved from testing to how replicas and forks currently work on Timescale, we explain in this article how we made <a href="https://www.timescale.com/blog/making-postgresql-backups-100x-faster-via-ebs-snapshots-and-pgbackrest/"><u>PostgreSQL backups 100x faster via EBS snapshots and pgBackRest</u></a>.<br><br>- If you want to learn how we <i><em class="italic" style="white-space: pre-wrap;">used to</em></i> do backup and restore validation on Timescale, keep reading—<b><strong style="white-space: pre-wrap;">just remember this is not how we roll anymore</strong></b>.</div></div><p></p><p></p><p></p><h2 id="how-we-used-to-run-restore-tests">How We Used to Run Restore Tests</h2><p>In the previous section, we established that the deployer and operator work together to deploy and manage a TimescaleDB service in Kubernetes, including the container running PostgreSQL and TimescaleDB and the container sidecars running pgBackRest and others.</p><p>Sometimes, a solution to one problem is a by-product of working on another problem. As we built Timescale, we easily implemented several features by adding the ability to clone a running service, producing a new one with identical data. That process is similar to spawning a replica of the original database, except that at some point, that replica is “detached” from the former primary and goes a separate way.</p><p>At the time, we added the ability to continuously validate backups through frequent smoke testing using a similar approach. This is how it worked: a restore test produced a new service with the data from an existing backup, relying on PostgreSQL point-in-time recovery (PITR). When a new test service was launched, it restored the base backup from Amazon S3 and replayed all pending WAL files until it reached a pre-defined point in time, where it detached into a stand-alone instance.</p><p>Under the hood, we used (and still use)<a href="https://github.com/zalando/patroni"> <u>Patroni</u></a>, a well-known PostgreSQL high-availability solution template, to replace a regular PostgreSQL bootstrap sequence with a custom one that involves restoring a backup from Amazon S3. If you want to go into the weeds of <a href="https://timescale.ghost.io/blog/how-timescale-replication-works-enabling-postgres-ha/"><u>how we enable high availability in PostgreSQL, check out this article</u></a>.</p><p>A feature of Patroni called “<a href="https://patroni.readthedocs.io/en/latest/replica_bootstrap.html#bootstrap"><u>custom bootstrap</u></a>” allows defining arbitrary initialization steps instead of relying on the PostgreSQL bootstrap command <a href="https://www.postgresql.org/docs/current/app-initdb.html" rel="noreferrer"><u><code>initdb</code></u></a>. Our custom bootstrap script also called pgBackRest, pointing it to the backup of the instance we were testing. (Side note: my colleague&nbsp;<a href="https://twitter.com/ekief"><u>Feike Steenbergen</u></a> and I were among the initial developers of Patroni earlier in our careers, so we were quite familiar with how to incorporate it into such complex workflows.)</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/06/How-I-Learned-to-Love-Kubernetes-Update-backup-restore_1-1.png" class="kg-image" alt="Architecture diagram of how Timescale Forge performs restore tests." loading="lazy" width="2000" height="1085" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/06/How-I-Learned-to-Love-Kubernetes-Update-backup-restore_1-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/06/How-I-Learned-to-Love-Kubernetes-Update-backup-restore_1-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2024/06/How-I-Learned-to-Love-Kubernetes-Update-backup-restore_1-1.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/06/How-I-Learned-to-Love-Kubernetes-Update-backup-restore_1-1.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">How Timescale conducted restore tests to validate PostgreSQL backups</span></figcaption></figure><p>Once we had verified the backup could be restored without errors, we determined whether we had the right data. We checked two properties of the restored backup: <strong>recentness and consistency</strong>. Since the outcome of the restore is a regular TimescaleDB, those checks simply ran SQL queries against the resulting database.</p><p>Obviously, we have no visibility into users’ data to verify the restored backup is up-to-date. So to check for recentness, we injected a special row with the timestamp of the beginning of the restore test into a dedicated bookkeeping table in the target service. (This table was not accessible or visible to users.) The test configured the PostgreSQL point-in-time Recovery (PITR), setting the parameter <code>restore_target_time</code> to match that timestamp. When the instance’s restore was completed, the scripts that Patroni ran at the post-bootstrap stage verified whether the row was there.</p><p>As a final safeguard, we checked for consistency by verifying that the restored database was internally consistent. In this context, a backup restore was consistent if it produced the same results for a set of queries as the original service it was based on at the point in time when the backup was made.</p><p>The easiest way to check for consistency was to read every object in the target database and watch for errors. If the original instance produced no errors for a particular query when the backup was made, the restore of that backup should also produce no errors. We used<a href="https://www.postgresql.org/docs/current/app-pgdump.html"> <u>pg_dump</u></a>, the built-in tool for producing SQL dumps for PostgreSQL.</p><p>Typically, it read every row in the target database and wrote its SQL representation in the dump file. Since we were not interested in the dump, we redirected the output to /dev/null to save disk space and improve performance. (We used the “-s” flag to trigger a schema-only dump without touching data rows.) There was no need to read every data row because we were only interested in checking system catalogs for consistency.</p><p>The deployer was responsible for scheduling the tests over the whole fleet. It employed an elegant hack—our favorite type of hack!—to do so by relying on certain Patroni behavior:</p><ul><li>Patroni modified Kubernetes endpoints to point PostgreSQL clients to the primary database instance. It updated the list of addresses and annotations in the endpoint. As a result, every endpoint was touched regularly, as Patroni ensured the primary holds the leader lock for every instance.</li><li>The deployer installed the Kubernetes informer on the endpoints of running instances, allowing it to call a custom callback every time the endpoint was created, updated, or deleted.</li><li>The OnUpdate path allowed for every running instance to evaluate whether the restore test was necessary.</li><li>The restore test instance endpoint triggered its own OnUpdate event. We used it to check the restore test status and finish the test once it was done.</li><li>The deployer recorded each observed restore test status in a <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer"><u>hypertable</u></a> in the deployer database, together with the status change timestamp.</li><li>The deployer <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertable</a> limited the number of in-progress tests and provided useful statistics about the tests for our monitoring.</li></ul><h2 id="summary">Summary</h2><p>Timescale is designed to provide a worry-free experience and a trustworthy environment for your critical data. We believe that developers should never have to worry about the reliability of their database, and they should have complete confidence that their data will never be lost.</p><p>Backups provide a facility to archive and store data so that it can be recovered in the future. Of course, backups are only one part of a broader strategy for ensuring reliability. Among other things, Timescale's use of Kubernetes has allowed us to provide a decoupled compute and storage solution for more reliable and cost-effective fault tolerance.</p><p>All writes to WAL and data volumes are replicated to multiple physical storage disks for higher durability and availability, and even if a TimescaleDB instance fails (including from hardware failures), Kubernetes can immediately spin up a new container to reconnect to its online storage volumes within tens of seconds without needing to ever take the slower path of recovering from these backups from S3.&nbsp;</p><p>So, at Timescale, we modify that ancient proverb: “There are three kinds of database developers: those who do backups, those who will do backups, and those who use Timescale and don’t have to think about them.”</p><p>If you’re new to TimescaleDB,&nbsp;<a href="https://console.cloud.timescale.com/signup"><u>create a free Timescale account</u></a> to get started with a fully managed Timescale service (free for 30 days, no credit card required).</p><p>Once you’re using TimescaleDB, or if you’re already up and running,<a href="https://www.timescale.com/community"> <u>join the Timescale community</u></a> to share your feedback, ask questions about time-series data (and databases in general), and more. We’d love to hear about your restore tests and thoughts on trade-offs of snapshots vs. point-in-time recovery.</p><p>And, if you enjoy working on hard engineering problems, share our mission, and want to join our fully remote, global team,<a href="https://www.timescale.com/careers"> <u>we’re hiring broadly across many roles</u></a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How We Made PostgreSQL as Fast as Pinecone for Vector Data]]></title>
            <description><![CDATA[Read how we equipped PostgreSQL with advanced indexing techniques, making it as fast as other specialized vector databases, like Pinecone.]]></description>
            <link>https://www.tigerdata.com/blog/how-we-made-postgresql-as-fast-as-pinecone-for-vector-data</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-we-made-postgresql-as-fast-as-pinecone-for-vector-data</guid>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[pgvector]]></category>
            <dc:creator><![CDATA[Matvey Arye]]></dc:creator>
            <pubDate>Tue, 11 Jun 2024 12:05:45 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/06/How-We-Made-PostgreSQL-as-Fast-as-Pinecone-for-Vector-Data_Binary-Quantization-1.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/06/How-We-Made-PostgreSQL-as-Fast-as-Pinecone-for-Vector-Data_Binary-Quantization-1.png" alt="How We Made PostgreSQL as Fast as Pinecone for Vector Data using an alternative to binary quantization " /><p>We’ve <a href="https://timescale.ghost.io/blog/pgvector-is-now-as-fast-as-pinecone-at-75-less-cost" rel="noreferrer">recently announced the open-sourcing of pgvectorscale</a>, a new <a href="https://www.tigerdata.com/blog/top-8-postgresql-extensions" rel="noreferrer">PostgreSQL extension</a> that provides advanced indexing techniques for vector data. Pgvectorscale provides a new index method for pgvector data, significantly improving the search performance of approximate nearest neighbor (ANN) queries. These queries are key for leveraging modern <a href="https://www.tigerdata.com/blog/a-beginners-guide-to-vector-embeddings" rel="noreferrer">vector embedding</a> techniques to facilitate <a href="https://www.tigerdata.com/learn/vector-search-vs-semantic-search" rel="noreferrer"><u>semantic search</u></a>, which allows for finding things similar to a query's <em>meaning. </em>That, in turn, enables applications like retrieval-augmented generation (RAG), summarization, clustering, or general search.</p><p>In our announcement post, we described how our new StreamingDiskANN vector index allows us to <a href="https://www.tigerdata.com/search" rel="noreferrer">perform vector search faster than bespoke purpose-built databases created for this purpose—like Pinecone</a>. We also observed that if bespoke databases aren’t faster, then there is no reason to use them because they can’t possibly compete with the rich feature set and ecosystem of general-purpose databases like PostgreSQL.</p><p>In this article we’ll go into the technical contributions that allowed us to “break the speed barrier” and create a fast vector index in PostgreSQL. We’ll cover three technical improvements we made:</p><ul><li><strong>Implementing the DiskANN algorithm</strong> to allow the index to be stored on SSDs instead of having to reside in memory. This vastly decreases the cost of storing large amounts of vectors since SSDs are much cheaper than RAM.&nbsp;</li><li><strong>Supporting streaming post-filtering,</strong> which allows for accurate retrieval even when secondary filters are applied. In contrast, the <a href="https://www.tigerdata.com/blog/vector-database-basics-hnsw" rel="noreferrer">HNSW</a> (hierarchical navigable small world) index fails to accurately retrieve data if the filters exclude the first <code>ef_search</code> vectors. Pinecone had previously complained about this problem when comparing itself to pgvector. Guess what; through the power of open source, this issue has been resolved.&nbsp;</li><li><strong>Developing a completely new vector quantization algorithm </strong>we call SBQ (statistical binary quantization). This algorithm provides a better accuracy vs. performance trade-off compared to existing ones like BQ (binary quantization) and PQ (product quantization).</li></ul><h2 id="enhancing-postgresql-for-vector-data">Enhancing PostgreSQL for Vector Data</h2><h3 id="implementing-the-diskann-algorithm-to-optimize-for-ssd-storage"><br>Implementing the DiskANN algorithm to optimize for SSD storage</h3><p>The <a href="https://www.tigerdata.com/learn/understanding-diskann" rel="noreferrer"><u>DiskANN</u></a> algorithm was developed by work coming out of Microsoft. Its goal was to store a very large number of vectors (think Microsoft scale). At that scale, it was simply uneconomical to store everything in RAM. Thus, the algorithm is geared towards enabling storing vectors on SSDs and using less RAM. Its details are described very well in the <a href="https://proceedings.neurips.cc/paper_files/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf"><u>paper</u></a>, so I’ll only give a bit of intuition below.&nbsp;</p><p>The DiskANN algorithm is a graph-based search algorithm like <a href="https://www.tigerdata.com/learn/hnsw-vs-diskann" rel="noreferrer">HNSW</a>. Graph-based algorithms in this space have a well-known problem: finding an item that’s “very far” from the starting position is expensive because it requires a lot of hops.&nbsp;</p><p>HNSW solves this problem by introducing a system of layers where the first (top) layer only has “long-range” edges that quickly get you into the right vicinity and have pointers to nodes into lower levels that allow you to traverse the graph in a more fine-grained way. This solves the long-range problem but introduces more indirection through the layering system, which requires more random-access that forces the graph into RAM for good performance.&nbsp;</p><p>In contrast, DiskANN uses a single-layer graph and solves the long-range problem during graph construction by allowing for neighbor edges that refer to far-away nodes. The single-layer construction simplifies the algorithm and decreases the random access necessary during search, allowing SSDs to be used effectively.</p><h3 id="support-for-streaming-retrieval-for-accurate-metadata-filtering">Support for streaming retrieval for accurate metadata filtering</h3><p>Oftentimes, when searching for semantically similar items, you want to constrain your search with additional filters. For example, documents are often associated with a set of tags and you may want to constrain your search by requiring a match of the tags as well as vector similarity.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/06/How-We-Made-PostgreSQL-as-Fast-as-Pinecone-for-Vector-Data_two-stage-filtering.png" class="kg-image" alt="A diagram representing two-stage filtering" loading="lazy" width="2000" height="1657" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/06/How-We-Made-PostgreSQL-as-Fast-as-Pinecone-for-Vector-Data_two-stage-filtering.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/06/How-We-Made-PostgreSQL-as-Fast-as-Pinecone-for-Vector-Data_two-stage-filtering.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2024/06/How-We-Made-PostgreSQL-as-Fast-as-Pinecone-for-Vector-Data_two-stage-filtering.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/06/How-We-Made-PostgreSQL-as-Fast-as-Pinecone-for-Vector-Data_two-stage-filtering.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Figure 1: The problem with two-stage post-filtering is that if the matching records aren’t located in the set before the cutoff of the first stage, the final answer will be incorrect.</em></i></figcaption></figure><p>This is challenging for many HNSW-based indexes (including pgvector’s implementation) because the index retrieves a pre-set number of records from the index (set by the <code>hnsw.ef_search</code> parameter, often set to 1,000 or less) <em>before</em> applying secondary filters. If not enough items in the retrieved set (e.g., first 1,000 items) match the secondary filters, you will miss those results.&nbsp;</p><p>Figure 1 illustrates this problem when you use <code>hnsw.ef_search=5</code> to find the top two vectors closest to a given query <strong>and</strong> matching the tag “department=engineering”. In this scenario, the first item with the correct tag is the seventh vector closest to the query.&nbsp;</p><p>Since the vector search returns only the closest five items and none matches the tag filter, no results will be returned! This is an extreme example where no results are left over, but there will be some accuracy loss any time the retrieved set has less than k items matching the filter.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/06/How-We-Made-PostgreSQL-as-Fast-as-Pinecone-for-Vector-Data_streaming-filtering.png" class="kg-image" alt="A diagram representing streaming filtering" loading="lazy" width="2000" height="1657" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/06/How-We-Made-PostgreSQL-as-Fast-as-Pinecone-for-Vector-Data_streaming-filtering.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/06/How-We-Made-PostgreSQL-as-Fast-as-Pinecone-for-Vector-Data_streaming-filtering.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2024/06/How-We-Made-PostgreSQL-as-Fast-as-Pinecone-for-Vector-Data_streaming-filtering.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/06/How-We-Made-PostgreSQL-as-Fast-as-Pinecone-for-Vector-Data_streaming-filtering.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Figure 2: Streaming filtering produces the correct result by exposing a </em></i><i><code spellcheck="false" style="white-space: pre-wrap;"><em class="italic">get_next()</em></code></i><i><em class="italic" style="white-space: pre-wrap;"> function that can be called continuously until the right number of records are found. </em></i></figcaption></figure><p>In contrast, our StreamingDiskANN index has no “ef_search” type cutoff. Instead, as shown in Figure 2, it uses a streaming model that allows the index to continuously retrieve the “next closest” item for a given query, potentially even traversing the entire graph! The Postgres execution system will continuously ask for the “next closet” item until it has matched the <code>LIMIT N</code> items that satisfy the additional filters. This is a form of post-filtering that suffers absolutely no accuracy degradation.</p><p>As a side note, Pinecone made a big deal of the “ef_search” type limitation to deposition pgvector in <a href="https://www.pinecone.io/blog/pinecone-vs-pgvector/"><u>their comparison</u></a>. But, with the introduction of StreamingDiskANN, this criticism no longer applies. This just shows the power of open-source projects to move quickly to mitigate limitations.</p><h3 id="statistical-binary-quantization-sbq-a-new-quantization-algorithm">Statistical binary quantization (SBQ): A new quantization algorithm</h3><p>Many vector indexes use compression to reduce the space needed for vector storage and make index traversal faster at the cost of some loss in accuracy. The common algorithms are product quantization (PQ) and binary quantization (BQ). In fact, pgvector’s <a href="https://www.tigerdata.com/learn/vector-database-basics-hnsw" rel="noreferrer">HNSW index</a> just added BQ in their <a href="https://github.com/pgvector/pgvector/blob/master/CHANGELOG.md?ref=timescale.com#070-2024-04-29" rel="noreferrer">latest 0.7.0 release</a> (hooray!).&nbsp;</p><p>The way most vector databases work to retrieve K results is as follows. The system first retrieves N results (N&gt;K) using the approximate quantized differences, then “corrects” for the error by rescoring. It calculates the full distance for the N results, sorts the list by the full distance, and returns the K items with the smallest distance. Yet, even with rescoring, accuracy is important because it allows you to decrease N (and thus query faster) and improve the chances that the accurate result will be in the set of N pre-fetched results.</p><p>We took a look at the BQ algorithm and were unhappy with the amount of accuracy loss it produced. We also immediately saw some low-hanging fruit to improve it. In tinkering with the algorithm, we developed a new compression algorithm we are calling statistical binary quantization (SBQ).&nbsp;</p><p>The BQ compression algorithm transforms a floating-point vector into a binary vector in a surprisingly simple way: for each element in the vector, if the value is greater than 0.0, make the binary value 1; otherwise, set the binary value to 0. Then, the distance function simply becomes the <code>XOR</code> function. Why <code>XOR</code>? Well, you’ll find many mathematical explanations (none of which we quite like) but the intuition we use is that the binary vector divides the space into quadrants as seen in Figure 3, and the <code>XOR</code> function is simply a count of how many planes you have to cross to get from one quadrant to another.&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/06/How-We-Made-PostgreSQL-as-Fast-as-Pinecone-for-Vector-Data_Binary-Quantization.png" class="kg-image" alt="A diagram representing BQ" loading="lazy" width="2000" height="1460" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/06/How-We-Made-PostgreSQL-as-Fast-as-Pinecone-for-Vector-Data_Binary-Quantization.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/06/How-We-Made-PostgreSQL-as-Fast-as-Pinecone-for-Vector-Data_Binary-Quantization.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2024/06/How-We-Made-PostgreSQL-as-Fast-as-Pinecone-for-Vector-Data_Binary-Quantization.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/06/How-We-Made-PostgreSQL-as-Fast-as-Pinecone-for-Vector-Data_Binary-Quantization.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Figure 3: BQ for three dimensions. Quadrant 1 is represented by the binary vector [1,1,1] and any vector falling into that quadrant will have a distance of 0. The distance with vectors in other quadrants increases with the number of dimensions that are different.</em></i></figcaption></figure><p>One of the immediate things that struck us as odd is that the cutoff for each dimension is always 0.0. This was odd because in analyzing real embeddings we’ve previously found that the mean for each dimension is not even approximately 0.0. That means the quadrants we are defining in BQ are not dividing the space of points in half and thus missing out on opportunities for differentiation.&nbsp;</p><p>Intuitively, you want the “origin” of your cutting plane in the middle of all the action, but in BQ, it’s off to the side. The solution was quite simple: we used a learning pass to derive the mean value for each dimension and then set the float-value cutoff to the mean instead of 0.0. Thus, we set the binary value for an element to 1 if and only if the float value is greater than the mean for the dimension.</p><p>But then we noticed yet another odd thing: the compression algorithm worked better for 1,536 dimensions than for 768 dimensions. This made little sense to use because the literature strongly implies that problems with higher dimensions are harder than lower dimensions (the so-called “curse of dimensionality”). But here, the opposite is true.&nbsp;</p><p>However, thinking about the quadrant analogy, this kind of made sense—we’d have fewer quadrants with 768 dimensions, and each quadrant would be bigger and thus less differentiated. So we asked ourselves, could we create more quadrants with 768 dimensions?&nbsp;</p><p>Our approach was to convert each floating-point dimension into two bits (which we later generalized). The idea was to use the mean and standard deviations to derive a z-score (a value’s distance from the mean normalized by standard deviation) and then divide the z-score into three regions. We then encode the three regions into two bits so that adjacent regions have a <code>XOR</code> distance of 1, and the distance increases with the z-score distance. In the two-bit case with three regions the encoding is 00, 01, 11.&nbsp;</p><p>Experimentally, we found that two-bit encoding really helps accuracy with the 768-dimension case. Thus, by default, we use two-bit encoding for any data with less than about 900 dimensions and one-bit encoding otherwise. In one representative example on a dataset with 768 dimensions, the recall improved from 96.5&nbsp;% to 98.6&nbsp;% when switching from the one-bit to two-bit encoding, a significant improvement at such high recall levels.</p><p>In sum, these techniques help us achieve a better accuracy/performance trade-off.</p><h2 id="a-better-postgresql-for-vector-data">A Better PostgreSQL for Vector Data</h2><p>The three techniques we covered in this post allow us to develop a best-in-class index for vector data in PostgreSQL that rivals the performance of bespoke databases like Pinecone. We were able to achieve this with a small team by harnessing much of the infrastructure that PostgreSQL provides, including caching, WAL (write-ahead logging), and the associated recovery infrastructure, and a rock-solid disk writing system.&nbsp;</p><p>We wrote this in Rust using the <a href="https://github.com/pgcentralfoundation/pgrx"><u>PGRX</u></a> framework for writing Rust extensions for PostgreSQL. This further sped up development because we could rely on some of the safety guarantees that Rust and PGRX provide while developing our own safe wrappers for tricky parts of the code (like disk I/O). We think that this combination of tools is really useful and powerful for developing database features and extending the reach of PostgreSQL.&nbsp;</p><h3 id="next-steps">Next steps</h3><p>Our team has been working tirelessly in the last few months to equip PostgreSQL with these new advanced indexing techniques for vector data. Our goal is to help PostgreSQL developers become AI developers. But for that, we need your feedback.</p><p>Here’s how you can get involved:&nbsp;</p><ul><li><strong>Share the news with your friends and colleagues</strong>: Share our posts announcing <a href="https://www.tigerdata.com/blog/pgai-giving-postgresql-developers-ai-engineering-superpowers" rel="noreferrer">pgai</a> and pgvectorscale on <a href="https://x.com/TimescaleDB"><u>X/Twitter</u></a>, <a href="https://www.linkedin.com/company/timescaledb/"><u>LinkedIn</u></a>, and Threads. We promise to RT back.</li><li><strong>Submit issues and feature requests</strong>: We encourage you to submit issues and feature requests for functionality you’d like to see, bugs you find, and suggestions you think would improve both projects.</li><li><strong>Make a contribution</strong>: We welcome community contributions for both pgvectorscale and pgai. Pgvectorscale is written in Rust, while pgai uses Python and PL/Python. For pgai specifically, let us know which models you want to see supported, particularly for open-source embedding and generation models. <a href="https://github.com/timescale/pgai/" rel="noreferrer">See the pgai GitHub</a> for more.</li><li><strong>Offer pgvectorscale and pgai extensions on your PostgreSQL cloud</strong>: Pgvectorscale and pgai are open-source projects under the <a href="https://github.com/timescale/pgvectorscale/blob/main/LICENSE" rel="noreferrer"><u>PostgreSQL License</u></a>. We encourage you to offer pgvectorscale and pgai on your managed PostgreSQL database-as-a-service platform, and we can even help you spread the word. Get in touch via our <a href="https://www.timescale.com/contact"><u>Contact Us form</u></a> and mention pgai and pgvectorscale to discuss further.</li><li><strong>Use pgai and pgvectorscale today</strong>: You can find installation instructions on the <a href="https://github.com/timescale/pgai/" rel="noreferrer">pgai GitHub</a> and <a href="https://github.com/timescale/pgvectorscale/" rel="noreferrer">pgvectorscale GitHub</a> repositories, respectively. <a href="https://www.tigerdata.com/search" rel="noreferrer">You can also access both pgai and pgvectorscale on any database service on Tiger Data’s cloud PostgreSQL platform</a>. For production vector workloads, we’re offering private beta access to vector-optimized databases with pgvector and pgvectorscale on Timescale. <a href="https://timescale.typeform.com/to/H7lQ10eQ"><u>Sign up here for priority access</u></a>.</li></ul>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[PostgreSQL Data Cleaning vs. Python Data Cleaning]]></title>
            <description><![CDATA[Are you using the best tools for your PostgreSQL data cleaning tasks? Here’s an introduction to some time-saving tools you can use within PostgreSQL itself. ]]></description>
            <link>https://www.tigerdata.com/blog/postgresql-data-cleaning-vs-python</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/postgresql-data-cleaning-vs-python</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[General]]></category>
            <dc:creator><![CDATA[Miranda Auhl]]></dc:creator>
            <pubDate>Thu, 23 May 2024 13:59:00 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/05/PostgreSQL-Data-Cleaning-vs.-Python-Data-Cleaning--1-.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/05/PostgreSQL-Data-Cleaning-vs.-Python-Data-Cleaning--1-.png" alt="A neon-colored broom over a paint-splashed background, representing data cleaning" /><h2 id="introduction"><br>Introduction</h2><p>During analysis, you rarely—if ever—get to go directly from evaluating data to transforming and analyzing it. Sometimes, to properly evaluate your data, you may need to do some pre-cleaning before you get to the main data cleaning—and that’s a lot of cleaning! To accomplish all this work, you may use Excel, R, or Python, but are these the best tools for your PostgreSQL data cleaning tasks?</p><p>In this blog post, I explore some classic <strong>data cleaning</strong> scenarios and show how you can perform them <em>directly within your database</em> using <a href="https://www.timescale.com/">TimescaleDB</a> and <a href="https://www.postgresql.org/">PostgreSQL</a>, replacing the tasks that you may have done in Excel, R, or Python. TimescaleDB and PostgreSQL can't replace these tools entirely. However, they can help your data munging/cleaning tasks be more efficient and, in turn, let Excel, R, and Python shine where they do best: in visualizations, modeling, and machine learning.  </p><p>Cleaning is a critical part of the analysis process. In my experience, it can be the most grueling! By cleaning data directly within my database, I can perform many of my cleaning tasks one time rather than repetitively within a script. This saves me considerable time in the long run.</p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">🔖</div><div class="kg-callout-text">Learn how you can perform <a href="https://www.timescale.com/learn/time-series-analysis-in-r" rel="noreferrer">time-series analysis in R</a>.</div></div><p><br></p><h2 id="how-postgresql-data-cleaning-fits-in-the-data-analysis-process">How PostgreSQL Data Cleaning Fits in the Data Analysis Process</h2><p>I began this series of posts on <a href="https://timescale.ghost.io/blog/blog/speeding-up-data-analysis/">data analysis</a> by presenting the following summary of the analysis process:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Untitled.jpg" class="kg-image" alt="Image showing Evaluate -> Clean -> Transform -> Model, accompanied by icons which relate to each step" loading="lazy" width="1600" height="406" srcset="https://www.timescale.com/blog/content/images/size/w600/2022/01/Untitled.jpg 600w, https://www.timescale.com/blog/content/images/size/w1000/2022/01/Untitled.jpg 1000w, https://www.timescale.com/blog/content/images/2022/01/Untitled.jpg 1600w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Data analysis lifecycle</span></figcaption></figure><p>The first three steps of the analysis lifecycle (evaluate, clean, transform) comprise the “data munging” stages of analysis. Historically, I have done my data munging and modeling all within Python or R, these being excellent options for analysis. </p><p>However, once I was introduced to PostgreSQL and TimescaleDB, I found how efficient and fast it was to do my data munging directly within my database. In my previous post, I focused on showing <a href="https://timescale.ghost.io/blog/blog/how-to-evaluate-your-data-directly-within-the-database-and-make-your-analysis-more-efficient/">data evaluation</a> techniques and how you can replace tasks previously done in Python with PostgreSQL and TimescaleDB code. I now want to move on to the second step, <strong>data cleaning</strong>. Cleaning may not be the most glamorous step in the analysis process, but it is absolutely crucial to creating accurate and meaningful models.</p><p>As I mentioned <a href="https://timescale.ghost.io/blog/blog/how-to-evaluate-your-data-directly-within-the-database-and-make-your-analysis-more-efficient/">in my last post</a>, my first job out of college was at an energy and sustainability solutions company that focused on monitoring utility usage—such as electricity, water, sewage, you name it—to determine how our clients’ buildings could be more efficient. My role at this company was to perform data analysis and business intelligence tasks.</p><p>Throughout this job, I got the chance to use many popular data analysis tools including Excel, R, and Python. But once I tried using a database to perform my data munging tasks - specifically PostgreSQL and TimescaleDB - I realized how efficient and straightforward analysis, and particularly cleaning tasks, could be when done directly in a database. </p><p>Before using a database for data cleaning tasks, I would often find either columns or values that needed to be edited. I would pull the raw data from a CSV file or database, then make any adjustments to this data within my Python script. </p><p>This meant that every time I ran my Python script, I would have to wait for my machine to spend computational time setting up and cleaning my data. This means that I lost time with every run of the script. Additionally, if I wanted to share cleaned data with colleagues, I would have to run the script or pass it along to them to run. This extra computational time could add up depending on the project. </p><p>Instead, with PostgreSQL data cleaning, I can write a query to do this cleaning once and then store the results in a table. I wouldn’t need to spend time cleaning and transforming data again and again with a Python script, I could just set up the cleaning process in my database and call it a day! Once I started to make PostgreSQL data cleaning changes directly within my database, I was able to skip performing cleaning tasks within Python and simply focus on jumping straight into modeling my data. </p><p>To keep this post as succinct as possible, I chose to only show side-by-side code comparisons for Python and PostgreSQL. If you have any questions about other tools or languages, please feel free to join our <a href="https://slack.timescale.com/">Slack channel</a>, where you can ask the Timescale community specific questions about Timescale or PostgreSQL functionality 😊. I’d love to hear from you!</p><p>Additionally, as we explore TimescaleDB and PostgreSQL functionality together, you may be eager to try things out right away! Which is awesome! The easiest way to get started is by signing up for <a href="https://console.cloud.timescale.com/signup" rel="noreferrer">a free 30-day trial of Timescal</a>e (if you prefer self-hosting, you can always <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/install-timescaledb/self-hosted/">install and manage TimescaleDB on your own PostgreSQL instances</a>). Learn more by <a href="https://docs.timescale.com/timescaledb/latest/tutorials/">following one of our many tutorials</a>.</p><p>Now, before we dip into things and get our data, as Outkast best put it, “So fresh, So clean,” I want to quickly cover the data set I'll be using. In addition, I want to note that all the code I show will assume you have some basic knowledge of SQL. If you aren't familiar with SQL, don’t worry! In my last post, I included a section on <a href="https://timescale.ghost.io/blog/blog/how-to-evaluate-your-data-directly-within-the-database-and-make-your-analysis-more-efficient/#sql-basics" rel="noreferrer">SQL basics</a>.</p><h2 id="about-the-sample-dataset">About the Sample Dataset</h2><p>In my experience within the data science realm, I have done most of my data cleaning after evaluation. However, sometimes it can be beneficial to clean data, evaluate, and then clean again. The process you choose is dependent on the initial state of your data and how easy it is to evaluate. For the data set I'll use today, I'd likely do some initial cleaning before evaluation and then clean again after. Let me show you why. </p><p>I got the following <a href="https://www.kaggle.com/jaganadhg/house-hold-energy-data">IoT data set from Kaggle</a>, where a very generous individual shared their energy consumption readings from their apartment in San Jose, CA, the data incrementing every 15 minutes. While this is awesome data, it is structured a little differently than I would like. The raw data set follows this schema:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Tables.jpg" class="kg-image" alt="Graphic showing the setup of the table. The tables name is 'energy_usage_staging'. each row contains the tables column and data types, the pairs of info are as follows ([type, text], [date, date], [start_time, time], [end_time, time], [usage, float4], [units,text], [cost, text], [notes, text])" loading="lazy" width="960" height="764" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/Tables.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Tables.jpg 960w" sizes="(min-width: 720px) 720px"></figure><p>and appears like this…</p><table>
<thead>
<tr>
<th>type</th>
<th>date</th>
<th>start_time</th>
<th>end_time</th>
<th>usage</th>
<th>units</th>
<th>cost</th>
<th>notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>Electric usage</td>
<td>2016-10-22</td>
<td>00:00:00</td>
<td>00:14:00</td>
<td>0.01</td>
<td>kWh</td>
<td>$0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22</td>
<td>00:15:00</td>
<td>00:29:00</td>
<td>0.01</td>
<td>kWh</td>
<td>$0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22</td>
<td>00:30:00</td>
<td>00:44:00</td>
<td>0.01</td>
<td>kWh</td>
<td>$0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22</td>
<td>00:45:00</td>
<td>00:59:00</td>
<td>0.01</td>
<td>kWh</td>
<td>$0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22</td>
<td>01:00:00</td>
<td>01:14:00</td>
<td>0.01</td>
<td>kWh</td>
<td>$0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22</td>
<td>01:15:00</td>
<td>01:29:00</td>
<td>0.01</td>
<td>kWh</td>
<td>$0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22</td>
<td>01:30:00</td>
<td>01:44:00</td>
<td>0.01</td>
<td>kWh</td>
<td>$0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22</td>
<td>01:45:00</td>
<td>01:59:00</td>
<td>0.01</td>
<td>kWh</td>
<td>$0.00</td>
<td></td>
</tr>
</tbody>
</table>
<p><br>In order to do any type of analysis on this data set, I want to clean it up. A few things that quickly come to mind include:</p><ul><li>The cost is seen as a text data type which will cause some issues.</li><li>The time columns are split apart which could cause some problems if I want to create plots over time or perform any type of modeling based on time.</li><li>I may also want to filter the data based on various parameters that have to do with time, such as day of the week or holiday identification (both potentially play into how energy is used within the household). </li></ul><p>In order to fix all of these things and get more valuable data evaluation and analysis, I will have to clean the incoming data! So without further ado, let’s roll up our sleeves and dig in!</p><h2 id="postgresql-data-cleaning">PostgreSQL Data Cleaning</h2><p>Here's what I've done in the past while working in data science. While these examples aren't exhaustive, I hope they will cover many of the cleaning steps you perform during your own analysis, helping to make your cleaning tasks more efficient by using PostgreSQL and TimescaleDB.</p><p>Please feel free to explore these various techniques and skip around if you need! There's a lot here, and I designed it to be a helpful glossary of tools that you could use as needed.</p><p>The techniques I will cover include:</p><ul><li><a href="https://timescale.ghost.io/blog/blog/postgresql-vs-python-for-data-cleaning-a-guide/#correcting-structural-issues">Correcting structural issues</a></li><li><a href="https://timescale.ghost.io/blog/blog/postgresql-vs-python-for-data-cleaning-a-guide/#creating-or-generating-relevant-data">Creating or generating relevant data</a></li><li><a href="https://timescale.ghost.io/blog/blog/postgresql-vs-python-for-data-cleaning-a-guide/#adding-data-to-a-hypertable">Adding data to a hypertable</a></li><li><a href="https://timescale.ghost.io/blog/blog/postgresql-vs-python-for-data-cleaning-a-guide/#renaming-values">Renaming columns or tables</a></li><li><a href="https://timescale.ghost.io/blog/blog/postgresql-vs-python-for-data-cleaning-a-guide/#fill-in-missing-data">Fill in missing values</a></li></ul><h3 id="note-on-cleaning-approach">Note on cleaning approach:</h3><p>There are many ways that I could approach PostgreSQL data cleaning. I could create a table then <a href="https://www.postgresql.org/docs/current/sql-altertable.html"><code>ALTER</code></a> it as I clean, I could create multiple tables as I add or change data, or I could work with <a href="https://www.postgresql.org/docs/14/sql-createview.html"><code>VIEW</code></a>s. Depending on the size of my data, any of these approaches <em>could</em> make sense. However, they will have different computational consequences.</p><p>You may have noticed above that my raw data table was called <code>energy_usage_staging</code>. This is because I decided that given the state of my raw data, it's better to place the raw data in a <em>staging table</em>, clean it using <code>VIEW</code>s, and then insert it into a more usable table as part of my cleaning process. </p><p>This move from raw table to the usable table could happen even before the evaluation step of analysis. Sometimes data cleaning has to occur after AND before evaluating your data. Regardless, this data needs to be cleaned and I wanted to use the most efficient method possible. In this case, that meant using a staging table and leveraging the efficiency and power of PostgreSQL <code>VIEW</code>s, something I will talk about later.</p><p>Generally, if you're dealing with a lot of data, altering an existing table in PostgreSQL can be costly. For this post, I'll show how to build up clean data using <code>VIEW</code>s along with additional tables. This method of cleaning is more efficient. It also sets up our next blog post about data transformation, which includes the use of scripts in PostgreSQL.</p><h3 id="correcting-structural-issues">Correcting structural issues</h3><p>Right off the bat, I know that I need to do some data refactoring on my raw table due to data types. Notice that we have <code>date</code> and time columns separated and <code>costs</code> is recorded as a text data type. I need to convert my separated date time columns to a timestamp and the <code>cost</code> column to float4. But before I show that, I want to talk about why conversion to timestamp is beneficial.</p><h3 id="timescaledb-hypertables-and-why-timestamp-is-important">TimescaleDB hypertables and why timestamp is important</h3><p>For those of you not familiar with the structure of <a href="https://docs.timescale.com/timescaledb/latest/overview/core-concepts/hypertables-and-chunks/">TimescaleDB hypertables</a>, they're at the basis of how we efficiently query and manipulate time-series data. Timescale hypertables are partitioned based on time, and more specifically by the time column you specify upon creation of the table. </p><p>The data is partitioned by timestamp into "chunks" so that every row in the table belongs to some <em>chunk</em> based on a time range. We then use these time chunks to help query the rows so that you can get more efficient querying and data manipulation based on time. This image represents the difference between a normal table and our special hypertables.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Cleaning-data.jpg" class="kg-image" alt="Graphic showing a normal table vs a hypertable. The normal table just shows data in a table. The hypertable shows data in the table, but it also shows the data being &quot;grouped&quot; or &quot;chunked&quot; by day. By adding an index like structure based on time, queries can be more efficient. " loading="lazy" width="1600" height="1219" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/Cleaning-data.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/Cleaning-data.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Cleaning-data.jpg 1600w" sizes="(min-width: 720px) 720px"></figure><h4 id="changing-date-time-structure">Changing date-time structure</h4><p>Because I want to utilize TimescaleDB functionality to the fullest, such as continuous aggregates and faster time-based queries, I want to restructure the <code>energy_usage_staging</code> table's <code>date</code> and time columns. I could use the <code>date</code> column for my hypertable partitioning, however, I would have limited control over manipulating my data based on time. It is more flexible and space-efficient to have a single column with a timestamp than it is to have separate columns with date and time. I can always extract the date or time from the timestamp if I want to later!  </p><p>Looking back at the table structure, I should be able to get a usable timestamp value from the <code>date</code> and <code>start_time</code> columns as the <code>end_time</code> really doesn’t give me that much useful information. Thus, I want to essentially combine these two columns to form a new timestamp column, let’s see how I can do that using SQL. Spoiler alert: it's as simple as an algebraic statement. How cool is that?!</p><p><strong>PostgreSQL code:</strong></p><p>For my PostgreSQL data cleaning code, I can create the column without inserting it into the database. Since I want to create a NEW table from this staging one, I don’t want to add more columns or tables just yet. </p><p>Let’s first compare the original columns with our newly generated columns. For this query, I simply <em>add</em> the two columns together. The <code>AS</code> keyword allows me to rename the column to whatever I would like, in this case, <code>time</code>.</p><p><a name="add"></a></p>
<pre><code class="language-sql">--add the date column to the start_time column
SELECT date, start_time, (date + start_time) AS time 
FROM energy_usage_staging eus;
</code></pre>
<p> Results:</p><table>
<thead>
<tr>
<th>date</th>
<th>start_time</th>
<th>time</th>
</tr>
</thead>
<tbody>
<tr>
<td>2016-10-22</td>
<td>00:00:00</td>
<td>2016-10-22 00:00:00.000</td>
</tr>
<tr>
<td>2016-10-22</td>
<td>00:15:00</td>
<td>2016-10-22 00:15:00.000</td>
</tr>
<tr>
<td>2016-10-22</td>
<td>00:30:00</td>
<td>2016-10-22 00:30:00.000</td>
</tr>
<tr>
<td>2016-10-22</td>
<td>00:45:00</td>
<td>2016-10-22 00:45:00.000</td>
</tr>
<tr>
<td>2016-10-22</td>
<td>01:00:00</td>
<td>2016-10-22 01:00:00.000</td>
</tr>
<tr>
<td>2016-10-22</td>
<td>01:15:00</td>
<td>2016-10-22 01:15:00.000</td>
</tr>
</tbody>
</table>
<p><strong>Python code:</strong></p><p>In Python, the easiest way to do this is to add a new column to the dataframe. Notice that in Python I would have to concatenate the two columns along with a defined space, then convert that column to datetime.</p><pre><code class="language-python">energy_stage_df['time'] = pd.to_datetime(energy_stage_df['date'] + ' ' + energy_stage_df['start_time'])
print(energy_stage_df[['date', 'start_time', 'time']])
</code></pre>
<h3 id="changing-column-data-types">Changing column data types</h3><p>Next, I want to change the data type of my cost column from text to float. Again, this is straightforward in PostgreSQL with the <a href="https://www.postgresql.org/docs/14/functions-formatting.html"><code>TO_NUMBER()</code></a> function. </p><p>The format of the function is as follows: <code>TO_NUMBER(‘text’, ‘format’)</code> . The ‘format’ input is a PostgreSQL specific string that you can build depending on what type of text you want to convert. In our case we have a <code>$</code> symbol followed by a numeric set up <code>0.00</code>. For the format string I decided to use ‘L99D99’. The L lets PostgreSQL know there is a money symbol at the beginning of the text, the 9s let the system know I have numeric values, and then the D stands for a decimal point. </p><p>I decided to cap the conversion on values that would be less than or equal to ‘$99.99’ because the cost column has no values greater than 0.65. If you were planning to convert a column with larger numeric values, you would want to account for that by adding in a G for commas. For example, say you have a cost column with text values like this ‘$1,672,278.23’ then you would want to format the string like this ‘L9G999G999D99’</p><p><strong>PostgreSQL code:</strong></p><p><a name="tonumber"></a></p>
<pre><code class="language-sql">--create a new column called cost_new with the to_number() function
SELECT cost, TO_NUMBER("cost", 'L9G999D99') AS cost_new
FROM energy_usage_staging eus  
ORDER BY cost_new DESC
</code></pre>
<p>Results:</p><table>
<thead>
<tr>
<th>cost</th>
<th>cost_new</th>
</tr>
</thead>
<tbody>
<tr>
<td>$0.65</td>
<td>0.65</td>
</tr>
<tr>
<td>$0.65</td>
<td>0.65</td>
</tr>
<tr>
<td>$0.65</td>
<td>0.65</td>
</tr>
<tr>
<td>$0.57</td>
<td>0.57</td>
</tr>
<tr>
<td>$0.46</td>
<td>0.46</td>
</tr>
<tr>
<td>$0.46</td>
<td>0.46</td>
</tr>
<tr>
<td>$0.46</td>
<td>0.46</td>
</tr>
<tr>
<td>$0.46</td>
<td>0.46</td>
</tr>
</tbody>
</table>
<p><strong>Python code:</strong></p><p>For Python, I used a lambda function that systematically replaces all the ‘$’ signs with empty strings. This can be fairly inefficient.</p><pre><code class="language-python">energy_stage_df['cost_new'] = pd.to_numeric(energy_stage_df.cost.apply(lambda x: x.replace('$','')))
print(energy_stage_df[['cost', 'cost_new']])
</code></pre>
<h3 id="creating-a-view">Creating a <code>VIEW</code></h3><p>Now that I know how to convert my columns, I can combine the two queries and create a <code>VIEW</code> of my new restructured table. A <a href="https://www.postgresql.org/docs/14/sql-createview.html"><code>VIEW</code></a> is a PostgreSQL object that allows you to define a query and call it by its <code>VIEW</code>s name as if it were a table within your database. I can use the following query to generate the data I want and then create a <code>VIEW</code> that I can query as if it were a table.</p><p><strong>PostgreSQL code:</strong></p><pre><code class="language-sql">-- query the right data that I want
SELECT type, 
(date + start_time) AS time, 
"usage", 
units, 
TO_NUMBER("cost", 'L9G999D99') AS cost, 
notes 
FROM energy_usage_staging
</code></pre>
<p>Results:</p><table>
<thead>
<tr>
<th>type</th>
<th>time</th>
<th>usage</th>
<th>units</th>
<th>cost</th>
<th>notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:00:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:15:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:30:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:45:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 01:00:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 01:15:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 01:30:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 01:45:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 02:00:00.000</td>
<td>0.02</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 02:15:00.000</td>
<td>0.02</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
</tr>
</tbody>
</table>
<p>I decided to call my <code>VIEW</code> <code>energy_view</code>. Now, when I want to do further cleaning, I can just specify its name in the <code>FROM</code> statement.</p><p><a name="view"></a></p>
<pre><code class="language-sql">--create view from the query above
CREATE VIEW energy_view AS
SELECT type, 
(date + start_time) AS time, 
"usage", 
units, 
TO_NUMBER("cost", 'L9G999D99') AS cost, 
notes 
FROM energy_usage_staging
</code></pre>
<p><strong>Python code:</strong></p><pre><code class="language-python">energy_df = energy_stage_df[['type','time','usage','units','cost_new','notes']]
energy_df.rename(columns={'cost_new':'cost'}, inplace = True)
print(energy_df.head(20))
</code></pre>
<p>It is important to note that with PostgreSQL <code>VIEW</code>s, the data inside of them have to be recalculated every time you query it. This is why we want to insert our <code>VIEW</code> data into a hypertable once we have the data set up just right. You can think of <code>VIEW</code>s as a shorthand version of the <a href="https://timescale.ghost.io/blog/blog/how-to-evaluate-your-data-directly-within-the-database-and-make-your-analysis-more-efficient/#cte">CTEs <code>WITH</code> <code>AS</code></a> statement I discussed in my last post.</p><p>We are now one step closer to cleaner data!</p><h3 id="creating-or-generating-relevant-data">Creating or generating relevant data</h3><p>With some quick investigation, we can see that the notes column is blank for this data set. To check this I just need to include a <code>WHERE</code> clause and specify where <code>notes</code> are not equal to an empty string. </p><p><strong>PostgreSQL data cleaning code—detecting blank notes:</strong></p><p><a name="where"></a></p>
<pre><code class="language-sql">SELECT * 
FROM energy_view ew
-- where notes are not equal to an empty string
WHERE notes!='';
</code></pre>
<p>The results come out empty.</p><p><strong>Python code:</strong></p><pre><code class="language-python">print(energy_df[energy_df['notes'].notnull()])
</code></pre>
<p>Since the notes are blank, I would like to replace the column with various sets of additional information that I could use later on during modeling. One thing I would like to add in particular, is a column that specifies the day of the week. To do this I can use the <code>EXTRACT()</code> command. The <a href="https://www.postgresql.org/docs/14/functions-datetime.html"><code>EXTRACT()</code></a> command is a PostgreSQL date/time function that allows you to extract various date/time elements. For our column, PostgreSQL has the specification DOW (day-of-week) which maps 0 to Sunday through to 6 for Saturday.</p><p><strong>PostgreSQL code:</strong></p><p><a name="extract"></a></p>
<pre><code class="language-sql">--extract day-of-week from date column and cast the output to an int
SELECT *,
EXTRACT(DOW FROM time)::int AS day_of_week
FROM energy_view ew
</code></pre>
<p>Results:</p><table>
<thead>
<tr>
<th>type</th>
<th>time</th>
<th>usage</th>
<th>units</th>
<th>cost</th>
<th>notes</th>
<th>day_of_week</th>
</tr>
</thead>
<tbody>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:00:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
<td>6</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:15:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
<td>6</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:30:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
<td>6</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:45:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
<td>6</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 01:00:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
<td>6</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 01:15:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
<td>6</td>
</tr>
</tbody>
</table>
<p><strong>Python code:</strong></p><pre><code class="language-python">energy_df['day_of_week'] = energy_df['time'].dt.dayofweek
</code></pre>
<p>Additionally, we may want to add another column that specifies if a day occurs over a weekend or weekday. To do this, I create a boolean column, where <code>true</code> represents a weekend, and <code>false</code> represents a weekday. I then apply a <a href="https://www.postgresql.org/docs/14/plpgsql-control-structures.html"><code>CASE</code></a> statement. With this command, I can specify “when-then” statements (similar to “if-then” statements in coding) where I can say <code>WHEN</code> a <code>day_of_week</code> value is <code>IN</code> the set (0,6) <code>THEN</code> the output should be <code>true</code>, <code>ELSE</code> the value should be <code>false</code>.</p><p><strong>PostgreSQL code:</strong></p><p><a name="case"></a></p>
<pre><code class="language-sql">SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
--use the case statement to make a column true when records fall on a weekend aka 0 and 6
CASE WHEN (EXTRACT(DOW FROM time)::int) IN (0,6) then true
	ELSE false
END AS is_weekend
FROM energy_view ew
</code></pre>
<p>Results:</p><table>
<thead>
<tr>
<th>type</th>
<th>time</th>
<th>usage</th>
<th>units</th>
<th>cost</th>
<th>day_of_week</th>
<th>is_weekend</th>
</tr>
</thead>
<tbody>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:00:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td>6</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:15:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td>6</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:30:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td>6</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:45:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td>6</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 01:00:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td>6</td>
<td>true</td>
</tr>
</tbody>
</table>
<p>Fun fact: you can do the same query without a <code>CASE</code> statement, however it only works for binary columns.</p><pre><code class="language-sql">--another method to create a binary column
SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
EXTRACT(DOW FROM time)::int IN (0,6) AS is_weekend
FROM energy_view ew
</code></pre>
<p><strong>Python code:</strong></p><p>Notice that in Python, the weekends are represented by numbers 5 and 6 vs. the PostgreSQL weekend values 0 and 6.</p><pre><code class="language-python">energy_df['is_weekend'] = np.where(energy_df['day_of_week'].isin([5,6]), 1, 0)
print(energy_df.head(20))
</code></pre>
<p>And maybe things then start getting really crazy. Maybe you want to add more parameters! </p><p>Let’s consider holidays. Now, you may be asking, “Why in the world would we do that?!” but often, people have time off during some of the holidays within the US. Since this individual lives within the US, they likely have at least <em>some </em>of the holidays off. Where there are days off, there could be a difference in energy usage. To help guide my analysis, I want to include the identification of holidays. To do this, I create another boolean column identifying when a federal holiday occurs. </p><p>I can accomplish this using TimescaleDB’s <code>time_bucket()</code> function. The <a href="https://docs.timescale.com/api/latest/hyperfunctions/time_bucket/"><code>time_bucket()</code></a> function is one of the functions I discussed in detail in my <a href="https://timescale.ghost.io/blog/blog/how-to-evaluate-your-data-directly-within-the-database-and-make-your-analysis-more-efficient/#timebucket">previous post</a>. Essentially, I need to use this function to make sure all time values within a single day get accounted for. Without using the <code>time_bucket()</code> function, I would only see changes to the row associated with the 12 a.m. time period. </p><p><strong>PostgreSQL code:</strong></p><p>After I create a holiday table, I can then use the data from it within my query. I also decided to use the non-case syntax for this query. Note that you can use either!</p><p><a name="timebucket"></a></p>
<pre><code class="language-sql">--create table for the holidays
CREATE TABLE holidays (
date date)

--insert the holidays into table
INSERT INTO holidays 
VALUES ('2016-11-11'), 
('2016-11-24'), 
('2016-12-24'), 
('2016-12-25'), 
('2016-12-26'), 
('2017-01-01'),  
('2017-01-02'), 
('2017-01-16'), 
('2017-02-20'), 
('2017-05-29'), 
('2017-07-04'), 
('2017-09-04'), 
('2017-10-9'), 
('2017-11-10'), 
('2017-11-23'), 
('2017-11-24'), 
('2017-12-24'), 
('2017-12-25'), 
('2018-01-01'), 
('2018-01-15'), 
('2018-02-19'), 
('2018-05-28'), 
('2018-07-4'), 
('2018-09-03'), 
('2018-10-8')

SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
EXTRACT(DOW FROM time)::int IN (0,6) AS is_weekend,
-- I can then select the data from the holidays table directly within my IN statement
time_bucket('1 day', time) IN (SELECT date FROM holidays) AS is_holiday
FROM energy_view ew
</code></pre>
<p>Results:</p><table>
<thead>
<tr>
<th>type</th>
<th>time</th>
<th>usage</th>
<th>units</th>
<th>cost</th>
<th>day_of_week</th>
<th>is_weekend</th>
<th>is_holiday</th>
</tr>
</thead>
<tbody>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:00:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td>6</td>
<td>true</td>
<td>false</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:15:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td>6</td>
<td>true</td>
<td>false</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:30:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td>6</td>
<td>true</td>
<td>false</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:45:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td>6</td>
<td>true</td>
<td>false</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 01:00:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td>6</td>
<td>true</td>
<td>false</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 01:15:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td>6</td>
<td>true</td>
<td>false</td>
</tr>
</tbody>
</table>
<p><strong>Python code:</strong></p><pre><code class="language-python">holidays = ['2016-11-11', '2016-11-24', '2016-12-24', '2016-12-25', '2016-12-26', '2017-01-01',  '2017-01-02', '2017-01-16', '2017-02-20', '2017-05-29', '2017-07-04', '2017-09-04', '2017-10-9', '2017-11-10', '2017-11-23', '2017-11-24', '2017-12-24', '2017-12-25', '2018-01-01', '2018-01-15', '2018-02-19', '2018-05-28', '2018-07-4', '2018-09-03', '2018-10-8']
energy_df['is_holiday'] = np.where(energy_df['day_of_week'].isin(holidays), 1, 0)
print(energy_df.head(20))
</code></pre>
<p>At this point, I’m going to save this expanded table into another <code>VIEW</code> so that I can call the data without writing out the query.</p><p><strong>PostgreSQL code:</strong></p><pre><code class="language-sql">--create another view with the data from our first round of cleaning
CREATE VIEW energy_view_exp AS
SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
EXTRACT(DOW FROM time)::int IN (0,6) AS is_weekend,
time_bucket('1 day', time) IN (select date from holidays) AS is_holiday
FROM energy_view ew
</code></pre>
<p>You may ask, “Why did you create these as boolean columns?!” A very fair question! You see, I may want to use these columns for filtering during analysis, something I commonly do during my own analysis process. In PostgreSQL, you can use boolean columns to filter things super easily. For example, say that I want to use my table query so far and show only the data that occurs over the weekend <code>AND</code> a holiday. I can do this simply by adding a <code>WHERE</code> statement along with the specified columns.</p><p><strong>PostgreSQL code:</strong></p><pre><code class="language-sql">--if you use binary columns, then you can filter with a simple WHERE statement
SELECT *
FROM energy_view_exp
WHERE is_weekend AND is_holiday
</code></pre>
<p>Results:</p><table>
<thead>
<tr>
<th>type</th>
<th>time</th>
<th>usage</th>
<th>units</th>
<th>cost</th>
<th>day_of_week</th>
<th>is_weekend</th>
<th>is_holiday</th>
</tr>
</thead>
<tbody>
<tr>
<td>Electric usage</td>
<td>2016-12-24 00:00:00.000</td>
<td>0.34</td>
<td>kWh</td>
<td>0.06</td>
<td>6</td>
<td>true</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-12-24 00:15:00.000</td>
<td>0.34</td>
<td>kWh</td>
<td>0.06</td>
<td>6</td>
<td>true</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-12-24 00:30:00.000</td>
<td>0.34</td>
<td>kWh</td>
<td>0.06</td>
<td>6</td>
<td>true</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-12-24 00:45:00.000</td>
<td>0.34</td>
<td>kWh</td>
<td>0.06</td>
<td>6</td>
<td>true</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-12-24 01:00:00.000</td>
<td>0.34</td>
<td>kWh</td>
<td>0.06</td>
<td>6</td>
<td>true</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-12-24 01:15:00.000</td>
<td>0.34</td>
<td>kWh</td>
<td>0.06</td>
<td>6</td>
<td>true</td>
<td>true</td>
</tr>
</tbody>
</table>
<p><strong>Python code:</strong></p><pre><code class="language-python">print(energy_df[(energy_df['is_weekend']==1) &amp; (energy_df['is_holiday']==1)].head(10))
</code></pre>
<h3 id="adding-data-to-a-hypertable">Adding data to a hypertable</h3><p>Now  I have new columns ready to go, and I know how I would like my table to be structured. I can now create a new hypertable and insert my cleaned data. In my own analysis with this data set, I may have run clean up to this point BEFORE evaluating my data so that I can get a more meaningful evaluation step in analysis. What’s great is that you can use any of these techniques for general cleaning, either that is before or after evaluation.</p><p><strong>PostgreSQL:</strong></p><p><a name="create"></a><br>
<a name="createhyper"></a><br>
<a name="insert"></a></p>
<pre><code class="language-sql">CREATE TABLE energy_usage (
type text,
time timestamptz,
usage float4,
units text,
cost float4,
day_of_week int,
is_weekend bool,
is_holiday bool,
) 

--command to create a hypertable
SELECT create_hypertable('energy_usage', 'time')

INSERT INTO energy_usage 
SELECT *
FROM energy_view_exp
</code></pre>
<p>Note that if you had data continually coming in you could create a script within your database that automatically makes these changes when importing your data. That way you can have cleaned data ready to go in your database rather than processing and cleaning the data in your scripts every time you want to perform analysis. </p><p>I'll discuss this in detail in my next post. Make sure to stay tuned in if you want to know how to create scripts and keep data automatically updated!</p><h3 id="renaming-values">Renaming values</h3><p>Another valuable technique for cleaning data is being able to rename various items or remap categorical values. The importance of this skill is amplified by the <a href="https://stackoverflow.com/questions/40427943/how-do-i-change-a-single-index-value-in-pandas-dataframe">popularity of this Python data analysis question on StackOverflow</a>. The question states “How do I change a single index value in a pandas dataframe?”. Since PostgreSQL and TimescaleDB use relational table structures, renaming unique values can be fairly simple using PostgreSQL data cleaning. </p><p>When renaming specific index values within a table, you can do this “on the fly” by using PostgreSQL’s <code>CASE</code> statement within the <code>SELECT</code> query. Let’s say I don’t like Sunday being represented by a 0 in the <code>day_of_week</code> column, but would prefer it to be a 7. I can do this with the following query.</p><p><strong>PostgreSQL code:</strong></p><pre><code class="language-sql">SELECT type, time, usage, cost, is_weekend,
-- you can use case to recode column values 
CASE WHEN day_of_week = 0 THEN 7
ELSE day_of_week 
END
FROM energy_usage
</code></pre>
<p><strong>Python code:</strong></p><p>Caveat, this code would make Monday = 7 because the python DOW function has Monday set to 0 and Sunday set to 6. But this is how you would update one value within a column. Likely you would not want to do this exact action, I just wanted to show the python equivalent for reference.</p><pre><code class="language-python">energy_df.day_of_week[energy_df['day_of_week']==0] = 7
print(energy_df.head(250))
</code></pre>
<p>Now, let’s say that I wanted to use the names of the days of the week instead of showing numeric values. For this example, I want to ditch the <code>CASE</code> statement and create a mapping table. When you need to change various values, it will likely be more efficient to create a mapping table and then join it to this table using the <a href="https://www.postgresql.org/docs/14/queries-table-expressions.html"><code>JOIN</code></a> command.</p><p><strong>PostgreSQL:</strong></p><p><a name="join"></a></p>
<pre><code class="language-sql">--first I need to create the table
CREATE TABLE day_of_week_mapping (
day_of_week_int int,
day_of_week_name text
)

--then I want to add data to my table
INSERT INTO day_of_week_mapping
VALUES (0, 'Sunday'),
(1, 'Monday'),
(2, 'Tuesday'),
(3, 'Wednesday'),
(4, 'Thursday'),
(5, 'Friday'),
(6, 'Saturday')

--then I can join this table to my cleaning table to remap the days of the week
SElECT type, time, usage, units, cost, dowm.day_of_week_name, is_weekend
FROM energy_usage eu
LEFT JOIN day_of_week_mapping dowm ON dowm.day_of_week_int = eu.day_of_week
</code></pre>
<p>Results:</p><table>
<thead>
<tr>
<th>type</th>
<th>time</th>
<th>usage</th>
<th>units</th>
<th>cost</th>
<th>day_of_week_name</th>
<th>weekend</th>
</tr>
</thead>
<tbody>
<tr>
<td>Electric usage</td>
<td>2018-07-22 00:45:00.000</td>
<td>0.1</td>
<td>kWh</td>
<td>0.03</td>
<td>Sunday</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2018-07-22 00:30:00.000</td>
<td>0.1</td>
<td>kWh</td>
<td>0.03</td>
<td>Sunday</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2018-07-22 00:15:00.000</td>
<td>0.1</td>
<td>kWh</td>
<td>0.03</td>
<td>Sunday</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2018-07-22 00:00:00.000</td>
<td>0.1</td>
<td>kWh</td>
<td>0.03</td>
<td>Sunday</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2018-02-11 23:00:00.000</td>
<td>0.04</td>
<td>kWh</td>
<td>0.01</td>
<td>Sunday</td>
<td>true</td>
</tr>
</tbody>
</table>
<p><strong>Python:</strong></p><p>In this case, python has similar mapping functions.</p><pre><code class="language-python">energy_df['day_of_week_name'] = energy_df['day_of_week'].map({0 : 'Sunday', 1 : 'Monday', 2: 'Tuesday', 3: 'Wednesday', 4: 'Thursday', 5: 'Friday', 6: 'Saturday'})
print(energy_df.head(20))
</code></pre>
<p>Hopefully, one of these techniques will be useful for you as you approach data renaming!</p><p>Additionally, remember that if you would like to change the name of a column in your table, it is truly as easy as <code>AS</code> (I couldn’t not use such a ridiculous statement 😂). When you use the <code>SELECT</code> statement, you can rename your columns like so,</p><p><strong>PostgreSQL code:</strong></p><p><a name="as"></a></p>
<pre><code class="language-sql">SELECT type AS usage_type,
time as time_stamp,
usage,
units, 
cost AS dollar_amount
FROM energy_view_exp
LIMIT 20;
</code></pre>
<p>Results:</p><table>
<thead>
<tr>
<th>usage_type</th>
<th>time_stamp</th>
<th>usage</th>
<th>units</th>
<th>dollar_amount</th>
</tr>
</thead>
<tbody>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:00:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:15:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:30:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:45:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
</tr>
</tbody>
</table>
<p><strong>Python code:</strong></p><p>Comparatively, renaming columns in Python can be a huge pain. This is an area where SQL is not only faster, but also just more elegant in its code.</p><pre><code class="language-python">energy_df.rename(columns={'type':'usage_type', 'time':'time_stamp', 'cost':'dollar_amount'}, inplace=True)
print(energy_df[['usage_type','time_stamp','usage','units','dollar_amount']].head(20))
</code></pre>
<h3 id="fill-in-missing-data">Fill in missing data</h3><p>Another common problem in the PostgreSQL data cleaning process is having missing data. For the dataset we are using, there are no obviously missing data points. However, it's very possible that with evaluation, we could find missing hourly data from a power outage or some other phenomenon. </p><p>This is where the gap-filling functions TimescaleDB offers could come in handy. When using algorithms, missing data can often have significant negative impacts on the accuracy or dependability of the model. Sometimes, you can navigate this problem by filling in missing data with reasonable estimates and TimescaleDB actually has built-in functions to help you do this. </p><p>For example, let’s say that you are modeling energy usage over individual days of the week and a handful of days have missing energy data due to a power outage or an issue with the sensor. We could remove the data or try to fill in the missing values with reasonable estimations. For today, let’s assume that the model I want to use would benefit more from filling in the missing values. </p><p>As an example, I created some data. I called this table <code>energy_data</code> and it is missing both <code>time</code> and <code>energy</code> readings for the timestamps between 7:45 a.m. and 11:30 a.m.</p><table>
<thead>
<tr>
<th>time</th>
<th>energy</th>
</tr>
</thead>
<tbody>
<tr>
<td>2021-01-01 07:00:00.000</td>
<td>0</td>
</tr>
<tr>
<td>2021-01-01 07:15:00.000</td>
<td>0.1</td>
</tr>
<tr>
<td>2021-01-01 07:30:00.000</td>
<td>0.1</td>
</tr>
<tr>
<td>2021-01-01 07:45:00.000</td>
<td>0.2</td>
</tr>
<tr>
<td>2021-01-01 11:30:00.000</td>
<td>0.04</td>
</tr>
<tr>
<td>2021-01-01 11:45:00.000</td>
<td>0.04</td>
</tr>
<tr>
<td>2021-01-01 12:00:00.000</td>
<td>0.03</td>
</tr>
<tr>
<td>2021-01-01 12:15:00.000</td>
<td>0.02</td>
</tr>
<tr>
<td>2021-01-01 12:30:00.000</td>
<td>0.03</td>
</tr>
<tr>
<td>2021-01-01 12:45:00.000</td>
<td>0.02</td>
</tr>
<tr>
<td>2021-01-01 13:00:00.000</td>
<td>0.03</td>
</tr>
</tbody>
</table>
<p>I can use TimescaleDB’s <a href="https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/">gapfilling hyperfunctions</a> to fill in these missing values. The <a href="https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/interpolate/"><code>interpolate()</code></a> function is another one of TimescaleDB’s hyperfunctions. It creates data points that follow a linear approximation given the data points before and after the missing range of data. Alternatively, you could use the <a href="https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/locf/"><code>locf()</code></a> hyperfunction which carries the last recorded value forward to fill in the gap (note that locf stands for last-one-carried-forward). Both of these functions must be used in conjunction with the <a href="https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/time_bucket_gapfill/"><code>time_bucket_gapfill()</code></a> function. </p><p><strong>PostgreSQL code:</strong></p><p><a name="gapfill"></a></p>
<pre><code class="language-sql">SELECT
--here I specified that the data should increment by 15 mins
  time_bucket_gapfill('15 min', time) AS timestamp,
  interpolate(avg(energy)),
  locf(avg(energy))
FROM energy_data
--to use gapfill, you will have to take out any time data associated with null values. You can do this using the IS NOT NULL statement
WHERE energy IS NOT NULL AND time &gt; '2021-01-01 07:00:00.000' AND time &lt; '2021-01-01 13:00:00.000'
GROUP BY timestamp
ORDER BY timestamp;
</code></pre>
<p>Results:</p><table>
<thead>
<tr>
<th>timestamp</th>
<th>interpolate</th>
<th>locf</th>
</tr>
</thead>
<tbody>
<tr>
<td>2021-01-01 07:00:00.000</td>
<td>0.1</td>
<td>0.10000000000000000000</td>
</tr>
<tr>
<td>2021-01-01 07:30:00.000</td>
<td>0.15</td>
<td>0.15000000000000000000</td>
</tr>
<tr>
<td>2021-01-01 08:00:00.000</td>
<td>0.13625</td>
<td>0.15000000000000000000</td>
</tr>
<tr>
<td>2021-01-01 08:30:00.000</td>
<td>0.1225</td>
<td>0.15000000000000000000</td>
</tr>
<tr>
<td>2021-01-01 09:00:00.000</td>
<td>0.10875</td>
<td>0.15000000000000000000</td>
</tr>
<tr>
<td>2021-01-01 09:30:00.000</td>
<td>0.095</td>
<td>0.15000000000000000000</td>
</tr>
<tr>
<td>2021-01-01 10:00:00.000</td>
<td>0.08125</td>
<td>0.15000000000000000000</td>
</tr>
<tr>
<td>2021-01-01 10:30:00.000</td>
<td>0.0675</td>
<td>0.15000000000000000000</td>
</tr>
<tr>
<td>2021-01-01 11:00:00.000</td>
<td>0.05375</td>
<td>0.15000000000000000000</td>
</tr>
<tr>
<td>2021-01-01 11:30:00.000</td>
<td>0.04</td>
<td>0.04000000000000000000</td>
</tr>
<tr>
<td>2021-01-01 12:00:00.000</td>
<td>0.025</td>
<td>0.02500000000000000000</td>
</tr>
<tr>
<td>2021-01-01 12:30:00.000</td>
<td>0.025</td>
<td>0.02500000000000000000</td>
</tr>
</tbody>
</table>
<p><strong>Python code:</strong></p><pre><code class="language-python">energy_test_df['time'] = pd.to_datetime(energy_test_df['time'])
energy_test_df_locf = energy_test_df.set_index('time').resample('15 min').fillna(method='ffill').reset_index()
energy_test_df = energy_test_df.set_index('time').resample('15 min').interpolate().reset_index()
energy_test_df['locf'] = energy_test_df_locf['energy']
print(energy_test_df)
</code></pre>
<p><strong>Bonus:</strong></p><p>The following query shows how I could ignore the missing data. I wanted to include this to show you just how easy it can be to exclude null data. Alternatively, I could use a <code>WHERE</code> clause to specify the times I like to ignore (the second query).</p><pre><code class="language-sql">SELECT * 
FROM energy_data 
WHERE energy IS NOT NULL

SELECT * 
FROM energy_data
WHERE time &lt;= '2021-01-01 07:45:00.000' OR time &gt;= '2021-01-01 11:30:00.000'
</code></pre>
<h2 id="postgresql-data-cleaning-wrap-up">PostgreSQL Data Cleaning Wrap-Up</h2><p>After reading through these various techniques, I hope you feel more comfortable with exploring some of the possibilities that PostgreSQL data cleaning and TimescaleDB data cleaning provide. By cleaning data directly within my database, I am able to perform a lot of my cleaning tasks a single time rather than repetitively within a script, thus saving me time in the long run. If you're looking to save time and effort while cleaning your data for analysis, definitely consider using PostgreSQL and TimescaleDB. </p><p>In my next posts, I'll discuss techniques for transforming data using PostgreSQL and TimescaleDB. I'll then use everything we've learned together to benchmark data munging tasks in PostgreSQL and TimescaleDB vs. Python and pandas. The final blog post will walk you through the full process on a real dataset by conducting a deep-dive into data analysis with TimescaleDB (for data munging) and Python (for modeling and visualizations).</p><p>If you have questions about TimescaleDB, time-series data, or any of the functionality mentioned above, join our <a href="https://slack.timescale.com/">community Slack</a>, where you'll find an active community of time-series enthusiasts and various Timescale team members.</p><p>If you’re ready to see the power of TimescaleDB and PostgreSQL right away, you can <a href="https://console.cloud.timescale.com/signup" rel="noreferrer">sign up for a free 30-day trial</a> or install TimescaleDB and <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/install-timescaledb/self-hosted/">manage it on your current PostgreSQL instances</a>. We also have a bunch of <a href="https://docs.timescale.com/timescaledb/latest/tutorials/">great tutorials</a> to help get you started.</p><p>Until next time!</p><p><strong>Functionality Glossary:</strong></p>
<ul>
<li><a href="#add">Adding columns together</a></li>
<li><a href="#tonumber"><code>TO_NUMBER()</code></a></li>
<li><a href="#view"><code>VIEW</code></a></li>
<li><a href="#where"><code>WHERE</code></a></li>
<li><a href="#extract"><code>EXTRACT()</code></a></li>
<li><a href="#case"><code>CASE</code></a></li>
<li><a href="#timebucket"><code>time_bucket()</code></a></li>
<li><a href="#join"><code>JOIN</code></a></li>
<li><a href="#as"><code>AS</code></a></li>
<li><a href="#create"><code>CREATE TABLE</code></a></li>
<li><a href="#createhyper"><code>create_hypertable()</code></a></li>
<li><a href="#insert"><code>INSERT INTO</code></a></li>
<li><a href="#gapfill"><code>time_bucket_gapfill()</code></a></li>
</ul>
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[What We’re Excited About PostgreSQL 17]]></title>
            <description><![CDATA[As we count the days until September’s release, here are the features we’re excited about in PostgreSQL 17.]]></description>
            <link>https://www.tigerdata.com/blog/what-were-excited-about-postgresql-17</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/what-were-excited-about-postgresql-17</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Aleksander Alekseev]]></dc:creator>
            <pubDate>Thu, 16 May 2024 12:59:30 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/05/What-we-re-excited-about-postgres-17--1-.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/05/What-we-re-excited-about-postgres-17--1-.png" alt="An elephant on a starry sky background with the number 17 next to it—what we're excited about PostgreSQL 17" /><p>The next major PostgreSQL release (PostgreSQL 17) is <a href="https://www.postgresql.org/developer/roadmap/"><u>scheduled for September</u></a>.</p><figure class="kg-card kg-image-card"><img src="https://media.tenor.com/R7dseIrp4N8AAAAC/party-the-office.gif" class="kg-image" alt="" loading="lazy" width="410" height="200"></figure><p>In 2023, PostgreSQL regained the attention it deserves as a rock-solid relational database. It was voted the <a href="https://survey.stackoverflow.co/2023/?ref=timescale.com#most-popular-technologies-database-prof"><u>most popular DB in the Stack Overflow Developer Survey</u></a> and named <a href="https://db-engines.com/en/blog_post/106"><u>database management system of the year by DB-Engines</u></a>. Here at Timescale, we also consolidated our status as fierce PostgreSQL fans: besides having built Timescale on PostgreSQL, we believe PostgreSQL is evolving as a platform and becoming the <a href="https://timescale.ghost.io/blog/postgres-for-everything/"><u>bedrock for the future of data</u></a>. So, excuse us for being a <em>bit </em>excited about PostgreSQL 17.</p><p>In its latest releases, we’ve watched PostgreSQL develop toward higher performance, scalability, security, and compatibility while introducing new features to meet the evolving needs of users and applications, especially enterprise ones. The improvements to privilege administration, logical replication, and monitoring are examples of that. More importantly, during this time, <a href="https://timescale.ghost.io/blog/how-and-why-to-become-a-postgresql-contributor/"><u>we contributed</u></a>, <a href="https://timescale.ghost.io/blog/what-does-a-postgresql-commitfest-manager-do-and-should-you-become-one/"><u>managed commitfests</u></a>, and created new features and products to expand it—from <a href="https://timescale.ghost.io/blog/how-we-made-real-time-data-aggregation-in-postgres-faster-by-50-000/"><u>boosting real-time aggregation by 50,000&nbsp;%</u></a> to <a href="https://docs.timescale.com/ai/latest/"><u>powering production AI applications</u></a>.</p><p>In this blog post, we gathered Timescale contributors and enthusiasts to discuss a few of the most exciting PostgreSQL 17 commits. As we count the days until September, we’ll also examine PostgreSQL’s direction for this release. Finally, we’ll share some of our commits, as we help build up PostgreSQL as a <a href="https://timescale.ghost.io/blog/postgres-for-everything/"><u>versatile development platform for everything</u></a>.&nbsp;</p><h2 id="postgresql-17-where-it-came-from-and-where-it%E2%80%99s-headed">PostgreSQL 17: Where It Came From and Where It’s Headed</h2><p>Looking at the several PostgreSQL 17 commits, <a href="https://www.linkedin.com/in/afiskon/?ref=timescale.com"><u>Aleksander Alekseev</u></a>, long-time PostgreSQL contributor and Timescaler, says significant changes to modernize PostgreSQL are underway. “I believe the future of Postgres is bright,” he notes, adding that “new people are <a href="https://www.postgresql.org/message-id/ccbc2cfa-7711-4a52-bd8e-8746e28550a2%40joeconway.com"><u>joining the project</u></a>.” Perhaps influenced by the new wave of contributors, the changes to PostgreSQL 17 reflect the project’s commitment to embracing modern methodologies and adapting to the ever-evolving tech landscape</p><p>One such notable change in version 17, says Aleksander, is the decision to <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0b16bb8776bb834eb1ef8204ca95dd7667ab948b"><u>drop support for AIX</u></a>, an operating system developed by IBM. AIX, while historically significant, has seen declining usage in recent years, prompting PostgreSQL to reallocate resources towards supporting more widely adopted platforms. This strategic move enables PostgreSQL to focus on enhancing compatibility with modern operating systems.&nbsp;</p><p>While they may seem more focused today, the PostgreSQL community's efforts to make PostgreSQL a solid database for modern data needs were already visible in previous versions, including the current one, PostgreSQL 16. As a specific example, Aleksander mentions the transition from Autotools to the Meson build system. Autotools, a long-standing suite of tools for configuring, building, and installing software packages, has been a stalwart in the development process of PostgreSQL.&nbsp;</p><p>However, with the advent of <a href="https://mesonbuild.com/"><u>Meson</u></a>, a contemporary build system known for its simplicity, speed, and scalability, PostgreSQL managed to streamline its development workflows. Meson offers advantages such as improved performance, easier maintenance, and better cross-platform compatibility, which PostgreSQL currently extends to its users.</p><h2 id="what-we%E2%80%99re-excited-about-postgresql-17">What We’re Excited About PostgreSQL 17</h2><p>Now that we’ve seen where PostgreSQL 17 is headed, let’s discuss some of the commits that have caught our 👀.</p><h3 id="pgcreatesubscriber">pg_createsubscriber</h3><p>Suggested by Timescaler and PostgreSQL contributor <a href="https://br.linkedin.com/in/fabriziomello"><u>Fabrízio de Mello</u></a>, <a href="https://www.postgresql.org/docs/devel/app-pgcreatesubscriber.html"><u>pg_createsubscriber is a new PostgreSQL 17 tool</u></a> that allows users to create a new logical replica from a physical standby server. “The main advantage of this tool over a common logical replication setup is the initial data copy, which can take longer on large databases and have side effects, like autovacuum issues, due to the long-running transaction to copy data from one server to another. This tool will also help reduce the catchup phase,” explains Fabrízio.</p><h3 id="support-for-merge-partitions-and-split-partitions">Support for MERGE PARTITIONS and SPLIT PARTITIONS</h3><p>While <code>ALTER TABLE</code> is a well-known statement that changes the structure of a PostgreSQL table, PostgreSQL 17 comes along with two new commands: <code>MERGE PARTITIONS</code> and <code>SPLIT PARTITIONS</code>. As the name indicates, these new DDL commands merge or split several partitions. “The current implementation has certain limitations though,” says Aleksander. “It works as a single process and holds the <code>ACCESS EXCLUSIVE LOCK</code> on the parent table during all operations. This is why the new DDL commands are not advisable for large partitioned tables under a high load,” he adds.</p><h3 id="add-support-for-incremental-file-system-backup">Add support for incremental file system backup</h3><p>“This is another feature worth mentioning,” says Aleksander. Adding support for incremental file system backup in PostgreSQL enhances the database's ability to perform efficient and effective backups. Incremental backups only save changes made since the last backup (full or incremental). This significantly reduces the volume of data to be backed up compared to full backups, which capture the entire database. And since incremental backups involve less data, the backup process is faster, minimizing the impact on system performance and reducing downtime. </p><p>Developed by Robert Haas, Jakub Wartak, and Tomas Vondra, <a href="http://rhaas.blogspot.com/2024/05/hacking-on-postgresql-is-really-hard.html"><u>this commit has been struggling with stability issues</u></a>, as explained by Robert on his blog. “Hopefully it won’t be reverted (as many other commits this month),” comments Aleksander.</p><h3 id="enable-the-failover-of-logical-slots">Enable the failover of logical slots&nbsp;</h3><p>Picked by two Timescalers, Fabrízio and our head of Developer Advocacy, <a href="https://twitter.com/jamessewell"><u>James Blackwood-Sewell</u></a>, this commit by Hou Zhijie, Shveta Malik, and Ajin Cherian lets high-availability <a href="https://www.timescale.com/learn/postgresql-database-replication-guide"><u>PostgreSQL use logical replication</u></a> and not lose downstream data in case of a failover. Enabling the failover of logical replication slots in PostgreSQL enhances the robustness and reliability of logical replication setups by allowing logical slots to be transferred and maintained across different database instances.</p><h3 id="allow-explain-to-report-optimizer-memory-usage">Allow EXPLAIN to report optimizer memory usage</h3><p>“This commit by Ashutosh Bapat is another good one,” notes Aleksander. Allowing the <code>EXPLAIN</code> command to report optimizer memory usage in PostgreSQL provides valuable insights into the resources consumed by the query planner and optimizer during the preparation of query execution plans.“It will allow the developer to choose the query that uses less memory,” explains Aleksander. This makes it especially helpful for those trying to fine-tune PostgreSQL’s performance.</p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">💻</div><div class="kg-callout-text"><a href="https://www.timescale.com/learn/postgres-guides#:~:text=Overview-,Performance,-Guide%20to%20PostgreSQL"><u>If you’re struggling to improve your PostgreSQL performance, these resources will help you get the most out of your database</u></a>.</div></div><p><br><br>Any on this list, really</p><p><a href="https://timescale.ghost.io/blog/how-postgresql-aggregation-works-and-how-it-inspired-our-hyperfunctions-design/"><u>Bruce Momjian has always been an inspiration to us—bow tie included</u></a>—so we can safely say that any of the contributions <a href="https://momjian.us/pgsql_docs/release-17.html#RELEASE-17-SERVER"><u>on this list</u></a>, which Aleksander describes as “overall performance improvements” make us excited about getting our hands on the new PostgreSQL version.&nbsp;</p><h2 id="what-we-committed-to-postgresql-17">What We Committed to PostgreSQL 17</h2><p>In total, 90 commits (3.5 percent of all commits) were authored, co-authored, and/or reviewed by Timescalers during the PostgreSQL 17 cycle. 😎We’re not going to bother you by going over all of them, but we asked our team of upstreamers to name some of their personal favorites.</p><h3 id="the-slru-move-to-64-bit-indexes">The SLRU move to 64-bit indexes</h3><p>“Personally, I’m most excited about the series of patches that moved SLRU (simple least recently used) caches to the 64-bit indexes,” says Aleksander. While we’re not there yet, this opens the path to 64-bit XIDs, which will mitigate the problem of <a href="https://timescale.ghost.io/blog/how-to-fix-transaction-id-wraparound/"><u>XID wraparound</u></a> certain users face under specific workloads, such as mixing long-living OLAP (online analytical processing) transactions and <a href="https://www.tigerdata.com/learn/understanding-oltp" rel="noreferrer">OLTP</a> (on-line transaction processing) workloads on the same PostgreSQL instance.&nbsp;</p><h3 id="transitive-comparisons">Transitive comparisons</h3><p>Another Timescaler who contributed to PostgreSQL was database architect <a href="https://se.linkedin.com/in/matskindahl"><u>Mats Kindahl</u></a>. Mats helped with refactoring to ensure transitive comparisons in PostgreSQL, which brings several benefits to users. Transitive comparisons allow for more concise and intuitive query expressions, improve <a href="https://www.tigerdata.com/blog/best-practices-for-query-optimization-in-postgresql" rel="noreferrer">query optimization</a>, enhance index usage, and facilitate data modeling, as developers can define relationships between entities more naturally.</p><h3 id="standardexplainonequery">standard_ExplainOneQuery</h3><p>Mats also worked on the introduction of&nbsp; <code>standard_ExplainOneQuery</code> in PostgreSQL 17. This addition helps ensure consistent behavior when adding explain hooks, making it easier to predict and understand the effects of explain hooks on query explanation. Developers can focus on implementing specific hooks without worrying about the nuances of query explanation behavior, leading to more efficient development processes and facilitating query performance tuning.</p><h3 id="uuidv7">UUIDv7</h3><p>On the reviewing front, Aleksander reviewed (along with other contributors) the <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=794f10f6b920670cb9750b043a2b2587059d5051"><u>partial merge of UUIDv7 support</u></a> authored by Andrey Borodin. “While there are several UUIDv7 implementations available, the UUIDv7 standard is currently in draft condition,” explains Aleksander, adding that PostgreSQL will only support when the standard is finalized. Once it’s fully supported by PostgreSQL, UUIDv7 will help make time-based queries more efficient.&nbsp;</p><h2 id="expanding-postgresql">Expanding PostgreSQL</h2><p>Here you have it, a reflection on the direction of PostgreSQL 17, the new updates we’re excited about, and some of the contributions we made. If like us, you want to carry on (or start) building on PostgreSQL, give Timescale a try. Features like hypertables (automatically partitioned PostgreSQL tables), continuous aggregates (automatically refreshed materialized views), and advanced data management techniques will significantly enhance PostgreSQL's ability to manage your most demanding workloads effectively.</p><p>If you want to expand PostgreSQL’s capabilities while using the PostgreSQL you know and love, <a href="https://console.cloud.timescale.com/signup"><u>create a free Timescale account today</u></a>. </p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[13 Tips to Improve PostgreSQL Insert Performance]]></title>
            <description><![CDATA[Some of these may surprise you, but all 13 ways will improve ingest (INSERT) performance using PostgreSQL and TimescaleDB.]]></description>
            <link>https://www.tigerdata.com/blog/13-tips-to-improve-postgresql-insert-performance</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/13-tips-to-improve-postgresql-insert-performance</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <category><![CDATA[PostgreSQL Tips]]></category>
            <dc:creator><![CDATA[Mike Freedman]]></dc:creator>
            <pubDate>Wed, 17 Apr 2024 12:00:00 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-12-at-4.06.02-PM.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-12-at-4.06.02-PM.png" alt="A powerful PostgreSQL elephant, representing performance, with electric yellow rays in the background" /><p>Ingest performance is critical for many common PostgreSQL use cases, including application monitoring, application analytics, IoT monitoring, and more. <a href="https://timescale.ghost.io/blog/time-series-data/" rel="noreferrer">These use cases have something in common</a>: unlike standard relational "business" data, changes are treated as <em>inserts</em>, not overwrites. In other words, every new value becomes a <strong>new </strong>row in the database instead of replacing the row's prior value with the latest one.</p><p>If you're operating in a scenario where you need to retain all data vs. overwriting past values, optimizing the speed at which your database can ingest new data becomes essential.</p><p>At <a href="https://www.timescale.com" rel="noreferrer">Tiger Data</a> (the creators of TimescaleDB), we have a lot of experience <a href="https://www.timescale.com/learn/postgresql-performance-tuning-how-to-size-your-database" rel="noreferrer">optimizing performance</a>, so in this article, we will look at PostgreSQL inserts and how to improve their performance. We'll include the following:<br><br><strong>1. &nbsp;Useful tips for improving PostgreSQL insert performance, in general,</strong> such as moderating your use of indexes, reconsidering foreign key constraints, avoiding unnecessary UNIQUE keys, using separate disks for WAL (Write-Ahead Logging) and data, and deploying on performant disks. Each of these strategies can help optimize the speed at which your database ingests new data.</p><p><strong>2.&nbsp; TimescaleDB-specific insert performance tips</strong> (TimescaleDB works like PostgreSQL under the hood).</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">💫</div><div class="kg-callout-text">Don't know what TimescaleDB is? <a href="https://www.timescale.com/learn/is-postgres-partitioning-really-that-hard-introducing-hypertables" rel="noreferrer">Read this article.</a></div></div><h2 id="postgresql-insert-overview">PostgreSQL Insert Overview</h2><p><a href="https://www.postgresql.org/docs/current/sql-insert.html" rel="noreferrer">One of PostgreSQL's fundamental commands, the <code>INSERT</code> operation</a> plays a crucial role in adding new data to a database. It adds one or more rows to a table, filling each column with specified data. When certain columns are not specified in the insert query, PostgreSQL automatically fills these columns with their default values, if any are defined. This feature ensures that the database maintains integrity and consistency, even when all column values are not provided.</p><p>The <code>INSERT</code> operation is fundamental for data ingest processes, where new data is continually added to the database. It allows for the efficient and organized storage of new information, making it accessible for querying and analysis.<br><br>Here’s a simple example of an <code>INSERT</code> query:</p><pre><code class="language-sql">INSERT INTO employees (name, position, department)
VALUES ('John Doe', 'Software Engineer', 'Development');</code></pre><p>In this example, the <code>INSERT INTO</code> statement specifies the table <code>employees</code> to which the row will be added. The columns <code>name</code>, <code>position</code>, and <code>department</code> are explicitly mentioned, indicating where the provided data should be inserted. <br><br>Following the <code>VALUES</code> keyword, the actual data to be inserted into these columns is provided in parentheses. If the <code>employees</code> table contains other columns for which default values are defined and are not included in the <code>INSERT</code> statement, PostgreSQL will automatically fill those columns with the default values.</p><h2 id="when-insert-performance-matters">When Insert Performance Matters</h2><p>The speed at which data can be ingested into a database directly impacts its utility and responsiveness, especially when <strong>reaction speed</strong> in real-time or near-real-time data processing is essential.</p><p>One prominent example of such a use case is <a href="https://timescale.ghost.io/blog/guide-to-postgres-data-management/" rel="noreferrer"><strong>time-series data management</strong></a>. Time-series data, characterized by its sequential nature, accumulates moment by moment, often originating from sensors, financial transactions, or user activity logs. <br><br>The value of time-series data lies in its timeliness and the insights that can be gleaned from analyzing patterns over time. To maintain the integrity and relevance of these insights, insert performance must be optimized to ensure data is updated consistently and without delay. High insert performance allows for the seamless integration of new data, preserving the chronological order and enabling accurate <a href="https://www.timescale.com/learn/real-time-analytics-in-postgres" rel="noreferrer">real-time analysis</a>.</p><p><strong>Application monitoring</strong> represents another critical area where insert performance is paramount. Effective monitoring systems rely on continuously ingesting application metrics and logs to provide an up-to-date view of the application's health and performance. Any lag in data ingest can lead to delays in detecting and responding to issues, potentially affecting user experience and system stability. Strong insert performance ensures that monitoring systems remain current, allowing for immediate action in response to any anomalies detected.</p><p><strong>Event detection applications</strong>, such as fraud detection systems, also underscore the importance of fast insert speeds. In these scenarios, the ability to rapidly ingest and process data can mean the difference between catching a fraudulent transaction as it happens or missing it entirely. </p><p>Fast data ingest enables these systems to analyze events in real time, applying algorithms to detect suspicious patterns and react promptly. The reaction speed is crucial in minimizing risk and protecting assets, highlighting the critical role of insert performance in maintaining system efficacy.</p><h2 id="improving-insert-performance">Improving Insert Performance</h2><p>The previous use cases stress the critical role of ingest speed in real-time or high-volume databases, such as those handling time series. These use cases make up for most of our customer base here at Tiger Data, so we're pretty confident to recommend these five best practices for improving ingest performance in vanilla PostgreSQL:</p><h3 id="1-use-indexes-in-moderation">1. Use indexes in moderation</h3><p><a href="https://www.timescale.com/learn/postgresql-performance-tuning-optimizing-database-indexes" rel="noreferrer">Having the right indexes</a> can speed up your queries, but they’re not a silver bullet. Incrementally maintaining indexes with each new row requires additional work. Check the number of indexes you’ve defined on your table (use the <code>psql</code> command <code>\d table_name</code>), and determine whether their potential query benefits outweigh the storage and insert overhead. Since every system is different, there aren’t any hard and fast rules or “magic number” of indexes—just be reasonable.</p><h3 id="2-reconsider-foreign-key-constraints">2. Reconsider foreign key constraints</h3><p>Sometimes, it's necessary to build <a href="https://www.postgresql.org/docs/current/tutorial-fk.html" rel="noreferrer">foreign keys (FK)</a> from one table to other relational tables. When you have an FK constraint, every <code>INSERT</code> will typically need to read from your referenced table, which can degrade performance. Consider denormalizing your data—we sometimes see pretty extreme use of FK constraints from a sense of “elegance” rather than engineering trade-offs.</p><h3 id="3-avoid-unnecessary-unique-keys">3. Avoid unnecessary UNIQUE keys</h3><p>Developers are often trained to specify primary keys in database tables, and many ORMs love them. Yet, many use cases—including common monitoring or time-series applications—don’t require them, as each event or sensor reading can simply be logged as a separate event by inserting it at the tail of a hypertable's current chunk during write time. </p><p>If a <code>UNIQUE</code> constraint is otherwise defined, that insert can necessitate an index lookup to determine if the row already exists, which will adversely impact the speed of your <code>INSERT</code>.</p><h3 id="4-use-separate-disks-for-wal-and-data">4. Use separate disks for WAL and data</h3><p>While this is a more advanced optimization that isn't always needed, if your disk becomes a bottleneck, you can further increase throughput by using a separate disk (tablespace) for the database's WAL and data.</p><h3 id="5-use-performant-disks">5. Use performant disks </h3><p>Sometimes developers deploy their database in environments with slower disks, whether due to poorly-performing HDD, remote storage area networks (SANs), or other types of configurations. And because when you insert rows, the data is durably stored in the WAL before the transaction completes, slow disks can impact insert performance. One thing to do is check your disk IOPS using the <code>ioping</code> command.<br><br>Read test:</p><pre><code>$ ioping -q -c 10 -s 8k .
--- . (hfs /dev/disk1 930.7 GiB) ioping statistics ---
9 requests completed in 208 us, 72 KiB read, 43.3 k iops, 338.0 MiB/s
generated 10 requests in 9.00 s, 80 KiB, 1 iops, 8.88 KiB/s
min/avg/max/mdev = 18 us / 23.1 us / 35 us / 6.17 us</code></pre><p>Write test:</p><pre><code>$ ioping -q -c 10 -s 8k -W .
--- . (hfs /dev/disk1 930.7 GiB) ioping statistics ---
9 requests completed in 10.8 ms, 72 KiB written, 830 iops, 6.49 MiB/s
generated 10 requests in 9.00 s, 80 KiB, 1 iops, 8.89 KiB/s
min/avg/max/mdev = 99 us / 1.20 ms / 2.23 ms / 919.3 us</code></pre><p>You should see at least thousands of read IOPS and many hundreds of write IOPS. If you are seeing far fewer, your disk hardware is likely affecting your INSERT performance. See if alternative storage configurations are feasible.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">✨</div><div class="kg-callout-text">Read our <a href="https://www.timescale.com/blog/benchmarking-postgresql-batch-ingest" rel="noreferrer">benchmark on batch ingest in PostgreSQL</a>.</div></div><h2 id="using-timescaledb-to-improve-ingest-performance">Using TimescaleDB to Improve Ingest Performance</h2><p><a href="https://www.timescale.com/performance" rel="noreferrer">TimescaleDB is built to improve query and ingest performance in PostgreSQL.</a> </p><p>The most common uses for TimescaleDB involve storing massive amounts of data for cloud infrastructure metrics, product analytics, web analytics, IoT devices, and many use cases involving <a href="https://www.timescale.com/learn/postgresql-partition-strategies-and-more" rel="noreferrer">large PostgreSQL tables</a>. The ideal TimescaleDB scenarios are time-centric, almost solely append-only (lots of INSERTs), and require fast ingestion of large amounts of data within small time windows. </p><p>Here you have eight more techniques for improving ingest performance with TimescaleDB:</p><h3 id="6-use-parallel-writes">6. Use parallel writes </h3><p>Each <code>INSERT</code> or <code>COPY</code> command to TimescaleDB (as in PostgreSQL) is executed as a single transaction and thus runs in a single-threaded fashion. To achieve higher ingest, you should execute multiple <code>INSERT</code> or <code>COPY</code> commands in parallel. </p><p>For help with bulk loading large CSV files in parallel, check out  TimescaleDB's <a href="https://github.com/timescale/timescaledb-parallel-copy">parallel copy command</a>.</p><p>⭐ <strong>Pro tip</strong>: make sure your client machine has enough cores to execute this parallelism (running 32 client workers on a 2 vCPU machine doesn’t help much— the workers won’t actually be executed in parallel).</p><h3 id="7-insert-rows-in-batches">7. Insert rows in batches </h3><p>To achieve higher ingest rates, you should insert your data with many rows in each <code>INSERT</code> call (or else use some bulk insert command, like COPY or our parallel copy tool). </p><p>Don't insert your data row-by-row—instead, try at least hundreds (or thousands) of rows per insert. This allows the database to spend less time on connection management, transaction overhead, SQL parsing, etc., and more time on data processing.</p><h3 id="8-properly-configure-sharedbuffers">8. Properly configure shared_buffers</h3><p>We typically recommend 25&nbsp;% of available RAM. If you install TimescaleDB via a method that runs <a href="https://github.com/timescale/timescaledb-tune"><code>timescaledb-tune</code></a>, it should automatically configure <code>shared_buffers</code> to something well-suited to your hardware specs. </p><p>Note: in some cases, typically with virtualization and constrained cgroups memory allocation, these automatically-configured settings may not be ideal. To check that your <code>shared_buffers</code> are set to within the 25&nbsp;% range,  run <code>SHOW shared_buffers</code> from your <code>psql</code> connection.</p><h3 id="9-run-our-docker-images-on-linux-hosts">9. Run our Docker images on Linux hosts</h3><p>If you are<em> </em><a href="https://docs.timescale.com/latest/getting-started/installation/docker/installation-docker/?utm_source=timescale-13-insert-tips&amp;utm_medium=blog&amp;utm_campaign=july-2020-advocacy&amp;utm_content=install-docs-docker">running a TimescaleDB Docker container (which runs Linux)</a><em> </em>on top of another Linux operating system, you're in great shape. The container is basically providing process isolation, and the overhead is extremely minimal. </p><p>If you're running the container on a Mac or Windows machine, you'll see some performance hits for the OS virtualization, including for I/O.</p><p>Instead, if you need to run on Mac or Windows, we recommend <a href="https://docs.timescale.com/latest/getting-started/installation/?utm_source=timescale-13-insert-tips&amp;utm_medium=blog&amp;utm_campaign=july-2020-advocacy&amp;utm_content=install-docs">installing directly</a> instead of using a Docker image.</p><h3 id="10-avoid-too-many-or-too-small-chunks">10.&nbsp;Avoid too many or too small chunks</h3><p>We don't currently recommend using space partitioning. And if you do, remember that this number of chunks is created for every time interval. </p><p>So, if you create 64 space partitions and daily chunks, you'll have 24,640 chunks per year. This may lead to a bigger performance hit during query time (due to planning overhead) than during insert time, but it's something to consider nonetheless.<br><br>Another thing to avoid is using an incorrect integer value when you specify the time interval range in <code>create_hypertable</code>. </p><p>⭐ <strong>Pro tip</strong>: </p><ul><li>If your time column uses a native timestamp type, then any integer value should be in terms of microseconds (so one day = 86400000000). We recommend using interval types ('1 day') to avoid the potential for any confusion. </li><li>If your time column is an integer or bigint itself,  use the appropriate range: if the integer timestamp is in seconds, use 86400; if the bigint timestamp is in nanoseconds, use 86400000000000.<br><br>In both cases, you can use <a href="https://docs.timescale.com/latest/api?utm_source=timescale-13-insert-tips&amp;utm_medium=blog&amp;utm_campaign=july-2020-advocacy&amp;utm_content=chunk-pretty-api-docs#chunk_relation_size_pretty"><code>chunk_relation_size_pretty</code></a> to make sure your chunk sizes or partition ranges seem reasonable:</li></ul><pre><code class="language-SQL">=&gt; SELECT chunk_table, ranges, total_size
FROM chunk_relation_size_pretty('hypertable_name')
ORDER BY ranges DESC LIMIT 4;
chunk_table               |                         ranges                          | total_size
-----------------------------------------+---------------------------------------------------------+------------
_timescaledb_internal._hyper_1_96_chunk | {"['2020-02-13 23:00:00+00','2020-02-14 00:00:00+00')"} | 272 MB
_timescaledb_internal._hyper_1_95_chunk | {"['2020-02-13 22:00:00+00','2020-02-13 23:00:00+00')"} | 500 MB
_timescaledb_internal._hyper_1_94_chunk | {"['2020-02-13 21:30:00+00','2020-02-13 22:00:00+00')"} | 500 MB
_timescaledb_internal._hyper_1_93_chunk | {"['2020-02-13 20:00:00+00','2020-02-13 21:00:00+00')"} | 500 MB</code></pre><h3 id=""></h3><h3 id="11-avoid-%E2%80%9Ctoo-large%E2%80%9D-chunks">11. Avoid “too large” chunks</h3><p>To maintain higher ingest rates, you want your latest chunk and all its associated indexes to stay in memory so that writes to the chunk and index updates merely update memory. (The write is still durable, as inserts are written to the WAL on disk before the database pages are updated.) </p><p>If your chunks are too large, then writes to even the latest chunk will start swapping to disk.</p><p>As a rule of thumb, we recommend that the latest chunks and all their indexes fit comfortably within the database's <code>shared_buffers</code>. You can check your chunk sizes via the <a href="https://docs.timescale.com/latest/api?utm_source=timescale-13-insert-tips&amp;utm_medium=blog&amp;utm_campaign=july-2020-advocacy&amp;utm_content=chunk-pretty-api-docs#chunk_relation_size_pretty"><code>chunk_relation_size_pretty</code></a> SQL command.</p><p></p><pre><code class="language-SQL">=&gt; SELECT chunk_table, table_size, index_size, toast_size, total_sizeFROM chunk_relation_size_pretty('hypertable_name')ORDER BY ranges DESC LIMIT 4;
chunk_table               | table_size | index_size | toast_size | total_size
-----------------------------------------+------------+------------+------------+------------
_timescaledb_internal._hyper_1_96_chunk | 200 MB     | 64 MB      | 8192 bytes | 272 MB
_timescaledb_internal._hyper_1_95_chunk | 388 MB     | 108 MB     | 8192 bytes | 500 MB
_timescaledb_internal._hyper_1_94_chunk | 388 MB     | 108 MB     | 8192 bytes | 500 MB
_timescaledb_internal._hyper_1_93_chunk | 388 MB     | 108 MB     | 8192 bytes | 500 MB</code></pre><p></p><p>If your chunks are too large, you can update the range for future chunks via the <a href="https://docs.timescale.com/latest/api?utm_source=timescale-13-insert-tips&amp;utm_medium=blog&amp;utm_campaign=july-2020-advocacy&amp;utm_content=set-chunk-interval-api-docs#set_chunk_time_interval"><code>set_chunk_time_interval</code></a> command. However, this does not modify the range of existing chunks (e.g., by rewriting large chunks into multiple small chunks). </p><p>For configurations where individual chunks are much larger than your available memory, we recommend dumping and reloading your hypertable data to properly sized chunks.</p><p>Keeping the latest chunk applies to all active hypertables; if you are actively writing to two hypertables, the latest chunks from both should fit within <code>shared_buffers</code>.</p><h3 id="12-write-data-in-loose-time-order">12. Write data in loose time order</h3><p>When chunks are sized appropriately (see #10 and #11), the latest chunk(s) and their associated indexes are naturally maintained in memory. New rows inserted with recent timestamps will be written to these chunks and indexes already in memory. </p><p>If a row with a sufficiently older timestamp is inserted—i.e., it's an out-of-order or backfilled write—the disk pages corresponding to the older chunk (and its indexes) will need to be read in from disk. This will significantly increase write latency and lower insert throughput.</p><p>Particularly, when you are loading data for the first time, try to load data in sorted, increasing timestamp order. </p><p>Be careful if you're bulk-loading data about many different servers, devices, and so forth: </p><ul><li>Do not bulk insert data sequentially by server (i.e., all data for server A, then server B, then C, and so forth). This will cause disk thrashing as loading each server will walk through all chunks before starting anew. </li><li>Instead, arrange your bulk load so that data from all servers are inserted in loose timestamp order (e.g., day 1 across all servers in parallel, then day 2 across all servers in parallel, etc.)</li></ul><h3 id="13-watch-row-width">13. Watch row width</h3><p>The overhead from inserting a wide row (say, 50, 100, 250 columns) is going to be much higher than inserting a narrower row (more network I/O, more parsing and data processing, larger writes to WAL, etc.). Most of our published benchmarks are using <a href="https://github.com/timescale/tsbs">TSBS</a>, which uses 12 columns per row. So you'll correspondingly see lower insert rates if you have very wide rows.</p><p>If you are considering very wide rows because you have different types of records, and each type has a disjoint set of columns, you might want to try using multiple hypertables (one per record type)—particularly if you don't often query across these types.</p><p>Additionally, JSONB records are another good option if virtually all columns are sparse. That said, if you're using sparse wide rows, use NULLs for missing records whenever possible, not default values, for the most performance gains (NULLs are much cheaper to store and query).</p><p>Finally, the cost of wide rows is actually much less once you compress rows using <a href="https://timescale.ghost.io/blog/blog/building-columnar-compression-in-a-row-oriented-database/?utm_source=timescale-13-insert-tips&amp;utm_medium=blog&amp;utm_campaign=july-2020-advocacy&amp;utm_content=1-5-release-blog">TimescaleDB’s native compression</a>.  Rows are converted into more columnar compressed form, sparse columns compress extremely well, and compressed columns aren’t read from disk for queries that don’t fetch individual columns.</p><h2 id="summary">Summary</h2><p><a href="https://www.timescale.com/learn/types-of-data-supported-by-postgresql-and-timescale" rel="noreferrer">If ingest performance is critical to your use case</a>, consider using TimescaleDB. You can <a href="https://console.cloud.timescale.com/signup">get started with hosted TimescaleDB (Tiger Cloud)</a> for free today or <a href="https://docs.timescale.com/latest/getting-started/installation/?utm_source=timescale-13-insert-tips&amp;utm_medium=blog&amp;utm_campaign=july-2020-advocacy&amp;utm_content=install-docs">download TimescaleDB</a> to your own hardware. </p><p>Our approach to support is to address your whole solution, so we're here to help you achieve your desired performance results (see more details about our <a href="https://www.timescale.com/support" rel="noreferrer">Support team and ethos</a>). </p><p>Lastly, <a href="https://slack.timescale.com/">our Slack community</a> is a great place to connect with 8,000+ other developers with similar use cases, as well as myself, Tiger Data engineers, product team members, and developer advocates.</p><h3 id="keep-learning-about-improving-postgresql-performance">Keep learning about improving PostgreSQL performance</h3><p>If you're interested in improving your PostgreSQL performance, you'll find the following resources useful: </p><p><strong>👉 </strong><a href="https://www.tigerdata.com/learn/postgresql-partition-strategies-and-more" rel="noreferrer"><strong>Navigating growing PostgreSQL tables</strong></a><strong>. </strong>Are your PostgreSQL queries slowing down as your database tables grow? Learn about a few tactics that can get you back on track.</p><p><a href="https://www.timescale.com/learn/when-to-consider-postgres-partitioning" rel="noreferrer"><strong>👉 When to consider PostgreSQL partitioning.</strong></a><strong> </strong>Postgres partitioning can be a powerful tool to scale your database, although it’s not a one-size-fits-all solution. Learn if it's the solution you're looking for. </p><p><strong>👉 </strong>When your tables start growing, it might be time for some PostgreSQL fine-tuning. Get advice on how to optimize your database step by step: </p><ul><li><a href="https://www.timescale.com/learn/postgresql-performance-tuning-how-to-size-your-database" rel="noreferrer">Sizing your database properly (CPU, memory) </a></li><li><a href="https://www.timescale.com/learn/postgresql-performance-tuning-key-parameters" rel="noreferrer">Key PostgreSQL parameters to fine-tune (e.g., work_mem, shared_buffers)</a></li><li><a href="https://www.timescale.com/learn/postgresql-performance-tuning-optimizing-database-indexes" rel="noreferrer">Optimizing indexes</a></li><li><a href="https://www.timescale.com/learn/postgresql-performance-tuning-designing-and-implementing-database-schema" rel="noreferrer">Schema design best practices</a></li></ul><p><strong>👉</strong><a href="https://timescale.ghost.io/blog/timescale-cloud-tips-how-to-optimize-your-ingest-rate/" rel="noreferrer"><strong> Further tips on improving inserts</strong></a><strong>. </strong><br></p><h2 id="faqs-improving-postgresql-insert-performance">FAQs: Improving PostgreSQL Insert Performance</h2><p><strong>Q: How can I improve PostgreSQL insert performance when dealing with large amounts of data?</strong></p><p>A: To optimize PostgreSQL insert performance, focus on using indexes in moderation, inserting rows in batches rather than one by one, and ensuring your disk hardware is performant. If disk becomes a bottleneck, consider using separate disks for WAL and data. For bulk loading data, try tools like TimescaleDB's parallel copy command, which can significantly increase throughput.</p><p><strong>Q: What role do indexes and constraints play in PostgreSQL insert performance?</strong></p><p>A: While indexes speed up queries, they can slow down inserts since each new row requires index maintenance. Foreign key constraints force PostgreSQL to read from referenced tables during inserts, and <code>UNIQUE</code> constraints necessitate index lookups to check for duplicates. Consider whether these constraints are truly necessary for your use case, especially for append-only scenarios like time-series data.</p><p><strong>Q: How should I configure my hardware for optimal PostgreSQL insert performance?</strong></p><p>A: Use performant disks capable of thousands of read IOPS and hundreds of write IOPS, which you can check with the <code>ioping</code> command. Configure <code>shared_buffers</code> to approximately 25% of available RAM to ensure enough memory for caching active data. If running in containers, use Linux hosts for Docker images to minimize virtualization overhead.</p><p><strong>Q: What's the optimal approach for batch inserting data into PostgreSQL?</strong></p><p>A: Instead of row-by-row insertion, batch hundreds or thousands of rows per <code>INSERT</code> command to reduce overhead from connection management and SQL parsing. Execute multiple <code>INSERT</code> or <code>COPY</code> commands in parallel to leverage multiple cores. When bulk loading time-series data, insert in loose time order (e.g., day 1 across all servers, then day 2) rather than sequentially by server to prevent disk thrashing.</p><p><strong>Q: How can TimescaleDB improve insert performance compared to vanilla PostgreSQL?</strong></p><p>A: TimescaleDB enhances PostgreSQL insert performance through features like hypertables with automatic time-based chunking, which keeps recent data in memory for faster writes. It offers native compression to reduce the cost of wide rows and provides tools like parallel copy for efficient bulk loading. TimescaleDB also helps maintain appropriate chunk sizes (neither too large nor too small) to optimize memory usage and prevent unnecessary disk operations.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Self-Hosted or Cloud Database? A Countryside Reflection on Infrastructure Choices]]></title>
            <description><![CDATA[Read what country living can teach you about infrastructure choices and choosing a self-hosted vs. cloud database.]]></description>
            <link>https://www.tigerdata.com/blog/self-hosted-or-cloud-database-a-countryside-reflection-on-infrastructure-choices</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/self-hosted-or-cloud-database-a-countryside-reflection-on-infrastructure-choices</guid>
            <category><![CDATA[Cloud]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Benchmarks & Comparisons]]></category>
            <dc:creator><![CDATA[Jônatas Davi Paganini]]></dc:creator>
            <pubDate>Wed, 03 Apr 2024 15:54:16 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/04/Self-hosted-vs-cloud-database_cover--1-.webp">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/04/Self-hosted-vs-cloud-database_cover--1-.webp" alt="A developer working in a countryside landscape. A countryside reflection on infrastructure choices." /><p>The choice between using a cloud database or opting for a self-managed setup is critical for every developer as it affects the entire framework through which an organization processes its data. Funny enough, nothing has taught me more invaluable lessons about infrastructure than living in the countryside for the past four years. These experiences have shaped my mindset about what I prefer to manage myself and what I’d rather have as a service. </p><p>Much like my countryside life, which has the benefits and drawbacks of managing one's infrastructure versus relying on external services, this article aims to draw parallels to guide you through your digital infrastructure decisions. </p><p>I’ll also compare deployment options, emphasizing that the implications of this choice extend far beyond the technical: they influence your organization's agility, efficiency, and long-term scalability. I hope these insights will act as a sort of compass, directing you toward a decision that aligns with your strategic objectives and operational capabilities.</p><h2 id="self-hosting-vs-cloud-a-water-management-lesson">Self-Hosting vs. Cloud: A Water Management Lesson</h2><p>Before explaining how my water system got me thinking of databases and infrastructure choices, let me clarify the decision on the table here. Self-hosting a database involves running it on your own physical or virtual servers, requiring maintenance, security, and scalability management. In contrast, a cloud database is hosted and managed by a third-party cloud provider, offering scalability, automated backups, and reduced maintenance overhead, allowing developers to focus more on application development and less on infrastructure management.</p><p>The thing is, <em>water is no different from data</em>. In businesses, data flows like water through systems, streaming seamlessly and requiring to be stored. Water or data are crucial for your system infrastructure—they are vital for the show to go on, irrespective of their challenges.</p><h3 id="system-infrastructure-as-a-water-system">System infrastructure as a water system</h3><p>In the countryside, choosing my water system was similar to selecting a well-integrated infrastructure. System infrastructure refers to the underlying framework that supports the operation of software applications. The same could be said about my water system if talking about general human operations.</p><p>In the driest seasons, I was compelled to build a resilient water recycling system. After heavy rains, my large repository would be brimming, providing essentials for showers, dishes, and laundry. However, managing this infrastructure wasn’t without its challenges. Annually, I’d face issues like broken pipes, the need to pump water from the lake, water pump failures, and clogged filters.</p><p>These experiences parallel the challenges in managing business infrastructure:</p><ul><li><strong>Unexpected breakdowns</strong>: just as pipes break, systems can fail.</li><li><strong>Resource scarcity</strong>: like running out of water, businesses can face resource shortages.</li><li><strong>Maintenance needs</strong>: like repairing a water pump, systems require regular upkeep.</li><li><strong>Regular updates</strong>: comparable to changing filters, systems need continual updates.</li></ul><h2 id="self-hosting-vs-cloud-services-making-a-choice">Self-Hosting vs. Cloud Services: Making a Choice</h2><p>Reflecting on my rural infrastructure, self-hosting was my only option. But what about businesses? Consider these questions:</p><ul><li>Do you have the infrastructure and skills to manage emergencies anytime?</li><li>Are you prepared to invest in and maintain your infrastructure?</li><li>If the answer to any of these is “no,” self-hosting might be a temporary solution.</li></ul><h3 id="the-right-mindset-for-system-infrastructure">The right mindset for system infrastructure</h3><p>This is more than opting for self-managing a database or a cloud provider; it’s about understanding your business limitations and choosing the option that sustains your business longer.</p><p>To make an informed choice, here are some key considerations you’ll need to make:</p><ul><li><strong>Learning costs during downtimes</strong>: Downtime is not just a technical setback—it's a period of intensive learning under pressure. Organizations must evaluate whether they have the resources and resilience to absorb the learning curve of diagnosing and resolving infrastructure failures in-house. The cost of this learning, both in terms of time and lost productivity, can be significant.</li><li><strong>Business risks during outages:</strong> Outages directly threaten your business continuity. The longer your systems are down, the greater the risk to your reputation, customer satisfaction, and revenue. Assessing the potential impact of outages is crucial in understanding whether the self-hosted approach aligns with your risk tolerance and business continuity plans.</li><li><strong>Team commitment to infrastructure responsibilities:</strong> Choosing to self-host means your team will bear the full weight of infrastructure responsibilities—from routine maintenance to emergency response. This commitment requires a dedicated, skilled team that's prepared to tackle challenges as they arise. Reflect on whether your team has the bandwidth and expertise to manage these tasks without detracting from their core functions.</li><li><strong>Training availability:</strong> Your team's effectiveness in managing a self-hosted infrastructure heavily relies on their ongoing education and training. Consider whether you have access to the necessary training resources to keep your team up-to-date with the latest technologies and best practices in infrastructure management.</li></ul><p>These considerations go beyond the surface-level appeal of having complete control over your infrastructure. They highlight the depth of <em>commitment</em> and <em>preparedness</em> needed to ensure that a self-hosted solution supports, rather than hinders, your organization's goals. </p><p>Below, I've outlined some of the primary areas of concern when opting for self-hosting, paired with the inevitable consequences businesses might face if they don’t prepare correctly:</p>
<!--kg-card-begin: html-->
<table style="border:none;border-collapse:collapse;"><colgroup><col width="147"><col width="463"></colgroup><tbody><tr style="height:35.5pt"><td style="border-left:solid #e3e3e3 0.75pt;border-right:solid #e3e3e3 0.75pt;border-bottom:solid #e3e3e3 0.75pt;border-top:solid #e3e3e3 0.75pt;vertical-align:bottom;background-color:#ffffff;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Challenge Area</span></p></td><td style="border-left:solid #e3e3e3 0.75pt;border-right:solid #e3e3e3 0.75pt;border-bottom:solid #e3e3e3 0.75pt;border-top:solid #e3e3e3 0.75pt;vertical-align:bottom;background-color:#ffffff;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Consequence</span></p></td></tr><tr style="height:106.75pt"><td style="border-left:solid #e3e3e3 0.75pt;border-right:solid #e3e3e3 0.75pt;border-bottom:solid #e3e3e3 0.75pt;border-top:solid #e3e3e3 0.75pt;vertical-align:middle;background-color:#ffffff;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Without a ready emergency response</span></p></td><td style="border-left:solid #e3e3e3 0.75pt;border-right:solid #e3e3e3 0.75pt;border-bottom:solid #e3e3e3 0.75pt;border-top:solid #e3e3e3 0.75pt;vertical-align:middle;background-color:#ffffff;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><ul style="margin-top:0;margin-bottom:0;padding-inline-start:48px;"><li dir="ltr" style="list-style-type:disc;font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Eventually, a critical error will halt systems, leading to operational paralysis.</span></p></li><li dir="ltr" style="list-style-type:disc;font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">The team will need to scramble to respond, undermining the stability of all operations.</span></p></li><li dir="ltr" style="list-style-type:disc;font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">There will be an expensive delay while figuring out the necessary steps to recovery.</span></p></li></ul></td></tr><tr style="height:124.75pt"><td style="border-left:solid #e3e3e3 0.75pt;border-right:solid #e3e3e3 0.75pt;border-bottom:solid #e3e3e3 0.75pt;border-top:solid #e3e3e3 0.75pt;vertical-align:middle;background-color:#ffffff;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Without investment and maintenance</span></p></td><td style="border-left:solid #e3e3e3 0.75pt;border-right:solid #e3e3e3 0.75pt;border-bottom:solid #e3e3e3 0.75pt;border-top:solid #e3e3e3 0.75pt;vertical-align:middle;background-color:#ffffff;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><ul style="margin-top:0;margin-bottom:0;padding-inline-start:48px;"><li dir="ltr" style="list-style-type:disc;font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Infrastructure will become overwhelmed as operations scale, leading to performance bottlenecks.</span></p></li><li dir="ltr" style="list-style-type:disc;font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Bugs and system issues will proliferate as the infrastructure expands, reducing system reliability.</span></p></li><li dir="ltr" style="list-style-type:disc;font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Fragile infrastructure components will fail, precipitating emergencies and further destabilizing operations.</span></p></li></ul></td></tr></tbody></table>
<!--kg-card-end: html-->
<p>Looking at these complexities, the contrast with cloud services becomes clearer, illustrating the value of scalability, reliability, and reduced operational burdens that cloud services can offer.</p><p>You can build a resilient team for self-hosting your database, but you’ll need sufficient resources and investment. Plus, you will have to be fully transparent with your customers to build customer confidence. If self-hosting isn’t feasible, let your customers know or find ways to enhance your infrastructure together.&nbsp;</p><p>This is where Timescale’s self-hosting support options can lend you a helping hand. 🤝</p><h2 id="self-hosting-with-timescale">Self-Hosting With Timescale</h2><p>Self-hosting does not mean you have to do everything alone. When considering the self-hosting route for managing time-series data, Timescale is about empowering control. It enables users who want to self-host and control their infrastructure to be backed by comprehensive support to ensure their operations run smoothly. </p><p>To do this, Timescale provides specialized support packages tailored to production and development environments designed to mitigate the challenges of self-hosting.</p><h4 id="timescale-production-and-development-support-packages">Timescale Production and Development Support Packages</h4><p>For organizations committed to self-hosting their time-series databases, Timescale provides a <a href="https://timescale.ghost.io/blog/empowering-control-production-and-developer-support-for-self-managed-timescaledb/"><u>tiered support system designed to address the needs of both production and development stages</u></a>. This support includes:</p><ul><li><strong>All email Support requests are fielded within one business day</strong>, ensuring that any queries or issues are promptly addressed, minimizing delays in troubleshooting and resolution.</li><li><strong>24x7 on-call support with a one-hour response time for severe or critical issues</strong> that threaten production environments. Timescale offers dedicated on-call support to provide real-time expertise, significantly reducing downtime.</li><li><strong>Dedicated Support portal</strong>: A centralized location for all your support needs, providing easy access to assistance and resources.</li><li><strong>Production Support as a Service:</strong> This feature offloads the burden of emergency responses and infrastructure troubleshooting from your team, allowing you to focus on core operations while relying on Timescale's expertise.</li></ul><p>For more detailed information on how TimescaleDB can support your self-hosting requirements, visit <a href="https://www.timescale.com/self-managed-support"><u>Timescale's Self-Managed Support Page</u></a>.</p><h3 id="self-managed-timescaledb-features">Self-managed TimescaleDB features</h3><p>Besides providing support for your self-hosted database, TimescaleDB enhances PostgreSQL—one of the best-known reliable databases—with features specifically designed for time-series data, making it an attractive option for self-hosting scenarios:</p><ul><li><strong>Hypertables</strong>: These are designed to handle massive datasets by automatically partitioning data across time and space while still allowing you to interact with them as though they were standard PostgreSQL tables.</li><li><strong>Continuous Aggregates</strong>: Time-series queries often require aggregating data over time intervals. Continuous aggregates simplify this by automatically updating incrementally, saving processing time and resources.</li><li><strong>Compression</strong>: Leveraging <a href="https://www.tigerdata.com/blog/building-columnar-compression-in-a-row-oriented-database" rel="noreferrer">columnar storage</a> and time-partitioned data structures, TimescaleDB offers efficient compression mechanisms to reduce storage costs and improve query performance.</li><li><strong>Full SQL</strong>: Unlike some NoSQL databases designed for time-series data, TimescaleDB does not compromise on the power of SQL, offering full compatibility with PostgreSQL for ease of use and flexibility.</li></ul><p>(To learn more about these features, <a href="https://docs.timescale.com/" rel="noreferrer">check out our documentation</a>.)</p><p>With Timescale, you can mitigate some of the traditional challenges associated with self-hosted databases, benefiting from a system that combines the scalability and flexibility of a conventional SQL database with the performance and efficiency required for modern <a href="https://www.tigerdata.com/learn/understanding-database-workloads-variable-bursty-and-uniform-patterns" rel="noreferrer">data workloads</a>. </p><p>But, even with the Timescale Support team by your side, managing your self-hosted setup remains a significant responsibility. It may be time to start considering an alternative: cloud services. The cloud technology era is not just upon us—it is shaping the future of data management, offering a distinct path from traditional self-hosting models. Shame I can't get a similar model for my water system.</p><p>Cloud services are specifically designed to expedite business operations and iterations. They represent an ideal solution for companies that prefer not to invest heavily in internal teams dedicated to ensuring resilience and managing infrastructure complexities. The suitability of cloud services for your organization hinges on several factors, including your business objectives, the current stage of your Service Level Agreements (SLAs), and your growth ambitions.</p><h2 id="the-role-of-cloud-services-in-scaling-and-infrastructure-management">The Role of Cloud Services in Scaling and Infrastructure Management</h2><p>Cloud services provide a reliable framework for scaling, enabling you to adapt quickly to changing demands without the upfront costs typically associated with physical infrastructure investments. Here are some key advantages:</p><ul><li><strong>Infrastructure investment at scale</strong>: Cloud services allow businesses to purchase infrastructure wholesale, translating to significant savings on time and personnel despite potentially higher direct spending.</li><li><strong>Security and reliability</strong>: By offloading security and reliability concerns to the cloud provider, companies can focus more on their core business functions.</li><li><strong>Transparency and control</strong>: While adopting cloud services may result in less operational transparency, the trade-off comes with access to a suite of services and support that can dramatically simplify infrastructure management.</li></ul><h2 id="timescale%E2%80%99s-cloud-services-empowering-your-data-management">Timescale’s Cloud Services: Empowering Your Data Management</h2><p>Timescale provides a comprehensive cloud solution designed to optimize time-series data management without the operational overhead of self-hosting:</p><ul><li><strong>Free Production Support</strong>: Ensures that your operations run smoothly with expert assistance readily available.</li><li><a href="https://docs.timescale.com/use-timescale/latest/data-tiering/tour-data-tiering/" rel="noreferrer"><strong>Data tiering</strong></a><strong> and </strong><a href="https://timescale.ghost.io/blog/savings-unlocked-why-we-switched-to-a-pay-for-what-you-store-database-storage-model/" rel="noreferrer"><strong>usage-based cost tiers</strong></a>: Optimizes your storage spending according to your actual needs, ensuring cost-efficiency.</li><li><strong>Scalability without the traditional constraints</strong>: With compute and storage decoupled, scalability becomes both cost-efficient and performance-optimized.</li><li><a href="https://timescale.ghost.io/blog/how-high-availability-works-in-our-cloud-database/" rel="noreferrer"><strong>High availability</strong></a><strong>, security, and compliance</strong>: Features like automated backups, upgrades, and end-to-end encryption ensure your data is secure, compliant, and available when needed.</li><li><strong>Insights and analytics</strong>: In-console metric visualization and <a href="https://timescale.ghost.io/blog/database-monitoring-and-query-optimization-introducing-insights-on-timescale/" rel="noreferrer">detailed query information</a> enhance your ability to monitor and improve performance.</li></ul><h2 id="the-verdict-self-hosting-vs-cloud-services">The Verdict: Self-Hosting vs. Cloud Services</h2><p>Choosing between self-hosting and cloud services boils down to a strategic decision based on your company's specific needs and goals:</p><p><strong>Self-hosting</strong> offers clarity, transparency, and guaranteed integration with your existing systems, giving you total control over your infrastructure.</p><p><strong>Cloud services</strong> streamline infrastructure investment and updates, removing the need for extensive personnel or tooling investments for security and emergencies, allowing you to concentrate on your core business.</p><p>Ultimately, these are not rigid rules but guiding principles to help you make informed decisions. Whether you opt for self-hosting or cloud services, <em>choose what aligns best with your business goals and needs</em>. In either scenario, Timescale provides tailored support and solutions to ensure your time-series data infrastructure is optimized, secure, and scalable.</p><p>Whether you're deciding between self-hosting and cloud services or looking for ways to optimize your current setup, the Timescale Slack Community <code>#tech-design</code> channel is an excellent resource for collective learning and support.</p><p>Join the <a href="https://slack.timescale.com"><u>Timescale Slack Community</u></a>, where you can talk to me and many other like-minded developers about their infrastructure choices. See you there! 👋</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Refining Vector Search Queries With Time Filters in Pgvector: A Tutorial]]></title>
            <description><![CDATA[Learn how to do both time-based filtering and vector search—semantic similarity search to be more precise—in a single SQL query.]]></description>
            <link>https://www.tigerdata.com/blog/refining-vector-search-queries-with-time-filters-in-pgvector-a-tutorial</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/refining-vector-search-queries-with-time-filters-in-pgvector-a-tutorial</guid>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Engineering]]></category>
            <dc:creator><![CDATA[John Pruitt]]></dc:creator>
            <pubDate>Mon, 01 Apr 2024 15:15:31 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/04/refining-vector-queries-with-time-filters-in-pgvector.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/04/refining-vector-queries-with-time-filters-in-pgvector.png" alt="A clock in a funnel in a rocky, volcano-like landscape: Refining vector search queries with time filters in pg vector" /><p><a href="https://timescale.ghost.io/blog/postgresql-as-a-vector-database-create-store-and-query-openai-embeddings-with-pgvector/" rel="noreferrer">Vector databases have emerged as a powerful solution for working with high-dimensional data</a>. These databases store vectors and perform vector searches. A vector is an array of numbers representing a data point in a multi-dimensional space, and vector search finds the most similar vectors to a query vector in this space. Vector search capabilities are powerful in machine learning and artificial intelligence, where traditional text-based search falls short.</p><p>Refining vector search queries becomes increasingly essential as data grows in volume. This is where time filters come into play. Time-based filtering allows users to narrow search results based on temporal criteria, such as creation date, modification date, or any time-related attribute associated with the data. Whether finding the most recent documents or documents from a specific time range, time-based filtering provides a crucial capability that complements similarity search.</p><p>Not all vector databases support both vector search and time-based filtering. The ability to perform both in the database unlocks efficiencies not otherwise possible. PostgreSQL with the <a href="https://github.com/pgvector/pgvector"><u>pgvector</u></a> and <a href="https://github.com/timescale/timescaledb"><u>timescaledb</u></a> extensions provides a potent platform for combining vector search with time-based filtering using SQL. In this post, we will explore vector search, time-based search, and several approaches to how one might combine the two. We will evaluate the speed and efficiency of these options.</p><h2 id="vector-search-queries-with-time-filters-the-roadmap">Vector Search Queries With Time Filters: The Roadmap</h2><p>In this hands-on orientation, we will load a sample dataset into a table and run time-filtered vector search queries against it. We can understand how Postgres executes each vector search query by examining the query plans. Furthermore, we can evaluate the performance of each query.</p><p>Here’s an overview of what we’ll cover:</p><p><strong>Query plans: </strong><a href="#query-plans" rel="noreferrer"><u>First</u></a>, we will provide a brief description of query plans in Postgres. If you are a seasoned query tuner, skip ahead to the setup.</p><p><strong>Setup: </strong>In <a href="#the-setup" rel="noreferrer"><u>the second section</u></a>, we will do all our prep work. We will create a Timescale service, create a table, and load it with our sample dataset.</p><p><strong>Filtering documents by time: </strong>In <a href="#filtering-documents-by-time" rel="noreferrer"><u>the next section</u></a>, we will look at time-based filtering in isolation. We will explore the power of TimescaleDB’s hypertables (which automatically partition Postgres tables into smaller data partitions or chunks) and compare them to plain Postgres tables. Feel free to skip ahead if you are already familiar with hypertables and time-based filtering.</p><p><strong>Vector search:</strong> After looking at time-based filtering in isolation, we will look at vector search in isolation. <a href="#vector-search" rel="noreferrer"><u>This section</u></a> will introduce the features the <a href="https://github.com/pgvector/pgvector"><u>pgvector</u></a> and <a href="https://www.tigerdata.com/blog/pgai-giving-postgresql-developers-ai-engineering-superpowers" rel="noreferrer"><u>pgai on Timescale</u></a> extensions provide. You can safely jump past this if you know how to do vector search in Postgres.</p><p><strong>Vector search and querying by time: </strong>In <a href="#vector-search-and-filtering-by-time" rel="noreferrer"><u>this section</u></a>, we will learn how to do both time-based filtering and similarity search in a single SQL query. (When the vectors contain a large language model’s embedding of textual content, vector search is a search of semantic similarity.) We will play with a few different versions of the query and see how Postgres decides to execute them.</p><p><strong>Using vector search and time filters in AI applications: </strong>Once we’ve covered the techniques for combining vector similarity search and time filters, we’ll briefly explore <a href="#using-vector-search-and-time-filters-in-ai-applications" rel="noreferrer">how to apply what we learned to build GenAI applications</a>.</p><h2 id="query-plans">Query Plans</h2><p>When discussing query tuning in PostgreSQL, an essential concept to grasp is a "query plan." <a href="https://www.postgresql.org/docs/current/using-explain.html"><u>A query plan is a set of steps</u></a> the PostgreSQL query planner generates describing how PostgreSQL will execute a SQL query. Various factors, including the query structure, the database schema, and the statistical information about the data distribution in the tables, influence the chosen query plan.</p><p>To evaluate a query plan in PostgreSQL, you can use the <a href="https://www.postgresql.org/docs/current/sql-explain.html"><u><code>EXPLAIN</code></u></a> command followed by your query. This command returns the planner's chosen execution plan without executing the query. It will show the plan's overall cost and each step's cost. Cost is a unitless measure but can be used to compare the relative performance of one query to another. For a more detailed analysis, including the execution times, you can use <code>EXPLAIN ANALYZE</code>, which executes the query. However, be cautious with <code>EXPLAIN ANALYZE</code> on production systems, as it can affect performance.&nbsp;</p><p>Don’t worry! We won’t get too bogged down in the minutia of query plans. We will use <code>EXPLAIN ANALYZE</code> to compare the speed and efficiency of various queries, but we will keep it light.</p><p><strong>Note</strong>: Because many variables influence query plans, you may see query plans that are not exactly like the ones depicted; however, they should be similar enough to draw the same conclusions.</p><p><strong>Another note</strong>: If you want to see a visual representation of query plans, there is an excellent, free tool for this at <a href="https://explain.dalibo.com/"><u>https://explain.dalibo.com/</u></a></p><h2 id="the-setup">The Setup</h2><p>We will load a table with a sample dataset and run queries against it to explore time-based filtering and <a href="https://www.tigerdata.com/learn/vector-search-vs-semantic-search" rel="noreferrer">semantic search</a>. We will use a modified version of the Cohere <a href="https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings"><u>wikipedia-22-12-simple-embeddings</u></a> dataset hosted on <a href="https://huggingface.co/"><u>Huggingface</u></a>. It contains embeddings of <a href="https://simple.wikipedia.org/wiki/Main_Page"><u>Simple English Wikipedia</u></a> entries. </p><p>We added synthetic data: a time column, category, and tags. We loaded the data into a Postgres table and exported it to a CSV file; therefore, the format has changed. The original dataset on Huggingface is available under the Apache 2.0 license, and thus, our modified version is also subject to the Apache 2.0 license.</p><p>You will need a database if you want to follow along. Head to the <a href="https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&amp;utm_source=timescale-blog&amp;utm_medium=direct&amp;utm_content=filtering-blog"><u>Timescale console</u></a> (you can try pgai on Timescale for free with a 90-day extended trial) and create a “<a href="https://www.tigerdata.com/blog/time-series-introduction" rel="noreferrer">Time Series</a> and Analytics” service. Building vector indexes can be a compute-hungry activity, so choose 2 CPU / 8 GiB Memory compute. Choose a region close to you for the best experience.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/10/Refining-Vector-Search-Queries-With-Time-Filters-in-Pgvector_Timescale-UI.png" class="kg-image" alt="The Configure your service page in the Timescale Cloud UI" loading="lazy" width="645" height="759" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/10/Refining-Vector-Search-Queries-With-Time-Filters-in-Pgvector_Timescale-UI.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/10/Refining-Vector-Search-Queries-With-Time-Filters-in-Pgvector_Timescale-UI.png 645w"></figure><p>We need some sample data to load into our database. The dataset we will use is <a href="https://huggingface.co/datasets/timescale/wikipedia-22-12-simple-embeddings"><u>available</u></a> on Huggingface. <a href="https://huggingface.co/datasets/timescale/wikipedia-22-12-simple-embeddings/resolve/main/wiki.csv?download=true"><u>Use this link to download the CSV file</u></a>.</p><p>Make sure you have the <a href="https:/www.tigerdata.com/blog/how-to-install-psql-on-mac-ubuntu-debian-windows/" rel="noreferrer"><u><code>psql</code> client installed locally</u></a>. You will need <code>psql</code> version 16 or greater. Run <code>psql --version</code> to see what version you have.</p><p>Connect to your Timescale service using the <code>psql</code> client. Pass the connection URL as an argument. Make sure the <code>wiki.csv</code> file you downloaded is in the same directory where you launch <code>psql</code>.</p><pre><code class="language-bash">psql "&lt;connection url&gt;"</code></pre><p>We will use three extensions, including timescaledb, pgvector, and timescale_vector. We need to ensure these extensions exist before creating our table.</p><pre><code class="language-PostgreSQL">create extension if not exists timescaledb;
create extension if not exists vector;
create extension if not exists timescale_vector;</code></pre><p>Run the following statement to create a table named <code>wiki</code>. The <code>embedding</code>column is of type <code>vector(768)</code> and will contain our <a href="https://platform.openai.com/docs/guides/embeddings/what-are-embeddings?ref=timescale.com"><u>embeddings</u></a>. The <a href="https://www.tigerdata.com/learn/postgresql-extensions-pgvector" rel="noreferrer">pgvector extension</a> provides the <code>vector</code> type.</p><pre><code class="language-SQL">create table public.wiki
( id int not null
, time timestamptz not null
, contents text
, meta jsonb
, embedding vector(768)
);</code></pre><p>Next, run the following command to convert the plain Postgres table into a TimescaleDB hypertable. The hypertable divides the dataset into day-sized sub-tables called chunks based on the values in the <code>time</code> column.</p><pre><code class="language-SQL">select create_hypertable
( 'public.wiki'::regclass
, 'time'::name
, chunk_time_interval=&gt;'1d'::interval
);</code></pre><p>Now that we have a place to put our data, we can load our data. Run this meta-command to load the data from the CSV file into the <code>wiki</code> table. Expect this to take a few minutes.</p><pre><code class="language-SQL">\copy wiki from 'wiki.csv' with (format csv, header on)
</code></pre><p>Make sure Postgres has accurate statistics for our new table. Statistics will influence the query plans.</p><pre><code class="language-SQL">analyze wiki;</code></pre><p>You should find 485,859 rows in the <code>wiki</code> table if you have done this correctly.</p><pre><code class="language-SQL">select count(*) from wiki;</code></pre><p>Finally, let’s turn off parallelism to keep our query plans simple.</p><pre><code class="language-SQL">set max_parallel_workers_per_gather = 0;</code></pre><p>You are ready to go!</p><h2 id="filtering-documents-by-time">Filtering Documents by Time</h2><p>Our <code>wiki</code> table contains a <code>timestamptz</code> column named <code>time</code>. The data in this column is fake. We have added these timestamps so that we have something to illustrate filtering by time. You would use the timestamp associated with a web page or document in real-world use.</p><p>We will use the power of TimescaleDB's <a href="https://docs.timescale.com/use-timescale/latest/hypertables/about-hypertables/"><u>hypertables</u></a> to facilitate filtering by time. A hypertable is a logical table broken up into smaller physical tables called chunks. <a href="https://docs.timescale.com/use-timescale/latest/hypertables/change-chunk-intervals/"><u>Each chunk contains data falling within a time range, and the time ranges of the chunks do not overlap.</u></a> If you are familiar with <a href="https://www.postgresql.org/docs/current/ddl-partitioning.html"><u>partitioning</u></a> or <a href="https://www.postgresql.org/docs/current/ddl-inherit.html"><u>inheritance</u></a> in Postgres, you can think of hypertables in these terms.</p><p>Why is this helpful? When querying a hypertable and filtering on a given time range, TimescaleDB can do chunk exclusion: it ignores any chunks whose time ranges fall outside the bounds of the time range filter, and Postgres expends no compute resources to process the rows in the excluded chunks. Eliminating large chunks of the table through chunk exclusion speeds up time-filtered search.</p><p>There are 485,859 rows in the dataset. As we loaded the data, we assigned a date to each row and incremented the date each time we loaded 50,000 rows, giving us 10 chunks. Nine chunks have 50,000 rows, and the last chunk has 35,859. Below are the time range bounds and row counts for the 10 chunks.</p><pre><code class="language-text">┌─────────────┬────────────┬───────┐
│ range_start │ range_end  │ count │
├─────────────┼────────────┼───────┤
│ 2000-01-01  │ 2000-01-02 │ 50000 │
│ 2000-01-02  │ 2000-01-03 │ 50000 │
│ 2000-01-03  │ 2000-01-04 │ 50000 │
│ 2000-01-04  │ 2000-01-05 │ 50000 │
│ 2000-01-05  │ 2000-01-06 │ 50000 │
│ 2000-01-06  │ 2000-01-07 │ 50000 │
│ 2000-01-07  │ 2000-01-08 │ 50000 │
│ 2000-01-08  │ 2000-01-09 │ 50000 │
│ 2000-01-09  │ 2000-01-10 │ 50000 │
│ 2000-01-10  │ 2000-01-11 │ 35859 │
└─────────────┴────────────┴───────┘
(10 rows)</code></pre><p>The query below filters the <code>wiki</code> table on a time range. We use <code>explain analyze</code> to get Postgres to run the query and tell us what it did.</p><pre><code class="language-SQL">explain analyze
select id
from wiki
where '2000-01-04'::timestamptz &lt;= time and time &lt; '2000-01-06'::timestamptz
;</code></pre><p>The query plan shows that Postgres did a sequential scan on two chunks of the hypertable. Postgres returned 100,000 rows in total. Postgres excluded the other eight chunks and did not have to expend resources to process the rows in those chunks. Again, chunk exclusion is a hypertable’s superpower; it allows for faster and more efficient queries when filtering on a time range.</p><pre><code class="language-text">QUERY PLAN
Append&nbsp; (cost=0.00..10064.00 rows=100000 width=4) (actual time=0.005..30.934 rows=100000 loops=1)
&nbsp;&nbsp;-&gt;&nbsp; Seq Scan on _hyper_1_4_chunk&nbsp; (cost=0.00..4846.00 rows=50000 width=4) (actual time=0.005..12.453 rows=50000 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Filter: (('2000-01-04 00:00:00+00'::timestamp with time zone &lt;= "time") AND ("time" &lt; '2000-01-06 00:00:00+00'::timestamp with time zone))
&nbsp;&nbsp;-&gt;&nbsp; Seq Scan on _hyper_1_5_chunk&nbsp; (cost=0.00..4718.00 rows=50000 width=4) (actual time=0.006..12.622 rows=50000 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Filter: (('2000-01-04 00:00:00+00'::timestamp with time zone &lt;= "time") AND ("time" &lt; '2000-01-06 00:00:00+00'::timestamp with time zone))
Planning Time: 0.197 ms
Execution Time: 33.925 ms</code></pre><p>To illustrate the savings associated with the chunk exclusion, we can create a <code>wiki2</code> table exactly like <code>wiki</code> except <code>wiki2</code> is a plain table—not a hypertable.</p><p><strong>Note</strong>: the insert will take a minute or two. Be patient.</p><pre><code class="language-SQL">create table public.wiki2
( id int not null
, time timestamptz not null
, contents text
, meta jsonb
, embedding vector(768)
);

insert into wiki2 select * from wiki;

analyze wiki2;

explain analyze
select id
from wiki2
where '2000-01-04'::timestamptz &lt;= time and time &lt; '2000-01-06'::timestamptz
;</code></pre><p>Running the same query against <code>wiki2</code> gives us the following query plan. Using the plain table is more expensive and slower than using the hypertable. In this case, Postgres had to “look at” all 485,859 rows in the table to find the 100,000 that match the time filter.&nbsp;&nbsp;</p><pre><code class="language-text">QUERY PLAN
Seq Scan on wiki2&nbsp; (cost=0.00..44853.14 rows=102261 width=4) (actual time=31.259..105.586 rows=100000 loops=1)
&nbsp;&nbsp;Filter: (('2000-01-04 00:00:00+00'::timestamp with time zone &lt;= "time") AND ("time" &lt; '2000-01-06 00:00:00+00'::timestamp with time zone))
&nbsp;&nbsp;Rows Removed by Filter: 385859
Planning Time: 0.631 ms
Execution Time: 108.655 ms</code></pre><p>You may say this is not a fair comparison because any reasonable developer would put an index on <code>time</code>. Let's try that.</p><pre><code class="language-SQL">create index on wiki2 (time);</code></pre><pre><code class="language-text">QUERY PLAN
Index Scan using wiki2_time_idx on wiki2&nbsp; (cost=0.42..10049.55 rows=102257 width=4) (actual time=0.013..26.817 rows=100000 loops=1)
&nbsp;&nbsp;Index Cond: (("time" &gt;= '2000-01-04 00:00:00+00'::timestamp with time zone) AND ("time" &lt; '2000-01-06 00:00:00+00'::timestamp with time zone))
Planning Time: 0.086 ms
Execution Time: 29.917 ms</code></pre><pre><code class="language-text">┌────────────┬──────────┬────────────┐
│&nbsp; &nbsp; type&nbsp; &nbsp; │ &nbsp; cost &nbsp; │&nbsp; &nbsp; time&nbsp; &nbsp; │
├────────────┼──────────┼────────────┤
│ hypertable │ 10064.00 │&nbsp; 33.925 ms │
│ plain&nbsp; &nbsp; &nbsp; │ 44853.14 │ 108.655 ms │
│ indexed&nbsp; &nbsp; │ 10049.55 │&nbsp; 29.917 ms │
└────────────┴──────────┴────────────┘</code></pre><p>At least for this dataset on this machine type, an indexed plain table is slightly cheaper and faster than a hypertable. So, why would we use a hypertable? There are a couple of reasons. Firstly, as this dataset gets larger, it becomes more likely that the hypertable will outperform the plain table. Secondly, Postgres will use only one index per table per query in most situations. We want to use a hypertable because we are not finished filtering, and we want to use a different index altogether.</p><h2 id="vector-search">Vector Search</h2><p>As a reminder, we ultimately want to combine time filtering and vector search. Let's explore vector search in isolation first. If you are already a confident user of similarity search with pgvector, feel free to <a href="https://www.tigerdata.com/blog/refining-vector-search-queries-with-time-filters-in-pgvector-a-tutorial/#vector-search-and-filtering-by-time" rel="noreferrer"><u>jump ahead</u></a>.</p><p>Our <code>wiki</code> table has a column named <code>embedding</code> of type <code>vector(768)</code>. The pgvector extension provides the <code>vector</code> datatype. This column contains the embeddings of the wiki content.</p><p>We will grab a random vector to use as a search parameter. The <code>psql</code> client makes this easy. We can select the <code>embedding</code> column from a row and use the <a href="https://www.postgresql.org/docs/current/app-psql.html#APP-PSQL-META-COMMAND-GSET"><u><code>\gset</code> meta-command</u></a> to create a variable in the psql client named <code>emb</code> containing the vector value. We can use this variable as a parameter for future queries. In this case, we will use the vector from the row with id 65272. In subsequent queries, we will search for rows where the embedding is semantically similar to the row with id 65272.</p><pre><code class="language-SQL">select embedding as emb
from wiki
where id = 65272
\gset</code></pre><p>Next, we query the <code>wiki</code> table and use <a href="https://docs.timescale.com/ai/latest/key-vector-database-concepts-for-understanding-pgvector/#vector-distance-types"><u>the <code>&lt;=&gt;</code> (cosine) distance operator </u></a>provided by pgvector and timescale_vector. This operator computes the distance between two vectors, and this distance is a representation of how semantically similar the two vectors are. If we order by the similarity distance and limit the results to 10, our query will return the 10 rows most semantically similar to the content from row 65272.</p><p><strong>Note</strong>: The <code>$1</code> in the query below denotes a query parameter, and the <a href="https://www.postgresql.org/docs/current/app-psql.html#APP-PSQL-META-COMMAND-BIND"><u><code>\bind</code> meta-command</u></a> assigns the value from our <code>emb</code> psql variable to this parameter.</p><pre><code class="language-SQL">select id, embedding &lt;=&gt; $1::vector as dist
from wiki
order by dist
limit 10
\bind :emb
;</code></pre><pre><code class="language-text">┌────────┬─────────────────────┐
│ &nbsp; id &nbsp; │&nbsp; &nbsp; &nbsp; &nbsp; dist &nbsp; &nbsp; &nbsp; &nbsp; │
├────────┼─────────────────────┤
│&nbsp; 65272 │ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0 │
│&nbsp; 16295 │&nbsp; 0.1115766717013853 │
│&nbsp; 16302 │ 0.11360477414644143 │
│&nbsp; 16297 │ 0.12387882580027065 │
│&nbsp; 16298 │&nbsp; 0.1299166956232466 │
│&nbsp; 34598 │ 0.13121629452715455 │
│&nbsp; 16292 │ 0.13307570937651947 │
│&nbsp; 61078 │ 0.13427528064358674 │
│ 215163 │ 0.13759063950532036 │
│&nbsp; 79614 │ 0.13770102915039784 │
└────────┴─────────────────────┘
(10 rows)</code></pre><p>As expected, row 65272 is the top result with exactly the vector we borrowed for the query. Nine more results follow with increasing distance.</p><p>Looking at the query plan for this, we find that Postgres scanned the entire table and performed the distance calculation on the fly for each of the 485,859 rows. This execution took 3,441 milliseconds, considerably longer than any time-based filtering approaches we had previously explored.</p><pre><code class="language-SQL">explain analyze
select id, embedding &lt;=&gt; $1::vector as dist
from wiki
order by dist
limit 10
\bind :emb
;</code></pre><p></p><pre><code class="language-text">QUERY PLAN
Limit&nbsp; (cost=62004.36..62004.39 rows=10 width=12) (actual time=3440.996..3441.002 rows=10 loops=1)
&nbsp;&nbsp;-&gt;&nbsp; Sort&nbsp; (cost=62004.36..63219.01 rows=485859 width=12) (actual time=3440.994..3441.000 rows=10 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Sort Key: ((_hyper_1_1_chunk.embedding &lt;=&gt; '[...]'::vector))
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Sort Method: top-N heapsort&nbsp; Memory: 25kB
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Result&nbsp; (cost=0.00..51505.12 rows=485859 width=12) (actual time=0.030..3357.861 rows=485859 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Append&nbsp; (cost=0.00..45431.88 rows=485859 width=22) (actual time=0.015..329.158 rows=485859 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Seq Scan on _hyper_1_1_chunk&nbsp; (cost=0.00..4788.00 rows=50000 width=22) (actual time=0.015..30.596 rows=50000 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Seq Scan on _hyper_1_2_chunk&nbsp; (cost=0.00..4724.00 rows=50000 width=22) (actual time=0.013..31.533 rows=50000 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Seq Scan on _hyper_1_3_chunk&nbsp; (cost=0.00..4660.00 rows=50000 width=22) (actual time=0.012..31.167 rows=50000 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Seq Scan on _hyper_1_4_chunk&nbsp; (cost=0.00..4596.00 rows=50000 width=22) (actual time=0.012..30.917 rows=50000 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Seq Scan on _hyper_1_5_chunk&nbsp; (cost=0.00..4468.00 rows=50000 width=22) (actual time=0.013..29.304 rows=50000 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Seq Scan on _hyper_1_6_chunk&nbsp; (cost=0.00..4340.00 rows=50000 width=22) (actual time=0.012..28.831 rows=50000 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Seq Scan on _hyper_1_7_chunk&nbsp; (cost=0.00..4276.00 rows=50000 width=22) (actual time=0.012..29.065 rows=50000 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Seq Scan on _hyper_1_8_chunk&nbsp; (cost=0.00..4148.00 rows=50000 width=22) (actual time=0.013..28.186 rows=50000 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Seq Scan on _hyper_1_9_chunk&nbsp; (cost=0.00..3956.00 rows=50000 width=22) (actual time=0.012..27.490 rows=50000 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Seq Scan on _hyper_1_10_chunk&nbsp; (cost=0.00..3046.59 rows=35859 width=22) (actual time=0.011..20.635 rows=35859 loops=1)
Planning Time: 0.231 ms
Execution Time: 3441.034 ms</code></pre><p>Can we make vector search faster and cheaper? Yes, we can do it by <a href="https://docs.timescale.com/ai/latest/sql-interface-for-pgvector-and-timescale-vector/#indexing-the-vector-data-using-indexes-provided-by-pgvector-and-timescale-vector"><u>using an index</u></a>. The pgvector extension provides two types of indexes: <code>ivfflat</code> and <code>hnsw</code>. The <a href="https://timescale.ghost.io/blog/how-we-made-postgresql-the-best-vector-database/"><u>timescale_vector extension</u></a> provides a third: <code>tsv</code>. All three index types implement <a href="https://docs.timescale.com/ai/latest/key-vector-database-concepts-for-understanding-pgvector/#vector-search-indexing-approximate-nearest-neighbor-search"><u>approximate nearest-neighbor search algorithms</u></a>. None of them will produce exact results, but they will make our searches considerably more efficient. As the number of vectors grows past a certain scale, exact searches become impractical.</p><p>This tutorial will use the <code>tsv</code> <a href="https://docs.timescale.com/ai/latest/sql-interface-for-pgvector-and-timescale-vector/#timescale-vector-index"><u>index type and its default settings</u></a>.</p><p><strong>Note</strong>: Expect creating the index to take 10 minutes or more.</p><p></p><pre><code class="language-SQL">create index on wiki using tsv (embedding);</code></pre><p>You will note that we get slightly different results when we execute the same query. Without the index, Postgres performs an exhaustive and exact search. With the index, Postgres performs an approximate search. We trade some measure of accuracy for speed and efficiency. We can see how fast and efficient it is by looking at the query plan.</p><pre><code class="language-SQL">select id, embedding &lt;=&gt; $1::vector as dist
from wiki
order by dist
limit 10
\bind :emb
;</code></pre><pre><code class="language-text">┌───────┬─────────────────────┐
│&nbsp; id &nbsp; │&nbsp; &nbsp; &nbsp; &nbsp; dist &nbsp; &nbsp; &nbsp; &nbsp; │
├───────┼─────────────────────┤
│ 65272 │ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0 │
│ 16302 │ 0.11360477414644143 │
│ 16295 │&nbsp; 0.1115766717013853 │
│ 16297 │ 0.12387882580027065 │
│ 16298 │&nbsp; 0.1299166956232466 │
│ 34598 │ 0.13121629452715455 │
│ 16292 │ 0.13307570937651947 │
│ 61078 │ 0.13427528064358674 │
│ 79614 │ 0.13770102915039784 │
│ 16307 │&nbsp; 0.1408329687219958 │
└───────┴─────────────────────┘
(10 rows)</code></pre><p>The query plan shows that the query cost 2720.81 and took 24.642 milliseconds to execute. The version of the query that did not use the index cost 62004.39 and took 3441.034 milliseconds. The index made a huge difference! Instead of doing a sequential scan on all the chunks, Postgres did an index scan on each chunk.</p><pre><code class="language-text">QUERY PLAN
Limit&nbsp; (cost=2719.51..2720.81 rows=10 width=12) (actual time=24.317..24.578 rows=10 loops=1)
&nbsp;&nbsp;-&gt;&nbsp; Result&nbsp; (cost=2719.51..66108.97 rows=485859 width=12) (actual time=24.316..24.575 rows=10 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Merge Append&nbsp; (cost=2719.51..60035.73 rows=485859 width=22) (actual time=24.315..24.572 rows=10 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Sort Key: ((_hyper_1_1_chunk.embedding &lt;=&gt; '[...]'::vector))
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Index Scan using _hyper_1_1_chunk_wiki_embedding_idx on _hyper_1_1_chunk&nbsp; (cost=279.85..5496.65 rows=50000 width=22) (actual time=1.846..1.982 rows=7 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Order By: (embedding &lt;=&gt; '[...]'::vector)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Index Scan using _hyper_1_2_chunk_wiki_embedding_idx on _hyper_1_2_chunk&nbsp; (cost=279.85..5426.25 rows=50000 width=22) (actual time=2.247..2.361 rows=4 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Order By: (embedding &lt;=&gt; '[...]'::vector)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Index Scan using _hyper_1_3_chunk_wiki_embedding_idx on _hyper_1_3_chunk&nbsp; (cost=279.85..5355.85 rows=50000 width=22) (actual time=2.327..2.327 rows=1 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Order By: (embedding &lt;=&gt; '[...]'::vector)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Index Scan using _hyper_1_4_chunk_wiki_embedding_idx on _hyper_1_4_chunk&nbsp; (cost=279.85..5285.45 rows=50000 width=22) (actual time=2.265..2.265 rows=1 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Order By: (embedding &lt;=&gt; '[...]'::vector)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Index Scan using _hyper_1_5_chunk_wiki_embedding_idx on _hyper_1_5_chunk&nbsp; (cost=279.85..5144.65 rows=50000 width=22) (actual time=2.849..2.849 rows=1 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Order By: (embedding &lt;=&gt; '[...]'::vector)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Index Scan using _hyper_1_6_chunk_wiki_embedding_idx on _hyper_1_6_chunk&nbsp; (cost=279.85..5003.85 rows=50000 width=22) (actual time=2.625..2.625 rows=1 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Order By: (embedding &lt;=&gt; '[...]'::vector)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Index Scan using _hyper_1_7_chunk_wiki_embedding_idx on _hyper_1_7_chunk&nbsp; (cost=279.85..4933.45 rows=50000 width=22) (actual time=2.805..2.805 rows=1 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Order By: (embedding &lt;=&gt; '[...]'::vector)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Index Scan using _hyper_1_8_chunk_wiki_embedding_idx on _hyper_1_8_chunk&nbsp; (cost=279.85..4792.65 rows=50000 width=22) (actual time=2.496..2.497 rows=1 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Order By: (embedding &lt;=&gt; '[...]'::vector)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Index Scan using _hyper_1_9_chunk_wiki_embedding_idx on _hyper_1_9_chunk&nbsp; (cost=279.85..4581.45 rows=50000 width=22) (actual time=2.160..2.161 rows=1 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Order By: (embedding &lt;=&gt; '[...]'::vector)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Index Scan using _hyper_1_10_chunk_wiki_embedding_idx on _hyper_1_10_chunk&nbsp; (cost=200.69..3516.08 rows=35859 width=22) (actual time=2.689..2.689 rows=1 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Order By: (embedding &lt;=&gt; '[...]'::vector)
Planning Time: 0.288 ms
Execution Time: 24.642 ms
</code></pre><p>Notice that we got different results from the two queries. The version without the index had to calculate the exact distances on the fly—a big part of its slowness. The indexed version used an approximate nearest-neighbor algorithm. Its results were not exact, but they were delivered faster and cheaper.</p><h2 id="vector-search-and-filtering-by-time">Vector Search and Filtering by Time</h2><p>Now that we have explored filtering by time and vector search in isolation, can we combine the two in a single efficient query?</p><pre><code class="language-SQL">select id, embedding &lt;=&gt; $1::vector as dist
from wiki
where '2000-01-02'::timestamptz &lt;= time and time &lt; '2000-01-04'::timestamptz
order by dist
limit 10
\bind :emb
;</code></pre><p><strong>Note</strong>: Notice that our results are "worse" in terms of semantic similarity because our time filter has eliminated some of the most similar rows. This outcome is expected. Instead, we have more <em>relevant</em> rows as we excluded rows with high similarity outside our time period of interest.</p><pre><code class="language-text">┌────────┬─────────────────────┐
│ &nbsp; id &nbsp; │&nbsp; &nbsp; &nbsp; &nbsp; dist &nbsp; &nbsp; &nbsp; &nbsp; │
├────────┼─────────────────────┤
│&nbsp; 65272 │ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0 │
│&nbsp; 61078 │ 0.13427528064358674 │
│&nbsp; 79614 │ 0.13770102915039784 │
│&nbsp; 79612 │ 0.14383225467703376 │
│&nbsp; 61080 │&nbsp; 0.1444315737261198 │
│&nbsp; 52634 │&nbsp; 0.1436081460507892 │
│&nbsp; 64473 │ 0.14383654656162292 │
│&nbsp; 61079 │&nbsp; 0.1474735568613179 │
│ 131920 │ 0.15066526846960182 │
│&nbsp; 64475 │ &nbsp; 0.154294958183531 │
└────────┴─────────────────────┘
(10 rows)</code></pre><p>The query plan reveals that Postgres only considers the two chunks that match the time filter and uses the <code>tsv</code> vector indexes associated with those chunks. We have the best of both worlds!</p><pre><code class="language-text">QUERY PLAN
Limit&nbsp; (cost=559.71..561.01 rows=10 width=12) (actual time=4.631..5.025 rows=10 loops=1)
&nbsp;&nbsp;-&gt;&nbsp; Result&nbsp; (cost=559.71..13532.11 rows=100000 width=12) (actual time=4.630..5.023 rows=10 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Merge Append&nbsp; (cost=559.71..12282.11 rows=100000 width=22) (actual time=4.628..5.019 rows=10 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Sort Key: ((_hyper_1_2_chunk.embedding &lt;=&gt; '[...]'::vector))
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Index Scan using _hyper_1_2_chunk_wiki_embedding_idx on _hyper_1_2_chunk&nbsp; (cost=279.85..5676.25 rows=50000 width=22) (actual time=2.349..2.682 rows=9 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Order By: (embedding &lt;=&gt; '[...]'::vector)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Filter: (('2000-01-02 00:00:00+00'::timestamp with time zone &lt;= "time") AND ("time" &lt; '2000-01-04 00:00:00+00'::timestamp with time zone))
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Index Scan using _hyper_1_3_chunk_wiki_embedding_idx on _hyper_1_3_chunk&nbsp; (cost=279.85..5605.85 rows=50000 width=22) (actual time=2.277..2.334 rows=2 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Order By: (embedding &lt;=&gt; '[...]'::vector)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Filter: (('2000-01-02 00:00:00+00'::timestamp with time zone &lt;= "time") AND ("time" &lt; '2000-01-04 00:00:00+00'::timestamp with time zone))
Planning Time: 0.257 ms
Execution Time: 5.056 ms</code></pre><p>Could we have achieved the same with the indexed plain table? Let's add a <code>tsv</code> index to <code>wiki2</code> and find out.</p><p><strong>Note</strong>: Expect creating the index to take 15 minutes or more.</p><pre><code class="language-SQL">create index on wiki2 using tsv (embedding);</code></pre><pre><code class="language-SQL">explain analyze
select id, embedding &lt;=&gt; $1::vector as dist
from wiki2
where '2000-01-04'::timestamptz &lt;= time and time &lt; '2000-01-06'::timestamptz
order by dist
limit 10
\bind :emb
;</code></pre><p>In most situations, Postgres can only use one index per table at a time. At least in this specific example on this machine, it chooses to use the vector index on <code>embedding</code>. It then filters the time on the fly. If you run this, you may get the same result, or Postgres may choose to do an index scan on the <code>time</code> column to filter by time and then do on-the-fly distance calculations on the vectors.&nbsp;</p><p>In this case, the query is as fast as the <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertable</a> approach but nearly five times more expensive. Crucially, as the dataset grows, the hypertable version will increasingly outperform the plain-table version because the hypertable will use chunk exclusion.</p><pre><code class="language-text">QUERY PLAN
Limit&nbsp; (cost=2709.44..2714.22 rows=10 width=12) (actual time=3.029..5.078 rows=10 loops=1)
&nbsp;&nbsp;-&gt;&nbsp; Index Scan using wiki2_embedding_idx on wiki2&nbsp; (cost=2709.44..51574.47 rows=102257 width=12) (actual time=3.028..5.074 rows=10 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Order By: (embedding &lt;=&gt; '[...]'::vector)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Filter: (('2000-01-04 00:00:00+00'::timestamp with time zone &lt;= "time") AND ("time" &lt; '2000-01-06 00:00:00+00'::timestamp with time zone))
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Rows Removed by Filter: 73
Planning Time: 0.093 ms
Execution Time: 5.096 ms</code></pre><h3 id="other-approaches-that-did-not-perform-as-well">Other approaches that did not perform as well.</h3><p>Can we force Postgres to use both indexes on one plain table? Yes! We can use a subquery. We will order by the vector distance using the vector index in the subquery. In the outer query, we will filter on time using the index on the <code>time</code> column. By joining the two, we will only get results that match both criteria. In other words, querying this way creates two resultsets—one via each index—and these resultsets are then evaluated for intersection.</p><p><strong>Note</strong>: Technically, we cannot force Postgres to choose to use indexes. Postgres does not have query hints. However, the structure of a query influences the plan. By restructuring the query, we can politely suggest the plan we want.</p><pre><code class="language-SQL">explain analyze
select w.id, x.dist
from wiki2 w
inner join
(
  select id, embedding &lt;=&gt; $1::vector as dist
  from wiki2
  order by dist
  limit 10
) x on (w.id = x.id)
where '2000-01-02'::timestamptz &lt;= time and time &lt; '2000-01-04'::timestamptz
order by x.dist
\bind :emb
;</code></pre><pre><code class="language-text">QUERY PLAN
Sort&nbsp; (cost=12868.26..12868.27 rows=2 width=12) (actual time=36.700..36.703 rows=3 loops=1)
&nbsp;&nbsp;Sort Key: x.dist
&nbsp;&nbsp;Sort Method: quicksort&nbsp; Memory: 25kB
&nbsp;&nbsp;-&gt;&nbsp; Hash Join&nbsp; (cost=2711.06..12868.25 rows=2 width=12) (actual time=5.720..36.693 rows=3 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Hash Cond: (w.id = x.id)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Index Scan using wiki2_time_idx on wiki2 w&nbsp; (cost=0.42..9784.27 rows=99553 width=4) (actual time=0.013..25.961 rows=100000 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Index Cond: (("time" &gt;= '2000-01-02 00:00:00+00'::timestamp with time zone) AND ("time" &lt; '2000-01-04 00:00:00+00'::timestamp with time zone))
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Hash&nbsp; (cost=2710.51..2710.51 rows=10 width=12) (actual time=3.111..3.112 rows=10 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Buckets: 1024&nbsp; Batches: 1&nbsp; Memory Usage: 9kB
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Subquery Scan on x&nbsp; (cost=2709.44..2710.51 rows=10 width=12) (actual time=2.693..3.096 rows=10 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Limit&nbsp; (cost=2709.44..2710.41 rows=10 width=12) (actual time=2.692..3.092 rows=10 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Index Scan using wiki2_embedding_idx on wiki2&nbsp; (cost=2709.44..50104.18 rows=485859 width=12) (actual time=2.690..3.089 rows=10 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Order By: (embedding &lt;=&gt; '[...]'::vector)
Planning Time: 0.258 ms
Execution Time: 36.737 ms</code></pre><p>The query plan shows us that Postgres used both indexes. Surprisingly, having Postgres use both indexes is slower and more costly than both previous approaches. While the subquery only returns 10 rows, the outer query returns 100,000. Postgres then has to evaluate each of the 100,000 rows against each of the 10 rows using a hash join. Returning the 100,000 and comparing them to the 10 is where the bulk of the execution time takes place.</p><pre><code class="language-text">┌───────┬─────────────────────┐
│&nbsp; id &nbsp; │&nbsp; &nbsp; &nbsp; &nbsp; dist &nbsp; &nbsp; &nbsp; &nbsp; │
├───────┼─────────────────────┤
│ 65272 │ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0 │
│ 61078 │ 0.13427528064358674 │
│ 79614 │ 0.13770102915039784 │
└───────┴─────────────────────┘
(3 rows)</code></pre><p>More importantly, we only get three results using the two-index approach. Why is this the case? The vector search subquery considers all the rows regardless of time. It returns the 10 most similar rows from the whole dataset. Unfortunately, seven of these 10 fall outside the time range we are filtering on and, therefore, get excluded from the results. This effect illustrates the classic post-filtering problem common in vector search use cases. Filtering by time after a similarity search can leave you with fewer results than expected or even none at all.</p><p>What happens if we turn this query “inside out”? We can put the time filtering in a subquery and do the vector search in the outer query like so:</p><pre><code class="language-SQL">explain analyze
select w.id, w.embedding &lt;=&gt; $1::vector as dist
from wiki2 w
where exists
(
  select 1
  from wiki2 x
  where '2000-01-02'::timestamptz &lt;= x.time and x.time &lt; '2000-01-04'::timestamptz
  and w.id = x.id
)
order by dist
limit 10
\bind :emb
;</code></pre><pre><code class="language-text">┌────────┬─────────────────────┐
│ &nbsp; id &nbsp; │&nbsp; &nbsp; &nbsp; &nbsp; dist &nbsp; &nbsp; &nbsp; &nbsp; │
├────────┼─────────────────────┤
│&nbsp; 65272 │ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0 │
│&nbsp; 61078 │ 0.13427528064358674 │
│&nbsp; 79614 │ 0.13770102915039784 │
│&nbsp; 52634 │&nbsp; 0.1436081460507892 │
│&nbsp; 79612 │ 0.14383225467703376 │
│&nbsp; 64473 │ 0.14383654656162292 │
│&nbsp; 61080 │&nbsp; 0.1444315737261198 │
│&nbsp; 61079 │&nbsp; 0.1474735568613179 │
│ 131920 │ 0.15066526846960182 │
│&nbsp; 64475 │ &nbsp; 0.154294958183531 │
└────────┴─────────────────────┘
(10 rows)
</code></pre><pre><code class="language-text">QUERY PLAN
Limit&nbsp; (cost=58235.37..58235.39 rows=10 width=12) (actual time=1394.904..1394.909 rows=10 loops=1)
&nbsp;&nbsp;-&gt;&nbsp; Sort&nbsp; (cost=58235.37..58484.25 rows=99553 width=12) (actual time=1394.902..1394.906 rows=10 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Sort Key: ((w.embedding &lt;=&gt; '[...]'::vector))
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Sort Method: top-N heapsort&nbsp; Memory: 25kB
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Hash Semi Join&nbsp; (cost=11028.68..56084.06 rows=99553 width=12) (actual time=294.332..1380.233 rows=100000 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Hash Cond: (w.id = x.id)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Seq Scan on wiki2 w&nbsp; (cost=0.00..42423.59 rows=485859 width=22) (actual time=0.034..616.130 rows=485859 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Hash&nbsp; (cost=9784.27..9784.27 rows=99553 width=4) (actual time=40.376..40.378 rows=100000 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Buckets: 131072&nbsp; Batches: 1&nbsp; Memory Usage: 4540kB
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-&gt;&nbsp; Index Scan using wiki2_time_idx on wiki2 x&nbsp; (cost=0.42..9784.27 rows=99553 width=4) (actual time=0.016..28.235 rows=100000 loops=1)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Index Cond: (("time" &gt;= '2000-01-02 00:00:00+00'::timestamp with time zone) AND ("time" &lt; '2000-01-04 00:00:00+00'::timestamp with time zone))
Planning Time: 0.164 ms
Execution Time: 1394.960 ms
</code></pre><p>Unfortunately, Postgres used the index on time but chose to ignore the vector index. It computed the exact distances on the fly instead of using an approximate nearest-neighbor search. Unsurprisingly, this approach is slower and more expensive than the single index query plan on the plain table. It is also much slower and more costly than the hypertable approach.</p><h2 id="using-vector-search-and-time-filters-in-ai-applications">Using Vector Search and Time Filters in AI Applications</h2><p>Incorporating time-based filtering with semantic search allows AI applications to offer users a more nuanced, context-aware interaction. This approach ensures that the information, products, or content presented is not just relevant but also timely, providing a more valuable user experience.</p><ol><li><strong>Content discovery platforms</strong>: Imagine a streaming service that not only recommends movies based on your viewing history but also prioritizes new releases or timely content, such as holiday-themed movies. By applying time filters to semantic search, these platforms can dynamically adjust recommendations, ensuring users are presented with content that's not just relevant to their interests but also their recent viewing history or seasonal trends.</li><li><strong>News aggregation</strong>: Time-based filtering can revolutionize how readers engage with information in news and content aggregation. Time filters can prioritize the latest news, ensuring readers can access the most current information and trends relevant to their query, while semantic search simultaneously sifts through the articles to find ones related to a user's interests.</li><li><strong>E-commerce</strong>: For e-commerce platforms, combining semantic search with time-based filters can enhance shopping experiences by promoting products related to a user’s recent purchases, upcoming events, or seasonal trends. For instance, suggesting winter sports gear as the season approaches or highlighting flash sale items within a limited time frame makes the shopping experience more relevant and timely.</li><li><strong>Academic research</strong>: Researchers often seek their field's most recent studies or articles. By integrating time filters with semantic search, academic databases can provide search results that are contextually relevant and prioritize the latest research, facilitating access to cutting-edge knowledge and discoveries.</li></ol><h2 id="next-steps">Next Steps</h2><p>So, what have we learned? We can combine vector search with time-based filtering in Postgres to retrieve more temporally relevant vectors. While we can achieve this with a plain table, we can write simpler, faster, and more efficient queries that continue to perform as your dataset grows using TimescaleDB’s hypertables.</p><p>Use pgvector on Timescale today! <a href="https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&amp;utm_source=timescale-blog&amp;utm_medium=direct&amp;utm_content=filtering-blog"><u>Follow this link</u></a> to get a 90-day free trial!</p><p>Read more:</p><ul><li><a href="https://docs.timescale.com/ai/latest/"><u>pgai on Timescale documentation</u></a></li><li><a href="https://youtu.be/JDVU0k30cGA?si=_xoIDzbtQQ8WJen9"><u>pgvector and Timescale Vector Up and Running</u></a></li><li><a href="https://youtu.be/EYMZVfKcRzM?si=o2BHV6fEqpaxjufO"><u>LlamaIndex Webinar: Time-based retrieval for RAG (with Timescale)</u></a></li></ul>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Understanding PostgreSQL Aggregation and Hyperfunctions’ Design]]></title>
            <description><![CDATA[A discussion of aggregation in PostgreSQL and how it integrates with the design of Timescale’s hyperfunctions.

]]></description>
            <link>https://www.tigerdata.com/blog/how-postgresql-aggregation-works-and-how-it-inspired-our-hyperfunctions-design</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-postgresql-aggregation-works-and-how-it-inspired-our-hyperfunctions-design</guid>
            <category><![CDATA[Product & Engineering]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Analytics]]></category>
            <dc:creator><![CDATA[David Kohn]]></dc:creator>
            <pubDate>Thu, 11 Jan 2024 17:06:03 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/01/PostgreSQL-Aggregation-and-Hyperfunctions-Design--1-.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/01/PostgreSQL-Aggregation-and-Hyperfunctions-Design--1-.png" alt="A group of elephants (the Postgres logo is an elephant). Understanding PostgreSQL aggregation and hyperfunctions's design." /><p>At Timescale, our goal is always to focus on the developer experience, and we take great care to design our products and APIs to be developer-friendly. This focus on developer experience is why we decided <a href="https://timescale.ghost.io/blog/blog/when-boring-is-awesome-building-a-scalable-time-series-database-on-postgresql-2900ea453ee2/">early in the design of TimescaleDB to build on top of PostgreSQL</a>. We believed then, as we do now, that building on the world’s fastest-growing database would have numerous benefits for our users.</p><p>The same logic applies to many of our features, including <a href="https://www.timescale.com/learn/time-series-data-analysis-hyperfunctions" rel="noreferrer">hyperfunctions</a>. Timescale's hyperfunctions are designed to enhance PostgreSQL's native aggregation capabilities—<a href="https://timescale.ghost.io/blog/introducing-hyperfunctions-new-sql-functions-to-simplify-working-with-time-series-data-in-postgresql/" rel="noreferrer">a series of SQL functions within TimescaleDB that make it easier to manipulate and analyze time-series data in PostgreSQL with fewer lines of code</a>.</p><p>So, let's have a closer look at how PostgreSQL aggregation works and how it has influenced the design of Timescale's hyperfunctions.</p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">✨</div><div class="kg-callout-text">Before we start, here's a recap on <a href="https://www.timescale.com/learn/understanding-sql-aggregate-functions" rel="noreferrer">SQL aggregate functions</a> and how to use them.</div></div><h2 id="a-primer-on-postgresql-aggregation">A Primer on PostgreSQL Aggregation</h2><p>We'll start by going over PostgreSQL aggregates. But first, a little backstory.</p><p>When I first started learning about PostgreSQL five or six years ago (I was an electrochemist and was dealing with lots of battery data, as mentioned in <a href="https://timescale.ghost.io/blog/blog/what-time-weighted-averages-are-and-why-you-should-care/">my last post on time-weighted averages</a>), I ran into some performance issues. I was trying to understand better what was happening inside the database to improve its performance—and that’s when I found <a href="https://momjian.us">Bruce Momjian</a>’s talks on <a href="https://momjian.us/main/presentations/internals.html">PostgreSQL Internals Through Pictures</a>. Bruce is well-known in the community for his insightful talks (and his penchant for bow ties); his sessions were a revelation for me. </p><p>They’ve served as a foundation for my understanding of how PostgreSQL works ever since. He explained things so clearly, and I’ve always learned best when I can visualize what’s going on, so the “through pictures” part really helped—and stuck with—me. </p><p>So, this next bit is my attempt to channel Bruce by explaining some PostgreSQL internals through pictures. Cinch up your bow ties and get ready for some learnin'.</p>
<!--kg-card-begin: html-->
<figure>
	<iframe height="100px" width="100px" style="min-width: 100%; min-height: 330px" 		src="https://s3.amazonaws.com/blog.timescale.com/gifs/player.html?source=https://s3.amazonaws.com/blog.timescale.com/gifs/how-postgres-works/david_bowtie.mp4" frameborder="0" class="gif-stand-in">
	</iframe>
    <figcaption class="gif-caption">The author pays homage to Bruce Momjian (and looks rather pleased with himself because he’s managed to tie a bow tie on the first try).
    </figcaption>
</figure>

<!--kg-card-end: html-->
<h3 id="postgresql-aggregates-vs-functions">PostgreSQL aggregates vs. functions</h3><p>We have written about <a href="https://timescale.ghost.io/blog/blog/introducing-hyperfunctions-new-sql-functions-to-simplify-working-with-time-series-data-in-postgresql/">how we use custom functions and aggregates to extend SQL</a>, but we haven’t exactly explained the difference<em> between</em> them.</p><p>The fundamental difference between an aggregate function and a “regular” function in SQL is that an <strong>aggregate</strong> produces a single result from a <em>group</em> of related rows, while a regular <strong>function </strong>produces a result for <em>each</em> row:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/Aggregate-vs.-Function-2.jpg" class="kg-image" alt="A side-by-side diagram depicting an “aggregate” side and a “function” side and how each product results. There are three individual rows on the aggregate side, with arrows that point to a single result; on the function side, there are three individual rows, with arrows that point to three different results (one per row). " loading="lazy" width="1804" height="752" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/08/Aggregate-vs.-Function-2.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/08/Aggregate-vs.-Function-2.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2021/08/Aggregate-vs.-Function-2.jpg 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/Aggregate-vs.-Function-2.jpg 1804w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">In SQL, aggregates produce a result from multiple rows, while functions produce a result per row.</span></figcaption></figure><p>This is not to say that a function can’t have inputs from multiple columns; they just have to come from the same row. </p><p>Another way to think about it is that functions often act on rows, whereas aggregates act on columns. To illustrate this, let’s consider a theoretical table <code>foo</code> with two columns:</p><pre><code class="language-SQL">CREATE TABLE foo(
	bar DOUBLE PRECISION,
	baz DOUBLE PRECISION);
</code></pre><p>And just a few values so we can easily see what’s going on:</p><pre><code class="language-SQL">INSERT INTO foo(bar, baz) VALUES (1.0, 2.0), (2.0, 4.0), (3.0, 6.0);
</code></pre><p>The function <a href="https://www.postgresql.org/docs/13/functions-conditional.html#FUNCTIONS-GREATEST-LEAST"><code>greatest()</code></a> will produce the largest of the values in columns <code>bar</code> and <code>baz</code> for each row:</p><p></p><pre><code class="language-SQL">SELECT greatest(bar, baz) FROM foo; 
 greatest 
----------
        2
        4
        6
</code></pre><p>Whereas the aggregate <a href="https://www.postgresql.org/docs/current/functions-aggregate.html"><code>max()</code></a> will produce the largest value from each column:</p><pre><code class="language-SQL">SELECT max(bar) as bar_max, max(baz) as baz_max FROM foo;

 bar_max | baz_max 
---------+---------
       3 |       6
</code></pre><p>Using the above data, here’s a picture of what happens when we aggregate something:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/Aggregate-vs.-Function-1-1.jpg" class="kg-image" alt="A diagram showing how the statement: `SELECT max(bar) FROM foo;` works: multiple rows with values of “bar equal to” 1.0, 2.0, and 3.0, go through the `max(bar)` aggregate to ultimately produce a result of 3.0. " loading="lazy" width="1804" height="600" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/08/Aggregate-vs.-Function-1-1.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/08/Aggregate-vs.-Function-1-1.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2021/08/Aggregate-vs.-Function-1-1.jpg 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/Aggregate-vs.-Function-1-1.jpg 1804w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The </span><code spellcheck="false" style="white-space: pre-wrap;"><span>max()</span></code><span style="white-space: pre-wrap;"> aggregate gets the largest value from multiple rows.</span></figcaption></figure><p>The aggregate takes inputs from multiple rows and produces a single result. That’s the main difference between it and a function, but how does it do that? Let’s look at what it’s doing under the hood.</p><h3 id="aggregate-internals-row-by-row">Aggregate internals: Row-by-row</h3><p>Under the hood, aggregates in PostgreSQL work row-by-row. But then, how does an aggregate know anything about the previous rows?</p><p>An aggregate stores some state about the rows it has previously seen, and as the database sees new rows, it updates that internal state.</p><p>For the <code>max()</code> aggregate we’ve been discussing, the internal state is simply the largest value we’ve collected so far. </p><p>Let’s take this step-by-step.</p><p>When we start, our internal state is <code>NULL</code> because we haven’t seen any rows yet:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/1-2.jpg" class="kg-image" alt="Flowchart arrow diagram representing the max open parens bar close parens aggregate, with three rows below the arrow where bar is equal to 1.0, 2.0, and 3.0, respectively. There is a box in the arrow in which the state is equal to NULL. " loading="lazy" width="1600" height="904" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/08/1-2.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/08/1-2.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/1-2.jpg 1600w" sizes="(min-width: 720px) 720px"></figure><p>Then, we get our first row in:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/2-2.jpg" class="kg-image" alt="" loading="lazy" width="1600" height="904" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/08/2-2.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/08/2-2.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/2-2.jpg 1600w" sizes="(min-width: 720px) 720px"></figure><p>Since our state is <code>NULL</code>, we initialize it to the first value we see:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/3-1.jpg" class="kg-image" alt="The same flowchart diagram, except that row one has moved _out_ of the arrow, and the state has been updated from NULL to the 1.0, row one’s value. " loading="lazy" width="1600" height="904" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/08/3-1.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/08/3-1.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/3-1.jpg 1600w" sizes="(min-width: 720px) 720px"></figure><p>Now, we get our second row:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/4-2.jpg" class="kg-image" alt="The same flowchart diagram, except that row two has moved into the arrow representing the max aggregate. " loading="lazy" width="1600" height="904" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/08/4-2.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/08/4-2.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/4-2.jpg 1600w" sizes="(min-width: 720px) 720px"></figure><p>And we see that the value of bar (2.0) is greater than our current state (1.0), so we update the state:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/5-2.jpg" class="kg-image" alt=" The same diagram, except that row two has moved out of the max aggregate, and the state has been updated to the largest value (the value of row two, 2.0). " loading="lazy" width="1600" height="904" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/08/5-2.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/08/5-2.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/5-2.jpg 1600w" sizes="(min-width: 720px) 720px"></figure><p>Then, the next row comes into the aggregate:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/6-1.jpg" class="kg-image" alt="The same diagram, except that the row three has moved into the arrow representing the max aggregate. " loading="lazy" width="1600" height="904" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/08/6-1.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/08/6-1.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/6-1.jpg 1600w" sizes="(min-width: 720px) 720px"></figure><p>We compare it to our current state, take the greatest value, and update our state:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/7-1.jpg" class="kg-image" alt="The same diagram, expect that row three has moved out of the max aggregate, and the state has been updated to the largest value, the value of the third row, 3.0." loading="lazy" width="1600" height="904" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/08/7-1.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/08/7-1.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/7-1.jpg 1600w" sizes="(min-width: 720px) 720px"></figure><p>Finally, we don’t have any more rows to process, so we output our result:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/8-1.jpg" class="kg-image" alt="The same diagram, now noting that there are “no more rows” to process, and including a final result, 3.0, being output at the end of the arrow. " loading="lazy" width="1600" height="904" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/08/8-1.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/08/8-1.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/8-1.jpg 1600w" sizes="(min-width: 720px) 720px"></figure><p>So, to summarize, each row comes in, gets compared to our current state, and then the state gets updated to reflect the new greatest value. Then the next row comes in, and we repeat the process until we’ve processed all our rows and output the result.</p>
<!--kg-card-begin: html-->
    <video autoplay loop muted playsinline>
      <source id="player" src="https://s3.amazonaws.com/blog.timescale.com/gifs/how-postgres-works/how_postgres_works_1.mp4" type="video/mp4"></source>
    </video>
    <figcaption class="gif-caption">The max aggregate aggregation process, told in GIFs. 
    </figcaption>
<!--kg-card-end: html-->
<p></p><p>There’s a name for the function that processes each row and updates the internal state: the <a href="https://www.postgresql.org/docs/current/sql-createaggregate.html"><strong>state transition function</strong></a> (or just “transition function” for short.) The transition function for an aggregate takes the current state and the value from the incoming row as arguments and produces a new state. </p><p>It’s defined like this, where <code>current_value</code> represents values from the incoming row, <code>current_state</code> represents the current aggregate state built up over the previous rows (or NULL if we haven’t yet gotten any), and <code>next_state</code> represents the output after analyzing the incoming row:</p><pre><code class="language-SQL">next_state = transition_func(current_state, current_value)</code></pre><h3 id="aggregate-internals-composite-state">Aggregate internals: Composite state</h3><p>So, the <code>max()</code> aggregate has a straightforward state that contains just one value (the largest we’ve seen). But not all aggregates in PostgreSQL have such a simple state.</p><p>Let’s consider the aggregate for average (<code>avg</code>):</p><pre><code class="language-SQL">SELECT avg(bar) FROM foo;</code></pre><p>To refresh, an average is defined as:</p><p>\begin{equation} avg(x) = \frac{sum(x)}{count(x)}  \end{equation}</p>
<p>To calculate it, we store the sum and the count as our internal state and update our state as we process rows:</p>
<!--kg-card-begin: html-->
    <video autoplay loop muted playsinline>
      <source id="player" src="https://s3.amazonaws.com/blog.timescale.com/gifs/how-postgres-works/how_postgres_works_2.mp4" type="video/mp4"></source>
    </video>
    <figcaption class="gif-caption">The `avg()` aggregation process, told in GIFs. For `avg()`, the transition function must update a more complex state since the sum and count are stored separately at each aggregation step. 
    </figcaption>
<!--kg-card-end: html-->
<p>But, when we’re ready to output our result for <code>avg</code>, we need to divide <code>sum</code> by <code>count</code>:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/9-1.jpg" class="kg-image" alt="An arrow flowchart diagram similar to those before, showing the end state of the avg aggregate. The rows have moved through the aggregate, and the state is 6.0 - the sum and three - the count. There are then some question marks and an end result of 2.0." loading="lazy" width="1600" height="904" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/08/9-1.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/08/9-1.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/9-1.jpg 1600w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">For some aggregates, we can output the state directly – but for others, we need to perform an operation on the state before calculating our final result.</span></figcaption></figure><p>There’s another function inside the aggregate that performs this calculation: the <a href="https://www.postgresql.org/docs/current/sql-createaggregate.html"><strong>final function</strong></a>. Once we’ve processed all the rows, the final function takes the state and does whatever it needs to produce the result. </p><p>It’s defined like this, where <code>final_state</code> represents the output of the transition function after it has processed all the rows:</p><pre><code class="language-SQL">result = final_func(final_state)
</code></pre><p>And, through pictures:</p>
<!--kg-card-begin: html-->
    <video autoplay loop muted playsinline>
      <source id="player" src="https://s3.amazonaws.com/blog.timescale.com/gifs/how-postgres-works/how_postgres_works_3.mp4" type="video/mp4"></source>
    </video>
    <figcaption class="gif-caption">How the average aggregate works, told in GIFs. Here, we’re highlighting the role of the final function.
    </figcaption>
<!--kg-card-end: html-->
<p>To summarize, as an aggregate scans over rows, its <strong>transition function</strong> updates its internal state. Once the aggregate has scanned all of the rows, its <strong>final function</strong> produces a result, which is returned to the user.</p><h3 id="improving-the-performance-of-aggregate-functions">Improving the performance of aggregate functions</h3><p>One interesting thing to note here: the transition function is called many, many more times than the final function: once for each row, whereas the final function is called once per <em>group</em> of rows. </p><p>Now, the transition function isn’t inherently more expensive than the final function on a per-call basis—but because there are usually orders of magnitude more rows going into the aggregate than coming out, the transition function step becomes the most expensive part very quickly. This is especially true when you have high-volume time-series data being ingested at high rates; optimizing aggregate transition function calls is important for improving performance.</p><p>Luckily, PostgreSQL already has ways to optimize aggregates.</p><h3 id="parallelization-and-the-combine-function">Parallelization and the combine function</h3><p>Because the transition function is run on each row, <a href="https://www.postgresql.org/message-id/flat/CA%2BTgmoYSL_97a--qAvdOa7woYamPFknXsXX17m0t2Pwc%2BFOvYw%40mail.gmail.com#fb9f2ae2a52ac605a4439a1879ff3c10">some enterprising PostgreSQL developers</a> asked: <em>what if we parallelized the transition function calculation?</em> </p><p>Let’s revisit our definitions for transition functions and final functions:</p><pre><code class="language-SQL">next_state = transition_func(current_state, current_value)

result = final_func(final_state)</code></pre><p>We can run this in parallel by instantiating multiple copies of the transition function and handing a subset of rows to each instance. Then, each parallel aggregate will run the transition function over the subset of rows it sees, producing multiple (partial) states, one for each parallel aggregate. But, since we need to aggregate over the <em>entire</em> data set, we can’t run the final function on each parallel aggregate separately because they only have some of the rows. </p><p>So, now we’ve ended up in a bit of a pickle: we have multiple partial aggregate states, and the final function is only meant to work on the single, final state—right before we output the result to the user. </p><p>To solve this problem, we need a new type of function that combines two partial states into one so that the final function can do its work. This is (aptly) called the <a href="https://www.postgresql.org/docs/current/sql-createaggregate.html"><strong>combine function</strong></a>. </p><p>We can run the combine function iteratively over all of the partial states that are created when we parallelize the aggregate.</p><pre><code class="language-SQL">combined_state = combine_func(partial_state_1, partial_state_2)</code></pre><p>For instance, in <code>avg</code>, the combine function will add up the counts and sums.</p>
<!--kg-card-begin: html-->
    <video autoplay loop muted playsinline>
      <source id="player" src="https://s3.amazonaws.com/blog.timescale.com/gifs/how-postgres-works/how_postgres_works_4.mp4" type="video/mp4"></source>
    </video>
    <figcaption class="gif-caption">How parallel aggregation works, told in GIFs. Here, we’re highlighting the combine function (We’ve added a couple more rows to illustrate parallel aggregation.)
    </figcaption>
<!--kg-card-end: html-->
<p>Then, after we have the combined state from all of our parallel aggregates, we run the final function and get our result.</p><h3 id="deduplication">Deduplication</h3><p>Parallelization and the combined function are one way to reduce the cost of calling an aggregate, but they’re not the only way. </p><p>One other built-in <a href="https://www.tigerdata.com/blog/best-practices-for-query-optimization-in-postgresql" rel="noreferrer">PostgreSQL optimization</a> that reduces an aggregate’s cost occurs in a statement like this:</p><pre><code class="language-SQL">SELECT avg(bar), avg(bar) / 2 AS half_avg FROM foo;</code></pre><p>PostgreSQL will optimize this statement to evaluate the <code>avg(bar)</code> calculation only once and then use that result twice. </p><p>And what if we have different aggregates with the same transition function but different final functions? PostgreSQL further optimizes by calling the transition function (the expensive part) on all the rows and then doing both final functions! Pretty neat!</p><p>That’s not all that PostgreSQL aggregates can do, but it’s a pretty good tour, and it’s enough to get us where we need to go today.</p><h2 id="two-step-aggregation-in-timescaledb-hyperfunctions">Two-step Aggregation in TimescaleDB Hyperfunctions</h2><p>In TimescaleDB, we’ve implemented the two-step aggregation design pattern for our aggregate functions. This generalizes the PostgreSQL internal aggregation API and exposes it to the user via our aggregates, accessors, and rollup functions. (In other words, each internal <a href="https://www.tigerdata.com/blog/function-pipelines-building-functional-programming-into-postgresql-using-custom-operators" rel="noreferrer">PostgreSQL function</a> has an equivalent function in TimescaleDB hyperfunctions.)</p><p>As a refresher, when we talk about the two-step aggregation design pattern, we mean the following convention, where we have an inner aggregate call:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/Inner-aggregate-call-1.jpg" class="kg-image" alt="SELECT average(time_weight('LOCF', value)) as time_weighted_average FROM foo; -- or SELECT approx_percentile(0.5, percentile_agg(value)) as median FROM bar;  With the snippets: time_weight('LOCF', value) and percentile_agg(value) highlighted." loading="lazy" width="1232" height="230" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/08/Inner-aggregate-call-1.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/08/Inner-aggregate-call-1.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/Inner-aggregate-call-1.jpg 1232w" sizes="(min-width: 720px) 720px"></figure><p>And an outer accessor call:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/Outer-accessor-call-1.jpg" class="kg-image" alt="The same as the previous in terms of code, except the sections: average(time_weight('LOCF', value)) and approx_percentile(0.5, percentile_agg(value)) are highlighted" loading="lazy" width="1232" height="230" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/08/Outer-accessor-call-1.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/08/Outer-accessor-call-1.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/Outer-accessor-call-1.jpg 1232w" sizes="(min-width: 720px) 720px"></figure><p>The inner aggregate call returns the internal state, just like the transition function does in PostgreSQL aggregates. </p><p>The outer accessor call takes the internal state and returns a result to the user, just like the final function does in PostgreSQL. </p><p>We also have special <a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/rollup-percentile/#sample-usage"><code>rollup</code></a> functions <a href="https://docs.timescale.com/api/latest/hyperfunctions/time-weighted-averages/rollup-timeweight/">defined for each of our aggregates</a> that work much like PostgreSQL combine functions.</p>
<!--kg-card-begin: html-->
<span id="agg-table" />
<!--kg-card-end: html-->
<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/table-pg-two-step-comparison-1.jpg" class="kg-image" alt="A table with columns labeled: the PostgreSQL internal aggregation API, Two-step aggregate equivalent, and TimescaleDB hyperfunction example. In the first row, we have the transition function equivalent to the aggregate, and the examples are time_weight() and percentile_agg(). In the second row, we have the final function, equivalent to the accessor, and the examples are average() and approx_percentile(). In the third row, we have the combine function equivalent to rollup in two-step aggregates, and the example is rollup()." loading="lazy" width="1109" height="466" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/08/table-pg-two-step-comparison-1.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/08/table-pg-two-step-comparison-1.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/table-pg-two-step-comparison-1.jpg 1109w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">PostgreSQL internal aggregation APIs and their TimescaleDB hyperfunctions’ equivalent</span></figcaption></figure><h3 id="why-we-use-the-two-step-aggregate-design-pattern">Why we use the two-step aggregate design pattern</h3><p>There are four basic reasons we expose the two-step aggregate design pattern to users rather than leave it as an internal structure (and the last two helped us build our continuous aggregates): </p><ol><li>Allow multi-parameter aggregates to re-use state, making them more efficient.</li><li>Cleanly distinguish between parameters that affect aggregates vs. accessors, making performance implications easier to understand and predict.</li><li>Enable easy-to-understand rollups, with logically consistent results, in continuous aggregates and window functions (one of our most common requests on continuous aggregates).</li><li>Allow easier <em>retrospective analysis</em> of downsampled data in continuous aggregates as requirements change, but the data is already gone.</li></ol><p>That’s a little theoretical, so let’s dive in and explain each one.</p><h2 id="efficiency-of-two-step-aggregates">Efficiency of Two-Step Aggregates<br></h2><h3 id="re-using-state"><strong>Re-using state</strong></h3><p>PostgreSQL is very good at optimizing statements (as we saw earlier in this post through pictures 🙌), but you have to give it things in a way it can understand. </p><p>For instance, <a href="https://timescale.ghost.io/blog/blog/how-postgresql-aggregation-works-and-how-it-inspired-our-hyperfunctions-design-2/#deduplication">when we talked about deduplication</a>, we saw that PostgreSQL could “figure out” when a statement occurs more than once in a query (i.e., <code>avg(bar)</code>) and only run the statement a single time to avoid redundant work:</p><pre><code class="language-SQL">SELECT avg(bar), avg(bar) / 2 AS half_avg FROM foo;</code></pre><p>This works because the <code>avg(bar)</code> occurs multiple times without variation. </p><p>However, if I write the equation in a slightly different way and move the division<em> inside</em> the parentheses so that the expression <code>avg(bar)</code> doesn’t repeat so neatly, PostgreSQL <em>can’t</em> figure out how to optimize it:</p><pre><code class="language-SQL">SELECT avg(bar), avg(bar / 2) AS half_avg FROM foo;</code></pre><p>It doesn’t know that the division is commutative or that those two queries are equivalent. </p><p>This is a complicated problem for database developers to solve, and thus, as a PostgreSQL user, you need to make sure to write your query in a way that the database can understand. </p><p>Performance problems caused by equivalent statements that the database doesn’t understand are equal (or that are equal in the specific case you wrote but not in the general case) can be some of the trickiest SQL optimizations to figure out as a user. </p><p>Therefore, <strong>when we design our APIs, we try to make it hard for users to write low-performance code unintentionally: in other words, the default option should be the high-performance option</strong>.</p><p>For the next bit, it’ll be useful to have a simple table defined as:</p><pre><code class="language-SQL">CREATE TABLE foo(
	ts timestamptz, 
	val DOUBLE PRECISION);</code></pre><p>Let’s look at an example of how we use two-step aggregation in the <a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/">percentile approximation hyperfunction</a> to allow PostgreSQL to optimize performance.</p><pre><code class="language-SQL">SELECT 
    approx_percentile(0.1, percentile_agg(val)) as p10, 
    approx_percentile(0.5, percentile_agg(val)) as p50, 
    approx_percentile(0.9, percentile_agg(val)) as p90 
FROM foo;</code></pre><p>...is treated as the same as:</p><pre><code class="language-SQL">SELECT 
    approx_percentile(0.1, pct_agg) as p10, 
    approx_percentile(0.5, pct_agg) as p50, 
    approx_percentile(0.9, pct_agg) as p90 
FROM 
(SELECT percentile_agg(val) as pct_agg FROM foo) pct;
</code></pre><p>This calling convention allows us to use identical aggregates so that, under the hood, PostgreSQL can deduplicate calls to the identical aggregates (and is faster as a result).</p><p>Now, let’s compare this to the one-step aggregate approach. </p><p>PostgreSQL can’t deduplicate aggregate calls here because the extra parameter in the <code>approx_percentile</code> aggregate changes with each call:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/Approx_percentile.jpg" class="kg-image" alt="-- NB: THIS IS AN EXAMPLE OF AN API WE DECIDED NOT TO USE, IT DOES NOT WORK SELECT      approx_percentile(0.1, val) as p10,      approx_percentile(0.5, val) as p50,      approx_percentile(0.9, val) as p90  FROM foo; The first values in each of the approx_percentile calls are highlighted and in red." loading="lazy" width="1232" height="368" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/08/Approx_percentile.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/08/Approx_percentile.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/Approx_percentile.jpg 1232w" sizes="(min-width: 720px) 720px"></figure><p>So, even though all of those functions could use the same approximation built up over all the rows, PostgreSQL has no way of knowing that. The two-step aggregation approach enables us to structure our calls so that PostgreSQL can optimize our code, and it enables developers to understand when things will be more expensive and when they won't. Multiple different aggregates with different inputs will be expensive, whereas multiple accessors to the same aggregate will be much less expensive.</p><h3 id="cleanly-distinguishing-between-aggregateaccessor-parameters">Cleanly distinguishing between aggregate/accessor parameters</h3><p>We also chose the two-step aggregate approach because some of our aggregates can take multiple parameters or options themselves, and their accessors can also take options:</p><pre><code class="language-SQL">SELECT
    approx_percentile(0.5, uddsketch(1000, 0.001, val)) as median,--1000 buckets, 0.001 target err
    approx_percentile(0.9, uddsketch(1000, 0.001, val)) as p90, 
    approx_percentile(0.5, uddsketch(100, 0.01, val)) as less_accurate_median -- modify the terms for the aggregate get a new approximation
FROM foo;</code></pre><p><br>That’s an example of <a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/percentile-aggregation-methods/uddsketch/"><code>uddsketch</code></a>, an <a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/percentile-aggregation-methods/#choosing-the-right-algorithm-for-your-use-case">advanced aggregation method</a> for percentile approximation that can take its own parameters. </p><p>Imagine if the parameters were jumbled together in one aggregate:</p><pre><code class="language-SQL">-- NB: THIS IS AN EXAMPLE OF AN API WE DECIDED NOT TO USE, IT DOES NOT WORK
SELECT
    approx_percentile(0.5, 1000, 0.001, val) as median
FROM foo;
</code></pre><p><br>It’d be pretty difficult to understand which argument is related to which part of the functionality.</p><p>Conversely, the two-step approach separates the arguments to the accessor vs. aggregate very cleanly, where the aggregate function is defined in parenthesis within the inputs of our final function:</p><pre><code class="language-SQL">SELECT
    approx_percentile(0.5, uddsketch(1000, 0.001, val)) as median
FROM foo;
</code></pre><p>By making it clear which is which, users can know that if they change the inputs to the aggregate, they will get more (costly) aggregate nodes, =while inputs to the accessor are cheaper to change. </p><p>So, those are the first two reasons we expose the API—and what it allows developers to do as a result. The last two reasons involve continuous aggregates and how they relate to hyperfunctions, so first, a quick refresher on what they are.</p><h2 id="continuous-aggregates-and-two-step-design-in-timescaledb">Continuous Aggregates and Two-Step Design in TimescaleDB</h2><p>TimescaleDB includes a feature called <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/continuous-aggregates/">continuous aggregates</a>, which are designed to make queries on very large datasets run faster. TimescaleDB's continuous aggregates continuously and incrementally store the results of an aggregation query in the background, so when you run the query, only the data that has changed needs to be computed, not the entire dataset. </p><p>In our discussion of the combine function <a href="https://timescale.ghost.io/blog/blog/how-postgresql-aggregation-works-and-how-it-inspired-our-hyperfunctions-design-2/#deduplication">above,</a> we covered how you could take the expensive work of computing the transition function over every row and split the rows over multiple parallel aggregates to speed up the calculation.</p><p>TimescaleDB continuous aggregates do something similar, except they spread the computation work over<em> time</em> rather than between parallel processes running simultaneously. The continuous aggregate computes the transition function over a subset of rows inserted some time in the past, stores the result, and then, at query time, we only need to compute over the raw data for a small section of recent time that we haven’t yet calculated. </p><p>When we designed TimescaleDB hyperfunctions, we wanted them to work well within continuous aggregates and even open new possibilities for users.  </p><p>Let’s say I create a continuous aggregate from the simple table above to compute the sum, average, and percentile (the latter using a hyperfunction) in 15-minute increments:</p><pre><code class="language-SQL">CREATE MATERIALIZED VIEW foo_15_min_agg
WITH (timescaledb.continuous)
AS SELECT id,
    time_bucket('15 min'::interval, ts) as bucket,
    sum(val),
    avg(val),
    percentile_agg(val)
FROM foo
GROUP BY id, time_bucket('15 min'::interval, ts);</code></pre><p>And then what if I come back and I want to re-aggregate it to hours or days rather than 15-minute buckets—or need to aggregate my data across all IDs? Which aggregates can I do that for, and which can’t I?</p><h3 id="logically-consistent-rollups">Logically consistent rollups</h3><p>One of the problems we wanted to solve with two-step aggregation was how to convey to the user when it is “okay” to re-aggregate and when it’s not. (By “okay,” I mean you would get the same result from the re-aggregated data as you would running the aggregate on the raw data directly.) </p><p>For instance:</p><pre><code class="language-SQL">SELECT sum(val) FROM tab;
-- is equivalent to:
SELECT sum(sum) 
FROM 
    (SELECT id, sum(val) 
    FROM tab
    GROUP BY id) s;</code></pre><p>But:</p><pre><code class="language-SQL">SELECT avg(val) FROM tab;
-- is NOT equivalent to:
SELECT avg(avg) 
FROM 
    (SELECT id, avg(val) 
    FROM tab
    GROUP BY id) s;
</code></pre><p>Why is re-aggregation okay for <code>sum</code> but not for <code>avg</code>? </p><p>Technically, it’s logically consistent to re-aggregate when:</p><ul><li>The aggregate returns the internal aggregate state. The internal aggregate state for sum is <code>(sum)</code>, whereas for average, it is <code>(sum, count)</code>. </li><li>The aggregate’s combine and transition functions are equivalent. For <code>sum()</code>, the states and the operations are the same. For <code>count()</code>, the <em>states</em> are the same, but the transition and combine functions <em>perform different operations </em>on them. <code>sum()</code>’s transition function adds the incoming value to the state, and its combine function adds two states together or a sum of sums.  Conversely, <code>count()</code>s transition function increments the state for each incoming value, but its combine function adds two states together, or a sum of counts.</li></ul><p>But, you have to have in-depth (and sometimes rather arcane) knowledge about each aggregate’s internals to know which ones meet the above criteria—and, therefore, which ones you can re-aggregate.</p><p><strong>With the two-step aggregate approach, we can convey when it is logically consistent to re-aggregate by exposing our equivalent of the combine function when the aggregate allows it.</strong></p><p>We call that function <code>rollup()</code>. <code>Rollup()</code> takes multiple inputs from the aggregate and combines them into a single value. </p><p>All of our aggregates that can be combined have <code>rollup</code> functions that will combine the output of the aggregate from two different groups of rows. (Technically, <code>rollup()</code> is an aggregate function because it acts on multiple rows. For clarity, I’ll call them rollup functions to distinguish them from the base aggregate).  Then you can call the accessor on the combined output! </p><p>So using that continuous aggregate we created to get a 1-day re-aggregation of our <code>percentile_agg</code> becomes as simple as:</p><pre><code class="language-SQL">SELECT id, 
    time_bucket('1 day'::interval, bucket) as bucket, 
    approx_percentile(0.5, rollup(percentile_agg)) as median
FROM foo_15_min_agg
GROUP BY id, time_bucket('1 day'::interval, bucket);</code></pre><p>(We actually suggest that you create your continuous aggregates without calling the accessor function for this very reason. Then, you can just create views over top or put the accessor call in your query). </p><p>This brings us to our final reason.</p><h3 id="retrospective-analysis-using-continuous-aggregates">Retrospective analysis using continuous aggregates</h3><p>When we create a continuous aggregate, we’re defining a view of our data that we could then be stuck with for a very long time. </p><p>For example, we might have a data retention policy that deletes the underlying data after X time period. If we want to go back and re-calculate anything, it can be challenging, if not impossible, since we’ve “dropped” the data. </p><p>But we understand that in the real world, you don’t always know what you’re going to need to analyze ahead of time. </p><p>Thus, we designed hyperfunctions to use the two-step aggregate approach so they would better integrate with continuous aggregates. As a result, users store the aggregate state in the continuous aggregate view and modify accessor functions <em>without</em> requiring them to recalculate old states that might be difficult (or impossible) to reconstruct (because the data is archived, deleted, etc.). </p><p>The two-step aggregation design also allows for much greater flexibility with continuous aggregates. For instance, let’s take a continuous aggregate where we do the aggregate part of the two-step aggregation like this:</p><pre><code class="language-SQL">CREATE MATERIALIZED VIEW foo_15_min_agg
WITH (timescaledb.continuous)
AS SELECT id,
    time_bucket('15 min'::interval, ts) as bucket,
    percentile_agg(val)
FROM foo
GROUP BY id, time_bucket('15 min'::interval, ts);
</code></pre><p>When we first create the aggregate, we might only want to get the median:</p><pre><code class="language-SQL">SELECT
    approx_percentile(0.5, percentile_agg) as median
FROM foo_15_min_agg;
</code></pre><p>But then, later, we decided we wanted to know the 95th percentile as well. </p><p>Luckily, we don’t have to modify the continuous aggregate; we<strong> just modify the parameters to the accessor function in our original query to return the data we want from the aggregate state</strong>:</p><pre><code class="language-SQL">SELECT
    approx_percentile(0.5, percentile_agg) as median,
    approx_percentile(0.95, percentile_agg) as p95
FROM foo_15_min_agg;</code></pre><p>And then, if a year later, we want the 99th percentile as well, we can do that too:</p><pre><code class="language-SQL">SELECT
    approx_percentile(0.5, percentile_agg) as median,
    approx_percentile(0.95, percentile_agg) as p95,
    approx_percentile(0.99, percentile_agg) as p99
FROM foo_15_min_agg;
</code></pre><p>That’s just scratching the surface. Ultimately, our goal is to provide a high level of developer productivity that enhances other PostgreSQL and TimescaleDB features, like aggregate deduplication and continuous aggregates.</p><h2 id="example-time-weighted-average">Example: Time-Weighted Average</h2><p>To illustrate how the two-step aggregate design pattern impacts how we think about and code hyperfunctions, let’s look at the<a href="https://docs.timescale.com/api/latest/hyperfunctions/time-weighted-averages/"> time-weighted average family of functions</a>. (Our<a href="https://timescale.ghost.io/blog/blog/what-time-weighted-averages-are-and-why-you-should-care/"> What Time-Weighted Averages Are and Why You Should Care</a> post provides a lot of context for this next bit, so if you haven’t read it, we recommend doing so. You can also skip this next bit for now.)</p><p>The equation for the time-weighted average is as follows:</p><p>\begin{equation}  time\_weighted\_average = \frac{area\_under\_curve}{ \Delta T}   \end{equation}</p>
<p>As we noted in the <a href="https://timescale.ghost.io/blog/blog/how-postgresql-aggregation-works-and-how-it-inspired-our-hyperfunctions-design-2/#agg-table">table above</a>:</p><ul><li><code>time_weight()</code> is TimescaleDB hyperfunctions’ aggregate and corresponds to the transition function in PostgreSQL’s internal API.</li><li><code>average()</code> is the accessor, which corresponds to the PostgreSQL final function.</li><li><code>rollup()</code> for re-aggregation corresponds to the PostgreSQL combine function.</li></ul><p>The <code>time_weight()</code> function returns an aggregate type that has to be usable by the other functions in the family.</p><p>In this case, we decided on a <code>TimeWeightSummary</code> type that is defined like so (in pseudocode):</p><pre><code class="language-SQL">TimeWeightSummary = (w_sum, first_pt, last_pt)</code></pre><p><code>w_sum</code> is the weighted sum (another name for the area under the curve), and <code>first_pt</code> and <code>last_pt</code> are the first and last (time, value) pairs in the rows that feed into the <code>time_weight()</code> aggregate. </p><p>Here’s a graphic depiction of those elements, which builds on our <a href="https://timescale.ghost.io/blog/blog/what-time-weighted-averages-are-and-why-you-should-care/#mathy-bits-how-to-derive-a-time-weighted-average">how to derive a time-weighted average theoretical description</a>:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/example-graph1-1.jpg" class="kg-image" alt="A graph showing value on the y-axis and time on the x-axis. There are four points:  open parens t 1 comma v 1 close parens, labeled first point to open parens t 4 comma  v 4 close parens, labeled last point. The points are spaced unevenly in time on the graph. The area under the graph is shaded, and labeled w underscore sum. The time axis has a brace describing the total distance between the first and last points labeled Delta T. " loading="lazy" width="692" height="589" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/08/example-graph1-1.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/example-graph1-1.jpg 692w"><figcaption><span style="white-space: pre-wrap;">Depiction of the values we store in the </span><code spellcheck="false" style="white-space: pre-wrap;"><span>TimeWeightSummary</span></code><span style="white-space: pre-wrap;"> representation.</span></figcaption></figure><p></p><p>So, the <code>time_weight()</code> aggregate does all of the calculations as it receives each of the points in our graph and builds a weighted sum for the time period (ΔT) between the first and last points it “sees.” It then outputs the <code>TimeWeightSummary</code>.</p><p>The <code>average()</code> accessor function performs simple calculations to return the time-weighted average from the <code>TimeWeightSummary</code> (in pseudocode where <code>pt.time()</code> returns the time from the point):</p><pre><code class="language-SQL">func average(TimeWeightSummary tws) 
	-&gt; float {
		delta_t = tws.last_pt.time - tws.first_pt.time;
		time_weighted_average = tws.w_sum / delta_t;
		return time_weighted_average;
	}</code></pre><p><br>But, as we built the <code>time_weight</code> hyperfunction, ensuring the <code>rollup()</code> function worked as expected was a little more difficult – and introduced constraints that impacted the design of our <code>TimeWeightSummary</code> data type. </p><p>To understand the rollup function, let’s use our graphical example and imagine the <code>time_weight()</code> function returns two <code>TimeWeightSummaries</code> from different regions of time like so:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/Example-graph-2-1.jpg" class="kg-image" alt="A similar graph to the previous, except that now there are two sets of shaded regions. The first is similar to the previous and is labeled with first sub 1 open parens t 1 comma v 1 close parens, last  1 open parens t 4 comma  v 4 close parens , and w underscore sum  1.  The second is similar, with points first 2 open parens t 5 comma  v 4 close parens and last 2 open parens t 8 comma  v 8 close parens and the label w underscore sum 2 on the shaded portion. " loading="lazy" width="672" height="511" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/08/Example-graph-2-1.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/Example-graph-2-1.jpg 672w"><figcaption><span style="white-space: pre-wrap;">What happens when we have multiple TimeWeightSummaries representing different regions of the graph</span></figcaption></figure><p>The <code>rollup()</code> function needs to take in and return the same <code>TimeWeightSummary</code> data type so that our <code>average()</code> accessor can understand it. (This mirrors how PostgreSQL’s combined function takes in two states from the transition function and then returns a single state for the final function to process.)</p><p>We also want the <code>rollup()</code> output to be the same as if we had computed the <code>time_weight()</code> over all the underlying data. The output should be a <code>TimeWeightSummary</code> representing the full region.  </p><p>The <code>TimeWeightSummary</code> we output should also account for the area in the gap between these two weighted sum states:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/example-graph-3-1.jpg" class="kg-image" alt="A similar picture to the previous, with the area between the points open parens t 4 comma  v 4 close parens aka last 1 and open parens t 5 comma  v 5 close parens aka first 2, down to the time axis highlighted. This is called w underscore sum gap. " loading="lazy" width="663" height="501" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/08/example-graph-3-1.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/example-graph-3-1.jpg 663w"><figcaption><span style="white-space: pre-wrap;">Mind the gap! (between one </span><code spellcheck="false" style="white-space: pre-wrap;"><span>TimeWeightSummary</span></code><span style="white-space: pre-wrap;"> and the next).</span></figcaption></figure><p>The gap area is easy to get because we have the last<sub>1</sub> and first<sub>2</sub> points—and it’s the same as the <code>w_sum</code> we’d get by running the <code>time_weight()</code> aggregate on them.</p><p>Thus, the overall <code>rollup()</code> function needs to do something like this (where <code>w_sum()</code> extracts the weighted sum from the <code>TimeWeightSummary</code>):</p><pre><code class="language-SQL">func rollup(TimeWeightSummary tws1, TimeWeightSummary tws2) 
	-&gt; TimeWeightSummary {
		w_sum_gap = time_weight(tws1.last_pt, tws2.first_pt).w_sum;
		w_sum_total = w_sum_gap + tws1.w_sum + tws2.w_sum;
		return TimeWeightSummary(w_sum_total, tws1.first_pt, tws2.last_pt);
	}
</code></pre><p>Graphically, that means we’d end up with a single <code>TimeWeightSummary</code> representing the whole area:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/example-graph-4-1.jpg" class="kg-image" alt="Similar to the previous graphs, except that now there is only one region that has been shaded, the combined area of the w underscore sum 1, w underscore sum 2, and w underscore sum gap has become one area, w underscore sum. Only the overall first open parens t 1 comma  v 1 close parens and last open parens t 8 comma  v 8 close parens points are shown. " loading="lazy" width="672" height="505" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/08/example-graph-4-1.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/example-graph-4-1.jpg 672w"><figcaption><span style="white-space: pre-wrap;">The combined </span><code spellcheck="false" style="white-space: pre-wrap;"><span>TimeWeightSummary</span></code></figcaption></figure><p>So that’s how the two-step aggregate design approach ends up affecting the real-world implementation of our time-weighted average hyperfunctions. The above explanations are a bit condensed, but they should give you a more concrete look at how <code>time_weight()</code> aggregate, <code>average()</code> accessor, and <code>rollup()</code> functions work.</p><h2 id="summing-it-up">Summing It Up</h2><p>Now that you’ve gotten a tour of the PostgreSQL aggregate API, how it inspired us to make the TimescaleDB hyperfunctions two-step aggregate API, and a few examples of how this works in practice, we hope you'll try it out yourself and tell us what you think :). </p><p>If you're currently dealing with gigantic databases, remember that you can always tier your older, infrequently accessed data to keep things running smoothly without breaking the bank—we built the perfect solution for this with our <a href="https://docs.timescale.com/use-timescale/latest/data-tiering/" rel="noreferrer">Tiered Storage architecture backend</a>. Check it out!</p><p>If you'd like to keep learning about Postgres and its community, visit our <a href="https://www.timescale.com/state-of-postgres/2023" rel="noreferrer"><em>State of PostgreSQL 2023</em> </a>report, which is full of insights on how people around the world use PostgreSQL.</p><p><strong>Going back to hyperfunctions, to get started right away, </strong><a href="https://console.cloud.timescale.com/signup"><strong>spin up a fully managed Timescale service and try it for free</strong></a><strong>. </strong>Hyperfunctions are pre-loaded on each new database service on Timescale, so after you create a new service, you’re all set to use them!</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to Migrate From AWS RDS for PostgreSQL to Timescale]]></title>
            <description><![CDATA[Database migration doesn’t have to be hard. Read this step-by-step guide on how to migrate your database from AWS RDS to Timescale.]]></description>
            <link>https://www.tigerdata.com/blog/how-to-migrate-from-aws-rds-for-postgresql-to-timescale</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-to-migrate-from-aws-rds-for-postgresql-to-timescale</guid>
            <category><![CDATA[Amazon RDS]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Vineeth Pothulapati]]></dc:creator>
            <pubDate>Tue, 09 Jan 2024 15:04:50 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/01/Migrating-from-Amazon-RDS-to-Timescale-Cloud_hero.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/01/Migrating-from-Amazon-RDS-to-Timescale-Cloud_hero.jpg" alt="How to Migrate From AWS RDS for PostgreSQL to Timescale" /><p>Database migration refers to the process of moving data from one database to another. It's a vital and complex operation requiring careful planning and execution. Unsurprisingly, developers tend to avoid it, as migrating a database involves changing the database architecture or transferring data from one platform or format to another.&nbsp;</p><p>At Timescale, we see many devs looking for an <a href="https://www.tigerdata.com/learn/alternatives-to-rds" rel="noreferrer">alternative to AWS RDS for PostgreSQL</a>. Drawn by performance and a transparent pricing model (no one has the time to <a href="https://timescale.ghost.io/blog/understanding-rds-pricing/"><u>understand RDS pricing</u></a>), they often land at Timescale. When trying to help them, we couldn’t help but notice how lengthy and complex it is to migrate from RDS.</p><p>But what if you had a playbook to assist you along the way? If you’re embarking on a journey through uncharted territory, having a map makes all the difference. That’s what this blog post is: a step-by-step overview of the migration process from RDS to Timescale using live migrations. This migration workflow ensures a smooth transition (even for heavy workloads) while helping you avoid common pitfalls and harness our features. Think of it as your GPS for database migration. </p><h2 id="steps-for-a-successful-database-migration">Steps for a Successful Database Migration</h2><p>Migrating a database is a complex process due to all the moving parts you’ll need to handle to ensure a seamless and safe transition of your data. A successful database migration typically involves several steps:</p><ul><li><strong>Assessment</strong>: understand your current database's size, complexity, and dependencies.</li><li><strong>Planning</strong>: decide on a migration strategy, schedule downtime if necessary, and prepare a backup plan in case something goes wrong.</li><li><strong>Enable hypertables: </strong>to leverage Timescale’s performance optimization, compression, continuous aggregates, and data tiering, you have to enable hypertables on your time-series tables.&nbsp;</li><li><strong>Data migration</strong>: move the old data and incoming data during migration from the old database to the new one with minimal downtime.</li><li><strong>Testing</strong>: rigorously test the new database with your applications to ensure everything works as expected.</li><li><strong>Cutover</strong>: finally, switch over from the old database to the new one, which usually involves a period of downtime. (If you follow our playbook, we ensure minimal downtime.)</li></ul><p>Our playbook covers all the bases for a successful migration and streamlines it, allowing you to skip some of these steps. For example, we devised a way for you to enable one of Timescale’s most popular features, <a href="https://docs.timescale.com/use-timescale/latest/hypertables/about-hypertables/"><u>hypertables</u></a>, in your target database while migrating your data. This means that all the data coming into your target database will be automatically partitioned when it comes into Timescale, saving you precious time.</p><p>This helpful aid is already part of our <a href="https://timescale.ghost.io/blog/migrating-a-terabyte-scale-postgresql-database-to-timescale-with-zero-downtime/"><u>live migrations</u></a> solution, a database migration workflow for Timescale based on pg_dump/pg_restore (for schema) and PostgreSQL logical decoding (for live data). But now, we have accelerated the process even more, shortening the number of steps and making it even more straightforward. Let’s jump into it.</p><h2 id="steps-for-migrating-from-aws-rds-to-timescale">Steps for Migrating From AWS RDS to Timescale</h2><p>For a detailed step-by-step guide to migrate your database(s) from Amazon RDS for PostgreSQL to Timescale, we recommend you read our <a href="https://docs.timescale.com/migrate/latest/playbooks/rds-timescale-live-migration/"><u>documentation</u></a>. However, here is a high-level overview of the process. You’ll have to follow these steps:</p><h3 id="1-create-a-timescale-instance">1.&nbsp;Create a Timescale instance</h3><p>Sign up or log in to <a href="https://console.cloud.timescale.com/signup"><u>Timescale Cloud</u></a> and click on "Create service." Choose a Time-series service with your desired CPU and Memory plan. We recommend a minimum of 4 CPUs and 8 GB memory for migration. Click on "Download the cheatsheet" post-service creation to obtain an SQL file that contains the login details for your new service. Alternatively, you can copy the details directly from this page. After copying your password, click "I stored my password, go to service overview" at the bottom of the page. Once your service is ready to use, it will be labeled as "Running" in the Service Overview, displayed in green.</p><h3 id="2-collect-information-from-your-aws-rds-instance">2.&nbsp;Collect information from your AWS RDS instance</h3><p>The AWS RDS management console is extensive, containing numerous details. Our customers often express that locating requirement details and taking actions in AWS is time-consuming, requiring manual navigation of various screens and documentation to find the relevant information. Additionally, AWS has its own intricacies, such as parameter groups, which require engineers to invest effort in understanding the concepts and gathering the required information.</p><p>To prepare an AWS RDS instance for migration, you need the following information about the instance as the first step:</p><ol><li>Endpoint</li><li>Port</li><li>Master username</li><li>Master password</li><li>VPC</li><li>DB instance parameter group</li></ol><h3 id="3-update-the-following-parameters-of-the-aws-rds-for-postgresql-instance">3.&nbsp;Update the following parameters of the AWS RDS for PostgreSQL instance</h3><p>The live migration offered by Timescale is powered by logical decoding, which consumes Write-Ahead Logs (WAL) to synchronize real-time changes happening on the source database during the migration. Therefore, you need to set the <code>wal_level</code> to <code>logical</code>.&nbsp;</p><p>Additionally, you must set <code>old_snapshot_threshold</code> to <code>-1</code> to preserve the snapshots until the migration is complete. Not setting this parameter to <code>-1</code> can cause errors in the logical decoding process, as the database performs periodic maintenance checks. You need to update the following configurations:</p><ol><li><code>wal_level</code>: set to <code>logical</code></li><li><code>old_snapshot_threshold</code>: set to <code>-1</code></li></ol><h3 id="4-prepare-the-intermediate-machine-to-start-the-migration-process">4.&nbsp;Prepare the intermediate machine to start the migration process</h3><p>To migrate a database from a source to a target, an intermediate machine is necessary. This machine is responsible for pulling data from the source database and pushing it to the target database. Specific migration tools are required to perform actions on both the source and target databases to complete the migration. The intermediate machine will host these migration tools throughout the entire migration process.</p><p>The recommended steps are the following:</p><ol><li>Set up an EC2 instance in the same region as your source and target databases.</li><li>Configure your EC2 instance to connect to your AWS RDS instance. This involves updating the security group of your RDS instance to allow the connection from the EC2 instance.</li></ol><h3 id="5-perform-the-database-migration">5.&nbsp;Perform the database migration</h3><p>Now that the source database (AWS RDS instance) is ready for migration, an intermediate machine is prepared to execute the migration, and the target database (your Timescale instance) is up and running, we can proceed with the actual migration process.</p><ol><li>Set source database uri (i.e., AWS RDS instance) and target database uri (i.e., Timescale instance).</li></ol><p><code>export TARGET=&lt;target_db_uri&gt;</code></p><p><code>export SOURCE=&lt;source_db_uri&gt;&nbsp;</code></p><ol start="2"><li>Execute the following command to initiate the migration. We have packaged all the necessary tools for live migration in the docker image. Simply run the command below and wait for the migration to complete.</li></ol><pre><code class="language-SQL">docker run --rm -dit --name live-migration \
  -e PGCOPYDB_SOURCE_PGURI=$SOURCE \
  -e PGCOPYDB_TARGET_PGURI=$TARGET \
  -v ~/live-migration:/opt/timescale/ts_cdc \
  timescale/live-migration:v0.0.1
</code></pre>
<p>During this step, the live migration workflow will copy existing data from the source database to the target database, as well as replicate ongoing changes (change data capture) from the source to the target. The copying process will occur sequentially, beginning with the existing data and then applying ongoing changes. The output logs of this step will display the migration progress and provide prompts to verify the target database before transitioning applications to use the target database as the primary.</p><ol start="3"><li>Verify the data consistency between the source and target databases once the live migration is synchronized. Once you are confident that the data consistency checks have been successful, proceed with the application switchover to the target database by promoting it as the primary database.</li></ol><p>And you’re set! Welcome to Timescale!</p><h2 id="why-choose-timescale-over-rds">Why Choose Timescale Over RDS</h2><p>If you’re reading this article because you’re contemplating a migration from RDS to Timescale, know that moving your data will not only be easy but will also come with the following benefits:</p><h3 id="44-faster-data-ingestion">44&nbsp;% faster data ingestion&nbsp;</h3><p><a href="https://timescale.ghost.io/blog/timescale-cloud-vs-amazon-rds-postgresql-up-to-350-times-faster-queries-44-faster-ingest-95-storage-savings-for-time-series-data/"><u>During our 16-thread ingestion benchmark</u></a>, where we inserted nearly one billion rows of data, we observed impressive results when running Timescale and RDS. Timescale outperformed RDS by 32&nbsp;% with 4 vCPUs and by 44&nbsp;% with 8 vCPUs. Both systems had the same I/O performance configured on their gp3 disk.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/01/How-to-Migrate-from-AWS-RDS-for-PostgreSQL-to-Timescale-ingest-performance.png" class="kg-image" alt="A line graph displaying Timescale's superior ingest performance comprared to RDS" loading="lazy" width="1116" height="753" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/01/How-to-Migrate-from-AWS-RDS-for-PostgreSQL-to-Timescale-ingest-performance.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/01/How-to-Migrate-from-AWS-RDS-for-PostgreSQL-to-Timescale-ingest-performance.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/01/How-to-Migrate-from-AWS-RDS-for-PostgreSQL-to-Timescale-ingest-performance.png 1116w" sizes="(min-width: 720px) 720px"></figure><p></p><h3 id="up-to-350x-faster-queries">Up to 350x faster queries</h3><p>Query performance optimization is crucial in a time-series database, especially when powering real-time dashboards. The <a href="https://github.com/timescale/tsbs?ref=timescale.com"><u>Time-Series Benchmarking Suite</u></a> (TSBS), which we use to run our benchmarks, includes a variety of queries, each with its complex description. For the RDS benchmark, we conducted 100 runs of each query on 4 vCPU instance types and recorded the results.</p><p>The table demonstrates that Timescale outperforms Amazon RDS consistently, often by more than 100x. In some cases, Timescale performs over 350x better without any degradation for any query type. The table below displays the data for 4 vCPU instances, but similar results were observed across all CPU types tested. With a high instance workload, you can achieve even better results.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/01/How-to-migrate-from-AWS-RDS-to-Timescale---median-query-timings-table.png" class="kg-image" alt="A table with the benchmarked queries and the Timescale and RDS times in miliseconds. Timescale outperformed RDS in all these queries." loading="lazy" width="1352" height="1824" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/01/How-to-migrate-from-AWS-RDS-to-Timescale---median-query-timings-table.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/01/How-to-migrate-from-AWS-RDS-to-Timescale---median-query-timings-table.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/01/How-to-migrate-from-AWS-RDS-to-Timescale---median-query-timings-table.png 1352w" sizes="(min-width: 720px) 720px"></figure><h3 id="95-storage-savings">95% storage savings</h3><p>In our benchmark, all data, except for the most recent partition, is compressed into our<a href="https://timescale.ghost.io/blog/building-columnar-compression-in-a-row-oriented-database/"><u> native columnar format</u></a>. This format utilizes advanced algorithms to reduce the required storage space for the CPU table significantly. Despite compression, you can still access the data as usual, but with the added advantages of a smaller size and a columnar structure.</p><p>Reducing storage usage can result in smaller volumes, lower costs, and faster access. We achieved a 95&nbsp;% reduction, from 159 GB to 8.6 GB—these numbers are not uncommon for real customer workloads.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/01/How-to-Migrate-from-AWS-RDS-for-PostgreSQL-to-Timescale-total-database-size.png" class="kg-image" alt="Timescale dramatically compresses database size compared to PostgreSQL: the graph shows two bars, one with 8.6 GB for TimescaleDB and another with 159 GB for Postgres" loading="lazy" width="1006" height="648" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/01/How-to-Migrate-from-AWS-RDS-for-PostgreSQL-to-Timescale-total-database-size.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/01/How-to-Migrate-from-AWS-RDS-for-PostgreSQL-to-Timescale-total-database-size.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/01/How-to-Migrate-from-AWS-RDS-for-PostgreSQL-to-Timescale-total-database-size.png 1006w" sizes="(min-width: 720px) 720px"></figure><h3 id="and-even-more-features">And even more features</h3><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/01/How-to-Migrate-from-AWS-RDS-for-PostgreSQL-to-Timescale-more-features.png" class="kg-image" alt="A summary of more Timescale benefits for time savings, performance at scale, and cost-efficiency—all built on Postgres. " loading="lazy" width="1123" height="558" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/01/How-to-Migrate-from-AWS-RDS-for-PostgreSQL-to-Timescale-more-features.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/01/How-to-Migrate-from-AWS-RDS-for-PostgreSQL-to-Timescale-more-features.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/01/How-to-Migrate-from-AWS-RDS-for-PostgreSQL-to-Timescale-more-features.png 1123w" sizes="(min-width: 720px) 720px"></figure><p>Check out the <a href="https://timescale.ghost.io/blog/timescale-cloud-vs-amazon-rds-postgresql-up-to-350-times-faster-queries-44-faster-ingest-95-storage-savings-for-time-series-data/" rel="noreferrer">Timescale vs. Amazon RDS PostgreSQL benchmark blog post</a> for more details.</p><h2 id="next-steps">Next Steps</h2><p>If you’re considering migrating from AWS RDS but still want to keep the PostgreSQL you know and love, this playbook provides a good overview of how quickly and easily it is to migrate to Timescale using the live migrations solution. We hope it will help you achieve your desired destination—a high-performance PostgreSQL but faster database with predictable, unambiguous billing. If you’re still on the fence, <a href="https://console.cloud.timescale.com/signup"><u>create a free Timescale account</u></a> and try it out for 30 days.</p><p>We will continue to iterate on our migration strategies—we have more aside from live migrations, targeted at different needs and use cases—to ensure that moving to Timescale is just as easy as using our cloud database platform. Read our documentation for more details:</p><ul><li><a href="https://docs.timescale.com/migrate/latest/playbooks/rds-timescale-live-migration/"><u>Live migrations from AWS RDS for PostgreSQL to Timescale</u></a></li><li><a href="https://docs.timescale.com/migrate/latest/playbooks/rds-timescale-pg-dump/"><u>Pg_dump/pg_restore from AWS RDS for PostgreSQL to Timescale</u></a></li><li><a href="https://docs.timescale.com/migrate/latest/live-migration/?ref=timescale.com"><u>Live migrations for any PostgreSQL to Timescale</u></a></li></ul>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Best Practices for Query Optimization on PostgreSQL]]></title>
            <description><![CDATA[Explore the optimization of PostgreSQL queries using tactics such as efficient indexing and partitioning, and judicious use of data types.]]></description>
            <link>https://www.tigerdata.com/blog/best-practices-for-query-optimization-in-postgresql</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/best-practices-for-query-optimization-in-postgresql</guid>
            <category><![CDATA[Product & Engineering]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <category><![CDATA[PostgreSQL Tips]]></category>
            <dc:creator><![CDATA[Team Tiger Data]]></dc:creator>
            <pubDate>Fri, 08 Dec 2023 15:18:00 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/12/Best-Practices-for-Query-Optimization-on-PostgreSQL-1.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/12/Best-Practices-for-Query-Optimization-on-PostgreSQL-1.png" alt="Best Practices for Query Optimization on PostgreSQL (a pack of elephants moving fast in neon colors)" /><p>The demands of modern applications and the exponential growth of data in today’s data-driven world have put immense pressure on databases. Traditional relational databases, including PostgreSQL, are increasingly being pushed to their limits as they struggle to cope with the sheer scale of data that needs to be processed and analyzed, requiring constant query optimization practices and performance tweaks.</p><p>Having built our product on PostgreSQL, at Timescale we’ve written extensively on the topic of tweaking your PostgreSQL database performance, from <a href="https://www.timescale.com/learn/postgresql-performance-tuning-how-to-size-your-database"><u>how to size your database</u></a>, <a href="https://www.timescale.com/learn/postgresql-performance-tuning-key-parameters"><u>key PostgreSQL parameters</u></a>, <a href="https://www.timescale.com/learn/postgresql-performance-tuning-optimizing-database-indexes"><u>database indexes</u></a>, and <a href="https://www.timescale.com/learn/postgresql-performance-tuning-designing-and-implementing-database-schema"><u>designing your schema</u></a> to <a href="https://www.timescale.com/learn/when-to-consider-postgres-partitioning"><u>PostgreSQL partitioning</u></a>.</p><p>In this article, we aim to explore best practices for enhancing query optimization in PostgreSQL. We’ll offer insights into optimizing queries, the importance of indexing, data type selection, and the implications of fluctuating data volumes and high transaction rates.</p><p>Let’s jump right in.</p><h2 id="why-is-postgresql-query-optimization-necessary">Why Is PostgreSQL Query Optimization Necessary?</h2>
<p>Optimizing your PostgreSQL queries becomes a necessity when performance issues significantly impact the efficiency and functionality of your PostgreSQL database, making your application sluggish and impacting the user experience. Before we dive into the solutions, let’s look at some of the key contributors to performance issues in PostgreSQL:</p>
<p><strong>Inefficient queries</strong>: The impact of poorly optimized or complex queries on PostgreSQL's performance is profound. These queries act as significant bottlenecks, impeding data processing efficiency and overall database throughput. Regular analysis and refinement of these query structures are not just beneficial but crucial for maintaining optimal database performance. <a href="https://www.postgresql.org/docs/current/queries.html">Understanding and optimizing SQL queries</a> is essential for efficient database operations. This knowledge is pivotal for developing efficient and responsive database operations, ensuring the database's capability to handle complex <a href="https://www.tigerdata.com/learn/understanding-database-workloads-variable-bursty-and-uniform-patterns">data workloads</a> effectively.</p>
<p><strong>Insufficient indexes</strong>: Inadequate indexing can significantly slow down query execution in PostgreSQL. Strategically implementing indexes, particularly on columns that are frequently accessed, can drastically enhance performance and optimize database responsiveness. <a href="https://timescale.ghost.io/blog/use-composite-indexes-to-speed-up-time-series-queries-sql-8ca2df6b3aaa/">Effective indexing strategies</a> are not only crucial for accelerating query speeds but also play a main role in optimizing the efficiency of complex queries and large-scale data operations, ensuring a more responsive and robust database environment.</p>
<p><strong>Over-indexing</strong>: While it's true that insufficient indexing can hurt your <a href="https://www.tigerdata.com/learn/postgres-performance-best-practices">PostgreSQL performance</a>, it's equally important not to overdo it. Excessive indexes can lead to their own set of challenges: each additional index introduces overhead during your <code>INSERT</code>s, <code>UPDATE</code>s, and <code>DELETE</code>s, they consume disk space and can make database maintenance tasks (such as vacuuming) more time-consuming.</p>
<p><strong>Inappropriate data types</strong>: Using unsuitable data types in PostgreSQL can lead to increased storage usage and slower query execution, as inappropriate types may need additional processing and can occupy more storage space than necessary. Carefully <a href="https://timescale.ghost.io/blog/best-practices-for-picking-postgresql-data-types/">selecting and optimizing data types</a> to align with the specific characteristics of the data is a critical aspect of database optimization. The right choice of data types not only influences overall database performance but also contributes to storage efficiency. Additionally, it helps in avoiding costly type conversions during <a href="https://www.tigerdata.com/learn/guide-to-postgresql-database-operations">database operations</a>, thereby streamlining data processing and retrieval.</p>
<p><strong>Fluctuating data volume</strong>: PostgreSQL's query planner relies on up-to-date data statistics to formulate efficient execution plans. Fluctuations in data volume can significantly impact these plans, potentially leading to suboptimal performance if the planner operates on outdated information. As data volumes change, it becomes crucial to regularly assess and adapt execution plans to these new conditions. Keeping the database statistics current is essential, as it enables the query planner to accurately assess the data landscape and make informed decisions, thereby optimizing query performance and ensuring the database responds effectively to varying data loads.</p>
<p><strong>High transaction volumes</strong>: Large numbers of transactions can significantly strain PostgreSQL's resources, especially in high-traffic or data-intensive environments. Effectively leveraging <a href="https://docs.timescale.com/use-timescale/latest/ha-replicas/read-scaling/#read-replicas">read replicas</a> in PostgreSQL can substantially mitigate the impact of high transaction volumes, ensuring a more efficient and robust database environment.</p>
<p><strong>Hardware limitations</strong>: Constraints in CPU, memory, or storage can create significant bottlenecks in PostgreSQL's performance, as these hardware limitations directly affect the database's ability to process queries, handle concurrent operations, and store data efficiently. Upgrading hardware components, such as increasing CPU speed, expanding memory capacity, or adding more storage, can provide immediate improvements in performance. Additionally, optimizing resource allocation, like adjusting memory distribution for different database processes or balancing load across storage devices, can also effectively alleviate these hardware limitations.</p>
<p><strong>Lock contention</strong>: Excessive locking on tables or rows in PostgreSQL, particularly in environments that handle parallel queries, can lead to significant slowdowns, inconsistent data, and locking issues. This is because row-level or table-level locks can restrict data access, leading to increased waiting times for other operations and potentially causing queuing delays. Therefore, <a href="https://timescale.ghost.io/blog/how-timescaledb-solves-common-postgresql-problems-in-database-operations-with-data-retention-management/">judicious use of locks</a> is crucial in maintaining database concurrency and ensuring smooth operation. Strategies such as using less restrictive lock types, designing transactions to minimize locked periods, and optimizing query execution plans can help reduce lock contention.</p>
<p><strong>Lack of maintenance</strong>: Routine maintenance tasks such as vacuuming, reindexing, and updating statistics are fundamental to sustaining optimal performance in PostgreSQL databases. Vacuuming is essential for reclaiming storage space and preventing <a href="https://timescale.ghost.io/blog/how-to-fix-transaction-id-wraparound/">transaction ID wraparound issues</a>, ensuring the database remains efficient and responsive. Regular reindexing is crucial for maintaining the speed and efficiency of index-based query operations, as indexes can become fragmented over time. Additionally, keeping statistics up-to-date is vital for the query planner to make well-informed decisions, as outdated statistics can lead to suboptimal query plans. Ignoring these tasks can lead to a gradual but significant deterioration in database efficiency and reliability.</p>
<h2 id="how-to-measure-query-performance-in-postgresql">How to Measure Query Performance in PostgreSQL</h2>
<h3 id="pgstatstatements">pg_stat_statements</h3>
<p>To optimize your queries, you must first identify your PostgreSQL performance bottlenecks. A simple way to do this is using <code>pg_stat_statements</code>, a <a href="https://www.tigerdata.com/blog/top-8-postgresql-extensions">PostgreSQL extension</a> that provides essential information about query performance. It records data about running queries, helping to identify performance slowdowns caused by inefficient queries, index changes, or ORM query generators. Notably, <code>pg_stat_statements</code> is enabled by default in TimescaleDB, enhancing its capability to monitor and optimize database performance out of the box.</p>
<p>You can query <code>pg_stat_statements</code> to gather various statistics such as the number of times a query has been called, total execution time, rows retrieved, and cache hit ratios:</p>
<ul>
<li>
<p><strong>Identifying long-running queries</strong>: Focus on queries with high average total times, adjusting the <code>calls</code> value based on specific application needs.</p>
</li>
<li>
<p><strong>Hit cache ratio</strong>: This metric measures how often data needed for a query was available in memory, which can affect query performance.</p>
</li>
<li>
<p><strong>Standard deviation in query execution time</strong>: Analyzing the standard deviation can reveal the consistency of query execution times, helping to identify queries with significant performance variability.</p>
</li>
</ul>
<h3 id="insights-by-timescale">Insights by Timescale</h3>
<p>Timescale’s Insights (available to Timescale users at no extra cost) is a tool providing in-depth observation of PostgreSQL queries over time. It offers detailed statistics on query timing, latency, and memory and storage I/O usage, enabling users to comprehensively monitor and analyze their query and database performance.</p>
<ul>
<li>
<p><strong>Scalable query collection system</strong>: Insights is built on a scalable system that collects sanitized statistics on every query stored in Timescale, facilitating comprehensive analysis and optimization. <a href="https://timescale.ghost.io/blog/how-we-scaled-postgresql-to-350-tb-with-10b-new-records-day/">And the best part is that the team is dogfooding its own product to enable this tool</a>, expanding PostgreSQL to accommodate hundreds of TBs of data (and growing).</p>
</li>
<li>
<p><strong>Insights interface</strong>: The tool presents a graph showing the relationship between system resources (CPU, memory, disk I/O) and query latency.</p>
</li>
</ul>
<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/12/Best-Practices-for-Query-Optimization-on-PostgreSQL_Insights.png" class="kg-image" alt="Insights offers a drill-down view with finer-grain metrics for quick query optimization" loading="lazy" width="2000" height="1564" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/12/Best-Practices-for-Query-Optimization-on-PostgreSQL_Insights.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/12/Best-Practices-for-Query-Optimization-on-PostgreSQL_Insights.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/12/Best-Practices-for-Query-Optimization-on-PostgreSQL_Insights.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w2400/2023/12/Best-Practices-for-Query-Optimization-on-PostgreSQL_Insights.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Insights offers a drill-down view with finer-grain metrics for quick query optimization</span></figcaption></figure><ul>
<li>
<p><strong>Detailed query information</strong>: Insights provides a table of the top 50 queries based on chosen criteria, offering insights into query frequency, affected rows, and usage of Timescale features like hypertables and continuous aggregates.</p>
</li>
<li>
<p><strong>Drill-down view</strong>: Insights offers a drill-down view with finer-grain metrics, including trends in latency, buffer usage, and cache utilization.</p>
</li>
<li>
<p><strong>Real-world application</strong>: Check out the <a href="https://twitter.com/gooddaymax/status/1717573289285996889?s=20&amp;ref=timescale.com">Humblytics</a> story, which demonstrates Insights' practical application in identifying and resolving performance issues.</p>
</li>
</ul>
<h2 id="best-practices-for-query-optimization-in-postgresql">Best Practices for Query Optimization in PostgreSQL</h2>
<h3 id="understand-common-performance-bottlenecks">Understand common performance bottlenecks</h3>
<p>To effectively identify inefficient queries so you can optimize them, analyze query execution plans using PostgreSQL's <code>EXPLAIN</code> command. This tool provides a breakdown of how your queries are executed, revealing critical details such as execution paths and the use of indexes. Look specifically for patterns like full table scans, which suggest missing indexes or queries consuming high CPU or memory, indicating potential optimizations. By understanding the intricacies of the execution plan, you can pinpoint exactly where performance issues are occurring.</p>
<p>For example, you could run the following code to view the execution plan of a query:</p>
<pre><code class="language-sql">EXPLAIN SELECT * FROM your_table WHERE your_column = 'value';
</code></pre>
<p>To identify full table scans, you would look for “Seq Scan” in the output. This suggests that the query is scanning the entire table, which is often a sign that an index is missing or not being used effectively:</p>
<pre><code class="language-SQL">Seq Scan on large_table  (cost=0.00..1445.00 rows=50000 width=1024)
</code></pre>
<div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">📚</div><div class="kg-callout-text"><a href="https://www.postgresql.org/docs/current/using-explain.html"><u>Check out the PostgreSQL documentation for more examples of what to look for when using </u><u><code spellcheck="false" style="white-space: pre-wrap;">EXPLAIN</code></u></a><u>.</u></div></div><h3 id="partition-your-data">Partition your data</h3>
<p>Partitioning large PostgreSQL tables is a powerful strategy to enhance their speed and efficiency. However, the process of setting up and maintaining partitioned tables can be burdensome, often requiring countless hours of manual configurations, testing, and maintenance. But, there's a more efficient solution: hypertables. Available through the TimescaleDB extension and on AWS via the Timescale platform, hypertables simplify the PostgreSQL partition creation process significantly by automating the generation and management of <a href="https://www.tigerdata.com/learn/data-partitioning-what-it-is-and-why-it-matters">data partitions</a> without altering your user experience. Behind the scenes, however, hypertables work their magic, accelerating your queries and ingest operations.</p>
<p>To create a <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered">hypertable</a>, create a regular PostgreSQL table:</p>
<pre><code class="language-SQL">CREATE TABLE conditions (
   time        TIMESTAMPTZ       NOT NULL,
   location    TEXT              NOT NULL,
   device      TEXT              NOT NULL,
   temperature DOUBLE PRECISION  NULL,
   humidity    DOUBLE PRECISION  NULL
);
</code></pre>
<p>Then, convert the table to a hypertable. Specify the name of the table you want to convert and the column that holds its time values.</p>
<pre><code class="language-SQL">SELECT create_hypertable('conditions', by_range('time'));
</code></pre>
<div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">📚</div><div class="kg-callout-text">Want to learn more about hypertables? <a href="https://www.timescale.com/learn/pg_partman-vs-hypertables-for-postgres-partitioning"><u>Check out this comparison of pg_partman vs. hypertables.</u></a></div></div><p></p><p><br></p><p></p><h3 id="employ-partial-aggregation-for-complex-queries">Employ partial aggregation for complex queries</h3>
<p><a href="https://docs.timescale.com/use-timescale/latest/continuous-aggregates/about-continuous-aggregates/">Continuous aggregates</a> in TimescaleDB are a powerful tool to improve the performance of commonly accessed aggregate queries over large volumes. Continuous aggregates are based on PostgreSQL materialized views but incorporate incremental and automatic refreshes so they are always up-to-date and remain performant as the underlying dataset grows.</p>
<p>In the example below, we’re setting up a continuous aggregate for daily average temperatures, which is remarkably simple.</p>
<pre><code class="language-sql">CREATE VIEW daily_temp_avg
WITH (timescaledb.continuous)
AS
SELECT time_bucket('1 day', time) as bucket, AVG(temperature)
FROM hypertable
GROUP BY bucket;
</code></pre>
<p>Learn how continuous aggregates can help you get real-time analytics or create a time-series graph.</p>
<h3 id="continuously-update-and-educate">Continuously update and educate</h3>
<p><a href="https://timescale.ghost.io/blog/read-before-you-upgrade-best-practices-for-choosing-your-postgresql-version/">Regularly updating PostgreSQL and TimescaleDB</a> is vital for performance and security. The upgrade process involves assessing changes, performance gains, security patches, and extension compatibility. Focus on best practices, which include upgrading major versions at minor version .2 for stability, consistently updating minor versions, and upgrading major versions when needed for functionality or security. Timescale further eases this process by handling minor updates automatically with no downtime and providing tools for testing major version upgrades, ensuring smooth transitions with minimal disruption.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Optimizing your queries in PostgreSQL doesn’t have to be a daunting task. While it involves understanding and addressing various factors, there is much you can do by adopting best practices, such as efficient indexing, judicious use of data types, regular database maintenance, and staying up-to-date with the latest PostgreSQL releases.</p>
<p>If you really want to extend PostgreSQL’s capabilities, <a href="https://console.cloud.timescale.com/signup">create a free Timescale account today</a>. Features such as hypertables, continuous aggregates, and advanced data management techniques significantly enhance PostgreSQL's ability to manage your demanding workloads effectively.</p>
<p>Written by <a href="https://pt.w3d.community/paulogio">Paulinho Giovannini</a></p>
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Creating a Fast Time-Series Graph With Postgres Materialized Views]]></title>
            <description><![CDATA[Build a time-series graph or plot to quickly visualize data using Postgres materialized views and their upgraded version, continuous aggregates.]]></description>
            <link>https://www.tigerdata.com/blog/creating-a-fast-time-series-graph-with-postgres-materialized-views</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/creating-a-fast-time-series-graph-with-postgres-materialized-views</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Data Visualization]]></category>
            <dc:creator><![CDATA[Dylan Paulus]]></dc:creator>
            <pubDate>Mon, 27 Nov 2023 18:21:08 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/Creating-a-Fast-Time-Series-Graph-With-Postgres-Materialized-ViewsCreating-a-Fast-Time-Series-Graph-With-Postgres-Materialized-Views.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/Creating-a-Fast-Time-Series-Graph-With-Postgres-Materialized-ViewsCreating-a-Fast-Time-Series-Graph-With-Postgres-Materialized-Views.jpg" alt="Creating a Fast Time-Series Graph With Postgres Materialized Views" /><p>Imagine you have a massive amount of time-series data you want to explore and visualize. Seeing the latest trends, the historical patterns, and the outliers in your data can help you gain insights and make decisions. But how do you visualize and analyze time-series data effectively? How do you create graphs, plots, and other visualizations for real-time analytics showing the current state of your data and the historical changes over different time intervals? And how do you do it efficiently without sacrificing performance or accuracy?</p><p>In this article, we will see how to use PostgreSQL materialized views and Timescale’s improved version of these—continuous aggregates—to create a time-series graph that answers these questions.&nbsp;</p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">📊</div><div class="kg-callout-text"><a href="https://www.timescale.com/blog/what-is-a-time-series-plot-and-how-can-you-create-one/"><u>Learn more about time-series plots</u></a> or dive into <a href="https://www.timescale.com/blog/what-is-a-time-series-graph-with-examples/"><u>an explainer about time-series graphs</u></a>.</div></div><h2 id="creating-a-time-series-graph-in-postgresql">Creating a Time-Series Graph in PostgreSQL<br></h2><h3 id="method-1-creating-plots-and-graphs-directly-from-raw-data">Method 1: Creating plots and graphs directly from raw data&nbsp;</h3><p>Pretend you are a senior engineer at a company that creates devices to monitor the electrical power grid. These devices export a large amount of data—one PostgreSQL row is created per device every second. For this example, let's say the local power company uses one hundred devices (60,480,000 rows created per week). You want to be able to give your customers data visualizations of the <a href="https://en.wikipedia.org/wiki/Electrical_grid#:~:text=The%20demand%2C%20or%20load%20on,demand%20is%20the%20maximum%20load"><u>load on a given line</u></a> per hour, day, and week.</p><p>Our table looks like this:</p><pre><code class="language-sql">CREATE TABLE demand (
    id          serial primary key,
    amps        DOUBLE PRECISION  NOT NULL,
    location    TEXT,
    time        TIMESTAMPTZ       NOT NULL
);
</code></pre>
<p>We can import a single device's data by running the following INSERT command (or, you know, generate dummy data!). It will take some time to insert 10,540,800 rows; shorten the gap between timestamps to produce less data:</p><pre><code class="language-sql">INSERT INTO demand (amps, location, time) VALUES  (random()*40, 'Spokane, WA', 
generate_series('2023-09-01T00:00:00+03:00'::timestamptz, '2023-12-31T23:59:59+03:00'::timestamptz, '1 second'));
</code></pre>
<p>Now, we can generate a time-series plot to calculate average amps per minute with the following SQL. Change <code>'1 minute'</code> to <code>'1 day'</code> or <code>'1 week'</code> to create time-series plots for different intervals.</p><pre><code class="language-sql">SELECT 
 date_bin(interval '1 minute', time, timestamptz '2023-08-01' ) AS time_interval, 
 AVG(amps)
FROM demand
GROUP BY 1
ORDER BY 1;
</code></pre>
<figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/query-results.png" class="kg-image" alt="The query output" loading="lazy" width="1724" height="1422" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/11/query-results.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/11/query-results.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/11/query-results.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/query-results.png 1724w" sizes="(min-width: 720px) 720px"></figure><p></p><p>This query can take some time, depending on how much data is in the <code>demand</code> table. For the 10,540,800 rows we created, the query takes 15 seconds to execute on an 8-Core Intel Core i9 with 32&nbsp;GB RAM Apple MacBook Pro. That is 15 seconds to return plot data for a single device over three months! Imagine if we had hundreds or thousands of devices spanning over a year.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/explain-initial-query.png" class="kg-image" alt="The initial query" loading="lazy" width="2000" height="964" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/11/explain-initial-query.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/11/explain-initial-query.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/11/explain-initial-query.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/explain-initial-query.png 2252w" sizes="(min-width: 720px) 720px"></figure><p>Let's look at a few ways to improve the speed of our time-series plot using materialized views and continuous aggregates.</p><h3 id="method-2-using-materialized-views-to-make-graphs-more-performant">Method 2: Using materialized views to make graphs more performant&nbsp;</h3><p>In PostgreSQL, a view can be thought of as a stored query on top of a table. When we query a view, the underlying query the view was created with gets called. This gives us the ability to abstract away and simplify our queries, but a view won't do much to improve the speed of a query.&nbsp;</p><p>Somewhere between a table and a view sits the materialized view. A materialized view works similarly to a view in that you can make queries reusable. The difference is a materialized view will store the resulting data on disk—caching the data. When you use a materialized view, you don’t have to run the query again. You get the results from the disk. This makes your queries much faster!</p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">🗒️</div><div class="kg-callout-text"><a href="https://www.timescale.com/blog/how-postgresql-views-and-materialized-views-work-and-how-they-influenced-timescaledb-continuous-aggregates/"><u>Learn more about PostgreSQL materialized views and how they influenced the design of our continuous aggregates</u></a>.</div></div><p></p><p>To improve the speed of our time-series graph data, let's create a materialized view over the <code>demand</code> table.</p><pre><code class="language-sql">CREATE MATERIALIZED VIEW demand_amps_by_minute AS 
SELECT 
  date_bin(
    interval '1 minute', time, timestamptz '2023-08-01'
  ) AS time_interval, 
  AVG(amps) AS median 
FROM 
  demand 
GROUP BY 
  1 
ORDER BY 
  1;
</code></pre>
<p>Since creating the materialized view needs to run the same average amps per minute SQL, it can take some time to create. Once it's complete, run <code>SELECT * FROM demand_amps_by_minute;</code>. On my same MacBook Pro, the query now takes 58ms—much better!</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/explain-materialized-view.png" class="kg-image" alt="The query plan with a materialized view" loading="lazy" width="2000" height="668" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/11/explain-materialized-view.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/11/explain-materialized-view.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/11/explain-materialized-view.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/explain-materialized-view.png 2252w" sizes="(min-width: 720px) 720px"></figure><p>This shows off the speed improvement materialized views can give us, but they come with a downside we haven't covered yet. When new data is added, updated, or deleted from the underlying table, we have to manually refresh the materialized view with a <code>REFRESH MATERIALIZED VIEW [materialized view name];</code> statement. This will completely replace the data in the materialized view with all the new data from the table using the query from the definition.</p><p>Having to refresh your materialized views comes with a few glaring problems:</p><ul><li>If you have a steady stream of data being written to your table, which is very common in time-series data, then once you refresh your materialized view, it'll be out of date.</li><li>Refreshing a materialized view comes with a performance hit, as it needs to rerun the materialized view's definition on all the data in the table to refresh itself.</li><li>You'll need to remember to manually run a refresh on your materialized views or maintain a cron job.</li></ul><p>However, Timescale has engineered a little magic under the hood to remove all these pain points of using materialized views through <a href="https://docs.timescale.com/use-timescale/latest/continuous-aggregates/"><u>continuous aggregates</u></a>.</p><h3 id="method-3-creating-graphs-that-are-more-resource-efficient-and-easier-to-maintain-via-continuous-aggregates">Method 3: Creating graphs that are more resource-efficient and easier to maintain via continuous aggregates</h3><p>Timescale’s continuous aggregates have the same look and feel as materialized views, but they add some essential functionality to help you keep your graphs, plots, dashboards, or other visualizations of real-time analytics performant over time without manual maintenance.&nbsp;</p><p>First, continuous aggregates stay automatically updated via a refresh policy defined by you—i.e., you can configure your continuous aggregate view so it gets updated automatically every 30 minutes, including your latest data. This is much more convenient than refreshing your views manually!&nbsp;</p><p>But the key is what happens under the hood once this refresh policy kicks in. In plain PostgreSQL materialized views, when you refresh the view, the query will be recomputed over the entire dataset. In other words, in plain PostgreSQL, materialized views’ refreshes are not incremental. This makes the refresh process computationally expensive unnecessarily, especially once your dataset grows and a large volume of data needs to be materialized.&nbsp;&nbsp;</p><p>Continuous aggregates fix this inefficiency: when you refresh a continuous aggregate,&nbsp; Timescale doesn’t drop all the old data and recompute the aggregate against it. Instead, the engine just runs the query against the most recent refresh period (e.g., 30 minutes) and the data that has changed since the last refresh. This way, continuous aggregates keep your visualizations performant over time, independently of how much your dataset is growing.&nbsp;<br></p><p>Switching over to Timescale, we'll recreate our <code>demand</code> table using the same <code>CREATE TABLE</code> statement as before but leaving off the <code>id</code> column (we'll use <code>time</code> instead).</p><pre><code class="language-sql">CREATE TABLE demand (
    amps        DOUBLE PRECISION  NOT NULL,
    location    TEXT,
    time        TIMESTAMPTZ       NOT NULL
);
</code></pre>
<p>Next, we'll update <code>demand</code> to be a <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertable</a>:</p><pre><code class="language-sql">SELECT create_hypertable('demand', 'time');
</code></pre>
<p>Finally, populate the <code>demand</code> table with data:</p><pre><code class="language-sql">INSERT INTO demand (amps, location, time) VALUES  (random()*40, 'Spokane, WA', 
generate_series('2023-09-01T00:00:00+03:00'::timestamptz, '2023-12-31T23:59:59+03:00'::timestamptz, '1 second'));
</code></pre>
<p>At last we can create our continuous aggregate that will work similarly to the previous materialized view.</p><pre><code class="language-sql">CREATE MATERIALIZED VIEW demand_amps_by_minute
WITH (timescaledb.continuous) AS
SELECT 
   time_bucket(INTERVAL '1 minute', time) AS bucket,
   AVG(amps)
FROM demand
GROUP BY bucket;
</code></pre>
<p>We need to update its refresh policy to have our continuous aggregate continuously refresh. For this example, we'll have it refresh every minute. But for your own workloads, you'll need to optimize these settings to fit your needs.</p><pre><code class="language-sql">SELECT add_continuous_aggregate_policy(
	'demand_amps_by_minute', 
	start_offset =&gt; NULL, 
	end_offset =&gt; INTERVAL '1 h',
	schedule_interval =&gt; INTERVAL '1 m');
</code></pre>
<p>If we run a <code>SELECT</code> query on <code>demand_amps_by_minute</code>, I now get 120&nbsp;ms to query the continuous aggregate. A little bit slower than a raw materialized view, but we're still much faster than querying the table!&nbsp;</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/explain-continous-aggregate.png" class="kg-image" alt="The query plan with a continuous aggregate" loading="lazy" width="2000" height="1023" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/11/explain-continous-aggregate.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/11/explain-continous-aggregate.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/11/explain-continous-aggregate.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w2400/2023/11/explain-continous-aggregate.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>Continuous aggregates track which chunks have been materialized and what data hasn't been yet by using a watermark (e.g., a pointer). When you query a continuous aggregate, you get materialized data before the watermark and non-materialized data after the watermark. This watermark will move as the aggregate policy continues to work through materializing non-materialized data.</p><p>All this adds some time to the overall query speed, but we benefit from not having to manually refresh the materialized view!</p><p>Let's try it out. Insert a new row into the underlying <code>demand</code> table.</p><pre><code class="language-sql">INSERT INTO demand (amps, location, time) VALUES (100.2, 'Pullman, WA', now())
</code></pre>
<p>Then, if we re-query our continuous aggregate, we'll see the newly added row returned to us.</p><h2 id="start-speeding-up-your-queries-today">Start Speeding Up Your Queries Today</h2><p>Throughout this article, we discovered how to use a table to create a time-series graph for large amounts of data. We improved the query performance by taking advantage of PostgreSQL's materialized views.&nbsp;</p><p>However, materialized views can be time-consuming to maintain. Last, we removed the need to manually refresh the materialized view by taking advantage of Timescale's continuous aggregates. Now, it’s your turn to create your own time-series plots or <a href="https://www.timescale.com/learn/real-time-analytics-in-postgres"><u>real-time analytics</u></a> using these methods!&nbsp;</p><p>You can <a href="https://console.cloud.timescale.com/signup"><u>create a free Timescale account</u></a> and start speeding up your queries today.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Amazon Aurora vs. PostgreSQL: 35% Faster Ingest, Up to 16x Faster Queries, and 78% Cheaper Storage]]></title>
            <description><![CDATA[Read about our journey benchmarking Amazon Aurora. Spoiler alert: 35% faster ingest, up to 16x faster queries, less than ½  the price, zero fuss.
]]></description>
            <link>https://www.tigerdata.com/blog/benchmarking-amazon-aurora-vs-postgresql</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/benchmarking-amazon-aurora-vs-postgresql</guid>
            <category><![CDATA[Benchmarks & Comparisons]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Amazon Aurora]]></category>
            <dc:creator><![CDATA[James Blackwood-Sewell]]></dc:creator>
            <pubDate>Wed, 22 Nov 2023 18:38:28 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless-_over-1.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless-_over-1.png" alt="What We Learned From Benchmarking Amazon Aurora PostgreSQL Serverless" /><p>At Timescale, we pride ourselves on making PostgreSQL fast. We started by extending PostgreSQL for new workloads, first for <a href="https://www.tigerdata.com/blog/time-series-introduction" rel="noreferrer">time series</a> with TimescaleDB, then with Timescale Vector, and soon in other directions (keep an 👀 out). We don’t modify PostgreSQL in any way. Our innovation comes from how we integrate with, run, and schedule databases.&nbsp;</p><p>Many users come to us from Amazon RDS. They started there, but as their database grows and their performance suffers, they come to Timescale as a high-performance alternative. To see why, just look at our <a href="https://timescale.ghost.io/blog/timescale-cloud-vs-amazon-rds-postgresql-up-to-350-times-faster-queries-44-faster-ingest-95-storage-savings-for-time-series-data/"><u>time-series benchmark,</u></a> our <a href="https://timescale.ghost.io/blog/savings-unlocked-why-we-switched-to-a-pay-for-what-you-store-database-storage-model/" rel="noreferrer"><u>usage-based storage pricing model</u></a>, and our <a href="https://timescale.ghost.io/blog/introducing-dynamic-postgresql/"><u>response to serverless</u></a>, which gives you a better way of running non-time-series PostgreSQL workloads in the cloud without any wacky abstractions.</p><p>Amazon Aurora is another popular cloud database option. Sometimes, users start using Aurora right away; other times, these users migrate from RDS to Aurora looking for performance from a faster, more scalable PostgreSQL. But is this what they find?</p><p>This article looks at what Aurora is, why you’d use it, and presents some interesting benchmark results that may surprise you. </p><h3 id="what-is-aurora-it%E2%80%99s-not-postgresql">What is Aurora? (It’s not PostgreSQL)</h3><p>Amazon Aurora is a database as a service (DBaaS) product released by AWS in 2015. The original selling point was of a relational database engine custom-built to combine the performance and availability of high-end commercial databases (which we guess means Oracle and SQLServer) with the simplicity and cost-effectiveness of open-source databases (MySQL and PostgreSQL).&nbsp;</p><p>Originally, Amazon Aurora only supported MySQL, but PostgreSQL support was added in 2017. There have been a bunch of updates over the years, with the most important being Aurora Serverless (and then, when that fell a bit flat, Serverless v2), which aims to bring the serverless “scale to zero” model to databases.</p><p>Aurora’s key pillars have always been performance and availability. It’s marketed as being faster than RDS (“up to three times the throughput of PostgreSQL”), supporting multi-region clusters, and highly scalable. Not much is known about the internals of Aurora (it’s closed-source, after all), but we do know that compute and storage have been decoupled, resulting in a cloud-native architecture that is PostgreSQL-compatible but isn’t Postgres.&nbsp;</p><h3 id="investigating-aurora">Investigating Aurora</h3><p>There are a few ways of running Aurora for PostgreSQL, and you’ll be asked two critical questions from the <strong>Create Database </strong>screen.</p><p>First up, you need to select a cluster storage configuration:</p><ul><li>Do you want to pay slightly less for your compute and stored data with an additional charge per I/O request (Aurora Standard)?</li><li>Or, do you want to pay a small premium on compute and stored data, but I/O is included (Aurora I/O-Optimized)?</li></ul><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_cluster-storage-confirmation.png" class="kg-image" alt="Cluster storage configuration screen in Amazon Aurora" loading="lazy" width="1972" height="878" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_cluster-storage-confirmation.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_cluster-storage-confirmation.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_cluster-storage-confirmation.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_cluster-storage-confirmation.png 1972w" sizes="(min-width: 720px) 720px"></figure><p>In our benchmark, we saw a 33&nbsp;% increase in CPU costs and a massive 125&nbsp;% increase in storage costs when moving from Standard to I/O-Optimized, although I/O-Optimized still came in cheaper once the I/O was factored in. AWS recommends using an I/O-Optimized instance if your I/O costs exceed 25&nbsp;% of your database costs.</p><p><br>I/O-Optimized turns out to be a billing construct: we saw roughly equivalent performance between the two storage configurations.</p><p>After you’ve chosen that, there’s another big decision coming up: do you want to enable Serverless v2?&nbsp;</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_instance-config.png" class="kg-image" alt="Instance config screen in Amazon Aurora" loading="lazy" width="1972" height="956" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_instance-config.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_instance-config.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_instance-config.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_instance-config.png 1972w" sizes="(min-width: 720px) 720px"></figure><p>Although three options are shown, there are really only two: Provisioned and Serverless. Provisioned is where you choose the instance class for your database, which comes with a fixed hourly cost. Serverless is where your prices are driven by your usage.&nbsp;</p><p>If you have quiet periods, Serverless might save you money; if you burst all the time, it might not. When you choose a Provisioned type, you get a familiar “choose your instance type” dialog; when you select Serverless, you get something new.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_capacity-range.png" class="kg-image" alt="Selecting the capacity range in Amazon Aurora" loading="lazy" width="2000" height="490" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_capacity-range.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_capacity-range.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_capacity-range.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_capacity-range.png 2000w" sizes="(min-width: 720px) 720px"></figure><p>So, instead of choosing a CPU and memory allocation associated with an instance, you set a range of resources in ACUs (Aurora Capacity Unit), which your cluster will operate within.&nbsp;</p><p>So, what exactly is an ACU? That’s an excellent question and one which we still don’t entirely know the answer to. You can see that the description states an ACU provides “2 GiB of memory and corresponding compute and networking,” but what on Earth is <em>corresponding compute and networking</em>?&nbsp;</p><p>How do you compare this to Provisioned if you have no idea how many CPUs are in an ACU? Is an ACU one CPU, half a CPU, a quarter of a CPU? We actually have no idea, and we can see no way to quickly find out. The opacity was frustrating during our tests. It feels obfuscated for no good reason.&nbsp;</p><p>Confusion aside, the general idea is that, at any time, Amazon Aurora will use the number of ACUs (in half-a-point increments) that it needs to sustain your current workload within the range you specify. If your workload lets you scale up and down, Serverless might be a good idea. Or is it?</p><h3 id="aurora-costs">Aurora costs</h3><p>So, why isn’t everybody using Aurora? The other axis is price, and while <a href="https://www.tigerdata.com/blog/estimate-amazon-aurora-costs" rel="noreferrer"><u>Amazon Aurora pricing is significantly harder to model than RDS</u></a>, it’s definitely more expensive, with the difference soaring as you scale out replicas or multiple regions.</p><p>We thought so. We have had some interesting testimonials from customers telling us that they had lost confidence in Aurora. So, to draw our own conclusions, we started where any reasonable engineer would—we benchmarked.</p><h2 id="benchmarking-configuration">Benchmarking Configuration&nbsp;</h2><p>But, before we started, we had to decide what we would benchmark against. We ended up choosing the Serverless (v2) I/O-Optimized configuration because that’s what we tend to see people using in the wild when they talk to us about migration.</p><p>When deploying Amazon Aurora Serverless, we need to choose a range of ACUs (our mystery billing units). We wanted to compare with a Timescale 8&nbsp;CPU/32&nbsp;GB memory instance, so we selected a minimum of 8&nbsp;ACUs (16&nbsp;GB) and a maximum of 16&nbsp;ACUs (32&nbsp;GB memory). Again, this veneer over CPUs is very confusing. In a perfect world, one would hope that an ACU provides one CPU from the underlying instance type—but we just don’t know.</p><p>We used the <a href="https://github.com/timescale/tsbs"><u>Time Series Benchmark Suite </u></a>(TSBS) to compare Amazon Aurora for PostgreSQL because we wanted to benchmark for a specific workload type (in this case, time series) to see how the generic Aurora compared to PostgreSQL that has been extended for a particular workload (and also because we ❤️ time series).&nbsp;</p><p><em><strong>Note:</strong> Many types of workloads are actually time series, more than you would think. This doesn’t only apply to the more traditional time-series use cases (e.g., finance) but also to workloads like energy metrics, sensor data, website events, </em><a href="https://www.timescale.com/learn/types-of-data-supported-by-postgresql-and-timescale/#:~:text=If%20you%20want%20to%20know,dealing%20with%20time%2Dseries%20data."><em><u>and others</u></em></a><em><u>.</u></em></p><p>We used the following TSBS configuration across all runs (for more info about how we run TSBS, you can see our <a href="https://timescale.ghost.io/blog/timescale-cloud-vs-amazon-rds-postgresql-up-to-350-times-faster-queries-44-faster-ingest-95-storage-savings-for-time-series-data/"><u>RDS Benchmark</u></a>):</p>
<!--kg-card-begin: html-->
<table style="border:none;border-collapse:collapse;"><colgroup><col width="191"><col width="211"><col width="222"></colgroup><tbody><tr style="height:25.6787109375pt"><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><br></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Timescale</span></p><br></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Amazon Aurora Serverless for PostgreSQL</span></p></td></tr><tr style="height:48pt"><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">PostgreSQL version</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">15.4</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">15.3 (latest available)</span></p></td></tr><tr style="height:85.5pt"><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">PostgreSQL configuration</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">No changes</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">synchronous_commit=off</span></p><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">(to match Timescale)</span></p></td></tr><tr style="height:61.5pt"><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Partitioning system</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">TimescaleDB (</span><a href="https://www.timescale.com/learn/is-postgres-partitioning-really-that-hard-introducing-hypertables" style="text-decoration:none;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#1155cc;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:underline;-webkit-text-decoration-skip:none;text-decoration-skip-ink:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">partitions configured transparently at ingest time</span></a><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">)</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">pg_partman (partitions manually configured ahead of time)</span></p></td></tr><tr style="height:48pt"><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Compression into columnar</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Yes</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Not supported</span></p></td></tr><tr style="height:48pt"><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Partition size</span></p></td><td colspan="2" style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">4h (each system ended up with 26 non-default partitions)</span></p></td></tr><tr style="height:48pt"><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Scale (number of devices)</span></p></td><td colspan="2" style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">25,000</span></p></td></tr><tr style="height:48pt"><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Ingest workers&nbsp;</span></p></td><td colspan="2" style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">16</span></p></td></tr><tr style="height:48pt"><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Rows ingested</span></p></td><td colspan="2" style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">868,000,000</span></p></td></tr><tr style="height:26.4287109375pt"><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">TSBS profile</span></p><br></td><td colspan="2" style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">DevOps</span></p></td></tr><tr style="height:48pt"><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">CPU / Memory</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">8 vCPU / 32GB memory</span><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;"><br></span><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;"><br><br></span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">8-16 ACUs&nbsp;</span></p><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">(see below for more details)</span></p></td></tr><tr style="height:48pt"><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Volume size</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Dynamic</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Dynamic</span></p></td></tr><tr style="height:48pt"><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Disk type</span></p></td><td colspan="2" style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Default provisioned IOPs (no changes)</span></p></td></tr></tbody></table>
<!--kg-card-end: html-->
<h2 id="aurora-vs-postgresql-ingest-performance-comparison">Aurora vs. PostgreSQL Ingest Performance Comparison</h2><p>We weren’t expecting Timescale to compare when it came to ingesting data (we know the gap between us and PostgreSQL for ingest has been narrowing as PostgreSQL native partitioning gets better). By separating the compute and storage layers, we thought we would see some engineered gains in Aurora. </p><p>What we actually saw when we ran the benchmark—ingesting almost one billion rows—was Timescale ingesting 35&nbsp;% faster than Aurora with 8&nbsp;CPUs. Aurora was scaled up to 16 ACUs for the entire benchmark run (including the queries in the next section). So not only was Timescale 35&nbsp;% faster, but it was 35&nbsp;% faster with 50&nbsp;% of the CPU resources (assuming 1 CPU == 1 ACU).</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_ingest-speed.png" class="kg-image" alt="Ingest speed (average rows per second)" loading="lazy" width="2000" height="1222" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_ingest-speed.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_ingest-speed.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_ingest-speed.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_ingest-speed.png 2000w" sizes="(min-width: 720px) 720px"></figure><p>At this stage, some of you might be wondering why Timescale jumped in ingest speed around the 30-minute mark. The jump happened when the platform dynamically adapted the I/O on the instance as we saw data flooding in (thanks to our amazing <a href="https://timescale.ghost.io/blog/savings-unlocked-why-we-switched-to-a-pay-for-what-you-store-database-storage-model/" rel="noreferrer">Usage Based Storage</a> implementation).</p><h2 id="aurora-vs-postgresql-query-performance-comparison">Aurora vs. PostgreSQL Query Performance Comparison</h2><p>Query performance matters with a demanding workload because your application often needs a response in real or near real-time. While the details of the TSBS query types are basically indecipherable (<a href="https://github.com/timescale/tsbs?ref=timescale.com#appendix-i-query-types-"><u>here’s a cheat sheet</u></a>), they model some common (although quite complex) time-series patterns that an application might use. Each query was run 10 times, and the average value was compared for each of our target systems.</p><p>The results here tell another very interesting story, with Timescale winning in most query categories—we were between 1.15x and 16x faster, with two queries being slightly slower. When we did a one-off test with a Timescale instance with 16&nbsp;CPUs, some queries stretched out to 81x faster, with all categories being won by Timescale.</p><p>Why is this? Timescale is optimizing for the workload by teaching the planner how to handle these analytical queries and also using our native compression—which flips the row-based PostgreSQL data into a <a href="https://www.tigerdata.com/blog/building-columnar-compression-in-a-row-oriented-database" rel="noreferrer">columnar</a> format and speeds up analysis. For more information about how our technology works and how it can help you, check out our <a href="https://timescale.ghost.io/blog/timescale-cloud-vs-amazon-rds-postgresql-up-to-350-times-faster-queries-44-faster-ingest-95-storage-savings-for-time-series-data/"><u>Timescale vs. Amazon RDS benchmark blog post</u></a>.&nbsp;</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/What-we-learned-from-benchmarling-amazon-aurora-median-query-timings.png" class="kg-image" alt="Median query timings" loading="lazy" width="1464" height="1824" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/11/What-we-learned-from-benchmarling-amazon-aurora-median-query-timings.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/11/What-we-learned-from-benchmarling-amazon-aurora-median-query-timings.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/What-we-learned-from-benchmarling-amazon-aurora-median-query-timings.png 1464w" sizes="(min-width: 720px) 720px"></figure><h2 id="aurora-vs-postgresql-data-size-comparison">Aurora vs. PostgreSQL Data Size Comparison</h2><p>What about the total size of the CPU table at the end of the benchmark? There were no surprises here. Amazon Aurora (even though it’s using a different storage backend to PostgreSQL) doesn’t seem to change the total table size, with it coming in at 159&nbsp;GB (the same as RDS did). In contrast, Timescale compresses the time-series data by 95&nbsp;% to 8.6&nbsp;GB.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_compression-ratio.png" class="kg-image" alt="Total database size" loading="lazy" width="2000" height="824" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_compression-ratio.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_compression-ratio.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_compression-ratio.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_compression-ratio.png 2000w" sizes="(min-width: 720px) 720px"></figure><h2 id="aurora-vs-postgresql-cost-comparison">Aurora vs. PostgreSQL Cost Comparison</h2><p>There is no way to sugarcoat it: Amazon Aurora Serverless is expensive. While we were benchmarking, it used 16&nbsp;ACUs constantly. First, we tried the standard Serverless product, but it charged a prohibitive amount for I/O, which is why we don’t see anyone using it for anything even remotely resembling an always-on workload. It defeats the purpose of serverless if you can’t actually ingest or query data without breaking the bank.</p><p>So, we switched to the Serverless v2 I/O-Optimized pricing, which charges a small premium on compute and storage costs and zero rates on all I/O charges. It’s supposed to help with pricing for a workload like the one we’re simulating.&nbsp;</p><p>Let’s see how Aurora I/O-Optimized really did. The bill for running this benchmark has two main components: compute and storage costs. (Although Aurora actually charges for some other facets, the costs were low in this case). These are the results:&nbsp;</p><p><strong>Compute costs:</strong></p><ul><li>Aurora Serverless v2 I/O-Optimized costs $2.56 per hour for the 16 ACUs, which were used for the duration of the benchmark.</li><li>The Timescale 8vCPU instance costs $1.26 per hour <strong>(52&nbsp;% cheaper than Aurora).</strong><br></li></ul><p><strong>Storage costs:</strong></p><ul><li>Aurora Serverless v2 I/O-Optimized needed 159&nbsp;GB of storage for the CPU table and indexes, which would be billed at $34 per month.&nbsp;</li><li>Timescale needed 8.6&nbsp;GB to store the CPU table and indexes, which would be billed at $7.60 per month (<strong>78&nbsp;% cheaper than Aurora</strong>).</li></ul><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_instance-costs.png" class="kg-image" alt="Instance costs comparison" loading="lazy" width="1870" height="1108" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_instance-costs.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_instance-costs.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_instance-costs.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_instance-costs.png 1870w" sizes="(min-width: 720px) 720px"></figure><p><strong>Timescale is 52&nbsp;% cheaper to run the machines used for the benchmark (assuming a constant workload) and 78&nbsp;% cheaper to store the data created by the benchmark.&nbsp;</strong><br></p><h2 id="our-finding">Our Finding</h2><p>The main takeaway from this benchmark was that, although Aurora Serverless is commonly used to “scale PostgreSQL” for large workloads, when compared to Timescale, it fell (very) short of doing this.</p><p>Timescale was:</p><ul><li>35&nbsp;% faster to ingest</li><li>1.15x-16x faster to query in all but two query categories</li><li>95&nbsp;% more efficient at storing data</li><li>52&nbsp;% cheaper per hour for compute</li><li>78&nbsp;% cheaper per month to store the data created</li></ul><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_summary-3.png" class="kg-image" alt="A summary of the benchmark results: Timescale vs. Amazon Aurora Serverless" loading="lazy" width="1952" height="1110" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_summary-3.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_summary-3.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_summary-3.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/What-We-Learned-From-Benchmarking-Amazon-Aurora-PostgreSQL-Serverless_summary-3.png 1952w" sizes="(min-width: 720px) 720px"></figure><p>While Aurora does replace PostgreSQL’s storage backend with newer (closed-source ☹️) technology, our investigation shows that Timescale beats it for large workloads in all dimensions.&nbsp;</p><p>Looking at this data, people might conclude that “Aurora isn’t for time-series workloads” or “of course a time-series database beats Aurora (a PostgreSQL database) for a time-series workload.” Both of those statements are true, but we would like to leave you with three thoughts:</p><ol><li>Timescale is PostgreSQL—in fact, it’s more PostgreSQL than Amazon Aurora.&nbsp;</li><li>Timescale is tuned for time-series workloads, but that doesn’t mean it’s not also great for general-purpose workloads.</li><li>A very high proportion of the “large tables” or “large datasets” that give PostgreSQL problems (and might cause people to look at Aurora) are organized by timestamp or an incrementing primary key (perhaps bigint)—both of which Timescale is optimized for, regardless of if you call your data time-series or not.</li></ol><p><a href="https://console.cloud.timescale.com/signup?ref=timescale.com"><u>Create a free Timescale account</u></a> to get started with Timescale today.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Teaching Postgres New Tricks: SIMD Vectorization for Faster Analytical Queries]]></title>
            <description><![CDATA[Read how we supercharged Postgres with vectorization and Single Instruction, Multiple Data (SIMD) to set your analytical queries on fire.]]></description>
            <link>https://www.tigerdata.com/blog/teaching-postgres-new-tricks-simd-vectorization-for-faster-analytical-queries</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/teaching-postgres-new-tricks-simd-vectorization-for-faster-analytical-queries</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Announcements & Releases]]></category>
            <dc:creator><![CDATA[James Blackwood-Sewell]]></dc:creator>
            <pubDate>Wed, 15 Nov 2023 13:28:58 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_cover.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_cover.png" alt="A fierce neon tiger next to compressed disks: Teaching Postgres New Tricks: SIMD Vectorization for Faster Analytical Queries" /><p>After more than a year in the works, we’re proud to announce that the latest release of TimescaleDB (TimescaleDB 2.12) has added a vectorized query pipeline that makes Single Instruction, Multiple Data (SIMD) vectorization on our hybrid row columnar storage a reality for PostgreSQL. Our goal is to make common analytics queries an order of magnitude faster, making the <a href="https://survey.stackoverflow.co/2023/#section-most-popular-technologies-databases"><u>world’s most loved database</u></a> even better.</p><p>We’ve already built a mechanism for <a href="https://timescale.ghost.io/blog/building-columnar-compression-in-a-row-oriented-database/"><u>transforming your PostgreSQL tables into hybrid row columnar stores</u></a> with our native columnar compression. When you compress data you get the immediate benefit of significantly reducing storage size, and you get the secondary benefit of spending less CPU time waiting for disk reads. But there is another avenue for optimization that comes from columnar storage, and we are now focused on unlocking its potential to set analytical queries on fire.</p><p>(Here's a sneak preview so you can see what we're talking about.)</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_summarized-table.png" class="kg-image" alt="" loading="lazy" width="2000" height="448" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_summarized-table.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_summarized-table.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_summarized-table.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_summarized-table.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>If you thought you had to turn to a specialized “analytics” columnar database to serve your queries, think twice. In this article, we walk you through how we’ve supercharged PostgreSQL with vectorization, or to be more precise, implemented a vectorized query execution pipeline that lets us transparently unlock the power of SIMD, so you can start on Postgres, scale with Postgres, and stay with Postgres—even for your analytical workloads.&nbsp;</p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">This work started to be released in TimescaleDB 2.12 which is available today, and is continuing in TimescaleDB 2.13, which will ship in November to all time-series services in the Timescale platform. <a href="https://console.cloud.timescale.com/signup"><u>Create an account here and try it out for 30 days.</u></a></div></div><h2 id="from-postgres-scaling-issues-to-vectorization">From <a href="https://www.tigerdata.com/learn/guide-to-postgresql-scaling" rel="noreferrer">Postgres Scaling</a> Issues to Vectorization</h2><p>The decision to implement vectorized query execution in TimescaleDB comes from a long line of initiatives aimed at improving PostgreSQL’s experience and scalability. Before we get into the technical details, let’s start by discussing where developers reach the limits of Postgres and how vectorization can help.</p><p>You love Postgres (doesn’t everyone?) and chose it to power your new application because using a rock-solid, widely-used database with an incredibly diverse ecosystem that supports full SQL just makes sense. </p><p>Things are going really well, development is easy, the application launches. <a href="https://www.timescale.com/learn/types-of-data-supported-by-postgresql-and-timescale"><u>You might be working with IoT devices, sensors, event data, or financial instruments</u></a>—but whatever the use case, as time moves on, data starts piling up. All of a sudden, some of the queries that power your application mysteriously begin to get slower. Panic starts to settle in. 😱</p><p>Fast forward a few weeks or months, and something is off. You’re spending money on adding additional resources to the database and burning precious developer time trying to work out what’s broken. It doesn’t feel like anything is wrong on the application side, and tuning PostgreSQL hasn’t helped. Before you know it, someone has proposed splitting part of the workload into a different (perhaps “purpose-built”) database.&nbsp;</p><p>Complexity and tech debt rocket as the size of your tech stack balloons, your team has to learn a new database (which comes with its own set of challenges), and your application now has to deal with data from multiple siloed systems.&nbsp;</p><h2 id="teaching-postgres-new-tricks-to-make-this-journey-smoother">Teaching Postgres new tricks to make this journey smoother&nbsp;&nbsp;</h2><p>This is the painful end-state that Timescale wants to help avoid, allowing developers to scale and stay with PostgreSQL. Over the years, TimescaleDB has made PostgreSQL better with many features to help you scale smoothly, like<a href="https://www.timescale.com/learn/is-postgres-partitioning-really-that-hard-introducing-hypertables"><u> hypertables with automatic partitioning</u></a>, <a href="https://www.tigerdata.com/blog/building-columnar-compression-in-a-row-oriented-database" rel="noreferrer"><u>native columnar compression</u></a>, <a href="https://www.timescale.com/learn/real-time-analytics-in-postgres"><u>improved materialized views</u></a>, <a href="https://timescale.ghost.io/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/"><u>query planner improvements</u></a>, and much more. If it holds you back in PostgreSQL, we want to tackle it.</p><p>Which brings us to today’s announcement…&nbsp;</p><p>For the past year we’ve been investigating how to extend PostgreSQL to unlock techniques used by specialized analytics databases custom-built for OnLine Analytical Processing (OLAP), even while retaining ACID transactions, full support for mutable data, and compatibility with the rest of the wonderful ecosystem. We don’t have the luxury of building a database from the ground up for raw performance (with all the trade-offs that typically entails), but we think where we have ended up offers a unique balance of performance, flexibility, and stability.</p><p>You <em>can</em> teach an old elephant new tricks and sometimes get an order of magnitude speedup when you do!</p><h2 id="columnar-storage-in-postgresql">Columnar Storage in PostgreSQL</h2><p>Before we launch into the vectorization and SIMD deep dive, we need to set the scene by explaining the other feature which makes it possible, our <a href="https://timescale.ghost.io/blog/building-columnar-compression-in-a-row-oriented-database/"><u>compressed columnar storage</u></a>.</p><p>By default, PostgreSQL stores and processes data in a way that is optimized for operating on data record by record (or row) as it’s inserted. The on-disk data files are organized by row, and queries use a row-based iterator to process that data. Paired with a B-tree index, a row-based layout is great for transactional workloads, which are more concerned with quickly ingesting and operating on individual records.&nbsp;</p><p>Databases that optimize for raw analytical performance take the opposite approach to PostgreSQL—they make some architectural trade-offs to organize writes with multiple values from one column grouped on disk. When a read happens, a column-based iterator is used, which means only the columns that are needed are read.&nbsp;</p><p>Column organized, or columnar, storage performs poorly when an individual record is targeted or when all columns are requested, but amazingly for the aggregate or single-column queries that are common in analytics or used for powering dashboards.</p><p>To clarify things, the following diagram shows how a row store and a column store would logically lay out data from devices measuring temperature.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_data-format-diagram.png" class="kg-image" alt="How a row store and a column store would logically lay out data from devices measuring temperature" loading="lazy" width="1998" height="817" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_data-format-diagram.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_data-format-diagram.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_data-format-diagram.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_data-format-diagram.png 1998w" sizes="(min-width: 720px) 720px"></figure><p></p><h2 id="row-vs-columnar-storage-why-not-both">Row vs. Columnar Storage: Why Not Both?&nbsp;</h2><p>Traditionally, you had to choose between a database that supported a row-based format optimized for transactional workloads or one that supported a column-based format targeted towards analytical ones. But, what we saw over and over again with our customers is that, with the same dataset, they actually wanted to be able to perform transactional-style operations on recent data and analytical operations on historical data.</p><p>Timescale is built on Postgres, so we can store data using Postgres’ native row format effortlessly. We have also built out the ability to organize data by columns through our native columnar compression (check out this <a href="https://timescale.ghost.io/blog/building-columnar-compression-in-a-row-oriented-database/"><u>recent deep dive</u></a> into the technical details). You can keep recent data in a row format and convert it to columnar format as it ages. </p><p>Both formats can be queried together seamlessly, the conversion is handled automatically in the background, and we can still support transactions and modifications on our older data (albeit less performantly).</p><p>When you’re working with columnar data, the benefit for analytical queries is immense, with some aggregate queries over columnar storage coming in <strong>5x</strong>, <strong>10x</strong>, and in some cases, even up to <strong>166x faster</strong> (due to lower I/O requirements and metadata caching) compared to row-based storage, as well as taking <strong>95&nbsp;% less space to store </strong>(due to our columnar compression) when tested using the <a href="https://github.com/timescale/tsbs"><u>Time-Series Benchmark Suite</u></a>.</p><p>But can we make this faster? Read on!</p><h2 id="vectorization-and-simd%E2%80%94oh-my">Vectorization and SIMD—Oh My!</h2><p>Now that we have data in a columnar format, we have a new world of optimization to explore, starting with vectorization and SIMD. Current CPUs are amazing feats of engineering, supporting SIMD instruction sets that can process multiple data points with a single instruction, both working faster and giving much better memory and cache locality.&nbsp; (The exact number they can process depends on the register size of the CPU and the data size; with a 128-bit register, each vector could hold 4 x 32-bit values, resulting in a theoretical 4x speedup.)</p><p>A regular (or scalar) CPU instruction receives two values and performs an operation on them, returning a single result. A vectorized SIMD CPU instruction processes two same-sized vectors (a.k.a. arrays) of values simultaneously, executing the same operation across both vectors to create an output vector in a single step. The magic is that the SIMD instruction takes the same amount of time as its scalar equivalent, even though it’s doing more work.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_scalar-vs-vectorized-diagram.png" class="kg-image" alt="Scalar vs. vectorized CPU instruction" loading="lazy" width="1427" height="762" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_scalar-vs-vectorized-diagram.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_scalar-vs-vectorized-diagram.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_scalar-vs-vectorized-diagram.png 1427w" sizes="(min-width: 720px) 720px"></figure><p>Implementing vectorized query execution on top of our compressed columnar storage has been a significant focus for Timescale over the last year. It quickly became evident that implementing a vectorized query pipeline is one of the most exciting areas for optimization we can tackle—with performance increases by an order of magnitude on the table.</p><h2 id="timescale%E2%80%99s-vectorized-query-execution-pipeline">Timescale’s Vectorized Query Execution Pipeline</h2><p>As of version 2.12, TimescaleDB supports a growing number of vectorized operations over compressed data, with many more coming in 2.13 and beyond. When we were starting, one of the biggest challenges was integrating the built-in PostgreSQL operators, which process data in row-based tuples, with our new vectorized pipeline, which would be triggered as the batch was decompressed and complete when the batch was aggregated.</p><p>This becomes very clear when we look at an aggregate query. For us to vectorize aggregation, we need to have that as part of our vectorization pipeline (and not at a higher level where PostgreSQL would normally handle it).&nbsp;</p><p>However, because a single query could be returning data from an uncompressed and a compressed chunk (our abstraction which partitions tables) at the same time, we also need to return the same type of data in both cases (even though no vectorization would take place for the uncompressed data). We did this by changing both plans' output to PostgreSQL <a href="https://www.postgresql.org/docs/current/parallel-plans.html#PARALLEL-AGGREGATION"><u>Partial Aggregate nodes</u></a> (which were actually developed for parallel aggregation) nodes rather than raw tuples. PostgreSQL already knows how to deal with partial aggregates, so this gives us a common interface to work with that allows early aggregation.</p><p>The following diagram contains a query plan for an aggregation query and shows how an uncompressed chunk, a compressed chunk with vectorization disabled, and a compressed chunk with vectorization enabled all flow up to the same PostgreSQL Append node.&nbsp;</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_finalize-aggregate-diagram.png" class="kg-image" alt="A query plan for an aggregation query and shows how an uncompressed chunk, a compressed chunk with vectorization disabled, and a compressed chunk with vectorization enabled all flow up to the same PostgreSQL Append node" loading="lazy" width="1649" height="1147" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_finalize-aggregate-diagram.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_finalize-aggregate-diagram.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_finalize-aggregate-diagram.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_finalize-aggregate-diagram.png 1649w" sizes="(min-width: 720px) 720px"></figure><p>But doing this had an amazing side-effect: we could now do early aggregation for uncompressed chunks! In fact, when we committed this in TimescaleDB 2.12, we saw a consistent 10-15&nbsp;% speedup across all aggregation queries which operated on hypertables, even before we got to implementing vectorization (interestingly, a large part of this improvement comes from working with smaller datasets when aggregating, for example, smaller hash tables).</p><p>Now that we could keep PostgreSQL happy when aggregating by using Partial Aggregates, we turned our attention to the start of the pipeline. We knew that we needed to convert the compressed representation into an in-memory format, which each of our vectorization stages could use. </p><p>We chose to update our decompression node to read compressed data and output data in the Apache Arrow format, allowing us to quickly and transparently perform SIMD operations at each stage of the execution pipeline.</p><h2 id="vectorization-stages">Vectorization Stages</h2><p>So, now that we have a vectorization pipeline, we need to find operations that can benefit from SIMD to vectorize. Let’s start with an example: consider a typical dashboard query that shows some average metrics on a table with all data older than one hour compressed:</p><pre><code class="language-SQL">SELECT time_bucket(INTERVAL '5 minute', timestamp) as bucket,
       metric, sum(value)
FROM metrics
WHERE metric = 1 AND
      timestamp &gt; now() - INTERVAL '1 day' 
GROUP BY bucket, metric;
</code></pre>
<p>Among the many things this query does are four crucial, computationally expensive stages that can be vectorized to use SIMD:</p><ol><li>Decompressing the compressed data into the in-memory Apache Arrow format</li><li>Checking two filters in the WHERE clause, one for metric and one for time</li><li>Computing one expression using the time_bucket function</li><li>Performing aggregation using the SUM aggregate function</li></ol><p>All of these stages benefit from vectorization in a slightly different way; let’s dig into each of them.</p><h3 id="vectorized-decompression">Vectorized decompression</h3><p>We know that compression is a good thing, and when we decompress data, the CPU overhead incurred is almost always offset by the I/O savings from reading a smaller amount of data from disk. But what if we could use our CPU more efficiently to decompress data faster? In TimescaleDB 2.12, we answered that question with a <strong>3x decompression speedup</strong> when using SIMD over vectorized batches where the algorithms support it.</p><p>While we raised our decompression throughput ceiling to 1&nbsp;GB/second/core, there is more work to be done. Some parts of our modified Gorilla compression algorithm for floating-point values, as well as some custom algorithms for compressing small and repeated numbers (see <a href="https://timescale.ghost.io/blog/time-series-compression-algorithms-explained/"><u>this blog post for more algorithm details</u></a>), don’t allow full use of SIMD because of the way they lay out compressed data, with internal references or complex flow control blocking us from unlocking more performance.&nbsp;&nbsp;</p><p>Looking to the future, we have identified some new algorithms designed with SIMD in mind, which can go an order of magnitude faster, so watch this space. 👀</p><p>On top of the speed benefits, vectorized decompression is where we convert our on-disk compression format into our in-memory Apache Arrow format that the rest of our vectorization pipeline consumes.</p><h3 id="vectorized-filters">Vectorized filters</h3><p>The next stage of query processing is applying compute-time filters from WHERE clauses. In an ideal analytical query, most of the data that doesn't match the query filters is not even read from storage. Unneeded columns are skipped, metadata is consulted to exclude entire columnar batches, and conditions are satisfied using indexes.&nbsp;</p><p>However, the real world is not ideal, and for many queries, not all conditions can be optimized like this. For example, when a filter (e.g., a where clause on a time range) <em>partially</em> overlaps a compressed batch, then some of the batch (but not all of it) has to be used to calculate the result.</p><p>In this case, vectorized filters can provide another large performance boost. As Apache Arrow vectors stream out of our decompression node, we can use SIMD to check each filter condition very efficiently by comparing the stream to a vector of constants. Using the example from above (namely, <code>WHERE metric = 1 AND time &gt; now() - INTERVAL '1 day'</code>), we would compare the metric column against the value 1 and then also compare vectors of the time column against <code>now() - INTERVAL '1 day'</code>.</p><p>This optimization should be released in TimescaleDB 2.13, with early benchmark results against some real-world data showing up to a <strong>50&nbsp;% speedup on common queries</strong>.</p><p>But that’s not all vectorized filters can provide! Previously, even for compressed queries, all data was read from disk before filters were applied (a hold-over from the read-the-whole-row behavior that PostgreSQL employs by default).&nbsp;</p><p>Now that we are living in a columnar world, we can optimize this using a technique called “lazy column reads,” which reads the required columns for a batch early in the order they are defined in the WHERE clause. If any filters fail, the batch is discarded with no more I/O incurred. For queries with filters that remove a large number of full batches of records, this can result in an additional <strong>25&nbsp;% – 50&nbsp;% speedup</strong>.</p><h3 id="vectorized-expressions">Vectorized expressions</h3><p>Another important part of vectorized query pipelines is computing various expressions (projections) of columns that might be present in the query. In the simplest cases, this allows the use of the common vector CPU instructions for addition or multiplication, increasing the throughput.&nbsp;</p><p>More complex operations can benefit from handcrafted SIMD code, such as converting the string case, validating UTF-8, or even parsing JSON. More importantly, in some cases, the vectorized computation of expressions is a prerequisite for vectorizing the subsequent stages of the pipeline, such as grouping. For example, in the dashboard query we presented at the beginning of this section, we considered the grouping to be on <code>time_bucket</code>, so the result of this function must have a columnar in-memory representation to allow us to vectorize the grouping itself.</p><p>We haven’t made a start on vectorizing expressions yet, because aggregations will have a more immediate impact on analytics queries—but fear not, we will get to them!</p><h3 id="vectorized-aggregation">Vectorized aggregation</h3><p>Finally, the computation of most aggregate functions can also be vectorized to take advantage of SIMD. To demonstrate that this can work inside PostgreSQL as a partial aggregate, we built a high-throughput summation function that uses SIMD when working on columnar/compressed data, targeting the basic use case of <code>SELECT sum(value) FROM readings_compressed</code> (we can currently support filters on the <a href="https://docs.timescale.com/use-timescale/latest/compression/about-compression/#segment-by-columns"><strong><u>segment_by column</u></strong></a>). Without further optimization, we saw a <strong>3x speedup on compressed data</strong>. </p><p>Obviously, SUM is only one of the large set of aggregate functions that PostgreSQL provides (and TimescaleDB extends with <a href="https://docs.timescale.com/api/latest/hyperfunctions/"><u>hyperfunctions</u></a>). So, in forthcoming versions, we will optimize our approach to aggregates and deliver vectorized functions with the eventual goal of supporting the full set of built-in and hyperfunction aggregates.</p><h2 id="adding-it-all-up">Adding It All Up</h2><p>We’ve been showing you a lot of speedups, but how do they stack up in the real world?&nbsp;</p><p>We ran two simple queries, one which uses the vectorized SUM aggregate, and one which makes use of vectorized filters (unfortunately these can’t be combined at the moment). Both the queries were run on the same data (about 30 million rows) four times to show the gains from row-based, columnar (without a segment_by in this case), vectorized decompression, and then finally adding the last vectorized stage (aggregation or filter depending on the query).</p><p>We think the numbers can speak for themselves here 🔥.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_query-table-1.png" class="kg-image" alt="Vectorization made the SUM aggregate query up to 5.8x faster, and a SELECT count(*) query up to 4x faster!" loading="lazy" width="2000" height="1625" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_query-table-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_query-table-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_query-table-1.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/SIMD-Vectorization-for-Faster-Analytical-Queries_query-table-1.png 2400w" sizes="(min-width: 720px) 720px"></figure><h2 id="wrap-up">Wrap-Up</h2><p>Nothing gets us more excited at Timescale than finding smart solutions to hard problems which let people get more out of PostgreSQL. Since the first code for our vectorization pipeline hit Git, our internal Slack channels have been full of developer discussion about the optimizations and possibilities that vectorization on top of our columnar compression unlocks.</p><p>Looking forward, we are projecting that we can get even orders-of-magnitude performance improvements on some queries, and we’ve only started scratching the surface of what’s possible.&nbsp;</p><p>It’s an amazing time to be using Postgres. </p><p><a href="https://console.cloud.timescale.com/signup?ref=timescale.com"><u>Create a free Timescale account</u></a> to get started quickly with vectorization in TimescaleDB today.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How We Designed a Resilient Vector Embedding Creation System for PostgreSQL Data]]></title>
            <description><![CDATA[Learn the design decisions and trade-offs behind our system for creating and storing vector embeddings for data in PostgreSQL, PgVectorizer.]]></description>
            <link>https://www.tigerdata.com/blog/how-we-designed-a-resilient-vector-embedding-creation-system-for-postgresql-data</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-we-designed-a-resilient-vector-embedding-creation-system-for-postgresql-data</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[Engineering]]></category>
            <dc:creator><![CDATA[Matvey Arye]]></dc:creator>
            <pubDate>Fri, 10 Nov 2023 13:33:39 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/How-we-designed-a-resilient-embedding-system-for-PostgreSQL-Data.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/How-we-designed-a-resilient-embedding-system-for-PostgreSQL-Data.png" alt="A crew tuning up an elephant: How We Designed a Resilient Vector Embedding Creation System for PostgreSQL Data" /><p>Embedding data stored in a PostgreSQL table is undoubtedly useful—with applications ranging from semantic search and recommendation systems to generative AI applications and retrieval augmented generation. But creating and managing embeddings for data in PostgreSQL tables can be tricky, with many considerations and edge cases to take into account, such as keeping embeddings up to date with table updates and deletes, ensuring resilience against failures, and impact to existing systems dependent on the table.</p><p><a href="https://timescale.ghost.io/blog/a-complete-guide-to-creating-and-storing-embeddings-for-postgresql-data/" rel="noreferrer">In a previous blog post, we detailed a step-by-step guide on the process of creating and managing embeddings</a> for data residing in PostgreSQL using <a href="https://docs.timescale.com/ai/latest/pgvectorizer/" rel="noreferrer">PgVectorizer</a>—our simple and resilient embedding creation system for data residing in PostgreSQL. Using a blog application with data stored in a PostgreSQL database as an example, we covered how to create and keep up-to-date vector embeddings using <a href="https://timescale.ghost.io/blog/a-python-library-for-using-postgresql-as-a-vector-database-in-ai-applications/">Python</a>, <a href="https://python.langchain.com/docs/integrations/vectorstores/timescalevector">LangChain</a>, and <a href="https://www.timescale.com/ai">pgai on Timescale</a>.</p><p>In this blog post, we’ll discuss the technical design decisions and the trade-offs we made while building <a href="https://docs.timescale.com/ai/latest/pgvectorizer/" rel="noreferrer">PgVectorizer</a> to ensure simplicity, resilience, and high performance. We’ll also discuss alternative designs if you want to roll your own.</p><p>Let’s jump into it.</p><h2 id="design-of-a-high-performance-vectorizer-for-postgresql-data-pgvectorizer">Design of a High-Performance Vectorizer for PostgreSQL Data (PgVectorizer)</h2><p>First, let’s describe how the system we are building will work. Feel free to skip this section if you already read the <a href="https://timescale.ghost.io/blog/a-complete-guide-to-creating-and-storing-embeddings-for-postgresql-data/" rel="noreferrer">PgVectorizer post</a>.</p><h3 id="system-overview">System overview</h3><p>As an illustrative example, we’ll use a simple blog application storing data in PostgreSQL using a table defined as follows:</p><pre><code class="language-SQL">CREATE TABLE blog (
  id              SERIAL PRIMARY KEY NOT NULL,
  title           TEXT NOT NULL, 
  author          TEXT NOT NULL,
  contents        TEXT NOT NULL,
  category        TEXT NOT NULL,
  published_time  TIMESTAMPTZ NULL --NULL if not yet published
);
</code></pre>
<p>We want to create embeddings on the contents of the blog post so we can later use it for semantic search and power retrieval augmented generation. Embeddings should only exist and be searchable for blogs that have been published (where the <code>published_time</code> is <code>NOT NULL</code>).&nbsp;</p><p>While building this embeddings system, we were able to identify a number of goals that any straightforward and resilient system that creates embeddings should have:</p><ul><li><strong>No modifications to the original table. </strong>This allows systems and applications that already use this table not to be impacted by changes to the embedding system. This is especially important for legacy systems.</li><li><strong>No modification to the applications that interact with the table. </strong>Having to modify the code that alters the table may not be possible for legacy systems. It’s also poor software design because it couples systems that don’t use embeddings with code that generates the embedding.</li><li><strong>Automatically update embeddings</strong> when rows in the source table change (in this case, the blog table). This lessens the maintenance burden and contributes to worry-free software. At the same time, this update need not be instantaneous or within the same commit. For most systems, “eventual consistency” is just fine.</li><li><strong>Ensure resilience against network and service failures: </strong>Most systems generate embeddings via a call to an external system, such as the OpenAI API. In scenarios where the external system is down, or a network malfunction occurs, it's imperative that the remainder of your database system continues working.</li></ul><p>These guidelines were the basis of a robust architecture that we implemented using the <a href="https://github.com/timescale/python-vector">Python Vector library</a>, a library for <a href="https://timescale.ghost.io/blog/a-python-library-for-using-postgresql-as-a-vector-database-in-ai-applications/">working with vector data using PostgreSQL</a>. To complete the job successfully, added new functionality to this library—<a href="https://docs.timescale.com/ai/latest/pgvectorizer/" rel="noreferrer">PgVectorizer</a>—to make embedding PostgreSQL data as simple as possible.</p><p>Here’s the architecture we settled on:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/Create-Embedding-for-PostgreSQL-Data-Architecture_Diagram-1.png" class="kg-image" alt="Reference architecture for a simple and resilient system for embedding data in an existing PostgreSQL table. We use the example use case of a blogging application, hence the table names above." loading="lazy" width="2000" height="1282" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/11/Create-Embedding-for-PostgreSQL-Data-Architecture_Diagram-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/11/Create-Embedding-for-PostgreSQL-Data-Architecture_Diagram-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/11/Create-Embedding-for-PostgreSQL-Data-Architecture_Diagram-1.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/Create-Embedding-for-PostgreSQL-Data-Architecture_Diagram-1.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Reference architecture for a simple and resilient system for embedding data in an existing PostgreSQL table. We use the example use case of a blogging application, hence the table names above</em></i><span style="white-space: pre-wrap;">.</span></figcaption></figure><p>In this design, we first add a trigger to the blog table that monitors for changes and, upon seeing a modification, inserts a job into the blog_work_queue table that indicates that a row in the blog table is out-of-date with its embedding.</p><p>On a fixed schedule, an embeddings creator job will poll the blog_work_queue table, and if it finds work to do, will do the following in a loop:</p><ol><li>Read and lock a row in the blog_work_queue table&nbsp;</li><li>Read the corresponding row in the blog table</li><li>Create an embedding for the data in the blog row</li><li>Write the embedding to the blog_embedding table</li><li>Delete the locked row in the&nbsp; blog_work_queue table</li></ol><p>To see this system in action, see an example of usage to <a href="https://timescale.ghost.io/blog/a-complete-guide-to-creating-and-storing-embeddings-for-postgresql-data/" rel="noreferrer">create and maintain embeddings in a PostgreSQL table using OpenAI, LangChain, and pgai on Timescale in this blog post</a>.</p><p>Going back to the example of our blog application table, on a high level, <a href="https://docs.timescale.com/ai/latest/pgvectorizer/" rel="noreferrer">PgVectorizer</a> has to do two things:</p><ol><li>Track changes to the blog rows to know which rows have changed.</li><li>Provide a method to process the changes to create embeddings.</li></ol><p>Both of these have to be highly concurrent and performant. Let’s see how it works.</p><h3 id="track-change-to-the-blog-table-with-the-blogworkqueue-table">Track change to the blog table with the blog_work_queue table</h3><p>You can create a simple work queue table with the following:</p><pre><code class="language-SQL">CREATE TABLE blog_embedding_work_queue (
  id  INT 
);

CREATE INDEX ON blog_embedding_work_queue(id);
</code></pre>
<p>This is a very simple table, but there is one item of note: this table has no unique key. This was done to avoid locking issues when processing the queue, but it does mean that we may have duplicates. We discuss the trade-off later in Alternative 1 below.</p><p>Then you create a trigger to track any changes made to <code>blog</code>:</p><pre><code class="language-SQL">CREATE OR REPLACE FUNCTION blog_wq_for_embedding() RETURNS TRIGGER LANGUAGE PLPGSQL AS $$ 
BEGIN 
  IF (TG_OP = 'DELETE') THEN
    INSERT INTO blog_embedding_work_queue 
      VALUES (OLD.id);
  ELSE
    INSERT INTO blog_embedding_work_queue 
      VALUES (NEW.id);
  END IF;
  RETURN NULL;
END; 
$$;

CREATE TRIGGER track_changes_for_embedding 
AFTER INSERT OR UPDATE OR DELETE
ON blog 
FOR EACH ROW EXECUTE PROCEDURE blog_wq_for_embedding();

INSERT INTO blog_embedding_work_queue 
  SELECT id FROM blog WHERE published_time is NOT NULL;
</code></pre>
<p>The trigger inserts the ID of the blog that has changed into blog_work_queue. We install the trigger and then insert any existing blogs into the work_queue. This ordering is important to make sure that no IDs get dropped.</p><p>Now, let’s describe some alternative designs and why we rejected them.</p><h3 id="alternative-1-implement-a-primary-or-unique-key-for-the-blogworkqueue-table"><strong>Alternative 1: </strong>Implement a primary or unique key for the blog_work_queue table.</h3><p>Introducing this key would eliminate the problem of duplicate entries. However, it's not without its challenges, particularly because such a key would force us to use the <code>INSERT…ON CONFLICT DO NOTHING</code> clause to insert new IDs into the table, and that clause takes a lock on the ID in the B-tree.&nbsp;</p><p>Here's the dilemma: during the processing phase, it's necessary to delete the rows being worked on to prevent simultaneous processing. Yet, committing this deletion can only be done after the corresponding embedding has been placed into blog_embeddings. This ensures no IDs are lost if there's a disruption midway—say, if the embedding creation crashes post-deletion but before the embedding is written.</p><p>Now, if we create a unique or primary key, the transaction overseeing the deletion stays open. Consequently, this acts as a lock on those specific IDs, preventing their insertion back into the blog_work_queue for the entire duration of the embedding creation job. Given that it takes longer to create embeddings than your typical database transaction, this spells trouble. The lock would stall the trigger for the main 'blog' table, leading to a dip in the primary application's performance. Making things worse, if processing multiple rows in a batch, deadlocks become a potential problem as well.</p><p>However, the potential issues arising from occasional duplicate entries can be managed during the processing stage, as illustrated later. A sporadic duplicate here and there isn't a problem as it only marginally increases the amount of work the embedding job performs. This is certainly more palatable than grappling with the above-mentioned locking challenges.</p><h3 id="alternative-2-track-the-work-that-needs-to-be-done-by-adding-a-column-to-the-blog-table-to-track-whether-an-up-to-date-embedding-exists"><strong>Alternative 2:</strong> Track the work that needs to be done by adding a column to the <code>blog</code> table to track whether an up-to-date embedding exists.</h3><p>For example, we could add an <code>embedded</code> boolean column set to false on modification and flipped to true when the embedding is created. There are three reasons to reject this design:&nbsp;</p><ol><li>We don’t want to modify the <code>blog</code> table for the reasons we already mentioned above.</li><li>Efficiently getting a list of non-embedded blogs would require an additional index (or partial index) on the blog table. This would slow down other operations.</li><li>This increases churn on the table because every modification would now be written twice (once with embedding=false and once with embedding=true) due to the MVCC nature of PostgreSQL.</li></ol><p>A separate work_queue_table solves these issues.</p><h3 id="alternative-3-create-the-embeddings-directly-in-the-trigger">Alternative 3: Create the embeddings directly in the trigger.</h3><p>This approach has several issues:</p><ol><li>If the embedding service is down, either the trigger needs to fail (aborting your transaction), or you need to create a backup code path that … stores the IDs that couldn’t be embedded in a queue. The latter solution gets us back to our proposed design but with more complexity bolted on top.</li><li>This trigger will probably be much slower than the rest of the database operations because of the latency required to contact an external service. This will slow down the rest of your database operations on the table.</li><li>It forces the user to write the creation embedding code directly in the database. Given that the lingua franca of AI is Python and that embedding creation often requires many other libraries, this isn’t always easy or even possible (especially if running within a hosted PostgreSQL cloud environment). It’s much better to have a design where you have a choice to create embeddings inside or outside of the database.</li></ol><p>Now we have a list of blogs that need to be embedded, let’s process the list!</p><h3 id="create-the-embeddings">Create the embeddings</h3><p>There are many ways to create embeddings. We recommend using an external Python script. This script will scan the work queue and the related blog posts, invoke an external service to craft the embeddings, and then store these embeddings back into the database. Our reasoning for this strategy is as follows:</p><ul><li><strong>Choice of Python</strong>: We recommend<a href="https://timescale.ghost.io/blog/postgresql-as-a-vector-database-create-store-and-query-openai-embeddings-with-pgvector/"> Python</a> because it offers a rich, unmatched ecosystem for AI data tasks, highlighted by powerful LLM development and data libraries like <a href="https://blog.langchain.dev/timescale-vector-x-langchain-making-postgresql-a-better-vector-database-for-ai-applications/">LangChain</a> and <a href="https://timescale.ghost.io/blog/timescale-vector-x-llamaindex-making-postgresql-a-better-vector-database-for-ai-applications/">LlamaIndex</a>.</li><li><strong>Opting for an external script instead of PL/Python</strong>: We wanted users to have control over how they embed their data. Yet, at the same time, many Postgres cloud providers don’t allow the execution of arbitrary Python code inside the database because of security concerns. So, to allow users to have flexibility in both their embedding scripts as well as where they host their database, we went with a design that used external Python scripts.</li></ul><p>The jobs must be both performant and concurrency-safe. Concurrency guarantees that if jobs start running behind, the schedulers can start more jobs to help the system catch up and handle the load.</p><p>We’ll go through how to set up each of those methods later, but first, let’s see what the Python script would look like. Fundamentally, the script has three parts:</p><ol><li>Read the work queue and the blog post</li><li>Create an embedding for the blog post</li><li>Write the embedding to the blog_embedding table</li></ol><p>Steps 2 and 3 are performed by an&nbsp;<code>embed_and_write</code> callback that we define in the <a href="https://timescale.ghost.io/blog/a-complete-guide-to-creating-and-storing-embeddings-for-postgresql-data/" rel="noreferrer">PgVectorizer blog post</a>. So, let’s look more deeply at how we process the work queue.</p><h3 id="process-the-work-queue">Process the work queue</h3><p>We’ll first show you the code and then highlight the key elements at play:</p><pre><code class="language-SQL">def process_queue(embed_and_write_cb, batch_size:int=10):            
    with psycopg2.connect(TIMESCALE_SERVICE_URL) as conn:
        with conn.cursor(cursor_factory=psycopg2.extras.DictCursor) as cursor:
            cursor.execute(f"""
                SELECT to_regclass('blog_embedding_work_queue')::oid; 
                """)
            table_oid = cursor.fetchone()[0]
            
            cursor.execute(f"""
                WITH selected_rows AS (
                    SELECT id
                    FROM blog_embedding_work_queue                         
                    LIMIT {int(batch_size)}
                    FOR UPDATE SKIP LOCKED
                ), 
                locked_items AS (
                    SELECT id, 
                           pg_try_advisory_xact_lock(
                                {int(table_oid)}, id) AS locked
                    FROM (
                        SELECT DISTINCT id 
                        FROM selected_rows 
                        ORDER BY id
                     ) as ids
                ),
                deleted_rows AS (
                    DELETE FROM blog_embedding_work_queue
                    WHERE id IN (
                        SELECT id 
                        FROM locked_items 
                        WHERE locked = true ORDER BY id
                   )
                )
                SELECT locked_items.id as locked_id, {self.table_name}.*
                FROM locked_items
                LEFT JOIN blog ON blog.id = locked_items.id
                WHERE locked = true
                ORDER BY locked_items.id
            """)
            res = cursor.fetchall()
            if len(res) &gt; 0:
                embed_and_write_cb(res)
            return len(res)

process_queue(embed_and_write)
</code></pre>
<p>The SQL code in the above snippet is subtle because it is designed to be both performant and concurrency-safe, so let’s go through it:</p><ul><li><strong>Getting items off the work queue</strong>: Initially, the system retrieves a specified number of entries from the work queue, determined by the batch queue size parameter. A FOR UPDATE lock is taken to ensure that concurrently executing scripts don’t try processing the same queue items. The SKIP LOCKED directive ensures that if any entry is currently being handled by another script, the system will skip it instead of waiting, avoiding unnecessary delays.<br></li><li><strong>Locking blog IDs</strong>: Due to the possibility of duplicate entries for the same blog_id within the work-queue table, simply locking said table is insufficient. Concurrent processing of the same ID by different jobs would be detrimental. Consider the following potential race-condition:</li></ul><ol><li>Job 1 initiates and accesses a blog, retrieving version 1.</li><li>An external update to the blog occurs.</li><li>Subsequently, Job 2 begins, obtaining version 2.</li><li>Both jobs commence the embedding generation process.</li><li>Job 2 concludes, storing the embedding corresponding to blog version 2.</li><li>Job 1, upon conclusion, erroneously overwrites the version 2 embedding with the outdated version 1.</li></ol><p>While one could counter this issue by introducing explicit version tracking, it introduces considerable complexity without performance benefit. The strategy we opted for not only mitigates this issue but also prevents redundant operations and wasted work by concurrently executing scripts. </p><p>A Postgres advisory lock, prefixed with the table identifier to avoid potential overlaps with other such locks, is employed. The <code>try</code> variant, analogous to the earlier application of SKIP LOCKED, ensures the system avoids waiting on locks. The inclusion of the ORDER BY blog_id clause helps prevent potential deadlocks. We’ll cover some alternatives below.</p><ul><li><strong>Cleaning up the work queue</strong>: The script then deletes all the work queue items for blogs we have successfully locked. If these queue items are visible via Multi-Version Concurrency Control (MVCC), their updates are manifested in the retrieved blog row. Note that we delete all items with the given blog ID, not only the items read when selecting the rows: this effectively handles duplicate entries for the same blog ID. It's crucial to note that this deletion only commits after invoking the embed_and_write() function and the subsequent storage of the updated embedding. This sequence ensures we don’t lose any updates even if the script fails during the embedding generation phase.</li><li><strong>Getting the blogs to process: </strong>In the last step, we fetch the blogs to process. Note the use of the left join: that allows us to retrieve the blog IDs for deleted items that won’t have a blog row. We need to track those items to delete their embeddings. In the <code>embed_and_write</code> callback, we use published_time being NULL as a sentinel for the blog being deleted (or unpublished, in which case we also want to delete the embedding).</li></ul><h3 id="alternative-4-avoid-using-advisory-locks-by-using-another-table"><strong>Alternative 4</strong>: Avoid using advisory locks by using another table.&nbsp;</h3><p>If the system already uses advisory locks and you are worried about collisions, it’s possible to use a table with a blog ID as the primary key and lock the rows. In fact, this can be the blog table itself if you are sure these locks won’t slow down any other system (remember, these locks have to be held throughout the embedding process, which can take a while). </p><p>Alternatively, you can have a blog_embedding_locks table just for this purpose. We didn’t suggest creating that table because we think it can get quite wasteful in terms of space, and using advisory locks avoids this overhead.&nbsp;</p><h2 id="conclusion-and-next-steps">Conclusion and Next Steps</h2><p><a href="https://timescale.ghost.io/blog/a-complete-guide-to-creating-and-storing-embeddings-for-postgresql-data/" rel="noreferrer">We introduced PgVectorizer and outlined a system adept at generating vector embeddings from data stored in PostgreSQL</a> and automatically keeping them up to date. This architecture ensures the embeddings remain synchronized with the perpetually evolving data, responding seamlessly to insertions, modifications, and deletions. </p><p>In this blog post, we gave you a behind-the-scenes look at how we created a system that boasts resilience, effectively handling potential downtimes of the embedding-generation service. Its design is adept at managing a high rate of data modifications and can seamlessly use concurrent embedding-generation processes to accommodate heightened loads.</p><p>Moreover, the paradigm of committing data to PostgreSQL and using the database to manage embedding generation in the background emerges as an easy mechanism to supervise embedding upkeep amidst data modifications. A myriad of demos and tutorials in the AI space focus singularly on the initial creation of data from documents, overlooking the intricate nuances associated with preserving data synchronization as it evolves.&nbsp;</p><p>However, in real production environments, data invariably changes, and grappling with the complexities of tracking and synchronizing these shifts is no trivial endeavor. But that’s what a database is designed to do! Why not just use it?</p><p>Here are some resources to continue your learning journey:</p><ul><li><a href="https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&amp;utm_source=timescale-blog&amp;utm_medium=direct&amp;utm_content=pgvectorizer-how-we-built"><strong>Try pgai on Timescale free for 90 days</strong></a>: Store the embeddings generated from your PostgreSQL data in a fast and scalable vector database built on PostgreSQL. Learn more <a href="https://www.timescale.com/ai">about pgai on Timescale</a> and <a href="https://timescale.ghost.io/blog/how-we-made-postgresql-the-best-vector-database/">how it performs</a>.</li><li><a href="https://docs.timescale.com/ai/latest/pgvectorizer/" rel="noreferrer"><strong>Read the docs</strong></a>: Learn more about PgVectorizer and how to use it via the <a href="https://timescale.ghost.io/blog/a-python-library-for-using-postgresql-as-a-vector-database-in-ai-applications/">Python Vector library</a>.</li><li><a href="https://github.com/timescale/vector-cookbook/tree/main/pgvectorizer"><strong>Tutorial: Embedding blog data in PostgreSQL</strong></a>:<strong> </strong>Follow this step-by-step tutorial on how to create, embed, and store blog post data from a PostgreSQL table using the methods discussed in this blog post.</li><li><a href="https://timescale.ghost.io/blog/refining-vector-search-queries-with-time-filters-in-pgvector-a-tutorial/" rel="noreferrer"><strong>Tutorial: Refining Vector Search Queries With Time Filters in pgvector</strong></a><strong>: </strong>Learn how to do time-based filtering and semantic similarity search in a single SQL query.</li><li><a href="https://python.langchain.com/docs/integrations/vectorstores/timescalevector"><strong>LangChain and pgai on Timescale</strong></a><strong>: </strong>We used LangChain to illustrate document parsing and embedding creation in our examples above. Learn more about how to use it with pgai on Timescale for vector storage, similarity search, and hybrid search.</li></ul>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[A Complete Guide to Creating and Storing Embeddings for PostgreSQL Data]]></title>
            <description><![CDATA[Explore the power of vector embeddings and learn how to create and store them for PostgreSQL data using Python, LangChain, and pgai on Timescale.]]></description>
            <link>https://www.tigerdata.com/blog/a-complete-guide-to-creating-and-storing-embeddings-for-postgresql-data</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/a-complete-guide-to-creating-and-storing-embeddings-for-postgresql-data</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[Engineering]]></category>
            <dc:creator><![CDATA[Matvey Arye]]></dc:creator>
            <pubDate>Fri, 10 Nov 2023 13:33:28 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/Create-and-Store-Embeddings-PostgreSQL-data.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/Create-and-Store-Embeddings-PostgreSQL-data.png" alt="A mighty group of elephants working on data servers  (Creating and Storing Embeddings for PostgreSQL Data)" /><h3 id="why-use-embeddings-for-your-postgresql-data">Why use embeddings for your PostgreSQL data</h3><p><a href="https://timescale.ghost.io/blog/a-beginners-guide-to-vector-embeddings/" rel="noreferrer">Vector embeddings</a> provide a mathematical representation of data, encapsulating its semantic essence in a form that machines can readily process. While commonly associated with text, images, and audio, virtually any binary data can be converted into this format.&nbsp;</p><p>Generating embeddings from data already stored in your PostgreSQL database unlocks a multitude of applications:</p><p>Embeddings enable <a href="https://www.timescale.com/learn/vector-search-vs-semantic-search" rel="noreferrer"><strong>semantic search</strong></a>, which transcends the limitations of traditional keyword-driven methods. It doesn't just seek exact word matches; it grasps the deeper intent behind a user's query. The result? Even if search terms differ in phrasing, relevant results are surfaced. Taking advantage of <strong>hybrid search</strong>, which marries lexical and semantic search methodologies, offers users a search experience that's both rich and accurate. It's not just about finding direct matches anymore; it's about tapping into contextually and conceptually similar content to meet user needs.</p><p><strong>Recommendation systems</strong> benefit immensely from embeddings. Imagine a user who has shown interest in several articles on a singular topic. With embeddings, the recommendation engine can delve deep into the semantic essence of those articles, surfacing other database items that resonate with the same theme. Recommendations, thus, move beyond just the superficial layers like tags or categories and dive into the very heart of the content.</p><p>Generative AI, particularly <strong>retrieval-augmented generation (RAG)</strong>, can be powered using the data stored in a PostgreSQL database. This turns your data into more than just tabular information; it becomes context for Large Language Models (LLMs) like OpenAI’s<a href="https://openai.com/blog/new-models-and-developer-products-announced-at-devday"> GPT-4 Turbo</a>, Anthropic’s <a href="https://www.anthropic.com/index/claude-2">Claude 2</a>, and open-source modes like <a href="https://ai.meta.com/llama/">Llama 2</a>. When a user poses a query, relevant database content is fetched and used to supplement the query as additional context for the LLM. This helps reduce LLM hallucinations, as it ensures the model's output is more grounded in specific and relevant information, even if it wasn't part of the original training data.</p><p>Furthermore, embeddings offer a robust solution for <strong>clustering</strong> data in PostgreSQL. Transforming data into these vectorized forms enables nuanced comparisons between data points in a high-dimensional space. Through algorithms like <a href="https://en.wikipedia.org/wiki/K-means_clustering">K-means</a> or <a href="https://en.wikipedia.org/wiki/Hierarchical_clustering">hierarchical clustering</a>, data can be categorized into semantic clusters, offering insights that surface-level attributes might miss. This deepens our grasp of inherent data patterns, enriching both exploration and decision-making processes.</p><p>This guide delves into the process of creating and managing embeddings for data residing in PostgreSQL using PgVectorizer, a library we developed to make managing embeddings simple. PgVectorizer both creates embedding from your data and keeps your relational and embedding data in sync as your data changes. </p><p>We'll navigate through architectural considerations, set up the library, perform a sync between your relational and embedding data, and query your embeddings. To learn more about how we built this, in this article, <a href="https://timescale.ghost.io/blog/how-we-designed-a-resilient-vector-embedding-creation-system-for-postgresql-data/" rel="noreferrer">we go under the hood and explore how PgVectorizer works</a>: we’ll cover schema layout, how we designed the system for performance, concurrency, and resilience, and explore a few alternative design decisions.</p><p>Let’s get started!&nbsp;</p><h2 id="creating-embeddings-for-data-in-postgresql-and-keeping-them-up-to-date-with-your-tables">Creating Embeddings for Data in PostgreSQL (and Keeping Them Up-To-Date With Your Tables)</h2><p>As a running example, we’ll use a simple blog application storing data in PostgreSQL using a table defined as:</p><pre><code class="language-SQL">CREATE TABLE blog (
  id              SERIAL PRIMARY KEY NOT NULL,
  title           TEXT NOT NULL, 
  author          TEXT NOT NULL,
  contents        TEXT NOT NULL,
  category        TEXT NOT NULL,
  published_time  TIMESTAMPTZ NULL --NULL if not yet published
);

</code></pre>
<p>We want to create embeddings on the contents of the blog post so we can later use it for semantic search. Embeddings should only exist and be searchable for blogs that have been published (where the <code>published_time</code> is <code>NOT NULL</code>).&nbsp;</p><p>To make working with embeddings simple and resilient, any system that creates embeddings should have the following goals:</p><ul><li><strong>No modifications to the original table. </strong>This allows systems and applications that already use this table not to be impacted by changes to the embedding system. This is especially important for legacy systems.</li><li><strong>No modification to the applications that interact with the table. </strong>Having to modify the code that alters the table may not be possible for legacy systems. It’s also poor software design because it couples systems that don’t use embeddings with code that generates the embedding.</li><li><strong>Automatically update embeddings</strong> when rows in the source table change (in this case, the blog table). This lessens the maintenance burden and contributes to worry-free software. At the same time, this update need not be instantaneous or within the same commit. For most systems, “eventual consistency” is just fine.</li><li><strong>Ensure resilience against network and service failures: </strong>Most systems generate embeddings via a call to an external system, such as the OpenAI API. In scenarios where the external system is down or a network malfunction occurs, it's imperative that the remainder of your database system continues working.</li></ul><p>These guidelines act as a robust framework for the following architecture:&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/Create-Embedding-for-PostgreSQL-Data-Architecture_Diagram.png" class="kg-image" alt="Reference architecture for a simple and resilient system for embedding data in an existing PostgreSQL table. We use the example use case of a blogging application, hence the names above." loading="lazy" width="2000" height="1282" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/11/Create-Embedding-for-PostgreSQL-Data-Architecture_Diagram.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/11/Create-Embedding-for-PostgreSQL-Data-Architecture_Diagram.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/11/Create-Embedding-for-PostgreSQL-Data-Architecture_Diagram.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/Create-Embedding-for-PostgreSQL-Data-Architecture_Diagram.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Reference architecture for a simple and resilient system for embedding data in an existing PostgreSQL table. We use the example use case of a blogging application, hence the names above.</em></i></figcaption></figure><p>In this design, we first add a trigger to the blog table that monitors for changes and, upon seeing a modification, inserts a job into the blog_work_queue table that indicates that a row in the blog table is out-of-date with its embedding.</p><p>On a fixed schedule, an embeddings creator job will poll the blog_work_queue table, and if it finds work to do, will do the following in a loop:</p><ol><li>Read and lock a row in the blog_work_queue table&nbsp;</li><li>Read the corresponding row in the blog table</li><li>Create an embedding for the data in the blog row</li><li>Write the embedding to the blog_embedding table</li><li>Delete the locked row in the&nbsp; blog_work_queue table</li></ol><p>Next, we’ll discuss how to implement this simply by using the <a href="https://github.com/timescale/python-vector">Python Vector library</a>, a library for <a href="https://timescale.ghost.io/blog/a-python-library-for-using-postgresql-as-a-vector-database-in-ai-applications/">working with vector data using PostgreSQL</a>.&nbsp;</p><h3 id="easily-manage-embedding-postgresql-data-using-langchain-and-the-python-vector-library">Easily manage embedding PostgreSQL data using LangChain and the Python Vector library</h3><p>We’ve added functionality to our library to make embedding PostgreSQL data as simple as possible. We call this functionality <a href="https://docs.timescale.com/ai/latest/pgvectorizer/" rel="noreferrer">PgVectorizer</a>.</p><h3 id="define-your-embedding-creation-function">Define your embedding creation function</h3><p>There are myriad ways to embed your data. We don’t want to force you to use just one pre-defined method, so we ask users to define how to embed their data. Thus, we ask you to provide us with a Python function callback to create and store embeddings from database data. We call this the <code>embed_and_write</code> function, and it’s best to illustrate how to write it with an example.&nbsp;</p><p>Using the blog example above, it could look like the following when embedding using <a href="https://python.langchain.com/docs/get_started/introduction">LangChain</a>, a popular framework to work with LLM applications:</p><pre><code class="language-Python">from langchain.docstore.document import Document
from langchain.text_splitter import CharacterTextSplitter
from timescale_vector import client, pgvectorizer
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores.timescalevector import TimescaleVector
from datetime import timedelta


import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv(), override=True)

TIMESCALE_SERVICE_URL = os.environ["TIMESCALE_SERVICE_URL"]

def get_document(blog):
    text_splitter = CharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
    )
    docs = []
    for chunk in text_splitter.split_text(blog['contents']):
        content = f"Author {blog['author']}, title: {blog['title']}, contents:{chunk}"
        metadata = {
            "id": str(client.uuid_from_time(blog['published_time'])),
            "blog_id": blog['id'], 
            "author": blog['author'], 
            "category": blog['category'],
            "published_time": blog['published_time'].isoformat(),
        }
        docs.append(Document(page_content=content, metadata=metadata))
    return docs

def embed_and_write(blog_instances, vectorizer):
    # Note: the vectorizer argument isn’t used in this example but it
    # provides a way to get the name of the table being embedded,
    # along with other metadata.


    embedding = OpenAIEmbeddings()
    vector_store = TimescaleVector(
        collection_name="blog_embedding",
        service_url=TIMESCALE_SERVICE_URL,
        embedding=embedding,
        time_partition_interval=timedelta(days=30),
    )

    # delete old embeddings for all ids in the work queue. locked_id is a special column that is set to the primary key of the table being
    # embedded. For items that are deleted, it is the only key that is set.
    metadata_for_delete = [{"blog_id": blog['locked_id']} for blog in blog_instances]
    vector_store.delete_by_metadata(metadata_for_delete)

    documents = []
    for blog in blog_instances:
        # skip blogs that are not published yet, or are deleted (in which case it will be NULL)
        if blog['published_time'] != None:
            documents.extend(get_document(blog))

    if len(documents) == 0:
        return
    
    texts = [d.page_content for d in documents]
    metadatas = [d.metadata for d in documents]
    ids = [d.metadata["id"] for d in documents]
    vector_store.add_texts(texts, metadatas, ids)
</code></pre>
<p>The <code>embed_and_write()</code> function gets a list of blogs that have either been created, updated, or deleted. Its job is to update the vector store with the new blogs. We do this in two steps:</p><ol><li>Delete all existing vectors already in the vector store for items with the same primary key. The primary key is passed in via a special “locked_id” attribute. This is necessary if rows are deleted or updated.</li><li>Create embeddings for all items that were updated or deleted. Deleted items will have all attributes other than “locked_id” set to <code>None</code> so that any attribute can be used as a sentinel. In the example above, we use “published_time” because we also want to skip embedding documents where “published_time” is NULL in the database.</li></ol><p>The <code>get_document()</code> function is very use-case-specific, and you’ll have to adjust this code to suit your needs. Because of the context length limitations in LLM completion queries and token length limitations in embedding generation models, you will likely need some way to split long text up into smaller chunks. Here, we use a simple <a href="https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/character_text_splitter"><code>CharacterTextSplitter</code> in LangChain</a>, but much more complex approaches are possible. In the code above, we use a simple but effective trick: add some semantic context to each chunk by prepending the author and category. The only real requirement for the metadata generation portion is including the blog_id, which we can later use to delete old embeddings for a given blog.</p><p>In the code snippet above, we use <a href="https://python.langchain.com/docs/integrations/vectorstores/timescalevector#2-similarity-search-with-time-based-filtering">time-based partitioning</a> based on published_time. This type of partitioning drastically <strong>speeds up hybrid search on time and embedding similarity</strong>. We partition by time by setting the UUID based on the timestamp using <a href="https://timescale.github.io/python-vector/vector.html#uuid_from_time"><code>client.uuid_from_time()</code> function</a> and by specifying <code>time_partition_interval=timedelta(days=30)</code> when creating the <a href="https://python.langchain.com/docs/integrations/vectorstores/timescalevector" rel="noreferrer">pgai on Timescale vector store</a>. This type of partitioning really speeds up search when filtering by both published_time and vector similarity. <a href="https://youtu.be/EYMZVfKcRzM?si=iW-ySVYaGCDYrZkS">See this explainer video</a> for more on how time-based partitioning works in pgai on Timescale (previously known as Timescale Vector).</p><p>Once this is written, all you have to do is call the following code on a schedule:</p><pre><code class="language-Python">vectorizer = pgvectorizer.Vectorize(service_url, 'blog')
while vectorizer.process(embed_and_write) &gt; 0:
    pass
</code></pre>
<p>This is the embedding creator job, which will sync your PostgreSQL data with a vector store. You can run this Python script on a schedule from practically anywhere:</p><ul><li>A scheduled <a href="https://timescale.ghost.io/blog/aws-lambda-for-beginners-overcoming-the-most-common-challenges/">AWS Lambda function</a></li><li>A scheduled <a href="https://developers.cloudflare.com/workers/">Cloudflare worker</a></li><li>A<a href="https://modal.com/use-cases/job-queues"> Modal function</a></li><li>A <a href="https://robocorp.com/docs/python">Robocorp automation task</a></li><li>A cron job on an EC2 instance or even on your local machine</li></ul><p>It will automatically track which rows within the blog table have changed and call the embed_and_write function on batches of changed rows. It is performant, resilient to failures, and can be run in parallel when you have a backlog of things that need to be embedded. <a href="https://timescale.ghost.io/blog/how-we-designed-a-resilient-vector-embedding-creation-system-for-postgresql-data/" rel="noreferrer">Designing such a system to perform well is harder than it sounds</a>. But we’ve done it for you in <a href="https://docs.timescale.com/ai/latest/pgvectorizer/" rel="noreferrer">PgVectorizer</a> as part of the <a href="https://docs.timescale.com/ai/latest/python-interface-for-pgvector-and-timescale-vector/" rel="noreferrer">Python Vector library</a>!&nbsp;</p><h3 id="searching-through-your-embeddings">Searching through your embeddings</h3><p>Use of the embeddings depends on how the embeddings were generated as well as the use case. We will illustrate some simple search applications to work with the <a href="https://python.langchain.com/docs/integrations/vectorstores/timescalevector#1-similarity-search-with-euclidean-distance-default">LangChain example</a> we gave above and also provide references for more advanced applications, such as hybrid search on metadata and time.</p><pre><code class="language-Python">TABLE_NAME = "blog_embedding"
embedding = OpenAIEmbeddings()
vector_store = TimescaleVector(
        collection_name=TABLE_NAME,
        service_url=TIMESCALE_SERVICE_URL,
        embedding=embedding,
        time_partition_interval=timedelta(days=30)
)
# find closest item
res = vector_store.similarity_search_with_score("Tell me about Travel to Istanbul", 1);

#hybrid search with time
start_dt = datetime(2021, 1, 1, 0, 0, 0) 
end_dt = datetime(2024, 1, 1, 0, 0, 0)
res = vector_store.similarity_search_with_score("Tell me about Travel to Istanbul", 1, start_date=start_dt, end_date=end_dt);
</code></pre>
<p>There are, of course, a lot more options for search, including filters and predicates on metadata, self-query retriever options, integrations with chat and RAG, and more! We recommend reading this <a href="https://python.langchain.com/docs/integrations/vectorstores/timescalevector">LangChain tutorial</a> for more info about the above-mentioned methods!&nbsp;</p><p><strong>Note: </strong>While the above example uses LangChain, you can also swap in frameworks like <a href="https://gpt-index.readthedocs.io/en/stable/index.html">LlamaIndex</a>, a popular LLM data framework that <a href="https://gpt-index.readthedocs.io/en/stable/examples/vector_stores/Timescalevector.html">integrates well with pgai on Timescale</a>, or do DIY document parsing and embedding in Python using OpenAI’s <a href="https://platform.openai.com/docs/guides/embeddings/what-are-embeddings">text-embedding-ada-002</a> model, or an open-source embedding model like <a href="https://huggingface.co/sentence-transformers">sentence-transformers</a>, while using the <a href="https://timescale.ghost.io/blog/a-python-library-for-using-postgresql-as-a-vector-database-in-ai-applications/">Python Vector client</a> for vector search.&nbsp;</p><h2 id="conclusion-and-next-steps">Conclusion and Next Steps</h2><p>In this blog post, we have outlined a system adept at generating vector embeddings from data stored in PostgreSQL and automatically keeping them up to date. This architecture ensures the embeddings remain synchronized with the perpetually evolving data, responding seamlessly to insertions, modifications, and deletions.&nbsp;</p><p>Using PostgreSQL to handle both data storage and background embedding generation offers an interesting new paradigm for maintaining embeddings as data changes. Many AI demonstrations and tutorials tend to concentrate only on initial data creation from documents, often missing the complexities of keeping data and embeddings synchronized as it evolves.</p><p>If your goal is to embed and keep up-to-date data in a PostgreSQL table, then you are done! Just <a href="https://docs.timescale.com/ai/latest/pgvectorizer/" rel="noreferrer">use the PgVectorizer class</a> in the <a href="https://github.com/timescale/python-vector">Python Vector library</a> and the code above to start embedding your PostgreSQL data and leveraging semantic and hybrid search in your applications.&nbsp;</p><p>If you are curious about how the PgVectorizer library works “under the hood” and how we designed the system for high performance, <a href="https://timescale.ghost.io/blog/how-we-designed-a-resilient-vector-embedding-creation-system-for-postgresql-data/" rel="noreferrer">see our companion blog post about how we designed a resilient embeddings system for PostgreSQL data</a>, which discusses the system design decisions and trade-offs we made while building PgVectorizer above.&nbsp;</p><p>If you’d like to go straight to applying what you learned to your own data in PostgreSQL, here are some resources to continue your learning journey:<br></p><ul><li><a href="https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&amp;utm_source=timescale-blog&amp;utm_medium=direct&amp;utm_content=pgvectorizer-guide"><strong>Try pgai on Timescale free for 90 days</strong></a>: Store the embeddings generated from your PostgreSQL data in a fast and scalable vector database built on PostgreSQL. Learn more <a href="https://www.timescale.com/ai">about pgai on Timescale</a>, and <a href="https://timescale.ghost.io/blog/how-we-made-postgresql-the-best-vector-database/">how it performs</a>.</li><li><a href="https://docs.timescale.com/ai/latest/pgvectorizer/" rel="noreferrer"><strong>Read the docs</strong></a>: Learn more about PgVectorizer and how to use it via the <a href="https://timescale.ghost.io/blog/a-python-library-for-using-postgresql-as-a-vector-database-in-ai-applications/">Python Vector library</a> to embed data stored in PostgreSQL tables.</li><li><a href="https://github.com/timescale/vector-cookbook/tree/main/pgvectorizer"><strong>Tutorial–Embedding blog data in PostgreSQL</strong></a>:<strong> </strong>Follow this step-by-step tutorial on creating, embedding, and storing blog post data from a PostgreSQL table using the methods discussed in this blog post.</li><li><a href="https://python.langchain.com/docs/integrations/vectorstores/timescalevector"><strong>LangChain and pgai on Timescale</strong></a><strong> (previously known as Timescale Vector): </strong>We used LangChain to illustrate document parsing and embedding creation in our examples above. Learn more about how to use it with pgai on Timescale for vector storage, similarity search, and hybrid search.</li></ul>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Boosting Postgres Performance With Prepared Statements and PgBouncer's Transaction Mode]]></title>
            <description><![CDATA[You can now boost both your application’s and Postgres’ performance by using prepared statements and PgBouncer’s transaction mode. Learn how.]]></description>
            <link>https://www.tigerdata.com/blog/boosting-postgres-performance-with-prepared-statements-and-pgbouncers-transaction-mode</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/boosting-postgres-performance-with-prepared-statements-and-pgbouncers-transaction-mode</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <dc:creator><![CDATA[Grant Godeke]]></dc:creator>
            <pubDate>Fri, 03 Nov 2023 11:12:50 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/boosting-postgres-performance-with-prepared-statements-and-pgbouncer-transaction-mode.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/11/boosting-postgres-performance-with-prepared-statements-and-pgbouncer-transaction-mode.png" alt="Boosting Postgres Performance With Prepared Statements and PgBouncer's Transaction Mode" /><p>Adopting <a href="https://en.wikipedia.org/wiki/Prepared_statement">prepared statements</a> in your application is an easy performance gain. Using prepared statements lets your application skip query parsing and analyzing, eliminating a substantial amount of overhead. Pairing this with a connection pooler and transaction mode can dramatically boost your Postgres database performance. Needless to say, we were excited to learn that support for prepared statements in transaction mode was introduced in the <a href="https://www.pgbouncer.org/changelog.html#pgbouncer-121x">1.21 release of PgBouncer</a>!&nbsp;</p><p>In this post we’ll cover, at a high level, why you should enable prepared statements in your application. If you’re looking for the gory details of the implication, <a href="https://www.crunchydata.com/blog/prepared-statements-in-transaction-mode-for-pgbouncer" rel="noreferrer">our peers over at Crunchy wrote a great post on the 1.21 release here</a>!</p><p><a href="https://timescale.ghost.io/blog/connection-pooling-on-timescale-or-why-pgbouncer-rocks/" rel="noreferrer">Connection poolers on our mature cloud platform, Timescale, use PgBouncer </a>under the hood, supporting prepared statements and transaction pooling. <a href="https://console.cloud.timescale.com/signup">Start a free trial today to try it out</a>—no credit card required!</p><h3 id="what-is-a-prepared-statement-and-why-does-it-boost-postgres%E2%80%99-performance">What is a prepared statement, and why does it boost Postgres’ performance?</h3><p>A <a href="https://www.postgresql.org/docs/current/sql-prepare.html#:~:text=A%20prepared%20statement%20is%20a,statement%20is%20planned%20and%20executed">prepared statement</a> is a query that can accept parameters that you <code>PREPARE</code> on the server side, which does the query parsing, analysis, and any rewriting. You can then call this query by <code>EXECUTE</code>ing it with the corresponding arguments.&nbsp;</p><p>I like to think of prepared statements as similar to functions. You create a template of what will happen in the database and can call that template with parameters to have it happen inside the database. To over-extend the analogy, this “template” (prepared statements) lets you pre-compile the query, potentially greatly improving the overall execution time.</p><p>In an application setting, many of your queries are probably already templated rather than executing arbitrary queries. These are ideal candidates to use as prepared statements since they repeat the same underlying query but with different values.&nbsp;</p><h3 id="how-to-make-it-work-with-my-app">How to make it work with my app?</h3><p>One limitation of the implementation of prepared statement support in <a href="https://www.tigerdata.com/blog/using-pgbouncer-to-improve-your-postgresql-database-performance" rel="noreferrer">PgBouncer</a> is that “PgBouncer tracks protocol-level named prepared statements.” Basically, rather than writing raw SQL, we should use the libpq implementation of prepared statements instead. </p><p>Fortunately for you, whatever language your application is likely using already relies on this as part of the object-relational mapping (ORM) implementation! We just need to use the corresponding API rather than writing <code>PREPARE … EXECUTE …</code> in raw SQL ourselves.</p><p>ActiveRecord, the default ORM of Rails, makes this even simpler for us by using prepared statements by default. In your <code>config/database.yml</code> file, ensure you have not altered your production environment to turn off <code>prepared_statements</code>. They will allow up to 1,000 prepared statements by default. </p><p>Since Timescale allows for 100, we recommend reducing the config to equal our <code>max_prepared_statements</code> (see the next section for more detail). Thus, your config might look something like this:</p><pre><code class="language-SQL">production:
&nbsp;&nbsp;adapter: postgresql
&nbsp;&nbsp;statement_limit: 100
</code></pre>
<p>Note that <code>prepared_statements: false</code> is absent, as we want them on (which they are by default).</p><p>For an example of what is happening under the hood or to use as a template for other ORMs that may not handle this automatically, in the Ruby pg gem, we have the <code>prepare()</code> function. Creating a prepared statement would look something like:</p><pre><code class="language-SQL">conn = PG.connect(:dbname =&gt; 'tsdb')
conn.prepare('statement1', 'insert into metrics (created, type_id, value) values ($1::timestamptz, $2::int, $3::float)')
</code></pre>
<p>This uses the same table structure as our <a href="https://docs.timescale.com/tutorials/latest/energy-data/" rel="noreferrer">Energy tutorial</a>. Note that the <a href="https://deveiate.org/code/pg/PG/Connection.html#method-i-prepare">official gem documentation</a> recommends casting the values to the desired types to avoid type conflicts. These are SQL types, not Ruby types, though, since they are a part of the query.&nbsp;</p><p>To execute the query, you’d use something like:</p><pre><code class="language-SQL">conn.exec_prepared('statement1', [ 2023-05-31 23:59:59.043264+00,13, 1.78 ])
</code></pre>
<p><br>As a quirk of the PgBouncer implementation, we do not need to <code>DEALLOCATE</code> prepared statements. PgBouncer handles this automatically for us.&nbsp;</p><p>All you need to do is, for each connection, try to prepare the statement you want to use once and then call <code>exec_prepared()</code>as often as the connection stays open.</p><h3 id="what%E2%80%99s-happening-behind-the-scenes">What’s happening behind the scenes?</h3><p>Behind the scenes, PgBouncer intercepts the creation of the prepared statement and creates a cache of prepared statements for the connection pool. It looks at what the client called it ( <code>statement1</code> in our example) and sees if it has a prepared statement already on that connection, which looks something like <code>PGBOUNCER_1234</code>. If it does, then it will just reuse the already created statement. If not, it will automatically handle this and create it on behalf of the client in the database connection.</p><p>This implementation lets you effectively cache prepared statements across connections —which is insanely cool—giving you the full benefit of prepared statements even when using a transaction pool. For example, if your pool size is 20, and you have 100 client connections, each running the same query, this means that, at most, the query will be planned 20 times instead of the standard 100 in previous transaction mode usage. That’s pretty awesome!</p><p>One thing to be mindful of is the <code>max_prepared_statements</code> set in your connection pooler. On Timescale, the default is 100. For ideal performance, it’s recommended to keep the number of different prepared statements your application uses to be less than this value. <br><br>This lets you get maximum efficiency of PgBouncer’s cache. If you go more than this, it is not a big deal, but it may result in slightly more query plans than otherwise, as prepared statements will get deallocated from the database connection. For example, on Timescale, with your 101st prepared statement, the first prepared statement will be replaced by it.</p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">📚</div><div class="kg-callout-text"><a href="https://www.timescale.com/blog/using-pgbouncer-to-improve-your-postgresql-database-performance/" rel="noreferrer">Need some advice on how to configure PgBouncer correctly</a>? Here you go!</div></div><h2 id="final-statements">Final Statements</h2><p>Prepared statements rock! Using them can be a pretty easy performance win for your application while also boosting your Postgres database performance. The only thing to be careful about is to make sure that your application is using the libpq version of creating a prepared statement. Basically, rather than writing the raw SQL yourself, make sure you use your ORM’s API to create a prepared statement! Also, in PgBouncer, you don’t have to worry about deallocating a prepared statement ever.</p><p>If you’d like to try this out yourself, <a href="https://console.cloud.timescale.com/signup" rel="noreferrer">Timescale offers a free trial</a>—no credit card required!</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[What Is TOAST (and Why It Isn’t Enough for Data Compression in Postgres)]]></title>
            <description><![CDATA[Postgres TOAST is often seen as a data compression mechanism in PostgreSQL, but it falls short of that task. Learn how TOAST really works and why there is a better alternative.]]></description>
            <link>https://www.tigerdata.com/blog/what-is-toast-and-why-it-isnt-enough-for-data-compression-in-postgres</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/what-is-toast-and-why-it-isnt-enough-for-data-compression-in-postgres</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Team Tiger Data]]></dc:creator>
            <pubDate>Wed, 25 Oct 2023 18:48:16 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/What-Is-TOAST--and-Why-It-Isn-t-Enough-for-Data-Compression-in-Postgres-.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/What-Is-TOAST--and-Why-It-Isn-t-Enough-for-Data-Compression-in-Postgres-.png" alt="What Is TOAST (and Why It Isn’t Enough for Data Compression in Postgres): two hands holding two pieces of bread toast." /><p>If you’re working with large databases in Postgres, this story will sound familiar. As your Postgres database keeps growing, your performance starts to decline, and you begin to worry about storage space—or, to be precise, how much you’ll pay for it. You love PostgreSQL, but there’s something you wish you had: a highly effective data compression mechanism.</p><p>PostgreSQL does have somewhat of a compression mechanism: <a href="https://www.postgresql.org/docs/current/storage-toast.html">TOAST</a> 🍞. In this post, we’ll walk you through how Postgres TOAST works and the different TOASTing strategies. <br><br>As much as we enjoy a good TOAST, we’ll discuss why this is not the kind of compression feature you need for reducing the storage footprint of modern large databases—and how, as the PostgreSQL enthusiasts that we are here at Timescale, we decided to build a more suitable compression mechanism for PostgreSQL, inspired by the columnar design of NoSQL databases.&nbsp;</p><h2 id="what-is-postgres-toast">What Is Postgres TOAST?</h2><p>Even if it might reduce the size of datasets, TOAST (The Oversized Attribute Storage Technique) is not your traditional data compression mechanism. To understand  TOAST, we have to start by understanding <a href="https://www.timescale.com/blog/how-to-reduce-your-postgresql-database-size" rel="noreferrer">how PostgreSQL stores data</a>.&nbsp;</p><p>Postgres’ storage units are called pages, and pages have a fixed size (8 kB by default). Having a fixed page size gives Postgres many advantages: <a href="https://www.timescale.com/blog/guide-to-postgres-data-management" rel="noreferrer">data management simplicity, efficiency, and consistency</a>. But there is a downside: some data values might not fit within that page.&nbsp;</p><p>This is where TOAST comes in. TOAST refers to the automatic mechanism that PostgreSQL uses to efficiently store and manage values in Postgres that do not fit within a page. To handle such values, Postgres TOAST will, by default, compress them using an internal algorithm. If, after compression, the values are still too large, Postgres will move them to a separate table (called the TOAST table), leaving pointers in the original table.&nbsp;</p><p>(As we’ll see later in this article, you can modify this strategy as a user, for example, by telling&nbsp;Postgres to avoid compressing data in a particular column.)</p><h2 id="toast-able-data-types">TOAST-able Data Types&nbsp;</h2><p>The data types subject to TOAST are primarily variable-length ones that have the potential to exceed the size limits of a standard PostgreSQL page. On the other hand, fixed-length data types, like <code>integer</code>, <code>float</code>, or&nbsp; <code>timestamp</code>, are not subjected to TOAST since they fit comfortably within a page.&nbsp;</p><p>Some examples of these data types are:&nbsp;</p><ul><li><code>json</code> and <code>jsonb</code></li><li>Large <code>text</code> strings</li><li><code>varchar</code> and <code>varchar(n)</code> (If the length specified in <code>varchar(n)</code> is small enough, then values of that column might always stay below the TOAST threshold.)</li><li><code>bytea</code> storing binary data</li><li>Geometric data like <code>path</code> and <code>polygon</code> and PostGIS types like&nbsp; <code>geometry</code> or <code>geography</code></li></ul><h2 id="how-does-postgres-toast-work">How Does Postgres TOAST Work?&nbsp;</h2><p>Understanding TOAST relates not only to page size but also to another Postgres storage concept: tuples. Tuples are rows in a PostgreSQL table. Typically, the TOAST mechanism kicks in if all fields within a tuple have a total size of over 2 kB approx.</p><p>If you’ve been paying attention, you might wonder, “Wait, but the page size is around 8&nbsp;kB—why is there overhead?” That’s because PostgreSQL likes to ensure it can store multiple tuples on a single page: if tuples are too large, fewer tuples fit on each page, leading to increased I/O operations and reduced performance. </p><p>Postgres also needs to keep free space to fit additional operational data: each page stores the tuple data <em>and</em> additional information for managing the data, such as item identifiers, headers, and transaction information.&nbsp;</p><p>So, when the combined size of all fields in a tuple exceeds approximately 2&nbsp;kB (or the TOAST threshold parameter, as we’ll see later), PostgreSQL takes action to ensure that the data is stored efficiently. TOAST handles this in two primary ways:</p><ol><li><strong>Compression.</strong> PostgreSQL can compress the large field values within the tuple to reduce their size using a <a href="https://www.tigerdata.com/blog/time-series-compression-algorithms-explained" rel="noreferrer">compression algorithm</a>. By default, if compression is sufficient to bring the tuple's total size below the threshold, the data will remain in the main table, albeit in a compressed format.</li><li><strong>Out-of-line storage.</strong> If compression alone isn't effective enough to reduce the size of the large field values, Postgres moves them to a separate TOAST table. This process is known as "out-of-line" storage because the original tuple in the main table doesn’t hold the large field values anymore. Instead, it contains a "pointer" or reference to the location of the large data in the TOAST table.&nbsp;</li></ol><p>(We’re simplifying things slightly for the purpose of this article—<a href="https://www.postgresql.org/docs/current/storage-toast.html">read the PostgreSQL documentation for a full detailed view.</a>)&nbsp;</p><h2 id="the-postgres-compression-algorithm-pglz">The Postgres Compression Algorithm: <code>pglz</code>&nbsp;</h2><p>We’ve mentioned that TOAST can compress large values in PostgreSQL. But which compression algorithm is PostgreSQL using, and how effective is it?&nbsp;</p><p>The <code>pglz</code> (PostgreSQL Lempel-Ziv) is the default internal compression algorithm used by PostgreSQL specifically tailored for TOAST. Here’s how it works in simple terms:</p><ul><li><code>pglz</code> tries to avoid repeated data. When it sees repeated data, instead of writing the same thing again, it just points back to where it wrote it before. This "avoiding repetition" helps in saving space.</li><li>As <code>pglz</code> reads through data, it remembers a bit of the recent data it has seen. This recent memory is the "sliding window."</li><li>As new data comes in, <code>pglz</code> checks if it has seen this data recently (within its sliding window). If yes, it writes a short reference instead of repeating the data.</li><li>If the data is new or not repeated enough times to make a reference shorter than the actual data, <code>pglz</code> just writes it down as it is.</li><li>When it's time to read the compressed data, <code>pglz</code> uses its references to fetch the original data. This process is quite direct, as it looks up the referred data and places it where it belongs.</li><li><code>pglz</code> doesn't need separate storage for its memory (the sliding window); it builds it on the go while compressing and does the same when decompressing.</li></ul><p>This implementation balances compression efficiency and speed within the TOAST mechanism. The compression rate effectiveness of <code>pglz</code> will largely depend on the nature of the data. </p><p>For example, highly repetitive data will compress much better than high entropy data (like random data). You might see compression ratios in the range of 25 to 50 percent, but this is a very general estimate—results will vary widely based on the exact nature of the data.</p><h2 id="configuring-toast">Configuring TOAST&nbsp;<br></h2><h3 id="toast-strategies">TOAST strategies&nbsp;</h3><p>By default, PostgreSQL will go through the TOAST mechanism according to the procedure explained earlier (compression first and out-of-line storage next, if compression is not enough). Still, there might be scenarios where you might want to fine-tune this behavior on a per-column basis. PostgreSQL allows you to do this using the TOAST strategies <code>EXTENDED</code>, <code>EXTERNAL</code>, <code>MAIN</code>, and&nbsp; <code>PLAIN</code>.</p><ul><li><strong><code>EXTENDED</code>: </strong>This is the default strategy. Data will be stored out of line in a separate TOAST table if it’s too large for a regular table page. Data will be compressed to save space before being moved to the TOAST table.</li><li><strong><code>EXTERNAL</code>: </strong>This strategy tells PostgreSQL to store the data for this column out of line if the data is too large to fit in a regular table page, and we’re asking PostgreSQL not to compress the data—the value will be moved to the TOAST table as-is.</li><li><strong><code>MAIN</code>: </strong>This strategy is a middle ground. It tries to keep data in line in the main table through compression; if the data is definitely too large, it will move the data to the TOAST table to avoid an error, but PostgreSQL won't move the compressed data. Instead, it will store the value in the TOAST table in its original form.&nbsp;</li><li><strong><code>PLAIN</code>: </strong>Using <code>PLAIN</code> in a column tells PostgreSQL to always store the column's data in line in the main table, ensuring it isn't moved to an out-of-line TOAST table. Take into account that if the data grows beyond the page size, the <code>INSERT</code> will fail because the data won’t fit.&nbsp;</li></ul><p>If you want to inspect the current strategies of a particular table, you can run the following:&nbsp;</p><pre><code class="language-SQL">\d+ your_table_name&nbsp;
</code></pre>
<p>You'll get an output like this: </p><pre><code class="language-sql">=&gt; \d+ example_table
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Table "public.example_table"
&nbsp;Column&nbsp; | &nbsp; &nbsp; &nbsp; Data Type &nbsp; | Modifiers | Storage&nbsp; | Stats target | Description&nbsp;
---------+------------------+-----------+----------+--------------+-------------
&nbsp;&nbsp;bar&nbsp; &nbsp; | varchar(100000)&nbsp; | &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | extended |&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; |&nbsp;
  ```</code></pre><p>If you wish to modify the storage setting, you can do so using the following command:&nbsp;</p><pre><code class="language-SQL">-- Sets EXTENDED as the TOAST strategy for bar_column&nbsp;
ALTER TABLE example_blob ALTER COLUMN bar_column SET STORAGE EXTENDED;
</code></pre>
<h3 id="key-parameters">Key parameters&nbsp;</h3><p>Apart from the strategies above, these two parameters are also important to control TOAST behavior: &nbsp;</p><p><strong><code>TOAST_TUPLE_THRESHOLD</code></strong></p><p>This is the parameter that sets the size threshold for when TOASTing operations (compression and out-of-line storage) are considered for oversized tuples.</p><p>As previously mentioned, <code>TOAST_TUPLE_THRESHOLD</code> is set to approximately 2&nbsp;kB by default.</p><p><strong><code>TOAST_COMPRESSION_THRESHOLD</code></strong></p><p>This parameter specifies the minimum size of a value before Postgres considers compressing it during the TOASTing process.</p><p>If a value surpasses this threshold, PostgreSQL will attempt to compress it. However, just because a value is above the compression threshold, it doesn't automatically mean it will be compressed: the TOAST strategies will guide PostgreSQL on how to handle the data based on whether it was compressed and its resultant size relative to the tuple and page limits, as we’ll see in the next section.&nbsp;</p><h3 id="bringing-it-all-together">Bringing it all together</h3><p><code>TOAST_TUPLE_THRESHOLD</code> is the trigger point. When the size of a tuple's data fields combined exceeds this threshold, PostgreSQL will evaluate how to manage it based on the set TOAST strategy for its columns, considering compression and out-of-line storage. The exact actions taken will also depend on whether column data surpasses the <code>TOAST_COMPRESSION_THRESHOLD</code>.</p>
<!--kg-card-begin: html-->
<table style="border:none;border-collapse:collapse;"><colgroup><col width="160"><col width="307"><col width="233"><col width="233"></colgroup><tbody><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;background-color:#f3f3f3;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Strategy&nbsp;</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;background-color:#f3f3f3;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Compress if tuple &gt; TOAST_COMPRESSION_THRESHOLD</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;background-color:#f3f3f3;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Store out-of-line if tuple &gt; TOAST_TUPLE_THRESHOLD</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;background-color:#f3f3f3;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Description&nbsp;</span></p></td></tr><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">EXTENDED</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Yes</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Yes</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Default strategy. Compresses first, then checks if out-of-line storage is needed.&nbsp;</span></p></td></tr><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">MAIN</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Yes</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Only in uncompressed form&nbsp;</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Compresses first, and if still oversized, moves to TOAST table without compression.&nbsp;</span></p></td></tr><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">EXTERNAL</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">No</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Yes</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Always moves to TOAST if oversized, without compression.</span></p></td></tr><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">PLAIN</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">No</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">No</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial,sans-serif;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Data always stays in the main table. If a tuple exceeds the page size, an error occurs.&nbsp;</span></p></td></tr></tbody></table>
<!--kg-card-end: html-->
<p></p><h2 id="why-toast-isnt-enough-as-a-data-compression-mechanism-in-postgresql">Why TOAST Isn't Enough as a Data Compression Mechanism in PostgreSQL&nbsp;</h2><p>By now, you’ll probably understand why TOAST is not the data compression mechanism you wish you had in PostgreSQL. Modern applications imply large volumes of data ingested daily, meaning databases (over)grow quickly. </p><p>Such a problem was not as prominent when our beloved Postgres was built decades ago, but today’s developers need compression solutions for reducing the storage footprint of their datasets.&nbsp;</p><p>While TOAST incorporates compression as one of its techniques, it's crucial to understand that its primary role isn't to serve as a database compression mechanism in the traditional sense. TOAST is mainly a solution to one problem: managing large values within the structural confines of a Postgres page.&nbsp;</p><p>While this approach can lead to some storage space savings due to the compression of specific large values, its primary purpose is not to optimize storage space across the board. </p><p>For example, if you have a 5 TB database made up of small tuples, TOAST won’t help you turn those 5 TB into 1 TB. While there are parameters within TOAST that can be adjusted, this won't transform TOAST into a generalized storage-saving solution. </p><p>And there are other inherent problems with using TOAST as a traditional compression mechanism in PostgreSQL; for example:  </p><ul><li>Accessing TOASTed data can add overhead, especially when the data is stored out of line. This overhead becomes more evident when many large text or other TOAST-able data types are frequently accessed.</li><li>TOAST lacks a high-level, user-friendly mechanism for dictating compression policies. It’s not built to optimize storage costs or facilitate storage management.</li><li>TOAST's compression is not designed to provide high compression ratios. It only uses one algorithm  (<code>pglz</code>) with compression rates varying typically from 25-50 percent.&nbsp;</li></ul><h2 id="adding-columnar-compression-to-postgresql-with-timescale">Adding Columnar Compression to PostgreSQL With Timescale&nbsp;</h2><p><a href="https://docs.timescale.com/self-hosted/latest/install/" rel="noreferrer">Via the TimescaleDB extension</a>, PostgreSQL users have a better alternative. Inspired by the compression design of NoSQL databases, <a href="https://www.timescale.com/blog/building-columnar-compression-in-a-row-oriented-database" rel="noreferrer">we added columnar compression functionality to PostgreSQL</a>. This transformative approach transcends PostgreSQL’s conventional row-based storage paradigm, introducing the efficiency and performance of columnar storage. </p><p>By adding a compression policy to your large tables, <a href="https://www.timescale.com/blog/how-ndustrial-is-providing-fast-real-time-queries-and-safely-storing-client-data-with-97-compression/" rel="noreferrer"><strong>you can reduce your PostgreSQL database size by up to 10x (achieving +90 percent compression rates)</strong></a>. &nbsp;</p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">💪</div><div class="kg-callout-text">Ready for a compression faceoff? Read our <a href="https://www.timescale.com/blog/postgres-toast-vs-timescale-compression/" rel="noreferrer">PostgreSQL TOAST vs. Timescale Compression</a> comparison and see the numbers for yourself!</div></div><p><br></p><p>By defining a time-based compression policy, you indicate when data should be compressed. For instance, you might choose to compress data older than seven (7) days automatically: &nbsp;</p><pre><code class="language-SQL">-- Compress data older than 7 days
SELECT add_compression_policy('my_hypertable', INTERVAL '7 days');
</code></pre>
<p>Via this compression policy, Timescale will transform the table <a href="https://www.timescale.com/learn/when-to-consider-postgres-partitioning" rel="noreferrer">partitions</a> (<a href="https://www.timescale.com/learn/is-postgres-partitioning-really-that-hard-introducing-hypertables" rel="noreferrer">which in Timescale are <em>also created automatically</em></a>) into a <a href="https://www.tigerdata.com/blog/building-columnar-compression-in-a-row-oriented-database" rel="noreferrer">columnar</a> format behind the scenes, combining many rows (1,000) into an array. To boost compressibility,  Timescale will apply different compression algorithms depending on the data type:&nbsp;</p><ul><li><a href="http://www.vldb.org/pvldb/vol8/p1816-teller.pdf?ref=timescale.com">Gorilla compression</a> for floats</li><li>Delta-of-delta + <a href="https://arxiv.org/abs/1209.2137?ref=timescale.com">Simple-8b</a> with <a href="https://github.com/lemire/FastPFor/tree/c69935a1b507ea58c4cbd2f5e32d997e2c7402e9?ref=timescale.com">run-length encoding</a> compression for timestamps and other integer-like types</li><li>Whole-row dictionary compression for columns with a few repeating values (+ LZ compression on top)</li><li>LZ-based array compression for all other types</li></ul><p>This columnar compression design offers an efficient and scalable solution to the problem of large datasets in PostgreSQL. It allows you to use less storage to store more data without hurting your query performance (it actually improves it). <a href="https://docs.timescale.com/about/latest/release-notes/#timescaledb-2110-on-2023-05-22" rel="noreferrer">And in the latest versions of TimescaleDB, you can also <code>INSERT</code>, <code>DELETE</code>, and <code>UPDATE</code> directly over compressed data.&nbsp;</a></p><h2 id="keep-reading">Keep Reading</h2><p>Have we piqued your curiosity? Read the following blog posts to learn more about compression in Timescale: </p><ul><li><a href="https://www.timescale.com/blog/building-columnar-compression-in-a-row-oriented-database" rel="noreferrer">Building Columnar Compression in a Row-Oriented Database&nbsp;</a></li><li><a href="https://www.timescale.com/blog/how-ndustrial-is-providing-fast-real-time-queries-and-safely-storing-client-data-with-97-compression" rel="noreferrer">How Ndustrial Is Providing Fast Real-Time Queries and Safely Storing Client Data With 97 % Compression</a></li><li><a href="https://www.timescale.com/blog/allowing-dml-operations-in-highly-compressed-time-series-data-in-postgresql" rel="noreferrer">Allowing DML Operations in Highly Compressed Time-Series Data in PostgreSQL</a></li><li><a href="https://www.timescale.com/blog/time-series-compression-algorithms-explained" rel="noreferrer">Time-Series Compression Algorithms, Explained</a></li></ul><h2 id="wrap-up">Wrap-Up&nbsp;</h2><p>We hope this article helped you understand that while TOAST is a well-thought-out mechanism to manage large values within a PostgreSQL page, it’s not effective for optimizing database storage within the realm of modern applications. </p><p>If you’re looking for effective data compression that can move the needle on your storage savings, give Timescale a go. You can try our cloud platform <a href="https://www.timescale.com/blog/postgresql-timescaledb-1000x-faster-queries-90-data-compression-and-much-more" rel="noreferrer">that propels PostgreSQL to new performance heights</a>, making it faster and fiercer—<a href="https://console.cloud.timescale.com/signup">it’s free, and no credit card is required</a>—or you can add<a href="https://docs.timescale.com/self-hosted/latest/install/"> the TimescaleDB extension</a> to your self-hosted PostgreSQL database. </p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Database Backups and Disaster Recovery in PostgreSQL: Your Questions, Answered]]></title>
            <description><![CDATA[Database backups are one of the biggest pain points for developers. Use our guide to PostgreSQL backup to help you navigate your way.]]></description>
            <link>https://www.tigerdata.com/blog/database-backups-and-disaster-recovery-in-postgresql-your-questions-answered</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/database-backups-and-disaster-recovery-in-postgresql-your-questions-answered</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Team Tiger Data]]></dc:creator>
            <pubDate>Tue, 24 Oct 2023 14:12:07 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Database-Backups-and-Disaster-Recovery-in-PostgreSQL.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Database-Backups-and-Disaster-Recovery-in-PostgreSQL.png" alt="An elephant in a rescue helmet: learn more about Database Backups and Disaster Recovery in PostgreSQL" /><p>When we ask our community about the elementary challenges they face with their PostgreSQL production databases, we often hear about three pain points: query speed, optimizing large tables, and managing database backups. We’ve covered the first two topics in articles about <a href="https://www.timescale.com/learn/is-postgres-partitioning-really-that-hard-introducing-hypertables?ref=timescale.com"><u>partitioning</u></a><a href="https://www.timescale.com/learn/postgresql-performance-tuning-key-parameters?ref=timescale.com"> <u>and fine-tuning your database</u></a>. We’ve also discussed<a href="https://timescale.ghost.io/blog/how-to-reduce-your-postgresql-database-size/"> <u>how to reduce your database size</u></a> to better manage large tables. </p><p>In this guide, we’ll answer some of the most frequently asked questions about database backup and recovery in PostgreSQL. We’ll also discuss how we handle things in the<a href="https://www.timescale.com/?ref=timescale.com"> <u>Timescale platform</u></a>.</p><h2 id="why-are-postgresql-database-backups-important">Why Are PostgreSQL Database Backups Important?</h2><p>When we discuss backup and recovery, we’re referring to a set of processes and protocols established to safeguard your data from loss or corruption and restore it to a usable state:</p><ul><li><strong>Backups</strong> involve creating copies of your data at regular intervals, copies that encapsulate the state of your PostgreSQL database at a specific point in time.&nbsp;</li><li><strong>Recovery</strong>, on the other hand, is the process of restoring data from these backups. If both things are taken care of (i.e., you always have up-to-date backups and a good recovery strategy in place), your PostgreSQL database will be resilient against failure, and you’ll be protected against data loss.&nbsp;&nbsp;</li></ul><p>Effective backup management is not only about creating copies of data. It’s also about ensuring those copies are healthy, accurate, and up-to-date.&nbsp;</p><p>To define a good backup strategy for your production PostgreSQL database, you need to consider several aspects. This includes how frequently you will back up your database, where these backups will be stored, and how often you will audit them.&nbsp;</p><p>But your job isn’t finished once you get up-to-date and healthy database backups. You must also establish an effective disaster recovery protocol. No matter how careful you are, it’s a fact of database management that failures will happen sooner or later. They can be caused by outages, failed upgrades, corrupted hardware, or human error—you name it.</p><p>Your disaster recovery plan must encompass all the steps to restore data as quickly as possible after an incident. This ensures that your database is not just backed up but also recoverable in a timely and efficient manner.</p><h2 id="what-is-the-difference-between-a-physical-backup-and-a-logical-backup-in-postgresql">What Is the Difference Between a Physical Backup and a Logical Backup in PostgreSQL?&nbsp;</h2><p>In PostgreSQL, there are two main types of database backups: physical backups and logical backups.&nbsp;</p><ul><li><strong>Physical backups</strong> capture the database's state at a specific point in time. They involve copying the actual PostgreSQL database data at the file system level.&nbsp;</li><li><strong>Logical backups </strong>involve exporting specific database objects or the entire database into a human-readable SQL file format. A logical backup contains SQL statements to recreate the database objects and insert data.</li></ul><p>Logical backups can be highly granular, allowing for the backup of specific database objects like tables, schemas, or databases. They are also portable and can be used across different database systems or versions, making them popular for migrating small to medium databases. This is your common <code>pg_dump/pg_restore</code>.&nbsp;&nbsp;</p><p>But a main drawback of logical backups is speed. For large databases, the process of restoring from a logical backup is too slow to be useful as a sole disaster recovery mechanism (or migration mechanism, for that matter). Restoring from physical backups is faster than restoring from logical backups, and it’s exact. When putting together a disaster recovery strategy, you’ll be dealing with physical backups.</p><h2 id="a-guide-to-physical-backups-in-postgresql">A Guide to Physical Backups in PostgreSQL</h2><p>Let’s explore some essential concepts around physical backups and how they can help you recover your database in case of failure.&nbsp;</p><h3 id="file-system-backups">File system backups&nbsp;</h3><p>Physical backups are referred to as<a href="https://www.postgresql.org/docs/current/backup-file.html?ref=timescale.com"> <u>file system backups</u></a> in PostgreSQL. This refers to the process of directly copying the directories and files that PostgreSQL uses to store its data, resulting in a complete representation of the database at a specific moment in time.&nbsp;</p><p>Maintaining file system backups is an essential piece of every disaster recovery strategy and imperative in production databases. But putting together a solid disaster recovery plan requires other techniques beyond simply taking “physical” file system backups regularly. That’s especially true if you’re dealing with large production databases.&nbsp;</p><p>Taking physical backups of very large databases can be a rather slow and resource-intensive process that conflicts with other high-priority database tasks, affecting your overall performance. Physical backups are not enough to ensure consistency in case of failure, as they only reflect the database state at the time they were taken. To restore a database in case of failure, you’ll need another mechanism to be able to restore all the transactions that occurred between the moment the last backup was taken and the failure.&nbsp;</p><h3 id="wal-and-continuous-archiving">WAL and continuous archiving&nbsp;</h3><p>WAL stands for <a href="https://www.postgresql.org/docs/current/wal-intro.html?ref=timescale.com"><u>Write-Ahead Logging</u></a>. It’s a protocol that improves the reliability, consistency, and durability of a PostgreSQL database by logging changes before they are written to the actual database files.</p><p>WAL is key for assuring atomicity and durability in PostgreSQL transactions. By writing changes to a log before they're committed to the database, WAL ensures that either all the changes related to a transaction are made or none at all.&nbsp;</p><p>WAL is also essential for disaster recovery since, in the event of a failure, the WAL files can be replayed to bring the database back to a consistent state. The process of regularly saving and storing these WAL records in a secondary storage location, ensuring that they are preserved over the long term, is usually referred to as continuous archiving.&nbsp;</p><p>Keeping WAL records and a recent, healthy physical database backup ensures that your PostgreSQL database can be successfully restored in case of failure. The physical backup will get PostgreSQL to the same state as it was when the backup was taken, which hopefully was not so long ago, and the WAL files will be rolled forward right before things start failing.&nbsp;</p><p>You might be wondering why it’s necessary to keep up-to-date backups if WAL can be replayed. The answer is speed. Replaying WAL during a recovery process is time-consuming, especially when dealing with large datasets with complex transactions. Backups provide a snapshot of the database at a specific point in time, enabling quick restoration up to that point.&nbsp; </p><p>In the optimal recovery scenario, you restore a recent backup (e.g., from the previous day. hen, you replay a WAL recorded post-backup to update the database to its most recent state. You don’t want to rely on WAL to reproduce two weeks’ worth of transactions.</p><h3 id="what-is-point-in-time-recovery-pitr-in-postgresql">What is point-in-time recovery (PITR) in PostgreSQL?</h3><p>Point-in-time recovery refers to restoring a PostgreSQL database to any specific point in time due to direct user input. For example, if I perform an upgrade and, for whatever reason, decide to revert the change, I could choose to recover the database from any day before.&nbsp;</p><p>Behind the scenes, PITR in PostgreSQL is often anchored in WAL. By integrating a backup with the sequential replay of WAL, PostgreSQL can be restored to an exact moment.</p><h2 id="a-guide-to-postgresql-physical-backup-tools">A Guide to PostgreSQL Physical Backup Tools&nbsp;&nbsp;</h2><p>There are multiple tools that help with the creation of physical backups, two of the most popular being <code>pg_basebackup</code> and <code>pgBackRest</code>.&nbsp;</p><h3 id="pgbasebackup">pg_basebackup</h3><p><a href="https://www.postgresql.org/docs/current/app-pgbasebackup.html?ref=timescale.com"><u><code>pg_basebackup</code></u></a> is the native tool offered by PostgreSQL for taking physical backups. It’s straightforward and reliable. It allows you to efficiently copy the data directory and include the WAL files to ensure a consistent and complete backup.</p><p><code>pg_basebackup</code> has important limitations. Taking full backups of a large database can be a lengthy and resource-intensive process. A good workaround to mitigate this is to combine full backups with incremental backups. For example, frequently copying the data that has changed since the last full backup (e.g., once a day) and creating full backups less frequently (e.g., once a week). However, incremental backups are not supported in <code>pg_basebackup</code>.&nbsp;</p><p><code>pg_basebackup</code> also has limited parallelization capabilities, which can further slow down the creation of full backups. The process is mostly manual, requiring developers to closely monitor and manage the backup operations.</p><h3 id="pgbackrest">pgBackRest</h3><p>To address the constraints of <code>pg_basebackup</code>, the PostgreSQL community built tools like <a href="https://pgbackrest.org/?ref=timescale.com"><u><code>pgBackRest</code></u></a>. <code>pgBackRest</code> introduces several important improvements:</p><ul><li>It supports both full and incremental backups.&nbsp;</li><li>It introduces multi-threaded operations, accelerating the backup process for larger databases.&nbsp;</li><li>It validates checksums during the backup process to ensure data integrity, offering an additional layer of security.</li><li>It supports various storage solutions, offering flexibility in how and where backups are stored.&nbsp;</li></ul><p>We use <a href="https://www.tigerdata.com/blog/making-postgresql-backups-100x-faster-via-ebs-snapshots-and-pgbackrest" rel="noreferrer"><code>pgBackRest</code></a> to manage our own backup and restore process in Timescale, although we’ve implemented some hacks to speed up the full backup process (<code>pgBackRest</code> can still be quite slow for creating backups in large databases).</p><h2 id="a-guide-to-logical-backups-in-postgresql">A Guide to Logical Backups in PostgreSQL&nbsp;</h2><p>Logical backups involve exporting data into a human-readable format, such as SQL statements. This type of backup is generally more flexible and portable, making it handy to reproduce a database in another architecture (i.e., for migrations).&nbsp; However, recovering from a logical backup is quite a slow process. That makes them practical only for migrating small to medium PostgreSQL production databases.&nbsp;</p><h3 id="pgdumppgrestore">pg_dump/pg_restore&nbsp;</h3><p>The most common way to create logical backups and restore from them is by using <code>pg_dump/pg_restore</code>:</p><ul><li><strong><code>pg_dump</code> </strong>creates logical backups of a PostgreSQL database. It generates a script file or other formats that contain SQL statements needed to reconstruct the database to the state it was at the backup time. You can use <code>pg_dump</code> to back up an entire database or individual tables, schemas, or other database objects.</li><li><code><strong>pg_restore</strong></code> restores databases from backups created by <code>pg_dump</code>. Just as <code>pg_dump</code> offers granularity in creating backups, <code>pg_restore</code> allows for selective restoration of specific database objects, providing flexibility in the recovery process. While it is typically used with backups created by <code>pg_dump</code>, <code>pg_restore</code> is compatible with other SQL-compliant database systems, enhancing its utility as a migration tool.</li></ul><h2 id="when-should-i-use-logical-backups-and-when-should-i-use-physical-backups-in-postgresql">When Should I Use Logical Backups, and When Should I Use Physical Backups in PostgreSQL?&nbsp;</h2><p><strong>Logical backups via <code>pg_dump/pg_restore</code> are mostly useful for creating testing databases or for database migrations. </strong>In terms of migrations, if you’re operating a production database, we only recommend going the <code>pg_dump/pg_restore</code> route if your database is small (&lt;100&nbsp;GB).</p><p>Migrating larger and more complex databases via <code>pg_dump/pg_restore</code> might take your production database offline for too long. Other migration strategies, like the dual-write and backfill method, can avoid this downtime.&nbsp;</p><p><strong>Physical backups are mostly used for disaster recovery and data archiving. </strong>If you’re operating a production database, you’ll want to maintain up-to-date physical backups and WAL to recover your database when failure occurs. If your industry requires you to keep copies of your data for a certain period of time due to regulations, physical backups will be the way to go.&nbsp;</p><p>In production applications, you’ll most likely use a combination of logical and physical backups. For disaster recovery, physical backups will be your foundational line of defense, but logical backups can serve as additional assurance (redundancy is a good thing). For migrating large databases, you’ll most likely use a staged approach, <a href="https://docs.timescale.com/migrate/latest/dual-write-and-backfill/dual-write-from-postgres/?ref=timescale.com"><u>combining logical backups with other tactics</u></a>, and so on.</p><h2 id="what-about-replicas-in-postgresql">What About Replicas in PostgreSQL?&nbsp;</h2><p><a href="https://timescale.ghost.io/blog/how-high-availability-works-in-our-cloud-database/"><u>Replicas</u></a> are continuously updated mirrors of the primary database, capturing every transaction and modification almost instantaneously. They're not the same as backups, but their usefulness in disaster recovery is indisputable. In the event of a failure, you can promote replicas to serve as the primary database, ensuring minimal downtime while you restore the damaged database.&nbsp;Building a high-availability replica and failover mechanism generally involves the following steps:&nbsp;</p><ul><li>The primary database should be configured to allow connections from replicas.&nbsp;</li><li>Physical backups of the primary should be regularly created, e.g., using <code>pgBackRest</code>.&nbsp;</li><li>WAL capturing all changes made to the database should be shipped to the replica, for example, via streaming replication. Replication can be synchronous, where each transaction is confirmed only when both primary and replica have received it, or asynchronous, where transactions are confirmed without waiting for the replica.</li><li>Configurations for automatic failover should be established to promote a replica to become the primary database in case of a failure.</li><li>Tools and scripts should be used to monitor replication lag and ensure the replica is up-to-date.</li></ul><p>This setup can be considerably complex to maintain. Most providers of managed PostgreSQL databases, <a href="https://docs.timescale.com/use-timescale/latest/ha-replicas/?ref=timescale.com"><u>including Timescale</u></a>, offer fully managed replicas as one of their services. This simplifies running highly available databases.</p><h2 id="a-guide-to-database-backups-and-disaster-recovery-with-timescale">A Guide to Database Backups and Disaster Recovery with Timescale?&nbsp;</h2><p>The<a href="https://console.cloud.timescale.com/signup?ref=timescale.com"> <u>Timescale platform</u></a> allows our customers to create fully managed PostgreSQL and TimescaleDB databases. That means we take care of the backup and disaster recovery process for them. Let’s run through how the platform handles backups, replication, upgrades, and restores.</p><h3 id="how-do-backups-work-in-timescale">How do backups work in Timescale?&nbsp;</h3><p>Backups in Timescale are fully automated. Using <code>pgBackRest</code> under the hood, Timescale automatically creates one full backup every week and incremental backups every day.&nbsp;</p><p>Timescale also keeps WAL files of any changes made to the database. This WAL can be replayed in the event of a failure to reproduce any transactions not captured by the last daily backup. For example, it can replay the changes made to your database during the last few hours. Timescale stores the two most recent full backups and WAL in<a href="https://aws.amazon.com/s3/?ref=timescale.com"> <u>S3</u></a> volumes.&nbsp;</p><p>On top of the full and incremental backups taken by <code>pgBackRest</code>,<a href="https://timescale.ghost.io/blog/making-postgresql-backups-100x-faster-via-ebs-snapshots-and-pgbackrest/"> <u>Timescale also takes EBS snapshots daily</u></a>. EBS snapshots create copies of the storage volume that can be restored, effectively making it a backup. They are significantly faster than taking full backups via pgBackRest (about 100x faster).&nbsp;</p><p>By taking EBS snapshots daily (on top of the weekly full backups by <code>pgBackRest</code>), we introduce an extra layer of redundancy, ensuring that we always have a fresh snapshot that we can quickly restore if the customer experiences a critical failure that requires recovery from a full backup.&nbsp;</p><h3 id="disaster-recovery-in-timescale-what-happens-if-my-database-fails">Disaster recovery in Timescale: What happens if my database fails?&nbsp;</h3><p>Timescale is built on AWS with decoupled compute and storage, something that makes the platform especially resilient against failures. There are two classes of failures that Timescale handles distinctly: compute and storage failures.</p><h4 id="how-timescale-handles-compute-failures">How Timescale handles compute failures&nbsp;</h4><p>Compute failures are more frequent than storage failures, as they can be caused by things like unoptimized queries or other issues that result in a maxed-out CPU. To improve uptime for the customer, Timescale has developed a methodology that makes the platform recover extremely quickly from compute failures. We call this technique<a href="https://docs.timescale.com/use-timescale/latest/ha-replicas/high-availability/?ref=timescale.com"> <u>rapid recovery</u></a>.&nbsp;</p><p>Timescale decouples the compute and storage nodes. So, if the compute node fails, Timescale automatically spins up a new compute node, attaching the undamaged storage unit to it. Any WAL that was in memory then replays.&nbsp;</p><p>The length of this recovery process depends on how much WAL needs replaying. Typically, it completes in less than thirty seconds. Under the hood, this entire process is automated via Kubernetes.&nbsp;</p><h3 id="how-timescale-handles-storage-failures">How Timescale handles storage failures&nbsp;</h3><p>Storage failures are much less common than compute failures, but when they happen, they’re more severe. Having a<a href="https://docs.timescale.com/use-timescale/latest/ha-replicas/high-availability/?ref=timescale.com"> <u>high-availability replica</u></a> can be a lifesaver in this circumstance; while your storage is being restored, instead of experiencing downtime, your replica will automatically take over.&nbsp;</p><p>To automatically restore your damaged storage, Timescale makes use of the backups it has on storage, reproducing WAL since the last incremental backup. The figure below illustrates the process:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Database-Backups-in-PostgreSQL_Timescale-backups.png" class="kg-image" alt="Recovery from backup in Timescale " loading="lazy" width="1025" height="658" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/10/Database-Backups-in-PostgreSQL_Timescale-backups.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/10/Database-Backups-in-PostgreSQL_Timescale-backups.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Database-Backups-in-PostgreSQL_Timescale-backups.png 1025w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Recovery from backup in Timescale&nbsp;</em></i></figcaption></figure><h3 id="how-do-replicas-work-in-timescale-and-how-do-they-help-with-recovery">How do replicas work in Timescale, and how do they help with recovery?&nbsp;</h3><p>In Timescale, you can create two types of replicas:</p><ul><li><strong>Read replicas </strong>are useful for read scaling. They’re used to liberate load from your primary database in read-heavy applications, for example, if you’re powering a BI tool or doing frequent reporting. Read replicas are read-only, and you can create as many as you need.</li><li><strong>High-availability replicas</strong> are exact, up-to-date copies of your database that automatically take over operations if your primary becomes unavailable.&nbsp;</li></ul><p>We’ve been talking about the importance of backups and disaster recovery. There’s a related concept that’s also important to consider: the concept of <strong>high availability</strong>. In broad terms, a “highly available” database describes a database that’s able to stay running without significant interruption (perhaps no more than a few seconds) even in case of failure.&nbsp;</p><p>The process of recovering a large database from backup might take a while, even when you’ve done everything right. That’s why it’s handy to have a replica running. Instead of waiting for the backup and restore process to finish, when your primary database fails, your connection will automatically failover to the replica. That saves your own users any major downtime.</p><p>Failover also helps remove downtime for common operations that would normally cause a service to reset, like upgrades. In these cases, Timescale makes changes to each node sequentially so that there is always a node available.&nbsp;And speaking of upgrades…&nbsp;</p><h3 id="how-are-upgrades-handled-in-timescale">How are upgrades handled in Timescale?&nbsp;</h3><p>In Timescale, you’re running PostgreSQL databases with the TimescaleDB extension enabled. Therefore, during your Timescale experience, you’ll most likely experience three different types of upgrades:&nbsp;</p><h4 id="timescaledb-upgrades">TimescaleDB upgrades</h4><p>These refer to upgrades between TimescaleDB versions, e.g., from TimescaleDB 2.11 to TimescaleDB 2.12. You don’t have to worry about these. They’re backward compatible, they require no downtime, and they will happen automatically during your maintenance window. Your Timescale services always run the latest available TimescaleDB version, so you can enjoy all the new features we ship.&nbsp;</p><h4 id="postgresql-minor-version-upgrades">PostgreSQL minor version upgrades</h4><p>We always run the latest available minor version of PostgreSQL in Timescale as well, mostly for security reasons. These minor updates may contain security patches, data corruption problems, and fixes to frequent bugs.&nbsp;</p><p>Timescale automatically handles these upgrades during your maintenance window, and they are also backward compatible. However, they require a service restart, which could cause some downtime (30 seconds to a few minutes) if you don’t have a replica. We alert you ahead of time about these, so you can set your maintenance window to a low traffic time (e.g., middle of the night) to minimize consequences.&nbsp;</p><h4 id="postgresql-major-version-upgrades">PostgreSQL major version upgrades<strong>&nbsp;</strong></h4><p>These refer to upgrading, for example, from PostgreSQL 15 to 16. These upgrades are different and more serious since they’re often not backward compatible.</p><p>&nbsp;We can’t run these upgrades for you, as this might cause issues on your application. Besides, the downtime associated with upgrading major versions of PostgreSQL can be more severe (e.g., 20 minutes). Unfortunately, in this particular case, high-availability replicas can’t help you avoid downtime.</p><p>Major PostgreSQL upgrades are always a significant lift. Timescale has some tools that will make the transition smoother. For example, you can initiate the upgrade process in a particular database <a href="https://docs.timescale.com/self-hosted/latest/upgrades/upgrade-pg/?ref=timescale.com#upgrade-postgresql"><u>with a click of a button</u></a> Before doing so, you can test your upgrade in a<a href="https://docs.timescale.com/use-timescale/latest/services/service-management/?ref=timescale.com#forking-a-service"> <u>copy of your database</u></a> to make sure nothing will break and have an accurate idea of how much downtime the upgrade will require.<a href="https://timescale.ghost.io/blog/read-before-you-upgrade-best-practices-for-choosing-your-postgresql-version/"> <u>Read this article for more information.&nbsp;</u></a></p><h3 id="can-i-do-pitr-in-timescale-ie-restore-my-database-to-a-previous-state-at-my-own-will">Can I do PITR in Timescale, i.e., restore my database to a previous state at my own will?&nbsp;</h3><p>Yes, you can! All Timescale services <a href="https://docs.timescale.com/use-timescale/latest/backup-restore/point-in-time-recovery/?ref=timescale.com"><u>allow PITR</u></a> to any point in the last three days. If you're using our<a href="https://www.timescale.com/enterprise?ref=timescale.com"> <u>Enterprise plan</u></a>, this timespan expands up to 14 days.</p><h2 id="stress-free-postgresql-backups">Stress-Free PostgreSQL Backups</h2><p>Having a solid backup and recovery strategy is top of mind for every PostgreSQL user. We hope this introductory article answers some of your questions; if you’d like to see more articles diving deeper into this topic, <a href="https://x.com/TimescaleDB?s=20&amp;ref=timescale.com"><u>tell us on Twitter/X</u></a>. </p><p>If you prefer not to worry about maintaining your backups and taking care of recovering your database when things fail, <a href="https://console.cloud.timescale.com/signup?ref=timescale.com"><u>try Timescale</u></a>, our managed PostgreSQL platform. It takes care of all things backups so you can focus on what matters (building and running your application) while experiencing <a href="https://timescale.ghost.io/blog/postgresql-timescaledb-1000x-faster-queries-90-data-compression-and-much-more/"><u>the performance boost of TimescaleDB</u></a>. You can start a free trial <a href="https://console.cloud.timescale.com/signup?ref=timescale.com"><u>here</u></a> (no credit card required).&nbsp;</p><h3 id=""></h3>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[When and How to Use Psycopg2]]></title>
            <description><![CDATA[This guide walks you through integrating PostgreSQL and your Python code via Psycopg2, one of the most popular PostgreSQL adapters.]]></description>
            <link>https://www.tigerdata.com/blog/when-and-how-to-use-psycopg2</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/when-and-how-to-use-psycopg2</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Python]]></category>
            <dc:creator><![CDATA[Anber Arif]]></dc:creator>
            <pubDate>Thu, 19 Oct 2023 17:24:31 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-19-at-9.58.23-AM-1-1-1.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-19-at-9.58.23-AM-1-1-1.png" alt="When and How to Use Psycopg2" /><p><a href="https://survey.stackoverflow.co/2023/#section-most-popular-technologies-programming-scripting-and-markup-languages">Python is one of the most popular programming languages</a>, widely used for data analytics, visualizations, and data science. This guide will walk you through how to integrate PostgreSQL and your Python code via Psycopg2, one of the most popular PostgreSQL adapters. </p><h2 id="beyond-psycopg2-connecting-to-a-postgresql-database-using-python">Beyond Psycopg2: Connecting to a PostgreSQL Database Using Python</h2><p>PostgreSQL adapters serve as bridges that enable you to directly interact with your PostgreSQL database directly from your application and programming language. </p><p>In the particular case of Python, these are some of the most popular PostgreSQL adapters:&nbsp;</p><h3 id="pg8000">pg8000</h3><p><a href="https://pypi.org/project/pg8000/">pg8000</a> stands out for its purity as a Python library and its seamless adherence to the Python Database API Specification v2.0. Unlike some adapters that rely on C extensions, pg8000 is written entirely in Python, which enhances its portability and ease of deployment across various environments. </p><p>It's particularly favored for applications that need to avoid the complexities of dealing with C extensions while maintaining efficient communication with PostgreSQL databases. The adapter strikes a balance between simplicity and functionality, making it an excellent choice for developers who prioritize straightforward implementation and usage.</p><p>pg8000 supports PostgreSQL 8.4 and up, as well as Python 2.6 to 2.7 and 3.2 to 3.7.</p><h3 id="asyncpg">asyncpg</h3><p><a href="https://magicstack.github.io/asyncpg/" rel="noreferrer">asyncpg</a> is a distinct database adapter renowned for its superior performance and asynchronous processing capabilities. Designed explicitly for <a href="https://docs.python.org/3/library/asyncio.html" rel="noreferrer">Python’s asyncio framework</a>, asyncpg provides non-blocking, asynchronous communication with PostgreSQL databases. This ensures that applications remain responsive and scalable, particularly under heavy loads. </p><p>Its specialization in handling concurrent database connections effectively distinguishes <a href="https://www.tigerdata.com/blog/how-to-build-applications-with-asyncpg-and-postgresql" rel="noreferrer">asyncpg</a> from other adapters, making it a go-to option for developers building high-performance, I/O-bound applications.</p><p>asyncpg supports PostgreSQL 9.2 and later versions, and it’s designed specifically for Python 3.5 and newer.</p><h3 id="sqlalchemy">SQLAlchemy</h3><p><a href="https://www.sqlalchemy.org/" rel="noreferrer">SQLAlchemy</a> is not only an adapter but more of a comprehensive SQL toolkit and Object-Relational Mapping (ORM) system for Python applications. It abstracts the complexities of database communication, allowing developers to interact with databases using Pythonic expressions. </p><p>SQLAlchemy's ORM enables developers to map Python objects to database tables, facilitating a higher-level, object-oriented perspective of database interaction. This feature-rich adapter is ideal for developers looking for an extensive set of tools to streamline both the basic and advanced aspects of database interaction.</p><p>SQLAlchemy can be used with a variety of databases, including PostgreSQL.</p><h3 id="psycopg2">Psycopg2</h3><p>Lastly, the focus of this blog post. Psycopg2 is a popular adapter for its comprehensive feature set, robustness, and scalability, firmly establishing itself as a favorite among Python developers interfacing with PostgreSQL. It is implemented with C extensions, which contributes to its performance efficiency. </p><p>Psycopg2 supports a range of PostgreSQL features, including server-side cursors, asynchronous notifications, and COPY commands. Furthermore, it is thread-safe and boasts connection pooling capabilities. Its widespread adoption is anchored on its reliability and compatibility with various versions of PostgreSQL and Python, making it a versatile choice for a diverse array of applications.</p><p>psycopg2 supports PostgreSQL 7.4 and up and Python versions from 2.5 to 3.7.</p><h2 id="what-is-the-difference-between-psycopg2-and-sqlalchemy">What Is the Difference Between Psycopg2 and SQLAlchemy?&nbsp;</h2><p>Psycopg2 and SQLAlchemy are very popular tools, so let’s spend a minute clarifying the difference between both.&nbsp;</p><p>As we mentioned before, both tools are fundamentally different in character. SQLAlchemy is not only an adapter but an extensive SQL toolkit and ORM. With SQLAlchemy, developers can interact with databases using high-level Python expressions. It automatically translates these Python expressions into SQL code, reducing the need for writing SQL queries. </p><p>This abstraction makes SQLAlchemy particularly beneficial for developers who are either less experienced with SQL or are looking for a more Pythonic way to interact with databases.</p><p>In comparison, Psycopg2 offers a closer interaction with the PostgreSQL database, enabling developers to leverage PostgreSQL’s features to the fullest. It provides detailed control over database connections and query executions, making it a favorite for those who prioritize performance and direct database interaction.</p><p>In sum, here are the main differences between Psycopg2 and SQLAlchemy:</p><ul><li>Psycopg2’s design is more straightforward, focusing on direct and efficient database interaction. SQLAlchemy, being an ORM, introduces an additional layer of abstraction, making database interactions more Pythonic and less complex.</li><li>Psycopg2 is rich in features that allow for a closer and more intricate interaction with PostgreSQL databases. SQLAlchemy offers a broader set of tools that simplify not only database connections but also query executions, mapping Python objects to database tables, and other advanced functionalities.</li><li>Psycopg2 provides developers with detailed control over SQL queries and database interactions. SQLAlchemy automates and abstracts many aspects of database communication, making it a suitable option for a less SQL-intensive experience.</li></ul><h2 id="using-psycopg2-top-advantages">Using Psycopg2: Top Advantages </h2><p>Now, let’s get into <a href="https://pypi.org/project/psycopg2/" rel="noreferrer">Psycopg2</a>!</p><p>This adapter seamlessly integrates Python and PostgreSQL, making it incredibly easy to work with these technologies in unison. It provides a set of Python modules that allow you to establish connections to PostgreSQL databases, execute SQL queries, and retrieve data easily. Psycopg2 adheres to Python’s database API specifications, ensuring a consistent and intuitive experience.</p><p>These are its main strengths:&nbsp;</p><ul><li><strong>Connection management. </strong>Psycopg2 excels in establishing and maintaining robust connections between Python applications and PostgreSQL databases. It guarantees a reliable and uninterrupted data transfer channel, enhancing the efficiency of data exchange and communication.</li><li><strong>Efficient SQL query execution. </strong>The adapter is equipped with capabilities for precise and rapid SQL query execution. It is adept at handling a variety of tasks, including data retrieval, record modifications, and executing complex operations, ensuring optimal performance and accuracy.</li><li><strong>Real-time data synchronization. </strong>Psycopg2 helps you develop real-time applications, as it ensures that Python code is consistently synchronized with the PostgreSQL database. This feature facilitates the creation of responsive, data-driven applications that can effectively adapt to dynamic data changes.</li><li><strong>Robusticity. </strong>Psycopg2 is recognized for its stability and reliability, making it the ideal choice for mission-critical applications. This library handles various PostgreSQL features, complex data types, and large objects, ensuring the precision and reliability needed for high-stakes projects.</li><li><strong>Cross-compatibility.</strong> Psycopg2 seamlessly integrates with different Python versions and PostgreSQL versions, providing cross-compatibility and versatility for your projects. This ensures that your Python code remains functional across various environments.</li><li><strong>Active community. </strong>Psycopg2 benefits from continuous development and a supportive community. This dedication to improvement ensures that Psycopg2 remains up-to-date with the latest advancements in Python and PostgreSQL, making it a reliable choice for your Python PostgreSQL interactions.</li><li><strong>Enhanced security. </strong>Psycopg2 prioritizes data security by offering robust features, including support for SSL connections. This added layer of security helps safeguard your sensitive data, maintaining data integrity and confidentiality in your transactions.</li></ul><h2 id="when-to-use-psycopg2-example-use-cases-of-using-python-and-postgresql">When to Use Psycopg2: Example Use Cases of Using Python and PostgreSQL&nbsp;</h2><h3 id="data-analytics-and-reporting">Data analytics and reporting&nbsp;</h3><p>Data analysts and scientists frequently employ Psycopg2 for seamless access to PostgreSQL databases. For instance, imagine a data analyst at a marketing firm who uses Psycopg2 to retrieve customer data from a PostgreSQL database. With this data, they can create insightful reports, analyze trends, and make data-driven decisions to enhance marketing strategies.</p><h3 id="web-development">Web development&nbsp;</h3><p>In web development, Psycopg2 is invaluable for building dynamic, database-driven websites. Consider an e-commerce website where Psycopg2 is used to manage product inventory, customer orders, and user accounts stored in a PostgreSQL database. This ensures a smooth shopping experience for customers and efficient inventory management for the business.</p><h3 id="business-applications">Business applications</h3><p>Businesses across various industries leverage Psycopg2 for mission-critical applications. For example, a financial institution may employ Psycopg2 to maintain a secure and robust database of customer transactions and accounts. This ensures data integrity, reliability, and swift access to financial data.</p><h3 id="iot-and-real-time-applications">IoT and real-time applications&nbsp;</h3><p>In the realm of IoT, Psycopg2 plays a crucial role in capturing and storing real-time sensor data. Imagine a smart city project that relies on Psycopg2 to collect and analyze data from various sensors, such as traffic cameras and air quality monitors. This data can be used to optimize traffic flow, improve air quality, and enhance overall city management.</p><h3 id="scientific-research">Scientific research</h3><p>Scientists and researchers utilize Psycopg2 for storing and analyzing scientific data. For instance, in a research project involving climate data, Psycopg2 could be used to store temperature and weather data in a Psycopg2 database. Researchers can then perform complex data analysis and generate climate models to better understand climate patterns.</p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">📚</div><div class="kg-callout-text"><a href="https://www.timescale.com/learn/building-python-apps-with-postgresql-and-psycopg3/" rel="noreferrer">Learn how to build a Python app with Postgres</a>.</div></div><h2 id="installing-psycopg2-instructions">Installing Psycopg2: Instructions &nbsp;</h2><p>Before we begin, ensure that you have the following prerequisites in place:</p><ul><li>Python installed on your system (<a href="https://www.python.org/downloads/release/python-360/">Python 3.6</a> or later).</li><li>Access to a <a href="https://www.postgresql.org/">PostgreSQL</a> database.</li></ul><h3 id="installation-steps">Installation steps</h3><p>Psycopg2 can be easily installed using <a href="https://pypi.org/project/pip/">pip</a>, Python's package manager. Open your terminal or command prompt and run the following command:</p><pre><code>pip install psycopg2</code></pre><p>You can also install the latest version of <a href="https://pypi.org/project/psycopg2/">Psycopg2</a>, including the necessary binary dependencies using this command:</p><pre><code>pip install psycopg2-binary</code></pre><p><em>📚 Editor's note: If you're using Timescale, </em><a href="https://docs.timescale.com/quick-start/latest/python/#connect-to-timescaledb" rel="noreferrer"><em>you can also find guidelines on how to install Psycopg2 in our documentation</em></a><em>.</em></p><p><br>To ensure that Psycopg2 is correctly installed, perform the following checks:</p><ul><li>Open a Python interpreter or create a Python script and import Psycopg2. If there are no import errors, Psycopg2 is installed correctly.</li><li>You can verify the installed Psycopg2 version with the following code:<br></li></ul><pre><code>print(psycopg2.__version__)</code></pre><h3 id="troubleshooting-common-installation-issues">Troubleshooting common installation issues</h3><p>While Psycopg2 installation is generally straightforward, if you encounter some issues, run through this checklist: </p><ul><li>Psycopg2 relies on PostgreSQL’s C library. If you encounter missing library errors during installation, ensure you have the PostgreSQL development headers and libraries installed on your system. </li><li>If you are using a virtual environment for your Python development, make sure it is activated when you run the installation command. </li><li>Verify that your Python version meets Psycopg2's requirements. If you encounter compatibility issues, consider upgrading to a compatible Python version. </li></ul><h2 id="connecting-to-your-postgresql-database-using-psycopg2">Connecting to Your PostgreSQL Database Using Psycopg2 </h2><p>Now that you have everything installed, let's connect to your PostgreSQL database!</p><p>Here’s an example of establishing a database connection using Psycopg2: </p><pre><code class="language-python">import psycopg2
# Defining database connection parameters
db_params = {
    "host": "your_database_host",
    "database": "your_database_name",
    "user": "your_database_user",
    "password": "your_database_password",
    "port": "your_database_port"
}
try:
    # Establishing a connection to the database
    connection = psycopg2.connect(**db_params)
    # Creating a cursor object to interact with the database
    cursor = connection.cursor()
    # Performing database operations here...
except (Exception, psycopg2.Error) as error:
    print(f"Error connecting to the database: {error}")

finally:
    if connection:
        cursor.close()
        connection.close()
        print("Database connection closed.")</code></pre><p>In this code, we first import the psycopg2 module, which provides the functionality needed to interact with PostgreSQL databases from Python. The <code>db_params</code> dictionary contains the following parameters necessary to establish a database connection:</p><ul><li><code>host</code>: specifies the hostname or IP address of the database server.</li><li><code>database</code>: specifies the name of the database you want to connect to.</li><li><code>user</code>: specifies the username for authentication.</li><li><code>password</code>: specifies the password for authentication.</li><li><code>port</code>: specifies the port number to connect to. The default is 5432 for PostgreSQL.</li></ul><p>The code is wrapped in a try block, which is used to handle exceptions or errors that may occur during the database connection process. Inside the try block, <code>psycopg2.connect(**db_params)</code> is used to establish a connection to the database. The <code>**db_params</code> syntax passes the connection parameters defined in the dictionary to the connect function.</p><p>After successfully establishing a connection, a cursor object is created using <code>connection.cursor()</code>. The cursor is used to execute SQL queries and interact with the database. The except block catches any exceptions or errors that may occur during the connection process, and it prints an error message if there is an issue. The <code>finally</code> block ensures that the cursor and connection are properly closed.</p><h2 id="running-postgresql-queries-using-psycopg2">Running PostgreSQL Queries Using Psycopg2</h2><p>In this section, we assume that the database connection has already been established, so we’ll focus on showing some query examples you can run using Psycopg2.</p><h3 id="executing-select-queries-using-psycopg2">Executing <code>SELECT</code> queries using Psycopg2</h3><p>The below code shows how to execute a <code>SELECT</code> query with Psycopg2:</p><pre><code class="language-python"># Defining the SELECT query
select_query = "SELECT column1, column2 FROM your_table_name WHERE condition;"

# Executing the SELECT query
cursor.execute(select_query)

# Fetching and printing the results
result = cursor.fetchall()
for row in result:
    print(row)</code></pre><p>With this code, we achieve the following:</p><ul><li>We define the <code>SELECT</code> query and include the desired columns and a condition.</li><li>The execute method is used to execute the query.</li><li>The results are fetched using <code>cursor.fetchall()</code> , and we loop through the rows to print them.</li></ul><h3 id="executing-insert-queries-using-psycopg2">Executing <code>INSERT</code> queries using psycopg2</h3><p>INSERT queries are executed similarly: </p><pre><code class="language-python"># Defining the INSERT query
insert_query = "INSERT INTO your_table_name (column1, column2) VALUES (value1, value2);"

# Executing the INSERT query
cursor.execute(insert_query)

# Committing the transaction to save changes
connection.commit()</code></pre><h3 id="executing-update-and-delete-queries-using-psycopg2">Executing <code>UPDATE</code> and <code>DELETE</code> queries using Psycopg2</h3><p>Same with UPDATES and DELETES: </p><pre><code class="language-python"># Defining the UPDATE query
update_query = "UPDATE your_table_name SET column1 = new_value WHERE condition;"

# Executing the UPDATE query
cursor.execute(update_query)

# Committing the transaction to save changes
connection.commit()</code></pre><pre><code class="language-python"># Defining the DELETE query
delete_query = "DELETE FROM your_table_name WHERE condition;"

# Executing the DELETE query
cursor.execute(delete_query)

# Committing the transaction to save changes
connection.commit()</code></pre><h2 id="troubleshooting-fixing-common-psycopg2-errors">Troubleshooting: Fixing Common Psycopg2 Errors&nbsp;</h2><p>Now that you have everything up and running, let’s explore the most frequently encountered errors you might see when working with Psycopg2 to interact with PostgreSQL databases, guiding you on how to remedy them.</p><h3 id="psycopg2operationalerror"><code>psycopg2.OperationalError</code></h3><pre><code class="language-python">import psycopg2

try:
    connection = psycopg2.connect(
        dbname="yourdbname",
        user="youruser",
        password="yourpassword",
        host="yourhost",
        port="yourport"
    )
    # Additional database operations here

except psycopg2.OperationalError as error:
    print(f"OperationalError: {error}")
</code></pre><p><br>The <code>psycopg2.OperationalError</code> is one of the common errors that you might see. It generally encapsulates one of these issues:&nbsp;</p><ul><li><strong>Connection failure.</strong> The problem might be related to establishing a connection to the PostgreSQL server. Ensure that the server is up and running and accessible from the client machine.</li><li><strong>Invalid database name or credentials. </strong>Mistyping or incorrect credentials also can lead to an <code>OperationalError</code>. Make sure that the database name, username, and password are correct.&nbsp;</li><li><strong>Networking issues. </strong>Networking issues, such as incorrect host, port, or network unreachable, can also trigger this error. Make sure that the networking configurations are correct and that the server is reachable.</li><li><strong>Insufficient privileges. </strong>The user might not have the required privileges to connect to the specified database. Ensure the user has the necessary permissions.</li><li><strong>Server overload or downtime. </strong>The PostgreSQL server might be down or experiencing overload issues—confirm that the server is in good health and operational.</li></ul><p>In order to fix the error, you can go through these steps. You’ll typically resolve the issue:&nbsp;</p><ul><li>Ensure that the PostgreSQL server is running and listening on the correct port.</li><li>&nbsp;Confirm that the connection parameters (including host, port, user, password, and database name) are correct.&nbsp;</li><li>Ensure the network connection between the client and server is stable and that firewalls or network ACLs are not blocking the connection.</li><li>&nbsp;Confirm that the user has the required permissions to connect to the database.</li><li>Examine the PostgreSQL and system logs for additional details on the error.</li></ul><h3 id="psycopg2programmingerror"><br><code>psycopg2.ProgrammingError</code></h3><h3 id=""></h3><pre><code class="language-python">import psycopg2

try:
    connection = psycopg2.connect(
        dbname="yourdbname",
        user="youruser",
        password="yourpassword",
        host="yourhost",
        port="yourport"
    )
    cursor = connection.cursor()
    # Replace the next line with your actual SQL query
    cursor.execute("YOUR SQL QUERY HERE")
    connection.commit()

except psycopg2.ProgrammingError as error:
    print(f"ProgrammingError: {error}")
</code></pre><p>This error is usually a signal of an anomaly within the structure or syntax of the SQL query being executed. These are the most common reasons:&nbsp;</p><ul><li><strong>SQL syntax errors.</strong> You might have a typos, misplaced operator, or other mistake in your SQL syntax.</li><li><strong>Invalid table/column names. </strong>You might be referencing non-existent tables or columns.</li><li><strong>Incorrect data types.</strong> This error also shows up if you have mismatched data types or if you’re attempting to insert an incorrect data type into a column.</li><li><strong>Privilege issues</strong>. Lastly, you might be trying to access or manipulate the database objects without the necessary privileges.</li></ul><p>To troubleshoot this error, run through this checklist:&nbsp;</p><ul><li>Review the SQL query for any syntax errors. Utilize<a href="https://github.com/sqlfluff/sqlfluff"> SQL linters</a> or built-in database functions to check the syntax.&nbsp;</li><li>Ensure that all referenced table and column names exist and are spelled correctly.</li><li>Ensure that you are inserting data that matches the expected data types of the columns.</li><li>Verify if the database user has adequate privileges to execute the intended operations on the database, tables, or columns.</li></ul><p>Also, pay close attention to the error message—it often contains information on what’s wrong with the SQL query.</p><h3 id="psycopg2integrityerror"><code>psycopg2.IntegrityError</code></h3><pre><code class="language-python">import psycopg2

try:
    connection = psycopg2.connect(
        dbname="yourdbname",
        user="youruser",
        password="yourpassword",
        host="yourhost",
        port="yourport"
    )
    cursor = connection.cursor()
    # Replace the next line with your actual SQL query
    cursor.execute("YOUR SQL QUERY HERE")
    connection.commit()

except psycopg2.IntegrityError as error:
    print(f"IntegrityError: {error}")
    connection.rollback()
</code></pre><p><br>This error occurs when an attempted operation threatens database integrity constraints, for example:</p><ul><li>Unique constraints violation (i.e., attempting to insert a duplicate value in a column that is constrained to have only unique values).&nbsp;</li><li>Foreign key constraints violation (i.e., trying to insert a value in a foreign key column that does not exist in the referenced primary key column).</li><li>Check constraints violation (i.e., inserting data that does not satisfy the check constraints of the columns).&nbsp;</li><li>Not null constraints violation (i.e., attempting to insert a null value into a column that is defined as <code>NOT NULL</code>).&nbsp;</li></ul><p>Here's how you can troubleshoot:</p><ul><li>Review the error message, as it usually contains specific information about the nature of the integrity violation.&nbsp;</li><li>Ensure that the data being inserted adheres to the integrity constraints defined for the table.</li><li>Review the integrity constraints on the table to understand the rules and ensure compliance.</li></ul><h3 id="psycopg2dataerror"><code>psycopg2.DataError</code><br></h3><pre><code class="language-python">import psycopg2

try:
    connection = psycopg2.connect(
        dbname="yourdbname",
        user="youruser",
        password="yourpassword",
        host="yourhost",
        port="yourport"
    )
    cursor = connection.cursor()
    # Replace the next line with your actual SQL query
    cursor.execute("YOUR SQL QUERY HERE")
    connection.commit()

except psycopg2.DataError as error:
    print(f"DataError: {error}")
    connection.rollback()
</code></pre><p>This error is related to data values, specifically when the type or format of the data being inserted or manipulated is not compatible with the expected data type of the database column. To fix it, examine the data being inserted or updated to ensure its type, format, and value are compatible with the column’s specifications.&nbsp;</p><h2 id="psycopg2-best-practices">Psycopg2 Best Practices&nbsp;</h2><p>In addition to the troubleshooting tips we shared in the previous section, some general best practices will help you run Psycopg2 with fewer errors (and fix them quicker when they arise). </p><h3 id="logging-errors">Logging errors &nbsp;&nbsp;</h3><p>By implementing a logging mechanism in your Psycopg2 workflow, it’ll be easier to track and analyze complex issues. This is a great practice for applications that interact with databases.&nbsp;</p><p>For example, consider the following code: </p><pre><code class="language-python">import psycopg2
import logging

# Configuring logging
logging.basicConfig(filename='database_errors.log', level=logging.ERROR, 
                    format='%(asctime)s:%(levelname)s:%(message)s')

try:
    connection = psycopg2.connect(
        dbname="yourdbname",
        user="youruser",
        password="yourpassword",
        host="yourhost",
        port="yourport"
    )
    cursor = connection.cursor()
    cursor.execute("YOUR SQL QUERY HERE")
    connection.commit()

except (Exception, psycopg2.Error) as error:
    # Rolling back the transaction in case of error
    connection.rollback()
    # Logging the error
    logging.error(f"Error: {error}")
    print(f"Error: {error}")

finally:
    # Closing the cursor and connection
    if connection:
        cursor.close()
        connection.close()
</code></pre><p><br>This logging configuration is enriched with a format parameter to include a timestamp, severity level, and the error message. This allows you to record your errors more comprehensively, ensuring that transactions are rolled back in the event of an error and avoiding partial data commits that can lead to inconsistencies. </p><p>It's also essential to close database resources like cursors and connections to prevent resource leakage, and this can be effectively handled within a final block.&nbsp;</p><h3 id="always-check-your-database-connection">Always check your database connection</h3><p>Verify that the database connection is still open before executing queries to avoid errors related to a closed connection: </p><pre><code class="language-python">import psycopg2

# Assuming 'connection' is the established database connection

try:
    # Checking if the connection is still open
    if connection.closed == 0:
        # Database operations here
    else:
        print("Database connection is closed.")
except (Exception, psycopg2.Error) as error:
    print(f"Error: {error}")</code></pre><h3 id="sanitize-your-input-to-prevent-sql-injection-attacks"><br>Sanitize your input to prevent SQL injection attacks&nbsp;</h3><p>A SQL injection attack is a type of security vulnerability that occurs when an attacker is able to insert malicious SQL code into a query. This can happen when an application allows user input to be included in SQL queries without proper validation or escaping. </p><p>When the database executes this malicious input, it can lead to unauthorized access, data theft, corruption, or other adverse impacts. This is something you want to protect yourself from when working with Psycopg2.&nbsp;</p><p>In this snippet, user-provided input is sanitized and validated to prevent SQL injection attacks. The <code>safe_input</code> variable is created using Psycopg2's <code>adapt</code> function, ensuring that user input is properly escaped and can be safely used in SQL queries: </p><pre><code class="language-python">import psycopg2

# User-provided input
user_input = "'; DROP TABLE users --"

try:
    # Sanitizing and validating user input
    safe_input = psycopg2.extensions.adapt(user_input).getquoted()
    
    # Using safe_input in our query
    query = f"SELECT * FROM users WHERE username = {safe_input};"

    # Executing the query
    cursor.execute(query)

except (Exception, psycopg2.Error) as error:
    print(f"Error: {error}")</code></pre><h2 id="using-psycopg2-and-timescaledb"><br>Using Psycopg2 and TimescaleDB</h2><p>In this section, we will cover a few examples of TimescaleDB operations that can be effectively managed through Psycopg2. TimescaleDB is a <a href="https://www.tigerdata.com/blog/top-8-postgresql-extensions" rel="noreferrer">PostgreSQL extension </a>that boosts <a href="https://www.tigerdata.com/learn/postgres-performance-best-practices" rel="noreferrer">PostgreSQL performance</a> for <a href="https://www.timescale.com/learn/types-of-data-supported-by-postgresql-and-timescale">data-intensive applications dealing with time-series data</a>, such as IoT sensor data, energy metrics, tick financial data, and many more. </p><h3 id="creating-hypertables-using-psycopg2">Creating hypertables using Psycopg2</h3><p><a href="https://www.timescale.com/learn/is-postgres-partitioning-really-that-hard-introducing-hypertables">Hypertables</a> are the core feature of TimescaleDB. They <a href="https://www.timescale.com/learn/when-to-consider-postgres-partitioning">automatically partition data by time</a>, improving query and ingest performance and making data management more efficient.&nbsp;</p><p>To transform your PostgreSQL tables into hypertables using Psycopg2, run: </p><pre><code class="language-python">import psycopg2

# Assuming a table 'your_table' with a time column 'time_column'
cursor.execute("SELECT create_hypertable('your_table', 'time_column');")</code></pre><h3 id="querying-timescaledb-using-psycopg2"><br>Querying TimescaleDB using Psycopg2</h3><p>You can query TimescaleDB databases seamlessly using Psycopg2, just as you query PostgreSQL. You can leverage <a href="https://www.timescale.com/learn/time-series-data-analysis-hyperfunctions">the library of SQL functions</a> that TimescaleDB offers to write time-based queries more effectively—for example, using functions like time_bucket, which allows you to group entries into specified time intervals, simplifying the process of aggregating and analyzing data over time.&nbsp;</p><p>For example, in this SQL query, the time_bucket function is used to create time intervals of one hour, facilitating the grouping and aggregation of temperature data within these intervals. The query then calculates the average temperature for each location within each time bucket, offering insights into temperature trends over time and across locations.</p><pre><code class="language-SQL">SELECT 
    time_bucket('1 hour', time) as one_hour_interval,
    location,
    AVG(temperature) as avg_temperature
FROM 
    conditions 
WHERE 
    temperature &gt; 76 
GROUP BY 
    one_hour_interval, location 
ORDER BY 
    one_hour_interval DESC, avg_temperature DESC;
</code></pre><h2 id="conclusion"><br>Conclusion&nbsp;</h2><p>In this comprehensive guide, we've delved into the world of <a href="https://pypi.org/project/psycopg2/">Psycopg2</a> and how it can help you connect to your PostgreSQL database from your Python application.</p><p>If you could do a query performance boost in your PostgreSQL database, give <a href="https://www.timescale.com/">Timescale</a> a try.&nbsp;This PostgreSQL extension <a href="https://timescale.ghost.io/blog/postgresql-timescaledb-1000x-faster-queries-90-data-compression-and-much-more/">will make your queries faster via</a> <a href="https://www.timescale.com/?ref=timescale.com">automatic partitioning</a>, query planner enhancements, improved materialized views, <a href="https://timescale.ghost.io/blog/building-columnar-compression-in-a-row-oriented-database/">columnar compression</a>, and much more.</p><p>If you're running your PostgreSQL database on your own hardware, <a href="https://docs.timescale.com/self-hosted/latest/install/?ref=timescale.com">you can simply add the TimescaleDB extension</a>. If you prefer to try Timescale in AWS, <a href="https://console.cloud.timescale.com/signup?ref=timescale.com">create a free account on our platform</a>. It only takes a couple of seconds, no credit card required.&nbsp;</p><h3 id="next-steps">Next steps</h3><p>Also, don't forget to check out some of our Python resources, from time-series data analysis to OpenAI exploration:</p><ul><li><a href="https://timescale.ghost.io/blog/how-to-work-with-time-series-in-python/">How to Work With Time Series in Python?</a></li><li><a href="https://timescale.ghost.io/blog/tools-for-working-with-time-series-analysis-in-python/">Tools for Working With Time-Series Analysis in Python</a></li><li><a href="https://timescale.ghost.io/blog/postgresql-vs-python-for-data-cleaning-a-guide/">PostgreSQL vs Python for Data Cleaning: A Guide<br></a></li><li><a href="https://timescale.ghost.io/blog/do-more-with-aws-and-timescale-cloud-build-an-application-using-lambda-functions-in-python/">Do More on AWS With Timescale: Build an Application Using Lambda Functions in Python</a></li><li><a href="https://timescale.ghost.io/blog/jupyter-notebook-tutorial-setup-python-and-jupyter-notebooks-macos/">Jupyter Notebook Tutorial: Setting Up Python &amp; Jupyter Notebooks on macOS for OpenAI Exploration</a></li></ul>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[The Problem With Locks and PostgreSQL Partitioning (and How to Actually Fix It)]]></title>
            <description><![CDATA[PostgreSQL locks can cause issues in partitioned tables. Read how TimescaleDB solves this using lock minimization strategies.]]></description>
            <link>https://www.tigerdata.com/blog/how-timescaledb-solves-common-postgresql-problems-in-database-operations-with-data-retention-management</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-timescaledb-solves-common-postgresql-problems-in-database-operations-with-data-retention-management</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Tips]]></category>
            <category><![CDATA[Hypertables]]></category>
            <dc:creator><![CDATA[Chris Travers]]></dc:creator>
            <pubDate>Thu, 12 Oct 2023 13:00:00 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-12-at-11.52.06-AM.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-12-at-11.52.06-AM.png" alt="A lock over a black and neon green background. The Problem With Locks and PostgreSQL Partitioning" /><p>In my career, I have frequently worked for companies with large amounts of<a href="https://www.timescale.com/learn/types-of-data-supported-by-postgresql-and-timescale" rel="noreferrer"> time-partitioned data</a>, where I was a software engineer focusing on our PostgreSQL databases. </p><p>We’d already grown past the point where deleting data row-by-row was no longer practical, so <a href="https://www.timescale.com/learn/when-to-consider-postgres-partitioning" rel="noreferrer">we needed to use PostgreSQL partitioning to manage data retention.</a> In brief, dropping a whole partition allows PostgreSQL to remove the entire file from disk for a subset of your data rather than going through each row and removing them individually. So it’s <em>much</em> faster. <a href="https://www.timescale.com/learn/pg_partman-vs-hypertables-for-postgres-partitioning" rel="noreferrer">But if you are doing partitioning natively in PostgreSQL, you do have to make sure to add new partitions where you’re ingesting new data and drop the old ones. </a></p><p>This was a frequent cause of outages for us, even if we had reasonably well-tested scripts for adding and removing partitions. Unfortunately, the interactions around the scripts were less well-tested, and new, frequent, and long-running queries prevented the partition management scripts from getting the locks required and creating new partitions. We didn’t see the problem at first because we’d created partitions a few days in advance, but then we ran out of time, and with no new partitions, we couldn’t insert, and whoops, down goes the app.</p><p>These types of problems are particularly hard to debug and disentangle because they are often caused by totally unrelated pieces of code in combination with changes in load. PostgreSQL has begun to address this with newer approaches attaching partitions concurrently, but they’re quite complex. </p><p>I’ve seen the outages caused by <a href="https://timescale.ghost.io/blog/how-to-fix-no-partition-of-relation-found-for-row/" rel="noreferrer">partitions failing to create</a>, disk filling up because they can’t be dropped, or the pauses in other normal queries by partition management code. I know how difficult these problems can be. This is why TimescaleDB's hypertables were so exciting to me when I discovered them, especially their lock minimization strategies. </p><h2 id="understanding-postgresql-locks">Understanding PostgreSQL Locks </h2><h3 id="what-are-postgresql-locks">What are PostgreSQL locks? </h3><p>PostgreSQL locks are mechanisms that control concurrent access to data in the database to ensure consistency, integrity, and isolation of database transactions. </p><p>PostgreSQL, like most other relational database management systems, is a <em>concurrent</em> system, which means that multiple queries can be processed at the same time. Locks help in managing multiple transactions attempting to access the same data simultaneously, avoiding conflicts and potential data corruption. </p><h3 id="why-are-postgresql-locks-necessary">Why are PostgreSQL locks necessary? </h3><p>Concurrency is essential for optimizing the performance and responsiveness of the database. However, concurrency introduces several challenges that need careful handling to ensure the database’s integrity, consistency, and reliability: </p><ul><li>When multiple queries are executed concurrently, there's a risk that one transaction might view inconsistent or uncommitted data modified by another ongoing transaction. This can lead to erroneous results and inconsistencies in the database.</li><li>Queries executing simultaneously can interfere with each other, leading to performance degradation, locking issues, or inconsistent data.</li><li>When two transactions try to modify the same data simultaneously, it can lead to conflicts, data corruption, or loss of data.</li></ul><p>Locks are necessary to prevent these problems. </p><h3 id="types-of-postgresql-locks">Types of PostgreSQL locks </h3><p><a href="https://www.postgresql.org/docs/current/explicit-locking.html">PostgreSQL supports many different types of locks</a>, but the three relevant to this article are <code>ACCESS SHARE</code>, <code>SHARE UPDATE EXCLUSIVE</code>, and <code>ACCESS EXCLUSIVE</code> locks. </p><ul><li><code>ACCESS SHARE</code> locks are the least restrictive and are intended to prevent the database schema from changing under a query along with related caches being cleared. Access share locks are acquired for database read operations. The purpose of access share locks is to block access exclusive locks.</li><li><code>SHARE UPDATE EXCLUSIVE</code><strong> </strong>locks allow concurrent writes to a table but block operations that change the database schema in ways that might interfere with running queries. These are used for some forms of concurrent schema changes in PostgreSQL, though two concurrent transactions cannot both take this lock on the same table. For example, you cannot concurrently detach and attach the same partition to/from the same parent table in different sessions. One must complete before the other starts. These locks generally are used for concurrency-safe schema changes, which do not clear cached relation information.</li><li><code>ACCESS EXCLUSIVE</code> locks are the most restrictive and are intended to prevent other queries from operating across a schema change. Access exclusive locks block all locks from all other transactions on the locked table.</li></ul><h3 id="cache-invalidation-and-access-exclusive-locks">Cache invalidation and <code>ACCESS EXCLUSIVE</code> locks </h3><p>For performance reasons, PostgreSQL caches information about tables and views (which we call “relations”) and uses this cached information in query execution. This strategy is instrumental for PostgreSQL's efficiency, ensuring that data retrieval is quick and resource utilization is optimized. </p><p>A critical scenario that needs meticulous handling lock-wise is when the structrure of tables is altered. When the schema is altered (e.g. by adding or dropping columns, changing data types, or modifying constraints) the cached information related to that table might become outdated or inconsistent. Therefore, it needs to be invalidated and refreshed to ensure that the query execution reflects the modified schema. </p><p>To make this work, PostgreSQL takes an access exclusive lock on the table in question before the cached information for that relation can be invalidated.</p><h2 id="using-postgresql-partitioning-to-simplify-data-management">Using PostgreSQL Partitioning to Simplify Data Management </h2><p>In PostgreSQL <a href="https://www.postgresql.org/docs/current/ddl-partitioning.html" rel="noreferrer">declarative partitioning</a>, PostgreSQL tables are used both for empty parent tables and for partitions holding the data. Internally, each partition is a table, and there is mapping information used by the planner to indicate which partitions should be looked at for each query. This information is cached in the relation cache.</p><p>When tables are partitioned based on time, it allows for an organized structure where data is segmented into specific time frames. This makes data management much faster, since dropping a whole partition allows PostgreSQL to remove the entire partition from disk rather than going through each row and removing them individually. </p><p>In PostgreSQL, you can follow two general approaches for managing partitions and data retention, which as we'll see later, have two different concurrency considerations and problematics. </p><h3 id="approach-1-dropping-partitions">Approach #1: Dropping partitions<br></h3><p>In the first approach, we simply drop partitions from a partitioned table when we want to delete data. </p><pre><code class="language-SQL">CREATE TABLE partition_test (
    event_time timestamp,
    sensor_id bigint,
    reported_value float
) partition by range (event_time);

-- Create partition 
CREATE TABLE partition_test_2022 PARTITION OF partition_test 
FOR VALUES event_time FROM ('2022-01-01 00:00:00') TO ('2023-01-01 00:00:00');
```

--Drop partition
ALTER TABLE partition_test DROP PARTITION partition_test_2022;</code></pre><h3 id="approach-2-concurrent-workflow">Approach #2: Concurrent workflow<br></h3><p>PostgreSQL also offers (in PostgreSQL 14 and newer) a concurrent workflow for these operations.</p><pre><code class="language-SQL">CREATE TABLE partition_test_2022 (like partition_test);


ALTER TABLE partition_test ATTACH PARTITION partition_test_2022 FOR VALUES event_time FROM ('2022-01-01 00:00:00') TO ('2023-01-01 00:00:00') CONCURRENTLY;
</code></pre>
<p>To remove a partition concurrently, we can:</p><pre><code class="language-SQL">ALTER TABLE partition_test DETACH PARTITION partition_test_2022 CONCURRENTLY;


DROP TABLE partition_test_2022;
</code></pre>
<h2 id="the-problem-with-locks-and-postgresql-partitioning">The Problem With Locks and <a href="https://www.tigerdata.com/learn/when-to-consider-postgres-partitioning" rel="noreferrer">PostgreSQL Partitioning</a> </h2><p>From a database administration perspective, neither of these approaches is very safe. </p><p>Both the partition creation and dropping requires an access exclusive lock on the <code>partition_test</code>, meaning that once the query is issued, no other queries can run against that table until the query is concluded and the transaction committed or rolled back. The locking in each case looks like this:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/01/Data-Retention-Access-Exclusive-Lock_img1.png" class="kg-image" alt="" loading="lazy" width="2000" height="626" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/01/Data-Retention-Access-Exclusive-Lock_img1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/01/Data-Retention-Access-Exclusive-Lock_img1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/01/Data-Retention-Access-Exclusive-Lock_img1.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/01/Data-Retention-Access-Exclusive-Lock_img1.png 2070w" sizes="(min-width: 720px) 720px"></figure><p>In terms of the concurrent approach, it still has to address the issue of clearing the relation cache. It does so in two stages: first, a share update exclusive lock is taken <code>partition_test</code>, and then information is written to the catalogs indicating that the table will be removed from the partition list. The backend then waits until all running queries have concluded (and all transactions guaranteeing repeatable reads have concluded) before removing the table from the partition map.</p><p>This approach does not rely on locks to signal that the process is complete, only to prevent multiple concurrent updates for the status of the same set of partitions. As a result, even unrelated queries can block the detach operation. If the partition management script’s connection is interrupted for any reason, cleanup processes must be performed by the database administrator.</p><p>Once the partition is removed from the partition list, it is locked in access exclusive mode and dropped. The locking approach of this process looks like this:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/01/Data-retention-management-stages-1-and-2_img2.png" class="kg-image" alt="" loading="lazy" width="2000" height="854" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/01/Data-retention-management-stages-1-and-2_img2.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/01/Data-retention-management-stages-1-and-2_img2.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/01/Data-retention-management-stages-1-and-2_img2.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/01/Data-retention-management-stages-1-and-2_img2.png 2070w" sizes="(min-width: 720px) 720px"></figure><p>In conclusion, </p><ul><li>The first approach (involving the manual creation and dropping of partitions) relatively quick operations but forces hard synchronization points on partitioned tables, which in time-series workloads are usually partitioned due to being heavily used. Problems here can cause database outages fairly quickly.</li><li>The concurrent workflow doesn’t always solve these problems. In mixed-workflow applications, waiting for all running queries to complete (which can include long-running automatic maintenance tasks) can lead to long delays, dropped connections, and general difficulties in actually managing data retention. Particularly under load, these operations may not perform well enough to be useful.</li></ul><h2 id="common-advice-on-how-to-fix-this-problem-and-why-its-not-the-best">Common Advice on How to Fix This Problem (and Why It's Not the Best) </h2><p>The overall problems of partition management with time-series data fall into two categories:  </p><p>1) Failure to create partitions before they are needed can block inserts.</p><p>2) Dropping partitions when needed for regulatory or cost reasons not only can fail but can also block reading and writing to the relevant tables. </p><p>If you ask for advice, you'll probably hear one of these two things: </p><h3 id="use-custom-scripts">Use custom scripts</h3><p>Many companies begin their partition-management journey with custom scripts. This has the advantage of simplicity, but the disadvantage is that the operations can require heavy locks, and there is often a lack of initial knowledge on how to address these.</p><p>Custom scripts are the most flexible approach to lock problems of partition management because of the entire toolkit (lock escalation, time-out and retry, and more). This allows knowledgeable teams to build solutions that work around the existing database workloads with the best success chance. </p><p>On the other hand, this problem is full of general landmines, and teams often do not begin with the knowledge to navigate these hazards successfully.</p><p>A second major problem with custom scripts is that database workloads can change over time, and this is often out of the hands of the responsible team. For example, a data science team might run workloads that interfere with production in ways the software engineering teams had not considered.</p><h3 id="use-pgpartman">Use pg_partman </h3><p><a href="https://www.timescale.com/learn/pg_partman-vs-hypertables-for-postgres-partitioning" rel="noreferrer"><code>pg_partman</code> provides a general toolkit for partition management which can mitigate the problem on some workloads.</a> <code>pg_partman</code> takes a time-out-and-retry approach to partition creation and removal, meaning that—depending on the configuration and how things are run—the functions will run in an environment where a lock time-out is set. This prevents a failed lock from leading to an outage, but there is no guarantee that it will be obtained before the partitions are required. </p><p>In most cases, you can tune these features to provide reasonable assurances that problems will usually be avoided. Workloads exist that prevent the partition management functions from successfully running in such an environment.</p><p><code>pg_partman</code> is a good tool and an important contribution to this topic, but at scale and under load, it will only work in cases where you have a real opportunity to get the locks required within the lock time-out. I have personally worked in environments where important services would have to be briefly disabled to allow this to happen.</p><h2 id="how-timescaledb-solves-the-problem-of-locking-in-postgresql-partitioning">How TimescaleDB Solves the Problem of Locking in PostgreSQL Partitioning </h2><p>Instead of using PostgreSQL native partitioning, you can <a href="https://docs.timescale.com/self-hosted/latest/install/" rel="noreferrer">install the TimescaleDB extension</a> and use <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertables</a>, which are PostgreSQL tables that are automatically partitioned. This solves the problems caused by locking since hypertables minimize locks by design.  <br></p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">For those using our managed PostgreSQL service, <a href="https://www.timescale.com/cloud" rel="noreferrer">Timescale Cloud</a>, <a href="https://docs.timescale.com/about/latest/changelog/#:~:text=%F0%9F%94%90%20Current%20Lock%20Contention" rel="noreferrer">you can see current lock contention in the results section of our SQL editor</a> if a query is waiting on locks and can't complete execution.</div></div><p><br></p><p>TimescaleDB automatically partitions hypertables into chunks, organized by various partitioning criteria, usually time. This implementation is independent of PostgreSQL’s partitioning strategies and has been optimized as an independent add-on to PostgreSQL rather than a part of PostgreSQL core. TimescaleDB does not use inheritance as a table partitioning structure either, nor does TimescaleDB rely on the relation cache mentioned above for determining which chunks to scan.</p><p>Within a TimescaleDB hypertable, chunks are added transparently as needed and removed asynchronously without intrusive locks on the parent table. TimescaleDB then uses various strategies to hook into the planner and execute TimescaleDB-specific approaches to partition selection and elimination. These strategies require locking the chunk table with intrusive locks but not locking the parent. <br><br>This approach is likely to lead to some potential problems in serializable transaction isolation levels because once the underlying partition is gone, it is gone. In the event that a serializable transaction starts and then chunks are dropped, this will result in serialization errors or isolation violations.<br></p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/01/Data-retention-management-Access-Share-Lock-and-Exclusive-Lock_img3.png" class="kg-image" alt="" loading="lazy" width="2000" height="738" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/01/Data-retention-management-Access-Share-Lock-and-Exclusive-Lock_img3.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/01/Data-retention-management-Access-Share-Lock-and-Exclusive-Lock_img3.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/01/Data-retention-management-Access-Share-Lock-and-Exclusive-Lock_img3.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/01/Data-retention-management-Access-Share-Lock-and-Exclusive-Lock_img3.png 2070w" sizes="(min-width: 720px) 720px"></figure><h3 id="lock-minimization">Lock minimization</h3><p>PostgreSQL has traditionally taken the view that concurrency is not extremely important for <a href="https://www.tigerdata.com/learn/guide-to-postgresql-database-operations" rel="noreferrer">database operations</a> while Data Definition Language (DDL) commands are run. Traditionally, this is true. Even today, DDL commands are usually run sufficiently infrequently that the database cannot take the performance hit of introducing DDL commands as synchronization points. </p><p>The emerging problems of heavy PostgreSQL users today are not usually performance problems but the fact that applications are often not written with an awareness of what these added synchronization points will mean. In my experience, these synchronization points themselves are a significant cause of database outages among large-scale PostgreSQL users.</p><p>Timescale has been built to avoid the sort of locking problems that currently exist with PostgreSQL’s declarative partitioning simply because this is a common problem in time-series workloads.</p><p>TimescaleDB maintains its own chunk catalogs and only locks the partitions that will be removed. The catalog entry is removed, then the chunk table is locked and dropped. Only an access share lock is taken on the top-level table. This means that reads and even writes can be done to other chunks without interfering with dropping or adding chunks.</p><p>TimescaleDB’s current approach has one limitation when used under serializable transactions. Currently, if you use serializable transactions, there are certain circumstances where a transaction could go to read dropped chunks and no longer see them, resulting in a violation of the serialization guarantees. This is only a problem under very specific circumstances, but in this case, TimescaleDB behaves differently than PostgreSQL’s concurrent DDL approaches. </p><p>In general, though, you should only drop chunks when you are reasonably sure they are not going to be accessed if you use serializable transaction isolation.</p><h3 id="why-is-postgresql-not-doing-this">Why is PostgreSQL not doing this? </h3><p>TimescaleDB’s solution cannot be perfectly replicated with stock PostgreSQL at the moment because dropping partitions requires active invalidation of cached data structures, which other concurrent queries might be using. </p><p>Offering some sort of lazy invalidation infrastructure (via message queues, etc.) would go a long way to making some of this less painful, as would allowing more fine-grained invalidations to caching.</p><h2 id="conclusion">Conclusion</h2><p>TimescaleDB’s approach to the problem of locking is the best solution today, better than the options available in stock PostgreSQL. But it's not yet perfect; it operates between the two options given in terms of concurrency capabilities. We cannot drop a chunk that a serializable transaction has read until that transaction concludes regardless.</p><p>Getting there is likely to require some changes to how PostgreSQL caches the table and view characteristics and how this cache invalidation works. However, such improvements would help us move toward more transactional DDL. </p><p>Many <code>ALTER TABLE</code> commands are limited in concurrency largely because of these caching considerations. I think the general success of our approach here is also evidence of a need to address these limitations generally.<br><br>In the meantime, if you're planning to partition your tables, check out Timescale.  If you're running your PostgreSQL database on your own hardware,&nbsp;<a href="https://docs.timescale.com/self-hosted/latest/install/?ref=timescale.com" rel="noreferrer">you can simply add the TimescaleDB extension</a>. </p><p>If you're running managed PostgreSQL, <a href="https://console.cloud.timescale.com/signup?ref=timescale.com" rel="noreferrer">try the Timescale platform for free.</a> Besides the advantages of a mature cloud platform, Timescale Cloud will warn you about lock contention via our UI. <a href="https://docs.timescale.com/about/latest/changelog/#:~:text=%F0%9F%94%90%20Current%20Lock%20Contention" rel="noreferrer">The Timescale Console displays the current lock contention</a> in the results section of our SQL editor if a query is waiting on locks and can't complete execution.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to Reduce Your PostgreSQL Database Size]]></title>
            <description><![CDATA[Shrinking the storage used by your PostgreSQL database will help keep your costs low and improve the performance of your large tables. ]]></description>
            <link>https://www.tigerdata.com/blog/how-to-reduce-your-postgresql-database-size</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-to-reduce-your-postgresql-database-size</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Dylan Paulus]]></dc:creator>
            <pubDate>Fri, 06 Oct 2023 18:52:39 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/How-to-Reduce-Your-PostgreSQL-Database-Size.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/How-to-Reduce-Your-PostgreSQL-Database-Size.png" alt="A colorful vacuum cleaner over a black background to vacuum your PostgreSQL database and reduce its size" /><p>Your phone buzzes in the middle of the night. You pick it up. A monitor went off at work—your PostgreSQL database is slowly but steadily reaching its maximum storage space. You are the engineer in charge. What should you do?</p><p>Okay, if it comes down to that situation, you should remedy it ASAP by adding more storage. But you’re going to need a better long-term strategy to optimize your PostgreSQL storage use, or you’ll keep paying more and more money.</p><p>Does your PostgreSQL database really need to be that large? Is there something you can do to optimize your storage use?</p><p>This article explores several strategies that will help you reduce your PostgreSQL database size considerably and sustainably.</p><h2 id="why-is-postgresql-storage-optimization-important">Why Is PostgreSQL Storage Optimization Important?</h2><p></p><p>Perhaps you’re thinking:</p><p>“Storage is cheap these days, and optimizing a PostgreSQL database takes time and effort. I’ll just keep adding more storage.”</p><p>Or perhaps:</p><p>“My PostgreSQL provider is actually usage-based (<a href="https://timescale.ghost.io/blog/savings-unlocked-why-we-switched-to-a-pay-for-what-you-store-database-storage-model/">like Timescale</a>), and I don’t have the problem of being locked into a large disk.”</p><p>Indeed, resigning yourself to simply using more storage is the most straightforward way to tackle an increasingly growing PostgreSQL database. Are you running servers on-prem? Slap another hard drive on that bad boy. Are you running PostgreSQL in RDS? Raise the storage limits. But this comes with problems.</p><p>The first and most obvious problem is the cost. For example, if you’re running PostgreSQL in an EBS instance in AWS or in <a href="https://timescale.ghost.io/blog/understanding-amazon-rds-cost/">RDS</a>, you’ll be charged on an allocation basis. This model assumes you’ll predetermine how much disk space you’ll need in the future and then pay for it, regardless of whether you end up using it or not, and without the chance of downscaling.</p><p>In other PostgreSQL providers, when you run out of storage space, you must upgrade and pay for the next available plan or storage tier, meaning you’ll see a considerably higher bill overnight.</p><p>In a way, these issues are mitigated by usage-based models. <a href="https://timescale.ghost.io/blog/savings-unlocked-why-we-switched-to-a-pay-for-what-you-store-database-storage-model/">Timescale</a> <a href="https://www.timescale.com/pricing">charges by the amount of storage you use</a>: you don't need to worry about allocating storage or managing storage plans, which really simplifies things—and the less storage you use, the less it costs. </p><p>Usage-based models are a great incentive to actually optimize your PostgreSQL database size as much as possible since you’ll see immediate reductions in your bill. But yes, this also works the opposite way: if you ignore managing your storage, your storage bill will go up.</p><p>The second problem with not optimizing your PostgreSQL storage usage is that this situation can lead to bad performance. Queries run slower and your I/O operations increase. This is something that often gets overlooked, <a href="https://www.timescale.com/learn/postgresql-performance-tuning-how-to-size-your-database">but maintaining PostgreSQL storage usage is paramount to keeping large PostgreSQL tables fast</a>.</p><p>‌‌This last point deserves a deeper dive into how data is actually stored in PostgreSQL and what is causing the problem, so let’s briefly cover some essential PostgreSQL storage concepts.</p><h2 id="essential-postgresql-storage-concepts%E2%80%8C%E2%80%8C%E2%80%8C%E2%80%8C">Essential PostgreSQL Storage Concepts‌‌‌‌</h2><p><br><strong>How does PostgreSQL store data?</strong></p><p>At a high level, there are two terms you need to understand: tuples and pages. </p><ul><li>A tuple is the physical representation of an entry in a table. You'll generally see the terms tuple and row used interchangeably. Each element in a tuple corresponds to a specific column in that table, containing the actual data value for that column.</li><li>A page is the unit of storage in PostgreSQL, typically 8&nbsp;kB in size, that holds one or more tuples. PostgreSQL reads and writes data in page units. </li></ul><p>Each page in PostgreSQL consists of a page header (which contains metadata about the page, such as page layout versions, page flags, and so on) and actual data (including tuples). There’s also a special area called the Line Pointer Array, which provides the offsets where each tuple begins.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-06-at-11.29.48-AM.png" class="kg-image" alt="A simple representation of a PostgreSQL page containing metadata about the page and tuples stored in the page" loading="lazy" width="1858" height="944" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/10/Screenshot-2023-10-06-at-11.29.48-AM.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/10/Screenshot-2023-10-06-at-11.29.48-AM.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/10/Screenshot-2023-10-06-at-11.29.48-AM.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-06-at-11.29.48-AM.png 1858w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">A simple representation of a PostgreSQL page containing metadata about the page and tuples stored in the page</em></i></figcaption></figure><h3 id="what-happens-when-querying-data">What happens when querying data? </h3><p>When querying data, PostgreSQL utilizes the metadata to quickly navigate to the relevant page and tuple. The PostgreSQL query planner examines the metadata to decide the optimal path for retrieving data, for example, estimating the cost of different query paths based on the metadata information about the tables, indexes, and data distribution.</p><h3 id="what-happens-when-we-insert-delete-update-a-row-in-postgresql">What happens when we INSERT/ DELETE/ UPDATE a row in PostgreSQL?</h3><p>When a new tuple is inserted into a PostgreSQL table, it gets added to a page with enough free space to accommodate the tuple. Each tuple within a page is identified and accessed using the offset provided in the Line Pointer Array.</p><p>If a tuple inserted is too big for the available space of a page, PostgreSQL doesn't split it between two 8kB pages. Instead, it employs TOAST to compress and/or break the large values into smaller pieces. These pieces are then stored in a separate TOAST table, while the original tuple retains a pointer to this external stored data. </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-06-at-11.31.16-AM-1.png" class="kg-image" alt="When we insert a tuple that's too large for a single page, a new page is created. The tuple could be fragmented between two pages" loading="lazy" width="1858" height="726" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/10/Screenshot-2023-10-06-at-11.31.16-AM-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/10/Screenshot-2023-10-06-at-11.31.16-AM-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/10/Screenshot-2023-10-06-at-11.31.16-AM-1.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-06-at-11.31.16-AM-1.png 1858w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">When we insert a tuple that's too large for a single page, a new page is created.</em></i></figcaption></figure><h3 id="what-is-a-dead-tuple">What is a dead tuple? </h3><p>A key aspect to understand (and this will influence our PostgreSQL database size, as we’ll see shortly) is that when you delete data in PostgreSQL via <code>DELETE FROM</code>,  you’re not actually deleting it but marking the rows as unavailable. These unavailable rows are usually referred to as “dead tuples.”</p><p>When you run <code>UPDATE</code>, the row you’re updating will also be marked as a dead tuple. Then, PostgreSQL will insert a new tuple with the updated column. </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-06-at-11.36.00-AM.png" class="kg-image" alt="A page in a Postgres table with tuples that have been deleted or updated. The old instances are now dead tuples" loading="lazy" width="1858" height="842" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/10/Screenshot-2023-10-06-at-11.36.00-AM.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/10/Screenshot-2023-10-06-at-11.36.00-AM.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/10/Screenshot-2023-10-06-at-11.36.00-AM.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-06-at-11.36.00-AM.png 1858w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">A page in a Postgres table with tuples that have been deleted or updated. The old instances are now dead tuples</em></i></figcaption></figure><p>You might be wondering <a href="https://www.tigerdata.com/blog/postgres-for-everything" rel="noreferrer">why PostgreSQL</a> does this. Dead tuples are actually a compromise to reduce excessive locks on tables during concurrent operations, multiple connections, and simplifying transactions. Imagine a transaction failing halfway through its execution; it is much easier to revert a change when the old data is still available than trying to rewind each action in an idempotent way. Furthermore, this mechanism supports the easy and efficient implementation of rollbacks, ensuring data consistency and integrity during transactions. </p><p>The trade-off, however, is the increased database size due to the accumulation of dead tuples, necessitating regular maintenance to reclaim space and maintain performance… What brings us to table bloat.</p><h3 id="what-is-table-bloat">What is table bloat?</h3><p>When a tuple is deleted or updated, its old instance is considered a dead tuple. The issue with dead tuples is that they’re effectively still a tuple on disk, taking up storage space—yes, that storage page that is costing you money every month. </p><p>Table bloat refers to this excess space that dead tuples occupy in your PostgreSQL database, which not only leads to an inflated table size but also to increased I/O and slower queries. Since PostgreSQL runs under the MVCC system, it doesn't immediately purge these dead tuples from the disk. Instead, they linger until a vacuum process reclaims their space.</p><p>Table bloat also occurs when a table contains unused pages, which can accumulate as a result of operations such as mass deletes.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/table-bloat.png" class="kg-image" alt="A visualization of table bloat in PostgreSQL. Pages contain many dead tuples and a lot of empty space" loading="lazy" width="1698" height="544" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/10/table-bloat.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/10/table-bloat.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/10/table-bloat.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/table-bloat.png 1698w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">A visualization of table bloat in PostgreSQL. Pages contain many dead tuples and a lot of empty space</em></i></figcaption></figure><h3 id="what-is-vacuum">What is <code>VACUUM</code>?</h3><p>Dead tuples get cleaned and deleted from storage when the <code>VACUUM</code> command runs: </p><pre><code class="language-SQL">VACUUM customers;</code></pre><p><a href="https://www.postgresql.org/docs/current/sql-vacuum.html">Vacuum</a> has a lot of roles, but the relevant point for this article is that vacuum removes dead tuples once all connections using the dead tuples are closed. <code>VACUUM</code> by itself will not delete pages, though. Any pages created by a table will stay allocated, although the memory in those pages is now usable space after running vacuum.</p><h3 id="what-is-autovacuum">What is autovacuum? </h3><p>Postgres conveniently includes a daemon to automatically run vacuum on tables that get heavy insert, update, and delete traffic. It operates in the background, monitoring the database to identify tables with accumulating dead tuples and then initiating the vacuum process autonomously. </p><p>Autovacuum comes enabled by default, although the threshold PostgreSQL uses to enable autovacuum is very conservative. </p><h3 id="what-is-vacuum-full">What is VACUUM FULL?</h3><p>Autovacuum helps with dead tuples, but what about unused pages? </p><p>The <code>VACUUM FULL</code> command is a more aggressive version of <code>VACUUM</code> that locks the table, removes dead tuples and empty pages, and then returns the reclaimed space to the operating system. <code>VACUUM FULL</code> can be resource-intensive and requires an exclusive lock on the table during the process. We’ll come back to this later.</p><p>Now that you have the necessary context, let’s jump into the advice.</p><h2 id="how-to-reduce-your-postgresql-database-size">How To Reduce Your PostgreSQL Database Size</h2><h3 id="use-timescale-compression">Use Timescale compression </h3><p>There are different ways we can compress our data to consistently save storage space. <a href="https://www.postgresql.org/docs/current/storage-toast.htm">PostgreSQL has some compression mechanisms</a>, but if you want to take data compression even further, especially for time-series data, you should use <a href="https://www.tigerdata.com/blog/building-columnar-compression-in-a-row-oriented-database" rel="noreferrer">Timescale’s columnar compression</a>.</p><p>It allows you to dramatically compress data through a provided <code>add_compression_policy()</code> function. To achieve high compression rates, <a href="https://timescale.ghost.io/blog/time-series-compression-algorithms-explained/">Timescale uses various compression techniques</a> depending on data types to reduce your data footprint. Timescale also uses column stores to merge many rows into a single row, saving space.</p><p>Let's illustrate how this works with an example.</p><p>Let’s say we have a <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertable</a> with a week's worth of data. Imagine that our application generally only needs data from the last day, but we must keep historical data around for reporting purposes. We could run <code>SELECT add_compression_policy('my_table', INTERVAL '24 hours');</code> which automatically compresses rows in the <code>my_table</code> hypertable older than 24 hours. </p><p>Timescale’s compression would combine all the rows into a single row, where each column contains an array of all the row's data in segments of 1,000 rows. Visually, this would take a table that looks like this:</p><pre><code class="language-SQL">| time                   | location | temperature |
|------------------------|----------|-------------|
| 2023-09-20 00:16:00.00 | garage   | 80          |
| 2023-09-21 00:10:00.00 | attic    | 92.3        |
| 2023-09-22 00:5:00.00  | basement | 73.9        |
</code></pre><p>And compress it down to a table like this:</p><pre><code class="language-SQL">| time                                                                     | location                    | temperature               |
|--------------------------------------------------------------------------|-----------------------------|---------------------------|
| [2023-09-20 00:16:00.00, 2023-09-21 00:10:00.00, 2023-09-22 00:5:00.00]  | [garage, attic, basement]   | [80, 92.3, 73.9]          |
</code></pre><p>To see exactly how much space we can save, let's run compression on a table with 400 rows, 50 rows per day for the last seven days, that looks like this:</p><pre><code class="language-SQL">CREATE TABLE conditions (
  time        TIMESTAMPTZ       NOT NULL,
  location    TEXT              NOT NULL,
  temperature DOUBLE PRECISION  NULL,
);

SELECT create_hypertable('conditions', 'time');
</code></pre><p>Next, we'd add a compression policy to run compression on <code>conditions</code> for rows older than one day:</p><pre><code class="language-SQL">SELECT add_compression_policy('conditions', INTERVAL '1 day')</code></pre><p>In the Timescale platform, if we navigate to the <a href="https://docs.timescale.com/use-timescale/latest/services/service-explorer/">Explorer tab</a> under Services, we’d see our table shrink from 72 kB to 16&nbsp;kB—78% savings!</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-06-at-11.50.48-AM.png" class="kg-image" alt="The Timescale console showing a 78% space reduction in table size due to compression" loading="lazy" width="1858" height="390" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/10/Screenshot-2023-10-06-at-11.50.48-AM.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/10/Screenshot-2023-10-06-at-11.50.48-AM.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/10/Screenshot-2023-10-06-at-11.50.48-AM.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-06-at-11.50.48-AM.png 1858w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">The Timescale console showing a 78% space reduction in table size due to compression</em></i></figcaption></figure><p>This is a simple example, but it shows the potential that Timescale compression has to reduce storage space. </p><h3 id="monitor-dead-tuples">Monitor dead tuples</h3><p>A great practice to ensure you’re using as little storage as possible is to consistently monitor the number of dead tuples in each table.This is the first step towards putting together an efficient PostgreSQL storage management strategy.</p><p>To see pages and tuples in action, you can use <a href="https://www.postgresql.org/docs/current/pgstattuple.html"><code>pgstattuple()</code></a>, an extension provided by the Postgres maintainers to gain insights into how our tables manage tuples:</p><pre><code class="language-sql">CREATE EXTENSION IF NOT EXISTS pgstattuple;</code></pre><p>If you run the following query, </p><pre><code class="language-sql">SELECT * FROM pgstattuple('my_table');</code></pre><p>Postgres would give you a table of helpful information in response:</p><pre><code class="language-sql"> table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent 
-----------+-------------+-----------+---------------+-----------------+----------------+--------------------+------------+--------------
  81920000 |      500000 |  40000000 |          48.8 |           10000 |        1000000 |                1.2 |     300000 |          0.4</code></pre><ul><li><code>table_len</code> tells you how big your table is in bytes, including data, indexes, toast tables, and free space.</li><li><code>dead_tuple_len</code> tells how much space is being occupied by dead tuples which can be reclaimed by vacuuming.</li><li><code>free_space</code> indicates the unused space within the allocated pages of the table.. Take note that <code>free_space</code> will reset for every new page created.</li></ul><p>You can also perform calculations or transformations on the result to make the information more understandable. For example, this query calculates the ratios of dead tuples and free space to the total table length, giving you a clearer perspective on the storage efficiency of your table:</p><pre><code class="language-sql">SELECT
(dead_tuple_len * 100.0 / table_len) as dead_tuple_ratio,
(free_space * 100.0 / table_len) as free_space_ratio
FROM
pgstattuple('my_table');</code></pre><h3 id="run-autovacuum-more-frequently">Run autovacuum more frequently</h3><p>If your table is experiencing table bloat, having autovacuum run more frequently may help you free up wasted storage space.</p><p>The default thresholds and values for autovacuum are in <code>postgresql.conf</code>. Updating <code>postgresql.conf</code> will change the autovacuum behavior for the whole Postgres instance. However, this practice is generally not recommended, since some tables will have a higher affinity for dead tuples than others.</p><p>Instead, you should update autovacuum's settings per table. For example, consider the following query:</p><pre><code class="language-SQL">ALTER TABLE my_table SET (autovacuum_vacuum_scale_factor = 0, autovacuum_vacuum_threshold = 200)</code></pre><p>This will update <code>my_table</code> to have autovacuum run after 200 tuples have been updated or deleted. </p><p>More information about additional autovacuum settings are in the<a href="https://www.postgresql.org/docs/current/runtime-config-autovacuum.html"> PostgreSQL documentation</a>. Each database and table will require different settings for how often autovacuum should run, but running vacuum often is a great way to reduce storage space.</p><p>Also, keep an eye on long-running transactions that might block autovacuum, leading to issues. You can use PostgreSQL’s <code>pg_stat_activity</code> view to identify such transactions, canceling them if necessary to allow autovacuum to complete its operations efficiently:</p><pre><code class="language-sql">SELECT pid, NOW() - xact_start AS duration, query, state
FROM pg_stat_activity
WHERE (NOW() - xact_start) &gt; INTERVAL '5 minutes';

#Cancelling
SELECT pg_cancel_backend(pid);</code></pre><p>You could also inspect long-running vacuum processes and adjust the <code>autovacuum_work_mem</code> parameter to increase the memory allocation for each autovacuum invocation, <a href="https://www.timescale.com/learn/postgresql-performance-tuning-key-parameters">as we discussed in our article about PostgreSQL fine tuning</a>.</p><h3 id="reclaim-unused-pages">Reclaim unused pages</h3><p>Autovacuum and vacuum will free up dead tuples, but you’ll need the big guns to clean up unused pages. </p><p>As we saw previously, running <code>VACUUM FULL my_table</code> will reclaim pages, but it has a significant problem: it exclusively locks the entire table. A table running <code>VACUUM FULL</code>  cannot be read or written to while the vacuum has the lock, which can take a long time to finish. This is usually an instant no-go for any production database.</p><p>The PostgreSQL community has a solution, <a href="https://github.com/reorg/pg_repack">pg_repack</a>. <code>pg_repack</code> is an extension that will clean up unused pages and bloat from a table by cloning a given table, swapping the original table with the new table, and then deleting the old table. All these operations are done with minimal exclusive locks, leading to less downtime. </p><p>At the end of the <code>pg_repack</code> process, the pages associated with the original table become deleted from storage, and the new table only has the absolute minimum number of pages to store its rows, thus freeing table bloat.</p><h3 id="find-unused-indexes">Find unused indexes </h3><p>As we mention <a href="https://www.timescale.com/learn/postgresql-performance-tuning-optimizing-database-indexes">in this article on idexing design</a>, over-indexing is a frequent issue in many large PostgreSQL databases. Indexes consume disk space, so removing unused or underutilized indexes will help you keep your PostgreSQL database lean. </p><p>You can use <code>pg_stat_user_indexes</code> to spot opportunities: </p><pre><code class="language-sql">SELECT
relname AS table_name,
indexrelname AS index_name,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan AS index_scan_count
FROM
pg_stat_user_indexes
WHERE
idx_scan &lt; 50 -- Choose a threshold that makes sense for your application.
ORDER BY
index_scan_count ASC,
pg_relation_size(indexrelid) DESC;</code></pre><p>(This query looks for indexes with fewer than 50 scans, but this is an arbitrary number. You should adjust it based on your own usage patterns.)</p><h3 id="arrange-columns-by-data-type-from-largest-to-smallest">Arrange columns by data type (from largest to smallest)</h3><p>In PostgreSQL, storage efficiency is significantly influenced by the ordering of columns, which is closely related to alignment padding determined by the size of the column types. Each data type is aligned at memory addresses that are multiples of their size. </p><p>This alignment is systematic, ensuring that data retrieval is efficient and that the architecture adheres to specific memory and storage management protocols. But this can also lead to unused spaces, as the alignment necessitates padding to meet the address multiple criteria.</p><p>The way to fix this is to strategically order you columns from the largest to the smallest data type in your table definitions. This practical tip will help you minimize wasted space. <a href="https://www.timescale.com/learn/postgresql-performance-tuning-designing-and-implementing-database-schema">Check out this article for a more in-depth explanation</a>.</p><h3 id="delete-old-data-regularly">Delete old data regularly </h3><p>You should always ask yourself: how long should I keep data around? Setting up data retention policies is essential for managing storage appropriately. Your users may not need data older than a year ago. Deleting old, unused records and indexes regularly is an easy win to reduce your database size.  </p><p>Timescale can automatically delete old data for us using <a href="https://docs.timescale.com/use-timescale/latest/data-retention/about-data-retention/">retention policies</a>. Timescale’s hypertables are <a href="https://timescale.ghost.io/blog/when-to-consider-postgres-partitioning/">automatically partitioned by time</a>, which helps a lot with data retention.  Retention policies automatically delete partitions (which are called chunks in Timescale) once the data contained in such partition is older than a given interval. </p><p>You can <a href="https://docs.timescale.com/use-timescale/latest/data-retention/create-a-retention-policy/">create a retention policy</a> by running:</p><pre><code class="language-SQL">SELECT add_retention_policy('my_table', INTERVAL '24 hours');</code></pre><p><br>In this snippet, Timescale would delete chunks older than 24 hours from <code>my_table</code>.</p><h2 id="wrap-up">Wrap-Up</h2><p>We examined how table bloat and dead tuples can contribute to wasted storage space, which not only affects your pocket but also the performance of your large PostgreSQL tables. </p><p>To make sure you’re reducing your PostgreSQL database size down to its minimum, make sure to enable Timescale compression, to use data retention policies, and to set up a maintenance routine to periodically and effectively delete your dead tuples and reclaim your unused pages.  </p><p>All these techniques together provide a holistic approach to maintaining a healthy PostgreSQL database and keeping your PostgreSQL database costs low.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Advice on PgBouncer From a Support Engineer]]></title>
            <description><![CDATA[PgBouncer is a great tool to scale your client connections, reduce compute overhead, and improve database performance. Here’s some advice on how to configure it correctly.
]]></description>
            <link>https://www.tigerdata.com/blog/using-pgbouncer-to-improve-your-postgresql-database-performance</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/using-pgbouncer-to-improve-your-postgresql-database-performance</guid>
            <category><![CDATA[Connection Pooling]]></category>
            <category><![CDATA[PgBouncer]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <dc:creator><![CDATA[Brian Muckian]]></dc:creator>
            <pubDate>Thu, 14 Sep 2023 13:26:12 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/09/pg-bouncer-timescale.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/09/pg-bouncer-timescale.png" alt="Advice on PgBouncer From a Support Engineer" /><p>Are you looking for an easy way to increase the number of client connections you can establish to your Timescale database? Would you like to reduce the compute (CPU/RAM) overhead of managing (starting-stopping-maintaining) all of those connections? </p><p>If you’re nodding your head as you read these questions, you may want to consider enabling connection pooling via <a href="https://www.pgbouncer.org/" rel="noreferrer">PgBouncer</a> for your service—or perhaps you’re using it already.  As a support engineer at Timescale, I assist our customers daily with their database configuration. <a href="https://timescale.ghost.io/blog/connection-pooling-on-timescale-or-why-pgbouncer-rocks/" rel="noreferrer">This includes connection pooling, specifically PgBouncer, which is the base of our own connection pooler</a>. So I’ve learned one thing or two about it!</p>
<!--kg-card-begin: html-->
<iframe src="https://giphy.com/embed/KffdTQfewxdbKTGEJY" width="480" height="354" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/yes-jum-charlottekhm-KffdTQfewxdbKTGEJY">via GIPHY</a></p>
<!--kg-card-end: html-->
<p><br>In this blog post, I’ll lay out some tips on implementing pgBouncer while avoiding some common mistakes we’ve seen among our customers.</p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">🎉</div><div class="kg-callout-text"><a href="https://www.timescale.com/blog/boosting-postgres-performance-with-prepared-statements-and-pgbouncers-transaction-mode/" rel="noreferrer">You can now boost your application's (and Postgres') performance with prepared statements and PgBouncer's transaction mode</a>.</div></div><h2 id="quick-introduction-pgbouncer-in-timescale">Quick Introduction: PgBouncer in Timescale</h2><p>At Timescale, we built our implementation of connection pooling on <a href="https://www.pgbouncer.org/">PgBouncer</a>, a self-described “lightweight connection pooler for PostgreSQL,” since it is a widely used tool by many in our community and performs its function well.</p><p>To start using PgBouncer in Timescale, <a href="https://docs.timescale.com/use-timescale/latest/services/connection-pooling/#add-a-connection-pooler">you first need to enable the feature in the web console</a>. Once you’ve done this, you'll be provided an additional service URI with a specified port (for PgBouncer) and will be able to connect to either the "session pool" (named "tsdb") or the "transaction pool" (called "tsdb_transaction"). You will also still be able to make direct connections (via the original service URI and port) to your database server when or if you need to.</p><p>PgBouncer maintains up to a specified number of dedicated "server" connections to PostgreSQL for each pool, as a fluctuating number of clients comes and goes, gets a connection from the pool, runs its queries, and gathers results. </p><p>When the client disconnects from the connection pool, instead of terminating and deallocating the server connection (which can be very expensive in terms of compute utilization when it happens frequently), it is returned to the pool to be used by another client. </p><p>On a sufficiently busy system, you can see a substantial benefit of PgBouncer performing this work, along with helping to manage traffic to avoid problems on your database server and free up its resources for query and storage operations.</p><p>You'll likely see the greatest improvement (reduction) in resource utilization if you are able to take full advantage of the transaction pool (tsdb_transaction). In transaction mode, server connections are added back to the pool after each transaction is completed or rolled back. Almost immediately made available to another client that may be waiting, this allows you to serve a very large number of clients with a relatively small number of dedicated server connections. 💪</p><h2 id="practical-tips-for-pgbouncer-success">Practical Tips for PgBouncer Success </h2><p>Now, let’s get into the advice. PgBouncer is an awesome tool that will give you so many benefits, but like all powerful things, it must be used with great responsibility. There are some common implementation mistakes that we see people making often—you can avoid them by following these tips:<br></p><h3 id="always-monitor-your-postgres-connection-pool">Always monitor your Postgres connection pool</h3><p>First things first: when setting up a connection pool, it is highly recommended to monitor behavior. </p><p>To do so, you can access the PgBouncer administrative console and just specify "pgbouncer" as the pool (or database) name, like this:<br></p><pre><code class="language-SQL">psql postgres://tsdbadmin@HOST:PORT/pgbouncer?sslmode=require
</code></pre>
<p>Then, you can run many of the (read-only) <code>SHOW</code> commands:</p><pre><code class="language-SQL">pgbouncer=# show help;
NOTICE:  Console usage
DETAIL:
        SHOW HELP|CONFIG|DATABASES|POOLS|CLIENTS|SERVERS|USERS|VERSION
        SHOW PEERS|PEER_POOLS
        SHOW FDS|SOCKETS|ACTIVE_SOCKETS|LISTS|MEM|STATE
        SHOW DNS_HOSTS|DNS_ZONES
        SHOW STATS|STATS_TOTALS|STATS_AVERAGES|TOTALS
</code></pre>
<p>It can be helpful to run some of these commands interactively to get a better understanding of how the pools behave under current conditions. For example, you can see how many clients and servers are active, waiting, or being canceled with the <code>SHOW POOLS</code> command:</p><pre><code class="language-SQL">pgbouncer&gt; show pools;

 	database 	 |   user	  | cl_active | cl_waiting |
------------------+-----------+-----------+------------+
 pgbouncer    	 | pgbouncer |     	1  |      	   0 |
 postgres     	 | postgres  |     	0  |      	   0 |
 tsdb         	 | tsdbadmin |     	2  |      	   0 |
 tsdb_transaction| tsdbadmin |    	11 |      	   0 |
</code></pre>
<p>Finally, here is a simple bash script you can use to periodically (default every 90 seconds) collect CSV results of those <code>SHOW</code> commands up to a specified (default 20 minutes) time. This is a rudimentary—but effective—tool to observe some of those statistics change over a period of time.</p><pre><code>#!/usr/bin/env bash
export PGBOUNCER_URL="${PGBOUNCER_URL:-empty}"
export EXEC_INTERVAL="${EXEC_INTERVAL:-90}"
export END_AFTER_MIN="${END_AFTER_MIN:-20}"

echo "gathering pgbouncer metrics"
echo "pgbouncer url: $PGBOUNCER_URL"
echo "run every $EXEC_INTERVAL seconds"
echo "end after $END_AFTER_MIN minutes"

declare -a arr=(databases pools clients servers users sockets active_sockets lists dns_hosts dns_zones stats stats_totals stats_averages stats_totals)

if [ $PGBOUNCER_URL != "empty" ]; then
	psql --csv $PGBOUNCER_URL -c "show config;" | sed "s/$/,dt`date +%Y%m%d%H%M%S`/" &gt; ./pb_config.csv

	for key in "${arr[@]}"; do
		psql --csv $PGBOUNCER_URL -c "show ${key}" | sed "s/$/,at_date_time/" | head -n1 &gt; ./pb_${key}.csv
	done

	touch .get_pb_metrics;
	while [ -f ".get_pb_metrics" ]; do
		for key in "${arr[@]}"; do
			psql --csv -t $PGBOUNCER_URL -c "show ${key}" | sed "s/$/,`date +%Y%m%d%H%M%S`/" &gt;&gt; ./pb_${key}.csv
		done
		sleep $EXEC_INTERVAL
		find .get_pb_metrics -mmin +$END_AFTER_MIN -type f -exec rm -fv {} \;
	done

	tar czvf ./pb_metrics.tar.gz ./pb*.csv
	rm ./pb*.csv
else
	echo "you must set PGBOUNCER_URL"
fi

</code></pre>
<p>You can save that to a file (make it executable), set a proper <code>PGBOUNCER_URL</code> environment variable, run it, and watch the tail of your favorite CSV. When it's done, collect them into tables for query if you like.<br></p><p>I know, I know, this all sounds awesome (because it is). But before you change all of your connection strings, keep reading for more best practices to ensure you get the absolute most out of this feature.</p><h3 id="avoid-session-based-features-when-using-the-transaction-pool">Avoid session-based features when using the transaction pool</h3><p>One of the mistakes we often see people making with PgBouncer is using session-based features (such as prepared statements, temporary tables, and SET commands) that can fail or produce unexpected results for customers using the transaction pool.</p><p>The <a href="https://www.pgbouncer.org/faq.html#how-to-use-prepared-statements-with-session-pooling">PgBouncer FAQ briefly discusses this</a> and then links to how to disable this feature in JDBC and PHP/PDO. <strong>Tip:</strong> You may have to configure your framework/library to prevent the use of prepared statements.</p><p>You will find that some tools, frameworks, and some "thick" client software may expect or require a connection to the database in "session" mode. For example, if you point <a href="https://github.com/dalibo/pg_activity/tree/master#readme">pg_activity</a> to your transaction pool, it may run for a short while, but you'll probably see errors similar to this:<br></p><pre><code>prepared statement "_pg3_9" does not exist
</code></pre>
<p>Similarly, <a href="https://github.com/dbcli/pgcli#a-repl-for-postgres">pgcli</a> seems to work fine for most things, but if you use the <code>\watch</code> command, you may see something such as:</p><pre><code>prepared statement "_pg3_0" already exists
</code></pre>
<p>For another example to demonstrate why you should avoid session-based features while using the transaction pool, connect psql and do this:</p><pre><code>home:/&gt; psql postgres://tsdbadmin@HOST:PORT/tsdb_transaction?sslmode=require

tsdb_transaction&gt; show search_path;
+-----------------+
| search_path     |
|-----------------|
| "$user", public |
+-----------------+

tsdb_transaction&gt; set search_path = "$user", custom, public;

tsdb_transaction&gt; show search_path;
+-------------------------+
| search_path             |
|-------------------------|
| "$user", custom, public |
+-------------------------+

tsdb_transaction&gt; select * from temp_table;
relation "temp_table" does not exist
LINE 1: select * from temp_table
                      ^

tsdb_transaction&gt; create temporary table temp_table as select 'foo' as bar;

tsdb_transaction&gt; select * from temp_table;
+-----+
| bar |
|-----|
| foo |
+-----+

tsdb_transaction&gt; \q
Goodbye!


home:/&gt; psql postgres://tsdbadmin@HOST:PORT/tsdb_transaction?sslmode=require
...

tsdb_transaction&gt; select * from temp_table;
+-----+
| bar |
|-----|
| foo |
+-----+

tsdb_transaction&gt; show search_path;
+-------------------------+
| search_path             |
|-------------------------|
| "$user", custom, public |
+-------------------------+
</code></pre>
<p>Notice how, in this example, after disconnecting and reconnecting (terminating the "client" and creating a new "client" connection) to the transaction pool, the previously created temporary table and setting are visible to this connection (which obviously retrieved the same "server" connection which had been added back to the pool). <br></p><h3 id="take-max-advantage-of-the-transaction-pool-mode">Take max advantage of the transaction pool mode</h3><p>Our experience suggests that most users will get the greatest benefit from using connection pooling when you can take greater advantage of the transaction mode pool. If your use case does not require session-based features, and if most of your transactions are very short, we recommend using the transaction pool.  </p><p>Large volumes of short-lived clients, things like event systems, IoT and sensor networks, and microservices architectures can all benefit from transaction pools. Meanwhile, for those “thick clients” (like your long-running sessions in your robust BI Suite), these connections may better be served by the session pool. </p><p><strong>Our general guidelines would be:</strong> If you do not need and can avoid using any session-based features, use the transaction pool. If you do need those features, use the session pool. These two pools should meet most of your needs, but if you do encounter something that prevents you from using either pool available, you can still directly connect to your service.</p><p>As always, <a href="https://timescale.ghost.io/blog/how-were-raising-the-bar-on-hosted-database-support/">we’re happy to discuss this further</a>, but we hope these simple tips will save you some precious time.</p><h2 id="wrap-up">Wrap-Up</h2><p>With connection pooling via PpgBouncer, we aim to provide a simple way for Timescale users to increase the number of <a href="https://www.tigerdata.com/learn/guide-to-postgresql-database-operations" rel="noreferrer">database operations</a>, manage system resources more efficiently, and improve database reliability. <br></p><p>We expect to keep learning more about the optimal implementation of pgBouncer as we move forward. We’ll make sure to share them with the community—in the meantime, please let us know in our community <a href="https://slack.timescale.com/">Slack</a>, <a href="https://forum.timescale.com/">Forum</a>, or reach out for <a href="https://www.timescale.com/support">support</a> if you have any feedback or run into any obstacles at all. Also, if you have further advice on how to avoid pgBouncer pitfalls, we’d love to hear it! <br></p><p>Excelsior!</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Pg_partman vs. Hypertables for Postgres Partitioning]]></title>
            <description><![CDATA[Are you trying to streamline your data partitioning? Check out this head-to-head comparison on pg_partman and Timescale’s hypertables.  
]]></description>
            <link>https://www.tigerdata.com/blog/pg_partman-vs-hypertables-for-postgres-partitioning</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/pg_partman-vs-hypertables-for-postgres-partitioning</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Engineering]]></category>
            <category><![CDATA[Benchmarks & Comparisons]]></category>
            <dc:creator><![CDATA[James Blackwood-Sewell]]></dc:creator>
            <pubDate>Wed, 13 Sep 2023 14:26:04 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/09/pg_partman-vs-hypertables-for-postgres-partitioning.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/09/pg_partman-vs-hypertables-for-postgres-partitioning.png" alt="The Postgres elephant vs. the Timescale tiger— representing Pg_partman vs. Hypertables for Postgres Partitioning" /><p>You all know the feeling: You’ve got one big table in your database, and it’s getting slower and slower. Your app gets bottlenecked; user experience takes a dive. These aren’t happy times.<br></p><p>When you have data streaming into PostgreSQL constantly, sooner or later you end up with these big, slow tables. Luckily, the PostgreSQL ecosystem offers a range of <a href="https://www.timescale.com/learn/database-partitioning-what-it-is-and-why-it-matters">partitioning</a> techniques to optimize the performance and maintenance of these datasets. Among these partitioning methodologies, there are two that stand out as the most popular: Timescale's <a href="https://docs.timescale.com/use-timescale/latest/hypertables/">hypertables</a> (optimized for time-based/range partitioning) and the <a href="https://github.com/pgpartman/pg_partman">pg_partman</a> extension. <br></p><p>While both approaches aim to simplify partitioning, this article explores why we believe Timescale's hypertables present compelling advantages over pg_partman. <br></p><p>📑 <a href="https://timescale.ghost.io/blog/when-to-consider-postgres-partitioning/">Check out our previous article on when to consider partitioning— if you haven’t already.</a></p><h2 id="partitioning-strategies-in-postgresql-quick-overview">Partitioning Strategies in PostgreSQL: Quick Overview</h2><p>A partitioned table comprises many non-overlapping partitions, each covering a part of your dataset. When you select data from a partitioned table using a <code>WHERE</code> clause with a time-based restriction, PostgreSQL is able to immediately discard all the partitions that aren’t relevant before it plans the query.<br></p><p>Because we aren’t searching through all the data, we spend less time doing I/O, and the query is faster. If the total table size or (even worse) the total index size of the unpartitioned table exceeds the amount of memory Postgres uses for cache, then the difference becomes even more significant.<br></p><p>As we introduced in our <a href="https://timescale.ghost.io/blog/when-to-consider-postgres-partitioning/">previous article on partitioning</a>, you can follow different strategies and techniques to partition your PostgreSQL tables. In terms of the types of partitioning, you could choose between:</p><ul><li><strong>Range partitioning</strong>: partitions are defined by a range of values (e.g., by month, year, or an incrementing sequence).</li><li><strong>List partitioning</strong>: partitions are defined by a list of values (e.g., by country).</li><li><strong>Hash partitioning</strong>: rows are partitioned based on the hash value of the partition key to distribute data across a fixed number of partitions evenly. <br></li></ul><p>Depending on which partitioning strategy you’re using, you can choose between different methodologies, the most common being the following: </p><ul><li>Using the <code>PARTITION BY</code> clause native in PostgreSQL. This supports the three types of partitioning (e.g., <code>PARTITION BY RANGE</code>, <code>BY LIST</code>, or <code>BY HASH</code>).</li><li>Using pg_partman, an extension that automates time-based partitioning in PostgreSQL.</li><li>Using Timescale, which goes one step further than pg_partman to automate partitioning by time via the concept of hypertables. <br></li></ul><p>Here, we’ll focus particularly on range partitioning (by far the most common), comparing the last two methods: pg_partman and hypertables.</p><h2 id="pgpartman-making-postgresql-partitioning-simpler">Pg_partman: Making PostgreSQL Partitioning Simpler </h2><p>The <a href="https://github.com/pgpartman/pg_partman">pg_partman</a> extension for PostgreSQL is built on the native PostgreSQL declarative approach to partitioning tables. Declarative partitioning, introduced in PostgreSQL 10, has replaced the older method of table inheritance, introducing a more intuitive and simpler approach by providing built-in support for partitioning without triggers or rules. <br></p><p>With declarative partitioning, much of the partitioning management is automated, but for example, creating new partitions still requires manual intervention—unless you're using tools like pg_partman. Pg_partman helps to automate the creation and management of partitioned tables and partitions through a SQL API. Although new partitions aren’t added and removed automatically, this can be managed by adding another extension like <a href="https://github.com/citusdata/pg_cron">pg_cron</a> to schedule jobs.<br></p><p>Without pg_partman, declarative partitioning is a lot more complicated. Ppg_partman intends to simplify this process, and indeed, it does, but there are still important tasks and nuances that will require manual intervention. A few examples:</p><ul><li>It’s essential to ensure that the necessary partitions have been created when ingesting data to avoid a <a href="https://timescale.ghost.io/blog/how-to-fix-no-partition-of-relation-found-for-row/">No Partition of Relation Found for Row</a> error, which may block your writes.</li><li>If your workload involves sporadic or irregular data ingestions, you’ll need to ensure you aren't creating excessive, unnecessary partitions, as they could degrade query performance and lead to table bloat.</li><li>You must ensure that there are no gaps or overlaps between partitions, especially when dealing with manual partition modifications.</li><li>If you want to implement a retention policy to regularly drop old partitions regularly, you'll need to set this up.</li><li>If you need to alter the schema of your tables, such as adding or dropping columns, you'll often have to handle these changes manually to ensure they propagate correctly to all partitions.</li></ul><h2 id="hypertables-making-postgresql-partitioning-seamless">Hypertables: Making PostgreSQL Partitioning Seamless </h2><p>If pg_partman simplifies partition management, hypertables take this simplification to the next level: they completely automate the process. <strong>If pg_partman is the general toolkit, hypertables are the product.</strong><br></p><p><a href="https://docs.timescale.com/use-timescale/latest/hypertables/about-hypertables/">Hypertables</a> are an abstraction layer that allows you to automatically create and manage partitions (which in Timescale are called chunks) automatically without losing the ability to query as normal with SQL. Hypertables are optimized for time-based partitioning, although they also work for tables that aren’t based on time but have something similar, for example, a BIGINT primary key. <br></p><p>Hypertables are based on inheritance-based partitioning (which you’ll recall was the older method PostgreSQL used). While this method is harder to implement manually, it’s also more flexible, giving more granular control over the partitions. This is definitely not something that you (as an end user of partitioning) want to set up and manage, but this flexibility allows us (Timescale) to introduce some improvements over native PostgreSQL partitioning that you can directly benefit from. </p><p>What are these improvements? Let’s cover them.</p><h3 id="dynamic-partition-management-forget-about-the-%E2%80%9Cno-partition-of-relation-found-for-row%E2%80%9D-error">Dynamic partition management: Forget about the “no partition of relation found for row” error    <br></h3><p>A normal table is transformed into a Timescale hypertable using a single command (<code>create_hypertable</code>): <br></p><pre><code>CREATE TABLE conditions (
time        TIMESTAMPTZ       NOT NULL,
location    TEXT              NOT NULL,
device      TEXT              NOT NULL,
temperature DOUBLE PRECISION  NULL,
humidity    DOUBLE PRECISION  NULL
);
SELECT create_hypertable('conditions', 'time');
</code></pre>
<p>This sets up the partition column, the partition interval (seven days by default), and the unique index to support partitioning. Once the hypertable is created, new partitions (chunks) will be created on the fly as data flows into the hypertable. <br></p><p>As we said earlier, pg_partman can automate much of the partition creation process, but to routinely schedule this automation, you will need to integrate it with pg_cron—and you’ll have to ensure the necessary partitions are in place proactively. Without a predefined partition to host incoming data, you'll encounter the <a href="https://timescale.ghost.io/blog/how-to-fix-no-partition-of-relation-found-for-row/">No Partition of Relation Found for Row</a> error. (This is a common one.)  <br></p><p>Using Timescale eliminates the risk of partitions not existing, completely removing partition management from the list of things the database owner needs to consider. You get exactly the right number of partitions when you need them.<br></p><p>Another hypertables’ hidden gem is that they’ll never create an unnecessary partition. Partitions are generated on the fly, meaning if there's no data to fit a potential partition, that partition simply won't be created. This is a good thing since each active partition adds a slight overhead during query planning.</p><h3 id="reduced-table-locking-no-need-to-worry-about-data-integrity">Reduced table locking: No need to worry about data integrity</h3><p><a href="https://timescale.ghost.io/blog/how-timescaledb-solves-common-postgresql-problems-in-database-operations-with-data-retention-management/">As we covered extensively in this post,</a> DDL operations in PostgreSQL, such as adding a new partition, inherently require locks on the table. This means that during the brief period the operation is being performed, other transactions trying to write (insert, update, delete) to the table can be blocked until the operation completes.</p><p>In PostgreSQL there are two methods of adding partitions, from the <code>CREATE TABLE</code> statement and from the <code>ALTER TABLE</code> statement. The first will block writes, while the second will not. The same two methods can be used to drop partitions, although in this case both will block writes.</p><p>When pg_partman creates these partitions for its maintenance job, it performs DDL operations on the table. These operations the same locks—which can completely block writes. Other problems may also arise: the waiting time for transactions can increase, leading to unpredictably longer response time, and in systems where operations have a strict timeout, the waiting caused by locks can lead to operation failures.</p><p>Hypertables are designed to ensure that your application’s read or write operations are not interrupted. Timescale maintains its own partition catalogs and implements its own minimized locking strategy that allows reads and writes without interfering with adding or dropping partitions.</p><h3 id="easily-configurable-data-retention">Easily configurable data retention</h3><p>One of the amazing things about partitioning your data is that you can drop individual partitions instantly, which isn’t the case when writing large <code>DELETE</code> statements. </p><p>When using pg_partman, you would need to create the custom logic for removing old partitions yourself, and removing a partition will lock the master table. Also, you would need to schedule this with pg_cron or an external scheduler.<br></p><p>On the contrary, setting up automatic data retention policies for hypertables is straightforward: you don’t need further code or to manage more extensions. It only takes one command, <a href="https://docs.timescale.com/use-timescale/latest/data-retention/create-a-retention-policy/"><code>add_retention_policy</code></a>. You can define retention periods for specific time intervals, and Timescale will automatically drop outdated partitions when it needs to:<br></p><pre><code>SELECT add_retention_policy('conditions', INTERVAL '24 hours');
</code></pre>
<h3 id="query-performance-optimizations">Query performance optimizations</h3><p>Hypertables also unlock some extra features that Timescale enables for your query plans. For example, queries that reference <code>now()</code> when pruning partitions will perform better due to <a href="https://timescale.ghost.io/blog/how-we-fixed-long-running-postgresql-now-queries/"><code>now()</code></a><a href="https://timescale.ghost.io/blog/how-we-fixed-long-running-postgresql-now-queries/"> being turned into a constant,</a> and your ordered <code>DISTINCT</code> queries will benefit from <a href="https://timescale.ghost.io/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/">SkipScan</a>.</p><h2 id="going-beyond-partitioning">Going Beyond Partitioning  </h2><p>It's worth noting that while pg_partman is more of a general-purpose partition manager for PostgreSQL, hypertables unlock a wealth of features specifically tailored for time-based (or time series) data that can get very handy for scaling your large PostgreSQL tables: </p><ul><li><a href="https://timescale.ghost.io/blog/building-columnar-compression-in-a-row-oriented-database/">Timescale compression</a> takes a hypertable and changes it from row to column-oriented. This can reduce storage utilization by up to 95 %, unlock blazing-fast analytical queries, <a href="https://timescale.ghost.io/blog/compressing-immutable-data-changing-time-series-management/">and still allow the data to be updated in place.</a></li><li><a href="https://timescale.ghost.io/blog/how-we-made-data-aggregation-better-and-faster-on-postgresql-with-timescaledb-2-7/">Continuous aggregates</a> take hypertables and let you create incrementally updated materialized views for aggregate queries. You define your query and get an aggregate table that is updated as historical data changes while also keeping up with your real-time data as it flows in.</li><li><a href="https://docs.timescale.com/api/latest/hyperfunctions/">Hyperfunctions</a> give you a blazing-fast full set of functions, procedures, and data types optimized for querying, aggregating, and <a href="https://timescale.ghost.io/blog/time-series-analysis/">analyzing time-series data</a>.</li><li>The <a href="https://timescale.ghost.io/blog/the-postgresql-job-scheduler-you-always-wanted-but-be-careful-what-you-ask-for/">Timescale job scheduler</a> lets you schedule any SQL or function-based job within PostgreSQL, meaning you don’t need an external scheduler or to load another extension like pg_cron.</li></ul><h2 id="conclusion">Conclusion</h2><p>Pg_partman is an amazing toolkit that greatly simplifies the management of declarative partitioning in PostgreSQL, but it is only that—a toolkit. </p><p>We believe hypertables are a complete product that makes partitioning much more streamlined. The dynamic partition management, reduced locking overhead, and automated retention policies make hypertables a better choice for applications dealing with large datasets. You will save time and worries, and you’ll unlock many other amazing features that will make it even easier to work with your large PostgreSQL tables. </p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[A Deep Dive Into PostgreSQL Vacuum Monitoring With BPFtrace]]></title>
            <description><![CDATA[Read how we developed a simple BPFtrace program to observe the execution of vacuum calls (and analyze their needed execution time) in PostgreSQL.]]></description>
            <link>https://www.tigerdata.com/blog/using-bpftrace-to-trace-postgresql-vacuum-operations</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/using-bpftrace-to-trace-postgresql-vacuum-operations</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Monitoring & Alerting]]></category>
            <dc:creator><![CDATA[Jan Nidzwetzki]]></dc:creator>
            <pubDate>Tue, 05 Sep 2023 13:00:58 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-12-at-7.05.00-PM.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-12-at-7.05.00-PM.png" alt="A Deep Dive Into PostgreSQL Vacuum Monitoring With BPFtrace" /><p>Observability enables engineers to quickly troubleshoot and fix issues in production. For those using the Linux kernel, the <a href="https://ebpf.io/">extended Berkeley Packet Filter (eBPF) technology</a> allows application monitoring with minimal overhead. For example, you can use <a href="https://github.com/torvalds/linux/blob/master/kernel/events/uprobes.c">UProbes</a> to trace the invocation and exit of functions in programs.</p><p>Modern database observability tools (e.g., <a href="https://github.com/iovisor/bcc/blob/master/tools/funccount.py"><code>funccount</code></a>) are built on top of eBPF. However, these fully flagged tools are often written in C and Python and require some development effort when a "quick and dirty" solution for a particular observation is sometimes more than sufficient. However, with <a href="https://github.com/iovisor/bpftrace" rel="noreferrer">BPFtrace</a>, a high-level tracing language for Linux eBPF, users can create eBPF programs with only a few lines of code. </p><p>In this tutorial, we develop a simple BPFtrace program to observe the execution of vacuum calls in PostgreSQL and measure and print their execution times.</p><h2 id="the-environment">The Environment</h2><p><a href="https://www.timescale.com/learn/what-is-postgresql-and-where-did-it-come-from">PostgreSQL </a>uses <a href="https://www.postgresql.org/docs/current/sql-vacuum.html">vacuum operations</a> to reclaim space from dead (e.g., updated or deleted) tuples. <a href="https://timescale.ghost.io/blog/how-to-reduce-your-postgresql-database-size/" rel="noreferrer">For more information on how this process works and why it is an important one to monitor, read this article. </a></p><p>In this tutorial, we will trace the vacuum calls and determine the required time per table for the vacuum operations. We will use a PostgreSQL 14 server, with the  binary is located at <code>/home/jan/postgresqlsandbox/bin/REL_14_2_DEBUG/bin/postgres</code>. </p><p>In addition, the examples are executed in a database with these two tables:</p><pre><code class="language-SQL">CREATE TABLE testtable1 (
   id int NOT NULL,
   value int NOT NULL
);

CREATE TABLE testtable2 (
   id int NOT NULL,
   value int NOT NULL
);
</code></pre>
<div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">❗</div><div class="kg-callout-text"><i><b><strong class="italic" style="white-space: pre-wrap;">Note:</strong></b></i><i><em class="italic" style="white-space: pre-wrap;"> Depending on the used C compiler and applied optimizations, the symbols of internal functions (i.e., as static declared) may not be visible. In this case, you can not use UProbes to trace the function invocations. There are two possible solutions to address this issue: (1) remove the static modifier from the function declaration and recompile PostgreSQL, or (2) create a complete </em></i><a href="https://github.com/jnidzwetzki/pg-lock-tracer/#postgresql-build"><i><em class="italic" style="white-space: pre-wrap;">debug build</em></i></a><i><em class="italic" style="white-space: pre-wrap;"> of PostgreSQL.</em></i></div></div><h2 id="tracing-postgresql-vacuum-operations-using-funclatency"><br>Tracing PostgreSQL Vacuum Operations Using <code>funclatency</code> </h2><p>Let’s explore the existing solutions before developing our tool to trace the vacuum operations. </p><p>The tool <a href="https://github.com/iovisor/bcc/blob/master/tools/funccount.py"><code>funclatency</code></a>is available for most Linux distributions (on Debian, it’s part of the package <em>bpfcc-tools and renamed to funclatency-bpfcc</em>) and allows it to trace a function enter and exit and measure the function latency (i.e., the time it takes a function to complete).</p><p>In PostgreSQL, the function <a href="https://github.com/postgres/postgres/blob/d2bd4ba30585e65e57004b65106c79235aef9a44/src/backend/commands/vacuum.c#L1798"><code>vacuum_rel</code></a> is invoked when a vacuum operation on a relation is performed. To trace these function calls with <code>funclatency-bpfcc</code>, you need to provide the path of the PostgreSQL binary and the function name. For instance:‌</p><pre><code>$ sudo funclatency-bpfcc -r /home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel

Tracing 1 functions for "/home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel"... Hit Ctrl-C to end.
</code></pre><p>Afterward, an eBPF program is loaded into the Linux kernel, and a UProbe is defined and observes the beginning of the function call while another UProbe observes its exit. The latency between these two events is measured and stored.</p><p>To execute some vacuum operations, we perform the following SQL statement in a second session:</p><pre><code class="language-SQL">database=# VACUUM FULL;
VACUUM FULL
</code></pre><p>‌<br>This SQL statement triggers PostgreSQL to perform a vacuum operation of all tables of the current database—this takes some time. After the vacuum operations are done, the <code>funclatency-bpfcc</code> program can be stopped (by executing CTRL+C), ending the observation of the binary and showing the recorded execution times on the terminal.</p><pre><code>$ sudo funclatency-bpfcc -r /home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel
[...]
^C
Function = b'vacuum_rel' [876997]
     nsecs               : count     distribution
         0 -&gt; 1          : 0        |                                        |
         2 -&gt; 3          : 0        |                                        |
         4 -&gt; 7          : 0        |                                        |
         8 -&gt; 15         : 0        |                                        |
        16 -&gt; 31         : 0        |                                        |
        32 -&gt; 63         : 0        |                                        |
        64 -&gt; 127        : 0        |                                        |
       128 -&gt; 255        : 0        |                                        |
       256 -&gt; 511        : 0        |                                        |
       512 -&gt; 1023       : 0        |                                        |
      1024 -&gt; 2047       : 0        |                                        |
      2048 -&gt; 4095       : 0        |                                        |
      4096 -&gt; 8191       : 0        |                                        |
      8192 -&gt; 16383      : 0        |                                        |
     16384 -&gt; 32767      : 0        |                                        |
     32768 -&gt; 65535      : 0        |                                        |
     65536 -&gt; 131071     : 0        |                                        |
    131072 -&gt; 262143     : 0        |                                        |
    262144 -&gt; 524287     : 0        |                                        |
    524288 -&gt; 1048575    : 0        |                                        |
   1048576 -&gt; 2097151    : 0        |                                        |
   2097152 -&gt; 4194303    : 0        |                                        |
   4194304 -&gt; 8388607    : 2        |*                                       |
   8388608 -&gt; 16777215   : 13       |***********                             |
  16777216 -&gt; 33554431   : 44       |****************************************|
  33554432 -&gt; 67108863   : 7        |******                                  |
  67108864 -&gt; 134217727  : 1        |                                        |

avg = 22765358 nsecs, total: 1525279002 nsecs, count: 67

Detaching...
</code></pre><p>‌</p><p>The output contains the information that the function <code>vacuum_rel</code> was called 67 times, and the average function time is <code>22765358 nsecs</code>.</p><p>In addition, a histogram of the function latency is printed, providing a lot of helpful information. However, while we now know the number and duration of vacuum calls, we still don’t know the duration of the individual vacuum calls for each relation.</p><p>This is something that this tool does not support because it does not evaluate the function parameters (e.g., the Oid of relation that the current function invocation should vacuum). But this is something we can do with <code>bpftrace</code>.<br></p><h2 id="using-bpftrace-to-trace-vacuum-full-entries">Using BPFtrace To Trace VACUUM FULL Entries</h2><p>Let’s start with a very simple BPFtrace program that prints a line once the <code>vacuum_rel</code> function is invoked in the PostgreSQL binary. <code>bpftrace</code> is called with the eBPF program that should be loaded into the Linux kernel. The eBPF programs that are passed to BPFtrace have the following <a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#language">syntax</a>:</p><pre><code>&lt;probe1&gt; {
        &lt;Actions&gt;
}

[...]

&lt;probeN&gt; {
        &lt;Actions&gt;
}
</code></pre><p>‌The syntax to define a UProbe on a binary or library is: <code>uprobe:library_name:function_name[+offset]</code>. For instance, to define a UProbe on the function invocation of <code>vacuum_rel</code> in the binary <code>/home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres</code> and print the line <code>Vacuum started</code>, you can use the following BPFtrace call:</p><pre><code>$ sudo bpftrace -e '
uprobe:/home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel {
    printf("Vacuum started\n");
}
'

Attaching 1 probe...
Vacuum started
Vacuum started
Vacuum started
Vacuum started
Vacuum started
Vacuum started
Vacuum started
Vacuum started
Vacuum started
[...]
</code></pre><p>As soon as the <code>VACUUM FULL</code> SQL statement in PostgreSQL is executed in another terminal session, the program starts to print the message on the screen. This is a good start, but we still have less information available than output by the existing tool <code>funclatency-bpfcc</code>. We're missing the latency of the function calls.</p><h2 id="tracing-vacuum-function-returns-and-latency">Tracing Vacuum Function Returns and Latency</h2><p>To measure the execution time of the function invocations, we need two things:</p><ul><li>Define a second Uprobe that is invoked when the function observed returns; this can be done by using a URetProbe.</li><li>Calculate the difference between the two UProbe events.</li></ul><p>A URetProbe in BPFtrace can be defined using the same syntax <code>uretprobe:binary:function</code> as the one used for the UProbe. In addition, BPFtrace allows it to create variables like associative arrays.</p><p>We use such an array to capture the start time of a function invocation <code>@start[tid] = nsecs;</code>. The array's key is the ID of the current thread: <code>tid</code>. So, multiple threads (and processes like in our case with PostgreSQL) can be traced simultaneously without overriding the last function invitation start time.</p><p>In the URetProbe we take the current time and subtract the time of the function invocation (<code>nsecs - @start[tid])</code>) to get the time the function call needs. We also use a function predicate (<code>/@start[tid]/</code>) to let BPFtrace know that we only want to execute the function body of the URetProbe as soon as this array value is defined.</p><p>Using this predicate, we prevent handling a function return without seeing the function enter before (e.g., we start theBPFtrace program in the middle of a running function call, and we get only the URetProbe invocation for this function call).</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">❗</div><div class="kg-callout-text"><i><b><strong class="italic" style="white-space: pre-wrap;">Note:</strong></b></i><i><em class="italic" style="white-space: pre-wrap;"> It is not guaranteed that BPFtrace will deliver and process the eBPF events in order. Especially when a function call is short, and we have a lot of function invocations, the events could be processed out-of-order (e.g., we see two function enter events followed by two function return events). In this case, function latency observations with BPFtrace become imprecise. To avoid this, we use VACUUM FULL calls instead of vacuum calls. These calls are </em></i><a href="https://www.postgresql.org/docs/current/sql-vacuum.html"><i><em class="italic" style="white-space: pre-wrap;">much more expensive</em></i></a><i><em class="italic" style="white-space: pre-wrap;"> since they rewrite the table. Therefore, they take longer and can be reliably observed by BPFtrace.</em></i></div></div><p></p><pre><code>$ sudo bpftrace -e '
uprobe:/home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel
{
        printf("Performing vacuum\n");
        @start[tid] = nsecs;
}

uretprobe:/home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel
/@start[tid]/
{
        printf("Vacuum call took %d ns\n", nsecs - @start[tid]);
        delete(@start[tid]);
}
'
</code></pre><p>After running this BPFtrace call and executing <code>VACUUM FULL</code> in a second session, we see the following output:</p><pre><code>Attaching 2 probes...
Performing vacuum
Vacuum call took 37486735 ns
Performing vacuum
Vacuum call took 16491130 ns
Performing vacuum
Vacuum call took 32443568 ns
Performing vacuum
Vacuum call took 17959933 ns
[...]
</code></pre><p>‌For each call of the <code>vacuum_rel</code> function in PostgreSQL, we measure the time the vacuum operation needs. However, it would also be convenient to capture the Oid or the name of the relation vacuumed by the current vacuum operation. This requires the handling of the function parameters of the observed function.</p><h2 id="handle-function-parameters">Handle Function Parameters</h2><p>The function <code>vacuum_rel</code> has the following signature in PostgreSQL 14. The first parameter is the <code>Oid</code> (an <a href="https://github.com/postgres/postgres/blob/1951d21b29939ddcb0e30a018cf413b949e40d97/src/include/postgres_ext.h#L31">unsigned int</a>) of the processed relation. The second parameter is a <code>RangeVar</code> struct, which <em>could</em> contain the name of the relation. The third parameter is a <code>VacuumParams</code> struct, which contains additional parameters for the vacuum operation, and the last parameter is a <code>BufferAccessStrategy</code>, which defines the access strategy of the used buffer.</p><pre><code>static bool vacuum_rel(Oid relid,
        RangeVar *relation,
        VacuumParams *params,
        BufferAccessStrategy bstrategy 
)
</code></pre><p>BPFtrace allows it to access the function parameter using the keywords <code>arg0</code>, <code>arg1</code>, …, <code>argN</code>. To include the Oid in the output of our logging, we need only to print the first parameter of the function.</p><pre><code>$ sudo bpftrace -e '

uprobe:/home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel
{
        printf("Performing vacuum of Oid %d\n", arg0);
        @start[tid] = nsecs;
}

uretprobe:/home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel
/@start[tid]/
{
        printf("Vacuum call took %d ns\n", nsecs - @start[tid]);
        delete(@start[tid]);
}
'
</code></pre><p>When the <code>VACUUM FULL</code> operation is executed again in a second terminal, the output looks as follows:</p><pre><code>Attaching 2 probes...
[...]
Performing vacuum of Oid 1153888
Vacuum call took 37486734 ns
Performing vacuum of Oid 1153891
Vacuum call took 49535256 ns
Performing vacuum of Oid 2619
Vacuum call took 39575635 ns
Performing vacuum of Oid 2840
Vacuum call took 40683526 ns
Performing vacuum of Oid 1247
Vacuum call took 14683600 ns
Performing vacuum of Oid 4171
Vacuum call took 20587503 ns
</code></pre><p>‌To determine which Oid belongs to which relation, the following SQL statement can be executed:</p><pre><code class="language-SQL">blog=# SELECT oid, relname FROM pg_class WHERE oid IN (1153888, 1153891);
   oid   |  relname   
---------+------------
 1153888 | testtable1
 1153891 | testtable2
(2 rows)
</code></pre><p>The result shows that Oids <code>1153888</code> and <code>1153891</code> belong to the tables <code>testtable1</code> and <code>testtable2</code>, which we have created in one of the first sections of this article. These values belong to our test environment. In your environment, different OIDs might be shown.</p><h2 id="handle-function-struct-parameters">Handle Function Struct Parameters</h2><p>So far, we have processed simple parameters with <code>bpftrace</code> (like Oids, which are unsigned integers). However, many parameters in PostgreSQL are C data structs. Furthermore, these structs can be handled in BPFtrace programs as well.</p><p>The second parameter of the <code>vacuum_rel</code> function is a RangeVar struct. This struct is <a href="https://github.com/postgres/postgres/blob/2a8b40e3681921943a2989fd4ec6cdbf8766566c/src/include/nodes/primnodes.h#L63">defined in PostgreSQL 14</a> as follows:</p><pre><code>typedef struct RangeVar
{
	NodeTag	type;
	char *catalogname;
	char *schemaname;
	char *relname;
	[...]
}
</code></pre><p>To process the struct, the following BPFtrace program can be used. Please note, that the internal <code>NodeTag</code> data type of PostgreSQL is replaced by a simple int. </p><p>The <code>NodeTag</code> data type is an <code>enum</code>. Enums are backed by the integer data type in C. To handle this enum correctly, we could (1) also copy the enum definition into the eBPF program, or (2) we could replace it with a data type of the same length. </p><p>To keep the BPFtrace program simple, the second option is used here. The next three struct members are char pointer, which contains the <code>catalogname</code>, the schema, and the name of the relation. The <code>schemaname</code> and the <code>relname</code> are the fields we are interested in. The struct contains more members, but these members are ignored to keep the example clear.</p><pre><code>$ sudo bpftrace -e '
struct RangeVar
{
	int type;
	char *catalogname;
	char *schemaname;
	char *relname;
};

uprobe:/home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel
{
        printf("[PID %d] Performing vacuum of Oid %d (%s.%s)\n", pid, arg0, str(((struct RangeVar*) arg1)-&gt;schemaname), str(((struct RangeVar*) arg1)-&gt;relname));
        @start[tid] = nsecs;
}

uretprobe:/home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel
/@start[tid]/
{
        printf("[PID %d] Vacuum call took %d ns\n", pid, nsecs - @start[tid]);
        delete(@start[tid]);
}
'
</code></pre><p>After the struct is defined, the members of the struct can be accessed as in a regular C program. For example: <code>((struct RangeVar*) arg1)-&gt;schemaname</code>. In addition, we also print the process ID (PID) of the program that has triggered the UProbe. This allows it to identify the process that has performed the vacuum operation.</p><p>When running the following SQL statements in a second terminal:</p><pre><code class="language-SQL">VACUUM FULL public.testtable1;
VACUUM FULL public.testtable2;
</code></pre><p>The BPFtrace program shows the following output:</p><pre><code>Attaching 2 probes...
[PID 616516] Performing vacuum of Oid 1153888 (public.testtable1)
[PID 616516] Vacuum call took 23683600 ns
[PID 616516] Performing vacuum of Oid 1153891 (public.testtable2)
[PID 616516] Vacuum call took 24240837 ns</code></pre><p>The table names are extracted from the <code>RangeVar</code> data structure and shown in the output. However, this data structure is not always populated by PostgreSQL. The data structure might be empty when running <code>VACUUM FULL</code> without specifying a table name. Therefore, we use two single invocations with explicit table names to force PostgreSQL to populate this data structure.</p><h2 id="optimizing-the-bpftrace-program-using-maps">Optimizing the BPFtrace Program Using Maps</h2><p>The BPFtrace programs we have developed so far use one or more <code>printf</code> statements directly. A <code>printf</code> call is slow and reduces the throughput the BPFtrace program can monitor.</p><p>This can be optimized by storing the data in map that is printed when BPFtrace is stopped. This will postpone the printf calls until observation is done. To do this, we introduce three new maps <code>@start</code>, <code>@oid</code>, and <code>@vacuum</code>. The first two maps are populated in the UProbe event of the <code>vacuum_rel</code> function. The map <code>@start</code> contains the time when the probe is triggered, and the map <code>@oid</code> contains the Oid of the parameter function.</p><p>When the function returns, and the URetProbe is triggered, the <code>@vacuum</code> map is populated. The key is the Oid, and the value is the needed time to perform the vacuum operation. Also, the keys of the first two maps are removed.</p><p>When BPFtrace exits (i.e., by pressing CRTL+C), all populated maps are printed automatically. By using these three maps (<code>@start</code>, <code>@oid</code>, <code>@vaccum</code>), we have separated the actual monitoring from the output, the expensive <code>printf</code> function is called after the monitoring is done.</p><p>In addition, in the following program, we use the two functions <code>BEGIN</code> and <code>END</code> that are called by BPFtrace when the observation begins and ends.</p><pre><code>$ sudo sudo bpftrace -e '

uprobe:/home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel
{
        @start[tid] = nsecs;
        @oid[tid] = arg0;
}

uretprobe:/home/jan/postgresql-sandbox/bin/REL_14_2_DEBUG/bin/postgres:vacuum_rel
/@start[tid]/
{

        @vacuum[@oid[tid]] = nsecs - @start[tid];
        delete(@start[tid]);
        delete(@oid[tid]);

}

BEGIN
{
        printf("VACUUM calles are traced, press CTRL+C to stop tracing\n");
}

END 
{
        printf("\n\nNeeded time in ns to perform VACUUM FULL per Oid\n");
}
'
</code></pre><p>After BPFtrace is started, the first message is printed. After the program is stopped, the second message is printed. Furthermore, the content of the <code>@vacuum</code> map is printed. For each Oid, the needed time for the vacuum operations is shown.</p><pre><code>VACUUM calls are traced, press CTRL+C to stop tracing
^C

Needed time in ns to perform VACUUM FULL per Oid

@vacuum[1153888]: 7526823
@vacuum[1153891]: 8462672
@vacuum[2613]: 10764797
@vacuum[2995]: 11429589
@vacuum[6102]: 11436539
@vacuum[12801]: 14373934
@vacuum[6106]: 14396012
@vacuum[3118]: 14507167
@vacuum[3596]: 14695385
@vacuum[12811]: 14871237
@vacuum[3429]: 15106778
@vacuum[3350]: 15158742
@vacuum[2611]: 15432053
@vacuum[3764]: 15534169
@vacuum[2601]: 16055863
@vacuum[3602]: 16128624
@vacuum[2605]: 16405419
@vacuum[2616]: 16914195
@vacuum[3576]: 17003920
[...]
</code></pre><h2 id="conclusion">Conclusion</h2><p>This article provides a brief overview of BPFtrace. To trace the function latency of PostgreSQL vacuum calls, we used the tool <code>funclatency-bpfcc</code>. </p><p>Additionally, we utilized BPFtrace to create a tool that allows for more in-depth observation of the calls. Our BPFtrace script also takes into account the parameters of the PostgreSQL <code>vacuum_rel</code> function, enabling us to monitor the vacuum time per relation.</p><p>And speaking of vacuum, <a href="https://timescale.ghost.io/blog/how-to-fix-transaction-id-wraparound/">check out this blog post to learn more about transaction ID wraparound exhaustion and how to avoid it in PostgreSQL databases</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Making PostgreSQL Backups 100x Faster via EBS Snapshots and pgBackRest]]></title>
            <description><![CDATA[pgBackrest is an awesome tool for backup creation/restore in Postgres, but it can get slow for large databases. We mitigated this problem by incorporating EBS Snapshots to our backup strategy. ]]></description>
            <link>https://www.tigerdata.com/blog/making-postgresql-backups-100x-faster-via-ebs-snapshots-and-pgbackrest</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/making-postgresql-backups-100x-faster-via-ebs-snapshots-and-pgbackrest</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Engineering]]></category>
            <dc:creator><![CDATA[Grant Godeke]]></dc:creator>
            <pubDate>Thu, 31 Aug 2023 14:16:35 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/elephant-armor.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/elephant-armor.png" alt="A supercharged PostgreSQL elephant: learn how we're making PostgreSQL backups 100x Faster via EBS snapshots and pgBackrest" /><p>If you have experience running PostgreSQL in a production environment, you know that maintaining database backups is a daunting task. In the event of a catastrophic failure, data corruption, or other form of data loss, the ability to quickly restore from these backups will be vital for minimizing downtime. If you’re managing a database, maintaining your backups and getting your recovery strategy in order is probably the first check on your checklist.</p><p>Perhaps this has already given you one headache or two because<strong> creating and restoring backups for large PostgreSQL databases can be a very slow process.</strong></p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">🗒️</div><div class="kg-callout-text">A refresher on your basic backup and restore Postgres commands:<br><br><a href="https://www.timescale.com/learn/backup" rel="noreferrer">Postgres Backup Cheat Sheet</a><br><a href="https://www.timescale.com/learn/postgres-cheat-sheet/restore" rel="noreferrer">Postgres Restore Cheat Sheet</a></div></div><p><br></p><p>The most widely used external tool for backup operations in PostgreSQL is <a href="https://pgbackrest.org/?ref=timescale.com">pgBackRest</a>, which is very powerful and reliable. But pgBackRest can also be very time-consuming, especially for databases well over 1 TB. </p><p>The problem is exacerbated when restoring backups from production databases that continue to ingest data, thus creating more WAL (write-ahead log) that must be applied. In this case, a full backup and restore can take hours or even days, which can be a nightmare in production databases.</p><p>When operating our platform (<a href="https://www.timescale.com/">Timescale</a>, a cloud database platform built on PostgreSQL), we struggled with this very thing. At Timescale, we pride ourselves in making PostgreSQL faster and more scalable for large volumes of time-series data—therefore, our customers’ databases are often large (many TBs). At first, we were completely basing our backup and restore operations in pgBackRest, and we were experiencing some pain:</p><ul><li>Creating full backups was very slow. This was a problem, for example, when our customers were trying to upgrade their PostgreSQL major version within our platform, as we took a fresh, full backup after upgrade in case there was a failure shortly after. Upgrades are already stressful, and adding a very slow backup experience was not helping. </li><li>Restoring from backups was also too slow, both restoring from the backups themselves and replaying any WAL that had accrued since the last backup. (<a href="https://docs.timescale.com/use-timescale/latest/backup-restore/backup-restore-cloud/" rel="noreferrer">In Timescale, we automatically take full and incremental backups of all our customers’ databases.</a>)</li></ul><p>In this blog post, we’re sharing how we solved this problem by combining pgBackRest with EBS snapshots. Timescale runs in AWS, so we had the advantage of cloud-native infrastructure. If you're running PostgreSQL in AWS, you can perhaps benefit from a similar approach.</p><p><strong>After introducing EBS snapshots, our backup creation and restore process got 100x faster. </strong>This significantly improved the experience for our customers and made things much easier for our team.</p><h2 id="quick-introduction-to-database-backups-in-postgresql-and-why-we-used-pgbackrest">Quick Introduction to Database Backups in PostgreSQL (And Why We Used pgBackRest)</h2><p>If you asked 100 engineers if they thought backups were important for production databases, they would all say "yes"—but if you then took those same 100 engineers and gave them a grade on their backups, most wouldn’t hit a pass mark. </p><p>We all collectively understand the need for backups, but it’s still hard to create an effective backup strategy, implement it, run it, and test that it’s working appropriately.</p><p>In PostgreSQL specifically, there are two ways to implement backups: <strong>logical database dumps</strong>, which contain the SQL commands needed to recreate (not restore) your database from scratch, and <strong>physical backups</strong>, which capture the files that store your database state.  </p><p>Physical backups are usually paired with a mechanism to store the constant stream of write-ahead logs (WALs), which describe all data mutations on the system. A physical backup can then be restored to get PostgreSQL to the exact same state as it was when that backup was taken, and the WAL files rolled forward to get to a specific point in time, maybe just before someone (accidentally?) dropped all your data or your disk ate itself.</p><p>Logical backups are useful to recreate databases (potentially on other architectures), but maintaining physical backups is imperative for any production workload where uptime is valued. Physical backups are exact: they can be restored quickly and provide point-in-time recovery. In the rest of this article, we’ll discuss physical backups.</p><h3 id="how-are-physical-backups-usually-created-in-postgresql">How are physical backups usually created in PostgreSQL?</h3><ul><li>The first option is using the <a href="https://www.postgresql.org/docs/current/app-pgbasebackup.html">pg_basebackup</a> command. <code>pg_basebackup</code> copies the data directory and optionally includes the WAL files, but it doesn’t support incremental backups and has limited parallelization capabilities. The whole process is very manual, too. If you’re using <code>pg_basebackup</code>, you’ll instantly get the files you need to bootstrap a new database in a tarball or directory, but not much else. </li><li>Tools like <a href="https://pgbackrest.org/?ref=timescale.com">pgBackRest</a> were designed to overcome the limitations of <code>pg_basebackup</code>. pgBackRest allows for full and incremental backups, multi-threaded operations, and point-in-time recovery. It ensures data integrity by validating checksums during the backup process, supports different types of storage, and much more. In other words, pgBackRest is a robust and feature-rich tool, making it our choice for PostgreSQL backup operations.</li></ul><h2 id="the-problem-with-pgbackrest">The Problem With pgBackRest</h2><p>But pgBackrest is not perfect: it reads and backs up files, causing an additional load on your system. This can cause performance bottlenecks that can complicate your backup and restore strategy, especially if you’re dealing with large databases.</p><p>Even though pgBackRest offers incremental backups and parallelization, it often gets slow when executing full backups over large data volumes or on an I/O-saturated system.  </p><p>While you can sometimes rely on differential or incremental backups to minimize data (<a href="https://docs.timescale.com/use-timescale/latest/backup-restore/backup-restore-cloud/" rel="noreferrer">like we do in Timescale</a>), there are situations in which creating full backups is unavoidable. Backups could also be taken on standby, but at the end of the day, you’re limited by how fast you can get data off your volumes. </p><p>We shared earlier the example of full database upgrades, but we're also talking about any other kind of migration, integrity checks, archival operations, etc. In Timescale, some of our most popular platform features (like <a href="https://timescale.ghost.io/blog/introducing-one-click-database-forking-in-timescale-cloud/">forks,</a> <a href="https://timescale.ghost.io/blog/high-availability-for-your-production-environments-introducing-database-replication-in-timescale-cloud/">high-availability replicas</a>, and <a href="https://timescale.ghost.io/blog/high-availability-for-your-production-environments-introducing-database-replication-in-timescale-cloud/">read replicas</a>) imply a data restore from a full backup.</p><p>Having a long-running full backup operation in your production database is not only inconvenient, it can also conflict with other high-priority DB tasks, affecting your overall performance. This was problematic for us.</p><p>The slowness of pgBackRest was also problematic when it was time to restore from these backups. It’s very good at CPU parallelization, but when you’re trying to write terabytes of data as fast as possible, I/O will be the bottleneck. When it comes to recovery time objective or RTO, every minute counts. In case of major failure, you want to get that database up as soon as possible.</p><h2 id="using-ebs-snapshots-to-speed-up-the-creation-of-backups">Using EBS Snapshots to Speed Up the Creation of Backups</h2><p>To speed up the process of creating fresh full backups, we decided to replace standard pgBackRest full backups with on-demand <a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSSnapshots.html?ref=timescale.com">EBS snapshots</a>.</p><p>Our platform runs in AWS, which comes with some advantages. Using snapshots is a much more cloud-native approach to the problem of backups compared to what’s been traditionally used in PostgreSQL management. </p><p>EBS snapshots create a point-in-time copy of a particular database: this snapshot can be restored, effectively making it a backup. The key is that <strong>taking a snapshot is significantly faster than the traditional approach with pgBackRest</strong>: in our case, our p90 snapshot time decreased by over 100x. This gap gets wider the larger your database is!</p><p>How did we implement this? Basically, we did a one-to-one replacement of pgBackRest. Instead of waiting for the pgBackRest fresh full backup to complete, we now take a snapshot. We still wait for the backup to complete, but the process is significantly faster via snapshots. This way, we get the quick snapshot but also the full data copy and checksumming for datafile integrity, which pgBackRest performs.</p><p>If a user experiences a failure shortly after an upgrade, we have a fresh backup—the snapshot—that we can quickly restore (we’ll cover how we handle restores next). We still take a fresh full backup using pgBackRest (yay for redundancy), but the key difference is that this happens after the upgrade process has been fully completed. </p><p>If a failure has happened, the service is available to our customer quickly: we don’t have to force them to wait for the lengthy pgBackRest process to finish before being able to use their service again.</p><p>The trade-offs for adopting this approach were minimal. The only downside to consider is that, by taking snapshots, we now have redundant backups (both snapshots and full backups), so we incur additional storage costs. But what we’ve gained (both in terms of customer satisfaction and our own peace of mind) is worth the price.</p><h2 id="combining-ebs-snapshots-and-pgbackrest-for-quick-data-restore-taking-partial-snapshots-replaying-wal">Combining EBS Snapshots and pgBackRest for Quick Data Restore: Taking Partial Snapshots, Replaying WAL</h2><p>Solving the first problem we encountered with pgBackRest (i.e., slow creation of full backups) was relatively simple. We knew exactly when we needed an EBS snapshot to be created, as this process is always tied to a very specific workflow (e.g., performing a major version upgrade).</p><p>But we also wanted to explore using EBS snapshots to improve our data restore functionality. As we mentioned earlier, some popular features in the Timescale platform rely heavily on restores, including <a href="https://timescale.ghost.io/blog/introducing-one-click-database-forking-in-timescale-cloud/">creating forks,</a> <a href="https://timescale.ghost.io/blog/high-availability-for-your-production-environments-introducing-database-replication-in-timescale-cloud/">high-availability replicas</a>, and <a href="https://timescale.ghost.io/blog/high-availability-for-your-production-environments-introducing-database-replication-in-timescale-cloud/">read replicas,</a> all of which imply a data restore from a full backup.</p><p>This use case posed a slightly different and more difficult challenge since to restore from a full backup, such a backup needs to exist first, reflecting the latest state of the service. </p><p>To implement this, the first option we explored was taking an EBS snapshot when the user clicked “Create” a fork, read replica, or high-availability replica, to then restore from that snapshot. However, this process was still too slow for the end user. To get the performance we wanted, we had to think a bit beyond the naive approach and determine a way to take semi-regular snapshots across our fleet.</p><p>Fortunately, we already had a backup strategy for pgBackRest in place that we chose to mirror. Now, all Timescale services have EBS snapshots taken daily. For redundancy reasons and to verify file checksums, we still take our standard pgBackRest partial backups, but we don’t depend on them.</p><p>Once the strategy is solved, restoring data from an EBS snapshot mirrors a restore from pgBackRest very closely. We simply chose the corresponding EBS snapshot we wanted to restore—in the cases mentioned above, always the most recent—and then replayed any WAL that has accumulated since that restore point. Here, it is important to note that <a href="https://timescale.ghost.io/blog/how-high-availability-works-in-our-cloud-database/#what-if-theres-a-failure-affecting-your-storage">we still rely on pgBackRest to do our WAL management</a>. pgBackRest works great for us here; nothing gets close in terms of parallel WAL streaming.</p><p>This EBS snapshotting and pgBackRest approach has given us great results so far. Using snapshots for restores has helped improve our product experience, also providing our customers with an even higher level of reliability. Keeping pgBackRest in parallel has given us peace of mind that we still have a traditional backup approach that validates our data as well as snapshots.</p><p>We’re continually improving our strategy though, for example, by being smarter about when we snapshot—e.g., by looking at the accumulated WAL since the last snapshot to determine if we need to snapshot certain services more frequently. This practice helps improve restore times by reducing the amount of WAL that would need to be replayed, which is often the bottleneck in this process.</p><h2 id="on-snapshot-prewarming">On Snapshot Prewarming</h2><p>One important trade-off with this EBS snapshot approach is the balance between deployment time and initial performance. One limitation of a snapshot restore is that not all blocks are necessarily prewarmed and <a href="https://timescale.ghost.io/blog/scaling-postgresql-with-amazon-s3-an-object-storage-for-low-cost-infinite-database-scalability/">may need to be fetched from S3</a> the first time they are used, which is a slow process.</p><p>To give props to pgBackRest restore, it does not have this issue. For our platform features, our trade-off was between getting the user a running read replica (or fork or high-availability replica) as quickly as possible or making sure it was as performant as possible.</p><p>After some back and forth, we decided on our current approach on prewarming: we’re reading as much as we can for five minutes, prioritizing the most recently modified files first. The idea here is that we will warm the data the user is actively engaging with first. After five minutes, we then hand the process off to PostgreSQL to continue reading the rest of the volume at a slower pace until it is complete. For the initial warming, we use a custom <a href="https://www.educative.io/answers/what-is-a-goroutine?ref=timescale.com">goroutine</a> that reads concurrently from files.</p><h2 id="backing-it-up">Backing It Up</h2><p>We are not completely replacing our pgBackRest backup infrastructure with EBS snapshots anytime soon: it is hard to give up on the effectiveness and reliability of pgBackRest. </p><p>But by combining EBS snapshots with pgBackRest across our infrastructure, we’ve been able to mitigate its performance problem significantly, speeding up our backup creation and restore process. This allows us to build a better product, providing a better experience to our customers.</p><p>If you’re experiencing the same pains we were experiencing with pgBackRest, think about experimenting with something similar! It may cost you a little extra money, but it can be very much worth it.</p><p>We still have work to do on our end: we will continue to iterate on the ideal snapshotting strategy across the fleet to minimize deployment times as much as possible. We are also looking at smarter ways to prewarm the snapshots and more applications for snapshots in general.</p><p>If any of these problems interest you, <a href="https://www.timescale.com/careers/?ref=timescale.com">check out our open engineering roles</a> (we’re hiring!). And if you are a PostgreSQL user yourself, <a href="https://console.cloud.timescale.com/signup?ref=timescale.com">sign up for a free Timescale trial</a> and experience the result of EBS snapshots in action.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Best Practices for Picking PostgreSQL Data Types]]></title>
            <description><![CDATA[Learn which data types best suit your application when storing massive data volumes in PostgreSQL and TimescaleDB.]]></description>
            <link>https://www.tigerdata.com/blog/best-practices-for-picking-postgresql-data-types</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/best-practices-for-picking-postgresql-data-types</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Tips]]></category>
            <dc:creator><![CDATA[Chris Engelbert]]></dc:creator>
            <pubDate>Wed, 23 Aug 2023 15:43:17 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/PostgreSQL-Data-types.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/PostgreSQL-Data-types.jpg" alt="Developer at his desk. (Best Practices for Picking PostgreSQL Data Types)" /><p>A good, future-proof data model is one of the most challenging problems when building applications. That is specifically true when working on applications meant to store (and analyze) massive amounts of data, such as time series, log data, or event-storing ones. </p><p>Deciding what data types are best suited to store that kind of information comes down to a few factors, such as requirements on the precision of float-point values, the actual values content (such as text), compressibility, or query speed.</p><p>In this installment of the best practices series (see our posts on <a href="https://timescale.ghost.io/blog/best-practices-for-time-series-data-modeling-narrow-medium-or-wide-table-layout-2/">narrow, medium, and wide table layouts</a>, <a href="https://timescale.ghost.io/blog/best-practices-for-time-series-data-modeling-single-or-multiple-partitioned-table-s-a-k-a-hypertables/">single or partitioned hypertables</a>, and <a href="https://timescale.ghost.io/blog/best-practices-for-time-series-metadata-tables/" rel="noreferrer">metadata tables</a>), we’ll have a look at the different options in PostgreSQL and TimescaleDB regarding most of these questions. While unable to answer the requirement question, a few alternatives may be provided (such as integers instead of floating point)—but more on that later.</p><h1 id="before-we-start-compression">Before We Start: Compression</h1><p>Event-like data, such as time series, logs, and similar use cases, are notorious for ever-growing amounts of collected information. Hence, it’ll grow continuously and require disk storage. </p><p>But that’s not the only issue with big data. Querying, aggregating, and analyzing are some of the others. Reading this amount of data from disk requires a lot of I/O operations (<a href="https://en.wikipedia.org/wiki/IOPS">IOPS; input/output operations per second</a>), which is one of the most limiting factors in cloud environments, and even on-premise systems (due to how storage works in general). While non-volatile memory express (NVMes) transfer protocols and similar technologies can help you optimize for high IOPS, they’re not limitless.</p><p>That’s where compression comes in. TimescaleDB’s compression algorithms (and, to some extent, the default PostgreSQL’s) help decrease disk space requirements and IOPS, improving cost, manageability, and query speed.</p><p>But let’s get to the actual topic: best practices for data types in TimescaleDB.</p><h2 id="basic-data-types">Basic Data Types</h2><p>PostgreSQL (and the SQL standard in general) offers a great set of basic data types, providing a perfect choice for all general use cases. However, you should be discouraged from using them. The <a href="https://wiki.postgresql.org/wiki/Don't_Do_This">PostgreSQL Wiki</a> provides a great list of best practices regarding data types to use or avoid. Anyhow, you don’t immediately have to jump over; we’ll cover most of them here 🥹.</p><h3 id="nullable-columns">Nullable columns</h3><p>When looking back at the <a href="https://timescale.ghost.io/blog/best-practices-for-time-series-data-modeling-narrow-medium-or-wide-table-layout-2/">best practices on table layout</a>, medium and wide table layouts tend to have a few too many nullable columns, being placeholders for potential values.</p><p>Due to how PostgreSQL stores nullable values that are <code>NULL</code>, those are almost free. Having hundreds of nullable columns, most being <code>NULL</code>, is not an issue. The same is true for TimescaleDB’s custom compression. Due to storing data in a columnar format, empty row values are almost free when compressed (null bitmap).</p><h3 id="boolean-values">Boolean values</h3><p>A <em>boolean</em> value is a logical data type with one of two possible values, <code>TRUE</code> or <code>FALSE</code>. It is normally used to record decisions or states.</p><p>There isn’t much specific to booleans in TimescaleDB. They are a very simple data type but also a great choice. Still, people often use an <em>integer</em> to represent their values as <code>1</code> or <code>0</code>. This may come in handy with narrow or medium table layouts where you want to limit the number of columns. Nothing speaks against either solution!</p><p>In terms of compressibility, booleans aren’t heavily optimized but compress fairly well with the standard compression. If you have a series of states, it may be recommended to only store state changes, though, removing duplicates from the dataset.</p><h2 id="floating-point-values">Floating-point values</h2><p>Floating-point data types represent real numbers, most often decimal floating points of base ten. They are used to store all kinds of information, such as percentages, measurements like temperature or CPU usage, or statistical values.</p><p>There are two groups of floating-point numbers in PostgreSQL, <em>float4</em> (a.k.a., <em>real</em>), <em>float8</em> (<em>double precision</em>), and <em>numeric</em>.</p><p><strong><em>Float4</em> and <em>float8</em> columns are the recommended data types.</strong> TimescaleDB will handle them specifically (during compression) and optimize their use. On the other hand, <em>numeric</em>, as an arbitrary precision-sized data type, isn’t optimized at all. <strong><em>Numeric</em> isn’t recommended.</strong></p><p>In general, though, due to the complexity of floating-point numbers, if you know the required precision upfront, you could use the multiply-division trick and store them as integers, which are better optimized. For example, consider we want to store a temperature value (in Kelvin) and only two decimal places, but the value comes in as a <em>float4</em>.</p><pre><code class="language-plain">float4 originalValue = 298.151566;
int storedValue = (round(originalValue * 100))::int;
float4 queryValue = storedValue::float4 / 100;
</code></pre>
<p>It is a trick often used in data transmission for embedded devices with a low throughput uplink to limit the number of bytes sent.</p><h3 id="integer-values">Integer values</h3><p>Integer data types represent natural numbers of various sizes (meaning valid number ranges, depending on how many bytes are used to represent them). Integer values are often used for simple values, such as counts of events or similar.</p><p><strong>All integer types (<em>int2, SmallInt</em>, <em>int4, Integer</em>, <em>int8, BigInt</em>) are recommended data types.</strong> TimescaleDB is heavily optimized to compress the values of those data types. No less than three compression algorithms are working in tandem to get the most out of these data types. </p><p>This is why it is advised (if you know the necessary precision of a floating-point value) to store the values as integers (see <a href="https://docs.google.com/document/d/13OXvYi5QA5zVgu3fmY9mN1RUt7HWpkm_rYBDgnIcLfM/edit#heading=h.e2lgb9ieofml">floating-point values</a> for more information).<br><br>What is true for integers is also true for all serial data types (<em>serial2, SmallSerial</em>, <em>serial4, Serial</em>, <em>serial8, BigSerial</em>), as those are magical “aliases” for their integer counterparts, incorporating the automatic creation of sequences to fill in their values on insert. </p><p>That said, they use their corresponding integer data types as a column data type. Anyhow, the PostgreSQL best practices advise against using them and recommend using <a href="https://www.postgresql.org/docs/current/sql-createtable.html">identity columns</a> instead for anything PostgreSQL from version 10 onwards.</p><h3 id="timestamp-time-and-date-values">Timestamp, time, and date values</h3><p>Timestamps and time and date data types represent a specific point in time, some with more and some with less explicit information. All these data types have versions with and without timezone information attached (except <em>date</em>).</p><p><strong>Before going into details, I’d generally advise against any of the data types without timezones (timestamp without time zone, timestamp, time without time zone, time).</strong> Most of them are discouraged by PostgreSQL’s best practices and shouldn’t be used. </p><p>It’s a misconception that it would save any storage space, as many believe, and TimescaleDB doesn’t have any optimization for data types without timezones. While it works, it will add a lot of casting overhead which is implicit and, therefore, not immediately visible. That said, just don’t do it 🔥.</p><p>With that out of the way, you can use <em>date</em> and <em>time</em>, but you should consider the use case. While <em>date</em> is optimized and compressed using the same compression scheme as integers, <em>time</em> is not. In any case, you shouldn’t use both for the time-dimension column. </p><p>To store dates, you should consider using <em>timestamptz</em> with the time portion set to midnight in the necessary timezone (<code>2023-01-01T00:00:00+00</code>) to prevent any casting overhead when querying.</p><p>Likewise, you can use <em>timestamptz</em> to store a time value only. In this case, you encode the time portion to a specific date (such as <code>1970-01-01T15:01:44+04</code>) and cast the final value back into a <em>time</em> value. Alternatively, you can store the value as an <em>integer</em> by encoding the time into the (nano)seconds since midnight or any other encoding you can come up with.</p><p><strong>That leaves us with <em>timestamptz</em> (<em>timestamp with time zone</em>). You’ve guessed it: this is the recommended data type for any kind of point-in-time storage.</strong> </p><p>This data type is highly optimized, used by all internal functionality, and employs the same compression as integers. Said compression is especially effective with timestamps, as they tend to have little difference between two consecutive values and compress extremely well. </p><p>Still, be aware that some frameworks, object-relational mappings (ORMs), or tools love their <em>timestamp without time zone</em> and need to be forced to be good citizens.</p><h3 id="text-values">Text values</h3><p>Text values are used to store textual values of arbitrary size. Those values can include detailed descriptions, log messages, and metric names or tags. The available data types include <em>text</em>, <em>char(n)</em>, and <em>varchar(n)</em>.</p><p>PostgreSQL’s best practices advise against using <em>char(n)</em>, as it will pad values shorter than <em>n</em> to that size and waste storage. It recommends using <em>text</em> instead.</p><p>The same is true with <em>varchar(n)</em> with a length limit. Consider using <em>varchar</em> (without length limit) or <em>text</em>.</p><p>From a TimescaleDB-specific perspective, there isn’t much to say except you may want to deduplicate long values using a separate table holding the actual value and a reference (such as a checksum on the content) and storing the reference in the hypertable.<br><br>TimescaleDB doesn’t offer any specific optimization to handle this type of data. It will, however, apply dictionary compression (lz4-based) to those text fields.</p><h3 id="byte-array-bytea-values">Byte array (bytea) values</h3><p>Byte arrays (in PostgreSQL represented by the data type <em>bytea</em>) store arbitrary large sequences of bytes, which may represent anything, from encoded machine state to binary data packets.</p><p>When looking at customer/user use cases, it is a very uncommon data type, as most data is decoded before being stored in the database. Therefore, TimescaleDB doesn’t optimize anything about this data type. Compression, however, should use the lz4-based compression.</p><p>If you have recurring, large <em>bytea</em> values, you can store them outside the actual hypertable and apply deduplication, as explained for text columns.</p><h2 id="complex-and-extension-data-types">Complex and Extension Data Types</h2><p>Compared to basic data types, complex data types commonly encode multiple values into a single column. This may include custom composite types.</p><h3 id="structural-types-json-jsonb-xml">Structural types (JSON, JSONB, XML)</h3><p>Structural data types encode complete objects or sets of information, often <a href="https://www.timescale.com/learn/what-is-data-compression-and-how-does-it-work">lossy</a> in terms of data types of the actual values. PostgreSQL supports three structural data types, <em>JSON</em>, <em>JSONB</em> (a binary representation of JSON), and <em>XML</em>. </p><p>Values often contain complex state information from machines or sensors. They are also common with narrow table layouts to compensate for the different values that need to be stored.</p><p><strong>To get it out of the way, if you want to store JSON-like data, don’t use <em>JSON</em>—use <em>JSONB</em>!</strong> It’s better regarding storage space, query speed, and anything you can think of. </p><p>The only disadvantage of <em>JSONB</em> is that you lose the original order of properties since <em>JSONB</em> will decompose the object for better efficiency due to the way it is stored internally. Anyhow, not sure I can come up with a great reason for why the order should matter, and I hope you agree 😉.</p><p><strong>That said, <em>JSONB</em> is a valid choice when storing complex data.</strong> The amount of stored data stored should be kept under close observation, though. If you have recurring, large amounts of data inside the <em>JSONB</em> objects, it may be advisable to extract that meta information into a separate, vanilla PostgreSQL table and join them in at query time. An approach is shown in the <a href="https://docs.google.com/document/d/13OXvYi5QA5zVgu3fmY9mN1RUt7HWpkm_rYBDgnIcLfM/edit#heading=h.r3us7sct6gi4">text values</a> section.</p><p>For the <em>XML,</em> I don’t have enough data to give any recommendations. Actually, I cannot remember anyone ever asking anything about it. Due to my scarce experience, I wouldn't advise using it. I guess it may behave similarly to <em>JSONB</em>, but that’s reading clouds.</p><p>For all the above data types, TimescaleDB won’t apply any specific optimizations, and compression will likely use the dictionary algorithm, which still yields impressive results (been there, done that).</p><p>There are a few things to be wary of when using those data types. Remember that the whole object needs to be read, parsed, and the requested path’s value extracted. That happens for every single row of data being queried. </p><p><br>One thing I learned using JSONB is that <a href="https://www.postgresql.org/docs/current/indexes-expressional.html">indexes on expressions</a> and <a href="https://www.postgresql.org/docs/current/gin-intro.html">GIN indexes</a> are your friends if you need to make selection decisions based on data inside the object (such as tags). Ensure that anything you need to make that decision is in an index to read and (potentially) decompress the object.</p><p><br>The second element to remember is that extracting values from those objects yields text values that need to be cast into the required data type. Even if the value is stored as a number inside the <em>JSONB</em> object, extracting it returns the text representation, which then needs to be cast back into an integer or floating-point value, adding a lot of casting overhead to the process.</p><h2 id="universally-unique-identifiers-uuid">Universally Unique Identifiers: UUID</h2><p>Using <em>UUID</em> to store unique identifiers is common in the PostgreSQL world. They are used for various reasons, including simplifying the generation of highly unlikely and colliding values on a different system than the database for “security” reasons (meaning removing potentially guessable series of numbers) and others. </p><p>A <em>UUID</em> represents a (semi)random 128-bit number in the format specified in <a href="https://tools.ietf.org/html/rfc4122">RFC 4122</a> (ISO/IEC 9834-8:2005), which looks similar to <code>a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11</code>.<br><br>TimescaleDB doesn’t optimize this data type at all. It will compress it using the dictionary compression algorithm. However, there are other elements to keep in mind when using <em>UUID</em>. While they offer many advantages and simplify distributed systems, they also create some locality issues, specifically with BTree indexes. You will find a great read on this topic in the <a href="https://www.cybertec-postgresql.com/en/unexpected-downsides-of-uuid-keys-in-postgresql/">Cybertec blog</a>.<br><br>My recommendation would be to think twice about <a href="https://www.google.com/url?q=https://pgxn.org/dist/pg_uuidv7/&amp;sa=D&amp;source=docs&amp;ust=1692628210856556&amp;usg=AOvVaw05jaIePFZVsEcE052A7jpD">using <em>UUID</em>s</a>.</p><h2 id="tree-like-structures-ltree">Tree-like structures: Ltree</h2><p>While <em>ltree</em> is an uncommon data type when storing event-like data, it can be a great fit for deduplicating recurring, larger values. <em>Ltree</em> represents a certain (sub)path in a tree, illustrating a specific device in an overarching topology. Their values resemble a canonical, dot-separated path (<code>building1.floor3.room10.device12</code>).</p><p>Values of data type <em>ltree</em> aren’t handled by TimescaleDB specifically. However, due to their size, this isn’t a big deal.</p><p>As mentioned above, <em>ltree</em> values are perfect for deduplicating recurring data, such as meta information, which are unlikely or slowly changing. Combined with a hash/checksum, they can be used as a reference to look them up in an external table.</p><h2 id="key-value-pairs-hstore">Key-value pairs: <code>hstore</code></h2><p><em><code>hstore</code></em> provides the capability to store key-value pairs of data, similar to what a non-nested <em>JSON(B)</em> object would offer.</p><p>From my experience, only a few people use <em>hstore,</em> and most prefer <em>JSONB</em>. One reason may be that <em>hstore</em> only offers text values. That said, there is no experience in compression gains or speed implications. I guess a lot of what was noted on <em>JSONB</em> objects would hold for <em>hstore</em>.</p><h2 id="postgis-data-geometries-and-geographies">PostGIS data: Geometries and geographies</h2><p>Last but not least, <a href="https://postgis.net/">PostGIS’</a> data types are like <em>geometry</em> and <em>geography</em>. PostGIS data types are used to store special information of any kind. Typical use cases are GPS positions, street information, or positional changes over time.</p><p><a href="https://www.timescale.com/learn/postgresql-extensions-postgis">TimescaleDB works perfectly fine with PostGIS data types</a> but doesn’t optimize them. Compression ratios, while not perfect, are still decent.</p><p>Not a lot to mention here: they work, and they work great.</p><h2 id="what-is-the-recommendation-now">What Is the Recommendation Now?</h2><p>Recommending data types is hard since many decisions depend on requirements outside the scope of this blog post. However, I hope this read provides insight into which data types to avoid, which ones are fine, and how you may optimize okay data types.</p><p>To summarize, here’s a small table showing different aspects and categorizing them into four states: great 🟢, okay, but be wary of something 🟠, or avoid at any cost 🔴, as well as unknown (feel free to provide experience/feedback) ⚪️.</p><table>
<thead>
<tr>
<th style="text-align:right"><strong>Data Type</strong></th>
<th style="text-align:center"><strong>Query Speed</strong></th>
<th style="text-align:center"><strong>Compressibility</strong></th>
<th style="text-align:center"><strong>Recommended</strong></th>
<th><strong>Alternative</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:right">boolean</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td></td>
</tr>
<tr>
<td style="text-align:right">real, float4</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td></td>
</tr>
<tr>
<td style="text-align:right">double (precision), float8</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td></td>
</tr>
<tr>
<td style="text-align:right">numeric</td>
<td style="text-align:center">🔴</td>
<td style="text-align:center">🔴</td>
<td style="text-align:center">🔴</td>
<td>floats, integers</td>
</tr>
<tr>
<td style="text-align:right">smallint, int2</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td></td>
</tr>
<tr>
<td style="text-align:right">integer, int4</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td></td>
</tr>
<tr>
<td style="text-align:right">bigint, int8</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td></td>
</tr>
<tr>
<td style="text-align:right">smallserial, serial2</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟠</td>
<td>int2 with identity</td>
</tr>
<tr>
<td style="text-align:right">serial, serial4</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟠</td>
<td>int4 with identity</td>
</tr>
<tr>
<td style="text-align:right">bigserial, serial8</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟠</td>
<td>int8 with identity</td>
</tr>
<tr>
<td style="text-align:right">timestamp without time zone, timestamp</td>
<td style="text-align:center">🔴</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🔴</td>
<td>timestamptz</td>
</tr>
<tr>
<td style="text-align:right">timestamp with time zone, timestamptz</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td></td>
</tr>
<tr>
<td style="text-align:right">time</td>
<td style="text-align:center">🔴</td>
<td style="text-align:center">⚪️</td>
<td style="text-align:center">🔴</td>
<td>timestamptz</td>
</tr>
<tr>
<td style="text-align:right">timetz</td>
<td style="text-align:center">🔴</td>
<td style="text-align:center">⚪️</td>
<td style="text-align:center">🔴</td>
<td>timestamptz</td>
</tr>
<tr>
<td style="text-align:right">date</td>
<td style="text-align:center">🔴</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🔴</td>
<td>timestamptz</td>
</tr>
<tr>
<td style="text-align:right">char(n)</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟠</td>
<td style="text-align:center">🔴</td>
<td>text</td>
</tr>
<tr>
<td style="text-align:right">varchar(n)</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟠</td>
<td style="text-align:center">🔴</td>
<td>text</td>
</tr>
<tr>
<td style="text-align:right">text</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟠</td>
<td style="text-align:center">🟠</td>
<td></td>
</tr>
<tr>
<td style="text-align:right">bytea</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟠</td>
<td style="text-align:center">🟠</td>
<td></td>
</tr>
<tr>
<td style="text-align:right">json</td>
<td style="text-align:center">🔴</td>
<td style="text-align:center">🟠</td>
<td style="text-align:center">🔴</td>
<td>jsonb</td>
</tr>
<tr>
<td style="text-align:right">jsonb</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟠</td>
<td style="text-align:center">🟠</td>
<td></td>
</tr>
<tr>
<td style="text-align:right">xml</td>
<td style="text-align:center">⚪️</td>
<td style="text-align:center">⚪️</td>
<td style="text-align:center">⚪️</td>
<td>jsonb</td>
</tr>
<tr>
<td style="text-align:right">uuid</td>
<td style="text-align:center">🟠</td>
<td style="text-align:center">🟠</td>
<td style="text-align:center">🟠</td>
<td></td>
</tr>
<tr>
<td style="text-align:right">ltree</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟠</td>
<td style="text-align:center">🟢</td>
<td></td>
</tr>
<tr>
<td style="text-align:right">hstore</td>
<td style="text-align:center">⚪️</td>
<td style="text-align:center">⚪️</td>
<td style="text-align:center">⚪️</td>
<td>jsonb</td>
</tr>
<tr>
<td style="text-align:right">geometry, geography (PostGIS)</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟠</td>
<td style="text-align:center">🟢</td>
<td></td>
</tr>
</tbody>
</table>
<h2 id="conclusion">Conclusion</h2><p>If you want to know more about the compression algorithms used, see the blog post one of my colleagues wrote, <a href="https://timescale.ghost.io/blog/time-series-compression-algorithms-explained/">Time-Series Compression Algorithms, Explained</a>.</p><p>Also, if you have any feedback or experience with any of the above data types or the non-mentioned ones, I’d be happy to hear from you on <a href="https://twitter.com/noctarius2k">Twitter</a> or our <a href="https://slack.timescale.com">Community Slack</a> (@Chris Engelbert)!</p><p>Finally, I’m sorry for the wall of text you’ve had to read to get here, but there’s a lot of information to share and many different data types (and those aren’t even all of them). I hope you enjoyed the read and learned something new along the way.<br><br>If you want to test Timescale right now, the easiest and fastest way to get started is to sign up for our <a href="https://console.cloud.timescale.com/signup">30-day Timescale free trial</a>. To try self-managed TimescaleDB, see the <a href="https://docs.timescale.com/self-hosted/latest/">documentation</a> for further information.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[The 2023 State of PostgreSQL Survey Is Now Open!]]></title>
            <description><![CDATA[The 2023 State of PostgreSQL survey is now live! Help us learn more about this community, and check out last year’s main highlights.]]></description>
            <link>https://www.tigerdata.com/blog/the-2023-state-of-postgresql-survey-is-now-open</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/the-2023-state-of-postgresql-survey-is-now-open</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[State of PostgreSQL]]></category>
            <dc:creator><![CDATA[Team Tiger Data]]></dc:creator>
            <pubDate>Tue, 01 Aug 2023 12:58:32 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/07/2023-SOP-blog-hero.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/07/2023-SOP-blog-hero.png" alt="The 2023 State of PostgreSQL Survey Is Now Open!" /><p>Almost half (45.55 %) of the <a href="https://survey.stackoverflow.co/2023/#databases">2023 Stack Overflow Developer Survey</a> respondents who answered the question about their favorite database (76,634 in total) chose PostgreSQL as the most popular one. This is a testament to the quality, reliability, and performance of PostgreSQL, as well as the vibrant and diverse community that supports it.</p><p>As proud members of the PostgreSQL community, we want to continue giving back to this awesome group of data techies. We’re happy to announce that the 2023 State of PostgreSQL survey is officially live, and we are excited to hear once again from PostgreSQL users worldwide.</p><p>Over the years, we have learned a lot about the community. In 2019, we noticed that while PostgreSQL is a popular choice among organizations, <a href="https://drive.google.com/file/d/1VGWN0oCXRxX-qOiq4T-QQwu89Id07dBZ/view?usp=sharing">81 % of you use PostgreSQL for personal projects</a>. In 2021, <a href="https://www.timescale.com/state-of-postgres/2021/">the community shared that the most frequently used extension is PostGIS</a>. Last year, <a href="https://www.timescale.com/state-of-postgres/2022">17 % of the respondents said they contributed to PostgreSQL at least once</a>.</p><p>This year, we want to know how these practices evolved, and we will explore the two magical letters of the hour—AI. We want to learn what AI tools the PostgreSQL community uses and if the AI workloads are already part of personal and work projects.</p><p>The survey results and anonymized raw data will be published in a report that will be available for free to everyone. The report will provide valuable insights into the PostgreSQL ecosystem and help us understand how we can collectively make PostgreSQL better.</p><p><br>For now, to whet the appetite for the 2023 report, read the highlights of last year’s findings. To download the full 2022 report, head over to <a href="https://www.timescale.com/state-of-postgres/2022/">https://www.timescale.com/state-of-postgres/2022/</a></p>
<!--kg-card-begin: html-->
<div class="gray-cta-box">
   <div class="gray-cta-box__text">
       <p><strong>The survey is open until September 15, 2023.</strong></p>
       <br>
       <p>So what are you waiting for? Take the survey now and share your voice with the PostgreSQL community!</p>
       
    </div>
    <a class="gray-cta-box__button" href="https://timescale.typeform.com/state-of-pg-23" target="_blank">
        <p><strong>Take the survey</strong></p>
    </a>
</div>
<!--kg-card-end: html-->
<h2 id="the-state-of-postgresql-in-2022">The State of PostgreSQL in 2022</h2><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/07/2023-07-25-Infographic-pdf.jpg" class="kg-image" alt="" loading="lazy" width="2000" height="4924" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/07/2023-07-25-Infographic-pdf.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/07/2023-07-25-Infographic-pdf.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/07/2023-07-25-Infographic-pdf.jpg 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/07/2023-07-25-Infographic-pdf.jpg 2000w" sizes="(min-width: 720px) 720px"></figure><h3 id="postgresqls-popularity-is-increasing">PostgreSQL's popularity is increasing</h3><p>The number of PostgreSQL newbies using the database for less than a year has grown from 6.1&nbsp;% in 2021 to 6.4&nbsp;% in 2022.</p><h3 id="reasons-for-choosing-postgresql-over-other-databases">Reasons for choosing PostgreSQL over other databases</h3><p>Open-source, reliability, and extensions are the main reasons PostgreSQL users selected in 2022. Interestingly, users´ years of experience were directly related to their answers. “Reliability” was the number one reason to choose PostgreSQL among those who have been using the database for 11-15 years, while “open source” was primarily pointed out by users with up to five years of experience.</p><h3 id="postgresql-usage-is-growing">PostgreSQL usage is growing</h3><p>Small and medium businesses (0-50 employees) use PostgreSQL a lot more than they did one year ago. The result is on par with a broader trend: PostgreSQL’s usage is growing, with the majority of respondents—55 %—saying that they have increased their usage of the database.</p><h3 id="postgresql-users-%E2%99%A5-documentation">PostgreSQL users ♥ documentation</h3><p>The majority of respondents (76.1 %) answered that technical documentation is their preferred way of learning about PostgreSQL, followed by long-form blog posts (51.5 %) and short-form blog posts (43.3 %). But the new generation of PostgreSQL enjoys learning slightly differently: users with less than five years of PostgreSQL experience gravitate toward video as their first option.</p><h3 id="postgresql-users-increasingly-use-dbaas-providers-to-deploy-postgresql">PostgreSQL users increasingly use DBaaS providers to deploy PostgreSQL</h3><p>The trend that we first saw in 2021 continues in 2022. Fewer PostgreSQL users reported self-managing the database compared to previous years. More respondents are using a managed PostgreSQL service to deploy the database.<br></p><p>Many thanks to everyone who took the time to fill in the 2023 survey. If you have not done that yet, grab a cup or glass of your favorite beverage and <a href="https://timescale.typeform.com/state-of-pg-23?utm_source=xxxxx&amp;utm_medium=xxxxx&amp;utm_campaign=xxxxx&amp;utm_term=xxxxx&amp;utm_content=xxxxx">share your experience with PostgreSQL</a>.<br></p><p>Help us make the survey more representative. Share this far and wide! Post it in the company chat. Share it on social media.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Supercharge Your AI Agents With Postgres: An Experiment With OpenAI's GPT-4]]></title>
            <description><![CDATA[Read how you can work with AI agents as intermediaries between AI and databases.]]></description>
            <link>https://www.tigerdata.com/blog/supercharge-your-ai-agent-with-postgresql-an-experiment-with-openais-gpt-4</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/supercharge-your-ai-agent-with-postgresql-an-experiment-with-openais-gpt-4</guid>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[OpenAI]]></category>
            <dc:creator><![CDATA[Jônatas Davi Paganini]]></dc:creator>
            <pubDate>Wed, 26 Jul 2023 13:00:22 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/07/AI-agents.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/07/AI-agents.png" alt="The OpenAI and Postgres logos: supercharge your AI agents with Postgres by reading this blog post" /><p>Hello developers, AI enthusiasts, and everyone eager to push the boundaries of what's possible with technology! Today, we're exploring <strong>AI agents</strong> as intermediaries in a fascinating intersection of fields: Artificial Intelligence and databases.</p><h2 id="the-dawn-of-ai-agents">The Dawn of AI Agents</h2><p>AI agents are at the heart of the tech industry's ongoing revolution. As programs capable of autonomous actions in their environment, AI agents analyze, make decisions, and execute actions that drive a myriad of applications. From autonomous vehicles and voice assistants to recommendation systems and customer service bots, AI agents are changing the way we interact with technology.<br></p><p>But what if we could take it a step further? What if we could use AI to simplify how we interact with databases? Could AI agents act as intermediaries, interpreting human language and converting it into structured database queries?</p><h2 id="a-ruby-experiment-with-gpt-4">A Ruby Experiment With GPT-4</h2><p>That's exactly what we tried to achieve in a recent experiment. Leveraging OpenAI's GPT-4, a powerful language model, we conducted an experiment to see how we could use AI to interact with our databases using everyday language.<br></p><p>The experiment was built using Ruby, and you can find the <a href="https://jonatas.github.io/timescaledb/chat_gpt_tutorial">detailed explanation and code here</a>. The results were fascinating, revealing the potential power of using AI as a “middle-man” (Middle-tech? Middle-bot?) between humans and databases.<br></p><p><a href="https://asciinema.org/a/594564">Check out the videos throughout this blog post</a> to see it in action:</p>
<!--kg-card-begin: html-->
<a href="https://asciinema.org/a/594563" target="_blank"><img src="https://asciinema.org/a/594563.svg" /></a>
<!--kg-card-end: html-->
<h2 id="why-store-data-for-ai-agents">Why Store Data for AI Agents?</h2><p>Data storage is crucial for the successful application of AI, particularly for training and fine-tuning models. By storing interactions, results, and other relevant data, we can improve the performance and accuracy of our AI agents over time.<br></p><p>But data storage is not just about improving our AI; it's also about cost-effectiveness. With the OpenAI API, you pay per token, which can add up when dealing with large amounts of data. By using PostgreSQL as long-term memory for your AI agent, you can reduce the number of tokens you send to the OpenAI API, saving computational resources and money.</p>
<!--kg-card-begin: html-->
<a href="https://asciinema.org/a/594564" target="_blank"><img src="https://asciinema.org/a/594564.svg" /></a>
<!--kg-card-end: html-->
<h2 id="postgresql-flexible-and-robust">PostgreSQL: Flexible and Robust</h2><p>PostgreSQL is a powerful, open-source relational database system. With a reputation for reliability, robustness, and performance, it's a fantastic choice for your AI's long-term memory. PostgreSQL also offers flexibility and scalability, making it suitable for projects of all sizes.<br></p><p>Whether you're conducting experiments or deploying production-ready applications, PostgreSQL's flexibility and robust nature make it an excellent companion for your AI.<br></p><p>Needless to say, we’re huge PostgreSQL enthusiasts here at Timescale—so much so that we built Timescale on PostgreSQL. Timescale works just like PostgreSQL under the hood, offering the same 100 percent SQL support (not SQL-like) and a rich ecosystem of connectors and tools but supercharging PostgreSQL for analytics, events, and <a href="https://www.tigerdata.com/blog/time-series-introduction" rel="noreferrer">time series</a> (and time-series-like workloads). <br></p><p>With additional features like <a href="https://timescale.ghost.io/blog/compressing-immutable-data-changing-time-series-management/">compression</a> and <a href="https://timescale.ghost.io/blog/an-incremental-materialized-view-on-steroids-how-we-made-continuous-aggregates-even-better/">automatically updated incremental materialized views—we call them continuous aggregates</a>—Timescale allows you to scale PostgreSQL further for optimal performance while enjoying the best developer experience and cost-effectiveness. <br></p><p>But why all this talk about Timescale? As the conversation between human and machine is happening on point in time, I realize I’m dealing with time-series data. Cue in TimescaleDB for the rescue!<br></p><h2 id="join-the-timescale-community">Join the Timescale Community</h2><p>We're just scratching the surface of what's possible when combining AI with databases like PostgreSQL, and we'd love for you to join us on this journey.</p><p>Got a cool idea? A question? Or just want to share your thoughts on this topic? Join the Timescale Community on <a href="https://timescale.com/community/">Slack</a> and head over to the <code>#ai-llm-discussion</code> channel. Let's push the boundaries together and shape the future of AI!<br><br>Check this page to learn how to <a href="http://timescale.com/ai">power agents, chatbots, and other large language models AI applications with PostgreSQL</a>. To see what my fellow Timescalers Avthar, Mat, and Sam are already building, read their post on <a href="https://timescale.ghost.io/blog/postgresql-as-a-vector-database-create-store-and-query-openai-embeddings-with-pgvector/">PostgreSQL as a Vector Database: Create, Store, and Query OpenAI Embeddings With pgvector</a>.<br></p><p>Remember, technology grows exponentially when great minds come together. See you there!</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Nearest Neighbor Indexes: What Are IVFFlat Indexes in Pgvector and How Do They Work]]></title>
            <description><![CDATA[A primer on the pgvector’s Inverted File Flat (ivfflat) algorithm for approximate nearest neighbor search. ]]></description>
            <link>https://www.tigerdata.com/blog/nearest-neighbor-indexes-what-are-ivfflat-indexes-in-pgvector-and-how-do-they-work</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/nearest-neighbor-indexes-what-are-ivfflat-indexes-in-pgvector-and-how-do-they-work</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[pgvector]]></category>
            <category><![CDATA[AI]]></category>
            <dc:creator><![CDATA[Matvey Arye]]></dc:creator>
            <pubDate>Fri, 30 Jun 2023 13:03:10 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/nearest-neighbor_hero_4.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/nearest-neighbor_hero_4.png" alt="Nearest Neighbor Indexes: What Are ivfflat Indexes in pgvector and How Do They Work" /><p>The rising popularity of ChatGPT, OpenAI, and applications of Large Language Models (LLMs) has brought the concept of approximate nearest neighbor search (ANN) to the forefront and sparked a renewed interest in vector databases due to the use of embeddings. <a href="https://platform.openai.com/docs/guides/embeddings">Embeddings</a> are mathematical representations of phrases that capture the semantic meaning as a vector of numerical values. </p><p>What makes this representation fascinating—and useful—is that phrases with similar meanings will have similar vector representations, meaning the distance between their respective vectors will be small. We recently discussed one application of these embeddings, <a href="https://timescale.ghost.io/blog/postgresql-as-a-vector-database-create-store-and-query-openai-embeddings-with-pgvector/">retrieval-augmented generation</a>—augmenting base LLMs with knowledge that it wasn’t trained on—but there are numerous other applications as well.</p><h3 id="semantic-similarity-search">Semantic similarity search</h3><p>One common application of embeddings is precisely <a href="https://en.wikipedia.org/wiki/Semantic_search">semantic similarity search</a>. The basic concept behind this approach is that if I have a knowledge library consisting of various phrases and I receive a question from a user, I can locate the most relevant information in my library by finding the data that is most similar to the user's query. <br></p><p>This is in contrast to lexical or full-text search, which only returns exact matches for the query. The remarkable aspect of this technique is that, since the embeddings represent the semantics of the phrase rather than its specific wording, I can find pertinent information even if it is expressed using completely different words!<br></p><h3 id="the-challenge-of-speed-at-scale">The challenge of speed at scale</h3><p>Semantic similarity search involves calculating an embedding for the user's question and then searching through my library to find the K most relevant items related to that question—these are the K items whose embeddings are closest to that of the question. However, when dealing with a large library, it becomes crucial to perform this search efficiently and swiftly. In the realm of vector databases, this problem is referred to as "Finding the k nearest neighbors" (<a href="https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm">KNN</a>).<br></p><p>This post discusses a method to enhance the speed of this search when utilizing PostgreSQL and <a href="https://www.timescale.com/ai">pgvector</a> for storing <a href="https://www.tigerdata.com/blog/a-beginners-guide-to-vector-embeddings" rel="noreferrer">vector embeddings</a>: the <a href="https://github.com/pgvector/pgvector#indexing">Inverted File Flat (IVFFlat)</a> algorithm for approximate nearest neighbor search. We’ll cover why IVFFlat is useful, how it works, and best practices for using it in pgvector for fast similarity search over embeddings vectors. </p><p>Let’s go!</p><p><strong>P.S. </strong>If you’re looking for the fastest vector search index on PostgreSQL, <a href="https://timescale.ghost.io/blog/how-we-made-postgresql-as-fast-as-pinecone-for-vector-data/" rel="noreferrer"><u>check out pgvectorscale</u></a>.</p><h2 id="what-are-ivfflat-indexes">What Are IVFFlat Indexes?</h2><p>IVFFlat indexes, short for Inverted File with Flat Compression, are a type of vector index used in PostgreSQL's <a href="https://www.tigerdata.com/learn/postgresql-extensions-pgvector" rel="noreferrer">pgvector extension</a> to speed up similarity searches to find vectors that are close to a given query. This index type uses approximate nearest neighbor search (ANNS) to provide fast searches.&nbsp;&nbsp;</p><p>These indexes work by dividing the vectors into multiple lists, known as clusters. Each cluster represents a region of similar vectors, and an inverted index is built to map each region to its corresponding vectors. When a query comes in, the nearest clusters to the query are identified and only the vectors in those clusters are searched. Thus, this approach significantly reduces the scope of similarity searches by excluding all the vectors that are not in the clusters that are close to the query.</p><p></p><h2 id="why-use-the-ivfflat-index-in-pgvector">Why Use the IVFFlat Index in Pgvector</h2><p>Searching for the k-nearest neighbors is not a novel problem for PostgreSQL. <a href="https://docs.timescale.com/use-timescale/latest/extensions/postgis/">PostGIS</a>, a <a href="https://www.tigerdata.com/blog/top-8-postgresql-extensions" rel="noreferrer">PostgreSQL extension</a> for handling location data, stores its data points as two-dimensional vectors (longitude and latitude). Locating nearby locations is a crucial query in that domain. <br></p><p>PostGIS tackles this challenge by employing an index known as an R-Tree, which yields precise results for k-nearest neighbor queries. Similar techniques, such as KD-Trees and Ball Trees, are also employed for this type of search in other databases.<br></p><h3 id="the-curse-of-dimensionality">"The curse of dimensionality"</h3><p>However, there's a catch. These approaches cease to be effective when dealing with data larger than approximately 10 dimensions due to the "curse of dimensionality." Cue the ominous music! Essentially, as you add more dimensions, the available space increases exponentially, resulting in exponentially sparser data. This reduced density renders existing indexing techniques, like the aforementioned R-Tree, KD-Trees, and Ball Trees, which rely on partitioning the space, ineffective. (To learn more, I suggest these two videos: <a href="https://www.youtube.com/watch?v=BbYV8UfMJSA">1</a>, <a href="https://www.youtube.com/watch?v=E1_WCdUAtyE">2</a>). <br></p><p>Given that embeddings often consist of more than a thousand dimensions—OpenAI’s are 1,536—new techniques had to be developed. There are no known exact algorithms for efficiently searching in such high-dimensional spaces. Nevertheless, there are excellent <em>approximate</em> algorithms that fall into the category of approximate nearest neighbor algorithms. Numerous such algorithms exist, but in this article, we will delve into the Inverted File Flat or IVFFlat algorithm, which is provided by pgvector.</p><h2 id="how-the-ivfflat-index-works-in-pgvector">How the IVFFlat Index Works in pgvector</h2><p></p><h3 id="how-ivfflat-divides-the-space">How IVFFlat divides the space</h3><p>To gain an intuitive understanding of how IVFFlat works, let's consider a set of vectors represented in a two-dimensional space as the following points:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/nearest-neighbor-pgvector-diagram---1.png" class="kg-image" alt="A set of vectors represented as points in two dimensions" loading="lazy" width="1640" height="1040" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/06/nearest-neighbor-pgvector-diagram---1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/06/nearest-neighbor-pgvector-diagram---1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/06/nearest-neighbor-pgvector-diagram---1.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/nearest-neighbor-pgvector-diagram---1.png 1640w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">A set of vectors represented as points in two dimensions</em></i></figcaption></figure><p>In the IVFFlat algorithm, the first step involves applying k-means clustering to the vectors to find cluster centroids. In the case of the given vectors, let's assume we perform k-means clustering and identify four clusters with the following centroids.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/nearest-neighbor-pgvector-diagram---2-1.png" class="kg-image" alt="After k-means clustering, we identify four clusters indicated by the colored triangles." loading="lazy" width="1640" height="1040" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/06/nearest-neighbor-pgvector-diagram---2-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/06/nearest-neighbor-pgvector-diagram---2-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/06/nearest-neighbor-pgvector-diagram---2-1.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/nearest-neighbor-pgvector-diagram---2-1.png 1640w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">After k-means clustering, we identify four clusters indicated by the colored triangles</em></i></figcaption></figure><p>After computing the centroids, the next step is to assign each vector to its nearest centroid. This is accomplished by calculating the distance between the vector and each centroid and selecting the centroid with the smallest distance as the closest one. This process conceptually maps each point in space to the closest centroid based on proximity.<br></p><p>By establishing this mapping, the space becomes divided into distinct regions surrounding each centroid (technically, this kind of division is called a <a href="https://en.wikipedia.org/wiki/Voronoi_diagram">Voronoi Diagram</a>). Each region represents a cluster of vectors that exhibit similar characteristics or are close in semantic meaning. <br></p><p>This division enables efficient organization and retrieval of approximate nearest neighbors during subsequent search operations, as vectors within the same region are likely to be more similar to each other than those in different regions.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/nearest-neighbor-pgvector-diagram---3-1.png" class="kg-image" alt=" The process of assigning each vector to its closest centroid conceptually divides the space into distinct regions that surround each centroid" loading="lazy" width="1640" height="1040" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/06/nearest-neighbor-pgvector-diagram---3-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/06/nearest-neighbor-pgvector-diagram---3-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/06/nearest-neighbor-pgvector-diagram---3-1.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/nearest-neighbor-pgvector-diagram---3-1.png 1640w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The process of assigning each vector to its closest centroid conceptually divides the space into distinct regions that surround each centroid</span></figcaption></figure><h3 id="building-the-ivfflat-index-in-pgvector"><br>Building the IVFFlat index in pgvector</h3><p>IVFFlat proceeds to create an <a href="https://en.wikipedia.org/wiki/Inverted_index">inverted index</a> that maps each centroid to the set of vectors within the corresponding region. In pseudocode, the index can be represented as follows:</p><pre><code>inverted_index = {
  centroid_1: [vector_1, vector_2, ...],
  centroid_2: [vector_3, vector_4, ...],
  centroid_3: [vector_5, vector_6, ...],
  ...
}
</code></pre>
<p>Here, each centroid serves as a key in the inverted index, and the corresponding value is a list of vectors that belong to the region associated with that centroid. This index structure allows for efficient retrieval of vectors in a region when performing similarity searches.</p><h3 id="searching-the-ivfflat-index-in-pgvector">Searching the IVFFlat index in pgvector</h3><p>Let's imagine we have a query for the nearest neighbors to a vector represented by a question mark, as shown below:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/nearest-neighbor-pgvector-diagram---4-1.png" class="kg-image" alt="We want to find nearest neighbors to the vector represented by the question mark" loading="lazy" width="1640" height="1040" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/06/nearest-neighbor-pgvector-diagram---4-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/06/nearest-neighbor-pgvector-diagram---4-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/06/nearest-neighbor-pgvector-diagram---4-1.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/nearest-neighbor-pgvector-diagram---4-1.png 1640w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">We want to find nearest neighbors to the vector represented by the question mark</span></figcaption></figure><p>To find the approximate nearest neighbors using IVFFlat, the algorithm operates under the assumption that the nearest vectors will be located in the same region as the query vector. Based on this assumption, IVFFlat employs the following steps:</p><ol><li>Calculate the distance between the query vector (red question mark) and each centroid in the index.</li><li>Select the centroid with the smallest distance as the closest centroid to the query (the blue centroid in this example).</li><li>Retrieve the vectors associated with the region corresponding to the closest centroid from the inverted index.</li><li>Compute the distances between the query vector and each of the vectors in the retrieved set.</li><li>Select the K vectors with the smallest distances as the approximate nearest neighbors to the query.<br></li></ol><p>The use of the index in IVFFlat accelerates the search process by restricting the search to the region associated with the closest centroid. This results in a significant reduction in the number of vectors that need to be examined during the search. Specifically, if we have C clusters (centroids), on average, we can reduce the number of vectors to search by a factor of 1/C.</p><h3 id="searching-at-the-edge">Searching at the edge</h3><p>The assumption that the nearest vectors will be found in the same region as the query vector can introduce recall errors in IVFFlat. Consider the following query:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/nearest-neighbor-pgvector-diagram---5-1.png" class="kg-image" alt=" ivfflat can sometimes make errors when searching for nearest neighbors to a point at the edge of two regions of the vector space" loading="lazy" width="1640" height="1040" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/06/nearest-neighbor-pgvector-diagram---5-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/06/nearest-neighbor-pgvector-diagram---5-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/06/nearest-neighbor-pgvector-diagram---5-1.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/nearest-neighbor-pgvector-diagram---5-1.png 1640w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">IVFFlat</span><i><em class="italic" style="white-space: pre-wrap;"> can sometimes make errors when searching for nearest neighbors to a point at the edge of two regions of the vector space</em></i></figcaption></figure><p>From visual inspection, it becomes apparent that one of the light-blue vectors is closer to the query vector than any of the dark-blue vectors, despite the query vector falling within the dark-blue region. This illustrates a potential error in assuming that the nearest vectors will always be found within the same region as the query vector.<br></p><p>To mitigate this type of error, one approach is to search not only the region of the closest centroid but also the regions of the next closest R centroids. This approach expands the search scope and improves the chances of finding the true nearest neighbors. <br></p><p>In pgvector, this functionality is implemented through the `probes` parameter, which specifies the number of centroids to consider during the search, as described below.</p><h2 id="parameters-for-pgvector%E2%80%99s-ivfflat-implementation"><br>Parameters for Pgvector’s IVFFlat Implementation</h2><p>In the implementation of IVFFlat in pgvector, two key parameters are exposed: lists and probes.</p><h3 id="lists-parameter-in-pgvector">Lists parameter in pgvector</h3><p>The <code>lists</code> parameter determines the number of clusters created during index building (It’s called lists because each centroid has a list of vectors in its region). Increasing this parameter reduces the number of vectors in each list and results in smaller regions.<br></p><p>It offers the following trade-offs to consider:</p><ul><li>Higher <code>lists</code> value speeds up queries by reducing the search space during query time.</li><li>However, it also decreases the region size, which can lead to more recall errors by excluding some points.</li><li>Additionally, more distance comparisons are required to find the closest centroid during step one of the query process.<br></li></ul><p>Here are some recommendations for setting the <code>lists</code> parameter:</p><ul><li>For datasets with less than one million rows, use <code>lists =  rows / 1000</code>.</li><li>For datasets with more than one million rows, use <code>lists = sqrt(rows)</code>.</li><li>It is generally advisable to have at least 10 clusters.</li></ul><h3 id="probes-parameter-in-pgvector">Probes parameter in pgvector</h3><p>The probes parameter is a query-time parameter that determines the number of regions to consider during a query. By default, only the region corresponding to the closest centroid is searched. By increasing the probes parameter, more regions can be searched to improve recall at the cost of query speed. <br></p><p>The recommended value for the probes parameter is <code>probes = sqrt(lists)</code>.</p><h2 id="using-ivfflat-in-pgvector">Using IVFFlat in Pgvector</h2><p></p><h3 id="creating-an-index">Creating an index<br></h3><p>When creating an index, it is advisable to have existing data in the table, as it will be utilized by k-means to derive the centroids of the clusters.<br></p><p>The index in pgvector offers three different methods to calculate the distance between vectors: L2, inner product, and cosine. It is essential to select the same method for both the index creation and query operations. The following table illustrates the query operators and their corresponding index methods:</p>
<!--kg-card-begin: html-->
<table style="border:none;border-collapse:collapse;"><colgroup><col width="208"><col width="209"><col width="207"></colgroup><tbody><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Distance type</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Query operator</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Index method</span></p></td></tr><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">L2 / Euclidean</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">&lt;-&gt;</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">vector_l2_ops</span></p></td></tr><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Negative Inner product</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">&lt;#&gt;</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">vector_ip_ops</span></p></td></tr><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Cosine</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">&lt;=&gt;</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">vector_cosine_ops</span></p></td></tr></tbody></table>
<!--kg-card-end: html-->
<p><strong>Note</strong>: OpenAI <a href="https://platform.openai.com/docs/guides/embeddings/limitations-risks">recommends</a> cosine distance for its embeddings.</p><p>To create an index in pgvector using IVFFlat, you can use a statement using the following form:</p><pre><code class="language-SQL">CREATE INDEX ON &lt;table name&gt; USING ivfflat (&lt;column name&gt; &lt;index method&gt;) WITH (lists = &lt;lists parameter&gt;);
</code></pre>
<p>Replace <code>&lt;table name&gt;</code> with the name of your table and <code>&lt;column name&gt;</code> with the name of the column that contains the vector type.</p><p>For example, if our table is named <code>embeddings</code> and our embedding vectors are in a column named <code>embedding</code>, we can create an IVFFlat index as follows:</p><pre><code class="language-SQL">CREATE INDEX ON embeddings USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

</code></pre>
<p>Here’s a simple Python function that you can use to create an IVFFlat index with the correct parameters for <code>lists</code> and <code>probes</code> as discussed above:</p><pre><code class="language-Python">def create_ivfflat_index(conn, table_name, column_name, query_operator="&lt;=&gt;"): 
    index_method = "invalid"
    if query_operator == "&lt;-&gt;":
        index_method = "vector_l2_ops"
    elif query_operator == "&lt;#&gt;":
        index_method = "vector_ip_ops"
    elif query_operator == "&lt;=&gt;":
        index_method = "vector_cosine_ops"
    else:
        raise ValueError(f"unrecognized operator {query_operator}")

    with conn.cursor() as cur:
        cur.execute(f"SELECT COUNT(*) as cnt FROM {table_name};")
        num_records = cur.fetchone()[0]

        num_lists = num_records / 1000
        if num_lists &lt; 10:
            num_lists = 10
        if num_records &gt; 1000000:
            num_lists = math.sqrt(num_records)

        cur.execute(f'CREATE INDEX ON {table_name} USING ivfflat ({column_name} {index_method}) WITH (lists = {num_lists});')
        conn.commit()
</code></pre>
<h3 id="querying">Querying</h3><p>An index can be used whenever there is an ORDER BY of the form <code>column &lt;query operator&gt; &lt;some pseudo-constant vector&gt;</code> along with a LIMIT k;<br></p><p><strong>Some examples</strong><br><br>Get the closest two vectors to a constant vector:</p><pre><code class="language-SQL">SELECT * FROM my_table ORDER BY embedding_column &lt;=&gt; '[1,2]' LIMIT 2;
</code></pre>
<p>This is a common usage pattern in retrieval augmented generation using LLMs, where we find the embedding vectors that are closest in semantic meaning to the user’s query. In that case, the constant vector would be the embedding vector representing the user’s query. </p><p>You can see an example of this in our guide to <a href="https://timescale.ghost.io/blog/postgresql-as-a-vector-database-create-store-and-query-openai-embeddings-with-pgvector/">creating, storing, and querying OpenAI embeddings with pgvector</a>, where we use this Python function to find the three most similar documents to a given user query from our database:</p><pre><code class="language-Python"># Helper function: Get top 3 most similar documents from the database
def get_top3_similar_docs(query_embedding, conn):
    embedding_array = np.array(query_embedding)
    # Register pgvector extension
    register_vector(conn)
    cur = conn.cursor()
    # Get the top 3 most similar documents using the KNN &lt;=&gt; operator
    cur.execute("SELECT content FROM embeddings ORDER BY embedding &lt;=&gt; %s LIMIT 3", (embedding_array,))
    top3_docs = cur.fetchall()
    return top3_docs
</code></pre>
<p>Get the closest vector to some row:</p><pre><code class="language-SQL">SELECT * FROM my_table WHERE id != 1 ORDER BY embedding_column &lt;=&gt; (SELECT embedding_column FROM my_table WHERE id = 1) LIMIT 2;
</code></pre>
<p><strong>Tip:</strong> PostgreSQL's ability to use an index does not guarantee its usage! The cost-based planner evaluates query plans and may determine that a sequential scan or a different index is more efficient for a specific query. You can use the EXPLAIN command to see the chosen execution plan. To test the viability of using an index, you can modify planner costing parameters until you achieve the desired plan. For small datasets, setting <code>enable_seqscan = 0</code> can be especially advantageous for testing viability as it avoids sequential scans.<br></p><p>To adjust the probes parameter, you can set the <code>ivfflat.probes</code> variable. For instance, to set it to '5', execute the following statement before running the query:</p><pre><code class="language-sql">SET ivfflat.probes = 5;
</code></pre>
<h3 id="dealing-with-data-changes">Dealing with data changes</h3><p>As your data evolves with inserts, updates, and deletes, the IVFFlat index will be updated accordingly. New vectors will be added to the index, while no longer-used vectors will be removed. </p><p><strong>However, the clustering centroids will not be updated</strong>. Over time, this can result in a situation where the initial clustering, established during index creation, no longer accurately represents the data. This can be visualized as follows:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/nearest-neighbor-pgvector-diagram---6.png" class="kg-image" alt="As data gets inserted or deleted from the index, if the index is not rebuilt, the ivfflat index in pgvector can return incorrect approximate nearest neighbors due to clustering centroids no longer fitting the data well" loading="lazy" width="1640" height="1040" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/06/nearest-neighbor-pgvector-diagram---6.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/06/nearest-neighbor-pgvector-diagram---6.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/06/nearest-neighbor-pgvector-diagram---6.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/nearest-neighbor-pgvector-diagram---6.png 1640w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">As data gets inserted or deleted from the index, if the index is not rebuilt, the IVFFlat index in pgvector can return incorrect approximate nearest neighbors due to clustering centroids no longer fitting the data well</span></figcaption></figure><p>To address this issue, the only solution is to rebuild the index.<br></p><p>Here are two important takeaways from this issue:</p><ul><li>Build the index once you have all the representative data you want to reference in it. This is unlike most indexes, which can be built on an empty table.</li><li>It is advisable to periodically rebuild the index.<br></li></ul><p>When rebuilding the index, it is highly recommended to use the <code>CONCURRENTLY</code>option to avoid interfering with ongoing operations.<br></p><p>Thus, to rebuild the index run the following in a cron job:</p><pre><code class="language-SQL">REINDEX INDEX CONCURRENTLY &lt;index name&gt;;
</code></pre>
<h2 id="summing-it-up">Summing It Up</h2><p>The IVFFlat algorithm in pgvector provides an efficient solution for approximate nearest neighbor search over high-dimensional data like embeddings. It works by clustering similar vectors into regions and building an inverted index to map each region to its vectors. This allows queries to focus on a subset of the data, enabling fast search. By tuning the lists and probes parameters, IVFFlat can balance speed and accuracy for a dataset.  </p><p></p><p>Overall, IVFFlat gives PostgreSQL the ability to perform fast semantic similarity search over complex data. With simple queries, applications can find the nearest neighbors to a query vector among millions of high-dimensional vectors. For natural language processing, information retrieval, and more, IVFFlat is a compelling solution. By understanding how IVFFlat divides the vector space into regions and builds its inverted index, you can optimize its performance for your needs and build powerful applications on top of it.</p><p>✨<strong>Resources for further learning:</strong> Now that you know more about the IVFFlat index in pgvector, here are some resources to further your learning journey:&nbsp;</p><ul><li>Learn about other PostgreSQL indexes for vector search, like <a href="https://timescale.ghost.io/blog/vector-database-basics-hnsw/" rel="noreferrer">HNSW</a>.</li><li>Learn how we made <a href="https://timescale.ghost.io/blog/how-we-made-postgresql-as-fast-as-pinecone-for-vector-data/" rel="noreferrer">PostgreSQL as fast as Pinecone for vector data</a>.</li><li>Follow our tutorial on <a href="https://timescale.ghost.io/blog/postgresql-as-a-vector-database-create-store-and-query-openai-embeddings-with-pgvector/" rel="noreferrer">creating, storing, and querying OpenAI embeddings using PostgreSQL as a vector database</a>. <a href="https://timescale.ghost.io/blog/how-to-build-llm-applications-with-pgvector-vector-store-in-langchain/" rel="noreferrer"><u>Learn how </u>to use pgvector as a vector store in LangChain</a>. <a href="https://timescale.ghost.io/blog/refining-vector-search-queries-with-time-filters-in-pgvector-a-tutorial/" rel="noreferrer">Or see how you can refine vector search queries using time filters in pgvector with a single SQL query</a>.</li></ul><p>And if you’re looking for a production-ready PostgreSQL database for your AI application’s vector, relational, and time-series data, <a href="https://www.timescale.com/ai" rel="noreferrer"><u>try Timescale Cloud</u></a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[PostgreSQL as a Vector Database: A Pgvector Tutorial]]></title>
            <description><![CDATA[Vector databases add organizational intelligence to AI. Learn how to use PostgreSQL as a vector database for retrieval-augmented generation with pgvector.]]></description>
            <link>https://www.tigerdata.com/blog/postgresql-as-a-vector-database-using-pgvector</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/postgresql-as-a-vector-database-using-pgvector</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[pgvector]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[OpenAI]]></category>
            <dc:creator><![CDATA[Avthar Sewrathan]]></dc:creator>
            <pubDate>Wed, 21 Jun 2023 18:22:10 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/Postgres-vector-database-and-OpenAI-embeddings-blog--1-.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/Postgres-vector-database-and-OpenAI-embeddings-blog--1-.png" alt="PostgreSQL as a Vector Database: Create, Store, and Query OpenAI Embeddings With pgvector. The OpenAI and Postgres logos" /><p>Vector databases enable efficient storage and searching of vector data. They are essential for developing and maintaining AI applications using large language models (LLMs).</p><p>With some help from the <a href="https://www.tigerdata.com/learn/postgresql-extensions-pgvector" rel="noreferrer"><u>pgvector extension</u></a>, you can leverage PostgreSQL as a vector database to store and query<a href="https://platform.openai.com/docs/guides/embeddings/what-are-embeddings?ref=timescale.com"> <u>OpenAI embeddings</u></a>. OpenAI embeddings are a type of data representation (in the shape of vectors, i.e., lists of numbers) used to measure the similarity of text strings for OpenAI’s models.</p><p>In this article, we work through the example of creating a chatbot to answer questions about Tiger Data (creators of TimescaleDB). The chatbot will be trained on content from the <a href="https://timescale.ghost.io/blog/tag/dev-q-a/"><u>Tiger Data Developer Q&amp;A blog posts</u></a>. This example will illustrate the key concepts for creating, storing, and querying OpenAI embeddings with PostgreSQL and pgvector.</p><p>This example has three parts:</p><ul><li>Part 1: How to create embeddings from content using the<a href="https://platform.openai.com/docs/api-reference?ref=timescale.com"> <u>OpenAI API</u></a>.</li><li>Part 2: How to use PostgreSQL as a vector database and store OpenAI embedding vectors using pgvector.</li><li>Part 3: How to use embeddings retrieved from a vector database to augment LLM generation.</li></ul><p>One could think of this as a “hello world” tutorial for building a chatbot that can reference a company knowledge base or developer docs.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">✨</div><div class="kg-callout-text"><b><strong style="white-space: pre-wrap;">Jupyter Notebook and Code:</strong></b> You can find all the code used in this tutorial in a Jupyter Notebook, as well as sample content and embeddings on the Tiger Data GitHub: <a href="https://github.com/timescale/vector-cookbook/tree/main/openai_pgvector_helloworld">timescale/vector-cookbook</a>. We recommend cloning the repo and following along by executing the code cells as you read through the tutorial.</div></div><h2 id="the-big-picture-openai-embeddings">The Big Picture: OpenAI Embeddings</h2><p>Foundational models of AI (e.g., GPT-3 or GPT-4) may be missing some information needed to provide accurate answers to certain specific questions. That’s because relevant information was not in the dataset used to train the model. (For example, the information is stored in private documents or only became available recently.) This lack of data may make these models unsuitable for use as a chatbot in specific information banks.</p><p><a href="https://www.promptingguide.ai/techniques/rag?ref=timescale.com"><u>Retrieval-augmented generation</u></a> (RAG) gives a simple solution; it provides additional context to the foundational model in the prompt. </p><p>This technique is powerful—it allows you to “teach” foundational models about things only you know about and use that to create a ChatGPT++ experience for your users!</p><p>But what context should you provide to the model? If you have a library of information, how can you determine what’s relevant to a given question? That is what <a href="https://www.timescale.com/blog/a-beginners-guide-to-vector-embeddings" rel="noreferrer">embeddings</a> are for. <a href="https://platform.openai.com/docs/guides/embeddings/what-are-embeddings?ref=timescale.com"><u>OpenAI embeddings</u></a> are a mathematical representation of the semantic meaning of a piece of text that allows for <em>similarity search</em>.</p><p>With this representation, when you get a user question and calculate its embedding, you can use a similarity search against data embeddings in your library to find the most relevant information. But that requires having an embedding representation of your library.&nbsp;&nbsp;</p><h3 id="what-is-a-vector-database">What is a vector database?</h3><p>A <a href="https://www.timescale.com/blog/how-to-choose-a-vector-database"><u>vector database</u></a> is a database that can handle vector data. Vector databases are useful for:</p><ul><li><a href="https://www.tigerdata.com/learn/vector-search-vs-semantic-search" rel="noreferrer"><strong>Semantic search</strong></a><strong>:</strong> Vector databases facilitate semantic search, which considers the context or meaning of search terms rather than just exact matches. They are useful for recommendation systems, content discovery, and question-answering systems.</li><li><strong>Efficient similarity search:</strong> Vector databases are designed for efficient high-dimensional nearest neighbor search, a task where traditional relational databases struggle.</li><li><strong>Machine learning:</strong> Vector databases store and search embeddings created by machine-learning models. This feature aids in finding items semantically similar to a given item.</li><li><strong>Multimedia data handling:</strong> Vector databases also excel in working with multimedia data (images, audio, video) by converting them into high-dimensional vectors for efficient similarity search.</li><li><strong>NLP and data combination:</strong> In natural language processing (NLP), vector databases store high-dimensional vectors representing words, sentences, or documents. They also allow a combination of traditional SQL queries with similarity searches, accommodating both structured and unstructured data.</li></ul><p>We’ll use PostgreSQL with the <a href="https://github.com/pgvector/pgvector"><u>pgvector extension</u></a> installed as our vector database. Pgvector extends PostgreSQL to handle vector data types and vector similarity search, like <a href="https://en.wikipedia.org/wiki/Nearest_neighbor_search"><u>nearest neighbor search</u></a>, which we’ll use to find the k most related embeddings in our database for a given user prompt.</p><h2 id="using-pgvector-for-a-postgresql-vector-database">Using Pgvector for a PostgreSQL Vector Database</h2><p><a href="https://www.timescale.com/learn/using-pgvector-with-python"><u>Pgvector</u></a> is an open-source extension for PostgreSQL that enables storing and searching over machine learning-generated embeddings. It provides different capabilities that allow users to identify exact and approximate nearest neighbors. Pgvector is designed to work seamlessly with other PostgreSQL features, including indexing and querying.</p><p>Now we’re ready to start building our chatbot!</p><h3 id="why-use-pgvector-as-a-vector-database">Why use pgvector as a vector database?</h3><p>Here are five reasons <a href="https://www.tigerdata.com/blog/postgres-for-everything" rel="noreferrer">why <u>PostgreSQL</u></a> is a good choice for storing and handling vector data:</p><ul><li><strong>Integrated solution:</strong> By using PostgreSQL as a vector database, you keep your data in one place. This can simplify your architecture by reducing the need for multiple databases or additional services.</li><li><strong>Enterprise-level robustness and operations:</strong> With a 30-year pedigree, PostgreSQL provides world-class data integrity, operations, and robustness. This includes backups, streaming replication, role-based and row-level security, and ACID compliance.</li><li><strong>Full-featured SQL:</strong> PostgreSQL supports a rich set of SQL features, including joins, subqueries, window functions, and more. This allows for powerful and complex queries that can include both traditional relational data and vector data. It also integrates with a plethora of existing data science and data analysis tools.</li><li><strong>Scalability and performance:</strong> PostgreSQL is known for its robustness and ability to handle large datasets. Using it as a vector database allows you to leverage these characteristics for vector data as well.</li><li><strong>Open source:</strong> PostgreSQL is open source, which means it's free to download and use, and you can modify it to suit your needs. It also means that it benefits from the collective input of developers all over the world, which often results in high-quality, secure, and up-to-date software. PostgreSQL has a large and active community, so help is readily available. There are many resources, including documentation, tutorials, forums, and more, to help you troubleshoot and optimize your PostgreSQL database.</li></ul><h2 id="setting">setting</h2><ul><li>Install Python.</li><li>Install and configure a Python virtual environment. We recommend <a href="https://github.com/pyenv/pyenv">Pyenv</a>.</li><li>Install the requirements for this notebook using the following command:</li></ul><pre><code class="language-Python">pip install -r requirements.txt
</code></pre>
<p>Import all the packages we will be using:</p><pre><code class="language-Python">import openai
import os
import pandas as pd
import numpy as np
import json
import tiktoken
import psycopg2
import ast
import pgvector
import math
from psycopg2.extras import execute_values
from pgvector.psycopg2 import register_vector
</code></pre>
<p>You’ll need to <a href="https://platform.openai.com/overview">sign up for an OpenAI Developer Account</a> and create an OpenAI API Key – we recommend getting a paid account to avoid rate limiting and setting a spending cap so that you avoid any surprises with bills.</p><p>Once you have an OpenAI API key, it’s a <a href="https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety">best practice</a> to store it as an environment variable and then have your Python program read it.</p><pre><code class="language-Python">#First, run export OPENAI_API_KEY=sk-YOUR_OPENAI_API_KEY...


# Get openAI api key by reading local .env file
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) 
openai.api_key  = os.environ['OPENAI_API_KEY'] 
</code></pre>
<h2 id="part-1-create-embeddings-for-your-postgresql-vector-database">Part 1: Create Embeddings for Your PostgreSQL Vector Database</h2><p><a href="https://platform.openai.com/docs/guides/embeddings/what-are-embeddings">Embeddings</a> measure how related text strings are. First, we'll create embeddings using the OpenAI API on some text we want the LLM to answer questions on.</p><p>In this example, we'll use content from the Tiger Data blog, specifically from the <a href="https://timescale.ghost.io/blog/tag/dev-q-a/">Developer Q&amp;A section</a>, which features posts by Tiger Data users talking about their real-world use cases.</p><p>You can replace this blog data with any text you want to embed, such as your own company blog, developer documentation, internal knowledge base, or any other information you’d like to have a “ChatGPT-like” experience over.</p><pre><code class="language-Python"># Load your CSV file into a pandas DataFrame
df = pd.read_csv('blog_posts_data.csv')
df.head()
</code></pre>
<p>The output looks like this:</p>
<!--kg-card-begin: html-->
<table style="border:none;border-collapse:collapse;"><colgroup><col width="23"><col width="162"><col width="146"><col width="262"></colgroup><tbody><tr style="height:16.5pt"><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:middle;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><br></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:middle;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Title</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:middle;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Content</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:middle;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">URL</span></p></td></tr><tr style="height:38.25pt"><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:middle;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">0</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">How to Build a Weather Station With Elixir, Ne...</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">This is an installment of our “Community Membe...</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">https://www.timescale.com/blog/how-to-build-a-...</span></p></td></tr><tr style="height:38.25pt"><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:middle;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">1</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">CloudQuery on Using PostgreSQL for Cloud Asset...</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">This is an installment of our “Community Membe...</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">https://www.timescale.com/blog/cloudquery-on-u...</span></p></td></tr><tr style="height:38.25pt"><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:middle;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">2</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">How a Data Scientist Is Building a Time-Series...</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">This is an installment of our “Community Membe...</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">https://www.timescale.com/blog/how-a-data-scie...</span></p></td></tr><tr style="height:38.25pt"><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:middle;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">3</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">How Conserv Safeguards History: Building an En...</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">This is an installment of our “Community Membe...</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">https://www.timescale.com/blog/how-conserv-saf...</span></p></td></tr><tr style="height:38.25pt"><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:middle;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">4</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">How Messari Uses Data to Open the Cryptoeconom...</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">This is an installment of our “Community Membe...</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">https://www.timescale.com/blog/how-messari-use...</span></p></td></tr></tbody></table>
<!--kg-card-end: html-->
<h3 id="11-calculate-the-cost-of-embedding-data">1.1 Calculate the cost of embedding data</h3><p>It's usually a good idea to calculate how much creating embeddings for your selected content will cost. We provide a number of helper functions to calculate a cost estimate before creating the embeddings to help us avoid surprises.</p><p>For OpenAI, you are charged on a per-token basis for embeddings created. The total cost for the blog posts we want to embed will be less than $0.01, thanks to OpenAI’s small text embedding model, <a href="https://openai.com/index/new-embedding-models-and-api-updates/"><u>text-embedding-3-small</u></a>. This model boasts not only stronger performance but also 5X cost reduction compared to its predecessor, <a href="https://openai.com/blog/new-and-improved-embedding-model?ref=timescale.com"><u>text-embedding-ada-002</u></a>.</p><pre><code class="language-Python"># Helper functions to help us create the embeddings

# Helper func: calculate number of tokens
def num_tokens_from_string(string: str, encoding_name = "cl100k_base") -&gt; int:
    if not string:
        return 0
    # Returns the number of tokens in a text string
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

# Helper function: calculate length of essay
def get_essay_length(essay):
    word_list = essay.split()
    num_words = len(word_list)
    return num_words

# Helper function: calculate cost of embedding num_tokens
# Assumes we're using the text-embedding-ada-002 model
# See https://openai.com/pricing
def get_embedding_cost(num_tokens):
    return num_tokens/1000*0.0002

# Helper function: calculate total cost of embedding all content in the dataframe
def get_total_embeddings_cost():
    total_tokens = 0
    for i in range(len(df.index)):
        text = df['content'][i]
        token_len = num_tokens_from_string(text)
        total_tokens = total_tokens + token_len
    total_cost = get_embedding_cost(total_tokens)
    return total_cost

</code></pre>
<pre><code class="language-Python"># quick check on total token amount for price estimation
total_cost = get_total_embeddings_cost()
print("estimated price to embed this content = $" + str(total_cost))

</code></pre>
<h3 id="12-create-smaller-chunks-of-content">1.2 Create smaller chunks of content</h3><p>The OpenAI API has a maximum token <a href="https://platform.openai.com/docs/guides/embeddings/what-are-embeddings"><u>limit</u></a> that it can create an embedding for in a single request: 8,191 to be specific. To get around this limit, we'll break up our text into smaller chunks. Generally, it's a best practice to “chunk” the documents you want to create embeddings into groups of a fixed token size.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/10/Making-PostgreSQL-a-Vector-Database-pgvector-tutorial_embedding-models.png" class="kg-image" alt="A table with the performance eval of the OpenAI embedding models" loading="lazy" width="1748" height="546" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/10/Making-PostgreSQL-a-Vector-Database-pgvector-tutorial_embedding-models.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/10/Making-PostgreSQL-a-Vector-Database-pgvector-tutorial_embedding-models.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2024/10/Making-PostgreSQL-a-Vector-Database-pgvector-tutorial_embedding-models.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/10/Making-PostgreSQL-a-Vector-Database-pgvector-tutorial_embedding-models.png 1748w" sizes="(min-width: 720px) 720px"></figure><p>The precise number of tokens to include in a chunk depends on your use case and your model’s context window—the number of input tokens it can handle in a prompt.</p><p>For our purposes, we'll aim for chunks of around 512 tokens each. Chunking text up is a complex topic worthy of its own blog post. We’ll illustrate a simple method we found to work well below. &nbsp;If you want to read about other approaches, we recommend <a href="https://python.langchain.com/docs/how_to/#text-splitters"><u>this section</u></a> of the LangChain docs.</p><p><strong>Note:</strong> If you prefer to skip this step, you can use the provided file: <a href="https://github.com/timescale/vector-cookbook/tree/main/openai_pgvector_helloworld">blog_data_and_embeddings.csv</a>, which contains the data and embeddings that you'll generate in this step.</p><p>The code below creates a new list of our blog content while retaining the metadata associated with the text, such as the blog title and URL that the text is associated with.</p><pre><code class="language-Python">
# Create new list with small content chunks to not hit max token limits
# Note: the maximum number of tokens for a single request is 8191
# https://platform.openai.com/docs/guides/embeddings/embedding-models

# list for chunked content and embeddings
new_list = []
# Split up the text into token sizes of around 512 tokens
for i in range(len(df.index)):
    text = df['content'][i]
    token_len = num_tokens_from_string(text)
    if token_len &lt;= 512:
        new_list.append([df['title'][i], df['content'][i], df['url'][i], token_len])
    else:
        # add content to the new list in chunks
        start = 0
        ideal_token_size = 512
        # 1 token ~ 3/4 of a word
        ideal_size = int(ideal_token_size // (4/3))
        end = ideal_size
        #split text by spaces into words
        words = text.split()

        #remove empty spaces
        words = [x for x in words if x != ' ']

        total_words = len(words)
        
        #calculate iterations
        chunks = total_words // ideal_size
        if total_words % ideal_size != 0:
            chunks += 1
        
        new_content = []
        for j in range(chunks):
            if end &gt; total_words:
                end = total_words
            new_content = words[start:end]
            new_content_string = ' '.join(new_content)
            new_content_token_len = num_tokens_from_string(new_content_string)
            if new_content_token_len &gt; 0:
                new_list.append([df['title'][i], new_content_string, df['url'][i], new_content_token_len])
            start += ideal_size
            end += ideal_size

</code></pre>
<p>Now that our text is chunked better, we can create embeddings for each chunk of text using the OpenAI API.</p><p>We’ll use this helper function to create embeddings for a piece of text:</p><pre><code class="language-Python">openai_client = openai.OpenAI()

# Helper function: get embeddings for a text
def get_embeddings(text):
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input = text.replace("\n"," ")
    )
    return response.data[0].embedding</code></pre>
<p><br>And then create embeddings for each chunk of content:</p><pre><code class="language-Python"># Create embeddings for each piece of content
for i in range(len(new_list)):
   text = new_list[i][1]
   embedding = get_embeddings(text)
   new_list[i].append(embedding)

# Create a new dataframe from the list
df_new = pd.DataFrame(new_list, columns=['title', 'content', 'url', 'tokens', 'embeddings'])
df_new.head()

</code></pre>
<p>The new data frame should look like this:</p>
<!--kg-card-begin: html-->
<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-0pky"></th>
    <th class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">Title</span></th>
    <th class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">Content</span></th>
    <th class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">URL</span></th>
    <th class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">Tokens</span></th>
    <th class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">Embeddings</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">0</span></td>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">How to Build a Weather Station With Elixir, Ne...</span></td>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">This is an installment of our “Community Membe...</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">https://www.timescale.com/blog/how-to-build-a-...</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">501</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">[0.021440856158733368, 0.02200360782444477, -0...</span></td>
  </tr>
  <tr>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">1</span></td>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">How to Build a Weather Station With Elixir, Ne...</span></td>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">capture weather and environmental data. In all...</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">https://www.timescale.com/blog/how-to-build-a-...</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">512</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">[0.016165969893336296, 0.011341351084411144, 0...</span></td>
  </tr>
  <tr>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">2</span></td>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">How to Build a Weather Station With Elixir, Ne...</span></td>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">command in their database migration:SELECT cre...</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">https://www.timescale.com/blog/how-to-build-a-...</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">374</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">[0.022517921403050423, -0.0019158280920237303,...</span></td>
  </tr>
  <tr>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">3</span></td>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">CloudQuery on Using PostgreSQL for Cloud Asset...</span></td>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">This is an installment of our “Community Membe...</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">https://www.timescale.com/blog/cloudquery-on-u...</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">519</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">[0.009028822183609009, -0.005185891408473253, ...</span></td>
  </tr>
  <tr>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">4</span></td>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">CloudQuery on Using PostgreSQL for Cloud Asset...</span></td>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">Architecture with CloudQuery SDK- Writing plug...</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">https://www.timescale.com/blog/cloudquery-on-u...</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">511</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">[0.02050386555492878, 0.010169642977416515, 0....</span></td>
  </tr>
</tbody>
</table>
<!--kg-card-end: html-->
<p><br>As an optional but recommended step, you can save the original blog content along with associated embeddings in a CSV file for reference later on so that you don't have to recreate embeddings if you want to reference it in another project.</p><pre><code class="language-Python"># Save the dataframe with embeddings as a CSV file
df_new.to_csv('blog_data_and_embeddings.csv', index=False)
</code></pre>
<h2 id="pro-tip-automating-embedding-creation-with-pgai-vectorizer">Pro Tip: Automating Embedding Creation with pgai Vectorizer</h2><p>In the section above, we showed how to manually create and manage embeddings in your own data pipeline – chunking content, calling the OpenAI API, and storing the results. While this approach helps you understand the fundamentals, in production, you may want to automate this process completely. Let’s look at how <a href="https://github.com/timescale/pgai/blob/main/docs/vectorizer.md"><u>pgai Vectorizer</u></a> can handle this entire pipeline for you!&nbsp;</p><p>Managing embeddings in production involves several challenges: keeping embeddings in sync with changing content, handling API failures, and optimally chunking text.&nbsp;</p><p><a href="https://www.tigerdata.com/blog/pgai-giving-postgresql-developers-ai-engineering-superpowers" rel="noreferrer">pgai</a> Vectorizer automates this entire process directly in PostgreSQL - similar to how PostgreSQL automatically <a href="https://docs.timescale.com/use-timescale/latest/schema-management/indexing?ref=timescale.com"><u>maintains indexes</u></a> for your tables.</p><h3 id="setting-up-pgai-vectorizer">Setting Up pgai Vectorizer</h3><p>The setup process differs depending on whether you’re using Tiger Cloud (formerly Timescale Cloud) or hosting PostgreSQL yourself.&nbsp;</p><p><strong>On Tiger Cloud</strong></p><pre><code class="language-python">-- 1. Store your OpenAI API key securely in Timescale Cloud
-- 2. Navigate to Project Settings &gt; AI Model API Keys in the Timescale Console
-- 3. The key is stored securely and not in your database
-- 4. Create the extensions
CREATE EXTENSION IF NOT EXISTS ai;</code></pre><p><strong>For self-hosted PostgreSQL</strong></p><pre><code class="language-bash">export OPENAI_API_KEY="your-api-key-here"

# Start the vectorizer worker
vectorizer-worker --connection="postgres://user:password@host:port/dbname"</code></pre><h3 id="creating-your-first-vectorizer">Creating Your First Vectorizer</h3><p>Instead of manually creating embeddings using Python, you can define a <em>vectorizer</em> that automatically generates and maintains embeddings for your content:</p><pre><code class="language-python">SELECT ai.create_vectorizer( 
   'blog_posts'::regclass,
    destination =&gt; 'blog_embeddings',
    embedding =&gt; ai.embedding_openai('text-embedding-3-small', 768),
    chunking =&gt; ai.chunking_recursive_character_text_splitter('content'),
    -- Pro tip: Add blog title as context to each chunk
    formatting =&gt; ai.formatting_python_template('$title: $chunk')
);</code></pre><p>This single SQL command:</p><ol><li>Automatically chunks your blog content</li><li>Creates embeddings for each chunk using OpenAI's API</li><li>Maintains embeddings as your content changes</li><li>Creates a view that joins your content with its embeddings</li></ol><h3 id="searching-with-vectorizer">Searching with Vectorizer</h3><p>You can then search your content the same way as before:</p><pre><code class="language-Python">SELECT 
   chunk,
   embedding &lt;=&gt; ai.openai_embed('text-embedding-3-small', 'How is Timescale used in IoT?') as distance
FROM blog_embeddings
ORDER BY distance
LIMIT 3;</code></pre><p>Vectorizer runs automatically every five minutes on <a href="https://console.cloud.timescale.com/signup?ref=timescale.com"><u>Tiger Cloud</u></a>, handling retries and keeping your embeddings up to date. For more details on setup and advanced features like <a href="https://github.com/timescale/pgai/blob/main/docs/vectorizer.md#monitor-a-vectorizer"><u>monitoring the Vectorizer</u></a>, see our pgai Vectorizer <a href="https://github.com/timescale/pgai/blob/main/docs/vectorizer.md"><u>documentation</u></a>.&nbsp;</p><h3 id="further-reading-on-rag">Further Reading on RAG</h3><p>The accuracy and cost of your RAG application depends heavily on implementation choices such as the embedding model selection to chunking strategies.&nbsp;</p><p>Here are more blog posts to help you build effective RAG applications with PostgreSQL:</p><ol><li><a href="https://timescale.ghost.io/blog/vector-databases-are-the-wrong-abstraction?ref=timescale.com"><u>Vector Databases Are the Wrong Abstraction</u></a> – learn why general-purpose databases with vector extensions like <a href="https://timescale.ghost.io/blog/pgvector-is-now-as-fast-as-pinecone-at-75-less-cost/"><u>pgvectorscale</u></a> often provide better solutions than specialized vector databases</li><li><a href="https://timescale.ghost.io/blog/which-rag-chunking-and-formatting-strategy-is-best?ref=timescale.com"><u>Which RAG Chunking and Formatting Strategy Is Best?</u></a> – Explore different approaches to chunking and formatting your content for optimal retrieval-augmented generation (RAG) performance</li><li><a href="https://www.tigerdata.com/blog/which-openai-embedding-model-is-best" rel="noreferrer"><u>Which OpenAI Embedding Model Is Best?</u></a> - Compare OpenAI's embedding models to choose the right one for your use case</li></ol><h2 id="part-2-store-embeddings-in-a-postgresql-vector-database-using-pgvector">Part 2: Store Embeddings in a PostgreSQL Vector Database Using Pgvector</h2><p>Now that we have created embedding vectors for our blog content, the next step is to store the embedding vectors in a vector database to help us perform a fast search over many vectors.</p><h3 id="21-create-a-postgresql-database-and-install-pgvector">2.1 Create a PostgreSQL database and <a href="https://www.tigerdata.com/learn/postgresql-extensions-pgvector" rel="noreferrer">install pgvector</a></h3><p>First, we’ll create a PostgreSQL database. You can <a href="https://docs.timescale.com/getting-started/latest/services/" rel="noreferrer">create a cloud PostgreSQL database</a> in minutes for free on <a href="https://console.cloud.timescale.com/signup">Tiger Cloud</a> or use a local PostgreSQL database for this step. </p><p>Once you’ve created your PostgreSQL database, export your connection string as an environment variable, and just like the OpenAI API key, we’ll read it into our Python program from the environment file:</p><pre><code class="language-Python"># Timescale database connection string
# Found under "Service URL" of the credential cheat-sheet or "Connection Info" in the Timescale console
# In terminal, run: export TIMESCALE_CONNECTION_STRING=postgres://&lt;fill in here&gt;

connection_string  = os.environ['TIMESCALE_CONNECTION_STRING']

</code></pre>
<p>We then connect to our database using the popular <a href="https://pypi.org/project/psycopg2/?ref=timescale.com"><u>psycopg2</u></a> Python library and install the pgvector and <a href="https://github.com/timescale/pgvectorscale?tab=readme-ov-file#installation"><u>pgvectorscale</u></a> extension (which provides powerful filtering and indexing capabilities ) as follows:</p><pre><code class="language-Python"># Connect to PostgreSQL database in Timescale using connection string
conn = psycopg2.connect(connection_string)
cur = conn.cursor()

#install pgvector
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
conn.commit()

#install pgvectorscale
cur.execute("CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;")
conn.commit()
</code></pre>
<h3 id="22-connect-to-and-configure-your-vector-database">2.2 Connect to and configure your vector database</h3><p>Once we’ve installed pgvector, we use the <a href="https://github.com/pgvector/pgvector-python#psycopg-2">register_vector()</a> command to register the vector type with our connection:</p><pre><code class="language-Python"># Register the vector type with psycopg2
register_vector(conn)
</code></pre>
<p>Once we’ve connected to the database, let’s create a table that we’ll use to store embeddings along with metadata. Our table will look as follows:<br></p>
<!--kg-card-begin: html-->
<table style="border:none;border-collapse:collapse;"><colgroup><col width="44"><col width="48"><col width="60"><col width="87"><col width="84"><col width="112"></colgroup><tbody><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">id</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">title&nbsp;</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">url</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">content</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">tokens</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">embedding</span></p></td></tr></tbody></table>
<!--kg-card-end: html-->
<p></p><ul><li><code>Id</code> represents the unique ID of each <a href="https://www.tigerdata.com/blog/a-beginners-guide-to-vector-embeddings" rel="noreferrer">vector embedding</a> in the table.</li><li><code>title</code> is the blog title from which the content associated with the embedding is taken.</li><li><code>url</code> is the blog URL from which the content associated with the embedding is taken.</li><li><code>content</code> is the actual blog content associated with the embedding.</li><li><code>tokens</code> is the number of tokens the embedding represents.</li><li><code>embedding</code> is the vector representation of the content.<br></li></ul><p>One advantage of using PostgreSQL as a vector database is that you can easily store metadata and embedding vectors in the same database, which is helpful for supplying the user-relevant information related to the response they receive, like links to read more or specific parts of a blog post that are relevant to them.</p><pre><code class="language-Python"># Create table to store embeddings and metadata
table_create_command = """
CREATE TABLE embeddings (
            id bigserial primary key, 
            title text,
            url text,
            content text,
            tokens integer,
            embedding vector(1536)
            );
            """

cur.execute(table_create_command)
cur.close()
conn.commit()

</code></pre>
<h3 id="23-ingest-and-store-vector-data-into-postgresql-using-pgvector">2.3 Ingest and <a href="https://www.tigerdata.com/learn/vector-store-vs-vector-database" rel="noreferrer">store vector</a> data into PostgreSQL using pgvector</h3><p>Now that we’ve created the database and created the table to house the embeddings and metadata, the final step is to insert the embedding vectors into the database. </p><p>For this step, it’s a best practice to batch insert the embeddings rather than insert them one by one.<br></p><pre><code class="language-Python">#Batch insert embeddings and metadata from dataframe into PostgreSQL database
register_vector(conn)
cur = conn.cursor()
# Prepare the list of tuples to insert
data_list = [(row['title'], row['url'], row['content'], int(row['tokens']), np.array(row['embeddings'])) for index, row in df_new.iterrows()]
# Use execute_values to perform batch insertion
execute_values(cur, "INSERT INTO embeddings (title, url, content, tokens, embedding) VALUES %s", data_list)
# Commit after we insert all embeddings
conn.commit()

</code></pre>
<p>Let’s sanity check by running some simple queries against our newly inserted data:</p><pre><code class="language-Python">cur.execute("SELECT COUNT(*) as cnt FROM embeddings;")
num_records = cur.fetchone()[0]
print("Number of vector records in table: ", num_records,"\n")
# Correct output should be 129

</code></pre>
<pre><code class="language-Python"># print the first record in the table, for sanity-checking
cur.execute("SELECT * FROM embeddings LIMIT 1;")
records = cur.fetchall()
print("First record in table: ", records)
</code></pre>
<h3 id="24-index-your-data-for-faster-retrieval">2.4 Index your data for faster retrieval</h3><p>In this example, we only have 129 embedding vectors, so searching through all of them is blazingly fast. But for larger datasets, you need to create indexes to speed up searching for similar embeddings, so we include the code to build the index for illustrative purposes. </p><p>While pgvector&nbsp;supports the <a href="https://www.tigerdata.com/blog/nearest-neighbor-indexes-what-are-ivfflat-indexes-in-pgvector-and-how-do-they-work" rel="noreferrer"><u>IVFFLAT</u></a> and <a href="https://www.tigerdata.com/learn/vector-database-basics-hnsw" rel="noreferrer"><u>HNSW</u></a> index types for approximate nearest neighbor (ANN) search, <a href="https://github.com/timescale/pgvectorscale"><u>pgvectorscale</u></a> offers a more cost-efficient and powerful index type for pgvector data: <a href="https://timescale.ghost.io/blog/pgvector-is-now-as-fast-as-pinecone-at-75-less-cost/"><u>StreamingDiskANN</u></a>, which we use here. </p><p>You always want to build this index <strong>after</strong> you have inserted the data, as the index needs to discover clusters in your data to be effective, and it does this only when first building the index. </p><p>The StreamingDiskANN index has tunable parameters depending on your goal, whether it is changing indexing operations or querying operations. In our case, we use the default values of the parameters. You can read more about <a href="https://github.com/timescale/pgvectorscale?tab=readme-ov-file#tuning"><u>tuning here</u></a>.</p><pre><code class="language-Python"># Create an index on the data for faster retrieval
cur.execute('CREATE INDEX embedding_idx ON embeddings USING diskann (embedding);')
conn.commit()
</code></pre>
<h2 id="part-3-nearest-neighbor-search-using-pgvector">Part 3: Nearest Neighbor Search Using pgvector</h2><p>Given a user question, we’ll perform the following steps to use information stored in the vector database to answer their question using Retrieval Augmented Generation:</p><ol><li>Create an embedding vector for the user question.</li><li>Use pgvector to perform a vector similarity search and retrieve the <code>k</code> nearest neighbors to the question embedding from our embedding vectors representing the blog content. In our example, we’ll use k=3, finding the three most similar embedding vectors and associated content.</li><li>Supply the content retrieved from the database as additional context to the model and ask it to perform a completion task to answer the user question.</li></ol><h3 id="31-define-a-question-you-want-to-answer">3.1 Define a question you want to answer</h3><p>First, we’ll define a sample question that a user might want to answer about the blog posts stored in the database.</p><pre><code class="language-Python"># Question about Timescale we want the model to answer
input = "How is Timescale used in IoT?"
</code></pre>
<p>Since TimescaleDB is <a href="https://timescale.ghost.io/blog/visualizing-iot-data-at-scale-with-hopara-and-timescaledb/">popular for IoT sensor data</a>, a user might want to learn specifics about how they can leverage it for that use case.</p><h3 id="32-find-the-most-relevant-content-in-the-database">3.2 Find the most relevant content in the database</h3><p>Here’s the function we use to find the three nearest neighbors to the user question. Note it uses pgvector’s <code>&lt;=&gt;</code> operator, which finds the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">Cosine distance</a> (also known as <a href="https://www.tigerdata.com/learn/understanding-cosine-similarity" rel="noreferrer">Cosine similarity</a>) between two embedding vectors.  </p><pre><code class="language-Python"># Helper function: Get top 3 most similar documents from the database
def get_top3_similar_docs(query_embedding, conn):
    embedding_array = np.array(query_embedding)
    # Register pgvector extension
    register_vector(conn)
    cur = conn.cursor()
    # Get the top 3 most similar documents using the KNN &lt;=&gt; operator
    cur.execute("SELECT content FROM embeddings ORDER BY embedding &lt;=&gt; %s LIMIT 3", (embedding_array,))
    top3_docs = cur.fetchall()
    return top3_docs

</code></pre>
<h3 id="33-define-helper-functions-to-query-openai">3.3 Define helper functions to query OpenAI</h3><p>We define a helper function to get a completion response from an OpenAI model while we use the previously defined helper function, <code>get_embeddings</code>, to create an embedding for the user question. We use GPT-4o, but you can use any other model from OpenAI.</p><p>We also specify a number of parameters, such as limits of the maximum number of tokens in the model response and model temperature, which controls the randomness of the model, which you can modify to your liking:</p><pre><code class="language-Python"># Helper function: get text completion from OpenAI API
# Note we're using the latest gpt-3.5-turbo-0613 model
def get_completion_from_messages(messages, model="gpt-4o", temperature=0, max_tokens=1000):
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature, 
        max_tokens=max_tokens, 
    )
    return response.choices[0].message.content
</code></pre>
<h3 id="33-putting-it-all-together">3.3 Putting it all together</h3><p>We’ll define a function to process the user input by retrieving the most similar documents from our database and passing the user input, along with the relevant retrieved context to the OpenAI model to provide a completion response to.</p><p>Note that we modify the system prompt as well in order to influence the tone of the model’s response.</p><p>We pass to the model the content associated with the three most similar embeddings to the user input using the assistant role. You can also append the additional context to the user message.<br></p><pre><code class="language-Python"># Function to process input with retrieval of most similar documents from the database
def process_input_with_retrieval(user_input):
    delimiter = "```"

    #Step 1: Get documents related to the user input from database
    related_docs = get_top3_similar_docs(get_embeddings(user_input), conn)

    # Step 2: Get completion from OpenAI API
    # Set system message to help set appropriate tone and context for model
    system_message = f"""
    You are a friendly chatbot. \
    You can answer questions about timescaledb, its features and its use cases. \
    You respond in a concise, technically credible tone. \
    """

    # Prepare messages to pass to model
    # We use a delimiter to help the model understand the where the user_input starts and ends
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": f"{delimiter}{user_input}{delimiter}"},
        {"role": "assistant", "content": f"Relevant Timescale case studies information: \n {related_docs[0][0]} \n {related_docs[1][0]} {related_docs[2][0]}"}   
    ]

    final_response = get_completion_from_messages(messages)
    return final_response
</code></pre>
<p>Let’s see an example of the model’s output to our original input question:</p><pre><code class="language-Python">input = "How is Timescale used in IoT?"
response = process_input_with_retrieval(input)
print(input)
print(response)
</code></pre>
<p>Here’s the model’s response:</p><pre><code>User input: How is Timescale used in IoT?

Model Response:

TimescaleDB is widely used in IoT applications due to its ability to efficiently handle time-series data, which is a common requirement in IoT systems. Here are some ways TimescaleDB is utilized in IoT:

1. **Data Ingestion and Storage**: IoT devices generate large volumes of time-stamped data. TimescaleDB's architecture is optimized for ingesting and storing this data efficiently, allowing for high write throughput and efficient storage through features like data compression.

2. **Real-time Analytics**: TimescaleDB supports real-time analytics, enabling IoT applications to process and analyze data as it arrives. This is crucial for applications that require immediate insights, such as monitoring systems or predictive maintenance.

3. **Historical Data Analysis**: With its support for continuous aggregates, TimescaleDB allows for efficient querying of historical data. This is useful for trend analysis, reporting, and long-term data storage, which are common in IoT use cases.

4. **Integration with IoT Platforms**: TimescaleDB can be integrated with various IoT platforms and tools, such as AWS IoT, MQTT, and Grafana, to manage, visualize, and analyze IoT data seamlessly.

5. **Scalability**: As IoT deployments grow, the volume of data can increase significantly. TimescaleDB's scalability features, such as partitioning and distributed hypertables, help manage this growth effectively.

Overall, TimescaleDB provides a robust solution for managing the unique challenges of IoT data, combining the benefits of a relational database with time-series capabilities.

</code></pre><p>We can also ask the model questions about specific documents in the database, in this case about specific TimescaleDB users who have spoken about their experience in a blog post:</p><pre><code class="language-Python">input_2 = "Tell me about Edeva and Hopara. How do they use Timescale?"
response_2 = process_input_with_retrieval(input_2)
print(input_2)
print(response_2)

</code></pre>
<p>Here’s the model output for the input question above; notice how it uses specific details from the blog posts.</p><pre><code>User input: Tell me about Edeva and Hopara. How do they use Timescale?


Model Response:

### Edeva

**Use Case:**
Edeva leverages TimescaleDB to manage and analyze data from their smart city systems, particularly focusing on their Actibump dynamic speed bump. The data collected includes:

1. **IoT Event Data:** Information such as vehicle speeds and environmental conditions.
2. **Alarm Data:** Alerts for sensor malfunctions or other issues.
3. **Status Data:** Self-checks and statistical data from IoT devices.
4. **Administrative Data:** Metadata about devices, such as configuration details.

**Key Features Utilized:**
- **Continuous Aggregations:** To speed up queries and make dashboards responsive.
- **Percentile Aggregations:** For calculating accurate percentile values without querying raw data.
- **SQL Compatibility:** Simplifies onboarding for developers familiar with SQL.

**Benefits:**
- **Performance:** Transitioned from sluggish to lightning-fast dashboards.
- **Ease of Use:** Developers could quickly adapt due to SQL familiarity.
- **Scalability:** Efficiently handles large datasets, such as hundreds of millions of records.

### Hopara

**Use Case:**
Hopara uses TimescaleDB to manage and visualize time-series data for their geospatial analytics platform. The platform integrates various data sources to provide insights into spatial and temporal trends.

**Key Features Utilized:**
- **Time-Series Data Management:** Efficiently stores and queries large volumes of time-series data.
- **Geospatial Capabilities:** Leverages PostgreSQL’s PostGIS extension for spatial queries.
- **Continuous Aggregations:** To pre-compute and speed up complex queries.

**Benefits:**
- **Scalability:** Handles large datasets with ease.
- **Performance:** Fast query execution for real-time analytics.
- **Integration:** Seamless integration with existing PostgreSQL tools and extensions.

Both Edeva and Hopara benefit from TimescaleDB’s ability to handle large volumes of time-series data efficiently, providing fast query performance and ease of use through SQL compatibility.
</code></pre><h2 id="conclusion">Conclusion</h2><p><a href="https://www.timescale.com/blog/rag-is-more-than-just-vector-search" rel="noreferrer">Retrieval-augmented generation (RAG)</a> is a powerful method of building applications with LLMs that enables you to teach foundation models about things they were not originally trained on, like private documents or recently published information.</p><p>This project is an example of how to create, store, and perform similarity search on <a href="https://www.tigerdata.com/blog/open-source-vs-openai-embeddings-for-rag" rel="noreferrer">OpenAI embeddings</a>. We used PostgreSQL + <a href="https://github.com/pgvector/pgvector"><u>pgvector</u></a> + <a href="https://github.com/timescale/pgvectorscale"><u>pgvectorscale</u></a> as our vector database to efficiently store and query the embeddings, enabling precise and relevant responses.</p><h2 id="timescaledb-postgresql">TimescaleDB + PostgreSQL</h2><p>And if you’re looking for a production PostgreSQL database for your vector workloads, <a href="https://console.cloud.timescale.com/signup"><u>try Timescale</u></a>. It’s free for 30 days, no credit card required.</p><h3 id="further-reading">Further reading</h3><p>Here are more blog posts about RAG with PostgreSQL and different tools:</p><ul><li><a href="https://www.timescale.com/blog/rag-is-more-than-just-vector-search/"><u>RAG Is More Than Just Vector Search</u></a></li><li><a href="https://timescale.com/blog/retrieval-augmented-generation-with-claude-sonnet-3-5-and-pgvector/"><u>Retrieval-Augmented Generation With Claude Sonnet 3.5 &amp; Pgvector</u></a></li><li><a href="https://www.timescale.com/blog/build-a-fully-local-rag-app-with-postgresql-mistral-and-ollama/"><u>Build a Fully Local RAG App With PostgreSQL, Mistral, and Ollama</u></a></li><li><a href="https://www.timescale.com/blog/building-an-ai-image-gallery-advanced-rag-with-pgvector-and-claude-sonnet-3-5/"><u>Building an AI Image Gallery: Advanced RAG With Pgvector and Claude Sonnet 3.5</u></a></li></ul><p></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Implementing ASOF Joins in PostgreSQL and Timescale]]></title>
            <description><![CDATA[Read our step-by-step guide to implement ASOF joins in PostgreSQL and Timescale, and learn how to supercharge your queries with some Timescale magic.]]></description>
            <link>https://www.tigerdata.com/blog/implementing-asof-joins-in-timescale</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/implementing-asof-joins-in-timescale</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[James Blackwood-Sewell]]></dc:creator>
            <pubDate>Thu, 15 Jun 2023 14:10:00 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/caspar-camille-rubin-fPkvU7RDmCo-unsplash--1-.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/caspar-camille-rubin-fPkvU7RDmCo-unsplash--1-.jpg" alt="SQL code on computer screen: Implementing ASOF joins in Postgres and Timescale" /><h2 id="what-is-an-asof-join">What Is an <code>ASOF</code> Join?</h2><p>An <code>ASOF</code> (or "as of") join is a type of join operation used when analyzing two sets of time-series data. It essentially matches each record from one table with the nearest—but not necessarily equal—value from another table based on a chosen column. Oracle supports this out of the box using a non-standard SQL syntax, but unfortunately, PostgreSQL does not provide a built-in <code>ASOF</code> keyword.</p><p>The chosen column needs to have some concept of range for the <code>ASOF</code> operation to work. You may think of it as being the "closest value," but not exceeding the comparison. It works for string (alphabetical), integer (ordinal), float (decimal), and any other data type that has an idea of ORDER. Because timestamps are near and dear to our hearts at Timescale, we will demonstrate with time and date columns.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">✨</div><div class="kg-callout-text">Want to understand <a href="https://www.timescale.com/learn/postgresql-join-type-theory">how the PostgreSQL parser picks a join method or join types</a>? Check out this article!</div></div><p><br></p><p>Performing this operation in PostgreSQL takes a bit of effort. This article aims to delve deeper into <code>ASOF</code>-style joins and how to implement similar functionality in PostgreSQL by subselecting data or other join types.</p><h2 id="understanding-asof-joins">Understanding <code>ASOF</code> Joins</h2><p><code>ASOF</code> joins are a powerful tool when dealing with time-series data. In simple terms, an ASOF join will, for each row in the left table, find a corresponding single row in the right table where the key value is less than or equal to the key in the left table.</p><p>This is a common operation when dealing with financial data, sensor readings, or <a href="https://timescale.ghost.io/blog/time-series-data/" rel="nofollow noopener noreferrer ugc">other types of time-series data where readings might not align perfectly by timestamp</a>.</p><p>For a simple example, consider the real-world question, "What was the temperature yesterday at this time?" It is very unlikely that a temperature reading was taken yesterday at exactly the millisecond that the question is asked today. What we really want is "What was the temperature taken yesterday up to today's time stamp?"</p><p>This simple example becomes a lot more complex when we start comparing temperatures day over day, week over week, etc.</p><h2 id="implementing-asof-joins-in-timescale">Implementing <code>ASOF</code> Joins in Timescale</h2><p>Even though PostgreSQL does not directly support <code>ASOF</code> joins, you can achieve similar functionality using a combination of SQL operations. Here's a simplified step-by-step guide:</p><h3 id="step-1-prepare-your-data">Step 1: Prepare your data</h3><p>Ensure your data is in the correct format for the <code>ASOF</code> join. You'll need a timestamp or other monotonically increasing column to use as a key for the join.</p><p>Suppose you have two tables, <code>bids</code> and <code>asks</code>, each containing a timestamp column, and you want to join them by instrument and the nearest timestamp.</p><pre><code class="language-sql">CREATE TABLE bids (
    instrument text,
    ts TIMESTAMPTZ,
    value NUMERIC
);
--
CREATE INDEX bids_instrument_ts_idx ON bids (instrument, ts DESC);
CREATE INDEX bids_ts_idx ON bids (ts);
--
CREATE TABLE asks (
    instrument text,
    ts TIMESTAMPTZ,
    value NUMERIC
);
CREATE INDEX asks_instrument_ts_idx ON asks (instrument, ts DESC);
CREATE INDEX asks_ts_idx ON asks (ts);
--
</code></pre><p>Normally you'd make both these tables into hypertables with the <code>create_hypertable</code> function (because you're a super educated Timescale user), but in this case, we aren't going to, as we won't be inserting much data (and we also have some Timescale magic to show off 🪄).</p><h3 id="step-2-insert-some-test-data">Step 2: Insert some test data</h3><p>Next, we'll create data for four instruments, <code>AAA, BBB, NCD,</code> and <code>USD</code>.</p><pre><code class="language-sql">INSERT INTO bids (instrument, ts, value)
SELECT 
   -- random 1 of 4 instruments
  (array['AAA', 'BBB', 'NZD', 'USD'])[floor(random() * 4 + 1)], 
   -- timestamp of last month plus some seconds
  now() - interval '1 month' + g.s, 
   -- random value
  random()* 100 +1
FROM (
  -- 2.5M seconds in a month
  SELECT ((random() * 2592000 + 1)::text || ' s')::interval s 
  FROM generate_series(1,3000000)) g;
INSERT INTO asks (instrument, ts, value)
SELECT 
   -- random 1 of 4 instruments
  (array['AAA', 'BBB', 'NZD', 'USD'])[floor(random() * 4 + 1)], 
   -- timestamp of last month plus some seconds
  now() - interval '1 month' + g.s, 
   -- random value
  random()* 100 +1
FROM (
  -- 2.5M seconds in a month
  SELECT ((random() * 2592000 + 1)::text || ' s')::interval s 
  FROM generate_series(1,2000000)) g;
</code></pre><h3 id="step-3-query-the-data-using-a-sub-select">Step 3: Query the data using a sub-select</h3><p>To mimic the behavior of an <code>ASOF</code> join, use a <code>SUBSELECT</code> join operation along with conditions to match rows based on your criteria. This will run the sub-query once per row returned from the target table. We need to use the <code>DISTINCT</code> clause to limit the number of rows returned to one.</p><p>This will work in vanilla Postgres, but when we are using Timescale (even though we aren't using hypertables yet), we get the benefits of a Skip Scan, which will supercharge the query (for more information on this check our <a href="https://docs.timescale.com/use-timescale/latest/query-data/skipscan/" rel="nofollow noopener noreferrer ugc">docs</a> or <a href="https://timescale.ghost.io/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/">blog post about how Skip Scan can give you an 8,000x speed-up</a>).</p><pre><code class="language-sql">SELECT bids.ts timebid, bids.value bid,
    (SELECT DISTINCT ON (asks.instrument) value ask
    FROM asks
    WHERE asks.instrument = bids.instrument
    AND asks.ts &lt;= bids.ts
    ORDER BY instrument, ts DESC) ask
FROM bids
WHERE bids.ts &gt; now() - interval '1 week'
</code></pre><pre><code class="language-sql">                              QUERY PLAN                                                                               
-------------------------------------------------------------------------
 Index Scan using bids_ts_idx on public.bids  
    (cost=0.43..188132.58 rows=62180 width=56) 
    (actual time=0.067..1700.957 rows=57303 loops=1)
   Output: bids.instrument, bids.ts, bids.value, (SubPlan 1)
   Index Cond: (bids.ts &gt; (now() - '7 days'::interval))
   SubPlan 1
     -&gt;  Unique  (cost=0.43..2.71 rows=5 width=24) 
                (actual time=0.027..0.029 rows=1 loops=57303)
           Output: asks.value, asks.instrument, asks.ts
           -&gt;  Custom Scan (SkipScan) on public.asks  
                  (cost=0.43..2.71 rows=5 width=24) 
                  (actual time=0.027..0.027 rows=1 loops=57303)
                 Output: asks.value, asks.instrument, asks.ts
                 -&gt;  Index Scan using asks_instrument_ts_idx on public.asks  
                        (cost=0.43..15996.56 rows=143152 width=24) 
                        (actual time=0.027..0.027 rows=1 loops=57303)
                       Output: asks.value, asks.instrument, asks.ts
                       Index Cond: ((asks.instrument = bids.instrument) 
                          AND (asks.ts &lt;= bids.ts))
 Planning Time: 1.231 ms
 Execution Time: 1703.821 ms

</code></pre><h3 id="conclusion">Conclusion</h3><p>While PostgreSQL does not have an <code>ASOF</code> keyword, it does offer the flexibility and functionality to perform similar operations. When you're using Timescale, things only get better with the enhancements like Skip Scan.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to Fix Transaction ID Wraparound Exhaustion]]></title>
            <description><![CDATA[Learn more about transaction ID wraparound failure and how to avoid it in PostgreSQL databases. It involves treating your database as a house: turn on your Roomba, a.k.a. autovaccum.]]></description>
            <link>https://www.tigerdata.com/blog/how-to-fix-transaction-id-wraparound</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-to-fix-transaction-id-wraparound</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Tips]]></category>
            <dc:creator><![CDATA[Kirk Laurence Roybal]]></dc:creator>
            <pubDate>Wed, 10 May 2023 16:45:49 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/05/How-to-fix-transaction-ID-wraparound.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/05/How-to-fix-transaction-ID-wraparound.jpg" alt="How to Fix Transaction ID Wraparound" /><p>Timescale employs <a href="https://timescale.ghost.io/blog/how-to-manage-a-commitfest/">PostgreSQL contributors</a> who are working feverishly to mitigate a problem with PostgreSQL. That problem is commonly referred to as “transaction ID wraparound” and stems from design decisions in the PostgreSQL project that have been around for decades.<br></p><p>Because this design decision was made so early in the project history, it affects all branches and forks of PostgreSQL, with Amazon RDS PostgreSQL, Greenplum, Netezza, Amazon Aurora, and many others suffering from transaction ID wraparound failures.<br></p><p>In this article, the second in a <a href="https://timescale.ghost.io/blog/how-to-fix-no-partition-of-relation-found-for-row/">series of posts tackling PostgreSQL errors or issues</a>, we’ll explain what transaction ID wraparound is, why it fails, and how you can mitigate or resolve it. But let’s start with a bit of PostgreSQL history.</p><h2 id="transaction-id-wraparound-xid-wraparound">Transaction ID Wraparound (XID Wraparound)</h2><p>To fully understand the problem of transaction ID wraparound (or XID wraparound), a bit of history is in order. The idea of a transaction counter in PostgreSQL originated as a very simple answer to transaction tracking. We need to know the order in which transactions are committed to a PostgreSQL database, so let's enumerate them. What is the simplest way to give transactions a concept of order? That would be a counter. What is the simplest counter? An integer. Tada!   <br></p><p>So, form follows function, and we have an integer counter. Seems like an obvious, elegant, and simple solution to the problem, doesn't it?<br></p><p>At first glance (and second and third, honestly), this rather simple solution stood up very well. Who would ever need more than 2<sup>31</sup> (just over 2 billion) transactions in flight? That was an astronomical number for 1985.</p><p></p><p>Since this is such a huge number, we should only need a single counter for the entire database cluster. That will keep the design simple, prevent the need to coordinate multiple transaction counters and allow for efficient (just four bytes!) storage. We simply add this small counter to each row, and we know exactly what the high watermark is for every row version in the entire cluster.</p><p>This simple method is row-centric and yet cluster-wide. So, our backups are easy (we know exactly where the pointer is for the entire cluster), and the data snapshot at the beginning and end of our transaction is stable. We can easily tell within the transaction if the data has changed underneath us from another transaction.   <br></p><p>We can even play peek-a-boo with other transaction data in flight. That lets us ensure that transactions settle more reasonably, even if we are wiggling the loose electrical connectors of our transaction context a bit.<br></p><p>We can stretch that counter quite a bit by making it a ring buffer. That is, we'll <code>OR</code> the next value to the end rather than add it there. That way, 2<sup>31</sup> or 1 = 1. So, our counter can wrap around the top (2<sup>31</sup>) and only becomes problematic when it reaches the oldest open transaction at the bottom.   </p><p>This "oldest" transaction is an upwardly moving number also, which then wraps around the top. So, we have the head (current transaction) chasing the tail (oldest transaction) around the integer, with 2,147,483,648 spaces from the bottom to the top. This makes our solution even look like a puppy dog, so now it's cute as well as elegant.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/05/dachshund-cricle.gif" class="kg-image" alt="A dog chasing his tail on top of a record player—the perfect representation of transaction ID wraparound" loading="lazy" width="480" height="360"></figure><p>The idea is that this would make the counter <strong>almost</strong> infinite, as the head should never catch the tail. At that point, who could possibly need more transactions than that? Brilliant!</p><p>Transaction counters are obviously the way to go here. They just make everything work so elegantly.</p><h2 id="explanation-the-plan-in-action">Explanation: The Plan in Action</h2><p>For many years, PostgreSQL raged forward with the XID wraparound transaction design. Quite a few features were added along the way that were based on this simple counter. Backups (<code>pg_basebackup</code> and its cousins), replication (both physical and logical), indexes, visibility maps, autovacuum, and defragmentation utilities all sprouted up to enhance and support this central concept.<br></p><p>All of these things worked together brilliantly for quite some time. We didn't start seeing the stress marks in the fuselage until the hardware caught up with us. As much as PostgreSQL wants to turn a blind eye to the reality of the hardware universe, the time came upon us when systems had the capacity to create more than 2<sup>31</sup> transactions at a time.   <br></p><p>High-speed ETL, queuing systems, IoT, and other machine-generated data could actually keep the system busy long enough that the counter could be pushed to its inherent limit.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-text"><i><em class="italic" style="white-space: pre-wrap;">"Everybody has a plan until I punch them in the face."</em></i> —Mike Tyson</div></div><p></p><p></p><p>These processes weren't exactly showstoppers, though. We came up with band-aids for much of it.   <br></p><p><code>COPY</code> got its own transaction context, reducing the load significantly. So did <code>VACUUM</code>.  <code>VACUUM</code> sprouted the ability to just freeze transactions without having to do a full row cleanup. That made the tail move forward a bit more quickly. External utilities gained features, individual tables gained <code>VACUUM</code> settings so they could be targeted separately.<br></p><p>Okay, that helped. But did it help enough? These features were never designed to fundamentally fix the issue. The issue is that size matters. But to be a bit more descriptive...</p><h2 id="possible-causes">Possible Causes</h2><h3 id="how-big-is-big">How big is big?</h3><p>In the early aughts, I was involved in building a data center for a private company. We spent some $5M creating a full-featured data center, complete with Halon, a 4K generator, Unisys ES7000, and a Clarion array. For the sake of our XID wraparound article, I'll focus on the Clarion array. It cost just a bit over $2M and could hold 96 drives for a whopping total of 1.6 TB! In 2002, that was incredible.<br></p><p>It doesn't seem so incredible now, does it? Kinda disappointing even. A few weeks ago, I needed an additional drive for my home backup unit. As I was walking through a Costco, I absent-mindedly threw a 2 TB drive into my cart that retailed for $69. It wasn't until I got home and was in the middle of installing it that it dawned on me how far we've come in the storage industry.<br></p><p>Some of the young whippersnappers don't even care about storage anymore. They think the "cloud" storage is effectively infinite. <a href="https://timescale.ghost.io/blog/scaling-postgresql-with-amazon-s3-an-object-storage-for-low-cost-infinite-database-scalability/">They're not wrong</a>.<br></p><p>To bring this around to PostgreSQL, tables with 2M rows were a big deal in 2002. Now that's not even on the radar of "big data." A VLDB (very large database) at the time was 2 TB. Now it's approaching 1 PB.<br></p><p>"A lot" of transactions in 2002 was 2M. Now, I would place that number at somewhere around 2B. Oops. Did I just say 2B? Isn't that close to the same number I said a few paragraphs ago was the limit of our transaction space? Let me see, that was 2<sup>31</sup>, which is 2,147,483,648. </p><p>Ouch.</p><h2 id="how-to-resolve-transaction-id-wraparound-failure">How to Resolve Transaction ID Wraparound Failure</h2><p>To be fair, not everybody has this problem. 2,147,483,648 is still a really big number, so a fairly small number of systems will ever reach this limit, even in the transaction environment of 2023.  </p><p>It also represents the number of transactions that are currently in flight, as the autovacuum process will latently brush away transaction counters that are no longer visible to the postmaster (<code>pg_stat_activity</code>).  But if the number of phone calls to consultants is any indication, this limitation is nonetheless becoming quite an issue. It certainly isn't going away any time soon.<br></p><p>Everybody in the PostgreSQL ecosystem is painfully aware of the limitation. This problem affects more than just the core of PostgreSQL, it affects all of the systems that have grown around it also. Do you know what it also affects? All the PostgreSQL-based databases, such as Amazon RDS and Aurora.<br></p><p>To make any changes to the core of PostgreSQL, all of the ramifications of those changes have to be thought out in advance. Fortunately, we have a whole community of people (some of them proudly part of our own organization) that are really, really good at thinking things out in advance.</p><p></p><p><strong>Query to show your current transaction ages:</strong></p><pre><code class="language-SQL">with overridden_tables as (
  select
    pc.oid as table_id,
    pn.nspname as scheme_name,
    pc.relname as table_name,
    pc.reloptions as options
  from pg_class pc
  join pg_namespace pn on pn.oid = pc.relnamespace
  where reloptions::text ~ 'autovacuum'
), per_database as (
  select
    coalesce(nullif(n.nspname || '.', 'public.'), '') || c.relname as relation,
    greatest(age(c.relfrozenxid), age(t.relfrozenxid)) as age,
    round(
      (greatest(age(c.relfrozenxid), age(t.relfrozenxid))::numeric *
      100 / (2 * 10^9 - current_setting('vacuum_freeze_min_age')::numeric)::numeric),
      2
    ) as capacity_used,
    c.relfrozenxid as rel_relfrozenxid,
    t.relfrozenxid as toast_relfrozenxid,
    (greatest(age(c.relfrozenxid), age(t.relfrozenxid)) &gt; 1200000000)::int as warning,
    case when ot.table_id is not null then true else false end as overridden_settings
  from pg_class c
  join pg_namespace n on c.relnamespace = n.oid
  left join pg_class t ON c.reltoastrelid = t.oid
  left join overridden_tables ot on ot.table_id = c.oid
  where c.relkind IN ('r', 'm') and not (n.nspname = 'pg_catalog' and c.relname &lt;&gt; 'pg_class')
    and n.nspname &lt;&gt; 'information_schema'
  order by 3 desc)
SELECT *
FROM per_database;
</code></pre>
<p><a href="https://gitlab.com/postgres-ai/postgres-checkup/-/blob/master/resources/checks/F002_autovacuum_wraparound.sh"><em>Adapted from Postgres-Checkup</em> </a></p><p></p><p>Many enhancements have already been made to PostgreSQL to mitigate the transaction ID wraparound problem and solve it permanently. Here are the steps on the way to the solution.</p><ul><li>The PostgreSQL system catalogs have already been enhanced to a 64-bit (eight-byte) transaction ID.</li><li>The functions and procedures of PostgreSQL have been expanded to 64-bit transaction ID parameters and outputs.</li><li>The backends (query worker processes) can deal with 64-bit transaction IDs.</li><li>Work has been done on the utilities of PostgreSQL (such as <code>pg_basebackup</code>) that previously assumed 32-bit integer transactions.</li><li>Replication, <code>VACUUM</code>, and other processes have been enhanced for 64-bit transactions.</li><li>A lot of other "stuff." Many smaller incidental fixes that were based on 32-bit assumptions needed modification.</li></ul><p>The goal of all of these changes is to eventually move to a 64-bit transaction counter for the entire system.</p><h3 id="where-do-we-go-from-here">Where do we go from here?</h3><p>There's a bit of bad news. I'm going to close my eyes while I write this, so I won't have to look at your face while you read it.<br></p><p>Updating the user tables in your database to use 64-bit transaction counters will require rewriting all of your data. Remember at the beginning, where I said the transaction counter was a per-row solution? Oh, yeah.   <br></p><p>That means that its limitations are also per row. There are only eight bytes reserved for <code>xmin</code>and eight bytes for <code>xmax</code> in the row header. So, every single row of data in the database is affected.<br></p><p>At some point, there will be a major version of PostgreSQL that requires a data dump, replication, <code>pg_upgrade</code> or another such process to re-create every row in the database in the new format. It is true that every major version of PostgreSQL <em>could</em> change the format of data on disk. <br></p><p>The <code>pg_upgrade</code> utility will not be able to use symlinks or hardlinks for the upgrade. These links usually allow for some efficiency while upgrading. There will be no such shortcuts when the "fix" for transaction ID wraparound is put into place.<br></p><p>Okay, now for the good news. We will all be in retirement (if not taking a dirt nap) when the next bunch of <s>suckers</s> engineers has to deal with this issue again. 2<sup>63</sup> is not double the number of transactions. It is 9,223,372,034,707,292,160 (nine quintillion) more.</p><h3 id="what-to-do-while-youre-waiting-for-infinity">What to do while you're waiting for infinity</h3><p>You can still make use of some basic mitigation strategies for transaction ID wraparound failures:</p><ul><li>Make the autovacuum process more aggressive to keep up with maintaining the database.</li><li>Use custom settings to make the autovacuum process more aggressive for the most active tables.</li><li><a href="https://www.postgresql.org/docs/current/app-vacuumdb.html">Schedule vacuumdb</a> to do additional vacuuming tasks for PostgreSQL to catch up faster.</li><li>Vacuum the <code>TOAST</code> tables separately so the autovacuum has a better chance of catching up.</li><li><code>REINDEX CONCURRENTLY</code> more frequently so that the autovacuum has less work to do.</li><li><code>CLUSTER ON INDEX</code> will re-order the data in the table to the same order as an index, thus "vacuuming" the table along the way.</li><li><code>VACUUM FULL</code>, which blocks updates while vacuuming but will finish without interruption. Let me say that again. There will be no writes while <code>VACUUM FULL</code> is running, and you can't interrupt it. 😠</li><li>Switch over to a secondary. The transaction counter will be reset to one when the system is restarted. (There are no transactions in flight, are there? 😄)</li><li>Use batching for  <code>INSERT</code>, <code>UPDATE</code>, and <code>DELETE</code> operations.  The counter is issued per transaction (not row), so grouping operations helps reserve counters.</li></ul><p>All of these strategies are basically the same thing. The objective is to ensure the tail number (oldest transaction) moves forward as quickly as possible. This will prevent you from ending up in a "transaction ID wraparound" scenario. <strong>🙂 ♥️ 👍</strong></p><h2 id="documentation-and-resources">Documentation and Resources</h2><ul><li>Check out the PostgreSQL documentation on <a href="https://www.postgresql.org/docs/15/routine-vacuuming.html">routine vacuuming</a> to prevent transaction ID wraparound failures.</li><li>The Timescale Docs also <a href="https://docs.timescale.com/mst/latest/troubleshooting/">troubleshoot transaction ID wraparound exhaustion</a>.</li></ul><h2 id="how-timescale-can-help">How Timescale Can Help</h2><p>While Timescale—also built on the rock-solid foundation of PostgreSQL— does not solve transaction ID wraparound failure, it can help you prevent it since our ingestion inherently batches the data by design after you<a href="https://docs.timescale.com/use-timescale/latest/ingest-data/about-timescaledb-parallel-copy/"> install <code>timescaledb-parallel-copy</code></a>.<br></p><p>Of course, you can do this for yourself with transaction blocks, but our tools will do the right thing automatically.<br></p><p>We also provide a <a href="https://timescale.ghost.io/blog/the-postgresql-job-scheduler-you-always-wanted-but-be-careful-what-you-ask-for/">general-purpose job scheduler</a> that can be useful for adding <code>VACUUM</code> and <code>CLUSTER</code> operations.<br></p><p>So, if you want to mitigate the chances of ever dealing with XID wraparound problems while enjoying <a href="https://timescale.ghost.io/blog/postgresql-timescaledb-1000x-faster-queries-90-data-compression-and-much-more/">superior query performance and storage savings compared to vanilla PostgreSQL</a> or <a href="https://timescale.ghost.io/blog/timescale-cloud-vs-amazon-rds-postgresql-up-to-350-times-faster-queries-44-faster-ingest-95-storage-savings-for-time-series-data/">Amazon RDS for PostgreSQL</a>, try Timescale. <a href="https://console.cloud.timescale.com/signup">Sign up now </a>(30-day free trial, no credit card required) for fast performance, seamless user experience, and the best compression ratios.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to Fix No Partition of Relation Found for Row in Postgres Databases]]></title>
            <description><![CDATA[Learn more about the “no partition of relation found for row” error and how to avoid it in PostgreSQL databases.]]></description>
            <link>https://www.tigerdata.com/blog/how-to-fix-no-partition-of-relation-found-for-row</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-to-fix-no-partition-of-relation-found-for-row</guid>
            <category><![CDATA[AWS]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Tips]]></category>
            <dc:creator><![CDATA[James Blackwood-Sewell]]></dc:creator>
            <pubDate>Thu, 06 Apr 2023 13:30:44 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/04/AWS-RDS-Error-message-no-partition-of-relation-found.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/04/AWS-RDS-Error-message-no-partition-of-relation-found.jpg" alt="How to Fix No Partition of Relation Found for Row in Postgres Databases" /><p></p><h2 id="error-no-partition-of-relation-found-for-row"><code>ERROR</code>: No Partition of Relation Found for Row</h2><p>The error message <code>ERROR: no partition of relation {table-name} found for row</code> is reported by PostgreSQL (and will appear in the console and the log) when a table has been configured with declarative partitioning, and data is <code>INSERTed</code> before a child table has been defined with constraints that match the data. This will cause the insert to fail, potentially losing the data which was in flight.</p><p>You will find this error message in other PostgreSQL-based databases, such as Amazon RDS for PostgreSQL and Amazon Aurora. But it can be avoided in Timescale when you use our <a href="https://docs.timescale.com/use-timescale/latest/hypertables/about-hypertables/">hypertable abstraction</a>. In this blog post, we’ll explain this database error in more detail to learn why.</p><h2 id="explanation">Explanation</h2><p>Let’s dive deeper into what causes a <code>no partition of relation found for row</code> error. When a table is partitioned using PostgreSQL declarative partitioning, it becomes a parent to which multiple child partitions can be attached. Each of these children can handle a specific non-overlapping subset of data. When partitioning by time (the most common use case), each partition would be attached for a particular date range. For example, seven daily partitions could be attached, representing the upcoming week.<br></p><p>When inserts are made into the parent table, these are transparently routed to the child table, matching the partitioning criteria. So an insert of a row that referenced tomorrow would be sent automatically to tomorrow’s partition. If this partition doesn’t exist, then there is a problem—there is no logical place to store this data. PostgreSQL will fail the <code>INSERT</code> and report <code>no partition of relation {table-name} found for row</code>.</p><h2 id="how-to-resolve">How to Resolve</h2><p>There are two ways around this problem, although neither is perfect. Keep reading to see the Timescale approach with <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertables</a> that avoids these pitfalls.</p><p>Partitions can be made ahead of time—perhaps a scheduler could be used to create a month's worth of partitions automatically in advance. This works in theory (as long as that scheduler keeps running!) but will cause locking issues while the partitions are being created. Plus, it doesn’t account for data in the past or the far future. </p><p>A default partition can also be added that automatically catches all data that doesn’t have a home, but this is problematic, too, as it collects data that needs to eventually be moved into freshly created partitions. As the amount of orphaned data in the default partition grows, it will also slow down query times.</p><h2 id="documentation-and-resources">Documentation and Resources</h2><ul><li>Timescale <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hypertables/">hypertables work like regular PostgreSQL tables</a> but provide a superior user experience when handling time-series data.</li><li>Need some advice on how to model your time-series data using hypertables? Read our best practices about choosing between a <a href="https://timescale.ghost.io/blog/best-practices-for-time-series-data-modeling-narrow-medium-or-wide-table-layout-2/">narrow, medium, or wide hypertable layout</a> and learn when to use <a href="https://timescale.ghost.io/blog/best-practices-for-time-series-data-modeling-single-or-multiple-partitioned-table-s-a-k-a-hypertables/">single or multiple hypertables</a>.<br></li></ul><h2 id="how-timescale-can-help">How Timescale Can Help</h2><p>As mentioned earlier, another solution is enabling the TimescaleDB extension and converting the table into a hypertable instead of using PostgreSQL declarative partitioning. This removes the need to worry about partitions (which in Timescale jargon are called chunks), as they are transparently made when inserts happen with no locking issues. </p><p>You’ll never have to see this error, worry about scheduling potentially disruptive partition creation, or think about default partitions ever again! </p><p>New to Timescale? <a href="https://console.cloud.timescale.com/signup">Sign up for Timescale</a> (30-day free trial, no credit card required) for fast performance, seamless user experience, and the best compression ratios.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Best Practices for Time-Series Data Modeling: Single or Multiple Partitioned Table(s) a.k.a. Hypertables]]></title>
            <description><![CDATA[Time-series data is relentless, so you know you’ll have to create one or more partitioned tables (a.k.a. Timescale hypertables) to store it. Learn how to choose the best data modeling option for your use case—single or multiple hypertables.]]></description>
            <link>https://www.tigerdata.com/blog/best-practices-for-time-series-data-modeling-single-or-multiple-partitioned-table-s-a-k-a-hypertables</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/best-practices-for-time-series-data-modeling-single-or-multiple-partitioned-table-s-a-k-a-hypertables</guid>
            <category><![CDATA[Time Series Data]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Hypertables]]></category>
            <dc:creator><![CDATA[Chris Engelbert]]></dc:creator>
            <pubDate>Thu, 09 Mar 2023 14:00:05 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/03/time-series-data-modeling-single-or-multiple-partitioned-tables_hero.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/03/time-series-data-modeling-single-or-multiple-partitioned-tables_hero.jpg" alt="A white egg (single partitioned table) amidst multiple other eggs (multiple partitioned tables)." /><p>Collecting <a href="https://timescale.ghost.io/blog/time-series-data/">time-related information</a>, or time-series data, creates massive amounts of data to manage and model. Storing it will require one or more Timescale hypertables, which are very similar to PostgreSQL partitioned tables.</p><p><a href="https://docs.timescale.com/timescaledb/latest/overview/core-concepts/hypertables-and-chunks/">Timescale hypertables work like regular PostgreSQL tables</a> but offer optimized performance and user experience for time-series data. With hypertables, data is stored in chunks, which work similarly to <a href="https://www.postgresql.org/docs/current/ddl-partitioning.html">PostgreSQL’s partitioned tables</a> but support multiple dimensions and other features. While we discussed the table layout in the <a href="https://timescale.ghost.io/blog/best-practices-for-time-series-data-modeling-narrow-medium-or-wide-table-layout-2/"><em>Narrow, Medium, or Wide Table Layout</em></a> best practices article, this time, we addressed whether you should use a single table to store all data versus using multiple tables, as well as their respective pros and cons.</p><p><a href="https://timescale.ghost.io/blog/time-series-data-why-and-how-to-use-a-relational-database-instead-of-nosql-d0cd6975e87c/">Timescale is built upon a relational database model</a>, which means it supports numerous data modeling choices or ways in which data can be organized and laid out. Understanding the database design choices early on is crucial to find the best combination.</p><p>I started using Timescale long before joining the company, initially using it to store IoT metrics at my own startups. We went through a few different iterations of designs, and migrations between those were everything but fun. Due to that personal experience, one of my biggest goals is to prevent others from suffering through the same.</p><h1 id="time-series-data-modeling-using-our-relational-database-experience">Time-Series Data Modeling: Using our Relational Database Experience</h1>
<p>As mentioned, Timescale uses a relational data model at its core. Being built on PostgreSQL, we understand we have many ways to store data, including in partitioned tables or just separate ones. A common pattern in the relational world is to divide data into separate tables based on their content, also often referred to as <a href="https://opentextbc.ca/dbdesign01/chapter/chapter-7-the-relational-data-model/"><em>domain</em></a> or <em>entity</em>. That means that data belonging to a set of A’s is stored in a different table than a set of B’s, representing different <em>domains</em>.</p><p>That leaves us questioning how the concepts of time-series data and relational domains fit together. Unfortunately, there is no easy answer. Our primary options are “squeezing” all data into a single table, which could have hundreds of columns (basically defining our domain around the idea of “it’s all just metrics”), or splitting data into multiple tables with fewer columns. The latter choice may slice tables in many ways, such as by metric type (temperature is different from humidity, stock symbol A is different from symbol B), customer, data type, and others, or combinations of the previous.</p><p>Both possibilities have their own set of advantages and disadvantages, which can be split into four commonly seen topics:</p>
<ol>
<li>Ease of use</li>
<li>Multi-tenancy / Privacy-related requirements (General Data Protection Regulation or GDPR / California Consumer Privacy Act or CCPA / others)</li>
<li>Schema migration or upgrading / Future-proofness</li>
<li>Tooling support</li>
</ol>
<p>I did not choose the above order by accident: the sequential importance of these questions may influence your options further down the line.</p><h1 id="pros-and-cons">Pros and Cons</h1>
<p>As said before, both design choices have pros and cons, and it’s vital to understand them before making a final data modeling decision. Given the previous set of topics, let’s start by answering a few questions.</p><p>First and foremost, how important is the ease of use, meaning, are you and your team up to the challenging tasks that need to be solved down the road? Potential “complications” could involve generation patterns for table names or ensuring that similar tables are all upgraded to the same schema level.</p><p>Next up, are you required to provide a harder level of multi-tenancy, such as storing different customers not just by segregating them using a customer ID but are required to store them in different tables, schemas, or even databases? Is your company bound by regulations (e.g., GDPR or CCPA) where users and customers may have the right to be forgotten? With time-series data being (normally) append-only, removing parts of the data (this specific user’s data) may be tricky.</p><p>Then we have the question of whether you expect the data schema to change frequently. A large discussion around a future-proof design for hypertables can be found in the <a href="https://timescale.ghost.io/blog/best-practices-for-time-series-data-modeling-narrow-medium-or-wide-table-layout-2/"><em>Narrow, Medium, or Wide Table Model</em></a> best practices write-up. However, the higher the number of tables, the more they need to be upgraded or migrated in the future, adding additional complexity.</p><p>Finally, how important is support by additional tools and frameworks, such as ORM (Object Relational Mapping) solutions? While I personally don’t think ORM frameworks are a great fit for time-series data (especially when using aggregations), a lot of folks out there make extensive use of them, so talking about them has its merits.</p><p>Anyhow, now that we answered those questions, let’s dig into the design choices in greater detail.</p><h1 id="single-table-designs">Single Table Designs</h1>
<p>Storing all data into a single table may initially feel irresponsible from a relational database point of view. Depending on how I slice my domain model, though, it could be a perfectly valid option. The design choice makes sense if I consider all stored data (metrics, events, IoT data, stock prices, etc.) as a single domain, namely time series.</p><p>Single tables make a few things super simple. First of all, and probably obvious to most, is querying. Everything is in the same table, and the queries select certain columns and add additional filters or where clauses to select the data. That is as basic as it can get with SQL. That said, querying data is super simple, not just easy.</p><p>Upgrading the table’s schema is equally simple. Adding or removing columns implies a single command, and all data is upgraded at the same time. If you have multiple similar tables, you may end up in a situation where some tables are upgraded while others are simply forgotten—no real migration window is needed.</p><p>Single tables can easily support multiple different values, either through multiple columns (wide table layout), a JSONB column that supports a wide range of data values (narrow table layout), or through columns based on a value’s potential data type (medium table layout). <a href="https://timescale.ghost.io/blog/best-practices-for-time-series-data-modeling-narrow-medium-or-wide-table-layout-2/">Those three options have pros and cons, though</a>. </p><pre><code>tsdb=&gt; \d
      Column    |           Type           | Collation | Nullable |      Default
—---------------+--------------------------+-----------+----------+-------------------
 created        | timestamp with time zone |           | not null | now()
 point_id       | uuid                     |           | not null | gen_random_uuid()
 device_id      | uuid                     |           | not null |
 temp           | double precision         |           |          |
 hum            | double precision         |           |          |
 co2            | integer                  |           |          |
 wind_speed     | integer                  |           |          |
 wind_direction | integer                  |           |          |

|                  created | point_id | device_id | temp |  hum | co2 |
 wind_speed | wind_direction |
| 2022-01-01 00:00:00.0+00 |      123 |        10 | 24.7 | 57.1 | 271 |
       NULL |           NULL |
</code></pre>
<p>Last but not least, single tables play very nicely with tools like ORM frameworks. It is easy to connect a specific set of ORM entities to the hypertable, representing either a full record or the result of an aggregation (which may need a native query instead of an automatically generated one).</p><p>But as with everything in this world, this choice has a large downside. Since time-series data is designed around the idea of being append-only (meaning that mutating existing records only happens occasionally), it is hard to delete data. Deleting data based on specific requirements is even harder, such as a user or customer asking to have all their data deleted.</p><p>In that situation, we’d have to crawl our way through potentially years of data, removing records from all over the place. That’s not only a burden on the WAL (Write-Ahead Log) to keep track of the changes, but it also creates loads and loads of I/O, reading and writing.</p><p>The same is true if we try to store collected and calculated sets of data in the same table. With many systems often having to backfill data (for example, from devices that lost their internet connection for a while and were collecting data locally), calculated values may have to be recalculated. That means that the already stored data must be invalidated (which may mean deleted) and reinserted.</p><p>Finally, if your company provides different tiers of data retention, good luck implementing this on a single table. It is the previous two issues, but on a constant, more than ugly, basis.</p><h1 id="multiple-table-designs">Multiple Table Designs</h1>
<p>Now that we know about the pros and cons of single table design, what are the differences when we aim for multiple table designs instead?</p><p>While querying is still simple, querying multiple sets of data simultaneously may be slightly more complicated, involving <code>JOINs</code> and <code>UNIONs</code> to merge data from the different tables. Requesting multiple sets of data at the same time is often done for efficiency reasons, requiring fewer database round trips and minimizing response time. Apart from that, there isn’t a massive difference in ease of use, except for table names, but we’ll come back to that in a second.</p><p>One of the major benefits of having multiple tables, especially when sliced by the customer, user, or whatever meaningful multi-tenancy divider for your use case, is the option to quickly react to GDPR- or CCPA-related requests to destroy and remove any customer-related data. In this case, it is as easy as finding all the client’s tables and dropping them. Removing them from backups is a different story, though. 😌</p><p>The same is true with calculated and collected data. Separating those tables makes it much easier to throw away and recalculate all or parts of the data when late information arrives.</p><p>Also similar is the previously mentioned data retention. Many companies storing huge amounts of data on behalf of their customers provide different data retention policies based on how much the customer is willing to pay. Having tables sliced by customers makes it easy to set customer-specific retention policies and even change them when a customer upgrades to a higher tier. If this is something you need, the <em>multi-table design</em> is it.</p><p>However, just as with single tables, multiples have drawbacks too.</p><p>Besides the already slightly more complicated elements around querying, which are not necessarily a disadvantage, having many tables requires planning a table name schema. The more dimensions we bring into the game (by customer, metric type, etc.), the more complicated our naming schema needs to be. That said, we may end up with table names such as <code>&lt;&lt;customer_name&gt;&gt;__&lt;&lt;metric_type&gt;&gt;</code>. While this doesn’t sound too bad, it can get ugly fast. We’ve all been there before. 😅</p>
<pre><code>tsdb=&gt; \dt *.cust_*
                    List of relations
 Schema |            Name            | Type  |   Owner
--------+----------------------------+-------+-----------
 public | cust_mycompany_co2         | table | tsdbadmin
 public | cust_mycompany_humidity    | table | tsdbadmin
 public | cust_mycompany_temperature | table | tsdbadmin
 public | cust_timescale_co2         | table | tsdbadmin
 public | cust_timescale_humidity    | table | tsdbadmin
 public | cust_timescale_temperature | table | tsdbadmin
(6 rows)
</code></pre>
<p>Tools, such as an ORM framework, may make things even more complicated. Many of those tools are not designed to support arbitrary, runtime-generated table names, making it very complicated to integrate those with this concept. Using different PostgreSQL database schemas per customer and lowering the number of dimensions may help.</p><p>There is one more complexity: upgrading and migrating tables. Due to the multiple table design, we may end up with many similar tables segregated by the additional dimensions chosen. When upgrading the table schema, we need to ensure that all those tables eventually end up in a consistent state. </p><p>However, many automatic schema migration tools do not easily support that kind of multi-table migration. This forces us to fall back on writing migration scripts, trying to find all tables matching a specific naming schema, and ensuring that all are upgraded the same way. If we miss a table, we’ll figure it out eventually, but probably when it’s too late.</p><h1 id="the-tldr">The TL;DR</h1>
<p>Now that you’ve laid it all out and answered these questions, you can look at the requirements and see where your use case fits.</p><p>Some hard requirements may make your choice obvious, while some “nice-to-have” elements may still influence the final decision and hint at what may become a harder requirement in the near or far future.</p><table>
<thead>
<tr>
<th></th>
<th style="text-align:right"><strong>Single Table Design</strong></th>
<th style="text-align:right"><strong>Multiple Table Design</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Ease of Use</strong></td>
<td style="text-align:right">Easy</td>
<td style="text-align:right">Somewhat easy</td>
</tr>
<tr>
<td><strong>Multi-Tenancy / <br>Privacy Regulations</strong></td>
<td style="text-align:right">Hard</td>
<td style="text-align:right">Easy</td>
</tr>
<tr>
<td><strong>Future-Proofness</strong></td>
<td style="text-align:right">Easy</td>
<td style="text-align:right">Somewhat hard</td>
</tr>
<tr>
<td><strong>Tooling Support</strong></td>
<td style="text-align:right">Easy</td>
<td style="text-align:right">Hard</td>
</tr>
</tbody>
</table>
<p>While the <em>single table design</em> is very easy to get started with, it may be a hard sell if you need to comply with regulations. However, the <em>multiple table design</em> is certainly more complicated to manage and use correctly.</p><h1 id="so-whats-the-best-one-for-me">So What's the Best One for Me?</h1>
<p>There’s never a one-size-fits-all answer, or as consultants love to put it: it depends.</p><p>Unlike the design choices around the table layout, it’s too complex to make a real recommendation. You can only try to follow the suggested process of answering the questions above and looking at the answers to see which points represent hard requirements, which could become hard requirements, and which are simply nice to have. That way, you’ll likely find the answer to match your use case.</p><p>Also, remember that you may have different use cases with diverging requirements, so one use case may end up being perfectly fine running as a single table design, while the other one(s) may need multiple tables.</p><p>Plus, you have the chance to mix and match the benefits of both solutions. It was kind of hinted at in the text already, but it is possible to use a simplified multiple table design (for example, per metric type) and separate the customer dimension into a PostgreSQL database schema, with each schema representing one customer.</p><p>Similarly, it is possible to use the schema to separate customers and store all metrics/events/data for that specific customer in a single hypertable. There are plenty of options, only limited by your imagination.</p><p>Whatever you end up with, try to be as future-proof as possible. Try to imagine what the future will hold, and if you’re unsure whether a somewhat harder requirement will become a must-have, it may be worth considering it as a hard requirement now just to be on the safe side.</p><p>If you want to start designing your hypertable database schema as soon as possible, ensuring you get the best performance and user experience for your time-series data while achieving incredible compression ratios, check out <a href="https://console.cloud.timescale.com/signup">Timescale</a>.</p><p>If you’re looking to test it locally or run it on-prem, then we’ve got you covered: have a <a href="https://docs.timescale.com/">look at our documentation</a>.</p><h3 id="learn-more">Learn more</h3><ul><li><a href="https://timescale.ghost.io/blog/best-practices-for-time-series-metadata-tables/"><strong>Best Practices for (Time-)Series Metadata Tables</strong></a></li><li><a href="https://www.timescale.com/learn/designing-your-database-schema-wide-vs-narrow-postgresql-tables" rel="noreferrer"><strong>Choosing Your Database Schema: Wide vs. Narrow PostgreSQL Tables</strong></a></li></ul>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[The PostgreSQL Job Scheduler You Always Wanted (Use it With Caution)]]></title>
            <description><![CDATA[We created a job scheduler built into PostgreSQL with no external dependencies. This is the power you always wanted, but with a few caveats.]]></description>
            <link>https://www.tigerdata.com/blog/the-postgresql-job-scheduler-you-always-wanted-but-be-careful-what-you-ask-for</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/the-postgresql-job-scheduler-you-always-wanted-but-be-careful-what-you-ask-for</guid>
            <category><![CDATA[Cloud]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Job Scheduler]]></category>
            <category><![CDATA[Announcements & Releases]]></category>
            <dc:creator><![CDATA[Kirk Laurence Roybal]]></dc:creator>
            <pubDate>Thu, 19 Jan 2023 16:28:52 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-12-at-6.22.47-PM.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-12-at-6.22.47-PM.png" alt="The PostgreSQL Job Scheduler You Always Wanted (Use it With Caution)" /><p>As a PostgreSQL guy, it really makes you wonder why a built-in job scheduler is not a part of the core PostgreSQL project. It is one of the most requested features in the history of ever. Yet, somehow, it just isn’t there.</p><p>Essentially, a job scheduler is a process that kicks off in-database functions and procedures at specified times and runs them independently of user sessions. The benefits of having a scheduler built into the database are obvious: no dependencies, no inherent security leaks, fits in your existing high availability plan, and takes part in your data recovery plan, too.</p><p>The <a href="https://www.postgresql.org/">PostgreSQL Global Development Group</a> has been debating for years about including a built-in job scheduler. Even after the addition of background processes that would support the feature (<a href="https://www.postgresql.org/about/news/postgresql-96-released-1703/">all the way back in 9.6</a>), background job scheduling is unfortunately not a part of core PostgreSQL.</p><p>So being the PostgreSQL lovers we are at <a href="https://www.timescale.com" rel="noreferrer">Timescale</a>, <strong>we decided to build such a scheduler</strong> so that our users and customers can benefit from a job scheduler in PostgreSQL. In TimescaleDB 2.9.1, we extended it to allow you to schedule jobs with flexible intervals and <a href="https://docs.timescale.com/api/latest/informational-views/job_errors/">provide you with better visibility of error logs</a>.</p><p>The flexible intervals enable you to determine whether the next run of the job occurs based on the scheduled clock time or the end of the last job run. And by “better visibility” of the job logs, we mean that they are also being logged to a table where they can be queried internally. These were extended to prevent overlapping job executions, provide predictable job timing, and provide better forensics.</p><p>We extensively use the advantage of this internal scheduler for our core features, enabling us to defer <a href="https://timescale.ghost.io/blog/allowing-dml-operations-in-highly-compressed-time-series-data-in-postgresql/" rel="noreferrer">compression</a>, <a href="https://docs.timescale.com/use-timescale/latest/data-retention/about-data-retention/" rel="noreferrer">data retention</a>, and refreshing of continuous aggregates to a background process (among other things).</p><p><em>📝 Editor's note: </em><a href="https://www.timescale.com/learn/is-postgres-partitioning-really-that-hard-introducing-hypertables" rel="noreferrer"><em>Learn more about how TimescaleDB's hypertables enable all these features above as a PostgreSQL extension, plus other awesome things like automatic partitioning. </em></a><em>  </em></p><p>This scheduler makes Timescale much more responsive to the caller and results in more efficient processing of these tasks. For our own benefit, the job scheduler needs to be internal to the database. It also needs to be efficient, controllable, and scale with the installation.</p><p>We made all this power available to you as a PostgreSQL end user. If you're running PostgreSQL in your own hardware, you can <a href="https://docs.timescale.com/self-hosted/latest/install/" rel="noreferrer">install the TimescaleDB extension</a>. If you're running in AWS, <a href="https://console.cloud.timescale.com" rel="noreferrer">you can try our platform for free</a>. </p><h2 id="the-postgresql-job-scheduler-debate">The PostgreSQL Job Scheduler Debate</h2><p>But not so fast. Before you start rejoicing, let’s review the reasons that the PostgreSQL Global Development Group chose not to include a scheduler in the database—there'll be educational for you as a word of caution. </p><p>Rather than rehashing the discussion list on the subject, let's summarize the obstacles that came up in the <a href="https://www.postgresql.org/list/pgsql-hackers/">mailing list</a>: </p><p><strong>PostgreSQL is multi-process, not multi-thread.</strong> This simple fact makes having a one-to-one relationship of processes to user-defined tasks a fairly heavy implementation issue. Under normal circumstances, PostgreSQL expects to lay a process onto a CPU (affinity), load the memory through the closest non-uniform memory access (NUMA) controller, and do some fairly heavy data processing. </p><p>This works great when the expectation is that the process will be very busy the majority of the time. Schedulers do not work like that. They sit around with some cheap threads waiting to do something for the majority of the life of the thread. Just the context switching alone would make using a full-blown process very expensive.</p><p><strong>Background workers' processes are a relatively small pool by design.</strong> This has a lot to do with the previous paragraph, but also that each process allocates the prescribed memory at startup. So, these processes compete with SQL query workers for CPU and memory. And the background processes have priority over both resources since they are allocated at system startup.</p><p><strong>The next issue is more semantic.</strong> There are quite a few external schedulers available. Each one of them has a different implementation of the time management system. That is, there is a question about just how exactly the job should be invoked. Should it be invoked again if it is still running from the last time? Should the job be started again based on clock time or relative to the previous job run? From the beginning or the end of the last run? </p><p>There are quite a few more questions of this nature, but you get the idea. No matter how the community answers these questions, somebody will complain that the implementation is the wrong answer because <code>\&lt;insert silly mathematician answer here\&gt;</code>.</p><h2 id="why-we-still-need-a-postgresql-job-scheduler">Why We Still Need a PostgreSQL Job Scheduler</h2><p>Timescale doesn't have the luxury of debating how many angels can dance on the head of a pin. As a database service working with large volumes of data in PostgreSQL, we face a hard requirement of background maintenance for the actions of archival, compression, and general storage. Timescale's core features, excluding <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/">hyperfunctions</a>, depend on the job scheduler.</p><p>But, rather than create a bespoke scheduler for our own purposes we built a general-purpose scheduler with a public application programming interface.</p><p>This general-purpose scheduler is generally available as part of TimescaleDB. You may use it to set a schedule for anything you can express as a procedure or function. In PostgreSQL, that's a huge advantage because you have the full power of the <a href="https://www.tigerdata.com/blog/top-8-postgresql-extensions" rel="noreferrer">PostgreSQL extension</a> system at your disposal. This list includes plug-in languages, which allow you to do anything the operating system can do.</p><p>Timescale assumes that the developer/administrator is a sane and reasonable person who can deal with a balance of complexity. That is longhand for "we trust you to do the right thing."</p><h2 id="with-great-power-comes-great-responsibility">With Great Power Comes Great Responsibility</h2><p>So, let's talk first about a few best design practices for using the Timescale (PostgreSQL) built-in job scheduler.</p><ol><li><strong>Keep it short.</strong> The dwell time of the background process can lead to high concurrency.  You are also using a process shared by other system tasks such as sorting, sequential scans, and other system tasks.</li><li><strong>Keep it unlocked.</strong> Try to minimize the number of exclusive locks you create while doing your process.</li><li><strong>Keep it down.</strong> The processes that you are using are shared by the system, and you are competing for resources with SQL query worker processes. Keep that in mind before you kick off hundreds or thousands of scheduled jobs.</li></ol><p>Now, assuming we are using the product fairly and judiciously, we can move on to the features and benefits of having an internal scheduler.</p><h2 id="built-in-postgresql-job-scheduler-all-the-nice-stuff"><br>Built-In PostgreSQL Job Scheduler: All the Nice Stuff</h2><p>Now that we've covered the things that demand caution, here's a list of some of the benefits of using this scheduler: </p><ul><li>Physical streaming replication will also replicate the job schedule. When you go to switch over to your replica, everything will already be there.</li><li>You don't need a separate high-availability plan for your scheduler. If the system is alive, so are your scheduled jobs.</li><li>The jobs can report on their own success or failure to internal tables and the <a href="https://www.tigerdata.com/learn/what-is-audit-logging-and-how-to-enable-it-in-postgresql" rel="noreferrer">PostgreSQL log</a> file.</li><li>The jobs can do administrative functions like dropping tables and changing table structure by monitoring the existing needs and structures.</li><li>When you install Timescale, it's already there.</li></ul><p>📝<em> Editor's note: Quick reminder that you can </em><a href="https://docs.timescale.com/self-hosted/latest/install/" rel="noreferrer"><em>install the TimescaleDB extension</em></a><em> if you're running your own PostgreSQL database, or </em><a href="https://console.cloud.timescale.com" rel="noreferrer"><em>sign up for the Timescale platform</em></a><em> (free for 30 days). </em></p><h2 id="how-the-job-scheduler-works">How The Job Scheduler Works</h2><p>There is <a href="https://docs.timescale.com/use-timescale/latest/jobs/" rel="noreferrer">a quick introductory article in the Timescale documentation</a>. Click that link if you want more detailed information.</p><p>The TL;DR version is that you make a <a href="https://www.tigerdata.com/learn/understanding-postgresql-user-defined-functions" rel="noreferrer">PostgreSQL function</a> or procedure and then call the <code>add_job()</code> function to schedule it. Of course, you can remove it from the schedule using… Wait for it... <code>delete_job()</code>.</p><p>That's it. Really. All that power is at your fingertips, and all you need to know is two function signatures.</p><p>Something to be aware of while you're using the scheduler is that the job may be scheduled to repeat from the end of the last run or from the scheduled clock time (in TimescaleDB 2.9.1 and beyond). This allows you to ensure that the previous job has completed (by picking from the end of the run) or that the job executes at a prescribed time (making job completion your responsibility). </p><p>If you feel a bit homesick and just want to look at your adorable job, there's also:</p><pre><code class="language-SQL">SELECT * FROM timescaledb_information.jobs;
</code></pre>
<p>And, of course, for completeness, there's always <code>alter_job()</code> for rescheduling, renaming, etc.</p><p>Once your job has been created, it becomes the responsibility of the job scheduler to invoke it at the proper time. The job scheduler is a PostgreSQL background process. It wakes up every 10 seconds and checks to see if any job is scheduled in the near future.</p><p>If such a job is queued up, it will request another background process from the PostgreSQL master process. The database system will provide one (provided there are any available). The provided process becomes responsible for the execution of your job.</p><p>This basic operation has some ramifications. We have already mentioned that we need to use these background processes sparingly for resource allocation reasons. Also, there are only a few of them available. The maximum parallel count of background processes is determined by <a href="https://www.tigerdata.com/blog/timescale-parameters-you-should-know-about-and-tune-to-maximize-your-performance" rel="noreferrer"><code>max_worker_processes</code></a>. If you need help configuring TimescaleDB background workers, <a href="https://docs.timescale.com/use-timescale/latest/configuration/advanced-parameters/#timescaledbmax_background_workers-int" rel="noreferrer">check out our documentation</a>.</p><p><a href="https://timescale.ghost.io/blog/timescale-parameters-you-should-know-about-and-tune-to-maximize-your-performance/" rel="noreferrer"><em>📝 You can also check out this blog post on tuning TimescaleDB parameters. </em></a></p><p>On my system (Kubuntu 22.04.1, PostgreSQL 14.6), the default is 43. That number is just an example, as the package manager for each distribution of PostgreSQL has discretion about the initial setting. Your mileage **will** vary.</p><p>Changing this parameter requires a restart, so you will need to make a judgment call about how many concurrent processes you expect to kick off. Add that to this base number and restart your system. Of course, a reasonable number has been added for you in Timescale. Remember the CPU and memory limitations while you are making this adjustment.</p><h2 id="what-to-do-with-a-postgresql-job-scheduler-a-few-ideas">What to Do With A PostgreSQL Job Scheduler: A Few Ideas </h2><p>The original reasons for creating this scheduler involve building out-of-the-box features involving data management. That includes <a href="https://docs.timescale.com/timescaledb/latest/overview/core-concepts/compression/">compression</a>, <a href="https://docs.timescale.com/timescaledb/latest/overview/core-concepts/continuous-aggregates/">continuous aggregates</a>, <a href="https://docs.timescale.com/timescaledb/latest/overview/core-concepts/data-retention/">retention policy implementation</a>, <a href="https://docs.timescale.com/api/latest/hyperfunctions/downsample/">downsampling</a>, and backfilling.</p><p>You may want to use this for event notifications, sending an email, clustered index maintenance, partition creation, pruning, archiving, refreshing materialized views, or summarizing data somewhere to avoid the need for triggers. These are just a few of the obvious ideas that jump into my consciousness. You can literally do anything that the operating system allows.</p><h2 id="what-not-to-do">What <strong>Not</strong> to Do </h2><p>This would be a bad place to gum up the locking tables. That is, be sure that whatever you do here is done in a concurrent manner.</p><p><code>REFRESH INDEX CONCURRENTLY</code> is better than <code>DROP</code> / <code>CREATE INDEX</code>.  <code>REFRESH MATERIALIZED VIEW CONCURRENTLY</code> is better than <code>REFRESH MATERIALIZED VIEW</code>. You get it. Use <code>CONCURRENTLY</code>, or design concurrently. Better yet, do things in a tiny atomic way that takes little time anyway.</p><p>Long-running transactions that create a lot of locks will interfere with the background writer, the planner, and the vacuum processes. If you crank up too many concurrent processes, you may also run out of memory. Please try to schedule everything to run in series. You’ll thank me later.</p><h2 id="well-wishes-to-the-newly-crowned-emperor">Well Wishes to the Newly Crowned Emperor</h2><p>Now you have the power to do anything your little heart desires in the background of PostgreSQL without having any external dependencies. We hope you feel empowered, awed, and a little bit special. We also hope you will use your new powers for good! </p><h2 id="try-the-updated-job-scheduler">Try the Updated Job Scheduler</h2><p>The job scheduler is available in TimescaleDB 2.9.1 and beyond. If you’re self-hosting TimescaleDB, follow the <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/upgrades/#upgrade-timescaledb">upgrade instructions</a> in our documentation. If you are using the <a href="https://www.timescale.com/cloud" rel="noreferrer">Timescale platform</a>, upgrades are automatic, meaning that you already have the scheduler at your fingertips. </p><h2 id="keep-learning">Keep Learning </h2><p>If this article has inspired you to keep going with your PostgreSQL hacking, <a href="https://www.timescale.com/learn/postgresql-performance-tuning-key-parameters" rel="noreferrer">check out our collection of articles on PostgreSQL fine tuning.</a> </p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Timescale vs. Amazon RDS PostgreSQL: Up to 350x Faster Queries, 44 % Faster Ingest, 95 % Storage Savings for Time-Series Data]]></title>
            <description><![CDATA[Why are developers migrating from Amazon RDS for PostgreSQL to Timescale to handle their time-series data workloads? Our benchmark answers the question: faster queries, faster ingest, and 95 % storage savings for time-series data.]]></description>
            <link>https://www.tigerdata.com/blog/timescale-cloud-vs-amazon-rds-postgresql-up-to-350-times-faster-queries-44-faster-ingest-95-storage-savings-for-time-series-data</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/timescale-cloud-vs-amazon-rds-postgresql-up-to-350-times-faster-queries-44-faster-ingest-95-storage-savings-for-time-series-data</guid>
            <category><![CDATA[Cloud]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Benchmarks & Comparisons]]></category>
            <dc:creator><![CDATA[James Blackwood-Sewell]]></dc:creator>
            <pubDate>Tue, 15 Nov 2022 14:19:00 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/2023-06-23-Amazon-RDS-timescale-hero.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/2023-06-23-Amazon-RDS-timescale-hero.png" alt="Timescale vs. Amazon RDS PostgreSQL: Up to 350x Faster Queries, 44 % Faster Ingest, 95 % Storage Savings for Time-Series Data" /><p>Since we launched Timescale, our cloud-hosted PostgreSQL service for time-series data and event and analytics workloads, we have seen large numbers of customers migrating onto it from the general-purpose Amazon RDS for PostgreSQL. These developers usually struggle with performance issues on ingest, sluggish real-time or historical queries, and spiraling storage costs.</p><p>They need a solution that will let them keep using PostgreSQL while not blocking them from getting value out of their time-series data. Timescale fits them perfectly, and this article will present benchmarks that help explain why.</p><p>When we talk to these customers, we often see a pattern: </p><ol><li>At the start of a project, developers choose PostgreSQL because it’s <a href="https://survey.stackoverflow.co/2022/#section-most-loved-dreaded-and-wanted-databases">a database they know and love</a>. The team is focused on shipping features, so they choose the path of least resistance—Amazon RDS for PostgreSQL.</li><li>Amazon RDS for PostgreSQL works well at first, but as the volume of time-series data in their database grows, they notice slower ingestion, sluggish query performance, and growing storage costs.</li><li>As the database becomes a bottleneck, it becomes a target for optimization. Partitioning is implemented, materialized views are configured (destroying the ability to get real-time results), and schedules are created for view refreshes and partition maintenance. Operational complexity grows, and more points of failure are introduced.</li><li>Eventually, in an effort to keep up, instance sizes are increased, and larger, faster volumes are created. Bills skyrocket, while the improvements are only temporary.</li><li>The database is now holding the application hostage regarding performance and AWS spending. A time-series database is discussed, but the developers and the application still rely on PostgreSQL features.<br></li></ol><p>Does it sound familiar? It’s usually at this stage when developers realize that Amazon RDS for PostgreSQL is no longer a good choice for their applications, start seeking alternatives, and come across Timescale. </p><p>Timescale runs on AWS, offering hosted PostgreSQL with added time-series superpowers. Since Timescale is still PostgreSQL and already in AWS, the transition from RDS is swift: Timescale integrates with your PostgreSQL-based application directly and plays nicely <a href="https://timescale.ghost.io/blog/do-more-on-aws-with-timescale-cloud-8-services-to-build-time-series-apps-faster/">with your AWS infrastructure</a>. </p><p>Timescale has always strived to enhance PostgreSQL with the ingestion, query performance, and cost-efficiency boosts that developers need to run their data-intensive applications, all while providing a seamless developer experience with advanced features to ease working with time-series data.</p><p>But don’t take our word for it—let the numbers speak for themselves. In this blog post, we share a benchmark comparing the performance of Timescale to Amazon RDS for PostgreSQL. You will find all the details of our comparison and all the information required to run the benchmark yourself using the <a href="https://github.com/timescale/tsbs">Time-Series Benchmarking Suite</a> (TSBS).</p><h2 id="time-series-data-benchmarking-a-sneak-preview">Time-Series Data Benchmarking: A Sneak Preview</h2><p>For those who can’t wait, here’s a summary: <strong>for a 160 GB dataset with almost 1 billion rows stored on a 1 TB volume, Timescale outperforms Amazon RDS for PostgreSQL with up to 44 % higher ingest rates, queries running up to 350x faster, and a 95 % smaller data footprint.</strong><br></p><p>When we ingested data in both Timescale and Amazon RDS for PostgreSQL (using gp3 EBS volumes for both), Timescale was <strong>34 % faster than RDS for 4 vCPU</strong> and <strong>44 %  for 8 vCPU</strong> configurations.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/2023-06-08-time-to-run-table.png" class="kg-image" alt="A diagram of our time to run ingest benchmark between Timescale Cloud and RDS for time-series data" loading="lazy" width="1176" height="580" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/06/2023-06-08-time-to-run-table.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/06/2023-06-08-time-to-run-table.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/2023-06-08-time-to-run-table.png 1176w" sizes="(min-width: 720px) 720px"></figure><p></p><p>When we ran a variety of time-based queries on both databases, ranging from simple aggregates to more complex rollups through to last-point queries, <strong>Timescale consistently outperformed Amazon RDS for PostgreSQL in every query category, sometimes by as much as 350x</strong> (you can see all of the results in the <a href="#benchmarking-configuration" rel="noreferrer">Benchmarking section</a>).</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/2023-06-08-median-time-to-run-table.png" class="kg-image" alt="A diagram of our median time to run CPU benchmark between Timescale Cloud and RDS for time-series data" loading="lazy" width="1176" height="580" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/06/2023-06-08-median-time-to-run-table.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/06/2023-06-08-median-time-to-run-table.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/2023-06-08-median-time-to-run-table.png 1176w" sizes="(min-width: 720px) 720px"></figure><p><strong>Timescale used 95 % less disk</strong> than Amazon RDS for PostgreSQL, thanks to Timescale’s <a href="https://timescale.ghost.io/blog/building-columnar-compression-in-a-row-oriented-database/">native columnar compression</a>, which reduced the size of the test database from 159 GB to 8.6 GB. Timescale's compression uses <a href="https://timescale.ghost.io/blog/time-series-compression-algorithms-explained/">best-in-class algorithms</a>, including Gorilla and delta-of-delta, to dramatically reduce the storage footprint.<br></p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/2023-06-08-total-database-size.png" class="kg-image" alt="A diagram of our storage savings benchmark between Timescale Cloud and RDS for time-series data" loading="lazy" width="1176" height="580" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/06/2023-06-08-total-database-size.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/06/2023-06-08-total-database-size.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/2023-06-08-total-database-size.png 1176w" sizes="(min-width: 720px) 720px"></figure><p>And the storage savings above don’t even consider the effect of the <a href="https://timescale.ghost.io/blog/expanding-the-boundaries-of-postgresql-announcing-a-bottomless-consumption-based-object-storage-layer-built-on-amazon-s3/">object store built on Amazon S3</a> that we just announced for Timescale. This feature is available for testing via private beta at the time of writing but is not yet ready for production use. </p><p>Still, by running one SQL command, this novel functionality will allow you to tier an unlimited amount of data to the S3 object storage layer that’s now an integral part of Timescale. This layer is columnar (it’s based on Apache Parquet), elastic (you can increase and reduce your usage), consumption-based (you pay only for what you store), and one order of magnitude cheaper than our EBS storage, with no extra charges for queries or usage. This feature will make scalability even more cost-efficient in Timescale, so stay tuned for some exciting benchmarks!</p><p>In the remainder of this post, we’ll deep dive into our performance benchmark comparing Amazon RDS for PostgreSQL with Timescale, detailing our methods and results for comparing ingest rates, query speed, and storage footprint. We’ll also offer insight into <em>why</em> Timescale puts up the numbers it does, with a short introduction to its vital advantages for handling time-series, events, and analytics data.</p><p>If you’d like to see how Timescale performs for your workload, <a href="https://console.cloud.timescale.com/signup">sign up for Timescale</a> today— it’s free for 30 days, there’s no credit card required to sign up, and you can spin up your first database in minutes.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">More on RDS:<br>- <a href="https://timescale.ghost.io/blog/estimating-rds-costs/" rel="noreferrer">Estimating RDS Costs</a><br>- <a href="https://timescale.ghost.io/blog/understanding-rds-pricing-and-costs/" rel="noreferrer">Why Is RDS so Expensive?</a><br>- <a href="https://timescale.ghost.io/blog/alternatives-to-rds/" rel="noreferrer">Alternatives to RDS</a><br>- <a href="https://timescale.ghost.io/blog/amazon-aurora-vs-rds-understanding-the-difference/" rel="noreferrer">Amazon Aurora vs. RDS</a></div></div><h2 id="benchmarking-configuration">Benchmarking Configuration</h2><p><a href="https://timescale.ghost.io/blog/timescaledb-vs-amazon-timestream-6000x-higher-inserts-175x-faster-queries-220x-cheaper/">As for our previous Timescale benchmarks</a>, we used the <a href="https://github.com/timescale/tsbs">open-source Time-series Benchmarking Suite</a> to run our tests. Feel free to download and run it for yourself using the settings below. Suggestions for improvements are also welcome: comment on <a href="https://twitter.com/TimescaleDB">Twitter</a> or <a href="https://slack.timescale.com/">Timescale Slack</a> to join the conversation.</p><p>We used the following TSBS configuration across all runs:<br></p>
<!--kg-card-begin: html-->
<table style="border:none;border-collapse:collapse;"><colgroup><col width="213"><col width="235"><col width="248"></colgroup><tbody><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><br></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Timescale</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Amazon RDS for PostgreSQL</span></p></td></tr><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">PostgreSQL version</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">14.5</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">14.4 (latest available)</span></p></td></tr><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">&nbsp;</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">No changes</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">synchronous_commit=off</span></p><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">(to match Timescale)</span></p></td></tr><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Partitioning system</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">TimescaleDB (partitions automatically configured)</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">pg_partman (partitions manually configured)</span></p></td></tr><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Compression into columnar</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Yes, for older partitions</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Not supported</span></p></td></tr><tr style="height:21pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Partition size</span></p></td><td colspan="2" style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;text-align: center;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">4h (each system ended up with 26 non-default partitions)</span></p></td></tr><tr style="height:21pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Scale (number of devices)</span></p></td><td colspan="2" style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;text-align: center;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">25,000</span></p></td></tr><tr style="height:21pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Ingest workers&nbsp;</span></p></td><td colspan="2" style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;text-align: center;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">16</span></p></td></tr><tr style="height:21pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Rows ingested</span></p></td><td colspan="2" style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;text-align: center;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">868,000,000</span></p></td></tr><tr style="height:21pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">TSBS profile</span></p></td><td colspan="2" style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;text-align: center;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">DevOps</span></p></td></tr><tr style="height:21pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Instance type</span></p></td><td colspan="2" style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;text-align: center;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">M5 series&nbsp; (4 vCPU+16 GB memory and 8 vCPU+32 GB memory)</span></p></td></tr><tr style="height:21pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Disk type</span></p></td><td colspan="2" style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;text-align: center;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">gp3 (16 K IOPs, 1000 MiBps throughput)</span></p></td></tr><tr style="height:21pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Volume size</span></p></td><td colspan="2" style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;text-align: center;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">1 TB</span></p></td></tr></tbody></table>
<!--kg-card-end: html-->
<p><br><a href="https://docs.timescale.com/use-timescale/latest/hypertables/about-hypertables/" rel="noreferrer">Hypertables</a> are the base abstraction of Timescale's time-series magic. While they work just like regular PostgreSQL tables, they boost performance and the user experience with time-series data by automatically partitioning it (large tables become smaller chunks or data partitions within a table) and allowing it to be queried more efficiently.</p><p>If you’re familiar with PostgreSQL, you may be asking questions about partitioning in RDS. In the past, we have benchmarked TimescaleDB against unpartitioned PostgreSQL simply because that’s the journey most of our customers follow. However, we inevitably get questions about not comparing using <a href="https://github.com/pgpartman/pg_partman">pg_partman</a>.</p><p>Pg_partman is another <a href="https://www.tigerdata.com/blog/top-8-postgresql-extensions" rel="noreferrer">PostgreSQL extension</a> that provides partition creation but doesn’t seamlessly create partitions on the fly: if someone inserted data outside of the currently created partitions, it would either go into a catch-all partition, degrading performance or, worse, still fail). It also doesn’t provide any additional time-series functionality, planner enhancements, or compression.</p><p>We listen to these comments, so we decided to highlight Timescale's performance (and convenience) by enabling pg_partman on the RDS systems in this benchmark. After all, the extension is considered a <a href="https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/PostgreSQL_Partitions.html">best practice for partitioned tables</a> in Amazon RDS for PostgreSQL, so it was only fair we’d use it.<br></p><p>On our end, we enabled native compression on Timescale, compressing everything but the most recent chunk data. To do so, we segmented by the <code>tags_id</code> and ordered by time descending and <code>usage_user</code> columns. This is something we couldn’t reproduce in RDS since it doesn’t offer any equivalent functionality. </p><p>Almost everything else was exactly the same for both databases. We used the same data, indexes, and queries: almost one billion rows of data in which we ran a set of queries 100 times each using 16 threads. The only difference is that the Timescale queries use the <code>time_bucket()</code> function for arbitrary interval bucketing, whereas the PostgreSQL queries use extract (which performs equally well but is much less flexible).</p><p>We have split the performance data extracted from the benchmark into three sections: ingest, query, and storage footprint.</p><h2 id="ingest-performance-comparison">Ingest Performance Comparison</h2><p>As we started to run Timescale and RDS through our 16-thread ingestion benchmark to insert almost 1 billion rows of data, we began to see some amazing wins. Timescale beat RDS by 32 % with 4 vCPUs and 44 % with 8 vCPUs. Both systems had the same I/O performance configured on their gp3 disk, so we kept looking to get to the bottom of why we were winning on busy systems.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/2023-06-08-ingest-performance-graph.png" class="kg-image" alt="A diagram of our ingest performance during benchmark run between Timescale Cloud and RDS for time-series data" loading="lazy" width="1992" height="1362" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/06/2023-06-08-ingest-performance-graph.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/06/2023-06-08-ingest-performance-graph.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/06/2023-06-08-ingest-performance-graph.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/2023-06-08-ingest-performance-graph.png 1992w" sizes="(min-width: 720px) 720px"></figure><p></p><p>To test the outcome without any disk I/O involvement, we used <a href="https://www.postgresql.org/docs/current/pgbench.html">pgbench</a> to run the following CPU-hungry SQL statement on 8 vCPU machines (using a scale of 1,000 and 16 jobs) and had some more interesting results straight away. </p><pre><code class="language-SQL">SELECT count(*) FROM (SELECT generate_series(1,10000000)) a
</code></pre>
<p>Timescale was almost twice as fast, returning an average query latency of <strong>518 ms</strong>, while RDS returned <strong>904 ms</strong>. This 50 % difference was consistent on both 4 vCPU and 8 vCPU instances.</p><p>Unfortunately, we can’t look inside the black box that is RDS to see what’s happening here. One hypothesis is that a large part of this difference is because Timescale gives you the exact amount of vCPU you provision <strong>for PostgreSQL</strong> (thanks, Kubernetes!), while Amazon RDS provides you a <strong>host with that many vCPUs</strong>. </p><p>This means that we (Timescale) pay for the operating overhead on Timescale, while on RDS, you (as the user) pay for this. As instances get very busy and processes fight with the operating system for CPU (like for an ingest benchmark or when you’re crunching a lot of data), this becomes a much bigger advantage for Timescale than we had anticipated. As usual, if anybody has any other possible reasons for this difference, please reach out, we’d love to hear from you.</p><p>Our benchmark shows Timescale not only ingests data faster across the board but also provides more predictable and faster results under heavy CPU load. Not a bad feature when you want to get the most out of your instances.</p><h2 id="query-performance-comparison">Query Performance Comparison</h2><p>Query performance is something that needs to be optimized in a time-series database. When you ask for data, you often need to have it as quickly as possible—especially when you’re powering a real-time dashboard. TSBS has a wide range of queries, each with its own somewhat hard-to-decode description (you can find a <a href="https://github.com/timescale/tsbs#appendix-i-query-types-">quick primer here</a>). We ran each query 100 times on the 4 vCPU instance types (which wasn’t quick in some cases) and recorded the results. </p><p>When we look at the table of query runtimes, we can see a clear story. Timescale is consistently faster than Amazon RDS, often by more than 100x. In some cases, Timescale performs over 350x better, and it doesn’t perform worse for any query type. The table below shows the data for 4 vCPU instances, but results are similar across all the CPU types we tested (and of course, if your instance is very busy, you could get even better results).</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/2023-06-08-table.png" class="kg-image" alt="A table of the median query times in the benchmark between Timescale Cloud and RDS for time-series data" loading="lazy" width="1352" height="1824" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/06/2023-06-08-table.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/06/2023-06-08-table.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/2023-06-08-table.png 1352w" sizes="(min-width: 720px) 720px"></figure><p>When we examine the amount of data loaded and processed by some of the queries with the larger differences, the reason behind these improvements becomes clear. Timescale compresses data into a <a href="https://www.tigerdata.com/blog/building-columnar-compression-in-a-row-oriented-database" rel="noreferrer">columnar</a> format, which has several impacts on performance:</p><ol><li>Timescale compressed chunks group by column, not by row. When a subset of the columns for a table are required, they can be loaded individually, reducing the amount of data processed (especially for the <code>single-groupby-</code> query types).</li><li>When compressed data is loaded from disk, it takes less time, as there is simply less data to read. This is traded off against additional compute cycles to uncompress the data—a compromise that works in our favor, as you can see in the results above.</li><li>As compressed data is smaller, more of it can be cached in shared memory, meaning even fewer reads from disk (for a great introduction to this, check out <a href="https://timescale.ghost.io/blog/database-scaling-postgresql-caching-explained/">Database Scaling: PostgreSQL Caching Explained</a> by our own Kirk Roybal).</li></ol><p>And just as a reminder, RDS had pg_partman configured for this test. This shows that while Timescale provides efficient partitioning via hypertables, we also provide a lot more than that (353x more in some instances).</p><h2 id="storage-usage-comparison">Storage Usage Comparison</h2><p>Total storage size is measured at the end of the TSBS ingest cycle, looking at the size of the database which TSBS has been ingesting data into. For this benchmark on Timescale, all but the most recent partition of data is compressed into our <a href="https://timescale.ghost.io/blog/building-columnar-compression-in-a-row-oriented-database/">native columnar format,</a> which uses best-in-class algorithms, including Gorilla and delta-of-delta, to reduce the storage footprint for the CPU table dramatically. </p><p>After compression, you can still access the data as usual, but you get the benefits of it being smaller and the benefits of it being columnar.</p><p>Using less storage can mean smaller volumes, lower cost, and faster access (as we saw in the query results above). In the case of this benchmark, we saved 95 %, reducing our database from 159 GB to 8.6 GB. And this isn’t an outlier, we often see these numbers for production workloads at real customers.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/2023-06-08-total-cpu-table-size.png" class="kg-image" alt="A diagram of the total database size in the benchmark between Timescale Cloud and RDS for time-series data" loading="lazy" width="1666" height="1090" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/06/2023-06-08-total-cpu-table-size.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/06/2023-06-08-total-cpu-table-size.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/06/2023-06-08-total-cpu-table-size.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/2023-06-08-total-cpu-table-size.png 1666w" sizes="(min-width: 720px) 720px"></figure><h2 id="beyond-benchmarks-a-closer-look-at-timescale">Beyond Benchmarks: A Closer Look at Timescale</h2><p>Now that we’ve examined the results of the benchmark, let’s briefly explore some of the features that make these results possible. This section aims to offer insight into the performance comparison above and highlight some other aspects of Timescale that will improve your developer experience when working with time-series data.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">✨</div><div class="kg-callout-text"><i><em class="italic" style="white-space: pre-wrap;">If you’re new to Timescale, you can also </em></i><a href="https://console.cloud.timescale.com/signup"><i><em class="italic" style="white-space: pre-wrap;">sign up for free </em></i></a><i><em class="italic" style="white-space: pre-wrap;">and follow our </em></i><a href="https://docs.timescale.com/getting-started/latest/create-hypertable/"><i><em class="italic" style="white-space: pre-wrap;">Getting Started guide</em></i></a><i><em class="italic" style="white-space: pre-wrap;">, which will introduce you to our main features in a hands-on way.</em></i></div></div><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/2023-06-08-chart.png" class="kg-image" alt="A diagram with the benefits of Timescale vs. Amazon RDS for time-series data" loading="lazy" width="1952" height="948" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/06/2023-06-08-chart.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/06/2023-06-08-chart.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/06/2023-06-08-chart.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/2023-06-08-chart.png 1952w" sizes="(min-width: 720px) 720px"></figure><h2 id="hypertables-continuous-aggregates-and-query-planner-improvements-for-performance-at-scale">Hypertables, continuous aggregates, and query planner improvements for performance at scale</h2><p>Timescale is purpose-built to provide features that handle the unique demands of time-series, analytics, and event workloads—and as we’ve seen earlier in this post, performance at scale is one of the most challenging aspects to achieve with a vanilla PostgreSQL solution. </p><p>To make PostgreSQL more scalable, we built features like <a href="https://docs.timescale.com/use-timescale/latest/hypertables/about-hypertables/" rel="noreferrer">hypertables</a> and added query planner improvements allowing you to seamlessly partition tables into high-performance chunks, ensuring that you can load and query data quickly. </p><p>While some other solutions force you to think about creating and maintaining data partitions, Timescale does this for you under the hood, as queries come in with no performance impact. In fact, some of Timescale’s improvements work on tables that don’t even hold time-series data, like SkipScan, which <a href="https://timescale.ghost.io/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/">dramatically improves <code>DISTINCT</code> queries on any PostgreSQL table</a> with a matching B-tree index.</p><p>Another problem that comes with time-series data at scale is slow aggregate queries as you analyze or present data. <a href="https://timescale.ghost.io/blog/how-we-made-data-aggregation-better-and-faster-on-postgresql-with-timescaledb-2-7/">Continuous aggregates</a> let you take an often run or costly time-series query and incrementally materialize it in the background, providing real-time, up-to-date results in seconds or milliseconds rather than minutes or hours. </p><p>While this might sound similar to a materialized view, it not only reduces the load on your database but also takes into account the most recent inserts and doesn’t require any management once it’s configured.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/07/Timescale-vs.-Amazon-RDS-for-PostgreSQL_comparison-chart.png" class="kg-image" alt="A table of PostgreSQL vs. Timescale Cloud for our time-series data benchmark" loading="lazy" width="1800" height="2066" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/07/Timescale-vs.-Amazon-RDS-for-PostgreSQL_comparison-chart.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/07/Timescale-vs.-Amazon-RDS-for-PostgreSQL_comparison-chart.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2024/07/Timescale-vs.-Amazon-RDS-for-PostgreSQL_comparison-chart.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/07/Timescale-vs.-Amazon-RDS-for-PostgreSQL_comparison-chart.png 1800w" sizes="(min-width: 720px) 720px"></figure><h2 id="hyperfunctions-job-scheduling-and-user-defined-functions-to-build-faster">Hyperfunctions, job scheduling, and user-defined functions to build faster</h2><p>Once you have time-series data loaded, Timescale also gives you the tools to work with it, offering over 100 built-in <a href="https://timescale.ghost.io/blog/how-to-write-better-queries-for-time-series-data-analysis-using-custom-sql-functions/">hyperfunctions</a>—custom SQL functions that simplify complex time-series analysis, such as <a href="https://docs.timescale.com/api/latest/hyperfunctions/time-weighted-averages/">time-weighted averages</a>, <a href="https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/locf/">last observation carried forward</a> and <a href="https://docs.timescale.com/api/latest/hyperfunctions/downsample/">downsampling with LTTP or ASAP algorithms</a>, and bucketing by hour, minute, month and timezone with <a href="https://docs.timescale.com/api/latest/hyperfunctions/time_bucket/">time_bucket()</a>, and <a href="https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/time_bucket_gapfill/">time_bucket_gapfill()</a>.</p><p>We also provide a <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/user-defined-actions/">built-in job scheduler</a>, which saves the effort of installing and managing another PostgreSQL extension and lets you schedule and monitor any SQL snippet or database function.</p><h2 id="direct-access-to-a-global-expert-support-team-to-assist-you-in-production">Direct access to a global, expert support team to assist you in production</h2><p>If you’re running your database in production, having direct access to a team of database experts will lift a heavy weight off your shoulders. Timescale gives all customers access to a <a href="https://timescale.ghost.io/blog/how-were-raising-the-bar-on-hosted-database-support/">world-class team of technical support</a> engineers at no extra cost, encouraging discussion on any time-series topic, even if it’s not directly related to Timescale operations. You might want some help with ingest performance, tuning advice for a tricky SQL query, or best practices on setting up your schema—we are here to help.</p><p>As a comparison, <a href="https://aws.amazon.com/premiumsupport/pricing/">deeply consultative support, general guidance, and best practices start at over $5,000 per month</a> in Amazon RDS for PostgreSQL. Lower tiers have only a community forum or receive general advice. So this means that you need to pay an extra $60,000 a year just for such support on AWS, while you get this for free on Timescale.</p><h2 id="native-columnar-compression-and-object-storage-for-cost-efficiency">Native columnar compression and object storage for cost efficiency</h2><p>Cost is one of the major factors when choosing any cloud database platform, and Timescale provides multiple ways to keep your spending under control.</p><p>Timescale's best-in-class <a href="https://docs.timescale.com/timescaledb/latest/overview/core-concepts/compression/">native compression</a> allows you to compress time-series data in place while still retaining the ability to query it as normal. Compressing data in Timescale often results in savings of 90 % or more (take another look at our benchmark results, which actually saw a 95 % storage footprint reduction).</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/2023-06-08-compression-chart.png" class="kg-image" alt="A diagram of the Timescale Cloud compression advantages for time-series data in our benchmark vs. RDS" loading="lazy" width="1346" height="332" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/06/2023-06-08-compression-chart.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/06/2023-06-08-compression-chart.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/2023-06-08-compression-chart.png 1346w" sizes="(min-width: 720px) 720px"></figure><p>Timescale also includes built-in features to manage<a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/data-retention/"> data retention</a>, making it easy to implement data lifecycle policies, which remove data you don’t care about quickly, easily, and without impacting your application. You can combine data retention policies with continuous aggregates to automatically downsample your data according to a schedule.</p><p>To help reduce costs even further, Timescale offers <a href="https://timescale.ghost.io/blog/expanding-the-boundaries-of-postgresql-announcing-a-bottomless-consumption-based-object-storage-layer-built-on-amazon-s3/">bottomless, consumption-based object storage</a> built on Amazon S3 (currently in private beta). Providing access to an object storage layer from within the database itself enables you to seamlessly tier data from the database to S3, store an unlimited amount of data, and pay only for what you store. All the while you retain the ability to query data in S3 from within the database via standard SQL.</p><h2 id="it%E2%80%99s-just-postgresql">It’s just PostgreSQL</h2><p>Last but not least, Timescale is just PostgreSQL under the hood. Timescale supports full SQL (not SQL-like or SQL-ish). You can leverage the full breadth of drivers, connectors, and extensions in the vibrant PostgreSQL ecosystem—if it works with PostgreSQL, it works with Timescale! </p><p>If you switch from Amazon RDS for PostgreSQL to Timescale, you won’t lose any compatibility, your application will operate the same as before (but it will probably be faster, as we’ve shown).</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/2023-06-08-timescale-and-postgres.png" class="kg-image" alt="The PostgreSQL and Timescale logos together: Timescale Cloud is just PostgreSQL for time-series data" loading="lazy" width="1346" height="386" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/06/2023-06-08-timescale-and-postgres.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/06/2023-06-08-timescale-and-postgres.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/06/2023-06-08-timescale-and-postgres.png 1346w" sizes="(min-width: 720px) 720px"></figure><h2 id="conclusion">Conclusion</h2><p>When you have time-series data, you need a database that can handle time-series workloads. While Amazon RDS for PostgreSQL provides a great cloud PostgreSQL experience, our benchmarks have shown that even when paired with the pg_partman extension to provide partition management, it can’t compete with Timescale. According to our tests, Timescale can be over 40 % faster to ingest data, up to 350x faster for queries, and takes 95 % less space to store data when compressed. </p><p>On top of these findings, we offer a rich collection of time-series features that weren’t used in the benchmark. You can speed queries up even further by incrementally pre-computing responses with continuous aggregates, benefit from our job scheduler, configure retention policies, use analytical hyperfunctions, speed up your non-time-series queries with features like Skip Scan, and so much more. </p><p><br>If you have time-series data, don’t wait until you hit that performance wall to give us a go. Spin up an account now: y<a href="https://console.cloud.timescale.com/signup">ou can use it for free for 30 days; no credit card required</a>.</p><h3 id="further-reading">Further reading</h3><ul><li><a href="https://timescale.ghost.io/blog/estimating-rds-costs/" rel="noreferrer">Estimating RDS Costs</a></li><li><a href="https://timescale.ghost.io/blog/understanding-rds-pricing-and-costs/" rel="noreferrer">Why Is RDS so Expensive?</a></li><li><a href="https://timescale.ghost.io/blog/alternatives-to-rds/" rel="noreferrer">Alternatives to RDS</a></li><li><a href="https://timescale.ghost.io/blog/amazon-aurora-vs-rds-understanding-the-difference/" rel="noreferrer">Amazon Aurora vs. RDS</a></li></ul>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Read Before You Upgrade: Best Practices for Choosing Your PostgreSQL Version]]></title>
            <description><![CDATA[PostgreSQL upgrades have been known to be a bit of a controversial issue in the community. In this article, we will take the mystery out of the question of when an upgrade is appropriate and how Timescale allows you to do it as swiftly as possible.]]></description>
            <link>https://www.tigerdata.com/blog/read-before-you-upgrade-best-practices-for-choosing-your-postgresql-version</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/read-before-you-upgrade-best-practices-for-choosing-your-postgresql-version</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Cloud]]></category>
            <dc:creator><![CDATA[Kirk Laurence Roybal]]></dc:creator>
            <pubDate>Fri, 11 Nov 2022 18:35:31 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/11/Best-Practices-PostgreSQL-version_Hero--1-.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/11/Best-Practices-PostgreSQL-version_Hero--1-.png" alt="Read Before You Upgrade: Best Practices for Choosing Your PostgreSQL Version" /><p>PostgreSQL has a long-standing reputation for having a miserable upgrade process. So, when the community heartily recommends that you should upgrade as soon as possible to the latest and greatest PostgreSQL version, it's not really surprising that your heart sinks, your mouth goes dry, and the outright dread of another laborious job takes over.</p><p>It's almost like finishing a long hike or trying to convince somebody that Betamax was better than VHS. Eventually, you just want it to be over so you can take a nap. There's not even any joy about all the new features and speed. It's just too exhausting to generate emotion anymore.</p><p>This blog post will hopefully serve as a guide for when to pull off the old band-aid. That is, when you should upgrade and what PostgreSQL version you should select as a target. By the end of this post, we will introduce you to our best practices for upgrading your PostgreSQL version in Timescale, so you can get over this process of upgrading as quickly and safely as possible.</p><h2 id="when-to-upgrade-postgresql-common-myths">When to Upgrade PostgreSQL: Common Myths</h2><p>The <a href="https://www.linkedin.com/company/postgresql-global-development-group">PostgreSQL Global Development Group</a> has simplified the upgrade process quite a bit with more explicit version numbering. Since there are only two external stimuli, there are only two choices: upgrade the binaries (minor version change) or upgrade the data on disk (major version change).</p><p>The developers of PostgreSQL never really had a plan in mind for when and how to upgrade.  This seems a bit of a harsh statement when tools like pg_upgrade exist but bear with me.   These tools were meant to make upgrades <strong>possible</strong>, not to imply any particular schedule or recommendations for an upgrade plan. The actual upgrade implementation was always left as an exercise for the administrator.</p><p>Let's start with some of the community's conventional wisdom and pretend that those ideas were actually a plan of sorts.</p><h3 id="myth-1-%E2%80%9Cupgrade-as-fast-as-possible-every-time%E2%80%9D">Myth 1: “Upgrade as fast as possible, every time”</h3><p>This "plan" is based on the fear of existing bugs. It is a very Rumsfeldian plan that assumes you don't know what the bugs are, but you're certainly better off if they're fixed. This makes for a very aggressive upgrade pace and hopes for a better tomorrow rather than a stable today.</p><h3 id="myth-2-upgrade-when-you-have-to">Myth 2: "Upgrade when you have to"</h3><p>The complete opposite fear-based pseudo-plan is to stick to the existing version—come hell or high water—unless you run into an otherwise unfixable bug that affects your installation. This is based on the idea that the bugs we know are better than the bugs we don't know. Unfortunately, it ignores the bugs you don't even know exist.</p><h3 id="myth-3-%E2%80%9Cupgrade-for-every-minor-version%E2%80%9D">Myth 3: “Upgrade for every minor version”</h3><p>This is the <a href="https://www.postgresql.org/support/versioning/">general recommendation</a> of the PostgreSQL Global Development Group. The general idea is that all software has bugs, and upgrading is better than not upgrading. That is a bit over-optimistic about new bugs being introduced and kind of ignores that new features that you don’t care about have to be configured—or else.</p><p>This comes a bit closer to planning than guessing for minor versions, as the minor versions of PostgreSQL do not change the file system; they only change the binaries. These upgrades tend to be super heavy on bug fixes and very low on new features, which is where bugs tend to get introduced. It doesn't say anything about bugs you have actually encountered, nor does it say anything about any improvements from which you might be able to benefit.</p><h3 id="myth-4-%E2%80%9Cupgrade-when-you-have-time-to-kill%E2%80%9D">Myth 4: “Upgrade when you have time to kill”</h3><p>Probably the most dangerous plan since you will never have more time in the future and will probably never upgrade. Experience says that this is a completely silly plan that never gets implemented.</p><h3 id="myth-5-%E2%80%9Cupgrade-when-there-are-security-fixes%E2%80%9D">Myth 5: “Upgrade when there are security fixes”</h3><p>Okay, this makes some kind of sense. Unfortunately, it ignores the rest of your installation and puts the application development team into tailspin mode for your DevOps enjoyment. It is the kind of policy you end up with when the DevOps team doesn’t really care about the Apps team.</p><h2 id="when-to-upgrade-postgresql">When to Upgrade PostgreSQL </h2><p>Much of this guide is based on personal experience with PostgreSQL upgrades over the years. In some cases, the old was better than the new, and in others, the other way around. In some cases, the fixes worked immediately. In others, well, not so much.</p><p>Very few hard and fast rules can be drawn when coming up with a plan of this nature, but I'll try to bring the experience to bear in a way that helps to make a decision in the future. That being said, this is a "best practice" based on experience, not a "sure-fire thing."</p><p>As a way to reduce the amount of just sheer subjectivity and opinion around choosing the moment to upgrade, I've taken a look through the release notes of PostgreSQL. In this lookie-look, I've attempted to note where bug fixes occurred and mentally move them back to the version where they were discovered. Unfortunately, this task is also somewhat subjective, as I was not a part of the bug fix development or the bug discovery. So these are just educated guesses, but I hope rather good ones.</p><p>Then I looked at the mental list that I had made and thought about whether it matched my personal experience with successful versus unsuccessful upgrades. It (again) seemed a subjectively good indicator of when an upgrade succeeded or failed.</p><p>So, on to the findings.  </p><p>The first thing I noticed in my research is that the biggest upgrade failures were with a new major version containing updates to the <a href="https://www.postgresql.org/docs/current/wal-intro.html">write-ahead log (WAL)</a>. These were most notable for versions 10 and 12.</p><p>Version 10 would make a book by itself. It was a major undertaking, with quite a few subsystem rewrites. In these version upgrades, there were numerous additions to items (like WAL for hash indexes), as well as improvements and changes to the background writer to support structural changes on disk. These major updates introduced the largest number of unintended behaviors, which lasted the longest before being detected and fixed.</p><p>The next most striking failures came from logical replication between 10 and 11.  Of course, logical replication was invented for 10, so there had never been an attempt to use it for production upgrades before. This first use in the field was—how should I put it?—interesting.</p><p>After that, the bugs died down a lot but were never quite gone.</p><h2 id="upgrade-plan">Upgrade Plan</h2><p>Here is my list of questions to ask before an upgrade.</p><p>1. How big is the change? Was it a major refactor, and did it involve any of the following?</p><ul><li><strong>Query planner:</strong> minor.</li><li><strong>WAL:</strong> major.</li><li><strong>Background writer: </strong>major.</li><li><strong>Memory, caching, locks, or anything else managed by the parent process:</strong> minor.</li><li><strong>Index engine</strong>: major (or just rebuild all your indexes anyway).</li><li><strong>Replication</strong>: major.</li><li><strong>Logging</strong>: minor.</li><li><strong>Vacuuming</strong>: minor.</li></ul><p>2. Were there any huge performance gains?</p><p>3. Does it include major security fixes?</p><p>4. Are there major built-in function() improvements/enhancements?</p><p>5. Do all of my extensions exist for the new version?</p><p>These are my rules of thumb for whether a new PostgreSQL version is compelling for upgrade.  Unfortunately, this still requires some subjective evaluation and a bit of professional knowledge. For instance, just because vacuum is a major feature, it doesn't mean it has ever been a problem with an upgrade. It <strong>could</strong> be, though, and we should look at its major changes with a bit of a wry mouth hold.</p><p>This brings me to my personal procedure that has (so far) followed the above guidelines.</p><ol><li><strong>Upgrade major versions when they reach the minor version .2.</strong> That is, 10.2, 11.2, 12.2, etc. This technique avoids the most egregious bugs introduced in major versions but still allows for staying reasonably close to the current.</li><li><strong>Upgrade minor versions as they are available.</strong> Minor upgrades have not created major issues thus far in my personal experience. The speed increases, bug fixes, security patches, and internationalization have been worth the minor risk.</li><li><strong>Upgrade immediately if your version is nearing the five-year mark</strong>. <a href="https://www.postgresql.org/support/versioning/" rel="noreferrer">The PostgreSQL Global Development Group releases a new major version every year</a> and supports it for five years after its release. You don't want to be left with an unsupported version.</li><li><strong>Upgrade when the security team tells you to</strong>. It doesn't happen very often, but when it does, it's a major event.</li><li><strong>Upgrade because you need functionality</strong>. Things to upgrade for: <code>CONCURRENTLY</code>, <code>SYSTEM</code>, and performance. Things not to upgrade for: functions(), operators, and libraries.</li></ol><p>That's all there is to it.  </p><p>I hope this blog post has helped you to make a decision for when PostgreSQL has compelling new features for you.</p><p>Of course, this is only a general rule of thumb. If you feel compelled to upgrade for some other reason, don't let my guide tell you what <strong>not</strong> to do. It only intends to help in the absence of any other stimuli for upgrade. You do you.</p><h2 id="i-am-ready-to-upgrade-now-what">I Am Ready to Upgrade. Now, What?</h2><p>So now you have followed the checklist above and determined that it’s time for you to upgrade your PostgreSQL version. If you’re running a production database, this may be easier said than done, especially if we are talking about upgrading your major version (e.g., from PostgreSQL 13 to PostgreSQL 14): </p><ul><li>Minor versions of PostgreSQL (e.g., from PostgreSQL 13 to PostgreSQL 13.2) are always backward compatible with the major version. That means that if you upgrade your production database, it is unlikely that anything is going to break due to the upgrade.</li><li>However, major versions of PostgreSQL are not backward compatible. That means that when you upgrade the PostgreSQL version of a database behind a mission-critical application, this may introduce user-facing incompatibilities which might require code changes in your application to ensure no breakage.</li></ul><p>Practical example: if you are upgrading from PostgreSQL 13 to 14, in PostgreSQL 14, the factorial operators ! and !! are no longer supported, nor is running the factorial function on negative numbers. What may seem like a silly example is, in fact, illustrative that assumptions made about how certain functions (or even operators) work between versions may break once you update. </p><p>Fortunately, PostgreSQL is awesome enough to provide clear <a href="https://www.postgresql.org/docs/current/release.html">Release Notes</a> stating the changes between versions. But this doesn’t solve our problem: how to upgrade production databases safely? </p><h2 id="timescale-to-the-rescue">Timescale to the Rescue</h2><p>This is one of the many areas in which choosing a cloud database will help. If you are self-hosting your mission-critical PostgreSQL database and want to run a major upgrade, you would have first to create a copy of your database manually, dumping your production data and restoring it in another database with the same config as your production database. </p><p>Then, you would have to upgrade this database and run your testing there. This process can take a while depending on your database's size (and if we’re talking about a time-series application, it’s probably pretty big). </p><p>Timescale makes the upgrading process way more approachable. Timescale is a database cloud for time-series applications built on TimescaleDB and PostgreSQL. In other words, this is PostgreSQL under the hood—with a sprinkle of TimescaleDB as the time-series secret sauce. </p><p>Timescale databases (which are called “services”) run on a particular version of TimescaleDB and PostgreSQL:</p><ul><li>As a user of Timescale, you don’t have to worry about the TimescaleDB upgrades: they will be handled automatically by the platform during a maintenance window picked by you. These upgrades are backward compatible and nothing you should worry about. They require no downtime.</li><li>The upgrades between minor versions of PostgreSQL are also automatically handled by the platform during your maintenance window. As we mentioned, these upgrades are also backward compatible. However, they require a service restart, which could cause a small (30 seconds to a few minutes) of downtime if you do not have a replica. We always alert users ahead of these in advance.</li></ul><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">✨</div><div class="kg-callout-text"><b><strong style="white-space: pre-wrap;">Editor's Note: </strong></b><i><em class="italic" style="white-space: pre-wrap;">For security reasons, we always run the latest available minor version within a major version on PostgreSQL in Timescale. These minor updates may contain security patches, data corruption problems, and fixes to frequent bugs—as a managed service provider, we have to store our customers’ data as safely as possible.</em></i></div></div><p>But what about upgrades between major versions of PostgreSQL? Since these are often not backward compatible, we cannot automatically upgrade your service in Timescale from, let’s say, PostgreSQL 13 to 14, which may introduce problems in your code and cause major issues! </p><p>Also, upgrading between major versions of PostgreSQL can (unfortunately but unavoidably) introduce some downtime. If you are running a mission-critical application, you want complete control over <em>when</em> that unavoidable downtime will occur. And you certainly want to test that upgrade first. </p><p>A database platform like Timescale can certainly help solve this issue. Upgrading your major version of Postgres will always be a decent lift—but a hosted database platform can make this process way smoother, helping you automate what can be automated and also facilitating your testing:</p><ul><li>In Timescale, you can upgrade the PostgreSQL version that’s running on your service by simply clicking a button.</li><li>You can use database forks to test your upgrade safely. Also, by clicking a button, Timescale allows you to create a database fork (a.k.a. an exact copy of your database) which you can then upgrade to estimate the required downtime to upgrade your production instance.</li><li>You can also use forks to test your application changes. Once your fork is upgraded, you can run some of your production queries—you can find some of these using <a href="https://timescale.ghost.io/blog/identify-postgresql-performance-bottlenecks-with-pg_stat_statements/"><code>pg_stat_statements</code></a>—on the fork to ensure they don’t contain any breaking changes to the new major version. </li></ul><p>Let’s explore this more in the next section. If you’re not using Timescale, you can create a <a href="https://console.cloud.timescale.com/signup">free account here</a>—you’ll have free access for 30 days, no credit card required.  </p><h2 id="safely-upgrading-major-postgresql-versions-in-timescale">Safely Upgrading Major PostgreSQL Versions in Timescale </h2><p>Here’s how you can safely upgrade your Timescale service:</p><ul><li>First, fork your service. Timescale allows you to fork (a.k.a. copy) your databases in one click—a fast and cost-effective process. You will only be charged when your fork runs, and you can immediately delete it after your testing is complete.</li><li>Now that you have a perfect copy of your production database ready for testing (with the click of a button), it’s time to click another button to tell the platform to upgrade your major PostgreSQL version automatically. You can do this in Timescale—we’ll tell you exactly how in a minute.</li><li>Once the upgrade is complete in your fork, run your tests.</li><li>In order to see how long the upgrade took on the fork, you can go to your metrics tab and check how long your service was unavailable (the grey zone in your CPU and RAM graphs). This will give you an estimate as to how long your primary service will be down when you choose to upgrade it.</li><li>When you’re sure that nothing breaks, you can upgrade your primary service. Make sure to plan accordingly! Upgrading will cause downtime, so make sure you have accounted for that as a part of your upgrade plan. </li></ul><p>Let’s see how this looks in the console. </p><p>First, check which TimescaleDB and PostgreSQL version your database is running on your service Overview page.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/11/Best-practices-upgrade-PostgreSQL_img-1.png" class="kg-image" alt="" loading="lazy" width="1095" height="635" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/11/Best-practices-upgrade-PostgreSQL_img-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/11/Best-practices-upgrade-PostgreSQL_img-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/11/Best-practices-upgrade-PostgreSQL_img-1.png 1095w" sizes="(min-width: 720px) 720px"></figure><p>To fork your service is as easy as going to the Operations tab and clicking on the Fork service option. This will automatically create an exact snapshot of your database.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/11/Best-practices-upgrade-PostgreSQL_img2.png" class="kg-image" alt="" loading="lazy" width="1170" height="553" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/11/Best-practices-upgrade-PostgreSQL_img2.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/11/Best-practices-upgrade-PostgreSQL_img2.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/11/Best-practices-upgrade-PostgreSQL_img2.png 1170w" sizes="(min-width: 720px) 720px"></figure><p>To upgrade your major version of PostgreSQL, go to your Maintenance tab. Under Service upgrades, you will see a Service upgrades button. If you click that button, your service will be updated to the next major version of Postgres (in the example below, the service would be upgraded from PostgreSQL 13.7 to PostgreSQL 14).</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/11/Best-practices-upgrade-PostgreSQL_img3.png" class="kg-image" alt="" loading="lazy" width="964" height="623" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/11/Best-practices-upgrade-PostgreSQL_img3.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/11/Best-practices-upgrade-PostgreSQL_img3.png 964w" sizes="(min-width: 720px) 720px"></figure><h2 id="your-upgrade-is-complete">Your Upgrade Is Complete</h2><p>That’s it! You can now use the latest and greatest that PostgreSQL has to offer. That said, choosing to upgrade is no small feat. Before going through the upgrade process, there is a lot to consider, and it is important to have a plan to account for the downtime you will experience. </p><p>While the upgrade process can be a bit painful, you can at least rely on Timescale to handle the technical orchestration of the upgrade. In the future, we hope to offer even better tooling to make the upgrade process entirely pain-free (but we have to walk before we can run, right?).</p><p><br><br>If you’d like to see what Timescale has to offer, <a href="https://www.timescale.com/timescale-signup">start a free trial if you haven’t already. There’s no credit card required!</a></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[A PostgreSQL Developer's Perspective: Five Interesting Patches From September's Commitfest]]></title>
            <description><![CDATA[Welcome to our new blog series! Every other month, Timescale’s developer advocate, Chris Travers, will use his PostgreSQL developer perspective to feel the pulse of the beloved database by looking into new commitfest patches.]]></description>
            <link>https://www.tigerdata.com/blog/a-postgresql-developers-perspective-five-interesting-patches-from-septembers-commitfest</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/a-postgresql-developers-perspective-five-interesting-patches-from-septembers-commitfest</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Chris Travers]]></dc:creator>
            <pubDate>Wed, 02 Nov 2022 14:14:11 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/11/PostgreSQL-Developer-Perspective-September-commitfest--1-.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/11/PostgreSQL-Developer-Perspective-September-commitfest--1-.png" alt="An elephant (PostgreSQL's mascot) working on a laptop and ticking a few boxes" /><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">🐘</div><div class="kg-callout-text"><em>The PostgreSQL community organizes patch reviews into “</em><a href="https://commitfest.postgresql.org/"><em>commitfests</em></a><em>” which last for a month at a time, every other month. In this series, our very own PostgreSQL developer advocate and expert, Chris Travers, will discuss a few patches that may be of interest to PostgreSQL users after each commitfest. This is intended to provide a sense of the progress and future of PostgreSQL as a major player in the database world.</em></div></div><p><a href="https://commitfest.postgresql.org/39/">September’s commitfest</a> is over with 65 committed patches, 40 patches returned with feedback, 177 patches moved to the next commitfest, 3 rejected, and 11 withdrawn. From a PostgreSQL developer's perspective and beyond, the patches include a large number of improvements in a large number of areas.</p><p>In this new blog post series pilot, I have selected a few patches that I find particularly interesting and helpful, and which I feel I can easily communicate their importance to a general audience of PostgreSQL users. This is by no means a comprehensive list of committed patches of interest, and in particular, patches that improve code quality or set the foundations for new features somewhere in the distant future are not included in this review. However, in my discussion with PostgreSQL developers, the improvements that I heard a lot about involved type-safety improvements not on this list.</p><p>In this article I have selected five patches, three of which are committed and two returned with feedback, for discussion. I will focus on their utility to database users and application developers from my PostgreSQL developer point of view (POV).</p><p>Let’s start with the committed patches.</p><h2 id="postgresql-developer-pov-interesting-patches-committed">PostgreSQL Developer POV: Interesting Patches Committed</h2><h3 id="reducing-chunk-header-size-on-all-memory-context-types"><br>Reducing chunk header size on all memory context types</h3><p>PostgreSQL <a href="https://www.youtube.com/watch?v=tP2pHbKz2R0">manages memory by lifetime</a> and allocates based on either a “chunk allocator” or a “slab allocator” (the latter being designed specifically for logical replication contexts, and not relevant to this patch). </p><p>Allocation sets are arranged in a hierarchy relating to memory lifetime within the software. This prevents one from having to free memory at a defined point later, as the system can just do this later, at a defined point in time.  For example, if we allocate memory with a lifetime related to the processing of a row, then the memory will be reused or freed when PostgreSQL moves on to process the next row.  If we allocate to the lifetime of the transaction, then the memory is freed when the transaction commits or rolls back.  It also means that in some cases, PostgreSQL can just reuse a chunk of memory without having to do significant processing of it other than the header.</p><p>The chunk allocator also increases the memory allocation on each subsequent call for an allocation set. The first chunk in the allocation set is 8 kB in size, and each subsequent chunk allocated doubles until one reaches 1 GiB. A given allocation within a set cannot span chunks, and this is why you cannot allocate (compressed or not) more than 1 GB of data within PostgreSQL in C <a href="http://postgresql.org/docs/6.3/c3903.htm">using the palloc memory allocation interfaces</a>.</p><p><a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c6e0fe1f2a08505544c410f613839664eea9eb21">This patch</a> provides a number of important memory improvements in this regard. Although PostgreSQL memory management is well-optimized performance-wise and avoids most of the problems plaguing C programmers generally, this introduces further improvements.</p><p>Databases are environments where memory allocation efficiency matters for a number of reasons. In addition to pure memory savings, the fact that these may allow more data to fit in a given chunk means fewer malloc() calls will happen, and this is likely to produce performance improvements as well.</p><p>This patch does not affect all aspects of memory in PostgreSQL. While data coming into shared buffers or from shared buffers back to disk are not affected by this change, the data allocated for processing data extracted from shared buffers (or processed for writing to shared buffers) is affected.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">✨</div><div class="kg-callout-text"><strong>Editor's Note: </strong>shared_buffers are an important part of configuring your PostgreSQL instance. <a href="https://timescale.ghost.io/blog/database-scaling-postgresql-caching-explained/">Check out this article on what they are and how they’re designed</a>, and <a href="https://timescale.ghost.io/blog/postgresql-caching-the-postmaster-process/">read this blog post to learn how they interact with the postmaster process</a>.</div></div><p>This also shows how even very well-optimized systems can sometimes still have opportunities for further improvements, and it shows that even in mature systems, there are often gains to be made by those with the knowledge and insight to find them.</p><h3 id="handle-infinite-recursion-in-a-logical-replication-setup">Handle infinite recursion in a logical replication setup</h3><p>PostgreSQL’s logical replication is built on a publisher/subscriber model, where a publisher exposes a series of changes that a subscriber ingests and writes locally. In current releases of PostgreSQL, it is not really possible to have a loop of subscribers and publishers. Replication must always be only unidirectional.</p><p>For the most part, this works well because bidirectional replication poses fundamental (mathematical) conflict resolution problems that are impossible to solve where authoritative data is required. For example, if the same row is updated on two different mutually replicating systems, there is insufficient information to determine what the final output state is or should be. For this reason, cyclic replication topologies—often called “multi-master replication”—are generally frowned upon.</p><p>A common approach to addressing this problem is to use “last update wins” as a strategy, but this approach necessarily clobbers existing updates. In my previous work in other companies, where I was using such complex replication topologies with other databases, we actually had to take steps to prevent conflicting updates elsewhere in the infrastructure for this reason.</p><p><a href="https://commitfest.postgresql.org/39/3610/">This patch</a> allows, for the first time, logical replication loops. In other words, while a subscriber can already republish data it subscribed to, the original publisher could now, with new options set on the subscription, subscribe to data that could be republished from its own subscriptions. This can be done while preventing replication loops by setting an “origin” option in the subscription. The patch is then intended to prevent replication write loops, where the same insert or update is replicated back and forth forever.</p><p>Logical replication is a completely different beast when compared to physical streaming replication. It has completely different use cases and pitfalls than the latter. It imposes very different administrative burdens as well. However, this is a massive leap forward towards a community-owned bidirectional logical replication capability, which will likely open some doors for PostgreSQL where replication topologies based on cyclic graphs (often called multi-master replication) are actually worth the significant costs.</p><h3 id="proper-planner-support-for-order-by-distinct">Proper planner support for <code>ORDER BY</code> / <code>DISTINCT</code></h3><p>PostgreSQL supports various aggregates where collation operations are important, such as ntile, percentile, and other ordered set aggregates, window functions, and aggregates with the <code>DISTINCT</code> modifier. The efficiency of these operations is affected by the sort ordering of the table scans, and in current versions of PostgreSQL, the planner does not take this into account. As a result, <code>ORDER BY</code> and <code>DISTINCT</code> in aggregate functions can lead to unnecessary sorts, which lead to slower query performance. </p><p>Having proper planner support for these sorts of operations is a significant performance win for anyone using these sorts of aggregates. In my experience, there are many users of these sorts of features, especially those whose workflows include both transactional and decision support workflows. <a href="https://commitfest.postgresql.org/39/3164/">This patch</a> represents another significant improvement for analytic workloads on PostgreSQL.</p><p>Now let’s address the two patches that haven’t been yet committed.</p><h2 id="patches-not-yet-committed">Patches Not Yet Committed</h2><p>Not every interesting patch that caught my attention here got committed this time around. Two important patches will get discussed further as they progress through peer review and the commitfest feedback process. Both of these patches are currently listed as “returned with feedback” but are of sufficient importance or near enough to completion that they are worth watching anyway.</p><h3 id="kerberos-delegation">Kerberos Delegation</h3><p><a href="https://web.mit.edu/kerberos/">Kerberos, the authentication protocol that was developed by MIT</a>, and available on many platforms before being incorporated into Microsoft’s ActiveDirectory, has the capacity to pass delegated credentials between hosts. PostgreSQL can already accept delegated credentials for authentication but currently cannot delegate credentials.</p><p>So, for example, when a user accesses an internal ASP.net application on a company intranet, the web server can authenticate the user via Kerberos, and then pass on a delegated credential to the database if needed for actual access. This approach allows fine-grained and redundant control over access to data with a great deal of defense in depth.</p><p>What PostgreSQL is not currently able to do is delegate Kerberos authentication, which means that Kerberos authentication cannot be used between PostgreSQL nodes over things like foreign data wrappers.</p><p>A <a href="https://commitfest.postgresql.org/39/3582/">proposed patch</a> would ensure that PostgreSQL could delegate Kerberos credentials to libpq connections, allowing this to be used by the PostgreSQL foreign data wrapper, dblink, and similar extensions. While this patch has been listed as “returned with feedback,” the general sense is that this would be a really useful feature, and so I would read that status at present as a note that the current patch has some problems that need to be rethought before it can be accepted.</p><p>The fundamental difficulty here is that Kerberos session encryption does not provide forward privacy or forward security, and therefore, when session encryption is used with credential delegation, the user could potentially break the encryption on the middle host. In the event that passwords are also used for a foreign data wrapper connection, this would render some protections against less robust sensitive data or even in some cases (for example, where passwords are used in a passthrough way to authenticate against a third-party provider) allow for password disclosure.</p><p>The overall consensus is that this is a feature that would be very helpful in PostgreSQL, something I believe as well. The patch rejection is also a great testament to the degree of peer review in the community that goes into security-critical code paths. I expect that sooner or later, this feature will be included with appropriate safety measures in place, and I hope it is resubmitted sooner rather than later.</p><h3 id="allows-database-specific-role-memberships">Allows database-specific role memberships</h3><p>In PostgreSQL, roles (which include users) are global to an instance of PostgreSQL (called a “cluster” in PostgreSQL terminology). This means that if you have several databases managed by the same PostgreSQL instance, the roles and role memberships are common throughout all databases.</p><p>One major difficulty in writing multi-tenant applications which use databases as the tenant boundary is that if you assign a user with a role in one database, this occurs in all databases. For example, if you have an accounting application and you have a user “chris” who has different permissions in three different databases on the same server, you cannot use a simple, consistent set of roles to manage database permissions.</p><p>One solution to this problem (which we did when I was building out such a system in LedgerSMB) is to create different roles on each database, including, for example, the database name in the role name. This leads to a certain degree of complexity and makes the role names harder to read. Another option would be to limit users to a single database. This adds complexity in a different place.</p><p>A <a href="https://commitfest.postgresql.org/39/3374/">proposed patch</a> would allow the ability to grant role permissions only on a specific database. For multi-tenant applications, this would be a game changer. The patch was near acceptance when there was some further discussion on documentation. It needed to be rebased and corrected, and the initial submitter did not reply in a timely manner.  This can happen—people can become busy or unavailable for one reason or another, and patches can end up temporarily orphaned.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">✨</div><div class="kg-callout-text"><strong>Editor's Note: </strong>For an insider’s perspective on commitfests, <a href="https://timescale.ghost.io/blog/how-to-manage-a-commitfest/">read this blog post on being a commitfest manager</a>, and <a href="https://timescale.ghost.io/blog/what-does-a-postgresql-commitfest-manager-do-and-should-you-become-one/">check out if you should become one</a>.</div></div><p>Given the general utility of this patch for multi-tenant applications, I would like to see this fixed and resubmitted sooner rather than later.</p><h2 id="concluding-thoughts">Concluding Thoughts</h2><p>The patches discussed here represent small but useful steps forward for PostgreSQL. These and many others make PostgreSQL a database that is improving significantly with each major release. </p><p>Commitfests such as these provide great insight into this process, the review that patches undergo, and how PostgreSQL keeps moving forward.</p><p>That’s one of my <strong>favorite things about PostgreSQL: it’s a database that is significantly improving with each major release. </strong>If you want to add even more functionality to PostgreSQL, <a href="https://www.timescale.com/timescale-signup">explore TimescaleDB</a>—<a href="https://timescale.ghost.io/blog/postgresql-timescaledb-1000x-faster-queries-90-data-compression-and-much-more/"><strong>it extends PostgreSQL</strong> with things like automatic time-based partitioning and indexing, continuous aggregations, columnar compression, and time-series functionality.</a> And if you’re using a managed service for PostgreSQL, <a href="https://console.cloud.timescale.com/">try Timescale</a>—it’s free for 30 days, no credit card required. <br></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[PostgreSQL + TimescaleDB: 1,000x Faster Queries, 90 % Data Compression, and Much More]]></title>
            <description><![CDATA[TimescaleDB expands PostgreSQL query performance by 1,000x, reduces storage utilization by 90%, and provides time-saving features for time-series and analytical applications—while still being 100% Postgres.]]></description>
            <link>https://www.tigerdata.com/blog/postgresql-timescaledb-1000x-faster-queries-90-data-compression-and-much-more</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/postgresql-timescaledb-1000x-faster-queries-90-data-compression-and-much-more</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[General]]></category>
            <category><![CDATA[Engineering]]></category>
            <category><![CDATA[Benchmarks & Comparisons]]></category>
            <dc:creator><![CDATA[Ryan Booz]]></dc:creator>
            <pubDate>Thu, 22 Sep 2022 15:32:44 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-11-at-7.24.23-PM.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-11-at-7.24.23-PM.png" alt="PostgreSQL + TimescaleDB: 1,000x Faster Queries, 90 % Data Compression, and Much More" />
<!--kg-card-begin: html-->
<div class="highlight">
	
    <p class="highlight__text">
        <svg width="17" height="16" viewBox="0 0 17 16" fill="none" xmlns="http://www.w3.org/2000/svg">
	</svg>
<b> Compared to PostgreSQL alone, TimescaleDB can dramatically improve query performance by 1,000x or more, reduce storage utilization by 90 %, and provide features essential for time-series and analytical applications. Some of these features even benefit non-time-series data–increasing query performance just by loading the extension. </b> 
    </p>
</div>

<!--kg-card-end: html-->
<p>PostgreSQL is today’s most advanced and most popular open-source relational database. We believe this as much today as we did <a href="https://timescale.ghost.io/blog/when-boring-is-awesome-building-a-scalable-time-series-database-on-postgresql-2900ea453ee2/">five years ago</a> when we chose PostgreSQL as the foundation of TimescaleDB because of its longevity, extensibility, and rock-solid architecture.</p><p>By loading the TimescaleDB extension into a PostgreSQL database, you can effectively “supercharge” PostgreSQL, empowering it to excel for both time-series workloads and classic transactional ones. </p><p>This article highlights how TimescaleDB improves PostgreSQL query performance at scale, increases storage efficiency (thus lowering costs), and provides developers with the tools necessary for building modern, innovative, and cost-effective time-series applications—all while retaining access to the full Postgres feature set and ecosystem.</p><p>(To show our work, this article also presents the benchmarks that compare query performance and data ingestion for one billion rows of time-series data between PostgreSQL 14.4 and TimescaleDB 2.7.2.  For PostgreSQL, we benchmarked both using a single-table and declarative partitioning)</p><h2 id="better-performance-at-scale">Better Performance at Scale</h2><p>With orders of magnitude better performance at scale, TimescaleDB enables developers to build on top of PostgreSQL <em>and </em>“future-proof” their applications.</p><h3 id="1000x-faster-performance-for-time-series-queries">1,000x faster performance for time-series queries</h3><p>The core concept in TimescaleDB is the notion of the “hypertable”: seamless partitioning of data while presenting the abstraction of a single, virtual table across all your data. </p><p>This partitioning enables faster queries by quickly excluding irrelevant data, as well as enabling enhancements to the query planner and execution process. In this way, a hypertable looks and feels just like a normal PostgreSQL table but enables a lot more.</p><p>For example, one recent query planner improvement excludes data more efficiently for relative <code>now()</code>-based queries (e.g., <code>WHERE time &gt;= now()-’1 week’::interval</code>). To be even more specific, <a href="https://timescale.ghost.io/blog/how-we-fixed-long-running-postgresql-now-queries/">these relative time predicates are constified</a> at planning time to ignore chunks that don't have data to satisfy the query. Furthermore, as the number of partitions increases, planning times can be reduced by 100x or more over vanilla PostgreSQL for the same number of partitions.</p><p>When hypertables are compressed, the amount of data that queries need to read is reduced, leading to dramatic increases in performance of 1000x or more. For more information (including a discussion of this bar chart), keep reading the benchmark below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/09/single-query-latency-milliseconds.png" class="kg-image" alt="" loading="lazy" width="2000" height="1053" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/09/single-query-latency-milliseconds.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/09/single-query-latency-milliseconds.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2022/09/single-query-latency-milliseconds.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/09/single-query-latency-milliseconds.png 2070w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Query latency comparison (ms) between TimescaleDB and PostgreSQL 14.4. To see the complete query, scroll down below.</span></figcaption></figure><p>Other enhancements in TimescaleDB apply to both hypertables and normal PostgreSQL tables, e..g, SkipScan, which <a href="https://timescale.ghost.io/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/">dramatically improves DISTINCT queries on any PostgreSQL table</a> with a matching B-tree index regardless of whether you have time-series data or not.</p><h3 id="reduce-commonly-run-queries-to-milliseconds-even-when-the-original-query-took-minutes-or-hours">Reduce commonly run queries to milliseconds (even when the original query took minutes or hours)<br></h3><p>Today, nearly every time-series application reaches for rolling aggregations to query and analyze data more efficiently. The raw data could be saved per second, minute, or hour (and a plethora of other permutations in between), but what most applications display are time-based aggregates. </p><p>What's more, most time-series data applications are append-only, which means that aggregate queries return the same values over and over based on the unchanged raw data. It's much more efficient to store the results of the aggregate query and use those for analytic reporting and analysis most of the time. </p><p>Often, developers try materialized views in vanilla PostgreSQL to help, however, they have two main problems with fast-changing time-series data:</p><ul><li>Materialized views <em>recreate the entire view every time the materialization process runs, </em>even if little or no data has changed.</li><li>Materialized views don't provide any data retention management. Any time you delete raw data and update the materialized view, the aggregated data is removed as well.</li></ul><p>In contrast, <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/continuous-aggregates/about-continuous-aggregates/">TimescaleDB’s continuous aggregates</a> solve both of these problems. They are updated automatically on the <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/continuous-aggregates/refresh-policies/">schedule you configure</a>, they can have data <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/data-retention/data-retention-with-continuous-aggregates/#about-data-retention-with-continuous-aggregates">retention policies applied separately from the underlying hypertable</a>, and they only update the portions of new data that have been modified since the last materialization was run.</p><p>When we compare using a continuous aggregate to querying the data directly, customers often see queries that might take minutes or even hours drop to milliseconds. When that query is powering a dashboard or a web page, this can be the difference between snappy and unusable.</p><h2 id="lower-storage-costs">Lower Storage Costs</h2><p>The number one driver of cost for modern time-series applications is storage. Even when storage is cheap, time-series data piles up quickly. TimescaleDB provides two methods to reduce the amount of data being stored, compression and downsampling using continuous aggregates.</p><h3 id="90-or-more-storage-savings-via-best-in-class-compression-algorithms">90&nbsp;% or more storage savings via best-in-class <a href="https://www.tigerdata.com/blog/time-series-compression-algorithms-explained" rel="noreferrer">compression algorithms</a></h3><p>The TimescaleDB hypertable is data heavily partitioned into many, many smaller partitions called “chunks.” TimescaleDB provides <a href="https://www.tigerdata.com/blog/building-columnar-compression-in-a-row-oriented-database" rel="noreferrer">native columnar compression</a> on this per-chunk basis. </p><p>As we show in the benchmark results (and as we see often in production databases), compression reduced disk consumption by over 90% compared to the same data in vanilla PostgreSQL. </p><p>Even better, TimescaleDB doesn't change anything about the PostgreSQL storage system to achieve this level of compression. Instead, TimescaleDB utilizes PostgreSQL storage features, namely TOAST, to transition historical data from row-store to column-store, a key component for querying long-term aggregates over individual columns.</p><p>To demonstrate the effectiveness of compression, here’s a comparison of the total size of the CPU table and indexes in TimescaleDB and in PostgreSQL.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/09/image-5.png" class="kg-image" alt="" loading="lazy" width="2000" height="1053" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/09/image-5.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/09/image-5.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2022/09/image-5.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/09/image-5.png 2070w" sizes="(min-width: 720px) 720px"></figure><p><br>With the proper <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/compression/about-compression/#enable-compression">compression policy in place</a>, hypertable chunks will be compressed automatically once all data in the chunk has aged beyond the specified time interval. </p><p>In practice, this means that a hypertable can store data as row-oriented for newer data and column-oriented for older data simultaneously. Having the data stored as both row and column store also matches the typical query patterns of time-series applications to help improve overall query performance—again, something we see in the benchmark results.</p><p>This reduces the storage footprint and improves query performance even further for many time-series aggregate queries. Compression is also automatic: users set a compression horizon, and then data is automatically compressed as it ages.</p><p>This also means that users can save significant costs using cloud services that provide separation of compute and storage—such as Tiger Cloud (formerly Timescale Cloud)—so that larger machines aren’t needed just for more storage. </p><h3 id="more-storage-savings-by-easily-removing-or-downsampling-data">More storage savings by easily removing or downsampling data</h3><p>With TimescaleDB, automated <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/data-retention/about-data-retention/">data retention </a>is achieved with <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/data-retention/create-a-retention-policy/">one SQL command</a>:</p><pre><code class="language-SQL">SELECT add_retention_policy('cpu', INTERVAL '7 days');
</code></pre><p>There's no further setup or extra extensions to install or configure. Each day any partitions older than 7 days will be dropped automatically. If you were to implement this in vanilla PostgreSQL you’d need to use DELETE to remove records, which is a very costly operation as it needs to scan for the data to remove. Even if you were using PostgreSQL declarative partitioning, you’d still need to automate the process yourself, wasting precious developer time, adding additional requirements, and implementing bespoke code that needs to be supported moving forward.</p><p>One can also combine continuous aggregates and data retention policies to downsample data and then drop the raw measurements, thus saving even more data storage. </p><p>Using this architecture, you can retain higher-level rollup values for a longer period of time, even after the raw data has been dropped from the database. This allows multiple different levels of granularity to be stored in the database, and provides even more ways to control storage costs.</p><h2 id="more-features-to-speed-up-development-time">More Features to Speed Up Development Time</h2><p>TimescaleDB includes more features that speed up development time. This includes a library of over 100 hyperfunctions, which make complex time-series analysis easy using SQL, such as count approximations, statistical aggregates, and more. TimescaleDB also includes a built-in, multi-purpose job scheduling engine for setting up automated workflows.</p><h3 id="library-of-over-100-hyperfunctions-that-make-complex-analysis-easy">Library of over 100 hyperfunctions that make complex analysis easy</h3><p>TimescaleDB hyperfunctions make data analysis in SQL easy. This library includes <a href="https://docs.timescale.com/api/latest/hyperfunctions/time-weighted-averages/">time-weighted averages</a>, <a href="https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/locf/">last observation carried forward</a>, and <a href="https://docs.timescale.com/api/latest/hyperfunctions/downsample/">downsampling with LTTP or ASAP algorithms</a>, <a href="https://docs.timescale.com/api/latest/hyperfunctions/time_bucket/">time_bucket()</a>, and <a href="https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/time_bucket_gapfill/">time_bucket_gapfill()</a>. </p><p>As an example, one could get the average temperature every day for each device over the last seven days, carrying forward the last value for missing readings with the following SQL.</p><pre><code class="language-SQL">SELECT
  time_bucket_gapfill('1 day', time) AS day,
  device_id,
  avg(temperature) AS value,
  locf(avg(temperature))
FROM metrics
WHERE time &gt; now () - INTERVAL '1 week'
GROUP BY day, device_id
ORDER BY day;
</code></pre><p>For more information on the extensive list of hyperfunctions in TimescaleDB, please visit our <a href="https://docs.timescale.com/api/latest/hyperfunctions/">API documentation</a>.</p><h3 id="built-in-job-scheduler-for-workflow-automation">Built-in job scheduler for workflow automation </h3><p>TimescaleDB provides the ability to schedule the execution of custom stored procedures with <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/user-defined-actions/">user-defined actions</a>. This feature provides access to the same job scheduler that TimescaleDB uses to run all of the native automation jobs for compression, continuous aggregates, data retention, and more. </p><p>This provides a similar functionality as a third-party scheduler like<code>pg_cron</code> without needing to maintain multiple <a href="https://www.tigerdata.com/blog/top-8-postgresql-extensions" rel="noreferrer">PostgreSQL extensions</a> or databases.</p><p>We see users doing all sorts of neat stuff with user-defined actions, from calculating complex SLAs to sending event emails based on data correctness to polling tables.</p><h2 id="still-100-postgresql-and-sql">Still 100&nbsp;% PostgreSQL and SQL</h2><p>Notably, because TimescaleDB is packaged as a <a href="https://www.tigerdata.com/blog/top-8-postgresql-extensions" rel="noreferrer">PostgreSQL extension</a>, it achieves these results without forking or breaking PostgreSQL.</p><h3 id="extending-postgresql%E2%80%94not-forking-or-cloning">Extending PostgreSQL—not forking or cloning</h3><p>Postgres is popular at the moment, but a lot of that popularity is with ‘Postgres compatible’ products which might look like Postgres, or talk like Postgres, or query somewhat like Postgres - but aren’t Postgres under the hood (and are sometimes closed-source). </p><p>TimescaleDB is just PostgreSQL. One can install other extensions, make full use of the type system, and benefit from the incredibly diverse Postgres ecosystem.</p><h3 id="100-sql">100&nbsp;% SQL</h3><p>Any product that can connect to PostgreSQL can query time-series data stored with TimescaleDB using the same SQL it normally would. While we provide helper functions for working with data, we do not restrict the SQL features one can use. Once in the database, users can combine <a href="https://www.tigerdata.com/blog/time-series-introduction" rel="noreferrer">time series</a> and business data as necessary.</p><h3 id="rock-solid-foundations-thanks-to-postgresql">Rock-solid foundations thanks to PostgreSQL</h3><p>PostgreSQL is not a new database: it has years of production deployments under its belt. High availability, backup and restore, and load-balancing are all solved problems. As we mentioned earlier, we chose Postgres because it was reliable, and TimescaleDB inherits that reliability.</p><h2 id="benchmarking-setup-and-results">Benchmarking Setup and Results</h2><p>This section provides details about how we tested TimescaleDB against vanilla PostgreSQL. Feel free to download the <a href="https://github.com/timescale/tsbs">Time-Series Benchmarking Suite</a> and run it for yourself. If you'd like to get started with TimescaleDB quickly, you can use Tiger Cloud, which lets you <a href="https://console.cloud.timescale.com/signup">sign up for a free 30-day trial</a>.</p><h3 id="benchmark-configuration">Benchmark configuration</h3><p>For this benchmark, all tests were run on the same m5.2xlarge EC2 instance in AWS us-east-1 with the following configuration and software versions. </p><ul><li>Versions: TimescaleDB version 2.7.2, community edition, and PostgreSQL 14.4</li><li>One remote client machine running TSBS, one database server, both in the same cloud data center</li><li>TSBS Client Instance: EC2 m5.4xlarge  with 16 vCPU and 64&nbsp;GB memory</li><li>Database server instance: EC2 m5.2xlarge  with 8 vCPU and 32&nbsp;GB memory</li><li>OS: both server and client machines ran Ubuntu 20.04</li><li>Disk size: 1&nbsp;TB of EBS GP2 storage</li><li>TSBS config: Dev-ops profile, 4,000 devices recording metrics every 10 seconds over one month.</li></ul><p>We also deliberately chose to use EBS (elastic block storage) volumes rather than attached SSDs. While benchmark performance would certainly improve with SSDs, the baseline performance using EBS is illustrative of what many self-hosted users could expect while saving some expenses by using elastic storage.</p><h3 id="database-configuration">Database configuration</h3><p>We ran only one PostgreSQL cluster on the EC2 database instance. The TimescaleDB extension was loaded via <code>shared_preload_libraries</code> but not installed into the PostgreSQL-only database.</p><p>To set sane defaults for the PostgreSQL cluster, we ran <code>timescaledb-tune</code> and set<code>synchronous_commit=off</code> in postgresql.conf. This is a common performance configuration for write-heavy workloads while still maintaining transactional, logged integrity. <strong>All configuration changes applied to both PostgreSQL and TimescaleDB benchmarks alike.</strong></p><h3 id="the-dataset">The dataset</h3><p>As we mentioned earlier, for this benchmark, we used the <a href="https://github.com/timescale/tsbs">Time-Series Benchmarking Suite </a>and generated data for 4,000 devices, recording metrics every 10 seconds, for one month. This generated just over one billion rows of data. Because TimescaleDB is a PostgreSQL extension, we could use the same data file and ingestion process, ensuring identical data in each database.</p><h3 id="timescaledb-setup">TimescaleDB setup</h3><p>TimescaleDB uses an abstraction called <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertables</a> which splits large tables into smaller chunks, increasing performance and greatly easing management of large amounts of time-series data.</p><p>We also enabled native compression on TimescaleDB. We compressed everything but the most recent chunk of data, leaving it uncompressed. This configuration is a commonly recommended one where raw, uncompressed data is kept for recent time periods and older data is compressed, enabling greater query efficiency. The parameters we used to enable compression are as follows: we segmented by the <code>tags_id</code> columns and ordered by time descending and <code>usage_user</code> columns.</p><p><strong><em>All benchmark results were performed on a single PostgreSQL table and on an empty TimescaleDB hypertable created with four-hour chunks.</em></strong></p><p>(And for those thinking that we also need to compare TimescaleDB with PostgreSQL Declarative Partitioning, please read on to the end; we discuss that as well.)</p><h2 id="query-latency-deep-dive">Query Latency Deep Dive</h2><p>For this benchmark, we inserted one billion rows of data and then ran a set of queries 100 times each against the respective database. The data, indexes, and queries are exactly the same for both databases. The only difference is that the TimescaleDB queries use the <code>time_bucket()</code> function for doing arbitrary interval bucketing, whereas the PostgreSQL queries use the new <code>date_bin()</code> function, introduced in PostgreSQL 13.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/09/Query-latency-deep-dive--1--1.png" class="kg-image" alt="" loading="lazy" width="2000" height="1639" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/09/Query-latency-deep-dive--1--1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/09/Query-latency-deep-dive--1--1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2022/09/Query-latency-deep-dive--1--1.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/09/Query-latency-deep-dive--1--1.png 2070w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Query latency comparison between PostgreSQL and TimescaleDB, for different queries</span></figcaption></figure><p>The results are clear and consistently reproducible. For one billion rows of data spanning one month of time (with four-hour partitions), <strong><em>TimescaleDB consistently outperformed a vanilla PostgreSQL database running 100 queries at a time.</em></strong> </p><p>There are two main reasons for TimescaleDB's consistent query performance.</p><h3 id="compression-smaller-storage-less-work">Compression = smaller storage + less work</h3><p>In PostgreSQL (and many other databases), table data is stored in an 8&nbsp;Kb page (sometimes called a block). If a query has to read 1,000 pages to satisfy it, it reads ~8&nbsp;Mb of data. If some of that data had to be retrieved from disk, then the query will usually be slower than if all of the data was found in memory (the reserved space known as <em>shared buffers</em> in PostgreSQL, if you’re looking for some insight into PostgreSQL caching we have <a href="https://timescale.ghost.io/blog/database-scaling-postgresql-caching-explained/">a blog on that</a>).</p><p>With TimescaleDB compression, queries that return the same results have to read significantly fewer pages of data (this is both because of the actual compression and because it can return single columns rather than whole rows). For all of our benchmarking queries, this also translates into higher concurrency for the benchmark duration.</p><p>Stated another way, compression typically impacts fetching historical data most because TimescaleDB can query individual columns rather than entire rows. Because less I/O is occurring for each query, TimescaleDB can handle more queries with a lower standard deviation than vanilla PostgreSQL.</p><p>Let's look at two examples of how this plays out between the two databases using two queries above, <code>cpu-max-all-1</code> and <code>single-groupby-1-1-12</code>.</p><h3 id="single-groupby-1-1-12"><code>single-groupby-1-1-12</code></h3><p>We selected one of the queries from the benchmark and ran it on both databases. Recall that each database has the exact same data and indexes on uncompressed data. TimescaleDB has the advantage of being able to segment and order compressed data in a way that's beneficial to typical application queries.</p><pre><code class="language-SQL">EXPLAIN (ANALYZE,BUFFERS)
SELECT time_bucket('1 minute', time) AS minute,
        max(usage_user) as max_usage_user
        FROM cpu
        WHERE tags_id IN (
          SELECT id FROM tags WHERE hostname IN ('host_249')
        ) 
        AND time &gt;= '2022-08-03 06:16:22.646325 +0000' 
        AND time &lt; '2022-08-03 18:16:22.646325 +0000'
        GROUP BY minute ORDER BY minute;</code></pre><p>When we run the <code>EXPLAIN</code> on this query and ask for <code>BUFFERS</code> to be returned, we start to get a hint of what's happening.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/09/single-query-latency-single-groupby-1-1-12.png" class="kg-image" alt="" loading="lazy" width="2000" height="1053" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/09/single-query-latency-single-groupby-1-1-12.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/09/single-query-latency-single-groupby-1-1-12.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2022/09/single-query-latency-single-groupby-1-1-12.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/09/single-query-latency-single-groupby-1-1-12.png 2070w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Query latency vs volume of data that has to be read to satisfy the query in TimescaleDB and PostgreSQL.</span></figcaption></figure><p></p><p>Two things quickly jump out when I view these results. First, the execution times are significantly lower than the benchmarking results above. Individually, these queries execute pretty fast, but PostgreSQL has to read approximately 27x more data to satisfy the query. When 16 workers request data across the time range, PostgreSQL has to do a lot more I/O, which consumes resources. TimescaleDB can simply handle a higher concurrency for the same workload. </p><h3 id="cpu-max-all-1"><code>cpu-max-all-1</code></h3><p>Again we can clearly see the impact of compression on the ability for TimescaleDB to handle a higher concurrent load when compared to vanilla PostgreSQL for time-series queries.</p><pre><code class="language-SQL">EXPLAIN (ANALYZE, buffers) 
SELECT
   time_bucket('3600 seconds', time) AS hour,
   max(usage_user) AS max_usage_user,
   max(usage_system) AS max_usage_system,
   max(usage_idle) AS max_usage_idle,
   max(usage_nice) AS max_usage_nice,
   max(usage_iowait) AS max_usage_iowait,
   max(usage_irq) AS max_usage_irq,
   max(usage_softirq) AS max_usage_softirq,
   max(usage_steal) AS max_usage_steal,
   max(usage_guest) AS max_usage_guest,
   max(usage_guest_nice) AS max_usage_guest_nice 
FROM cpu 
WHERE  
   tags_id IN (
      SELECT id FROM tags WHERE hostname IN ('host_249')
   )
   AND time &gt;= '2022-08-08 18:16:22.646325 +0000' 
   AND time &lt; '2022-08-09 02:16:22.646325 +0000' 
GROUP BY HOUR 
ORDER BY HOUR;</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/09/single-query-latency-cpu-max-all-1.png" class="kg-image" alt="" loading="lazy" width="2000" height="1053" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/09/single-query-latency-cpu-max-all-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/09/single-query-latency-cpu-max-all-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2022/09/single-query-latency-cpu-max-all-1.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/09/single-query-latency-cpu-max-all-1.png 2070w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Query latency vs volume of data that has to be read to satisfy the query, in TimescaleDB and PostgreSQL</span></figcaption></figure><p>With compression, TimescaleDB does significantly less work to retrieve the same data, resulting in faster queries and higher query concurrency.</p><h3 id="time-ordered-queries-just-work-better">Time-ordered queries just work better</h3><p>TimescaleDB hypertables require a time column to partition the data. Because time is an essential (and known) part of each row and chunk, TimescaleDB can intelligently improve how the query is planned and executed to take advantage of the time component of the data.</p><p>For example, let's query for the maximum CPU usage for each minute for the last 10 minutes.</p><pre><code class="language-SQL">EXPLAIN (ANALYZE,BUFFERS)        
SELECT time_bucket('1 minute', time) AS minute, 
  max(usage_user) 
FROM cpu 
WHERE time &gt; '2022-08-14 07:12:17.568901 +0000' 
GROUP BY minute 
ORDER BY minute DESC 
LIMIT 10;
</code></pre><p>Because TimescaleDB understands that this query is aggregating on time and the result is ordered by the time column (something each chunk is already ordering by in an index), it can use the ChunkAppend custom execution node. In contrast, PostgreSQL plans five workers to scan all partitions before sorting the results and finally doing a <code>GroupAggregate</code> on the time column.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/09/single-query-latency-chunkappend-1.png" class="kg-image" alt="" loading="lazy" width="2000" height="1053" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/09/single-query-latency-chunkappend-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/09/single-query-latency-chunkappend-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2022/09/single-query-latency-chunkappend-1.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/09/single-query-latency-chunkappend-1.png 2070w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Query latency vs volume of data that has to be read to satisfy the query, in TimescaleDB and PostgreSQL</span></figcaption></figure><p>TimescaleDB scans fewer data and doesn't need to spend time re-sorting the data that it knows is already sorted in the chunk. For time-series data with a known order and constraints, TimescaleDB works better for most queries than vanilla PostgreSQL.</p><h2 id="ingest-performance">Ingest Performance</h2><p>Intriguingly, ingest performance for both TimescaleDB and PostgreSQL are nearly identical, a dramatic improvement for PostgreSQL given the <a href="https://timescale.ghost.io/blog/timescaledb-vs-6a696248104e/">results five years ago with PostgreSQL 9.6</a>. However, TimscaleDB still consistently finished with an average rate of 3,000 to 4,000 rows/second higher than a single PostgreSQL table.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/09/insert-performance-of-1-billion-rows--1--1.png" class="kg-image" alt="" loading="lazy" width="2000" height="1349" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/09/insert-performance-of-1-billion-rows--1--1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/09/insert-performance-of-1-billion-rows--1--1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2022/09/insert-performance-of-1-billion-rows--1--1.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/09/insert-performance-of-1-billion-rows--1--1.png 2070w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Insert performance comparison between TimescaleDB 2.7.2 and PostgreSQL 14.4</span></figcaption></figure><p>This shows that while vast improvements have been made in PostgreSQL, TimescaleDB hypertables also continue to perform exceptionally well. As well as the rate, the other characteristics of ingest performance are nearly identical between TimescaleDB and PostgreSQL. Modifying the batch size for the number of rows to insert at a time impacts each database the same: small batch sizes or a few hundred rows significantly hinder ingest performance, while batch sizes of 10,000 to 15,000 rows seem to be about optimal for this dataset.</p><h2 id="declarative-partitioning">Declarative Partitioning</h2><p>In the benchmarks above, we tested TimescaleDB against a single PostgreSQL table simply because that’s the default option that most people end up using. PostgreSQL also has support for native declarative partitioning, which has also been maturing over the past few years. </p><p>For the sake of completeness, we also tested TimescaleDB against native declarative partitioning. As the graphic below shows, TimescaleDB is still 1,000x faster for some queries, with strong performance gains still showing across the board. Ingest performance was similar between TimescaleDB and declarative partitioning.</p><p>In fact, if anything, the takeaway from these tests was that while declarative partitioning has matured, the gap between using a single table and declarative partitioning has shrunk.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/04/image.png" class="kg-image" alt="" loading="lazy" width="2000" height="1639" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/04/image.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/04/image.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/04/image.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/04/image.png 2070w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Query latency comparison between TimescaleDB and PostgreSQL with declarative partitioning</span></figcaption></figure><p>Using declarative partitioning is also harder. One needs to manually pre-create partitions, ensure there are no data gaps, ensure no data is inserted outside of your partition ranges, and create more partitions as time moves on.</p><p>In contrast, with TimescaleDB, one does not need any of this. Instead, a single <code>create_hypertable</code> command is used to convert a standard table into a hypertable, and TimescaleDB takes care of the rest.</p><h2 id="conclusion">Conclusion</h2><p>TimescaleDB harnesses the power of the extension framework to supercharge PostgreSQL for time-series and analytical applications. With additional features like compression and continuous aggregates, TimescaleDB provides not only the most performant way of using time-series data in PostgreSQL but also the best developer experience. </p><p>When compared to traditional PostgreSQL, TimescaleDB enables 1,000x faster time-series queries, compresses data by 90&nbsp;%, and provides access to advanced time-series analysis tools and operational features specifically designed to ease data management. TimescaleDB also provides benefits for other types of queries with features like SkipScan—just by installing the extension.</p><p>In short, TimescaleDB extends PostgreSQL to enable developers to continue to use the database they love for time series, perform better at scale, spend less, and stream data analysis and operations. </p><p>If you’re looking to expand your <a href="https://www.tigerdata.com/learn/building-a-scalable-database" rel="noreferrer">database scalability</a>, try our hosted service, <a href="https://www.timescale.com/cloud">Tiger Cloud</a>. You will get the PostgreSQL you know and love with extra features for time series (<a href="https://timescale.ghost.io/blog/how-we-made-data-aggregation-better-and-faster-on-postgresql-with-timescaledb-2-7/">continuous aggregation</a>, <a href="https://docs.timescale.com/api/latest/compression/">compression</a>, <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/data-retention/about-data-retention/#drop-data-by-chunk">automatic retention policies</a>, <a href="https://timescale.ghost.io/blog/how-postgresql-aggregation-works-and-how-it-inspired-our-hyperfunctions-design-2/">hyperfunctions</a>). Plus, a platform with <a href="https://timescale.ghost.io/blog/how-high-availability-works-in-our-cloud-database/">automated backups, high availability</a>, automatic upgrades, and much more. <a href="https://console.cloud.timescale.com/signup">You can use it for free for 30 days; no credit card required</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Database Scaling: PostgreSQL Caching Explained]]></title>
            <description><![CDATA[Caching is integral to improving PostgreSQL performance. A look at how caching works in PostgreSQL—and how to make it work even better.]]></description>
            <link>https://www.tigerdata.com/blog/database-scaling-postgresql-caching-explained</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/database-scaling-postgresql-caching-explained</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Kirk Laurence Roybal]]></dc:creator>
            <pubDate>Tue, 13 Sep 2022 14:34:51 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/09/caching-explained-timescale.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/09/caching-explained-timescale.png" alt="Database Scaling: PostgreSQL Caching Explained" /><p>Follow us, friends, as we take a journey backward in time. We're going back to 1990, when soft rock was cool, and fanny packs were still okay. But we're not going there to enjoy the music and hang out at the mall. We’re going there to talk about database scaling and PostgreSQL caching. We’re going there because that was the last time PostgreSQL made simple sense—at least when it comes to resource management.</p><p>It was a time when the network was slower than a hard drive, the hard drives were slower than memory, and memory was slower than CPU. Back then, there was no such thing as a file system cache, hard drive cache, or operating system cache. Stuff like the Linux kernel was just a gleam in Linux Torvalds eye.</p><p>Why are we going there, you might ask? To be honest, because your poor author is a bit lazy. And because it's less likely that you'll be overwhelmed by the description we're about to give you.</p><p>PostgreSQL implemented a strategy to speed up access to data in those special years of clarity and simplicity. The basic idea was simple. Memory is faster than disk, so why not keep some of the most used stuff in memory to speed up retrieval? (Cue the sinister laughter from the future.)&nbsp;</p><p>This improvement has proven far more effective and valuable than the original authors probably envisioned. As PostgreSQL has matured over the years, the shared memory system matured with it. This most basic idea, what we commonly know as caching, continues to be very useful—in fact, the second most useful thing in PostgreSQL besides the working memory for each query. However, it is becoming less and less accurate over time, and other factors are becoming more prominent.</p><p>We are going to start at the beginning and then introduce the pesky truth—much in the same way that it blindsided the developers of PostgreSQL caching. As we go along, I can hear the advanced users of PostgreSQL. They are saying things like "except when," "unless," and "on the such-and-such platform." Yes, yes. We may or may not get around to your favorite exception to the rule. If we don't, apologies in advance for skipping it in the name of clarity and simplicity. This is not the last article we will ever write (in fact, there are already two more planned in the series, so stay tuned!). Please share your thoughts in the <a href="https://www.timescale.com/forum/c/conversation-community/events-blogs-and-live-streams/12"><u>Timescale Forum blog channel</u></a> and we'll try to get there in the next few go-arounds.</p><p>The good news is that I hope to introduce you to the concepts in a digestible format and pace. The bad news is that caching is a huge problem domain, and it will take a while to introduce you to all those concepts if you want to learn more about database scaling. Keep reading, and the information will get more useful and accurate over time.</p><h2 id="scaling-your-database-a-trip-down-shared-memory-lane">Scaling Your Database: A Trip Down Shared Memory Lane</h2><p>Back to 1990. There were basically two problems to solve to have a practical design for shared memory. The first one is that PostgreSQL is a multi-process system by design, so things happening in parallel processes can (and do) affect each other. The other is that the overhead of managing the memory system can't take more time than it would have just to retrieve the data anyway.</p><p>The good news for the first problem is that we already had a similar problem in the form of file system access. To solve that problem, we used an in-memory locking table.&nbsp;&nbsp;</p><p>Access to the file system is doled out by the postmaster process (the main one that creates all the other processes). Any other PostgreSQL process that wants to access a file has to ask the postmaster "pretty please" first. These locks are maintained in memory associated with the postmaster process. There isn't much reason to maintain them elsewhere because if the main process dies, the database will be unlocked. In other words, all the other processes working on files will be closed, and all locks released.</p><p>These requirements are suspiciously close to the same requirements for shared memory access. For the file system, we call this "locking" or, for memory, "latching."&nbsp; For the Postgres shared memory system, we call them "pins." Pins are very similar to locks but much simpler. All we have to care about is reading or writing to memory. So there are only two types of pins.</p><p>Now that we have the cooperation system down to two actions and a bit of memory, the next issue to solve is finding what you want when you need it. This is a simple matter of a memory scan. In PostgreSQL, the files on disk that store the actual table data are managed already with a page and leaf descriptor.&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/09/Caching.jpg" class="kg-image" alt="" loading="lazy" width="1200" height="1240" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/09/Caching.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/09/Caching.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/09/Caching.jpg 1200w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Table files in PostgreSQL</span></figcaption></figure><p>These descriptors are simply an indicator of the location of a row within a database file. The format of a page is <a href="https://www.postgresql.org/docs/current/storage-page-layout.html"><u>described in the manual</u></a>.</p><p>Curiously, in that description it also says:</p><p>&gt; <a href="https://github.com/postgres/postgres/blob/master/src/include/storage/bufpage.h"><u>All the details can be found</u></a> in src/include/storage/bufpage.h.</p><p>Which is a reference to the shared memory code. It turns out that every disk write operation is handled in memory first. The ctid (page and leaf location of the eventual location in the data files for the table) is assigned <strong>before</strong> the data is written to disk.</p><p>That allows the pages in memory to be "looked up" using the same description the file system uses, even if the data hasn't yet been written to the file system. Clever, eh?<br></p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">We could go off on a tangent here about how the journaling system works and why a page could be in the journal (known as the write-ahead log) and memory but not written to the data file system yet. That is a topic for another day. Suffice it to say for now that durability is guaranteed by writing to the journal, so this is fine and dandy. In a future article, we'll also talk about how this buffering of writes acts as a backup for the journal. Again, that's getting ahead of ourselves.</div></div><p><br></p><p></p><p></p><h3 id="accessing-shared-memory">Accessing shared memory</h3><p>Each connection to the database is handled by a process that the PostgreSQL developers affectionately call a “backend.” This process is responsible for interpreting a query and providing the result. In some cases, it can retrieve that result from the shared memory held by the postmaster process. To access shared memory, we have to ask if the buffer system in the postmaster keeps a copy. The postmaster responds with one of two options: </p><ol><li>No, these aren’t the pages you’re looking for.</li><li>Yes, and this is what it might look like.</li></ol><p>"Might" in this case, because we are now beginning to see the effects of the first issue mentioned above. No, don't look back there; we'll repeat it here. The issue is that the processes are also affecting each other. A buffer may change based on any process still in flight acting on it. So, if we want to know that the buffer is valid, we have to read it while we "pin" it.</p><p>The semantics of this are much the same as the file system. Any number of processes may access the buffers for reading purposes. The postmaster simply keeps a running list of these processes. When any process comes along with a write operation and makes a change to the buffer, all of the ones that were reading it get a notice that the contents changed in flight. It is up to each "backend" (process handling connections to the user) to reread the buffer and validate that the data continues to be "interesting."&nbsp; That is, the row still matches the criteria of the query.</p><p>Since the data in shared memory is managed in pages, not rows, the particular row that a query was accessing may or may not have actually changed at all. It may have just had the misfortune of being in roughly the same place as something else that changed. It may have changed, but none of the columns that are a part of the query criteria were affected. It may have changed those columns, but in a way that still matches. Or the row may now no longer be a part of what the query was searching for. This is all up to the parallel processes handling the user query to decide.&nbsp;&nbsp;</p><p>Assuming that the data has made it through this gauntlet, it can be returned to the caller. We can reasonably assume that the row looks exactly like what would have been returned had we looked it up in the file system instead of memory. Despite having to work with a form of locking and lookup, we also presume that this was cheaper than spinning up a disk and finding and reading the data.</p><h2 id="postgresql%E2%80%99s-shared-memory-the-design-principles">PostgreSQL’s Shared Memory: The Design Principles&nbsp;</h2><p>Now that we know how the basic process of accessing shared data works, let's have a few words about why it was originally designed this way. PostgreSQL is an MVCC (multi-version concurrency control) system—another topic beyond this article's scope to explain. For the moment, we'll condense this to the point of libel. INSERT, DROP, TRUNCATE and SELECT are cheap. UPDATE, DELETE and MERGE are expensive. This is largely due to the tracking system for row updates. And yes, DELETE is considered an UPDATE for tracking purposes.</p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">PostgreSQL actually doesn’t UPDATE or DELETE rows. For tracking purposes, it maintains a copy of every version of a row that existed. On UPDATE, it creates a new row, makes the changes in the new row, and marks the row as current for any future transactions. For DELETE, it just marks the existing row as no longer, well, uh, existing. This is called a tombstone record. It allows all transactions in flight (and future transactions) to know that the row is dead, creating yet another cleanup problem, which we’ll (hopefully) talk about in future articles.</div></div><p><br></p><p></p><p></p><p>The caching system follows the same coding paradigm to have the same performance characteristics.&nbsp; It is possible in an alternate universe that there is a cheaper solution for caching that provides "better" concurrency. That being said, the overall system is as fast as the slowest part. If the design of the caching system wildly diverged from the file system, the total response to the caller would suffer at the worst-performing points of both systems.</p><p>Also, this system is tightly integrated into the PostgreSQL query planner. A secondary system (of nearly any kind) would likely introduce inefficiencies that are far greater than any benefits would be likely to cover.</p><p>And lastly, the system acts not only as a way to return the most-sought data to the caller but also as a change buffer for data modifying queries. Multiple changes may be made to the same row before being written to the file system. The background writer (responsible for making the journal changes permanent in the data files) is intelligent enough to save the final condition of the row to disk. This added efficiency alone pays for a lot of the complexity of caching.</p><h3 id="cache-eviction">Cache eviction</h3><p>There are a few outstanding things to consider in the design of PostgreSQL caching. The first is that, eventually, <em>something</em> has to come along and evict the pages out of the cache. It can't just grow indefinitely. That <em>something</em> is called the background writer. We know from the design above that PostgreSQL keeps track of which processes find the data interesting. When that list of processes is at zero entries, and the data hasn't been accessed in a while, the background writer will mark the block as "reusable" by putting it in a list called the "free space map." (Yes, this is much the same as the autovacuum process for the file system).&nbsp;&nbsp;</p><p>In the future, the memory space will be overwritten by some other (presumably more active) data. It's the circle of life. Buffer eviction is a garbage collection process with no understanding of what queries are in flight or why the data was put into the buffer. It just comes along on a timer and kicks things out that haven't been active in a while.</p><h3 id="forced-eviction-considered-evil">Forced eviction considered evil</h3><p>Also, the backend processes we have already mentioned may decide that they need a lot of memory to do some huge operation and request the postmaster commit everything currently staged to disk. This is called a buffer flush, and it is immediate. It is miserable for performance to get into a position where a buffer flush is necessary. All concurrent processes will halt until the flush is completed and verified.&nbsp; &lt;== A horrible statement in a concurrent database.</p><p>The postmaster may decide to flush the buffer cache to respond to some backend process. This is just as horrible as the previous paragraph for the same reasons.</p><h2 id="hey-i-was-using-that">Hey, I Was Using That</h2><p>PostgreSQL is paying attention (by the autovacuum process) to which data blocks are being accessed in the file system. If these accesses reach a threshold, PostgreSQL will read the block from the disk and stick it back in the cache because it seems to answer many questions. This process is blind to the queries that access the data, the eviction process, and anything else, for that matter. It's the old late-night kung-fu flick version of sticking the commercials in there anywhere. There is no rhyme or reason to where these blocks will end up in the buffer memory space, but they seem interesting, so in you go.</p><p>In fact, the blocks in memory are effectively unordered. Because of the process of eviction and restoration, no spatial order is guaranteed (or even implied.) This means that a "cache lookup" is effectively reading the page and leaf location for each block every time the cache is accessed. The postmaster holds a list of pages in the cache in order—much like a hash index—with the memory location of each block.&nbsp; As the size of the cache increases, additional lookups in the "cache index" are implied.</p><h2 id="more-postgresql-scaling-and-caching-coming-your-way">More PostgreSQL Scaling and Caching Coming Your Way</h2><p>Now that we have the PostgreSQL caching basics behind us, in the next few articles we can fast forward and explain some of the caveats that have come up along the way:</p><ul><li>We'll explain how improvements in disk and memory have affected the assumptions made in the original design.&nbsp;</li><li>We'll look at the expense of each of the caching operations. We can look at the journaling system and see how it interacts with the caching system.&nbsp;</li><li>We'll examine how valuable caching is today and how it could benefit from improvements already under development.&nbsp;</li><li>We'll look at how to tune caching for better performance and determine how much more performance you're likely to get based on your hardware and software choices.</li></ul><p>Stay tuned; there's a lot more where this article came from.</p><p>In the meantime, if you’re looking to expand your <a href="https://www.tigerdata.com/learn/building-a-scalable-database" rel="noreferrer">database scalability</a>, try our hosted service, <a href="https://www.timescale.com/cloud"><u>Timescale</u></a>. You will get all the PostgreSQL juice with extra features for <a href="https://www.tigerdata.com/blog/time-series-introduction" rel="noreferrer">time series</a> (continuous aggregation, compression, automatic retention policies, hyperfunctions). Plus, a platform with automated backups, high availability, automatic upgrades, and much more. <a href="http://tsdb.co/cloud-signup"><u>You can use it for free for 30 days; no credit card required.</u></a></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[State of PostgreSQL 2022—13 Tools That Aren't psql]]></title>
            <description><![CDATA[Performance and tooling are frequently debated in the State of PostgreSQL survey, and this year was no exception. With psql remaining the number one tool for querying and admin among PostgreSQL users, we decided to compile a list of tools— that aren’t psql—to broaden your options.]]></description>
            <link>https://www.tigerdata.com/blog/state-of-postgresql-2022-13-tools-that-arent-psql</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/state-of-postgresql-2022-13-tools-that-arent-psql</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[State of PostgreSQL]]></category>
            <dc:creator><![CDATA[Ryan Booz]]></dc:creator>
            <pubDate>Tue, 26 Jul 2022 13:26:48 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/Blog-Hero.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/Blog-Hero.png" alt="State of PostgreSQL 2022—13 Tools That Aren't psql" /><p>The State of PostgreSQL 2022 survey closed a few weeks ago, and we're hard at work cleaning and analyzing the data to provide the best insights we can for the PostgreSQL community.</p><p>In the database community, however, there are usually two things that drive lots of <em>discussion</em> year after year: performance and tooling. During this year's survey, we modified the questions slightly so that we could focus on three specific use cases and the PostgreSQL tools that the community finds most helpful for each: querying and administration, development, and data visualization.</p><h2 id="postgresql-tools-what-do-we-have-against-psql">PostgreSQL Tools: What Do We Have Against psql?</h2><p>Absolutely nothing! As evidenced by the majority of respondents (69.4 %) that mentioned using psql for querying and administration, it's the ubiquitous choice for so many PostgreSQL users and there is already good documentation and community contributed resources (<a href="https://psql-tips.org/">https://psql-tips.org/</a> by <a href="https://www.youtube.com/watch?v=j2gHN0ItUq8">Leatitia Avrot</a> is a great example) to learn more about it.</p><p>So that got us thinking. What other tools did folks bring up often for interacting with PostgreSQL along the three use cases mentioned above?</p><p>I'm glad we asked. 😉</p><h2 id="postgresql-querying-and-administration">PostgreSQL Querying and Administration</h2><p>As we just said, psql is by far the most popular tool for interacting with PostgreSQL. 🎉</p><p>It's clear, however, that many users with all levels of experience do trust other tools as well.</p><h3 id="query-and-administration-tools">Query and administration tools</h3><p>pgAdmin (35 %), DBeaver (26 %), Datagrip (13 %), and IntelliJ (10 %) IDEs received the most mentions. Most of these aren't surprising if you've been working with databases, PostgreSQL or not. The most popular GUIs (pgAdmin and DBeaver) are open source and freely available to use. The next more popular GUIs (Datagrip and IntelliJ) are licensed per seat. However, if your company or team already uses JetBrain's tools, you might have access to these popular tools.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/Blog---10-tools-that-aren-t-psql.png" class="kg-image" alt loading="lazy" width="2000" height="1310" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/07/Blog---10-tools-that-aren-t-psql.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/07/Blog---10-tools-that-aren-t-psql.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2022/07/Blog---10-tools-that-aren-t-psql.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/Blog---10-tools-that-aren-t-psql.png 2000w" sizes="(min-width: 720px) 720px"></figure><p>What I was more interested in were the mentions that happened just after the more popular tools I expected to see. Often, it's this next set of PostgreSQL tools that has gained enough attention from community members that there's obviously a value proposition to investigate further. If they can be helpful to my (or your) development workflow in certain situations, I think it's worth digging a little deeper.</p><h3 id="pgcli">pgcli</h3><!--kg-card-begin: markdown--><p>First on the list is <a href="https://www.pgcli.com/">pgcli</a>, a Python-based command-line tool and one of many <a href="https://github.com/dbcli">dbcli</a> tools created for various databases. Although this is not a replacement for <code>psql</code>, it provides an interactive, auto-complete interface for writing SQL and getting results. Syntax highlighting and some basic support for psql backslash commands are included. If you love to stay in the terminal but want a little more interactivity, the dbcli tools have been around for quite some time, have a nice community of support, and might make database exploration just a little bit easier sometimes.</p>
<!--kg-card-end: markdown--><h3 id="azure-data-studio">Azure Data Studio</h3><p>Introduced as a beta in December 2017 by the Microsoft database tooling team, <a href="https://docs.microsoft.com/en-us/sql/azure-data-studio/download-azure-data-studio?view=sql-server-ver16">Azure Data Studio</a> has been built on top of the same Electron platform as Visual Studio Code. Although the primary feature set is currently geared towards SQL Server (for obvious reasons), the ability to connect to PostgreSQL has been available since 2019.</p><p>There are a couple of unique features in Azure Data Studio (ADS) that work with both SQL Server and PostgreSQL connections that I think are worth mentioning. </p><p>First, ADS includes the ability to create and run SQL-based Jupyter Notebooks. Typically you'd have to wrap your SQL inside of another runtime like Python, but ADS provides the option to select the "SQL" kernel and deals with the connection and SQL wrapping behind the scenes.</p><p>Second, ADS provides the ability to export query results to Excel without any plugins needed. While there are (seemingly) a thousand ways to quickly get a result set into CSV, producing a correctly formatted Excel file requires a plugin with almost any other tool. Regardless of how you feel about Excel, it is still the tool of choice for many data analysts, and being able to provide an Excel file easily does help sometimes.</p><p>Finally, ADS also provides some basic charting capabilities using query results. There's no need to set up a notebook and use a <a href="https://plotly.com/python/">charting library like plotly</a> if you just need to get some quick visualizations on the data. I've had a few hiccups with the capabilities (it's certainly not intended as a serious data analytics tool), but it can be helpful to get some quick chart images to share while exploring query data.</p><h3 id="postico">Postico</h3><p>For anyone using MacOS, <a href="https://eggerapps.at/postico/">Postico</a> is a GUI application that's been recommended in many of my circles. Many folks prefer the native MacOS feel, and some of the unique usability and editing features that make working with PostgreSQL simple and intuitive.</p><h3 id="up-and-coming">Up and coming</h3><p>We'll leave it to you to look through the data and see what other GUI/query tools fellow PostgreSQL users are also using that might be of interest to you, but there are a few that were mentioned multiple times and even caused me to hit Google a few times to find out more. Some are free and open source, while others require licenses but provide interesting features like built-in data analytics capabilities. Whether you end up using any of these or not, it's good to see continued innovation within the tooling market, something that doesn't seem to be slowing down decades into our SQL journey. </p><ul><li><a href="https://arctype.com/">Archetype</a></li><li><a href="https://tableplus.com/">TablePlus</a></li><li><a href="https://www.aquafold.com/">Aqua Data Studio</a></li><li><a href="https://www.beekeeperstudio.io/">Beekeeper Studio</a></li></ul><h2 id="helpful-third-party-postgresql-tools-for-application-development">Helpful Third-Party PostgreSQL Tools for Application Development</h2><p>Although the GUI/administration landscape is certainly as active as ever, one of the most impactful features of PostgreSQL is how extensible it is. If the core application doesn't provide exactly what your application needs, there's a good chance someone (or some company) is working to provide that functionality.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/Blog---10-tools-that-aren-t-psql--1-.png" class="kg-image" alt loading="lazy" width="2000" height="1726" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/07/Blog---10-tools-that-aren-t-psql--1-.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/07/Blog---10-tools-that-aren-t-psql--1-.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2022/07/Blog---10-tools-that-aren-t-psql--1-.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/Blog---10-tools-that-aren-t-psql--1-.png 2000w" sizes="(min-width: 720px) 720px"></figure><p>The total distinct number of tools mentioned was similar to GUI/administration tools and generally fell into four categories: management features, cluster monitoring, query plan insights, and database DevOps tooling. For this blog post, we're going to focus on the first three areas.</p><h3 id="management-features">Management features</h3><p>It's not surprising that the most popular third-party PostgreSQL tools tend to be focused on daily management tasks of some sort. Two of the most popular tools in this area are mainstays in most self-hosted PostgreSQL circles.</p><p><strong>pgBouncer</strong></p><!--kg-card-begin: markdown--><p>PostgreSQL creates one new process (not thread) per connection. Without proper tuning and a right-sized server, a database can quickly become overwhelmed with unplanned spikes in usage. <a href="https://www.pgbouncer.org/">pgBouncer</a> is an open-source connection pooling application that helps manage connection usage for high-traffic applications.</p>
<!--kg-card-end: markdown--><p>If your database is self-hosted or your DBaaS doesn't provide some kind of connection pooling management for you, pgBouncer can be installed anywhere that makes sense with respect to your application to provide better connection management.</p><p><strong>pgBackRest</strong></p><!--kg-card-begin: markdown--><p>Database backups are essential, obviously, and PostgreSQL has always had standard tooling for backup and restore. But as databases have grown in size and application architectures have become more complex, using <code>pg_dump</code> and <code>pg_restore</code> can make it more difficult than intended to perform these tasks well.</p>
<!--kg-card-end: markdown--><p>The Crunchy Data team created <a href="https://pgbackrest.org/">pgBackRest</a> to help provide a full-fledged backups and restore system with many necessary features for enterprise workloads. Multi-threaded backup and compression, multiple repository locations, and backup resume are just a few features that make this a common and valuable tool for any PostgreSQL administrator.</p><h3 id="cluster-monitoring">Cluster monitoring</h3><p>The second area of third-party PostgreSQL tools that show up often focuses on improved database monitoring, which includes query monitoring in most cases. There are a lot of folks tackling this problem area from many different angles, which demonstrates the continued need that many developers and administrators have when managing PostgreSQL.</p><p><strong>pgBadger</strong></p><p>PostgreSQL has a lot of settings that can be tuned and details that can be logged into server logs, but there is no built-in functionality for holistically analyzing that data cohesively. This is where pgBadger steps in to help generate useful reports from all of the data your server is logging.</p><p><a href="https://github.com/darold/pgbadger">pgBadger</a> is one of a few popular PostgreSQL tools written in <a href="https://www.perl.org/">Perl</a> (which surprises me for some reason), but the developer has gone to great lengths to not require lots of Perl-specific modules for drawing charts and graphs, instead relying on common JavaScript libraries in the rendered reports.</p><p>There's a lot to look at with pgBadger, and the larger PostgreSQL community often recommends it as a helpful, long-term debugging tool for server performance issues.</p><p><strong>pganalyze</strong></p><p><a href="https://pganalyze.com/">pganalyze</a> has grown in popularity quite a lot over the last few years. <a href="https://twitter.com/LukasFittl">Lukas Fittl</a> has done a great job adding new features and capabilities while also providing a number of great PostgreSQL community resources across various platforms.</p><!--kg-card-begin: markdown--><p>pganalyze is a fee-based product that uses data provided by standard plugins (<code>pg_stat_statements</code> for example) which is then consumed through a collector that sends the data to a cloud service. If you use pganalyze to query log information as well (e.g., long-running queries), then features like example problem queries and index advisor could be really helpful for your development workflow and user experience.</p>
<!--kg-card-end: markdown--><h3 id="query-plan-analysis">Query plan analysis</h3><p>No discussion about PostgreSQL would be complete without mentioning tools that help you understand EXPLAIN output better. This is one area so many people struggle with, particularly based on what their previous experience is with another database, and a small cache of common, helpful tools have been growing in popularity to help with this essential task.</p><p><strong>Depesz EXPLAIN and Dalibo EXPLAIN</strong></p><p>Both <a href="https://explain.depesz.com/">Depesz</a> and <a href="https://explain.dalibo.com/">Dalibo</a> EXPLAIN provide a quick, free platform for taking a PostgreSQL explain plan and providing helpful insights into which operations are causing a slow query and, in some cases, providing helpful hints to help speed things up. Also, if you let them, both tools provide a permalink to the output for you to share with others if necessary.</p><p><strong>pgMustard</strong></p><p>One of my favorite EXPLAIN tools is <a href="https://www.pgmustard.com/">pgMustard</a>, created and maintained by <a href="https://uk.linkedin.com/in/michristofides">Michael Christofides</a>. This is a for-fee tool, but there are a lot of unique insights and features that pgMustard provides that others currently don't. Michael is also doing great work within the community, even recently starting a <a href="http://postgres.fm">PostgreSQL podcast</a> with Nikolay Samokhvalov, <a href="https://timescale.ghost.io/blog/what-is-sql-used-for-build-environments-where-devs-can-experiment/">with whom we recently talked about all things SQL</a>.</p><h2 id="which-visualization-tools-do-you-use">Which Visualization Tools Do You Use?</h2><p>The final tooling question on the State of PostgreSQL survey asked about visualization tools that folks used. Without a doubt, Grafana was the top vote-getter, but that's something we could have probably guessed pretty easily.</p><p>I <em>was</em> surprised that the next two top vote-getters were for pgAdmin and DBeaver, both popular database GUI tools we mentioned earlier. In both cases, visualization capabilities are somewhat limited, so it's hard to tell exactly what kind of features are being used that would categorize them as visualization tools.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/Blog---10-tools-that-aren-t-psql-2.png" class="kg-image" alt loading="lazy" width="2000" height="676" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/07/Blog---10-tools-that-aren-t-psql-2.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/07/Blog---10-tools-that-aren-t-psql-2.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2022/07/Blog---10-tools-that-aren-t-psql-2.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/Blog---10-tools-that-aren-t-psql-2.png 2000w" sizes="(min-width: 720px) 720px"></figure><p>The next group of tools is more interesting to me and I wanted to highlight a few that might pique your interest to investigate further.</p><h3 id="qgis">QGIS</h3><p><a href="https://www.qgis.org/en/site/">QGIS</a> is a desktop application that's used to visualize spatial data, whether from PostGIS queries or other data sources. As I've had the pleasure of learning about GIS data and queries from <a href="https://twitter.com/RustProofLabs">Ryan Lambert</a> over the past few years, I've seen him use this tool for lots of valuable and interesting spatial queries. If you rely on PostGIS for application features and you store spatial data, take a look at how QGIS might be able to help your analysis workflow.</p><h3 id="superset">Superset</h3><p>There are a number of data visualization and dashboarding alternatives in the market, and PostgreSQL support is universally expected regardless of the tool. <a href="https://superset.apache.org/">Superset</a> is an open-source option that also has commercial support and hosting options available through Preset.io. With more than 40 chart types and a vibrant community, there's a lot to explore in the Superset ecosystem.</p><h3 id="streamlit">Streamlit</h3><p>For those developers that use Python for most of their data analysis and visualizations, <a href="https://streamlit.io/">Streamlit</a> is another popular choice that can easily fit into your existing workflow. Streamlit isn't a drag-and-drop UI for creating dashboards, but rather a programmatic interface for building and deploying data analysis applications using Python. And as of July 2022, you can deploy public data apps using Streamlit.io.</p><h2 id="what-about-you">What About You?</h2><p>There were so many interesting answers and suggestions provided by the community to these three questions. It's clear that there are a lot of people around the world working to help developers and database professionals be more productive across many common tasks.</p><p>Are there any surprises in this list or tools that you think didn't make the list? Hit us up on <a href="https://slack.timescale.com">Slack</a>, our <a href="https://www.timescale.com/forum/t/state-of-postgresql-2022-postgres-tools/735">Forum</a>, or Twitter (<a href="https://twitter.com/timescaledb">@timescaleDB</a>) to share other tools that are important to your daily PostgreSQL workflow!</p><h2 id="read-the-report">Read the Report</h2><p>Now that we’ve given you a taste of our survey results, are you curious to learn more about the PostgreSQL community? If you’d like to know more insights about the State of PostgreSQL 2022, including why respondents chose PostgreSQL, their opinion on industry events, and what information sources they would recommend to friends and colleagues, don’t miss our complete report. <a href="https://www.timescale.com/state-of-postgres/2022?utm_source=state-of-pg-2022&amp;utm_medium=blog&amp;utm_campaign=state-of-pg-2022&amp;utm_id=state-of-pg-2022&amp;utm_content=state-postgres-blog">Click here to read it and learn firsthand what the State of PostgreSQL is in 2022</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Postgres Materialized Views, The Timescale Way]]></title>
            <description><![CDATA[Continuous aggregates are one of the most popular features in TimescaleDB, but you can’t fully understand them without some context on PostgreSQL views and materialized views.]]></description>
            <link>https://www.tigerdata.com/blog/materialized-views-the-timescale-way</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/materialized-views-the-timescale-way</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[David Kohn]]></dc:creator>
            <pubDate>Thu, 14 Jul 2022 15:34:43 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_12-1.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_12-1.png" alt="Postgres Materialized Views, The Timescale Way" /><h3 id="how-postgresql-views-and-materialized-views-work-and-how-they-influenced-timescaledb-continuous-aggregates"><br>How PostgreSQL Views and Materialized Views Work and How They Influenced TimescaleDB Continuous Aggregates</h3><p>Soon after I graduated from college, I was working on starting a business and moved into a house near campus with five friends. The house was a bit run down, but the rent was cheap, and the people were great. </p><p>I lived there for a couple of years, and we developed a tradition: whenever someone would move out, we would ask them to share three pieces of wisdom with the group. Some were silly, some were serious, some were profound, none seemed remotely related to PostgreSQL or materialized views at the time, but one really has stuck with me. My friend Jason, who’d just finished his Ph.D. in Applied Physics, said the wisdom he’d learned was “Don’t squander your ignorance.” He explained that once you learn something, you end up taking it for granted and it becomes so much harder to overcome your tacit knowledge and ask simple, but important, questions.</p><p>A few months ago, I started a new role managing Engineering Education here at Timescale, and as I’ve started teaching more, Jason’s advice has never been more relevant. Teaching is all about reclaiming squandered ignorance; it’s about constantly remembering what it was like to first learn something, even when you’ve been working on it for years. The things that felt like revelations when we first learned them feel normal as we continue in the field. </p><p>So it’s been common for me that things I’ve worked most closely on can be the hardest to teach. Continuous aggregates are one of the most popular features of TimescaleDB and one of the ones I’m most proud of, partially because I helped design them. </p><p>We were recently working on a <a href="https://timescale.ghost.io/blog/how-we-made-data-aggregation-better-and-faster-on-postgresql-with-timescaledb-2-7/">revamp of continuous aggregates</a>, and as we were discussing the changes, I realized that whenever I’ve explained continuous aggregates, I’ve done it with all of this context about <a href="https://www.tigerdata.com/learn/guide-to-postgresql-views" rel="noreferrer">PostgreSQL views</a> and materialized views in my head.</p><p>I was lucky enough to learn this by trial and error; I had problems that forced me to learn about views and materialized views and figure out all the ways they worked and didn’t work, and I was able to bring that experience to their design when I joined Timescale, but not everyone has that luxury.</p><p><strong>This post is an attempt to distill a few lessons about what views and materialized </strong><a href="https://www.tigerdata.com/learn/guide-to-postgresql-views" rel="noreferrer"><strong>views in PostgreSQL</strong></a><strong> are, what they’re good at, where they fall short, and how we learned from them to make continuous aggregates incredible tools for time-series data analysis. If you need a summary of the main concepts around this topic, </strong><a href="https://www.timescale.com/learn/guide-to-postgresql-views" rel="noreferrer"><strong>check out our Guide on PostgreSQL views</strong></a><strong>.</strong><br><br>This post also comes out of our ongoing <a href="https://www.youtube.com/playlist?list=PLsceB9ac9MHRnmNZrCn_TWkUrCBCPR3mc">Foundations of PostgreSQL and TimescaleDB YouTube series</a>, so please, those of you who are encountering this for the first time, don’t squander your ignorance! Send me questions, no matter how basic they seem—I’d really appreciate it because it will help me become a better teacher. I’ve set up a<a href="https://www.timescale.com/forum/t/ask-your-questions-about-the-views-materialized-views-and-continuous-aggregates-blog-post-here/674"> forum post where you can ask them</a>, or, if you’d prefer, you can ask on our <a href="https://join.slack.com/t/timescaledb/shared_invite/zt-1chdnn3cy-2j_6wpt~TzWUXN6ZygMkbg">Community Slack</a>!</p><h2 id="getting-started-with-views-and-materialized-views">Getting Started With Views and Materialized Views</h2><p>To get an understanding of <a href="https://www.postgresql.org/docs/current/tutorial-views.html" rel="noreferrer">PostgreSQL views</a>, <a href="https://www.postgresql.org/docs/current/rules-materializedviews.html" rel="noreferrer">materialized views</a>, and <a href="https://docs.timescale.com/use-timescale/latest/continuous-aggregates/about-continuous-aggregates/" rel="noreferrer">TimescaleDB continuous aggregates</a>, we’re going to want to have some data to work with to demonstrate the concepts and better understand where each of them is most useful. <br><br>I’ve used the data from our <a href="https://docs.timescale.com/getting-started/latest/#what-is-timescaledb">getting started tutorial</a> so that if you’d like to, you can follow along (you may need to change some of the dates in <code>WHERE</code>clauses, though). Our tutorial deals with financial data, but many of the insights are very broadly applicable. </p><p>Also, I won’t go through the whole thing here, but you should know that we have a <code>company</code> table and a <code>stocks_real_time</code> hypertable, defined like so:</p><pre><code class="language-SQL">CREATE TABLE company (
    symbol text NOT NULL,
    name text NOT NULL
);

CREATE TABLE stocks_real_time (
    time timestamp with time zone NOT NULL,
    symbol text NOT NULL,
    price double precision,
    day_volume integer
);
CREATE INDEX ON stocks_real_time (symbol, time);
SELECT create_hypertable('stocks_real_time', 'time');
</code></pre>
<p>Once you’ve set that up, you can <a href="https://docs.timescale.com/getting-started/latest/add-data/">import data</a>, and you should be able to follow along with the rest if you’d like.</p><h2 id="what-are-postgresql-views-why-should-i-use-them">What Are PostgreSQL Views? Why Should I Use Them?</h2><p>One thing we might want to explore with this dataset is being able to get the name of our company. You’ll note that the <code>name</code> column only exists in the <code>company</code> table, which can be joined to the <code>stocks_real_time</code> table on the <code>symbol</code> column so we can query by either like so:</p><pre><code class="language-SQL">CREATE VIEW stocks_company AS 
SELECT s.symbol, s.price, s.time, s.day_volume, c.name 
FROM stocks_real_time s 
INNER JOIN company c ON s.symbol = c.symbol;
</code></pre>
<p>Once I’ve created a view, I can refer to it in another query:</p><pre><code class="language-SQL">SELECT symbol, price 
FROM stocks_company 
WHERE time &gt;= '2022-04-05' and time &lt;'2022-04-06';
</code></pre>
<p>But what is that actually doing under the hood? As I mentioned before, the view acts as an alias for the stored query, so PostgreSQL replaces the view <code>stocks_company</code> with the query it was defined with and runs the full resulting query. That means the query to the <code>stocks_company</code> view is the same as:</p><pre><code class="language-SQL">SELECT symbol, price 
FROM (
SELECT s.symbol, s.price, s.time, s.day_volume, c.name 
FROM stocks_real_time s 
INNER JOIN company c ON s.symbol = c.symbol) sc 
WHERE time &gt;= '2022-04-05' and time &lt;'2022-04-06';
</code></pre>
<p>We’ve manually replaced the view with the same query that we defined it with. </p><p>How can we tell that they are the same? <a href="https://www.postgresql.org/docs/current/using-explain.html" rel="noreferrer">The <code>EXPLAIN</code>command</a> tells us how PostgreSQL executes a query, and we can use it to see if the query to the view and the query that just runs the query in a subselect produce the same output. </p><p>Note, I know that <code>EXPLAIN</code> plans can initially seem a little intimidating. I’ve tried to make it so you don’t need to know a whole lot about <code>EXPLAIN</code> plans or the like to understand this post, <a href="#what-postgresql-materialized-views-are-and-when-to-use-them" rel="noreferrer">so if you don’t want to read them, feel free to skip over them</a>.</p><p>And if we run both:</p><pre><code class="language-SQL">EXPLAIN (ANALYZE ON, BUFFERS ON) 
SELECT symbol, price 
FROM stocks_company 
WHERE time &gt;= '2022-04-05' and time &lt;'2022-04-06';
--AND
EXPLAIN (ANALYZE ON, BUFFERS ON) 
SELECT symbol, price 
FROM (
SELECT s.symbol, s.price, s.time, s.day_volume, c.name 
FROM stocks_real_time s 
INNER JOIN company c ON s.symbol = c.symbol) sc 
WHERE time &gt;= '2022-04-05' and time &lt;'2022-04-06';
</code></pre>
<p>We can see that they both produce the same query plan (though the timings might be slightly different, they’ll even out with repeated runs).</p><pre><code>Hash Join  (cost=3.68..16328.94 rows=219252 width=12) (actual time=0.110..274.764 rows=437761 loops=1)
   Hash Cond: (s.symbol = c.symbol)
   Buffers: shared hit=3667
   -&gt;  Index Scan using _hyper_5_2655_chunk_stocks_real_time_time_idx on _hyper_5_2655_chunk s  (cost=0.43..12488.79 rows=438503 width=12) (actual time=0.057..125.607 rows=437761 loops=1)
         Index Cond: (("time" &gt;= '2022-04-05 00:00:00+00'::timestamp with time zone) AND ("time" &lt; '2022-04-06 00:00:00+00'::timestamp with time zone))
         Buffers: shared hit=3666
   -&gt;  Hash  (cost=2.00..2.00 rows=100 width=4) (actual time=0.034..0.035 rows=100 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 12kB
         Buffers: shared hit=1
         -&gt;  Seq Scan on company c  (cost=0.00..2.00 rows=100 width=4) (actual time=0.006..0.014 rows=100 loops=1)
               Buffers: shared hit=1
 Planning:
   Buffers: shared hit=682
 Planning Time: 1.807 ms
 Execution Time: 290.851 ms
(15 rows)
</code></pre>
<p>The plan joins the <code>company</code> to the relevant chunk of the <code>stocks_real_time</code> hypertable and uses an index scan to fetch the right rows. But you don’t really need to understand exactly what’s going on here to understand that they’re doing the same thing.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">✨</div><div class="kg-callout-text"><b><strong style="white-space: pre-wrap;">Editor's Note: </strong></b>If you’d like to learn more about EXPLAIN, I recommend taking a look at the <a href="https://www.youtube.com/watch?v=UUcVS0290nY&amp;list=PLsceB9ac9MHRnmNZrCn_TWkUrCBCPR3mc&amp;index=8">Explaining Explain session</a> that my colleague Feike Steenbergen gave a couple of weeks ago. It was awesome!</div></div><h3 id="views-hide-complexity">Views hide complexity</h3><p>The <code>JOIN</code> in our view is very simple, which means the aliased query is relatively simple, but you can imagine that as the views get more complex, it can be very helpful to have a much simpler way for a user to query the database, where they don’t have to write all the <code>JOINs</code> themselves. (You can also use special views like <a href="https://www.2ndquadrant.com/en/blog/how-do-postgresql-security_barrier-views-work/">security barrier views to grant access to data securely</a>, but that’s more than we can cover here!).</p><p>Unfortunately, hiding the complexity can also be a problem. For instance, you may or may not have noticed in our example that we <em>don’t actually need the <code>JOIN</code>!</em> The <code>JOIN</code> gets us the <code>name</code> column from the <code>company</code> table, but we’re only selecting the <code>symbol</code> and <code>price</code> columns, which come from the <code>stocks_real_time</code> table! If we run the query directly on the table, it can go about twice as  fast by avoiding the <code>JOIN</code>:</p><pre><code>Index Scan using _hyper_5_2655_chunk_stocks_real_time_time_idx on _hyper_5_2655_chunk  (cost=0.43..12488.79 rows=438503 width=12) (actual time=0.021..72.770 rows=437761 loops=1)
  Index Cond: (("time" &gt;= '2022-04-05 00:00:00+00'::timestamp with time zone) AND ("time" &lt; '2022-04-06 00:00:00+00'::timestamp with time zone))
  Buffers: shared hit=3666
Planning:
  Buffers: shared hit=10
Planning Time: 0.243 ms
Execution Time: 140.775 ms
</code></pre>
<p>If I’d written out the query, I might have seen that I didn’t need the <code>JOIN</code> (or never written it in the first place). Whereas the view hides that complexity. So they can make things easier, but that can lead to performance pitfalls if we’re not careful. </p><p>If we actually <code>SELECT</code> the <code>name</code> column, then we could say we’re using the view more for what it was meant for like so:</p><pre><code class="language-SQL">SELECT name, price, symbol 
FROM stocks_company 
WHERE time &gt;= '2022-04-05' AND time &lt;'2022-04-06';
</code></pre>
<p>So to sum up this section on views: </p><ul><li>Views are a way to store an alias for a query in the database.</li><li>PostgreSQL will replace the view name with the query you use in the view definition. </li></ul><p>Views can be good for reducing complexity for the user, so they don’t have to write out complex <code>JOINs</code>, but can also lead to performance problems if they are overused and because hiding the complexity can make it harder to identify potential performance pitfalls. </p><p>One thing you’ll notice is that <strong>views can improve the user interface, but they won’t really ever improve performance</strong>, because they don’t actually run the query, they just alias it. If you want something that runs the query, you’ll need a materialized view.</p><h2 id="what-postgresql-materialized-views-are-and-when-to-use-them">What PostgreSQL Materialized Views Are and When to Use Them</h2><p>When I <a href="https://www.postgresql.org/docs/current/sql-creatematerializedview.html">create a materialized view</a>, it actually runs the query and stores the results. In essence, this means the materialized view acts as a <a href="https://en.wikipedia.org/wiki/Cache_(computing)">cache</a> for the query. Caching is a common way to improve performance in all sorts of computing systems. The question we might ask is: will it be helpful here? So let’s try it out and see how it goes. <br><br><a href="https://www.postgresqltutorial.com/postgresql-views/postgresql-materialized-views/" rel="noreferrer">Creating a materialized view</a> is quite easy, I can just add the <code>MATERIALIZED</code> keyword to my create view command:</p><pre><code class="language-SQL">CREATE MATERIALIZED VIEW stocks_company_mat AS 
SELECT s.symbol, s.price, s.time, s.day_volume, c.name 
FROM stocks_real_time s INNER JOIN company c ON s.symbol = c.symbol;

CREATE INDEX on stocks_company_mat (symbol, time DESC);
CREATE INDEX on stocks_company_mat (time DESC);
</code></pre>
<p>You’ll also notice that I created some indexes on the materialized view (the same ones I have on <code>stocks_real_time</code>)! That’s one of the cool things about materialized views, you can create indexes on them because under the hood they’re just tables that store the results of a query (we’ll explain that more later). </p><p>Now I can run <a href="https://www.cybertec-postgresql.com/en/how-to-interpret-postgresql-explain-analyze-output/" rel="noreferrer"><code>EXPLAIN ANALYZE</code></a> on a slightly different query, where I’m trying to get the data for ‘AAPL’ for four days on both to understand how much this caching helps our query:</p><pre><code class="language-SQL">EXPLAIN (ANALYZE ON, BUFFERS ON) SELECT name, price FROM stocks_company_mat WHERE time &gt;= '2022-04-05' AND time &lt;'2022-04-09' AND symbol = 'AAPL';
Bitmap Heap Scan on stocks_company_mat  (cost=1494.93..56510.51 rows=92196 width=17) (actual time=11.796..46.336 rows=95497 loops=1)
  Recheck Cond: ((symbol = 'AAPL'::text) AND ("time" &gt;= '2022-04-05 00:00:00+00'::timestamp with time zone) AND ("time" &lt; '2022-04-09 00:00:00+00'::timestamp with time zone))
  Heap Blocks: exact=14632
  Buffers: shared hit=14969
  -&gt;  Bitmap Index Scan on stocks_company_mat_symbol_time_idx  (cost=0.00..1471.88 rows=92196 width=0) (actual time=9.456..9.456 rows=95497 loops=1)
        Index Cond: ((symbol = 'AAPL'::text) AND ("time" &gt;= '2022-04-05 00:00:00+00'::timestamp with time zone) AND ("time" &lt; '2022-04-09 00:00:00+00'::timestamp with time zone))
        Buffers: shared hit=337
Planning:
  Buffers: shared hit=5
Planning Time: 0.102 ms
Execution Time: 49.995 ms

EXPLAIN (ANALYZE ON, BUFFERS ON) SELECT name, price FROM stocks_company WHERE time &gt;= '2022-04-05' AND time &lt;'2022-04-09' AND symbol = 'AAPL';
Nested Loop  (cost=919.95..30791.92 rows=96944 width=19) (actual time=6.023..75.367 rows=95497 loops=1)
  Buffers: shared hit=13215
  -&gt;  Seq Scan on company c  (cost=0.00..2.25 rows=1 width=15) (actual time=0.006..0.018 rows=1 loops=1)
        Filter: (symbol = 'AAPL'::text)
        Rows Removed by Filter: 99
        Buffers: shared hit=1
  -&gt;  Append  (cost=919.95..29820.23 rows=96944 width=12) (actual time=6.013..67.491 rows=95497 loops=1)
        Buffers: shared hit=13214
        -&gt;  Bitmap Heap Scan on _hyper_5_2655_chunk s_1  (cost=919.95..11488.49 rows=49688 width=12) (actual time=6.013..22.334 rows=49224 loops=1)
              Recheck Cond: ((symbol = 'AAPL'::text) AND ("time" &gt;= '2022-04-05 00:00:00+00'::timestamp with time zone) AND ("time" &lt; '2022-04-09 00:00:00+00'::timestamp with time zone))
              Heap Blocks: exact=6583
              Buffers: shared hit=6895
(... elided for space)
Planning:
  Buffers: shared hit=30
Planning Time: 0.465 ms
Execution Time: 78.932 ms
</code></pre>
<p>Taking a look at these plans, we can see that it helps less than one might think! It sped it up a little, but really, they’re doing almost the same amount of work! How can I tell? Well, they scan approximately the same number of 8KB buffers (see <a href="https://www.youtube.com/watch?v=JOrXRsES3mk&amp;list=PLsceB9ac9MHRnmNZrCn_TWkUrCBCPR3mc&amp;index=1">Lesson 0 of the Foundations series</a> to learn more about those), and they scan the same number of rows.</p><h3 id="when-materialized-view-performance-doesnt-materialize">When materialized view performance doesn't materialize</h3><p>Why is this? Well, our <code>JOIN</code> didn’t reduce the number of rows in the query, so the materialized view <code>stocks_company_mat</code> actually has the same number of rows in it as the <code>stocks_real_time</code> hypertable!</p><pre><code class="language-SQL">SELECT 
(SELECT count(*) FROM stocks_company_mat) as rows_mat, 
(SELECT count(*) FROM stocks_real_time) as rows_tab;

 rows_mat | rows_tab 
----------+----------
  7375355 |  7375355

</code></pre>
<p>So, not a huge benefit, and <em>we have to store the same number of rows over again</em>. So we’re getting little benefit for a pretty large cost in terms of how much storage we have to use. Now this could have been a large benefit if we were running a very expensive function or doing a very complex <code>JOIN</code> in our materialized view definition, but we’re not, so this doesn’t save us much. </p><p>The thing about our example is that it only gets worse from here. One of the things we might want to do with our view or materialized view is being able to use a <a href="https://www.geeksforgeeks.org/postgresql-where-clause/" rel="noreferrer"><code>WHERE</code></a> clause to filter not just on <code>symbol</code> but on the company <code>name</code>. (Maybe I don’t remember the stock symbol for a company, but I do remember its name.) Remember that the <code>name</code> column is the one we joined on, so let’s run that query on both the view and materialized view and see what happens:</p><pre><code class="language-SQL">
EXPLAIN (ANALYZE ON, BUFFERS ON) SELECT name, price from stocks_company_mat WHERE time &gt;= '2022-04-05' and time &lt;'2022-04-06' AND name = 'Apple' ;
Index Scan using stocks_company_mat_time_idx on stocks_company_mat  (cost=0.43..57619.99 rows=92196 width=17) (actual time=0.022..605.268 rows=95497 loops=1)
  Index Cond: (("time" &gt;= '2022-04-05 00:00:00+00'::timestamp with time zone) AND ("time" &lt; '2022-04-09 00:00:00+00'::timestamp with time zone))
  Filter: (name = 'Apple'::text)
  Rows Removed by Filter: 1655717
  Buffers: shared hit=112577
Planning:
  Buffers: shared hit=3
Planning Time: 0.116 ms
Execution Time: 609.040 ms

EXPLAIN (ANALYZE ON, BUFFERS ON) SELECT name, price from stocks_company WHERE time &gt;= '2022-04-05' and time &lt;'2022-04-06' AND name = 'Apple' ;
Nested Loop  (cost=325.22..21879.02 rows=8736 width=19) (actual time=5.642..56.062 rows=95497 loops=1)
  Buffers: shared hit=13215
  -&gt;  Seq Scan on company c  (cost=0.00..2.25 rows=1 width=15) (actual time=0.007..0.018 rows=1 loops=1)
        Filter: (name = 'Apple'::text)
        Rows Removed by Filter: 99
        Buffers: shared hit=1
  -&gt;  Append  (cost=325.22..21540.78 rows=33599 width=12) (actual time=5.633..48.232 rows=95497 loops=1)
        Buffers: shared hit=13214
        -&gt;  Bitmap Heap Scan on _hyper_5_2655_chunk s_1  (cost=325.22..9866.59 rows=17537 width=12) (actual time=5.631..21.713 rows=49224 loops=1)
              Recheck Cond: ((symbol = c.symbol) AND ("time" &gt;= '2022-04-05 00:00:00+00'::timestamp with time zone) AND ("time" &lt; '2022-04-09 00:00:00+00'::timestamp with time zone))
              Heap Blocks: exact=6583
              Buffers: shared hit=6895
…
Planning:
  Buffers: shared hit=30
Planning Time: 0.454 ms
Execution Time: 59.558 ms
</code></pre>
<p>This time, my query on the regular view is much better! It hits far fewer buffers and returns 10x faster! This is because we created an index on <code>(symbol, time DESC)</code> for the materialized view, but not on <code>(name, time DESC)</code>, so it has to fall back to scanning the full <code>time</code>index and removing the rows that don’t match. </p><p>The normal view, however, can use the more selective <code>(symbol, time DESC)</code> on the <code>stocks_real_time</code> hypertable because it’s performing the <code>JOIN</code> to the <code>company</code> table, and it joins on the <code>symbol</code> column, which means it can still use the more selective index. We “enhanced” the materialized view by performing the <code>JOIN</code> and caching the results, but then we’d need to create an index on the joined column too. <br><br>So we’re learning that this query isn’t a great candidate for a materialized view, because it’s not a crazy complex time-consuming <code>JOIN</code> and doesn’t reduce the number of rows. But if we had a query that we wanted to run that would reduce the number of rows, then that would be a great candidate for a materialized view.</p><h3 id="when-materialized-views-perform-well">When materialized views perform well</h3><p>As it turns out, there’s a very common set of queries on stock data like this that does reduce the number of rows, they’re called <strong>O</strong>pen-<strong>H</strong>igh-<strong>L</strong>ow-<strong>C</strong>lose queries (OHLC), and they look something like this:</p><pre><code class="language-SQL">CREATE VIEW ohlc_view AS 
SELECT time_bucket('15 min', time) bucket, symbol, first(price, time), max(price), min(price), last(price, time) 
FROM stocks_real_time 
WHERE time &gt;= '2022-04-05' and time &lt;'2022-04-06' 
GROUP BY time_bucket('15 min', time), symbol;

CREATE MATERIALIZED VIEW ohlc_mat AS 
SELECT time_bucket('15 min', time) bucket, symbol, first(price, time), max(price), min(price), last(price, time) 
FROM stocks_real_time 
GROUP BY time_bucket('15 min', time), symbol ;

CREATE INDEX on ohlc_mat(symbol, bucket);
CREATE INDEX ON ohlc_mat(bucket);
</code></pre>
<p>Here I’m aggregating a lot of rows together, so I end up storing a lot fewer in my materialized view. (The view doesn’t store any rows, it’s just an alias for the query.) I still created a few indexes to help speed lookups, but they’re much smaller as well because there are many fewer rows in the output of this query. So now, if I select from the normal view and the materialized view, I see a huge speedup!<br><br>Normal view:</p><pre><code class="language-SQL">EXPLAIN (ANALYZE ON, BUFFERS ON) 
SELECT  bucket, symbol, first, max, min, last 
FROM ohlc_view
WHERE bucket &gt;= '2022-04-05' AND bucket &lt;'2022-04-06';
Finalize GroupAggregate  (cost=39098.81..40698.81 rows=40000 width=44) (actual time=875.233..1000.171 rows=3112 loops=1)
  Group Key: (time_bucket('00:15:00'::interval, _hyper_5_2655_chunk."time")), _hyper_5_2655_chunk.symbol
  Buffers: shared hit=4133, temp read=2343 written=6433
  -&gt;  Sort  (cost=39098.81..39198.81 rows=40000 width=92) (actual time=875.212..906.810 rows=5151 loops=1)
        Sort Key: (time_bucket('00:15:00'::interval, _hyper_5_2655_chunk."time")), _hyper_5_2655_chunk.symbol
        Sort Method: quicksort  Memory: 1561kB
        Buffers: shared hit=4133, temp read=2343 written=6433
        -&gt;  Gather  (cost=27814.70..36041.26 rows=40000 width=92) (actual time=491.920..902.094 rows=5151 loops=1)
              Workers Planned: 1
              Workers Launched: 1
              Buffers: shared hit=4133, temp read=2343 written=6433
              -&gt;  Partial HashAggregate  (cost=26814.70..31041.26 rows=40000 width=92) (actual time=526.663..730.168 rows=2576 loops=2)
                    Group Key: time_bucket('00:15:00'::interval, _hyper_5_2655_chunk."time"), _hyper_5_2655_chunk.symbol
                    Planned Partitions: 128  Batches: 129  Memory Usage: 1577kB  Disk Usage: 19592kB
                    Buffers: shared hit=4133, temp read=2343 written=6433
                    Worker 0:  Batches: 129  Memory Usage: 1577kB  Disk Usage: 14088kB
                    -&gt;  Result  (cost=0.43..13907.47 rows=257943 width=28) (actual time=0.026..277.314 rows=218880 loops=2)
                          Buffers: shared hit=4060
                          -&gt;  Parallel Index Scan using _hyper_5_2655_chunk_stocks_real_time_time_idx on _hyper_5_2655_chunk  (cost=0.43..10683.19 rows=257943 width=20) (actual time=0.025..176.330 rows=218880 loops=2)
                                Index Cond: (("time" &gt;= '2022-04-05 00:00:00+00'::timestamp with time zone) AND ("time" &lt; '2022-04-06 00:00:00+00'::timestamp with time zone))
                                Buffers: shared hit=4060
Planning:
  Buffers: shared hit=10
Planning Time: 0.615 ms
Execution Time: 1003.425 ms
</code></pre>
<p>Materialized view:</p><pre><code class="language-SQL">EXPLAIN (ANALYZE ON, BUFFERS ON) 
SELECT  bucket, symbol, first, max, min, last 
FROM ohlc_mat 
WHERE bucket &gt;= '2022-04-05' AND bucket &lt;'2022-04-06';
Index Scan using ohlc_mat_bucket_idx on ohlc_mat  (cost=0.29..96.21 rows=3126 width=43) (actual time=0.009..0.396 rows=3112 loops=1)
  Index Cond: ((bucket &gt;= '2022-04-05 00:00:00+00'::timestamp with time zone) AND (bucket &lt; '2022-04-06 00:00:00+00'::timestamp with time zone))
  Buffers: shared hit=35
Planning:
  Buffers: shared hit=6
Planning Time: 0.148 ms
Execution Time: 0.545 ms
</code></pre>
<p>Well, that helped! We hit far fewer buffers and scanned far fewer rows with the materialized case, and we didn’t need to perform the <a href="https://www.tutorialspoint.com/postgresql/postgresql_group_by.htm" rel="noreferrer"><code>GROUP BY</code></a> and aggregate, which removes a sort, etc. All of that means that we sped up our query dramatically! But, it’s not rainbows and butterflies for materialized views. Because we didn’t cover one of their big problems, they get out of date!<br><br>So, if you think about a table like our stocks table, it’s a typical time-series use case which means it looks something like this:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_1.png" class="kg-image" alt="" loading="lazy" width="1526" height="1096" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_1.png 1526w" sizes="(min-width: 720px) 720px"></figure><p><br>We have a materialized view that we created, and it has been populated at a certain time with our query.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_2.png" class="kg-image" alt="" loading="lazy" width="1526" height="1214" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_2.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_2.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_2.png 1526w" sizes="(min-width: 720px) 720px"></figure><p><br>But then, as time passes, and we’re, say, 15 minutes later, and we’ve inserted more data, the view is out of date! We’re <code>time_bucket</code>ing by 15-minute increments, so there’s a whole set of buckets that we don’t have! </p><p>Essentially, materialized views are only as accurate as the last time they ran the query they are caching. You need to run <a href="https://www.postgresql.org/docs/current/sql-refreshmaterializedview.html" rel="noreferrer"><code>REFRESH MATERIALIZED VIEW</code></a>  to ensure they are up to date. </p><p>Once you run <code>REFRESH MATERIALIZED VIEW</code>, we’ll end up with the new data in our materialized view, like so:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_3.png" class="kg-image" alt="" loading="lazy" width="1526" height="1214" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_3.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_3.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_3.png 1526w" sizes="(min-width: 720px) 720px"></figure><p>The thing is, <code>REFRESH</code>ing a view can be expensive, and to understand why we should understand a bit more about how they work and why they get out of date. And that can be expensive.</p><h2 id="how-materialized-views-work-and-why-they-get-out-of-date">How Materialized Views Work (and Why They Get Out of Date)</h2><p>To understand how <a href="https://www.postgresqltutorial.com/postgresql-views/postgresql-materialized-views/" rel="noreferrer">materialized views</a> get out of date and what refresh is doing, it helps to understand a little about how they work under the hood. Essentially, when you create a materialized view, you are creating a table and populating it with the data from the query. For the <code>ohlc_mat</code> view we’ve been working with, it’s equivalent to:</p><pre><code class="language-SQL">CREATE TABLE ohlc_tab AS 
SELECT time_bucket('15 min', time) bucket, symbol, first(price, time), max(price), min(price), last(price, time) 
FROM stocks_real_time 
GROUP BY time_bucket('15 min', time), symbol;
</code></pre>
<p>Now, what happens when I insert data into the underlying table?</p><p>So, our materialized view <code>ohlc_mat</code> is storing the results of the query run when we created it.</p><pre><code class="language-SQL">INSERT INTO stocks_real_time VALUES (now(), 'AAPL', 170.91, NULL);
</code></pre>
<figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_4.png" class="kg-image" alt="" loading="lazy" width="1526" height="1214" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_4.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_4.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_4.png 1526w" sizes="(min-width: 720px) 720px"></figure><p>The regular view (<code>ohlc_view</code>) will stay up to date because it’s just running the queries directly on the raw data in <code>stocks_real_time</code>. And if we’re only inserting data close to <code>now()</code>, and only querying much older data, then the materialized view will seem like it’s okay. We’ll see no change for our query from a month or two ago, but if we try to query a more recent time, we won’t have any data. If we want it up to date with more recent data, we’ll need to run:</p><pre><code class="language-SQL">REFRESH MATERIALIZED VIEW ohlc_mat;
</code></pre>
<p>When we do this, what is actually happening under the hood is that we truncate (remove all the data) from the table, and then run the query again and insert it into the table.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_5.png" class="kg-image" alt="" loading="lazy" width="1526" height="1214" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_5.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_5.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_5.png 1526w" sizes="(min-width: 720px) 720px"></figure><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_6.png" class="kg-image" alt="" loading="lazy" width="1526" height="1214" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_6.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_6.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_6.png 1526w" sizes="(min-width: 720px) 720px"></figure><p>If we were using the <code>ohlc_tab</code> table from above, the equivalent operations would be something like:</p><pre><code class="language-SQL">TRUNCATE TABLE ohlc_tab;

INSERT INTO ohlc_tab 
SELECT time_bucket('15 min', time) bucket, symbol, first(price, time), max(price), min(price), last(price, time) 
FROM stocks_real_time 
GROUP BY time_bucket('15 min', time), symbol;
</code></pre>
<p>(This works slightly differently when you run <code>REFRESH MATERIALIZED VIEW</code> with the <code>CONCURRENTLY</code> option, but fundamentally, it always runs the query over the entire data set, and getting into the details is beyond the scope of this post.)</p><p>The database, much like with a view, stores the query we ran so that when we run our <code>REFRESH</code> it just knows what to do, which is great, but it’s not the most efficient. Even though most of the data didn’t change, we still threw out the whole data set, and re-run the whole query. </p><p>While that might be okay when you’re working with, say <a href="https://www.tigerdata.com/learn/understanding-oltp" rel="noreferrer">OLTP</a> data that PostgreSQL works with, and your updates/deletes are randomly spread around your data set, it starts to seem pretty inefficient when you’re working with time-series data, where the writes are mostly in the most recent period. </p><p>So, to sum up, we found a case where the materialized view really helps us because the output from the query is so much smaller than the number of rows we have to scan to calculate it. In our case, it was an aggregate. But we also noticed that, when we used a materialized view, the data gets out of date because we’re storing the output of the query, rather than rerunning it at query time as you do with a view.</p><p>In order to get the materialized view to be up to date, we learned that we need to <code>REFRESH</code> it, but for time-series use cases, a) you have to refresh it frequently (in our case, approximately every 15 minutes or so at least) for it to be up to date, and b) the refresh is inefficient because we have to delete and re-materialize all the data, maybe going back months, to get the new information from just the previous 15 minutes. <strong>And that’s one of the main reasons we developed continuous aggregates at Timescale.</strong></p><h2 id="what-are-continuous-aggregates-why-should-i-use-them">What Are Continuous Aggregates? Why Should I Use Them?</h2><p><a href="https://docs.timescale.com/use-timescale/latest/continuous-aggregates/about-continuous-aggregates/#about-continuous-aggregates"><u>Continuous aggregates</u></a> function as incremental, automatically updated materialized views. They scan and update the results of your query in the background, ensuring that new data is incorporated and old data is adjusted based on your refresh policy. <a href="https://timescale.ghost.io/blog/continuous-aggregates-faster-queries-with-automatically-maintained-materialized-views/"><u>Check out an example here.</u></a> Keep in mind that this does not prevent you from retrieving data typically.</p><p>During a refresh, only the data that has changed since the last refresh is updated. This eliminates the need to rerun the entire query to retrieve data that you already have. This is also known as real-time aggregation; all continuous aggregates use real-time aggregation.</p><h3 id="real-time-aggregates">Real-time aggregates</h3><p><a href="https://docs.timescale.com/use-timescale/latest/continuous-aggregates/real-time-aggregates/"><u>Real-time aggregates</u></a> address the problem of maintaining the most current data using continuous aggregates. Here's an example: if you have a 15-minute refresh policy and the last refresh occurred five minutes ago, there's already a gap in your data. Real-time aggregation tackles this by filling this gap with the latest raw data from your table. Real-time aggregation extracts the most recent yet-to-be-materialized raw data from the source table or view and adds it to the recently retrieved raw data from your table or view. This gives you an up-to-date view of your data for <a href="https://www.timescale.com/learn/real-time-analytics-in-postgres"><u>real-time analytics</u></a>. <br></p><h2 id="how-continuous-aggregates-work-and-how-they-were-inspired-by-the-best-of-views-and-materialized-views"><strong>How Continuous Aggregates Work and How They Were Inspired by the Best of Views and Materialized Views</strong></h2><p>We’ll build up to how exactly continuous aggregates work in this section, piece by piece.</p><p>Fundamentally, when we create a continuous aggregate, we’re doing something very similar to what happens when we create a materialized view. That’s why we use a slightly modified version of the interface for creating a materialized view:</p><pre><code class="language-SQL">CREATE MATERIALIZED VIEW ohlc_cont 
WITH (timescaledb.continuous) AS 
SELECT time_bucket('15 min', time) bucket, symbol, first(price, time), max(price), min(price), last(price, time) 
FROM stocks_real_time 
GROUP BY time_bucket('15 min', time), symbol;
</code></pre>
<p>Once we’ve done that, we end up in a very similar situation to what we have with a materialized view. We have the data that was around when created the view, but as new data gets inserted, the view will get out of date.<strong> </strong><br></p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_6_5-1.png" class="kg-image" alt="" loading="lazy" width="1526" height="1214" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_6_5-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_6_5-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_6_5-1.png 1526w" sizes="(min-width: 720px) 720px"></figure><p>In order to keep a continuous aggregate up to date, we need a scheduled aggregation.</p><h3 id="scheduled-aggregation-of-new-data">Scheduled aggregation of new data</h3><p>We saw two main problems with materialized views that we wanted to address with scheduled aggregations:</p><ol><li>We have to manually refresh a materialized view when we want it to remain up to date.</li><li>We don’t want to re-run the query on all the old data unnecessarily; we should only run it on the new data. </li></ol><p>To schedule aggregations, we need to create a continuous aggregate policy:</p><pre><code class="language-SQL">SELECT add_continuous_aggregate_policy('ohlc_cont'::regclass, start_offset=&gt;NULL, end_offset=&gt;'15 mins'::interval,  schedule_interval=&gt;'5 mins'::interval);
</code></pre>
<p>Once we’ve scheduled a continuous aggregate policy, it will run automatically according to the <code>schedule_interval</code> we’ve specified. In our case, it runs every five minutes. When it runs, it looks at the data we’ve already materialized and the new inserts and looks to see if we’ve finished at least one 15-minute bucket. If we have, it will run the query on just that next 15-minute portion and materialize the results in our continuous aggregate.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_7.png" class="kg-image" alt="" loading="lazy" width="1526" height="1214" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_7.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_7.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_7.png 1526w" sizes="(min-width: 720px) 720px"></figure><p>This means that the continuous aggregate now automatically has data from the next 15-minute period without user intervention. </p><p>And it was much more efficient. Unlike running <code>REFRESH MATERIALIZED VIEW</code>, we didn’t drop all the old data and recompute the aggregate against it, we just ran the aggregate query against the next 15-minute period and added that to our materialization. And as we move forward in time, this can keep occurring as each successive 15-minute period (or whatever period we chose for the <code>time_bucket</code> in the continuous aggregate definition) gets filled in with new data and then materialized. </p><p>One thing to note about this is that we keep track of where we’ve materialized up to by storing what we call a watermark, represented here by the dotted line. (<strong>NB</strong>: It’s named after the high watermark caused by a flood, not the watermark on a bank check.) So before the scheduled aggregation runs, the watermark is right after all the data we’ve materialized:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_8.png" class="kg-image" alt="" loading="lazy" width="1526" height="1246" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_8.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_8.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_8.png 1526w" sizes="(min-width: 720px) 720px"></figure><p>That helps us locate our next bucket and ensure it’s all there before we run the aggregation. Once we have, we move the watermark:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_9-1.png" class="kg-image" alt="" loading="lazy" width="1526" height="1312" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_9-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_9-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_9-1.png 1526w" sizes="(min-width: 720px) 720px"></figure><p>So our watermark represents the furthest point we’ve materialized up until now.</p><p>But, you might notice that our continuous aggregates still aren’t fully up to date and wouldn’t give us the same results as a view that ran the same query. Why?</p><ol><li>Scheduled aggregates will have some gap between when the next bucket has all of its data and when the job runs to materialize it.</li><li>We only materialize data once the next bucket is full by default, so we’re missing the partial bucket where inserts are happening right now. We might want to get partial results for that bucket (this is especially true when we’re using larger buckets).</li></ol><p>To address this, we made real-time views.</p><h3 id="real-time-views">Real-time views</h3><p>Real-time views combine the best of materialized views and normal views to give us a more up-to-date view of our data. They’re the default for continuous aggregates, so I don’t need to change how I made my continuous aggregate at all. However, I will admit that I elided a few things in the previous picture about how continuous aggregates work under the hood.</p><p>Real-time continuous aggregates have two parts:</p><ol><li>A <em>materialized hypertable</em>, where our already computed aggregates are stored.</li><li>And a <em>real-time view, </em>which queries both the materialized <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertable</a> and the raw hypertable (in the not-yet-aggregated region) and combines the results together.</li></ol><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_10.png" class="kg-image" alt="" loading="lazy" width="1526" height="1892" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_10.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_10.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_10.png 1526w" sizes="(min-width: 720px) 720px"></figure><p>So, if you look at the view definition of the continuous aggregate, it looks like this:</p><pre><code class="language-SQL">CREATE VIEW ohlc_cont AS  SELECT _materialized_hypertable_15.bucket,
    _materialized_hypertable_15.symbol,
    _materialized_hypertable_15.first,
    _materialized_hypertable_15.max,
    _materialized_hypertable_15.min,
    _materialized_hypertable_15.last
   FROM _timescaledb_internal._materialized_hypertable_15
  WHERE _materialized_hypertable_15.bucket &lt; COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(15)), '-infinity'::timestamp with time zone)
UNION ALL
 SELECT time_bucket('00:15:00'::interval, stocks_real_time."time") AS bucket,
    stocks_real_time.symbol,
    first(stocks_real_time."time", stocks_real_time.price) AS first,
    max(stocks_real_time.price) AS max,
    min(stocks_real_time.price) AS min,
    last(stocks_real_time."time", stocks_real_time.price) AS last
   FROM stocks_real_time
  WHERE stocks_real_time."time" &gt;= COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(15)), '-infinity'::timestamp with time zone)
  GROUP BY (time_bucket('00:15:00'::interval, stocks_real_time."time")), stocks_real_time.symbol;
</code></pre>
<p>It’s two queries put together with a <a href="https://www.postgresql.org/docs/current/queries-union.html"><code>UNION ALL</code></a>, the first just selecting the data straight out of the materialized hypertable where our bucket is below the watermark, the second running the aggregation query where our time column is above the watermark. </p><p>So you can see how this takes advantage of the best of both materialized views and normal views to create something that is much faster than a normal view but still up to date! </p><p>It’s not going to be as performant as just querying the already materialized data (though we do have an <a href="https://docs.timescale.com/getting-started/latest/create-cagg/create-cagg-policy/">option to allow you to do that if you want to</a>), but for most users, the last few months or even years of data is already materialized whereas only the last few minutes or days of raw data needs to be queried, so that still creates a huge speedup!</p><h3 id="invalidation-of-out-of-order-data">Invalidation of out-of-order data</h3><p>You may have noticed that I was making a big assumption in all of my diagrams. I was assuming that <em>all</em> of our inserts happen in the most recent time period. For time-series workloads, this is <em>mostly</em> true. <em>Most</em> data comes in time order. But, most and all are <em>very</em> different things. Especially with time-series workloads, where we have so much data coming in, that even if 99 percent of the data is in time order, 1 percent of the data is still a lot!</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_11-1.png" class="kg-image" alt="" loading="lazy" width="1526" height="1210" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_11-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_11-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_11-1.png 1526w" sizes="(min-width: 720px) 720px"></figure><p>And the results of the aggregate would be wrong by a meaningful amount if we simply let the inserts (or updates or deletes) build up over time. This cache invalidation problem is a very common problem in computing and a very hard one! PostgreSQL materialized views solve this problem by dropping all the old data and re-materializing it every time, but we already said how inefficient that was. </p><p>The other way that many folks try to solve this sort of problem in a database like PostgreSQL is a trigger. A standard trigger would run for every row and update the aggregates for every row.</p><p>But in practice, it’s hard to get a per-row trigger to work very well, and it still would cause significant <em>write amplification</em>, meaning, we’d have to write multiple times for every row we insert. </p><p>In fact, we’d need to write at least once for each row for each continuous aggregate we had on the raw hypertable. It would also limit the aggregates we can use to those that can be modified by a trigger, which are fewer than we might like. <br><br>So instead, we created a special kind of trigger that tracks the minimum and maximum <em>times</em> modified across all the rows in a statement and writes out the range of times that were modified to a log table. We call that an invalidation log.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_12.png" class="kg-image" alt="" loading="lazy" width="1526" height="1258" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_12.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_12.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_12.png 1526w" sizes="(min-width: 720px) 720px"></figure><p>The next time the continuous aggregate job runs, it has to do two things: it runs the normal aggregation of the next 15 minutes of data, and it runs an aggregation over each of the invalidated regions to recalculate the proper value over that period.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_13.png" class="kg-image" alt="" loading="lazy" width="1526" height="1312" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_13.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_13.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/20220711_Materialized-views-and-CAGGS_Diagram_13.png 1526w" sizes="(min-width: 720px) 720px"></figure><p>Note that this makes our continuous aggregates <a href="https://en.wikipedia.org/wiki/Eventual_consistency">eventually consistent</a> for out-of-order modifications. However, real-time views make continuous aggregates more strongly consistent for recent data (because they use a view under the hood). </p><p><br><br>While we could make more strongly consistent aggregates by joining to our log table and rerunning the aggregation in real time for the invalidated regions, we talked to users and decided that eventual consistency was good enough for most cases here. After all, this data is already late coming in. Essentially, we think the performance impact of doing that wasn’t worth the consistency guarantees. And anyway, if ever a user wants to, they can trigger a manual refresh of the continuous aggregate for the modified region by running the <a href="https://docs.timescale.com/api/latest/continuous-aggregates/refresh_continuous_aggregate/">manual <code>refresh_continuous_aggregates</code> procedure</a>, which updates the data in the materialized hypertable right away.</p><h3 id="data-retention">Data retention</h3><p>The final thing we wanted to accomplish with our continuous aggregates was a way to <em>keep</em> aggregated data around after dropping the raw data. This is impossible with both PostgreSQL views and materialized views because, for views, they work directly on the raw data—if you drop it, they can’t aggregate it. </p><p></p><p>With materialized views it’s a bit more complicated: until you run a refresh, they can have the old data around, but, once you run a refresh,  to get the new data that you’ve added in more recent time periods, then the old data is dropped. </p><p></p><p>With continuous aggregates, the implementation is much simpler. We mentioned the invalidation trigger that fires when we modify data that we’ve already materialized. We simply ignore any events older than a certain time horizon, including the drop event. </p><p></p><p>We also can process any invalidations before dropping data so that you can have the correct data materialized from right before you dropped the oldest stuff. You can configure data retention <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/data-retention/data-retention-with-continuous-aggregates/">by setting up your continuous aggregate policy correctly</a>.</p><h2 id="does-it-work">Does It Work?</h2><p>So we’ve got this whole mashup of views and materialized views and triggers in order to try to make a good set of trade-offs that works well for time-series data. So the question is: does it work? </p><p><br><br>To test that, I recreated our continuous aggregate from above without data and a policy and ran the <code>refresh_continuous_aggregate</code> procedure so that it would have approximately a month’s worth of data in materialized in the aggregate, with about 30 minutes that need to go through the real-time view.</p><p><br><br>If we <strong>query for the aggregated data over the whole period from the continuous aggregate</strong>, it takes about<strong> 18 ms</strong>, which is slightly slower than the <strong>5-6 ms for the fully materialized data in a materialized view</strong>, but it’s still <strong>1,000x faster than the 15 s it takes from a normal view</strong>, <strong>and we get <em>most</em> of the up-to-date benefits of a normal view from it. </strong>I'd be pretty happy with that trade-off.</p><p></p><p>If you are new to TimescaleDB and would like to try this out, I invite you to <a href="https://console.cloud.timescale.com/signup">sign up for Timescale</a>. It's the easiest way to get started with Timescale. It’s 100 percent free for 30 days, no credit card required, and you’ll be able to spin up a database in seconds. </p><p>You can easily host PostgreSQL tables and TimescaleDB hypertables in your Timescale demo database, create views, materialized views, and continuous aggregates, and explore the differences in performance and developer experience between them. We've put together some extra directions in this guide on <a href="https://www.timescale.com/learn/postgresql-materialized-views-and-where-to-find-them">using incremental materialized views, a.k.a. continuous aggregates</a>—we hope it helps!<br><br></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[State of PostgreSQL 2022—First Findings]]></title>
            <description><![CDATA[The results of the third State of PostgreSQL survey are almost out! While you wait for the complete report, read some of the survey’s initial findings, including where PostgreSQL users come from, their experience level, and favorite tools.]]></description>
            <link>https://www.tigerdata.com/blog/state-of-postgresql-2022-first-findings</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/state-of-postgresql-2022-first-findings</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[State of PostgreSQL]]></category>
            <dc:creator><![CDATA[Ryan Booz]]></dc:creator>
            <pubDate>Fri, 08 Jul 2022 14:59:54 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/BlogHero_First-Findings.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/BlogHero_First-Findings.png" alt="State of PostgreSQL 2022—First Findings" /><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">🚀</div><div class="kg-callout-text"><b><strong style="white-space: pre-wrap;">About the State of PostgreSQL</strong></b><i><em class="italic" style="white-space: pre-wrap;">Timescale’s love for PostgreSQL, one of the world’s most advanced open-source databases with 30+ years of history, runs deep. </em></i><a href="http://www.timescale.com/"><i><em class="italic" style="white-space: pre-wrap;">We built our products on PostgreSQL</em></i></a><i><em class="italic" style="white-space: pre-wrap;">, </em></i><a href="https://timescale.ghost.io/blog/the-future-of-community-in-light-of-babelfish/"><i><em class="italic" style="white-space: pre-wrap;">are proud members of the PostgreSQL community,</em></i></a><a href="https://www.youtube.com/playlist?list=PLsceB9ac9MHRnmNZrCn_TWkUrCBCPR3mc"><i><em class="italic" style="white-space: pre-wrap;">and wouldn’t exist without it and the extensibility it provides</em></i></a><i><em class="italic" style="white-space: pre-wrap;">.In 2019, Timescale launched the first State of PostgreSQL report, advancing our desire to provide greater insights into the vibrant and growing PostgreSQL user base. From the most popular programming languages and favorite features to whether respondents use PostgreSQL for work or personal projects (or both!), the State of PostgreSQL provides valuable insights into this great community. Following a one-year hiatus due to the pandemic, we resumed the annual survey in 2021. </em></i><a href="https://www.timescale.com/state-of-postgres-results"><i><em class="italic" style="white-space: pre-wrap;">Check out our previous reports </em></i></a><i><em class="italic" style="white-space: pre-wrap;">for more info, and keep reading to learn more about this year’s first findings. </em></i><a href="https://www.timescale.com/state-of-postgres/2022?utm_source=state-of-pg-2022&amp;utm_medium=blog&amp;utm_campaign=state-of-pg-2022&amp;utm_id=state-of-pg-2022&amp;utm_content=state-postgres-blog"><i><em class="italic" style="white-space: pre-wrap;">If you want to read the complete 2022 report, go ahead!</em></i></a></div></div><p>Earlier this year, we launched the third <em>State of PostgreSQL</em> survey. Participation reached unprecedented levels, with nearly one thousand PostgreSQL users answering our questions —more than double compared to the previous year! A huge thank you to you all for giving back to the community by sharing your experiences with PostgreSQL.</p><p>Gathering and making this information open to the public is our way of helping build a better and more inclusive PostgreSQL community. Here are some of our initial findings.</p><h2 id="demographics"><br>Demographics<br></h2><h3 id="what-is-your-primary-geographical-location">What is your primary geographical location?</h3><p>Mirroring the 2019 and 2021 survey results, most respondents (54.4 %) are located in the EMEA (Europe, Middle East, Africa) region, followed by North America, with 25.9 %.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/What-is-your-primary-geographical-location.png" class="kg-image" alt="" loading="lazy" width="1800" height="1117" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/07/What-is-your-primary-geographical-location.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/07/What-is-your-primary-geographical-location.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2022/07/What-is-your-primary-geographical-location.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/What-is-your-primary-geographical-location.png 1800w" sizes="(min-width: 720px) 720px"></figure><h3 id="how-long-have-you-been-using-postgresql"><br><br>How long have you been using PostgreSQL?</h3><p>While most of this year’s participants have been using PostgreSQL for 3-5 years, the number of new users experimenting with the database for less than a year has grown (6.4 %). The majority of the surveyed PostgreSQL users—54 %— have been using PostgreSQL for six or more years.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/How-long-have-you-been-using-PostgreSQL.png" class="kg-image" alt="" loading="lazy" width="1800" height="1116" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/07/How-long-have-you-been-using-PostgreSQL.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/07/How-long-have-you-been-using-PostgreSQL.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2022/07/How-long-have-you-been-using-PostgreSQL.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/How-long-have-you-been-using-PostgreSQL.png 1800w" sizes="(min-width: 720px) 720px"></figure><h3 id="what-is-your-current-profession-or-job-status"><br><br>What is your current profession or job status?</h3><p>Most PostgreSQL users (43.3 %) work as software developers/engineers, followed by software architects (13.2 %) and database administrators or DBAs (7.3 %). However, we introduced more title options in this year’s survey, which lent a bit more nuance to the answers. For example, we learned that 5.8 % of respondents are consultants, while 2.4 % are researchers.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/Blog---current-profession-or-job-status_-1.png" class="kg-image" alt="" loading="lazy" width="1724" height="1116" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/07/Blog---current-profession-or-job-status_-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/07/Blog---current-profession-or-job-status_-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2022/07/Blog---current-profession-or-job-status_-1.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/Blog---current-profession-or-job-status_-1.png 1724w" sizes="(min-width: 720px) 720px"></figure><h2 id="community">Community</h2><h3 id="have-you-ever-contributed-to-postgresql"><br>Have you ever contributed to PostgreSQL?</h3><p>Among our sample of PostgreSQL users with 15 or more years of experience, 44&nbsp;% said they have contributed to PostgreSQL at least once. In fact, regardless of their experience, users across the board have contributed to the PostgreSQL community.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/Blog---Have-you-ever-contributed-to-PostgreSQL_.png" class="kg-image" alt="" loading="lazy" width="1801" height="1117" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/07/Blog---Have-you-ever-contributed-to-PostgreSQL_.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/07/Blog---Have-you-ever-contributed-to-PostgreSQL_.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2022/07/Blog---Have-you-ever-contributed-to-PostgreSQL_.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/Blog---Have-you-ever-contributed-to-PostgreSQL_.png 1801w" sizes="(min-width: 720px) 720px"></figure><p><br><br><br>Over 300 respondents answered bonus questions and shed light on what they like the most about the PostgreSQL community, where they see room for improvement, and what would make the community more welcoming.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/Improvement-in-PostgreSQL--1-.png" class="kg-image" alt="" loading="lazy" width="1836" height="926" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/07/Improvement-in-PostgreSQL--1-.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/07/Improvement-in-PostgreSQL--1-.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2022/07/Improvement-in-PostgreSQL--1-.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/Improvement-in-PostgreSQL--1-.png 1836w" sizes="(min-width: 720px) 720px"></figure><p></p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/Blog---welcoming-to-newcomers_---v2.png" class="kg-image" alt="" loading="lazy" width="1837" height="926" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/07/Blog---welcoming-to-newcomers_---v2.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/07/Blog---welcoming-to-newcomers_---v2.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2022/07/Blog---welcoming-to-newcomers_---v2.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/Blog---welcoming-to-newcomers_---v2.png 1837w" sizes="(min-width: 720px) 720px"></figure><p></p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/Best-thing-about-the-community.png" class="kg-image" alt="" loading="lazy" width="1836" height="926" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/07/Best-thing-about-the-community.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/07/Best-thing-about-the-community.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2022/07/Best-thing-about-the-community.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/Best-thing-about-the-community.png 1836w" sizes="(min-width: 720px) 720px"></figure><h2 id="tools">Tools</h2><p>Three tools stood out among the respondents who use them for queries and administration tasks: <a href="https://www.postgresql.org/docs/current/app-psql.html"><strong>psql</strong></a> (69.4 %), <a href="https://www.pgadmin.org/"><strong>pgAdmin</strong></a> (35.3 %), and <a href="https://dbeaver.io/"><strong>DBeaver</strong></a> (26.2 %).</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/Top_3_Adim_Tools.png" class="kg-image" alt="" loading="lazy" width="1836" height="926" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/07/Top_3_Adim_Tools.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/07/Top_3_Adim_Tools.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2022/07/Top_3_Adim_Tools.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/07/Top_3_Adim_Tools.png 1836w" sizes="(min-width: 720px) 720px"></figure><h2 id="read-the-report">Read the Report</h2><p>Now that we’ve given you a taste of our survey results, are you curious to learn more about the PostgreSQL community? If you’d like to know more insights about the <em>State of PostgreSQL 2022, </em>including why respondents chose PostgreSQL, their opinion on industry events, and what information sources they would recommend to friends and colleagues, don’t miss our complete report. <a href="https://www.timescale.com/state-of-postgres/2022?utm_source=state-of-pg-2022&amp;utm_medium=blog&amp;utm_campaign=state-of-pg-2022&amp;utm_id=state-of-pg-2022&amp;utm_content=state-postgres-blog">Click here to read the report and learn firsthand what the <em>State of PostgreSQL</em> is in 2022</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Slow Grafana Performance? Learn How to Fix It Using Downsampling]]></title>
            <description><![CDATA[Learn about two common visualization problems in Grafana—slow dashboards and noisy data—and how to fix them using downsampling in TimescaleDB.]]></description>
            <link>https://www.tigerdata.com/blog/slow-grafana-performance-learn-how-to-fix-it-using-downsampling</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/slow-grafana-performance-learn-how-to-fix-it-using-downsampling</guid>
            <category><![CDATA[Data Visualization]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Tips]]></category>
            <dc:creator><![CDATA[Brian Rowe]]></dc:creator>
            <pubDate>Thu, 23 Jun 2022 13:03:01 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Grafana-Downsampling-post--1-.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Grafana-Downsampling-post--1-.png" alt="Slow Grafana Performance? Learn How to Fix It Using Downsampling" /><h2 id="downsampling-in-grafana">Downsampling in Grafana</h2><p>Graphs are awesome. They allow us to understand data quicker and easier, highlighting trends that otherwise wouldn’t stand out. And Grafana, the open-source visualization tool, is a fantastic tool for creating graphs, especially for time-series data. </p><p>If you have some data that you want to analyze visually, you just hook it up to your Grafana instance, set up your query, and you’re off to the races. (If you’re new to Grafana and Timescale, don’t worry, we’ve got you covered. See our Getting Started with Grafana and TimescaleDB <a href="https://docs.timescale.com/timescaledb/latest/tutorials/grafana/">docs</a> or <a href="https://youtube.com/playlist?list=PLsceB9ac9MHTjwvV18QJnPcLrTXm_Q-Ft">videos</a> to get up and running).</p><p>However, while Grafana is an awesome tool for generating graphs, problems still arise when we have too much data. Extremely large datasets can be prohibitively slow to load, leading to frustrated users or, worse, unusable dashboards.</p><p>These large <a href="https://timescale.ghost.io/blog/what-the-heck-is-time-series-data-and-why-do-i-need-a-time-series-database-dcf3b1b18563/">time-series datasets</a> are especially common in industries like financial services, the Internet of Things, and <a href="https://timescale.ghost.io/blog/observability-powered-by-sql-understand-your-systems-like-never-before-with-opentelemetry-traces-and-postgresql/">observability</a> as data can be relentless, often generated at high rates and volumes.</p><p>To better understand the problems that can occur when we have extremely large datasets, consider the example of stock ticker data and this graph showing 30 days' worth of trades for five different stocks (AAPL, TSLA, NVDA, MSFT, and AMD):</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Grafana-Downsampling-2--1-.png" class="kg-image" alt="" loading="lazy" width="512" height="257"></figure><p>This graph is composed of five queries which collectively contain nearly 1.3 million data points and takes nearly 20 seconds to load, pan, or zoom!</p><p>Even with more manageable amounts of data, our graphs can still sometimes be difficult to interpret if the data is too noisy. If the daily variance of our data is so high, it can hide the underlying trends that we're looking for. Consider this graph showing the volume of taxi trips taken in New York City over a two-month period:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Grafana-downsampling-3--1-.png" class="kg-image" alt="" loading="lazy" width="512" height="225"></figure><p>That spike a third of the way in may be a significant shift in volume, and those lower peaks toward the right edge might be a significant decline. It's not immediately obvious though, and certainly, this is not the powerful tool we want our graphs to be.</p><p>We can use different types of downsampling to solve the problems of slow-loading Grafana dashboards and noisy graphs, respectively. Downsampling is the practice of replacing a large set of data points with a smaller set.</p><p>We’ll implement our solutions using two of <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/">TimescaleDB’s hyperfunctions</a> for downsampling, making it easy to manipulate and analyze time-series data with fewer lines of SQL code. We’ll look at one hyperfunction for downsampling using the Largest Triangle Three Buckets or <code>lttb()</code> method, and another for downsampling using the ASAP smoothing algorithm, both of which come pre-installed with Timescale or can be accessed via the <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/install-toolkit/">timescaledb_toolkit extension</a> if you self-manage your database.</p><h2 id="example-1-load-faster-dashboards-with-lttb-downsampling">Example 1: Load faster dashboards with lttb( ) downsampling</h2><p>In our first example, which plots the prices for five stocks over a 30-day period, the problem is that we have way too much data, resulting in a slow-loading graph. This is because the <a href="https://docs.timescale.com/getting-started/latest/add-data/#about-the-dataset">real-time stocks dataset</a> we’re using has upwards of 10,000 points per day for each stock symbol!</p><p>Given the timeframe of our analysis (30 days), this is far more data than we need to spot a trend, and the time needed to load this graph is dominated by the cost of fetching all of the data.</p><p>To solve this problem, we need to find a way to reduce the number of data points we're getting from our data source. Unfortunately, doing this in a manner that doesn't drastically deform our graph is actually a very tricky problem. For example, let’s look at just the NVDA ticker price:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Grafana-Downsampling-4--1-.png" class="kg-image" alt="" loading="lazy" width="512" height="224"></figure><p><br>Here's what we see if we just naively take the 10-minute average for the NVDA symbol (overlaid in yellow on the original data).</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Grafana-Downsampling-5--1-.png" class="kg-image" alt="" loading="lazy" width="512" height="224"></figure><p>The graph of the average (mean) roughly follows the underlying data but completely smooths away almost all of the peaks and valleys, and those are the most interesting parts of the dataset! Taking the first or last point from each bucket results in an even more skewed graph, as the outlying points have no weight unless they happen to fall in just the right spot.  </p><p>What we need is a way to capture the most interesting point from each bucket. To do that, we can use the <a href="https://docs.timescale.com/api/latest/hyperfunctions/downsample/lttb/"><code>lttb()</code> algorithm</a> which gives us a downsampled graph that follows the pattern of the original graph quite closely. (As an aside, <a href="http://skemman.is/stream/get/1946/15343/37285/3/SS_MSthesis.pdf"><code>lttb(</code></a><code>)</code> was invented by <a href="https://is.linkedin.com/in/sveinn-steinarsson">Sveinn Steinarsson</a> in his master’s thesis).</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Grafana-Downsampling-6--1-.png" class="kg-image" alt="" loading="lazy" width="512" height="224"></figure><p>Using <a href="http://skemman.is/stream/get/1946/15343/37285/3/SS_MSthesis.pdf"><code>lttb(</code></a><code>)</code>, the downsampled data is barely distinguishable from the original, <strong>despite having less than 0.5 % of the points!</strong></p><p><a href="http://skemman.is/stream/get/1946/15343/37285/3/SS_MSthesis.pdf"><code>lttb(</code></a><code>)</code> works by keeping the same first and last point as the original data but dividing the rest of the data into equal intervals. For each interval, it then tries to find the most impactful point. It does this by building a triangle for each point in the interval with the point selected from the previous interval and the average of the points in the next interval. These triangles are compared with one another by area. The largest resulting triangle corresponds to the point in the interval that has the largest impact on how the graph looks.</p><p>As we see above, the result is a graph that very closely resembles the original graph. What's not as obvious is that the raw data was nearly 315,000 rows of data that took over five seconds to pull into our dashboard.  The <a href="http://skemman.is/stream/get/1946/15343/37285/3/SS_MSthesis.pdf"><code>lttb(</code></a><code>)</code> data was 1,404 rows that took less than one second to fetch.  </p><p>Here is the SQL query we used in our Grafana panel to get the <a href="http://skemman.is/stream/get/1946/15343/37285/3/SS_MSthesis.pdf"><code>lttb(</code></a><code>)</code> data.</p><pre><code class="language-SQL">SELECT
  time AS "time",
  value AS "NVDA lttb"
FROM unnest((
    SELECT lttb(time, price, 2 * (($__to - $__from) / $__interval_ms)::int)
    FROM stocks_real_time
    WHERE symbol = 'NVDA' AND $__timeFilter("time"))
)
  ORDER BY 1;
</code></pre>
<p>As you can see, the real work here is done by the <a href="https://docs.timescale.com/api/latest/hyperfunctions/downsample/lttb/"><code>lttb()</code></a><a href="https://docs.timescale.com/api/latest/hyperfunctions/downsample/lttb/"> hyperfunction</a> call in the inner <code>SELECT</code>.  This function takes the <code>time</code> and <code>value</code> columns from our table, and also a third integer specifying the target resolution, which is the number of points it should return.  </p><p>Unfortunately, Grafana doesn't directly expose the panel width in pixels to us, but we can get an approximation from the <a href="https://grafana.com/docs/grafana/latest/variables/variable-types/global-variables/#__interval"><code>$__interval</code> global variable</a> (which is approximately <code>(to - from) / resolution)</code>. For this graph, the interval was a bit of an underestimation, hence us doubling it in the function above.</p><p>Our <code>lttb()</code> hyperfunction <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/function-pipelines/#timevectors">returns a custom </a><a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/function-pipelines/#timevectors"><code>timevector</code></a><a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/function-pipelines/#timevectors"> object</a>, which uses <code>unnest</code> to get <code>time</code>, <code>value</code> rows that Grafana can understand and plot.</p><h2 id="example-2-find-signal-from-noisy-datasets-with-asap-smoothing-downsampling">Example 2: Find signal from noisy datasets with ASAP smoothing downsampling</h2><p><code>lttb()</code> is a fantastic downsampling algorithm for giving us a subset of points that maintain the visual appearance of a graph. However, sometimes the problem is that the original graph is so noisy that the long-term trends we're trying to see are lost in the normal periodic variance of the data. This is the case we saw in our second example above, that of taxi data (and shown below):</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Graf-Downsampling-7--1-.png" class="kg-image" alt="" loading="lazy" width="512" height="225"></figure><p>In this case, what we're interested in isn't a way of just reducing the number of points in a graph (as we saw before, that ends up with a graph that looks the same!), but doing so in a manner that smooths away the noise.</p><p>We can use a downsampling technique called<a href="https://arxiv.org/pdf/1703.00983.pdf"> Automated Smoothing for Attention Prioritization (ASAP)</a>, which was developed by <a href="https://twitter.com/kexinrong?lang=en">Kexin Rong</a> and <a href="https://www.linkedin.com/in/pbailis">Peter Bailis</a>.</p><p><a href="https://docs.timescale.com/api/latest/hyperfunctions/downsample/asap/">ASAP works by analyzing the data for intervals of high autocorrelation.</a> Think of this as finding the size of the repeating shape of a graph, so maybe 24 hours for our taxi data, or even 168 hours (one week). Once ASAP has found the range with the highest autocorrelation, it will smooth out the data by computing a rolling average using that range as the window size.</p><p>For instance, if you have perfectly regular data, ASAP should mostly smooth everything away to the underlying flat trend, as in the following example: <br></p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Graf-Downsa-8--1-.png" class="kg-image" alt="" loading="lazy" width="512" height="254"></figure><p>The green line here is the raw data. It is generated as a sine wave with an interval of 20 and an offset of 100 that repeats daily. The yellow line is the ASAP algorithm applied to the data, showing that the graph is entirely regular noise with no interesting underlying fluctuation.</p><p>Obviously ASAP can work well on this type of synthetic data, but let's see how it does with our taxi data.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Graf-Downs-9--1-.png" class="kg-image" alt="" loading="lazy" width="512" height="254"></figure><p>Here it becomes very obvious that there was a significant dip over from about 11/26 to 12/03, which happens to be Thanksgiving weekend, a US public holiday weekend that occurs at the end of November every year. We can see this even more dramatically by selecting only the ASAP output and letting Grafana auto-adjust the scale:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Grafana-Downsampling-10--1-.png" class="kg-image" alt="" loading="lazy" width="512" height="252"></figure><p><br>The data for this graph is the<a href="http://www.futuredata.io.s3-website-us-west-2.amazonaws.com/asap/"> taxi trips CSV file</a>. The SQL query we're running in Grafana is this:</p><pre><code class="language-SQL">SELECT
  time AS "time",
  value AS "asap"
FROM unnest((
  SELECT asap_smooth(time, value, (($__to - $__from) / '$__interval_ms')::integer)
  FROM taxidata
  WHERE $__timeFilter("time"))
)
ORDER BY 1
</code></pre>
<p>As in example 1 above, the <code>asap_smooth</code> hyperfunction does most of the work here, taking the time and value columns, as well as a target resolution as arguments. We use the same trick from example 1 to approximate the panel width from Grafana's global variables.</p><h2 id="learn-more">Learn More</h2><p>Eager to try downsampling or learn more about other hyperfunctions? Check out our <a href="https://docs.timescale.com/api/latest/hyperfunctions/downsample/#downsample">downsample</a> and <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/#learn-hyperfunction-basics-and-install-timescale-toolkit">hyperfunctions</a> docs for more information on how hyperfunctions can help you efficiently query and analyze your data. </p><p>Looking for more Grafana guides? Here are our <a href="https://docs.timescale.com/timescaledb/latest/tutorials/grafana/">Grafana tutorials</a> and our <a href="https://timescale.ghost.io/blog/grafana-webinar-1-recap/">Grafana 101 Creating Awesome Visualizations</a> for more support on visualizations in Grafana. </p><p>If you need a database to store your time-series data and power your dashboards, try <a href="https://console.cloud.timescale.com/signup">Timescale</a>, our fast, easy-to-use, and reliable cloud-native data platform for <a href="https://www.tigerdata.com/blog/time-series-introduction" rel="noreferrer">time series</a> built on PostgreSQL. (You can sign up for a 30-day free trial, no credit card required.)</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to Write Better Queries for Time-Series Data Analysis With Custom SQL Functions]]></title>
            <description><![CDATA[There is a better way to explore and analyze time-series data with PostgreSQL: hyperfunctions. Learn how to ease your workload with faster and simpler data analysis using TimescaleDB.]]></description>
            <link>https://www.tigerdata.com/blog/how-to-write-better-queries-for-time-series-data-analysis-using-custom-sql-functions</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-to-write-better-queries-for-time-series-data-analysis-using-custom-sql-functions</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Analytics]]></category>
            <category><![CDATA[Time Series Data]]></category>
            <category><![CDATA[Hyperfunctions]]></category>
            <category><![CDATA[Engineering]]></category>
            <dc:creator><![CDATA[JF Joly]]></dc:creator>
            <pubDate>Thu, 23 Jun 2022 13:02:54 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/marc-olivier-jodoin-NqOInJ-ttqM-unsplash--1-.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/marc-olivier-jodoin-NqOInJ-ttqM-unsplash--1-.jpg" alt="How to Write Better Queries for Time-Series Data Analysis With Custom SQL Functions" /><h2 id="why-the-right-tools-matter-when-analyzing-time-series-data">Why the Right Tools Matter When Analyzing Time-Series Data</h2><p><strong>SQL is the lingua franca for analytics. </strong>As data proliferates, we need to find new ways to store, explore, and analyze it. <a href="https://timescale.ghost.io/blog/blog/why-sql-beating-nosql-what-this-means-for-future-of-data-time-series-database-348b777b847a/">We believe SQL is the best language for data analysis</a>. We’ve championed the benefits of SQL for several years, even when many were swapping it for custom domain-specific languages. Full SQL support was one of the <a href="https://timescale.ghost.io/blog/when-boring-is-awesome-building-a-scalable-time-series-database-on-postgresql-2900ea453ee2/">key reasons</a> we chose to build TimescaleDB on top of PostgreSQL, the <a href="https://survey.stackoverflow.co/2022/#most-popular-technologies-database-prof">most loved database among developers</a>, rather than creating a custom query language. And we were right—SQL is making a comeback (although it never really went away) and has become the universal language for data analysis, with many NoSQL databases adding SQL interfaces to keep up.</p><p>In addition, most developers are familiar with SQL, along with most data scientists, data analysts, and other professionals who work with data. Whether you've taken classes at university, done an online course, or attended a boot camp, chances are that you probably have learned a bit of SQL along the way. So you and your fellow developers already know it, making it easier for teams to onboard new members and quickly extract value from the data. With a proprietary language, learning the language is in itself a barrier to using the data—you’ll have to ask another team to write the queries or rely on a separate data lake.</p><p><a href="https://timescale.ghost.io/blog/blog/what-the-heck-is-time-series-data-and-why-do-i-need-a-time-series-database-dcf3b1b18563/"><strong>Time-series data</strong></a><strong> is ubiquitous. </strong>At Timescale, our mission is to serve developers worldwide and enable them to build exceptional data-driven products that measure everything that matters: software applications, industrial equipment, financial markets, blockchain activity, user actions, consumer behavior, machine learning models, climate change, and more.</p><p>And time-series data comes at you fast, sometimes generating millions of data points per second. Because of the sheer volume and rate of information, time-series data can be complex to query and analyze, even in SQL.</p><p><strong>TimescaleDB hyperfunctions make it easier to manipulate and analyze time-series datasets with fewer lines of SQL code</strong>. Hyperfunctions are purpose-built for the most common and difficult time-series and analytical queries developers write today in SQL. Using hyperfunctions makes you more productive when querying time-series data, which means you can spend less time creating reports, dashboards, and visualizations involving <a href="https://www.tigerdata.com/blog/time-series-introduction" rel="noreferrer">time series</a>, and spend more time acting on the insights that your work unearths!</p><h2 id="handling-time-series-data-meet-hyperfunctions">Handling Time-Series Data: Meet Hyperfunctions</h2><p>There are over 70 different TimescaleDB hyperfunctions ready to use today. Here are some of the most popular ones and how they can help you handle your time-series data:<br></p><ul><li><a href="https://timescale.ghost.io/blog/how-to-write-better-queries-for-time-series-data-analysis-using-custom-sql-functions/#solved-how-to-query-arbitrary-time-intervals-with-date_trunc"><strong>Time-based analysis</strong></a><strong>:</strong> <code>time_bucket()</code> makes time-based analysis simpler and easier by enabling you to analyze data over arbitrary time intervals using succinct queries.</li><li><a href="https://timescale.ghost.io/blog/how-to-write-better-queries-for-time-series-data-analysis-using-custom-sql-functions/#first-and-last-"><strong><code>first()</code> and <code>last()</code></strong></a> allow you to get the value of one column as ordered by another (2x faster in TimescaleDB 2.7!).</li><li><a href="https://timescale.ghost.io/blog/how-to-write-better-queries-for-time-series-data-analysis-using-custom-sql-functions/#simpler-time-weighted-averages"><strong>Time-weighted averages</strong></a><strong>:</strong> <code>time_weight()</code> and related hyperfunctions for working with time-weighted averages offer a more elegant way to get an unbiased average when working with irregularly sampled data.</li><li><a href="https://timescale.ghost.io/blog/how-to-write-better-queries-for-time-series-data-analysis-using-custom-sql-functions/#enhanced-query-readability-and-maintenance-with-function-pipelines-"><strong>Function pipelines</strong></a> enable you to analyze data by composing multiple functions, leading to a simpler, cleaner way of expressing complex logic in PostgreSQL (currently experimental).</li><li><a href="https://timescale.ghost.io/blog/how-to-write-better-queries-for-time-series-data-analysis-using-custom-sql-functions/#better-data-summaries-using-percentile-approximation"><strong>Percentile approximation</strong></a> brings percentile analysis to more workflows, enabling you to understand the distribution of your data efficiently (e.g., 10th percentile, mean, or 50th percentile, 90th percentile, etc.) without performing expensive computations over gigantic time-series datasets. When used with<a href="https://timescale.ghost.io/blog/how-we-made-data-aggregation-better-and-faster-on-postgresql-with-timescaledb-2-7/"> continuous aggregates</a>, you can compute percentiles over any time range of your dataset in near real-time and use them for baselining and normalizing incoming data.</li><li><a href="https://timescale.ghost.io/blog/how-to-write-better-queries-for-time-series-data-analysis-using-custom-sql-functions/#easier-frequency-analysis-with-frequency-aggregates"><strong>Frequency analysis</strong></a><strong>:</strong> <code>Freq_agg()</code> and related frequency analysis hyperfunctions more efficiently find the most common elements out of a set of vastly more varied values vs. brute force calculation.</li><li><a href="https://docs.timescale.com/api/latest/hyperfunctions/histogram/"><strong>Histogram</strong></a> shows the data distribution and can offer a better understanding of the segments compared to an average (<a href="https://statisticsbyjim.com/basics/histograms/">more on histograms</a>).</li><li><a href="https://timescale.ghost.io/blog/slow-grafana-performance-learn-how-to-fix-it-using-downsampling/"><strong>Downsampling</strong></a><strong>: </strong>ASAP smoothing smooths datasets to highlight the most important features when graphed. Largest Triangle Three Buckets Downsampling or <code>lttb()</code> reduces the number of elements in a dataset while retaining important features when graphed. <a href="https://timescale.ghost.io/blog/slow-grafana-performance-learn-how-to-fix-it-using-downsampling/">See how to apply our downsampling hyperfunctions in Grafana.</a></li><li><a href="https://timescale.ghost.io/blog/how-to-write-better-queries-for-time-series-data-analysis-using-custom-sql-functions/#more-memory-efficient-count-distinct-queries"><strong>Memory efficient COUNT DISTINCTs</strong></a>: HyperLogLog is a probabilistic cardinality estimator that uses significantly less memory than the equivalent COUNT DISTINCT query. It is ideal for use in a continuous aggregate for large datasets.</li></ul><p>We created new SQL functions for each of these time-series analysis and manipulation capabilities. This contrasts with other efforts to improve the developer experience by introducing new SQL syntax. While introducing new syntax with new keywords and constructs may have been easier from an implementation perspective, we made the deliberate decision not to do so since we believe it leads to a worse experience for the end-user. </p><p>New SQL syntax means existing drivers, libraries, and tools may no longer work. That can leave developers with more problems than solutions as their favorite tools, libraries, or drivers may not support the new syntax or require time-consuming modifications. On the other hand, new SQL functions mean that your query will run in every visualization tool, database admin tool, or data analysis tool. </p><p>We have the freedom to create custom functions, aggregates, and procedures that help developers better understand and work with their data, and ensure all their drivers and interfaces still work as expected!<br><br>We will now dive into each hyperfunction category that we mentioned and give examples of when, why, and how to use them, plus resources to continue your learning.</p><p>TimescaleDB hyperfunctions come pre-loaded and ready to use on every hosted and managed database service in Timescale, the easiest way to get started with TimescaleDB. <a href="https://console.cloud.timescale.com/">Get started with a free Timescale trial</a>—no credit card required. Or <a href="https://docs.timescale.com/install/latest/">download for free</a> with TimescaleDB self-managed.</p><p>If you’d like to jump straight into using TimescaleDB hyperfunctions on a real-world dataset, <a href="https://docs.timescale.com/timescaledb/latest/tutorials/nfl-analytics/#analyze-data-using-timescaledb-continuous-aggregates-and-hyperfunctions">start our tutorial</a>, which uses hyperfunctions to uncover insights about players and teams from the NFL (American football).  </p><p>Can’t find the function you need?<strong> </strong>Open an issue on our <a href="https://github.com/timescale/timescaledb-toolkit/issues">GitHub project</a> or contact us on <a href="https://timescaledb.slack.com/">Slack</a> or via the <a href="http://timescale.com/forum/">Timescale Community Forum</a>. We love to work with our users to simplify SQL!</p><h2 id="solved-how-to-query-arbitrary-time-intervals-with-datetrunc">Solved: How to Query Arbitrary Time-Intervals With date_trunc</h2><p>When using PostgreSQL, the <a href="https://www.postgresql.org/docs/current/functions-datetime.html"><code>date_trunc</code></a><a href="https://www.postgresql.org/docs/current/functions-datetime.html"> function</a> can be useful when you want to aggregate information over an interval of time. <code>date_trunc</code> truncates a <code>TIMESTAMP</code> or an <code>INTERVAL</code> value based on a specified date part (e.g., hour, week, or month) and returns the truncated timestamp or interval. For example, <code>date_trunc</code> can aggregate by one second, one hour, one day, or one week. However, you often want to see aggregates by the time intervals that matter most to your use case, which may be intervals like 30 seconds, 5 minutes, 12 hours, etc. This can get pretty complicated in SQL, just look at the query below which analyzes taxi ride activity in five-minute time intervals:</p><p><strong>Regular PostgreSQL: Taxi rides taken every five minutes</strong></p><pre><code class="language-SQL">SELECT
  EXTRACT(hour from pickup_datetime) as hours,
  trunc(EXTRACT(minute from pickup_datetime) / 5)*5 AS five_mins,
  COUNT(*)
FROM rides
WHERE pickup_datetime &lt; '2016-01-02 00:00'
GROUP BY hours, five_mins;
</code></pre>
<p><strong>The <code>time_bucket()</code> hyperfunction makes it easy to query your data in whatever time interval is most relevant to your analysis use case</strong>. <code>time_bucket()</code> enables you to aggregate data by arbitrary time intervals (e.g., 10 seconds, 5 minutes, 6 hours, etc.), and gives you flexible groupings and offsets, instead of just second, minute, hour, and so on.</p><p>In addition to allowing more flexible time-series queries, <code>time_bucket()</code> also allows you to write these queries in a simpler way. Just look much simpler the query from the example above is to write and understand when using the <code>time_bucket()</code> hyperfunction:</p><p><strong>TimescaleDB hyperfunctions: Taxi rides taken every five minutes</strong></p><pre><code class="language-SQL">-- How many rides took place every 5 minutes for the first day of 2016?
SELECT time_bucket('5 minute', pickup_datetime) AS five_min, count(*)
FROM rides
WHERE pickup_datetime &lt; '2016-01-02 00:00'
GROUP BY five_min
ORDER BY five_min;
</code></pre>
<p>If you’d like even more flexibility when aggregating your data, you can test out the  hyperfunction <a href="https://docs.timescale.com/api/latest/hyperfunctions/time_bucket/"><code>time_bucket</code></a>, which is an updated version of the original <code>time_bucket()</code> hyperfunction. <code>time_bucket</code> enables you to bucket your data by years and months, in addition to second, minute, and hour time intervals. This allows you to easily do monthly cohort analysis or other multiple-month-based reports in SQL.</p><pre><code class="language-SQL">SELECT time_bucket('3 month', date '2021-08-01');
 time_bucket
----------------
 2021-07-01
(1 row)
</code></pre>
<p><code>time_bucket</code> also features custom timezone support, which enables you to write queries like the one below, which illustrates using it to bucket data in the Europe/Moscow region:</p><pre><code class="language-SQL">-- note that timestamptz is displayed differently depending on the session parameters
SET TIME ZONE 'Europe/Moscow';

SELECT time_bucket('1 month', timestamptz '2001-02-03 12:34:56 MSK', timezone =&gt; 'Europe/Moscow');
     time_bucket
------------------------
 2001-02-01 00:00:00+03
</code></pre>
<p>Missing data or gaps is a common occurrence when capturing hundreds or thousands of time-series readings per second or minute. This can happen due to irregular sampling intervals, or you have experienced an outage of some sort. </p><p>The <code>time_bucket_gapfill()</code> hyperfunction enables you to create additional rows of data in any gaps, ensuring that the returned rows are in chronological order and contiguous. To learn more about gappy data, read our blog <a href="https://timescale.ghost.io/blog/sql-functions-for-time-series-analysis/">Mind the Gap: Using SQL Functions for Time-Series Analysis</a>.</p><p>Here’s an example of <code>time_bucket_gapfill()</code> in action, where we find the daily average temperature for a certain device and use the <code>locf()</code> function to carry the last observation forward in the case we have gaps in our data:</p><pre><code class="language-SQL">SELECT
  time_bucket_gapfill('1 day', time, now() - INTERVAL '1 week', now()) AS day,
  device_id,
  avg(temperature) AS value,
  locf(avg(temperature))
FROM metrics
WHERE time &gt; now () - INTERVAL '1 week'
GROUP BY day, device_id
ORDER BY day;

           day          | device_id | value | locf
------------------------+-----------+-------+------
 2019-01-10 01:00:00+01 |         1 |       |
 2019-01-11 01:00:00+01 |         1 |   5.0 |  5.0
 2019-01-12 01:00:00+01 |         1 |       |  5.0
 2019-01-13 01:00:00+01 |         1 |   7.0 |  7.0
 2019-01-14 01:00:00+01 |         1 |       |  7.0
 2019-01-15 01:00:00+01 |         1 |   8.0 |  8.0
 2019-01-16 01:00:00+01 |         1 |   9.0 |  9.0
(7 rows)
</code></pre>
<p>The last observation carried forward or <a href="https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/locf/"><code>locf(</code></a><code>)</code> function allows you to carry forward the last seen value in an aggregation group. You can only use it in an aggregation query with <code>time_bucket_gapfill</code>.</p><p>To learn more about using the <code>time_bucket</code> family of hyperfunctions, read the <a href="https://docs.timescale.com/api/latest/hyperfunctions/time_bucket/">docs</a>, and get started with our <a href="https://docs.timescale.com/timescaledb/latest/tutorials/nyc-taxi-cab/">tutorial</a>, which uses <code>time_bucket()</code> to analyze a real-world IoT dataset.</p><h2 id="simpler-time-weighted-averages">Simpler Time-Weighted Averages</h2><p>If you’re in a situation where you don't have regularly sampled data, getting a representative average over a period of time can be a complex and time-consuming query to write. For example, irregularly sampled data, and thus the need for time-weighted averages, frequently occurs in the following cases:</p><ul><li>Industrial IoT, where teams “compress” data by only sending points when the value changes.</li><li>Remote sensing, where sending data back from the edge can be costly, so you only send high-frequency data for the most critical operations.</li><li>Trigger-based systems, where the sampling rate of one sensor is affected by the reading of another (i.e., a security system that sends data more frequently when a motion sensor is triggered).</li></ul><p>Time-weighted averages are a way to get an unbiased average when you are working with irregularly sampled data.<br><br>To illustrate the value of a hyperfunction to find time-weighted averages, consider the following example of a simple table modeling freezer temperature:</p><pre><code class="language-SQL">CREATE TABLE freezer_temps (
	freezer_id int,
	ts timestamptz,
	temperature float);
</code></pre>
<p>And some irregularly sampled time-series data representing the freezer temperature:</p><pre><code class="language-SQL">INSERT INTO freezer_temps VALUES 
( 1, '2020-01-01 00:00:00+00', 4.0), 
( 1, '2020-01-01 00:05:00+00', 5.5), 
( 1, '2020-01-01 00:10:00+00', 3.0), 
( 1, '2020-01-01 00:15:00+00', 4.0), 
( 1, '2020-01-01 00:20:00+00', 3.5), 
( 1, '2020-01-01 00:25:00+00', 8.0), 
( 1, '2020-01-01 00:30:00+00', 9.0), 
( 1, '2020-01-01 00:31:00+00', 10.5), -- door opened!
( 1, '2020-01-01 00:31:30+00', 11.0), 
( 1, '2020-01-01 00:32:00+00', 15.0), 
( 1, '2020-01-01 00:32:30+00', 20.0), -- door closed
( 1, '2020-01-01 00:33:00+00', 18.5), 
( 1, '2020-01-01 00:33:30+00', 17.0), 
( 1, '2020-01-01 00:34:00+00', 15.5), 
( 1, '2020-01-01 00:34:30+00', 14.0), 
( 1, '2020-01-01 00:35:00+00', 12.5), 
( 1, '2020-01-01 00:35:30+00', 11.0), 
( 1, '2020-01-01 00:36:00+00', 10.0), -- temperature stabilized
( 1, '2020-01-01 00:40:00+00', 7.0),
( 1, '2020-01-01 00:45:00+00', 5.0);

</code></pre>
<p>Calculating the time-weighted average temperature of the freezer using regular SQL functions would look something like this:</p><p><strong>Time-weighted averages using regular SQL</strong></p><pre><code class="language-SQL">WITH setup AS (
	SELECT lag(temperature) OVER (PARTITION BY freezer_id ORDER BY ts) as prev_temp, 
		extract('epoch' FROM ts) as ts_e, 
		extract('epoch' FROM lag(ts) OVER (PARTITION BY freezer_id ORDER BY ts)) as prev_ts_e, 
		* 
	FROM  freezer_temps), 
nextstep AS (
	SELECT CASE WHEN prev_temp is NULL THEN NULL 
		ELSE (prev_temp + temperature) / 2 * (ts_e - prev_ts_e) END as weighted_sum, 
		* 
	FROM setup)
SELECT freezer_id,
    avg(temperature), -- the regular average
	sum(weighted_sum) / (max(ts_e) - min(ts_e)) as time_weighted_average 
</code></pre>
<p>But, with the TimescaleDB <code>time_weight()</code> hyperfunction, we reduce this potentially tedious to write and confusing to read query to a much simpler five-line query:<br></p><pre><code class="language-SQL">SELECT freezer_id, 
	avg(temperature), 
	average(time_weight('Linear', ts, temperature)) as time_weighted_average 
FROM freezer_temps
GROUP BY freezer_id;

freezer_id |  avg  | time_weighted_average 
------------+-------+-----------------------
          1 | 10.2  |     6.636111111111111
</code></pre>
<p>To learn more about using time-weighted average hyperfunctions, read the<a href="https://docs.timescale.com/api/latest/hyperfunctions/time-weighted-averages/"> docs</a> and see our explainer blog post: <a href="https://timescale.ghost.io/blog/what-time-weighted-averages-are-and-why-you-should-care/">What time-weighted averages are and why you should care</a>.</p><h2 id="better-data-summaries-using-percentile-approximation">Better Data Summaries Using Percentile Approximation</h2><p>Many developers choose to use averages and other summary statistics more frequently than percentiles because they are significantly “cheaper” to calculate over large time-series datasets, both in computational resources and time.</p><p>As we were designing hyperfunctions, we thought about how we could capture the benefits of percentiles (e.g., robustness to outliers, better correspondence with real-world impacts) while avoiding some of the pitfalls of calculating exact percentiles. </p><p>TimescaleDB’s<strong> percentile approximation hyperfunctions</strong> enable you to understand your data distribution efficiently (e.g., 10th percentile, mean, or 50th percentile, 90th percentile, etc.) without performing expensive computations over gigantic time-series datasets.</p><p>With relatively large datasets, you can often accept some accuracy trade-offs to avoid running into issues of high memory footprint and network costs while enabling percentiles to be computed more efficiently in parallel and used on streaming data. (In this post, you can learn more about the <a href="https://timescale.ghost.io/blog/how-percentile-approximation-works-and-why-its-more-useful-than-averages/">design decisions and trade-offs made in TimescaleDB’s percentile approximation hyperfunctions design</a>.)</p><p>TimescaleDB has a whole family of <a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/">percentile approximation hyperfunctions</a>. The simplest way to call them is to use the <a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/percentile_agg/">percentile_agg aggregate</a> along with the<a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/approx_percentile/"> approx_percentile accessor</a>. For example, here’s how we might calculate the 10th, 50th (mean), and 90th percentiles of the response time of a particular API:</p><pre><code class="language-SQL">SELECT 
    approx_percentile(0.1, percentile_agg(response_time)) as p10, 
    approx_percentile(0.5, percentile_agg(response_time)) as p50, 
    approx_percentile(0.9, percentile_agg(response_time)) as p90 
FROM responses;
</code></pre>
<p>Hyperfunctions for percentile approximation can also be used in <a href="https://timescale.ghost.io/blog/how-we-made-data-aggregation-better-and-faster-on-postgresql-with-timescaledb-2-7/">TimescaleDB´s continuous aggregates</a> which make aggregate queries on very large datasets run faster. Continuous aggregates continuously and incrementally store the results of an aggregation query in the background. So, when you run the query, only the changed data needs to be computed, not the entire dataset.</p><p>That is a huge advantage compared to exact percentiles because you can now do things like baselining and alerting on longer periods without recalculating from scratch every time!</p><p>For example, here’s how you can use continuous aggregates to identify recent outliers and investigate potential problems. First, we create a one-hour aggregation from the <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertable</a> <code>responses</code>:</p><pre><code class="language-SQL">CREATE TABLE responses(
	ts timestamptz, 
	response_time DOUBLE PRECISION);
SELECT create_hypertable('responses', 'ts');
</code></pre>
<pre><code class="language-SQL">CREATE MATERIALIZED VIEW responses_1h_agg
WITH (timescaledb.continuous)
AS SELECT 
    time_bucket('1 hour'::interval, ts) as bucket,
    percentile_agg(response_time)
FROM responses
GROUP BY time_bucket('1 hour'::interval, ts);
</code></pre>
<p>To find outliers, we can find the data in the last 30 seconds greater than the 99th percentile:</p><pre><code class="language-SQL">SELECT * FROM responses 
WHERE ts &gt;= now()-'30s'::interval
AND response_time &gt; (
	SELECT approx_percentile(0.99, percentile_agg)
	FROM responses_1h_agg
	WHERE bucket = time_bucket('1 hour'::interval, now()-'1 hour'::interval)
);
</code></pre>
<p>To learn more about using percentile approximation hyperfunctions, read the <a href="https://timescale.ghost.io/blog/how-percentile-approximation-works-and-why-its-more-useful-than-averages/">docs</a>, try our <a href="https://docs.timescale.com/timescaledb/latest/tutorials/nfl-analytics/">tutorial using real-world NFL data</a> and see our <a href="https://timescale.ghost.io/blog/how-percentile-approximation-works-and-why-its-more-useful-than-averages/">explainer blog post on why percentile approximation is more useful than averages</a>.</p><h2 id="first-and-last">first( )and last( )</h2><p>Another common problem is finding the first or last values for multiple time series. That often occurs in IoT scenarios, where you want to monitor devices in different locations, but each device sends back data at different times (as devices can go offline, experience connectivity issues,  batch transmit data, or simply have different sampling rates).</p><p>The <code>last</code> hyperfunction allows you to get the value of one column as ordered by another. For example, <code>last(temperature, time)</code> returns the latest temperature value based on time within an aggregate group. </p><p>This way, you can write queries more easily which, for example, will find the last recorded temperature at multiple locations, as each location might have different rates of data being sampled and recorded:</p><pre><code class="language-SQL">SELECT location, last(temperature, time)
  FROM conditions
  GROUP BY location;
</code></pre>
<p>Similarly, the <code>first</code> hyperfunction also allows you to get the value of one column as ordered by another. <code>first(temperature, time)</code> returns the earliest temperature value based on time within an aggregate group:</p><pre><code class="language-SQL">SELECT device_id, first(temp, time)
FROM metrics
GROUP BY device_id;
</code></pre>
<p><code>first()</code> and <code>last()</code> can also be used in more complex queries, such as finding the latest value within a specific time interval. In the example below, we find the last temperature recorded for each device in five minutes throughout the past day:</p><pre><code class="language-SQL">SELECT device_id, time_bucket('5 minutes', time) AS interval,
  last(temp, time)
FROM metrics
WHERE time &gt; now () - INTERVAL '1 day'
GROUP BY device_id, interval
ORDER BY interval DESC;
</code></pre>
<p>In TimescaleDB 2.7, we’ve made <a href="https://github.com/timescale/timescaledb/pull/3943">improvements</a> to make queries with the <code>first()</code> and <code>last()</code> hyperfunctions up to twice as fast and make memory usage near constant.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">🚀</div><div class="kg-callout-text"><i><b><strong class="italic" style="white-space: pre-wrap;">Note:</strong></b></i><i><em class="italic" style="white-space: pre-wrap;"> The last and first commands do not use indexes but perform a sequential scan through their groups. They are primarily used for ordered selection within a GROUP BY aggregate and not as an alternative to an ORDER BY time DESC LIMIT 1 clause to find the latest value (which uses indexes).</em></i></div></div><p>To learn more, see the docs for <a href="https://docs.timescale.com/api/latest/hyperfunctions/first/"><code>first(</code></a><code>)</code> and <a href="https://docs.timescale.com/api/latest/hyperfunctions/last/"><code>last(</code></a><code>)</code>.</p><h2 id="more-memory-efficient-count-distinct-queries">More Memory Efficient COUNT DISTINCT Queries<br></h2><p>Calculating the exact number of distinct values in a large dataset with <a href="https://www.tigerdata.com/learn/how-to-handle-high-cardinality-data-in-postgresql" rel="noreferrer">high cardinality</a> requires lots of computational resources, which can impact the query performance and experience of your database's concurrent users. </p><p>To solve this issue, TimescaleDB provides hyperfunctions to calculate <a href="https://docs.timescale.com/api/latest/hyperfunctions/approx_count_distincts/">approximate COUNT DISTINCTs</a>. Approximate count distincts do not calculate the exact cardinality of a dataset, but rather estimate the number of unique values, in order to improve compute time. We use <a href="https://en.wikipedia.org/wiki/HyperLogLog">HyperLogLog</a>, a probabilistic cardinality estimator that uses significantly less memory than the equivalent <code>COUNT DISTINCT</code> query.  </p><p><a href="https://docs.timescale.com/api/latest/hyperfunctions/approx_count_distincts/hyperloglog/"><code>Hyperloglog(</code></a><code>)</code> is an approximation object for <code>COUNT DISTINCT</code> queries. And the <a href="https://docs.timescale.com/api/latest/hyperfunctions/approx_count_distincts/distinct_count/"><code>distinct_count(</code></a><code>)</code> accessor function gets the number of distinct values from a HyperLogLog object, as illustrated in the example below, which efficiently estimates the number of unique NFTs and collections in a hypothetical NFT marketplace:<br></p><pre><code class="language-SQL">SELECT
  distinct_count(hyperloglog(32768, asset_id)) AS nft_count,
  distinct_count(hyperloglog(32768, collection_id)) AS collection_count
FROM nft_sales
WHERE payment_symbol = 'ETH' AND time &gt; NOW()-INTERVAL '3 months'
</code></pre>
<p>You can also use the <a href="https://docs.timescale.com/api/latest/hyperfunctions/approx_count_distincts/stderror/"><code>std_error(</code></a><code>)</code> function to estimate the relative standard error of the HyperLogLog compared to running <code>COUNT DISTINCT</code> directly. <br>To learn more about the approximate <code>COUNT DISTINCT</code> hyperfunctions, read the<a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/approx-count-distincts/"> docs</a>.</p><h2 id="enhanced-query-readability-and-maintenance-with-function-pipelines">Enhanced Query Readability and Maintenance With Function Pipelines </h2><p></p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">🚀</div><div class="kg-callout-text"><i><b><strong class="italic" style="white-space: pre-wrap;">Note:</strong></b></i><i><em class="italic" style="white-space: pre-wrap;"> In the spirit of </em></i><a href="https://www.timescale.com/blog/blog/move-fast-but-dont-break-things-introducing-the-experimental-schema-with-new-experimental-features-in-timescaledb-2-4/"><i><em class="italic" style="white-space: pre-wrap;">moving fast and not breaking things</em></i></a><i><em class="italic" style="white-space: pre-wrap;">, the hyperfunctions in this section are released as experimental—please play around with them but don’t use them in production.</em></i></div></div><p>At Timescale, we’re huge fans of SQL. But as we’ve seen in many examples above, SQL can get quite unwieldy for certain kinds of analytical and time-series queries. Enter TimescaleDB Function Pipelines.</p><p><strong>TimescaleDB Function Pipelines</strong> radically improve the developer ergonomics of analyzing data in PostgreSQL and SQL, by applying principles from <a href="https://en.wikipedia.org/wiki/Functional_programming">functional programming</a> and popular tools like <a href="https://pandas.pydata.org/docs/index.html">Python’s Pandas</a> and <a href="https://prometheus.io/docs/prometheus/latest/querying/basics/">PromQL</a>. In short, they improve your coding productivity, making your SQL code easier for others to comprehend and maintain.</p><p>Inspired by functional programming languages, Function Pipelines enable you to analyze data by composing multiple functions, leading to a simpler, cleaner way of expressing complex logic in PostgreSQL.</p><p>And the best part: we built Function Pipelines in a fully PostgreSQL-compliant way! We did not change any SQL syntax, meaning that any tool that speaks PostgreSQL will be able to support data analysis using function pipelines.</p><p>To understand the power of TimescaleDB Function Pipelines, consider the following PostgreSQL query.</p><p><strong>Regular PostgreSQL query:</strong><br></p><pre><code class="language-SQL">SELECT device_id, 
	sum(abs_delta) as volatility
FROM (
	SELECT device_id, 
		abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts))
        	as abs_delta 
	FROM measurements
	WHERE ts &gt;= now() - '1 day'::interval) calc_delta
GROUP BY device_id;
</code></pre>
<h2 id=""></h2><h2 id="supercharge-your-productivity-with-hyperfunctions-today">Supercharge Your Productivity With Hyperfunctions Today</h2><p><strong>Get started today: </strong>TimescaleDB hyperfunctions come pre-loaded and ready to use on every hosted and managed database service in Timescale, the easiest way to get started with TimescaleDB. <a href="https://console.cloud.timescale.com/signup">Get started with a free Timescale trial</a>—no credit card required. Or <a href="https://docs.timescale.com/install/latest/">download for free</a> with TimescaleDB self-managed.</p><p>If you’d like to jump straight into using TimescaleDB hyperfunctions on a real-world dataset, <a href="https://docs.timescale.com/timescaledb/latest/tutorials/nfl-analytics/#analyze-data-using-timescaledb-continuous-aggregates-and-hyperfunctions">start our tutorial</a>, which uses hyperfunctions to uncover insights about players and teams from the NFL (American football).  </p><p><strong>Learn more: </strong>If you’d like to learn more about TimescaleDB hyperfunctions and how to use them for your use case, read our <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/">How-To Guide</a> and the hyperfunctions <a href="https://docs.timescale.com/api/latest/hyperfunctions/">documentation</a>. </p><p>Can’t find the function you need?<strong> </strong>Open an issue on our <a href="https://github.com/timescale/timescaledb-toolkit/issues">GitHub project</a> or contact us on <a href="https://timescaledb.slack.com/">Slack</a> or via the <a href="http://timescale.com/forum/">Timescale Community Forum</a>. We love to work with our users to simplify SQL!<br><br></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How We Fixed Long-Running PostgreSQL now( ) Queries (and Made Them Lightning Fast)]]></title>
            <description><![CDATA[Help requests about slowdowns in PostgreSQL now( ) queries are a thing of the past. Learn how we fixed it in TimescaleDB 2.7 for lightning-fast performance (up to 400x faster!).]]></description>
            <link>https://www.tigerdata.com/blog/how-we-fixed-long-running-postgresql-now-queries</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-we-fixed-long-running-postgresql-now-queries</guid>
            <category><![CDATA[Announcements & Releases]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Sven Klemm]]></dc:creator>
            <pubDate>Wed, 22 Jun 2022 13:00:24 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/dog-racing-2878713_1920--1-.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/dog-racing-2878713_1920--1-.jpg" alt="How We Fixed Long-Running PostgreSQL now( ) Queries (and Made Them Lightning Fast)" /><p>It was just another regular Wednesday in our home offices when we received a question in the <a href="https://www.timescale.com/forum">forum</a> about a query with the Postgres now() function. A TimescaleDB user with dozens of tables of IoT data reported a slow degradation in query performance and a creeping server CPU usage. After struggling with the issue, they turned to our community for help.</p>
<!--kg-card-begin: html-->
<iframe src="https://giphy.com/embed/QBAzA0CaPCKwGg3pDs" width="480" height="270" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/groundhogday-groundhog-day-movie-QBAzA0CaPCKwGg3pDs">via GIPHY</a></p>

<!--kg-card-end: html-->
<p>That same question came up in our forum, <a href="http://timescaledb.slack.com">Community Slack</a>, and <a href="https://www.timescale.com/support">support</a> more often than we’d like. We could relate to this particular pain point because we also struggled with it in partitioned vanilla PostgreSQL. After a closer look at the user’s query, we found the usual suspect: the issue of high planning time in the presence of many chunks—<a href="https://docs.timescale.com/timescaledb/latest/overview/core-concepts/hypertables-and-chunks/">in Timescale slang, chunks are data partitions within a table</a>—and in a query using a rather common function: <code>now()</code>.</p><p>Usually, the problem with these queries is that the chunk exclusion happens late. Chunk exclusion is what happens when some data partitions are not even considered during the query to speed up the process. The logic is simple: the fewer data a query has to go through, the faster it is.</p><p>However, the problem is that <code>now()</code>, <a href="https://www.postgresql.org/docs/current/xfunc-volatility.html">similarly to other stable functions in PostgreSQL</a>, is not considered during plan-time chunk exclusion, those precious moments in which your machine is trying to find the quickest way to execute your query while excluding some of your data partitions to further speed up the process. So, your chunks are only excluded later, at execution time, which results in higher plan time—and yes, you guessed it—slower performance.</p><p>Until now, every time this issue popped up, we knew what to do. We had written a wrapper function, marked as immutable, that would call the <code>now()</code> function and whose only purpose was to add the immutable marking so that PostgreSQL would consider it earlier during plan-time chunk exclusion, thus improving query performance.</p><p>Well, not anymore.</p><p><strong>Today, we’re announcing the optimization of the <code>now()</code> function with the release of TimescaleDB 2.7</strong>, which solves this problem by natively performing as our previous workaround. </p><p>In this blog post, we’ll look at the basics of the <code>now()</code> function, explain how it works in vanilla PostgreSQL and our previous TimescaleDB version, and wrap everything up with a description of our optimization, which evaluates <code>now()</code>expressions during plan-time chunk exclusion, significantly reducing planning time. Finally, we include a performance comparison that will blow you away (all we can say for now is “more than 400 times faster”).</p>
<!--kg-card-begin: html-->
<iframe src="https://giphy.com/embed/Gpu3skdN58ApO" width="480" height="382" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/funny-elephant-fast-Gpu3skdN58ApO">via GIPHY</a></p>
<!--kg-card-end: html-->
<p>If you are already a TimescaleDB user, <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/update-timescaledb/">check out our docs for instructions on how to upgrade</a>. If you are using Timescale, upgrades are automatic, so all you need to do is sit back and enjoy this very fast ride! (New to Timescale? <a href="https://console.cloud.timescale.com/signup">You can start a free 30-day trial, no credit card required</a>.)</p><h2 id="now-in-vanilla-postgresql">now( ) in Vanilla PostgreSQL</h2><p>Queries with <code>now()</code> expressions are common in time-series data to retrieve readings of the last five minutes, three hours, three days, or other time intervals. In sum, <a href="https://www.postgresql.org/docs/current/functions-datetime.html"><code>now()</code> is a function</a> that returns the current time or, more accurately, the start time of the current transaction. These queries usually only need data from the most recent partition in a <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertable</a>, also called chunk. </p><p>A query to retrieve readings from the last five minutes could look like this:</p><pre><code class="language-sql">SELECT * FROM hypertable WHERE time &gt; now() - interval ‘5 minutes’;
</code></pre>
<p>To understand our users' slowdown, it’s vital to know that constraints in PostgreSQL can be constified at different stages in the planning process. The problem with <code>now()</code> is that it can only be constified during execution because the planning and execution times may differ. </p><p>Since <code>now()</code> is a stable function, it’s not considered for plan-time constraint exclusion; therefore, all chunks will have to be part of the planning process. For hypertables with many chunks, this query's total execution time is often dominated by planning time, resulting in poor query performance.</p><p>If we dig a little deeper with the EXPLAIN output, we can see that all chunks of the hypertable are part of the plan, painfully increasing it.<br></p><pre><code class="language-sql"> Append  (cost=0.00..1118.94 rows=1097 width=20)
   -&gt;  Seq Scan on _hyper_3_38356_chunk  (cost=0.00..1.01 rows=1 width=20)
         Filter: ("time" &gt; now())
   -&gt;  Seq Scan on _hyper_3_38357_chunk  (cost=0.00..1.01 rows=1 width=20)
         Filter: ("time" &gt; now())
   -&gt;  Seq Scan on _hyper_3_38358_chunk  (cost=0.00..1.01 rows=1 width=20)
         Filter: ("time" &gt; now())
   -&gt;  Seq Scan on _hyper_3_38359_chunk  (cost=0.00..1.01 rows=1 width=20)
         Filter: ("time" &gt; now())
   -&gt;  Seq Scan on _hyper_3_38360_chunk  (cost=0.00..1.01 rows=1 width=20)
         Filter: ("time" &gt; now())
   -&gt;  Seq Scan on _hyper_3_38361_chunk  (cost=0.00..1.01 rows=1 width=20)
         Filter: ("time" &gt; now())
</code></pre>
<p>We had to do something to improve this, and so we did.</p><h2 id="now-in-timescaledb">now( ) in TimescaleDB</h2><p>As proud builders on top of PostgreSQL, we wanted to come up with a solution. So in previous versions of TimescaleDB, we did not use the <code>now()</code> expression for plan-time constraint exclusion. </p><p>In turn, we implemented constraint exclusion at execution time in a bid to improve query performance. If you want to learn more about how we did this, <a href="https://timescale.ghost.io/blog/implementing-constraint-exclusion-for-faster-query-performance/">check out this blog post, which offers a detailed behind-the-scenes explanation of what happens when you execute a query in PostgreSQL</a>. </p><p>While the resulting plan does look much slimmer than the original, all the chunks were still considered during planning and removed only during execution. So, even though the resulting plan looks very different (look at those 1,096 excluded chunks), the effort is very similar to the vanilla PostgreSQL plan.</p><pre><code class="language-sql">Custom Scan (ChunkAppend) on metrics1k  (cost=0.00..1113.45 rows=1097 width=20)
   Chunks excluded during startup: 1096
   -&gt;  Seq Scan on _hyper_3_39453_chunk  (cost=0.00..1.01 rows=1 width=20)
         Filter: ("time" &gt; now())
</code></pre>
<p>Close, but not good enough.</p><h2 id="now-were-talking">now( ) We're Talking</h2><p>With our latest release, TimescaleDB 2.7, we approached things differently, adding an optimization that would allow the evaluation of <code>now()</code> expressions during plan-time chunk exclusion. </p><p>Looking at the root of the problem, the reason why <code>now()</code> would not be correct is due to prepared statements. If you execute <code>now()</code> but only use that value in a transaction half an hour later, the value does not reflect the <code>current time—now()</code>—anymore.</p><p>However, <strong>it will still hold true for certain expressions even as time goes by.</strong> For example, <code>time &gt;= now()</code> will be true at this moment, in 5 minutes and 10 hours. So, when optimizing this, we looked for expressions that held as time passed and used those during plan-time exclusion. <br><br>The initial implementation of this feature works for intervals of hours, minutes, and seconds (e.g., <code>now() - ‘1 hour’</code>).<br><br>As you can see from the EXPLAIN output, chunks are no longer excluded during execution. The exclusion happens earlier, during planning, speeding up the query. Success!</p><pre><code class="language-sql"> Custom Scan (ChunkAppend) on metrics1k  (cost=0.00..1.02 rows=1 width=20)
   Chunks excluded during startup: 0
   -&gt;  Seq Scan on _hyper_3_39453_chunk  (cost=0.00..1.02 rows=1 width=20)
         Filter: (("time" &gt; '2022-05-24 12:41:31.266968+02'::timestamp with time zone) AND ("time" &gt; now()))
</code></pre>
<p>In the next TimescaleDB version, 2.8, we are removing the initial limitations of the <code>now()</code> optimization, making it also available in intervals of months and years. This means that you will be able to make the most of this improvement in a wider range of situations, as any <code>time &gt; now() - Interval</code>expression will be usable during plan-time chunk exclusion. </p><pre><code class="language-sql"> Custom Scan (ChunkAppend) on metrics1k  (cost=0.00..1.02 rows=1 width=20)
   Chunks excluded during startup: 0
   -&gt;  Seq Scan on _hyper_3_39453_chunk  (cost=0.00..1.02 rows=1 width=20)
         Filter: ("time" &gt; now())
</code></pre>
<p>This code is already <a href="https://github.com/timescale/timescaledb/pull/4397">committed</a> in our <a href="https://github.com/timescale/timescaledb/pull/4393">GitHub repo</a>, and will be available shortly.</p><h2 id="how-does-it-work">How Does It Work?</h2><p>But how did we make this current version happen? The optimization works by rewriting the constraint. For example: <br></p><pre><code class="language-sql">time &gt; now() - INTERVAL ‘5 min’
</code></pre>
<p>turns into</p><pre><code class="language-sql">(("time" &gt; (now() - '00:05:00'::interval)) AND ("time" &gt; '2022-06-10 09:58:04.224996+02'::timestamp with time zone))
</code></pre>
<p>This means that the constified part of the constraint will be used during plan-time chunk exclusion. And, assuming that time only moves forward, the result will still be correct even in the presence of prepared statements, as the original constraint is ANDed with the constified value.</p><p>Rewriting the constraint makes the constified value available to plan-time constraint exclusion, leading to massive reductions in planning time, especially in the presence of many chunks.</p><p>So we know that this translates into faster queries. But how fast?</p><h2 id="performance-comparison%E2%80%94now-that-is-fast">Performance Comparison—now( ) That Is Fast!</h2><p>As shown in our table, the optimization’s performance improvement scales with the total number of chunks in the hypertables. The more data partitions you’re dealing with, the more you’ll notice the speed improvement—<strong>up to 401x faster in TimescaleDB 2.7</strong>  for a total of 20,000 chunks when compared to the previous version.</p><p><code>now()</code>that is fast. 🔥</p><figure class="kg-card kg-image-card kg-width-wide"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Screenshot-2022-06-23-at-10.19.45.png" class="kg-image" alt="" loading="lazy" width="605" height="247" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/06/Screenshot-2022-06-23-at-10.19.45.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Screenshot-2022-06-23-at-10.19.45.png 605w"></figure><p><em>The table lists the total execution time of the query (at the beginning of the post) on hypertables with a different number of chunks</em></p><h2 id="now-go-try-it">now( ) Go Try It</h2><p>There are few things more satisfying for a developer than solving a problem for your users, especially a recurring one. Achieving such performance optimization is just the icing on the cake. </p><p>If you want to experience the lightning-fast performance of PostgreSQL <code>now()</code>queries for yourself, TimescaleDB 2.7 is available for Timescale and self-managed TimescaleDB.</p><ul><li>If you are a Timescale user, you will be automatically upgraded to TimescaleDB 2.7. No action is required from your side. You can also create a free Timescale account to get <a href="https://console.cloud.timescale.com/signup">a free 30-day trial</a> (no credit card required).</li><li>If you are using TimescaleDB in your own instances, <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/update-timescaledb/">check out our docs for instructions on how to upgrade</a>.</li></ul><p>Once you’re using TimescaleDB, connect with us! You can find us in our <a href="http://slack.timescale.com/">Community Slack</a> and the <a href="http://timescale.com/forum/">Timescale Community Forum</a>. We’ll be more than happy to answer any question on query performance improvements, TimescaleDB, PostgreSQL, or other time-series issues.</p><p><br></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How We Made Data Aggregation Better and Faster on PostgreSQL With TimescaleDB 2.7]]></title>
            <description><![CDATA[They’re so fast we can’t catch up! Check out our benchmarks with two datasets to learn how we used continuous aggregates to make queries up to 44,000x faster, while requiring 60 % less storage (on average). ]]></description>
            <link>https://www.tigerdata.com/blog/how-we-made-data-aggregation-better-and-faster-on-postgresql-with-timescaledb-2-7</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-we-made-data-aggregation-better-and-faster-on-postgresql-with-timescaledb-2-7</guid>
            <category><![CDATA[Engineering]]></category>
            <category><![CDATA[General]]></category>
            <category><![CDATA[Announcements & Releases]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Ryan Booz]]></dc:creator>
            <pubDate>Tue, 21 Jun 2022 12:58:38 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/candlesticks-2.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/candlesticks-2.png" alt="How We Made Data Aggregation Better and Faster on PostgreSQL With TimescaleDB 2.7" /><p><a href="https://timescale.ghost.io/blog/what-the-heck-is-time-series-data-and-why-do-i-need-a-time-series-database-dcf3b1b18563/">Time-series data</a> is the lifeblood of the analytics revolution in nearly every industry today. One of the most difficult challenges for application developers and data scientists is aggregating data efficiently without always having to query billions (or trillions) of raw data rows. Over the years, developers and databases have created numerous ways to solve this problem, usually similar to one of the following options:</p><ul><li><strong>DIY processes to pre-aggregate data and store it in regular tables</strong>. Although this provides a lot of flexibility, particularly with indexing and data retention, it's cumbersome to develop and maintain, particularly deciding how to track and update aggregates with data that arrives late or has been updated in the past.</li><li><strong>Extract Transform and Load (ETL) process for longer-term analytics.</strong> Even today, development teams employ entire groups that specifically manage ETL processes for databases and applications because of the constant overhead of creating and maintaining the perfect process.</li><li><strong>Materialized views.</strong> While these VIEWS are flexible and easy to create, they are static snapshots of the aggregated data. Unfortunately, developers need to manage updates using TRIGGERs or CRON-like applications in all current implementations. And in all but a very few databases, all historical data is replaced each time, preventing developers from dropping older raw data to save space and computation resources every time the data is refreshed.</li></ul><p>Most developers head down one of these paths because we learn, often the hard way, that running reports and analytic queries over the same raw data, request after request, doesn't perform well under heavy load. In truth, most raw time-series data doesn't change after it's been saved, so these complex aggregate calculations return the same results each time.</p><p>In fact, as a long-term time-series database developer, I've used all of these methods too, so that I could manage historical aggregate data to make reporting, dashboards, and analytics faster and more valuable, even under heavy usage.</p><p>I loved when customers were happy, even if it meant a significant amount of work behind the scenes maintaining that data.</p><p>But, I always wished for a more straightforward solution.</p><h2 id="how-timescaledb-improves-queries-on-aggregated-data-in-postgresql">How TimescaleDB Improves Queries on Aggregated Data in PostgreSQL</h2><p><a href="https://timescale.ghost.io/blog/continuous-aggregates-faster-queries-with-automatically-maintained-materialized-views/">In 2019, TimescaleDB introduced continuous aggregates to solve this very problem</a>, making the ongoing aggregation of massive time-series data easy and flexible. This is the feature that first caught my attention as a PostgreSQL developer looking to build more scalable time-series applications—precisely because I had been doing it the hard way for so long.</p><p>Continuous aggregates look and act like <a href="https://www.tigerdata.com/learn/guide-to-postgresql-views" rel="noreferrer">materialized views in PostgreSQL</a>, but with many of the additional features I was looking for (<a href="https://www.youtube.com/watch?v=1V1ADr6CKz4">if you want to learn more about views, materialized views, and continuous aggregates, check out this lesson from our Foundations of PostgreSQL and TimescaleDB course</a>). These are just some of the things they do:</p><ul><li>Automatically track changes and additions to the underlying raw data.</li><li>Provide configurable, user-defined policies to keep the materialized data up-to-date automatically.</li><li>Automatically append new data (as <a href="https://timescale.ghost.io/blog/achieving-the-best-of-both-worlds-ensuring-up-to-date-results-with-real-time-aggregation/">real-time aggregates</a> by default) before the scheduled process has materialized to disk. This setting is configurable.</li><li>Retain historical aggregated data even if the underlying raw data is dropped.</li><li>Can be compressed to reduce storage needs and further improve the performance of analytic queries.</li><li>Keep dashboards and reports running smoothly.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/ABL_CAGGS_Comparison-Chart_V2.0.png" class="kg-image" alt="Table comparing the functionality of PostgreSQL materialized views with continuous aggregates in TimescaleDB" loading="lazy" width="1800" height="2066" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/06/ABL_CAGGS_Comparison-Chart_V2.0.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/06/ABL_CAGGS_Comparison-Chart_V2.0.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2022/06/ABL_CAGGS_Comparison-Chart_V2.0.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/ABL_CAGGS_Comparison-Chart_V2.0.png 1800w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Table comparing the functionality of PostgreSQL materialized views with continuous aggregates in TimescaleDB</em></i></figcaption></figure><p>Once I tried continuous aggregates, I realized that TimescaleDB provided the solution that I (and many other PostgreSQL users) were looking for. With this feature, managing and analyzing massive volumes of time-series data in PostgreSQL finally felt fast and easy.</p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">Want to make your queries even faster? Try <a href="https://www.timescale.com/blog/an-incremental-materialized-view-on-steroids-how-we-made-continuous-aggregates-even-better/" rel="noreferrer">hierarchical continuous aggregates</a>, a.k.a. continuous aggregates on top of continuous aggregates.</div></div><h2 id="what-about-other-databases">What About Other Databases?</h2><p>By now, some readers might be thinking something along these lines:</p><p><em>“Continuous aggregates may help with the management and analytics of time-series data in PostgreSQL, but that’s what NoSQL databases are for—they already provide the features you needed from the get-go. Why didn’t you try a NoSQL database?”</em></p><p>Well, I did.</p><p>There are numerous time-series and NoSQL databases on the market that attempt to solve this specific problem. I looked at (and used) many of them. But from my experience, nothing can quite match the advantages of a relational database with a feature like continuous aggregates for time-series data. These other options provide a lot of features for a myriad of use cases, but they weren't the right solution for this particular problem, among other things.</p><h3 id="what-about-mongodb">What about MongoDB?</h3><p><a href="https://www.mongodb.com/">MongoDB</a> has been the go-to for many data-intensive applications. Included since version 4.2 is a feature called <a href="https://www.mongodb.com/docs/manual/core/materialized-views/">On-Demand Materialized Views</a>. On the surface, it works similar to a materialized view by combining the <a href="https://www.mongodb.com/docs/manual/core/aggregation-pipeline/">Aggregation Pipeline</a> feature with a $merge operation to mimic ongoing updates to an aggregate data collection. However, there is no built-in automation for this process, and MongoDB doesn't keep track of any modifications to underlying data. The developer is still required to keep track of which time frames to materialize and how far back to look.</p><h3 id="what-about-influxdb">What about InfluxDB?</h3><p>For many years <a href="https://www.influxdata.com/">InfluxDB</a> has been the destination for time-series applications. Although <a href="https://www.tigerdata.com/blog/what-is-high-cardinality" rel="noreferrer">we've discussed in other articles how InfluxDB doesn't scale effectively, particularly with high cardinality datasets</a>, it does provide a feature called <a href="https://docs.influxdata.com/influxdb/v1.8/query_language/continuous_queries/">Continuous Queries.</a> This feature is also similar to a materialized view and goes one step further than MongoDB by automatically keeping the dataset updated. Unfortunately, it suffers from the same lack of raw data monitoring and doesn't provide nearly as much flexibility as SQL in how the datasets are created and stored.</p><h3 id="what-about-clickhouse">What about Clickhouse?</h3><p><a href="https://clickhouse.com/">Clickhouse</a>, and several recent forks like <a href="https://www.firebolt.io/">Firebolt</a>, have redefined the way some analytic workloads perform. Even with some of the<a href="https://timescale.ghost.io/blog/what-is-clickhouse-how-does-it-compare-to-postgresql-and-timescaledb-and-how-does-it-perform-for-time-series-data/"> impressive query performance</a>, it provides a mechanism similar to a materialized view as well, backed by an <a href="https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/aggregatingmergetree/">AggregationMergeTree</a> engine. In a sense, this provides almost real-time aggregated data because all inserts are saved to both the regular table and the materialized view. The biggest downside of this approach is dealing with updates or modifying the timing of the process.</p><h2 id="recent-improvements-in-continuous-aggregates-meet-timescaledb-27">Recent Improvements in Continuous Aggregates: Meet TimescaleDB 2.7</h2><p>Continuous aggregates were first introduced in <a href="https://timescale.ghost.io/blog/continuous-aggregates-faster-queries-with-automatically-maintained-materialized-views/">TimescaleDB 1.3</a> solving the problems that many PostgreSQL users, including me, faced with time-series data and materialized views: automatic updates, real-time results, easy data management, and the option of using the view for downsampling.</p><p>But continuous aggregates have come a long way. One of the previous improvements was the introduction of <a href="https://timescale.ghost.io/blog/increase-your-storage-savings-with-timescaledb-2-6-introducing-compression-for-continuous-aggregates/">compression for continuous aggregates in TimescaleDB 2.6</a>. Now, we took it a step further with the arrival of TimescaleDB 2.7, which introduces dramatic performance improvements in continuous aggregates. <strong>They are now blazing fast—up to 44,000x faster in some queries than in previous versions. </strong></p><p>Let me give you one concrete example: <strong>in initial testing using live, real-time stock trade transaction data, typical candlestick aggregates were nearly 2,800x faster to query </strong>than in previous versions of continuous aggregates (which were already fast!)</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/sonic-run.gif" class="kg-image" alt="" loading="lazy" width="498" height="206"></figure><p>Later in this post, we will dig into the performance and storage improvements introduced by TimescaleDB 2.7 by presenting a complete benchmark of continuous aggregates using multiple datasets and queries. 🔥</p><p>But the improvements don’t end here.</p><p>First, the new continuous aggregates also require 60&nbsp;% less storage (on average) than before for many common aggregates, which directly translates into storage savings. <br><br>Second, in previous versions of TimescaleDB, continuous aggregates came with certain limitations: users, for example, could not use certain functions like DISTINCT, FILTER, or ORDER BY. These limitations are now gone. TimescaleDB 2.7 ships with a completely redesigned materialization process that solves many of the previous usability issues, so you can use any aggregate function to define your continuous aggregate. <a href="https://docs.timescale.com/timescaledb/latest/overview/release-notes/">Check out our release notes for all the details on what's new.</a></p>
<!--kg-card-begin: html-->
<div class="highlight">
	
    <p class="highlight__text">
        <svg width="17" height="16" viewBox="0 0 17 16" fill="none" xmlns="http://www.w3.org/2000/svg">
	</svg>
✨ A big thank you to the Timescale engineers that made the improvements in continuous aggregates possible, with special mentions to Fabrízio Mello, Markos Fountoulakis, and David Kohn.  
    </p>
</div>

<!--kg-card-end: html-->
<p><br>And now, the fun part.</p><h2 id="show-me-the-numbers-benchmarking-aggregate-queries">Show Me the Numbers: Benchmarking Aggregate Queries</h2><p>To test the new version of continuous aggregates, we chose two datasets that represent common time-series datasets: IoT and financial analysis.</p><ul><li><strong>IoT dataset (~1.7 billion rows): </strong>The IoT data we leveraged is the New York City Taxicab dataset that's been maintained by Todd Schneider for a number of years, and scripts are available in his <a href="https://github.com/toddwschneider/nyc-taxi-data">GitHub repository</a> to load data into PostgreSQL. Unfortunately, a week after his latest update, the transit authority that maintains the actual datasets changed their long-standing export data format from CSV to Parquet—which means the current scripts will not work. Therefore, the dataset we tested with is from data prior to that change and covers ride information from 2014 to 2021.</li><li><strong>Stock transactions dataset (~23.7 million rows): </strong>The financial dataset we used is a real-time stock trade dataset provided by <a href="https://twelvedata.com/">Twelve Data</a> and ingests ongoing transactions for the top 100 stocks by volume from February 2022 until now. Real-time transaction data is typically the source of many stock trading analysis applications requiring aggregate rollups over intervals for visualizations like <a href="https://docs.timescale.com/timescaledb/latest/tutorials/financial-candlestick-tick-data/create-candlestick-aggregates/">candlestick charts </a>and machine learning analysis. While our example dataset is smaller than a full-fledged financial application would maintain, it provides a working example of ongoing data ingestion using continuous aggregates, TimescaleDB <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/compression/about-compression/">native compression</a>, and <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/data-retention/about-data-retention/">automated raw data retention</a> (while keeping aggregate data for long-term analysis).</li></ul><p>You can use a sample of this data, generously provided by Twelve Data, to try all of the improvements in TimescaleDB 2.7 by following<a href="https://docs.timescale.com/timescaledb/latest/tutorials/ingest-real-time-websocket-data/"> this tutorial</a>, which provides stock trade data for the last 30 days. Once you have the database setup, you can take it a step further by registering for an API key and <a href="https://docs.timescale.com/timescaledb/latest/tutorials/ingest-real-time-websocket-data/">following our tutorial to ingest ongoing transactions from the Twelve Data API</a>.</p><h3 id="creating-continuous-aggregates-using-standard-postgresql-aggregate-functions">Creating Continuous Aggregates Using Standard PostgreSQL Aggregate Functions<br></h3><p>The first thing we benchmarked was to create an aggregate query that used standard PostgreSQL aggregate functions like <code>MIN()</code>, <code>MAX()</code>, and <code>AVG()</code>. In each dataset we tested, we created the same continuous aggregate in TimescaleDB 2.6.1 and 2.7, ensuring that both aggregates had computed and stored the same number of rows. </p><p><strong>IoT dataset</strong></p><p>This continuous aggregate resulted in 1,760,000 rows of aggregated data spanning seven years of data.</p><pre><code class="language-sql">CREATE MATERIALIZED VIEW hourly_trip_stats
WITH (timescaledb.continuous, timescaledb.finalized=false) 
AS
SELECT 
	time_bucket('1 hour',pickup_datetime) bucket,
	avg(fare_amount) avg_fare,
	min(fare_amount) min_fare,
	max(fare_amount) max_fare,
	avg(trip_distance) avg_distance,
	min(trip_distance) min_distance,
	max(trip_distance) max_distance,
	avg(congestion_surcharge) avg_surcharge,
	min(congestion_surcharge) min_surcharge,
	max(congestion_surcharge) max_surcharge,
	cab_type_id,
	passenger_count
FROM 
	trips
GROUP BY 
	bucket, cab_type_id, passenger_count</code></pre><p></p><p><strong>Stock transactions dataset</strong></p><p>This continuous aggregate resulted in 950,000 rows of data at the time of testing, although these are updated as new data comes in.</p><pre><code class="language-sql">CREATE MATERIALIZED VIEW five_minute_candle_delta
WITH (timescaledb.continuous) AS
    SELECT
        time_bucket('5 minute', time) AS bucket,
        symbol,
        FIRST(price, time) AS "open",
        MAX(price) AS high,
        MIN(price) AS low,
        LAST(price, time) AS "close",
        MAX(day_volume) AS day_volume,
        (LAST(price, time)-FIRST(price, time))/FIRST(price, time) AS change_pct
    FROM stocks_real_time srt
    GROUP BY bucket, symbol;
</code></pre><p>To test the performance of these two continuous aggregates, we selected the following queries, all common queries among our users for both the IoT and financial use cases:</p><ol><li>SELECT COUNT (*)</li><li>SELECT COUNT (*) with WHERE</li><li>ORDER BY</li><li>time_bucket reaggregation</li><li>FILTER</li><li>HAVING </li></ol><p>Let’s take a look at the results.</p><h3 id="query-1-%60select-count-from%E2%80%A6%60">Query #1: `SELECT COUNT(*) FROM…`</h3><p>Doing a <code>COUNT(*)</code> from PostgreSQL is a known performance bottleneck. It's one of the reasons we created the <a href="https://docs.timescale.com/api/latest/hyperfunctions/approximate_row_count/"><code>approximate_row_count()</code></a> function in TimescaleDB which uses table statistics to provide a close approximation of the overall row count. However, it's instinctual for most users (and ourselves, if we're honest) to try and get a quick row count by doing a <code>COUNT(*)</code> query:</p><pre><code class="language-sql">-- IoT dataset
SELECT count(*) FROM hourly_trip_stats;

-- Stock transactions dataset
SELECT count(*) FROM five_min_candle_delta;</code></pre><p>And most users recognized that in previous versions of TimescaleDB, the materialized data seemed slower than normal to do a COUNT over. <br></p><p>Thinking about our two example datasets, both continuous aggregates reduce the overall row count from raw data by 20x or more. So, while counting rows in PostgreSQL is slow, it always felt a little slower than it had to be. The reason was that not only did PostgreSQL have to scan and count all of the rows of data, it had to group the data a second time because of some additional data that TimescaleDB stored as part of the original design of continuous aggregates. With the new design of continuous aggregates in TimescaleDB 2.7, that second grouping is no longer required, and PostgreSQL can just query the data normally, translating into faster queries.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Query--1-2.png" class="kg-image" alt="Table comparing the performance of a query with SELECT COUNT (*) in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7" loading="lazy" width="1351" height="441" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/06/Query--1-2.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/06/Query--1-2.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Query--1-2.png 1351w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Performance of a query with SELECT COUNT (*) in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7</em></i></figcaption></figure><h3 id="query-2-select-count-based-on-the-value-of-a-column">Query #2: SELECT COUNT(*) Based on The Value of a Column</h3><p>Another common query that many analytic applications perform is to count the number of records where the aggregate value is within a certain range:</p><pre><code class="language-sql">-- IoT  dataset
SELECT count(*) FROM hourly_trip_stats
WHERE avg_fare &gt; 13.1
AND bucket &gt; '2018-01-01' AND bucket &lt; '2019-01-01';

-- Stock transactions dataset
SELECT count(*) FROM five_min_candle_delta
WHERE change_pct &gt; 0.02;
</code></pre><p>In previous versions of continuous aggregates, TimescaleDB had to finalize the value before it could be filtered against the predicate value, which caused queries to perform more slowly. With the new version of continuous aggregates, PostgreSQL can now search for the value directly, <em>and</em> we can add an index to meaningful columns to speed up the query even more!</p><p>In the case of the financial dataset, we see a very significant improvement: 1,336x faster. The large change in performance can be attributed to the formula query that has to be calculated over all of the rows of data in the continuous aggregate. With the IoT dataset, we're comparing against a simple average function, but for the stock data, multiple values have to be finalized (FIRST/LAST) before the formula can be calculated and used for the filter.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Query--2-2.png" class="kg-image" alt="Table comparing the performance of a query with SELECT COUNT (*) plus WHERE in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7." loading="lazy" width="1351" height="441" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/06/Query--2-2.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/06/Query--2-2.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Query--2-2.png 1351w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Performance of a query with SELECT COUNT (*) plus WHERE in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7</em></i></figcaption></figure><h3 id="query-3-select-top-10-rows-by-value">Query #3: Select Top 10 Rows by Value<br></h3><p>Taking the first example a step further, it's very common to query data within a range of time and get the top rows:</p><pre><code class="language-sql">-- IoT dataset
SELECT * FROM hourly_trip_stats
ORDER BY avg_fare desc
LIMIT 10;

-- Stock transactions dataset
SELECT * FROM five_min_candle_delta
ORDER BY change_pct DESC 
LIMIT 10;</code></pre><p>In this case, we tested queries with the continuous aggregate set to provide <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/continuous-aggregates/real-time-aggregates/">real-time results</a> (the default for continuous aggregates) and materialized-only results. When set to real-time, TimescaleDB always queries data that's been materialized first and then appends (with a <code>UNION</code>) any newer data that exists in the raw data but that has not yet been materialized by the ongoing refresh policy. And, because it's now possible to index columns within the continuous aggregate, we added an index on the <code>ORDER BY</code> column.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Query--3-1.png" class="kg-image" alt="Table comparing the performance of a query with ORDER BY in a continuous aggregate TimescaleDB 2.6.1 and TimescaleDB 2.7." loading="lazy" width="1351" height="606" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/06/Query--3-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/06/Query--3-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Query--3-1.png 1351w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Performance of a query with ORDER BY in a continuous aggregate TimescaleDB 2.6.1 and TimescaleDB 2.7</em></i></figcaption></figure><p><strong>Yes, you read that correctly. Nearly 45,000x better performance on  <code>ORDER BY</code> </strong>when the query only searches through materialized data.</p><p>The dramatic difference between real-time and materialized-only queries is because of the <code>UNION</code> of both materialized and raw aggregate data. The PostgreSQL planner needs to union the total result before it can limit the query to 10 rows (in our example), and so all of the data from both tables need to be read and ordered first. When you only query materialized data, PostgreSQL and TimescaleDB knows that it can query just the index of the materialized data.</p><p>Again, storing the finalized form of your data and indexing column values dramatically impacts the querying performance of historical aggregate data! And all of this is updated continuously over time in a non-destructive way—something that's impossible to do with any other relational database, including vanilla PostgreSQL.</p><h3 id="query-4-timescale-hyperfunctions-to-re-aggregate-into-higher-time-buckets">Query #4: Timescale Hyperfunctions to Re-aggregate Into Higher Time Buckets</h3><p>Another example we wanted to test was the impact finalizing data values has on our suite of <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/">analytical hyperfunctions</a>. Many of the hyperfunctions we provide as part of the <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/install-toolkit/">TimescaleDB Toolkit</a> utilize custom aggregate values that allow many different values to be accessed later depending on the needs of an application or report. Furthermore, these aggregate values can be <a href="https://timescale.ghost.io/blog/how-postgresql-aggregation-works-and-how-it-inspired-our-hyperfunctions-design-2/">re-aggregated into different size time buckets</a>. This means that if the aggregate functions fit your use case, one continuous aggregate can produce results for many different time_bucket sizes! This is a feature many users have asked for over time, and hyperfunctions make this possible.</p><p>For this example, we only examined the New York City Taxicab dataset to benchmark the impact of finalized CAGGs. Currently, there is not an aggregate hyperfunction that aligns with the OHLC values needed for the stock data set, however, <a href="https://github.com/timescale/timescaledb-toolkit/issues/445">there is a feature request</a> for it! (😉)</p><p>Although there are not currently any one-to-one hyperfunctions that provide exact replacements for our min/max/avg example, we can still observe the query improvement using a <code>tdigest</code> value for each of the columns in our original query.</p><p><strong>Original min/max/avg continuous aggregate for multiple columns:</strong></p><pre><code class="language-sql">CREATE MATERIALIZED VIEW hourly_trip_stats
WITH (timescaledb.continuous, timescaledb.finalized=false) 
AS
SELECT 
	time_bucket('1 hour',pickup_datetime) bucket,
	avg(fare_amount) avg_fare,
	min(fare_amount) min_fare,
	max(fare_amount) max_fare,
	avg(trip_distance) avg_distance,
	min(trip_distance) min_distance,
	max(trip_distance) max_distance,
	avg(congestion_surcharge) avg_surcharge,
	min(congestion_surcharge) min_surcharge,
	max(congestion_surcharge) max_surcharge,
	cab_type_id,
	passenger_count
FROM 
	trips
GROUP BY 
	bucket, cab_type_id, passenger_count
</code></pre><p><strong>Hyperfunction-based continuous aggregate for multiple columns:</strong></p><pre><code class="language-sql">CREATE MATERIALIZED VIEW hourly_trip_stats_toolkit
WITH (timescaledb.continuous, timescaledb.finalized=false) 
AS
SELECT 
	time_bucket('1 hour',pickup_datetime) bucket,
	tdigest(1,fare_amount) fare_digest,
	tdigest(1,trip_distance) distance_digest,
	tdigest(1,congestion_surcharge) surcharge_digest,
	cab_type_id,
	passenger_count
FROM 
	trips
GROUP BY 
	bucket, cab_type_id, passenger_count</code></pre><p>With the continuous aggregate created, we then queried this data in two different ways:</p><p><strong>1. Using the same `time_bucket()` size defined in the continuous aggregate, which in this example was one-hour data.</strong></p><pre><code class="language-sql">SELECT 
	bucket AS b,
	cab_type_id, 
	passenger_count,
	min_val(ROLLUP(fare_digest)),
	max_val(ROLLUP(fare_digest)),
	mean(ROLLUP(fare_digest))
FROM hourly_trip_stats_toolkit
WHERE bucket &gt; '2021-05-01' AND bucket &lt; '2021-06-01'
GROUP BY b, cab_type_id, passenger_count 
ORDER BY b DESC, cab_type_id, passenger_count;</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Query--4.png" class="kg-image" alt="Table comparing the erformance of a query with time_bucket() in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7 (the query uses the same bucket size as the definition of the continuous aggregate)" loading="lazy" width="1350" height="354" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/06/Query--4.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/06/Query--4.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Query--4.png 1350w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Performance of a query with time_bucket() in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7 (the query uses the same bucket size as the definition of the continuous aggregate)</em></i></figcaption></figure><p><strong>2. We re-aggregated the data from one-hour buckets into one-day buckets. </strong>This allows us to efficiently query different bucket lengths based on the original bucket size of the continuous aggregate.</p><pre><code class="language-sql">SELECT 
	time_bucket('1 day', bucket) AS b,
	cab_type_id, 
	passenger_count,
	min_val(ROLLUP(fare_digest)),
	max_val(ROLLUP(fare_digest)),
	mean(ROLLUP(fare_digest))
FROM hourly_trip_stats_toolkit
WHERE bucket &gt; '2021-05-01' AND bucket &lt; '2021-06-01'
GROUP BY b, cab_type_id, passenger_count 
ORDER BY b DESC, cab_type_id, passenger_count;</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Query--4-2.png" class="kg-image" alt="Table comparing the performance of a query with time_bucket() in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7. The query re-aggregates the data from one-hour buckets into one-day buckets." loading="lazy" width="1350" height="354" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/06/Query--4-2.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/06/Query--4-2.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Query--4-2.png 1350w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Performance of a query with time_bucket() in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7. The query re-aggregates the data from one-hour buckets into one-day buckets</em></i></figcaption></figure><p>In this case, the speed is almost identical because the same amount of data has to be queried. But if these aggregates satisfy your data requirements, only one continuous aggregate would be necessary in many cases, rather than a different continuous aggregate for each bucket size (one minute, five minutes, one hour, etc.)</p><h3 id="query-5-pivot-queries-with-filter">Query #5: Pivot Queries With FILTER</h3><p><a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/continuous-aggregates/about-continuous-aggregates/#function-support">In previous versions of continuous aggregates, many common SQL features were not permitted</a> because of how the partial data was stored and finalized later. Using a PostgreSQL <code>FILTER</code> clause was one such restriction.</p><p>For example, we took the IoT dataset and created a simple <code>COUNT(*)</code> to calculate each company's number of taxi rides ( <code>cab_type_id</code>) for each hour. Before TimescaleDB 2.7, you would have to store this data in a narrow column format, storing a row in the continuous aggregate for each cab type.</p><pre><code class="language-sql">CREATE MATERIALIZED VIEW hourly_ride_counts_by_type 
WITH (timescaledb.continuous, timescaledb.finalized=false) 
AS
SELECT 
	time_bucket('1 hour',pickup_datetime) bucket,
	cab_type_id,
  	COUNT(*)
FROM trips
  	WHERE cab_type_id IN (1,2)
GROUP BY 
	bucket, cab_type_id;</code></pre><p>To then query this data in a pivoted fashion, we could <code>FILTER</code> the continuous aggregate data after the fact.</p><pre><code class="language-sql">SELECT bucket,
	sum(count) FILTER (WHERE cab_type_id IN (1)) yellow_cab_count,
  	sum(count) FILTER (WHERE cab_type_id IN (2)) green_cab_count
FROM hourly_ride_counts_by_type
WHERE bucket &gt; '2021-05-01' AND bucket &lt; '2021-06-01'
GROUP BY bucket
ORDER BY bucket;</code></pre><p>In TimescaleDB 2.7, you can now store the aggregated data using a <code>FILTER</code> clause to achieve the same result in one step!</p><pre><code class="language-sql">CREATE MATERIALIZED VIEW hourly_ride_counts_by_type_new 
WITH (timescaledb.continuous) 
AS
SELECT 
	time_bucket('1 hour',pickup_datetime) bucket,
  	COUNT(*) FILTER (WHERE cab_type_id IN (1)) yellow_cab_count,
  	COUNT(*) FILTER (WHERE cab_type_id IN (2)) green_cab_count
FROM trips
GROUP BY 
	bucket;</code></pre><p>Querying this data is much simpler, too, because the data is already pivoted and finalized.</p><pre><code class="language-sql">SELECT * FROM hourly_ride_counts_by_type_new 
WHERE bucket &gt; '2021-05-01' AND bucket &lt; '2021-06-01'
ORDER BY bucket;
</code></pre><p>This saves storage (50&nbsp;% fewer rows in this case) and CPU to finalize the <code>COUNT(*)</code> and then filter the results each time based on <code>cab_type_id</code>. We can see this in the query performance numbers.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Query--5.png" class="kg-image" alt="Table comparing the performance of a query with FILTER in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7." loading="lazy" width="1350" height="354" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/06/Query--5.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/06/Query--5.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Query--5.png 1350w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Performance of a query with FILTER in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7.</em></i></figcaption></figure><p>Being able to use <code>FILTER</code> and other SQL features improve both developer experience and flexibility long term!</p><h3 id="query-6-having-stores-significantly-less-materialized-data">Query #6: HAVING Stores Significantly Less Materialized Data</h3><p>As a final example of how the improvements to continuous aggregates will impact your day-to-day development and analytics processes, let's look at a simple query that uses a <code>HAVING</code> clause to reduce the number of rows that the aggregate stores.</p><p>In previous versions of TimescaleDB, the having clause couldn't be applied at materialization time. Instead, the <code>HAVING</code> clause was applied after the fact to all of the aggregated data as it was finalized. In many cases, this dramatically affected both the speed of queries to the continuous aggregate and the amount of data stored overall.</p><p>Using our stock data as an example, let's create a continuous aggregate that only stores a row of data if the <code>change_pct</code> value is greater than 20 %. This would indicate that a stock price changed dramatically over one hour, something we don't expect to see in most hourly stock trades.</p><pre><code class="language-sql">CREATE MATERIALIZED VIEW one_hour_outliers
WITH (timescaledb.continuous) AS
    SELECT
        time_bucket('1 hour', time) AS bucket,
        symbol,
        FIRST(price, time) AS "open",
        MAX(price) AS high,
        MIN(price) AS low,
        LAST(price, time) AS "close",
        MAX(day_volume) AS day_volume,
        (LAST(price, time)-FIRST(price, time))/LAST(price, time) AS change_pct
    FROM stocks_real_time srt
    GROUP BY bucket, symbol
   HAVING (LAST(price, time)-FIRST(price, time))/LAST(price, time) &gt; .02;</code></pre><p>Once the dataset is created, we can query each aggregate to see how many rows matched our criteria.</p><pre><code class="language-sql">SELECT count(*) FROM one_hour_outliers;</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Query--6.png" class="kg-image" alt="Table comparing the performance of a query with HAVING in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7." loading="lazy" width="1351" height="354" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/06/Query--6.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/06/Query--6.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Query--6.png 1351w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Performance of a query with HAVING in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7</em></i></figcaption></figure><p>The biggest difference here (and the one that will more negatively impact the performance of your application over time) is the storage size of this aggregated data. Because TimescaleDB 2.7 only stores rows that meet the criteria, the data footprint is significantly smaller!</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Query--6-2.png" class="kg-image" alt="Table comparing the storage footprint of a continuous aggregate bucketing stock transactions by the hour in TimescaleDB 2.6.1 and TimescaleDB 2.7." loading="lazy" width="1351" height="354" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/06/Query--6-2.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/06/Query--6-2.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Query--6-2.png 1351w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Storage footprint of a continuous aggregate bucketing stock transactions by the hour in TimescaleDB 2.6.1 and TimescaleDB 2.7</em></i></figcaption></figure><h2 id="storage-savings-in-timescaledb-27"><br>Storage Savings in TimescaleDB 2.7</h2><p>One of the final pieces of this update that excites us is how much storage will be saved over time. On many occasions, users with large datasets that contained complex equations in their continuous aggregates would join our <a href="https://slack.timescale.com/">Slack community</a> to ask why more storage is required for the rolled-up aggregate than the raw data.</p><p>In every case we've tested, the new, finalized form of continuous aggregates is smaller than the same example in previous versions of TimescaleDB, with or without a <code>HAVING</code> clause that might filter additional data out.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Storage-Savings.png" class="kg-image" alt="Table comparing the storage footprint of a query with HAVING in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7." loading="lazy" width="1351" height="654" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/06/Storage-Savings.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/06/Storage-Savings.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/Storage-Savings.png 1351w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Storage savings for different continuous aggregates in TimescaleDB 2.6.1 and TimescaleDB 2.7</em></i></figcaption></figure><h2 id="the-new-continuous-aggregates-are-a-game-changer">The New Continuous Aggregates Are a Game-Changer</h2><p>For those dealing with massive amounts of time-series data, continuous aggregates are the best way to solve a problem that has long haunted PostgreSQL users. The following list details how continuous aggregates expand materialized views:</p><ul><li>They always stay up-to-date, automatically tracking changes in the source table for targeted, efficient updates of materialized data.</li><li>You can use configurable policies to conveniently manage refresh/update interval.</li><li>You can keep your materialized data even after the raw data is dropped, allowing you to downsample your large datasets.</li><li>And you can compress older data to save space and improve analytic queries.</li></ul><p>And in TimescaleDB 2.7, continuous aggregates got much better. First, they are blazing fast: as we demonstrated with our benchmark, the performance of continuous aggregates got consistently better across queries and datasets, up to thousands of times better for common queries. They also got lighter, requiring an average of 60&nbsp;% less storage.</p><p>But besides the performance improvements and storage savings, there are significantly fewer limitations on the types of aggregate queries you can use with continuous aggregates, such as:</p><ul><li>Aggregates with DISTINCT</li><li>Aggregates with FILTER</li><li>Aggregates with FILTER in HAVING clause</li><li>Aggregates without combine function</li><li>Ordered-set aggregates</li><li>Hypothetical-set aggregates</li></ul><p>This new version of continuous aggregates is available by default in <a href="https://docs.timescale.com/timescaledb/latest/overview/release-notes/">TimescaleDB 2.7</a>: now, when you create a new continuous aggregate, you will automatically benefit from all the latest changes. <a href="https://docs.timescale.com/timescaledb/latest/overview/release-notes/">Read our release notes for more information on TimescaleDB 2.7</a>, and for instructions on how to upgrade, <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/update-timescaledb/">check out our docs.</a></p><p>Looking to migrate your existing continue aggregates to the new version? Now, with TimescaleDB 2.8.1, you don’t have to worry about <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/continuous-aggregates/migrate/">migrating from the old continuous aggregates to the new</a>. Say hello to our frictionless migration, an in-place upgrade that avoids disrupting queries over continuous aggregates in applications and dashboards and every time the data is not in the original <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertable</a>.</p>
<!--kg-card-begin: html-->
<div class="highlight">
	
    <p class="highlight__text">
        <svg width="17" height="16" viewBox="0 0 17 16" fill="none" xmlns="http://www.w3.org/2000/svg">
	</svg>
☁️🐯 Timescale avoids the manual work involved in updating your TimescaleDB version. Updates take place automatically during a maintenance window picked by you. 
        <a href="https://docs.timescale.com/cloud/latest/service-operations/maintenance/"								target="_blank">Learn more about maintenance and automatic version updates in Timescale,</a> 
and to test it yourself, 
                <a href="https://www.timescale.com/timescale-signup/"								target="_blank">start a free trial!</a>  
    </p>
</div>

<!--kg-card-end: html-->
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[The 2022 State of PostgreSQL Survey Is Now Open!]]></title>
            <description><![CDATA[We’re surveying the PostgreSQL community for the third year to learn more about how developers use and deploy the open-source database and surface collective trends and opportunities.]]></description>
            <link>https://www.tigerdata.com/blog/the-2022-state-of-postgresql-survey-is-now-open</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/the-2022-state-of-postgresql-survey-is-now-open</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[State of PostgreSQL]]></category>
            <dc:creator><![CDATA[Team Tiger Data]]></dc:creator>
            <pubDate>Mon, 06 Jun 2022 15:05:46 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/SoP-2022-Lockup.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/06/SoP-2022-Lockup.png" alt="The 2022 State of PostgreSQL Survey Is Now Open!" /><p>Our love for PostgreSQL runs deep. <a href="http://www.timescale.com">We built our products on PostgreSQL</a>, <a href="https://timescale.ghost.io/blog/the-future-of-community-in-light-of-babelfish/">are proud members of the PostgreSQL community,</a> <a href="https://www.youtube.com/playlist?list=PLsceB9ac9MHRnmNZrCn_TWkUrCBCPR3mc">and wouldn’t exist without it and the extensibility it provides</a>.</p><p>In 2019, Timescale launched the first <em>State of PostgreSQL report</em>, advancing our desire to provide greater insights into the specificities and features useful to the PostgreSQL community. Following a one-year hiatus due to the pandemic and after the 2021 survey submissions, <a href="https://timescale.ghost.io/blog/2021-state-of-postgres-survey-results/">we released the 2021 report</a>.</p><p>We are pleased to announce that the 2022 survey is now open for submissions! We are keen to learn more about how you use PostgreSQL for work and personal projects, how you deploy it, and how we can collectively improve it.</p><div class="kg-card kg-callout-card kg-callout-card-yellow"><div class="kg-callout-emoji">✨</div><div class="kg-callout-text">Help us give back to this awesome group: answer survey questions and share with other PostgreSQL users. We are excited to hear your thoughts and spark a conversation that will keep us moving forward and building better things together. 🙌 <b><strong style="white-space: pre-wrap;">We will share our report (as well as give you full and free access to the survey’s anonymized raw data) in July. </strong></b>Thank you for being a part of the community!</div></div>
<!--kg-card-begin: html-->
<div class="gray-cta-box">
    <a style="width: auto; display: flex; justify-content: center; align-items: center;" class="gray-cta-box__button" href="https://timescale.typeform.com/state-of-pg-22" target="_blank">
        <p><strong>Take the 2022 State of PostgreSQL survey </strong></p>
    </a>
</div>
<!--kg-card-end: html-->
<h2 id="the-state-of-postgresql-in-2019-and-2021">The State of PostgreSQL in 2019 and 2021</h2><p>So, what have we learned from the two years we sent out our survey? You will find the few key findings here, but <a href="https://drive.google.com/drive/folders/14elckaNv7FLKyWhzp3JKd3tH6PvI9F45">check out our reports</a> for a full picture. From the most used programming languages to whether developers use PostgreSQL for work or personal projects (or both!), favorite features, and qualitative answers, <em>The State of PostgreSQL </em>paints an accurate and informative portrait of this great community. </p><h3 id="sample">Sample </h3><p>Five hundred developers answered our survey in 2019, and 445 participated two years later. In both years, respondents mainly were software developers/engineers, software architects, and database administrators from the EMEA (Europe, Middle East, Africa) region.</p><h3 id="postgresql-usage-is-growing">PostgreSQL usage is growing</h3><p>Around 67 % of developers said they were using the database “more” or “a lot more” in 2019, compared to 52 % in 2021. However, the number of participants using it “about the same” increased from 31 % in 2019 to 43 % in 2021.</p><h3 id="use-cases-building-applications-at-the-top">Use cases: Building applications at the top</h3><p>Building applications is the primary use case for PostgreSQL developers, totaling 70 % in 2019 and 67 % in 2021.</p><h3 id="community-contribution-is-increasing">Community contribution is increasing</h3><p>Code contributions are crucial to open-source software development, and PostgreSQL successfully mobilizes its community. In 2019, about 9 % of respondents contributed their code to the database, and 11 % claimed to do it two years later.</p><h3 id="why-do-you-use-postgresql">Why do you use PostgreSQL?</h3><p>In both surveys, developers said that reliability and SQL were the main reasons they use PostgreSQL.</p><h3 id="the-way-developers-deploy-postgresql-is-changing">The way developers deploy PostgreSQL is changing</h3><p>In 2019, 51 % of respondents deployed PostgreSQL using AWS, while 46 % relied on a self-managed data center. In 2021, the self-managed option took the lead, with 36.4 % deploying on-site, 35.3 % from a private data center, and 32.8 % on a public cloud. In 2021, AWS was the leading cloud provider, with 46.1 % of the answers.</p><p><em>If you’ve got any feedback or questions on The State of PostgreSQL, let us know on </em><a href="https://twitter.com/TimescaleDB"><em>Twitter</em></a><em> or join Timescale’s Community </em><a href="http://timescaledb.slack.com/"><em>Slack</em></a><em> and message us in </em><a href="https://timescaledb.slack.com/archives/C4GT3N90X"><em>#general</em></a><em>.</em><br><br><br></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[OpenTelemetry: Where the SQL Is Better Than the Original]]></title>
            <description><![CDATA[How does OpenTelemetry differ from previous observability tools? And can these differences open a new path to an old friend as a unified query language for telemetry data? ]]></description>
            <link>https://www.tigerdata.com/blog/opentelemetry-where-sql-is-better-than-the-original</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/opentelemetry-where-sql-is-better-than-the-original</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Monitoring & Alerting]]></category>
            <dc:creator><![CDATA[James Blackwood-Sewell]]></dc:creator>
            <pubDate>Wed, 25 May 2022 11:14:00 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/05/jorgen-haland-4yOgRb_b_i4-unsplash--1-.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/05/jorgen-haland-4yOgRb_b_i4-unsplash--1-.jpg" alt="OpenTelemetry: Where the SQL Is Better Than the Original" /><p><em>This blog post was originally published at TFiR on May 2, 2022.</em></p><p>OpenTelemetry is a familiar term to those who work in the cloud-native landscape by now. Two years after the first beta was released it still maintains an incredibly active and large community, only coming second to Kubernetes when compared to other Cloud Native Computing Foundation (CNCF) projects. </p><p>For those who aren’t so familiar, OpenTelemetry was born out of the need to provide a unified front for instrumenting code and collecting observability data—a framework that can be used to handle metrics, logs, and traces in a consistent manner, while still retaining enough flexibility to model and interact with other popular approaches (such as Prometheus and StatsD).</p><p>This article explores how OpenTelemetry differs from previous observability tools and how that point of difference opens up the potential for bringing back an old friend as the query language across all telemetry data.</p><h2 id="observability%E2%80%94then-and-now">Observability—Then and Now</h2><p>At a high level, the primary difference between OpenTelemetry and the previous generation of open-source observability tooling is one of scope. OpenTelemetry doesn’t focus on one particular signal type, and it doesn’t offer any storage or query capabilities. Instead, it spans the entire area that an application needing instrumentation cares about—the creation and transmission of signals. The benefit of this change in approach is that OpenTelemetry can offer developers a complete experience: one API and one SDK per language, which offers common concepts across metrics, logs, and traces. When developers need to instrument an app, they only need to use OpenTelemetry.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/05/20220422_OpenTelemetry-Kubecon-article-v2.0-2.jpg" class="kg-image" alt="" loading="lazy" width="1800" height="1176" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/05/20220422_OpenTelemetry-Kubecon-article-v2.0-2.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/05/20220422_OpenTelemetry-Kubecon-article-v2.0-2.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2022/05/20220422_OpenTelemetry-Kubecon-article-v2.0-2.jpg 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/05/20220422_OpenTelemetry-Kubecon-article-v2.0-2.jpg 1800w" sizes="(min-width: 720px) 720px"></figure><p>On top of that promise, OpenTelemetry can take streams of signals and transform them, enrich them, aggregate them or route them, interfacing with any backend which implements the OpenTelemetry specification. This opens up a host of new deployment possibilities—a pluggable storage provider per signal (Prometheus, Jaeger, and Loki, maybe), a unified storage provider for all of them, two subsets of metrics to two different backends, or everything being sent out of a Kubernetes cluster to an external endpoint.</p><p>Personally, the appeal of OpenTelemetry is very real to me—gathering telemetry data from a Kubernetes cluster using a single interface feels much more natural than maintaining multiple signal flows and potentially operators, and custom resource definitions (CRDs). When I think back to the pain points of getting signals out of applications and into dashboards, one of my main issues was consistently around the fractured landscape of creating, discovering, and consuming telemetry data.</p><h2 id="opentelemetry-and-the-query-babel-tower">OpenTelemetry and the Query Babel Tower</h2><p>When discussing OpenTelemetry, the question of querying signals soon comes up. It’s amazing we now have the ability to provide applications with a single interface for instrumentation, but what about when the time comes to use that information?</p><p>If we store our data in multiple silos with separate query languages, all the value we gained from shared context, linking, and common attributes is lost. Because these languages have been developed (and are still being developed) for a single signal, they reinforce the silo approach. PromQL can query metrics, but it can’t reach out to logging or tracing data. It becomes clear that a solution to this problem is needed to allow the promise of OpenTelemetry to be realized from a consumption perspective.</p><p>As it stands today, open-source solutions to this problem have mostly been offered via a user interface. For example, Grafana can allow you to click between traces and metrics that have been manually linked and correlate via time—but this soon starts to feel a bit limited.</p><h2 id="a-new-promise">A New Promise</h2><p>OpenTelemetry promises tagged attributes that could be used to join instrumentation and rich linkages between all signals. So what is the query equivalent of what OpenTelemetry promises? A unified language that can take inputs from systems that provide storage for OpenTelemetry data and allow rich joins between different signal types. </p><p>This language would need to be multi-purpose, as it needs to be able to express common queries for metrics, traces, and logs. Ideally, it could also express one type of signal as another when required—the rate of entries showing up in a log stream which have a type of ERROR or a trace based on the time between metric increments. </p><p>So, what would this language look like? It needs to be a well-structured query language that can support multiple different types of signal data; it needs to be able to express domain-specific functionality for each signal; it really needs to support complex and straightforward joins between data, and it needs to return data which the visualization layer can present. Other tools also need to support it, too. And hopefully, not just observability tools—integration with programming languages and business intelligence solutions would be perfect. </p><p>Designing such a language is not easy. While the simplicity of PromQL is great for most metric use cases, adding on trace and log features would almost certainly make that experience worse. Having three languages that were similar (one for each signal) and could be linked together by time and attributes at query time is a possibility, but while PromQL is a de facto standard, it seems unlikely that LogQL (Grafana Loki’s PromQL-inspired query language for logs) will show up in other products. And, at the time of writing, traces don’t have a common language. Sure we could develop those three interfacing languages, but do we need to?</p><h2 id="why-sql">Why SQL?</h2><p>Before working with observability data, I was in the Open Source database world. I think we can learn something from databases here by adopting the lingua franca of data analytics: SQL. Somehow, it has been pushed to the bottom of our programming languages kit but is coming back strong due to the increasing importance of data for decision-making.</p><p>SQL is a truly a language that has stood the test of time:</p><ul><li>It’s a well-defined standard built for modeling relationships and then analyzing data.</li><li>It allows easy joins between relations and is used in many, many data products.</li><li>It is supported in all major programming languages, and if tooling supports external query languages, it’s a good bet it will support SQL as one of them.</li><li>And finally, developers <em>understand</em> SQL. While it can be a bit more verbose than something like PromQL, it won’t need any language updates to support traces and metrics in addition to logs—it just needs a schema defined that models those relationships.</li></ul><p>Despite all this, SQL is a language choice that often raises eyebrows. It’s not typically a language favored by Cloud technologies and DevOps, and with the rise in the use of object-relational mapping libraries (ORMs), which abstract SQL away from developers, it’s often ignored. But, if you need to analyze different sets of data that have something in common—so they can be joined, correlated, and compared together—you use SQL. </p><p>If before we dealt with metrics, logs, and traces in different (and usually intentionally simple) systems with no commonalities, today’s systems are becoming progressively more complex and require correlation. SQL is a perfect choice for this; in fact, this is what SQL was designed to do. It even lets us be sure that we can correlate data from outside of our Observability domain with our telemetry—all of a sudden, we would have the ability to pull in reference data and enrich our signals past the labels we attach at creation time.</p><p>At Timescale, we are convinced that a single, consistent query layer is the correct approach—and are investing in developing <a href="https://timescale.ghost.io/blog/important-news-about-promscale/" rel="noreferrer">Promscale</a>, a scalable backend to store all signal data which supports SQL as its native language. Whatever the solution is, we are looking forward to being able to query seamlessly across all our telemetry data, unlocking the full potential of OpenTelemetry.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Point-in-Time PostgreSQL Database and Query Monitoring With pg_stat_statements]]></title>
            <description><![CDATA[The pg_stat_statements extension strikes again. Learn how to store metrics snapshots regularly for more efficient database monitoring.]]></description>
            <link>https://www.tigerdata.com/blog/point-in-time</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/point-in-time</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <category><![CDATA[PostgreSQL Tips]]></category>
            <dc:creator><![CDATA[Ryan Booz]]></dc:creator>
            <pubDate>Tue, 03 May 2022 13:08:42 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/05/PIT-PostgreSQL-Database-Query-Monitoring--1-.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/05/PIT-PostgreSQL-Database-Query-Monitoring--1-.png" alt="The Grafana dashboards for different point-in-time PostgreSQL database monitoring queries o a black background" /><p>Database monitoring is a crucial part of effective data management and building high-performance applications. <a href="https://timescale.ghost.io/blog/identify-postgresql-performance-bottlenecks-with-pg_stat_statements/">In our previous blog post</a>, we discussed how to enable <code>pg_stat_statements</code> (and that it comes standard on all Timescale instances), what data it provides, and demonstrated a few queries that you can run to glean useful information from the metrics to help pinpoint problem queries. </p><p>We also discussed one of the few pitfalls with <code>pg_stat_statements</code>: <em>all of the data it provides is cumulative since the last server restart (or a superuser reset the statistics).</em></p><p>While <code>pg_stat_statements</code> can work as a go-to source of information to determine where initial problems might be occurring when the server isn’t performing as expected, the cumulative data it provides can also pose a problem when said server is struggling to keep up with the load.</p><p>The data might show that a particular application query has been called frequently and read a lot of data from the disk to return results, but that only tells part of the story. With cumulative data, it's impossible to answer specific questions about the state of your cluster, such as:</p><ul><li>Does it usually struggle with resources at this time of day?</li><li>Are there particular forms of the query that are slower than others?</li><li>Is it a specific database that's consuming resources more than others right now?</li><li>Is that normal given the current load?</li></ul><p>The database monitoring information that <code>pg_stat_statements</code> provides is invaluable when you need it. However, it's most helpful when the data shows trends and patterns over time to visualize the true state of your database when problems arise.</p><p>It would be much more valuable if you could transform this static, cumulative data into time-series data, regularly storing snapshots of the metrics. Once the data is stored, we can use standard SQL to query delta values of each snapshot and metric to see how each database, user, and query performed interval by interval. That also makes it much easier to pinpoint <em>when</em> a problem started and <em>what</em> query or database appears to be contributing the most.</p><p><br><br>In this blog post (a sequel to <a href="https://timescale.ghost.io/blog/using-pg-stat-statements-to-optimize-queries/" rel="noreferrer">our previous 101 on the subject</a>), we'll discuss the basic process for storing and analyzing the <code>pg_stat_statements</code> snapshot data over time using TimescaleDB features to store, manage, and query the metric data efficiently. </p><p><br>Although you could automate the storing of snapshot data with the help of a few <a href="https://www.tigerdata.com/blog/top-8-postgresql-extensions" rel="noreferrer">PostgreSQL extensions</a>, only TimescaleDB provides everything you need, including automated job scheduling, data compression, data retention, and continuous aggregates to manage the entire solution efficiently.</p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">You can complement pg_stat_statements with <a href="https://www.timescale.com/blog/database-monitoring-and-query-optimization-introducing-insights-on-timescale/" rel="noreferrer">Insights</a> for great database observability at an unprecedented level of granularity. To try Insights and unlock a new, insightful understanding of your queries and performance,&nbsp;<a href="https://console.cloud.timescale.com/signup?ref=timescale.com">sign up for Timescale</a>.</div></div><p></p><h2 id="postgresql-database-monitoring-201-preparing-to-store-data-snapshots">PostgreSQL Database Monitoring 201: Preparing to Store Data Snapshots</h2><p>Before we can query and store ongoing snapshot data from <code>pg_stat_statements</code>, we need to prepare a schema and, optionally, a separate database to keep all of the information we'll collect with each snapshot.</p><p>We're choosing to be very opinionated about how we store the snapshot metric data and, optionally, how to separate some information, like the query text itself. Use our example as a building block to store the information you find most useful in your environment. You may not want to keep some metric data (i.e., <code>exec_stddev</code>), and that's okay. Adjust based on your database monitoring needs.</p><h3 id="create-a-metrics-database">Create a metrics database</h3><p>Recall that <code>pg_stat_statements</code> tracks statistics for every database in your PostgreSQL cluster. Also, any user with the appropriate permissions can query all of the data while connected to any database. Therefore, while creating a separate database is an optional step, storing this data in a separate TimescaleDB database makes it easier to filter out the queries from the ongoing snapshot collection process. </p><p>We also show the creation of a separate schema called <code>statements_history</code> to store all of the tables and procedures used throughout the examples. This allows a clean separation of this data from anything else you may want to do within this database.</p><pre><code class="language-sql">psql=&gt; CREATE DATABASE statements_monitor;

psql=&gt; \c statements_monitor;

psql=&gt; CREATE EXTENSION IF NOT EXISTS timescaledb;

psql=&gt; CREATE SCHEMA IF NOT EXISTS statements_history;
</code></pre>
<h3 id="create-hypertables-to-store-snapshot-data">Create hypertables to store snapshot data</h3><p>Whether you create a separate database or use an existing TimescaleDB database, we need to create the tables to store the snapshot information. For our example, we'll create three tables:</p><ul><li><strong><code>snapshots</code> <code>(hypertable)</code></strong>: a cluster-wide aggregate snapshot of all metrics for easy cluster-level monitoring</li><li><code><strong>queries</strong></code>: separate storage of query text by <code>queryid</code>, <code>rolname</code>, and <code>datname</code> </li><li><strong><code>statements</code> <code>(hypertable)</code></strong>: statistics for each query every time the snapshot is taken, grouped by <code>queryid</code>, <code>rolname</code>, and <code>datname</code> </li></ul><p>Both the <code>snapshot</code> and <code>statements</code> tables are converted to a hypertable in the following SQL. Because these tables will store lots of metric data over time, making them hypertables unlocks powerful features for managing the data with compression and data retention, as well as speeding up queries that focus on specific periods of time. Finally, take note that each table is assigned a different <code>chunk_time_interval</code> based on the amount and frequency of the data that is added to it.</p><p>The <code>snapshots</code> table, for instance, will only receive one row per snapshot, which allows the chunks to be created less frequently (every four weeks) without growing too large. In contrast, the <code>statements</code> table will potentially receive thousands of rows every time a snapshot is taken and so creating chunks more frequently (every week) allows us to compress this data more frequently and provides more fine-grained control over data retention. </p><p>The size and activity of your cluster, along with how often you run the job to take a snapshot of data, will influence what the right <code>chunk_time_interval</code> is for your system. More information about chunk sizes and best practices can be <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hypertables/best-practices/">found in our documentation</a>.</p><pre><code class="language-sql">/*
 * The snapshots table holds the cluster-wide values
 * each time an overall snapshot is taken. There is
 * no database or user information stored. This allows
 * you to create cluster dashboards for very fast, high-level
 * information on the trending state of the cluster.
 */
CREATE TABLE IF NOT EXISTS statements_history.snapshots (
    created timestamp with time zone NOT NULL,
    calls bigint NOT NULL,
    total_plan_time double precision NOT NULL,
    total_exec_time double precision NOT NULL,
    rows bigint NOT NULL,
    shared_blks_hit bigint NOT NULL,
    shared_blks_read bigint NOT NULL,
    shared_blks_dirtied bigint NOT NULL,
    shared_blks_written bigint NOT NULL,
    local_blks_hit bigint NOT NULL,
    local_blks_read bigint NOT NULL,
    local_blks_dirtied bigint NOT NULL,
    local_blks_written bigint NOT NULL,
    temp_blks_read bigint NOT NULL,
    temp_blks_written bigint NOT NULL,
    blk_read_time double precision NOT NULL,
    blk_write_time double precision NOT NULL,
    wal_records bigint NOT NULL,
    wal_fpi bigint NOT NULL,
    wal_bytes numeric NOT NULL,
    wal_position bigint NOT NULL,
    stats_reset timestamp with time zone NOT NULL,
    PRIMARY KEY (created)
);

/*
 * Convert the snapshots table into a hypertable with a 4 week
 * chunk_time_interval. TimescaleDB will create a new chunk
 * every 4 weeks to store new data. By making this a hypertable we
 * can take advantage of other TimescaleDB features like native 
 * compression, data retention, and continuous aggregates.
 */
SELECT * FROM create_hypertable(
    'statements_history.snapshots',
    'created',
    chunk_time_interval =&gt; interval '4 weeks'
);

COMMENT ON TABLE statements_history.snapshots IS
$$This table contains a full aggregate of the pg_stat_statements view
at the time of the snapshot. This allows for very fast queries that require a very high level overview$$;

/*
 * To reduce the storage requirement of saving query statistics
 * at a consistent interval, we store the query text in a separate
 * table and join it as necessary. The queryid is the identifier
 * for each query across tables.
 */
CREATE TABLE IF NOT EXISTS statements_history.queries (
    queryid bigint NOT NULL,
    rolname text,
    datname text,
    query text,
    PRIMARY KEY (queryid, datname, rolname)
);

COMMENT ON TABLE statements_history.queries IS
$$This table contains all query text, this allows us to not repeatably store the query text$$;


/*
 * Finally, we store the individual statistics for each queryid
 * each time we take a snapshot. This allows you to dig into a
 * specific interval of time and see the snapshot-by-snapshot view
 * of query performance and resource usage
*/
CREATE TABLE IF NOT EXISTS statements_history.statements (
    created timestamp with time zone NOT NULL,
    queryid bigint NOT NULL,
    plans bigint NOT NULL,
    total_plan_time double precision NOT NULL,
    calls bigint NOT NULL,
    total_exec_time double precision NOT NULL,
    rows bigint NOT NULL,
    shared_blks_hit bigint NOT NULL,
    shared_blks_read bigint NOT NULL,
    shared_blks_dirtied bigint NOT NULL,
    shared_blks_written bigint NOT NULL,
    local_blks_hit bigint NOT NULL,
    local_blks_read bigint NOT NULL,
    local_blks_dirtied bigint NOT NULL,
    local_blks_written bigint NOT NULL,
    temp_blks_read bigint NOT NULL,
    temp_blks_written bigint NOT NULL,
    blk_read_time double precision NOT NULL,
    blk_write_time double precision NOT NULL,
    wal_records bigint NOT NULL,
    wal_fpi bigint NOT NULL,
    wal_bytes numeric NOT NULL,
    rolname text NOT NULL,
    datname text NOT NULL,
    PRIMARY KEY (created, queryid, rolname, datname),
    FOREIGN KEY (queryid, datname, rolname) REFERENCES statements_history.queries (queryid, datname, rolname) ON DELETE CASCADE
);

/*
 * Convert the statements table into a hypertable with a 1 week
 * chunk_time_interval. TimescaleDB will create a new chunk
 * every 1 weeks to store new data. Because this table will receive
 * more data every time we take a snapshot, a shorter interval
 * allows us to manage compression and retention to a shorter interval
 * if needed. It also provides smaller overall chunks for querying
 * when focusing on specific time ranges.
 */
SELECT * FROM create_hypertable(
    'statements_history.statements',
    'created',
    create_default_indexes =&gt; false,
    chunk_time_interval =&gt; interval '1 week'
);

</code></pre>
<h3 id="create-the-snapshot-stored-procedure">Create the snapshot stored procedure</h3><p>With the tables created to store the statistics data from <code>pg_stat_statements</code>, we need to create a stored procedure that will run on a scheduled basis to collect and store the data. This is a straightforward process with <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/user-defined-actions/create-and-register/">TimescaleDB user-defined actions</a>. </p><p>A user-defined action provides a method for scheduling a custom stored procedure using the underlying scheduling engine that TimescaleDB uses for automated policies like continuous aggregate refresh and data retention. Although there are other <a href="https://www.tigerdata.com/blog/top-8-postgresql-extensions" rel="noreferrer">PostgreSQL extensions</a> for managing schedules, this feature is included with TimescaleDB by default.<br><br>First, create the stored procedure to populate the data. In this example, we use a multi-part common table expression (CTE) to fill in each table, starting with the results of the <code>pg_stat_statements</code> view.</p><pre><code class="language-sql">CREATE OR REPLACE PROCEDURE statements_history.create_snapshot(
    job_id int,
    config jsonb
)
LANGUAGE plpgsql AS
$function$
DECLARE
    snapshot_time timestamp with time zone := now();
BEGIN
	/*
	 * This first CTE queries pg_stat_statements and joins
	 * to the roles and database table for more detail that
	 * we will store later.
	 */
    WITH statements AS (
        SELECT
            *
        FROM
            pg_stat_statements(true)
        JOIN
            pg_roles ON (userid=pg_roles.oid)
        JOIN
            pg_database ON (dbid=pg_database.oid)
        WHERE queryid IS NOT NULL
    ), 
    /*
     * We then get the individual queries out of the result
* and store the text and queryid separately to avoid
     * storing the same query text over and over.
     */
    queries AS (
        INSERT INTO
            statements_history.queries (queryid, query, datname, rolname)
        SELECT
            queryid, query, datname, rolname
        FROM
            statements
        ON CONFLICT
            DO NOTHING
        RETURNING
            queryid
    ), 
    /*
     * This query SUMs all data from all queries and databases
     * to get high-level cluster statistics each time the snapshot
     * is taken.
     */
    snapshot AS (
        INSERT INTO
            statements_history.snapshots
        SELECT
            now(),
            sum(calls),
            sum(total_plan_time) AS total_plan_time,
            sum(total_exec_time) AS total_exec_time,
            sum(rows) AS rows,
            sum(shared_blks_hit) AS shared_blks_hit,
            sum(shared_blks_read) AS shared_blks_read,
            sum(shared_blks_dirtied) AS shared_blks_dirtied,
            sum(shared_blks_written) AS shared_blks_written,
            sum(local_blks_hit) AS local_blks_hit,
            sum(local_blks_read) AS local_blks_read,
            sum(local_blks_dirtied) AS local_blks_dirtied,
            sum(local_blks_written) AS local_blks_written,
            sum(temp_blks_read) AS temp_blks_read,
            sum(temp_blks_written) AS temp_blks_written,
            sum(blk_read_time) AS blk_read_time,
            sum(blk_write_time) AS blk_write_time,
            sum(wal_records) AS wal_records,
            sum(wal_fpi) AS wal_fpi,
            sum(wal_bytes) AS wal_bytes,
            pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0'),
            pg_postmaster_start_time()
        FROM
            statements
    )
    /*
     * And finally, we store the individual pg_stat_statement 
     * aggregated results for each query, for each snapshot time.
     */
    INSERT INTO
        statements_history.statements
    SELECT
        snapshot_time,
        queryid,
        plans,
        total_plan_time,
        calls,
        total_exec_time,
        rows,
        shared_blks_hit,
        shared_blks_read,
        shared_blks_dirtied,
        shared_blks_written,
        local_blks_hit,
        local_blks_read,
        local_blks_dirtied,
        local_blks_written,
        temp_blks_read,
        temp_blks_written,
        blk_read_time,
        blk_write_time,
        wal_records,
        wal_fpi,
        wal_bytes,
        rolname,
  datname
    FROM
        statements;

END;
$function$;
</code></pre>
<p>Once you create the stored procedure, schedule it to run on an ongoing basis as a user-defined action. In the following example, we schedule snapshot data collection every minute, which may be too often for your needs. Adjust the collection schedule to suit your data capture and monitoring needs.</p><pre><code class="language-sql">/*
* Adjust the scheduled_interval based on how often
* a snapshot of the data should be captured
*/
SELECT add_job(
    'statements_history.create_snapshot',
    schedule_interval=&gt;'1 minutes'::interval
);
</code></pre>
<p>And finally, you can verify that the user-defined action job is running correctly by querying the jobs information views. If you set the <code>schedule_interval</code> for one minute (as shown above), wait a few minutes, and then ensure that <code>last_run_status</code>is <code>Success</code> and that there are zero <code>total_failures</code>.</p><pre><code>SELECT js.* FROM timescaledb_information.jobs j
 INNER JOIN timescaledb_information.job_stats js ON j.job_id =js.job_id 
WHERE j.proc_name='create_snapshot';


Name                  |Value                        |
----------------------+-----------------------------+
hypertable_schema     |                             |
hypertable_name       |                             |
job_id                |1008                         |
last_run_started_at   |2022-04-13 17:43:15.053 -0400|
last_successful_finish|2022-04-13 17:43:15.068 -0400|
last_run_status       |Success                      |
job_status            |Scheduled                    |
last_run_duration     |00:00:00.014755              |
next_start            |2022-04-13 17:44:15.068 -0400|
total_runs            |30186                        |
total_successes       |30167                        |
total_failures        |0                            |

</code></pre>
<p>The query metric database is set up and ready to query! Let's look at a few query examples to help you get started.</p><h2 id="querying-pgstatstatements-snapshot-data">Querying pg_stat_statements Snapshot Data</h2><p>We chose to create two statistics tables: one that aggregates the snapshot statistics for the cluster, regardless of a specific query, and another that stores statistics for each query per snapshot. The data is time-stamped using the created column in both tables. The rate of change for each snapshot is the difference in cumulative statistics values from one snapshot to the next.</p><p>This is accomplished in SQL using the LAG window function, which subtracts each row from the previous row ordered by the created column.</p><h3 id="cluster-performance-over-time">Cluster performance over time</h3><p>This<strong> </strong>first example queries the "snapshots" table, which stores the aggregate total of all statistics for the entire cluster. Running this query will return the total values for each snapshot, not the overall cumulative <code>pg_stat_statements</code> values.</p><pre><code class="language-sql">/*
 * This CTE queries the snapshot table (full cluster statistics)
 * to get a high-level view of the cluster state.
 * 
 * We query each row with a LAG of the previous row to retrieve
 * the delta of each value to make it suitable for graphing.
 */
WITH deltas AS (
    SELECT
        created,
        extract('epoch' from created - lag(d.created) OVER (w)) AS delta_seconds,
        d.ROWS - lag(d.rows) OVER (w) AS delta_rows,
        d.total_plan_time - lag(d.total_plan_time) OVER (w) AS delta_plan_time,
        d.total_exec_time - lag(d.total_exec_time) OVER (w) AS delta_exec_time,
        d.calls - lag(d.calls) OVER (w) AS delta_calls,
        d.wal_bytes - lag(d.wal_bytes) OVER (w) AS delta_wal_bytes,
        stats_reset
    FROM
        statements_history.snapshots AS d
    WHERE
        created &gt; now() - INTERVAL '2 hours'
    WINDOW
        w AS (PARTITION BY stats_reset ORDER BY created ASC)
)
SELECT
    created AS "time",
    delta_rows,
    delta_calls/delta_seconds AS calls,
    delta_plan_time/delta_seconds/1000 AS plan_time,
    delta_exec_time/delta_seconds/1000 AS exec_time,
    delta_wal_bytes/delta_seconds AS wal_bytes
FROM
    deltas
ORDER BY
    created ASC;  

time                         |delta_rows|calls               |plan_time|exec_time         |wal_bytes         |
-----------------------------+----------+--------------------+---------+------------------+------------------+
2022-04-13 15:55:12.984 -0400|          |                    |         |                  |                  |
2022-04-13 15:56:13.000 -0400|        89| 0.01666222812679749|      0.0| 0.000066054620811| 576.3131464496716|
2022-04-13 15:57:13.016 -0400|        89|0.016662253391151797|      0.0|0.0000677694667946|  591.643293413018|
2022-04-13 15:58:13.031 -0400|        89|0.016662503817796187|      0.0|0.0000666146741069| 576.3226820499345|
2022-04-13 15:59:13.047 -0400|        89|0.016662103471929153|      0.0|0.0000717084114511| 591.6379700812604|
2022-04-13 16:00:13.069 -0400|        89| 0.01666062607900462|      0.0|0.0001640335102535|3393.3363560151874|

</code></pre>
<h3 id="top-100-most-expensive-queries">Top 100 most expensive queries</h3><p>Getting an overview of the cluster instance is really helpful to understand the state of the whole system over time. Another useful set of data to analyze quickly is a list of the queries using the most resources of the cluster, query by query. There are many ways you could query the snapshot information for these details, and your definition of "resource-intensive" might be different than what we show, but this example gives the high-level cumulative statistics for each query over the specified time, ordered by the highest total sum of execution and planning time.</p><pre><code class="language-sql">/*
* individual data for each query for a specified time range, 
* which is particularly useful for zeroing in on a specific
* query in a tool like Grafana
*/
WITH snapshots AS (
    SELECT
        max,
        -- We need at least 2 snapshots to calculate a delta. If the dashboard is currently showing
        -- a period &lt; 5 minutes, we only have 1 snapshot, and therefore no delta. In that CASE
        -- we take the snapshot just before this window to still come up with useful deltas
        CASE
            WHEN max = min
THEN (SELECT max(created) FROM statements_history.snapshots WHERE created &lt; min)
            ELSE min
        END AS min
    FROM (
        SELECT
            max(created),
            min(created)
        FROM
            statements_history.snapshots WHERE created &gt; now() - '1 hour'::interval
            -- Grafana-based filter
            --statements_history.snapshots WHERE $__timeFilter(created)
        GROUP by
            stats_reset
        ORDER by
            max(created) DESC
        LIMIT 1
    ) AS max(max, min)
), deltas AS (
    SELECT
        rolname,
        datname,
        queryid,
        extract('epoch' from max(created) - min(created)) AS delta_seconds,
        max(total_exec_time) - min(total_exec_time) AS delta_exec_time,
        max(total_plan_time) - min(total_plan_time) AS delta_plan_time,
        max(calls) - min(calls) AS delta_calls,
        max(shared_blks_hit) - min(shared_blks_hit) AS delta_shared_blks_hit,
        max(shared_blks_read) - min(shared_blks_read) AS delta_shared_blks_read
    FROM
        statements_history.statements
    WHERE
 -- granted, this looks odd, however it helps the DecompressChunk Node tremendously,
        -- as without these distinct filters, it would aggregate first and then filter.
        -- Now it filters while scanning, which has a huge knock-on effect on the upper
        -- Nodes
        (created &gt;= (SELECT min FROM snapshots) AND created &lt;= (SELECT max FROM snapshots))
    GROUP BY
        rolname,
        datname,
        queryid
)
SELECT
    rolname,
    datname,
    queryid::text,
    delta_exec_time/delta_seconds/1000 AS exec,
    delta_plan_time/delta_seconds/1000 AS plan,
    delta_calls/delta_seconds AS calls,
    delta_shared_blks_hit/delta_seconds*8192 AS cache_hit,
    delta_shared_blks_read/delta_seconds*8192 AS cache_miss,
    query
FROM
    deltas
JOIN
    statements_history.queries USING (rolname,datname,queryid)
WHERE
    delta_calls &gt; 1
    AND delta_exec_time &gt; 1
    AND query ~* $$.*$$
ORDER BY
    delta_exec_time+delta_plan_time DESC
LIMIT 100;


rolname  |datname|queryid             |exec              |plan|calls               |cache_hit         |cache_miss|query      
---------+-------+--------------------+------------------+----+--------------------+------------------+----------+-----------
tsdbadmin|tsdb   |731301775676660043  |0.0000934033977289| 0.0|0.016660922907623773| 228797.2725854585|       0.0|WITH statem...
tsdbadmin|tsdb   |-686339673194700075 |0.0000570625206738| 0.0|  0.0005647770477161|116635.62329618855|       0.0|WITH snapsh...
tsdbadmin|tsdb   |-5804362417446225640|0.0000008223159463| 0.0|  0.0005647770477161| 786.5311077312939|       0.0|-- NOTE Thi...

</code></pre>
<p>However you decide to order this data, you now have a quick result set with the text of the query and the <code>queryid</code>. With just a bit more effort, we can dig even deeper into the performance of a specific query over time.</p><p>For example, in the output from the previous query, we can see that <code>queryid=731301775676660043</code> has the longest overall execution and planning time of all queries for this period. We can use that <code>queryid</code> to dig a little deeper into the snapshot-by-snapshot performance of this specific query.</p><pre><code class="language-sql">/*
 * When you want to dig into an individual query, this takes
 * a similar approach to the "snapshot" query above, but for 
 * an individual query ID.
 */
WITH deltas AS (
    SELECT
        created,
        st.calls - lag(st.calls) OVER (query_w) AS delta_calls,
        st.plans - lag(st.plans) OVER (query_w) AS delta_plans,
        st.rows - lag(st.rows) OVER (query_w) AS delta_rows,
        st.shared_blks_hit - lag(st.shared_blks_hit) OVER (query_w) AS delta_shared_blks_hit,
        st.shared_blks_read - lag(st.shared_blks_read) OVER (query_w) AS delta_shared_blks_read,
        st.temp_blks_written - lag(st.temp_blks_written) OVER (query_w) AS delta_temp_blks_written,
        st.total_exec_time - lag(st.total_exec_time) OVER (query_w) AS delta_total_exec_time,
        st.total_plan_time - lag(st.total_plan_time) OVER (query_w) AS delta_total_plan_time,
st.wal_bytes - lag(st.wal_bytes) OVER (query_w) AS delta_wal_bytes,
        extract('epoch' from st.created - lag(st.created) OVER (query_w)) AS delta_seconds
    FROM
        statements_history.statements AS st
    join
        statements_history.snapshots USING (created)
    WHERE
        -- Adjust filters to match your queryid and time range
        created &gt; now() - interval '25 minutes'
        AND created &lt; now() + interval '25 minutes'
        AND queryid=731301775676660043
    WINDOW
        query_w AS (PARTITION BY datname, rolname, queryid, stats_reset ORDER BY created)
)
SELECT
    created AS "time",
    delta_calls/delta_seconds AS calls,
    delta_plans/delta_seconds AS plans,
    delta_total_exec_time/delta_seconds/1000 AS exec_time,
    delta_total_plan_time/delta_seconds/1000 AS plan_time,
    delta_rows/nullif(delta_calls, 0) AS rows_per_query,
    delta_shared_blks_hit/delta_seconds*8192 AS cache_hit,
    delta_shared_blks_read/delta_seconds*8192 AS cache_miss,
    delta_temp_blks_written/delta_seconds*8192 AS temp_bytes,
    delta_wal_bytes/delta_seconds AS wal_bytes,
    delta_total_exec_time/nullif(delta_calls, 0) exec_time_per_query,
    delta_total_plan_time/nullif(delta_plans, 0) AS plan_time_per_plan,
    delta_shared_blks_hit/nullif(delta_calls, 0)*8192 AS cache_hit_per_query,
    delta_shared_blks_read/nullif(delta_calls, 0)*8192 AS cache_miss_per_query,
    delta_temp_blks_written/nullif(delta_calls, 0)*8192 AS temp_bytes_written_per_query,
    delta_wal_bytes/nullif(delta_calls, 0) AS wal_bytes_per_query
FROM
    deltas
WHERE
    delta_calls &gt; 0
ORDER BY
    created ASC;


time                         |calls               |plans|exec_time         |plan_time|ro...
-----------------------------+--------------------+-----+------------------+---------+--
2022-04-14 14:33:39.831 -0400|0.016662115132216382|  0.0|0.0000735602224659|      0.0|  
2022-04-14 14:34:39.847 -0400|0.016662248949061972|  0.0|0.0000731468396678|      0.0|  
2022-04-14 14:35:39.863 -0400|  0.0166622286820572|  0.0|0.0000712116494436|      0.0|  
2022-04-14 14:36:39.880 -0400|0.016662015187426844|  0.0|0.0000702374920336|      0.0|  

</code></pre>
<h2 id="compression-continuous-aggregates-and-data-retention">Compression, Continuous Aggregates, and Data Retention</h2><p>This self-serve query monitoring setup doesn't require TimescaleDB. You could schedule the snapshot job with other extensions or tools, and regular PostgreSQL tables could likely store the data you retain for some time without much of an issue. Still, all of this is classic time-series data, tracking the state of your PostgreSQL cluster(s) over time.</p><p>Keeping as much historical data as possible provides significant value to this database monitoring solution's effectiveness. TimescaleDB offers several features that are not available with vanilla PostgreSQL to help you manage growing time-series data and improve the efficiency of your queries and process.</p><p><a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/compression/"><strong>Compression</strong></a> on this data is highly effective and efficient for two reasons:</p><ol><li>Most of the data is stored as integers which compresses very well (96 % plus) using our type-specific algorithms.</li><li>The data can be compressed more frequently because we're never updating or deleting compressed data. This means that it's possible to store months or years of data with very little disk utilization, and queries on specific columns of data will often be significantly faster from compressed data.</li></ol><p><a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/continuous-aggregates/"><strong>Continuous aggregates</strong></a> allow you to maintain higher-level rollups over time for aggregate queries that you run often. Suppose you have dashboards that show 10-minute averages for all of this data. In that case, you could write a continuous aggregate to pre-aggregate that data for you over time without modifying the snapshot process. This allows you to create new aggregations after the raw data has been stored and new query opportunities come to light.</p><p>And finally, <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/data-retention/"><strong>data retention</strong></a> allows you to drop older raw data automatically once it has reached a defined age. If continuous aggregates are defined on the raw data, it will continue to show the aggregated data which provides a complete solution for maintaining the level of data fidelity you need as data ages.</p><p>These additional features provide a complete solution for storing lots of monitoring data about your cluster(s) over the long haul. See the links provided for each feature for more information.</p><h2 id="better-self-serve-query-monitoring-with-pgstatstatements">Better Self-Serve Query Monitoring With pg_stat_statements<br></h2><p>Everything we've discussed and shown in this post is just the beginning. With a few <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertables</a> and queries, the cumulative data from <code>pg_stat_statements</code>can quickly come to life. Once the process is in place and you get more comfortable querying it, visualizing it will be very helpful. </p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/04/PIT-Query-pg_stat-img-1.gif" class="kg-image" alt="" loading="lazy" width="512" height="330"></figure><p><br><br><code>pg_stat_statements</code> is automatically enabled in all Timescale services. If you’re not a user yet, <a href="https://console.cloud.timescale.com/signup">you can try out Timescale for free</a> (no credit card required) to get access to a modern cloud-native database platform with <a href="https://timescale.ghost.io/blog/postgresql-timescaledb-1000x-faster-queries-90-data-compression-and-much-more/" rel="noreferrer">TimescaleDB's top performance</a>.<br><br>To complement <code>pg_stat_statements</code> for better query monitoring, check out <a href="https://timescale.ghost.io/blog/database-monitoring-and-query-optimization-introducing-insights-on-timescale/" rel="noreferrer">Insights</a>, also available for trial services.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Teaching Elephants to Fish]]></title>
            <description><![CDATA[Timescale's developer advocate Ryan Booz reflects on the PostgreSQL community and shares five ideas on how to improve it.]]></description>
            <link>https://www.tigerdata.com/blog/the-future-of-community-in-light-of-babelfish</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/the-future-of-community-in-light-of-babelfish</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[SQL]]></category>
            <category><![CDATA[AWS]]></category>
            <category><![CDATA[Events & Recaps]]></category>
            <dc:creator><![CDATA[Ryan Booz]]></dc:creator>
            <pubDate>Wed, 27 Apr 2022 16:35:09 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/04/PostgreSQL-community-elephants.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/04/PostgreSQL-community-elephants.jpg" alt="A black and white picture of a group of elephants protecting a baby elephant" /><h3 id="the-future-of-community-in-light-of-babelfish">The future of community in light of Babelfish</h3><p></p><p><em>This blog post was adapted from the PGConf NYC 2021 keynote. Originally published at </em><a href="https://postgresconf.org/blog/posts/teaching-elephants-to-fish">https://postgresconf.org</a>.</p><p>On December 1, 2020, at its annual re:Invent conference, Amazon AWS announced Babelfish—an open-source PostgreSQL translation layer that allows SQL Server applications to work natively, and transparently, with PostgreSQL. To be honest, as someone that's spent a significant part of my career using both SQL Server and PostgreSQL, this wasn't actually a very “exciting” development. </p><p>I'm not sure that most people in either community really gave it that much notice a year ago. In fact, my first thought was that Babelfish is just an oversized object-relational mapping (ORM) framework that wasn't tied to any specific development language. While these tools have proven to be hugely useful to many developers, nearly every DBA has first-hand war stories that demonstrate the challenge that automated query generation can impose on a complex system. Frankly, the thought terrifies me a bit.</p><p>Until recently, however, all we knew about Babelfish was based on Amazon published content. But in October 2021, Babelfish was finally released for public access and preview at <a href="https://babelfishpg.org/">https://babelfishpg.org/</a>.</p><p>Not long after the release was announced, I had the opportunity to participate in a video call with some members of the European PostgreSQL community for a first look at Babelfish in real life. It was interesting, and kind of exciting to see what worked and what didn't. However, I didn't leave that call any less concerned about the struggles my SQL Server friends will have as their management teams mandate switching to PostgreSQL using Babelfish. It also got me thinking: <em>why SQL Server?</em></p><p>I decided to look at <a href="https://db-engines.com/">DB-engines.com</a> to see if the engagement metrics that they track would shed any light on it. Although the website doesn't disclose the specific method for determining database engine rank, we know that social engagement and search engine trends play a role in the rankings.</p><p>When I zoomed in on the four "major" relational database engines, utilizing the last nine years of data, two things jumped out at me:</p><ol><li>Only PostgreSQL has seen consistent increases in popularity and engagement over the nine years of tracking.</li><li>While the three other engines have had some steady decline over the same period, SQL Server seems to have the biggest drop-off compared to Oracle and MySQL.</li></ol><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/04/Babelfish-Img-1.png" class="kg-image" alt="A database engines ranking showing how SQL Server seems to have the biggest drop off compared to Oracle and MySQL in the last few years." loading="lazy" width="1370" height="836" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/04/Babelfish-Img-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/04/Babelfish-Img-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/04/Babelfish-Img-1.png 1370w" sizes="(min-width: 720px) 720px"></figure><p>While I can posit many reasons why Amazon AWS chose to go directly after SQL Server for this transparent compatibility tool, the reasons to use PostgreSQL as the solution to the problem are obvious to the PostgreSQL community. </p><p>A few facts on PostgreSQL:</p><ul><li>It is the fastest growing, relational database engine on the planet.</li><li>It has a proven foundation.</li><li>It is easy to enhance through extensibility.</li></ul><p>Regardless of how we feel about this new turn of events or the potential onslaught of new support needs by users of SQL Server, there's not much we can do about it. The proverbial cat is already out of the bag.</p><p>Whether the necessary patches are included in the core PostgreSQL code or not, AWS Aurora, at least, will still offer this functionality as a service. I believe this means that over the next 2-4 years, the community will grow from a base of users that have a lot of database experience but have little footing for how to approach a similar, but different, database. And regardless of how we feel about the new demands this will put on this community, it's a group of people that still want to do their job well and contribute back to the community.</p><h2 id="why-do-i-care">Why Do I Care?</h2><p>So, why do I care? Well, if you were to put my 20+ years of database experience into a word cloud of sorts, SQL Server would occupy the largest portion of space. For nearly 15 of the last 21 years, I was primarily a Microsoft data platform user. And while PostgreSQL has occupied the second largest (and longest) portion of my database landscape, I really came of age as a database professional within the SQL Server community. In fact, since coming back into the PostgreSQL community almost four years ago, I've continued to look for ways to foster a community and learning modeled after what I knew and experienced through SQL Server and the Professional Association of SQL Server (PASS) community. And I can say, without a doubt, that I'm not the only one looking for the same thing.</p><p>Let me ask you something then.</p><p><strong><em>When was the last time you had to join a new community?</em></strong></p><p>Is the first thing that comes to mind a technical community (data, programming language, visualization tools, etc.) or something else? Whatever community that was, how did you feel after first stepping through the proverbial door?</p><p>Did you have a crowd of people ready to cheer you on, eager to see you succeed, and ready to support you?</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/04/Picture1.gif" class="kg-image" alt="People cheering in the audience at a sports game" loading="lazy" width="480" height="270"></figure><p>Or, did you feel like an outsider looking in, trying to figure out how to find the right help, from the right people? Did you feel like everyone else always knew how to connect with the people around you but you struggled to find comradery and resources?</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/04/Picture2.gif" class="kg-image" alt="Comedian Conan O'Brien looking inside a house through its window" loading="lazy" width="440" height="237"></figure><p>Depending on where you fall, what could have made it better for others or for yourself?</p><p>Let's bring this same thought experiment a little closer to home. What about the PostgreSQL community? What was your "onboarding" experience like and how does that compare to some of the newest members you've met recently?</p><p>It just so happens that we have some feedback from the larger community based on <a href="https://www.timescale.com/state-of-postgres-results/"><em>The State of PostgreSQL</em></a> survey Timescale orchestrated this past April. There were <strong>500 respondents</strong> that ranged in experience levels from novice, newly joined developers, to users with 20+ years of experience. Of those 500 respondents, <strong>49.5&nbsp;%</strong> have used PostgreSQL for <strong>five years or less</strong>, and <strong>50.5&nbsp;%</strong> had PostgreSQL for <strong>six years or more</strong>.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/04/bablefish-img-4.png" class="kg-image" alt="A bar graph on how long users have been using Postgres: 49.5&nbsp;% have used PostgreSQL for five years or less" loading="lazy" width="1150" height="520" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/04/bablefish-img-4.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/04/bablefish-img-4.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/04/bablefish-img-4.png 1150w" sizes="(min-width: 720px) 720px"></figure><p>It would make sense, then, that about 50&nbsp;% of the survey participants felt like it was a bit difficult to use PostgreSQL and get involved with the community.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/04/bablefish-img-5.png" class="kg-image" alt="Text boxes with users' opinions on the PostgreSQL community over a blue blackground" loading="lazy" width="1600" height="915" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/04/bablefish-img-5.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/04/bablefish-img-5.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/04/bablefish-img-5.png 1600w" sizes="(min-width: 720px) 720px"></figure><p>And yet, the other 50&nbsp;% of participants seem to have a very different experience with PostgreSQL the longer they stay connected.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/04/bablefish-img-6.png" class="kg-image" alt="Text boxes with users' opinions on the PostgreSQL community over a blue blackground" loading="lazy" width="1600" height="839" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/04/bablefish-img-6.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/04/bablefish-img-6.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/04/bablefish-img-6.png 1600w" sizes="(min-width: 720px) 720px"></figure><p>The real goal, then, is to determine ways to improve the user experience within the PostgreSQL community earlier in the cycle, rather than hoping folks stick around for more than five years so that they can begin to have a more positive outlook on the community at large.</p><p>As a developer advocate at Timescale, one of my primary responsibilities is to engage with the PostgreSQL community so that we can figure out how to tackle these issues head-on. I'm excited to contribute to the efforts, learning better methods to teach people how to use this technology well and help them be successful.</p><p>And, as a former SQL Server professional and community member, I want to prepare for what I believe will be a growing number of database professionals joining the Slack channel, Twitter conversations, and conferences trying to improve their craft and give back to the community.</p><p>To do this, we have at least two options in the months and years ahead.</p><p>The first option is to take sides. Unfortunately, this happens all too often in technical communities. Whether it's a database engine or the newest development language, this approach is an option. "We're better! You're not!"</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/04/bablefish-img-7.png" class="kg-image" alt="A meme from the Captain America Marvel movie showing two groups of superheroes opposing each other (PostgreSQL vs. SQL server)" loading="lazy" width="450" height="449"></figure><p>Or… we could choose to just sit around the table, share a meal, and learn from one another about how we can build a better, shared community based on the best parts of what each community offers.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/04/bablefish-img-8.png" class="kg-image" alt="A scene from the same movie showing a group of people hanging out at a restaurant table" loading="lazy" width="450" height="233"></figure><p>Quite honestly, we could ensure that we're treating the community more like our beloved namesake—Slonik, the elephant.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/04/bablefish-img-9.png" class="kg-image" alt="A group of elephants protecting a baby elephant" loading="lazy" width="450" height="282"></figure><p>It turns out that elephants are close-knit communities. They care for each other, they actively accept orphans and elephants that are in need, and they're generational, passing down knowledge and community expectations from one generation to the next. Not a bad example to follow, right?</p><h2 id="five-initial-options-for-building-a-better-postgresql-community">Five Initial Options for Building a Better PostgreSQL Community</h2><p>Let me present you with five quick, high-level thoughts on ideas we could reuse from the SQL Server community that might begin the process of improving participation and engagement.</p><h3 id="1-lead-with-empathy-and-curiosity"><strong>1. Lead with empathy and curiosity</strong></h3><p>What does it mean to lead with empathy and curiosity? When the posture of the community starts with empathy, it means that we remember what it was like to be new to a community ourselves. When we start conversations from a place of curiosity, we avoid choosing sides and instead join new users where they are. <br><br>Here are a few things to keep in mind in regards to the SQL Server community that might start to show up in greater numbers soon, although I think these are good things to remember regardless of the user.</p><p><strong>Expect confusion from users that already know SQL</strong>. Oftentimes these users honestly don't know that the dialect of SQL they've been using (T-SQL in this case) isn't a standard. They know concepts but not direct comparisons to the SQL standard or pl/pgsql.</p><p><strong>Remember that you were a newbie once</strong>. Remember when I asked you to think of the last new community you joined? ;-)</p><p><strong>Assume positive intent and that they have tried to search out a solution first.</strong> Again, many users coming from the SQL Server community have a good network of people and resources they've grown accustomed to (we'll touch on that next!). But that community has also fostered good habits for finding solutions and asking better questions. Assume the best!</p><p><strong>Prepare resources to meet their specific needs</strong>. This doesn't mean that we craft all documentation, blogs, and help forums for a specific user base. But, if we know many people will be joining this community with the same understanding of a feature or topic, we can provide them a better foundation for transferring their knowledge into PostgreSQL. It would reduce support overall and foster better community involvement.</p><p>Let me give you one really simple example from my own experience (and that of <em>many</em> SQL Server users that try PostgreSQL for the first time).</p><p>It is a common practice in nearly every T-SQL script I've written, seen, or used to create and use variables and control logic directly in the flow of a script. Because all SQL is executed by default as T-SQL, there is no need for code blocks to do data-specific logic. This really simple T-SQL example is incredibly common in the day-to-day workflow of a SQL Server DBA or developer.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/04/bablefish-img-10.png" class="kg-image" alt="" loading="lazy" width="450" height="287"></figure><p>In PostgreSQL, however, I can't do this in the midst of my migration scripts or maintenance tasks directly. This was a constant source of frustration for me during the release of my first feature at a new company that was using PostgreSQL (but had very few developers that understood databases). <em>I knew</em> what I wanted to do, but I couldn't get any IDE or migration script to do what I wanted.</p><p>Eventually, I found a post that talked about anonymous functions within a SQL script for ad hoc processing. Although I didn't like the added complexity, I could finally write my migrations correctly.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/04/bablefish-img-11.png" class="kg-image" alt="" loading="lazy" width="450" height="400"></figure><p>Having proactive examples in the documentation that acknowledge this difference could be a game-changer for developer success.</p><h3 id="2-lower-the-bar-for-entry-level-pghelp">2. Lower the bar for entry-level #pghelp</h3><p>More than 10 years ago, someone in the SQL Server community had the idea of using #sqlhelp Twitter hashtag to provide help. At the time, the size limit for a tweet was 140 characters so they understood it could only provide short, really succinct triage-like help. In some ways, I think this limitation has actually helped the community learn how to ask better questions that will draw valuable answers. This is evidenced by two things in my opinion.</p><p>First, not every question gets an answer. It's a community-led initiative and so the quality of the question and respect for the "free" nature of the help influence overall engagement. And second, the SQL Server community actively protects the use of this hashtag. That's not always taken kindly by outsiders, and I'm not even sure how much I agree that a community can "own" a hashtag, but it produces a valuable community around a specific technology that has proven to be helpful for thousands of users over the years.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/04/bablefish-img-12.png" class="kg-image" alt=" Twitter users chipping in on a thread with the #sqlhelp hashtag" loading="lazy" width="847" height="1172" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/04/bablefish-img-12.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/04/bablefish-img-12.png 847w" sizes="(min-width: 720px) 720px"></figure><p>Like the PostgreSQL community, SQL Server users also have a community Slack channel that is active and more long-form. But today, more than 10 years after the community started using it, #sqlhelp is an active channel of connecting with the larger community to give and receive timely help.</p><p>Could we do something similar with a #pghelp hashtag?</p><h3 id="3-support-new-members-by-cultivating-more-leaders">3. Support new members by cultivating more leaders</h3><p>As the PostgreSQL community grows, we can only support users if we build a growing group of leaders. A few of the leaders in the SQL Server community realized this same need over a decade ago and proactively sought ways to build new leaders, content creators, and community advocates. One successful example of this is an initiative called "T-SQL Tuesday," a worldwide monthly "blog fest."</p><p>The idea is simple. Each month, someone volunteers to be the host, they announce the topic at least a week in advance, and then anyone from around the world can contribute to the conversation by publishing a blog on that topic. Some of the topics are technical (replication, high availability, query tuning success), while some are more soft-skill-focused (best/worst SQL interview experience, how to avoid burnout in the SQL field).</p><p>As I said, it was specifically started as an initiative to get more people in the community to contribute to the conversation. Of almost any initiative that this community has undertaken in the last 10-15 years, T-SQL Tuesday has done more than anything else to cultivate new community leaders, alter careers, and bring collaboration across the globe. The most intriguing part of this for me is that it's free to run and participate in, and Microsoft has had nothing to do with it. It is completely community-led.</p><p>Starting in April 2022, a few PostgreSQL community members are going to start "PSQL Phriday," a monthly community blogging initiative. To learn more about it, the monthly topics, and how you can participate, watch the blogging feeds at <a href="http://planet.postgresql.org/">planet.postgresql.org</a> and monitor the #psqlphriday on Twitter. I'm excited to see this get started and can't wait to learn from many others in the community!</p><h3 id="4-seek-leaders-proactively"><strong>4. Seek leaders proactively</strong></h3><p>As new members join the community and it grows, many of them will come with a desire to contribute in some way. Some of that has happened in meaningful ways around tooling that have dramatically improved the day-to-day tasks of every SQL Server DBA.</p><p>One of the best examples of this in recent years has been the <a href="https://dbatools.io/">DBATools project</a>. This is a PowerShell toolkit of hundreds of commands that can help back up a database or migrate an entire cluster of servers, no UI necessary. It is heavily supported by the community and they're always looking for opportunities to grow, learn, and contribute. Finding these developer-focused initiatives could be a great way to enlist help and add additional support resources as the community grows.</p><h3 id="5-develop-consistent-messaging-around-community"><strong>5. Develop consistent messaging around community</strong></h3><p>Lastly, I think it would be helpful to consider ways to consistently articulate the best methods and practices for accessing help within the PostgreSQL community. Although the <a href="https://www.postgresql.org/community/">Community page</a> on the PostgreSQL.org website does list many avenues for getting help, it still requires a fair amount of cognitive load to figure out which avenue is best suited for a given need.</p><ul><li>When do I use the email lists? What if I don't want to subscribe long-term?</li><li>If I join the Slack channel, how do I best ask for help? Can I mention specific people to try and get help? Create new rooms?</li><li>Why would I use IRC or Discord over Slack?</li><li>Is there a #pghelp Twitter hashtag, and if so, what kind of questions are best asked there?</li></ul><p>As a new user in the PostgreSQL community, I wanted this kind of guidance because I didn't want to overuse a resource or direct questions to the wrong group of people. If we had more consistent guidance on how to interface with the community across the plethora of channels, then other leaders within the PostgreSQL community could give the same consistent message. </p><p>I appreciate how leaders like Brent Ozar, a leader in the SQL Server community, don't feel obligated to answer every question thrown their way. They have clear instructions on their blog that describe the best ways for someone to actually get effective help and they often direct them there. "Hey, thanks for reaching out. That sounds like a great question for #sqlhelp. Check out this link for instructions on how to use it!"</p><p>When people feel heard, they're more likely to stay connected and involved. Even if the answer often points them to good documentation of how to get help, they're still acknowledged and included.</p><h2 id="wrap-up">Wrap up</h2><p>I'd like to leave you with a small twist on a common adage.</p><p>"You can't pick your family… but you <em>can</em> influence who becomes your friends."</p><p>As this community grows, are we prepared to provide some new ways of engaging with them? These specific ideas might not all fit within the PostgreSQL community, but I'd be interested to hear your thoughts about ways we can better incorporate the skills and talents of the community we already have to prepare for the future. Feel free to reach out to me on Twitter (<a href="https://twitter.com/ryanbooz?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor">@ryanbooz</a>), through email (<a href="mailto:ryan@timescale.com">ryan@timescale.com</a>), or on our <a href="https://slack.timescale.com">Timescale Slack channel</a>. </p><p>One last thing! The 2022 State of PostgreSQL survey will open later this spring. Take some time to <a href="https://www.timescale.com/state-of-postgres-results/">review the results from last year</a>, and then sign up at the bottom of the report to be notified when the new survey is ready. The more feedback we receive, the better we can understand our community, what's working well, and what can be improved in the years to come.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How Edeva Uses Continuous Aggregations and IoT to Build Smarter Cities]]></title>
            <description><![CDATA[Swedish company Edeva is making a real impact in city life by quickly presenting decision-makers with trends based on IoT data.]]></description>
            <link>https://www.tigerdata.com/blog/how-edeva-uses-continuous-aggregations-and-iot-to-build-smarter-cities</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-edeva-uses-continuous-aggregations-and-iot-to-build-smarter-cities</guid>
            <category><![CDATA[Dev Q&A]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Ana Tavares]]></dc:creator>
            <pubDate>Fri, 22 Apr 2022 13:05:15 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/06/How-Edeva-Uses-Continuous-Aggregations-and-IoT-to-Build-Smarter-Cities_1.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/06/How-Edeva-Uses-Continuous-Aggregations-and-IoT-to-Build-Smarter-Cities_1.png" alt="One of Edeva's dashboard showing the percentage and number of speeders, as well as speed distribution in a specific location." /><p><em>This is an installment of our “Community Member Spotlight” series, where we invite our customers to share their work, shining a light on their success and inspiring others with new ways to use technology to solve problems.<br><br>In this edition, John Eskilsson, software architect at Edeva, shares how his team collects huge amounts of data (mainly) from IoT devices to help build safer, smarter cities and leverages continuous aggregations for lightning-fast dashboards. </em></p><p>Founded in 2009 in Linköping, <a href="https://www.edeva.se/en/">Edeva</a> is a Swedish company that creates powerful solutions for smart cities. It offers managed services and complete systems, including hardware and software platforms.</p><p>As the creators of the dynamic speed bump <a href="https://www.actibump.com/">Actibump</a> and the smart city platform <a href="https://www.edevalive.com/">EdevaLive</a>, the Edeva team works mainly for municipal, regional, and national road administrations, toll stations, environmental agencies, and law enforcement agencies. </p><p>The team also solves many other problems, from obtaining large amounts of environmental data for decision-making to developing a screening scale to help law enforcement agencies assess vehicle overloading. The latter, for instance, decreased the amount of time needed to control each vehicle, speeding up traffic checks and allowing law enforcement agencies to control more vehicles.</p><h2 id="about-the-team">About the Team</h2><p>The <a href="https://www.edeva.se/en/#contact">team at Edeva</a> is a small but impactful group of 11, working on everything from creating hardware IoT devices to analyzing time-series data and making it accessible to customers—and, sometimes, the public.</p><p>As a software architect, I am in charge of building the best possible solution to receive, store, analyze, visualize, and share the customers’ event data. Our team then comes together to create solutions that work and that the customer actually wants.</p><h2 id="about-the-project">About the Project</h2><p>Edeva has created a dynamic speed bump called <a href="https://www.actibump.com/">Actibump</a> and the smart city platform <a href="https://www.edevalive.com/">EdevaLive</a>. <br><br>The Actibump has been used in Sweden since 2010. Speeding vehicles activate a hatch in the road that lowers a few centimeters, creating an inverted speed bump. Providing good accessibility for public transportation, such as buses and emergency vehicles, the Actibump still ensures a safe speed for pedestrians and other vulnerable road users. It is also an environmentally friendly solution, helping decrease noise and emissions.</p><figure class="kg-card kg-embed-card"><iframe width="200" height="113" src="https://www.youtube.com/embed/3FG1IBTYXWs?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></figure><p>The Actibump can be combined with the EdevaLive system, delivering valuable remote monitoring services and statistics to Edeva’s customers. </p><p>Most of the data we collect is based on IoT devices:</p><p><strong>Traffic flow data</strong>: The Actibump measures the speed of oncoming traffic to decide if it needs to activate the speed bump or not. We capture radar data, among others, and send an event to our smart city platform EdevaLive. The data treats the oncoming traffic as a flow rather than a single vehicle to create the smoothest possible traffic flow.<br></p><p><strong>Vehicle classification data (weigh-in-motion)</strong>: Actibump can be configured with weigh-in-motion. This means that the lid of the speed bump is equipped with a very sensitive high-sampling scale. The scale records several weight measurements when the vehicle passes over the speed bump. This way, it can detect how many axles a vehicle has and classify the type of vehicle. At the same time, it fires off one event for each axle with the scale fingerprint so we can analyze whether the weight measurements are correct.</p><p><br><strong>Vehicle classification data (radar)</strong>: If we want to classify vehicles in places where we do not yet have an Actibump installed, we can introduce a radar that can classify vehicle types. A roadside server controls the radar, gathers its data, and pushes it to EdevaLive. <br></p><p><strong>Bike and pedestrian data</strong>: We use cameras installed above a pedestrian and cycle path. The camera can detect and count pedestrians and bicycles passing in both directions. We push this data to EdevaLive for analysis. <br></p><p><strong>Number plate data: </strong>We can use a camera to detect a vehicle's number plate. This way, we can control devices like gates to open automatically. The camera can also be used to look up the number of electric vs. petrol or diesel vehicles passing it or determine if a specific vehicle exceeds the cargo weight limit.</p><p><br><strong>Gyroscopic data</strong>: We offer a gyroscopic sensor that can gather data for acceleration in all different planes. This device generates a lot of data that can be uploaded to EdevaLive in batches or as a stream (if the vehicle has an Internet connection). This data is GPS-tagged and can be used to calculate jerk to provide indications on working conditions to a bus driver, for instance. The data can also be used to calculate the wear and tear of vehicles and many other things.<br></p><p><strong>Environmental data</strong>: Monitoring environmental data in a smart city platform is important. This is why we use small portable devices that can measure the occurrence of different particle sizes in the air, CO2, and other gases. In addition, they measure the usual things like temperature, wind speed, etc. All this data is pushed to EdevaLive.</p><p><strong>Alarm data</strong>: Our IoT devices and roadside servers can send alarm information if a sensor or other parts malfunction. All this data comes to EdevaLive in the same way as a regular IoT event, but these events are only used internally so that we can react as quickly as possible if there is a problem.<br></p><p><strong>Status data</strong>: If the alarm data detects anomalies, the status data just reports the status of the server or IoT device. The devices run self-checks and report statistical data, like disk utilization, temperature, and load. This is also just for internal use to spot trends or troubleshoot in case any problems arise. For instance, it is incredibly useful to correlate CPU load with the version number of firmware or other software versions.  <br></p><p><strong>Administrative data</strong>: This is where the power of SQL and time-series data really shines. Let’s say we added a new device, and it has a configuration object that is persistent in a regular table in Timescale. This object keeps some metadata, such as the date it was added to the system or the device's display name. This way, we can use a join easily to pick up metadata about the device and, at the same time, get time-series data for the events that are coming in. There is only one database connection to handle and one query to run.</p><h2 id="choosing-and-using-timescaledb">Choosing (and Using!) TimescaleDB</h2><p>We realized we needed a time-series database a few years ago when we started storing our data in MySQL. At the time, we made a move to MongoDB, and it worked well for us but required quite a bit of administration and was harder to onboard other developers.<br><br>I looked at InfluxDB but never considered it in the end because it was yet another system to learn, and we had learned that lesson with MongoDB.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">✨</div><div class="kg-callout-text"><i><b><strong class="italic" style="white-space: pre-wrap;">Editor’s Note:</strong></b></i><i><em class="italic" style="white-space: pre-wrap;"> For more comparisons and benchmarks, see how TimescaleDB compares to </em></i><a href="https://www.timescale.com/blog/timescaledb-vs-influxdb-for-time-series-data-timescale-influx-sql-nosql-36489299877" rel="noreferrer"><i><em class="italic" style="white-space: pre-wrap;">InfluxDB</em></i></a><i><em class="italic" style="white-space: pre-wrap;">, </em></i><a href="https://www.timescale.com/blog/how-to-store-time-series-data-mongodb-vs-timescaledb-postgresql-a73939734016" rel="noreferrer"><i><em class="italic" style="white-space: pre-wrap;">MongoDB</em></i></a><i><em class="italic" style="white-space: pre-wrap;">, </em></i><a href="https://www.timescale.com/blog/timescaledb-vs-amazon-timestream-6000x-higher-inserts-175x-faster-queries-220x-cheaper" rel="noreferrer"><i><em class="italic" style="white-space: pre-wrap;">AWS Timestream</em></i></a><i><em class="italic" style="white-space: pre-wrap;">, </em></i><a href="https://www.timescale.com/blog/postgresql-timescaledb-1000x-faster-queries-90-data-compression-and-much-more" rel="noreferrer"><i><em class="italic" style="white-space: pre-wrap;">vanilla PostgreSQL</em></i></a><i><em class="italic" style="white-space: pre-wrap;">, and </em></i><a href="https://www.timescale.com/learn/the-best-time-series-databases-compared" rel="noreferrer"><i><em class="italic" style="white-space: pre-wrap;">other time-series database alternatives on various vectors</em></i></a><i><em class="italic" style="white-space: pre-wrap;">, from performance and ecosystem to query language and beyond.</em></i></div></div><p><br>Learning from this journey,  I looked for a solution that plugged the gaps the previous systems couldn’t. That is when I found Timescale and discovered that there was a hosted solution. <br><br>We are a small team that creates software with a big impact. This means that we don’t really have time to put a lot of effort into tweaking and administering every tool we use, but we still like to have control.</p><blockquote class="kg-blockquote-alt">"With Timescale, our developers immediately knew how to use the product because most of them already knew SQL"</blockquote><p>Also, since TimescaleDB is basically PostgreSQL with time-series functionality on steroids, it is much easier to onboard new developers if needed. With Timescale, our developers immediately knew how to use the product because most of them already knew SQL.<br><br>Edeva uses TimescaleDB as the main database in our smart city system. Our clients can control their IoT devices (like the Actibump from EdevaLive) and—as part of that system—see the data that has been captured and quickly get an overview of trends and historical data. We offer many graphs that show data in different time spans, like day, week, month, and year. To get this to render really fast, we use continuous aggregations.</p><blockquote class="kg-blockquote-alt">"Timescale is basically PostgreSQL with time-series functionality on steroids, it is much easier to onboard new developers if needed"</blockquote><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/06/How-Edeva-Uses-Continuous-Aggregations-and-IoT-to-Build-Smarter-Cities_2.png" class="kg-image" alt="One of Edeva's dashboards showing graphs for total passages, percentage of speeders, number of speeders, 85 percentile, and the average number of passages per day." loading="lazy" width="774" height="814" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/06/How-Edeva-Uses-Continuous-Aggregations-and-IoT-to-Build-Smarter-Cities_2.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/06/How-Edeva-Uses-Continuous-Aggregations-and-IoT-to-Build-Smarter-Cities_2.png 774w" sizes="(min-width: 720px) 720px"></figure><p></p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">✨</div><div class="kg-callout-text"><i><b><strong class="italic" style="white-space: pre-wrap;">Editor’s Note: </strong></b></i><a href="https://www.timescale.com/blog/massive-scale-for-time-series-workloads-introducing-continuous-aggregates-for-distributed-hypertables-in-timescaledb-2-5/"><i><em class="italic" style="white-space: pre-wrap;">Learn how to use continuous aggregates for real-time analytics in PostgreSQL.</em></i></a></div></div><h2 id="current-deployment-and-future-plans"><br>Current Deployment and Future Plans</h2><p>One of the TimescaleDB features that has had the most impact on our work is continuous aggregations. It changed our dashboards from sluggish to lightning-fast. If we are building functionality to make data available for customers, we always aggregate it first to speed up the queries and take the load off the database. It used to take minutes to run some long-term data queries. Now, almost all queries for long-term data are subsecond.</p><blockquote>"We rely on Timescale for everything now. It's super efficient, and we've reduced query load times from 30 seconds down to almost nothing." - John Eskilsson, System Architect</blockquote><p>For example, we always struggled with showing the 85th percentile of speed over time. To get accurate percentile data, you had to calculate it based on the raw data instead of aggregating it. If you had 200 million events in a <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertable</a> and wanted several years of data for a specific sensor, it could take you a long time to deliver—users don’t want to wait that long. </p><blockquote class="kg-blockquote-alt">"It changed our dashboards from sluggish to lightning-fast"</blockquote><p><br>Now that Timescale introduced <a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/percentile_agg/#percentile-agg"><code>percentile_agg</code></a> and <a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/approx_percentile/"><code>approx_percentile</code></a>, we can actually query continuous aggregations and <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/percentile-approx/advanced-agg/">get reasonably accurate percentile values</a> without querying raw data.</p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">✨</div><div class="kg-callout-text"><i><b><strong class="italic" style="white-space: pre-wrap;">Editor’s Note: </strong></b></i><a href="https://www.timescale.com/blog/how-percentile-approximation-works-and-why-its-more-useful-than-averages/"><i><em class="italic" style="white-space: pre-wrap;">Percentile approximations can be more useful for large time-series data sets than averages. Read how they work in this blog post</em></i></a><i><em class="italic" style="white-space: pre-wrap;">.</em></i></div></div><p><br><br>Note that “vehicles” is a hypertable where actibump_id is the ID of the dynamic speed bump containing several hundred million records.<br><br>This is how we build the continuous aggregate:</p><pre><code>CREATE MATERIALIZED VIEW view1
 WITH (timescaledb.continuous) AS
 SELECT actibump_id,
 timescaledb_experimental.time_bucket_ng(INTERVAL '1 month', time, 'UTC') AS bucket,
 percentile_agg(vehicle_speed_initial) AS percentile_agg
FROM vehicles
GROUP BY actibump_id, bucket
</code></pre>
<p>And this is the query that fetches the data for the graph:</p><pre><code>SELECT TIMESCALEDB_EXPERIMENTAL.TIME_BUCKET_NG(INTERVAL '1 month', bucket) AS date,
actibump_id,
APPROX_PERCENTILE(0.85, ROLLUP(PERCENTILE_AGG)) AS p85,
MAX(signpost_speed_max)
FROM vehicles_summary_1_month
WHERE actibump_id in ('16060022')
AND bucket &gt;= '2021-01-30 23:00:00'
AND bucket &lt;= '2022-04-08 21:59:59'
GROUP BY date, actibump_id
ORDER BY date ASC
</code></pre>
<p>Here is an example of the graph:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/06/How-Edeva-Uses-Continuous-Aggregations-and-IoT-to-Build-Smarter-Cities_percentile-graph.png" class="kg-image" alt="The 85 percentile line graph" loading="lazy" width="1172" height="230" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2024/06/How-Edeva-Uses-Continuous-Aggregations-and-IoT-to-Build-Smarter-Cities_percentile-graph.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2024/06/How-Edeva-Uses-Continuous-Aggregations-and-IoT-to-Build-Smarter-Cities_percentile-graph.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/06/How-Edeva-Uses-Continuous-Aggregations-and-IoT-to-Build-Smarter-Cities_percentile-graph.png 1172w" sizes="(min-width: 720px) 720px"></figure><p><br>At the moment, we use PHP and Yii 2 to deploy TimescaleDB. We connect to TimescaleDB with Qlik Sense for business analytics. In Qlik Sense, you can easily connect to TimescaleDB using the PostgreSQL integration. <br><br>It is especially convenient to be able to connect to the continuous aggregations for long-term data without overloading the system with too much raw data. We often use Qlik Sense to rapidly prototype graphs that we later add to EdevaLive.</p><h2 id="advice-and-resources">Advice and Resources</h2><p>The next step for us is to come up with a good way of reducing the amount of raw data we store in TimescaleDB. We are looking at how we can integrate it with a data lake. Apart from that, we are really excited to start building even more graphs and map applications.</p><p>If you are planning to store time-series data, Timescale is the way to go. It makes it easy to get started because it is “just” SQL, and at the same time, you get the important features needed to work with time-series data. I recommend you have a look, especially at continuous aggregations.</p><p>Think about the whole lifecycle when you start. Will your use cases allow you to use features like compression, or do you need to think about how to store long-term data outside of TimescaleDB to make it affordable right from the start? You can always work around things as you go along, but it is good to have a plan for this before you go live.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">💻</div><div class="kg-callout-text"><i><em class="italic" style="white-space: pre-wrap;">If you want to learn more about how Edeva handles time-series data with Actibump and EdevaLive, the team hosts </em></i><a href="https://www.edevalive.com/#webinar"><i><em class="italic" style="white-space: pre-wrap;">virtual biweekly webinars</em></i></a><i><em class="italic" style="white-space: pre-wrap;">, or you can also </em></i><a href=" https://www.edevalive.com/#demo"><i><em class="italic" style="white-space: pre-wrap;">request a demo</em></i></a><i><em class="italic" style="white-space: pre-wrap;">.</em></i></div></div><p><br><em>We’d like to thank John and all the folks from Edeva for sharing their story. We are amazed to see how their work truly impacts the way people live and enjoy their city with a little help from time-series data. </em>🙌</p><p><em>We’re always keen to feature new community projects and stories on our blog. If you have a story or project you’d like to share, reach out on Slack (</em><a href="https://slack.timescale.com"><em>@Ana Tavares</em></a><em>), and we’ll go from there.</em><br></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Using Pg_Stat_Statements to Optimize Queries]]></title>
            <description><![CDATA[Discover how the pg_stat_statements PostgreSQL extension can help you identify problematic queries and optimize your query performance.

]]></description>
            <link>https://www.tigerdata.com/blog/using-pg-stat-statements-to-optimize-queries</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/using-pg-stat-statements-to-optimize-queries</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <category><![CDATA[PostgreSQL Tips]]></category>
            <category><![CDATA[Cloud]]></category>
            <category><![CDATA[Announcements & Releases]]></category>
            <dc:creator><![CDATA[Ryan Booz]]></dc:creator>
            <pubDate>Wed, 30 Mar 2022 13:15:09 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/03/pg-stat-statements-timescale-2.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/03/pg-stat-statements-timescale-2.png" alt="Identify Slow PostgreSQL Queries With pg_stat_statements: the PostgreSQL and Timescale logos on a yellow backdrop." /><p><em><code>pg_stat_statements</code> allows you to quickly identify problematic or slow Postgres queries, providing instant visibility into your database performance. Today, we're announcing that we've enabled <code>pg_stat_statements</code> by default in all Timescale services. This is part of our #AlwaysBeLaunching Cloud Week with MOAR features! </em>🐯☁️</p><p>PostgreSQL is one of the fastest-growing databases in terms of usage and community size, being backed by many dedicated developers and supported by a broad ecosystem of tooling, connectors, libraries, and visualization applications. PostgreSQL is also extensible: using <a href="https://www.tigerdata.com/blog/top-8-postgresql-extensions" rel="noreferrer">PostgreSQL extensions</a>, users can add extra functionality to PostgreSQL’s core.  Indeed, TimescaleDB itself is <a href="https://www.tigerdata.com/blog/top-8-postgresql-extensions" rel="noreferrer">packaged as a PostgreSQL extension</a>, which also plays nicely with the broad set of other PostgreSQL extensions, as we’ll see today.</p><p>Today, we’re excited to share that <code>pg_stat_statements</code>, one of the most popular and widely used PostgreSQL extensions, is now enabled by default in all Timescale services. If you’re new to Timescale, <a href="https://console.cloud.timescale.com/signup">start a free trial</a> (100&nbsp;% free for 30 days, no credit card required).</p><h3 id="what-is-pgstatstatements">What is pg_stat_statements?</h3><p><a href="https://www.postgresql.org/docs/9.4/pgstatstatements.html"><code>pg_stat_statements</code></a> is a PostgreSQL extension that records information about your running queries. Identifying performance bottlenecks in your database can often feel like a cat-and-mouse game. Quickly written queries, index changes, or complicated ORM query generators can (and often do) negatively impact your database and application performance. </p><h3 id="how-to-use-pgstatstatements">How to use pg_stat_statements</h3><p>As we will show you in this post, <code>pg_stat_statements</code> is an invaluable tool to help you identify which queries are performing slowly and poorly and why. For example, you can query  <code>pg_stat_statements</code> to know how many times a query has been called, the query execution time, the hit cache ratio for a query (how much data was available in memory vs. on disk to satisfy your query),  and other helpful statistics such as the standard deviation of a query execution time.</p><p>Keep reading to learn how to query <code>pg_stat_statements</code> to identify PostgreSQL slow queries and other performance bottlenecks in your Timescale database.</p><p><em>A huge thank you to Lukas Bernert, Monae Payne, and Charis Lam for taking care of all things pg_stat_statements in Timescale. </em></p><h2 id="how-to-query-pgstatstatements-in-timescale">How to Query <code>pg_stat_statements</code> in Timescale </h2><p>Querying statistics data for your Timescale database from the <code>pg_stat_statements</code> view is straightforward once you're connected to the database.</p><figure class="kg-card kg-code-card"><pre><code class="language-SQL">SELECT * FROM pg_stat_statements;

userid|dbid |queryid             |query                         
------+-----+--------------------+------------------------------
 16422|16434| 8157083652167883764|SELECT pg_size_pretty(total_by
    10|13445|                    |&lt;insufficient privilege&gt;      
 16422|16434|-5803236267637064108|SELECT game, author_handle, gu
 16422|16434|-8694415320949103613|SELECT c.oid,c.*,d.description
    10|16434|                    |&lt;insufficient privilege&gt;      
    10|13445|                    |&lt;insufficient privilege&gt;   
 ...  |...  |...                 |...  </code></pre><figcaption><p><i><em class="italic" style="white-space: pre-wrap;">Queries that the </em></i><i><code spellcheck="false" style="white-space: pre-wrap;"><em class="italic">tsdbadmin</em></code></i><i><em class="italic" style="white-space: pre-wrap;"> user does not have access to will hide query text and identifier</em></i></p></figcaption></figure><p>The view returns many columns of data (more than 30!), but if you look at the results above, one value immediately sticks out: <code>&lt;insufficient privilege&gt;</code>. </p><p><code>pg_stat_statements</code> collects data on all databases and users, which presents a security challenge if any user is allowed to query performance data. Therefore, although any user can query data from the views, only superusers and those specifically granted the <code>pg_read_all_stats</code> permission can see all user-level details, including the <code>queryid</code> and <code>query</code> text. </p><p>This includes the <code>tsdbadmin</code> user, which is created by default for all Timescale services. Although this user owns the database and has the most privileges, it is not a superuser account and cannot see the details of all other queries within the service cluster.</p><p>Therefore, it's best to filter <code>pg_stat_statements</code> data by <code>userid</code> for any queries you want to perform.</p><figure class="kg-card kg-code-card"><pre><code class="language-SQL">-- current_user will provide the rolname of the authenticated user
SELECT * FROM pg_stat_statements pss
	JOIN pg_roles pr ON (userid=oid)
WHERE rolname = current_user;


userid|dbid |queryid             |query                         
------+-----+--------------------+------------------------------
 16422|16434| 8157083652167883764|SELECT pg_size_pretty(total_by
 16422|16434|-5803236267637064108|SELECT game, author_handle, gu
 16422|16434|-8694415320949103613|SELECT c.oid,c.*,d.description
 ...  |...  |...                 |...  		 </code></pre><figcaption><p><i><em class="italic" style="white-space: pre-wrap;">Queries for only the </em></i><i><code spellcheck="false" style="white-space: pre-wrap;"><em class="italic">tsdbadmin</em></code></i><i><em class="italic" style="white-space: pre-wrap;"> user, showing all details and statistics</em></i></p></figcaption></figure><p>When you add the filter, only data that you have access to is displayed. If you have created additional accounts in your service for specific applications, you could also filter to those accounts.</p><p>To make the rest of our example queries easier to work with, we recommend that you use this base query with a common table expression (CTE). This query form will return the same data but make the rest of the query a little easier to write.</p><figure class="kg-card kg-code-card"><pre><code class="language-SQL">-- current_user will provide the rolname of the authenticated user
WITH statements AS (
SELECT * FROM pg_stat_statements pss
		JOIN pg_roles pr ON (userid=oid)
WHERE rolname = current_user
)
SELECT * FROM statements;

userid|dbid |queryid             |query                         
------+-----+--------------------+------------------------------
 16422|16434| 8157083652167883764|SELECT pg_size_pretty(total_by
 16422|16434|-5803236267637064108|SELECT game, author_handle, gu
 16422|16434|-8694415320949103613|SELECT c.oid,c.*,d.description
 ...  |...  |...                 |...   		 </code></pre><figcaption><p><i><em class="italic" style="white-space: pre-wrap;">Query that shows the same results as before, but this time with the base query in a CTE for more concise queries later</em></i></p></figcaption></figure><p>Now that we know how to query only the data we have access to, let's review a few of the columns that will be the most useful for spotting potential problems with your queries. </p><ul><li><strong><code>calls</code></strong>: the number of times this query has been called.</li><li><strong><code>total_exec_time</code></strong>: the total time spent executing the query, in milliseconds.</li><li><strong><code>rows</code></strong>: the total number of rows retrieved by this query.</li><li><strong><code>shared_blks_hit</code></strong>: the number of blocks already cached when read for the query.</li><li><strong><code>shared_blks_read</code></strong>: the number of blocks that had to be read from the disk to satisfy all calls for this query form.</li></ul><p>Two quick reminders about the data columns above:</p><ol><li>All values are cumulative since the last time the service was started, or a superuser manually resets the values.</li><li>All values are for the same query form after parameterizing the query and based on the resulting hashed <code>queryid</code>.</li></ol><p>Using these columns of data, let's look at a few common queries that can help you narrow in on the problematic queries.</p><h2 id="long-running-postgresql-queries">Long-Running PostgreSQL Queries</h2><p>One of the quickest ways to find slow Postgres queries that merit your attention is to look at each query’s average total time. This is not a time-weighted average since the data is cumulative, but it still helps frame a relevant context for where to start.</p><p>Adjust the <code>calls</code> value to fit your specific application needs. Querying for higher (or lower) total number of calls can help you identify queries that aren't run often but are very expensive or queries that are run much more often than you expect and take longer to run than they should.</p><figure class="kg-card kg-code-card"><pre><code class="language-SQL">-- query the 10 longest running queries with more than 500 calls
WITH statements AS (
SELECT * FROM pg_stat_statements pss
		JOIN pg_roles pr ON (userid=oid)
WHERE rolname = current_user
)
SELECT calls, 
	mean_exec_time, 
	query
FROM statements
WHERE calls &gt; 500
AND shared_blks_hit &gt; 0
ORDER BY mean_exec_time DESC
LIMIT 10;


calls|mean_exec_time |total_exec_time | query
-----+---------------+----------------+-----------
 2094|        346.93 |      726479.51 | SELECT time FROM nft_sales ORDER BY time ASC LIMIT $1 |
 3993|         5.728 |       22873.52 | CREATE TEMPORARY TABLE temp_table ... |
 3141|          4.79 |       15051.06 | SELECT name, setting FROM pg_settings WHERE ... |
60725|          3.64 |      221240.88 | CREATE TEMPORARY TABLE temp_table ... |   
  801|          1.33 |        1070.61 | SELECT pp.oid, pp.* FROM pg_catalog.pg_proc p  ...|
 ... |...            |...                 |  		 </code></pre><figcaption><p><i><em class="italic" style="white-space: pre-wrap;">Queries that take the most time, on average, to execute</em></i></p></figcaption></figure><p>This sample database we're using for these queries is based on the <a href="https://github.com/timescale/nft-starter-kit">NFT starter kit</a>, which allows you to ingest data on a schedule from the OpenSea API and query NFT sales data. As part of the normal process, you can see that a <code>TEMPORARY TABLE</code> is created to ingest new data and update existing records as part of a lightweight extract-transform-load process.</p><p>That query has been called 60,725 times since this service started and has taken around 4.5 minutes of total execution time to create the table. By contrast, the first query shown takes the longest, on average, to execute—around 350 milliseconds each time. It retrieves the oldest timestamp in the <code>nft_sales</code> table and has used more than 12 minutes of execution time since the server was started.</p><p>From a work perspective, finding a way to improve the performance of the first query will have a more significant impact on the overall server workload.</p><h2 id="hit-cache-ratio">Hit Cache Ratio</h2><p>Like nearly everything in computing, databases tend to perform best when data can be queried in memory rather than going to external disk storage. If PostgreSQL has to retrieve data from storage to satisfy a query, it will typically be slower than if all of the needed data was already loaded into the reserved memory space of PostgreSQL. We can measure how often a query has to do this through a value known as Hit Cache Ratio.</p><p>Hit Cache Ratio is a measurement of how often the data needed to satisfy a query was available in memory. A higher percentage means that the data was already available and it didn't have to be read from disk, while a lower value can be an indication that there is memory pressure on the server and isn't able to keep up with the current workload.</p><p>If PostgreSQL has to constantly read data from disk to satisfy the same query, it means that other operations and data are taking precedence and "pushing" the data your query needs back out to disk each time. </p><p>This is a common scenario for time-series workloads because newer data is written to memory first, and if there isn't enough free buffer space, data that is used less will be evicted. If your application queries a lot of historical data, older <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertable</a> chunks might not be loaded into memory and ready to quickly serve the query.</p><p>A good place to start is with queries that run often and have a Hit Cache Ratio of less than 98&nbsp;%. Do these queries tend to pull data from long periods of time? If so, that could be an indication that there's not enough RAM to efficiently store this data long enough before it is evicted for newer data. </p><p>Depending on the application query pattern, you could improve Hit Cache Ratio by increasing server resources, consider index tuning to reduce table storage, or use <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/compression/">TimescaleDB compression</a> on older chunks that are queried regularly.</p><figure class="kg-card kg-code-card"><pre><code class="language-SQL">-- query the 10 longest running queries
WITH statements AS (
SELECT * FROM pg_stat_statements pss
		JOIN pg_roles pr ON (userid=oid)
WHERE rolname = current_user
)
SELECT calls, 
	shared_blks_hit,
	shared_blks_read,
	shared_blks_hit/(shared_blks_hit+shared_blks_read)::NUMERIC*100 hit_cache_ratio,
	query
FROM statements
WHERE calls &gt; 500
AND shared_blks_hit &gt; 0
ORDER BY calls DESC, hit_cache_ratio ASC
LIMIT 10;


calls | shared_blks_hit | shared_blks_read | hit_cache_ratio |query
------+-----------------+------------------+-----------------+--------------
  118|            441126|                 0|           100.00| SELECT bucket, slug, volume AS "volume (count)", volume_eth...
  261|          62006272|             22678|            99.96| SELECT slug FROM streamlit_collections_daily cagg...¶        I
 2094|         107188031|           7148105|            93.75| SELECT time FROM nft_sales ORDER BY time ASC LIMIT $1...      
  152|          41733229|                 1|            99.99| SELECT slug FROM streamlit_collections_daily cagg...¶        I
  154|          36846841|             32338|            99.91| SELECT a.img_url, a.name, MAX(s.total_price) AS price, time...

 ... |...               |...               | ...             | ...</code></pre><figcaption><p><i><em class="italic" style="white-space: pre-wrap;">The query that shows the Hit Cache Ratio of each query, including the number of buffers that were ready from disk or memory to satisfy the query</em></i></p></figcaption></figure><p>This sample database isn't very active, so the overall query counts are not very high compared to what a traditional application would probably show. In our example data above, a query called more than 500 times is a "frequently used query." </p><p>We can see above that one of the most expensive queries also happens to have the lowest Hit Cache Ratio of 93.75&nbsp;%. This means that roughly 6&nbsp;% of the time, PostgreSQL has to retrieve data from disk to satisfy the query. While that might not seem like a lot, your most frequently called queries should have a ratio of 99&nbsp;% or more in most cases.</p><p>If you look closely, notice that this is the same query that stood out in our first example that showed how to find long-running queries. It's quickly becoming apparent that we can probably tune this query in some way to perform better. As it stands now, it's the slowest query per call, and it consistently has to read some data from disk rather than from memory.</p><h2 id="queries-with-high-standard-deviation">Queries With High Standard Deviation</h2><p>For a final example, let's consider another way to judge which queries often have the greatest opportunity for improvement: using the standard deviation of a query execution time.</p><p>Finding the slowest queries is a good place to start. However, as discussed in the blog post <a href="https://timescale.ghost.io/blog/what-time-weighted-averages-are-and-why-you-should-care/">What Time-Weighted Averages Are and Why You Should Care</a>, averages are only part of the story. Although <code>pg_stat_statements</code> doesn't provide a method for tracking time-weighted averages, it does track the standard deviation of all calls and execution time.</p><h3 id="how-can-this-be-helpful">How can this be helpful?</h3><p>Standard deviation is a method of assessing how widely the time each query execution takes compared to the overall mean. If the standard deviation value is small, then queries all take a similar amount of time to execute. If the standard deviation value is large, this indicates that the execution time of the query varies significantly from request to request.</p><p>Determining how good or bad the standard deviation is for a particular query requires more data than just the mean and standard deviation values. To make the most sense of these numbers, we at least need to add the minimum and maximum execution times to the query. By doing this, we can start to form a mental model for the overall span execution times that the query takes.</p><p>In the example result below, we're only showing the data for one query to make it easier to read, the same <code>ORDER BY time LIMIT 1</code> query we've seen in our previous example output.</p><figure class="kg-card kg-code-card"><pre><code class="language-SQL">-- query the 10 longest running queries
WITH statements AS (
SELECT * FROM pg_stat_statements pss
		JOIN pg_roles pr ON (userid=oid)
WHERE rolname = current_user
)
SELECT calls, 
	min_exec_time,
	max_exec_time, 
	mean_exec_time,
	stddev_exec_time,
	(stddev_exec_time/mean_exec_time) AS coeff_of_variance,
	query
FROM statements
WHERE calls &gt; 500
AND shared_blks_hit &gt; 0
ORDER BY mean_exec_time DESC
LIMIT 10;


Name              |Value                                                |
------------------+-----------------------------------------------------+
calls             |2094                                                 |
min_exec_time     |0.060303                                             |
max_exec_time     |1468.401726                                          |
mean_exec_time    |346.9338636657108                                    |
stddev_exec_time  |212.3896857655582                                    |
coeff_of_variance |0.612190702635494                                    |
query             |SELECT time FROM nft_sales ORDER BY time ASC LIMIT $1|	 </code></pre><figcaption><p><i><em class="italic" style="white-space: pre-wrap;">Queries showing the min, max, mean, and standard deviation of each query</em></i></p></figcaption></figure><p>In this case, we can extrapolate a few things from these statistics:</p><ul><li>For our application, this query is called frequently (remember, more than 500 calls is a lot for this sample database).</li><li>If we look at the full range of execution time in conjunction with the mean, we see that the mean is not centered. This could imply that there are execution time outliers or that the data is skewed. Both are good reasons to investigate this query’s execution times further.</li><li>Additionally, if we look at the coefficient of variation column, which is the ratio between the standard deviation and the mean (also called the coefficient of variation), we get 0.612 which is fairly high. In general, if this ratio is above 0.3, then the variation of your data is quite large. Since we find the data is quite varied, it seems to imply that instead of a few outliers skewing the mean, there are a number of execution times taking longer than they should. This provides further confirmation that the execution time for this query should be investigated further.</li></ul><p>When I examine the output of these three queries together, this specific <code>ORDER BY time LIMIT 1</code> query seems to stick out. It's slower per call than most other queries, it often requires the database to retrieve data from disk, and the execution times seem to vary dramatically over time.<br><br>As long as I understood where this query was used and how the application could be impacted, I would certainly put this "first point" query on my list of things to improve.</p><h2 id="speed-up-your-postgresql-queries">Speed Up Your PostgreSQL Queries</h2><p>The <code>pg_stat_statements</code> extension is an invaluable monitoring tool, especially when you understand how statistical data can be used in the database and application context. </p><p>For example, an expensive query called a few times a day or month might not be worth the effort to tune right now. Instead, a moderately slow query called hundreds of times an hour (or more) will probably better use your query tuning effort.</p><p>If you want to learn how to store metrics snapshots regularly and move from static, cumulative information to time-series data for more efficient database monitoring, check out the blog post <a href="https://timescale.ghost.io/blog/point-in-time/"><strong>Point-in-Time PostgreSQL Database and Query Monitoring With pg_stat_statements</strong></a><strong>.</strong></p><p><code>pg_stat_statements</code> is automatically enabled in all Timescale services. If you’re not a user yet, <a href="https://console.cloud.timescale.com/signup">you can try out Timescale for free</a> (no credit card required) to get access to a modern cloud-native database platform with <a href="https://timescale.ghost.io/blog/postgresql-timescaledb-1000x-faster-queries-90-data-compression-and-much-more/">TimescaleDB's top performance</a>, one-click <a href="https://timescale.ghost.io/blog/high-availability-for-your-production-environments-introducing-database-replication-in-timescale-cloud/">database replication</a>, <a href="https://timescale.ghost.io/blog/introducing-one-click-database-forking-in-timescale-cloud/">forking</a>, and <a href="https://timescale.ghost.io/blog/vpc-peering-from-zero-to-hero/">VPC peering</a>.<br></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Select the Most Recent Record (of Many Items) With PostgreSQL]]></title>
            <description><![CDATA[Get five methods for retrieving the most recent data for each item in your PostgreSQL database quickly and efficiently.]]></description>
            <link>https://www.tigerdata.com/blog/select-the-most-recent-record-of-many-items-with-postgresql</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/select-the-most-recent-record-of-many-items-with-postgresql</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Tutorials]]></category>
            <category><![CDATA[Engineering]]></category>
            <dc:creator><![CDATA[Ryan Booz]]></dc:creator>
            <pubDate>Fri, 04 Feb 2022 14:50:31 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-11-at-6.56.02-PM.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-11-at-6.56.02-PM.png" alt="Select the Most Recent Record (of Many Items) With PostgreSQL" /><p><a href="https://timescale.ghost.io/blog/time-series-data/" rel="noreferrer">Time-series data is ubiquitous in almost every application today</a>. One of the most frequent queries applications make on time-series data is to find the most recent value for a given device or item.</p><p>In this blog post, we'll explore five methods for accessing the most recent value in PostgreSQL. Each option has its advantages and disadvantages, which we'll discuss as we go.</p><p><em>Note: Throughout this post, references to a "device" or "truck" are simply placeholders to whatever your application is storing time-series data for, whether it be an air quality sensor, airplane, car, website visits, or something else. As you read, focus on the concept of each option, rather than the specific data we're using as an example.</em></p><h2 id="the-problem">The Problem</h2><p>Knowing how to query the most recent timestamp and data for a device in large time-series datasets is often a challenge for many application developers. We study the data, determine the appropriate schema, and create the indexes that <em>should</em> make the queries return quickly. </p><p><a href="https://www.timescale.com/learn/postgresql-performance-tuning-how-to-size-your-database" rel="noreferrer">When the queries aren't as fast as we expect</a>, it's easy to be confused because indexes in PostgreSQL are supposed to help your queries return quickly - correct?</p><p>In most cases, the answer to that is emphatically "true". With the appropriate index, PostgreSQL is normally <em>very</em> efficient at retrieving data for your query. There are always nuances that we don't have time to get into in this post (<a href="https://www.timescale.com/learn/postgresql-performance-tuning-optimizing-database-indexes" rel="noreferrer">don't create too many indexes, make sure statistics are kept up-to-date, etc.</a>), but generally speaking, the right index will dramatically improve the query performance of a SQL database, PostgreSQL included.</p><p>Quick aside:</p><p><em>Before we dive into how to efficiently find specific records in a large time-series database using indexes, I want to make sure we're talking about the same thing. For the duration of this post, all references to indexes specifically mean a <a href="https://use-the-index-luke.com/sql/anatomy/the-tree"><em>B-tree index</em></a>. These are the most common index supported by all major OLTP databases and they are very good at locating specific rows of data across tables large and small. PostgreSQL actually supports <strong>many</strong> different index types that can help for various types of queries and data (including timestamp-centric data), but from here on out, we're only talking about B-tree indexes.</em></p><h2 id="the-impact-of-indexes">The Impact of Indexes</h2><p>In our TimescaleDB <a href="https://slack.timescale.com">Slack community channel</a> and in other developer forums such as StackOverflow (<a href="https://dba.stackexchange.com/questions/177162/how-to-make-distinct-on-faster-in-postgresql">example</a>), developers often wonder why a query for the latest value is slow in PostgreSQL even when it seems like the correct index exists to make the query "fast"?</p><p>The answer to that lies in how the PostgreSQL query planner works. It doesn't always use the index exactly how you might expect as we'll discuss below. In order to demonstrate how PostgreSQL might use an index on a large time-series table, let's set the stage with a set of fictitious data. </p><p>For these example queries, let's pretend that our application is tracking a trucking fleet, with sensors that report data a few times a minute as long as the truck has a cell connection. Sometimes the truck loses signal which causes data to be sent a few hours or days later. Although the app would certainly be more involved and have a more complex schema for tracking both time-series and business-related data, let's focus on two of the tables.</p><p><strong>Truck</strong></p><p><br>This table tracks every truck that is part of the fleet. Even for a very large company, this table will typically contain only a few tens of thousands of rows.</p><table>
<thead>
<tr>
<th>truck_id</th>
<th>make</th>
<th>model</th>
<th>weight_class</th>
<th>date_acquired</th>
<th>active_status</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Ford</td>
<td>Single Sleeper</td>
<td>S</td>
<td>2018-03-14</td>
<td>true</td>
</tr>
<tr>
<td>2</td>
<td>Tesla</td>
<td>Double Sleeper</td>
<td>XL</td>
<td>2019-02-18</td>
<td>false</td>
</tr>
<tr>
<td>…</td>
<td>…</td>
<td>…</td>
<td>…</td>
<td>…</td>
<td></td>
</tr>
</tbody>
</table>
<p>For the queries below, we'll pretend that this table has ~10,000 trucks, most of which are currently active and recording data a few times a minute.</p><p><strong>Truck Reading</strong></p><p><br>The reading hypertable stores all data that is delivered from every truck over time. Data typically comes in a few times a minute in time order, although older data can arrive when trucks lose their connection to cell service or transmitters break down. For this example, we'll show a wide-format table schema, and only a few columns of data to keep things simple. Many IoT applications store many types of data points for each set of readings.</p><table>
<thead>
<tr>
<th>ts</th>
<th>truck_id</th>
<th>milage</th>
<th>fuel</th>
<th>latitude</th>
<th>longitude</th>
</tr>
</thead>
<tbody>
<tr>
<td>2021-11-30 16:39:46</td>
<td>1</td>
<td>49.8</td>
<td>29</td>
<td>40.626</td>
<td>83.139</td>
</tr>
<tr>
<td>2021-11-30 16:39:46</td>
<td>2</td>
<td>33.0</td>
<td>371</td>
<td>40.056</td>
<td>78.978</td>
</tr>
<tr>
<td>2021-11-30 16:39:46</td>
<td>3</td>
<td>54.5</td>
<td>403</td>
<td>42.732</td>
<td>83.756</td>
</tr>
<tr>
<td>…</td>
<td>…</td>
<td>…</td>
<td>…</td>
<td>…</td>
<td>…</td>
</tr>
</tbody>
</table>
<p>When you create a TimescaleDB hypertable, an index on the timestamp column will be created automatically unless you specifically tell the <code>create_hypertable()</code> function not to. For the <code>truck_reading</code> table, the default index should look similar to:</p><p><code>CREATE INDEX ix_ts ON truck_reading (ts DESC);</code></p><p>This index (or at least a composite index that uses the time column first) is necessary for even the most basic queries where time is involved and is strongly recommended for hypertable chunk management. Queries involving time alone like <code>MIN(ts)</code> or <code>MAX(ts)</code> can easily be satisfied from this index.</p><p>However, if we wanted to know the minimum or maximum readings for a specific truck, PostgreSQL would have no path to quickly find that information. Consider the following query that searches for the most recent readings of a specific truck:</p><pre><code class="language-sql">SELECT * FROM truck_reading WHERE truck_id = 1234 ORDER BY ts DESC LIMIT 1;
</code></pre>
<p>If the <code>truck_reading</code> table only had the default timestamp index ( <code>ix_ts</code> above), PostgreSQL has no efficient method to get the most recent row of data for this specific truck. Instead, it has to start reading the index from the beginning (the most recent timestamp is first based on the index order) and check each row to see if it contains <code>1234</code> as the <code>truck_id</code>. </p><p>If this truck had reported recently, PostgreSQL would only have to read a few thousand rows <em>at most</em> and the query would still be "fast". If the truck hadn't recorded data in a few hours or days, PostgreSQL might have to read hundreds of thousands, or millions of rows of data, before it finds a row where <code>truck_id = 1234</code>.</p><p>To demonstrate this, we created a sample dataset with ~20 million rows of data (1 week for 10,000 trucks) and then deleted the most recent 12 hours for <code>truck_id = 1234</code>.</p><p>In the EXPLAIN output below, we can see that PostgreSQL had to scan the entire index and FILTER out more than 1.53 million rows that did not match the `truck_id` we were searching for. Even more alarming is the amount of data PostgreSQL had to process to correctly retrieve the one row of data we were asking for - <strong><em>~184MB of data!</em></strong> (23168 buffers x 8kb per buffer)</p><pre><code class="language-sql">QUERY PLAN                                                                  
------------------------------------------------------------------
Limit  (cost=0.44..289.59 rows=1 width=52) (actual time=189.343..189.344 rows=1 loops=1)                                                       
  Buffers: shared hit=23168                                                 
  -&gt;  Index Scan using ix_ts on truck_reading  (cost=0.44..627742.58 rows=2171 width=52) (actual time=189.341..189.341 rows=1 loops=1)|
        Filter: (truck_id = 1234)                                           
        Rows Removed by Filter: 1532532                                     
        Buffers: shared hit=23168                                           
Planning:                                                                   
  Buffers: shared hit=5                                                     
Planning Time: 0.116 ms                                                     
Execution Time: 189.364 ms 
</code></pre>
<p>If your application has to do that much work for each query, it will quickly become bottlenecked on the simplest of queries as the data grows.</p><p><strong><em>Therefore, it's essential that we have the correct index(es) for the typical query patterns of our application. </em></strong></p><p>In this example (and in many real-life applications), we should <strong><em>at least</em></strong> create one other index that includes both <code>truck_id</code> and <code>ts</code>. This will allow queries about a specific truck based on time to be searched much more efficiently. An example index would look like this:</p><pre><code class="language-sql">CREATE INDEX ix_truck_id_ts ON truck_reading (truck_id, ts DESC);
</code></pre>
<p>With this index created, PostgreSQL can find the most recent record for a specific truck very quickly, whether it reported data a few seconds or a few weeks ago. </p><p>With the same dataset as above, the same query which returns a datapoint for <code>truck_id = 1234</code> from 12 hours ago reads only<strong><em> 40kb of data</em></strong>! That's ~<strong>4600x less data</strong> that had to be read because we created the appropriate index, not to mention the sub-millisecond execution time! <em>That's bananas!</em></p><pre><code class="language-sql">QUERY PLAN                                                                                                                                               
-----------------------------------------------------
Limit  (cost=0.56..1.68 rows=1 width=52) (actual time=0.015..0.015 rows=1 loops=1)                                                                
  Buffers: shared hit=5                                                    
  -&gt;  Index Scan using ix_truck_id_ts on truck_reading  (cost=0.56..2425.55 rows=2171 width=52) (actual time=0.014..0.014 rows=1 loops=1)
        Index Cond: (truck_id = 1234)                                       
        Buffers: shared hit=5                                               
Planning:                                                                   
  Buffers: shared hit=5                                                     
Planning Time: 0.117 ms                                                     
Execution Time: 0.028 ms
                                                            
</code></pre>
<p>To be clear, both queries did use an index to search for the row. The difference is in how the indexes were used to find the data we wanted.</p><p>The first query had to FILTER the tuple because only the timestamp was part of the index. Filtering takes place <strong><em>after</em></strong> the tuple is read from disk, which means a lot more work takes place just trying to find the correct data.</p><p>In contrast, the second query used both parts of the index ( <code>truck_id</code> and <code>ts</code>) as part of the Index Condition. This means that only the rows that match the constraint are read from disk. In this case, that's a very small number and the query is much faster!</p><p>Unfortunately, even with <strong><em>both</em></strong> of these targeted indexes, there are a few common time-series SQL queries that won't perform as well as most developers expect them to. </p><p>Let's talk about why that is.</p><h3 id="open-ended-queries">Open-ended queries:</h3><p>Open-ended queries look for unique data points (first, last, most recent) without specifying a specific time-range or device constraint ( <code>truck_id</code> in our example). These types of queries leave the planner with few options, so it assumes that it will have to scan through the entire index <em>at planning time</em>. That might not be true, but PostgreSQL can't really know before it executes the query and starts looking for the data.</p><p>This is especially difficult when tables are partitioned because the actual indexes are stored independently with each table partition. Therefore, there is no global index for the entire table that identifies if a specific <code>truck_id</code> (in our case) exists in a partition. Once again, when the PostgreSQL planner doesn't have enough information during the planning phase, it assumes that each partition will need to be queried, typically causing increased planning time.</p><p>Consider a query like the following, which asks for the earliest reading for a specific <code>truck_id</code>:</p><pre><code class="language-sql">SELECT * FROM truck_reading WHERE truck_id=1234 ORDER BY ts LIMIT 1;
</code></pre>
<p>With the two indexes we have in place ( <code>(ts DESC)</code> and <code>(truck_id, ts DESC)</code>), it <em>feels</em> like this should be a fast query. But because the hypertable is partitioned on time, the planner initially assumes that it will have to scan each chunk. If you have a lot of partitions, the planning time will take longer. </p><p>If the <code>truck_reading</code> table is actively receiving new data, the execution of the query will still be "fast" because the answer will probably be found in the first chunk and returned quickly. But if <code>truck_id=1234</code> has <strong><em>never</em></strong> reported any data or has been offline for weeks, PostgreSQL will have to both <em>plan</em> and then <em>scan the index of</em> every chunk. The query will use the composite index on each partition to quickly determine there are no records for the truck, but it still has to take the time to plan and execute the query.</p><p>Instead, we want to avoid doing unnecessary work whenever possible and avoid the potential for this query anti-pattern.</p><h3 id="high-cardinality-queries">High-cardinality queries:</h3><p>Many queries can also be negatively impacted by increasing cardinality, becoming slower as data volumes grow and more individual items are tracked. Options 1-4 below are good examples of queries that perform well on small to medium-size datasets, but <em>often</em> become slower as volumes and cardinality increase.</p><p>These queries attempt to "step" through the time-series table by <code>truck_id</code>, taking advantage of the indexes on the hypertable. However, as more items need to be queried, the iteration often becomes slower because the index is too big to efficiently fit in memory, causing PostgreSQL to frequently swap data to and from disk.</p><p>Understanding that these two types of queries <em>may</em> not perform as well under every circumstance, let's examine five different methods for getting the most recent record for each item in your time-series table. In most circumstances, at least one of these options will work well for your data.</p><h2 id="development-production">Development != Production</h2><p>One quick word of warning as we jump into the SQL examples below.</p><p>It's always good to remember that your development database is unlikely to have the same volume, cardinality, and transaction throughput as your production database. Any one of the example queries we show below might perform really well on a smaller, less active database, only to perform more poorly than expected in production.</p><p>It's always best to test in an environment that is as similar to production as possible. How to do that is beyond the scope of this post, but a few options could be:</p><ul><li>Use <a href="https://timescale.ghost.io/blog/blog/introducing-one-click-database-forking-in-timescale-cloud/">one-click database forking</a> with your <a href="https://console.cloud.timescale.com/signup">Timescale</a> instance to easily make a copy of production for testing and learning. Using data as close to production as possible is usually preferred!</li><li>Back up and restore your production database to an approved location and anonymize the data, keeping similar cardinality and row statistics. Always <code>ANALYZE</code> the table after any data changes.</li><li>Consider reusing your schema and generating lots of high-volume, high-cardinality sample data with <code>generate_series()</code> (possibly using some of the ideas from <a href="https://timescale.ghost.io/blog/blog/how-to-create-lots-of-sample-time-series-data-with-postgresql-generate_series/">our series about generating more realistic sample data</a> inside of PostgreSQL).</li></ul><p>Whatever method you choose, always remember that a database with 1 million rows of time-series data for 100 items will act much differently from a database with 10 billion rows of time-series data for 10,000 items reporting every few seconds.</p><p>Now that we've discussed how indexes help us find the data and reviewed some of the query patterns that can be slower than usual, it's time to write some SQL and talk about when it might be appropriate to use each option.</p><h2 id="option-1-naive-group-by">Option 1: Naive GROUP BY</h2><p>SQL is a powerful language. Unfortunately, every database that allows queries to be written in SQL often has slightly different functions for doing similar work, or simply doesn't support SQL standards that would otherwise allow for efficient "last point" queries like we've been discussing.</p><p>However, in nearly every database where SQL is a supported query language, you can run this query to get the most recent time that a truck recorded data. In <em>most</em> cases, this will not perform well on large datasets because the <code>GROUP BY</code> clause prevents the indexes from being used.</p><pre><code class="language-sql">SELECT max(time) FROM truck_reading GROUP BY truck_id;
</code></pre>
<p>Because the indexes won't be used in PostgreSQL, this approach is not recommended for high-volume/high-cardinality datasets. But, it will get the result you expect, even if it's not efficient.</p><p>If you have a query like this, consider how one of the other options listed below might better fit your query pattern.</p><h2 id="option-2-lateral-join">Option 2: LATERAL JOIN</h2><p>One of the easiest pieces of advice to give for any PostgreSQL database developer is <a href="https://www.timescale.com/learn/sql-joins-summary">to learn how to use LATERAL JOINs</a>. In some other database engines (like SQL Server), these are called APPLY commands, but they do essentially the same thing—run the inner query for every row produced by the outer query. Because it is a JOIN, the inner query can utilize values from the outer query. (While this is similar to a correlated subquery, it's not the same thing.)</p><p>LATERAL JOINs are a great option when you, as the developer or administrator, know approximately how many rows the outer query will return. For a few hundred or a few thousand rows, this pattern is likely to return your "recent" record very quickly as long as the correct index is in place.</p><pre><code class="language-sql">SELECT * FROM trucks t 
INNER JOIN LATERAL (
	SELECT * FROM truck_reading 
	WHERE truck_id = t.truck_id
	ORDER BY ts DESC 
	LIMIT 1
) l ON TRUE
ORDER BY t.truck_id DESC;
</code></pre>
<p>The convenient thing about a LATERAL JOIN query is that additional filtering can be applied to the outer query to identify specific items to retrieve data for. In most cases, the relational business data ( <code>trucks</code>) will be a smaller table with faster query times. Paging can also be applied to the smaller table more efficiently (ie. <code>OFFSET 500 LIMIT 100</code>), which further reduces the total work that the inner query needs to perform.</p><p>Unfortunately, one downside of a LATERAL JOIN query is that it can be susceptible to the high cardinality issue we discussed above in at least two ways.</p><p>First, if the outer query returns many more items than the inner table has data for, this query will loop over the inner table doing more work than necessary. For example, if the <code>truck</code> table had 10,000 entries for trucks, but only 1,000 of them had ever reported readings, the query would loop over the inner query 10x more than it needed to. </p><p>Second, even if the cardinality of the inner and outer query generally match, if that cardinality is high or the table on the inner query is very large, a LATERAL JOIN query will slow down over time as memory or I/O become a limiting factor. At some point, you may need to consider <strong>Option 5</strong> below as a final solution.</p><h2 id="option-3-timescaledb-skipscan">Option 3: TimescaleDB SkipScan</h2><p><em>Disclaimer: this method only works when the TimescaleDB extension is installed. If you aren’t using it yet, you can find more information on our <a href="https://docs.timescale.com/install/latest/"><em>documentation page</em></a>.</em></p><p>LATERAL JOINs are a great tool to have on hand when working with iterative queries. However, as we just discussed, they're not always the best choice when iterating the items of the outer query would cause the inner query to be executed often, looking for data that doesn't exist.</p><p>This is when it can be advantageous to use the reading table itself to get the distinct items and related data. In particular, this is helpful when we want to query trucks that have reported data within a period of time, for example, the last 24 hours. While we could add a filter to the inner query above ( <code>WHERE ts &gt; now() - INTERVAL '24 hours'</code>), we'd still have to iterate over every <code>truck_id</code>, some of which might not have reported data in the last 24 hours.</p><p>Because we already created the <code>ix_truck_id_ts</code> index above that is ordered by <code>truck_id</code> and <code>ts DESC</code>, a common approach that many PostgreSQL developers try is to use a <code>DISTINCT ON</code> query with PostgreSQL.</p><pre><code class="language-sql">SELECT DISTINCT ON (truck_id) * FROM truck_reading WHERE ts &gt; now() - INTERVAL '24 hours' ORDER BY truck_id, ts DESC;
</code></pre>
<p>If you try this <strong><em>without</em></strong> <a href="https://docs.timescale.com/install/latest/">TimescaleDB</a> installed, <em>it won't perform well</em> - <strong>even though we have an index that <em>appears</em> to have the data ordered correctly and easy to "jump" through! </strong>This is because, as of PostgreSQL 14, there is no feature within the query execution phase that can "walk" the index to find each unique instance of a particular key. Instead, PostgreSQL essentially reads all of the data, groups it by the <code>ON</code> columns, and then filters out all but the first row (based on order).</p><p>However, <em>with the TimescaleDB extension installed</em> (version 2.3 or greater), <strong><em>the <code>DISTINCT ON</code> query will work much more efficiently</em><em> </em></strong>as long as the correct index exists and is ordered the same as the query. This is because the TimescaleDB extension <a href="https://timescale.ghost.io/blog/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/">adds a new query node called "SkipScan</a>" which will start scanning the index with the next key value as soon as another one is found, in order. One of the best parts of (SkipScan) is that it works on <strong>any</strong> PostgreSQL table with a B-tree index. It doesn't <em>have</em> to be a TimescaleDB hypertable! </p><p>There are a few nuances to how the index is used, all of which are outlined in the blog post linked above.</p><h2 id="option-4-loose-index-scan">Option 4: Loose Index Scan</h2><p>If you don't (or can't) install the TimescaleDB extension, there is still a way to query the <code>truck_reading</code> table to efficiently return the timestamp of the most recent reading for each <code>truck_id</code>. </p><p>On the PostgreSQL Wiki there is a page dedicated to the <a href="https://wiki.postgresql.org/wiki/Loose_indexscan">Loose Index Scan</a>. It demonstrates a way to use recursive CTE queries to essentially do what the TimescaleDB (SkipScan) node does. It's not nearly as straightforward to write and is more difficult to return multiple rows (it's not the same as a DISTINCT query), but it does provide a way to more efficiently use the index to retrieve one row for each item.</p><p>The biggest drawback with this approach is that it's much more difficult to return multiple columns of data with the recursive CTE (and in most cases, it's simply impossible to return multiple rows). So while some developers refer to this as a Skip Scan query, it doesn't easily allow you to retrieve all of the row data for a high-volume table like the (SkipScan) query node that TimescaleDB provides.</p><pre><code class="language-sql">/*
 * Loose index scan via https://wiki.postgresql.org/wiki/Loose_indexscan
 */
WITH RECURSIVE t AS (
   SELECT min(ts) AS time FROM truck_reading
   UNION ALL
   SELECT (SELECT min(ts) FROM truck_reading WHERE ts &gt; t.ts)
   FROM t WHERE t.ts IS NOT NULL
   )
SELECT ts FROM t WHERE ts IS NOT NULL
UNION ALL
SELECT null WHERE EXISTS(SELECT 1 FROM truck_reading WHERE ts IS NULL);
</code></pre>
<h2 id="option-5-logging-table-and-trigger">Option 5: Logging Table and Trigger</h2><p>Sometimes, particularly with large, high-cardinality datasets, the above options aren't efficient enough for day-to-day operations. Querying for the last reading of all items, or the devices that haven't reported a value in the last 24 hours, will not meet your expectations as data volume and cardinality grows. </p><p>In this case, a better option might be to maintain a table that stores the last readings for each device as it's inserted into the raw time-series table so that your application can query a much smaller dataset for the most recent values. To track and update the logging table, we'll create a database trigger on the raw data (hyper)table.</p><p><em>"Wait a minute! Did you just say we're going to create a database trigger? Doesn't everyone say you should never use them?" </em></p><p>It's true. Triggers often get a bad rap in the SQL world, and honestly, that can often be justified. Used properly and with the correct implementation, database triggers can be tremendously useful and have minimal impact on SELECT performance. Insert and Update performance will be impacted because each transaction has to do more work. The performance hit may or may not impact your application, so testing is essential.</p><p>The SQL below provides a minimal example of how you could implement this kind of logging. There are a <strong><em>lot</em><em> </em></strong>of considerations on how to best implement this option for your specific application. Thoroughly test any new processes you ad to the data processing in your database.</p><p>In short, the example script below:</p><ul><li>creates a table to store the most recent data. If you only want to store the most recent timestamp of each truck's readings, this could easily just insert values into a new field on the <code>truck</code> table</li><li>ALTER's the FILLFACTOR of the table to 90% because it will be UPDATE heavy</li><li>creates a trigger function that will insert a row if it doesn't exist for a truck or update the values if a row for that truck already has an entry in the table ( <code>ON CONFLICT</code>)</li><li>enables the trigger on the data hypertable</li></ul><p>The key to this approach is to only track what is necessary, reducing the amount of work PostgreSQL has to do as part of the overall transaction that is ingesting raw data.  If your application updates values for 100,000 devices every second (and you're tracking 50 columns of data), a different trigger approach might be necessary. If this is the kind of data volume you see regularly, we assume that you have an experienced PostgreSQL DBA on your team to help manage and maintain your application database—and help you decide if the logging table approach will work with the available server resources.</p><pre><code class="language-sql">/*
 * The logging table alternative. The PRIMARY KEY will create an
*  index on the truck_id column to make querying for specific trucks more efficient
 */
CREATE TABLE truck_log ( 
	truck_id int PRIMARY KEY REFERENCES trucks (truck_id),
	last_time timestamp,
	milage int,
	fuel int,
	latitude float8,
	longitude float8
);

/*
* Because the table will mostly be UPDATE heavy, a slightly reduced
* FILLFACTOR can alleviate maintenance contention and reduce
* page bloat on the table.
*/
ALTER TABLE truck_log SET (fillfactor=90);

/*
 * This is the trigger function which will be executed for each row
*  of an INSERT or UPDATE. Again, YMMV, so test and adjust appropriately
 */
CREATE OR REPLACE FUNCTION create_truck_trigger_fn()
  RETURNS TRIGGER LANGUAGE PLPGSQL AS
$BODY$
BEGIN
  INSERT INTO truck_log VALUES (NEW.truck_id, NEW.time, NEW.milage, NEW.fuel, NEW.latitude, NEW.longitude) ON CONFLICT (truck_id) DO UPDATE SET    last_time=NEW.time,
   milage=NEW.milage,
   fuel=NEW.fuel,
   latitude=NEW.latitude,
   longitude=NEW.longitude;
  RETURN NEW;
END
$BODY$;

/*
*  With the trigger function created, actually assign it to the truck_reading
*  table so that it will execute for each row
*/ 
CREATE TRIGGER create_truck_trigger
  BEFORE INSERT OR UPDATE ON truck_reading
  FOR EACH ROW EXECUTE PROCEDURE create_truck_trigger_fn();
</code></pre>
<p>With these pieces in place, the new table will start receiving new rows of data and updating the last values as data is ingested. Querying this table will be much more efficient than searching through hundreds of millions of rows.</p><h2 id="review-the-options">Review The Options</h2><table>
<thead>
<tr>
<th></th>
<th>Requires matching index</th>
<th>Impacted by higher cardinality</th>
<th>Insert performance may be impacted</th>
</tr>
</thead>
<tbody>
<tr>
<td>Option 1: GROUP BY</td>
<td></td>
<td>X</td>
<td></td>
</tr>
<tr>
<td>Option 2: LATERAL JOIN</td>
<td>X</td>
<td>X</td>
<td></td>
</tr>
<tr>
<td>Option 3: TimescaleDB SkipScan</td>
<td>X</td>
<td>X</td>
<td>X (if index needs to be added)</td>
</tr>
<tr>
<td>Option 4: Recursive CTE</td>
<td>X</td>
<td>X</td>
<td>X (if index needs to be added)</td>
</tr>
<tr>
<td>Option 5: Logging table</td>
<td></td>
<td></td>
<td>X</td>
</tr>
</tbody>
</table>
<h2 id="wrap-up">Wrap Up </h2><p>Whatever approach you take, hopefully one of these options will help you take the next step to improve the performance of your application.</p><p>If you are not using TimescaleDB yet,&nbsp;<a href="https://www.timescale.com/?ref=timescale.com" rel="noreferrer">take a look</a>. It's a PostgreSQL extension that&nbsp;<a href="https://timescale.ghost.io/blog/postgresql-timescaledb-1000x-faster-queries-90-data-compression-and-much-more/" rel="noreferrer">will make your queries faster</a>via&nbsp;<a href="https://www.timescale.com/?ref=timescale.com" rel="noreferrer">automatic partitioning</a>, query planner enhancements, improved materialized views,&nbsp;<a href="https://timescale.ghost.io/blog/building-columnar-compression-in-a-row-oriented-database/" rel="noreferrer">columnar compression</a>, and much more.&nbsp;</p><p>If you're running your PostgreSQL database in your own hardware,&nbsp;<a href="https://docs.timescale.com/self-hosted/latest/install/?ref=timescale.com" rel="noreferrer">you can simply add the TimescaleDB extension</a>. If you prefer to try Timescale in AWS,&nbsp;<a href="https://console.cloud.timescale.com/signup?ref=timescale.com" rel="noreferrer">create a free account on our platform</a>. It only takes a couple seconds, no credit card required!</p><p><a href="https://www.timescale.com/learn/postgresql-performance-tuning-how-to-size-your-database" rel="noreferrer"><em>PS: For more tips that will help you enhance performance in PostgreSQL, check out our collection of articles on PostgreSQL Performance Tuning</em></a><em>. </em></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to shape sample data with PostgreSQL generate_series() and SQL]]></title>
            <description><![CDATA[Learn how to quickly create recurring time-series data for charting and testing PostgreSQL and TimescaleDB functions.]]></description>
            <link>https://www.tigerdata.com/blog/how-to-shape-sample-data-with-postgresql-generate_series-and-sql</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-to-shape-sample-data-with-postgresql-generate_series-and-sql</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[General]]></category>
            <dc:creator><![CDATA[Ryan Booz]]></dc:creator>
            <pubDate>Thu, 20 Jan 2022 14:34:18 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pexels-burak-kebapci-187041.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pexels-burak-kebapci-187041.jpg" alt="How to shape sample data with PostgreSQL generate_series() and SQL" /><p><strong><em>In the lifecycle of any application, developers are often asked to create proof-of-concept features, test newly released functionality, and visualize data analysis. In many cases, the available test data is stale, not representative of normal usage, or simply doesn't exist for the feature being implemented. In situations like this, knowing how to quickly create sample time-series data with native PostgreSQL and SQL functions is a valuable skill to draw upon!</em></strong></p><p>In this three-part series on generating sample time-series data, we demonstrate how to use the built-in PostgreSQL function, <code><a href="https://www.postgresql.org/docs/current/functions-srf.html">generate_series()</a></code>, to more easily create large sets of data to help test various workloads, database features, or just to create fun samples.</p><p>In <a href="https://timescale.ghost.io/blog/blog/how-to-create-lots-of-sample-time-series-data-with-postgresql-generate_series/">part 1</a> and <a href="https://timescale.ghost.io/blog/blog/generating-more-realistic-sample-time-series-data-with-postgresql-generate_series/">part 2</a> of the series, we reviewed how <code>generate_series()</code> works, how to join multiple series using a CROSS JOIN to create large datasets quickly, and finally how to create and use custom PostgreSQL functions as part of the query to generate more realistic values for your dataset. If you haven't used <code>generate_series()</code> much before, we recommend first reading the other two posts. <em>The first one is an </em><a href="https://timescale.ghost.io/blog/blog/how-to-create-lots-of-sample-time-series-data-with-postgresql-generate_series/"><em>intro to the generate_series() function</em></a><em>, and the second one shows </em><a href="https://timescale.ghost.io/blog/blog/generating-more-realistic-sample-time-series-data-with-postgresql-generate_series/"><em>how to generate more realistic data</em></a><em>.</em></p><p>With those skills in hand, you can quickly and easily generate tens of millions of rows of realistic-looking data. Even still, there's one more problem that we hinted at in part 2 - all of our data, regardless of how it's formatted or constrained, is still based on the <code>random()</code> function. This means that over thousands or millions of samples, every device we create data for will likely have the same MAX() and MIN() value, and the distribution of random values over millions of rows for each device generally means that all devices will have similar average values.</p><p>This third post demonstrates a few methods for influencing how to create data that mimics a desired shape or trend. Do you need to simulate time-series values that cycle over time? What about demonstrating a counter value that resets every so often to test the <a href="https://docs.timescale.com/api/latest/hyperfunctions/counter_aggs/">counter_agg</a> hyperfunction? Are you trying to create new dashboards that display sales data over time, influenced for different months of the year when sales would ebb and flow?</p><p>Below we'll cover all of these examples to provide you with the final building blocks to create awesome sample data for all of your testing and exploration needs. Remember, however, that these examples are just the beginning. Keep playing. Tweak the formulas or add different relational data to influence the values that get generated so that it meets your use case.</p><h2 id="data-inception">Data inception</h2><p>Time-series data often has patterns. Weather temperatures and rainfall measurements change in a (mostly) predictable way throughout the year. Vibration measurements from an IoT device connected to an air conditioning system usually increase in the summer and decrease in the winter. Manufacturing data that measures the total units produced per hour (and the percentage of defective units) usually follow a pattern based on shift schedules and seasonal demand.</p><p>If you want to demonstrate this kind of data without having access to the production dataset, how would you go about it using <code>generate_series()</code>? SQL functions ended up being pretty handy when we discussed different methods for creating realistic-looking data in <a href="https://timescale.ghost.io/blog/blog/generating-more-realistic-sample-time-series-data-with-postgresql-generate_series/">part 2</a>. Do you think they might help here? 😉</p><h3 id="two-options-to-easily-return-the-row-number">Two options to easily return the row number</h3><p>Remember, for our purposes we're specifically talking about creating sample <em><strong>time-series data</strong></em>. Every row increases along the time axis, and if we use the multiplication formula from part 1, we can determine how many rows our sample data query will generate. Using built-in SQL functions, we can quickly start manipulating data values that change with the cycle of time. 💥</p><p>There are many reasons why it can be helpful to know the ordinal position of each row number in a query result. That's why standard SQL dialects have some variation of the <code>row_number() over()</code> window function. This simple, yet powerful, window function allows us to return the row number of a result set, and can utilize the ORDER BY and PARTITION keywords to further determine the row values.</p><!--kg-card-begin: markdown--><pre><code class="language-SQL">SELECT ts, row_number() over(order by time) AS rownum
FROM generate_series('2022-01-01','2022-01-05',INTERVAL '1 day') ts;

ts                          |rownum|
-----------------------------+------+
2022-01-01 00:00:00.000 -0500|     1|
2022-01-02 00:00:00.000 -0500|     2|
2022-01-03 00:00:00.000 -0500|     3|
2022-01-04 00:00:00.000 -0500|     4|
2022-01-05 00:00:00.000 -0500|     5|
</code></pre>
<!--kg-card-end: markdown--><p>In a normal query, this can be useful for tasks like paging data in a web API when there is a need to consistently return values based on a common partition.</p><p>There's one problem though. <code>row_number() over()</code> requires PostgreSQL (and any other SQL database) to process the query results twice to add the values correctly. Therefore, it's very useful, but also very expensive as datasets grow. </p><p>Fortunately, <strong>PostgreSQL helps us once again</strong> for our specific use case of generating sample time-series data. </p><p><br>Through this series of blog posts on generating sample time-series data, we've discussed that <code>generate_series()</code> is a <a href="https://www.postgresql.org/docs/current/functions-srf.html">Set Returning Function (SRF)</a>. Like the results from a table, set data can be JOINed and queried. Additionally, PostgreSQL provides the <code>WITH ORDINALITY</code> clause that can be applied to any SRF to generate an additional, incrementing BIGINT column. The best part? It doesn't require a second pass through the data in order to generate this value!</p><!--kg-card-begin: markdown--><pre><code class="language-sql">SELECT ts AS time, rownum
FROM generate_series('2022-01-01','2022-01-05',INTERVAL '1 day') WITH ORDINALITY AS t(ts,rownum);

time                         |rownum|
-----------------------------+------+
2022-01-01 00:00:00.000 -0500|     1|
2022-01-02 00:00:00.000 -0500|     2|
2022-01-03 00:00:00.000 -0500|     3|
2022-01-04 00:00:00.000 -0500|     4|
2022-01-05 00:00:00.000 -0500|     5|
</code></pre>
<!--kg-card-end: markdown--><p>Because it serves our purpose and is more efficient, the remainder of this post will use <code>WITH ORDINALITY</code>. However, remember that you <em>can</em> accomplish the same results using <code>row_number() over()</code> if that's more comfortable for you.</p><h3 id="harnessing-the-row-value">Harnessing the row value</h3><p>With increasing timestamps and an increasing integer on every row, we can begin to use other functions to create interesting data.</p><p>Remember from the previous blog posts that calling a function as part of your query executes the function for each row and returns the value. Just like a regular column, however, we don't have to actually emit that column in the final query results. Instead, the function value for that row can be used in calculating values in other columns.</p><p>As an example, let's modify the previous query. Instead of displaying the row number, let's multiply the value by 2. That is, the function value is treated as an input to a multiplication formula.</p><!--kg-card-begin: markdown--><pre><code class="language-sql">SELECT ts AS time, 2 * rownum AS rownum_by_two
FROM generate_series('2022-01-01','2022-01-05',INTERVAL '1 day') WITH ORDINALITY AS t(ts,rownum);

time                         |rownum_by_two|
-----------------------------+------+
2022-01-01 00:00:00.000 -0500|     2|
2022-01-02 00:00:00.000 -0500|     4|
2022-01-03 00:00:00.000 -0500|     6|
2022-01-04 00:00:00.000 -0500|     8|
2022-01-05 00:00:00.000 -0500|   10|
</code></pre>
<!--kg-card-end: markdown--><p>Easy enough, right? What else can we do with the row number value?</p><h3 id="counters-with-reset">Counters with reset</h3><p>Many time-series datasets record values that reset over time, often referred to as counters. The odometer on a car is an example. If you drive far enough, it will "roll over" to zero again and start counting upward. The same is true for many utilities, like water and electric meters, that track consumption. Eventually, the total digits will increment to the point where the counter resets and starts from zero again.</p><p>To simulate this with time-series data, we can use the incrementing row number and after some period of time, reset the count and start over using the modulus operator (%).</p><!--kg-card-begin: markdown--><pre><code class="language-sql">– This example resets the counter every 10 rows 
WITH counter_rows AS (
	SELECT ts, 
		CASE WHEN rownum % 10 = 0 THEN 10
		     ELSE rownum % 10 END AS row_counter
	FROM generate_series(now() - INTERVAL '5 minutes', now(), INTERVAL '1 second') WITH ORDINALITY AS t(ts, rownum)
)
SELECT ts, row_counter
FROM counter_rows;


ts                         |row_counter|
-----------------------------+-----------+
2022-01-07 13:17:46.427 -0500|          1|
2022-01-07 13:17:47.427 -0500|          2|
2022-01-07 13:17:48.427 -0500|          3|
2022-01-07 13:17:49.427 -0500|          4|
2022-01-07 13:17:50.427 -0500|          5|
2022-01-07 13:17:51.427 -0500|          6|
2022-01-07 13:17:52.427 -0500|          7|
2022-01-07 13:17:53.427 -0500|          8|
2022-01-07 13:17:54.427 -0500|          9|
2022-01-07 13:17:55.427 -0500|         10|
2022-01-07 13:17:56.427 -0500|          1|
… | …
</code></pre>
<!--kg-card-end: markdown--><p>By putting the CASE statement inside of the CTE, the counter data can be selected more easily to test other functions. For instance, to see how the <code><a href="https://docs.timescale.com/api/latest/hyperfunctions/counter_aggs/rate/">rate()</a></code> and <code><a href="https://docs.timescale.com/api/latest/hyperfunctions/counter_aggs/delta/">delta()</a></code> hyperfunctions work, we can use <code><a href="https://docs.timescale.com/api/latest/hyperfunctions/time_bucket/">time_bucket()</a></code> to group our 1-second readings into 1-minute buckets.</p><!--kg-card-begin: markdown--><pre><code class="language-sql">WITH counter_rows AS (
	SELECT ts, 
		CASE WHEN rownum % 10 = 0 THEN 10
		     ELSE rownum % 10 END AS row_counter
	FROM generate_series(now() - INTERVAL '5 minutes', now(), INTERVAL '1 second') WITH ORDINALITY AS t(ts, rownum)
)
SELECT time_bucket('1 minute', ts) bucket, 
  delta(counter_agg(ts,row_counter)),
  rate(counter_agg(ts, row_counter))
FROM counter_rows
GROUP BY bucket
ORDER BY bucket;

bucket                       |delta|rate|
-----------------------------+-----+----+
2022-01-07 13:25:00.000 -0500| 33.0| 1.0|
2022-01-07 13:26:00.000 -0500| 59.0| 1.0|
2022-01-07 13:27:00.000 -0500| 59.0| 1.0|
2022-01-07 13:28:00.000 -0500| 59.0| 1.0|
2022-01-07 13:29:00.000 -0500| 59.0| 1.0|
2022-01-07 13:30:00.000 -0500| 26.0| 1.0|
</code></pre>
<!--kg-card-end: markdown--><p><code>time_bucket()</code> outputs the starting time of the bucket, which based on our date math for <code>generate_series()</code> produces four complete buckets of 1-minute aggregated data, and two partial buckets - one for the minute we are currently in, and a second bucket for the partial 5 minutes ago. We can see that the delta correctly calculates the difference between the last and first readings of each bucket, and the rate of change (the increment between each reading) correctly displays a unit of one.</p><p>What are some other ways we can use these PostgreSQL functions to generate different shapes of data to help you explore other features of SQL and TimescaleDB quickly?</p><h3 id="increasing-trend-over-time">Increasing trend over time</h3><p>With the knowledge of how to create an ordinal value for each row of data produced by  <code>generate_series()</code>, we can explore other ways of generating useful time-series data. Because the row number value will always increase, we can easily produce a random dataset that always increases over time but has some variability to it. Consider this a very rough representation of daily website traffic over the span of two years.</p><!--kg-card-begin: markdown--><pre><code class="language-sql">SELECT ts, (10 + 10 * random()) * rownum as value FROM generate_series
       ( '2020-01-01'::date
       , '2021-12-31'::date
       , INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);
</code></pre>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/fig1.svg" class="kg-image" alt="Line chart showing fake daily website visits per day over two years of time, rising up and to the right" loading="lazy" width="900" height="500"><figcaption>Sample daily website traffic growing over time with random daily values</figcaption></figure><p>In reality this chart isn't very realistic or representative. Any website that gains and loses viewers upwards of 50% per day probably isn't going to have great long-term success. Don't worry, we can do better with this example after we learn about another method for creating shaped data using sine waves.</p><h3 id="simple-cycles-sine-wave">Simple cycles (sine wave)</h3><p>Using the built-in <code>sin()</code> and <code>cos()</code> PostgreSQL functions, we can generate data useful for graphing and testing functions that need a predictable data trend. This is particularly useful for testing TimescaleDB downsampling hyperfunctions like <a href="https://docs.timescale.com/api/latest/hyperfunctions/downsample/lttb/">lttb</a> or <a href="https://docs.timescale.com/api/latest/hyperfunctions/downsample/asap/">asap</a>. These functions can take tens of thousands (or millions) of data points and return a smaller, but still accurately representative dataset for graphing.</p><p>We'll start with a basic example that produces one row per day, for 30 days. For each row number value, we'll get the sine value that can be used to graph a wave.</p><!--kg-card-begin: markdown--><pre><code class="language-sql">–- subtract 1 from the row number for wave to start
-- at zero radians and produce a more representative chart
SELECT  ts,
 cos(rownum-1) as value
FROM generate_series('2021-01-01','2021-01-30',INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);

ts                         |value                 |
-----------------------------+--------------------+
2021-01-01 00:00:00.000 -0500|                 1.0|
2021-01-02 00:00:00.000 -0500|  0.5403023058681398|
2021-01-03 00:00:00.000 -0500| -0.4161468365471424|
2021-01-04 00:00:00.000 -0500| -0.9899924966004454|
2021-01-05 00:00:00.000 -0500| -0.6536436208636119|
2021-01-06 00:00:00.000 -0500| 0.28366218546322625|
… | …
</code></pre>
<!--kg-card-end: markdown--><p>Unfortunately, the graph of this SINE wave doesn't look all that appealing. For one month of daily data points, we only have ~6 distinct data points from peak to peak of each wave.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/basic_30day_cosine-1.svg" class="kg-image" alt="Graph showing a cosine wave using daily values for one month" loading="lazy" width="900" height="500"><figcaption>Cosine wave graph using daily values for one month</figcaption></figure><p>The reason our sine wave is so jagged is that sine and cosine values are measured in radians (based on 𝞹), not degrees. A complete cycle (peak-to-peak) on a sine wave happens from zero to 2*𝞹 (~6.28…). Therefore, every ~6 rows of data will produce a complete period in the wave - unless we find a way to modify that value. </p><p>To take control over the sine/cosine values, we need to think about how to modify the data based on the date range and interval (how many rows) and what we want the wave to look like.</p><p>This means we need to take a quick trip back to math class to talk about radians.</p><h2 id="math-class-flashback">Math class flashback!</h2><p>Step back with me for a minute to primary school and your favorite math subject - Algebra 2 (or Trigonometry as the case may be). How many hours did you spend working with graph paper (or graphing calculators) determining the amplitude, period, and shift of a sine or cosine graph?</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/20220114_Creating-sample-data-Image1--v2.0.svg" class="kg-image" alt="Sine wave graph showing period and amplitude, modified from https://www.mathsisfun.com/algebra/amplitude-period-frequency-phase-shift.html" loading="lazy"><figcaption>Sine wave period and amplitude</figcaption></figure><p>If you reach even further into your memory, you might remember this formula which allows you to modify the various aspects of a wave.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/20220114_Creating-sample-data-Image2--v2.0b.svg" class="kg-image" alt="Mathematical formula showing how to modify a sine wave period, amplitude, and shift. Formula: y = A sin(B(x+C))+D" loading="lazy" width="900" height="400"><figcaption>Mathematical formula for modifying the shape and values of a sine wave</figcaption></figure><p>There's a lot here, I know. Let's primarily focus on the two numbers that matter most for our current use case:</p><p>X = the "number of radians", which is the row number in our dataset</p><p>B = a value to multiply the row number by, to decrease the "radian" value for each row</p><p><em>(A, C, and D change the height and placement of the wave, but to start, we want to elongate each period and provide more "points" on the line to graph.)</em></p><p>Let's start with a small dataset example, generating cosine data for three months of daily timestamps with no modifications.</p><!--kg-card-begin: markdown--><pre><code class="language-sql">SELECT  ts, 
cos(rownum) as value
FROM generate_series('2021-01-01','2021-03-31',INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);


ts                         |value                  |
-----------------------------+---------------------+
2021-01-01 00:00:00.000 -0500|   0.5403023058681398|
2021-01-02 00:00:00.000 -0500|  -0.4161468365471424|
2021-01-03 00:00:00.000 -0500|  -0.9899924966004454|
2021-01-04 00:00:00.000 -0500|  -0.6536436208636119|
2021-01-05 00:00:00.000 -0500|  0.28366218546322625|
2021-01-06 00:00:00.000 -0500|    0.960170286650366|
2021-01-07 00:00:00.000 -0500|   0.7539022543433046|
2021-01-08 00:00:00.000 -0500| -0.14550003380861354|
2021-01-09 00:00:00.000 -0500|  -0.9111302618846769|
… | …
2021-03-29 00:00:00.000 -0400|   0.9993732836951247|
2021-03-30 00:00:00.000 -0400|   0.5101770449416689|
2021-03-31 00:00:00.000 -0400|  -0.4480736161291701|
</code></pre>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/basic_90day_cosine.svg" class="kg-image" alt="Graph showing a cosine wave using daily values for three months" loading="lazy" width="900" height="500"><figcaption>Cosine wave with daily data for three months</figcaption></figure><p>In this example, we see ~14 peaks in our wave because there are 90 points of data and without modification, the wave will have a period (peak-to-peak) every ~6.28 points. To lengthen the cycle, we need to perform some simple division.</p><!--kg-card-begin: markdown--><pre><code class="language-bash">[cycle modifying value] = 6.28/[total interval (rows) per cycle]
</code></pre>
<!--kg-card-end: markdown--><p>Using the same 3 months of generated daily values, let's see how to modify the data to lengthen the period of the wave.</p><h3 id="one-cycle-per-month-30-days">One cycle per month (30 days)</h3><p>If we want our daily data to cycle every 30 days, multiply our row number value by 6.28/30.</p><p>6.28/30 = .209 (the row number radians modifier)</p><!--kg-card-begin: markdown--><pre><code class="language-sql">SELECT  ts, cos(rownum * 6.28/30) as value
FROM generate_series('2021-01-01','2021-03-31',INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);
</code></pre>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/basic_90day_cosine_3_phases.svg" class="kg-image" alt="Graph showing a cosine wave, adjusted to produce one wave period per month, three total periods" loading="lazy" width="900" height="500"><figcaption>Cosine wave using daily data for three months, adjusted to have monthly periods</figcaption></figure><h3 id="one-cycle-per-quarter-90-days">One cycle per quarter (90 days)</h3><p>6.28/90 = .07 (this is our radians modifier)</p><pre><code>SELECT  ts, cos(rownum * 6.28/90) as value
FROM generate_series('2021-01-01','2021-03-31',INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/basic_90day_cosine_1_phase1.svg" class="kg-image" alt="Graph showing a cosine wave, adjusted to produce one wave over three months of data" loading="lazy" width="900" height="500"><figcaption>Cosine wave using daily data for three months, adjusted to have one 3-month period</figcaption></figure><p>To modify the overall length of the period, you need to modify the row number value based on the total number of rows in the result and the granularity of the timestamp.</p><p>Here are some example values that you can use to modify the wave period based on the interval used with <code>generate_series()</code>.</p><!--kg-card-begin: markdown--><table>
<thead>
<tr>
<th>generate_series() interval</th>
<th>Desired period length</th>
<th>Divide 6.28 by…</th>
</tr>
</thead>
<tbody>
<tr>
<td>daily</td>
<td>1 month</td>
<td>30</td>
</tr>
<tr>
<td>daily</td>
<td>3 months</td>
<td>90</td>
</tr>
<tr>
<td>hourly</td>
<td>1 day</td>
<td>24</td>
</tr>
<tr>
<td>hourly</td>
<td>1 week</td>
<td>168</td>
</tr>
<tr>
<td>hourly</td>
<td>1 month</td>
<td>720</td>
</tr>
<tr>
<td>minute</td>
<td>1 hour</td>
<td>60</td>
</tr>
<tr>
<td>minute</td>
<td>1 day</td>
<td>1440</td>
</tr>
</tbody>
</table>
<!--kg-card-end: markdown--><h3 id="modifying-the-wave-amplitude-and-shift">Modifying the wave amplitude and shift</h3><p>Another tweak we can make to our wave data is to change the amplitude (difference between the min and max peaks) and, as necessary, shift the wave up or down on the Y-axis. </p><p>To do this, multiply the cosine value by the value that maximum value you want the wave to have. For example, we can multiply the monthly cycle data by 10, which changes the overall minimum and maximum values of the data.</p><!--kg-card-begin: markdown--><pre><code class="language-sql">SELECT  ts, 10 * cos(rownum * 6.28/30) as value
FROM generate_series('2021-01-01','2021-03-31',INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);
</code></pre>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/basic_90day_cosine_adjusted_amplitude.svg" class="kg-image" alt="Graph showing a cosine wave with adjusted amplitude from -10 to 10 on the Y-axis" loading="lazy" width="900" height="500"><figcaption>Cosine wave using daily data for three months with increased amplitude</figcaption></figure><p>Notice that the min/max values are now from -10 to 10.</p><p>We can take it one step further by adding a value to the output which will shift the final values up or down on the Y-axis. In this example, we modified the previous query by adding 10 to the value of each row which results in values from 0 to 20.</p><!--kg-card-begin: markdown--><pre><code class="language-sql">SELECT  ts, 10 + 10 * cos(rownum * 6.28/30) as value
FROM generate_series('2021-01-01','2021-03-31',INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);
</code></pre>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/basic_90day_cosine_y_sift.svg" class="kg-image" alt="Graph showing a cosine wave with shifted min/max values on the Y-axis of zero to twenty" loading="lazy" width="900" height="500"><figcaption>Cosine wave using daily data for three months with shifted min/max values on the Y-axis</figcaption></figure><p>Why spend so much time showing you how to generate and manipulate sine or cosine wave data, especially when we rarely see repeatable data this smooth in real life? </p><p>One of the main advantages of using consistent, predictable data like this in testing is that you can easily tell if your application, charting tools, and query are working as expected. Once you begin adding in unpredictable, real-life data, it can be difficult to determine if the data, query, or application are producing unexpected results. Quickly generating known data with a specific pattern can help rule out errors with the query, at least.</p><p>The second advantage of using a known dataset is that it can be used to shape and influence the results of other queries. Earlier in this post, we demonstrated a very simplistic example of increasing website traffic by multiplying the row number and a random value. Let's look at how we can join both datasets to create a better shape for the sample website traffic data.</p><h3 id="better-website-traffic-samples">Better website traffic samples</h3><p>One of the key takeaways from this series of posts is that <code>generate_series()</code> returns a set of data that can be JOINed and manipulated like data from a regular table. Therefore, we can join together our rough "website traffic" data and our sine wave to produce a smoother, more realistic set of data to experiment with. SQL for the win!</p><p>Overall this is one of the more complex examples we've presented, utilizing multiple common table expressions (CTE) to break the various sets into separate tables that we can query and join. However, this also means that you can independently modify the time range and other values to change the data that is generated from this query for your own experimentation.</p><!--kg-card-begin: markdown--><pre><code class="language-sql">-- This is the generate series data
-- with a &quot;short&quot; date to join with later
WITH daily_series AS ( 
	SELECT ts, date(ts) AS day, rownum FROM generate_series
       ( '2020-01-01'
       , '2021-12-31'
       , '1 day'::interval) WITH ORDINALITY AS t(ts, rownum)
),
-- This selects the time, &quot;day&quot;, and a 
-- random value that represents our daily website visits
daily_value AS ( 
	SELECT ts, day, rownum, random() AS val
    FROM daily_series
    ORDER BY day
),
-- This cosine wave dataset has the same &quot;day&quot; values which allow 
-- it to be joined to the daily_value easily. The wave value is used to modify
-- the &quot;website&quot; value by some percentage to smooth it out 
-- in the shape of the wave.
daily_wave AS ( 
	SELECT
       day,
       -- 6.28 radians divided by 180 days (rows) to get 
       -- one peak every 6 months (twice a year)
       1 + .2 * cos(rownum * 6.28/180) as p_mod
       FROM daily_series
       day
)
-- (500 + 20 * val) = 500-520 visits per day before modification
-- p_mod = an adjusted cosine value that raises or lowers our data each day
-- row_number = a big incremental value for each row to quickly increase &quot;visits&quot; each day
SELECT dv.ts, (500 + 20 * val) * p_mod * rownum as value
FROM daily_value dv
	INNER JOIN daily_wave dw ON dv.DAY=dw.DAY
    order by ts;
</code></pre>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/sample_wave-based_website_traffic-2years-1.svg" class="kg-image" alt="Graph showing increasing, fake website visits, adjusted with cosine wave data to provide shape with elevated visits twice a year" loading="lazy" width="900" height="500"><figcaption>Combining wave data and random increasing data to shape the data pattern</figcaption></figure><p>Without much effort, we are able to generate a time-series dataset, use two different SQL functions, and join multiple sets together to create fun, graphical data. In this example, our traffic peaks twice a year (every ~180 days) during July and late December.</p><p>But we don't have to stop there. We can carry our website traffic example one step further by applying just a little more control over how much the data increases or decreases during certain periods. </p><p>Once again, relational data to the rescue!</p><h3 id="influence-the-pattern-with-relational-data">Influence the pattern with relational data</h3><p>As a final example, let's consider one other type of data that we can include in our queries that influence the final generated values - relational data. Although we've been using data that was created using <code>generate_series()</code> to produce some fun and interesting sample datasets, we can just as easily JOIN to other data in our database to further manipulate the final result.</p><p>There are many ways you could JOIN to and use additional data depending on your use case and the type of time-series data you're trying to mimic. For example:</p><ul><li><strong>IoT data from weather sensors:</strong> store the typical weekly temperature highs/lows in a database table and use those values as input to the <code>random_between()</code> function we created in post 2</li><li><strong>Stock data analysis:</strong> store the dates for quarterly disclosures and a hypothetical factor that will influence the impact on stock price moving forward</li><li><strong>Sales or website traffic:</strong> store the monthly or weekly change observed in a typical sales cycle. Does traffic or sales increase a quarter-end? What about during the end-of-year holiday season? </li></ul><p>To demonstrate this, we'll use the fictitious website traffic data from earlier in this post. Specifically, we've decided that we want to see a spike in traffic during June and December. </p><p>First, we create a regular PostgreSQL table to store the numerical month (1-12) and a float value which will be used to modify our generated data (up or down). This will allow us to tweak the overall shape for a given month.</p><!--kg-card-begin: markdown--><pre><code class="language-sql">CREATE TABLE overrides (
	m_val INT NOT NULL,
	p_inc FLOAT4 NOT null
);

INSERT INTO overrides(m_val, p_inc) VALUES 
	(1,.1.04), – 4% residual increase from December
	(2,1),
	(3,1),
	(4,1),
	(5,1),
	(6,1.10),-- June increase of 10%
	(7,1),
	(8,1),
	(9,1),
	(10,1),
	(11,1.08), -- 8% early shoppers sales/traffic growth
	(12,1.18); -- 18% holiday increase
</code></pre>
<!--kg-card-end: markdown--><p>Using this simple dataset, let's first join it to the "simplistic" query that had randomly growing data over time.</p><!--kg-card-begin: markdown--><pre><code class="language-sql">WITH daily_series AS (
-- a random value that increases over time based on the row number
SELECT ts, date_part('month',ts) AS m_val, (10 + 10*random()) * rownum as value FROM generate_series
       ( '2020-01-01'::date
       , '2021-12-31'::date
       , INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum)
)
-- join to the `overrides` table to get the 'p_inc' value 
-- for the month of the current row
SELECT ts, value * p_inc AS value FROM daily_series ds
INNER JOIN overrides o ON ds.m_val=o.m_val
ORDER BY ts;
</code></pre>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/enhanced_sample_daily_website_traffic-2years.svg" class="kg-image" alt="Graph showing increasing, fake website visits, adjusted during some months using values from a relational table" loading="lazy" width="900" height="920"><figcaption>Sample website traffic for two years, modified during some months with relational data</figcaption></figure><p>Joining to the <code>overrides</code> table based on the month of each data point, we are able to multiply the percentage increase (<code>p_inc</code>) value and the fake website traffic value to influence the trend of our data during specific time periods.</p><p>Combining everything we've learned and taking this example one step further, we can enhance the cosine data query with the same monthly override values to tweak our fake, cyclical time-series data that represents growing website traffic with a more realistic shape.</p><!--kg-card-begin: markdown--><pre><code class="language-sql">​​-- This is the generate series data
-- with a &quot;short&quot; date to join with later
WITH daily_series AS ( 
	SELECT ts, date(ts) AS day, rownum FROM generate_series
       ( '2020-01-01'
       , '2021-12-31'
       , '1 day'::interval) WITH ORDINALITY AS t(ts, rownum)
),
-- This selects the time, &quot;day&quot;, and a 
-- random value that represents our daily website visits
-- 'm_val' will be used to join with the 'overrides' table
daily_value AS ( 
	SELECT ts, day, date_part('month',ts) as m_val, rownum, random() AS val
    FROM daily_series
    ORDER BY day
),
-- This cosine wave dataset has the same &quot;day&quot; values which allow 
-- it to be joined to the daily_value easily. The wave value is used to modify
-- the &quot;website&quot; value by some percentage to smooth it out 
-- in the shape of the wave.
daily_wave AS ( 
	SELECT
       day,
       -- 6.28 radians divided by 180 days (rows) to get 
       -- one peak every 6 months (twice a year)
       1 + .2 * cos(rownum * 6.28/180) as p_mod
       FROM daily_series
       day
)
-- (500 + 20 * val) = 500-520 visits per day before modification
-- p_mod = an adjusted cosine value that raises or lowers our data each day
-- row_number = a big incremental value for each row to quickly increase &quot;visits&quot; each day
-- p_inc = a monthly adjustment value taken from the 'overrides' table
SELECT dv.ts, (500 + 20 * val) * p_mod * rownum * p_inc as value
FROM daily_value dv
	INNER JOIN daily_wave dw ON dv.DAY=dw.DAY
    inner join overrides o on dv.m_val=o.m_val
    order by ts; 
</code></pre>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/sample_wave-based_website_traffic_enhanced-2years.svg" class="kg-image" alt="Graph showing increasing, fake website visits, combined with sine wave data and adjusted during some months using values from a relational table" loading="lazy" width="900" height="500"><figcaption>Sample website traffic for two years combined with sine wave data and modified during some months with relational data</figcaption></figure><h2 id="wrapping-it-up">Wrapping it up</h2><p>In this 3rd and final blog post of our series about generating sample time-series datasets, we demonstrated how to add shape and trend into your sample time-series data (e.g., increasing web traffic over time and quarterly sales cycles) using built-in SQL functions and relational data. With a little bit of math mixed in, we learned how to manipulate the pattern of generated data, which is particularly useful for visualizing time-series data and learning analytical PostgreSQL or TimescaleDB functions.</p><p>To see some of these examples in action, watch my video on creating realistic sample data:</p><figure class="kg-card kg-embed-card"><iframe width="200" height="113" src="https://www.youtube.com/embed/Ff2ltGrPGIg?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></figure><p>If you have questions about using <a href="https://www.postgresql.org/docs/current/functions-srf.html">generate_series()</a> or have any questions about TimescaleDB, please <a href="https://slack.timescale.com/">join our community Slack channel</a>, where you'll find an active community and a handful of the Timescale team most days.</p><p>If you want to try creating larger sets of sample time-series data using generate_series() and see how the exciting features of TimescaleDB work, <a href="https://www.timescale.com/timescale-signup">sign up for a free 30-day trial</a> or <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/install-timescaledb/self-hosted/">install and manage it on your instances</a>. (You can also learn more by <a href="https://docs.timescale.com/timescaledb/latest/tutorials/">following one of our many tutorials</a>.)</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Generating More Realistic Sample Time-Series Data With PostgreSQL generate_series()]]></title>
            <description><![CDATA[In the second post about generate_series(), learn ways to create more realistic-looking data for testing and evaluating new features in PostgreSQL and TimescaleDB.]]></description>
            <link>https://www.tigerdata.com/blog/generating-more-realistic-sample-time-series-data-with-postgresql-generate_series</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/generating-more-realistic-sample-time-series-data-with-postgresql-generate_series</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[General]]></category>
            <dc:creator><![CDATA[Ryan Booz]]></dc:creator>
            <pubDate>Thu, 11 Nov 2021 14:51:33 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/11/pierre-chatel-innocenti-F4VHOj76D0o-unsplash.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/11/pierre-chatel-innocenti-F4VHOj76D0o-unsplash.jpg" alt="Generating More Realistic Sample Time-Series Data With PostgreSQL generate_series()" /><p>In this three-part series on generating sample time-series data, we demonstrate how to use the built-in <a href="https://www.tigerdata.com/learn/understanding-postgresql-user-defined-functions" rel="noreferrer">PostgreSQL function</a>, <a href="https://www.postgresql.org/docs/current/functions-srf.html"><code>generate_series()</code></a>, to more easily create large sets of data to help test various workloads, database features, or just to create fun samples.</p><p><a href="https://timescale.ghost.io/blog/blog/how-to-create-lots-of-sample-time-series-data-with-postgresql-generate_series/">In part 1 of the series</a>, we reviewed how <code>generate_series()</code> works, including the ability to join multiple series into a larger table of time-series data - through a feature known as a CROSS (or Cartesian) JOIN. We ended the first post by showing you how to quickly calculate the number of rows a query will produce and modify the parameters for <code>generate_series()</code> to fine-tune the size and shape of the data.</p><p>However, there was one problem with the data we could produce at the end of the first post. The data that we were able to generate was very basic and not very realistic. Without more effort, using functions like <code>random()</code> to generate values doesn't provide much control over precisely what numbers are produced, so the data still feels more fake than we might want.</p>
<p>This second post will demonstrate a few ways to create more realistic-looking data beyond a column or two of random decimal values. Read on for more.</p><p>In <a href="https://timescale.ghost.io/blog/blog/how-to-shape-sample-data-with-postgresql-generate_series-and-sql/">part 3</a> of this blog series adds one final tool to the mix - combining the data formatting techniques below with additional equations and relational data to shape your sample time-series output into something that more closely resembles real-life applications.</p><p>By the end of this series, you'll be ready to test almost any feature that TimescaleDB offers and create quick datasets for your testing and demos!</p><h2 id="a-brief-review-of-generateseries">A brief review of generate_series()</h2><p>In the first post, we demonstrated how <a href="https://www.postgresql.org/docs/current/functions-srf.html"><code>generate_series()</code></a> (a Set Returning Function) could quickly create a data set based on a range of numeric values or dates. The generated data is essentially an in-memory table that can quickly create large sets of sample data.</p><pre><code class="language-sql">-- create a series of values, 1 through 5, incrementing by 1
SELECT * FROM generate_series(1,5);

generate_series|
---------------|
              1|
              2|
              3|
              4|
              5|


-- generate a series of timestamps, incrementing by 1 hour
SELECT * from generate_series('2021-01-01','2021-01-02', INTERVAL '1 hour');

    generate_series     
------------------------
 2021-01-01 00:00:00+00
 2021-01-01 01:00:00+00
 2021-01-01 02:00:00+00
 2021-01-01 03:00:00+00
 2021-01-01 04:00:00+00
...
</code></pre><p>We then discussed how the data quickly becomes more complex as we join the various sets together (along with some value returning functions) to create a multiple of both sets together.</p><p>This example from the first post joined a timestamp set, a numeric set, and the <code>random()</code> function to create fake CPU data for four fake devices over time.</p><pre><code class="language-sql">-- there is an implicit CROSS JOIN between the two generate_series() sets
SELECT time, device_id, random()*100 as cpu_usage 
FROM generate_series('2021-01-01 00:00:00','2021-01-01 04:00:00',INTERVAL '1 hour') as time, 
generate_series(1,4) device_id;


time               |device_id|cpu_usage          |
-------------------+---------+-------------------+
2021-01-01 00:00:00|        1|0.35415126479989567|
2021-01-01 01:00:00|        1| 14.013393572770028|
2021-01-01 02:00:00|        1|   88.5015939122006|
2021-01-01 03:00:00|        1|  97.49037810105996|
2021-01-01 04:00:00|        1|  50.22781125586846|
2021-01-01 00:00:00|        2|  46.41196423062297|
2021-01-01 01:00:00|        2|  74.39903569177027|
2021-01-01 02:00:00|        2|  85.44087332221935|
2021-01-01 03:00:00|        2|  4.329394730750735|
2021-01-01 04:00:00|        2| 54.645873866589056|
2021-01-01 00:00:00|        3|  63.01888063314749|
2021-01-01 01:00:00|        3|  21.70606884856987|
2021-01-01 02:00:00|        3|  32.47610779097485|
2021-01-01 03:00:00|        3| 47.565982341726354|
2021-01-01 04:00:00|        3|  64.34867263419619|
2021-01-01 00:00:00|        4|   78.1768041898232|
2021-01-01 01:00:00|        4|  84.51505102850199|
2021-01-01 02:00:00|        4| 24.029611792753514|
2021-01-01 03:00:00|        4|  17.08996115345549|
2021-01-01 04:00:00|        4| 29.642690955760997|
</code></pre><p>And finally, we talked about how to calculate the total number of rows your query would generate based on the time range, the interval between timestamps, and the number of "things" for which you are creating fake data.</p><table>
<thead>
<tr>
<th>Range of readings</th>
<th>Length of interval</th>
<th>Number of "devices"</th>
<th>Total rows</th>
</tr>
</thead>
<tbody>
<tr>
<td>1 year</td>
<td>1 hour</td>
<td>4</td>
<td>35,040</td>
</tr>
<tr>
<td>1 year</td>
<td>10 minutes</td>
<td>100</td>
<td>5,256,000</td>
</tr>
<tr>
<td>6 months</td>
<td>5 minutes</td>
<td>1,000</td>
<td>52,560,000</td>
</tr>
</tbody>
</table>
<p>Still, <strong><em>the main problem remains.</em></strong> Even if we can generate 50 million rows of data with a few lines of SQL, the data we generate isn't very realistic. It's all random numbers, with lots of decimals and minimal variation.</p><p>As we saw in the query above (generating fake CPU data), any columns of data that we add to the SELECT query are added to each row of the resulting set. If we add static text (like 'Hello, Timescale!'), that text is repeated for every row. Likewise, adding a function as a column value will be called one time for each row of the final set. </p><p>That's what happened with the <code>random()</code> function in the CPU data example. Every row has a different value because the function is called separately for each row of generated data. We can use this to our advantage to begin making the data look more realistic.</p><p>With a little more thought and custom <a href="https://www.tigerdata.com/blog/function-pipelines-building-functional-programming-into-postgresql-using-custom-operators" rel="noreferrer">PostgreSQL functions</a>, we can start to bring our sample data "to life."</p><h3 id="what-is-realistic-data">What is realistic data?</h3><p>This feels like a good time to make sure we're on the same page. What do I mean by "realistic" data?</p><p>Using the basic techniques we've already discussed allows you to create a lot of data quickly. In most cases, however, you often know what the data you're trying to explore looks like. It's probably not a bunch of decimal or integer values. Even if the data you're trying to mimic <em>are</em> just numeric values, they likely have valid ranges and maybe a predictable frequency.</p><p>Take our simple example of CPU and temperature data from above. With just two fields, we have a few choices to make if we want the generated data to <em>feel</em> more realistic.</p><ul><li>Is CPU a percentage? Out of 100% or are we representing multi-core CPUs that can present as 200%, 400%, or 800%?</li><li>Is temperature measured in Fahrenheit or Celsius? What are reasonable values for CPU temperature in each unit? Do we store temperature with decimals or as an integer in the schema?</li><li>What if we added a "note" field to the schema for messages that our monitoring software might add to the readings from time to time? Would every reading have a note or just when a threshold was reached? Is there a special diagnostic message at the top of each hour that we need to replicate in some way?</li></ul><p>Using <code>random()</code> and static text by themselves allows us to generate <em>lots</em> of data with many columns, but it's not going to be very interesting or as useful in testing features in the database.</p><p>That's the goal of the second and third posts in this series, helping you to produce sample data that looks more like the real thing without much extra work. Yes, it <em>will</em> still be random, but it will be random within constraints that help you feel more connected to the data as you explore various aspects of time-series data.</p><p>And, by using functions, all of the work is easily reusable from table to table.</p><h3 id="walk-before-you-run">Walk before you run</h3><p>In each of the examples below, we'll approach our solutions much as we learned in elementary math class: <em>show your work</em>! It's often difficult to create a function or procedure in PostgreSQL without playing with a plain SQL statement first. This abstracts away the need to think about function inputs and outputs at the outset so that we can focus on how the SQL works to produce the value we want. </p><p>Therefore, the examples below show you how to get a value (random numbers, text, JSON, etc.) in a SELECT statement first before converting the SQL into a function that can be reused. This kind of iterative process is a great way to learn features of PostgreSQL, particularly when it's combined with <a href="https://www.postgresql.org/docs/current/functions-srf.html"><code>generate_series()</code></a>.</p><p>So, take one foot and put it in front of the other, and let's start creating better sample data.</p><h2 id="creating-more-realistic-numbers">Creating more realistic numbers</h2><p>In time-series data, numeric values are often the most common data type. Using a function like <code>random()</code> without any other formatting creates very… well... random (and precise) numbers with lots of decimal points. While it <em>works,</em> the values aren't <em>realistic</em>. Most users and devices aren't tracking CPU usage to 12+ decimals. We need a way to manipulate and constrain the final value that's returned in the query.</p><p>For numeric values, PostgreSQL provides many built-in functions to modify the output. In many cases, using <code>round()</code> and <code>floor()</code> with basic arithmetic can quickly start shaping the data in a way that better fits your schema and use case.</p>
<p>Let's modify the example query for getting device metrics, returning values for CPU and temperature. We want to update the query to ensure that the data values are "customized" for each column, returning values within a specific range and precision. Therefore, we need to apply a standard formula to each numeric value in our SELECT query.</p><pre><code class="language-bash">Final value = random() * (max allowed value - min allowed value) + min allowed value</code></pre><p>This equation will always generate a decimal value between (and inclusive of) the min and max value. If <code>random()</code> returns a value of 1, the final output will equal the maximum value. If <code>random()</code> returns a value of 0, then the result will equal the minimum value. Any other number that <code>random()</code> returns will produce some output between the min and max values.</p>
<p>Depending on whether we want a decimal or integer value, we can further format the "final value" of our formula with <code>round()</code> and <code>floor()</code>.</p><p>This example produces a reading every minute for one hour for 10 devices. The cpu value will always fall between 3 and 100 (with four decimals of precision), and the temperature will always be an integer between 28 and 83.</p><pre><code class="language-sql">SELECT
  time,
  device_id,
  round((random()* (100-3) + 3)::NUMERIC, 4) AS cpu,
  floor(random()* (83-28) + 28)::INTEGER AS tempc
FROM 
	generate_series(now() - interval '1 hour', now(), interval '1 minute') AS time, 
	generate_series(1,10,1) AS device_id;


time                         |device_id|cpu    |tempc        |
-----------------------------+---------+-------+-------------+
2021-11-03 12:47:01.181 -0400|        1|53.7301|           61|
2021-11-03 12:48:01.181 -0400|        1|34.7655|           46|
2021-11-03 12:49:01.181 -0400|        1|78.6849|           44|
2021-11-03 12:50:01.181 -0400|        1|95.5484|           64|
2021-11-03 12:51:01.181 -0400|        1|86.3073|           82|
…|...|...|...
</code></pre><p>By using our simple formula and formatting the result correctly, the query produced the "curated" output (random as it is) we wanted.</p><h3 id="the-power-of-functions">The power of functions</h3><p>But there's also a bit of a letdown here, isn't there? Typing that formula repeatedly for each value - trying to remember the order of parameters and when I need to cast a value - will become tedious quickly. After all, you only have so many <a href="https://keysleft.com/">keystrokes left</a>. </p><p>The solution is to create and use PostgreSQL functions that can take the inputs we need, do the correct calculations, and return the formatted value that we want. There are <em>many</em> ways we could accomplish a calculation like this in a function. Use this example as a starting place for your learning and exploration.</p><p><em><strong>Note:</strong> </em>In this example, I chose to return the value from this function as a <code>numeric</code> data type because it can return values that <em>look</em> like integers (no decimals) or floats (decimals). As long as the return values are inserted into a table with the intended schema, this is a "trick" to visually see what we expect - an integer or a float. In general, the <code>numeric</code> data type will often perform worse in queries and features like compression because of how <code>numeric</code>  values are represented internally. We recommend avoiding <code>numeric</code> types in schema design whenever possible, preferring the float or integer types instead.</p><pre><code class="language-sql">/*
 * Function to create a random numeric value between two numbers
 * 
 * NOTICE: We are using the type of 'numeric' in this function in order
 * to visually return values that look like integers (no decimals) and 
 * floats (with decimals). However, if inserted into a table, the assumption
 * is that the appropriate column type is used. The `numeric` type is often
 * not the correct or most efficient type for storing numbers in a table.
 */
CREATE OR REPLACE FUNCTION random_between(min_val numeric, max_val numeric, round_to int=0) 
   RETURNS numeric AS
$$
 DECLARE
 	value NUMERIC = random()* (min_val - max_val) + max_val;
BEGIN
   IF round_to = 0 THEN 
	 RETURN floor(value);
   ELSE 
   	 RETURN round(value,round_to);
   END IF;
END
$$ language 'plpgsql';
</code></pre><p>This example function uses the minimum and maximum values provided, applies the "range" formula we discussed earlier, and finally returns a <code>numeric</code> value that either has decimals (to the specified number of digits) or not. Using this function in our query, we can simplify creating formatted values for sample data, and it cleans up the SQL, making it easier to read and use.</p><pre><code class="language-sql">SELECT
  time,
  device_id,
  random_between(3,100, 4) AS cpu,
  random_between(28,83) AS temperature_c
FROM 
	generate_series(now() - interval '1 hour', now(), interval '1 minute') AS time, 
	generate_series(1,10,1) AS device_id;
</code></pre><p>This query provides the same formatted output, but now it's much easier to repeat the process.</p><h2 id="creating-more-realistic-text">Creating more realistic text</h2><p>What about text? So far, in both articles, we've only discussed how to generate numeric data. We all know, however, that time-series data often contain more than just numeric values. Let's turn to another common data type: text. </p><p>Time-series data often contains text values. When your schema contains log messages, item names, or other identifying information stored as text, we want to generate sample text that feels more realistic, even if it's random.</p><p>Let's consider the query used earlier that creates CPU and temperature data for a set of devices. If the devices were real, the data they create might contain an intermittent status message of varying length.</p><p>To figure out how to generate this random text, we will follow the same process as before, working directly in a stand-alone SQL query before moving our solution into a reusable function. After some initial attempts (and ample Googling), I came up with this example for producing random text of variable length using a defined character set. As with the <code>random_between()</code> function above, this can be modified to suit your needs. For instance, it would be fairly easy to get unique, random hexadecimal values by limiting the set of characters and lengths. </p><p><em>Let your creativity guide you.</em></p><pre><code class="language-sql">WITH symbols(characters) as (VALUES ('ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz 0123456789 {}')),
w1 AS (
	SELECT string_agg(substr(characters, (random() * length(characters) + 1) :: INTEGER, 1), '') r_text, 'g1' AS idx
	FROM symbols,
generate_series(1,10) as word(chr_idx) -- word length
	GROUP BY idx)
SELECT
  time,
  device_id,
  random_between(3,100, 4) AS cpu,
  random_between(28,83) AS temperature_c,
  w1.r_text AS note
FROM w1, generate_series(now() - interval '1 hour', now(), interval '1 minute') AS time, 
	generate_series(1,10,1) AS device_id
ORDER BY 1,2;

time                         |device_id|cpu     |temperature_c|note      |
-----------------------------+---------+--------+-------------+----------+
2021-11-03 16:49:24.218 -0400|        1| 88.3525|           50|I}3U}FIsX9|
2021-11-03 16:49:24.218 -0400|        2| 29.5313|           53|I}3U}FIsX9|
2021-11-03 16:49:24.218 -0400|        3| 97.6065|           70|I}3U}FIsX9|
2021-11-03 16:49:24.218 -0400|        4| 96.2170|           40|I}3U}FIsX9|
2021-11-03 16:49:24.218 -0400|        5| 53.2318|           82|I}3U}FIsX9|
2021-11-03 16:49:24.218 -0400|        6| 73.7244|           56|I}3U}FIsX9|
</code></pre><p>In this case, it was easier to generate a random value inside of a CTE that we could reference later in the query. However, this approach has one problem that's pretty easy to spot in the first few rows of returned data. </p><p>While the CTE does create random text of 10 characters (go ahead and run it a few times to verify), the value of the CTE is generated once each time and then cached, repeating the same result over and over for every row. Once we transfer the query into a function, we expect to see a different value for each row.</p><p>For this second example function to generate "words" of random lengths (or no text at all in some cases), the user will need to provide an integer for the minimum and maximum length of the generated text. After some testing, we also added a simple randomizing feature. </p><p>Notice the IF...THEN condition that we added. Any time the generated number is divided by five and has a remainder of zero or one, the function will not return a text value. There is nothing special about this approach to providing randomness to the frequency of the output, so feel free to adjust this part of the function to suit your needs.</p><pre><code class="language-sql">/*
 * Function to create random text, of varying length
 */
CREATE OR REPLACE FUNCTION random_text(min_val INT=0, max_val INT=50) 
   RETURNS text AS
$$
DECLARE 
	word_length NUMERIC  = floor(random() * (max_val-min_val) + min_val)::INTEGER;
	random_word TEXT = '';
BEGIN
	-- only if the word length we get has a remainder after being divided by 5. This gives
	-- some randomness to when words are produced or not. Adjust for your tastes.
	IF(word_length % 5) &gt; 1 THEN
	SELECT * INTO random_word FROM (
		WITH symbols(characters) AS (VALUES ('ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz 0123456789 '))
		SELECT string_agg(substr(characters, (random() * length(characters) + 1) :: INTEGER, 1), ''), 'g1' AS idx
		FROM symbols
		JOIN generate_series(1,word_length) AS word(chr_idx) on 1 = 1 -- word length
		group by idx) a;
	END IF;
	RETURN random_word;
END
$$ LANGUAGE 'plpgsql';
</code></pre><p>When we use this function to add random text to our sample time-series query, notice that the text is random in length (between 2 and 10 characters) and frequency.</p><pre><code class="language-sql">SELECT
  time,
  device_id,
  random_between(3,100, 4) AS cpu,
  random_between(28,83) AS temperature_c,
  random_text(2,10) AS note
FROM generate_series(now() - interval '1 hour', now(), interval '1 minute') AS time, 
	generate_series(1,10,1) AS device_id
ORDER BY 1,2;

time                         |device_id|cpu     |temperature_c|note     |
-----------------------------+---------+--------+-------------+---------+
2021-11-04 14:17:03.410 -0400|        1| 86.5780|           67|         |
2021-11-04 14:17:03.410 -0400|        2|  3.5370|           76|pCVBp AZ |
2021-11-04 14:17:03.410 -0400|        3| 59.7085|           28|kMrr     |
2021-11-04 14:17:03.410 -0400|        4| 69.6153|           46|3UdA     |
2021-11-04 14:17:03.410 -0400|        5| 33.0906|           56|d0sSUilx |
2021-11-04 14:17:03.410 -0400|        6| 44.2837|           74|         |
2021-11-04 14:17:03.410 -0400|        7| 14.2550|           81|TOgbHOU  |
</code></pre><p>Hopefully, you're starting to see a pattern. Using <code>generate_series()</code> and some custom functions can help you create time-series data of many shapes and sizes.</p><p>We've demonstrated ways to create more realistic numbers and text data because they are the primary data types used in time-series data. Are there any other data types included with time-series data that you might need to generate with your sample data? </p><p>What about JSON values?</p><h2 id="creating-sample-json">Creating sample JSON</h2><p><strong>Note:</strong> The sample queries below create JSON strings as the output with the intention that it would be inserted into a table for further testing and learning. In PostgreSQL, JSON string data can be stored in a JSON or JSONB column, each providing different features for querying and displaying the JSON data. In most circumstances, JSONB is the preferred column type because it provides more efficient storage and the ability to create indexes over the contents. The main downside is that the actual formatting of the JSON string, including the order of the keys and values, is not retained and may be difficult to reproduce exactly. To better understand the differences of when you would store JSON string data with one column type over the other, please <a href="https://www.postgresql.org/docs/current/datatype-json.html">refer to the PostgreSQL documentation</a>.</p><p>PostgreSQL has supported JSON and JSONB data types for many years. With each major release, the feature set for working with JSON and overall query performance improves. In a growing number of data models, particularly when REST or Graph APIs are involved, storing extra meta information as a JSON document can be beneficial. The data is available if needed while facilitating efficient queries on serialized data stored in regular columns.</p><p>We used a design pattern similar to this in our <a href="https://docs.timescale.com/timescaledb/latest/tutorials/analyze-nft-data/">NFT Starter Kit</a>. The OpenSea JSON API used as the data source for the starter kit includes many properties and values for each asset and collection. A lot of the values weren't helpful for the specific analysis in that tutorial. However, we knew that some of the values in the JSON properties could be useful in future analysis, tutorials, or demonstrations. Therefore, we stored additional metadata about assets and collections in a JSONB field to query it if needed. Still, it didn't complicate the schema design for otherwise common data like <code>name</code> and <code>asset_id</code>.</p><p>Storing data in a JSON field is also a common practice in areas like IIoT device data. Engineers usually have an agreed-upon schema to store and query metrics produced by the device, followed by a "free form" JSON column that allows engineers to send error or diagnostic data that changes over time as hardware is modified or updated.</p><p>There are several approaches to add JSON data to our sample query. One added challenge is that JSON data includes both a key and a value, along with the possibility of numerous levels of child object nesting. The approach you take will depend on how complex you want the PostgreSQL function to be and the end goal of the sample data. In this example, we'll create a function that takes an array of keys for the JSON and generates random numerical values for each key without nesting. Generating the JSON string in SQL from our values is straightforward, thanks to built-in PostgreSQL functions for reading and writing JSON strings. 🎉</p><p>As with the other examples in this post, we'll start by using a CTE to generate a random JSON document in a stand-alone SELECT query to verify that the result is what we want. Remember, we'll observe the same issue we had earlier when generating random text in the stand-alone query because we are using a CTE. The JSON is random every time the query runs, but the string is reused for all rows in the result set. CTE's are materialized once for each reference in a query, whereas functions are called again for every row. Because of this, we won't observe random values in each row until we move the SQL into a function to reuse later.</p><pre><code class="language-sql">WITH random_json AS (
SELECT json_object_agg(key, random_between(1,10)) as json_data
    FROM unnest(array['a', 'b']) as u(key))
  SELECT json_data, generate_series(1,5) FROM random_json;

json_data       |generate_series|
----------------+---------------+
{"a": 6, "b": 2}|              1|
{"a": 6, "b": 2}|              2|
{"a": 6, "b": 2}|              3|
{"a": 6, "b": 2}|              4|
{"a": 6, "b": 2}|              5|
</code></pre><p>We can see that the JSON data is created using our keys (['a','b']) with numbers between 1 and 10. We just have to create a function that will create random JSON data each time it is called. This function will always return a JSON document with numeric integer values for each key we provide for demonstration purposes. Feel free to enhance this function to return more complex documents with various data types if that's a requirement for you.</p><pre><code class="language-sql">CREATE OR REPLACE FUNCTION random_json(keys TEXT[]='{"a","b","c"}',min_val NUMERIC = 0, max_val NUMERIC = 10) 
   RETURNS JSON AS
$$
DECLARE 
	random_val NUMERIC  = floor(random() * (max_val-min_val) + min_val)::INTEGER;
	random_json JSON = NULL;
BEGIN
	-- again, this adds some randomness into the results. Remove or modify if this
	-- isn't useful for your situation
	if(random_val % 5) &gt; 1 then
		SELECT * INTO random_json FROM (
			SELECT json_object_agg(key, random_between(min_val,max_val)) as json_data
	    		FROM unnest(keys) as u(key)
		) json_val;
	END IF;
	RETURN random_json;
END
$$ LANGUAGE 'plpgsql';
</code></pre><p>With the <code>random_json()</code> function in place, we can test it in a few ways. First, we'll simply call the function directly without any parameters, which will return a JSON document with the default keys provided in the function definition ("a", "b", "c") and values from 0 to 10 (the default minimum and maximum value).</p><pre><code class="language-sql">SELECT random_json();

random_json             |
------------------------+
{"a": 7, "b": 3, "c": 8}|
</code></pre><p>Next, we'll join this to a small numeric set from <code>generate_series()</code>.</p><pre><code class="language-sql">SELECT device_id, random_json() FROM generate_series(1,5) device_id;

device_id|random_json              |
---------+-------------------------+
        1|{"a": 2, "b": 2, "c": 2} |
        2|                         |
        3|{"a": 10, "b": 7, "c": 1}|
        4|                         |
        5|{"a": 7, "b": 1, "c": 0} |
</code></pre><p>Notice two things with this example.</p><p>First, the data is different for each row, showing that the function gets called for each row and produces different numeric values each time. Second, because we kept the same random output mechanism from the <code>random_text()</code> example, not every row includes JSON.</p><p>Finally, let's add this into the sample query for generating device data that we've used throughout this article to see how to provide an array of keys ("building" and "rack") for the generated JSON data.</p><pre><code class="language-sql">SELECT
  time,
  device_id,
  random_between(3,100, 4) AS cpu,
  random_between(28,83) AS temperature_c,
  random_text(2,10) AS note,
  random_json(ARRAY['building','rack'],1,20) device_location
FROM generate_series(now() - interval '1 hour', now(), interval '1 minute') AS time, 
	generate_series(1,10,1) AS device_id
ORDER BY 1,2;


time                         |device_id|cpu     |temperature_c|note     |device_location             |
-----------------------------+---------+--------+-------------+---------+----------------------------+
2021-11-04 16:19:22.991 -0400|        1| 14.7614|           70|CTcX8 2s4|                            |
2021-11-04 16:19:22.991 -0400|        2| 62.2618|           81|x1V      |{"rack": 4, "building": 5}  |
2021-11-04 16:19:22.991 -0400|        3| 10.1214|           50|1PNb     |                            |
2021-11-04 16:19:22.991 -0400|        4| 96.3742|           29|aZpikXGe |{"rack": 12, "building": 4} |
2021-11-04 16:19:22.991 -0400|        5| 22.5327|           30|lM       |{"rack": 2, "building": 3}  |
2021-11-04 16:19:22.991 -0400|        6| 57.9773|           44|         |{"rack": 16, "building": 5} |
...
</code></pre><p>There are just so many possibilities for creating sample data with <code>generate_series()</code>, PostgreSQL functions, and some custom logic.</p><h2 id="putting-it-all-together">Putting it all together</h2><p>Let's put what we've learned into practice, using these three functions to create and insert ~1 million rows of data and then query it with the hyperfunctions <a href="https://docs.timescale.com/api/latest/hyperfunctions/time_bucket/"><code>time_bucket()</code></a>, <a href="https://docs.timescale.com/api/latest/hyperfunctions/time_bucket_ng/"><code>time_bucket_ng()</code></a>, <a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/approx_percentile/"><code>approx_percentile()</code></a> and <a href="https://docs.timescale.com/api/latest/hyperfunctions/time-weighted-averages/time_weight/"><code>time_weight()</code></a>. To do this, we'll create two tables: one will be a list of computer hosts and the second will be a <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertable</a> that stores fake time-series data about the computers.</p><h3 id="step-1-create-the-schema-and-hypertable">Step 1: Create the schema and <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertable</a><br></h3><pre><code class="language-sql">CREATE TABLE host (
	id int PRIMARY KEY,
	host_name TEXT,
	LOCATION jsonb
);

CREATE TABLE host_data (
	date timestamptz NOT NULL,
	host_id int NOT NULL,
	cpu double PRECISION,
	tempc int,
	status TEXT	
);

SELECT create_hypertable('host_data','date');
</code></pre><h3 id="step-2-generate-and-insert-data">Step 2: Generate and insert data</h3><pre><code class="language-sql">-- Insert data to create fake hosts
INSERT INTO host
SELECT id, 'host_' || id::TEXT AS name, 
	random_json(ARRAY['building','rack'],1,20) AS LOCATION
FROM generate_series(1,100) AS id;


-- insert ~1.3 million records for the last 3 months
INSERT INTO host_data
SELECT date, host_id,
	random_between(5,100,3) AS cpu,
	random_between(28,90) AS tempc,
	random_text(20,75) AS status
FROM generate_series(now() - INTERVAL '3 months',now(), INTERVAL '10 minutes') AS date,
generate_series(1,100) AS host_id;
</code></pre><h3 id="step-3-query-data-using-timebucket-and-timebucketng">Step 3: Query data using <code>time_bucket()</code> and <code>time_bucket_ng()</code></h3><pre><code class="language-sql">-- Using time_bucket(), query the average CPU and max tempc
SELECT time_bucket('7 days', date) AS bucket, host_name,
	avg(cpu),
	max(tempc)
FROM host_data
JOIN host ON host_data.host_id = host.id
WHERE date &gt; now() - INTERVAL '1 month'
GROUP BY 1,2
ORDER BY 1 DESC, 2;


-- try the experimental time_bucket_ng() to query data in month buckets
SELECT timescaledb_experimental.time_bucket_ng('1 month', date) AS bucket, host_name,
	avg(cpu) avg_cpu,
	max(tempc) max_temp
FROM host_data
JOIN host ON host_data.host_id = host.id
WHERE date &gt; now() - INTERVAL '3 month'
GROUP BY 1,2
ORDER BY 1 DESC, 2;
</code></pre><h3 id="step-4-query-data-using-toolkit-hyperfunctions">Step 4: Query data using toolkit hyperfunctions</h3><pre><code class="language-sql">-- query all host in building 10 for 7 day buckets
-- also try the new percentile approximation function to 
-- get the p75 of data for each 7 day period
SELECT time_bucket('7 days', date) AS bucket, host_name,
	avg(cpu),
	approx_percentile(0.75,percentile_agg(cpu)) p75,
	max(tempc)
FROM host_data
JOIN host ON host_data.host_id = host.id
WHERE date &gt; now() - INTERVAL '1 month'
	AND LOCATION -&gt; 'building' = '10'
GROUP BY 1, 2
ORDER BY 1 DESC, 2;



-- To test time-weighted averages, we need to simulate missing
-- some data points in our host_data table. To do this, we'll
-- randomly select ~10% of the rows, and then delete them from the
-- host_data table.
WITH random_delete AS (SELECT date, host_id FROM host_data
	 JOIN host ON host_id = id WHERE 
	date &gt; now() - INTERVAL '2 weeks'
	ORDER BY random() LIMIT 20000
)
DELETE FROM host_data hd
USING random_delete rd
WHERE hd.date = rd.date
AND hd.host_id = rd.host_id;


-- Select the daily time-weighted average and regular average
-- of each host for building 10 for the last two weeks.
-- Notice the variation in the two numbers because of the missing data.
SELECT time_bucket('1 day',date) AS bucket,
	host_name,
	average(time_weight('LOCF',date,cpu)) weighted_avg,
	avg(cpu) 
FROM host_data
	JOIN host ON host_data.host_id = host.id
WHERE LOCATION -&gt; 'building' = '10'
AND date &gt; now() - INTERVAL '2 weeks'
GROUP BY 1,2
ORDER BY 1 DESC, 2;
</code></pre><p>In a few lines of SQL, we created 1.3 million rows of data and were able to test four different functions in TimescaleDB, all without relying on any external source. 💪</p><p>Still, you may notice one last issue with the values in our <code>host_data</code> table (even though the values are not more realistic in nature). By using <code>random()</code> as the basis for our queries, the calculated numeric values all tend to have an equal distribution within the specified range which causes the average of the values to always be near the median. This makes sense statistically, but it highlights one other area of improvement to the data we generate. In the third post of this series, we'll demonstrate a few ways to influence the generated values to provide shape to the data (and even some outliers if we need them).</p>
<h2 id="reviewing-our-progress">Reviewing our progress</h2><p>When using a database like TimescaleDB or testing features in PostgreSQL, generating a representative dataset is a beneficial tool to have in your SQL toolbelt. </p><p><a href="https://timescale.ghost.io/blog/blog/how-to-create-lots-of-sample-time-series-data-with-postgresql-generate_series/">In the first post</a>, we learned how to generate lots of data by combining the result sets of multiple <code>generate_series()</code> functions. Using the implicit <code>CROSS JOIN</code>, the total number of rows in the final output is a product of each set together. When one of the data sets contains timestamps, the output can be used to create time-series data for testing and querying.</p>
<p>The problem with our initial examples was that the actual values we generated were random and lacked control over their precision - and all of the data was numeric. So in this second post, we demonstrated how to format the numeric data for a given column and generate random data of other types, like text and JSON documents. We also added an example in the text and JSON functions that created randomness in how often the values were emitted for each of those columns.</p><p>Again, all of these are building block examples for you to use, creating functions that generate the kind of data you need to test.</p><p>To see some of these examples in action, watch my video on creating realistic sample data:</p><figure class="kg-card kg-embed-card"><iframe width="200" height="113" src="https://www.youtube.com/embed/iKBH_p327vw?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></figure><p>In <a href="https://timescale.ghost.io/blog/blog/how-to-shape-sample-data-with-postgresql-generate_series-and-sql/">part 3</a> of this series, we will demonstrate how to add shape and trends into your sample time-series data (e.g., increasing web traffic over time and quarterly sales cycles) using the formatting functions in this post in conjunction with relational lookup tables and additional mathematical functions. Knowing how to manipulate the pattern of generated data is particularly useful for visualizing time-series data and learning analytical PostgreSQL or TimescaleDB functions.</p><p>If you have questions about using <a href="https://www.postgresql.org/docs/current/functions-srf.html"><code>generate_series()</code></a> or have any questions about TimescaleDB, please <a href="https://slack.timescale.com">join our community Slack channel</a>, where you'll find an active community and a handful of the Timescale team most days.</p><p>If you want to try creating larger sets of sample time-series data using <code>generate_series()</code> and see how the exciting features of TimescaleDB work, <a href="https://www.timescale.com/timescale-signup">sign up for a free 30-day trial</a> or <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/install-timescaledb/self-hosted/">install and manage it on your instances</a>. (You can also learn more by <a href="https://docs.timescale.com/timescaledb/latest/tutorials/">following one of our many tutorials</a>.)</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[What Is ClickHouse and How Does It Compare to PostgreSQL and TimescaleDB for Time Series?]]></title>
            <description><![CDATA[A detailed benchmark comparing the TimescaleDB and ClickHouse ingest speeds, disk space, and query response times.]]></description>
            <link>https://www.tigerdata.com/blog/what-is-clickhouse-how-does-it-compare-to-postgresql-and-timescaledb-and-how-does-it-perform-for-time-series-data</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/what-is-clickhouse-how-does-it-compare-to-postgresql-and-timescaledb-and-how-does-it-perform-for-time-series-data</guid>
            <category><![CDATA[Engineering]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Benchmarks & Comparisons]]></category>
            <dc:creator><![CDATA[Ryan Booz]]></dc:creator>
            <pubDate>Thu, 21 Oct 2021 13:43:23 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/09/Timescale-vs-clickhouse-hero.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/09/Timescale-vs-clickhouse-hero.png" alt="Timescale vs ClickHouse - the companies' logo opposing each other" /><p>Over the past year, one database we keep hearing about is ClickHouse, a column-oriented OLAP database initially built and open-sourced by Yandex. </p><p>In this detailed post, which is the culmination of three months of research and analysis, we answer the most common questions we hear, including:</p><ul><li>What is ClickHouse (including a deep dive into its architecture)</li><li>How does ClickHouse compare to PostgreSQL</li><li>How does ClickHouse compare to TimescaleDB</li><li>How does ClickHouse perform for time-series data vs. TimescaleDB</li></ul><p>At Timescale, we take our benchmarks very seriously. We find that, in our industry, there is far too much vendor-biased “benchmarketing” and not enough honest “benchmarking.” We believe developers deserve better. So we take great pains to really understand the technologies we are comparing against—and also to point out places where the other technology shines (and where TimescaleDB may fall short). </p><p>You can see this in our other detailed benchmarks vs. <a href="https://www.timescale.com/blog/timescaledb-vs-amazon-timestream-6000x-higher-inserts-175x-faster-queries-220x-cheaper/" rel="noreferrer">AWS Timestream</a> (29-minute read), <a href="https://www.timescale.com/blog/how-to-store-time-series-data-mongodb-vs-timescaledb-postgresql-a73939734016/" rel="noreferrer">MongoDB</a> (19-minute read), and <a href="https://www.timescale.com/blog/timescaledb-vs-influxdb-for-time-series-data-timescale-influx-sql-nosql-36489299877" rel="noreferrer">InfluxDB</a> (26-minute read).</p><p>We’re also database nerds at heart who really enjoy learning about and digging into other systems. (Which are a few reasons why these posts—including this one—are so long!)</p><p>So, to better understand the strengths and weaknesses of ClickHouse, we spent the last three months and hundreds of hours benchmarking, testing, reading documentation, and working with contributors.</p><p><em>Shout-out to Timescale engineers Alexander Kuzmenkov, who was most recently a core developer on ClickHouse, and Aleksander Alekseev, who is also a PostgreSQL contributor, who helped check our work and keep us honest with this post.</em></p><h2 id="how-clickhouse-fared-in-our-tests">How ClickHouse Fared in Our Tests</h2><p>ClickHouse is a very impressive piece of technology. In some tests, ClickHouse proved to be a blazing-fast database, able to ingest data faster than anything else we’ve tested so far (including TimescaleDB). In some complex queries, particularly those that do complex grouping aggregations, ClickHouse is hard to beat.</p><p>But nothing in databases comes for free. ClickHouse achieves these results because its developers have made specific architectural decisions. These architectural decisions also introduce limitations, especially when compared to PostgreSQL and TimescaleDB.</p><h3 id="clickhouse%E2%80%99s-limitationsweaknesses-include">ClickHouse’s limitations/weaknesses include:</h3><ul><li><strong>Worse query performance than TimescaleDB at nearly all queries</strong> in the <a href="https://github.com/timescale/tsbs#:~:text=The%20Time%20Series%20Benchmark%20Suite,write%20performance%20of%20various%20databases.">Time-Series Benchmark Suite</a>, except for complex aggregations.</li><li><strong>Poor inserts and much higher disk usage</strong> (e.g., 2.7x higher disk usage than TimescaleDB) at small batch sizes (e.g., 100-300 rows/batch).</li><li><strong>Non-standard SQL-like query language</strong> with several limitations (e.g., joins are discouraged, syntax is at times non-standard).</li><li><strong>Lack of other features</strong> one would expect in a robust SQL database (e.g., PostgreSQL or TimescaleDB): no transactions, no correlated sub-queries, no stored procedures, no user-defined functions, no index management beyond primary and secondary indexes, no triggers.</li><li><strong>Inability to modify or delete data at a high rate and low latency—</strong>instead have to batch deletes and updates.</li><li><strong>Batch deletes and updates happen asynchronously.</strong></li><li>Because data modification is asynchronous, <strong>ensuring consistent backups is difficult</strong>: the only way to ensure a consistent backup is to stop all writes to the database.</li><li><strong>Lack of transactions and lack of data consistency also affect other features like materialized views</strong> because the server can't atomically update multiple tables at once. If something breaks during a multi-part insert to a table with materialized views, the end result is an inconsistent state of your data.</li></ul><p>We list these shortcomings not because we think ClickHouse is a bad database. We actually think it’s a great database—well, to be more precise, a great database <em>for certain workloads</em>. And as a developer, you have to choose the right tool for <em>your workload</em>.</p><h3 id="why-does-clickhouse-fare-well-in-certain-cases-but-worse-in-others">Why does ClickHouse fare well in certain cases but worse in others?</h3><p>The answer is the <em>underlying architecture</em>.</p><p>Generally, in databases, there are two types of fundamental architectures, each with strengths and weaknesses: OnLine Transactional Processing (OLTP) and OnLine Analytical Processing (OLAP).</p><table>
<thead>
<tr>
<th style="text-align:center">OLTP</th>
<th style="text-align:center">OLAP</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:center">Large and small datasets</td>
<td style="text-align:center">Large datasets focused on reporting/analysis</td>
</tr>
<tr>
<td style="text-align:center">Transactional data (the raw, individual records matter)</td>
<td style="text-align:center">Pre-aggregated or transformed data to foster better reporting</td>
</tr>
<tr>
<td style="text-align:center">Many users performing varied queries and updates on data across the system</td>
<td style="text-align:center">Fewer users performing deep data analysis with few updates</td>
</tr>
<tr>
<td style="text-align:center">SQL is the primary language for interaction</td>
<td style="text-align:center">Often, but not always, utilizes a particular query language other than SQL</td>
</tr>
</tbody>
</table>
<h2 id="clickhouse-postgresql-and-timescaledb-architectures">ClickHouse, PostgreSQL, and TimescaleDB architectures</h2><p>At a high level, ClickHouse is an excellent OLAP database designed for systems of analysis. </p><p>PostgreSQL, by comparison, is a general-purpose database designed to be a versatile and reliable OLTP database for systems of record with high user engagement. </p><p>TimescaleDB is a relational database for time-series: purpose-built on PostgreSQL for time-series workloads. It combines the best of PostgreSQL plus new capabilities that increase performance, reduce cost, and provide an overall better developer experience for time series.</p><p>So, if you find yourself needing to perform fast analytical queries on mostly immutable large datasets with few users, i.e., OLAP, ClickHouse may be the better choice.</p><p>Instead, if you find yourself needing something more versatile that works well for powering applications with many users and likely frequent updates/deletes, i.e., OLTP, PostgreSQL may be the better choice.</p><p>And if your applications have time-series data—and especially if you also want the versatility of PostgreSQL—TimescaleDB is likely the best choice.</p><h2 id="time-series-benchmark-suite-results-summary-timescaledb-vs-clickhouse">Time-Series Benchmark Suite results summary (TimescaleDB vs. ClickHouse)</h2><p>We can see the impact of these architectural decisions on how TimescaleDB and ClickHouse fare with time-series workloads. </p><p>We spent hundreds of hours working with ClickHouse and TimescaleDB during this benchmark research. We tested insert loads from 100 million rows (1 billion metrics) to 1 billion rows (10 billion metrics), cardinalities from 100 to 10 million, and numerous combinations in between. We really wanted to understand how each database works across various datasets.</p><p>Overall, for inserts we find that ClickHouse outperforms on inserts with large batch sizes—but underperforms with smaller batch sizes. For queries, we find that ClickHouse underperforms on most queries in the benchmark suite, except for complex aggregates. </p><h3 id="insert-performance">Insert performance</h3><p>When rows are batched between 5,000 and 15,000 rows per insert, speeds are fast for both databases, with ClickHouse performing noticeably better:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/09/01-timescale-vs-clickhouse.png" class="kg-image" alt="Insert comparison between ClickHouse and TimescaleDB at cardinalities between 100 and 1 million hosts" loading="lazy" width="1842" height="1288" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/09/01-timescale-vs-clickhouse.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/09/01-timescale-vs-clickhouse.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/09/01-timescale-vs-clickhouse.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/09/01-timescale-vs-clickhouse.png 1842w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Performance comparison: ClickHouse outperforms TimescaleDB at all cardinalities when batch sizes are 5,000 rows or greater</span></figcaption></figure><p>However, when the batch size is smaller, the results are reversed in two ways: insert speed and disk consumption. With larger batches of 5,000 rows/batch, ClickHouse consumed ~16GB of disk during the test, while TimescaleDB consumed ~19GB (both before compression).</p><p>With smaller batch sizes, not only does TimescaleDB maintain steady insert speeds<strong> </strong>that are faster than ClickHouse between 100-300 rows/batch, but disk usage is 2.7x higher with ClickHouse. This difference should be expected because of the architectural design choices of each database, but it's still interesting to see.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/09/02-small-batch-insert-performance.png" class="kg-image" alt="Insert comparison of TimescaleDB and ClickHouse with small batch sizes. TimescaleDB outperforms and uses 2.7x less disk space." loading="lazy" width="1842" height="1358" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/09/02-small-batch-insert-performance.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/09/02-small-batch-insert-performance.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/09/02-small-batch-insert-performance.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/09/02-small-batch-insert-performance.png 1842w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Performance comparison: Timescale outperforms ClickHouse with smaller batch sizes and uses 2.7x less disk space</span></figcaption></figure><h3 id="query-performance">Query performance</h3><p>For testing query performance, we used a "standard" dataset that queries data for 4,000 hosts over a three-day period, with a total of 100 million rows. In our experience running benchmarks in the past, we found that this cardinality and row count works well as a representative dataset for benchmarking because it allows us to run many ingest and query cycles across each database in a few hours. </p><p>Based on ClickHouse’s reputation as a fast OLAP database, we expected ClickHouse to outperform TimescaleDB for nearly all queries in the benchmark.</p><p>When we ran TimescaleDB without compression, ClickHouse did outperform. </p><p>However, when we enabled TimescaleDB compression—which is the recommended approach—we found the opposite, with TimescaleDB outperforming nearly across the board:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/09/03-query-latency.png" class="kg-image" alt="Bar chart displaying results of query response between TimescaleDB and ClickHouse. TimescaleDB outperforms in almost every query category." loading="lazy" width="2000" height="1584" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/09/03-query-latency.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/09/03-query-latency.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/09/03-query-latency.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/09/03-query-latency.png 2354w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Results of query benchmarking between TimescaleDB and ClickHouse. TimescaleDB outperforms in almost every query category</span></figcaption></figure><p>(For those that want to replicate our findings or better understand why ClickHouse and TimescaleDB perform the way they do under different circumstances, please read the entire article for the full details.)</p><h3 id="cars-vs-bulldozers">Cars vs. bulldozers</h3><p>Today, we live in the golden age of databases: there are so many databases that all these lines (OLTP/OLAP/time-series/etc.) are blurring. Yet every database is architected differently and, as a result, has different advantages and disadvantages. As a developer, you should choose the right tool for the job. </p><p>After spending lots of time with ClickHouse, reading their docs, and working through weeks of benchmarks, we found ourselves repeating this simple analogy:</p><p><em>ClickHouse is like a bulldozer - very efficient and performant for a specific use case. PostgreSQL (and TimescaleDB) is like a car: versatile, reliable, and useful in most situations you will face in your life. </em></p><p>Most of the time, a “car” will satisfy your needs. But if you find yourself doing a lot of “construction”, by all means, get a “bulldozer.”</p><p>We aren’t the only ones who feel this way. Here is a similar opinion shared on <a href="https://news.ycombinator.com/item?id=28596718">HackerNews</a> by <em>stingraycharles </em>(whom we don’t know, but <em>stingraycharles</em> if you are reading this—we love your username):</p>
<!--kg-card-begin: html-->
<div class="quote-block">
    <p class="quote-block__text">
        "TimescaleDB has a great time-series story and an average data warehousing story; Clickhouse has a great data warehousing story, an average time-series story, and a bit meh clustering story (YMMV)."
    </p>   
</div>
<!--kg-card-end: html-->
<p>In the rest of this article, we do a deep dive into the ClickHouse architecture and then highlight some of the advantages and disadvantages of ClickHouse, PostgreSQL, and TimescaleDB that result from the architectural decisions that each of its developers (including us) have made. </p><p>We conclude with a more detailed time-series benchmark analysis. We also have a detailed description of our testing environment to replicate these tests yourself and verify our results.</p><p>Yes, we’re the makers of TimescaleDB, so you may not trust our analysis. If so, we ask you to hold your skepticism for the next few minutes and give the rest of this article a read. </p><p>As you (hopefully) will see, we spent a lot of time understanding ClickHouse for this comparison: first, to make sure we were conducting the benchmark the right way so that we were fair to Clickhouse, but also because we are database nerds at heart and were genuinely curious to learn how ClickHouse was built. </p><h3 id="next-steps">Next steps</h3><p>Are you curious about TimescaleDB? The easiest way to get started is by <a href="https://console.cloud.timescale.com/signup" rel="noreferrer">creating a free Timescale Cloud account</a>, which will give you access to a fully managed TimescaleDB instance (100&nbsp;% free for 30 days).</p><p>If you want to host TimescaleDB yourself, you can do it completely for free: <a href="https://github.com/timescale/timescaledb">visit our GitHub</a> to learn more about options, get installation instructions, and more (⭐️  are very much appreciated! 🙏)</p><p>One last thing:<a href="http://slack.timescale.com"> Join our Community Slack</a> to ask questions, get advice, and connect with other developers (we are +7,000 and counting!). We, the authors of this post, are very active on all channels—as well as all our engineers, members of Team Timescale, and many passionate users.</p><h2 id="what-is-clickhouse">What Is ClickHouse?</h2><p>ClickHouse, short for “Clickstream Data Warehouse”, is a columnar OLAP database that was initially built for web analytics in Yandex Metrica. Generally, ClickHouse is known for its high insert rates, fast analytical queries, and SQL-like dialect.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/10/04-timeline--1-.jpg" class="kg-image" alt="Timeline of ClickHouse development from 2008 to 2020" loading="lazy" width="2000" height="903" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/10/04-timeline--1-.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/10/04-timeline--1-.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2021/10/04-timeline--1-.jpg 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/10/04-timeline--1-.jpg 2000w" sizes="(min-width: 720px) 720px"></figure><p><em>Timeline of ClickHouse development (</em><a href="https://clickhouse.com/blog/en/2020/the-clickhouse-community/"><em>Full history here.</em></a><em>)</em></p><p>We are fans of ClickHouse. It is a very good database built around certain architectural decisions that make it a good option for OLAP-style analytical queries. In particular, in our benchmarking with the Time Series Benchmark Suite (TSBS),<strong> ClickHouse performed better for data ingestion than any time-series database we've tested so far</strong> (TimescaleDB included) at an average of more than 600k rows/second on a single instance <em>when rows are batched appropriately</em>. </p><p>But nothing in databases comes for free - and as we’ll show below, this architecture also creates significant limitations for ClickHouse, making it slower for many types of time-series queries and some insert workloads. </p><p>If your application doesn't fit within the architectural boundaries of ClickHouse (or TimescaleDB, for that matter), you'll probably end up with a frustrating development experience, redoing a lot of work down the road.</p><h2 id="the-clickhouse-architecture">The ClickHouse Architecture</h2><p>ClickHouse was designed for OLAP workloads, which have specific characteristics. From the <a href="https://clickhouse.com/docs/en/">ClickHouse documentation</a>, here are some of the requirements for this type of workload: </p><ul><li>The vast majority of requests are for read access.</li><li>Data is inserted in fairly large batches (&gt; 1000 rows), not by single rows, or it is not updated at all.</li><li>Data is added to the DB but is not modified.</li><li>For reads, quite a large number of rows are processed from the DB, but only a small subset of columns.</li><li>Tables are “wide,” meaning they contain a large number of columns.</li><li>Queries are relatively rare (usually hundreds of queries per server or less per second).</li><li>For simple queries, latencies of around 50 ms are allowed.</li><li>Column values are fairly small: numbers and short strings (for example, 60 bytes per URL).</li><li>Requires high throughput when processing a single query (up to billions of rows per second per server).</li><li>Transactions are not necessary.</li><li>Low requirements for data consistency.</li><li>There is one large table per query. All tables are small, except for one.</li><li>A query result is significantly smaller than the source data. In other words, data is filtered or aggregated so the result fits in a single server’s RAM.</li></ul><p>How is ClickHouse designed for these workloads? Here are some of the key aspects of their architecture: </p><ul><li>Compressed, column-oriented storage</li><li>Table Engines</li><li>Indexes</li><li>Vector Computation Engine</li></ul><h3 id="compressed-column-oriented-storage"><strong>Compressed, column-oriented storage</strong></h3><p>First, ClickHouse (like nearly all OLAP databases) is column-oriented (or columnar), meaning that data for the same table column is stored together. (In contrast, in row-oriented storage, used by nearly all OLTP databases, data for the same table row is stored together.)</p><p>Column-oriented storage has a few advantages:</p><ul><li>If your query only needs to read a few columns, then reading that data is much faster (you don’t need to read entire rows, just the columns)</li><li>Storing columns of the same data type together leads to greater compressibility (although, as we have shown, it is possible to build <a href="https://www.timescale.com/blog/building-columnar-compression-in-a-row-oriented-database" rel="noreferrer">columnar compression into row-oriented storage</a>).</li></ul><h3 id="table-engines">Table engines</h3><p>To improve the storage and processing of data in ClickHouse, columnar data storage is implemented using a collection of table "engines". The table engine determines the type of table and the features that will be available for processing the data stored inside. </p><p>ClickHouse primarily uses the <strong>MergeTree table engine</strong> as the basis for how data is written and combined. Nearly all other table engines derive from MergeTree and allow additional functionality to be performed automatically as the data is (later) processed for long-term storage.</p><p>(Quick clarification: From this point forward, whenever we mention MergeTree, we're referring to the overall MergeTree architecture design and all table types that derive from it unless we specify a specific MergeTree type.)</p><p>At a high level, MergeTree allows data to be written and stored very quickly to multiple immutable files (called "parts" by ClickHouse). These files are later processed in the background at some point in the future and merged into a larger <strong>part</strong> with the goal of reducing the total number of <strong>parts</strong> on disk (fewer files = more efficient data reads later). This is one of the key reasons behind ClickHouse’s astonishingly high insert performance on large batches.</p><p>All columns in a table are stored in separate <strong>parts</strong> (files), and all values in each column are stored in the order of the primary key. This column separation and sorting implementation makes future data retrieval more efficient, particularly when computing aggregates on large ranges of contiguous data.</p><h3 id="indexes">Indexes</h3><p>Once the data is stored and merged into the most efficient set of <strong>parts</strong> for each column, queries need to know how to efficiently find the data. For this, Clickhouse relies on two types of indexes: the primary index and, additionally, a secondary (data skipping) index.</p><p>Unlike a traditional OLTP, BTree index, which knows how to locate any row in a table, the ClickHouse primary index is sparse in nature, meaning that it does not have a pointer to the location of every value for the primary index. Instead, because all data is stored in primary key order, <strong>the primary index stores the value of the primary key in every N-th row</strong> (called index_granularity, 8192 by default). This is done with the specific design goal of<strong> fitting the primary index into memory </strong>for extremely fast processing.</p><p>When your query patterns fit with this index style, the sparse nature can help improve query speed significantly. The one limitation is that you cannot create other indexes on specific columns to help improve a different query pattern. We'll discuss this more later.</p><h3 id="vector-computation-engine">Vector computation engine</h3><p>ClickHouse was designed with the desire to have "online" query processing in a way that other OLAP databases hadn't been able to achieve. Even with compression and columnar data storage, most other OLAP databases still rely on incremental processing to pre-compute aggregated data. It has generally been the pre-aggregated data that's provided the speed and reporting capabilities.</p><p>To overcome these limitations, ClickHouse implemented a series of vector algorithms for working with large arrays of data on a column-by-column basis. With vectorized computation, ClickHouse can specifically work with data in blocks of tens of thousands of rows (per column) for many computations. Vectorized computing also provides an opportunity to write more efficient code that utilizes modern SIMD processors and keeps code and data closer together for better memory access patterns, too.</p><p>In total, this is a great feature for working with large data sets and writing complex queries on a limited set of columns and something TimescaleDB could benefit from as we explore more opportunities to utilize columnar data.</p><p>That said, as you'll see from the benchmark results, enabling compression in TimescaleDB (which converts data into compressed columnar storage) improves the query performance of many aggregate queries in ways that are even better than ClickHouse.</p><h3 id="clickhouse-architectures-disadvantages-aka-nothing-comes-for-free">ClickHouse architecture's disadvantages (a.k.a. nothing comes for free)</h3><p>Nothing comes for free in database architectures. Clearly, ClickHouse is designed with a very specific workload in mind. Similarly, it is <em>not designed</em> for other types of workloads. </p><p>We can see an initial set of disadvantages from the <a href="https://clickhouse.com/docs/en/introduction/distinctive-features/">ClickHouse docs</a>:</p><ul><li>No full-fledged transactions.</li><li>Lack of ability to modify or delete already inserted data with a high rate and low latency. There are batch deletes and updates available to clean up or modify data, for example, to comply with GDPR, but not for regular workloads.</li><li>The sparse index makes ClickHouse not so efficient for point queries retrieving single rows by their keys.</li></ul><p>There are a few disadvantages that are worth going into detail:</p><ul><li>Data can’t be directly modified in a table</li><li>Some “synchronous” actions aren’t really synchronous</li><li>SQL-like, but not quite SQL</li><li>No data consistency in backups</li></ul><h3 id="mergetree-limitation-data-can%E2%80%99t-be-directly-modified-in-a-table">MergeTree limitation: Data can’t be directly modified in a table</h3><p>All tables in ClickHouse are immutable. There is no way to directly update or delete a value that's already been stored. Instead, any operations that <code>UPDATE</code> or <code>DELETE</code> data can only be accomplished through an <code>ALTER TABLE</code> statement that applies a filter and actually re-writes the entire table (<strong>part</strong> by <strong>part</strong>) in the background to update or delete the data in question. Essentially, it's just another merge operation with some filters applied. </p><p>As a result, several MergeTree table engines exist to solve this deficiency—to solve common scenarios where frequent data modifications would otherwise be necessary. Yet, this can lead to unexpected behavior and non-standard queries.</p><p>For example, if you need to store only the most recent reading of a value, creating a <a href="https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/collapsingmergetree/">CollapsingMergeTree table type</a> is your best option. With this table type, an additional column (called <code>Sign</code>) is added to the table, which indicates which row is the current state of an item when all other field values match. ClickHouse will then asynchronously delete rows with a <code>Sign</code> that cancel each other out (a value of 1 vs -1), leaving the most recent state in the database.</p><p>For example, consider a common database design pattern where the most recent values of a sensor are stored alongside the long-term time-series table for fast lookup. We'll call this table <strong>SensorLastReading</strong>. In ClickHouse, this table would require the following pattern to store the most recent value every time new information is stored in the database.</p><p><strong>SensorLastReading</strong></p>
<table>
<thead>
<tr>
<th>SensorID</th>
<th>Temp</th>
<th>Cpu</th>
<th>Sign</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>55</td>
<td>78</td>
<td>1</td>
</tr>
</tbody>
</table>
<p>When new data is received, you need to add <strong>2 more rows</strong> to the table, one to negate the old value and one to replace it.</p><table>
<thead>
<tr>
<th>SensorID</th>
<th>Temp</th>
<th>Cpu</th>
<th>Sign</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>55</td>
<td>78</td>
<td>1</td>
</tr>
<tr>
<td>1</td>
<td>55</td>
<td>78</td>
<td>-1</td>
</tr>
<tr>
<td>1</td>
<td>40</td>
<td>35</td>
<td>1</td>
</tr>
</tbody>
</table>
<p>At some point after this insert, ClickHouse will merge the changes, removing the two rows that cancel each other out on Sign, leaving the table with just this row:</p><table>
<thead>
<tr>
<th>SensorID</th>
<th>Temp</th>
<th>Cpu</th>
<th>Sign</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>40</td>
<td>35</td>
<td>1</td>
</tr>
</tbody>
</table>
<p>But remember, <strong>MergeTree operations are asynchronous,</strong> so queries can occur on data before something like the collapse operation has been performed. Therefore, the queries to get data out of a CollapsingMergeTree table require additional work, like multiplying rows by their `Sign`, to make sure you get the correct value any time the table is in a state that still contains duplicate data. </p><p>Here is one solution that the ClickHouse documentation provides, modified for our sample data. Notice that with numerical numbers, you can get the "correct" answer by multiplying all values by the Sign column and adding a HAVING clause.</p><pre><code class="language-sql">SELECT
    SensorID,
    sum(Temp * Sign) AS Temp,
    sum(Cpu * Sign) AS Cpu
FROM SensorLastReading
GROUP BY SensorId
HAVING sum(Sign) &gt; 0
</code></pre>
<p>Again, the value here is that MergeTree tables provide really fast ingestion of data at the expense of transactions and simple concepts like UPDATE and DELETE in the way traditional applications would try to use a table like this. With ClickHouse, it's just more work to manage this kind of data workflow. </p><p>Because ClickHouse isn't an ACID database, these background modifications (or really any data manipulations) have no guarantees of ever being completed. Because there is no such thing as transaction isolation, any SELECT query that touches data in the middle of an UPDATE or DELETE modification (or a Collapse modification, as we noted above) will get whatever data is <strong><em>currently</em></strong> in each part. If the delete process, for instance, has only modified 50% of the parts for a column, queries would return outdated data from the remaining parts that have not yet been processed.</p><p>More importantly, <strong>this holds true for all data that is stored in ClickHouse, </strong>not just the large, analytical-focused tables that store something like time-series data, <em>but also the related metadata</em>. </p><p>While it's understandable that time-series data, for example, is often insert-only (and rarely updated), business-centric metadata tables almost always have modifications and updates as time passes. </p><p>Regardless, the related business data that you may store in ClickHouse to do complex joins and deeper analysis is still in a MergeTree table (or variation of a MergeTree), and therefore, updates or deletes would still require an entire rewrite (through the use of <code>ALTER TABLE</code>,) any time there are modifications.</p><h3 id="distributed-mergetree-tables">Distributed MergeTree tables</h3><p>Distributed tables are another example of where asynchronous modifications might cause you to change how you query data. If your application writes data directly to the distributed table (rather than to different cluster nodes, which is possible for advanced users), the data is first written to the "initiator" node, which in turn copies the data to the shards in the background as quickly as possible. Because there are no transactions to verify that the data was moved as part of something like two-phase commits (available in PostgreSQL), your data might not actually be where you think it is.</p><p>There is at least one other problem with how distributed data is handled. Because ClickHouse does not support transactions and data is in a constant state of being moved, there is no guarantee of consistency in the state of the cluster nodes. Saving 100,000 rows of data to a distributed table doesn't guarantee that backups of all nodes will be consistent with one another (we'll discuss reliability in a bit). Some of that data might have been moved, and some of it might still be in transit.</p><p>Again, this is by design, so there's nothing specifically wrong with what's happening in ClickHouse! It's just something to be aware of when comparing ClickHouse to something like PostgreSQL and TimescaleDB.</p><h3 id="some-%E2%80%9Csynchronous%E2%80%9D-actions-aren%E2%80%99t-really-synchronous"><strong>Some “synchronous” actions aren’t really synchronous</strong></h3><p>Most actions in ClickHouse are not synchronous. But we found that even some of the ones labeled “synchronous” weren’t really synchronous either.</p><p>One particular example that caught us by surprise during our benchmarking was how <code>TRUNCATE</code> worked. We ran many test cycles against ClickHouse and TimescaleDB to identify how changes in row batch size, workers, and even cardinality impacted the performance of each database. At the end of each cycle, we would <code>TRUNCATE</code> the database in each server, expecting the disk space to be released quickly so that we could start the next test. In PostgreSQL (and other OLTP databases), this is an atomic action. As soon as the truncate is complete, the space is freed up on disk.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0--1--5.png" class="kg-image" alt="Dashboard graph showing disk usage and immedate release of space after using TRUNCATE" loading="lazy" width="1524" height="540" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/pasted-image-0--1--5.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/pasted-image-0--1--5.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0--1--5.png 1524w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">TRUNCATE is an atomic action in TimescaleDB/PostgreSQL and frees disk almost immediately</span></figcaption></figure><p>We expected the same thing with ClickHouse because the documentation mentions that this is a <strong>synchronous</strong> action (and most things are <strong>not</strong> synchronous in ClickHouse). It turns out, however, that the files only get marked for deletion and the disk space is freed up at a later, unspecified time in the background. There's no specific guarantee for when that might happen.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0-10.png" class="kg-image" alt="Dashboard graph showing disk usage of ClickHouse and the time needed to free disk space after TRUNCATE" loading="lazy" width="1526" height="550" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/pasted-image-0-10.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/pasted-image-0-10.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0-10.png 1526w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">TRUNCATE is an asynchronous action in ClickHouse, freeing disk at some future time</span></figcaption></figure><p>For our tests, it was a minor inconvenience. <strong>We had to add a 10-minute sleep into the testing cycle</strong> to ensure that ClickHouse had released the disk space fully. In real-world situations, like ETL processing that utilizes staging tables, a <code>TRUNCATE</code> wouldn't actually free the staging table data immediately, which could cause you to modify your current processes.</p><p>We point a few of these scenarios to simply highlight the point that ClickHouse isn't a drop-in replacement for many things that a system of record (OLTP database) is generally used for in modern applications. Asynchronous data modification can take a lot more effort to effectively work with data.</p><h3 id="sql-like-but-not-quite-sql">SQL-like, but not quite SQL</h3><p>In many ways, ClickHouse was ahead of its time by choosing SQL as the language of choice.</p><p>ClickHouse chose early in its development to utilize SQL as the primary language for managing and querying data. Given the focus on data analytics, this was a smart and obvious choice, given that SQL was already widely adopted and understood for querying data. </p><p>In ClickHouse, the SQL isn't something that was added after the fact to satisfy a portion of the user community. That said, what ClickHouse provides is a SQL-like language that doesn't comply with any actual standard.</p><p>The challenges of a SQL-like query language are many. For example, retraining users who will be accessing the database (or writing applications that access the database). Another challenge is a lack of ecosystem: connectors and tools that speak SQL won’t just work out of the box—i.e., they will require some modification (and again, knowledge by the user) to work. </p><p>Overall, ClickHouse handles basic SQL queries well. </p><p>However, because the data is stored and processed in a different way from most SQL databases, there are a number of commands and functions you may expect to use from a SQL database (e.g., PostgreSQL, TimescaleDB), but which ClickHouse doesn't support or has limited support for:</p><ul><li>Not optimized for JOINs</li><li>No index management beyond the primary and secondary indexes</li><li>No recursive CTEs</li><li>No correlated subqueries or LATERAL joins</li><li>No stored procedures</li><li>No user-defined functions</li><li>No triggers<br></li></ul><p>One example that <strong>stands out about ClickHouse is that </strong><a href="https://www.timescale.com/learn/sql-joins-summary"><strong>JOINs</strong></a><strong>, by nature, are generally discouraged because the query engine lacks any ability to optimize the join of two or more tables</strong>. </p><p>Instead, users are encouraged to either query table data with separate sub-select statements and then and then use something like a <a href="https://www.timescale.com/learn/what-is-a-sql-inner-join"><code>ANY INNER JOIN</code></a>, which strictly looks for <a href="https://clickhouse.tech/docs/en/sql-reference/statements/select/join/#select-join-types">unique pairs on both sides of the join </a>(avoiding a cartesian product that can occur with standard JOIN types). There's also no caching support for the product of a JOIN, so if a table is joined multiple times, <strong>the query on that table is executed multiple times</strong>, further slowing down the query.</p><p>For example, all of the "double-groupby" queries in the TSBS group by multiple columns and then join to the tag table to get the `hostname` for the final output. Here is how that query is written for each database.</p><p><strong>TimescaleDB:</strong></p><pre><code class="language-sql">WITH cpu_avg AS (
     SELECT time_bucket('1 hour', time) as hour,
       hostname, 
	   AVG(cpu_user) AS mean_cpu_user
     FROM cpu
     WHERE time &gt;= '2021-01-01T12:30:00Z' 
       AND time &lt; '2021-01-02T12:30:00Z'
     GROUP BY 1, 2
)
SELECT hour, hostname, mean_cpu_user
FROM cpu_avg
JOIN tags ON cpu_avg.tags_id = tags.id
ORDER BY hour, hostname;
</code></pre>
<p><strong>ClickHouse:</strong></p><pre><code class="language-sql">SELECT
    hour,
    id,
    mean_cpu_user
FROM
(
    SELECT
        toStartOfHour(created_at) AS hour,
        tags_id AS id,
        AVG(cpu_user) as mean_cpu_user
    FROM cpu
    WHERE (created_at &gt;= '2021-01-01T12:30:00Z') 
        AND (created_at &lt; '2021-01-02T12:30:00Z')
    GROUP BY
        hour,
        id
) AS cpu_avg
ANY INNER JOIN tags USING (id)
ORDER BY
    hour ASC,
    id;
</code></pre>
<p><strong>Reliability: no data consistency in backups</strong></p><p>One last aspect to consider as part of the ClickHouse architecture and its lack of support for transactions is that there is no data consistency in backups. As we've already shown, all data modification (even sharding across a cluster) is asynchronous. Therefore, the only way to ensure a consistent backup would be to stop all writes to the database and then make a backup. Data recovery struggles with the same limitation.</p><p>The lack of transactions and data consistency also affects other features like materialized views because the server can't atomically update multiple tables at once. If something breaks during a multi-part insert to a table with materialized views, the end result is an inconsistent state of your data.</p><p>ClickHouse is aware of these shortcomings and is certainly working on or planning updates for future releases. Some form of transaction support <a href="https://github.com/ClickHouse/ClickHouse/issues/22086">has been in discussion for some time</a>, and <a href="https://github.com/ClickHouse/ClickHouse/pull/21945">backups are in process</a> and merged into the main branch of code, although it's <a href="https://github.com/ClickHouse/ClickHouse/pull/21945#issuecomment-933598875">not yet recommended for production use</a>. But even then, it only provides limited support for transactions.</p><h2 id="clickhouse-vs-postgresql">ClickHouse vs. PostgreSQL</h2><p><em>(A proper ClickHouse vs. PostgreSQL comparison would probably take another 8,000 words. To avoid making this post even longer, we opted to provide a short comparison of the two databases—but if anyone wants to provide a more detailed comparison, we would love to read it.) </em></p><p>As we can see above, ClickHouse is a well-architected database for OLAP workloads. Conversely, PostgreSQL is a well-architected database for OLTP workloads. </p><p>Also, PostgreSQL isn’t just an OLTP database: it’s the fastest-growing and most loved OLTP database (<a href="https://db-engines.com/en/ranking">DB-Engines</a>, <a href="https://survey.stackoverflow.co/2024/technology#1-databases" rel="noreferrer">Stack Overflow 2024</a><a href="https://insights.stackoverflow.com/survey/2021#section-most-popular-technologies-databases"> Developer Survey</a>).</p><p>As a result, we won’t compare the performance of ClickHouse vs. PostgreSQL because—to continue our analogy from before—it would be like comparing the performance of a bulldozer vs. a car. These are two different things designed for two different purposes. </p><p>We’ve already established why ClickHouse is excellent for analytical workloads. Let’s now understand why PostgreSQL is so loved for transactional workloads: versatility, extensibility, and reliability.</p><h3 id="postgresql-versatility-and-extensibility">PostgreSQL versatility and extensibility</h3><p>Versatility is one of the distinguishing strengths of PostgreSQL. It's one of the main reasons for the recent resurgence of PostgreSQL in the wider technical community.</p><p><a href="https://www.timescale.com/blog/best-practices-for-picking-postgresql-data-types/" rel="noreferrer">PostgreSQL supports a variety of data types</a>, including arrays, JSON, and more. It supports <a href="https://www.timescale.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered/" rel="noreferrer">various index types</a>—not just the common B-tree but also GIST, GIN, and more. Full-text search? Check. Role-based access control? Check. And, of course, full SQL.</p><p>Also, through the use of extensions, PostgreSQL can retain the things it's good at while adding specific functionality to enhance the ROI of your development efforts.</p><p>Does your application need geospatial data? Add the PostGIS extension. What about features that benefit time-series data workloads? Add TimescaleDB. <a href="https://www.timescale.com/learn/postgresql-extensions-pg-trgm">Could your application benefit from the ability to search using trigrams? Add pg_trgm</a>.</p><p>With all these capabilities, PostgreSQL is quite flexible, meaning it is essentially future-proof. As your application or workload changes, you will know that you can still adapt PostgreSQL to your needs.</p><p>(For one specific example of the powerful extensibility of PostgreSQL, please read how our engineering team built <a href="https://www.timescale.com/blog/function-pipelines-building-functional-programming-into-postgresql-using-custom-operators/" rel="noreferrer">functional programming into PostgreSQL using customer operators</a>.)</p><h3 id="postgresql-reliability">PostgreSQL reliability</h3><p>As developers, we’re resolved to the fact that programs crash, servers encounter hardware or power failures, disks fail, or experience corruption. You can mitigate this risk (e.g., robust software engineering practices, uninterrupted power supplies, disk RAID, etc.) but not eliminate it completely; it’s a fact of life for systems. </p><p>In response, databases are built with various mechanisms to further reduce such risk, including streaming replication to replicas, full-snapshot backup and recovery, streaming backups, robust data export tools, etc.</p><p>PostgreSQL has the benefit of 20+ years of development and usage, which has resulted in not just a reliable database but also a broad spectrum of rigorously tested tools: streaming replication for high availability and read-only replicas, pg_dump and pg_recovery for full database snapshots, pg_basebackup and log shipping/streaming for incremental backups and arbitrary point-in-time recovery, pgBackrest or WAL-E for continuous archiving to cloud storage, and robust COPY FROM and COPY TO tools for quickly importing/exporting data with a variety of formats. </p><p>This enables PostgreSQL to offer a greater “peace of mind”—because all of the skeletons in the closet have already been found (and addressed).</p><h2 id="clickhouse-vs-timescaledb">ClickHouse vs. TimescaleDB</h2><p>TimescaleDB is the leading relational database for time series, built on PostgreSQL. It offers everything PostgreSQL has to offer, plus a full time-series database. </p><p>As a result, all of the advantages of PostgreSQL also apply to TimescaleDB, including versatility and reliability.</p><p>But TimescaleDB adds some critical capabilities that allow it to outperform Postgres for time-series data:</p><ul><li><strong>Hypertables: </strong>The foundation for many TimescaleDB features (listed below), hypertables provide <a href="https://www.timescale.com/learn/data-partitioning-what-it-is-and-why-it-matters" rel="noreferrer">automatically partitioned data</a> across time and space for more performant inserts and queries</li><li><strong>Continuous aggregates: </strong><a href="https://www.timescale.com/learn/postgresql-materialized-views-and-where-to-find-them">Intelligently updated materialized views for time-series data</a>. Rather than recreating the materialized view every time, TimescleDB updates data based only on underlying changes to raw data.</li><li><strong>Columnar compression: </strong><a href="https://www.timescale.com/learn/what-is-data-compression-and-how-does-it-work">Efficient data compression of 90%+ on most time-series data</a> with dramatically improved query performance for historical, long+narrow queries.</li><li><strong>Hyperfunctions: </strong><a href="https://docs.timescale.com/api/latest/hyperfunctions/">Analytic-focused functions added to PostgreSQL to enhance time-series queries</a> with features like approximate percentiles, efficient downsampling, and two-step aggregation.</li><li><strong>Function pipelines (</strong><a href="https://www.timescale.com/blog/function-pipelines-building-functional-programming-into-postgresql-using-custom-operators/" rel="noreferrer"><strong>released this week!</strong></a><strong>): </strong>Radically improve the developer ergonomics of analyzing data in PostgreSQL and SQL by applying principles from functional programming and popular tools like Python’s Pandas and PromQL.</li></ul><h2 id="clickhouse-vs-timescaledb-performance-for-time-series-data">ClickHouse vs. TimescaleDB Performance for Time-Series Data</h2><p>Time-series data has exploded in popularity because the value of tracking and analyzing how things change over time has become evident in every industry: DevOps and IT monitoring, industrial manufacturing, financial trading and risk management, sensor data, ad tech, application eventing, smart home systems, autonomous vehicles, professional sports, and more. </p><p>It's unique from more traditional business-type (OLTP) data in at least two primary ways: it is primarily insert heavy, and the scale of the data grows at an unceasing rate. This impacts both data collection and storage, as well as how we analyze the values themselves. Traditional OLTP databases often can't handle millions of transactions per second or provide effective means of storing and maintaining the data.</p><p>Time-series data is also more unique than general analytical (OLAP) data in that queries generally have a time component, and queries rarely touch every row in the database.</p><p>Over the last few years, however, the lines between the capabilities of OLTP and OLAP databases have started to blur. For the last decade, the storage challenge was mitigated by numerous NoSQL architectures while still failing to effectively deal with the query and analytics required of time-series data.</p><p>As a result, many applications try to find the right balance between the transactional capabilities of OLTP databases and the large-scale analytics provided by OLAP databases. It makes sense, therefore, that many applications would try to use ClickHouse, which offers fast ingest and analytical query capabilities for time-series data.</p><p>So, let's see how both ClickHouse and TimescaleDB compare for time-series workloads using our standard <a href="https://github.com/timescale/tsbs">TSBS</a> benchmarks.</p><p><strong>Performance Benchmarks</strong></p><p>Let me start by saying that this wasn't a test we completed in a few hours and then moved on from. In fact, just yesterday, while finalizing this blog post, <strong>we installed the latest version of ClickHouse</strong> (released three days ago) and ran all of the tests again to ensure we had the best numbers possible! (benchmarking, not benchmarketing)</p><p>In preparation for the final set of tests, we ran benchmarks on both TimescaleDB and ClickHouse dozens of times each—<em>at least</em>. We tried different cardinalities, different lengths of time for the generated data, and various settings for things that we had easy control over—like "chunk_time_interval" with TimescaleDB. We wanted to really understand how each database would perform with typical cloud hardware and the specs that we often see in the wild.</p><p>We also acknowledge that most real-world applications don't work like the benchmark does: ingesting data first and querying it second. Separating each operation allows us to understand which settings impacted each database during different phases, allowing us to tweak benchmark settings for each database along the way to get the best performance.</p><p>Finally, we always view these benchmarking tests as an academic and self-reflective experience. That is, spending a few hundred hours working with both databases often causes us to consider ways we might improve TimescaleDB (in particular), and thoughtfully consider when we can—and should—say that another database solution is a good option for specific workloads.</p><h3 id="%E2%80%8B%E2%80%8Bmachine-configuration">​​Machine Configuration</h3><p>For this benchmark, we consciously decided to use cloud-based hardware configurations that were reasonable for a medium-sized workload typical of startups and growing businesses. In previous benchmarks, we've used bigger machines with specialized RAID storage, <strong>a very typical</strong> <strong>setup</strong> for a production database environment.</p><p>But, as time has marched on and we see more developers use Kubernetes and modular infrastructure setups without lots of specialized storage and memory optimizations, it felt more genuine to benchmark each database on instances that more closely matched what we tend to see in the wild. Sure, we can always throw more hardware and resources to help spike numbers, but that often doesn't help convey what most real-world applications can expect.</p><p>To that end, for comparing both insert and read latency performance, we used the following setup in AWS:</p><ul><li><strong>Versions</strong>: TimescaleDB <a href="https://github.com/timescale/timescaledb/releases/tag/2.4.0">version 2.4.0</a>, community edition, with PostgreSQL 13; ClickHouse <a href="https://github.com/ClickHouse/ClickHouse/releases/tag/v21.6.5.37-stable">version 21.6.5</a> (the latest non-beta releases for both databases at the time of testing).</li><li>1 remote client machine running TSBS, 1 database server, both in the same cloud datacenter</li><li><strong>Instance size</strong>: Both client and database server ran on Amazon EC2 virtual machines (m5.4xlarge) with 16 vCPU and 64GB Memory each.</li><li><strong>OS</strong>: Both server and client machines ran Ubuntu 20.04.3</li><li><strong>Disk Size</strong>: 1TB of EBS GP2 storage</li><li><strong>Deployment method</strong>: Installed via apt-get using official sources</li></ul><h3 id="database-configuration">Database configuration</h3><p><strong>ClickHouse:</strong> No configuration modification was done with the ClickHouse. We simply installed it per their documentation. There is not currently a tool like <code>timescaledb-tune</code> for ClickHouse.</p>
<p><strong>TimescaleDB:</strong> For TimescaleDB, we followed the recommendations in the timescale documentation. Specifically, we ran <a href="https://github.com/timescale/timescaledb-tune"><code>timescaledb-tune</code></a> and accepted the configuration suggestions which are based on the specifications of the EC2 instance. We also set <code>synchronous_commit=off</code> in <code>postgresql.conf</code>. This is a common performance configuration for write-heavy workloads while still maintaining transactional, logged integrity.</p>
<h3 id="insert-performance-1">Insert performance</h3><p>For insert performance, we used the following datasets and configurations. The datasets were created using <a href="https://github.com/timescale/tsbs">Time-Series Benchmarking Suite</a> with the <code>cpu-only</code> use case.</p>
<ul><li><strong>Dataset</strong>: 100-1,000,000 simulated devices generated 10 CPU metrics every 10 seconds for ~100 million reading intervals.</li><li><strong>Intervals</strong> used for each configuration are as follows: 31 days for 100 devices; 3 days for 4,000 devices; 3 hours for 100,000 devices; 30 minutes for 1,000,000</li><li><strong>Batch size</strong>: Inserts were made using a batch size of 5,000 which was used for both ClickHouse and TimescaleDB. We tried multiple batch sizes and found that in most cases there was little difference in overall insert efficiency between 5,000 and 15,000 rows per batch with each database.</li><li><strong>TimescaleDB chunk size</strong>: We set the chunk time depending on the data volume, aiming for 7-16 chunks in total for each configuration (<a href="https://docs.timescale.com/latest/using-timescaledb/hypertables?utm_source=timescale-influx-benchmark&amp;utm_medium=blog&amp;utm_campaign=july-2020-advocacy&amp;utm_content=chunks-docs#best-practices">more on chunks here</a>).<br></li></ul><p>In the end, these were the performance numbers for ingesting pre-generated time-series data from the TSBS client machine into each database using a batch size of 5,000 rows.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/09/07-clickhouse-improvement.png" class="kg-image" alt="Table showing the final insert results between ClickHouse and TimescaleDB when using larger 5,000 rows/batch" loading="lazy" width="2000" height="682" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/09/07-clickhouse-improvement.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/09/07-clickhouse-improvement.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/09/07-clickhouse-improvement.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/09/07-clickhouse-improvement.png 2064w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Insert performance comparison between ClickHouse and TimescaleDB with 5,000 row/batches</span></figcaption></figure><p>To be honest, <strong>this didn't surprise us</strong>. We've seen numerous recent blog posts about ClickHouse ingest performance, and since ClickHouse uses a different storage architecture and mechanism that doesn't include transaction support or ACID compliance, we generally expected it to be faster.</p><p>The story does change a bit, however, when you consider that ClickHouse is designed to save every "transaction" of ingested rows as separate files (to be merged later using the MergeTree architecture). It turns out that when you have much lower batches of data to ingest, ClickHouse is significantly slower and consumes much more disk space than TimescaleDB.</p><p><em>(Ingesting 100 million rows, 4,000 hosts, 3 days of data - 22GB of raw data)</em></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/09/08-chunk-time-interval.png" class="kg-image" alt="Table showing the impact of using smaller batch sizes has on TimescaleDB and ClickHouse. TimescaleDB insert performance and disk usage stays steady, while ClickHouse performance is negatively impacted" loading="lazy" width="2000" height="802" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/09/08-chunk-time-interval.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/09/08-chunk-time-interval.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/09/08-chunk-time-interval.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/09/08-chunk-time-interval.png 2064w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Insert performance comparison between ClickHouse and TimescaleDB using smaller batch sizes, which significantly impacts ClickHouse's performance and disk usage</span></figcaption></figure><p>Do you notice something in the numbers above?</p><p>Regardless of batch size, TimescaleDB consistently consumed ~19GB of disk space with each data ingest benchmark before compression. This is a result of the <code>chunk_time_interval</code> which determines how many chunks will get created for a given range of time-series data. Although ingest speeds may decrease with smaller batches, the same chunks are created for the same data, resulting in consistent disk usage patterns. Before compression, it's easy to see that TimescaleDB continually consumes the same amount of disk space regardless of the batch size.</p>
<p>By comparison, ClickHouse storage needs are correlated to how many files need to be written (which is partially dictated by the size of the row batches being saved). It can actually take significantly more storage to save data to ClickHouse before it can be merged into larger files. Even at 500-row batches, ClickHouse consumed 1.75x more disk space than TimescaleDB for a source data file that was 22GB in size.</p><h3 id="read-latency">Read latency</h3><p>For benchmarking read latency, we used the following setup for each database (the machine configuration is the same as the one used in the Insert comparison):</p><ul>
<li><strong>Dataset</strong>: 4,000/10,000 simulated devices generated 10 CPU metrics every 10 seconds for 3 full days (100M+ reading intervals, 1B+ metrics)</li>
<li>We also enabled <a href="https://www.timescale.com/blog/building-columnar-compression-in-a-row-oriented-database/?utm_source=timescale-influx-benchmark&amp;utm_medium=blog&amp;utm_campaign=july-2020-advocacy&amp;utm_content=compression-blog">native compression</a> on TimescaleDB. We compressed everything but the most recent chunk of data, leaving it uncompressed. This configuration is a commonly recommended one where raw, uncompressed data is kept for recent time periods and older data is compressed, enabling greater query efficiency (see our <a href="https://docs.timescale.com/latest/using-timescaledb/compression?utm_source=timescale-influx-benchmark&amp;utm_medium=blog&amp;utm_campaign=july-2020-advocacy&amp;utm_content=compression-docs#quick-start">compression docs</a> for more). The parameters we used to enable compression are as follows: We segmented by the <code>tags_id</code> columns and ordered by <code>time</code> descending and <code>usage_user</code> columns.</li>
</ul>
<p>On read (i.e., query) latency, the results are more complex. Unlike inserts, which primarily vary on cardinality size (and perhaps batch size), the universe of possible queries is essentially infinite, especially with a language as powerful as SQL. Often, the best way to benchmark read latency is to do it with the actual queries you plan to execute. For this case, we use a broad set of queries to mimic the most common query patterns.</p><p>The results shown below are the median from 1000 queries for each query type. Latencies in this chart are all shown as milliseconds, with an additional column showing the relative performance of TimescaleDB compared to ClickHouse (highlighted in green when TimescaleDB is faster, in blue when ClickHouse is faster).</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/09/09-clickhouse-benchmark-read-latency-performance.png" class="kg-image" alt="Table showing query response results when querying 4,000 hosts and 100 million rows of data. TimescaleDB outperforms in almost all query categories." loading="lazy" width="2000" height="1849" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/09/09-clickhouse-benchmark-read-latency-performance.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/09/09-clickhouse-benchmark-read-latency-performance.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/09/09-clickhouse-benchmark-read-latency-performance.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w2400/2023/09/09-clickhouse-benchmark-read-latency-performance.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Results of benchmarking query performance of 4,000 hosts with 100 million rows of data</span></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/09/10-clickhouse-benchmark-read-latency-performance.png" class="kg-image" alt="Table showing query response results when querying 10,000 hosts and 100 million rows of data. TimescaleDB outperforms in almost all query categories." loading="lazy" width="2000" height="1815" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/09/10-clickhouse-benchmark-read-latency-performance.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/09/10-clickhouse-benchmark-read-latency-performance.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/09/10-clickhouse-benchmark-read-latency-performance.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w2400/2023/09/10-clickhouse-benchmark-read-latency-performance.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Results of benchmarking query performance of 10,000 hosts with 100 million rows of data</span></figcaption></figure><h3 id="simple-rollups">Simple rollups</h3><p>For simple rollups (i.e., single-groupby), when aggregating one metric across a single host for 1 or 12 hours, or multiple metrics across one or multiple hosts (either for 1 hour or 12 hours), TimescaleDB generally outperforms ClickHouse at both low and high cardinality. In particular, TimescaleDB exhibited up to 1058 % of the performance of ClickHouse on configurations with 4,000 and 10,000 devices, with 10 unique metrics being generated every read interval.</p><h3 id="aggregates">Aggregates</h3><p>When calculating a simple aggregate for 1 device, TimescaleDB consistently outperforms ClickHouse across any number of devices. In our benchmark, TimescaleDB demonstrates 156 % of the performance of ClickHouse when aggregating 8 metrics across 4,000 devices and 164 % when aggregating 8 metrics across 10,000 devices. Once again, TimescaleDB outperforms ClickHouse for high-end scenarios.</p><h3 id="double-rollups">Double rollups</h3><p>The one set of queries that ClickHouse consistently bested TimescaleDB in query latency was in the double rollup queries that aggregate metrics by time and another dimension (e.g., GROUPBY time, deviceId). We'll go into a bit more detail below on why this might be, but this also wasn't completely unexpected.</p><h3 id="thresholds">Thresholds</h3><p>When selecting rows based on a threshold, TimescaleDB demonstrates between 249-357&nbsp;% the performance of ClickHouse when computing thresholds for a single device but only 130-58% the performance of ClickHouse when computing thresholds for all devices for a random time window.</p><h3 id="complex-queries">Complex queries</h3><p>For complex queries that go beyond rollups or thresholds, the comparison is a bit more nuanced, particularly when looking at TimescaleDB. The difference is that TimescaleDB gives you control over which chunks are compressed. In most time-series applications, especially things like IoT, there's a constant need to find the most recent value of an item or a list of the top X things by some aggregation. This is what the <code>lastpoint</code> and <code>groupby-orderby-limit</code> queries benchmark.</p>
<p>As we've shown previously with other databases (<a href="https://www.timescale.com/blog/timescaledb-vs-influxdb-for-time-series-data-timescale-influx-sql-nosql-36489299877/">InfluxDB</a> and <a href="https://www.timescale.com/blog/how-to-store-time-series-data-mongodb-vs-timescaledb-postgresql-a73939734016/">MongoDB</a>), and as ClickHouse documents themselves, getting individual ordered values for items is not a use case for a MergeTree-like/OLAP database, generally because there is no ordered index that you can define for a time, key, and value. This means asking for the most recent value of an item still causes a more intense scan of data in OLAP databases.</p>
<p>We see that expressed in our results. TimescaleDB was around 3486% faster than ClickHouse when searching for the most recent values (<code>lastpoint</code>) for each item in the database. This is because the most recent uncompressed chunk will often hold the majority of those values as data is ingested and a great example of why this flexibility with compression can have a significant impact on the performance of your application.</p>
<p>We fully admit, however, that compression doesn't always return favorable results for every query form. In the last complex query, <code>groupby-orderby-limit</code>, ClickHouse bests TimescaleDB by a significant amount, almost 15x faster. What our results didn't show is that queries that read from an uncompressed chunk (the most recent chunk) are 17x faster than ClickHouse, averaging 64ms per query. The query looks like this in TimescaleDB:</p>
<pre><code class="language-sql">SELECT time_bucket('60 seconds', time) AS minute, max(usage_user)
        FROM cpu
        WHERE time &lt; '2021-01-03 15:17:45.311177 +0000'
        GROUP BY minute
        ORDER BY minute DESC
        LIMIT 5
</code></pre>
<p>As you might guess, when the chunk is uncompressed, PostgreSQL indexes can be used to order the data by time quickly. When the chunk is compressed, the data matching the predicate (`WHERE time &lt; '2021-01-03 15:17:45.311177 +0000'` in the example above) must first be decompressed before it is ordered and searched.</p><p>When the data for a <code>lastpoint</code> query falls within an uncompressed chunk (which is often the case with near-term queries that have a predicate like <code>WHERE time &lt; now() - INTERVAL '6 hours'</code>), the results are startling.</p><p><em>(uncompressed chunk query, 4k hosts)</em></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/10/11-clickhouse-benchmark-uncompressed-chunk--1-.jpg" class="kg-image" alt="Table showing the positive impact querying uncompressed data in TimescaleDB can have, specifically the lastpoint and groupby-orderby-limit queries." loading="lazy" width="2000" height="307" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/10/11-clickhouse-benchmark-uncompressed-chunk--1-.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/10/11-clickhouse-benchmark-uncompressed-chunk--1-.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2021/10/11-clickhouse-benchmark-uncompressed-chunk--1-.jpg 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/10/11-clickhouse-benchmark-uncompressed-chunk--1-.jpg 2000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Query latency performance when </span><code spellcheck="false" style="white-space: pre-wrap;"><span>lastpoint</span></code><span style="white-space: pre-wrap;"> and </span><code spellcheck="false" style="white-space: pre-wrap;"><span>groupby-orderby-limit</span></code><span style="white-space: pre-wrap;"> queries use an uncompressed chunk in TimescaleDB</span></figcaption></figure><p>One of the key takeaways from this last set of queries is that the features provided by a database can have a material impact on the performance of your application. Sometimes, it just works, while other times, having the ability to fine-tune how data is stored can be a game-changer.</p><h3 id="read-latency-performance-summary">Read latency performance summary</h3><ul><li>For simple queries, TimescaleDB outperforms ClickHouse, regardless of whether native compression is used.</li><li>For typical aggregates, even across many values and items, TimescaleDB outperforms ClickHouse.</li><li>Doing more complex double rollups, ClickHouse outperforms TimescaleDB every time. To some extent, we were surprised by the gap and will continue to understand how we can better accommodate queries like this on raw time-series data. One solution to this disparity in a real application would be to use a continuous aggregate to pre-aggregate the data.</li><li>When selecting rows based on a threshold, TimescaleDB outperforms ClickHouse and is up to 250% faster.</li><li>For some complex queries, particularly a standard query like "lastpoint", TimescaleDB vastly outperforms ClickHouse</li><li>Finally, depending on the time range being queried, TimescaleDB can be significantly faster (up to 1760%) than ClickHouse for grouped and ordered queries. When these kinds of queries reach further back into compressed chunks, ClickHouse outperforms TimescaleDB because more data must be decompressed to find the appropriate max() values to order by.</li></ul><h2 id="conclusion">Conclusion</h2><p>You made it to the end! Thank you for taking the time to read our detailed report.</p><p>Understanding ClickHouse and then comparing it with PostgreSQL and TimescaleDB made us appreciate that there is a lot of choice in today’s database market—but often, there is still only one <em>right tool for the job</em>. </p><p>Before deciding which to use for your application, we recommend taking a step back and analyzing your stack, your team's skills, and your needs, now and in the future. Choosing the best technology for your situation now can make all the difference down the road. Instead, you want to pick an architecture that evolves and grows with you, not one that forces you to start all over when the data starts flowing from production applications.</p><p>We’re always interested in feedback, and we’ll continue to share our insights with the greater community.</p><h3 id="want-to-learn-more-about-timescaledb">Want to learn more about TimescaleDB?<br></h3><p><a href="https://console.cloud.timescale.com/signup"><strong>Create a free account to get started</strong></a> with a fully managed TimescaleDB instance (100&nbsp;% free for 30 days).</p><p>Want to host TimescaleDB yourself? <a href="https://github.com/timescale/timescaledb">Visit our GitHub</a> to learn more about options, get installation instructions, and more (and, as always, ⭐️  are  appreciated!)</p><p><a href="https://slack.timescale.com/">Join our Slack community</a> to ask questions, get advice, and connect with other developers (the authors of this post, as well as our co-founders, engineers, and passionate community members, are active on all channels).</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Function Pipelines: Building Functional Programming Into PostgreSQL Using Custom Operators]]></title>
            <description><![CDATA[Function pipelines allow you to analyze data by composing multiple functions in SQL and express complex logic in PostgreSQL in a cleaner way.]]></description>
            <link>https://www.tigerdata.com/blog/function-pipelines-building-functional-programming-into-postgresql-using-custom-operators</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/function-pipelines-building-functional-programming-into-postgresql-using-custom-operators</guid>
            <category><![CDATA[Announcements & Releases]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Engineering]]></category>
            <dc:creator><![CDATA[David Kohn]]></dc:creator>
            <pubDate>Tue, 19 Oct 2021 13:33:44 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/06/Function-pipelines--1-.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2024/06/Function-pipelines--1-.png" alt="Several pipes in neon colors. Function Pipelines: Building Functional Programming Into PostgreSQL Using Custom Operators" /><p><strong>Today, we are announcing <em>function pipelines</em>, a new capability that introduces functional programming concepts inside PostgreSQL (and SQL) using custom operators.</strong></p><p><em>Function pipelines</em> radically improve the developer ergonomics of analyzing data in PostgreSQL and SQL, by applying principles from <a href="https://en.wikipedia.org/wiki/Functional_programming">functional programming</a> and popular tools like <a href="https://pandas.pydata.org/docs/index.html">Python’s Pandas</a> and <a href="https://prometheus.io/docs/prometheus/latest/querying/basics/">PromQL</a>.</p><p>At Timescale, our mission is to serve developers worldwide and enable them to build exceptional data-driven products that measure everything that matters: software applications, industrial equipment, financial markets, blockchain activity, user actions, consumer behavior, machine learning models, climate change, and more.</p><p><a href="https://timescale.ghost.io/blog/blog/why-sql-beating-nosql-what-this-means-for-future-of-data-time-series-database-348b777b847a/">We believe SQL is the best language for data analysis</a>. We’ve championed its benefits for several years, even when many were abandoning it for custom domain-specific languages. And we were right—SQL has resurged and become the universal language for data analysis, and now many NoSQL databases are adding SQL interfaces to keep up.</p><p>But SQL is not perfect and, at times, can get quite unwieldy. For example,</p><pre><code class="language-SQL">SELECT device_id, 
	sum(abs_delta) as volatility
FROM (
	SELECT device_id, 
		abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts))
        	as abs_delta 
	FROM measurements
	WHERE ts &gt;= now() - '1 day'::interval) calc_delta
GROUP BY device_id; </code></pre><p><em>Pop quiz: What does this query do?</em> </p><p>Even if you are a SQL expert, queries like this can be quite difficult to read—and even harder to express. Complex data analysis in SQL can be hard.</p><h2 id="the-advantages-of-function-pipelines">The Advantages of Function Pipelines</h2><p><em>Function pipelines</em> let you express that same query like this:</p><pre><code class="language-SQL">SELECT device_id, 
	timevector(ts, val) -&gt; sort() -&gt; delta() -&gt; abs() -&gt; sum() 
    		as volatility
FROM measurements
WHERE ts &gt;= now() - '1 day'::interval
GROUP BY device_id;</code></pre><p>Now it is much clearer what this query is doing:</p><ul><li>It gets the last day’s data from the measurements table, grouped by <code>device_id</code>.</li><li>It sorts the data by the time column.</li><li>It calculates the delta (or change) between values.</li><li>It takes the absolute value of the delta.</li><li>And then takes the sum of the results of the previous steps.</li></ul><p><strong>Function pipelines improve your own coding productivity, while also making your SQL code easier for others to comprehend and maintain.</strong></p><p>Inspired by functional programming languages, function pipelines enable you to analyze data by composing multiple functions, leading to a simpler, cleaner way of expressing complex logic in PostgreSQL.</p><p>The best part is that we built function pipelines in a way that is fully PostgreSQL compliant—we did not change any SQL syntax—meaning that any tool that speaks PostgreSQL will be able to support data analysis using function pipelines.</p><h2 id="how-we-built-function-pipelines">How We Built Function Pipelines</h2><p>How did we build this? By taking advantage of PostgreSQL's incredible extensibility, particularly custom types, operators, and <a href="https://www.postgresql.org/docs/current/sql-createtype.html">custom types</a>, <a href="https://www.postgresql.org/docs/current/sql-createoperator.html">custom operators</a>, and <a href="https://www.postgresql.org/docs/current/sql-createfunction.html">functions</a>.</p><p>In our previous example, you can see the key elements of function pipelines:</p><ul><li><strong>Custom data types</strong>: in this case, the <code>timevector</code>, which is a set of <code>(time, value)</code> pairs.</li><li><strong>Custom operator</strong>: <code>-&gt;</code>, used to <em>compose</em> and <em>apply </em>function pipeline elements to the data that comes in.</li><li>And finally, <strong>custom functions,</strong> called <em>pipeline elements. </em>Pipeline elements can transform and analyze <code>timevector</code>s (or other data types) in a function pipeline. For this initial release, we’ve built <em>60 custom functions! </em><strong>(</strong><a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/function-pipelines/"><strong>Full list here</strong></a><strong>)</strong>.</li></ul><p>We’ll go into more detail on function pipelines in the rest of this post, but if you just want to get started as soon as possible, <strong>the easiest way to try function pipelines is through a fully managed Timescale Cloud service</strong>. <a href="https://console.cloud.timescale.com/signup">Try it for free</a> (no credit card required) for 30 days. </p><p>Function pipelines are pre-loaded on each new database service on Timescale Cloud, available immediately—so after you’ve created a new service, you’re all set to use them!</p><p>If you prefer to manage your own database instances, <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/install-toolkit/">you can install the <code>timescaledb_toolkit</code></a> into your existing PostgreSQL installation, completely for free. </p><p>We’ve been working on this capability for a long time, but in line with our belief of “<a href="https://timescale.ghost.io/blog/blog/move-fast-but-dont-break-things-introducing-the-experimental-schema-with-new-experimental-features-in-timescaledb-2-4/">move fast but don’t break things</a>,” we’re initially releasing function pipelines as an <a href="https://docs.timescale.com/api/latest/api-tag-overview/#experimental-timescaledb-toolkit">experimental feature</a>—and we would absolutely love to <strong>get your feedback</strong>. You can <a href="https://github.com/timescale/timescaledb-toolkit/issues">open an issue</a> or join a <a href="https://github.com/timescale/timescaledb-toolkit/discussions">discussion thread</a> in GitHub (and, if you like what you see, GitHub ⭐ is always welcome and appreciated, too!).</p><p><em>We’d also like to take this opportunity to give a huge shoutout to </em><a href="https://github.com/zombodb/pgx"><em><code>pgx</code>, the Rust-based framework for building PostgreSQL extensions</em></a><em>—it handles a lot of the heavy lifting for this project. We have over 600 custom types, operators, and functions in the <code>timescaledb_toolkit</code> extension at this point; managing this without <code>pgx</code> (and the ease of use that comes from working with Rust) would be a real bear of a job.</em></p><h2 id="function-pipelines-why-are-they-useful">Function Pipelines: Why Are They Useful?</h2><p>It’s October. In the northern hemisphere (where most of Team Timescale, including your author, lives), it is starting to get cold.</p><p>Now imagine a restaurant in New York City whose owners care about their customers and their customers’ comfort. And you are working on an IoT product designed to help small businesses like these owners minimize their heating bill while maximizing their customers happiness. So you install two thermometers, one at the front measuring the temperature right by the door, and another at the back of the restaurant.</p><p>Now, as many of you may know (if you’ve ever had to sit by the door of a restaurant in the fall or winter), when someone enters, the temperature drops—and once the door is closed, the temperature warms back up. The temperature at the back of the restaurant will vary much less than at the front, right by the door. Both of them will drop slowly down to a lower set point during non-business hours and warm back up sometime before business hours based on the setpoints on our thermostat. So overall we’ll end up with a graph that looks something like this:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/10/graph--1-.jpg" class="kg-image" alt="A graph with time on the x axis and temperature on the y axis showing two curves. First a curve labeled back which starts low steadily rises stays relatively constant in the section labeled operating hours and then drops slowly back down. Second a curve labeled front which starts following the other, starts low, rises then it starts getting jumpy, drastically varying while the other stays constant during operating hours, then it falls back down. " loading="lazy" width="1600" height="986" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/10/graph--1-.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/10/graph--1-.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/10/graph--1-.jpg 1600w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">A graph of the temperature at the front (near the door) and back. The back is much steadier, while the front is more volatile. Graph is for illustrative purposes only, data is fabricated. No restaurants or restaurant patrons were harmed in the making of this post.</span></figcaption></figure><p>As we can see, the temperature by the front door varies much more than at the back of the restaurant. Another way to say this is the temperature by the front door is more <em>volatile</em>. Now, the owners of this restaurant want to measure this because frequent temperature changes means uncomfortable customers.</p><p>In order to measure volatility, we could first subtract each point from the point before to calculate a delta. If we add this up directly, large positive and negative deltas will cancel out. But, we only care about the magnitude of the delta, not its sign—so what we really should do is take the absolute value of the delta, and then take the total sum of the previous steps.</p><p>We now have a metric that might help us measure customer comfort, and also the efficacy of different weatherproofing methods (for example, adding one of those little vestibules that acts as a windbreak). </p><p>To track this, we collect measurements from our thermometers and store them in a table:</p><pre><code class="language-SQL">CREATE TABLE measurements(
	device_id BIGINT,
	ts TIMESTAMPTZ,
	val DOUBLE PRECISION
);
</code></pre><p>The <code>device_id</code> identifies the thermostat, <code>ts</code> the time of reading and <code>val</code> the temperature. </p><p>Using the data in our measurements table, let’s look at how we calculate volatility using function pipelines.</p><p><em>Note: because all of the function pipeline features are still experimental, they exist in the <code>toolkit_experimental</code> schema. Before running any of the SQL code in this post you will need to set your <code>search_path</code> to include the experimental schema as we do in the example below, we won’t repeat this throughout the post so as not to distract.</em></p><pre><code class="language-SQL">set search_path to toolkit_experimental, public; --still experimental, so do this to make it easier to read

SELECT device_id, 
	timevector(ts, val) -&gt; sort() -&gt; delta() -&gt; abs() -&gt; sum() 
    	as volatility
FROM measurements
WHERE ts &gt;= now()-'1 day'::interval
GROUP BY device_id;</code></pre><p>And now we have the same query that we used as our example in the introduction.</p><p>In this query, the function pipeline <code>timevector(ts, val) -&gt; sort() -&gt; delta() -&gt; abs() -&gt; sum()</code> succinctly expresses the following operations:</p><ol><li>Create <code>timevector</code>s (more detail on this later) out of the <code>ts</code> and <code>val</code> columns.</li><li>Sort each <code>timevector</code> by the time column.</li><li>Calculate the delta (or change) between each pair in the <code>timevector</code> by subtracting the previous <code>val</code> from the current.</li><li>Take the absolute value of the delta.</li><li>Take the sum of the results from the previous steps.</li></ol><p>The <code>FROM</code>, <code>WHERE</code> and <code>GROUP BY</code> clauses do the rest of the work telling us:</p><ol><li>We’re getting data <em>FROM</em> the <code>measurements</code> table</li><li><em>WHERE </em>the ts, or timestamp column, contains values over the last day</li><li>Showing one pipeline output per <code>device_id</code> (the GROUP BY column)</li></ol><p>As we noted before, if you were to do this same calculation using SQL and PostgreSQL functionality, your query would look like this:</p><pre><code class="language-SQL">SELECT device_id, 
sum(abs_delta) as volatility
FROM (
	SELECT 
		abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts) ) 
        	as abs_delta 
	FROM measurements
	WHERE ts &gt;= now() - '1 day'::interval) calc_delta
GROUP BY device_id; 
</code></pre><p>This does the same five steps as the above but is much harder to understand. We have to use a <a href="https://www.postgresql.org/docs/current/functions-window.html">window function</a> and aggregate the results—but also, because aggregates are performed before window functions, we need to actually execute the window function in a subquery.</p><p>As we can see, function pipelines make it significantly easier to comprehend the overall analysis of our data. There’s no need to completely understand what’s going on in these functions just yet, but for now, it’s enough to understand that we’ve essentially implemented a small functional programming language inside of PostgreSQL. You can still use all of the normal, expressive SQL you’ve come to know and love. Function pipelines just add new tools to your SQL toolbox that make it easier to work with time-series data.</p><p>Some avid SQL users might find the syntax a bit foreign at first, but for many people who work in other programming languages, especially using tools like <a href="https://pandas.pydata.org/docs/index.html">Python’s Pandas Package</a>, this type of <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pipe.html">successive operation on data sets</a> will feel natural. </p><p>Again, this is still fully PostgreSQL-compliant: we introduce no changes to the parser or anything that should break compatibility with PostgreSQL drivers.</p><h2 id="building-function-pipelines-without-forking-postgresql">Building Function Pipelines Without Forking PostgreSQL</h2><p>We built function pipelines—without modifying the <a href="https://www.postgresql.org/docs/10/parser-stage.html">parser</a> or anything that would require a fork of PostgreSQL—by taking advantage of three of the many ways that PostgreSQL enables extensibility: <a href="https://www.postgresql.org/docs/current/sql-createtype.html">custom types</a>, <a href="https://www.postgresql.org/docs/current/sql-createfunction.html">custom functions</a>, and <a href="https://www.postgresql.org/docs/current/sql-createoperator.html">custom operators</a>.</p><ul><li><strong>Custom data types</strong>, starting with the <code>timevector</code>, which is a set of <code>(time, value)</code> pairs</li><li><strong>A custom operator</strong>: <code>-&gt;</code>, which is used to <em>compose</em> and <em>apply </em>function pipeline elements to the data that comes in.</li><li><strong>Custom functions</strong>, called <em>pipeline elements</em>, which can transform and analyze <code>timevector</code>s (or other data types) in a function pipeline (with 60 functions in this initial release)</li></ul><p>We believe that new idioms like these are exactly what PostgreSQL was <em>meant to enable</em>. That’s why it has supported custom types, functions and operators from its earliest days. (And is one of the many reasons why we love PostgreSQL.)</p><h2 id="a-custom-data-type-the-timevector">A Custom Data Type: The <code>timevector</code></h2><p>A <code>timevector</code> is a collection of <code>(time, value)</code> pairs. As of now, the times must be <code>TIMESTAMPTZ</code>s and the values must be <code>DOUBLE PRECISION</code> numbers. (But this may change in the future as we continue to develop this data type. If you have ideas/input, please <a href="https://github.com/timescale/timescaledb-toolkit/issues">file feature requests on GitHub</a> explaining what you’d like!)</p><p>You can think of the <code>timevector</code> as something like this: </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/10/table1--1-.jpg" class="kg-image" alt=" A box with `timevector` on the outside, inside there is a table with one column labeled time containing multiple timestamps and another column labeled value containing floating point numbers." loading="lazy" width="950" height="880" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/10/table1--1-.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/10/table1--1-.jpg 950w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">A depiction of a </span><code spellcheck="false" style="white-space: pre-wrap;"><span>timevector</span></code><span style="white-space: pre-wrap;">.</span></figcaption></figure><p>One of the first questions you might ask is: how does a <code>timevector</code> relate to time-series data? (If you want to know more about time-series data, <a href="https://timescale.ghost.io/blog/time-series-data/" rel="noreferrer">we have a great blog post on that</a>,). </p><p>Let’s consider our example from above, where we were talking about a restaurant that was measuring temperatures, and we had a <code>measurements</code> table like so:</p><pre><code class="language-SQL">CREATE TABLE measurements(
	device_id BIGINT,
	ts TIMESTAMPTZ,
	val DOUBLE PRECISION
);
</code></pre><p>In this example, we can think of a single time-series dataset as all historical and future time and temperature measurements from a device. </p><p>Given this definition, we can think of a <code>timevector</code> as a <strong>finite subset of a time-series dataset</strong>. The larger time-series dataset may extend back into the past, and it may extend into the future, but the <code>timevector</code> is bounded.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/10/table2--1-.jpg" class="kg-image" alt="A box labeled time-series with a table in it. The table has one column labeled time and another labeled value like the diagram above. There is a timevector box inside the table with timestamps and values. Each column also has an arrow at the top pointing to the word past and another arrow at the bottom pointing to the word future conveying that the time-series extends into the past and future while the timevector contains a subset of the values. " loading="lazy" width="1200" height="1500" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/10/table2--1-.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/10/table2--1-.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/10/table2--1-.jpg 1200w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">A </span><code spellcheck="false" style="white-space: pre-wrap;"><span>timevector</span></code><span style="white-space: pre-wrap;"> is a finite subset of a time-series and contains all the </span><code spellcheck="false" style="white-space: pre-wrap;"><span>(time, value)</span></code><span style="white-space: pre-wrap;"> pairs in some region of the time-series.</span></figcaption></figure><p>In order to construct a <code>timevector</code> from the data gathered from a thermometer, we use a custom aggregate and pass in the columns we want to become our <code>(time, value)</code> pairs. We can use the <code>WHERE</code> clause to define the extent of the <code>timevector</code> (i.e., the limits of this subset), and the <code>GROUP BY</code> clause to provide identifying information about the <a href="https://www.tigerdata.com/blog/time-series-introduction" rel="noreferrer">time series</a> that are represented.</p><p>Building on our example, this is how we construct a <code>timevector</code> for each thermometer in our dataset:</p><pre><code class="language-SQL">SELECT device_id, 
	timevector(ts, val)
FROM measurements
WHERE ts &gt;= now() - '1 day'::interval
GROUP BY device_id;
</code></pre><p>But a <code>timevector</code> doesn't provide much value by itself. So now, let’s also consider some complex calculations that we can apply to the <code>timevector</code>, starting with a custom operator used to apply these functions</p><h2 id="a-custom-operator">A Custom Operator: <code>-&gt;</code></h2><p><br>In function pipelines, the <code>-&gt;</code> operator is used to apply and compose multiple functions, in an easy to write and read format. </p><p>Fundamentally, <code>-&gt;</code> means: “apply the operation on the right to the inputs on the left”, or, more simply “do the next thing”. </p><p>We created a general-purpose operator for this because we think that too many operators meaning different things can get very confusing and difficult to read.</p><p>One thing that you’ll notice about the pipeline elements is that the arguments are in an unusual place in a statement like:</p><pre><code class="language-SQL">SELECT device_id, 
 	timevector(ts, val) -&gt; sort() -&gt; delta() -&gt; abs() -&gt; sum() 
    	as volatility
FROM measurements
WHERE ts &gt;= now() - '1 day'::interval
GROUP BY device_id;</code></pre><p>It <em>appears </em>(from the semantics) that the <code>timevector(ts, val)</code> is an argument to <code>sort()</code>, the resulting <code>timevector</code> is an argument to <code>delta()</code> and so on. <br></p><p>The thing is that <code>sort()</code> (and the others) are regular function calls; they can’t see anything outside of their parentheses and don’t know about anything to their left in the statement; so we need a way to get the <code>timevector</code> into the <code>sort()</code> (and the rest of the pipeline). </p><p>We solved this by taking advantage of one of the fundamental computing insights that functional programming languages use: <em>code and data are really the same thing</em>. </p><p>Each of our functions returns a special type that describes the function and its arguments. We call these types <em>pipeline elements </em>(more later)<em>.</em> </p><p>The <code>-&gt;</code> operator then performs one of two different types of actions depending on the types on its right and left sides.  It can either:</p><ol><li><em>Apply</em> a pipeline element to the left-hand argument—perform the function described by the pipeline element directly on the incoming data type.</li><li><em>Compose pipeline elements into a combined element that can be applied at some point in the future (this is an optimization that allows us to apply multiple elements in a “nested” manner so that we don’t perform multiple unnecessary passes).</em></li></ol><p>The operator determines the action to perform based on its left and right arguments. </p><p>Let’s look at our <code>timevector</code> from before: <code>timevector(ts, val) -&gt; sort() -&gt; delta() -&gt; abs() -&gt; sum()</code>. If you remember from before, I noted that this function pipeline performs the following steps: </p><ol><li>Create <code>timevector</code>s out of the <code>ts</code> and <code>val</code> columns.</li><li>Sort it by the time column.</li><li>Calculate the delta (or change) between each pair in the <code>timevector</code> by subtracting the previous <code>val</code> from the current.</li><li>Take the absolute value of the delta.</li><li>Take the sum of the results from the previous steps.</li></ol><p>And logically, at each step, we can think of the <code>timevector</code> being materialized and passed to the next step in the pipeline. </p><p>However, while this will produce a correct result, it’s not the most efficient way to compute this. Instead, it would be more efficient to compute as much as possible in a single pass over the data.</p><p>In order to do this, we allow not only the <em>apply</em> operation, but also the <em>compose</em> operation. Once we’ve composed a pipeline into a logically equivalent higher order pipeline with all of the elements we can choose the most efficient way to execute it internally. (Importantly, even if we have to perform each step sequentially, we don’t need to <em>materialize</em> it and pass it between each step in the pipeline so it has significantly less overhead even without other optimization).</p><h2 id="custom-functions-pipeline-elements">Custom Functions: Pipeline Elements</h2><p>Now let’s discuss the third, and final, key piece that makes up function pipelines: custom functions, or as we call them, <em>pipeline elements</em>.</p><p>We have implemented over 60 individual pipeline elements, which fall into 4 categories (with a few subcategories):</p><h3 id="timevector-transforms"><code>timevector</code> transforms</h3><p>These elements take in a <code>timevector</code> and produce a <code>timevector</code>. They are the easiest to compose, as they produce the same type.</p><p>Example pipeline:</p><pre><code class="language-SQL">SELECT device_id, 
	timevector(ts, val) 
    	-&gt; sort() 
        -&gt; delta() 
        -&gt; map($$ ($value^3 + $value^2 + $value * 2) $$) 
        -&gt; lttb(100) 
FROM measurements</code></pre><p>Organized by sub-category:</p><h4 id="unary-mathematical">Unary mathematical</h4>
<p>Simple mathematical functions applied to the value in each point in a <code>timevector</code>. (<a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/function-pipelines/#unary-mathematical-functions">Docs link</a>)</p>
<table>
<thead>
<tr>
<th>Element</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>abs()</code></td>
<td>Computes the absolute value of each value</td>
</tr>
<tr>
<td><code>cbrt()</code></td>
<td>Computes the cube root of each value</td>
</tr>
<tr>
<td><code>ceil()</code></td>
<td>Computes the first integer greater than or equal to each value</td>
</tr>
<tr>
<td><code>floor()</code></td>
<td>Computes the first integer less than or equal to each value</td>
</tr>
<tr>
<td><code>ln()</code></td>
<td>Computes the natural logarithm of each value</td>
</tr>
<tr>
<td><code>log10()</code></td>
<td>Computes the base 10 logarithm of each value</td>
</tr>
<tr>
<td><code>round()</code></td>
<td>Computes the closest integer to each value</td>
</tr>
<tr>
<td><code>sign()</code></td>
<td>Computes +/-1 for each positive/negative value</td>
</tr>
<tr>
<td><code>sqrt()</code></td>
<td>Computes the square root for each value</td>
</tr>
<tr>
<td><code>trunc()</code></td>
<td>Computes only the integer portion of each value</td>
</tr>
</tbody>
</table>
<h4 id="binary-mathematical">Binary mathematical</h4>
<p>Simple mathematical functions with a scalar input applied to the value in each point in a <code>timevector</code>. (<a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/function-pipelines/#binary-mathematical-functions">Docs link</a>)</p>
<table>
<thead>
<tr>
<th>Element</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>add(N)</code></td>
<td>Computes each value plus <code>N</code></td>
</tr>
<tr>
<td><code>div(N)</code></td>
<td>Computes each value divided by <code>N</code></td>
</tr>
<tr>
<td><code>logn(N)</code></td>
<td>Computes the logarithm base <code>N</code> of each value</td>
</tr>
<tr>
<td><code>mod(N)</code></td>
<td>Computes the remainder when each number is divided by <code>N</code></td>
</tr>
<tr>
<td><code>mul(N)</code></td>
<td>Computes each value multiplied by <code>N</code></td>
</tr>
<tr>
<td><code>power(N)</code></td>
<td>Computes each value taken to the <code>N</code> power</td>
</tr>
<tr>
<td><code>sub(N)</code></td>
<td>Computes each value less <code>N</code></td>
</tr>
</tbody>
</table>
<h4 id="compound-transforms">Compound transforms</h4>
<p>Transforms involving multiple points inside of a <code>timevector</code>. (<a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/function-pipelines/#compound-transforms">Docs link</a>)</p>
<table>
<thead>
<tr>
<th>Element</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>delta()</code></td>
<td>Subtracts each value from the previous`</td>
</tr>
<tr>
<td><code>fill_to(interval, fill_method)</code></td>
<td>Fills gaps larger than <code>interval</code> with points at <code>interval</code> from the previous using <code>fill_method</code></td>
</tr>
<tr>
<td><code>lttb(resolution)</code></td>
<td>Downsamples a <code>timevector</code> using the largest triangle three buckets algorithm at `resolution, requires sorted input.</td>
</tr>
<tr>
<td><code>sort()</code></td>
<td>Sorts the <code>timevector</code> by the <code>time</code> column ascending</td>
</tr>
</tbody>
</table>
<h4 id="lambda-elements">Lambda elements</h4>
<p>These elements use lambda expressions, which allows the user to write small functions to be evaluated over each point in a <code>timevector</code>.<br>
Lambda expressions can return a <code>DOUBLE PRECISION</code> value like <code>$$ $value^2 + $value + 3 $$</code>. They can return a <code>BOOL</code> like <code>$$ $time &gt; ‘2020-01-01’t $$</code> .  They can also return a <code>(time, value)</code> pair like <code>$$ ($time + ‘1 day’i, sin($value) * 4)$$</code>. (<a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/function-pipelines/#lambda-elements">Docs link</a>)</p>
<p>You can apply them using the elements below:</p>
<table>
<thead>
<tr>
<th>Element</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>filter(lambda (bool) )</code></td>
<td>Removes points from the <code>timevector</code> where the lambda expression evaluates to <code>false</code></td>
</tr>
<tr>
<td><code>map(lambda (value) )</code></td>
<td>Applies the lambda expression to all the values in the <code>timevector</code></td>
</tr>
<tr>
<td><code>map(lambda (time, value) )</code></td>
<td>Applies the lambda expression to all the times and values in the <code>timevector</code></td>
</tr>
</tbody>
</table>
<h3 id="timevector-finalizers"><code>timevector</code> finalizers</h3><p>These elements end the <code>timevector</code> portion of a pipeline, they can either help with output or  produce an aggregate over the entire <code>timevector</code>. They are an optimization barrier to composition as they (usually) produce types other than <code>timevector</code>.</p><p>Example pipelines:</p><pre><code class="language-SQL">SELECT device_id, 
	timevector(ts, val) -&gt; sort() -&gt; delta() -&gt; unnest()
FROM measurements</code></pre><pre><code class="language-SQL">SELECT device_id, 
	timevector(ts, val) -&gt; sort() -&gt; delta() -&gt; time_weight()
FROM measurements</code></pre><p>Finalizer pipeline elements organized by sub-category:</p><h4 id="timevector-output"><code>timevector</code> output</h4>
<p>These elements help with output, and can produce a set of <code>(time, value)</code> pairs or a Note: this is an area where we’d love further feedback, are there particular data formats that would be especially useful for, say graphing that we can add? <a href="https://github.com/timescale/timescaledb-toolkit/issues">File an issue in our GitHub</a>!</p>
<table>
<thead>
<tr>
<th>Element</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>unnest( )</code></td>
<td>Produces a set of <code>(time, value)</code> pairs. You can wrap and expand as a composite type to produce separate columns <code>(pipe -&gt; unnest()).*</code></td>
</tr>
<tr>
<td><code>materialize()</code></td>
<td>Materializes a <code>timevector</code> to pass to an application or other operation directly, blocks any optimizations that would materialize it lazily.</td>
</tr>
</tbody>
</table>
<h4 id="timevector-aggregates"><code>timevector</code> aggregates</h4>
<p>Aggregate all the points in a <code>timevector</code> to produce a single value as a result. (<a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/function-pipelines/#aggregate-output-elements">Docs link</a>)</p>
<table>
<thead>
<tr>
<th>Element</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>average()</code></td>
<td>Computes the average of the values in the <code>timevector</code></td>
</tr>
<tr>
<td><code>counter_agg()</code></td>
<td>Computes the <code>counter_agg</code> aggregate over the times and values in the <code>timevector</code></td>
</tr>
<tr>
<td><code>stats_agg()</code></td>
<td>Computes a range of statistical aggregates and returns a <code>1DStatsAgg</code> over the values in the <code>timevector</code></td>
</tr>
<tr>
<td><code>sum()</code></td>
<td>Computes the sum of the values in the <code>timevector</code></td>
</tr>
<tr>
<td><code>num_vals()</code></td>
<td>Counts the points in the <code>timevector</code></td>
</tr>
</tbody>
</table>
<h2 id="aggregate-accessors-and-mutators">Aggregate Accessors and Mutators</h2><p>These function pipeline elements act like the accessors that I described in our previous <a href="https://timescale.ghost.io/blog/blog/how-postgresql-aggregation-works-and-how-it-inspired-our-hyperfunctions-design-2/">post on aggregates</a>. You can use them to get a value from the aggregate part of a function pipeline like so:</p><pre><code class="language-SQL">SELECT device_id, 
	timevector(ts, val) -&gt; sort() -&gt; delta() -&gt; stats_agg() -&gt; variance() 
FROM measurements
</code></pre><p>But these don’t <em>just</em> work on <code>timevector</code>s - they also work on a normally produced aggregate as well. </p><p>When used instead of normal function accessors and mutators they can make the syntax more clear by getting rid of nested functions like:</p><pre><code class="language-SQL">SELECT approx_percentile(0.5, percentile_agg(val)) 
FROM measurements
</code></pre><p>Instead, we can use the arrow accessor to convey the same thing: </p><pre><code class="language-SQL">SELECT percentile_agg(val) -&gt; approx_percentile(0.5) 
FROM measurements
</code></pre><p>By aggregate family:</p><h4 id="counter-aggregates">Counter aggregates</h4>
<p><a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/counter-aggregation/counter-aggs/#counter-aggregates">Counter aggregates</a> deal with resetting counters, (and were stabilized in our 1.3 release this week!). Counters are a common type of metric in the application performance monitoring and metrics world. All values have resets accounted for. These elements must have a <code>CounterSummary</code> to their left when used in a pipeline, from a <code>counter_agg()</code> aggregate or pipeline element. (<a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/function-pipelines/#counter-aggregates">Docs link</a>)</p>
<table>
<thead>
<tr>
<th>Element</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>counter_zero_time()</code></td>
<td>The time at which the counter value is predicted to have been zero based on the least squares fit of the points input to the <code>CounterSummary</code>(x intercept)</td>
</tr>
<tr>
<td><code>corr()</code></td>
<td>The correlation coefficient of the least squares fit line of the adjusted counter value.</td>
</tr>
<tr>
<td><code>delta()</code></td>
<td>Computes the last - first value of the counter</td>
</tr>
<tr>
<td><code>extapolated_delta(method)</code></td>
<td>Computes the delta extrapolated using the provided method to bounds of range. Bounds must have been provided in the aggregate or a <code>with_bounds</code> call</td>
</tr>
<tr>
<td><code>idelta_left()</code> / <code>idelta_right()</code></td>
<td>Computes the instantaneous difference between the second and first points (left) or last and next-to-last points (right)</td>
</tr>
<tr>
<td><code>intercept()</code></td>
<td>The y-intercept of the least squares fit line of the adjusted counter value.</td>
</tr>
<tr>
<td><code>irate_left()</code> / <code>irate_right()</code></td>
<td>Computes the instantaneous rate of change between the second and first points (left) or last and next-to-last points (right)</td>
</tr>
<tr>
<td><code>num_changes()</code></td>
<td>Number of times the counter changed values.</td>
</tr>
<tr>
<td><code>num_elements()</code></td>
<td>Number of items - any with the exact same time will have been counted only once.</td>
</tr>
<tr>
<td><code>num_changes()</code></td>
<td>Number of times the counter reset.</td>
</tr>
<tr>
<td><code>slope()</code></td>
<td>The slope of the least squares fit line of the adjusted counter value.</td>
</tr>
<tr>
<td><code>with_bounds(range)</code></td>
<td>Applies bounds using the <code>range</code> (a <code>TSTZRANGE</code>) to the <code>CounterSummary</code> if they weren’t provided in the aggregation step</td>
</tr>
</tbody>
</table>
<h4 id="percentile-approximation">Percentile approximation</h4>
<p>These aggregate accessors deal with <a href="https://timescale.ghost.io/blog/blog/how-percentile-approximation-works-and-why-its-more-useful-than-averages/">percentile approximation</a>. For now we’ve only implemented them for <code>percentile_agg</code> and <code>uddsketch</code> based aggregates. We have not yet implemented them for <code>tdigest</code>. (<a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/function-pipelines/#percentile-approximation">Docs link</a>)</p>
<table>
<thead>
<tr>
<th>Element</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>approx_percentile(p)</code></td>
<td>The approximate value at percentile <code>p</code></td>
</tr>
<tr>
<td><code>approx_percentile_rank(v)</code></td>
<td>The approximate percentile a value <code>v</code> would fall in</td>
</tr>
<tr>
<td><code>error()</code></td>
<td>The maximum relative error guaranteed by the approximation</td>
</tr>
<tr>
<td><code>mean()</code></td>
<td>The exact average of the input values.</td>
</tr>
<tr>
<td><code>num_vals()</code></td>
<td>The number of input values</td>
</tr>
</tbody>
</table>
<h4 id="statistical-aggregates">Statistical aggregates</h4>
<p>These aggregate accessors add support for common statistical aggregates  (and were stabilized in our 1.3 release this week!). These allow you to compute and <code>rollup()</code> common statistical aggregates like <code>average</code>, <code>stddev</code> and more advanced ones like <code>skewness</code> as well as 2-dimensional aggregates like <code>slope</code> and <code>covariance</code>.  Because there are both 1D and 2D versions of these, the accessors can have multiple forms, for instance, <code>average()</code> calculates the average on a 1D aggregate while <code>average_y()</code> &amp; <code>average_x()</code> do so on each dimension of a 2D aggregate. (<a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/function-pipelines/#statistical-aggregates">Docs link</a>)</p>
<table>
<thead>
<tr>
<th>Element</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>average() / average_y() / average_x()</code></td>
<td>The average of the values.</td>
</tr>
<tr>
<td><code>corr()</code></td>
<td>The correlation coefficient of the least squares fit line.</td>
</tr>
<tr>
<td><code>covariance(method)</code></td>
<td>The covariance of the values using either <code>population</code> or <code>sample</code> method.</td>
</tr>
<tr>
<td><code>determination_coeff()</code></td>
<td>The determination coefficient (aka R squared)  of the values.</td>
</tr>
<tr>
<td><code>kurtosis(method) / kurtosis_y(method) / kurtosis_x(method)</code></td>
<td>The kurtosis (4th moment) of the values using either <code>population</code> or <code>sample</code> method.</td>
</tr>
<tr>
<td><code>intercept()</code></td>
<td>The intercept of the least squares fit line.</td>
</tr>
<tr>
<td><code>num_vals()</code></td>
<td>The number of (non-null) values seen.</td>
</tr>
<tr>
<td><code>sum() / sum_x() / sum_y()</code></td>
<td>The sum of the values seen.</td>
</tr>
<tr>
<td><code>skewness(method) / skewness_y(method) / skewness_x(method)</code></td>
<td>The skewness (3rd moment) of the values using either <code>population</code> or <code>sample</code> method.</td>
</tr>
<tr>
<td><code>slope()</code></td>
<td>The slope of the least squares fit line.</td>
</tr>
<tr>
<td><code>stddev(method) / stddev_y(method) / stddev_x(method)</code></td>
<td>The standard deviation of the values using either <code>population</code> or <code>sample</code> method.</td>
</tr>
<tr>
<td><code>variance(method) / variance_y(method) / variance_x(method)</code></td>
<td>The variance of the values using either <code>population</code> or <code>sample</code> method.</td>
</tr>
<tr>
<td><code>x_intercept()</code></td>
<td>The x intercept of the least squares fit line.</td>
</tr>
</tbody>
</table>
<h4 id="time-weighted-averages">Time-weighted averages</h4>
<p>(<a href="https://timescale.ghost.io/blog/blog/what-time-weighted-averages-are-and-why-you-should-care/">More info</a>)<br>
The <code>average()</code> accessor may be called on the output of a <code>time_weight()</code> like so:</p>
<pre><code class="language-SQL">SELECT time_weight('Linear', ts, val) -&gt; average()  FROM measurements;
</code></pre><h4 id="approximate-count-distinct-hyperloglog">Approximate count distinct (Hyperloglog)</h4>
<p>This is an approximation for distinct counts that was stabilized in our 1.3 release!(<a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/approx-count-distincts/">Docs link</a>) The <code>distinct_count()</code> accessor may be called on the output of a <code>hyperloglog()</code> like so:</p>
<pre><code class="language-SQL">SELECT hyperloglog(device_id) -&gt; distinct_count() FROM measurements;
</code></pre><h2 id="next-steps">Next Steps</h2><p>We hope this post helped you understand how function pipelines leverage PostgreSQL extensibility to offer functional programming concepts in a way that is fully PostgreSQL compliant. And how function pipelines can improve the ergonomics of your code, making it easier to write, read, and maintain. </p><p><a href="https://console.cloud.timescale.com/signup"><strong>You can try function pipelines today</strong></a> with a fully managed Timescale Cloud service (no credit card required, free for 30 days). Function pipelines are available now on every new database service on Timescale Cloud, so after you’ve created a new service, you’re all set to use them!</p><p>If you prefer to manage your own database instances, you can <a href="https://github.com/timescale/timescaledb-toolkit">download and install the timescaledb_toolkit extension</a> on GitHub for free, after which you’ll be able to use function pipelines.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[PostgreSQL vs Python for Data Evaluation: What, Why, and How]]></title>
            <description><![CDATA[Get a primer on using TimescaleDB and PostgreSQL to more efficiently perform your data evaluation tasks - previously done in Excel, R, or Python. Complete with short SQL refresher section, along with 1-to-1 code snippets comparing TimescaleDB and PostgreSQL code against Python code.]]></description>
            <link>https://www.tigerdata.com/blog/how-to-evaluate-your-data-directly-within-the-database-and-make-your-analysis-more-efficient</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-to-evaluate-your-data-directly-within-the-database-and-make-your-analysis-more-efficient</guid>
            <category><![CDATA[General]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Benchmarks & Comparisons]]></category>
            <dc:creator><![CDATA[Miranda Auhl]]></dc:creator>
            <pubDate>Fri, 01 Oct 2021 13:35:34 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/gabriel-crismariu-sOK9NjLArCw-unsplash.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/gabriel-crismariu-sOK9NjLArCw-unsplash.jpg" alt="Picture of an unfinished puzzle" /><h2 id="introduction">Introduction</h2><p>As I started writing this post, I realized that to properly show how to evaluate, clean, and transform data in the database (also known as data munging), I needed to focus on each step individually. This blog post will show you exactly how to use TimescaleDB and PostgreSQL to perform your <strong>data evaluation tasks</strong> that you may have previously done in Excel, R, or Python. TimescaleDB and PostgreSQL cannot replace these tools entirely, but they can help your data munging/evaluation tasks be more efficient and, in turn, let Excel, R, and Python shine where they do best: in visualizations, modeling, and machine learning.  </p><p>You may be asking yourself, “What exactly do you mean by <em>Evaluating</em> the data?”. When I talk about evaluating the data, I mean <em>really</em> understanding the data set you are working with. </p><p>If - in a theoretical world - I could grab a beer with my data set and talk to it about everything, that is what I would do during the evaluating step of my data analysis process. Before beginning analysis, I want to know every column, every general trend, every connection between tables, etc. To do this, I have to sit down and run query after query to get a solid picture of my data.</p><h3 id="recap">Recap</h3><p>If you remember, <a href="https://timescale.ghost.io/blog/blog/speeding-up-data-analysis/">in my last post</a>, I summarized the analysis process as the “data analysis lifecycle” with the following steps: Evaluate, Clean, Transform, and Model.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/data-analysis-lifecycle-1.jpeg" class="kg-image" alt=" Image showing Evaluate -> Clean -> Transform -> Model, accompanied by icons which relate to each step" loading="lazy" width="1838" height="466" srcset="https://blog.timescale.com/content/images/size/w600/2021/09/data-analysis-lifecycle-1.jpeg 600w, https://blog.timescale.com/content/images/size/w1000/2021/09/data-analysis-lifecycle-1.jpeg 1000w, https://blog.timescale.com/content/images/size/w1600/2021/09/data-analysis-lifecycle-1.jpeg 1600w, https://blog.timescale.com/content/images/2021/09/data-analysis-lifecycle-1.jpeg 1838w" sizes="(min-width: 720px) 720px"><figcaption>Data Analysis Lifecycle</figcaption></figure><p>As a data analyst, I found that all the tasks I performed could be grouped into these four categories, with evaluating the data as the first and I feel the most crucial step in the process. </p><p>My first job out of college was at an energy and sustainability solutions company that focused on monitoring all different kinds of usage - such as electricity, water, sewage, you name it - to figure out how buildings could be more efficient. They would place sensors on whatever medium you wanted to monitor to help you figure out what initiatives your group could take to be more sustainable and ultimately save costs. My role at this company was to perform data analysis and business intelligence tasks.</p><p>Throughout my time in this job, I got the chance to use many popular tools to evaluate my data, including Excel, R, Python, and heck, even Minitab. But once I tried using a database - and specifically PostgreSQL and TimescaleDB - I realized how efficient and straightforward evaluating work could be when done directly in a database. Lines of code that took me a while to hunt down online, trying to figure out how to accomplish with pandas, could be done intuitively through SQL. Plus, the database queries were just as fast, if not faster, than my other code most of the time. </p><p>Now, while I would love to show you a one-to-one comparison of my SQL code against each of these popular tools, that’s not practical. Besides, no one wants to read three examples of the same thing in a row! Thus, for comparison purposes in this blog post, I will directly show TimescaleDB and PostgreSQL functionality against Python code. Keep in mind that almost all code will likely be comparable to your Excel and R code. However, if you have any questions, feel free to hop on and join our <a href="https://slack.timescale.com/">Slack channel</a>, where you can ask the Timescale community, or me, specifics on TimescaleDB or PostgreSQL functionality 😊. I’d love to hear from you!</p><p>Additionally, as we explore TimescaleDB and PostgreSQL functionality together, you may be eager to try things out right away! Which is awesome! If so, you can <a href="https://www.timescale.com/timescale-signup">sign up for a free 30-day trial</a> or <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/install-timescaledb/self-hosted/">install and manage TimescaleDB on your own PostgreSQL instances</a>. (You can also learn more by <a href="https://docs.timescale.com/timescaledb/latest/tutorials/">following one of our many tutorials</a>.)</p><p>But enough of an intro, let’s get into the good stuff!</p><!--kg-card-begin: html--><center><iframe src="https://giphy.com/embed/QvwCVnX9DWdlHCnix5" width="280" height="280" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/cbc-schitts-creek-QvwCVnX9DWdlHCnix5">via GIPHY</a></p></center>
<!--kg-card-end: html--><h2 id="sql-basics">SQL basics</h2><!--kg-card-begin: markdown--><p>PostgreSQL is a database platform that uses SQL syntax to interact with the data inside it. TimescaleDB is an extension that is applied to a PostgreSQL database. To unlock the potential of PostgreSQL and TimescaleDB, you have to use SQL. So, before we jump into things, I wanted to give a basic SQL syntax refresher. <a href="https://timescale.ghost.io/blog/blog/how-to-evaluate-your-data-directly-within-the-database-and-make-your-analysis-more-efficient/#a-quick-note-on-the-data">If you are familiar with SQL, please feel free to skip this section!</a></p>
<!--kg-card-end: markdown--><p>For those of you who are newer to SQL (short for structured query language), it is the language many relational databases, including PostgreSQL, use to query data. Like <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html">pandas’ DataFrames</a> or Excel’s spreadsheets, data queried with SQL is structured as a table with columns and rows.</p><p>The <a href="https://www.postgresql.org/docs/current/sql-select.html">basics of a SQL <code>SELECT</code> command</a> can be broken down like this 👇</p><!--kg-card-begin: markdown--><pre><code class="language-sql">SELECT --columns, functions, aggregates, expressions that describe what you want to be shown in the results
FROM --if selecting data from a table in your DB, you must define the table name here
JOIN --join another table to the FROM statement table 
    ON --a column that each table shares values 
WHERE --statement to filter results where a column or expression is equivalent to some statement
GROUP BY --if SELECT or WHERE statement contains an aggregate, or if you want to group values on a column/expression, must include columns here
HAVING --similar to WHERE, this keyword helps to filter results based upon columns or expressions specifically used with a GROUP BY query
ORDER BY --allows you to specify the order in which your data is displayed
LIMIT --lets you specify the number of rows you want displayed in the output
</code></pre>
<!--kg-card-end: markdown--><p>You can think of your queries as <em>SELECTing </em>data <em>FROM </em>your tables within your database. You can <em>JOIN</em> multiple tables together and specify <em>WHERE</em> your data needs to be filtered or what it should be <em>GROUPed BY</em>. Do you see what I did there 😋?</p><p>This is the beauty of SQL; these keyword's names were chosen to make your queries intuitive. Thankfully, most of PostgreSQL and SQL functionality follow this same easy-to-read pattern. I have had the opportunity to teach myself many programming languages throughout my career, and SQL is by far the easiest to read, write, and construct. This intuitive nature is another excellent reason why data munging in PostgreSQL and TimescaleDB can be so efficient when compared to other methods. </p><p>Note that this list of keywords includes most of the ones you will need to start selecting data with SQL; however, it is not exhaustive. You will not need to use all these phrases for every query but likely will need at least <code>SELECT</code> and <code>FROM</code>. The queries in this blog post will always include these two keywords.</p><p>Additionally, the order of these keywords is specific. When building your queries, you need to follow the order that I used above. For any additional PostgreSQL commands you wish to use, you will have to research where they fit in the order hierarchy and follow that accordingly. </p><p>Seeing a list of commands may be somewhat helpful but is likely not enough to solidify understanding if you are like me. So let’s look at some examples!</p><p>Let’s say that I have a table in my PostgreSQL database called <code>energy_usage</code>. This table contains three columns: <code>time</code> which contains timestamp values, <code>energy</code> which contains numeric values, and <code>notes</code> which contains string values. As you may be able to imagine, every row of data in my <code>energy</code> table will contain,</p><ul><li><code>time</code>: timestamp value saying when the reading was collected</li><li><code>energy</code>: numeric value representing how much energy was used since the last reading</li><li><code>notes</code>: string value giving additional context to each reading. <br></li></ul><p>If I wanted to look at all the data within the table, I could use the following SQL query</p><!--kg-card-begin: markdown--><pre><code class="language-sql">SELECT time, energy, notes --I list my columns here
FROM energy_usage ;-- I list my table here and end query with semi-colon
</code></pre>
<!--kg-card-end: markdown--><p>Alternatively, SQL has a shorthand for ‘include all columns’, the operator <code>*</code>. So I could select all the data using this query as well,</p><!--kg-card-begin: markdown--><pre><code class="language-sql">SELECT *
FROM energy_usage;
</code></pre>
<!--kg-card-end: markdown--><p>What if I want to select the data and order it by the <code>time</code> column so that the earliest readings are first and the latest are last? All I need to do is include the <code>ORDER BY</code> statement and then specify the <code>time</code> column along with the specification <code>ASC</code> to let the database know I want the data in ascending order.</p><!--kg-card-begin: markdown--><pre><code class="language-sql">SELECT *
FROM energy_usage
ORDER BY time ASC;-- first I list my time column then I specify either DESC or ASC
</code></pre>
<!--kg-card-end: markdown--><p>Hopefully, you can start to see the pattern and feel more comfortable with SQL syntax. I will be showing a lot more code snippets throughout the post, so hang tight if you still need more examples!</p><p>So now that we have a little refresher on SQL basics, let’s jump into how you can use this language along with TimescaleDB and PostgreSQL functionality to do your data evaluating tasks!</p><!--kg-card-begin: markdown--><p><a name="datanotepart"></a></p>
<!--kg-card-end: markdown--><h2 id="a-quick-note-on-the-data">A quick note on the data</h2><p>Earlier I talked about my first job as a data analyst for an IoT sustainability company. Because of this job, I tend to love IoT data sets and couldn’t pass up the chance to explore <a href="https://www.kaggle.com/srinuti/residential-power-usage-3years-data-timeseries">this IoT dataset from Kaggle</a> to show how to perform data munging tasks in PostgreSQL and TimescaleDB. </p><p>The data set contains two tables, one specifying energy consumption for a single home in Houston, Texas (called <code>power_usage</code>), and the other documenting weather conditions (called <code>weather</code>). This data is actually the same data set that I used in my previous post, so bonus points if you caught that 😊!</p><p>This data was recorded from January 2016 to December 2020. While looking at this data set, and all time-series data sets, we must consider any outside influences that could affect the data. The most obvious factor that impacts the analysis of this dataset is the COVID-19 pandemic that took place from January 9th through to December 2020. Thankfully, we will see that the individual recording this data included some notes to help categorize days affected by the pandemic. As I go through this blog series, we will see patterns associated with the data collected during the COVID-19 pandemic, so definitely keep this fact in the back of your mind as we perform various data munging analysis steps!</p><p>Here is an image explaining the two tables, their column names in red and corresponding data types in blue.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Tables.jpg" class="kg-image" alt="explanation of the the power_usage table and weather table. The power table has four columns: startdate (timestamp), value_kWh (numeric), day_of_week (int),  notes (varchar). Weather has date (date), day (int), temp_max (numeric), temp_avg (numeric), temp_min (numeric), dew_max (numeric), dew_avg (numeric), dew_min (numeric), hum_max	(numeric), hum_avg	(numeric), hum_min	(numeric), wind_max (numeric), wind_avg (numeric), wind_min (numeric), press_max (numeric), press_avg (numeric), press_min (numeric), precipit (numeric), day_of_week (int)" loading="lazy" width="1480" height="1472" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/09/Tables.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/09/Tables.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Tables.jpg 1480w" sizes="(min-width: 720px) 720px"></figure><p>As we work through this blog post, we will use the evaluating techniques available within PostgreSQL and TimescaleDB to understand these two tables inside and out.</p><h2 id="evaluating-the-data">Evaluating the data</h2><p>As we discussed before, the first step in the data analysis lifecycle - and arguably the most critical step -  is to evaluate the data. I will go through how I would approach evaluating this IoT energy data, showing most of the techniques I have used in the past while working in data science. While these examples are not exhaustive, they will cover many of the evaluating steps you perform during your analysis, helping to make your evaluating tasks more efficient by using PostgreSQL and TimescaleDB. </p><p>The techniques that I will cover include:</p><!--kg-card-begin: markdown--><ul>
<li><a href="#technique1">Reading the raw data</a></li>
<li><a href="#technique2">Finding and observing “categorical” column values in my dataset</a></li>
<li><a href="#technique3">Sorting my data by specific columns</a></li>
<li><a href="#technique4">Displaying grouped data</a></li>
<li><a href="#technique5">Finding abnormalities in the database</a></li>
<li><a href="#technique6">Looking at general trends</a></li>
</ul>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><p><a name="technique1"></a></p>
<!--kg-card-end: markdown--><h3 id="reading-the-raw-data">Reading the raw data</h3><p>Let’s start with the most simple evaluating task, looking at the raw data.</p><p>As we learned in the SQL refresher above, we can quickly pull all the data within a table by using the <code>SELECT</code> statement with the <code>*</code> operator. Since I have two tables within my database, I will query both table’s information by running a query for each.</p><p>PostgreSQL code:</p><!--kg-card-begin: markdown--><p><a name="select"></a></p>
<pre><code class="language-sql">-- select all the data from my power_usage table
SELECT * 
FROM power_usage pu; 
-- selects all the data from my weather table
SELECT * 
FROM weather w;
</code></pre>
<!--kg-card-end: markdown--><p>But what if I don’t necessarily need to query all my data? Since all the data is housed in the database, if I want to get a feel for the data and the column values, I could just look at a snapshot of the raw data. </p><p>While conducting analysis in Python, I often would just print a handful of rows of data to get a feel for the values. We can do this in PostgreSQL by including the <code>LIMIT</code> command within our query. To show the first 20 rows of data in my tables, I can do the following:</p><p>PostgreSQL code:</p><!--kg-card-begin: markdown--><p><a name="limit"></a></p>
<pre><code class="language-sql">-- select all the data from my power_usage table
SELECT * 
FROM power_usage pu
LIMIT 20; -- specify 20 because I only want to see 20 rows of data
-- selects all the data from my weather table
SELECT * 
FROM weather w 
LIMIT 20;
</code></pre>
<!--kg-card-end: markdown--><p>Results: Some of the rows for each table</p><!--kg-card-begin: markdown--><table>
<thead>
<tr>
<th>startdate</th>
<th>value_kwh</th>
<th>day_of_week</th>
<th>notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>2016-01-06 01:00:00</td>
<td>1</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 02:00:00</td>
<td>1</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 03:00:00</td>
<td>1</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 04:00:00</td>
<td>1</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 05:00:00</td>
<td>0</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 06:00:00</td>
<td>0</td>
<td>2</td>
<td>weekday</td>
</tr>
</tbody>
</table>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><div style="width:100%;overflow:auto;">
<table>
<thead>
<tr>
<th>date</th>
<th>day</th>
<th>temp_max</th>
<th>temp_avg</th>
<th>temp_min</th>
<th>dew_max</th>
<th>dew_avg</th>
<th>dew_min</th>
<th>hum_max</th>
<th>hum_avg</th>
<th>hum_min</th>
<th>wind_max</th>
<th>wind_avg</th>
<th>wind_min</th>
<th>press_max</th>
<th>press_avg</th>
<th>press_min</th>
<th>precipit</th>
<th>day_of_week</th>
</tr>
</thead>
<tbody>
<tr>
<td>2016-01-06</td>
<td>1</td>
<td>85</td>
<td>75</td>
<td>68</td>
<td>74</td>
<td>71</td>
<td>66</td>
<td>100</td>
<td>89</td>
<td>65</td>
<td>21</td>
<td>10</td>
<td>0</td>
<td>30</td>
<td>30</td>
<td>30</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<td>2016-02-06</td>
<td>2</td>
<td>76</td>
<td>71</td>
<td>66</td>
<td>74</td>
<td>70</td>
<td>66</td>
<td>100</td>
<td>97</td>
<td>89</td>
<td>18</td>
<td>8</td>
<td>0</td>
<td>30</td>
<td>30</td>
<td>30</td>
<td>4</td>
<td>5</td>
</tr>
<tr>
<td>2016-02-07</td>
<td>2</td>
<td>95</td>
<td>86</td>
<td>76</td>
<td>76</td>
<td>73</td>
<td>69</td>
<td>94</td>
<td>67</td>
<td>43</td>
<td>12</td>
<td>6</td>
<td>0</td>
<td>30</td>
<td>30</td>
<td>30</td>
<td>0</td>
<td>6</td>
</tr>
<tr>
<td>2016-02-08</td>
<td>2</td>
<td>97</td>
<td>87</td>
<td>77</td>
<td>77</td>
<td>74</td>
<td>71</td>
<td>94</td>
<td>66</td>
<td>43</td>
<td>15</td>
<td>5</td>
<td>0</td>
<td>30</td>
<td>30</td>
<td>30</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>2016-02-09</td>
<td>2</td>
<td>95</td>
<td>85</td>
<td>77</td>
<td>75</td>
<td>74</td>
<td>70</td>
<td>90</td>
<td>70</td>
<td>51</td>
<td>16</td>
<td>7</td>
<td>0</td>
<td>30</td>
<td>30</td>
<td>30</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>2016-02-10</td>
<td>2</td>
<td>86</td>
<td>74</td>
<td>65</td>
<td>64</td>
<td>61</td>
<td>58</td>
<td>90</td>
<td>66</td>
<td>40</td>
<td>8</td>
<td>4</td>
<td>0</td>
<td>30</td>
<td>30</td>
<td>30</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<td>2016-03-06</td>
<td>3</td>
<td>79</td>
<td>72</td>
<td>68</td>
<td>72</td>
<td>70</td>
<td>68</td>
<td>100</td>
<td>94</td>
<td>72</td>
<td>18</td>
<td>5</td>
<td>0</td>
<td>30</td>
<td>30</td>
<td>30</td>
<td>3</td>
<td>6</td>
</tr>
</tbody>
</table>
</div><!--kg-card-end: markdown--><p>Python code:</p><p>In this first Python code snippet, I show the modules I needed to import and the connection code that I would have to run to access the data from my database and import it into a pandas DataFrame. </p><p>One of the challenges I faced while data munging in Python was the need to run through the entire script again and again when evaluating, cleaning, and transforming the data. This initial data pulling process usually takes a good bit of time, so it was often frustrating to run through it repetitively. I also would have to run print anytime I wanted to quickly glance at an array, Dataframe, or element. These kinds of extra tasks in Python can be time-consuming, especially if you end up at the modeling stage of the analysis lifecycle with only a subset of the original data! All this to say, keep in mind that for the other code snippets within the blog, I will not include this as part of the code; however, it still impacts that code in the background. </p><p>Additionally, because I have my data housed in a TimescaleDB instance, I still need to use the <code>SELECT</code> statement to query the data from the database and read it into Python. If you use a relational database - which I explained is very beneficial to analysis in my previous post - you will have to use <em>some</em> SQL.</p><!--kg-card-begin: markdown--><pre><code class="language-python">import psycopg2
import pandas as pd
import configparser
import numpy as np
import tempfile
import matplotlib.pyplot as plt
 
## use config file for database connection information
config = configparser.ConfigParser()
config.read('env.ini')
 
## establish conntection
conn = psycopg2.connect(database=config.get('USERINFO', 'DB_NAME'),
                       host=config.get('USERINFO', 'HOST'),
                       user=config.get('USERINFO', 'USER'),
                       password=config.get('USERINFO', 'PASS'),
                       port=config.get('USERINFO', 'PORT'))
 

## define the queries for copying data out of our database (using format to copy queries)                    
query_weather = &quot;select * from weather&quot;
query_power = &quot;select * from power_usage&quot;
## define function to copy the data to a csv
def copy_from_db(query, cur):
    with tempfile.TemporaryFile() as tmpfile:
        copy_sql = &quot;COPY ({query}) TO STDOUT WITH CSV {head}&quot;.format(
            query=query, head=&quot;HEADER&quot;
            )
        cur.copy_expert(copy_sql, tmpfile)
        tmpfile.seek(0)
        df = pd.read_csv(tmpfile)
        return df
## create cursor to use in function above and place data into a file
cursor = conn.cursor()
weather_df = copy_from_db(query_weather, cursor)
power_df = copy_from_db(query_power, cursor)
cursor.close()
conn.close()


print(weather_df.head(20))
print(power_df.head(20))
</code></pre>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><p><a name="technique2"></a></p>
<!--kg-card-end: markdown--><h3 id="finding-and-observing-%E2%80%9Ccategorical%E2%80%9D-column-values-in-my-dataset">Finding and observing “categorical” column values in my dataset</h3><p>Next, I think it is essential to understand any “categorical” columns - columns with a finite set of values - that I might have. This is useful in analysis because categorical data can give insight into natural groupings that often occur within a dataset. For example, I would assume that energy usage for many people is different on a weekday vs. a weekend. We can’t verify this without knowing the categorical possibilities and seeing how each could impact the data trend. </p><p>First, I want to look at my tables and the data types used for each column. Looking at the available columns in each table, I can make an educated guess that the <code>day_of_week</code>, <code>notes</code>, and <code>day</code> columns will be categorical. Let’s find out if they indeed are and how many different values exist in each. </p><p>To find all the distinct values within a column (or between multiple columns), you can use the <code>DISTINCT</code> keyword after <code>SELECT</code> in your query statement. This can be useful for several data munging tasks, such as identifying categories - which I need to do - or finding unique sets of data. </p><p>Since I want to look at the unique values within each column individually, I will run a query for each separately. If I were to run a query like this 👇</p><!--kg-card-begin: markdown--><pre><code class="language-sql">SELECT DISTINCT day_of_week, notes 
FROM power_usage pu;
</code></pre>
<!--kg-card-end: markdown--><p>I would get data like this</p><!--kg-card-begin: markdown--><table>
<thead>
<tr>
<th>day_of_week</th>
<th>notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>3</td>
<td>vacation</td>
</tr>
<tr>
<td>3</td>
<td>weekday</td>
</tr>
<tr>
<td>1</td>
<td>weekday</td>
</tr>
<tr>
<td>1</td>
<td>vacation</td>
</tr>
<tr>
<td>2</td>
<td>vacation</td>
</tr>
<tr>
<td>4</td>
<td>vacation</td>
</tr>
</tbody>
</table>
<!--kg-card-end: markdown--><p>The output data would show unique <em>pairs</em> of <code>day_of_week</code> and <code>notes</code> <em>related</em> values within the table. This is why I need to include a single column in each statement so that I only see that individual column’s unique values and not the unique sets of values. </p><p>For these queries, I am also going to include the <code>ORDER BY</code> command to show the values of each column in ascending order. </p><p>PostgreSQL code:</p><!--kg-card-begin: markdown--><p><a name="orderby"></a></p>
<pre><code class="language-sql">-- selecting distinct values in the ‘day_of_week’ column within my power_usage table
SELECT DISTINCT day_of_week 
FROM power_usage pu 
ORDER BY day_of_week ASC;
-- selecting distinct values in the ‘notes’ column within my power_usage table
SELECT DISTINCT notes 
FROM power_usage pu 
ORDER BY notes ASC;

-- selecting distinct values in the ‘day’ column within my weather table
SELECT DISTINCT &quot;day&quot; 
FROM weather w 
ORDER BY &quot;day&quot; ASC;
-- selecting distinct values in the ‘day_of_week’ column within my weather table
SELECT DISTINCT day_of_week 
FROM weather w 
ORDER BY day_of_week ASC;
</code></pre>
<!--kg-card-end: markdown--><p>Results:</p><p>Notice that we see the recorder for this data included “COVID-19” as a category in their <code>notes</code> column. As mentioned above, this note could be necessary to finding and understanding patterns in this family's energy usage.</p><!--kg-card-begin: markdown--><table>
<thead>
<tr>
<th>day_of_week</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
</tr>
<tr>
<td>1</td>
</tr>
<tr>
<td>2</td>
</tr>
<tr>
<td>3</td>
</tr>
<tr>
<td>4</td>
</tr>
<tr>
<td>5</td>
</tr>
<tr>
<td>6</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th>notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>COVID_lockdown</td>
</tr>
<tr>
<td>vacation</td>
</tr>
<tr>
<td>weekday</td>
</tr>
<tr>
<td>weekend</td>
</tr>
</tbody>
</table>
<p>(Only some of the values shown for day)</p>
<table>
<thead>
<tr>
<th>day</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
</tr>
<tr>
<td>2</td>
</tr>
<tr>
<td>3</td>
</tr>
<tr>
<td>4</td>
</tr>
<tr>
<td>5</td>
</tr>
<tr>
<td>6</td>
</tr>
<tr>
<td>7</td>
</tr>
<tr>
<td>8</td>
</tr>
<tr>
<td>9</td>
</tr>
<tr>
<td>10</td>
</tr>
</tbody>
</table>
<!--kg-card-end: markdown--><p>Python code:</p><p>In my Python code, notice that I need to print anything that I want to quickly observe. I have found this to be the quickest solution, even when compared to using the Python console in debug mode.</p><!--kg-card-begin: markdown--><pre><code class="language-python">p_day_of_the_week = power_df['day_of_week'].unique()
p_notes = power_df['notes'].unique()
w_day = weather_df['day'].unique()
w_day_of_the_week = power_df['day_of_week'].unique()
print(sorted(p_day_of_the_week), sorted(p_notes), sorted(w_day), sorted(w_day_of_the_week))
</code></pre>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><p><a name="technique3"></a></p>
<!--kg-card-end: markdown--><h3 id="sorting-my-data-by-specific-columns">Sorting my data by specific columns</h3><p>What if I want to evaluate my tables based on how specific columns were sorted? One of the top questions asked on StackOverflow for Python data analysis is "<a href="https://stackoverflow.com/questions/17141558/how-to-sort-a-dataframe-in-python-pandas-by-two-or-more-columns">How to sort a dataframe in python pandas by two or more columns?</a>". Once again, we can do this intuitively through SQL.</p><p>One of the things I'm interested in identifying is how bad weather impacts energy usage. To do this, I have to think about indicators that typically signal bad weather, which include high precipitation, high wind speed, and low pressure. To identify days with this pattern in my PostgreSQL <code>weather</code> table, I need to use the <code>ORDER BY</code> keyword, then call out each column in the order I want things sorted, specifying the <code>DESC</code> and <code>ASC</code> attributes as needed. </p><p>PostgreSQL code:</p><!--kg-card-begin: markdown--><pre><code class="language-sql">-- sort weather data by precipitation desc first, wind_avg desc second, and pressure asc third
SELECT &quot;date&quot;, precipit, wind_avg, press_avg 
FROM weather w 
ORDER BY precipit DESC, wind_avg DESC, press_avg ASC;
</code></pre>
<!--kg-card-end: markdown--><p>Results:</p><!--kg-card-begin: markdown--><table>
<thead>
<tr>
<th>date</th>
<th>precipit</th>
<th>wind_avg</th>
<th>press_avg</th>
</tr>
</thead>
<tbody>
<tr>
<td>2017-08-27</td>
<td>13</td>
<td>15</td>
<td>30</td>
</tr>
<tr>
<td>2017-08-28</td>
<td>11</td>
<td>24</td>
<td>30</td>
</tr>
<tr>
<td>2019-09-20</td>
<td>9</td>
<td>9</td>
<td>30</td>
</tr>
<tr>
<td>2017-08-08</td>
<td>6</td>
<td>5</td>
<td>30</td>
</tr>
<tr>
<td>2017-08-29</td>
<td>5</td>
<td>22</td>
<td>30</td>
</tr>
<tr>
<td>2018-08-12</td>
<td>5</td>
<td>12</td>
<td>30</td>
</tr>
<tr>
<td>2016-02-06</td>
<td>4</td>
<td>8</td>
<td>30</td>
</tr>
<tr>
<td>2018-05-07</td>
<td>4</td>
<td>7</td>
<td>30</td>
</tr>
<tr>
<td>2019-10-05</td>
<td>3</td>
<td>9</td>
<td>30</td>
</tr>
<tr>
<td>2018-03-29</td>
<td>3</td>
<td>8</td>
<td>30</td>
</tr>
<tr>
<td>2016-03-06</td>
<td>3</td>
<td>5</td>
<td>30</td>
</tr>
<tr>
<td>2018-06-19</td>
<td>2</td>
<td>12</td>
<td>30</td>
</tr>
<tr>
<td>2019-08-05</td>
<td>2</td>
<td>11</td>
<td>30</td>
</tr>
<tr>
<td>2019-10-30</td>
<td>2</td>
<td>11</td>
<td>30</td>
</tr>
</tbody>
</table>
<!--kg-card-end: markdown--><p>Python code:</p><p>I have often found the different pandas or Python functions to be harder to know off the top of my head. With how popular the StackOverflow question is, I can imagine that many of you also had to refer to Google for how to do this initially.</p><!--kg-card-begin: markdown--><pre><code class="language-python">sorted_weather = weather_df[['date', 'precipit', 'wind_avg', 'press_avg']].sort_values(['precipit', 'wind_avg', 'press_avg'], ascending=[False, True, False])
print(sorted_weather)
</code></pre>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><p><a name="technique4"></a></p>
<!--kg-card-end: markdown--><h3 id="displaying-grouped-data">Displaying grouped data</h3><p>Finding the sum of energy usage from data that records energy per hour can be instrumental in understanding data patterns. This concept boils down to performing a type of aggregation over a particular column. Between PostgreSQL and TimescaleDB, we have access to almost every type of aggregation function we could need. I will show some of these operators in this blog series, but I strongly encourage all of you to <a href="https://www.postgresql.org/docs/current/functions-aggregate.html">lookup more</a> for your own use!</p><p>From the categorical section earlier, I mentioned that I suspect people could have different energy behavior patterns on weekdays vs. weekends, particularly in a single-family home in the US. Given my data set, I’m curious about this hypothesis and want to find the cumulative energy consumption across each day of the week. </p><p>To do so, I need to sum all the kWh data (<code>value_kwh</code>) in the power table, then group this data by the day of the week (<code>day_of_week</code>). In order to sum my data in PostgreSQL, I will use the <code><a href="https://www.postgresql.org/docs/current/functions-aggregate.html">SUM()</a></code> function. Because this is an aggregation function, I will have to include something that tells the database what to sum over. Since I want to know the sum of energy over each type of day, I can specify that the sum should be grouped by the <code>day_of_week</code> column using the <code>GROUP BY</code> keyword. I also added the <code>ORDER BY</code> keyword so that we could look at the weekly summed usage in order of the day. </p><p>PostgreSQL code:</p><!--kg-card-begin: markdown--><p><a name="groupby"></a></p>
<pre><code class="language-sql">-- first I select the day_of_week col, then I define SUM(value_kwn) to get the sum of value_kwh col
SELECT day_of_week, SUM(value_kwh) --sum the value_sum column
FROM power_usage pu 
GROUP BY day_of_week -- group by the day_of_week col
ORDER BY day_of_week ASC; -- decided to order data by the day_of_week asc
</code></pre>
<!--kg-card-end: markdown--><p>Results:</p><p>After some quick investigation, the value <code>0</code> in the <code>day_of_week</code> column represents a Monday, thus my hypothesis may just be right. </p><!--kg-card-begin: markdown--><table>
<thead>
<tr>
<th>day_of_week</th>
<th>sum</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>3849</td>
</tr>
<tr>
<td>1</td>
<td>3959</td>
</tr>
<tr>
<td>2</td>
<td>3947</td>
</tr>
<tr>
<td>3</td>
<td>4094</td>
</tr>
<tr>
<td>4</td>
<td>3987</td>
</tr>
<tr>
<td>5</td>
<td>4169</td>
</tr>
<tr>
<td>6</td>
<td>4311</td>
</tr>
</tbody>
</table>
<!--kg-card-end: markdown--><p>Python code:</p><p>Something to note about the pandas <code>groupby()</code> function is that the group by column in the DataFrame will become the index column in the resulting aggregated DataFrame. This can add some extra work later on.</p><!--kg-card-begin: markdown--><pre><code class="language-python">day_agg_power = power_df.groupby('day_of_week').agg({'value_kwh' : 'sum'})
print(day_agg_power)
</code></pre>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><p><a name="technique5"></a></p>
<!--kg-card-end: markdown--><h3 id="finding-abnormalities-in-the-database">Finding abnormalities in the database</h3><p>Clean data is fundamental in producing accurate analysis, and abnormalities/errors can be a huge roadblock to clean data. An essential part of evaluating data is finding abnormalities to determine if an error caused them. No data set is perfect, so it is vital to hunt down any possible errors in preparation for the cleaning stage of our analysis. Let's look at one example of how to uncover issues in a dataset using our example energy data.</p><p>After looking at the raw data in my <code>power_usage</code> table, I found that the <code>notes</code> and <code>day_of_week</code> columns <strong>should be the same for each hour across a single day </strong>(there are 24 hourly readings each day, and each hour is supposed to have the same <code>notes</code> value). In my experience with data analysis, I have found that notes which need to be recorded granularly often have mistakes within them. Because of this, I wanted to investigate whether or not this pattern was consistent across all of the data.</p><p>To check this hypothesis I can use the TimescaleDB <a href="https://docs.timescale.com/api/latest/hyperfunctions/time_bucket/#time-bucket"><code>time_bucket()</code></a> function, PostgreSQL’s <code>GROUP BY</code> keyword, and <a href="https://www.postgresql.org/docs/13/queries-with.html">CTEs</a> (common table expressions). While the <code>GROUP BY</code> keyword is likely familiar to you by now, CTEs and the <code>time_bucket()</code> function are not. So, before I show the query, let’s dive into these two features.</p><p><strong>Time bucket function</strong></p><p>The <code>time_bucket()</code> function allows you to take a timestamp column like <code>startdate</code> in the <code>power_usage</code> table, and “bucket” the time based on the interval of your choice. For example, <code>startdate</code> is a timestamp column that shows values for each hour in a day. You could use the <code>time_bucket()</code> function on this column to “bucket” the hourly data into daily data. </p><p>Here is an image that shows how rows of the <code>startdate</code> column are bucketed into one aggregate row with <code>time_bucket(‘1 day’, startdate)</code>.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/image.png" class="kg-image" alt="Image showing how hourly data from 2016-01-01 00:00:00 - 2016-01-01 23:00:00 is bucketed to 2016-01-01 00:00:00 using the time_bucket() function " loading="lazy" width="1498" height="846" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/09/image.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/09/image.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/image.png 1498w" sizes="(min-width: 720px) 720px"></figure><p>After using the <code>time_bucket()</code> function in my query, I will have one unique “date” value for any data recorded over a single day. Since <code>notes</code> and <code>day_of_week</code> should also be unique over each day, if I <em>group by</em> these columns, I should get a single set of (date, day_of_week, notes) values. </p><p>Notice that to use <code>GROUP BY</code> in this scenario, I just list the columns I want to group on. Also, notice that I added <code>AS</code> after my <code>time_bucket()</code> function, this keyword allows you to "rename" columns. In the results, look for the <code>day</code> column, as this comes directly from my rename. </p><p>PostgreSQL code:</p><!--kg-card-begin: markdown--><p><a name="timebucket"></a></p>
<pre><code class="language-sql">-- select the date through time_bucket and get unique values for each 
-- (date, day_of_week, notes) set
SELECT 
	time_bucket(interval '1 day', startdate ) AS day,
	day_of_week,
	notes
FROM power_usage pu 
GROUP BY day, day_of_week, notes;
</code></pre>
<!--kg-card-end: markdown--><p>Results: Some of the rows</p><!--kg-card-begin: markdown--><table>
<thead>
<tr>
<th>day</th>
<th>day_of_week</th>
<th>notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>2017-01-19 00:00:00</td>
<td>3</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-10-06 00:00:00</td>
<td>3</td>
<td>weekday</td>
</tr>
<tr>
<td>2017-06-04 00:00:00</td>
<td>6</td>
<td>weekend</td>
</tr>
<tr>
<td>2019-01-03 00:00:00</td>
<td>3</td>
<td>weekday</td>
</tr>
<tr>
<td>2017-10-01 00:00:00</td>
<td>6</td>
<td>weekend</td>
</tr>
<tr>
<td>2019-11-27 00:00:00</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2017-06-15 00:00:00</td>
<td>3</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-11-16 00:00:00</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2017-05-18 00:00:00</td>
<td>3</td>
<td>weekday</td>
</tr>
<tr>
<td>2018-07-17 00:00:00</td>
<td>1</td>
<td>weekday</td>
</tr>
<tr>
<td>2020-03-06 00:00:00</td>
<td>4</td>
<td>weekday</td>
</tr>
<tr>
<td>2018-10-14 00:00:00</td>
<td>6</td>
<td>weekend</td>
</tr>
</tbody>
</table>
<!--kg-card-end: markdown--><p>Python code:</p><p>In my Python code, I cannot just manipulate the table to print results, I actually have to create another column in the DataFrame.</p><!--kg-card-begin: markdown--><pre><code class="language-python">day_col = pd.to_datetime(power_df['startdate']).dt.strftime('%Y-%m-%d')
power_df.insert(0, 'date_day', day_col)
power_unique = power_df[['date_day', 'day_of_week', 'notes']].drop_duplicates()
print(power_unique)
</code></pre>
<!--kg-card-end: markdown--><p>Now that we understand the <code>time_bucket()</code> function a little better, let's look at CTEs and how they help me use this bucketed data to find any errors within the <code>notes</code> column. </p><p><strong>CTEs or common table expressions</strong></p><p>Getting unique sets of data only solves half of my problem. Now I want to verify if each day is truly mapped to a single <code>day_of_week</code> and <code>notes</code> pair. This is where CTE’s come in handy. With CTEs, you can build a query based on the results of others. </p><p>CTE’s use the following format 👇</p><!--kg-card-begin: markdown--><pre><code class="language-sql">WITH query_1 AS (
SELECT -- columns expressions
FROM table_name
)
SELECT --column expressions 
FROM query_1;
</code></pre>
<!--kg-card-end: markdown--><p><code>WITH</code> and <code>AS</code> allow you to define the first query, then in the second <code>SELECT</code> statement, you can call the results from the first query as if it were another table in the database. </p><p>To check that each day was “mapped” to a single <code>day_of_week</code> and <code>notes</code> pair, I need to aggregate the queried <code>time_bucket()</code> table above based upon the date column using another PostgreSQL aggregation function <code>COUNT()</code>. I am doing this because each day <em>should</em> only count one unique <code>day_of_week</code> and <code>notes</code> pair. If the count results in two or more, this implies that one day contains multiple <code>day_of_week</code> and <code>notes</code> pairs and thus is showing abnormal data. </p><p>Additionally, I will add a <code>HAVING</code> statement into my query so that the output only displays rows where the <code>COUNT(day)</code> is greater than one. I will also throw in an <code>ORDER BY</code> statement in case we have many different values greater than 1.</p><p>PostgreSQL code:</p><!--kg-card-begin: markdown--><p><a name="cte"></a></p>
<pre><code class="language-sql">WITH power_unique AS (
-- query from above, get unique set of (date, day_of_week, notes)
SELECT 
	time_bucket(INTERVAL '1 day', startdate ) AS day,
	day_of_week,
	notes
FROM power_usage pu 
GROUP BY day, day_of_week, notes
)
-- calls data from the query above, using the COUNT() agg function
SELECT day, COUNT(day) 
FROM power_unique
GROUP BY day
HAVING COUNT(day) &gt; 1
ORDER BY COUNT(day) DESC;
</code></pre>
<!--kg-card-end: markdown--><p>Results:</p><!--kg-card-begin: markdown--><table>
<thead>
<tr>
<th>day</th>
<th>count</th>
</tr>
</thead>
<tbody>
<tr>
<td>2017-12-27 00:00:00</td>
<td>2</td>
</tr>
<tr>
<td>2020-01-03 00:00:00</td>
<td>2</td>
</tr>
<tr>
<td>2018-06-02 00:00:00</td>
<td>2</td>
</tr>
<tr>
<td>2019-06-03 00:00:00</td>
<td>2</td>
</tr>
<tr>
<td>2020-07-01 00:00:00</td>
<td>2</td>
</tr>
<tr>
<td>2016-07-21 00:00:00</td>
<td>2</td>
</tr>
</tbody>
</table>
<!--kg-card-end: markdown--><p>Python code:</p><p>Because of the count aggregation, I needed to rename the column in my <code>agg_power_unique</code> DataFrame so that I could then sort the values.</p><!--kg-card-begin: markdown--><pre><code class="language-python">day_col = pd.to_datetime(power_df['startdate']).dt.strftime('%Y-%m-%d')
## If you ran the previous code snippet, this next line will error since you already ran it
power_df.insert(0, 'date_day', day_col)
power_unique = power_df[['date_day', 'day_of_week', 'notes']].drop_duplicates()
agg_power_unique = power_unique.groupby('date_day').agg({'date_day' : 'count'})
agg_power_unique = agg_power_unique.rename(columns={'date_day': 'count'})
print(agg_power_unique.loc[agg_power_unique['count'] &gt; 1].sort_values('count', ascending=False))
</code></pre>
<!--kg-card-end: markdown--><p>This query reveals that I indeed have a couple of data points that seem suspicious. Specifically, the dates [2017-12-27, 2020-01-03, 2018-06-02, 2019-06-03, 2020-07-01, 2016-07-21]. I will demonstrate how to fix these date issues in a later blog post about Cleaning techniques. </p><p>This example only shows one set of functions which helped me identify abnormal data through grouping and aggregation. You can use many other PostgreSQL and TimescaleDB functions to find other abnormalities in your data, like utilizing TimescaleDB’s <code>approx_percentile()</code> function (introducing this next) to find outliers in numeric columns by playing around with interquartile range calculations.</p><!--kg-card-begin: markdown--><p><a name="technique6"></a></p>
<!--kg-card-end: markdown--><h3 id="looking-at-general-trends">Looking at general trends</h3><p>Arguably, one of the more critical aspects of evaluating your data is understanding the general trends. To do this, you need to get basic statistics on your data using functions like mean, interquartile range, maximum values, and others. TimescaleDB has created many optimized hyperfunctions to perform these very tasks.</p><p>To calculate these values, I am going to introduce the following TimescaleDB functions: `<a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/approx_percentile/">approx_percentile</a>`, `<a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/min_val/">min_val</a>`, `<a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/max_val/">max_val</a>`, `<a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/mean/">mean</a>`, `<a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/num_vals-pct/">num_vals</a>`,`<a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/percentile_agg/">percentile_agg</a>` (aggregate), and `<a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/percentile-aggregation-methods/tdigest/">tdigest</a>` (aggregate)</p><p>These hyperfunctions fall under the TimescaleDB category of two-step aggregation. Timescale designed each function to either be an aggregate or accessor function (I noted which ones above were aggregate functions). In two-step aggregation, the more programmatically taxing aggregate function is calculated first, then the accessor function is applied to it after.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Untitled--1--1.png" class="kg-image" alt="acessor_function (aggregate_function())" loading="lazy"></figure><p>For specifics on how two-step aggregation works and why we use this convention, check out <a href="https://timescale.ghost.io/blog/blog/how-postgresql-aggregation-works-and-how-it-inspired-our-hyperfunctions-design-2/">David Kohn’s blog series on our hyperfunctions and two-step aggregation</a>. </p><p>I definitely want to understand the basic trends within the <code>power_usage</code> table for my data set. If I plan to do any type of modeling to predict future usage trends, I need to know some basic information about what this home’s usage looks like daily. </p><p>To understand the daily power usage data distribution, I’ll need to aggregate the energy usage per day. To do this, I can use the <code>time_bucket()</code> function I mentioned above, along with the <code>SUM()</code> operator. </p><!--kg-card-begin: markdown--><pre><code class="language-sql">-- bucket the daily data using time_bucket, sum kWh over each bucketed day
SELECT 
	time_bucket(INTERVAL '1 day', startdate ) AS day,
	SUM(value_kwh)
FROM power_usage pu 
GROUP BY day;
</code></pre>
<!--kg-card-end: markdown--><p>I then want to find the 1st, 10th, 25th, 75th, 90th, and 99th percentiles, the median or 50th percentile, mean, minimum value, maximum value, number of readings in the table, and interquartile range of this data. Creating the query with a CTE simplifies the process by only calculating the sum of data once and reusing the value multiple times.</p><p>PostgreSQL:</p><!--kg-card-begin: markdown--><p><a name="stats"></a></p>
<pre><code class="language-sql">WITH power_usage_sum AS (
-- bucket the daily data using time_bucket, sum kWh over each bucketed day
SELECT 
	time_bucket(INTERVAL '1 day', startdate ) AS day,
	SUM(value_kwh) AS sum_kwh
FROM power_usage pu 
GROUP BY day
)
-- using two-step aggregation functions to find stats
SELECT approx_percentile(0.01,percentile_agg(sum_kwh)) AS &quot;1p&quot;,
approx_percentile(0.10,percentile_agg(sum_kwh)) AS &quot;10p&quot;,
approx_percentile(0.25,percentile_agg(sum_kwh)) AS &quot;25p&quot;,
approx_percentile(0.5,percentile_agg(sum_kwh)) AS &quot;50p&quot;,
approx_percentile(0.75,percentile_agg(sum_kwh)) AS &quot;75p&quot;,
approx_percentile(0.90,percentile_agg(sum_kwh)) AS &quot;90p&quot;,
approx_percentile(0.99,percentile_agg(sum_kwh)) AS &quot;99p&quot;,
min_val(tdigest(100, sum_kwh)),
max_val(tdigest(100, sum_kwh)),
mean(percentile_agg(sum_kwh)),
num_vals(percentile_agg(sum_kwh)),
-- you can use subtraction to create an output for the IQR
approx_percentile(0.75,percentile_agg(sum_kwh)) - approx_percentile(0.25,percentile_agg(sum_kwh)) AS iqr
FROM power_usage_sum pus;
</code></pre>
<!--kg-card-end: markdown--><p>Results:</p><!--kg-card-begin: markdown--><div style="width:100%;overflow:auto;">
<table>
<thead>
<tr>
<th>1p</th>
<th>10p</th>
<th>25p</th>
<th>50p</th>
<th>75p</th>
<th>90p</th>
<th>99p</th>
<th>min_val</th>
<th>max_val</th>
<th>mean</th>
<th>num_vals</th>
<th>iqr</th>
</tr>
</thead>
<tbody>
<tr>
<td>0.0</td>
<td>4.0028</td>
<td>6.9936</td>
<td>16.0066</td>
<td>28.9914</td>
<td>38.9781</td>
<td>56.9971</td>
<td>0.0</td>
<td>73.0</td>
<td>18.9025</td>
<td>1498.0</td>
<td>21.9978</td>
</tr>
</tbody>
</table>
</div><!--kg-card-end: markdown--><p>Python:</p><p>Something that really stumped me when initially writing this code snippet was that I had to use <code>astype(float)</code> on my <code>value_kwh</code> column to use describe. I have probably spent the combined time of a day over my life trying to deal with value types being incompatible with certain functions. This is another reason why I enjoy data munging with the intuitive functionality of PostgreSQL and TimescaleDB; these types of problems just happen less often. And let me tell you, the faster and painless data munging is the happier I am!</p><!--kg-card-begin: markdown--><pre><code class="language-python">agg_power = power_df.groupby('date_day').agg({'value_kwh' : 'sum'})
# need to make the value_kwh column the right data type
agg_power.value_kwh = agg_power.value_kwh.astype(float)
describe = agg_power.value_kwh.describe()
percentiles = agg_power.value_kwh.quantile([.01, .1, .9, .99])
q75, q25 = np.percentile(agg_power['value_kwh'], [75 ,25])
iqr = q75 - q25
print(describe, percentiles, iqr)
</code></pre>
<!--kg-card-end: markdown--><p>Another technique you may want to use for accessing the distribution of data in a column is a histogram. Generally, creating an image is where Python and other tools shine. However, I often need to glance at a histogram to check for any blatant anomalies when evaluating data. While this one technique in TimescaleDB may not be as simple as the Python solution, I can still do this directly in my database, which can be convenient. </p><p>To create a histogram in the database, we will need to use the TimescaleDB <a href="https://docs.timescale.com/api/latest/hyperfunctions/histogram/#required-arguments"><code>histogram()</code></a> function, <a href="https://www.postgresql.org/docs/13/functions-array.html"><code>unnest()</code></a>, <a href="https://www.postgresql.org/docs/13/functions-srf.html"><code>generate_series()</code></a>, <a href="https://www.postgresql.org/docs/13/functions-string.html"><code>repeat()</code></a>, and CTE’s.  <br></p><p>The <code>histogram()</code> function takes in the column you want to analyze and produces an array object which contains the frequency values across the number of buckets plus two (one additional bucket for values below the lowest bucket and above the highest bucket). You can then use PostgreSQL’s <code>unnest()</code>  function to break up the array into a single column with rows equal to two plus the number of specified buckets. </p><p>Once you have a column with bucket frequencies, you can then create a histogram “image” using the PostgreSQL <code>repeat()</code> function. The first time I saw someone use the <code>repeat()</code> function in this way was in <a href="https://hakibenita.com/sql-for-data-analysis">Haki Benita’s blog post</a>, which I recommend reading if you are interested in learning more PostgreSQL analytical techniques. The <code>repeat()</code> function essentially creates a string that repeats chosen characters a specified number of times. To use the histogram frequency values, you just input the unnested histogram in for the repeating argument. </p><p>Additionally, I find it useful to know the approximate starting values for each bucket in the histogram. This gives me a better picture of what values are occurring when. To approximate the bin values, I use the PostgreSQL <code>generate_series()</code> function along with some algebra,</p><!--kg-card-begin: markdown--><pre><code class="language-sql">(generate_series(-1, [number_of_buckets]) * [max_val - min_val]::float/[number_of_buckets]::float) + [min_val]
</code></pre>
<!--kg-card-end: markdown--><p>When I put all these techniques together, I am able to get a histogram with the following,</p><p>PostgreSQL:</p><!--kg-card-begin: markdown--><p><a name="hist"></a></p>
<pre><code class="language-sql">WITH power_usage_sum AS (
-- bucket the daily data using time_bucket, sum kWh over each bucketed day
SELECT 
	time_bucket(INTERVAL '1 day', startdate ) AS day,
	SUM(value_kwh) AS sum_kwh
FROM power_usage pu 
GROUP BY day
),
histogram AS (
-- I input the column = sum_kwh, the min value = 0, max value = 73, and number of buckets = 30
SELECT histogram(sum_kwh, 0, 73, 30)
FROM power_usage_sum w 
)
SELECT 
-- I use unnest to create the first column
   unnest(histogram) AS count, 
-- I use my approximate bucket values function
   (generate_series(-1, 30) * 73::float/30::float) + 0 AS approx_bucket_start_val,
-- I then use the repeat function to display the frequency
   repeat('■', unnest(histogram)) AS frequency
FROM histogram;
</code></pre>
<!--kg-card-end: markdown--><p>Results:</p><!--kg-card-begin: markdown--><div style="width:100%;overflow:auto;">
<table>
<thead>
<tr>
<th>count</th>
<th>approx_bucket_start_val</th>
<th>frequency</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>-2.43</td>
<td></td>
</tr>
<tr>
<td>83</td>
<td>0.0</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>104</td>
<td>2.43</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>207</td>
<td>4.87</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>105</td>
<td>7.3</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>150</td>
<td>9.73</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>76</td>
<td>12.17</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>105</td>
<td>14.6</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>62</td>
<td>17.03</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>48</td>
<td>19.47</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>77</td>
<td>21.9</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>35</td>
<td>24.33</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>83</td>
<td>26.77</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>42</td>
<td>29.2</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>72</td>
<td>31.63</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>46</td>
<td>34.07</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>51</td>
<td>36.5</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>39</td>
<td>38.93</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>32</td>
<td>41.37</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>24</td>
<td>43.8</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>16</td>
<td>46.23</td>
<td>■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>17</td>
<td>48.67</td>
<td>■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>4</td>
<td>51.1</td>
<td>■■■■</td>
</tr>
<tr>
<td>3</td>
<td>53.53</td>
<td>■■■</td>
</tr>
<tr>
<td>5</td>
<td>55.97</td>
<td>■■■■■</td>
</tr>
<tr>
<td>4</td>
<td>58.4</td>
<td>■■■■</td>
</tr>
<tr>
<td>5</td>
<td>60.83</td>
<td>■■■■■</td>
</tr>
<tr>
<td>1</td>
<td>63.27</td>
<td>■</td>
</tr>
<tr>
<td>0</td>
<td>65.7</td>
<td></td>
</tr>
<tr>
<td>1</td>
<td>68.13</td>
<td>■</td>
</tr>
<tr>
<td>0</td>
<td>70.57</td>
<td></td>
</tr>
<tr>
<td>1</td>
<td>73.0</td>
<td>■</td>
</tr>
</tbody>
</table>
</div><!--kg-card-end: markdown--><p>Python:</p><p>This Python code is definitively better. It’s simple and relatively painless. I wanted to show this comparison to provide an option for displaying a histogram directly in your database vs. having to pull the data into a pandas DataFrame then displaying it. Doing the histogram in the database just helps me to keep focus while evaluating the data.</p><!--kg-card-begin: markdown--><pre><code class="language-python">plt.hist(agg_power.value_kwh, bins=30)
plt.show()
</code></pre>
<!--kg-card-end: markdown--><h2 id="wrap-up"><br>Wrap Up</h2><p>Hopefully, after reading through these various evaluating techniques, you feel more comfortable with exploring some of the possibilities that PostgreSQL and TimescaleDB provide. Evaluating data directly in the database often saved me time without sacrificing any functionality. If you are looking to save time and effort while evaluating your data for analysis, definitely consider using PostgreSQL and TimescaleDB. </p><p>In my next posts, I will go over techniques to clean and transform data using PostgreSQL and TimescaleDB. I'll then take everything we've learned together to benchmark data munging tasks in PostgreSQL and TimescaleDB vs. Python and pandas. The final blog post will walk you through the full process on a real dataset by conducting deep-dive data analysis with TimescaleDB (for data munging) and Python (for modeling and visualizations).</p><p>If you have questions about TimescaleDB, time-series data, or any of the functionality mentioned above, join our <a href="https://slack.timescale.com/">community Slack</a>, where you'll find an active community of time-series enthusiasts and various Timescale team members (including me!).</p><p>If you’re ready to see the power of TimescaleDB and PostgreSQL right away, you can <a href="https://www.timescale.com/timescale-signup">sign up for a free 30-day trial</a> or <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/install-timescaledb/self-hosted/">install TimescaleDB and manage it on your current PostgreSQL instances</a>. We also have a bunch of <a href="https://docs.timescale.com/timescaledb/latest/tutorials/">great tutorials</a> to help get you started.</p><p>Until next time!</p><p><strong>Functionality Glossary </strong></p><!--kg-card-begin: markdown--><ul>
<li><a href="#select"><code>SELECT</code></a></li>
<li><a href="#select"><code>FROM</code></a></li>
<li><a href="#orderby"><code>ORDER BY</code></a></li>
<li><a href="#orderby"><code>DESC</code></a></li>
<li><a href="#orderby"><code>ASC</code></a></li>
<li><a href="#limit"><code>LIMIT</code></a></li>
<li><a href="#orderby"><code>DISTINCT</code></a></li>
<li><a href="#groupby"><code>GROUP BY</code></a></li>
<li><a href="#groupby"><code>SUM()</code></a></li>
<li><a href="#timebucket"><code>time_bucket(&lt;time_interval&gt;, &lt;time_col&gt;)</code></a></li>
<li><a href="#cte">CTE’s <code>WITH</code> <code>AS</code></a></li>
<li><a href="#cte"><code>COUNT()</code></a></li>
<li><a href="#stats"><code>approx_percentile()</code></a></li>
<li><a href="#stats"><code>min_val()</code></a></li>
<li><a href="#stats"><code>max_val()</code></a></li>
<li><a href="#stats"><code>mean()</code></a></li>
<li><a href="#stats"><code>num_vals()</code></a></li>
<li><a href="#stats"><code>percentile_agg()</code> [aggregate]</a></li>
<li><a href="#stats"><code>tdigest()</code> [aggregate]</a></li>
<li><a href="#hist"><code>histogram()</code></a></li>
<li><a href="#hist"><code>unnest()</code></a></li>
<li><a href="#hist"><code>generate_series()</code></a></li>
<li><a href="#hist"><code>repeat()</code></a></li>
</ul>
<!--kg-card-end: markdown--><p><br></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How Percentiles Work (and Why They're Better Than Averages)]]></title>
            <description><![CDATA[Get a primer on percentile approximations and why they're useful for analyzing large time-series datasets.]]></description>
            <link>https://www.tigerdata.com/blog/how-percentiles-work-and-why-theyre-better-than-averages</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-percentiles-work-and-why-theyre-better-than-averages</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Engineering]]></category>
            <category><![CDATA[Hyperfunctions]]></category>
            <dc:creator><![CDATA[David Kohn]]></dc:creator>
            <pubDate>Tue, 14 Sep 2021 15:41:30 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/maxim-hopman-fiXLQXAhCfk-unsplash--1-.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/maxim-hopman-fiXLQXAhCfk-unsplash--1-.jpg" alt="How Percentiles Work (and Why They're Better Than Averages)" /><p>In my recent post on <a href="https://timescale.ghost.io/blog/blog/what-time-weighted-averages-are-and-why-you-should-care/">time-weighted averages</a>, I described how my early career as an electrochemist exposed me to the importance of time-weighted averages, which shaped how we built them into TimescaleDB hyperfunctions. </p><p>A few years ago, soon after I started learning more about PostgreSQL internals (check out my <a href="https://timescale.ghost.io/blog/blog/how-postgresql-aggregation-works-and-how-it-inspired-our-hyperfunctions-design-2/">aggregation and two-step aggregates</a> post to learn about them yourself!), I worked on backends for an ad analytics company, where I started using TimescaleDB.</p><p>Like most companies, we cared a lot about making sure our website and API calls returned results in a reasonable amount of time for the user; we had billions of rows in our analytics databases, but we still wanted to make sure that the website was responsive and useful. </p><p>There’s a direct correlation between website performance and business results: users get bored if they have to wait too long for results, which is obviously not ideal from a business and customer loyalty perspective. To understand how our website performed and find ways to improve, we tracked the timing of our API calls and used API call response time as a key metric.</p><p>Monitoring an API is a common scenario and generally falls under the category of application performance monitoring (APM), but there are lots of similar scenarios in other fields, including:</p><ol><li>Predictive maintenance for industrial machines</li><li>Fleet monitoring for shipping companies</li><li>Energy and water use monitoring and anomaly detection</li></ol><p>Of course, analyzing raw (usually time-series) data only gets you so far. You want to analyze trends, understand how your system performs relative to what you and your users expect, catch and fix issues before they impact production users, and so much more. We <a href="https://timescale.ghost.io/blog/blog/introducing-hyperfunctions-new-sql-functions-to-simplify-working-with-time-series-data-in-postgresql/">built TimescaleDB hyperfunctions to help solve this problem and simplify how developers work with time-series data</a>.</p><p>For reference, hyperfunctions are a series of SQL functions that make it easier to manipulate and analyze time-series data in PostgreSQL with fewer lines of code. You can use hyperfunctions to calculate percentile approximations of data, compute time-weighted averages, downsample and smooth data, and perform faster <code>COUNT DISTINCT</code> queries using approximations. </p><p>Moreover, hyperfunctions are “easy” to use: you call a hyperfunction using the same SQL syntax you know and love.</p><p>We spoke with community members to understand their needs, and our initial release includes some of the most frequently requested functions, including <strong>percentile approximations</strong> (see <a href="https://github.com/timescale/timescaledb-toolkit/issues/41">GitHub feature request and discussion</a>). </p><p>They’re very useful for working with large time-series data sets because they offer the benefits of using percentiles (rather than averages or other counting statistics) while still being quick and space-efficient to compute, parallelizable, and useful with continuous aggregates and other advanced TimescaleDB features.</p><p><strong>If you’d like to get started with the </strong><a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/"><strong>percentile approximation hyperfunctions</strong></a><strong>—and many more—right away, spin up a fully managed Timescale Cloud service: </strong>create an account to <a href="https://console.cloud.timescale.com/signup">try it for free</a> for 30 days. (Hyperfunctions are pre-loaded on each new database service on Timescale Cloud, so after you create a new service, you’re all set to use them).</p><p><strong>If you prefer to manage your own database instances, you can </strong><a href="https://github.com/timescale/timescaledb-toolkit"><strong>download and install the timescaledb_toolkit extension</strong></a> on GitHub, after which you’ll be able to use percentile approximation and other hyperfunctions.</p><p>Finally, we love building in public and continually improving:</p><ul><li>If you have questions or comments on this blog post, <a href="https://github.com/timescale/timescaledb-toolkit/discussions/185">we’ve started a discussion on our GitHub page, and we’d love to hear from you</a>. And, if you like what you see, GitHub ⭐ are always welcome and appreciated, too!</li><li>You can view our <a href="https://github.com/timescale/timescaledb-toolkit">upcoming roadmap on GitHub</a> for a list of proposed features, as well as features we’re currently implementing and those that are available to use today.</li></ul><h2 id="things-i-forgot-from-7th-grade-math-percentiles-vs-averages">Things I Forgot From 7th Grade Math: Percentiles vs. Averages</h2><p>I probably learned about averages, medians, and modes in 7th-grade math class, but if you’re anything like me, they may periodically get lost in the cloud of “things I learned once and thought I knew, but actually, I don’t remember quite as well as I thought.”</p><p>As I was researching this piece, I found a number of good blog posts (see examples from the folks at <a href="https://www.dynatrace.com/news/blog/why-averages-suck-and-percentiles-are-great/">Dynatrace</a>, <a href="https://www.elastic.co/blog/averages-can-dangerous-use-percentile">Elastic</a>, <a href="https://blog.appsignal.com/2018/12/04/dont-be-mean-statistical-means-and-percentiles-101.html">AppSignal</a>, and <a href="https://www.optimizely.com/insights/blog/why-cdn-balancing/">Optimizely</a>) about how averages aren’t great for understanding application performance, or other similar things and why it’s better to use percentiles.</p><p>I won’t spend too long on this, but I think it’s important to provide some background on <em>why</em> and <em>how</em> percentiles can help us better understand our data.</p><p>First off, let’s consider how percentiles and averages are defined. To understand this, let’s start by looking at a <a href="https://en.wikipedia.org/wiki/Normal_distribution"><strong>normal distribution</strong></a><strong>:</strong></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Graph-1.jpg" class="kg-image" alt="A graph with a Gaussian/normal distribution, it has a peak at the center, then falls off relatively sharply and symmetrically on either side. " loading="lazy" width="2000" height="1182" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/09/Graph-1.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/09/Graph-1.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2021/09/Graph-1.jpg 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Graph-1.jpg 2152w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">A normal, or Gaussian, distribution describes many real-world processes that fall around a given value and where the probability of finding values that are further from the center decreases. The median, average, and mode are all the same for a normal distribution, and they fall on the dotted line at the center.</span></figcaption></figure><p>The normal distribution is what we often think of when we think about statistics; it’s one of the most frequently used and often used in introductory courses. In a normal distribution, the median, the average (also known as the mean), and the mode are all the same, even though they’re defined differently.</p><p>The <strong>median</strong> is the middle value, where half of the data is above and half is below. The <strong>mean</strong> (aka average) is defined as the sum(value) / count(value),  and the <strong>mode </strong>is defined as the most common or frequently occurring value.</p><p>When we’re looking at a curve like this, the x-axis represents the value, while the y-axis represents the frequency with which we see a given value (i.e., values that are “higher” on the y-axis occur more frequently).</p><p>In a normal distribution, we see a curve centered (the dotted line) at its most frequent value, with decreasing probability of seeing values further away from the most frequent one (the most frequent value is the mode). Note that the normal distribution is symmetric, which means that values to the left and right of the center have the same probability of occurring.</p><p>The median, or the middle value, is also known as the 50th percentile (the middle percentile out of 100). This is the value at which 50% of the data is less than the value, and 50% is greater than the value (or equal to it).</p><p>In the below graph, half of the data is to the left (shaded in blue), and half is to the right (shaded in yellow), with the 50th percentile directly in the center.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Graph-2.jpg" class="kg-image" alt="A normal distribution as in the last graph, except now the left 50% of the graph is shaded in one color, the right in a different color. The dotted line in the center  is labeled median (50th percentile)" loading="lazy" width="2000" height="1182" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/09/Graph-2.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/09/Graph-2.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2021/09/Graph-2.jpg 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Graph-2.jpg 2152w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">A normal distribution with the median/50th percentile depicted.</span></figcaption></figure><p>This leads us to percentiles: a <strong>percentile</strong> is defined as the value where x percent of the data falls below the value. </p><p>For example, if we call something “the 10th percentile,” we mean that 10% of the data is less than the value and 90% is greater than (or equal to) the value.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Graph-3.jpg" class="kg-image" alt="The same as the last graph except the dotted line has shifted leftward, and now 10% of the area under the graph to the left of the line is shaded one color with 90% shaded a different color. The dotted line has been re-labeled as the 10th percentile." loading="lazy" width="2000" height="1182" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/09/Graph-3.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/09/Graph-3.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2021/09/Graph-3.jpg 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Graph-3.jpg 2152w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">A normal distribution with the 10th percentile depicted.</span></figcaption></figure><p>And the 90th percentile is where 90% of the data is less than the value and 10% is greater:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Graph-4.jpg" class="kg-image" alt="The same as the last graph except the line has shifted to the right and is now labeled 90th percentile, 90% of the area under the curve to the left of the line is shaded one color, the 10% to the right of the line is shaded a different color." loading="lazy" width="2000" height="1182" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/09/Graph-4.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/09/Graph-4.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2021/09/Graph-4.jpg 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Graph-4.jpg 2152w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">A normal distribution with the 90th percentile depicted.</span></figcaption></figure><p>To calculate the 10th percentile, let’s say we have 10,000 values. We take all of the values, order them from smallest to largest, and identify the 1001st value (where 1000 or 10% of the values are below it), which will be our 10th percentile.</p><p>We noted before that the median and average are the same in a normal distribution. This is because a normal distribution is <em>symmetric</em>. Thus, the magnitude and number of points with values larger than the median are completely balanced (both in magnitude and number of points smaller than the median).</p><p>In other words, there is always the same number of points on either side of the median, but the <em>average</em> takes into account the actual value of the points. </p><p>For the median and average to be equal, the points less than the median and greater than the median must have the same distribution (i.e., there must be the same number of points that are somewhat larger and somewhat smaller and much larger and much smaller). (<strong>Correction:</strong> as pointed out to us in<a href="https://news.ycombinator.com/item?id=28527954"> a helpful comment on Hacker News</a>, technically, this is only true for symmetric distributions, asymmetric distributions it may or may not be true for, and you can get odd cases of asymmetric distributions where these are equal, though they are less likely!)</p><p><strong>Why is this important? </strong>The fact that the median and average are the same in the normal distribution can cause some confusion. Since a normal distribution is often one of the first things we learn, we (myself included!) can think it applies to more cases than it actually does.</p><p>It’s easy to forget or fail to realize that only the <em>median</em> guarantees that 50% of the values will be above and 50% below—while the average guarantees that 50% of the <strong>weighted</strong> values will be above and 50% below (i.e., the average is the <a href="https://en.wikipedia.org/wiki/Centroid">centroid</a>, while the median is the center).</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Graph-5.jpg" class="kg-image" alt="The same graph as the one several graphs ago showing the median/50th percentile with half shaded one color and half another, except now the line is labeled as the average as well as the median. " loading="lazy" width="2000" height="1182" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/09/Graph-5.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/09/Graph-5.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2021/09/Graph-5.jpg 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Graph-5.jpg 2152w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The average and median are the same in a normal distribution, and they split the graph exactly in half. But they aren’t calculated the same way, don’t represent the same thing, and aren’t necessarily the same in other distributions.</span></figcaption></figure><p>🙏 Shout out to the folks over at <a href="https://www.desmos.com/">Desmos</a> for their great graphing calculator, which helped make these graphs and even allowed me to make an <a href="https://www.desmos.com/calculator/ty3jt8ftgs">interactive demonstration of these concepts</a>!</p><p>But, to get out of the theoretical, let’s consider something more common in the real world, like the API response time scenario from my work at the ad analytics company.</p><h2 id="why-percentiles-are-better-than-averages-for-understanding-your-data">Why Percentiles Are Better Than Averages for Understanding Your Data</h2><p>We looked at how averages and percentiles are different—and now, we’re going to use a real-world scenario to demonstrate how using averages instead of percentiles can lead to false alarms or missed opportunities. </p><p>Why? Averages don’t always give you enough information to distinguish between real effects and outliers or noise, whereas percentiles can do a much better job.</p><p>Simply put, using averages can have a dramatic (and negative) impact on how values are reported, while percentiles can help you get closer to the “truth.”</p><p>If you’re looking at something like API response time, you’ll likely see a frequency distribution curve that looks something like this:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Graph-6.jpg" class="kg-image" alt="A curve with frequency on the vertical axis and response time on the horizontal axis. The curve has a relatively sharp peak at the beginning labelled 250 ms, it then falls off quickly before shallowing out into a long tail" loading="lazy" width="1600" height="986" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/09/Graph-6.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/09/Graph-6.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Graph-6.jpg 1600w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">A frequency distribution for API response times with a peak at 250ms (all graphs are not to scale and are meant only for demonstration purposes).</span></figcaption></figure><p>In my former role at the ad analytics company, we’d aim for most of our API response calls to finish in under half a second, and many were much, much shorter than that. When we monitored our API response times, one of the most important things we tried to understand was how users were affected by changes in the code.  </p><p>Most of our API calls finished in under half a second, but some people used the system to get data over very long time periods or had odd configurations that meant their dashboards were a bit less responsive (though we tried to make sure those were rare!). </p><p>The type of curve that resulted is characterized as a <strong>long-tail distribution</strong> where we have a relatively large spike at 250 ms, with many of our values under that and then an exponentially decreasing number of longer response times.</p><p>We talked earlier about how, in symmetric curves (like the normal distribution), but a long-tail distribution is an<strong> asymmetric </strong>curve.</p><p>This means that the largest values are much larger than the middle values, while the smallest values aren’t that far from the middle values. (In the API monitoring case,  you can never have an API call that takes less than zero seconds to respond, but there’s no limit to how long they can take, so you get that long tail of longer API calls).</p><p>Thus,  the average and the median of a long-tail distribution start to diverge:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Graph-7.jpg" class="kg-image" alt="The same curve as last time except now the median and average are labeled. The median is near the peak of the curve while the average is a bit rightward." loading="lazy" width="1600" height="986" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/09/Graph-7.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/09/Graph-7.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Graph-7.jpg 1600w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The API response time frequency curve with the median and average labeled. Graphs are not to scale and are meant for demonstration purposes only.</span></figcaption></figure><p>In this scenario, the average is significantly larger than the median because there are enough “large” values in the long tail to make the average larger. Conversely, in some other cases, the average might be smaller than the median. </p><p>But at the ad analytics company, we found that the average didn’t give us enough information to distinguish between important changes in how our API responded to software changes vs. noise/outliers that only affected a few individuals. </p><p>In one case, we introduced a change to the code that had a new query. The query worked fine in staging, but there was a lot more data in the production system. </p><p>Once the data was “warm” (in memory), it would run quickly, but it was very slow the first time. When the query went into production, the response time was well over a second for ~10% of the calls. </p><p>In our frequency curve, a response time over a second (but less than 10s) for ~10% of the calls resulted in a second, smaller hump in our frequency curve and looked like this:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Graph-8.jpg" class="kg-image" alt="The same curve as last time except now there is a smaller hump further to the right of the long tail of the original graph, it’s approximately one fifth the height of the original and has approximately one tenth the area. The average and median have both shifted rightward, the average more than the median. The average is nearly inline with the second hump." loading="lazy" width="1600" height="986" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/09/Graph-8.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/09/Graph-8.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Graph-8.jpg 1600w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">A frequency curve showing the shift and extra hump that occurs when 10% of calls take a moderate amount of time, between 1 and 10s (graph still not to scale).</span></figcaption></figure><p>In this scenario, the average shifted a lot, while the median slightly shifted, it’s much less impacted.</p><p>You might think that this makes the average a better metric than the median because it helped us identify the problem (too long API response times), and we could set up our alerting to notify when the average shifts.</p><p>Let’s imagine that we’ve done that, and people will jump into action when the average goes above, say, 1 second(s).</p><p>But now, we have a few users who have started requesting 15 years of data from our UI...and those API calls take a <em>really long time</em>. This is because the API wasn’t really built to handle this “off-label” use.</p><p>Just a few calls from these users easily shifted the average way over our 1s threshold.</p><p>Why? The average (as a value) can be dramatically affected by outliers like this, even though they impact only a small fraction of our users. The average uses the sum of the data, so the magnitude of the outliers can have an outsized impact, whereas the median and other percentiles are based on the ordering of the data.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Graph-9-1.jpg" class="kg-image" alt="The same curve without the second hump, but with a couple of hash marks on the x axis and some outliers far off to the right. These outliers are circled. The average has shifted far to the right, the same place it was with the hump but the median has remained in the same place as the original curve without hump or outliers." loading="lazy" width="2000" height="916" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/09/Graph-9-1.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/09/Graph-9-1.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2021/09/Graph-9-1.jpg 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Graph-9-1.jpg 2000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Our curve with a few outliers, where less than 1% of the API call responses are over 100s (the response time has a break representing the fact that the outliers would be way to the right otherwise, still, the graph is not to scale).</span></figcaption></figure><p><strong>The point is that the average doesn’t give us a good way to distinguish between outliers and real effects and can give odd results when we have a long-tail or asymmetric distribution.</strong></p><p>Why is this important to understand?</p><p>Well, in the first case, we had a problem affecting 10% of our API calls, which could be 10% or more of our users (how could it affect more than 10% of the users? Well, if a user makes 10 calls on average, and 10% of API calls are affected, then, on average, all the users would be affected... or at least some large percentage of them).</p><p>We want to respond very quickly to that type of urgent problem, affecting a large number of users. We built alerts and might even get our engineers up in the middle of the night and/or revert a change.</p><p>But the second case, where “off-label” user behavior or minor bugs had a large effect on a few API calls, was much more benign. Because relatively few users are affected by these outliers, we wouldn’t want to get our engineers up in the middle of the night or revert a change. (Outliers can still be important to identify and understand, both for understanding user needs or potential bugs in the code, but they usually <em>aren’t an emergency</em>).</p><p>Instead of using the average, we can instead use multiple percentiles to understand this type of behavior. Remember, unlike averages, percentiles rely on the <em>ordering</em> of the data rather than being impacted by the <em>magnitude</em> of data. If we use the 90th percentile, we <em>know</em> that 10% of users have values (API response times in our case) greater than it.   </p><p>Let’s look at the 90th percentile in our original graph; it nicely captures some of the long tail behavior:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Graph-10.jpg" class="kg-image" alt="Back to the original graph with the median and average, except now the 90th percentile is also drawn in. The 90th percentile is further to the right than the median or average and is around the halfway point of the long tail portion of the curve." loading="lazy" width="1600" height="986" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/09/Graph-10.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/09/Graph-10.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Graph-10.jpg 1600w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Our original API response time graph showing the 90th percentile, median, and average. Graph not to scale.</span></figcaption></figure><p>When we have some outliers caused by a few users who’re running super long queries or a bug affecting a small group of queries, the average shifts, but the 90th percentile is hardly affected.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Graph-11-1.jpg" class="kg-image" alt="Back to the graph with the outliers, except now the 90th percentile is there as well, it has remained in the same spot as the previous graph, but the average has shifted over to the right well beyond it. " loading="lazy" width="2000" height="916" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/09/Graph-11-1.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/09/Graph-11-1.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2021/09/Graph-11-1.jpg 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Graph-11-1.jpg 2000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Outliers affect the average but don’t impact the 90th percentile or median. (Graph is not to scale.)</span></figcaption></figure><p>But, when the tail is increased due to a problem affecting 10% of users, we see that the 90th percentile shifts outward pretty dramatically – which enables our team to be notified and respond appropriately:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Graph-12.jpg" class="kg-image" alt="The same graph as the previous one with a smaller hump, except now the 90th percentile is on there as well, it has shifted to the right this time and is further right than the average." loading="lazy" width="1600" height="986" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/09/Graph-12.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/09/Graph-12.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/Graph-12.jpg 1600w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">But when there are “real” effects from responses that impact more than 10% of users, the 90th percentile shifts dramatically (Graph not to scale.)</span></figcaption></figure><p>This (hopefully) gives you a better sense of how and why percentiles can help you identify cases where large numbers of users are affected – but not burden you with false positives that might wake engineers up and give them alarm fatigue!</p><p>So, now that we know why we might want to use percentiles rather than averages, let’s talk about how we calculate them.</p><h2 id="how-percentiles-work-in-postgresql">How Percentiles Work in PostgreSQL</h2><p>To calculate any sort of exact percentile, you take <em>all</em> your values, sort them, then find the <em>n</em>th value based on the percentile you’re trying to calculate. </p><p>To see how this works in PostgreSQL, we’ll present a simplified case of our ad analytics company’s API tracking. </p><p>We’ll start off with a table like this:</p><pre><code class="language-SQL">CREATE TABLE responses(
	ts timestamptz, 
	response_time DOUBLE PRECISION);</code></pre><p>In PostgreSQL, we can calculate a percentile over the column <code>response_time</code> using the <a href="https://www.postgresql.org/docs/current/functions-aggregate.html#FUNCTIONS-ORDEREDSET-TABLE"><code>percentile_disc</code> aggregate</a>:</p><pre><code class="language-SQL">SELECT 
	percentile_disc(0.5) WITHIN GROUP (ORDER BY response_time) as median
FROM responses;</code></pre><p>This doesn’t look the same as a normal aggregate; the <code>WITHIN GROUP (ORDER BY …)</code> is a different syntax that works on special aggregates called <a href="https://www.postgresql.org/docs/13/xaggr.html#XAGGR-ORDERED-SET-AGGREGATES">ordered-set aggregates</a>.</p><p>Here, we pass in the percentile we want (0.5 or the 50th percentile for the median) to the <code>percentile_disc</code> function, and the column that we’re evaluating (<code>response_time</code>)  goes in the order by clause.</p><p>It will be more clear why this happens when we understand what’s going on under the hood. Percentiles give a guarantee that x percent of the data will fall below the value they return. To calculate that, we need to sort all of our data in a list and then pick out the value where 50% of the data falls below it and 50% falls above it.</p><p>For those of you who read the section of our previous post on <a href="https://timescale.ghost.io/blog/blog/how-postgresql-aggregation-works-and-how-it-inspired-our-hyperfunctions-design-2/#a-primer-on-postgresql-aggregation-through-pictures">how PostgreSQL aggregates work</a>, we discussed how an aggregate like <code>avg</code> works. </p><p>As it scans each row, the transition function updates some internal state (for <code>avg</code> it’s the <code>sum</code> and the <code>count</code>), and then a final function processes the internal state to produce a result (for <code>avg</code> divide <code>sum</code> by <code>count</code>).</p>
<!--kg-card-begin: html-->
    <video autoplay loop muted playsinline>
      <source id="player" src="https://s3.amazonaws.com/blog.timescale.com/gifs/hyperfunctions-3/avg_bar.mp4" type="video/mp4"></source>
    </video>
    <figcaption class="gif-caption">A GIF showing how the avg is calculated in PostgreSQL with the sum and count as the partial state as rows are processed and a final function that divides them when we’ve finished.
    </figcaption>
<!--kg-card-end: html-->
<p>The ordered set aggregates, like <code>percentile_disc</code>, work somewhat similarly, with one exception: instead of the state being a relatively small fixed-size data structure (like <code>sum</code> and <code>count</code> for <code>avg</code> ), it must keep all the values it has processed to sort them and calculate the percentile later. </p><p>Usually, PostgreSQL does this by putting the values into a data structure called a <a href="https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/backend/utils/sort/tuplestore.c"><code>tuplestore</code></a> that stores and sorts values easily. </p><p>Then, when the final function is called, the <code>tuplestore</code> will first sort the data. Then, based on the value input into the <code>percentile_disc</code>), it will traverse to the correct point (0.5 of the way through the data for the median) in the sorted data and output the result.<br></p>
<!--kg-card-begin: html-->
    <video autoplay loop muted playsinline>
      <source id="player" src="https://s3.amazonaws.com/blog.timescale.com/gifs/hyperfunctions-3/percentile_disc.mp4" type="video/mp4"></source>
    </video>
    <figcaption class="gif-caption">With the percentile_disc ordered set aggregate, PostgreSQL has to store each value it sees in a tuplestore then when it’s processed all the rows, it sorts them, and then goes to the right point in the sorted list to extract the percentile we need. 
    </figcaption>
<!--kg-card-end: html-->
<p>Instead of performing these expensive calculations over very large data sets, <strong>many people find that approximate percentile calculations can provide a “close enough” approximation with significantly less work</strong>...which is why we introduced percentile approximation hyperfunctions.</p><h2 id="percentile-approximation-what-it-is-and-why-we-use-it-in-hyperfunctions">Percentile Approximation: What It Is and Why We Use It in Hyperfunctions</h2><p>In my experience, people often use averages and other summary statistics more frequently than percentiles because they are significantly “cheaper” to calculate over large datasets, both in computational resources and time.</p><p>As we noted above, calculating the average in PostgreSQL has a simple, two-valued aggregate state. Even if we calculate a few additional, related functions like the standard deviation, we still just need a small, fixed number of values to calculate the function.</p><p>In contrast, to calculate the percentile, we need all of the input values in a sorted list.</p><p>This leads to a few issues:</p><ol><li><strong>Memory footprint</strong>: The algorithm has to keep these values somewhere, which means keeping values in memory until they need to write some data to disk to avoid using too much memory (this is known as “spilling to disk”). This produces a significant memory burden and/or majorly slows down the operation because disk accesses are orders of magnitude slower than memory.</li><li><strong>Limited Benefits from Parallelization</strong>: Even though the algorithm can sort lists in parallel, the benefits from parallelization are limited because it still needs to merge all the sorted lists into a single, sorted list in order to calculate a percentile.</li><li><strong>High network costs: </strong>In distributed systems, all the values must be passed over the network to one node to be made into a single sorted list, which is slow and costly.</li><li><strong>No true partial states</strong>: Materialization of partial states (e.g., for continuous aggregates) is not useful because the partial state is simply all the values that underlie it. This could save on sorting the lists, but the storage burden would be high and the payoff low.</li><li><strong>No streaming algorithm</strong>: For streaming data, this is completely infeasible. You still need to maintain the full list of values (similar to the materialization of partial states problem above), which means that the algorithm essentially needs to store the entire stream!</li></ol><p>All of these can be manageable when you’re dealing with relatively small data sets, while for high volume, time-series workloads, they start to become more of an issue.</p><p>But, you only need the full list of values for calculating a percentile if you want <strong><em>exact</em></strong> percentiles. <strong>With relatively large datasets, you can often accept some accuracy tradeoffs to avoid running into any of these issues.</strong></p><p>The problems above, and the recognition of the tradeoffs involved in weighing whether to use averages or percentiles, led to the development of multiple algorithms to <a href="https://en.wikipedia.org/wiki/Quantile#Approximate_quantiles_from_a_stream">approximate percentiles in high volume systems.</a> Most percentile approximation approaches involve some sort of modified <a href="https://en.wikipedia.org/wiki/Histogram">histogram</a> to represent the overall shape of the data more compactly, while still capturing much of the shape of the distribution.</p><p>As we were designing hyperfunctions, we thought about how we could capture the benefits of percentiles (e.g., robustness to outliers, better correspondence with real-world impacts) while avoiding some of the pitfalls that come with calculating exact percentiles (above). </p><p>Percentile approximations seemed like the right fit for working with large, time-series datasets.</p><p>The result is a whole family of <a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/">percentile approximation hyperfunctions</a>, built into TimescaleDB. The simplest way to call them is to use the <a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/percentile_agg/"><code>percentile_agg</code> aggregate</a> along with the<a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/approx_percentile/"> <code>approx_percentile</code> accessor</a>.</p><p>This query calculates approximate 10th, 50th, and 90th percentiles:</p><pre><code class="language-SQL">SELECT 
    approx_percentile(0.1, percentile_agg(response_time)) as p10, 
    approx_percentile(0.5, percentile_agg(response_time)) as p50, 
    approx_percentile(0.9, percentile_agg(response_time)) as p90 
FROM responses;
</code></pre><p>(If you’d like to learn more about aggregates, accessors, and two-step aggregation design patterns, check out <a href="https://timescale.ghost.io/blog/blog/how-postgresql-aggregation-works-and-how-it-inspired-our-hyperfunctions-design-2/">our primer on PostgreSQL two-step aggregation</a>.) </p><p>These percentile approximations have many benefits when compared to the normal PostgreSQL exact percentiles, especially when used for large data sets.</p><h3 id="memory-footprint">Memory footprint</h3><p>When calculating percentiles over large data sets, our percentile approximations limit the memory footprint (or need to spill to disk, as described above). </p><p>Standard percentiles create memory pressure since they build up as much of the data set in memory as possible...and then slow down when forced to spill to disk.</p><p>Conversely, hyperfunctions’ percentile approximations have fixed-size representations based on the number of buckets in their modified histograms, so they limit the amount of memory required to calculate them.</p><p>Plus, all of our percentile approximation algorithms are parallelizable, so they can be computed using multiple workers in a single node; this can provide significant speedups because ordered-set aggregates like <code>percentile_disc</code> are not parallelizable in PostgreSQL. </p><h3 id="materialization-in-continuous-aggregates">Materialization in continuous aggregates</h3><p>TimescaleDB includes a feature called <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/continuous-aggregates/">continuous aggregates</a>, designed to make queries on very large datasets run faster. </p><p>TimescaleDB continuous aggregates continuously and incrementally store the results of an aggregation query in the background, so when you run the query, only the data that has changed needs to be computed, not the entire dataset. </p><p>Unfortunately, exact percentiles using <code>percentile_disc</code> cannot be stored in continuous aggregates because they cannot be broken down into a partial form and would instead require storing the entire dataset inside the aggregate.</p><p>We designed our percentile approximation algorithms to be usable with continuous aggregates. They have fixed-size partial representations that can be stored and re-aggregated inside continuous aggregates. </p><p>This is a huge advantage compared to exact percentiles because now you can do things like baselining and alerting on longer periods without having to re-calculate from scratch every time. </p><p>Let’s go back to our API response time example and imagine we want to identify recent outliers to investigate potential problems.</p><p>One way to do that would be to look at everything that is, say, above the 99th percentile in the previous hour. </p><p>As a reminder, we have a table:</p><pre><code class="language-SQL">CREATE TABLE responses(
	ts timestamptz, 
	response_time DOUBLE PRECISION);
SELECT create_hypertable('responses', 'ts'); -- make it a hypertable so we can make continuous aggs</code></pre><p>First, we’ll create a one-hour aggregation:</p><pre><code class="language-SQL">CREATE MATERIALIZED VIEW responses_1h_agg
WITH (timescaledb.continuous)
AS SELECT 
    time_bucket('1 hour'::interval, ts) as bucket,
    percentile_agg(response_time)
FROM responses
GROUP BY time_bucket('1 hour'::interval, ts);</code></pre><p>Note that we don’t perform the accessor function in the continuous aggregate; we just perform the aggregation function. </p><p>Now, we can find the data in the last 30&nbsp;s greater than the 99th percentile like so:</p><pre><code class="language-SQL">SELECT * FROM responses 
WHERE ts &gt;= now()-'30s'::interval
AND response_time &gt; (
	SELECT approx_percentile(0.99, percentile_agg)
	FROM responses_1h_agg
	WHERE bucket = time_bucket('1 hour'::interval, now()-'1 hour'::interval)
);
</code></pre><p>At the ad analytics company, we had a lot of users, so we’d have tens or hundreds of thousands of API calls every hour. </p><p>By default, we have 200 buckets in our representation, so we’re getting a large reduction in the amount of data that we store and process by using a continuous aggregate. This means that it would speed up the response time significantly. If you don’t have as much data, you’ll want to increase the size of your buckets or decrease the fidelity of the approximation to achieve a large reduction in the data we have to process.</p><p>We mentioned that we only performed the aggregate step in the continuous aggregate view definition; we didn’t use our <code>approx_percentile</code> accessor function directly in the view. We do that because we want to be able to use other accessor functions and/or the  <a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/rollup-percentile/"><u><code>rollup</code></u> function</a>, which you may remember as one of the main <a href="https://timescale.ghost.io/blog/blog/how-postgresql-aggregation-works-and-how-it-inspired-our-hyperfunctions-design-2/#why-we-use-the-two-step-aggregate-design-pattern">reasons we chose the two-step aggregate approach</a>.</p><p>Let’s look at how that works, we can create a daily rollup and get the 99th percentile like this:</p><pre><code class="language-SQL">SELECT 
	time_bucket('1 day', bucket),
	approx_percentile(0.99, rollup(percentile_agg)) as p_99_daily
FROM responses_1h_agg
GROUP BY 1;</code></pre><p>We could even use the <a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/approx_percentile_rank/"><code>approx_percentile_rank</code></a> accessor function, which tells you what percentile a value would fall into.</p><p>Percentile rank is the inverse of the percentile function; in other words, if normally you ask, what is the value of <em>n</em>th percentile? The answer is a value.</p><p>With percentile rank, you ask what percentile would this value be in? The answer is a percentile.</p><p>So, using <code>approx_percentile_rank</code> allows us to see where the values that arrived in the last 5 minutes rank compared to values in the last day:</p><pre><code class="language-SQL">WITH last_day as (SELECT 
	time_bucket('1 day', bucket),
 	rollup(percentile_agg) as pct_daily
FROM foo_1h_agg
WHERE bucket &gt;= time_bucket('1 day', now()-'1 day'::interval)
GROUP BY 1)

SELECT approx_percentile_rank(response_time, pct_daily) as pct_rank_in_day
FROM responses, last_day
WHERE foo.ts &gt;= now()-'5 minutes'::interval;
</code></pre><p>This is another way continuous aggregates can be valuable.</p><p>We performed a <code>rollup</code> over a day, which just combined 24 partial states, rather than performing a full calculation over 24 hours of data with millions of data points.</p><p>We then used the <code>rollup</code> to see how that impacted just the last few minutes of data, giving us insight into how the last few minutes compare to the last 24 hours. These are just a few examples of how the percentile approximation hyperfunctions can give us some pretty nifty results and allow us to perform complex analysis relatively simply.</p><h2 id="percentile-approximation-deep-dive">Percentile Approximation Deep Dive</h2><p>Some of you may be wondering how TimescaleDB hyperfunctions’ underlying algorithms work, so let’s dive in! (For those of you who don’t want to get into the weeds, feel free to skip over this bit.)</p><h3 id="approximation-methods-and-how-they-work">Approximation methods and how they work</h3><p>We implemented two different percentile approximation algorithms as TimescaleDB hyperfunctions: <a href="https://arxiv.org/pdf/2004.08604.pdf">UDDSketch</a> and <a href="https://github.com/tdunning/t-digest">T-Digest</a>. Each is useful in different scenarios, but first, let’s understand some of the basics of how they work. </p><p>Both use a modified histogram to approximate the shape of a distribution. A histogram buckets nearby values into a group and tracks their frequency.</p><p>You often see a histogram plotted like so:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0--1--2.png" class="kg-image" alt="A graph similar to the previous graphs of the response time of an API, except now the curve has been replaced by a series of black line graph boxes representing the buckets of a histogram." loading="lazy" width="2000" height="821" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/pasted-image-0--1--2.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/pasted-image-0--1--2.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2022/01/pasted-image-0--1--2.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0--1--2.png 2048w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">A histogram representing the same data as our response time frequency curve above, you can see how the shape of the graph is similar to the frequency curve. Not to scale.</span></figcaption></figure><p>If you compare this to the frequency curve we showed above, you can see how this could provide a reasonable approximation of the  API response time vs frequency response. Essentially, a histogram has a series of bucket boundaries and a count of the number of values that fall within each bucket.</p><p>To calculate the approximate percentile for, say, the 20th percentile, you first consider the fraction of your total data that would represent it. For our 20th percentile, that would be 0.2 * <code>total_points</code>.</p><p>Once you have that value, you can then sum the frequencies in each bucket, left to right, to find at which bucket you get the value closest to 0.2 * <code>total_points</code>. </p><p>You can even interpolate between buckets to get more exact approximations when the bucket spans a percentile of interest. </p><p>When you think of a histogram, you may think of one that looks like the one above, where the buckets are all the same width. </p><p>But choosing the bucket width, especially for widely varying data, can get very difficult or lead you to store a lot of extra data. </p><p>In our API response time example, we could have data spanning from tens of milliseconds up to ten seconds or hundreds of seconds. </p><p>This means that the right bucket size for a good approximation of the 1st percentile, e.g., 2ms, would be WAY smaller than necessary for a good approximation of the 99th percentile.</p><p>This is why most percentile approximation algorithms use a modified histogram with a <em>variable bucket width</em>. </p><p>For instance, the UDDSketch algorithm uses logarithmically sized buckets, which might look something like this:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0-6.png" class="kg-image" alt="A graph similar to the last one except the black boxes start smaller than the previous ones and increase in width as you move to the right." loading="lazy" width="2000" height="828" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/pasted-image-0-6.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/pasted-image-0-6.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2022/01/pasted-image-0-6.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0-6.png 2048w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">A modified histogram showing how logarithmic buckets like the UDDSketch algorithm uses can still represent the data. (Note: we’d need to modify the plot to plot the frequency/bucket width so that the scale would remain similar; however, this is just for demonstration purposes and not drawn to scale).</span></figcaption></figure><p>The designers of UDDSketch used a logarithmic bucket size like this because what they care about is the <em>relative error</em>.</p>
<!--kg-card-begin: html-->
<p>For reference, absolute error is defined as the difference between the actual and the approximated value:
\begin{equation}
\text{err}_\text{absolute} = abs(v_\text{actual} - v_\text{approx})
\end{equation}
   </p>
<!--kg-card-end: html-->

<!--kg-card-begin: html-->
<p>To get relative error, you divide the absolute error by the value:
    \begin{equation}
\text{err}_\text{relative} = \frac{\text{err}_\text{absolute}}{   v_\text{actual}}
\end{equation}
</p>
<!--kg-card-end: html-->
<p>If we had a constant absolute error, we might run into a situation like the following: </p><p>We ask for the 99th percentile, and the algorithm tells us it’s 10s +/- 100ms. Then, we ask for the 1st percentile, and the algorithm tells us it’s 10ms +/- 100ms.</p><p>The error for the 1st percentile is way too high!</p><p>If we have a constant relative error, then we’d get 10ms +/- 100 microseconds. </p><p>This is much, much more useful. (And 10s +/- 100 microseconds is probably too tight, we likely don’t really care about 100 microseconds if we’re already at 10s.)</p><p>This is why the UDDSketch algorithm uses logarithmically sized buckets, where the width of the bucket scales with the size of the underlying data. This allows the algorithm to provide constant relative error across the full range of percentiles. </p>
<!--kg-card-begin: html-->
<p>As a result, you always know that the true value of the percentile will fall within some range \([v_\text{approx} (1-err), v_\text{approx} (1+err)]\)</p>
<!--kg-card-end: html-->
<p>On the other hand, T-Digest uses buckets that are variably sized, based on where they fall in the distribution. Specifically, it uses smaller buckets at the extremes of the distribution and larger buckets in the middle.</p><p>So, it might look something like this:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0--2--1.png" class="kg-image" alt="A graph similar to the previous two except now the black boxes start smaller than the first, increase in width towards the center and then decrease in width as you move to the right edge." loading="lazy" width="2000" height="867" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/pasted-image-0--2--1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/pasted-image-0--2--1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2022/01/pasted-image-0--2--1.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0--2--1.png 2048w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">A modified histogram showing how variably sized buckets that are smaller at the extremes, like what the TDigest algorithm uses, can still represent the data (Note: for illustration purposes, not to scale.)</span></figcaption></figure><p>This histogram structure with variable-sized buckets optimizes for different things than UDDSketch. Specifically, it takes advantage of the idea that when you’re trying to understand the distribution, you likely care more about fine distinctions between extreme values than about the middle of the range. </p><p>For example, I usually care a lot about distinguishing the 5th percentile from the 1st or the 95th from the 99th, while I don’t care as much about distinguishing between the 50th and the 55th percentile.</p><p>The distinctions in the middle are less meaningful and interesting than the distinctions at the extremes. (Caveat: the TDigest algorithm is a bit more complex than this, and this doesn’t completely capture its behavior, but we’re trying to give a general gist of what’s going on. If you want more information, <a href="https://arxiv.org/abs/1902.04023">we recommend this paper</a>).</p><h3 id="using-advanced-approximation-methods-in-timescaledb-hyperfunctions">Using advanced approximation methods in TimescaleDB hyperfunctions</h3><p>So far in this post, we’ve only used the general-purpose <code>percentile_agg</code> aggregate. It uses the UDDSketch algorithm under the hood and is a good starting point for most users.</p><p>We’ve also provided separate <a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/percentile-aggregation-methods/uddsketch/"><code>uddsketch</code></a>  and <a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/percentile-aggregation-methods/tdigest/"><code>tdigest</code></a> aggregates to allow for more customizability.</p><p>Each takes the number of buckets as their first argument (which determines the size of the internal data structure) and <code>uddsketch</code> also has an argument for the target maximum relative error. </p><p>We can use the normal <code>approx_percentile</code> accessor function  just as we used with <code>percentile_agg</code>, so, we could compare median estimations like so:</p><pre><code class="language-SQL">SELECT 
	approx_percentile(0.5, uddsketch(200, 0.001, response_time)) as median_udd,
	approx_percentile(0.5, tdigest(200, response_time)) as median_tdig
FROM responses;</code></pre><p>Both of them also work with the <code>approx_percentile_rank</code> hyperfunction we discussed above. </p><p>If we wanted to see where 1000 would fall in our distribution, we could do something like this:</p><pre><code class="language-SQL">SELECT 
	approx_percentile_rank(1000, uddsketch(200, 0.001, response_time)) as rnk_udd,
	approx_percentile_rank(1000, tdigest(200, response_time)) as rnk_tdig
FROM responses;</code></pre><p>In addition, each of the approximations has some accessors that only work with their items based on the approximation structure.</p><p>For instance, <code>uddsketch</code> provides an <code>error</code> accessor function. This will tell you the actual guaranteed maximum relative error based on the values that the <code>uddsketch</code> saw.</p><p>The UDDSketch algorithm guarantees a maximum relative error, while the T-Digest algorithm does not, so <code>error</code> only works with <code>uddsketch</code> (and <code>percentile_agg</code> because it uses <code>uddsketch</code> algorithm under the hood).</p><p>This error guarantee is one of the main reasons we chose it as the default because error guarantees are useful for determining whether you’re getting a good approximation.</p><p><code>Tdigest</code>, on the other hand, provides <code>min_val</code> &amp; <code>max_val</code> accessor functions because it biases its buckets to the extremes and can provide the exact min and max values at no extra cost. <code>Uddsketch</code> can’t provide that.</p><p>You can call these other accessors like so:</p><pre><code class="language-SQL">SELECT 
	approx_percentile(0.5, uddsketch(200, 0.001, response_time)) as median_udd,
	error(uddsketch(200, 0.001, response_time)) as error_udd,
	approx_percentile(0.5, tdigest(200, response_time)) as median_tdig,
	min_val(tdigest(200, response_time)) as min,
	max_val(tdigest(200, response_time)) as max
FROM responses;</code></pre><p>As we discussed in the last post about <a href="https://timescale.ghost.io/blog/blog/how-postgresql-aggregation-works-and-how-it-inspired-our-hyperfunctions-design-2/#why-we-use-the-two-step-aggregate-design-pattern">two-step aggregates</a>, calls to all of these aggregates are automatically deduplicated and optimized by PostgreSQL so that you can call multiple accessors with minimal extra cost. </p><p>They also both have <a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/rollup-percentile/"><code>rollup</code></a>  functions defined for them, so you can re-aggregate when they’re used in continuous aggregates or regular queries. </p><p>(Note: <code>tdigest</code> rollup can introduce some additional error or differences compared to calling the <code>tdigest</code> on the underlying data directly. In most cases, this should be negligible and would often be comparable to changing the order in which the underlying data was ingested.)</p><p>We’ve provided a few of the tradeoffs and differences between the algorithms here, but we have a <a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/percentile-aggregation-methods/#choosing-the-right-algorithm-for-your-use-case">longer discussion in the docs that can help you choose</a>. You can also start with the default <code>percentile_agg</code> and then experiment with different algorithms and parameters on your data to see what works best for your application.</p><h2 id="wrapping-it-up">Wrapping It Up</h2><p>We’ve provided a brief overview of percentiles, how they can be more informative than more common statistical aggregates like average, why percentile approximations exist, and a little bit of how they generally work within TimescaleDB hyperfunctions.</p><p><strong>If you’d like to get started with the </strong><a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/"><strong>percentile approximation hyperfunctions</strong></a><strong>—and many more—right away, spin up a fully managed Timescale Cloud service</strong>: create an account to <a href="https://console.cloud.timescale.com/signup">try it for free</a> for 30 days. (Hyperfunctions are pre-loaded on each new cloud database service on Timescale, so after you create a new service, you’re all set to use them).</p><p><strong>If you prefer to manage your own database instances, you can </strong><a href="https://github.com/timescale/timescaledb-toolkit"><strong>download and install the <code>timescaledb_toolkit</code> extension</strong></a> on GitHub, after which you’ll be able to use percentile approximation and other hyperfunctions.</p><p>We believe time-series data is everywhere, and making sense of it is crucial for all manner of technical problems. We built hyperfunctions to make it easier for developers to harness the power of time-series data. </p><p>We’re always looking for feedback on what to build next and would love to know how you’re using hyperfunctions, problems you want to solve, or things you think should—or could—be simplified to make analyzing time-series data in SQL that much better. (To contribute feedback, comment on an <a href="https://github.com/timescale/timescaledb-toolkit/issues">open issue</a> or in a <a href="https://github.com/timescale/timescaledb-toolkit/discussions">discussion thread</a> in GitHub.)</p><h2 id="faqs">FAQs</h2><h3 id="why-are-percentiles-better-than-averages-for-understanding-data">Why are percentiles better than averages for understanding data?</h3><p>Percentiles are more robust to outliers than averages. This is especially true for asymmetric distributions like API response times.&nbsp;</p><p>While an average can be dramatically skewed by a few extreme values (like API calls taking 100+ seconds), percentiles like the 90th percentile only shift when a significant portion of the data changes.&nbsp;</p><p>For example: if 10&nbsp;% of API calls slow down from 250&nbsp;ms to 5 seconds, the 90th percentile will clearly show this problem, while outliers from just a few users won't trigger false alarms.</p><h3 id="how-do-percentile-approximations-work-in-timescaledb">How do percentile approximations work in TimescaleDB?</h3><p>TimescaleDB's percentile approximation hyperfunctions use modified histograms to represent data distribution more compactly while maintaining accuracy.&nbsp;</p><p>These approximations use either UDDSketch (the default) with logarithmically sized buckets or T-Digest with variable-sized buckets that are smaller at distribution extremes.&nbsp;</p><p>You can calculate approximate percentiles using syntax like <code>SELECT approx_percentile(0.9, percentile_agg(response_time)) as p90 FROM responses;</code></p><p>This syntax will use significantly less memory and computation than exact percentiles.</p><h3 id="what-are-the-benefits-of-using-percentile-approximations-for-large-datasets">What are the benefits of using percentile approximations for large datasets?</h3><p>Percentile approximations have a fixed memory footprint regardless of dataset size, unlike exact percentiles, which must store all values.&nbsp;</p><p>They’re parallelizable for faster computation and can be used with continuous aggregates for incremental calculation—all while still providing accurate insights with configurable error bounds.&nbsp;</p><p>For example, let’s say you’re tracking hourly API response times using a continuous aggregate with percentile approximations. This method allows you to instantly identify recent outliers compared to historical patterns without recalculating from raw data.</p><h3 id="how-do-i-choose-between-uddsketch-and-t-digest-algorithms">How do I choose between UDDSketch and T-Digest algorithms?</h3><p>UDDSketch (the default in <code>percentile_agg</code>) guarantees a maximum relative error across all percentiles. It’s ideal when you need consistent accuracy across your distribution.&nbsp;</p><p>T-Digest optimizes for accuracy at the extremes of your distribution, making it better when you care more about precisely measuring the 1st or 99th percentiles than the middle ranges.&nbsp;</p><p>You can compare both approaches using <code>SELECT approx_percentile(0.5, uddsketch(200, 0.001, response_time)) as median_udd, approx_percentile(0.5, tdigest(200, response_time)) as median_tdig FROM responses;</code></p><h3 id="how-can-i-use-percentile-approximations-with-continuous-aggregates">How can I use percentile approximations with continuous aggregates?</h3><p>Continuous aggregates allow you to precompute and incrementally update percentile approximations, significantly improving query performance.&nbsp;</p><p><strong>Step 1</strong>: Create a continuous aggregate that stores just the percentile approximation state: <code>CREATE MATERIALIZED VIEW responses_1h_agg WITH (timescaledb.continuous) AS SELECT time_bucket('1 hour'::interval, ts) as bucket, percentile_agg(response_time) FROM responses GROUP BY 1;&nbsp;</code></p><p><strong>Step 2</strong>: Query it using accessor functions like <code>approx_percentile(0.99, percentile_agg)</code> or combine multiple time buckets with the rollup function to analyze trends across different time periods.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Speeding up data analysis with TimescaleDB and PostgreSQL]]></title>
            <description><![CDATA[Is your data analysis process as fast and efficient as it could be? This four-part blog series will outline common data analysis problems and how TimescaleDB and PostgreSQL fixed them by making data munging tasks within analysis fast, efficient, and easily accessible. ]]></description>
            <link>https://www.tigerdata.com/blog/speeding-up-data-analysis</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/speeding-up-data-analysis</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Analytics]]></category>
            <category><![CDATA[General]]></category>
            <dc:creator><![CDATA[Miranda Auhl]]></dc:creator>
            <pubDate>Thu, 09 Sep 2021 15:32:12 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/magnify.jpeg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/magnify.jpeg" alt="Magnifying glass against blue background" /><p><a href="https://timescale.ghost.io/blog/time-series-data/">Time-series data</a> is everywhere, and it drives decision-making in every industry. Time-series data collectively represents how a system, process, or behavior changes over time. Understanding these changes helps us to solve complex problems across numerous industries, including <a href="https://timescale.ghost.io/blog/blog/simplified-prometheus-monitoring-for-your-entire-organization-with-promscale/">observability</a>, <a href="https://timescale.ghost.io/blog/blog/how-messari-uses-data-to-open-the-cryptoeconomy-to-everyone/">financial services</a>, <a href="https://timescale.ghost.io/blog/blog/how-meter-group-brings-a-data-driven-approach-to-the-cannabis-production-industry/">Internet of Things</a>, and even <a href="https://timescale.ghost.io/blog/blog/hacking-nfl-data-with-postgresql-timescaledb-and-sql/">professional football</a>.</p><p>Depending on the type of application they’re building, developers end up collecting millions of rows of time-series data (and sometimes millions of rows of data every day or even every hour!). Making sense of this high-volume, high-fidelity data takes a particular set of data analysis skills that aren’t often exercised as part of the classic developer skillset. To perform time-series analysis that goes beyond basic questions, developers and data analysts need specialized tools, and as <a href="https://db-engines.com/en/ranking_categories">time-series data grows in prominence</a>, the <strong>efficiency</strong> of these tools becomes even more important.</p><p>Often, data analysts’ work can be boiled down to <strong>evaluating</strong>, <strong>cleaning</strong>, <strong>transforming</strong>, and <strong>modeling</strong> data. In my experience, I’ve found these actions are necessary for me to gain understanding from data, and I will refer to this as the “data analysis life cycle” throughout this post.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/09/data-analysis-lifecycle.jpeg" class="kg-image" alt="Graphic showing the “data analysis lifecycle”, Evaluate -> Clean -> Transform -> Model" loading="lazy" width="1838" height="466" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2021/09/data-analysis-lifecycle.jpeg 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2021/09/data-analysis-lifecycle.jpeg 1000w, https://timescale.ghost.io/blog/content/images/size/w1600/2021/09/data-analysis-lifecycle.jpeg 1600w, https://timescale.ghost.io/blog/content/images/2021/09/data-analysis-lifecycle.jpeg 1838w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Data analysis lifecycle</span></figcaption></figure><p>Excel, R, and Python are arguably some of the most commonly used data analysis tools, and, while they are all fantastic tools, they may not be suited for every job. Speaking from experience, these tools can be especially inefficient for “data munging” at the early stages of the lifecycle; specifically, the <strong>evaluating data</strong>, <strong>cleaning data</strong>, and <strong>transforming data</strong> steps involved in pre-modeling work.</p><p>As I’ve worked with larger and more complex datasets, I’ve come to believe that databases built for specific types of data - such as time-series data - are more effective for data analysis. </p><p>For background, <a href="https://www.timescale.com/products">TimescaleDB</a> is a <em>relational </em>database for time-series data. If your analysis is based on time-series datasets, TimescaleDB can be a great choice not only for its scalability and dependability but also for its relational nature. Because TimescaleDB is packaged as an extension to PostgreSQL, you’ll be able to look at your time-series data alongside your relational data and get even more insight. (I recognize that as a Developer Advocate at Timescale, I might be a <em>little </em>biased 😊…)</p><p>In this blog series, I will discuss each of the three data munging steps in the analysis lifecycle in-depth and demonstrate how to use TimescaleDB as a powerful tool for your data analysis.</p><p>In this introductory post, I'll explore a few of the common frustrations that I experienced with popular data analysis tools, and from there, dive into how I’ve used TimescaleDB to help alleviate each of those pain points.</p><p>In future posts we'll look at:</p><ul><li>How TimescaleDB data analysis functionality can replace work commonly performed in Python and pandas</li><li>How TimescaleDB vs. Python and pandas compare (benchmarking a standard data analysis workflow)</li><li>How to use TimescaleDB to conduct an end-to-end, deep-dive data analysis, using real yellow taxi cab data from the <a href="https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page">New York City Taxi and Limousine Commission</a> (NYC TLC).</li></ul><p>If you are interested in trying out TimescaleDB and PostgreSQL functionality right away, <a href="https://www.timescale.com/timescale-signup">sign up for a free 30-day trial</a> or <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/install-timescaledb/self-hosted/">install and manage it on your instances</a>. (You can also learn more by <a href="https://docs.timescale.com/timescaledb/latest/tutorials/">following one of our many tutorials</a>.)</p><h2 id="common-data-analysis-tools-and-%E2%80%9Cthe-problem%E2%80%9D">Common data analysis tools and “the problem”</h2><p>As we’ve discussed, the three most popular tools used for data analysis are Excel, R, and Python. While they are great tools in their own right, they are not optimized to efficiently perform every step in the analysis process. </p><p>In particular, most data scientists (including myself!) struggle with similar issues as the amount of data grows or the same analysis needs to be redone month after month.</p><p>Some of these struggles include:</p><ul><li>Data storage and access: Where is the best place to store and maintain my data for analysis?</li><li>Data size and its influence on the analysis: How can I improve efficiency for data munging tasks, especially as data scales?</li><li>Script storage and accessibility: What can I do to improve data munging script storage and maintenance?</li><li>Easily utilizing new technologies: How could I set up my data analysis toolchain to allow for easy transitions to new technologies?</li></ul><p>So buckle in, keep your arms and legs in the vehicle at all times, and let’s start looking at these problems!</p><h2 id="data-analysis-issue-1-storing-and-accessing-data"><br>Data analysis issue #1: storing and accessing data</h2><p>To do data analysis, you need access to… data.</p>
<!--kg-card-begin: html-->
<center>
    <iframe src="https://giphy.com/embed/rIq6ASPIqo2k0" width="100px" 
            height="100px" style="min-width: 100%" frameBorder="0" class="giphy-embed" position="relative" aria-label="Image of Data from Star Trek smiling"></iframe>
    <p>
        <a href="https://giphy.com/gifs/rIq6ASPIqo2k0" position="center">via GIPHY</a>
    </p>
</center>
<!--kg-card-end: html-->
<p>Managing where that data lives, and how easily you can access it is the preliminary (and often most important) step in the analysis journey. Every time I begin a new data analysis project, this is often where I run into my first dilemma. Regardless of the original data source, I always ask “where is the best place to store and maintain the data as I start working through the data munging process?”</p><p>Although it's becoming more common for data analysts to use databases for storing and querying data, it's still not ubiquitous. Too often, raw data is provided in a stream of CSV files or APIs that produce JSON. While this may be manageable for smaller projects, it can quickly become overwhelming to maintain and difficult to manage from project to project. </p><p>For example, let’s consider how we might use Python as our data analysis tool of choice. </p><p>While using Python for data analysis, I have the option of ingesting data through files/APIs OR a database connection. </p><p>If I used files or APIs for querying data during analysis, I often faced questions like:</p><ul><li>Where are the files located? What happens if the URL or parameters change for an API?</li><li>What happens if duplicate files are made? And what if updates are made to one file, and not the other?</li><li>How do I best share these files with colleagues?</li><li>What happens if multiple files depend on one another?</li><li>How do I prevent incorrect data from being added to the wrong column of a CSV? (ie. a decimal where a string should be)</li><li>What about very large files? What is the ingestion rate for a 10MB, 100MB, 1GB, 1TB sized file?</li></ul><p>After running into these initial problems project after project, I knew there had to be a better solution.<strong> I knew that I needed a single source of truth for my data – and it started to become clear that a specialized SQL database might be my answer!</strong></p><p>Now, let’s consider if I were to connect to TimescaleDB. </p><p>By importing my time-series data into TimescaleDB, I can create one source of truth for all of my data. As a result, collaborating with others becomes as simple as sharing access to the database. Any modifications to the data munging process within the database means that all users have access to the same changes at the same time, opposed to parsing through CSV files to verify I have the right version. </p><p>Additionally, databases can typically handle much larger data loads than a script written in Python or R. TimescaleDB was built to house, maintain, and query terabytes of data efficiently and cost-effectively (both computationally speaking AND for your wallet). With features like <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/continuous-aggregates/">continuous aggregates</a> and native <a href="https://www.tigerdata.com/blog/building-columnar-compression-in-a-row-oriented-database" rel="noreferrer">columnar compression</a>, storing and analyzing years of time-series data became efficient while still being easily accessible.</p><p>In short, managing data over time, especially when it comes from different sources, can be a nightmare to maintain and access efficiently. But, it doesn’t have to be.</p><h2 id="data-analysis-issue-2-maximizing-analysis-speed-and-computation-efficiency-the-bigger-the-dataset-the-bigger-the-problem">Data analysis issue #2: maximizing analysis speed and computation efficiency (the bigger the dataset, the bigger the problem)</h2><p>Excel, R, and Python are all capable of performing the first three steps of the data analysis “lifecycle”: evaluating, cleaning, and transforming data. However, these technologies are not generally optimized for speed or computational efficiency during the process. </p><p>In numerous projects over the years, I’ve found that as the size of my dataset increased, the process of importing, cleaning, and transforming it became more difficult, time-consuming, and, in some cases impossible. For Python and R, parsing through large amounts of data seemed to take forever, and Excel would simply crash once hitting millions of rows.  </p><p>Things became <em>especially </em>difficult when I needed to create additional tables for things like aggregates or data transformations: some lines of code could take seconds or, in extreme cases, minutes to run depending on the size of the data, the computer I was using, or the complexity of the analysis. </p><p>While seconds or minutes may not <em>seem</em> like a lot, it adds up and amounts to hours or days of lost productivity when you’re performing analysis that needs to be run hundreds or thousands of times a month! </p><p>To illustrate, let’s look at a Python example once again. </p><p>Say I was working with <a href="https://www.kaggle.com/srinuti/residential-power-usage-3years-data-timeseries">this IoT data set taken from Kaggle</a>. The set contains two tables, one specifying energy consumption for a single home in Houston Texas, and the other documenting weather conditions.</p><p>To run through analysis with Python, the first steps in my analysis would be to pull in the data and observe it. </p><p>When using Python to do this, I would run code like this 👇</p><pre><code class="language-python">import psycopg2
import pandas as pd
import configparser


## use config file for database connection information
config = configparser.ConfigParser()
config.read('env.ini')

## establish conntection
conn = psycopg2.connect(database=config.get('USERINFO', 'DB_NAME'), 
                        host=config.get('USERINFO', 'HOST'), 
                        user=config.get('USERINFO', 'USER'), 
                        password=config.get('USERINFO', 'PASS'), 
                        port=config.get('USERINFO', 'PORT'))

## define the queries for selecting data out of our database                        
query_weather = 'select * from weather'
query_power = 'select * from power_usage'

## create cursor to extract data and place it into a DataFrame
cursor = conn.cursor()
cursor.execute(query_weather)
weather_data = cursor.fetchall()
cursor.execute(query_power)
power_data = cursor.fetchall()
## you will have to manually set the column names for the data frame
weather_df = pd.DataFrame(weather_data, columns=['date','day','temp_max','temp_avg','temp_min','dew_max','dew_avg','dew_min','hum_max','hum_avg','hum_min','wind_max','wind_avg','wind_min','press_max','press_avg','press_min','precipit','day_of_week'])
power_df = pd.DataFrame(power_data, columns=['startdate', 'value_kwh', 'day_of_week', 'notes'])
cursor.close()

print(weather_df.head(20))
print(power_df.head(20))
</code></pre><p>Altogether, this code took 2.718 seconds to run using my <a href="https://www.apple.com/shop/buy-mac/macbook-pro/16-inch-space-gray-2.3ghz-8-core-processor-1tb#">2019 MacBook Pro laptop with 32GB memory</a>. </p><p>But, what about if I run this equivalent script with SQL in the database?</p><pre><code class="language-sql">select * from weather
select * from power_usage</code></pre><table>
<thead>
<tr>
<th>startdate</th>
<th>value_kwh</th>
<th>day_of_week</th>
<th>notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>2016-01-06 01:00:00</td>
<td>1</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 02:00:00</td>
<td>1</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 03:00:00</td>
<td>1</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 04:00:00</td>
<td>1</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 05:00:00</td>
<td>0</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 06:00:00</td>
<td>0</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 07:00:00</td>
<td>0</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 08:00:00</td>
<td>0</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 09:00:00</td>
<td>0</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 10:00:00</td>
<td>0</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 11:00:00</td>
<td>1</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 12:00:00</td>
<td>0</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 13:00:00</td>
<td>0</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 14:00:00</td>
<td>0</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 15:00:00</td>
<td>0</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 16:00:00</td>
<td>1</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 17:00:00</td>
<td>4</td>
<td>2</td>
<td>weekday</td>
</tr>
</tbody>
</table>
<p>This query only took 0.342 seconds to run, almost 8x faster when compared to the Python script. </p><p>This time difference makes a lot of sense when we consider that Python must connect to a database, then run the SQL query, then parse the retrieved data, and then import it into a DataFrame. While almost three seconds is fast, this extra time for processing adds up as the script becomes more complicated and more data munging tasks are added. </p><p>Pulling in the data and observing it is only the beginning of my analysis! What happens when I need to perform a transforming task, like aggregating the data?</p><p>For this dataset, when we look at the <code>power_usage</code> table - as seen above - kWh readings are recorded every hour. If I want to do daily analysis, I have to aggregate the hourly data into “day buckets”.  </p><p>If I used Python for this aggregation, I could use something like 👇</p><pre><code class="language-python"># sum power usage by day, bucket by day
## create column for the day 
day_col = pd.to_datetime(power_df['startdate']).dt.strftime('%Y-%m-%d')
power_df.insert(0, 'date_day', day_col)
agg_power = power_df.groupby('date_day').agg({'value_kwh' : 'sum', 'day_of_week' : 'unique', 'notes' : 'unique' })
print(agg_power)</code></pre><p>...which takes 0.49 seconds to run (this does not include the time for importing our data).</p><p>Alternatively, with the TimescaleDB <a href="https://docs.timescale.com/api/latest/hyperfunctions/time_bucket/"><code>time_bucket()</code></a> function, I could do this aggregation directly in the database using the following query 👇</p><pre><code class="language-sql">select 
	time_bucket(interval '1 day', startdate ) as day,
	sum(value_kwh),
	day_of_week,
	notes
from power_usage pu 
group by day, day_of_week, notes
order by day</code></pre><table>
<thead>
<tr>
<th>day</th>
<th>sum</th>
<th>day_of_week</th>
<th>notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>2016-01-06 00:00:00</td>
<td>27</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-07 00:00:00</td>
<td>42</td>
<td>3</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-08 00:00:00</td>
<td>51</td>
<td>4</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-09 00:00:00</td>
<td>50</td>
<td>5</td>
<td>weekend</td>
</tr>
<tr>
<td>2016-01-10 00:00:00</td>
<td>45</td>
<td>6</td>
<td>weekend</td>
</tr>
<tr>
<td>2016-01-11 00:00:00</td>
<td>22</td>
<td>0</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-12 00:00:00</td>
<td>12</td>
<td>1</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-02-06 00:00:00</td>
<td>32</td>
<td>5</td>
<td>weekend</td>
</tr>
<tr>
<td>2016-02-07 00:00:00</td>
<td>62</td>
<td>6</td>
<td>weekend</td>
</tr>
<tr>
<td>2016-02-08 00:00:00</td>
<td>48</td>
<td>0</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-02-09 00:00:00</td>
<td>23</td>
<td>1</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-02-10 00:00:00</td>
<td>24</td>
<td>2</td>
<td>weekday</td>
</tr>
</tbody>
</table>
<p>...which only takes 0.087 seconds and is over 5x faster than the Python script. </p><p>You can start to see a pattern here. </p><p>As mentioned above, <a href="https://docs.timescale.com/timescaledb/latest/overview/core-concepts/#why-use-timescaledb">TimescaleDB was created to efficiently query and store time-series data</a>. But simply querying data only scratches the surface of the possibilities TimescaleDB and PostgreSQL functionality provides. </p><p>TimescaleDB and PostgreSQL offer a wide range of tools and functionality that can replace the need for additional tools to evaluate, clean, and transform your data. Some of the TimescaleDB functionality includes continuous aggregates, compression, and <a href="https://docs.timescale.com/api/latest/hyperfunctions/">hyperfunctions</a>; all of which allow you to do nearly all data munging tasks directly within the database. </p><p>When I performed the evaluating, cleaning, and transforming steps of my analysis directly within TimescaleDB, I cut out the need to use additional tools - like Excel, R, or Python - for data munging tasks. I could pull cleaned and transformed data, ready for modeling, directly into Excel, R, or Python.</p><h2 id="data-analysis-issue-3-storing-and-maintaining-scripts-for-data-analysis">Data analysis issue #3: storing and maintaining scripts for data analysis</h2><p>Another potential downside of exclusively using Excel, R, or Python for the entire data analysis workflow, is that all of the logic for analyzing the data is contained within a script file. Similar to the issues of having many different data sources, maintaining script files can be inconvenient and messy.  </p><p>Some common issues that I - and many data analysts - run into include:</p><ul><li>Losing files</li><li>Unintentionally creating duplicate files</li><li>Changing or updating some files but not others</li><li>Needing to write and run scripts to access transformed data (see below example)</li><li>Spending time re-running scripts whenever new raw data is added (see below example)</li></ul><p>While you can use a code repository to overcome some of these issues, it will not fix the last two.  </p><p>Let’s consider our Python scenario again. </p><p>Say that I used a Python script exclusively for all my data analysis tasks. What happens if I need to export my transformed data to use in a report on energy consumption in Texas? </p><p>Likely, I would have to add some code within the script to allow for exporting the data and then run the script again to actually export it. Depending on the content of the script and how long it takes to transform the data, this could be pretty inconvenient and inefficient.</p><p>What if I also just got a bunch of new energy usage and weather data? For me to incorporate this new raw data into existing visualizations or reports, I would need to run the script again and make sure that all of my data munging tasks run as expected. </p><p>Database functions, like continuous aggregates and materialized views, can create transformed data that can be stored and queried directly from your database without running a script. Additionally, I can create policies for continuous aggregates to regularly keep this transformed data up-to-date any time raw data is modified. Because of these policies, I wouldn't have to worry about running scripts to re-transform data for use, making access to updated data efficient. With TimescaleDB, many of the data munging tasks in the analysis lifecycle that you would normally do within your scripts can be accomplished using built-in TimescaleDB and PostgreSQL functionality.</p><h2 id="data-analysis-issue-4-easily-utilizing-new-or-additional-technologies">Data analysis issue #4: easily utilizing new or additional technologies</h2><p>Finally, the last step in the data analysis lifecycle: modeling. If I wanted to use a new tool or technology to create a visualization, it was difficult to easily take my transformed data and use it for modeling or visualizations elsewhere.</p><p>Python, R, and Excel are all pretty great for their visualization and modeling capabilities. However, what happens when your company or team wants to adopt a <em>new</em> tool?</p><p>In my experience, this often means either adding on another step to the analysis process, or rediscovering how to perform the evaluating, cleaning, and transforming steps within the new technology.</p><p>For example, in one of my previous jobs, I was asked to convert a portion of my analysis into Power BI for business analytics purposes. Some of the visualizations my stakeholders wanted required me to access transformed data from my Python script. At the time, I had the option to export the data from my Python script or figure out how to transform the data in Power BI directly. Both options were not ideal and were guaranteed to take extra time. </p><p>When it comes to adopting new visualization or modeling tools, using a database for evaluating, cleaning, and transforming data can again work in your favor. Most visualization tools - such as <a href="https://grafana.com/">Grafana</a>, <a href="https://www.metabase.com/">Metabase</a>, or <a href="https://powerbi.microsoft.com/">Power BI</a> - allow users to import data from a database directly. </p><p>Since I can do most of my data munging tasks within TimescaleDB, adding or switching tools - such as using Power BI for dashboard capabilities - becomes as simple as connecting to my database, pulling in the munged data,  and using the new tool for visualizations and modeling.</p><h2 id="wrapping-up">Wrapping up</h2><p>In summary, Excel, R, and Python are all great tools to use for analysis, but may not be the best tools for every job. Case in point: my struggles with time-series data analysis, especially on big datasets. </p><p>With TimescaleDB functionality, you can house your data and perform the evaluating, cleaning, and transforming aspects of data analysis, all directly within your database – and solve a lot of common data analysis woes in the process (which I’ve - hopefully! - demonstrated in this post)</p><p>In the blog posts to come, I’ll explore TimescaleDB and PostgreSQL functionality compared to Python, benchmark TimescaleDB performance vs. Python and pandas for data munging tasks, and conduct a deep-dive into data analysis with TimescaleDB (for data munging) and Python (for modeling and visualizations). </p><p>If you have questions about TimescaleDB, time-series data, or any of the functionality mentioned above, <a href="https://slack.timescale.com/">join our <strong>community Slack</strong></a>, where you'll find an active community of time-series enthusiasts and various Timescale team members (including me!). </p><p>If you’re ready to see the power of TimescaleDB and PostgreSQL right away, you can sign up for <a href="https://www.timescale.com/timescale-signup">a free 30-day trial</a> or <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/install-timescaledb/self-hosted/">install TimescaleDB</a> and manage it on your current PostgreSQL instances. We also have a bunch of great <a href="https://docs.timescale.com/timescaledb/latest/tutorials/">tutorials</a> to help get you started.</p><p>Until next time!<br></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to Create (Lots of) Sample Time-Series Data With PostgreSQL generate_series()]]></title>
            <description><![CDATA[Use generate_series to see how TimescaleDB uses PostgreSQL's rock-solid foundation to build a scalable, fully extensible, powerful time-series database.]]></description>
            <link>https://www.tigerdata.com/blog/how-to-create-lots-of-sample-time-series-data-with-postgresql-generate_series</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-to-create-lots-of-sample-time-series-data-with-postgresql-generate_series</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Time Series Data]]></category>
            <category><![CDATA[General]]></category>
            <dc:creator><![CDATA[Ryan Booz]]></dc:creator>
            <pubDate>Thu, 26 Aug 2021 18:05:53 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-12-at-5.55.04-PM.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-12-at-5.55.04-PM.png" alt="How to Create (Lots of) Sample Time-Series Data With PostgreSQL generate_series()" /><p>As the makers of <a href="https://www.timescale.com" rel="noreferrer">TimescaleDB</a>, we often need to quickly create lots of sample time-series data to demonstrate a new database feature, run a benchmark, or talk about use cases internally. We also see users in our <a href="https://slack.timescale.com/">community Slack</a> asking how they can test a feature or decide if their data is a good fit for TimescaleDB. </p><p><a href="https://www.timescale.com/learn/is-postgres-partitioning-really-that-hard-introducing-hypertables" rel="noreferrer"><em>📝 If you want to know why people want to use Timescale, check this out.</em></a></p><p>Although using real data from your current application would be great (and ideal),  knowing <strong>how to quickly create a representative time-series dataset using varying cardinalities and different lengths of time</strong> is a helpful - and advantageous - skill to have.</p><p>Fortunately PostgreSQL provides a built-in function to help us create this sample data using the SQL that we already know and love - no external tools required.</p><p>In this three-part blog series, we'll go through a few ways to use the <code>generate_series()</code> function to create large datasets in PostgreSQL, including:</p><ul><li>What PostgreSQL <code>generate_series()</code> is and how to use it for basic data generation </li><li>How to create more realistic-looking time-series data with custom <a href="https://www.tigerdata.com/learn/understanding-postgresql-user-defined-functions" rel="noreferrer">PostgreSQL functions</a></li><li>Ways to create complex time-series data using additional PostgreSQL math functions and JOINs.</li></ul><p>By the end of this series, you'll be ready to test almost any PostgreSQL or TimescaleDB feature, create quick datasets for general testing, meetup presentations, demos, and more!</p><h2 id="intro-to-postgresql-generateseries">Intro to PostgreSQL generate_series()</h2><p><code>generate_series()</code> is a built-in <a href="https://www.tigerdata.com/blog/function-pipelines-building-functional-programming-into-postgresql-using-custom-operators" rel="noreferrer">PostgreSQL function</a> that makes it easy to create ordered tables of numbers or dates. The PostgreSQL documentation calls it a <a href="https://www.postgresql.org/docs/13/functions-srf.html">Set Returning Function</a> because it can return more than one row. </p><p>The function is simple to start, taking at least two required arguments to specify the start and stop parameters for the generated data.</p><pre><code class="language-sql">SELECT * FROM generate_series(1,5);</code></pre><p>This will produce output that looks like this:</p><pre><code class="language-sql">generate_series|
---------------|
              1|
              2|
              3|
              4|
              5|</code></pre><p>This first example shows that <code>generate_series()</code> returns sequential numbers between a start parameter and a stop parameter. When used to generate numeric data, <code>generate_series()</code>a  will increment the values by 1. However, an optional third parameter can be used to specify the increment length, known as the step parameter.</p><p>For example, if we wanted to generate rows that counted by two from 0 to 10, we could use this SQL instead:</p><pre><code class="language-sql">SELECT * from generate_series(0,10,2);

 generate_series 
-----------------
               0
               2
               4
               6
               8
              10</code></pre><h2 id="postgresql-generateseries-with-dates">PostgreSQL generate_series() With Dates</h2><p>Using <code>generate_series()</code> to produce a range of dates is equally straightforward. The only difference is that the third parameter, the step <code>INTERVAL</code> used to increment the date, is <strong><em>required </em></strong>for date generation, as shown here:</p><pre><code class="language-sql">SELECT * from generate_series(
	'2021-01-01',
    '2021-01-02', INTERVAL '1 hour'
  );

    generate_series     
------------------------
 2021-01-01 00:00:00+00
 2021-01-01 01:00:00+00
 2021-01-01 02:00:00+00
 2021-01-01 03:00:00+00
 2021-01-01 04:00:00+00
 2021-01-01 05:00:00+00
 2021-01-01 06:00:00+00
 2021-01-01 07:00:00+00
 2021-01-01 08:00:00+00
 2021-01-01 09:00:00+00
 2021-01-01 10:00:00+00
 2021-01-01 11:00:00+00
 2021-01-01 12:00:00+00
 2021-01-01 13:00:00+00
 2021-01-01 14:00:00+00
 2021-01-01 15:00:00+00
 2021-01-01 16:00:00+00
 2021-01-01 17:00:00+00
 2021-01-01 18:00:00+00
 2021-01-01 19:00:00+00
 2021-01-01 20:00:00+00
 2021-01-01 21:00:00+00
 2021-01-01 22:00:00+00
 2021-01-01 23:00:00+00
 2021-01-02 00:00:00+00
(25 rows)</code></pre><p>Notice that the returned dates are <em>inclusive</em> of the start and stop values, just as we saw with the numeric example before. The reason we got 25 rows (representing 25 hours rather than 24 as you might expect) is that the stop value can be reached using the equal one-hour <code>INTERVAL</code> (the step parameter). As long as the <code>INTERVAL</code> can increment evenly up to the stop date, it will be included.</p><p>However, if the step interval resulted in the stop value being skipped over, it will not be included in your output. For example, if we modify the step <code>INTERVAL</code> above to '1 hour 25 minutes', the result only returns 17 rows, the last of which is before the stop value.</p><pre><code class="language-sql">SELECT * from generate_series(
	'2021-01-01','2021-01-02', 
    INTERVAL '1 hour 25 minutes'
   );

    generate_series     
------------------------
 2021-01-01 00:00:00+00
 2021-01-01 01:25:00+00
 2021-01-01 02:50:00+00
 2021-01-01 04:15:00+00
 2021-01-01 05:40:00+00
 2021-01-01 07:05:00+00
 2021-01-01 08:30:00+00
 2021-01-01 09:55:00+00
 2021-01-01 11:20:00+00
 2021-01-01 12:45:00+00
 2021-01-01 14:10:00+00
 2021-01-01 15:35:00+00
 2021-01-01 17:00:00+00
 2021-01-01 18:25:00+00
 2021-01-01 19:50:00+00
 2021-01-01 21:15:00+00
 2021-01-01 22:40:00+00
(17 rows)</code></pre><h2 id="how-to-generate-time-series-data">How to Generate Time-Series Data</h2><p>Now that we understand how to use <code>generate_series()</code>, how do we create some <a href="https://timescale.ghost.io/blog/time-series-data/">time-series data</a> to insert into TimescaleDB for testing and visualization? </p><p>To do this, we utilize a standard feature of relational databases and SQL. Recall that <code>generate_series()</code> is a Set Returning Function that returns a "table" of data (a set) just as if we had selected it from a table. Therefore, just like when we select data from a regular table with SQL, we can add more columns of data using other functions or static values. </p><p>You may have done this at some point with a string of text that needed to be repeated for each returned row.</p><pre><code class="language-sql">SELECT 'Hello Timescale!' as myStr, * FROM generate_series(1,5);

     myStr     | generate_series 
------------------+-----------------
 Hello Timescale! |               1
 Hello Timescale! |               2
 Hello Timescale! |               3
 Hello Timescale! |               4
 Hello Timescale! |               5
(5 rows)</code></pre><p>In this example, we simply added data to the rows being returned from <code>generate_series()</code>. For every row it returned, we added a column with some static text.</p>
<p>But, these added columns don't have to be static data! Using the <a href="https://www.postgresql.org/docs/13/functions-math.html">built-in <code>random()</code></a> function (for example), we can generate data that starts to look a little more like data we'd see when monitoring computer CPU values (i.e., realistic data to use for our demos and tests!).</p><pre><code class="language-sql">SELECT random()*100 as CPU, * FROM generate_series(1,5);

        cpu         | generate_series 
--------------------+-----------------
 48.905450626783775 |               1
  71.94031820213382 |               2
 25.210553719011486 |               3
  19.24163308357194 |               4
  8.434915599133674 |               5
(5 rows)</code></pre><p>In this example, <code>generate_series()</code> produced five rows of data; for every row, PostgreSQL also executed the <code>random()</code> function to produce a value. With a little creativity, this can become a pretty efficient method for generating lots of data quickly.</p><h2 id="performance-considerations-when-generating-sample-data">Performance Considerations When Generating Sample Data</h2><h3 id="functions">Functions</h3><p>Before getting too deep into the many ways we can use various functions with <code>generate_series()</code> to create more realistic data, let's tackle one potential downside with functions before your mind starts thinking about a myriad of ways to produce interesting, sample data. </p><p>Functions are a powerful database tool, allowing complex code to be hidden behind a standard interface. In the example above, I don't have to know how <code>random()</code> produces a value. I just have to call it in my SQL statement (without arguments) and a random value is returned for every row.</p><p>Unfortunately, functions can slow down your query and use lots of resources if you're not careful. This is true any time you call functions in SQL, not just when creating sample data.</p><p>The reason for this inefficiency is that scalar functions are executed once for each column and row in which they are used. So, if you produce a set of 1,000 rows with <code>generate_series()</code> and then add on 5 additional columns with data generated by functions, PostgreSQL has to effectively process 5,001 function calls, once to generate the initial series of data which is returned as a set, and 5 times that for each row because of the additional 5 columns.</p><p>The main issue is that PostgreSQL has no idea how to determine how much "work" it will take to generate the result from each function in the query. Something like <code>random()</code> required very minimal effort, so calling it 5,000 or even 1 million times won't break the bank for most machines. However, if the function you're calling generates lots of temporary data internally before producing a result, be aware that it could perform more slowly and consume more resources on your database server.</p><p><em>The takeaway here is that nothing comes for free, especially when functions are called once for every column, for every row</em>. Most of the time the data you generate will easily complete in a handful of seconds for tens of millions of rows. But, if you notice a query that generates data taking a long time, consider finding alternative ways to generate data for the columns that require more resources.</p><h3 id="transactions">Transactions</h3><p>There is a second caveat to be aware of when using a tool like <code>generate_series()</code>. All data is created and inserted as a single transaction. You <em>can</em> put in more time and effort to create functions and methods for breaking up the work, but understand if you create (SELECT) one large set of data with 100 million rows, it's likely to consume a lot of memory and CPU on the server while the process occurs.</p><p>This doesn't mean it's broken or any less valid of a method for generating data. However, as we demonstrate additional ways to create more realistic data in parts 2 and 3 of this series, recognize that the more data you create, the more resources you'll need from the server without managing batches and transactions with a more advanced setup.</p><p>Alright, let's get back to the fun!</p><h2 id="tips-for-quickly-increasing-sample-dataset-scale">Tips for Quickly Increasing Sample Dataset Scale</h2><p>So far we've looked at how <code>generate_series()</code> works and how you can use functions to add additional dynamic content to the output. But how do we quickly get to the scale of tens of millions of rows (or more)?</p><p>This is where the concept of a Cartesian product comes into play with databases. A Cartesian product (otherwise known as a <a href="https://www.tigerdata.com/learn/what-is-a-sql-cross-join" rel="noreferrer">CROSS JOIN</a>) takes two or more result sets (rows from multiple tables or functions) and produces a new result set that contains rows equal to the count of the first set multiplied by the count of the second set. </p><p>In doing so, the database outputs every row from the first table with the value of the first row from the second table. It does this over and over until all rows in both tables have been iterated. (and yes, if you join more than two tables this way, the process just keeps multiplying, tables processed left to right)</p><p>Let's look at a small example using two <code>generate_series()</code> in the same select statement.</p><pre><code class="language-sql">SELECT * from generate_series(1,10) a, generate_series(1,2) b;

a |b|
--+-+
 1|1|
 2|1|
 3|1|
 4|1|
 5|1|
 6|1|
 7|1|
 8|1|
 9|1|
10|1|
 1|2|
 2|2|
 3|2|
 4|2|
 5|2|
 6|2|
 7|2|
 8|2|
 9|2|
10|2|</code></pre><p>As you can see, PostgreSQL generated 10 rows for the first series and 2 rows for the second series. After processing and iterating over each table consecutively, the query produced a total of 20 rows (10 x 2). For generating lots of data quickly, this is about as easy as it gets!</p><p>The same process applies when using <code>generate_series()</code> to return a set of dates. If you (effectively) CROSS JOIN multiple sets (numbers or dates), the total number of rows in the final set will be a product of all sets. Again, for generating sample time-series data, you'd be hard-pressed to find an easier method!</p><p>In this example, we generate 12 timestamps an hour apart, a random value representing CPU usage, and then a second series of four values that represent IDs for fake devices. This should produce 48 rows (eg. 12 timestamps x 4 device IDs = 48 rows).</p><pre><code class="language-sql">SELECT time, device_id, random()*100 as cpu_usage 
FROM generate_series(
	'2021-01-01 00:00:00',
    '2021-01-01 11:00:00',
    INTERVAL '1 hour'
  ) as time, 
generate_series(1,4) device_id;


time               |device_id|cpu_usage          |
-------------------+---------+-------------------+
2021-01-01 00:00:00|        1|0.35415126479989567|
2021-01-01 01:00:00|        1| 14.013393572770028|
2021-01-01 02:00:00|        1|   88.5015939122006|
2021-01-01 03:00:00|        1|  97.49037810105996|
2021-01-01 04:00:00|        1|  50.22781125586846|
2021-01-01 05:00:00|        1|  77.93431470586931|
2021-01-01 06:00:00|        1|  45.73481750582076|
2021-01-01 07:00:00|        1|   70.7999843735724|
2021-01-01 08:00:00|        1|   4.72949831884506|
2021-01-01 09:00:00|        1|  85.29122113229981|
2021-01-01 10:00:00|        1| 14.539664281598874|
2021-01-01 11:00:00|        1|  45.95244258556228|
2021-01-01 00:00:00|        2|  46.41196423062297|
2021-01-01 01:00:00|        2|  74.39903569177027|
2021-01-01 02:00:00|        2|  85.44087332221935|
2021-01-01 03:00:00|        2|  4.329394730750735|
2021-01-01 04:00:00|        2| 54.645873866589056|
2021-01-01 05:00:00|        2|  6.544334492894777|
2021-01-01 06:00:00|        2|  39.05071228953645|
2021-01-01 07:00:00|        2|  71.07264365438404|
2021-01-01 08:00:00|        2|   72.4732704336219|
2021-01-01 09:00:00|        2| 34.533280927542975|
2021-01-01 10:00:00|        2| 26.764760864598003|
2021-01-01 11:00:00|        2|  62.32048879645227|
2021-01-01 00:00:00|        3|  63.01888063314749|
2021-01-01 01:00:00|        3|  21.70606884856987|
2021-01-01 02:00:00|        3|  32.47610779097485|
2021-01-01 03:00:00|        3| 47.565982341726354|
2021-01-01 04:00:00|        3|  64.34867263419619|
2021-01-01 05:00:00|        3|  57.74424991855476|
2021-01-01 06:00:00|        3| 55.593286571750156|
2021-01-01 07:00:00|        3|  36.92650110894995|
2021-01-01 08:00:00|        3| 53.166926049881624|
2021-01-01 09:00:00|        3| 10.009505806123897|
2021-01-01 10:00:00|        3| 58.067700285561585|
2021-01-01 11:00:00|        3|  81.58883725078034|
2021-01-01 00:00:00|        4|   78.1768041898232|
2021-01-01 01:00:00|        4|  84.51505102850199|
2021-01-01 02:00:00|        4| 24.029611792753514|
2021-01-01 03:00:00|        4|  17.08996115345549|
2021-01-01 04:00:00|        4| 29.642690955760997|
2021-01-01 05:00:00|        4|  90.83844806413275|
2021-01-01 06:00:00|        4| 6.5019080489854275|
2021-01-01 07:00:00|        4|    32.336484070672|
2021-01-01 08:00:00|        4|    55.595524107963|
2021-01-01 09:00:00|        4|   97.5442141375293|
2021-01-01 10:00:00|        4|   37.0741925805568|
2021-01-01 11:00:00|        4| 19.093927249791776|</code></pre><h2 id="choosing-a-date-range">Choosing a Date Range</h2><p>Now it's time to put all of the features and concepts together. We've seen how to use <code>generate_series()</code> to create a sample table of data (both numbers and dates), add static and dynamic content to each row, and finally how to join multiple sets of data together to create a deterministic number of rows to create test data.</p><p>The final piece of the puzzle is to figure out how to create date ranges that are more dynamic using date math. Once again, PostgreSQL makes this straightforward because date math is handled automatically.</p><p>When generating sample data, you usually have an idea of the duration of time you want to generate—one month, six months, or a year—for instance. But calculating your use case's exact start and end timestamps can get pretty tedious. </p><p>Instead, it's often easier to use <code>now()</code> or a static ending timestamp (ie. '2021-06-01 00:00:00'), and then using date math to get the starting timestamp based on your chosen interval. This makes it <em>very</em> easy to generate data for different durations simply by changing the interval. And to be clear, you can go the other direction (picking a start timestamp and adding time), but that can often lead to data in the future.</p><p>Let's look at three examples to demonstrate ways of creating dynamic date ranges.</p><h3 id="create-six-months-of-data-with-one-hour-intervals-ending-now">Create six months of data with one-hour intervals, ending now()</h3><pre><code class="language-sql">SELECT time, device_id, random()*100 as cpu_usage 
FROM generate_series(
	now() - INTERVAL '6 months',
    now(),
    INTERVAL '1 hour'
   ) as time, 
generate_series(1,4) device_id;</code></pre><p>Notice that we use the PostgreSQL function <code>now()</code> to automatically choose the ending timestamp, and then use date math (- INTERVAL '6 months') to let PostgreSQL find the starting timestamp for us. With almost no effort, we can easily generate weeks, months, or years of time-series data by changing the <code>INTERVAL</code> we subtract from <code>now()</code>.</p><h3 id="create-one-year-of-data-with-one-hour-intervals-ending-on-a-timestamp">Create one year of data with one-hour intervals, ending on a timestamp</h3><pre><code class="language-sql">SELECT time, device_id, random()*100 as cpu_usage 
FROM generate_series(
	'2021-08-01 00:00:00' - INTERVAL '6 months',
    '2021-08-01 00:00:00',
    INTERVAL '1 hour'
  ) as time, 
generate_series(1,4) device_id;
</code></pre><p>This example is the same as the first, but here we specify the timestamp that we want to end on. PostgreSQL can still do the date math for us (subtracting 6 months in this example), but we get control over the exact ending timestamp to use. This is particularly useful when you want the actual time portion of the timestamp to begin and end on even, rounded hours, minutes, or seconds. When we use <code>now()</code>, the timestamp can produce (what feels like) random timestamp causing all of the time-series data to have timestamps that are increments of <code>now()</code>.</p><h3 id="create-1-year-of-data-with-one-hour-intervals-beginning-on-a-timestamp">Create 1 year of data with one-hour intervals, beginning on a timestamp</h3><p>For this last example, we're going to specify the start timestamp instead. This can be useful when you need to test a specific fiscal period or maybe some logic that deals with the turning of each year. In this case, maybe you just need a month of data but it needs to cross over from one year to the next. In that case, trying to figure out the math of how far back to start and how far to go forward might be more complicated than you want to worry about.</p><pre><code class="language-sql">SELECT time, device_id, random()*100 as cpu_usage 
FROM generate_series(
	'2020-12-15 00:00:00',
    '2020-12-15 00:00:00' + INTERVAL '2 months',
    INTERVAL '1 hour'
  ) as time, 
generate_series(1,4) device_id;
</code></pre><p>Each of these examples uses date math to help you find the appropriate start and end timestamps so that you can easily adjust to create more or less data while still fitting the range profile that you need. </p><p>Speaking of calculating how many rows you want to generate… 😉</p><h2 id="calculating-total-rows">Calculating Total Rows</h2><p>We now have all the tools we need to create datasets of almost any size. For time-series data specifically, it's a straightforward calculation to figure out how many rows your query will generate.</p><p><em>Total Rows = Readings per hour * Total hours * Number of "things" being tracked</em></p><p>Using this formula, we can quickly determine how many rows will be created by changing any combination of the total range (start/end timestamps), how many readings per hour for each item, or the total number of items. Let's look at a couple of examples to get an idea of how quickly you could create many rows of time-series data for your testing scenario.</p><table>
<thead>
<tr>
<th>Range of readings</th>
<th>Length of interval</th>
<th>Number of "devices"</th>
<th>Total rows</th>
</tr>
</thead>
<tbody>
<tr>
<td>1 year</td>
<td>1hour</td>
<td>4</td>
<td>35,040</td>
</tr>
<tr>
<td>1 year</td>
<td>10 minutes</td>
<td>100</td>
<td>5,256,000</td>
</tr>
<tr>
<td>6 months</td>
<td>5 minutes</td>
<td>1,000</td>
<td>52,560,000</td>
</tr>
</tbody>
</table>
<p>As you can see, the numbers start to add up <em>very</em> quickly.</p><h2 id="lets-review">Let's Review</h2><p>In this first post, we've demonstrated that it's pretty easy to generate lots of data using the PostgreSQL <code>generate_series()</code> function. We also learned that when you select multiple sets (using <code>generate_series()</code> or selecting from tables and functions), PostgreSQL will produce what's known as a Cartesian product, the selection of all rows from all tables - the product of all rows. With this knowledge, we can quickly create large and diverse datasets to test various features of PostgreSQL and TimescaleDB.</p><h2 id="keep-learning-about-generating-sample-data">Keep Learning About Generating Sample Data</h2><p>This was part 1 of the three-part series: </p><ul><li><strong>Read part 2: </strong><a href="https://timescale.ghost.io/blog/blog/generating-more-realistic-sample-time-series-data-with-postgresql-generate_series/"><strong>Generating more realistic sample time-series data with PostgreSQL <code>generate_series()</code></strong></a>. Learn how to use custom user-defined functions to create more realistic-looking data to use for testing, including generated text, numbers constrained by a range, and even fake JSON data.</li><li><strong>Read part 3: </strong><a href="https://timescale.ghost.io/blog/blog/how-to-shape-sample-data-with-postgresql-generate_series-and-sql/"><strong>How to shape sample data with PostgreSQL <code>generate_series()</code> and SQL</strong></a><strong>. </strong>Learn how to add shape and trends into your sample time-series data (e.g., increasing web traffic over time and quarterly sales cycles) using the formatting functions in this post in conjunction with relational lookup tables and additional mathematical functions. Knowing how to manipulate the pattern of generated data is particularly useful for visualizing time-series data and learning analytical PostgreSQL or TimescaleDB functions.</li><li><strong>Watch the video: </strong></li></ul><figure class="kg-card kg-embed-card"><iframe width="200" height="113" src="https://www.youtube.com/embed/t5ULlC1MYWU?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></figure><p></p><h2 id="try-it-yourself-with-timescaledb">Try it Yourself With TimescaleDB </h2><p>If you are not using TimescaleDB yet,&nbsp;<a href="https://www.timescale.com/?ref=timescale.com" rel="noreferrer">take a look</a>. It's a <a href="https://www.tigerdata.com/blog/top-8-postgresql-extensions" rel="noreferrer">PostgreSQL extension</a> that&nbsp;<a href="https://timescale.ghost.io/blog/postgresql-timescaledb-1000x-faster-queries-90-data-compression-and-much-more/" rel="noreferrer">will make your queries faster</a> via&nbsp;<a href="https://www.timescale.com/?ref=timescale.com" rel="noreferrer">automatic partitioning</a>, query planner enhancements, improved materialized views,&nbsp;<a href="https://timescale.ghost.io/blog/building-columnar-compression-in-a-row-oriented-database/" rel="noreferrer">columnar compression</a>, and much more.&nbsp;<a href="https://www.tigerdata.com/blog/building-columnar-compression-in-a-row-oriented-database" rel="noreferrer">It also comes with awesome functionality for time-series data analysis. </a></p><p>If you're running your PostgreSQL database in your own hardware,&nbsp;<a href="https://docs.timescale.com/self-hosted/latest/install/?ref=timescale.com" rel="noreferrer">you can simply add the TimescaleDB extension</a>. If you prefer to try Timescale in AWS,&nbsp;<a href="https://console.cloud.timescale.com/signup?ref=timescale.com" rel="noreferrer">create a free account on our platform</a>. </p><p>See what's possible when you <a href="https://www.tigerdata.com/learn/what-is-a-sql-inner-join" rel="noreferrer">join PostgreSQL</a> with time-series superpowers!</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How Messari Uses Data to Open the Cryptoeconomy to Everyone]]></title>
            <description><![CDATA[Learn how Messari sets up its datastack to collect, calculate, and contextualize crypto metrics, break down data silos, and bring transparency to the cryptoeconomy.]]></description>
            <link>https://www.tigerdata.com/blog/how-messari-uses-data-to-open-the-cryptoeconomy-to-everyone</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-messari-uses-data-to-open-the-cryptoeconomy-to-everyone</guid>
            <category><![CDATA[Dev Q&A]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Adam Inoue]]></dc:creator>
            <pubDate>Wed, 25 Aug 2021 14:04:30 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/feature-image-Messari.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/feature-image-Messari.png" alt="Messari dashboard, with data available at messari.io for free—How Messari Uses Data to Open the Cryptoeconomy to Everyone" /><p><em>This is an installment of our “Community Member Spotlight” series, where we invite our customers to share their work, shining a light on their success and inspiring others with new ways to use technology to solve problems.</em></p><p><em>In this edition, </em><a href="https://twitter.com/AdamIn0ue"><em>Adam Inoue</em></a><em>, Software Engineer at Messari, joins us to share how they bring transparency to the cryptoeconomy, combining tons of data about crypto assets with real-time alerting mechanisms to give investors a holistic view of the market and ensure they never miss an important event.</em></p><p><a href="https://messari.io/">Messari</a> is a data analytics and research company on a mission to organize and contextualize information for crypto professionals. Using Messari, analysts and enterprises can analyze, research, and stay on the cutting edge of the crypto world – all while trusting the integrity of the underlying data. </p><p>This gives professionals the power to make informed decisions and take timely action. We are uniquely positioned to provide an experience that combines automated data collection (such as our quantitative <a href="https://messari.io/asset/ethereum/metrics/all">asset metrics</a> and charting tools) with <a href="https://messari.io/research">qualitative research</a> and <a href="https://messari.io/intel/list">market intelligence</a> from a global team of analysts.</p><p>Our users range from some of the most prominent analysts, investors, and individuals in the crypto industry to top platforms like Coinbase, BitGo, Anchorage, 0x, Chainanalysis, Ledger, Compound, MakerDAO, and many more. </p><h2 id="about-the-team">About the Team</h2><p>I have over five years of experience as a backend developer in roles where I’ve primarily focused on high-throughput financial systems, financial reporting, and relational databases to support those systems. </p><p>After some COVID-related career disruptions, I started at Messari as a software engineer this past April (2021). I absolutely love it. The team is small but growing quickly, and everyone is specialized, highly informed, and at the top of their game. (Speaking of growing quickly, <a href="https://messari.io/careers">we’re hiring</a>!)</p><p>We’re still small enough to function mostly as one team. We are split into front-end and back-end development. The core of our back end is a suite of microservices written in Golang and managed by Kubernetes, and I—along with two other engineers—“own” managing the cluster and associated services. (As an aside, another reason I love Messari: we’re a fully remote team. I’m in Hawaii, and those two colleagues are in New York and London. Culturally, we also minimize meetings, which is great because we’re so distributed, <em>and</em> we end up with lots of time for deep work.)</p><p>From a site reliability standpoint, my team is responsible for all of the backend APIs that serve the live site, our <a href="https://messari.io/api">public API</a>, our real-time data ingestion, the ingestion and calculation of asset metrics, and more.</p><p>So far, I’ve mostly specialized in the ingestion of real-time market data—and that’s where TimescaleDB comes in!</p><h2 id="about-the-project">About the Project</h2><p>Much of our website is completely free to use, but we have <a href="https://messari.io/pro">Pro</a> and <a href="https://messari.io/enterprise">Enterprise</a> tiers that provide enhanced functionality. For example, our Enterprise version includes <a href="https://messari.io/intel/list">Intel</a>, a real-time alerting mechanism that notifies users about important events in the crypto space (e.g., forks, hacks, protocol changes, etc.) as they occur.</p><p>We collect and calculate a huge catalog of <a href="https://messari.io/asset/ethereum/metrics/all">crypto-asset metrics</a>, like price, volume, all-time cycle highs and lows, and detailed information about each currency. Handling these metrics uses a relatively low proportion of our compute resources, while real-time trade ingestion is a much more resource-intensive operation. </p><p>Our crypto price data is currently calculated based on several thousand trades per second (ingested from partners, such as <a href="https://www.kaiko.com">Kaiko</a> and <a href="https://www.gemini.com">Gemini</a>), as well as our own on-chain integrations with <a href="https://thegraph.com">The Graph</a>. We also keep exhaustive historical data that goes as far back as the dawn of Bitcoin. (You can read more about the <a href="https://www.investopedia.com/terms/b/bitcoin.asp" rel="noreferrer">history of Bitcoin</a>.)</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Untitled--1-.png" class="kg-image" alt="Messari dashboard with a dark background. The dashboard consists of four parts: 1) watchlist tracking assets like Bitcoin or Ethereum, 2) ROI chart, 3) line graph showing real-time Bitcoin price data, and 4) Intel chart, a real-time alerting mechanism." loading="lazy" width="2000" height="1396" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/Untitled--1-.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/Untitled--1-.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2022/01/Untitled--1-.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Untitled--1-.png 2048w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Messari dashboard, with data available at </em></i><a href="https://messari.io/"><i><em class="italic" style="white-space: pre-wrap;">messari.io</em></i></a><i><em class="italic" style="white-space: pre-wrap;"> for free</em></i></figcaption></figure><p><strong>Our data pipelines are the core of the quantitative portion of our product—and are, therefore, mission-critical. </strong>For our site to be visibly alive, the most important metric is our real-time volume-weighted average price (VWAP), although we calculate hundreds of other metrics on an hourly or daily basis. We power our real-time view through WebSocket connections to our back-end, and we keep the latest price data in memory to avoid having to make constant repeated database calls. </p><p>Everything “historical”—i.e., even as recently as five minutes ago—makes a call to our time-series endpoint. <strong>Any cache misses there will hit the database, so it’s critical that the database is highly available.</strong></p><p>We use the price data to power the quantitative views we display on our live site and directly serve our data to API users. Much of what we display on our live site is regularly retrieved and cached by a backend-for-frontend GraphQL server, but some of it is also retrieved by HTTP calls or WebSocket connections from one or more Go microservices.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Untitled.png" class="kg-image" alt="Messari dashboard with a dark background. The dashboard shows all available data about Ethereum, including real-time price, ROI, and key metrics like market cap and price." loading="lazy" width="2000" height="998" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/Untitled.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/Untitled.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2022/01/Untitled.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Untitled.png 2048w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The asset view of Messari dashboard, showing various price stats for a specific currency.</span></figcaption></figure><p>The accuracy of our data is extremely important because it’s public-facing and used to help our users make decisions. And, just like the rest of the crypto space, we are also scaling quickly, both in terms of our business and the amount of data we ingest. </p><h2 id="choosing-and-using-timescaledb">Choosing (and Using!) TimescaleDB</h2><p>We’re wrapping up a complete transition to TimescaleDB from <a href="https://www.influxdata.com">InfluxDB</a>. It would be reasonable to say that <a href="https://timescale.ghost.io/blog/timescaledb-vs-influxdb-for-time-series-data-timescale-influx-sql-nosql-36489299877/" rel="noreferrer">we used InfluxDB until it fell over</a>; we asked it to do a huge amount of ingestion and continuous aggregation, not to mention queries around the clock, to support the myriad requests our users can make. </p><p>Over time, we pushed it enough that it became less stable, so eventually, it became clear that InfluxDB wasn’t going to scale with us. Thus, <a href="https://github.com/dev-kpyc">Kevin Pyc</a> (who served as the entire backend “team” until earlier this year) became interested in TimescaleDB as a possible alternative.</p><p>The pure PostgreSQL interface and impressive performance characteristics sold him on TimescaleDB as a good option for us. </p><p>From there, the entire tech group convened and agreed to try TimescaleDB. We were aware of its performance claims but needed to test it out for ourselves for our exact use case. I began by reimplementing our real-time trade ingestion database adapter on TimescaleDB—and on every test, TimescaleDB blew my expectations out of the water.</p><p>The most significant aspects of our system are INSERT and SELECT performance.</p><ul><li>INSERTs of real-time trade data are constant, 24/7, and rarely dip below 2,000 rows per second. At peak times, they can exceed 4,500—and, of course, we expect this number to continually increase as the industry continues to grow and we see more and more trades.</li><li>SELECT performance impacts our APIs’ response time for anything we haven’t cached; we briefly cache many of the queries needed for the live site, but less common queries end up hitting the database. </li></ul><p>When we tested these with TimescaleDB, both our SELECT and INSERT performance results flatly outperformed InfluxDB. In testing, even though our fully managed <a href="https://www.timescale.com/products" rel="noreferrer">Timescale</a> instance is currently only located in us-east-1 and most of our infrastructure is in an us-west region, we saw an average of ~40&nbsp;ms improvement in both types of queries. Plus, we could batch-insert 500 rows of data instead of 100, with no discernible drop in execution time relative to InfluxDB. </p><p><strong>These impressive performance benchmarks, combined with the fact that we can use Postgres with foreign key relationships to derive new datasets from our existing ones (which we weren’t able to do with InfluxDB), are key differentiators for TimescaleDB.</strong></p><p>✨ <strong>Editor’s Note: </strong><em>For more comparisons and benchmarks, see how TimescaleDB compares to </em><a href="https://timescale.ghost.io/blog/timescaledb-vs-influxdb-for-time-series-data-timescale-influx-sql-nosql-36489299877/" rel="noreferrer"><em>InfluxDB</em></a><em>, </em><a href="https://timescale.ghost.io/blog/how-to-store-time-series-data-mongodb-vs-timescaledb-postgresql-a73939734016/" rel="noreferrer"><em>MongoDB</em></a><em>, </em><a href="https://timescale.ghost.io/blog/timescaledb-vs-amazon-timestream-6000x-higher-inserts-175x-faster-queries-220x-cheaper/" rel="noreferrer"><em>AWS Timestream</em></a><em>, and other </em><a href="https://www.timescale.com/learn/the-best-time-series-databases-compared" rel="noreferrer"><em>time-series database alternatives</em> <em>on various vectors</em></a><em>, from performance and ecosystem to query language and beyond. For tips on optimizing your database insert rate, see our </em><a href="https://timescale.ghost.io/blog/blog/13-tips-to-improve-postgresql-insert-performance/"><em>13 ways to improve PostgreSQL insert performance</em></a><em> blog post.</em></p><p>We are also really excited about continuous aggregates. We store our data at minute-level granularity, so any granularity of data above one minute is powered by continuous queries that feed a rollup table. </p><p>In InfluxDB-world, we had a few problems with continuous queries: they tended to lag a few minutes behind real-time ingestion, and, in our experience, continuous queries would occasionally fail to pick up a trade ingested out of order—for instance, one that’s half an hour old—and it wouldn’t be correctly accounted for in our rollup queries. </p><p>Switching these rollups to TimescaleDB continuous aggregates has been great; they’re never out of date, and we can gracefully refresh the proper time range whenever we receive an out-of-order batch of trades or are back-filling data.</p><p>At the time of writing, I’m still finalizing our continuous aggregate views—we had to refresh them all the way back to 2010!—but all of the other parts of our implementation are complete and have been stable for some time.</p><p>✨<em> </em><strong>Editor’s Note</strong>:<em> Check out the </em><a href="https://docs.timescale.com/use-timescale/latest/continuous-aggregates/" rel="noreferrer"><em>continuous aggregates documentation</em></a><em> and follow </em><a href="https://docs.timescale.com/tutorials/latest/blockchain-query/" rel="noreferrer"><em>the step-by-step tutorial</em></a><em> to learn how to utilize continuous aggregates for analyzing crypto data.</em></p><h2 id="current-deployment-future-plans">Current Deployment &amp; Future Plans</h2><p>As I mentioned earlier, all of the core services in our back-end are currently written in Go, and we have some projects on the periphery written in Node or Java. We don't currently need to expose TimescaleDB to any project that isn't written in Go. We use <a href="https://gorm.io">GORM</a> for most database operations, so we connect to TimescaleDB with a <code>gorm.DB</code> object.</p>
<p>We try to use GORM conventions as much as possible; for TimescaleDB-specific operations like <a href="https://docs.timescale.com/api/latest/compression/add_compression_policy/">managing compression policies</a> or the <a href="https://docs.timescale.com/use-timescale/latest/hypertables/create/"><code>create_hypertable</code> step</a> where no GORM method exists, we write out queries literally.</p>
<p>For instance, we initialize our tables using <code>repo.PrimaryDB.AutoMigrate(repo.Schema.Model)</code>, which is a GORM-specific feature, but we create new hypertables as follows:</p>
<pre><code>res := repo.PrimaryDB.Table(tableName).Exec(
		fmt.Sprintf("SELECT create_hypertable('%s', 'time', chunk_time_interval =&gt; INTERVAL '%s');",
			tableName, getChunkSize(repo.Schema.MinimumInterval)))
</code></pre>
<p>Currently, our architecture that touches TimescaleDB looks like this:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/Messari---Dev-Q-A---Architecture--1-.jpg" class="kg-image" alt="The architecture diagram of Messari solution" loading="lazy" width="2000" height="1352" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/08/Messari---Dev-Q-A---Architecture--1-.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/08/Messari---Dev-Q-A---Architecture--1-.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2021/08/Messari---Dev-Q-A---Architecture--1-.jpg 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/Messari---Dev-Q-A---Architecture--1-.jpg 2000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The current architecture diagram</span></figcaption></figure><p>We use Prometheus for a subset of our monitoring, but for our real-time ingestion engine, we’re in an advantageous position: the system’s performance is obvious just by looking at our logs. </p><p>Whenever our database upsert backlog is longer than a few thousand rows, we log that with a timestamp to easily see how large the backlog is and how quickly we can catch up. </p><p>Our backlog tends to be shorter and more stable with TimescaleDB than it was previously—and our developer experience has improved as well. </p><p>Speaking for myself, I didn’t understand much about our InfluxDB implementation’s inner workings, but after talking it through with my teammates, it seems highly customized and hard to explain from scratch. The hosted TimescaleDB implementation with Timescale is much easier to understand, particularly because we can easily view the live database dashboard, complete with all our table definitions, chunks, policies, and the like.</p><p>Looking ahead, we have a lot of projects that we’re excited about! One of the big ones is that, with TimescaleDB, we’ll have a much easier time deriving metrics from multiple existing data sets. </p><p>In the past, because InfluxDB is NoSQL, linking <a href="https://www.tigerdata.com/blog/time-series-introduction" rel="noreferrer">time series</a> together to generate new, derived, or aggregated metrics was challenging. <strong>Now, we can use simple JOINs in one query to easily return all the data we need to derive a new metric</strong>. </p><p>Many other projects have to remain under wraps for now, but we think TimescaleDB will be a crucial part of our infrastructure for years to come, and we’re excited to scale with it.</p><h2 id="getting-started-advice-resources">Getting Started Advice &amp; Resources</h2><p>TimescaleDB is complex, and it's important to understand the implementation of <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertables</a> quickly. To best benefit from TimescaleDB’s features, you need to think about how to chunk your hypertables, what retention and compression policies to set, and whether/how to set up continuous aggregates. (Particularly with regard to your hypertable chunk size, because it's hard to change that decision later.)</p><p>In our case, the “answers” to three of these questions were addressed from our previous InfluxDB setup: compress after 48 hours (the maximum time in the past we expect to ingest a trade); retain everything; and rollup all of our price and volume data into our particular set of intervals (5m, 15m, 30m, 1h, 6h, 1d, and 1w).</p><p>The most difficult part was understanding how long our chunks should be (i.e., setting our <a href="https://docs.timescale.com/api/latest/hypertable/set_chunk_time_interval/#optional-arguments"><code>chunk_time_interval</code></a> on each hypertable). We settled on one day, mostly by default, with some particularly small metrics chunked after a year instead.</p>
<p>I’m not sure these decisions would be as obvious for other use cases. </p><p>In summary, the strongest advantages of TimescaleDB are its performance and pure Postgres interface. Both of these make us comfortable recommending it across a wide range of use cases. Still, the decision shouldn’t be cavalier; we tested Timescale for several weeks before committing to the idea and finishing our implementation.</p><p><em>We’d like to thank Adam and all of the folks at Messari for sharing their story and for their effort to lower the barriers to investing in crypto assets by offering a massive number of crypto-asset metrics and a real-time alerting mechanism.</em></p><p><em>We’re always keen to feature new community projects and stories on our blog. If you have a story or project you’d like to share, reach out on Slack (</em><a href="https://timescaledb.slack.com/team/U03797BSQKT?ref=timescale.com"><em>@Ana Tavares</em></a><em>), and we’ll go from there.</em></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Hacking NFL Data With PostgreSQL, TimescaleDB, and SQL]]></title>
            <description><![CDATA[Learn how to use time-series data provided by the NFL to uncover insights.]]></description>
            <link>https://www.tigerdata.com/blog/hacking-nfl-data-with-postgresql-timescaledb-and-sql</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/hacking-nfl-data-with-postgresql-timescaledb-and-sql</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Time Series Data]]></category>
            <category><![CDATA[Analytics]]></category>
            <dc:creator><![CDATA[Attila Toth]]></dc:creator>
            <pubDate>Tue, 27 Jul 2021 13:32:25 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/07/ameer-basheer-Yzef5dRpwWg-unsplash--1-.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/07/ameer-basheer-Yzef5dRpwWg-unsplash--1-.jpg" alt="Hacking NFL Data With PostgreSQL, TimescaleDB, and SQL" /><p><em>Learn how to use time-series data provided by the NFL to uncover valuable insights into many player performance metrics—and ways to apply the same methods to improve your fantasy league team, your knowledge of the game, or your viewing experience—all with PostgreSQL, standard SQL, and freely available extensions.</em></p><p>Time-series data is everywhere, including, much to our surprise, the world of professional sports. At Timescale, we're always looking for fun ways to showcase the expanding reach of time-series data. <a href="https://docs.timescale.com/timescaledb/latest/tutorials/analyze-intraday-stocks/">Stock</a>, <a href="https://docs.timescale.com/timescaledb/latest/tutorials/analyze-cryptocurrency-data/">cryptocurrency</a>, <a href="https://docs.timescale.com/timescaledb/latest/tutorials/nyc-taxi-cab/">IoT</a>, and <a href="https://docs.timescale.com/timescaledb/latest/tutorials/promscale/">infrastructure metrics</a> data are relatively common and widely understood time-series data scenarios. Head to Twitter on any given day, search for <a href="https://twitter.com/hashtag/TimeSeries">#timeseries</a> or <a href="https://twitter.com/hashtag/TimescaleDB">#TimescaleDB</a>, and you're sure to find questions about high-frequency trading or massive-scale observability data with tools like Prometheus.</p><p>You can imagine our excitement, then, when we happened upon the <a href="https://operations.nfl.com/gameday/analytics/big-data-bowl/">NFL Big Data Bowl</a>, an annual competition that encourages the data science community to use historical player position and play data to create machine learning models. </p><p>Did the NFL <strong><em>really</em></strong> give access to 18+ million rows of detailed play data from every regular season NFL game?</p>
<!--kg-card-begin: html-->
<div style="width:100%;height:0;padding-bottom:100%;position:relative;"><iframe src="https://giphy.com/embed/2w3uNmvsIZ09hNzgzo" width="100px" height="100px" style="position:absolute; min-width: 100%; min-height: 100%" frameBorder="0" class="giphy-embed" allowFullScreen></iframe></div><p><a href="https://giphy.com/gifs/nfl-colts-indianapolis-2w3uNmvsIZ09hNzgzo">via GIPHY</a></p>
<!--kg-card-end: html-->
<p>For background, the National Football League (NFL) is the US professional sports league for American football, and the NFL season is followed by tens of millions of people, culminating in the annual Super Bowl (which attracts 100M+ global viewers, whether for the game or for the commercials). </p><p>Each NFL game takes place as a series of “plays,” in which the two teams try to score and prevent the other team from scoring. There are approximately 200 plays per game, with up to 15 games a week during the regular season. A healthy amount of data, but nothing unmanageable.  </p><p>So, at first glance, football game metrics might not immediately jump out as anything special. </p><p>But then the NFL did something pretty ambitious and amazing. </p><p>All <a href="https://operations.nfl.com/gameday/technology/nfl-next-gen-stats/">NFL players are equipped with RFID chips</a> that track players’ position, speed, and various other metrics, which teams use to identify trends, mitigate risks, and continuously optimize.  The NFL started tracking and storing data for every player on the field, for every play, for every game. </p><p>As a result, we now have access to a very detailed analysis of exactly how a play unfolded, how quickly various players accelerated during each play, and the play’s outcome. A traditional view of play-by-play metrics is “down and distance” and the result of the play (yards gained, whether or not there was a score, and so on). With the NFL’s dataset, we're able to mine approximately 100 data points at 100-millisecond intervals throughout the play to see speed, distance, involved players, and much more.</p><p>This isn’t ordinary data. <a href="https://timescale.ghost.io/blog/blog/what-the-heck-is-time-series-data-and-why-do-i-need-a-time-series-database-dcf3b1b18563/">This is time-series data</a>. Time-series data is a sequence of data points collected over time intervals, giving us the ability to track changes over time. In the case of the NFL’s dataset, we have time-series data that represents how a play changes, including the locations of the players on the field, the location of the ball, the relative acceleration of players in the field of play, and so much more.</p><p>Time-series data comes at you fast, sometimes generating millions of data points per second (<a href="https://timescale.ghost.io/blog/blog/what-the-heck-is-time-series-data-and-why-do-i-need-a-time-series-database-dcf3b1b18563/">read more about time-series data</a>). Because of the sheer volume and rate of information, time-series data can already be complex to query and analyze, which is why we built TimescaleDB, a petabyte-scale, relational database for <a href="https://www.tigerdata.com/blog/time-series-introduction" rel="noreferrer">time series</a>. </p><p>We couldn't pass up the opportunity to look at the NFL dataset with TimescaleDB, exploring ways we could peer deeper into player performance in hopes of providing insights about overall player performance in the coming season. </p><p>Read on for more information about the <a href="https://www.kaggle.com/c/nfl-big-data-bowl-2021/overview">NFL’s dataset</a> and how you can start using it, plus some sample queries to jumpstart your analysis. They may help you get more enjoyment out of the game.</p><p><strong>If you’d like to get started with NFL data, you can spin up a fully managed TimescaleDB service</strong>: create an account to <a href="https://console.cloud.timescale.com/signup">try it for free</a> for 30 days. The instructions later in this post will take you through how to ingest the data and start using it for analysis.</p><p>If you’re new to time-series data or just have some questions you’d like to ask about the dataset, <a href="https://slack.timescale.com">join our public Slack community</a>, where you’ll find Timescale team members and thousands of time-series enthusiasts, and we’ll be happy to help you.</p><h2 id="the-nfl-time-series-dataset">The NFL Time Series Dataset</h2><p><br>Over the last few years, the NFL and Kaggle have collaborated on the <a href="https://www.kaggle.com/c/nfl-big-data-bowl-2021/overview">NFL Big Data Bowl</a>. The goal is to use historical data to answer a predetermined genre of questions, typically producing a machine learning model that can help predict the outcome of certain plays during regular season games.</p><p>Although the 2020/2021 contest is over, the sample dataset they provided from a prior season is still available for download and analysis. The 2020/2021 competition focused on pass-play defense efficiency; therefore, only the tracking data for offensive and defensive "playmakers" is available in the dataset. No offensive or defensive linemen data is included. (You can read more about <a href="https://www.kaggle.com/c/nfl-big-data-bowl-2021/discussion/217170">last year’s winners</a>.)</p><p>(Keep watching the <a href="https://operations.nfl.com/gameday/analytics/big-data-bowl/">NFL website</a> for more information on the next Big Data Bowl.)</p><h2 id="accessing-the-data">Accessing the Data</h2><p><br>For the purposes of this blog post and accompanying tutorial, we will use <a href="https://www.kaggle.com/c/nfl-big-data-bowl-2021/overview">the sample data provided by the NFL</a>. This data is from the 2018 NFL season and is available as CSV files, including game-specific data and week-by-week tracking data for each player involved in the "offensive" part of the pass play. Contest participants in the next season of the contest will have access to new weekly game data.</p><p>This data is also very relational in nature, which means that SQL is a great medium to start gleaning value – without the need for Jupyter notebooks, other data science specific languages (like Python or R), or additional toolsets. </p><p>If you want to follow along - or recreate! - the queries we go through below, <a href="https://docs.timescale.com/timescaledb/latest/tutorials/nfl-analytics/">follow our tutorial</a> to set up the tables, ingest data, and start analyzing data in TimescaleDB. For those unfamiliar with TimescaleDB, it’s built on PostgreSQL, so you’ll find that all of our queries are standard SQL. If you know SQL, you’ll know how to do everything here. (Some of the more advanced query examples we provide require our new, advanced hyperfunctions, which come pre-installed with any <a href="https://console.cloud.timescale.com" rel="noreferrer">Timescale instance</a>.)</p><h2 id="lets-start-exploring-time-series-insights">Let's Start Exploring Time-Series Insights!</h2><p><br>We've provided the steps needed to ingest the dataset into TimescaleDB in the <a href="https://docs.timescale.com/timescaledb/latest/tutorials/nfl-analytics/">accompanying tutorial</a>, so we won’t go into that here. </p><p>The NFL dataset includes the following data:</p>
<ul>
<li><strong>Games</strong>: all relevant data about each game of the regular season, including date, teams, time, and location</li>
<li><strong>Players</strong>: information on each player, including what team they play for and their originating college</li>
<li><strong>Plays</strong>: a wealth of data about each pass play in the game. Helpful fields include the down, description of the play that happened, line of scrimmage, and total offensive yardage, among other details.</li>
<li><strong>Week [1-17]</strong>: for each week of the season, the NFL provides a new CSV file with the tracking data of every player, for every play (pass plays for this data). Interesting fields include X/Y position data (relative to the football field) every few hundred milliseconds throughout each play, player acceleration, and the "type" of a route that was taken. (In our tutorial, this data is imported into the <code>tracking</code> table and totals almost 20 million rows of time-series data.)</li>
</ul>
<p>In addition to the NFL dataset, we also provide some extra data from Wikipedia that includes game scores and stadium conditions for each game, which you can load as part of the tutorial. With other time-series databases, it can be difficult to combine your time-series data with any other data you may have on hand (see <a href="https://timescale.ghost.io/blog/blog/timescaledb-vs-influxdb-for-time-series-data-timescale-influx-sql-nosql-36489299877/">our TimescaleDB vs. InfluxDB comparison</a> for reference). </p><p>Because TimescaleDB is PostgreSQL with time-series superpowers, it supports JOINS, so any extra relational data you want to add for deeper analysis is just a SQL query away. In our case, we’re able to combine the NFL’s play-by-play data along with weather data for each stadium.</p><p>Once you have the data ready, the world of NFL playmakers is at your fingertips, so let’s get started!</p><h2 id="the-power-of-sql">The Power of SQL</h2><p>Year after year, we see SQL listed as one of the most popular languages among developers on the <a href="https://insights.stackoverflow.com/survey/2020#technology-programming-scripting-and-markup-languages-all-respondents">Stack Overflow survey</a>. Sometimes, however, we can be lured into thinking that the only way to gain insights from relational data is to query it with powerful data analytics tools and languages, create data frames, and use specialized regression algorithms before we can do anything productive.</p><p>SQL, it often feels, is only useful for getting and storing data in applications and that we need to leave the "heavy lifting" of analysis to more mature tools.</p><p>Not so! SQL can data munge with the best of them! Let's look at a first, quick example.</p><h3 id="average-yards-per-position-per-game">Average yards per position, per game</h3><p>For this first example, we'll query the <code>tracking</code> table (the player movement data from all 17 weeks of games) and join to the <code>game</code> table to determine the number of yards per player position, per game.</p>
<p>The results give you a quick overview of how many yards different positions ran throughout each game. You could use this later to compare specific players to see how they compared, more or less yards, to that total.</p><pre><code class="language-SQL">WITH total_position_yards AS (
	SELECT sum(dis) position_yards, POSITION, gameid FROM tracking t 
	GROUP BY POSITION, gameid)
SELECT avg(position_yards), position, game_date
FROM game g
INNER JOIN total_position_yards tpy ON g.game_id = tpy.gameid
WHERE POSITION IN ('QB','RB','WR','TE')
GROUP BY game_date, POSITION;
</code></pre>
<h3 id="number-of-plays-by-offensive-player"><br>Number of plays by offensive player</h3><p>As a season progresses and players get injured (or traded), it's helpful to know which of the available players have more playing experience, rather than those that have been sitting on the sideline for most of the season. Players with more playing time are often able to contribute to the outcome of the game.</p><p>This query finds all players that were on the offense for any play and counts how many total passing plays they have been a part of, ordered by total passing plays descending.</p><pre><code class="language-SQL">WITH snap_events AS (
-- Create a table that filters the play events to show only snap plays
-- and display the players team information
 SELECT DISTINCT player_id, t.event, t.gameid, t.playid,
   CASE
     WHEN t.team = 'away' THEN g.visitor_team
     WHEN t.team = 'home' THEN g.home_team
     ELSE NULL
     END AS team_name
 FROM tracking t
 LEFT JOIN game g ON t.gameid = g.game_id
 WHERE t.event IN ('snap_direct','ball_snap')
)
-- Count these events &amp; filter results to only display data when the player was
-- on the offensive
SELECT a.player_id, pl.display_name, COUNT(a.event) AS play_count, a.team_name
FROM snap_events a
LEFT JOIN play p ON a.gameid = p.gameid AND a.playid = p.playid
LEFT JOIN player pl ON a.player_id = pl.player_id
WHERE a.team_name = p.possessionteam
GROUP BY a.player_id, pl.display_name, a.team_name
ORDER BY play_count DESC;
</code></pre>
<table>
<thead>
<tr>
<th>player_id</th>
<th>display_name</th>
<th>play_count</th>
<th>team_name</th>
</tr>
</thead>
<tbody>
<tr>
<td>2506109</td>
<td>Ben Roethlisberger</td>
<td>725</td>
<td>PIT</td>
</tr>
<tr>
<td>2558149</td>
<td>JuJu Smith-Schuster</td>
<td>691</td>
<td>PIT</td>
</tr>
<tr>
<td>2533031</td>
<td>Andrew Luck</td>
<td>683</td>
<td>IND</td>
</tr>
<tr>
<td>2508061</td>
<td>Antonio Brown</td>
<td>679</td>
<td>PIT</td>
</tr>
<tr>
<td>310</td>
<td>Matt Ryan</td>
<td>659</td>
<td>ATL</td>
</tr>
<tr>
<td>2506363</td>
<td>Aaron Rodgers</td>
<td>656</td>
<td>GB</td>
</tr>
<tr>
<td>2505996</td>
<td>Eli Manning</td>
<td>639</td>
<td>NYG</td>
</tr>
<tr>
<td>2543495</td>
<td>Davante Adams</td>
<td>630</td>
<td>GB</td>
</tr>
<tr>
<td>2540158</td>
<td>Zach Ertz</td>
<td>629</td>
<td>PHI</td>
</tr>
<tr>
<td>2532820</td>
<td>Kirk Cousins</td>
<td>621</td>
<td>MIN</td>
</tr>
<tr>
<td>79860</td>
<td>Matthew Stafford</td>
<td>619</td>
<td>DET</td>
</tr>
<tr>
<td>2504211</td>
<td>Tom Brady</td>
<td>613</td>
<td>NE</td>
</tr>
</tbody>
</table>
<p>If you’re familiar with American football, you might know that players are substituted in and out of the game based on game conditions. Stronger, larger players may play in some situations, while faster, more agile players may play in others. </p><p>Quarterbacks, however, are the most “important” players on the field, and tend to play more than others. However, by omitting quarterbacks, we can get a deeper insight into players across all other positions.</p><pre><code class="language-SQL">WITH snap_events AS (
-- Create a table that filters the play events to show only snap plays
-- and display the players team information
 SELECT DISTINCT player_id, t.event, t.gameid, t.playid,
   CASE
     WHEN t.team = 'away' THEN g.visitor_team
     WHEN t.team = 'home' THEN g.home_team
     ELSE NULL
     END AS team_name
 FROM tracking t
 LEFT JOIN game g ON t.gameid = g.game_id
 WHERE t.event IN ('snap_direct','ball_snap')
)
-- Count these events &amp; filter results to only display data when the player was
-- on the offensive
SELECT a.player_id, pl.display_name, COUNT(a.event) AS play_count, a.team_name, pl."position"
FROM snap_events a
LEFT JOIN play p ON a.gameid = p.gameid AND a.playid = p.playid
LEFT JOIN player pl ON a.player_id = pl.player_id
WHERE a.team_name = p.possessionteam AND pl."position" != 'QB'
GROUP BY a.player_id, pl.display_name, a.team_name, pl."position"
ORDER BY play_count DESC;
</code></pre>
<p>So, now we can see the non-quarterbacks who are on offense the most in a season:</p><table>
<thead>
<tr>
<th>player_id</th>
<th>display_name</th>
<th>play_count</th>
<th>team_name</th>
<th>position</th>
</tr>
</thead>
<tbody>
<tr>
<td>2558149</td>
<td>JuJu Smith-Schuster</td>
<td>691</td>
<td>PIT</td>
<td>WR</td>
</tr>
<tr>
<td>2508061</td>
<td>Antonio Brown</td>
<td>679</td>
<td>PIT</td>
<td>WR</td>
</tr>
<tr>
<td>2543495</td>
<td>Davante Adams</td>
<td>630</td>
<td>GB</td>
<td>WR</td>
</tr>
<tr>
<td>2540158</td>
<td>Zach Ertz</td>
<td>629</td>
<td>PHI</td>
<td>TE</td>
</tr>
<tr>
<td>2541785</td>
<td>Adam Thielen</td>
<td>612</td>
<td>MIN</td>
<td>WR</td>
</tr>
<tr>
<td>2543468</td>
<td>Mike Evans</td>
<td>610</td>
<td>TB</td>
<td>WR</td>
</tr>
<tr>
<td>2555295</td>
<td>Sterling Shepard</td>
<td>610</td>
<td>NYG</td>
<td>WR</td>
</tr>
<tr>
<td>2540169</td>
<td>Robert Woods</td>
<td>604</td>
<td>LA</td>
<td>WR</td>
</tr>
<tr>
<td>2552600</td>
<td>Nelson Agholor</td>
<td>604</td>
<td>PHI</td>
<td>WR</td>
</tr>
<tr>
<td>2543488</td>
<td>Jarvis Landry</td>
<td>592</td>
<td>CLE</td>
<td>WR</td>
</tr>
<tr>
<td>2540165</td>
<td>DeAndre Hopkins</td>
<td>587</td>
<td>HOU</td>
<td>WR</td>
</tr>
<tr>
<td>2543498</td>
<td>Brandin Cooks</td>
<td>581</td>
<td>LA</td>
<td>WR</td>
</tr>
</tbody>
</table>
<h3 id="sack-percentage-by-quarterback-on-passing-plays">Sack percentage by quarterback on passing plays</h3><p>We can start to go a little deeper by extracting specific data from the <code>tracking</code> table and layering queries on top of it to make correlations. One piece of information that might be helpful in your analysis is knowing which quarterbacks are sacked most often during passing plays. In football, a “sack” is a negative play for the offense, and quarterbacks who get sacked more often tend to be lower performers overall.</p>
<p>Once you know those players, you could expand your analysis to see if they are sacked more on specific types of plays (shotgun formation) or maybe if sacks occur more often in a specific quarter of the game (maybe the fourth quarter because the offensive line is more tired, or the team tends to be behind late in games and must pass more often).</p><p>Queries like this can quickly show you quarterbacks that are more likely to get sacked, particularly when they play a strong defensive team.<br><br>To get started, we wanted to find the sack percentage of each quarterback based on the total number of pass plays they were involved in during the regular season. To do that we approached the tracking data by layering on Common Table Expressions so that each query could build upon previous results.</p><p>First, we select the distinct list of all plays, for each quarterback (<code>qb_plays</code>). The reason we do a <code>SELECT DISTINCT…</code> is because the tracking table holds multiple entries for each player, for each play. We just need one row for each play, for each quarterback.</p>
<p>With this result, we can then count the number of total plays per quarterback (<code>total_qb_plays</code>), the total number of games each quarterback played (<code>qb_games</code>) and then finally the number of pass plays the quarterback was a part of that resulted in a sack (<code>sacks</code>).</p>
<p>With that data in hand, we can finally query all of the values, do a percentage calculation, and order it by the total sack count.</p><pre><code class="language-SQL">WITH qb_plays AS (
	SELECT DISTINCT ON (POSITION, playid, gameid) POSITION, playid, player_id, gameid 
	FROM tracking t 
	WHERE POSITION = 'QB'
),
total_qb_plays AS (
	SELECT count(*) play_count, player_id FROM qb_plays
	GROUP BY player_id
),
qb_games AS (
	SELECT count(DISTINCT gameid) game_count, player_id FROM qb_plays 
	GROUP BY player_id
),
sacks AS (
	SELECT count(*) sack_count, player_id 
	FROM play p
	INNER JOIN qb_plays ON p.gameid = qb_plays.gameid AND p.playid = qb_plays.playid
	WHERE p.passresult = 'S'
	GROUP BY player_id
)
SELECT play_count, game_count, sack_count, (sack_count/play_count::float)*100 sack_percentage, display_name FROM total_qb_plays tqp
INNER JOIN qb_games qg ON tqp.player_id = qg.player_id
LEFT JOIN sacks s ON s.player_id = qg.player_id
INNER JOIN player ON tqp.player_id = player.player_id
ORDER BY sack_count DESC NULLS last;
</code></pre>
<p>If you're an ardent football fan, the results from 2018 probably don't surprise you.</p><table>
<thead>
<tr>
<th>play_count</th>
<th>game_count</th>
<th>sack_count</th>
<th>sack_percentage</th>
<th>display_name</th>
</tr>
</thead>
<tbody>
<tr>
<td>579</td>
<td>16</td>
<td>65</td>
<td>11.23</td>
<td>Deshaun Watson</td>
</tr>
<tr>
<td>602</td>
<td>16</td>
<td>55</td>
<td>9.14</td>
<td>Dak Prescott</td>
</tr>
<tr>
<td>611</td>
<td>16</td>
<td>53</td>
<td>8.67</td>
<td>Derek Carr</td>
</tr>
<tr>
<td>656</td>
<td>16</td>
<td>49</td>
<td>7.47</td>
<td>Aaron Rodgers</td>
</tr>
<tr>
<td>462</td>
<td>15</td>
<td>48</td>
<td>10.39</td>
<td>Russell Wilson</td>
</tr>
<tr>
<td>639</td>
<td>16</td>
<td>47</td>
<td>7.36</td>
<td>Eli Manning</td>
</tr>
<tr>
<td>448</td>
<td>14</td>
<td>45</td>
<td>10.04</td>
<td>Josh Rosen</td>
</tr>
<tr>
<td>659</td>
<td>16</td>
<td>43</td>
<td>6.53</td>
<td>Matt Ryan</td>
</tr>
<tr>
<td>386</td>
<td>14</td>
<td>43</td>
<td>11.14</td>
<td>Marcus Mariota</td>
</tr>
<tr>
<td>619</td>
<td>16</td>
<td>41</td>
<td>6.62</td>
<td>Matthew Stafford</td>
</tr>
<tr>
<td>621</td>
<td>15</td>
<td>38</td>
<td>6.12</td>
<td>Kirk Cousins</td>
</tr>
<tr>
<td>324</td>
<td>11</td>
<td>37</td>
<td>11.42</td>
<td>Ryan Tannehill</td>
</tr>
<tr>
<td>447</td>
<td>11</td>
<td>36</td>
<td>8.05</td>
<td>Carson Wentz</td>
</tr>
</tbody>
</table>
<p>Of course, there are a few quarterbacks that always seem to have a way of avoiding a sack.</p><table>
<thead>
<tr>
<th>play_count</th>
<th>game_count</th>
<th>sack_count</th>
<th>sack_percentage</th>
<th>display_name</th>
</tr>
</thead>
<tbody>
<tr>
<td>725</td>
<td>16</td>
<td>25</td>
<td>3.45</td>
<td>Ben Roethlisberger</td>
</tr>
<tr>
<td>682</td>
<td>16</td>
<td>22</td>
<td>3.23</td>
<td>Andrew Luck</td>
</tr>
<tr>
<td>613</td>
<td>16</td>
<td>21</td>
<td>3.43</td>
<td>Tom Brady</td>
</tr>
</tbody>
</table>
<p>Now, let’s try some more “advanced” queries and analyses.</p><h2 id="faster-insights-with-postgresql-and-timescaledb">Faster Insights With PostgreSQL and TimescaleDB</h2><p>So far, the queries we've shown are interesting and help provide insights to various players throughout the season – but if you were looking closely, they're all regular SQL statements. </p><p>Examining a season of NFL tracking data isn't like typical time-series data, however. Most of the queries we want to perform need to examine all 20 million rows in some way.</p><p>This is where a tool that's been built for time-series analysis, even when the data isn't typical time-series data, can significantly improve your ability to examine the data and save money at the same time.</p><h2 id="faster-queries-with-timescaledb-continuous-aggregates">Faster Queries With TimescaleDB Continuous Aggregates</h2><p>We noticed that we often needed to build queries that started with the <code>tracking</code> table, filtering data by specific players, positions, and games. Part of the reason is that the <code>play</code> table doesn't list all of the players who were involved in a particular play. As a result, we need to cross-reference the <code>tracking</code> table to identify the players who were involved in any given play.</p>
<p>The first example query we demonstrated - “average yards per position, per game” - is a good example of this. The query begins by summing all yards, by position, for each game.</p><p>This means that every row in <code>tracking</code> has to be read and aggregated <em>before</em> we can do any other analysis. Scanning those 20 million rows is pretty boring, repetitive, and slow work – especially compared to the analysis we want to do!</p>
<p>On our small test instance, the "average yards" query takes about 8 seconds to run. We could increase the size of the instance (which will cost us more money), or we could be smarter about how we query the data (which will cost us more time).</p><p>Instead, we can use continuous aggregates to pre-aggregate the data we're querying over and over again, which reduces the amount of work TimescaleDB needs to do every time we run the query. (Continuous aggregates are like PostgreSQL materialized views. For more info, check out our <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/continuous-aggregates/">continuous aggregates docs</a>.)</p><pre><code class="language-SQL">CREATE MATERIALIZED VIEW player_yards_by_game_
WITH (timescaledb.continuous) AS
SELECT player_id, position, gameid,
 time_bucket(INTERVAL '1 day', "time") AS bucket,
 SUM(dis) AS yards
FROM tracking t
GROUP BY player_id, position, gameid, bucket;
</code></pre>
<p>After running this query and creating a continuous aggregate, we can modify that first query just slightly, using this as our basis table.</p><pre><code class="language-SQL">WITH total_position_yards AS (
	SELECT sum(yards) position_yards, POSITION, gameid 
FROM player_yards_by_game t 
	GROUP BY POSITION, gameid)
SELECT avg(position_yards), position, game_date
FROM game g
INNER JOIN total_position_yards tpy ON g.game_id = tpy.gameid
WHERE POSITION IN ('QB','RB','WR','TE')
GROUP BY game_date, POSITION
ORDER BY game_date, position;
</code></pre>
<p>We get the same result, but now the query runs in 100ms - <strong>800x faster</strong>!</p><h2 id="advanced-sql-data-analysis-with-timescaledb-hyperfunctions">Advanced SQL Data Analysis With TimescaleDB Hyperfunctions</h2><p>Finally, the more we dug into the data, the more and more we found we needed (or wanted) functions specifically tuned for time-series data analysis to answer the types of questions we wanted to ask.</p><p>It is for this kind of analysis that we built <a href="https://timescale.ghost.io/blog/blog/introducing-hyperfunctions-new-sql-functions-to-simplify-working-with-time-series-data-in-postgresql/">TimescaleDB hyperfunctions</a>, a series of SQL functions within TimescaleDB that make it easier to manipulate and analyze time-series data in PostgreSQL with fewer lines of code.</p><h3 id="grouping-data-into-percentiles">Grouping data into percentiles</h3><p>The NFL dataset is a great use case for <a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/">percentiles</a>. Being able to quickly find players that perform better or worse than some cohort is really powerful.</p><p>As an example, we'll use the same continuous aggregate we created earlier (total yards, per game, per player) to find the median total yards traveled by position for each game.</p><pre><code class="language-SQL">WITH sum_yards AS (
--Add position to the table to allow for grouping by it later
 SELECT a.player_id, display_name, SUM(yards) AS yards, p.position, gameid
 FROM player_yards_by_game a
 LEFT JOIN player p ON a.player_id = p.player_id
 GROUP BY a.player_id, display_name, p.position, gameid
)
--Find the mean and median for each position type
SELECT position, mean(percentile_agg(yards)) AS mean_yards, approx_percentile(0.5, percentile_agg(yards)) AS median_yards
FROM sum_yards
WHERE POSITION IS NOT null
GROUP BY position
ORDER BY mean_yards DESC;
</code></pre>
<table>
<thead>
<tr>
<th>position</th>
<th>mean_yards</th>
<th>median_yards</th>
</tr>
</thead>
<tbody>
<tr>
<td>FS</td>
<td>595.583433048431</td>
<td>626.388099960848</td>
</tr>
<tr>
<td>CB</td>
<td>572.3336749867212</td>
<td>592.2175990890378</td>
</tr>
<tr>
<td>WR</td>
<td>552.6508570179277</td>
<td>555.5030569048633</td>
</tr>
<tr>
<td>S</td>
<td>530.6436781609186</td>
<td>550.5961518474892</td>
</tr>
<tr>
<td>SS</td>
<td>522.5604103343453</td>
<td>551.1296628916651</td>
</tr>
<tr>
<td>MLB</td>
<td>462.70229007633407</td>
<td>490.77906906009343</td>
</tr>
<tr>
<td>ILB</td>
<td>402.7882871125599</td>
<td>403.3779668359464</td>
</tr>
<tr>
<td>OLB</td>
<td>393.40014271151847</td>
<td>390.6742117791442</td>
</tr>
<tr>
<td>QB</td>
<td>334.7025466893028</td>
<td>352.1192705472368</td>
</tr>
<tr>
<td>LB</td>
<td>328.9812527472519</td>
<td>257.72003396053884</td>
</tr>
<tr>
<td>TE</td>
<td>327.9515596330271</td>
<td>257.72003396053884</td>
</tr>
</tbody>
</table>
<h3 id="finding-extreme-outliers">Finding extreme outliers</h3><p>Finally, we can build upon this percentile query to find players at each position that run more than 95% of all other players at that position. For some positions, like wide receiver or free safety, this could help us find the “outlier” players that are able to travel the field consistently throughout a game – and make plays!</p><pre><code class="language-SQL">WITH sum_yards AS (
--Add position to the table to allow for grouping by it later
 SELECT a.player_id, display_name, SUM(yards) AS yards, p.position
 FROM player_yards_by_game a
 LEFT JOIN player p ON a.player_id = p.player_id
 GROUP BY a.player_id, display_name, p.position
),
position_percentile AS (
	SELECT POSITION, approx_percentile(0.95, percentile_agg(yards)) AS p95
	FROM sum_yards 
	GROUP BY position
)
SELECT a.POSITION, a.display_name, yards, p95
	FROM sum_yards a
	LEFT JOIN position_percentile pp ON a.POSITION = pp.position
	WHERE yards &gt;= p95
AND a.POSITION IN ('WR','FS','QB','TE')
ORDER BY position;
</code></pre>
<table>
<thead>
<tr>
<th>position</th>
<th>display_name</th>
<th>yards</th>
<th>p95</th>
</tr>
</thead>
<tbody>
<tr>
<td>FS</td>
<td>Eric Weddle</td>
<td>13869.759999999997</td>
<td>12320.288323166456</td>
</tr>
<tr>
<td>FS</td>
<td>Adrian Amos</td>
<td>12989.439999999966</td>
<td>12320.288323166456</td>
</tr>
<tr>
<td>FS</td>
<td>Tyrann Mathieu</td>
<td>12565.219999999956</td>
<td>12320.288323166456</td>
</tr>
<tr>
<td>QB</td>
<td>Aaron Rodgers</td>
<td>7422.35999999995</td>
<td>6667.51452813257</td>
</tr>
<tr>
<td>QB</td>
<td>Patrick Mahomes</td>
<td>6985.989999999952</td>
<td>6667.51452813257</td>
</tr>
<tr>
<td>QB</td>
<td>Matt Ryan</td>
<td>6759.959999999969</td>
<td>6667.51452813257</td>
</tr>
<tr>
<td>TE</td>
<td>Zach Ertz</td>
<td>13124.58999999995</td>
<td>10667.986199523099</td>
</tr>
<tr>
<td>TE</td>
<td>Jimmy Graham</td>
<td>12693.679999999982</td>
<td>10667.986199523099</td>
</tr>
<tr>
<td>TE</td>
<td>Travis Kelce</td>
<td>12218.129999999957</td>
<td>10667.986199523099</td>
</tr>
<tr>
<td>TE</td>
<td>David Njoku</td>
<td>11502.159999999965</td>
<td>10667.986199523099</td>
</tr>
<tr>
<td>TE</td>
<td>George Kittle</td>
<td>11058.099999999975</td>
<td>10667.986199523099</td>
</tr>
<tr>
<td>TE</td>
<td>Kyle Rudolph</td>
<td>10761.949999999968</td>
<td>10667.986199523099</td>
</tr>
<tr>
<td>TE</td>
<td>Jared Cook</td>
<td>10678.22999999998</td>
<td>10667.986199523099</td>
</tr>
<tr>
<td>WR</td>
<td>Antonio Brown</td>
<td>16877.559999999965</td>
<td>14271.23409723974</td>
</tr>
<tr>
<td>WR</td>
<td>Brandin Cooks</td>
<td>15510.01999999995</td>
<td>14271.23409723974</td>
</tr>
<tr>
<td>WR</td>
<td>JuJu Smith-Schuster</td>
<td>15492.76999999996</td>
<td>14271.23409723974</td>
</tr>
<tr>
<td>WR</td>
<td>Robert Woods</td>
<td>15253.179999999958</td>
<td>14271.23409723974</td>
</tr>
<tr>
<td>WR</td>
<td>Nelson Agholor</td>
<td>15180.32999999997</td>
<td>14271.23409723974</td>
</tr>
<tr>
<td>WR</td>
<td>Tyreek Hill</td>
<td>15106.609999999973</td>
<td>14271.23409723974</td>
</tr>
<tr>
<td>WR</td>
<td>Zay Jones</td>
<td>14790.589999999967</td>
<td>14271.23409723974</td>
</tr>
<tr>
<td>WR</td>
<td>Sterling Shepard</td>
<td>14673.79999999996</td>
<td>14271.23409723974</td>
</tr>
<tr>
<td>WR</td>
<td>Mike Evans</td>
<td>14620.129999999983</td>
<td>14271.23409723974</td>
</tr>
<tr>
<td>WR</td>
<td>Davante Adams</td>
<td>14574.509999999951</td>
<td>14271.23409723974</td>
</tr>
<tr>
<td>WR</td>
<td>Kenny Golladay</td>
<td>14354.499999999973</td>
<td>14271.23409723974</td>
</tr>
<tr>
<td>WR</td>
<td>Jarvis Landry</td>
<td>14281.509999999971</td>
<td>14271.23409723974</td>
</tr>
</tbody>
</table>
<h2 id="where-can-the-data-take-you">Where Can the Data Take You?</h2><p>As you’ve seen in this example, <strong>time-series data is everywhere</strong>. Being able to harness it gives you a huge advantage, whether you’re working on a professional solution or a personal project.</p><p>We’ve shown you a few ways that time-series queries can unlock interesting insights, give you a greater appreciation for the game and its players, and (hopefully) inspired you to dig into the data yourself.</p><p><strong>To get started with the </strong><a href="https://www.kaggle.com/c/nfl-big-data-bowl-2021/overview"><strong>NFL data</strong></a><strong>:</strong></p><ul><li><strong>Spin up a fully managed TimescaleDB service</strong>: create an account to <a href="https://console.cloud.timescale.com/signup">try it for free</a> for 30 days.</li><li><a href="https://docs.timescale.com/timescaledb/latest/tutorials/nfl-analytics/">Follow our complete tutorial</a> for step-by-step instructions for preparing and ingesting the dataset, along with several more queries to help you glean insights from the dataset.</li></ul><p>If you’re new to time-series data or just have some questions about how to use TimescaleDB to analyze the NFL’s dataset, <a href="https://slack.timescale.com">join our public Slack community</a>. You’ll find Timescale engineers and thousands of time-series enthusiasts from around the world – and we’ll be happy to help you.</p><p>🙏 We’d like to thank the NFL for making this data available and the millions of passionate fans around the world who make the NFL such an exciting game to watch.</p><p>And, Geaux Saints 🏈!</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How METER Group Brings a Data-Driven Approach to the Cannabis Production Industry]]></title>
            <description><![CDATA[Learn how METER Group architected its data stack to collect and visualize massive amounts of data – and help customers make informed business decisions.]]></description>
            <link>https://www.tigerdata.com/blog/how-meter-group-brings-a-data-driven-approach-to-the-cannabis-production-industry</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-meter-group-brings-a-data-driven-approach-to-the-cannabis-production-industry</guid>
            <category><![CDATA[Dev Q&A]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Paolo Bergantino]]></dc:creator>
            <pubDate>Mon, 26 Jul 2021 14:42:35 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/07/174779559_978280639374264_6865881920322659010_n.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/07/174779559_978280639374264_6865881920322659010_n.jpg" alt="A person sitting on the chair surrounded by cannabis plants" /><p><em>This is an installment of our “Community Member Spotlight” series, where we invite our customers to share their work, shining a light on their success and inspiring others with new ways to use technology to solve problems.</em></p><p><em>In this edition, </em><a href="https://www.linkedin.com/in/paolobergantino/"><em>Paolo Bergantino</em></a><em>, Director of Software for the Horticulture business unit at METER Group, joins us to share how they make data accessible to their customers so that they can maximize their cannabis yield and increase efficiency and consistency between grows.</em></p><p><a href="https://aroya.io/">AROYA</a> is the leading cannabis production platform servicing the U.S. market today. AROYA is part of <a href="https://www.metergroup.com/">METER Group</a>, a scientific instrumentation company with 30+ years of expertise in developing sensors for the agriculture and food industries. We have taken this technical expertise and applied it to the cannabis market, developing a platform that allows growers to grow more efficiently and increase their yields—and to do so consistently and at scale.</p><h2 id="about-the-team">About the team</h2><p>My name is <a href="https://www.linkedin.com/in/paolobergantino/">Paolo Bergantino</a>. I have about 15 years of experience developing web applications in various stacks, and I have spent the last four here at METER Group. Currently, I am the Director of Software for the Horticulture business unit, which is in charge of the development and infrastructure of the AROYA software platform. My direct team consists of about ten engineers, 3 QA engineers, and a UI/UX Designer. (<a href="https://www.metergroup.com/career/">We’re also hiring!</a>)</p><h2 id="about-the-project">About the project</h2><p>AROYA is built as a React Single-Page App (SPA) that communicates with a Django/DRF back-end. In addition to using <a href="https://www.timescale.com/products">Timescale Cloud</a> for our database, we use AWS services such as EC2+ELB for our app and workers, <a href="https://aws.amazon.com/elasticache/redis/">ElastiCache for Redis</a>, <a href="https://aws.amazon.com/s3/">S3</a> for various tasks, <a href="https://docs.aws.amazon.com/iot/latest/developerguide/sqs-rule-action.html">AWS IoT/SQS</a> for handling packets from our sensors, and some other services here and there.</p><p>As I previously mentioned, AROYA was born out of our desire to build a system that leveraged our superior sensor technology in an industry that needed such a system. Cannabis worked out great in this respect, as the current legalization movement throughout the U.S. has resulted in a lot of disruption in the space. </p><p>The more we spoke to growers, the more we were struck by how much mythology there was in growing cannabis and by how little science was being applied by relatively large operations. As a company with deeply scientific roots, we found it to be a perfect match and an area where we could bring some of our knowledge to the forefront. <strong>We ultimately believe the only survivors in the space are those who can use data-driven approaches to their cultivation to maximize their yield and increase efficiency and consistency between grows. </strong></p><p>As part of the AROYA platform, we developed a wireless module (called a “nose”) that could be attached to our sensors. Using Bluetooth Low Energy (BLE) for low power consumption and attaching a solar panel to take advantage of the lights in a grow room, the module can run indefinitely without charging.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/07/204951218_1941930365957763_2364085903187304407_n.jpg" class="kg-image" alt="The AROYA nose device with the cannabis plants in the background" loading="lazy" width="1032" height="1290" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/07/204951218_1941930365957763_2364085903187304407_n.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/07/204951218_1941930365957763_2364085903187304407_n.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/07/204951218_1941930365957763_2364085903187304407_n.jpg 1032w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The AROYA nose in its natural habitat (</span><a href="https://www.instagram.com/p/CQbpXuXBgke/"><span style="white-space: pre-wrap;">aroya.io</span></a><span style="white-space: pre-wrap;"> Instagram)</span></figcaption></figure><p>The most critical sensor we attach to this nose is called the TEROS 12, the three-pronged sensor pictured below. It can be installed into any growing medium (like rockwool, coconut coir, soil, or mixes like perlite, pumice, or peat moss) and give insights into the temperature, water content (WC), and electrical conductivity (EC) of the medium. Without getting too into the weeds (pardon the pun), WC and EC, in particular, are crucial in helping growers make informed irrigation decisions that will steer the plants into the right state and ultimately maximize their yield potential.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/07/161449707_437044057364377_9165544412024707037_n.jpg" class="kg-image" alt="The white three-pronged sensor called TEROS laying on the grey shelf" loading="lazy" width="845" height="556" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/07/161449707_437044057364377_9165544412024707037_n.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/07/161449707_437044057364377_9165544412024707037_n.jpg 845w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The AROYA nose with a connected TEROS 12 sensor (</span><a href="https://www.instagram.com/p/CMe_cyuBHUO/"><span style="white-space: pre-wrap;">aroya.io </span></a><span style="white-space: pre-wrap;">Instagram)</span></figcaption></figure><p>We also have an ATMOS 14 sensor for measuring the climate in the rooms and <a href="https://aroya.io/cultivation/">a whole suite of sensors</a> for other use cases.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/07/201782665_1014596755960402_5100068875632858009_n-1.jpg" class="kg-image" alt="The white ATMOS sensor hanging from the ceiling with cannabis plants underneath it" loading="lazy" width="1032" height="1290" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/07/201782665_1014596755960402_5100068875632858009_n-1.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/07/201782665_1014596755960402_5100068875632858009_n-1.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/07/201782665_1014596755960402_5100068875632858009_n-1.jpg 1032w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">An AROYA repeater with an ATMOS 14 sensor for measuring the climate (</span><a href="https://www.instagram.com/p/CQME57tB6Ds/"><span style="white-space: pre-wrap;">aroya.io </span></a><span style="white-space: pre-wrap;">Instagram)</span></figcaption></figure><p>AROYA’s core competency is collecting this data—e.g., EC, WC, soil temp, air temperature, etc.—and serving it to our clients in real-time (or, at least “real-time” for our purposes, as our typical sampling interval is 3 minutes).</p><p>Growers typically split their growing rooms into irrigation zones. We encourage them to install statistically significant amounts of sensors into each room and its zones, so that AROYA gives them good and actionable feedback on the state of their room.  For example, there’s a concept in cultivation called "<a href="https://aroya.io/resources/crop-steering/">crop steering</a>" that basically says that if you stress the plant in just the right way,  you can "steer" it into generative or vegetative states at will and drive it to squeeze every last bit of flower. How and when you do this is crucial to doing it properly.</p><p>Our data allows growers to dial in their irrigation strategy so they can hit their target "dry back" for the plant (this is more or less the difference between the water content at the end of irrigation and the water content at the next irrigation event). Optimizing dry back is one of the biggest factors in making crop steering work, and it's basically impossible to do well without good data. (We provide lots of other data that helps growers make decisions, but this is one of the most important ones.)</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/07/image--1-.png" class="kg-image" alt="A line chart with dark blue background showing electrical conductivity and water content data related to a room in AROYA." loading="lazy" width="1723" height="659" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/07/image--1-.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/07/image--1-.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2021/07/image--1-.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/07/image--1-.png 1723w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Graph showing electrical conductivity (EC) and water content (WC) data related to a room in AROYA.</span></figcaption></figure><p>This can be even more important when multiple cultivars (“strains”) of cannabis are grown in the same room, as the differences between two cultivars regarding their needs and expectations can be pretty dramatic. For those unfamiliar with the field, an example might be that different cultivars "drink" water differently, and thus must be irrigated differently to achieve maximum yields. There are also "stretchy" cultivars that grow taller faster than "stocky" ones, and this also affects how they interact with the environment. AROYA not only helps in terms of sensing, but in documenting and helping understand these differences to improve future runs.</p><p>The most important thing from collecting all this data is making it accessible to users via graphs and visualizations in an intuitive, reliable, and accurate way, so they can make informed decisions about their cultivation.</p><p>We also have alerts and other logic that we apply to incoming data. These visualizations and business logic can happen at the sensor level, at the zone level, at the room level, or sometimes even at the facility level.</p><p>A typical use case with AROYA might be that a user logs in to their dashboard to view sensor data for a room. Initially, they view charts aggregated to the zone level, but they may decide to dig deeper into a particular zone and view the individual sensors that make up that zone. Or, vice versa, they may want to pull out and view data averaged all the way up to the room. So, as we designed our solution, we needed to ensure we could get to (and provide) the data at the right aggregation level quickly.</p><h2 id="choosing-and-using-timescaledb">Choosing and using TimescaleDB</h2><h3 id="the-initial-solution">The initial solution</h3><p>During the days of our closed alpha and beta of AROYA with early trial accounts (late 2017 through our official launch in December 2019), the amount of data coming into the system was not significant. Our nose was still being developed (and hardware development is nice and slow), so we had to make due with some legacy data loggers that METER also produces. </p><p>These data loggers only sampled every 5 minutes and, at best, reported every 15 minutes. We used <a href="https://aws.amazon.com/rds/aurora/postgresql-features/">AWS’ RDS Aurora PostgreSQL</a> service and cobbled together a set of triggers and functions that partitioned our main readings table by each client facility—but no more. Because we have so many sensor models and data types we can collect, I chose to use a <a href="https://docs.timescale.com/timescaledb/latest/overview/data-model-flexibility/narrow-data-model/#narrow-table-model">narrow data model</a> for our main readings table. </p><p>This overall setup worked well enough at first, but as we progressed from alpha to beta and our customer base grew, it became increasingly clear that it was not a long-term solution for our <a href="https://www.tigerdata.com/blog/time-series-introduction" rel="noreferrer">time series data</a> needs. I could have expanded my self-managed system of triggers and functions and cobbled together additional partitions within a facility, but this did not seem ideal. There had to be a better way! </p><p>I started looking into specific time-series solutions. I am a bit of a home automation aficionado, and I was already familiar with InfluxDB—but <strong>I didn’t wish to split my relational data and readings data or teach my team a new query language. </strong></p><p><strong>TimescaleDB, being built on top of PostgreSQL, initially drew my attention: it “just worked” in every respect, I could expect it to, and I could use the same tools I was used to for it.</strong> At this point, however, I had a few reservations about some non-technical aspects of hosting TimescaleDB that prevented me from going full steam ahead with it.</p><p><em>✨  <strong>Editor’s Note:</strong> For more comparisons and benchmarks, see how TimescaleDB compares to </em><a href="https://timescale.ghost.io/blog/timescaledb-vs-influxdb-for-time-series-data-timescale-influx-sql-nosql-36489299877/" rel="noreferrer"><em>InfluxDB</em></a><em>, </em><a href="https://timescale.ghost.io/blog/how-to-store-time-series-data-mongodb-vs-timescaledb-postgresql-a73939734016/" rel="noreferrer"><em>MongoDB</em></a><em>, </em><a href="https://timescale.ghost.io/blog/timescaledb-vs-amazon-timestream-6000x-higher-inserts-175x-faster-queries-220x-cheaper/" rel="noreferrer"><em>AWS Timestream</em></a><em>, and other </em><a href="https://www.timescale.com/learn/the-best-time-series-databases-compared" rel="noreferrer"><em>time-series database alternatives</em> <em>on various vectors</em></a><em>, from performance and ecosystem to query language and beyond.</em></p><h3 id="applying-a-band-aid-and-setting-a-goal">Applying a band-aid and setting a goal</h3><p>If I am perfectly truthful, before this point, I did not have any serious requirements or standards about what I considered to be the adequate quality of service for our application. I had a bit of an “<a href="https://en.wikipedia.org/wiki/I_know_it_when_I_see_it">I know it when I see it</a>” attitude towards the whole thing. </p><p>When we had a potential client walk away during a demo due to a particularly slow loading graph, I knew we had a problem on our hands and that we needed something really solid for the long term. </p><p>Still, at the time, we also needed something to get us by until we could perform a thorough evaluation of the available solutions and build something around that. At this point, I decided to set a Redis cluster between RDS and our application, which stored the last 30 days of sensor data (at all the aggregation levels required) as a Pandas data frame. <a href="https://redis.io/topics/cluster-tutorial">Redis cluster</a> Any chart request coming in for data within the first 30 days - which accounted for something like 90&nbsp;% of our requests—would simply hit Redis. Anything longer would cobble together the answer using both Redis and querying the database. <strong>Performance for the 90&nbsp;% use case was adequate, but it was getting increasingly dreadful as more and more historical data piled up for anything that hit the database.</strong> </p><p>At this point, I set the goalposts for what our new solution would need to meet: <em>Any chart request, which is an integral part of AROYA, needs to take less than one second for the API to serve.</em></p><h3 id="the-research-and-the-first-solution">The research and the first solution</h3><p>We looked at other databases at this point, InfluxDB was looked at again, we got in a beta of Timestream for AWS and looked at that. We even considered going NoSQL for the whole thing. We ran tests and benchmarks, created matrices of pros and cons, estimated costs, and the whole shebang. <strong>Nothing compared favorably to what we were able to achieve with TimescaleDB.</strong></p><p>Ultimately,<strong> the feature that really caught our attention was </strong><a href="https://docs.timescale.com/timescaledb/latest/getting-started/create-cagg/"><strong>continuous aggregates</strong></a><strong> in TimescaleDB</strong>. The way our logic works is that we see the timeframe that the user is requesting and sample our data accordingly. In other words, if a user fetches three months' worth of data, we would not send three months' worth of raw data to be graphed to the front end. Instead, we would bucket our data into appropriately sized buckets that would give us the right amount of data we want to display in the interface. </p><p>Although it would require quite a few views if we created continuous aggregates for every aggregation level and bucket size we cared about and then directly queried the right aggregation/bucket combination (depending on the parameters requested), that should do it, right? The answer was a resounding <strong>yes</strong>. </p><p><strong>The performance we were able to achieve using these views shattered the competition. </strong>Although I admit we were kind of “cheating” by precalculating the data, the point is that we could easily do it. Not only this but when we ran load tests on our proposed infrastructure, we were blown away by how much more traffic we could support without any service degradation. We could also eliminate all the complicated infrastructure that our Redis layer required, which was quite a load off (literally and figuratively).</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/image--4-.png" class="kg-image" alt="The chart with two lines showing the application server load before and after the TimescaleDB deployment." loading="lazy" width="613" height="317" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/image--4-.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/image--4-.png 613w"><figcaption><span style="white-space: pre-wrap;">Grafana dashboard for the internal team showing app server load average before and after deployment of initial TimescaleDB implementation.</span></figcaption></figure><p>The Achilles’ heel of this solution, an astute reader may already notice, is that we were paying for this performance in disk space. </p><p>I initially brushed this off as fair trade and moved on with my life. <strong>We found </strong><a href="https://docs.timescale.com/timescaledb/latest/getting-started/compress-data/"><strong>TimescaleDB’s compression</strong></a><strong> to be as good as advertised, which gave us 90&nbsp;%+ space savings in our underlying hypertable, </strong>but our sizable collection of uncompressed continuous aggregates grew by the day (keep reading to learn why this is a “but”...).</p><p><em>✨ <strong>Editor’s Note</strong>: We’ve put together resources about </em><a href="https://docs.timescale.com/timescaledb/latest/getting-started/create-cagg/##what-are-continuous-aggregates"><em>continuous aggregates</em></a><em> and </em><a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/compression/"><em>compression</em></a><em> to help you get started.</em></p><h3 id="the-%E2%80%9Cfinal%E2%80%9D-solution">The “final” solution</h3><p>AROYA has been on an amazing trajectory since launch, and our growth was evident in the months before and after we deployed our initial TimescaleDB implementation. Thousands upon thousands of sensors hitting the field was great for business – but bad for our disk space. </p><p>Our monitoring told a good story of how long our chart requests were taking, as 95%+ of them were under 1 second, and virtually all were under 2 seconds. Still, within a few months of deployment, we needed to upgrade tiers in Timescale Cloud solely to keep up with our disk usage. <em>approaching</em></p><p>We had adequate computing resources for our load, but 1 TB was no longer enough, so we doubled our total instance size to get another 1 TB. While everything was running smoothly, I felt a dark cloud overhead as our continuous aggregates grew and grew in size.</p><p>The clock was ticking, and before we knew it, we were approaching 2 TB of readings. So, we had to take action. </p><p>We had attended a webinar hosted by Timescale and heard someone make a relatively off-hand comment about rolling their own compression for continuous aggregates. This planted a seed that was all we needed to get going.</p><p>The plan was thus: first, after consulting with Timescale staff, we were alerted we had way too many bucket sizes. We could use <a href="https://docs.timescale.com/api/latest/analytics/time_bucket/">TimescaleDB’s time_bucket functions</a> to do some of this on the fly without affecting performance or keeping as many continuous aggregates. That was an easy win. </p><p>Next, we split each of our current continuous aggregates into three separate components:</p><ul><li>First, we kept the original continuous aggregate.</li><li>Then, we leveraged the <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/user-defined-actions/">TimescaleDB job scheduler</a> to move and compress chunks from the original continuous aggregate into a <a href="https://docs.timescale.com/timescaledb/latest/getting-started/create-hypertable/">hypertable </a>for that specific bucket/aggregation view.</li><li>Finally, we created a plain old view that UNIONed the two and made it a transparent change for our application. </li></ul><p>This allowed us to compress everything but the last week of all of our continuous aggregates, and the results were as good as we could have hoped for.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0-5.png" class="kg-image" alt="The line chart showing the compression of data from 1.83 TB to 700 GB" loading="lazy" width="626" height="229" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/pasted-image-0-5.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0-5.png 626w"><figcaption><span style="white-space: pre-wrap;">The 1.83TB database was compressed into 700 GB.</span></figcaption></figure><p><strong>We were able to take our ~1.83 TB database and compress it down to 700 GB</strong>. Not only that, about 300 GB of that is log data that’s unrelated to our main reading pipeline. </p><p>We will be migrating out this data soon, which gives us a vast amount of room to grow. (We think we can even move back the 1TB plan at this point, but have to test to ensure that compute doesn’t become an issue.) The rate of incrementation in disk usage was also massively slowed, which bodes well for this solution in the long term. What’s more, there was virtually no penalty for doing this in terms of performance for any of the metrics we monitor.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0--1--1.png" class="kg-image" alt="The dot plot on the dark background showing how long sampling of chart requests takes to serve." loading="lazy" width="624" height="232" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/pasted-image-0--1--1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0--1--1.png 624w"><figcaption><span style="white-space: pre-wrap;">Our monitoring shows how long sampling of chart requests takes to serve.</span></figcaption></figure><p>Ultimately TimescaleDB had wins across the board for my team. <strong>Performance was going to be the driving force behind whatever we went with, and TimescaleDB has delivered that in spades.</strong></p><h2 id="current-deployment-future-plans">Current deployment &amp; future plans</h2><p><strong>We currently ingest billions of readings every month using TimescaleDB and couldn’t be happier. </strong>Our data ingest and charting capabilities are two of the essential aspects of AROYA’s infrastructure. </p><p>While the road to get here has been a huge learning experience, our current infrastructure is straightforward and performant, and we’ve been able to rely on it to work as expected and to do the right thing. I am not sure I can pay a bigger compliment than that.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0--2-.png" class="kg-image" alt="The architecture diagram of AROYA solution" loading="lazy" width="1600" height="764" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/pasted-image-0--2-.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/pasted-image-0--2-.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0--2-.png 1600w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The current architecture diagram</span></figcaption></figure><p>We’ve recently gone live with our AROYA Analytics release, which is building upon what we’ve done to deliver deeper insights into the environment and the operations at the facilities using our service. Every step of the way, it’s been straightforward (and performant!) to calculate the metrics we need with our TimescaleDB setup. </p><h2 id="getting-started-advice-resources">Getting started advice &amp; resources</h2><p>I think it’s worth mentioning that there were many trade-offs and requirements that guided me to where AROYA is today with our use of TimescaleDB. Ultimately, my story is simply the set of decisions that led me to where we are now and people’s mileage may vary depending on their requirements. </p><p>I am sure that the set of functionality offered means that, with a little bit of creativity, TimescaleDB can work for just about any time-series use case I can think of.</p><p>The exercise we went through when iterating from our initial non-Timescale solution to Timescale was crucial to get me to be comfortable with that migration. Moving such a critical part of my infrastructure was scary, and it is <em>still</em> scary. </p><p>Monitoring everything you can, having redundancies, and being vigilant about any unexpected activity - even if it’s not something that may trigger an error - has helped us stay out of trouble.</p><p>We have a big <a href="https://grafana.com/">Grafana</a> dashboard on a TV in our office that displays various metrics and multiple times we’ve seen something odd and uncovered an issue that could have festered into something much more if we hadn’t dug into it right away. Finally, diligent load testing of the infrastructure and staging runs of any significant modifications have made our deployments a lot less stressful, since they instill quite a bit of confidence.</p><p><em><strong>✨ Editor’s Note:</strong> Check out </em><a href="https://www.youtube.com/playlist?list=PLsceB9ac9MHTjwvV18QJnPcLrTXm_Q-Ft"><em>Grafana 101 video series</em></a><em> and </em><a href="https://docs.timescale.com/timescaledb/latest/tutorials/grafana/"><em>Grafana tutorials</em></a><em> to learn everything from building awesome, interactive visualizations to setting up custom alerts, sharing dashboards with teammates, and solving common issues.</em></p><p>I would like to give a big shout-out to Neil Parker, who is my right-hand man in anything relating to AROYA infrastructure and did virtually all of the actual work in getting many of these things set up and running. I would also like to thank <a href="https://twitter.com/michaelfreedman">Mike Freedman</a> and <a href="https://www.linkedin.com/in/priscilafletcher/">Priscila Fletcher</a> from Timescale, who have given us a great bit of time and information and helped us in our journey with TimescaleDB.</p><p><em>We’d like to give a big thank you to Paolo and everyone at AROYA for sharing their story, as well as for their efforts to help transform the cannabis production industry, equipping growers with the data they need to improve their crops, make informed decisions, and beyond.</em></p><p><em>We’re always keen to feature new community projects and stories on our blog. If you have a story or project you’d like to share, reach out on Slack (</em><a href="https://timescaledb.slack.com/archives/D01TTSRCFC7"><em>@</em></a><em>Ana Tavares), and we’ll go from there.</em></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[What Time-Weighted Averages Are and Why You Should Care]]></title>
            <description><![CDATA[Learn how time-weighted averages are calculated and why they’re so powerful for data analysis.]]></description>
            <link>https://www.tigerdata.com/blog/what-time-weighted-averages-are-and-why-you-should-care</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/what-time-weighted-averages-are-and-why-you-should-care</guid>
            <category><![CDATA[Engineering]]></category>
            <category><![CDATA[General]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[David Kohn]]></dc:creator>
            <pubDate>Thu, 22 Jul 2021 13:04:23 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/07/national-cancer-institute-zz_3tCcrk7o-unsplash--1-.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/07/national-cancer-institute-zz_3tCcrk7o-unsplash--1-.jpg" alt="Color tiny dots, purple, green, blue and bright pink" /><p>Many people who work with time-series data have nice, regularly sampled datasets. Data could be sampled every few seconds, or milliseconds, or whatever they choose, but by regularly sampled, we mean the time between data points is basically constant. Computing the average value of data points over a specified time period in a regular dataset is a relatively well-understood query to compose. </p><p>But for those who don't have regularly sampled data, getting a representative average over a period of time can be a complex and time-consuming query to write. <strong>Time-weighted averages are a way to get an unbiased average when you are working with irregularly sampled data</strong>.</p><p>Time-series data comes at you fast, sometimes generating millions of data points per second (<a href="https://timescale.ghost.io/blog/blog/what-the-heck-is-time-series-data-and-why-do-i-need-a-time-series-database-dcf3b1b18563/">read more about time-series data</a>). Because of the sheer volume and rate of information, time-series data can already be complex to query and analyze, which is why we built TimescaleDB, a petabyte-scale, relational database for <a href="https://www.tigerdata.com/blog/time-series-introduction" rel="noreferrer">time series</a>.</p><p>Irregularly sampled time-series data just adds another level of complexity – and is more common than you may think. For example, irregularly sampled data, and thus the need for time-weighted averages, frequently occurs in:</p><ul><li><strong>Industrial IoT</strong>, where teams “compress” data by only sending points when the value changes</li><li><strong>Remote sensing</strong>, where sending data back from the edge can be costly, so you only send high-frequency data for the most critical operations</li><li><strong>Trigger-based systems</strong>, where the sampling rate of one sensor is affected by the reading of another (i.e., a security system that sends data more frequently when a motion sensor is triggered)</li><li>...and many, many more</li></ul><p>At Timescale, we’re always looking for ways to make developers’ lives easier, especially when they’re working with time-series data. To this end, <a href="https://timescale.ghost.io/blog/blog/introducing-hyperfunctions-new-sql-functions-to-simplify-working-with-time-series-data-in-postgresql/">we introduced hyperfunctions</a>, new SQL functions that simplify working with time-series data in PostgreSQL. <strong>One of these hyperfunctions enables you to </strong><a href="https://docs.timescale.com/api/latest/hyperfunctions/time-weighted-averages/#time-weighted-average-functions"><strong>compute time-weighted averages</strong></a><strong> quickly and efficiently</strong>, so you gain hours of productivity.</p><p><br>Read on for examples of time-weighted averages, how they’re calculated, how to use the time-weighted averages hyperfunctions in TimescaleDB, and some ideas for how you can use them to get a productivity boost for your projects, no matter the domain.</p><p><strong>If you’d like to get started with the <code>time_weight</code> hyperfunction—and many more—right away, spin up a fully managed Timescale service</strong>: create an account to <a href="https://console.cloud.timescale.com/signup">try it for free</a> for 30 days. Hyperfunctions are pre-loaded on each new database service on Timescale, so after you create a new service, you’re all set to use them.</p><p><strong>If you prefer to manage your own database instances, you can </strong><a href="https://github.com/timescale/timescaledb-toolkit"><strong>download and install the timescaledb_toolkit extension</strong></a> on GitHub, after which you’ll be able to use <code>time_weight</code> and other hyperfunctions.</p><p>Finally, we love building in public and continually improving:</p><ul><li>If you have questions or comments on this blog post, <a href="https://github.com/timescale/timescaledb-toolkit/discussions/185">we’ve started a discussion on our GitHub page, and we’d love to hear from you</a>. (And, if you like what you see, GitHub ⭐ are always welcome and appreciated too!)</li><li>You can view our <a href="https://github.com/timescale/timescaledb-toolkit">upcoming roadmap on GitHub</a> for a list of proposed features, as well as features we’re currently implementing and those that are available to use today.</li></ul><h2 id="what-are-time-weighted-averages">What are time-weighted averages?</h2>
<p>I’ve been a developer at Timescale for over 3 years and worked in databases for about 5 years, but I was an electrochemist before that. As an electrochemist, I worked for a battery manufacturer and saw a lot of charts like these:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/image.png" class="kg-image" alt="Battery discharge curve showing cell voltage on the y-axis and capacity in amp-hours on the x-axis. The curve starts high, decreases relatively rapidly through the exponential zone, then stays relatively constant for a long period through the nominal zone, after which the voltage drops quite rapidly as it reaches its fully discharged state. " loading="lazy" width="744" height="426" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/08/image.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/image.png 744w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Example battery discharge curve, which describes how long a battery can power something. (Also a prime example of where time-weighted averages are 💯 necessary) Derived from </em></i><a href="https://www.nrel.gov/docs/fy17osti/67809.pdf"><i><em class="italic" style="white-space: pre-wrap;">https://www.nrel.gov/docs/fy17osti/67809.pdf</em></i></a></figcaption></figure><p>That’s a battery discharge curve, which describes how long a battery can power something. The x-axis shows capacity in Amp-hours, and since this is a constant current discharge, the x-axis is really just a proxy for time. The y-axis displays voltage, which determines the battery’s power output; as you continue to discharge the battery, the voltage drops until it gets to a point where it needs to be recharged.</p><p>When we’d do R&amp;D for new battery formulations, we’d cycle many batteries many times to figure out which formulations make batteries last the longest. </p><p>If you look more closely at the discharge curve, you’ll notice that there are only two “interesting” sections:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/image-1.png" class="kg-image" alt="The same battery discharge curve as in the previous image but with the “interesting bits” circled, namely where the voltage decreases rapidly at the beginning and the end of the discharge curve. " loading="lazy" width="757" height="447" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/08/image-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/image-1.png 757w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Example battery discharge curve, calling out the “interesting bits” (the points in time where data changes rapidly)</em></i></figcaption></figure><p>These are the parts at the beginning and end of the discharge where the voltage changes rapidly. Between these two sections, there’s that long period in the middle, where the voltage hardly changes at all:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/image-2.png" class="kg-image" alt="The same battery discharge curve again, except now the “boring” part of the curve is highlighted, which is the middle section where the voltage hardly changes." loading="lazy" width="749" height="449" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/08/image-2.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/image-2.png 749w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Example battery discharge curve, calling out the “boring bits” (the points in time where the data remains fairly constant)</em></i></figcaption></figure><p>Now, when I said before that I was an electrochemist, I will admit that I was exaggerating a little bit. I knew enough about electrochemistry to be dangerous, but I worked with folks with PhDs who knew <em>a lot</em> more than I did. </p><p>But, I was often better than them at working with data, so I’d do things like programming the <a href="https://en.wikipedia.org/wiki/Potentiostat">potentiostat</a>, the piece of equipment you hook the battery up to in order to perform these tests. </p><p>For the interesting parts of the discharge cycle (those parts at the start and end), we could have the potentiostat sample at its max rate, usually a point every 10 milliseconds or so. We didn’t want to sample as many data points during the long, boring parts where the voltage didn’t change because it would mean saving lots of data with unchanging values and wasting storage.</p><p>To reduce the boring data we’d have to deal with without losing the interesting bits, we’d set up the program to sample every 3 minutes, or when the voltage changed by a reasonable amount, say more than 5 mV. </p><p>In practice, what would happen is something like this:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/image-3.png" class="kg-image" alt="The same battery discharge curve again, this time with data points superimposed on the image. The data points are spaced close together in the “interesting bits,” where the voltage changes quickly at the beginning and end of the discharge curve. The data points are spaced further apart during the “boring” part in the middle, where the voltage hardly changes at all. " loading="lazy" width="755" height="403" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/08/image-3.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/image-3.png 755w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Example battery discharge curve with data points superimposed to depict rapid sampling during the interesting bits and slower sampling during the boring bits.</em></i></figcaption></figure><p>By sampling the data in this way, we'd get more data during the interesting parts and less data during the boring middle section. That’s great!</p><p>It let us answer more interesting questions about the quickly changing parts of the curve and gave us all the information we needed about the slowly changing sections – without storing gobs of redundant data.<strong> But, here’s a question: given this dataset, how do we find the average voltage during the discharge?</strong></p><p>That question is important because it was one of the things we could compare between this discharge curve and future ones, say 10 or 100 cycles later. As a battery ages, its average voltage drops, and how much it dropped over time could tell us how well the battery’s storage capacity held up during its lifecycle – and if it could turn into a useful product. </p><p>The problem is that the data in the interesting bits are sampled more frequently (i.e., there are more data points for the interesting bits), which would give it more weight when calculating the average, even though it shouldn't.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/image-4.png" class="kg-image" alt="The same battery curve again, with the same data points superimposed and the “interesting bits” circled again, however this time noting that the “interesting bits” shouldn’t count extra even though there are more data points included in the circled area." loading="lazy" width="754" height="449" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/08/image-4.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/08/image-4.png 754w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Example battery discharge curve, with illustrative data points to show that while we collect more data during the interesting bits, they shouldn’t count “extra.”</em></i></figcaption></figure><p>If we just took a naive average over the whole curve, adding the value at each point and dividing by the number of points, it would mean that a change to our sampling rate could change our calculated average...even though the underlying effect was really the same! </p><p>We could easily overlook any of the differences we were trying to identify – and any clues about how we could improve the batteries could just get lost in the variation of our sampling protocol. </p><p>Now, some people will say: well, why not just sample at max rate of the potentiostat, even during the boring parts? Well, these discharge tests ran <em>really</em> long. They’d take 10 to 12 hours to complete, but the interesting bits could be pretty short, from seconds or minutes. If we sampled at the highest rate, one every 10ms or so, it would mean orders of magnitude more data to store even though we would hardly use any of it! And orders of magnitude more data would mean more cost, more time for analysis, all sorts of problems.</p><p>So the big question is: <strong>how do we get a representative average when we’re working with irregularly spaced data points?</strong></p><p>Let’s get theoretical for a moment here:</p><p>(This next bit is a little equation-heavy, but I think they’re<em> relatively</em> simple equations, and they map very well onto their graphical representation. I always like it when folks give me the math and graphical intuition behind the calculations – but if you want to skip ahead to just see how time-weighted average is used, the mathy bits end <a href="https://timescale.ghost.io/blog/blog/what-time-weighted-averages-are-and-why-you-should-care/#how-to-compute-time-weighted-averages-in-sql">here</a>.)</p><h2 id="mathy-bits-how-to-derive-a-time-weighted-average">Mathy Bits: How to derive a time-weighted average</h2>
<p>Let’s say we have some points like this:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0-11.png" class="kg-image" alt="A graph showing value on the y-axis and time on the x-axis. There are four points:  open parens t 1 comma v 1 close parens to open parens t 4 comma  v 4 close parens spaced unevenly in time on the graph. " loading="lazy" width="1380" height="1144" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/pasted-image-0-11.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/pasted-image-0-11.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0-11.png 1380w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">A theoretical, irregularly sampled time-series dataset</em></i></figcaption></figure><p>Then, the normal average would be the sum of the values, divided by the total number of points:</p><p>\begin{equation} avg = \frac{(v_1 + v_2 + v_3 + v_4)}{4}  \end{equation}</p>
<p>But, because they’re irregularly spaced, we need some way to account for that. </p><p>One way to think about it would be to get a value at every point in time, and then divide it by the total amount of time. This would be like getting the total area under the curve and dividing by the total amount of time ΔT.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0--3-.png" class="kg-image" alt="The same graph as above but with the area under the curve shaded in gray. The area under the curve is drawn by drawing a line through each pair of points and then shading down to the x-axis. The total time spanned by the points from t 1 to t 4 is denoted as Delta T." loading="lazy" width="1348" height="1122" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/pasted-image-0--3-.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/pasted-image-0--3-.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0--3-.png 1348w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">The area under an irregularly sampled time-series dataset</em></i></figcaption></figure><p>\begin{equation} better\_avg = \frac{area\_under\_curve}{\Delta T}  \end{equation}</p>
<p>(In this case, we’re doing a linear interpolation between the points). So, let’s focus on finding that area. The area between the first two points is a trapezoid:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0--1--6.png" class="kg-image" alt="The same graph as above, except there is a trapezoid shaded in blue bounded on top by the line connecting the first two points and vertical lines connecting the points to the x-axis. The distance between the two points on the x-axis is denoted delta t 1." loading="lazy" width="1182" height="1008" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/pasted-image-0--1--6.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/pasted-image-0--1--6.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0--1--6.png 1182w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">A trapezoid representing the area under the first two points</em></i></figcaption></figure><p>Which is really a rectangle plus a triangle:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0--4-.png" class="kg-image" alt="The same graph as the previous, except now the trapezoid, has been divided into a rectangle and a triangle. The rectangle is the height of the first point v 1. The triangle is a right triangle with the line connecting the first two points as the hypotenuse. The distance on the y-axis between the first two points is denoted as delta v 1. " loading="lazy" width="1180" height="972" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/pasted-image-0--4-.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/pasted-image-0--4-.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0--4-.png 1180w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">That same trapezoid broken down into a rectangle and a triangle.</em></i></figcaption></figure><p>Okay, let’s calculate that area:</p><p>\begin{equation} area = \Delta t_1 v_1 + \frac{\Delta t_1 \Delta v_1}{2}  \end{equation}</p>
<p>So just to be clear, that’s:</p><math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
  <semantics>
    <mrow>
      <mi>a</mi>
      <mi>r</mi>
      <mi>e</mi>
      <mi>a</mi>
      <mo>=</mo>
      <munder>
        <mrow data-mjx-texclass="OP">
          <munder>
            <mrow>
              <mi mathvariant="normal">Δ</mi>
              <msub>
                <mi>t</mi>
                <mn>1</mn>
              </msub>
              <msub>
                <mi>v</mi>
                <mn>1</mn>
              </msub>
            </mrow>
            <mo>⏟</mo>
          </munder>
        </mrow>
        <mtext>area of rectangle</mtext>
      </munder>
      <mo>+</mo>
      <munder>
        <mrow data-mjx-texclass="OP">
          <munder>
            <mfrac>
              <mrow>
                <mi mathvariant="normal">Δ</mi>
                <msub>
                  <mi>t</mi>
                  <mn>1</mn>
                </msub>
                <mi mathvariant="normal">Δ</mi>
                <msub>
                  <mi>v</mi>
                  <mn>1</mn>
                </msub>
              </mrow>
              <mn>2</mn>
            </mfrac>
            <mo>⏟</mo>
          </munder>
        </mrow>
        <mtext>area of triangle</mtext>
      </munder>
    </mrow>
    <annotation encoding="application/x-tex">\begin{equation} area = \underbrace{\Delta t_1 v_1}_\text{area of rectangle} + \underbrace{\frac{\Delta t_1 \Delta v_1}{2}}_\text{area of triangle}  \end{equation}</annotation>
  </semantics>
</math><p>Okay. So now if we notice that:</p><p>\begin{equation} \Delta v_1 = v_2 - v_1      \end{equation}</p>
<p>We can simplify this equation pretty nicely:</p><p>Start with:<br>
\begin{equation} \Delta t_1 v_1 + \frac{\Delta t_1 (v_2 - v_1)}{2}  \end{equation}</p>
<p>Factor out: \begin{equation}(\frac{\Delta t_1}{2} ) \end{equation}</p>
<p>\begin{equation}  \frac{\Delta t_1}{2} (2v_1 + (v_2 - v_1))    \end{equation}</p>
<p>Simplify:<br>
\begin{equation}  \frac{\Delta t_1}{2} (v_1 + v_2)    \end{equation}</p>
<p>One cool thing to note is that this gives us a new way to think about this solution: it’s the average of each pair of adjacent values, weighted by the time between them:</p><p>\begin{equation}  area = \underbrace{\frac{(v_1 + v_2)}{2}}_{\text{average of    } v_1 \text{ &amp; } v_2} \Delta t_1   \end{equation}</p>
<p>It’s also equal to the area of the rectangle drawn to the midpoint between v1 and v2:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0--2--2.png" class="kg-image" alt="The same graph as the previous, except that now there is a rectangle imposed on the trapezoid. The rectangle is the same width as the others and goes to a height of v 1 plus v 2 over 2. " loading="lazy" width="1198" height="948" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/pasted-image-0--2--2.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/pasted-image-0--2--2.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/pasted-image-0--2--2.png 1198w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">The area of the trapezoid and of the rectangle, drawn to the midpoint between the two points, is the same.</em></i></figcaption></figure><p>Now that we’ve derived the formula for two adjacent points, we can repeat this for every pair of adjacent points in the dataset. Then all we need to do is sum that up, and that will be the time-weighted sum, which is equal to the area under the curve. (Folks who have studied calculus may actually remember some of this from when they were learning about integrals and integral approximations!) </p><p>With the total area under the curve calculated,  all we have to do is divide the time-weighted sum by the overall  ΔT and we have our time-weighted average. 💥</p><p>Now that we've worked through our time-weighted average in theory, let’s test it out in SQL.</p><h2 id="how-to-compute-time-weighted-averages-in-sql">How to compute time-weighted averages in SQL</h2>
<p>Let’s consider the scenario of an ice cream manufacturer or shop owner who is monitoring their freezers. It turns out that ice cream needs to stay in a relatively narrow range of temperatures (~0-10℉)<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup> so that it doesn’t melt and re-freeze, causing those weird crystals that no one likes. Similarly, if ice cream gets too cold, it’s too hard to scoop.</p>
<p>The air temperature in the freezer will vary a bit more dramatically as folks open and close the door, but the ice cream temperature takes longer to change. Thus, problems (melting, pesky ice crystals) will only happen if it's exposed to extreme temperatures for a prolonged period. By measuring this data, the ice cream manufacturer can impose quality controls on each batch of product they’re storing in the freezer.</p>
<p>Taking this into account, the sensors in the freezer measure temperature in the following way: when the door is closed and we’re in the optimal range, the sensors take a measurement every 5 minutes; when the door is opened, the sensors take a measurement every 30 seconds until the door is closed, and the temperature has returned below 10℉.</p>
<p>To model that we might have a simple table like this:</p>
<pre><code class="language-SQL">CREATE TABLE freezer_temps (
	freezer_id int,
	ts timestamptz,
	temperature float);
</code></pre>
<p>And some data like this:</p>
<pre><code class="language-SQL">INSERT INTO freezer_temps VALUES 
( 1, '2020-01-01 00:00:00+00', 4.0), 
( 1, '2020-01-01 00:05:00+00', 5.5), 
( 1, '2020-01-01 00:10:00+00', 3.0), 
( 1, '2020-01-01 00:15:00+00', 4.0), 
( 1, '2020-01-01 00:20:00+00', 3.5), 
( 1, '2020-01-01 00:25:00+00', 8.0), 
( 1, '2020-01-01 00:30:00+00', 9.0), 
( 1, '2020-01-01 00:31:00+00', 10.5), -- door opened!
( 1, '2020-01-01 00:31:30+00', 11.0), 
( 1, '2020-01-01 00:32:00+00', 15.0), 
( 1, '2020-01-01 00:32:30+00', 20.0), -- door closed
( 1, '2020-01-01 00:33:00+00', 18.5), 
( 1, '2020-01-01 00:33:30+00', 17.0), 
( 1, '2020-01-01 00:34:00+00', 15.5), 
( 1, '2020-01-01 00:34:30+00', 14.0), 
( 1, '2020-01-01 00:35:00+00', 12.5), 
( 1, '2020-01-01 00:35:30+00', 11.0), 
( 1, '2020-01-01 00:36:00+00', 10.0), -- temperature stabilized
( 1, '2020-01-01 00:40:00+00', 7.0),
( 1, '2020-01-01 00:45:00+00', 5.0);
</code></pre>
<p>The period after the door opens, minutes 31-36, has a lot more data points. If we were to take the average of all the points, we would get a misleading value. The freezer was only above the threshold temperature for 5 out of 45 minutes (11% of the time period), but those minutes make up 10 out of 20 data points (50%!) because we sample freezer temperature more frequently after the door is opened.</p>
<p>To find the more accurate, time-weighted average temperature, let’s write the SQL for the formula above that handles that case. We’ll also get the normal average just for comparison’s sake. (Don’t worry if you have trouble reading it, we’ll write a much simpler version later).</p>
<pre><code class="language-SQL">WITH setup AS (
	SELECT lag(temperature) OVER (PARTITION BY freezer_id ORDER BY ts) as prev_temp, 
		extract('epoch' FROM ts) as ts_e, 
		extract('epoch' FROM lag(ts) OVER (PARTITION BY freezer_id ORDER BY ts)) as prev_ts_e, 
		* 
	FROM  freezer_temps), 
nextstep AS (
	SELECT CASE WHEN prev_temp is NULL THEN NULL 
		ELSE (prev_temp + temperature) / 2 * (ts_e - prev_ts_e) END as weighted_sum, 
		* 
	FROM setup)
SELECT freezer_id,
    avg(temperature), -- the regular average
	sum(weighted_sum) / (max(ts_e) - min(ts_e)) as time_weighted_average -- our derived average
FROM nextstep
GROUP BY freezer_id;
</code></pre>
<pre><code class="language-SQL"> freezer_id |  avg  | time_weighted_average 
------------+-------+-----------------------
          1 | 10.2  |     6.636111111111111
</code></pre>
<p>It does return what we want, and gives us a much better picture of what happened, but it’s not exactly fun to write, is it?</p>
<p>We’ve got a few window functions in there, some case statements to deal with nulls, and several CTEs to try to make it reasonably clear what’s going on. <strong>This is the kind of thing that can really lead to code maintenance issues when people try to figure out what’s going on and tweak it.</strong></p>
<p>Code is all about managing complexity. A long, complex query to accomplish a relatively simple task makes it much less likely that the developer who comes along next (ie you in 3 months) will understand what’s going on, how to use it, or how to change it if they (or you!) need a different result.  Or, worse, it means that the code will never get changed because people don’t quite understand what the query’s doing, and it just becomes a black box that no one wants to touch (including you).</p>
<h2 id="timescaledb-hyperfunctions-to-the-rescue">TimescaleDB hyperfunctions to the rescue!</h2>
<p>This is why we created <strong><a href="https://docs.timescale.com/api/latest/hyperfunctions/">hyperfunctions</a></strong>, to make complicated time-series data analysis less complex. Let’s look at what the time-weighted average freezer temperature query looks like if we use the <a href="https://docs.timescale.com/api/latest/hyperfunctions/time-weighted-averages/">hyperfunctions for computing time-weighted averages</a>:</p>
<pre><code class="language-SQL">SELECT freezer_id, 
	avg(temperature), 
	average(time_weight('Linear', ts, temperature)) as time_weighted_average 
FROM freezer_temps
GROUP BY freezer_id;
</code></pre>
<pre><code class="language-SQL"> freezer_id |  avg  | time_weighted_average 
------------+-------+-----------------------
          1 | 10.2  |     6.636111111111111
</code></pre>
<p>Isn’t that so much more concise?! Calculate a <a href="https://docs.timescale.com/api/latest/hyperfunctions/time-weighted-averages/time_weight/"><code>time_weight</code></a> with a <code>'Linear'</code> weighting method (that’s the kind of weighting derived above <sup class="footnote-ref"><a href="#fn2" id="fnref2">[2]</a></sup>), then take the average of the weighted values, and we’re done. I like that API much better (and I’d better, because I designed it!).</p>
<p>What’s more, not only do we save ourselves from writing all that SQL, but it also becomes far, far easier to <strong>compose</strong> (build up more complex analyses over top of the time-weighted average). This is a huge part of the design philosophy behind hyperfunctions; we want to make fundamental things simple so that you can easily use them to build more complex, application-specific analyses.</p>
<p>Let’s imagine we’re not satisfied with the average over our entire dataset, and we want to get the time-weighted average for every 10-minute bucket:</p>
<pre><code class="language-SQL">SELECT time_bucket('10 mins'::interval, ts) as bucket, 
	freezer_id, 
	avg(temperature), 
	average(time_weight('Linear', ts, temperature)) as time_weighted_average 
FROM freezer_temps
GROUP BY bucket, freezer_id;
</code></pre>
<p>We added a <a href="https://docs.timescale.com/api/latest/hyperfunctions/time_bucket/"><code>time_bucket</code></a>, grouped by it, and done! Let’s look at some other kinds of sophisticated analysis that hyperfunctions enable.</p>
<p>Continuing with our ice cream example, let’s say that we’ve set our threshold because we know that if the ice cream spends more than 15 minutes above 15 ℉, it’ll develop those ice crystals that make it all sandy/grainy tasting. We can use the time-weighted average in a <a href="https://www.postgresql.org/docs/current/functions-window.html">window function</a> to see if that happened:</p>
<pre><code class="language-SQL">SELECT *, 
    average(time_weight('Linear', ts, temperature) OVER fifteen_min) as rolling_twa
FROM freezer_temps
WINDOW fifteen_min AS 
    (PARTITION BY freezer_id ORDER BY ts RANGE  '15 minutes'::interval PRECEDING)
ORDER BY freezer_id, ts;
</code></pre>
<pre><code class="language-SQL">
 freezer_id |           ts           | temperature |    rolling_twa     
------------+------------------------+-------------+--------------------
          1 | 2020-01-01 00:00:00+00 |           4 |                   
          1 | 2020-01-01 00:05:00+00 |         5.5 |               4.75
          1 | 2020-01-01 00:10:00+00 |           3 |                4.5
          1 | 2020-01-01 00:15:00+00 |           4 |  4.166666666666667
          1 | 2020-01-01 00:20:00+00 |         3.5 | 3.8333333333333335
          1 | 2020-01-01 00:25:00+00 |           8 |  4.333333333333333
          1 | 2020-01-01 00:30:00+00 |           9 |                  6
          1 | 2020-01-01 00:31:00+00 |        10.5 |  7.363636363636363
          1 | 2020-01-01 00:31:30+00 |          11 |  7.510869565217392
          1 | 2020-01-01 00:32:00+00 |          15 |  7.739583333333333
          1 | 2020-01-01 00:32:30+00 |          20 |               8.13
          1 | 2020-01-01 00:33:00+00 |        18.5 |  8.557692307692308
          1 | 2020-01-01 00:33:30+00 |          17 |  8.898148148148149
          1 | 2020-01-01 00:34:00+00 |        15.5 |  9.160714285714286
          1 | 2020-01-01 00:34:30+00 |          14 |   9.35344827586207
          1 | 2020-01-01 00:35:00+00 |        12.5 |  9.483333333333333
          1 | 2020-01-01 00:35:30+00 |          11 | 11.369047619047619
          1 | 2020-01-01 00:36:00+00 |          10 | 11.329545454545455
          1 | 2020-01-01 00:40:00+00 |           7 |             10.575
          1 | 2020-01-01 00:45:00+00 |           5 |  9.741666666666667
</code></pre>
<p>The window here is over the previous 15 minutes, ordered by time. And it looks like we stayed below our ice-crystallization temperature!</p>
<p>We also provide a special <a href="https://docs.timescale.com/api/latest/hyperfunctions/time-weighted-averages/rollup-timeweight/"><code>rollup</code></a> function so you can re-aggregate time-weighted values from subqueries. For instance:</p>
<pre><code class="language-SQL">SELECT average(rollup(time_weight)) as time_weighted_average 
FROM (SELECT time_bucket('10 mins'::interval, ts) as bucket, 
		freezer_id, 
		time_weight('Linear', ts, temperature)
	FROM freezer_temps
	GROUP BY bucket, freezer_id) t;
</code></pre>
<pre><code class="language-SQL">time_weighted_average 
-----------------------
    6.636111111111111
</code></pre>
<p>This will give us the same output as a grand total of the first equation because we’re just re-aggregating the bucketed values.</p>
<p>But this is mainly there so that you can do more interesting analysis, like, say, normalizing each ten-minute time-weighted average by freezer to the overall time-weighted average.</p>
<pre><code class="language-SQL">WITH t as (SELECT time_bucket('10 mins'::interval, ts) as bucket, 
		freezer_id, 
		time_weight('Linear', ts, temperature)
	FROM freezer_temps
	GROUP BY bucket, freezer_id) 
SELECT bucket, 
	freezer_id, 
	average(time_weight) as bucketed_twa,  
	(SELECT average(rollup(time_weight)) FROM t) as overall_twa, 
	average(time_weight) / (SELECT average(rollup(time_weight)) FROM t) as normalized_twa
FROM t; 
</code></pre>
<p>This kind of feature (storing the time-weight for analysis later) is most useful in a <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/continuous-aggregates/">continuous aggregate</a>, and it just so happens that we’ve designed our time-weighted average to be usable in that context!</p>
<p>We’ll be going into more detail on that in a future post, so be sure to <a href="https://www.timescale.com/signup/newsletter">subscribe to our newsletter</a> so you can get notified when we publish new technical content.</p>
<h2 id="try-time-weighted-averages-today">Try time-weighted averages today</h2>
<p><strong>If you’d like to get started with the time_weight hyperfunction - and many more - right away, spin up a fully managed TimescaleDB service:</strong> create an account to <a href="http://console.cloud.timescale.com/signup">try it for free</a> for 30 days. Hyperfunctions are pre-loaded on each new database service on Timescale, so after you create a new service, you’re all set to use them!</p>
<p><strong>If you prefer to manage your own database instances, you can <a href="https://github.com/timescale/timescaledb-toolkit">download and install the timescaledb_toolkit extension</a></strong> on GitHub, after which you’ll be able to use time_weight and all other hyperfunctions.</p>
<p>**If you have questions or comments on this blog post, <a href="https://github.com/timescale/timescaledb-toolkit/discussions/185">we’ve started a discussion on our GitHub page, and we’d love to hear from you</a>. (And, if you like what you see, GitHub ⭐ are always welcome and appreciated too!)</p>
<ul>
<li>We love building in public, and you can view our <a href="https://github.com/timescale/timescaledb-toolkit">upcoming roadmap on GitHub</a> for a list of proposed features, features we’re currently implementing, and features available to use today.</li>
</ul>
<p>We’d like to give a special thanks to <a href="https://github.com/inselbuch">@inselbuch</a>, who <a href="https://github.com/timescale/timescaledb-toolkit/issues/46">submitted the GitHub issue</a> that got us started on this project (as well as the other folks who 👍’d it and let us know they wanted to use it.)</p>
<p>We believe time-series data is everywhere, and making sense of it is crucial for all manner of technical problems. We built hyperfunctions to make it easier for developers to harness the power of time-series data. We’re always looking for feedback on what to build next and would love to know how you’re using hyperfunctions, problems you want to solve, or things you think should - or could - be simplified to make analyzing time-series data in SQL that much better. (To contribute feedback, comment on an <a href="https://github.com/timescale/timescaledb-toolkit/issues">open issue</a> or in a <a href="https://github.com/timescale/timescaledb-toolkit/discussions">discussion thread</a> in GitHub.)</p>
<p>Lastly, in future posts, we’ll give some more context around our design philosophy, decisions we’ve made around our APIs for time-weighted averages (and other features), and detailing how other hyperfunctions work. So, if that’s your bag, you’re in luck – but you’ll have to wait a week or two.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>I don’t know that these times or temperatures are accurate per se; however, the phenomenon of ice cream partially melting and refreezing causing larger ice crystals to form - and coarsening the ice cream as a result - is well documented. See, for instance, <a href="https://peoplegetreadybooks.com/?q=h.tviewer&amp;using_sb=status&amp;qsb=keyword&amp;qse=OqerFF92q0vIs_NOprdwmw">Harold McGee’s On Food And Cooking</a> (p 44 in the 2004 revised edition). So, just in case you are looking for advice on storing your ice cream from a blog about time-series databases: for longer-term storage, you would likely want the ice cream to be stored below 0℉. Our example is more like a scenario you’d see in an ice cream display (e.g., in an ice cream parlor or factory line) since the ice cream is kept between 0-10℉ (ideal for scooping, because lower temperatures make ice cream too hard to scoop). <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn2" class="footnote-item"><p>We also offer <code>’LOCF’</code> or last observation carried forward weighting, which is best suited to cases where you record data points whenever the value changes (i.e., the old value is valid until you get a new one.)</p>
<p>The derivation for that is similar, except the rectangles have the height of the first value, rather than the linear weighting we’ve discussed in this post (i.e., where we do linear interpolation between adjacent data points):</p>
<p><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/07/LOCF-Weighting.png" alt="LOCF-Weighting" loading="lazy"><br>
Rather than:<br>
<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/07/Linear-Weighting.png" alt="Linear-Weighting" loading="lazy"></p>
<p>In general, linear weighting is appropriate for cases where the sampling rate is variable, but there are no guarantees provided by the system about only providing data when it changes. LOCF works best when there’s some guarantee that your system will provide data only when it changes, and you can accurately carry the old value until you receive a new one. <a href="#fnref2" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Introducing Hyperfunctions: New SQL Functions to Simplify Working With Time-Series Data in PostgreSQL]]></title>
            <description><![CDATA[TimescaleDB hyperfunctions are pre-built functions for the most common and difficult queries that developers write today in TimescaleDB and PostgreSQL. Hyperfunctions help developers measure what matters in time-series data, which generates massive, ever-growing streams of information.]]></description>
            <link>https://www.tigerdata.com/blog/introducing-hyperfunctions-new-sql-functions-to-simplify-working-with-time-series-data-in-postgresql</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/introducing-hyperfunctions-new-sql-functions-to-simplify-working-with-time-series-data-in-postgresql</guid>
            <category><![CDATA[Announcements & Releases]]></category>
            <category><![CDATA[Engineering]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Joshua Lockerman]]></dc:creator>
            <pubDate>Tue, 13 Jul 2021 13:02:11 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/07/ryan-stone-OlxJVn9fxz4-unsplash.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/07/ryan-stone-OlxJVn9fxz4-unsplash.jpg" alt="Introducing Hyperfunctions: New SQL Functions to Simplify Working With Time-Series Data in PostgreSQL" /><p></p><p>Today, we’re excited to launch <strong>TimescaleDB hyperfunctions</strong>, a series of SQL functions within TimescaleDB that make it easier to manipulate and analyze time-series data in PostgreSQL with fewer lines of code. You can use hyperfunctions to calculate percentile approximations of data, compute time-weighted averages, downsample and smooth data, and perform faster <code>COUNT DISTINCT</code> queries using approximations. Moreover, hyperfunctions are “easy” to use: you call a hyperfunction using the same SQL syntax you know and love.</p>
<p>At Timescale, our mission is to <a href="https://www.timescale.com/products" rel="noreferrer">enable every software developer to store, analyze, and build on top of their time-series data</a> so that they can measure what matters in their world: IoT devices, IT systems, marketing analytics, user behavior, financial metrics, and more.</p><p>We made the decision early in the design of TimescaleDB to build on top of PostgreSQL. We believed then, as we do now, that building on the <a href="https://db-engines.com/en/blog_post/85">world’s fastest-growing database</a> would have numerous benefits for our customers. Perhaps the biggest of these advantages is in developer productivity. Developers can use the tools and frameworks they know and love and bring all their skills and expertise with SQL with them.</p><p>SQL is a powerful language and we believe that by adding a specialized set of functions for time-series analysis, we can make it even better.</p><p>Today, there are nearly three million active TimescaleDB databases running mission-critical time-series workloads across industries. Time-series data comes at you fast, sometimes generating millions of data points per second. In order to measure everything that matters, you need to capture all of the data you possibly can. Because of the volume and rate of information, time-series data can be complex to query and analyze. </p><p>As we interviewed customers and learned how they analyze and manipulate time-series data, we noticed several common queries begin to take shape. Often, these queries were difficult to compose in standard SQL. TimescaleDB hyperfunctions are a series of SQL functions to address the most common, and often most difficult, queries developers write today. We made the decision to take the hard path ourselves so that we could give developers an easier path.</p><h3 id="hyperfunctions-included-in-this-initial-release">Hyperfunctions included in this initial release</h3><p>Today, we’re releasing several hyperfunctions, including:</p><ul>
<li><strong>Time-Weighted Average</strong> allows you to take the average over an irregularly spaced dataset that only includes changepoints.</li>
<li><strong>Percentile-Approximation</strong> brings percentile analysis to more workflows. When used with <a href="https://timescale.ghost.io/blog/blog/continuous-aggregates-faster-queries-with-automatically-maintained-materialized-views/">continuous aggregates</a>, you can compute percentiles over any time range of your dataset in near real-time and use them for baselining and normalizing incoming data. For maximum control, we provide implementations of two different approximation algorithms:
<ul>
<li><strong>Uddsketch</strong> gives formal guarantees to the accuracy of approximate percentiles, in exchange for always returning a range of possible values.</li>
<li><strong>T-Digest</strong> gives fuzzier guarantees which allow it to be more precise at the extremes of the distribution.</li>
</ul>
</li>
<li><strong>Hyperloglog</strong> enables faster approximate <code>COUNT DISTINCT</code>, making it easier to track how the cardinality of your data changes over time.</li>
<li><strong>Counter Aggregate</strong> enables working with counters in an ergonomic SQL-native manner.</li>
<li><strong>ASAP Smoothing</strong> smooths datasets to bring out the most important features when graphed.</li>
<li><strong>Largest Triangle Three Buckets Downsampling</strong> reduces the number of elements in a dataset while retaining important features when graphed.</li>
<li><strong>Stats-agg</strong> makes using rolling, cumulative and normalized statistics as easy as their standard counterparts.</li>
</ul>
<p>Note that Hyperfunctions work on TimescaleDB <a href="https://docs.timescale.com/api/latest/hypertable/">hypertables</a>, as well as regular PostgreSQL tables.</p><h3 id="new-sql-functions-not-new-sql-syntax">New SQL functions, not new SQL syntax</h3><p>We made the decision to create <strong>new SQL functions</strong> for each of the time-series analysis and manipulation capabilities above. This stands in contrast to other efforts which aim to improve the developer experience by introducing new SQL <em>syntax</em>. </p><p>While introducing new syntax with new keywords and new constructs may have been easier from an implementation perspective, we made the deliberate decision not to do so since we believe that it actually leads to a worse experience for the end-user. </p><p>New SQL syntax means that existing drivers, libraries, and tools may no longer work. This can leave developers with more problems than solutions as their favorite tools, libraries, or drivers may not support the new syntax, or may require time-consuming modifications to do so. </p><p>On the other hand, new SQL functions mean that your query will run in every visualization tool, database admin tool, or data analysis tool. We have the freedom to create custom functions, aggregates, and procedures that help developers better understand and work with their data, <strong>and </strong>ensure all their drivers and interfaces still work as expected.</p><h3 id="hyperfunctions-are-written-in-rust">Hyperfunctions are written in Rust</h3><p>Rust was our language of choice for developing the new hyperfunctions. We chose it for its superior productivity, community, and the <a href="https://github.com/zombodb/pgx">pgx software development kit</a>. We felt Rust was a more friendly language for a project like ours and would encourage more community contributions. </p><p>The inherent safety of Rust means we could focus more time on feature development rather than worrying about how the code is written. The extensive Rust community (💗 <a href="http://crates.io">crates.io</a>), along with excellent package-management tools, means we can use off-the-shelf solutions for common problems, leaving us more time to focus on the uncommon ones. </p><p>On the topic of community, we found the Rust community to be one of the friendliest on the internet, and its commitment to open source, open communication, and good documentation make it an utter joy to work with. Libraries such as <a href="https://serde.rs/">Serde</a> and <a href="https://github.com/BurntSushi/quickcheck">quickcheck </a>make common tasks a breeze and lets us focus on the code that’s novel to our project, instead of writing boilerplate that's already been written by thousands of others. </p><p>We’d like to shout out ZomboDB’s <a href="https://github.com/zombodb/pgx">pgx</a>, an SDK for building <a href="https://www.tigerdata.com/blog/top-8-postgresql-extensions" rel="noreferrer">Postgres extensions</a> using Rust. Pgx provides tools to generate extension scripts from Rust files and bind Rust functions to Postgres functions, as well as tools to set up, run, and test PostgreSQL instances. (For us, it’s been an amazing tool and experience with incredible benefits – we estimate that pgx has reduced our workload by at least one-third!.)</p><h3 id="next-steps">Next steps</h3><p>In the rest of this post, we detail why we chose to build new SQL functions (not new SQL syntax), and explore each hyperfunction and its example usage.</p><p>But<strong> if you’d like to get started with hyperfunctions right away, the easiest way to do so is with a fully-managed TimescaleDB service</strong>. <a href="https://console.cloud.timescale.com/signup">Try it for free</a> (no credit card required) for 30 days. Hyperfunctions are pre-loaded on each new database service on Timescale, so after you’ve created a new service, you’re all set to use them!</p><p><strong>If you prefer to manage your own database instances, you can </strong><a href="https://github.com/timescale/timescaledb-toolkit"><strong>download and install the timescaledb_toolkit extension</strong></a> on GitHub for free, after which you’ll be able to use all the hyperfunctions listed above. </p><p>Finally, we love building in public. You can view our <a href="https://github.com/timescale/timescaledb-toolkit">upcoming roadmap on GitHub</a> for a list of proposed features, as well as features we’re currently implementing and those that are available to use today. </p><p>We also welcome feedback from the community (it helps us prioritize the features users really want). To contribute feedback, comment on an <a href="https://github.com/timescale/timescaledb-toolkit/issues">open issue</a> or in a <a href="https://github.com/timescale/timescaledb-toolkit/discussions">discussion thread</a> in GitHub.</p><p>To learn more about hyperfunctions, please continue reading.</p><h2 id="building-new-sql-functions-instead-of-reinventing-syntax">Building new SQL functions instead of reinventing syntax</h2><p>SQL is the <a href="https://insights.stackoverflow.com/survey/2020#most-popular-technologies">third most popular programming language in the world</a>. It’s the language known and loved by many software developers, data scientists, and business analysts the world over, and it's a big reason we chose to build TimescaleDB on top of PostgreSQL in the first place.</p><p>Similarly, we choose to make our APIs user-friendly without breaking full SQL compatibility. This means we can create custom functions, aggregates, and procedures but no new syntax - and all the drivers and interfaces can still work. You get the peace of mind that your query will run in every visualization tool, database admin tool, or data analysis tool that speaks SQL.</p><p>SQL is powerful and it’s even <a href="http://blog.coelho.net/database/2013/08/17/turing-sql-1.html">Turing complete</a>, so you can technically do anything with it. But that doesn’t mean you’d want to 😉. Our hyperfunctions are made to make complex analysis and time-series manipulation in SQL simpler, without undermining the guarantees of full SQL compatibility. We’ve spent a large amount of our time on design; prototyping and just writing out different names and calling conventions for clarity and ease of use. </p><p>Our guiding philosophy is to make simple things easy and complex things possible. We enable things that <em>feel</em> like they should be straightforward, like using a single function call to calculate a time-weighted average of a single item over a time period. We also enable operations that would otherwise be prohibitively expensive (in terms of complexity to write) or would previously take too long to respond to be useful. For example, calculating a rolling time-weighted average of each item normalized to the monthly average of the whole group of things.</p><p>For example, we’ve implemented a default for percentile approximation called <code>percentile_agg</code>  that should work for most users, while also exposing the lower level <code>UDDsketch</code> and <code>tdigest</code> implementations for users who want to have more control and get into the weeds.</p>
<p>Another advantage of using SQL functions rather than new syntax is that we bring your code closer to your data, rather than forcing you to take your data to your code. Simply put, you can now perform more sophisticated analysis and manipulation operations on your data right inside your database, rather than creating data pipelines to funnel data into Python or other analysis libraries to conduct analysis there. </p><p>We want to make the more complex analysis simpler and easier in the database not just because we want to build a good product, but also because it’s far, far more efficient to do your analysis as close to the data as possible, and then get aggregated or other simpler results that get passed back to the user. </p><p>This is because the network transmission step is often the slowest and most expensive part of many calculations, and because the serialization and deserialization overhead can be very large as you get to large datasets. So by making these functions and all sorts of analysis simpler to perform in the database, nearer to the data, developers save time and money.</p><p>Moreover, while you could perform some of the complex analysis enabled by hyperfunctions in other languages inside the database (e.g., programs in Python or R), hyperfunctions now enable you to perform such sophisticated time-series analysis and manipulation in SQL right in your query statements, making them more accessible.</p><h2 id="hyperfunctions-released-today">Hyperfunctions released today</h2><p>Hyperfunctions refactor some of the most gnarly SQL queries for time-series data into concise, elegant functions that feel natural to any developer that knows SQL. Let’s walk through the hyperfunctions we’re releasing today and the ones that will be available soon.</p><p>Back in January, <a href="https://timescale.ghost.io/blog/blog/time-series-analytics-for-postgresql-introducing-the-timescale-analytics-project/">when we launched our initial hyperfunctions release</a>, we asked for feedback and input from the community. We want this to be a community-driven project, so for our 1.0 release, we’ve prioritized several features requested by community members. We’ll have a brief overview here, with a technical deep dive into each family of functions in a series of separate blog posts in the coming weeks.</p><p><strong>Time-weighted averages</strong></p><p>Time-series averages can be complicated to calculate; generally, you need to determine how long each value has been recorded in order to know how much to weigh them. While doing this in native SQL is <em>possible</em>, it is extremely error-prone and unwieldy. More damningly, the SQL needed would not work in every context. In particular, it would not work in TimescaleDB’s automatically refreshing materialized views, <a href="https://timescale.ghost.io/blog/blog/continuous-aggregates-faster-queries-with-automatically-maintained-materialized-views/">continuous aggregates</a>, so users who wanted to calculate time-weighted averages over multiple time intervals would be forced to rescan the entire dataset for each average so calculated. Our time-weighted average hyperfunction removes this complexity and can be used in continuous aggregates to make multi-interval time-weighted averages as cheap as summing a few sub-averages.</p><p>Here’s an example of using time-weighted averages for an IoT use case, specifically to find the average temperature in a set of freezers over time. (Notice how it takes sixteen lines of complex code to find the time-weighted average using regular SQL, compared just five lines of code with <code>SELECT</code> statements when using the TimescaleDB hyperfunction):</p>
<p><strong>Time-weighted average using TimescaleDB hyperfunction</strong></p><pre><code class="language-SQL">SELECT freezer_id, 
	avg(temperature), 
	average(time_weight('Linear', ts, temperature)) as time_weighted_average 
FROM freezer_temps
GROUP BY freezer_id;
</code></pre>
<pre><code class="language-output"> freezer_id |  avg  | time_weighted_average 
------------+-------+-----------------------
          1 | 10.35 |     6.802777777777778
</code></pre>
<p><strong>Time-weighted average using regular SQL</strong></p><pre><code class="language-SQL">WITH setup AS (
	SELECT lag(temperature) OVER (PARTITION BY freezer_id ORDER BY ts) as prev_temp, 
		extract('epoch' FROM ts) as ts_e, 
		extract('epoch' FROM lag(ts) OVER (PARTITION BY freezer_id ORDER BY ts)) as prev_ts_e, 
		* 
	FROM  freezer_temps), 
nextstep AS (
	SELECT CASE WHEN prev_temp is NULL THEN NULL 
		ELSE (prev_temp + temperature) / 2 * (ts_e - prev_ts_e) END as weighted_sum, 
		* 
	FROM setup)
SELECT freezer_id, 
	avg(temperature),
	sum(weighted_sum) / (max(ts_e) - min(ts_e)) as time_weighted_average 
FROM nextstep
GROUP BY freezer_id;
</code></pre>
<pre><code class="language-output"> freezer_id |  avg  | time_weighted_average 
------------+-------+-----------------------
          1 | 10.35 |     6.802777777777778
</code></pre>
<p><strong>Percentile approximation (UDDsketch &amp; TDigest)</strong></p><p>Aggregate statistics are useful when you know the underlying distribution of your data, but for other cases, they <a href="https://en.wikipedia.org/wiki/Anscombe%27s_quartet">can be misleading</a>. For cases where they don’t work, and for more exploratory analyses looking at the ground truth, <a href="https://en.wikipedia.org/wiki/Percentile">percentiles</a> are useful. </p><p>As useful as it is, percentile analysis comes with one major downside: it needs to store the entire dataset in memory. This means that such analysis is only feasible for relatively small datasets, and even then can take longer than ideal to calculate. </p><p>The approximate-percentile hyperfunctions we’ve implemented suffer from neither of these problems: they take constant storage, and, when combined with automatically refreshing materialized views, they can produce results nearly instantaneously. This performance improvement opens up opportunities to use percentile analysis for use cases and datasets where it was previously unfeasible.</p><p>Here’s an example of using percentile approximation for a DevOps use case, where we alert on response times that are over the 95th percentile:</p><pre><code class="language-SQL">WITH “95th percentile” as (
    SELECT approx_percentile(0.95, percentile_agg(response_time)) as threshold
    FROM response_times
)
SELECT count(*) 
FROM response_times 
AND response_time &gt; “95th percentile”.threshold;
</code></pre>
<p><a href="https://docs.timescale.com/api/latest/hyperfunctions/">See our hyperfunctions docs</a> to get started today. In the coming weeks, we will be releasing a series of blog posts which detail each of the hyperfunctions released today, in the context of using them to solve a real-world problem. </p><h3 id="hyperfunctions-in-public-preview">Hyperfunctions in public preview</h3><p>In addition to the hyperfunctions released today, we’re making several hyperfunctions available for public preview. These include hyperfunctions for downsampling, smoothing, approximate count-distinct, working with counters, and working with more advanced forms of averaging. All of these are available for trial today through our experimental schema, and, with your feedback, will be made available for production usage soon. </p><p>Here’s a tour through each hyperfunction and why we created them:</p><p><strong>Graph Downsampling &amp; Smoothing</strong></p><p>We have two algorithms implemented to help downsample your data for better, faster graphing:</p><p>The first graphing algorithm for downsampling is <a href="https://github.com/timescale/timescaledb-toolkit/blob/main/docs/lttb.md"><strong>Largest triangle three bucket</strong></a><strong> (LTTB)</strong>. LTTB limits the number of points you need to send to your graphing engine while maintaining visual acuity. This means that you don’t try to plot 200,000 points on a graph that’s only 2000 pixels wide, which is inefficient in terms of network and rendering costs. </p><p>Given an original dataset which looks like the graph below:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/lttb_raw.png" class="kg-image" alt="" loading="lazy" width="716" height="371" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/lttb_raw.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/lttb_raw.png 716w"></figure><p>We can downsample it to just 34 points with the following query using the LTTB hyperfunction:</p><pre><code>SELECT toolkit_experimental.lttb(time, val, 34)
</code></pre>
<p>The above query yields the following graph, which retains the periodic pattern of the original graph, with just 34 points of data.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/lttb_34.png" class="kg-image" alt="" loading="lazy" width="600" height="371" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/lttb_34.png 600w"></figure><p>The second graphing algorithm for downsampling is <a href="https://github.com/timescale/timescaledb-toolkit/blob/main/docs/asap.md"><strong>Automatic smoothing for attention prioritization</strong></a><strong> (ASAP smoothing). </strong>ASAP Smoothing uses optimal moving averages to smooth a graph to remove noise and make sure that trends are obvious to the user, while not over-smoothing and removing all the signals as well. This leads to vastly improved readability. </p><p>For example, the graph below displays 250 years of monthly temperature readings from England (raw data can be found <a href="http://futuredata.stanford.edu/asap/Temp.csv">here</a>):</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/ASAP_raw.png" class="kg-image" alt="" loading="lazy" width="809" height="341" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/ASAP_raw.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/ASAP_raw.png 809w" sizes="(min-width: 720px) 720px"></figure><p>We can run the following query using the ASAP smoothing hyperfunction:</p><pre><code>SELECT toolkit_experimental.asap_smooth(month, value, 800) FROM temperatures
</code></pre>
<p>The result is the graph below, which is much less noisy than the original and one where users can more easily spot trends.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/ASAP_smoothed.png" class="kg-image" alt="Smoothed data" loading="lazy" width="809" height="341" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/ASAP_smoothed.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/ASAP_smoothed.png 809w" sizes="(min-width: 720px) 720px"></figure><p><a href="https://github.com/timescale/timescaledb-toolkit/blob/main/docs/counter_agg.md"><strong>Counter Aggregates</strong></a></p><p>Metrics generally come in a few different varieties, which many systems have come to call <strong>gauges</strong> and <strong>counters</strong>. A gauge is a typical metric that can vary up or down, something like temperature or percent utilization. A counter is meant to be monotonically increasing. So it keeps track of, say, the total number of visitors to a website. The main difference in processing counters and gauges is that a decrease in the value of a counter (compared to its previous value in the <a href="https://www.tigerdata.com/blog/time-series-introduction" rel="noreferrer">time series</a>) is interpreted as a <strong>reset</strong>.  TimescaleDB’s <a href="https://github.com/timescale/timescaledb-toolkit/blob/main/docs/counter_agg.md">counter aggregate hyperfunctions</a> enable a simple and optimized analysis of these counters. </p><p>For example, despite a dataset being stored like:</p><pre><code>data
------
  10
  20
   0
   5
  15
</code></pre>
<p>We can calculate the delta (along with various other statistics) over this monotonically-increasing counter with the following query using the counter aggregate hyperfunction:</p><pre><code>SELECT toolkit_experimental.delta(
    toolkit_experimental.counter_agg(ts, val))
FROM foo;
</code></pre>
<pre><code> delta
------
  40
</code></pre>
<p><a href="https://github.com/timescale/timescaledb-toolkit/blob/main/docs/hyperloglog.md"><strong>Hyperloglog for Approximate Count Distinct</strong></a></p><p>We’ve implemented a version of the hyperloglog algorithm to do approximate count distinct queries over data in a more efficient and parallelizable fashion. For existing TimescaleDB users, you’d be happy to hear that they work in continuous aggregates, which are automatically refreshing materialized views. </p><p><a href="https://github.com/timescale/timescaledb-toolkit/blob/main/docs/stats_agg.md"><strong>Statistical Aggregates</strong></a></p><p>Calculating rolling averages and other statistical aggregates over tumbling windows is very difficult in standard SQL because to do it accurately you’d need to separate out the different components (i.e., for average, count and sum) and then calculate it yourself. Our statistical aggregates allow you to simply do this, with simple <code>rollup</code>.</p>
<p>To follow the progress and contribute to improving these (and future) hyperfunctions, you can view our <a href="https://github.com/timescale/timescaledb-toolkit">roadmap on GitHub.</a> Our development process is heavily influenced by community feedback, so your comments on <a href="https://github.com/timescale/timescaledb-toolkit/issues">issues</a> and <a href="https://github.com/timescale/timescaledb-toolkit/discussions">discussion threads</a> will help determine which features get prioritized, and when they’re stabilized for release.</p><h2 id="next-steps-1">Next Steps</h2><p><a href="https://console.cloud.timescale.com/signup">Try hyperfunctions today</a> with a fully-managed Timescale service (no credit card required, free for 30 days). Hyperfunctions are pre-loaded on each new database service on Timescale, so after you’ve created a new service, you’re all set to use them!</p><p>If you prefer to manage your own database instances, you can <a href="https://github.com/timescale/timescaledb-toolkit">download and install the timescaledb_toolkit extension</a> on GitHub for free, after which you’ll be able to use all the hyperfunctions listed above. </p><p>We love building in public. You can view our <a href="https://github.com/timescale/timescaledb-toolkit">upcoming roadmap on GitHub</a> for a list of proposed features, as well as features we’re currently implementing and those that are available to use today. We also welcome feedback from the community (it helps us prioritize the features users really want). To contribute feedback, comment on an <a href="https://github.com/timescale/timescaledb-toolkit/issues">open issue</a> or in a <a href="https://github.com/timescale/timescaledb-toolkit/discussions">discussion thread</a> in GitHub.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How (and why) to become a PostgreSQL contributor]]></title>
            <description><![CDATA[Contributing to open-source projects can be intimidating – and PostgreSQL is no exception. As a long-time PostgreSQL contributor, Aleks shares his hard-earned tips to help you make your first contribution, or contribute more.]]></description>
            <link>https://www.tigerdata.com/blog/how-and-why-to-become-a-postgresql-contributor</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-and-why-to-become-a-postgresql-contributor</guid>
            <category><![CDATA[Engineering]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Aleksander Alekseev]]></dc:creator>
            <pubDate>Thu, 17 Jun 2021 13:02:28 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/06/neil-and-zulma-scott-hjv3VO-jpnc-unsplash.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/06/neil-and-zulma-scott-hjv3VO-jpnc-unsplash.jpg" alt="How (and why) to become a PostgreSQL contributor" /><p><em>Contributing to open-source projects can be intimidating – and PostgreSQL is no exception. As a long-time PostgreSQL contributor, Aleks shares his hard-earned tips to help you make your first contribution, or start contributing more.</em></p><p>PostgreSQL is one of the <a href="https://db-engines.com/en/blog_post/85">most popular and loved databases in the world</a>. It’s no secret that we are big fans of PostgreSQL at Timescale: We’ve built TimescaleDB on top of it, we employ open-source PostgreSQL contributors (<a href="https://www.linkedin.com/in/afiskon/">like me</a>!), and we’ve developed features to make using PostgreSQL better for time-series scenarios (like <a href="https://timescale.ghost.io/blog/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/">Skip Scan</a>, which makes certain queries in PostgreSQL 8000x faster). But, in addition to helping improve the database itself, we’re committed to the success of the PostgreSQL community at large. </p><p>Open-source is not just my passion; it’s my career. I’ve been a PostgreSQL contributor since 2016, and recently joined Timescale as a full-time open-source PostgreSQL contributor. I’ve contributed not only to PostgreSQL but also to <a href="https://github.com/insolar/insolar">Insolar</a>, <a href="https://sigrok.org/gitweb/?p=libsigrokdecode.git;a=commitdiff;h=f62e32bce4edf84dd415d3d494535d8b9a65e365">Sigrok</a>, and other open-source projects. I’m the author of <a href="https://github.com/afiskon/pg_protobuf">pg_protobuf</a> and <a href="https://github.com/postgrespro/zson">ZSON</a> extensions for PostgreSQL and <a href="https://github.com/afiskon/stm32-ssd1306">several</a> open-source <a href="https://github.com/afiskon/stm32-si5351">libraries</a> for STM32 microcontrollers.</p><p>I love open-source because it enables us to see what’s inside the software, learn from it, and improve it. The quality standards are higher in open-source software than in proprietary software because you can’t hide any cut corners. Last but not least, open-source software can’t refuse to sell or prolong your license because of geopolitical events or whatnot. (I encountered this at least twice in my career.)</p><p>Which brings me to the impetus for this post. Earlier this year, we ran the “State of PostgreSQL” survey to learn how people use PostgreSQL, from their community experiences to popular tools and areas to improve. </p><p>You can see the <a href="https://www.timescale.com/state-of-postgres-results">State of PostgreSQL 2021</a> report to explore all findings and trends – but one result stood out for me:<br></p><blockquote><strong>85% of respondents haven’t contributed to PostgreSQL codebase, docs, or commitfests, and only 4% have contributed several times.</strong></blockquote><p>The survey also highlighted several places where we, as a PostgreSQL community, can be more welcoming to new developers to help them use<em> and</em> contribute to PostgreSQL.</p><p>For example, one respondent said: “First code contributions can be traumatic...sometimes we’re not very welcome [sic] with new developers. We should improve...”</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/06/State-of-PG_Contributions_Updated.png" class="kg-image" alt="Graph showing contribution percentages of Postgres users." loading="lazy" width="1390" height="986" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/06/State-of-PG_Contributions_Updated.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/06/State-of-PG_Contributions_Updated.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/06/State-of-PG_Contributions_Updated.png 1390w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">85% of respondents in the </span><a href="https://www.timescale.com/state-of-postgres-results"><span style="white-space: pre-wrap;">2021 State of PostgreSQL survey</span></a><span style="white-space: pre-wrap;"> haven’t contributed to the PostgreSQL codebase, docs, or commitfests</span></figcaption></figure><p>That got me thinking about how we can make it easier for folks to overcome the initial fear and other barriers - be it technical difficulty, confusing processes, or lack of information - that often surround contributing to an open-source project. After all, we want more people to be a part of the PostgreSQL community and to make contributions; that’s how we make it even better.</p><p>To help more people get started, I wanted to share my observations, what I’ve learned over the past 5+ years, what I wish I knew when I started, and advice I typically give new contributors. </p><p>In my experience, little depends on the specifics of the project. So, while I’ll use PostgreSQL-specific examples, the following guidance is quite universal, whether you want to contribute to PostgreSQL for the first (or second, or third) time – or have another open-source project in mind. </p><p>I also included a few ways to give back to the community or help a project grow <em>beyond </em>code contributions: the important, yet easily overlooked, elements of building a sustainable, healthy open-source community.</p><h2 id="step-1-identify-your-motivation">Step #1: Identify your motivation</h2><p>One of the most important questions to ask is: “Why do you want to be an open-source contributor?” Unless you recognize and understand your motivation, it will be difficult for you to find time for the project, especially as time goes on. </p><p>Here is a list of potential reasons why you might want to start working on an open-source project:</p><p><strong>To gain a unique experience:</strong> If you're a backend developer who's been writing microservices for a while, you might look for a new challenge. Open-source software presents many (many!) such challenges and new technologies to learn.</p><p><strong>To learn the internals of your favorite operating system/ database/ language/ compiler:</strong> Understanding the internals of your favorite open-source project allows you to use it more efficiently and to learn its limitations. As an example, not many users know that running SELECT queries <a href="https://wiki.postgresql.org/wiki/Hint_Bits">may cause</a> <em>writes </em>to the disk by PostgreSQL. Or that <a href="https://www.postgresql.org/message-id/20160729141552.4062e19b%40fujitsu">creating multiple temporary tables</a> may significantly affect the performance of the entire database. Or that <code>synchronous_commit = remote_apply</code> doesn’t actually wait for replicas before committing the transaction. (The transaction is committed instantaneously. The user is just not notified about this, which <a href="https://www.postgresql.org/message-id/CAJ7c6TNp=9STwXX_1Rvp+OpszHqtMKXqUJcAXwgz9w95P8J56Q@mail.gmail.com">may cause problems</a>.)</p><p><strong>To work with great people:</strong> Open-source attracts some of the most talented people from around the world. There is always something you can learn from them, big or small. The original idea of the <a href="https://github.com/postgrespro/zson">ZSON extension</a> came from <a href="https://akorotkov.github.io/">Alexander Korotkov</a> and <a href="http://www.sigaev.ru/">Teodor Sigaev</a>, both PostgreSQL committers I was lucky to work with. ZSON is now the most popular project I have on GitHub (390+ stars at the moment of writing) – and <a href="https://commitfest.postgresql.org/33/3152/">there is a possibility</a> that it will be shipped with PostgreSQL by default.</p><p><strong>To make users happier: </strong>Let’s say <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=44ca4022f3f9297bab5cbffdd97973dbba1879ed;hp=ea4b8bd6188ecb17ba37d">you contributed several lines of code</a> and, as a result, made PostgreSQL 10% faster in some scenarios. PostgreSQL is used by thousands of companies whose products are used by millions of customers. It’s satisfying to realize that your small patch made all of these users just a little bit happier, even if you might not explicitly hear from them.</p><p><strong>To boost your resume:</strong> It’s natural to seek a job that better suits you. Several years of contributing to a well-known open-source software will open new doors for you, from more technical experience to connections with various community members who you may wind up working with later.</p><p>To be fair, there are probably dozens of other reasons why you might want to start contributing to an open-source project. This list isn’t exhaustive, and each case is unique, but I tried to distill a few of the reasons I see again and again.</p><p>Once you’ve taken some time to reflect on and establish <em>why </em>you want to contribute, the next step is to familiarize yourself with the project’s development process.</p><h2 id="step-2-learn-about-the-development-process">Step #2: Learn about the development process</h2><p>Before starting the work on a new patch, there are several things to learn about the project:</p><ul><li>How to compile the code</li><li>How to run the tests</li><li>How to build the documentation</li><li>How to debug / profile / benchmark the code</li><li>How to format the code</li><li>How to submit a patch</li></ul><p>You can usually find this information, or most of it, in the project’s GitHub README or somewhere in the documentation. For PostgreSQL, <a href="https://www.postgresql.org/docs/current/installation.html">look at the installation docs</a>.</p><p>PostgreSQL is written in C, uses the <a href="https://en.wikipedia.org/wiki/GNU_Autotools">GNU Autotools</a> build system, and relies on Perl scripting language for testing and SGML for documentation. It uses Git as the version control system, and the repository itself is self-hosted (although there is a <a href="https://github.com/postgres/postgres">mirror on GitHub</a>).</p><p>PostgreSQL can be compiled and tested like this:</p><pre><code class="language-SQL">git clone http://git.postgresql.org/git/postgresql.git
cd postgresql
./configure --prefix=/home/user/pginstall --enable-tap-tests --enable-cassert --enable-debug
make world
make check-world</code></pre><p>The details are a little bit more complicated, though. Firstly, you have to install <a href="https://wiki.postgresql.org/wiki/Compile_and_Install_from_source_code#Build.2Binstall_the_source_code">several dependencies</a>, which is done differently depending on the operating system. </p><p>For instance, on Ubuntu 20.04 LTS you will need:</p><pre><code class="language-SQL"># for basic build
sudo apt install gcc make flex bison libreadline-dev zlib1g-dev
# to build the documentation as well
sudo apt install docbook docbook-dsssl docbook-xsl libxml2-utils \
openjade opensp xsltproc</code></pre><p>Secondly, there are several <a href="https://www.postgresql.org/message-id/559708.1618930923@sss.pgh.pa.us">common mistakes</a> that you can make, e.g., forgetting to run <code>make distclean</code> after changing the header files. </p><p>There is a <a href="https://github.com/afiskon/pgscripts">set of scripts</a> on GitHub which will help you to avoid these mistakes. </p><p>Here is how to use it:</p><pre><code class="language-SQL"># where to install Postgres (you can add this line to your ~/.bash_profile)
export PGINSTALL="/home/user/pginstall"
# build and test Postgres
./full-build.sh
# install it to $PGINSTALL
./single-install.sh
# execute a little more tests on running Postgres
make installcheck-world
# check the documentation:
open ~/pginstall/share/doc/postgresql/html/index.html</code></pre><p>Additionally, the PostgreSQL community uses mailing lists as the main communication channel for discussions <em>and </em>submitting and reviewing patches. </p><p>To get an idea of the types of messages, subscribe to <a href="https://www.postgresql.org/list/pgsql-hackers/">pgsql-hackers@</a> (be aware that there are many messages per day on this mailing list). Two other important mailing lists are <a href="https://www.postgresql.org/list/pgsql-general/">pgsql-general@</a> and <a href="https://www.postgresql.org/list/pgsql-bugs/">pgsql-bugs@</a>, and there are <a href="https://www.postgresql.org/list/">many others</a> for assorted topics. </p><p>(As an aside: in the State of PostgreSQL survey, in going through the <a href="https://drive.google.com/drive/u/0/folders/10SZ8XMOKkwZ848bQ5GnMf_JnlrNVmOAO">anonymized survey source data</a>, a number of responses mentioned that the mailing lists weren’t the friendliest way to track bugs and may be a barrier to getting involved. For example,<em> “Mailing lists are considered "hard" by people nowadays. Not the most welcoming interface for interacting with the community for quite a large number of people I'd expect</em>.”)</p><p>After you’ve gotten familiar with the development process, it’s time to start thinking about ideas for your first patch.</p><h2 id="step-3-identify-your-first-patch">Step #3: Identify your first patch</h2><p>Your first patch doesn’t have to be anything fancy. Here are several examples that I think make good first patches, both for PostgreSQL and as general places to start:</p><p><strong>Find and fix mistakes in comments and documentation.</strong> Start with something simple. In my experience, projects are bound to have <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2d8a1e22b109680204cb015a30e5a733a233ed64;hp=aa698d753566f68bdd54881d30b1a515b0327b0e">typos</a> and <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=090b287fc59e7a44da8c3e0823eecdc8ea4522f2;hp=cc402116ca156babcd3ef941317f462a96277e3a">mistakes</a> in the code comments and in the documentation. Use your favorite text editor with a spell checker to find them. This is a really great place to start and overcome the fear of contributing: that there is no risk of breaking anything. Interestingly, this is exactly how I submitted <a href="https://github.com/torvalds/linux/commit/3409f9ab71d7db96eed849f49a6c8116c62dc251">my first (and so far the only) patch</a> to the Linux kernel. (My <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=cc988fbb0bf60a83b628b5615e6bade5ae9ae6f4;hp=879d71393de001880e031">first patch</a> to PostgreSQL was rather complicated and thus not very representative.)</p><p><strong>Participate in code review, testing, and discussions.</strong> Many software developers like to write the code, but few like to review and test it. One of the most valuable contributions one can make to PostgreSQL is <a href="https://wiki.postgresql.org/wiki/Reviewing_a_Patch">being a reviewer</a>. As a reviewer, your primary task is to check that the patch compiles, passes the tests, implements the claimed functionality, and includes documentation. And, of course, it’s worth checking that it doesn’t have any obvious bugs.</p><p><strong>Find and fix a bug.</strong> Check the bugtracker, or, in the case of PostgreSQL, check the archive of <a href="https://www.postgresql.org/list/pgsql-bugs/">pgsql-bugs@</a> mailing list. Try to reproduce the bug. If it doesn’t reproduce, it might already be fixed by another patch, or maybe the steps to reproduce it aren’t very clear. In any case, reply to the mailing list to let the community know what you find. If you managed to reproduce the bug, you are lucky; from there, write a corresponding test and change the code so that it passes the test.</p><p><strong>Find a bottleneck and optimize the code.</strong> After using a piece of software for a while, you discover cases when its performance is far from ideal. Use suitable tools (e.g., <a href="https://www.brendangregg.com/perf.html">perf</a> and <a href="https://www.brendangregg.com/ebpf.html">eBPF</a>) to find the bottleneck and then eliminate it. Before submitting the patch, <a href="https://www.postgresql.org/message-id/20151224111925.56e267e9%40fujitsu">make sure it doesn’t cause performance degradation</a> in some other scenarios.</p><p><strong>Write tests.</strong> Use a suitable tool to test code coverage. For PostgreSQL (C or C++), that tool would be <code>lcov</code>. With a code coverage report on hand, write a test that increases the code coverage.</p><p><strong>Improve documentation.</strong> Ideally, documentation is structured in a way that allows users to download it as a PDF and read it like a book. PostgreSQL documentation is quite good in terms of covering many - many! - topics, but could benefit from more experienced/dedicated technical writers (e.g., people who could add more sample scenarios and illustrations to help new users understand concepts). With PostgreSQL, there are <a href="https://www.postgresql.org/docs/">71 chapters and 2300+ pages in total</a>, and the pages mostly describe configuration parameters and query syntax vs. examples of how to solve concrete tasks. <a href="https://docs.freebsd.org/en/books/handbook/">The FreeBSD Handbook</a> comes to mind as a good “read it like a book” example.  </p><p><strong>Refactor the code.</strong> Refactoring has a clear goal: you rewrite the code so that it does the same thing but is more readable. It’s worth noting that sometimes the PostgreSQL community can be a little skeptical about the value of such patches. The reason is that the community supports the last 5 major releases of PostgreSQL, so refactorings can make backporting of bugfixes more complicated.</p><p>I recommend accumulating small wins - by submitting several “first patch”-like contributions - before you move on to more ambitious patches. This will help you get familiar with the contribution process, tools, and community – not to mention increasing your confidence.</p><p>Once you’ve settled on an idea for your first patch, you’re all set to go ahead and start contributing! (I recommend reading the following section about common mistakes to avoid first 😊).</p><h2 id="step-4-prepare-to-contribute-how-to-avoid-common-mistakes">Step #4: Prepare to contribute: how to avoid common mistakes</h2><p>There are a few common “mistakes” people make when joining an open-source project, so I’ve compiled the following pieces of advice to help you to avoid some of these mistakes.</p><ul><li>Start with something simple</li><li>The smaller your patch, the more are chances that it will be accepted</li><li>Listen carefully to the feedback on your patch</li><li>Always be polite (including: never be sarcastic, trolling, etc...)</li><li>Not all patches get accepted, don’t take it personally</li><li>Contributors are people too; most people work on open source in their spare time</li><li><a href="https://grammarly.com/">Grammarly</a> is a great help for those of us for whom English is a second language (and native speakers too)</li></ul><p>Thus far, I’ve focused my discussion on contributing patches with new code or bugfixes. But, submitting patches is not the be-all and end-all of contributing to open-source projects.</p><p>Next, we’ll look at how to contribute to open-source projects and the surrounding community <em>without </em>writing code.</p><h2 id="step-5-consider-contributing-beyond-code">Step #5: Consider contributing beyond code</h2><p>There are many ways to contribute to a project besides writing the code and documentation – and these non-code contributions are invaluable. </p><p>Here are several of my favorite ideas, although the list does not claim to be complete:</p><p><strong>Help newcomers.</strong> There are always people who recently started to use a given open-source project (for reference, in the State of PostgreSQL survey, almost 50% of respondents said they were new-ish to the project, with 0-5 years experience.) Usually, there is a <a href="https://www.postgresql.org/list/pgsql-novice/">mailing list</a> and/or <a href="https://postgresteam.slack.com/">Slack</a> where they can ask questions. Join the corresponding channel and help newcomers.</p><p><strong>Participate in conferences.</strong> Make a presentation on something you used, learned, or have been working on lately. Share the knowledge. For PostgreSQL, there are several popular conferences, like <a href="https://pgconf.asia/EN/">PGconf.asia</a>, <a href="https://postgreslondon.org/">Postgres London</a>, and <a href="https://pgconf.us">PGconf.us</a>, as well as many local meetups.</p><p><strong>Create a blog / podcast / YouTube channel.</strong> This article, for instance, can be considered as a small contribution to open-source. Make sure your blog, podcast, etc., is added to the prominent community news aggregator(s), so people can learn about it. For PostgreSQL, this is <a href="https://planet.postgresql.org/">PostgreSQL Planet</a>.</p><p><strong>Write a book.</strong> Writing a book is an ambitious and very time-consuming goal, but there are many ways to do so. For example, <a href="https://www.manning.com/">Manning</a> is a publisher well known for helping new technical writers to publish their first book; or simply make a PDF in Google Docs and distribute it for free. </p><p><strong>Participate in Google Summer of Code or Google Season of Docs.</strong> <a href="https://summerofcode.withgoogle.com/">Google Summer of Code</a> (GSoC) is a program focused on bringing student software developers into open-source development. Participate as a student or as a mentor. If you are a technical writer, consider participating in <a href="https://developers.google.com/season-of-docs">Google Season of Docs</a> (GSoD). See <a href="https://wiki.postgresql.org/wiki/GSoC">GSoC</a> and <a href="https://wiki.postgresql.org/wiki/GSoD">GSoD</a> pages on PostgreSQL Wiki for more information.</p><p><strong>Donate hardware for CI system. </strong>CI stands for <a href="https://en.wikipedia.org/wiki/Continuous_integration">continuous integration</a> — and in the PostgreSQL world, it’s called <a href="https://buildfarm.postgresql.org/cgi-bin/show_status.pl">Buildfarm</a>. The community is interested in adding unusual platforms or combinations of architecture, operating system, and compiler to the Buildfarm. As an example, currently, there is no server with <a href="https://riscv.org/">RISC-V</a> architecture. RISC-V is an open instruction set architecture (ISA) that gets support from many leading hardware manufacturers, especially after the discovery of Meltdown and Spectre vulnerabilities. See <a href="https://buildfarm.postgresql.org/cgi-bin/register-form.pl">Application to join PostgreSQL Buildfarm</a><strong> </strong>for more details.</p><h2 id="resources-to-learn-more">Resources to learn more</h2><p>The following additional materials are recommended for self-study in the context of contributing to PostgreSQL:</p><ul><li><a href="https://www.oreilly.com/library/view/c-in-a/9781491924174/">C in a Nutshell, 2nd Edition</a> by Peter Prinz, Tony Crawford</li><li><a href="https://autotools.io/index.html">Autotools Mythbuster</a> by Diego Elio Pettenò, David J. Cozatt</li><li><a href="https://www.amazon.com/Learning-Perl-Making-Things-Possible/dp/1491954329/">Learning Perl</a> and <a href="https://www.amazon.com/Intermediate-Perl-Beyond-Basics-Learning/dp/1449393098/">Intermediate Perl</a> by Randal L. Schwartz, et al.</li><li><a href="https://www.amazon.com/Refactoring-Improving-Design-Existing-Code/dp/0201485672/">Refactoring</a> by Martin Fowler, et al.</li><li><a href="https://www.amazon.com/Systems-Performance-Brendan-Gregg/dp/0136820158/">Systems Performance, 2nd Edition</a> by Brendan Gregg</li><li><a href="https://www.amazon.com/Performance-Tools-Addison-Wesley-Professional-Computing/dp/0136554822/">BPF Performance Tools</a> by Brendan Gregg</li><li><a href="https://www.amazon.com/Database-System-Implementation-Hector-Garcia-Molina/dp/0130402648/">Database System Implementation</a> by Hector Garcia-Molina, et al.</li><li><a href="https://afiskon.github.io/pgcon2018.html">Growing up new PostgreSQL developers</a>, by Anastasia Lubennikova and myself. See the bonus slides with recommended white papers.</li></ul><h2 id="final-thoughts">Final thoughts</h2><p>This post is merely a collection of advice, best practices, and various other things I’ve observed over the years to help would-be contributors make the jump from “never contributed” to “contributed once or twice” (and, ultimately, hopefully some make it to “contributed many times”).</p><p>There are <strong>many </strong>more technical and community experience topics that I didn't cover here. If you have something you’d like to read more about (e.g., debugging, profiling, and benchmarking PostgreSQL, or maybe about writing extensions), reach out to let me and the team know: <a href="mailto:aleksander@timescale.com">aleksander@timescale.com</a>. </p><p>We’re also looking for ways to help the community, share knowledge, and contribute to things community members are already working on.</p><p>Lastly, I’d be remiss if I didn’t mention that we’re hiring across multiple teams  🙌. TimescaleDB is the leading open-source relational database for time-series data. It’s packaged as a <a href="https://www.tigerdata.com/blog/top-8-postgresql-extensions" rel="noreferrer">PostgreSQL extension</a> (an extension like <code>CREATE EXTENSION</code>, not a fork, nor a set of patches). </p><p>If you know C and SQL, have experience with PostgreSQL, and want to be a full-time database developer, I encourage you to consider joining Timescale. Timescale is a remote-first company, with people located on all continents (except Antarctica – but if that’s you, we’re happy to outfit your home office with a space heater). </p><p>If you’d like to discuss technical topics with me, Timescale engineers, and other developers and community members, you can find us in the <a href="https://slack.timescale.com/">TimescaleDB Slack</a> (8K+ members).</p><p>Happy contributing 🐘🚀</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Improving DISTINCT Query Performance Up to 8,000x on PostgreSQL]]></title>
            <description><![CDATA[Learn common performance pitfalls and discover techniques to optimize your DISTINCT PostgreSQL queries.]]></description>
            <link>https://www.tigerdata.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Performance]]></category>
            <dc:creator><![CDATA[Sven Klemm]]></dc:creator>
            <pubDate>Thu, 06 May 2021 11:01:13 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/05/pexels-pixabay-373543.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/05/pexels-pixabay-373543.jpg" alt="Improving DISTINCT Query Performance Up to 8,000x on PostgreSQL" /><p>PostgreSQL is an amazing database, but it can struggle with certain types of queries, especially as tables approach tens and hundreds of millions of rows (or more). <a href="https://www.timescale.com/learn/understanding-distinct-in-postgresql-with-examples" rel="noreferrer"><code>DISTINCT</code> queries</a> are an example of this.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/baby-yoda-tea.gif" class="kg-image" alt="Baby Yoda drinking tea" loading="lazy" width="320" height="320"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Waiting for our DISTINCT queries to return</em></i></figcaption></figure><p>Why are <code>DISTINCT</code> queries slow on PostgreSQL when they seem to ask an "easy" question? It turns out that PostgreSQL currently lacks the ability to efficiently pull a list of unique values from an ordered index. </p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">🔖</div><div class="kg-callout-text">Learning PostgreSQL? <a href="https://www.timescale.com/learn/understanding-distinct-in-postgresql-with-examples" rel="noreferrer">Read the basics on DISTINCT</a>.</div></div><p></p><p><br></p><p>Even when you have an index that matches the exact order and columns for these "last-point" queries, PostgreSQL is still forced to scan the entire index to find all unique values. As a table grows (and <a href="https://timescale.ghost.io/blog/blog/what-the-heck-is-time-series-data-and-why-do-i-need-a-time-series-database-dcf3b1b18563/">they grow quickly with time-series data</a>), this operation keeps getting slower.</p><p>Other databases, such as MySQL, Oracle, and DB2, implement a feature called "Loose index scan," "Index Skip Scan," or “Skip Scan,” to speed up the performance of queries like this. </p><p>When a database has a feature like "Skip Scan," it can incrementally jump from one ordered value to the next without reading all of the rows in between. <em>Without</em> support for this feature, the database engine has to scan the entire ordered index and then deduplicate it at the end—which is a much slower process.</p><p>Since 2018, there have been <a href="https://commitfest.postgresql.org/19/1741/">plans to support something similar</a> in PostgreSQL. <em>(<strong>Note</strong>: We couldn’t use this implementation directly due to some limitations of what is possible within the </em><a href="https://www.tigerdata.com/blog/top-8-postgresql-extensions" rel="noreferrer"><em>Postgres extension</em></a><em> framework.)</em></p><p>Unfortunately, this patch wasn't included in the <a href="https://commitfest.postgresql.org/32/">CommitFest</a> for PostgreSQL 14, so it won't be included until PostgreSQL 15 at the earliest (i.e., no sooner than Fall 2022, at least 1.5 years from now). </p><p>We don’t want our users to have to wait that long.</p><h2 id="what-is-timescales-skipscan">What is Timescale's SkipScan?</h2><p>Today, via TimescaleDB 2.2.1, we are releasing <strong>TimescaleDB SkipScan</strong>, a custom query planner node that makes ordered <code>DISTINCT</code> queries blazing fast in PostgreSQL 🔥. </p><p>As you'll see in the benchmarks below, <strong>some queries performed more than</strong> <strong>8,000x better than before</strong>—and many of the SQL queries your applications and analytics tools use could also see dramatic improvements with this new feature.</p><p>This feature works in both Timescale <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertables</a> and normal PostgreSQL tables. </p><p>This means that with Timescale, not only will your time-series <code>DISTINCT</code> queries be faster, but <strong>any other related queries you may have on normal PostgreSQL tables will also be faster. </strong></p><p>This is because Timescale is not just a time-series database. It’s a relational database, specifically, a relational database for <a href="https://www.tigerdata.com/blog/time-series-introduction" rel="noreferrer">time series</a>. Developers who use Timescale benefit from a purpose-built time-series database plus a classic relational (Postgres) database, all in one, with full SQL support.</p><p>And to be clear, we love PostgreSQL. We employ engineers who contribute to PostgreSQL. We contribute to the ecosystem around PostgreSQL. PostgreSQL is the world’s fastest-growing database, and we are excited to support it alongside thousands of other users and contributors.</p><p>We constantly seek to advance the state of the art with databases, and features like SkipScan are only our latest contribution to the industry. SkipScan makes Timescale and PostgreSQL better, more competitive databases overall, especially compared to MySQL, Oracle, DB2, and others. </p><h3 id="how-to-check-and-optimize-your-query-performance-in-postgresql">How to check (and optimize) your query performance in PostgreSQL</h3><p>If you're new to PostgreSQL and are wondering how to check your query performance in the first place (and optimize it!), we're going to leave two helpful resources here:</p><ul><li><a href="https://www.timescale.com/forum/t/a-beginners-guide-to-explain-analyze/77">This beginner's guide to <code>EXPLAIN ANALYZE</code> </a>by Michael Christofides in one of our Timescale Community Days. And here's a blog post on <a href="https://www.timescale.com/learn/explaining-postgresql-explain" rel="noreferrer">Explaining EXPLAIN</a> in case you're more of a reader.</li></ul><figure class="kg-card kg-embed-card"><iframe width="200" height="113" src="https://www.youtube.com/embed/31EmOKBP1PY?start=1&amp;feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="" title="A beginners guide to EXPLAIN ANALYZE – Michael Christofides"></iframe></figure><ul><li>And our <a href="https://timescale.ghost.io/blog/identify-postgresql-performance-bottlenecks-with-pg_stat_statements/">blog post on using pg_stat_statements to optimize queries</a>.</li></ul><h3 id="optimizing-distinct-query-performance-what-about-recursive-ctes">Optimizing DISTINCT query performance: What about RECURSIVE CTEs?</h3><p>However, if you're an experienced PostgreSQL user, you might point out that it is already possible to get reasonably fast <code>DISTINCT</code>queries via <code>RECURSIVE CTEs</code>.</p><p>From the <a href="https://wiki.postgresql.org/wiki/Loose_indexscan">PostgreSQL Wiki</a>, using a <code>RECURSIVE CTE</code> can get you good results, but writing these kinds of queries can often feel cumbersome and unintuitive, especially for developers new to PostgreSQL:</p><pre><code class="language-sql">WITH RECURSIVE cte AS (
   (SELECT tags_id FROM cpu ORDER BY tags_id, time DESC LIMIT 1)
   UNION ALL
   SELECT (
      SELECT tags_id FROM cpu
      WHERE tags_id &gt; t.tags_id 
      ORDER BY tags_id, time DESC LIMIT 1
   )
   FROM cte t
   WHERE t.tags_id IS NOT NULL
)
SELECT * FROM cte LIMIT 50;
</code></pre><p>But even if writing a <code>RECURSIVE CTE</code> like this in day-to-day querying felt natural to you, there's a bigger problem. Most application developers, ORMs, and charting tools like Grafana or Tableau will still use the simpler, straight-forward form:</p><pre><code class="language-sql">SELECT DISTINCT ON (tags_id) * FROM cpu
WHERE tags_id &gt;=1 
ORDER BY tags_id, time DESC
LIMIT 50;</code></pre><p>In PostgreSQL, without a ", such as MySQL, Oracle, and DB2, implement a feature called "Loose index scan," "Index Skip Scan," or “Skip Scan" node, this query will perform the much slower Index Only Scan, causing your applications and graphing tools to feel clunky and slow.</p><p>Surely there's a better way, right?</p><h2 id="skipscan-is-the-way">SkipScan Is the Way</h2><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Screen-Shot-2021-04-16-at-1.45.56-PM.png" class="kg-image" alt="" loading="lazy" width="1472" height="608" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/Screen-Shot-2021-04-16-at-1.45.56-PM.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/Screen-Shot-2021-04-16-at-1.45.56-PM.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Screen-Shot-2021-04-16-at-1.45.56-PM.png 1472w" sizes="(min-width: 720px) 720px"></figure><p>SkipScan is an optimization for queries in the form of <code>SELECT DISTINCT ON</code> (column). Conceptually, a SkipScan is a regular IndexScan that “skips” across an index looking for the next value that is greater than the current value:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/05/skip-scan-illustration.png" class="kg-image" alt="Illustration of how a Skip Scan search works on a Btree index" loading="lazy" width="519" height="253"><figcaption><i><em class="italic" style="white-space: pre-wrap;">SkipScan: An index scan that “skips” across an index looking for the next greater value</em></i></figcaption></figure><p>With SkipScan in Timescale/PostgreSQL, query planning and execution can now utilize a new node (displayed as <code>(SkipScan)</code> in the <code>EXPLAIN</code> output) to quickly return distinct items from a properly ordered index. </p><p>Rather than scanning the entire index with an Index Only Scan, SkipScan incrementally searches for each successive item in the ordered index. As it locates one item, the <code>(SkipScan)</code> node quickly restarts the search for the next item. This is a <em>much</em> more efficient way of finding distinct items in an ordered index. (<a href="https://github.com/timescale/timescaledb/blob/master/tsl/src/nodes/skip_scan/exec.c">See GitHub for more details.</a>)</p><h2 id="benchmarking-timescaledb-skipscan-vs-a-normal-postgresql-index-scan">Benchmarking TimescaleDB SkipScan vs. a Normal PostgreSQL Index Scan</h2><p>In every example query, <strong>Timescale with SkipScan improved query response times by at least 26x</strong>. </p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">✨</div><div class="kg-callout-text">If you don't want to go through the entire benchmark, here's a short and sweet piece on <a href="https://www.timescale.com/blog/skip-scan-under-load/" rel="noreferrer">SkipScan's performance under load</a>.</div></div><p><br></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p>But the real surprise is <strong>how much of a difference it makes at lower cardinalities with lots of data—</strong>it is <strong>almost 8,500x faster to retrieve <em>all columns</em> for the most recent reading of each device</strong>. That's fast!</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/mandalorian-ships.gif" class="kg-image" alt="Mandolorian Razor Crest being chased by X-wing fighters" loading="lazy" width="1280" height="720" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/mandalorian-ships.gif 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/mandalorian-ships.gif 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/mandalorian-ships.gif 1280w" sizes="(min-width: 720px) 720px"></figure><p>In our tests, <strong>SkipScan is also consistently faster—by 80x or more—in our 4,000 device benchmarks</strong>. (This level of cardinality is typical for many users of Timescale.)</p><p>Before we share the full results, here is how our benchmark was set up.</p><h3 id="benchmark-setup">Benchmark setup</h3><p>To perform our benchmarks, we installed Timescale on a DigitalOcean Droplet using the following specifications. PostgreSQL and Timescale were installed from packages, and we applied the recommended tuning from <a href="https://github.com/timescale/timescaledb-tune"><code>timescaledb-tune</code></a>.</p><ul><li>8 Intel vCPUs</li><li>16&nbsp;GB of RAM</li><li>320&nbsp;GB NVMe SSD</li><li>Ubuntu 20.04 LTS</li><li>Postgres 12.6</li><li>TimescaleDB 2.2 <em>(The first release with SkipScan. TimescaleDB 2.2.1 primarily adds distributed hypertable support and some bug fixes.)</em></li></ul><p>To demonstrate the performance impact of SkipScan on varying degrees of cardinality, we benchmarked three separate datasets of varying sizes. To generate our datasets, we used the 'cpu-only' use case in the <a href="https://github.com/timescale/tsbs">Time Series Benchmark Suite (TSBS)</a>, which creates 10 metrics every 10 seconds for each device (identified by the <code>tag_id</code> in our benchmark queries).</p><table>
<thead>
<tr>
<th>Dataset 1</th>
<th>Dataset 2</th>
<th>Dataset 3</th>
</tr>
</thead>
<tbody>
<tr>
<td>100 devices</td>
<td>4000 devices</td>
<td>10,000 devices</td>
</tr>
<tr>
<td>4 months of data</td>
<td>4 days of data</td>
<td>36 hours of data</td>
</tr>
<tr>
<td>~103,000,000 rows</td>
<td>~103,000,000 rows</td>
<td>~144,000,000 rows</td>
</tr>
</tbody>
</table>
<h3 id="additional-data-preparation">Additional data preparation</h3><p>Not all device data is up-to-date in real life because devices go offline and internet connections get interrupted. Therefore, to simulate a more realistic scenario (i.e., that some devices had stopped reporting for a period of time), we deleted rows for random devices over each of the following periods.</p><table>
<thead>
<tr>
<th>Dataset 1</th>
<th>Dataset 2</th>
<th>Dataset 3</th>
</tr>
</thead>
<tbody>
<tr>
<td>5 random devices over:</td>
<td>100 random devices over:</td>
<td>250 random devices over:</td>
</tr>
<tr>
<td>30 minutes</td>
<td>1 hour</td>
<td>10 minutes</td>
</tr>
<tr>
<td>36 hours</td>
<td>12 hours</td>
<td>1 hour</td>
</tr>
<tr>
<td>7 days</td>
<td>36 hours</td>
<td>12 hours</td>
</tr>
<tr>
<td>1 month</td>
<td>3 days</td>
<td>24 hours</td>
</tr>
</tbody>
</table>
<p>To delete the data, we utilized the <code>tablesample</code> function of Postgres. This <code>SELECT</code> feature allows you to return a random sample of rows from a table based on a percentage of the total rows. In the example below, we randomly sample 10% of the rows ( <code>bernoulli(10)</code> ) and then take the first 10 ( <code>limit 10</code> ).</p><pre><code class="language-sql">DELETE FROM cpu
WHERE tags_id IN 
  (SELECT id FROM tags tablesample bernoulli(10) LIMIT 10)
  AND time &gt;= now() - INTERVAL '30 minutes';</code></pre><p>From there, we ran each benchmarking query multiple times to accommodate for caching, with and without SkipScan enabled.</p><p>As mentioned earlier, the following two indexes were present on the hypertable for all queries.</p><pre><code class="language-sql">"cpu_tags_id_time_idx" btree (tags_id, "time" DESC)
"cpu_time_idx" btree ("time" DESC)</code></pre><h3 id="benchmark-results">Benchmark results</h3><p>Here are the results:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/05/Skip-Scan-vs-Normal.jpg" class="kg-image" alt="" loading="lazy" width="2000" height="2183" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/05/Skip-Scan-vs-Normal.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/05/Skip-Scan-vs-Normal.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2021/05/Skip-Scan-vs-Normal.jpg 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/05/Skip-Scan-vs-Normal.jpg 2000w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">TimescaleDB with SkipScan improved the query response by at least 26x, up to 8500x in some cases.</em></i></figcaption></figure><h2 id="about-the-queries-benchmarked">About the Queries Benchmarked</h2><p>For this test, we benchmarked five types of common queries:</p><h3 id="scenario-1-what-was-the-last-reported-time-of-each-device-in-a-paged-list">Scenario #1: What was the last reported time of each device in a paged list?</h3><pre><code class="language-sql">SELECT DISTINCT ON (tags_id) tags_id, time FROM cpu
ORDER BY tags_id, time DESC
LIMIT 10 OFFSET 50;</code></pre><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/SkipScan---Scenario-1.jpg" class="kg-image" alt="" loading="lazy" width="1095" height="236" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/SkipScan---Scenario-1.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/SkipScan---Scenario-1.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/SkipScan---Scenario-1.jpg 1095w" sizes="(min-width: 720px) 720px"></figure><h3 id="scenario-2-what-was-the-time-and-most-recently-reported-set-of-values-for-each-device-in-a-paged-list">Scenario #2: What was the time and most recently reported set of values for each device in a paged list?</h3><pre><code class="language-sql">SELECT DISTINCT ON (tags_id) * FROM cpu
ORDER BY tags_id, time DESC
LIMIT 10 OFFSET 50;</code></pre><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/SkipScan---Scenario-2.jpg" class="kg-image" alt="" loading="lazy" width="1095" height="236" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/SkipScan---Scenario-2.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/SkipScan---Scenario-2.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/SkipScan---Scenario-2.jpg 1095w" sizes="(min-width: 720px) 720px"></figure><h3 id="scenario-3-what-is-the-most-recent-point-for-all-reporting-devices-in-the-last-5-minutes">Scenario #3: What is the most recent point for all reporting devices in the last 5 minutes?</h3><pre><code class="language-sql">SELECT DISTINCT ON (tags_id) * FROM cpu 
WHERE time &gt;= now() - INTERVAL '5 minutes' 
ORDER BY tags_id, time DESC;</code></pre><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/SkipScan---Scenario-3.jpg" class="kg-image" alt="" loading="lazy" width="1095" height="236" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/SkipScan---Scenario-3.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/SkipScan---Scenario-3.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/SkipScan---Scenario-3.jpg 1095w" sizes="(min-width: 720px) 720px"></figure><h3 id="scenario-4-which-devices-reported-at-some-time-today-but-not-within-the-last-hour">Scenario #4: Which devices reported at some time today but not within the last hour?</h3><pre><code class="language-sql">WITH older AS (
  SELECT DISTINCT ON (tags_id) tags_id FROM cpu 
  WHERE time &gt; now() - INTERVAL '24 hours'
)                                          
SELECT * FROM older o 
WHERE NOT EXISTS (
  SELECT 1 FROM cpu 
  WHERE cpu.tags_id = o.tags_id 
  AND time &gt; now() - INTERVAL '1 hour'
);</code></pre><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/05/SkipScan---Scenario-4.jpg" class="kg-image" alt="" loading="lazy" width="2000" height="431" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/05/SkipScan---Scenario-4.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/05/SkipScan---Scenario-4.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2021/05/SkipScan---Scenario-4.jpg 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/05/SkipScan---Scenario-4.jpg 2000w" sizes="(min-width: 720px) 720px"></figure><h3 id="scenario-5-which-devices-reported-yesterday-but-not-in-the-last-24-hours">Scenario #5: Which devices reported yesterday but not in the last 24 hours?</h3><pre><code class="language-sql">WITH older AS (
  SELECT DISTINCT ON (tags_id) tags_id FROM cpu 
  WHERE time &gt; now() - INTERVAL '48 hours'
  AND time &lt; now() - INTERVAL '24 hours'
)                                          
SELECT * FROM older o 
WHERE NOT EXISTS (
  SELECT 1 FROM cpu 
  WHERE cpu.tags_id = o.tags_id 
  AND time &gt; now() - INTERVAL '24 hour'
);</code></pre><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/05/SkipScan---Scenario-5.jpg" class="kg-image" alt="" loading="lazy" width="2000" height="431" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/05/SkipScan---Scenario-5.jpg 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2021/05/SkipScan---Scenario-5.jpg 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2021/05/SkipScan---Scenario-5.jpg 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/05/SkipScan---Scenario-5.jpg 2000w" sizes="(min-width: 720px) 720px"></figure><h2 id="how-will-your-application-improve">How Will Your Application Improve?</h2><p>But SkipScan isn’t a theoretical improvement reserved for benchmarking blog posts 😉—it has real-world implications, and many applications we use rely on getting this data as fast as possible.</p><p>Think about the applications you use (or develop) every day. Do they retrieve paged lists of unique items from database tables to fill dropdown options (or grids of data)?</p><p>At a few thousand items, the query latency might not be very noticeable. But, as your data grows and you have millions of rows of data and tens of thousands of distinct items, that dropdown menu might take seconds—or minutes—to populate. </p><p>SkipScan can reduce that to tens of <em>milliseconds</em>!</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/tenor.gif" class="kg-image" alt="" loading="lazy" width="498" height="287"><figcaption><span style="white-space: pre-wrap;">Baby Yoda</span></figcaption></figure><p>Even better, SkipScan also provides a fast, efficient way of answering the question that so many people with time-series data ask every day:</p><p><em>"What was the last time and value recorded for each of my [devices / users / services / crypto and stock investments / etc]?"</em></p><p>As long as there is an index on "device_id" and "time" descending, SkipScan will retrieve the data using a query like this much more efficiently.</p><pre><code class="language-sql">SELECT DISTINCT ON (device_id) * FROM cpu 
ORDER BY device_id, time DESC;</code></pre><p>With SkipScan, your application and dashboards that rely on these types of queries will now load a whole lot faster 🚀  (see below).</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/05/4k_with_skipscan.gif" class="kg-image" alt="" loading="lazy" width="658" height="527" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/05/4k_with_skipscan.gif 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/05/4k_with_skipscan.gif 658w"><figcaption><i><em class="italic" style="white-space: pre-wrap;">TimescaleDB 2.2 </em></i><i><b><strong class="italic" style="white-space: pre-wrap;">with</strong></b></i><i><em class="italic" style="white-space: pre-wrap;"> SkipScan enabled runs in less than 400&nbsp;ms</em></i></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/05/4k_without_skipscan.gif" class="kg-image" alt="" loading="lazy" width="658" height="527" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2021/05/4k_without_skipscan.gif 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/05/4k_without_skipscan.gif 658w"><figcaption><i><em class="italic" style="white-space: pre-wrap;">TimescaleDB 2.2</em></i><i><b><strong class="italic" style="white-space: pre-wrap;"> without</strong></b></i><i><em class="italic" style="white-space: pre-wrap;"> SkipScan enabled runs in 23 seconds</em></i></figcaption></figure><h2 id="how-to-use-skipscan-on-timescale">How to Use SkipScan on Timescale</h2><p>How do you get started? Upgrade to TimescaleDB 2.2.1 and set up your schema and indexing as described below. You should start to see immediate speed improvements in many of your <code>DISTINCT</code> queries.</p><p><strong>To ensure that a (SkipScan) node can be chosen for your query plan:</strong></p><p><strong>First, the query must use the <code>DISTINCT</code> keyword on a single column</strong>. The benchmarking queries above will give you some examples to draw from.</p><p><strong>Second, there must be an index that contains the <code>DISTINCT</code> column first, and any other <code>ORDER BY</code> columns.</strong> Specifically:</p><ul><li>The index needs to be a <code>BTREE</code> index.</li><li>The index needs to match the <code>ORDER BY</code> in your query.</li><li>The <code>DISTINCT</code> column must either be the first column of the index, or any leading column(s) must be used as constraints in your query.</li></ul><p>In practice, this means that if we use the questions from the beginning of this blog post ("retrieve a list of unique IDs in order" and "retrieve the last reading of each ID"), we would need at least one index like this (but if you're using a TimescaleDB hypertable, this likely already exists):</p><pre><code class="language-sql"> "cpu_tags_id_time_idx" btree (tags_id, "time" DESC)</code></pre><p>With that index in place, you should start to see immediate benefit if your queries look similar to the benchmarking examples below. When SkipScan is chosen for your query, the <code>EXPLAIN ANALYZE</code> output will show one or more <code>Custom Scan (SkipScan)</code> nodes similar to this:</p><pre><code>-&gt;  Unique
  -&gt;  Merge Append
    Sort Key: _hyper_8_79_chunk.tags_id, _hyper_8_79_chunk."time" DESC
     -&gt;  Custom Scan (SkipScan) on _hyper_8_79_chunk
      -&gt;  Index Only Scan using _hyper_8_79_chunk_cpu_tags_id_time_idx on _hyper_8_79_chunk
          Index Cond: (tags_id &gt; NULL::integer)
     -&gt;  Custom Scan (SkipScan) on _hyper_8_80_chunk
      -&gt;  Index Only Scan using _hyper_8_80_chunk_cpu_tags_id_time_idx on _hyper_8_80_chunk
         Index Cond: (tags_id &gt; NULL::integer)
...</code></pre><h2 id="learn-more-and-get-started">Learn More and Get Started</h2><p>If you’re new to Timescale, <a href="https://console.cloud.timescale.com/signup">create a free account</a> to get started with a fully managed TimescaleDB instance (100&nbsp;% free for 30 days).</p><p>If you are an existing user:</p><ul><li><strong>Timescale: </strong>TimescaleDB 2.2.1 is now the default for all new services on Timescale, and any of your existing services will be automatically upgraded during your next maintenance window.</li><li><strong>Self-managed TimescaleDB</strong>: <a href="https://docs.timescale.com/latest/update-timescaledb">Here are the upgrade instructions</a>. </li></ul><p>Join our <a href="https://slack.timescale.com">Slack Community</a> to share your results, ask questions, get advice, and connect with other developers (I, as well as our co-founders, engineers, and passionate community members, are active on all channels).</p><p>You can also <a href="https://github.com/timescale/timescaledb">visit our GitHub</a> to learn more (and, as always, ⭐️ are appreciated!)<br></p><p></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Time-Series Analytics for PostgreSQL: Introducing the Timescale Analytics Project]]></title>
            <description><![CDATA[We're excited to announce Timescale Analytics, a new project focused on combining all of the capabilities SQL needs to perform time-series analytics into one Postgres extension. Learn about our plans, why we're sharing it now, and ways to contribute your feedback and ideas.  ]]></description>
            <link>https://www.tigerdata.com/blog/time-series-analytics-for-postgresql-introducing-the-timescale-analytics-project</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/time-series-analytics-for-postgresql-introducing-the-timescale-analytics-project</guid>
            <category><![CDATA[Announcements & Releases]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Analytics]]></category>
            <dc:creator><![CDATA[David Kohn]]></dc:creator>
            <pubDate>Thu, 21 Jan 2021 03:16:15 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/01/alexander-andrews-4JdvOwrVzfY-unsplash.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2021/01/alexander-andrews-4JdvOwrVzfY-unsplash.jpg" alt="Time-Series Analytics for PostgreSQL: Introducing the Timescale Analytics Project" /><p>We're excited to announce Timescale Analytics, a new project focused on combining all of the capabilities SQL needs to perform time-series analytics into one <a href="https://www.tigerdata.com/blog/top-8-postgresql-extensions" rel="noreferrer">Postgres extension</a>. Learn about our plans, why we're announcing them now, and ways to contribute your feedback and ideas.  </p><p>At Timescale, our mission is to enable every software developer to store, analyze, and build on top of their time-series data so that they can measure what matters in their world: IoT devices, IT systems, marketing analytics, user behavior, financial metrics, and more. To this end, we’ve built a <a href="https://www.timescale.com/products" rel="noreferrer">petabyte-scale, time-series database</a>.</p><p>Today, we’re excited to announce the Timescale Analytics project, an initiative to make Postgres the best way to execute critical time-series queries quickly, analyze time-series data, and extract meaningful information. SQL is a powerful language (we're obviously big fans ourselves), and we believe that by adding a specialized set of functions for time-series analysis, we can make it even better.</p><p>The Timescale Analytics project aims to identify, build, and combine all of the functionality SQL needs to perform time-series analysis into a single extension.</p><p><strong>In other words, the Timescale Analytics extension will be a "one-stop shop" for time-series analytics in PostgreSQL, and we're looking for feedback from the community: what analytical functionality would you find most useful?</strong></p><p>We believe that it is important to develop our code in the open and are requiring radical transparency of ourselves: everything about this project, our priorities, intended features, trade-off discussions, and (tentative) roadmap, are available in <a href="https://github.com/timescale/timescale-analytics">our GitHub repository</a>.</p><p>It is our hope that working like this will make it easier for the community to interact with the project and allow us to respond quickly to community needs. </p><p>To this end, we’re announcing the project as early as possible, so we can get community feedback before we become too invested in a single direction. Over the next few weeks, we’ll be gathering thoughts on initial priorities and opening some sample PRs. Soon after that, we plan to create an initial version of the Timescale Analytics extension for you to experiment with.</p><p>Here are some examples of analytics functions we are considering adding: monotonic counters, tools for graphing, statistical sketching, and pipelining.</p><h2 id="monotonic-counters">Monotonic Counters</h2><p>A monotonically increasing counter is a type of metric often used in time-series analysis. Logically, such a counter should only ever increase, but the value is often read from an ephemeral source that can get reset back to zero at any time (due to crashes or other similar phenomena). To analyze data from such a source, you need to account for these resets: whenever the counter appears to decrease, you assume a reset occurred, and thus, you add the value after the reset to the value immediately prior to the reset.</p><p>Assume we have a counter that measures visitors to a website.  If we were running a new marketing campaign focused on driving people to a new page on our site, we could use the change in the counter to measure the success of the campaign. While this kind of analysis can be performed in stock SQL, it quickly becomes unwieldy.</p><p>Using native SQL, such a query would look like:</p><pre><code class="language-SQL`">SELECT sum(counter_reset_val) + last(counter, ts) - first(counter, ts) as counter_delta 
FROM (
    SELECT *,
        CASE WHEN counter - lag(counter) OVER (ORDER BY ts ASC) &lt; 0
            THEN lag(counter) OVER (ORDER BY ts ASC)
            ELSE 0
        END as counter_reset_val
    FROM user_counter
) f;
</code></pre><p></p><p>This is a relatively simple example, and more sophisticated queries are even more complicated.</p><p><a href="https://github.com/timescale/timescale-analytics/issues/4">One of our first proposals for capabilities to include in Timescale Analytics</a> would make this much simpler, allowing us to  write something like:</p><pre><code class="language-SQL">SELECT delta(counter_agg(counter, ts)) as counter_delta FROM user_counter;</code></pre><p></p><p>There are many examples like this: scenarios where it’s <em>possible</em> to solve the problem in stock SQL, but the resulting code is not exactly easy to write, nor pretty to read.</p><p>We believe we can solve that problem and make writing analytical SQL as easy as any other modern language.</p><h2 id="tools-for-graphing">Tools for Graphing</h2><p>When graphing time-series data, you often need to perform operations such as <a href="https://en.wikipedia.org/wiki/Change_detection">change-point analysis</a>, <a href="https://medium.com/@hayley.morrison/sampling-time-series-data-sets-fc16caefff1b">downsampling</a>, or <a href="https://dawn.cs.stanford.edu/2017/08/07/asap/">smoothing</a>. Right now, these are usually generated with a front-end service, such as <a href="https://grafana.com/">Grafana</a>, but this means the graphs you use are heavily tied to the renderer you’re using. </p><p>Moving these functions to the database offers a number of advantages:</p><ul><li>Users can choose their graphing front-end based on how well it does graphing, not on how well it does data analytics</li><li>Queries can remain consistent across all front-end tools and consumers of your data</li><li>Doing all the work in the database involves shipping a much smaller number of data points over the network</li></ul><p>Key to getting this project working is building the output formats that will work for a variety of front-ends and identifying the necessary APIs. If you have thoughts on the matter, please hop on our <a href="https://github.com/timescale/timescale-analytics/discussions/30">discussion threads</a>.</p><p>A fully worked-out pure-SQL example of a downsampling algorithm is too long to include inline here (for example, a worked-through version of largest-triangle-three-buckets can be found in <a href="https://medium.com/@hayley.morrison/sampling-time-series-data-sets-fc16caefff1b">this blog post</a>) – but with aggregate support could be as simple as:</p><pre><code class="language-SQL">SELECT lttb(time, value, num_buckets=&gt;500) FROM data;</code></pre><p></p><p>This could return a <code>timeseries</code> data type, which could be ingested directly into a tool like Grafana or another language, or it could be unnested to get back to the time-value pairs to send into an external tool. </p><p>These tools can then use the simplified query instead of doing their own custom analysis on your data.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Screen-Shot-2021-01-20-at-3.19.57-PM.png" class="kg-image" alt="Grafana dashboard UI, showing initial and downsampled data" loading="lazy" width="1600" height="944" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/Screen-Shot-2021-01-20-at-3.19.57-PM.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/Screen-Shot-2021-01-20-at-3.19.57-PM.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Screen-Shot-2021-01-20-at-3.19.57-PM.png 1600w" sizes="(min-width: 720px) 720px"></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Screen-Shot-2021-01-20-at-3.18.43-PM.png" class="kg-image" alt="Grafana dashboard UI, showing initial and downsampled data" loading="lazy" width="1600" height="944" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/Screen-Shot-2021-01-20-at-3.18.43-PM.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/Screen-Shot-2021-01-20-at-3.18.43-PM.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Screen-Shot-2021-01-20-at-3.18.43-PM.png 1600w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Example downsampling data from </span><a href="http://ecg.mit.edu/time-series/"><span style="white-space: pre-wrap;">this dataset</span></a><span style="white-space: pre-wrap;">. It keeps the large-scale features of the data, with an order of magnitude fewer data points</span></figcaption></figure><h2 id="statistical-sketching"><br>Statistical Sketching</h2><p>Sketching algorithms, such as <a href="https://github.com/timescale/timescale-analytics/issues/1">t-digest</a>, <a href="https://github.com/timescale/timescale-analytics/issues/3">hyperloglog</a>, and <a href="https://github.com/timescale/timescale-analytics/issues/6">count-min</a>, allow us to get a quick, approximate, answer for certain queries when the statistical bounds provided are acceptable.</p><p> This is even more exciting in the TimescaleDB ecosystem since it appears most of these sketches will fit nicely into <a href="https://docs.timescale.com/latest/using-timescaledb/continuous-aggregates">continuous aggregates</a>, allowing incredibly low query latency.  </p><p>For instance, a continuous aggregate displaying the daily unique visitors to a website could be defined like:</p><pre><code class="language-SQL">CREATE MATERIALIZED VIEW unique_vistors
WITH (timescaledb.continuous) AS
    SELECT 
    time_bucket('1 day', time) as day, 
    hll(visitor_id) as visitors
    FROM connections
    GROUP BY time_bucket('1 day', time);
</code></pre><p></p><p>Such a view could be queried to get the visitors over range of days, like so:</p><pre><code class="language-SQL">SELECT day, approx_distinct(visitors)
FROM unique_vistors
WHERE day &gt;= '2020-01-01' AND day &gt;= '2020-01-15'
</code></pre><p></p><p>Additionally, it would allow for re-aggregation to determine the number of unique visitors over a coarser time range, such as the number of monthly visitors: </p><pre><code class="language-SQL">SELECT time_bucket(day, '30 days'), approx_distinct(hll(visitors))
FROM unique_vistors
GROUP BY time_bucket(day, '30 days')
</code></pre><p></p><h2 id="pipelining">Pipelining</h2><p>SQL queries can get long, especially when there are multiple layers of aggregation and function calls.</p><p>For instance, to write a pairwise delta at minute-granularity in TimescaleDB, we’d use something like:</p><pre><code class="language-SQL">SELECT minutes, sampled - lag(sampled) OVER (ORDER BY minutes) as delta
FROM (
    SELECT
        time_bucket_gapfill(time, '1 minute') minutes,
        interpolate(first(value, time)) sampled
    FROM data
    GROUP BY time_bucket_gapfill(time, '1 minute')
) interpolated;
</code></pre><p></p><p>To mitigate this, the Timescale Analytics proposal includes a unified <a href="https://github.com/timescale/timescale-analytics/discussions/10">pipeline API</a> capability that would allow us to use the much more straightforward (and elegant) query below:</p><pre><code class="language-SQL">SELECT timeseries(time, value) |&gt; sample('1 minute') |&gt; interpolate('linear') |&gt; delta() FROM data;</code></pre><p></p><p>Besides the simpler syntax, this API could also enable some powerful optimizations, such as incremental pipelines, single-pass processing, and vectorization. </p><p>This is still very much in the design phase, and we’re currently having discussions about what such an API should <a href="https://github.com/timescale/timescale-analytics/discussions/10">look like</a>, what pipeline elements <a href="https://github.com/timescale/timescale-analytics/discussions/26">are appropriate</a>, and what the textual format <a href="https://github.com/timescale/timescale-analytics/discussions/10#discussioncomment-282898">should be</a>.</p><h2 id="how-we%E2%80%99re-building-timescale-analytics">How we’re building Timescale Analytics</h2><p>We’re building Timescale Analytics as a <a href="https://www.tigerdata.com/blog/top-8-postgresql-extensions" rel="noreferrer">PostgreSQL extension</a>. PostgreSQL's extension framework is quite powerful and allows for different levels of integration with database internals. </p><p>Timescale Analytics will be separate from the core  TimescaleDB extension. This is because TimescaleDB core interfaces quite deeply into PostgreSQL’s internals— including the planner, executor, and DDL interfaces—due to the demands of time-series data storage. This necessitates a certain conservatism to its development process in order to ensure that updating TimescaleDB versions cannot damage existing databases, and that features interact appropriately with PostgreSQL’s core functionality.</p><p>By separating the new analytics functionality into a dedicated Timescale Analytics extension, we can vastly reduce the contact area for these new functions, enabling us to move faster without increased risk. We will be focusing on improvements that take advantage of the PostgreSQL extension hooks for creating functions, aggregates, operators, and other database objects, rather those that require interfacing with the lower-level planning and execution infrastructure. Creating a separate extension also allows us to experiment with our build process and technologies, for instance, writing the extension <a href="https://github.com/zombodb/pgx">in Rust</a>.</p><p>More importantly,  we hope using a separate extension will lower barriers for community contributions. We know that the complexity of our integrations with PostgreSQL can make it difficult to contribute to TimescaleDB proper. We believe this new project will allow for much more self-contained contributions by avoiding projects requiring deep integration with the PostgreSQL planner or executor.</p><p>So, if you’ve been wanting to contribute back but didn’t know how or are a Rustacean looking to get involved in databasing, please join us!</p><h2 id="get-involved">Get Involved</h2><p>Before the code is written is the perfect time to have a say in where the project will go. To this end, we want—and need—your feedback: what are the frustrating parts of analyzing time-series data? What takes far more code than you feel it should? What runs slowly or only runs quickly after seemingly arcane rewrites?</p><p>We want to solve community-wide problems and incorporate as much feedback as possible, in addition to relying on our intuition, observation, and experiences.</p><p><strong>Want to help? </strong>You can submit suggestions and help shape the direction in 3 primary ways:</p><ul><li><a href="https://github.com/timescale/timescale-analytics/discussions"><strong>Look at some of the discussions</strong></a> we’re having right now and weigh in with your opinions. Any and all comments are welcome, whether you’re an experienced developer or just learning.</li><li><a href="https://github.com/timescale/timescale-analytics/labels/proposed-feature"><strong>Check out the features</strong></a> we’re thinking of adding, and weigh in on if they’re something you want, if we’re missing something, or if there are any issues or alternatives. We are releasing nightly Docker images of our builds.</li><li><a href="https://github.com/timescale/timescale-analytics/labels/feature-request"><strong>Explore our running feature requests, add a +1, </strong></a><strong> and </strong><a href="https://github.com/timescale/timescale-analytics/issues/new?assignees=&amp;labels=feature-request&amp;template=feature-request.md&amp;title="><strong>contribute your own</strong></a>.</li></ul><p><strong>Most importantly: </strong><a href="https://github.com/timescale/timescale-analytics/discussions"><strong>share your problems</strong></a><strong>! </strong>Tell us the kinds of queries or analyses you wish were easier, the issues you run into, or the workarounds you’ve created to solve gaps. (Example datasets are especially helpful, as they concretize your problems and create a shared language in which to discuss them.)</p><p><br></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[TimescaleDB vs. Amazon Timestream: 6,000x Higher Inserts, 5-175x Faster Queries, 150-220x Cheaper]]></title>
            <description><![CDATA[Our TimescaleDB vs Amazon Timestream results surprised us, but even after testing several configurations, we found Timestream slow, expensive, and missing key database capabilities like backups, restores, updates, and deletes. ]]></description>
            <link>https://www.tigerdata.com/blog/timescaledb-vs-amazon-timestream-6000x-higher-inserts-175x-faster-queries-220x-cheaper</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/timescaledb-vs-amazon-timestream-6000x-higher-inserts-175x-faster-queries-220x-cheaper</guid>
            <category><![CDATA[Announcements & Releases]]></category>
            <category><![CDATA[Benchmarks & Comparisons]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Ryan Booz]]></dc:creator>
            <pubDate>Wed, 02 Dec 2020 19:41:47 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/timescale-vs-amazon-timestream-2023-08-02---Timestream---Compare---Hero.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/timescale-vs-amazon-timestream-2023-08-02---Timestream---Compare---Hero.png" alt="The Timescale vs Amazon Timestream logos over a black background" /><p>This post compares TimescaleDB and Amazon Timestream across quantitative and qualitative dimensions. </p><p>Yes, we are the developers of TimescaleDB, so you might quickly disregard our comparison as biased. But if you let the analysis speak for itself, you’ll find that we stay as objective as possible and aim to be fair to <a href="https://www.tigerdata.com/blog/so-long-timestream-how-and-why-to-migrate-before-its-too-late" rel="noreferrer">Amazon Timestream</a> in our testing and results reporting. </p><p>Also, if you want to check our work or run your own analysis, we provide all our testing via the <a href="https://github.com/timescale/tsbs">Time Series Benchmark Suite</a>, an open-source project that anyone can use and contribute to.</p><h2 id="about-timescaledb-and-amazon-timestream">About TimescaleDB and Amazon Timestream</h2><p><strong>TimescaleDB, </strong>first launched in <a href="https://timescale.ghost.io/blog/blog/when-boring-is-awesome-building-a-scalable-time-series-database-on-postgresql-2900ea453ee2/">April 2017</a>, is today the industry-leading relational database for <a href="https://www.tigerdata.com/blog/time-series-introduction" rel="noreferrer">time series</a>, open-source, engineered on top of PostgreSQL, and offered via download or as a fully managed service on AWS. </p><p>The TimescaleDB community has become the largest developer community for time-series data: tens of millions of downloads; over 500,000 active databases; organizations like AppDynamics, Bosch, Cisco, Comcast, Credit Suisse, DigitalOcean, Dow Chemical, Electronic Arts, Fujitsu, IBM, Microsoft, Rackspace, Schneider Electric, Samsung, Siemens, Uber, Walmart, Warner Music, WebEx, and thousands of others (all in addition to the PostgreSQL community and ecosystem).</p><p><strong>Amazon Timestream </strong>was first announced at AWS re:Invent <a href="https://aws.amazon.com/blogs/aws/aws-previews-and-pre-announcements-at-reinvent-2018-andy-jassy-keynote/">November 2018</a>, but its launch was delayed until <a href="https://aws.amazon.com/blogs/aws/store-and-access-time-series-data-at-any-scale-with-amazon-timestream-now-generally-available/">September 2020</a>. This is Amazon’s time-series database-as-a-service. Amazon Timestream not only shares a similar name to TimescaleDB, but also embraces SQL as its query language. Amazon Timestream customers include Autodesk, PubNub, and Trimble.</p><p>We compare TimescaleDB and Amazon Timestream across several dimensions:</p><ul><li>Insert and query performance</li><li>Cost for equivalent workloads</li><li>Backups, reliability, and tooling</li><li>Query language, ecosystem, ease-of-use</li><li>Clouds and regions supported</li></ul><p>Below is a summary of our results. For those interested, we go into much more detail later in this post.</p><h2 id="insert-performance-query-performance">Insert Performance, Query Performance</h2><p>Our results are striking. TimescaleDB outperformed Amazon Timestream 6,000x on inserts and 5-175x on queries, depending on the query type. In particular, there were workloads and query types easily supported by TimescaleDB that Amazon Timestream was unable to handle.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/timescale-vs-amazon-timestream-Datasets-for-benchmarking-table.png" class="kg-image" alt="TimescaleDB achieves 6,000 times higher inserts than Amazon Timestream" loading="lazy" width="1355" height="891" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/08/timescale-vs-amazon-timestream-Datasets-for-benchmarking-table.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/08/timescale-vs-amazon-timestream-Datasets-for-benchmarking-table.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/timescale-vs-amazon-timestream-Datasets-for-benchmarking-table.png 1355w" sizes="(min-width: 720px) 720px"></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/timescale-vs-amazon-timestream-175x-faster-queries.png" class="kg-image" alt="TimescaleDB achieved 5 to 175 times better query performance than Amazon Timestream" loading="lazy" width="1562" height="1160" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/08/timescale-vs-amazon-timestream-175x-faster-queries.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/08/timescale-vs-amazon-timestream-175x-faster-queries.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/timescale-vs-amazon-timestream-175x-faster-queries.png 1562w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Note: Several queries’ ratios (high-cpu-all, lastpoint, groupby-orderby-limit) are “undefined” because Amazon Timestream did not finish executing them within the default 60-second timeout period that Timestream imposes, while TimescaleDB completed them in less than a single second</em></i></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/timescale-vs-amazon-timestream-Query-Performance.png" class="kg-image" alt="Table showing latency ratios for various queries - run on 100 devices x 10 metrics - in milliseconds" loading="lazy" width="1710" height="1496" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/08/timescale-vs-amazon-timestream-Query-Performance.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/08/timescale-vs-amazon-timestream-Query-Performance.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/08/timescale-vs-amazon-timestream-Query-Performance.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/timescale-vs-amazon-timestream-Query-Performance.png 1710w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Results of benchmarking query performance between TimescaleDB and Amazon Timestream</em></i></figcaption></figure><p>These results were so dramatic that we did not believe them at first, and we tried a variety of workloads and settings to ensure we weren’t missing anything. </p><p><a href="https://www.reddit.com/r/aws/comments/jsgn9x/realworld_aws_timestream_ingest_performance/">We even posted on Reddit</a> to see if others had been able to get better performance with Amazon Timestream. Although feedback was hard to find, we weren’t the only ones seeing these performance results, as evidenced by a <a href="https://crate.io/a/amazon-timestream-first-impressions/">similar benchmark by Crate.io</a>.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Screen-Shot-2020-11-17-at-12.00.37-PM--1-.png" class="kg-image" alt="Screenshot of Amazon Timestream UI showing list of databases created during benchmarking process" loading="lazy" width="1584" height="1262" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/Screen-Shot-2020-11-17-at-12.00.37-PM--1-.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/Screen-Shot-2020-11-17-at-12.00.37-PM--1-.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Screen-Shot-2020-11-17-at-12.00.37-PM--1-.png 1584w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">We REALLY tried to get Amazon Timestream to perform better. Just look at all of the databases we created through the process!</em></i></figcaption></figure><p>After all of our attempts to achieve better Amazon Timestream performance, we were even more confused when we read a <a href="https://aws.amazon.com/blogs/database/deriving-real-time-insights-over-petabytes-of-time-series-data-with-amazon-timestream/">recent post on the AWS Database Blog</a> that discusses achieving ingest speeds of three billion metrics/hour. </p><p>Although the details of how they ingested this scale of data aren’t completely clear, it appears that each “monitored host” sent individual metrics at various intervals directly to Amazon Timestream.</p><p>To achieve three billion metrics/hour in their test, four million hosts sent 26 metrics every two minutes, an average of 33,000 hosts reporting 866,667 metrics every second. </p><p>It’s certainly impressive to support 33,000 connections per second without issue, and this demonstrates one of the key advantages that Amazon presents with a serverless architecture like Timestream. </p><p>If you have an edge-based IoT system that pre-computes metrics on thousands of edge nodes before sending them, Amazon Timestream could simplify your data collection architecture. </p><p>However, as you’ll see, if you have a more traditional client-server data-collection architecture, or <a href="https://timescale.ghost.io/blog/blog/create-a-data-pipeline-with-timescaledb-and-kafka/">one using a more common streaming pipeline with database consumers, like Apache Kafka</a>, TimescaleDB can import more than three million metrics per second from one client—and doesn’t need 33,000 clients.</p><p>Because performance benchmarking is complex, we share the details of our setup, configurations, and workload patterns later in this post, as well as instructions on how to reproduce them.</p><h2 id="cost-for-equivalent-workloads">Cost for Equivalent Workloads</h2><p>The stark difference in performance translates into a large cost differential as well. </p><p>To compare costs, we calculated the cost for our above insert and query workloads, which store one billion metrics in TimescaleDB and ~410 million metrics in Amazon Timestream (because we were unable to load the full one billion—more later in this post), and ran our suite of queries on top.</p><p>For the same workloads, we found that <a href="https://www.timescale.com/products">fully managed TimescaleDB</a> is 154x cheaper than Amazon Timestream (224x cheaper if you’re self-managing TimescaleDB on a virtual machine) and inserted twice as many metrics.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/timescale-vs-amazon-timestream-154x-cheaper-than-Amazon-Timestream.png" class="kg-image" alt="To run this test TimescaleDB took less than an hour at a cost of $2.18. The same test took a week and cost $336 in Amazon Timestream" loading="lazy" width="1382" height="746" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/08/timescale-vs-amazon-timestream-154x-cheaper-than-Amazon-Timestream.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/08/timescale-vs-amazon-timestream-154x-cheaper-than-Amazon-Timestream.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/timescale-vs-amazon-timestream-154x-cheaper-than-Amazon-Timestream.png 1382w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The results of the costs for equivalent workloads</span></figcaption></figure><p>We go into further details about the cost comparison later in this post.</p><h2 id="backups-reliability-and-tooling">Backups, Reliability, and Tooling</h2><p>For reliability, the differences are also striking. In particular, backups, reliability, and tooling feel like an afterthought with Amazon Timestream.</p><p>In the <a href="https://docs.aws.amazon.com/timestream/latest/developerguide/timestream.pdf">240-page development guide</a> for Amazon Timestream, the words “recovery” and “restore” don’t appear at all, and the word “backup” appears only once to tell the developer that there is no backup mechanism. </p><p>Instead, you can “[...]write your own application using the Timestream SDK to query data and save it to the destination of your choice” (page 100). There isn’t a mechanism or support to DELETE or UPDATE existing data.<strong> </strong></p><p><strong>The only way to remove data is to drop the entire table.</strong> Furthermore, there is no way to recover a deleted table since it is an atomic action that cannot be recovered through any Amazon API or Console.</p><p>In contrast, TimescaleDB is built on PostgreSQL, which means it inherits the 25+ years of hard, careful engineering work that the entire PostgreSQL community has done to build a rock-solid database that supports millions of mission-critical applications worldwide. </p><p>When operating TimescaleDB, one inherits all of the battle-tested tools that exist in the PostgreSQL ecosystem: <a href="https://www.postgresql.org/docs/9.6/static/app-pgdump.html">pg_dump</a>/<a href="https://www.postgresql.org/docs/9.6/static/app-pgrestore.html">pg_restore</a> and <a href="http://www.postgresql.cn/docs/9.6/app-pgbasebackup.html">pg_basebackup</a> for backup/restore, high-availability/failover tools like <a href="https://github.com/zalando/patroni">Patroni</a>, load balancing tools for clustering reads like <a href="http://www.pgpool.net/mediawiki/index.php/Main_Page">Pgpool</a>/<a href="https://www.tigerdata.com/blog/using-pgbouncer-to-improve-your-postgresql-database-performance" rel="noreferrer">pgbouncer</a>, etc. Since TimescaleDB looks and feels like PostgreSQL, there are minimal operational learning curves. TimescaleDB “just works,” as one would expect from PostgreSQL.</p><h2 id="query-language-ecosystem-and-ease-of-use">Query Language, Ecosystem, and Ease-Of-Use</h2><p>We applaud Amazon Timestream’s decision to adopt SQL as their query language. Even if Amazon Timestream functions like a NoSQL database in many ways, opting for SQL as the query interface lowers developers’ barrier to entry—especially when compared to other databases like MongoDB and InfluxDB.</p><p>That said, because Amazon Timestream is <strong>not </strong>a relational database, it doesn’t support normalized datasets and JOINs across tables. Also, because Amazon Timestream enforces a specific narrow table model on your data, deriving value from your data relies heavily on CASE statements and Common Table Expressions (CTEs) when requesting multiple measurement values (defined by “measurement_name”), leading to some clunky queries (see example later in this post).</p><p>TimescaleDB, on the other hand, has fully embraced all parts of the SQL language from day one—and <a href="https://docs.timescale.com/v2.0/api#analytics">extended SQL with functions custom-built to simplify time-series analysis</a>. TimescaleDB is also a relational database, allowing developers to store their metadata alongside their time-series data and JOIN across tables as necessary. Consequently, with TimescaleDB, new users have a minimal learning curve and are in full control when querying their data. </p><p>Full SQL means that TimescaleDB supports everything that SQL has to offer, including normalized datasets, cross-table JOINs, subqueries, stored procedures, and user-defined functions. </p><p>Supporting SQL also enables TimescaleDB to support everything in the SQL ecosystem, including Tableau, Looker, PowerBI, Apache Kafka, Apache Spark, Jupyter Notebooks, R, native libraries for every major programming language, and much more.<strong> </strong></p><p>For example, if you already use <a href="https://docs.timescale.com/use-timescale/latest/integrations/observability-alerting/tableau/" rel="noreferrer">Tableau to visualize data</a> or Apache Spark for data processing, TimescaleDB can plug right into the existing infrastructure due to its compatible connectors. And, given its roots, TimescaleDB supports everything in the PostgreSQL ecosystem, including tools like EXPLAIN that help pinpoint why queries are slow and identify ways to improve performance.</p><p>By contrast, even though Amazon Timestream speaks a variant of SQL, it is “SQL-like,” not full SQL. Thus, tooling that normally works with SQL—e.g., the Tableau and Apache Spark examples cited above—are unable to utilize Amazon Timestream data unless that tool incorporates specific Amazon Timestream drivers and SQL-like dialect. </p><p>This means, for example, that the tooling you might normally use to help you improve query performance doesn’t currently support Amazon Timestream. And, unfortunately, the current Amazon Timestream UI doesn’t give us any clues about why queries might be performing poorly or ways to improve performance (e.g., via settings or query hints).</p><p>In short, if you use PostgreSQL and any tools or extensions with your applications, they will “just work” when connected to TimescaleDB. The same isn’t true for Amazon Timestream. </p><p>So, while adopting a SQL-like query language is a great start for Amazon Timestream, we found a lot to be desired for a true “easy” developer experience.</p><h2 id="cloud-offering">Cloud Offering</h2><p>Amazon Timestream is only offered as a serverless cloud on Amazon Web Services. As of writing, it is available in three U.S. regions and one E.U. region. </p><p>Conversely, TimescaleDB can be run in your own infrastructure or fully managed <a href="https://www.timescale.com/products">through our cloud offering</a>, which make TimescaleDB available on AWS in over 75 regions and many different possible region/storage/compute configurations.</p><p>With TimescaleDB, our goal is to give our customers their choice of cloud and the ability to choose the region closest to their customers and co-locate with their other workloads.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/timescale-vs-amazon-timestream--most-clouds-and-regions-of-any-managed-service.png" class="kg-image" alt="TimescaleDB is available in all 3 clouds and more than 75 regions. Amazon Timestream is only available in 3 regions" loading="lazy" width="1457" height="1274" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/08/timescale-vs-amazon-timestream--most-clouds-and-regions-of-any-managed-service.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/08/timescale-vs-amazon-timestream--most-clouds-and-regions-of-any-managed-service.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/timescale-vs-amazon-timestream--most-clouds-and-regions-of-any-managed-service.png 1457w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Cloud coverage comparison between Amazon Timestream and TimescaleDB (as of November 2020)</em></i></figcaption></figure><h2 id="what-we-like-about-amazon-timestream">What We Like About Amazon Timestream</h2><p>Despite the results of this comparison, we liked several things about Amazon Timestream. </p><p>The highest on the list is the simplicity of setting up a serverless offering. There aren’t any sizing decisions or knobs to tweak. Simply create a database, table, and then start sending data. </p><p>This brings up a second advantage of a serverless architecture: even if throughput isn’t ideal from a single client, the service appears to handle thousands of connections without issue. <a href="https://docs.aws.amazon.com/timestream/latest/developerguide/timestream.pdf">According to their documentation</a>, Amazon Timestream will eventually add more resources to keep up additional ingest or query threads. This means that an application shouldn’t be limited by the resources of one particular server for reads or writes.</p><p>Like with other NoSQL databases, some may find the schemaless nature of Amazon Timestream appealing, especially when a project is just getting off the ground. </p><p>Although schemas become more necessary as workloads grow for performance and data validation reasons, one of the reasons databases like MongoDB have grown in popularity is that they don’t require the same upfront planning as more traditional SQL databases.</p><p>Lastly, SQL. It shouldn’t come as a surprise that we like SQL as an efficient interface for the data we need to examine. And, although Amazon Timestream lacks some support for standard SQL dialects, most users will find it pretty straightforward to start querying data (after they understand Amazon Timestream’s narrow table model).</p><h3 id="but-why-is-amazon-timestream-so-expensive-slow-and-underwhelming">But why is Amazon Timestream so expensive, slow, and underwhelming?</h3><p>The reality is that Amazon Timestream, despite taking two years post-announcement to launch, still seems half-baked.</p><p>Why is Amazon Timestream so expensive, slow, and seemingly underdeveloped? We assume the reason is because of its underlying architecture. </p><p>Unlike other systems that we and others have benchmarked via the <a href="https://github.com/timescale/tsbs">Time-Series Benchmark Suite</a> (e.g., <a href="https://timescale.ghost.io/blog/blog/timescaledb-vs-influxdb-for-time-series-data-timescale-influx-sql-nosql-36489299877/">InfluxDB</a> and <a href="https://timescale.ghost.io/blog/blog/how-to-store-time-series-data-mongodb-vs-timescaledb-postgresql-a73939734016/">MongoDB</a>), Amazon Timestream is completely closed-source. </p><p>Based on our usage and experience with other Amazon services, we suspect that under the hood Amazon Timestream is backed by a combination of other Amazon services similar to Amazon ElastiCache, Athena, and S3. But because we cannot inspect the source code (and because Amazon does not make this sort of information public), this is just a guess.</p><p>By comparison, all of the <a href="https://github.com/timescale/timescaledb">source code for TimescaleDB</a> is available for anyone to inspect. We built TimescaleDB on top of PostgreSQL, giving it a rock-solid foundation and large ecosystem, and then spent years adding advanced capabilities to increase performance, lower costs, and improve the developer experience. </p><p>These capabilities include <a href="https://docs.timescale.com/latest/using-timescaledb/hypertables">auto-partitioning via hypertables and chunks</a>, faster queries via <a href="https://docs.timescale.com/latest/using-timescaledb/continuous-aggregates">continuous aggregates</a>, lower costs via 94&nbsp;%+ <a href="https://docs.timescale.com/use-timescale/latest/compression/about-compression/" rel="noreferrer">native compression</a>, high performance (10+ million inserts a second), and more. </p><p>We believe the real reason behind the difference between the two products is the companies building these products and how each approaches software development, community, and licensing. (And kind reader, this is where our bias may sneak in a little bit.)</p><h2 id="amazon-vs-timescale">Amazon vs. Timescale</h2><p>The viability of our company, Timescale, is 100&nbsp;% dependent on the quality of TimescaleDB. If we build a sub-par product, we cease to exist. </p><p>Amazon Timestream is just another of the <a href="https://en.wikipedia.org/wiki/List_of_Amazon_products_and_services">200+ services that Amazon is developing</a>. Regardless of the quality of Amazon Timestream, that team will still be supported by the rest of Amazon’s business—and if the product gets shut down, that team will find homes elsewhere within the larger company.</p><p>One can see this difference in how the two companies approach the developer community. Without a doubt, <a href="https://www.zdnet.com/article/the-top-cloud-providers-of-2020-aws-microsoft-azure-google-cloud-hybrid-saas/">Amazon Web Services is a leader in all things cloud computing</a>. However, with its enormous catalog of cloud services, many originally derived from external open-source projects, Amazon’s attention is spread over hundreds of products. </p><p>Case in point, when Amazon Timestream was announced in 2018, there was strong interest in when it would be released and how it would perform. However, after a two-year delay, with no information from Amazon, <a href="https://www.reddit.com/r/aws/comments/i5gomj/aws_timestream_still_happening_in_2020/">many gave up on waiting for the product</a>. When the product was finally released on September 30, 2020, there was <a href="https://news.ycombinator.com/item?id=24645416">very little fanfare from the community</a>.</p><p>In contrast, Timescale develops its <a href="https://github.com/timescale/timescaledb">source code</a> out in the open, and developers can reach us for help anytime directly via our <a href="https://slack.timescale.com/">Slack channel</a> (which is staffed by our engineers), whether they are a paying customer or not. We’ve continued to invest in our community by making all our software available for free while also serving our customers with our <a href="https://www.timescale.com/products/">hosted and fully managed cloud services</a>. </p><p>Building a high-performance, cost-effective, reliable, and easy-to-use time-series database is a hard and increasingly business-critical problem. For us, building TimescaleDB into a best-in-class time-series developer experience is an existential requirement. Without it, we cease to exist. For Amazon, Amazon Timestream is just a checkbox, another service to list on their website.</p><h3 id="when-amazon-is-forced-to-compete-on-product-quality-all-open-source-companies-have-a-shot-at-building-great-businesses">When Amazon is forced to compete on product quality, all open-source companies have a shot at building great businesses</h3><p>Amazon has a history of offering services that take advantage of the R&amp;D efforts of others: for example, <a href="https://aws.amazon.com/elasticsearch-service/">Amazon Elasticsearch Service</a>, <a href="https://aws.amazon.com/msk/">Amazon Managed Streaming for Apache Kafka</a>, <a href="https://aws.amazon.com/elasticache/redis/">Amazon ElastiCache for Redis</a>, and many others. </p><p>If Amazon wanted to launch a time-series database service that supported SQL, why did they build one from scratch and not just offer managed TimescaleDB?</p><p>Answer: our innovative licensing. The core of TimescaleDB is open-source, licensed under Apache 2. But advanced capabilities, such as compression and continuous aggregates, are licensed under the Timescale License, a source-available license that is open-source in spirit and makes all software available for free—but contains a critical restriction: preventing companies from offering that software via a hosted database-as-a-service. </p><p>The Timescale License is an example of a <em>“Cloud Protection License,”</em> which are licenses recognizing that the cloud has increasingly become the dominant form of open-source commercialization. </p><p>So these licenses protect the right of offering the software in the cloud for the main creator/maintainer of the project (who often contributes 99&nbsp;% of the R&amp;D effort). (Read more about <a href="https://timescale.ghost.io/blog/blog/building-open-source-business-in-cloud-era-v2/">how we're building a self-sustaining open-source business in the cloud era</a>.)</p><p>This “cloud protection” prevents Amazon from just distributing our R&amp;D and forces them to develop their own offering and compete on product quality, not just distribution. And as we can see from Amazon Timestream, building best-in-class database technologies is not easy, even for a company like Amazon.</p><p><strong>The truth is that when Amazon is forced to compete on product quality, all open-source companies have a shot at building great businesses. </strong></p><p>We welcome Amazon’s new entry to the time-series database market and appreciate that developers now have even more choices for storing and analyzing their time-series data. Competition is good for developers and helps drive further innovation. </p><p>For those who want to dig deeper into our benchmarking and comparison, we include detailed notes and methodology below.</p><p>For those who want to try Timescale, <a href="https://console.cloud.timescale.com/signup">create a free account </a>to get started with a fully managed TimescaleDB instance (100&nbsp;% free for 30 days). </p><p>Want to host TimescaleDB yourself? <a href="https://github.com/timescale/timescaledb">Visit our GitHub</a> to learn more about options, get installation instructions, and more (and, if you like what you see, ⭐️  are always appreciated!).</p><p>Join our <a href="http://slack.timescale.com/">Slack community</a> to ask questions, get advice, and connect with other developers (I, as well as our co-founders, engineers, and passionate community members are active on all channels).</p><h2 id="performance-comparison-details">Performance Comparison Details</h2><p>Here is a quantitative comparison of the two databases across insert and query workloads.</p><p><em>Note: We've released all the code and data used for the below benchmarks as part of the open-source Time Series Benchmark Suite (TSBS) (</em><a href="https://github.com/timescale/tsbs"><em>GitHub</em></a><em>, </em><a href="https://timescale.ghost.io/blog/blog/time-series-database-benchmarks-timescaledb-influxdb-cassandra-mongodb-bc702b72927e/?utm_source=timescale-influx-benchmark&amp;utm_medium=blog&amp;utm_campaign=july-2020-advocacy&amp;utm_content=tsbs-announcemement-blog"><em>announcement</em></a><em>), so you can reproduce our results or run your own analysis. </em></p><p><em>Typically, when we conduct performance benchmarks (for example, in our previous benchmarks versus </em><a href="https://timescale.ghost.io/blog/blog/timescaledb-vs-influxdb-for-time-series-data-timescale-influx-sql-nosql-36489299877/"><em>InfluxDB</em></a><em> and </em><a href="https://timescale.ghost.io/blog/blog/how-to-store-time-series-data-mongodb-vs-timescaledb-postgresql-a73939734016/"><em>MongoDB</em></a><em>) we use five different dataset configurations. These configurations increase metric loads and cardinalities, to simulate a breadth of time-series workloads for inserts and queries. </em></p><p><em>However, as you’ll see below, because of performance issues with Amazon Timestream, we were unable to look at Amazon Timestream’s performance under higher cardinalities and were limited to testing just our lowest-cardinality dataset.</em><br></p><h3 id="machine-configuration">Machine Configuration</h3>
<!--kg-card-begin: html-->
<p>
    <strong>Amazon Timestream </strong>
    <br style="display: block" />
    Amazon Timestream is a serverless offering, which means that a user cannot provision a specific service tier. The only meaningful configuration option that a user can modify is the “memory store retention” period and the “magnetic store retention” period. In Amazon Timestream, data can only be inserted into a table if the timestamp falls within the memory store retention period. Therefore, the only setting that we modified to insert data for our first test was to set the memory store retention period to 865 hours (~36 days) to provide padding to account for a slower insert rate.
</p>
<!--kg-card-end: html-->
<p></p><p><strong>It did not take long for us to realize that Amazon Timestream’s insert performance was dramatically slower than other time-series databases we’ve benchmarked</strong>. Therefore, we took extra time to test insert performance using three different Amazon EC2 instance configurations, each launched in the same region as our Amazon Timestream database:</p><ul><li>t3.medium running Ubuntu 18 LTS, 2 vCPUs, 4&nbsp;GB mem, up to 5 Gb network</li><li>c5n.2xlarge running Ubuntu 20 LTS, 8 vCPUs, 29&nbsp;GB mem, up to 25 Gb network</li><li>m5n.12xlarge running Ubuntu 18 LTS, 48 vCPUs,192&nbsp;GB mem, 50 Gb network</li></ul><p>After numerous attempts to insert data with each of these instance types, we determined that the size of the client did not noticeably impact insert performance at all. Instead, we needed to run multiple client instances to ingest more data. </p><p>In the end, we chose to write data from 1 and 10 t3.medium clients, each running 20 threads of TSBS. In the case of 10 clients, each covered a portion of the 30 days to avoid writing duplicate data (Amazon Timestream does not support writing duplicate data).</p>
<!--kg-card-begin: html-->
<p>
    <strong>TimescaleDB</strong>
    <br style="display:block">
    To test the same insert and read latency performance on TimescaleDB, we used the following setup:
</p>
<!--kg-card-end: html-->
<ul><li>Version: TimescaleDB <a href="https://github.com/timescale/timescaledb/releases/tag/1.7.4">version 1.7.4</a>, with PostgreSQL 12</li><li>One remote client machine and one database server, both in the same cloud data center</li><li>Instance size: both client and database server ran on DigitalOcean virtual machines (droplets) with 32 vCPU and 192&nbsp;GB memory each</li><li>OS: both server and client machines ran Ubuntu 18.04.3</li><li>Disk Size: 4.8&nbsp;TB of disk in a raid0 configuration (EXT4 filesystem)</li><li>Deployment method: TimescaleDB was deployed using <a href="https://hub.docker.com/r/timescale/timescaledb">Docker images from the official Docker hub</a></li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/timescale-vs-amazon-timestream-Datasets-for-benchmarking-table-1.png" class="kg-image" alt="TimescaleDB achieves 6000 times higher inserts than Amazon Timestream" loading="lazy" width="1355" height="891" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/08/timescale-vs-amazon-timestream-Datasets-for-benchmarking-table-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/08/timescale-vs-amazon-timestream-Datasets-for-benchmarking-table-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/timescale-vs-amazon-timestream-Datasets-for-benchmarking-table-1.png 1355w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Results of the insert performance between TimescaleDB and Amazon Timestream</span></figcaption></figure><p><strong>In our tests, TimescaleDB outperformed Amazon Timestream by a shocking 6,000x on inserts.</strong> </p><p>The lackluster insert performance of Amazon Timestream took us by surprise, especially since we were using the Amazon Timestream SDK and modeling our TSBS code from examples in their documentation. </p><p>In the interest of being thorough and fair to Amazon Timestream, we tried increasing the number of clients writing data, made some code modifications to increase concurrency (in ways that weren’t necessary for TimescaleDB), and worked to eliminate any possible thread contention, and then ran the same benchmark with 10 clients on Amazon Timestream. </p><p>After this effort, we were able to increase Amazon Timestream performance to 5,250 metrics/second (across 10 clients)—but even then, TimescaleDB (with only one client and without any extra code modifications) outperformed Amazon Timestream by 600x.</p><p><em>(Hypothetically, we could have started a lot more clients to increase insert performance on Amazon Timestream (assuming no bottlenecks), but with an average ingest rate of ~523 metrics/second per client, we would have had to start ~61,000 EC2 instances at the same time to finish inserting metrics as fast as one client writing to TimescaleDB.)</em></p><p>In particular, with this low performance, we were only able to test our lowest cardinality workload, not our usual five—even though we worked at it for more than a week. This scenario attempts to insert 100 simulated devices, each generating 10 CPU metrics every 10 seconds for ~100M reading intervals (for one billion metrics). </p><p>We never actually made it to the full one billion metrics with Amazon Timestream. After nearly 40 hours of inserting data from 10 EC2 clients, we were only able to insert slightly over 410 million metrics. <em>(</em><a href="https://github.com/timescale/tsbs/blob/master/docs/timestream.md"><em>The dataset was created using Time-Series Benchmarking Suite, using the cpu-only use case.</em></a><em>)</em></p><p>Let us put it another way:</p><ul><li>We first tested Amazon Timestream and TimescaleDB with one client writing data.</li><li>Then, in an attempt to be fair to Amazon Timestream, we tested it with 10 separate EC2 instances over a two-day period, inserting batches of 1,000 readings (100 hosts, 10 measurements per host) as fast as possible.</li><li>It’s also worth noting that most clients started to receive a fatal connection error from Amazon Timestream between the 28 and 32-hour mark and didn’t recover. Only one client made uninterrupted inserts for more than 40 hours before we manually stopped it. It’s possible that with some additional error checking with the Amazon Timestream SDK response, TSBS could have recovered on its own and continued to send metrics from all 10 clients.</li></ul><p>In total, this means that we inserted data into Amazon Timestream for 332.5 hours and achieved slightly more than 410 million metrics.</p><p><strong>TimescaleDB inserted one billion metrics from one client in just under five minutes.</strong></p><p>Amazon claims that Amazon Timestream will learn from your insert and query patterns and automatically adjust resources to increase performance. Their documentation <a href="https://docs.aws.amazon.com/timestream/latest/developerguide/writes.html">specifically warns that writes may become throttled</a>, with the only remedy to keep inserting at the same (or higher) rate until Amazon Timestream adjusts. </p><p>However, in our experience, 332.5 hours of inserting data at a very consistent rate was not enough time for it to make this adjustment.</p><p><em><strong>The issue of cardinality: </strong></em>One other side-effect of Amazon Timestream taking so long to ingest data: we couldn’t compare how it performs with higher cardinalities, which are common in time-series scenarios, where we need to ingest a <em>relentless</em> stream of metrics from devices, apps, customers, and beyond. (<a href="https://timescale.ghost.io/blog/blog/what-is-high-cardinality-how-do-time-series-databases-influxdb-timescaledb-compare/">Read more about the role of cardinality in time series and how TimescaleDB solves it</a>.)<br><br>We’ve shown in <a href="https://timescale.ghost.io/blog/blog/what-is-high-cardinality-how-do-time-series-databases-influxdb-timescaledb-compare/#:~:text=In%20the%20world%20of%20databases,%E2%80%9D)%20that%20describes%20that%20data.">previous benchmarks</a> that TimescaleDB actually sees better performance relative to other time-series databases as cardinality increases, with moderate drop off in terms of absolute insert rate. TimescaleDB surpasses many other popular time-series databases, like InfluxDB, in terms of insert performance for the configurations of 4,000, 100,000, 1 million, and 10 million devices.</p><p>But again, we were unable to test this given Amazon Timestream’s (lack of) performance.</p><p><strong>Insert performance summary:</strong></p><ul><li>TimescaleDB outperforms Amazon Timestream in raw numbers that we found hard to believe. However, despite our best efforts to optimize Amazon Timestream, TimescaleDB still outperformed Amazon Timestream by 6,000x (600x if using 10 clients on Amazon Timestream to TimescaleDB’s one).</li><li>In the time it took us to make a pot of coffee, TimescaleDB inserted one billion metrics for a 31-day period. With Amazon Timestream, we got two nights' sleep and inserted less than half the metrics.</li><li>That said, if your insert performance is far below these benchmarks (e.g., a few thousand metrics/second), then insert performance will not be your bottleneck.</li></ul><p><strong>Full results:</strong></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2020/12/image-14.png" class="kg-image" alt="Table showing TimescaleDB vs. Amazon Timestream Insert Rate comparison ratios in metrics/second" loading="lazy" width="2000" height="1200" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2020/12/image-14.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2020/12/image-14.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2020/12/image-14.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2020/12/image-14.png 2280w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">TimescaleDB vs. Amazon Timestream Insert Rate comparison ratios</em></i></figcaption></figure><p><strong>More information on database configuration for this test:</strong></p>
<!--kg-card-begin: html-->
<p>
    <em>Batch size</em>
    <br style="display:block">
    From our research and community members’ feedback, we’ve found that larger batch sizes generally provide better insert performance. (It’s one of the reasons we created tools like Parallel COPY to help our users insert data in large batches concurrently). 
</p>
<!--kg-card-end: html-->
<p>In our benchmarking tests for TimescaleDB, the batch size was set to 10,000, something we’ve found works well for this kind of high throughput. The batch size, however, is completely configurable and often worth customizing based on your application requirements.<br><br>Amazon Timestream, on the other hand, has a fixed batch size limit of 100 values. This seems to require significantly more overhead and insert latency increases dramatically as the number of metrics we try to insert at one time increases. This is one of the first reasons we believe insert performance was so much slower with Amazon Timestream.<br></p>
<!--kg-card-begin: html-->
<p>
    <em>
        Additional database configurations
    </em>
    <br style="display: block">
    For TimescaleDB, we set the chunk time depending on the data volume, aiming for 7-16 chunks in total for each configuration (see our documentation for more on hypertables - "chunks"). 
</p>
<!--kg-card-end: html-->
<p><br>With Amazon Timestream, there aren’t additional settings you can tweak to try and improve insert performance—at least not that we found, given the tools provided by Amazon. </p><p>As mentioned in the machine configuration section above, we had to set the memory store retention period equal to ~36 days to ensure we could get all of our data inserted before the magnetic store retention period kicked in.</p><h2 id="query-performance-comparison">Query Performance Comparison</h2><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/timescale-vs-amazon-timestream-175x-faster-queries-1.png" class="kg-image" alt="TimescaleDB achieved 5 to 175 times better query performance than Amazon Timestream" loading="lazy" width="1562" height="1160" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/08/timescale-vs-amazon-timestream-175x-faster-queries-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/08/timescale-vs-amazon-timestream-175x-faster-queries-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/timescale-vs-amazon-timestream-175x-faster-queries-1.png 1562w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Note: Several queries’ ratios (high-cpu-all, lastpoint, groupby-orderby-limit) are “undefined” because Amazon Timestream did not finish executing them within the default 60 second timeout period that Timestream imposes, while TimescaleDB completed them in less than a single second</em></i></figcaption></figure><p>Measuring query latency is complex. Unlike inserts, which primarily vary on cardinality size, the universe of possible queries is essentially infinite, especially with a language as powerful as SQL. </p><p>Often, the best way to benchmark read latency is to do it with the actual queries you plan to execute. For this case, we use a broad set of queries to mimic the most common time-series query patterns.</p><p>For benchmarking query performance, we decided to use a c5n.2xlarge EC2 instance to perform the queries with Amazon Timestream. Our hope was that having more memory and network throughput available to the query application would give Amazon Timestream a better chance. The client for TimescaleDB was unchanged.</p><p>Recall that we ran these queries on Amazon Timestream with a dataset that was 40 % that of the one we ran on TimescaleDB (410 million vs. one billion metrics), owing to the insert problems we had above. </p><p>Also, because we had to set the memory store retention period to ~36 days, all the data we queried was in the fastest storage available. These two advantages <em>should</em> have given Amazon Timestream a considerable edge.</p><p><strong>That said, TimescaleDB still outperformed Amazon Timestream by 5x to 175x, depending on the query, with Amazon Timestream unable to finish several queries. </strong></p><p>The results below are the average from 1,000 queries for each query type. Latencies in this chart are all shown as milliseconds, with an additional column showing the relative performance of TimescaleDB compared to Amazon Timestream.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/timescale-vs-amazon-timestream-Query-Performance-1.png" class="kg-image" alt="Table showing latency ratios for various queries - run on 100 devices x 10 metrics - in milliseconds" loading="lazy" width="1710" height="1496" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/08/timescale-vs-amazon-timestream-Query-Performance-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/08/timescale-vs-amazon-timestream-Query-Performance-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2023/08/timescale-vs-amazon-timestream-Query-Performance-1.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/timescale-vs-amazon-timestream-Query-Performance-1.png 1710w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Results of benchmarking query performance between TimescaleDB and Amazon Timestream</em></i></figcaption></figure><p><strong>Results by query type:</strong></p>
<!--kg-card-begin: html-->
<p>
    <em>SIMPLE ROLLUPS</em>
    <br style="display: block">
    For simple rollups (i.e., groupbys), when aggregating one metric across a single host for 1 or 12 hours, or multiple metrics across one or multiple hosts (either for 1 hour or 12 hours), TimescaleDB significantly outperforms Amazon Timestream by 11x to 28x.
</p>
<!--kg-card-end: html-->

<!--kg-card-begin: html-->
<p>
    <em>AGGREGATES</em>
    <br style="display: block">
    When calculating a simple aggregate for one device, TimescaleDB again outperforms Amazon Timestream by a considerable margin, returning results for each of 1,000 queries more than 19x faster.
</p>
    
<!--kg-card-end: html-->

<!--kg-card-begin: html-->
<p>
    <em>DOUBLE ROLLUPS</em>
    <br style="display: block">
    For double rollups aggregating metrics by time and another dimension (e.g., GROUPBY time, deviceId), TimescaleDB again achieves significantly better performance, 5x to 12x.
</p>
    
<!--kg-card-end: html-->

<!--kg-card-begin: html-->
<p>
    <em>THRESHOLDS</em>
    <br style="display: block">
   When selecting rows based on a threshold (CPU > 90 %), we see Amazon Timestream really begin to fall apart. Finding the last reading for one host greater than 90 % performs 170x better with TimescaleDB compared to Amazon Timestream. And the second variation of this query, trying to find the last reading greater than 90% for all 100 hosts (in the last 31 days), never finished in Amazon Timestream.
</p>
    
<!--kg-card-end: html-->
<p><em>Again, to be fair and ensure our query was returning the data we expected, we did manually run one of these queries in the Amazon Timestream Query interface of the AWS Console. It would routinely finish in 30-40 seconds (which would still be 36x slower than TimescaleDB). </em></p><p><em>In addition, running 100 of these queries at a time with the benchmark suite appears to be too much for the query engine, and results for the first set of 100 queries didn’t complete after more than 10 minutes of waiting.</em></p>
<!--kg-card-begin: html-->
<p>
    <em>COMPLEX QUERIES</em>
    <br style="display: block">
   Likewise, for complex queries that go beyond rollups or thresholds, there is no comparison. TimescaleDB vastly outperforms Amazon Timestream, in most cases because Amazon Timestream never returned results for the first set of 100 queries. Just like the complex aggregate above that failed to return any results when queried in batches of 100, these complex queries never returned results with the benchmark client.
</p>
<!--kg-card-end: html-->
<p>In each case, we attempted to run the queries multiple times, ensuring that no other clients or processes were inserting or accessing data. We also ran at least one of the queries manually in the AWS Console to verify that it worked and that we got the expected results. However, when running these kinds of queries in parallel, there seems to be a major issue with Amazon Timestream being able to satisfy the requests.</p><p>For these more complex queries that return results from Amazon Timestream, TimescaleDB provides real-time responses (e.g., 10-100s of milliseconds), while Amazon Timestream sees significant human-observable delays (seconds). </p><p>And remember, this dataset only had a cardinality of 100 hosts, the lowest cardinality we typically test with the Time-Series Benchmarking Suite, and we were unable to test higher cardinality datasets because of Amazon Timestream issues).</p><p>Notice that Timescale exhibits 48x-175x the performance of Amazon Timestream on these complex queries, many of which are common to historical time-series analysis and monitoring.</p><p><strong>Read latency performance summary</strong></p><ul><li>For simple queries, TimescaleDB outperforms Amazon Timestream in every category.</li><li>When selecting rows based on a threshold, TimescaleDB outperforms Amazon Timestream by a significant margin, being over 175x faster.</li><li>For complex queries with even low cardinality, Amazon Timestream was unable to return results for sets of 100 queries within the 60-second default query timeout.</li><li>Concurrent query load over the same time range seems to impact Amazon Timestream in a dramatic way.</li></ul><h2 id="cost-comparison-details">Cost Comparison Details</h2><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/timescale-vs-amazon-timestream-154x-cheaper-than-Amazon-Timestream-1.png" class="kg-image" alt="To run this test TimescaleDB took less than an hour at a cost of $2.18. The same test took a week and cost $336 in Amazon Timestream" loading="lazy" width="1382" height="746" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/08/timescale-vs-amazon-timestream-154x-cheaper-than-Amazon-Timestream-1.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/08/timescale-vs-amazon-timestream-154x-cheaper-than-Amazon-Timestream-1.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/timescale-vs-amazon-timestream-154x-cheaper-than-Amazon-Timestream-1.png 1382w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Results of cost comparison between TimescaleDB and Amazon Timestream</span></figcaption></figure><p>These performance differences between TimescaleDB and Amazon Timestream lead to massive cost differences for the same workloads. </p><p>To compare costs, we calculated the cost for our above insert and query workloads, which store one billion metrics in TimescaleDB and ~410 million metrics in Amazon Timestream (because we could not load the full one billion), and run our suite of queries on top.</p><p>Pricing for Amazon Timestream is rather complex. In all, our bill for testing Amazon Timestream over the <strong>course of seven days</strong> cost us $336.39, <strong>which does not include any Amazon EC2 charges </strong>(which we needed for the extra clients). During that time, our bill shows that:</p><ul><li>We inserted 100&nbsp;GB of data (~500 million metrics total across <strong><em>all</em></strong> of our attempts to ingest data)</li><li>Stored a lot of data in memory (and we continue to be charged per hour for that data)</li><li>Queried 21 TB of data when running 25,000 real-world queries</li></ul><p>For comparison, our tests for TimescaleDB (inserts and queries) completed in far less than an hour, and our Digital Ocean droplet costs $1.50/hour. We also ran this test on Timescale, our fully managed TimescaleDB service, and it was also completed in far less than an hour, and the instance (8 vCPU, 32&nbsp;GB, 1&nbsp;TB) cost $2.18/hour.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2020/12/image-12.png" class="kg-image" alt="Screenshot of Amazon Timestream bill with various charges, totaling 336.40 USD" loading="lazy" width="2000" height="804" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2020/12/image-12.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2020/12/image-12.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1600/2020/12/image-12.png 1600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2020/12/image-12.png 2280w" sizes="(min-width: 720px) 720px"></figure><p><strong>$336.39 for Amazon Timestream vs. $2.18 for fully managed TimescaleDB ($1.50 if you would rather self-manage the instance yourself), which means that TimescaleDB is 154x cheaper than Amazon Timestream (224x cheaper if self-managed)—and it loaded <em>and</em> queried over twice the number of metrics.</strong></p><h3 id="now-let%E2%80%99s-dig-a-little-deeper-into-our-amazon-timestream-bill">Now, let’s dig a little deeper into our Amazon Timestream bill</h3><p>When using Amazon Timestream, users are charged for usage in four main categories: data ingested, data stored in both memory and magnetic storage and the amount of data scanned to satisfy your queries.</p><p>Data that resides in the memory store costs $0.036 GB/hour, while data that is eventually moved to magnetic storage costs only $0.03 GB/month. For our one-month memory store setting (which was required to insert 30 days of historical data), <strong>that’s more than a 720x difference in cost for the same data</strong>.</p><p> What’s more, since Amazon Timestream doesn’t expose any information about how data is stored in the Console or otherwise, we have no idea how well-compressed this data is, or if there is anything more we could do to reduce storage.</p><p>The real surprise, however, came with querying data because the charges don’t scale with performance. Instead, you will be charged for the amount of data scanned to produce a query result, no matter how fast that result comes back. In almost all other database-as-a-service offerings, you can modify the storage, compute, or cluster size (at a known cost) for better performance.</p><p>After waiting nearly two days to insert 410 million metrics, we created the traditional set of query batches (as outlined above) and began to run our queries. <br><br>In total, we had 15 query files, each with 1,000 queries, for a total of 15,000 queries to run against both Amazon Timestream and TimescaleDB. </p><p>While some of the queries are certainly complex, others just ask for the most recent point for each host (and remember, this dataset only had 100 hosts).  Also, recall from our query performance comparison that a few of the most complex queries were unable to return results for just 100 queries, let alone the full 1,000 query test. </p><p><strong>With Amazon Timestream, you are still charged for the data that was read, even if the query was ultimately canceled or never returned a result.</strong></p><p>To validate results, we ran each query file twice. In the case where three of the query files failed to return results, we attempted to execute them five times, hoping for some result. </p><p>Doing the math, this means that we ran around 25,000 queries. In doing so, Amazon says that we scanned 21,598.02 GB of data, which cost $215.98. There were certainly a few other ad hoc queries performed through the AWS Console UI, but before we started running the benchmarking queries, the total cost for scanning data was about $15.00.</p><p>Furthermore, as we’ve mentioned a few times, there is no built-in support to help you identify which queries are scanning too much data and how you might improve them. For comparison, both Amazon Redshift and Amazon RDS provide this kind of feedback in their AWS Console interface.</p><p>When we consider some of the recent user applications that we have highlighted elsewhere on our blog, like <a href="https://timescale.ghost.io/blog/blog/how-flightaware-fuels-flight-prediction-models-with-timescaledb-and-grafana/">FlightAware</a> or <a href="https://timescale.ghost.io/blog/blog/how-clevabit-builds-data-pipeline-for-agricultural-iot/">clevabit</a>, 25,000 queries of various shapes and sizes would easily be run in a few hours or less.</p><p>While the bytes scanned might improve over time as partitioning improves, if you don’t need to scale storage beyond a few petabytes of data, it’s hard to see how this would be less costly than a fixed Compute and Storage cost.</p><h2 id="reliability-comparison-details">Reliability Comparison Details</h2><p>Another cardinal rule for a database: it cannot lose or corrupt your data. In this respect, the serverless nature of Amazon Timestream requires that you trust Amazon will not lose your data and all of the data will be stored without corruption. Usually, this is probably a pretty safe bet. In fact, many companies rely on services like Amazon S3 or Amazon Glacier to store their data as a reliable backup solution. </p><p>The problem is that we don’t know where our time-series data is stored in Amazon Timestream—because Amazon does not tell us.</p><p>This presents a specific challenge that Amazon hasn’t addressed natively: validating or backing up your data.</p><p>In their <a href="https://docs.aws.amazon.com/timestream/latest/developerguide/timestream.pdf">240-page development guide</a>, the words “recovery” and “restore” don’t appear at all, and the word “backup” appears only once to tell the developer that there isn’t a backup mechanism. Instead, you can “write your own application using the Timestream SDK to query data and save it to the destination of your choice” (page 100).</p><p>This is not to say that Amazon Timestream will lose or corrupt your data. As we mentioned, Amazon S3, for instance, is a widely known and used service for data storage. The issue here is that we’re unable to learn or easily verify where our data resides and how it’s protected in a service interruption.</p><p>We also found it worrisome that with Amazon Timestream, there isn’t a mechanism or support to DELETE or UPDATE existing data. <strong>The only way to remove data is to drop the entire table.</strong> Furthermore, there is no way to recover a deleted table since it is an atomic action that cannot be recovered through any Amazon API or Console. </p><p>Even if one were to write their own backup and restore utility, there is no method for importing more than the most recent year of data because of the memory store retention period limitation. </p><p>As an Amazon Timestream user, all these limitations put us in a precarious position. There’s no easy way to back up our data or restore it once we’ve accumulated more than a year's worth. Even if Amazon never loses our data, deleting an essential table of data <a href="https://www.geekwire.com/2015/starbucks-back-in-business-internal-report-blames-deleted-database-table-indicates-outage-was-global/">through human error is not uncommon</a>.</p><p>TimescaleDB uses a dramatically different design principle: build on PostgreSQL. As noted previously, this allows TimescaleDB to inherit over 25 years of dedicated engineering effort that the entire PostgreSQL community has done to build a rock-solid database that supports millions of applications worldwide. (In fact, this principle was at the core of our initial <a href="https://timescale.ghost.io/blog/blog/when-boring-is-awesome-building-a-scalable-time-series-database-on-postgresql-2900ea453ee2/?utm_source=timescale-influx-benchmark&amp;utm_medium=blog&amp;utm_campaign=july-2020-advocacy&amp;utm_content=timescale-launch-blog">TimescaleDB launch announcement</a>.)</p><h2 id="query-language-ecosystem-and-ease-of-use-comparison-details">Query Language, Ecosystem, and Ease-Of-Use Comparison Details</h2><p>We applaud Amazon Timestream’s decision to adopt SQL as their query language. We have always been big fans and vocal advocates of SQL, which has become the query language of choice for data infrastructure, is well-documented, and currently ranks as the <a href="https://insights.stackoverflow.com/survey/2020#most-popular-technologies">third-most commonly used programming language among developers</a> (<a href="https://timescale.ghost.io/blog/blog/why-sql-beating-nosql-what-this-means-for-future-of-data-time-series-database-348b777b847a/">see our SQL vs. NoSQL comparison</a> for more details). </p><p>Even if Amazon Timestream functions like a NoSQL database in many ways, opting for SQL as the query interface lowers developers’ barrier to entry—especially when compared to other databases like MongoDB and InfluxDB with their proprietary query languages.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/timescale-vs-amazon-timestream-Comparisson-table.png" class="kg-image" alt="Screenshot of Stack Overflow Developer survey results, showing percentage breakdown of various languages" loading="lazy" width="1290" height="1617" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2023/08/timescale-vs-amazon-timestream-Comparisson-table.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2023/08/timescale-vs-amazon-timestream-Comparisson-table.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/08/timescale-vs-amazon-timestream-Comparisson-table.png 1290w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Most popular Programming, Scripting, and Markup Languages. Source: </em></i><a href="https://insights.stackoverflow.com/survey/2020#most-popular-technologies"><i><em class="italic" style="white-space: pre-wrap;">2020 Stack Overflow Developer Surve</em></i></a><i><em class="italic" style="white-space: pre-wrap;">y</em></i></figcaption></figure><p>As discussed earlier in this article, Amazon Timestream is not a relational database, despite feeling like it could be because of the SQL query interface. It doesn’t support normalized datasets, JOINs across tables, or even some common “tricks of the trade” like a simple LATERAL JOIN and correlated subqueries. </p><p>Between these SQL limitations and the narrow table model that Amazon Timestream enforces on your data, writing efficient (and easily readable) queries can be a challenge. </p>
<!--kg-card-begin: html-->
<p>
    <em>Example</em>
    <br style="display: block">
    To see a brief example of how Amazon Timestream’s “narrow” table model impacts the SQL that you write, let’s look at an example given in the Timestream documentation, <a href="https://docs.aws.amazon.com/timestream/latest/developerguide/sample-queries.iot-scenarios.html" target="_blank">Queries with aggregate functions.</a>
</p>
<!--kg-card-end: html-->
<p>Specifically, we’ll look at the example to “<em>find the average load and max speed for each truck for the past week</em>”:</p><p><strong>Amazon Timestream SQL (CASE statement needed)</strong></p><pre><code class="language-SQL">SELECT
    bin(time, 1d) as binned_time,
    fleet,
    truck_id,
    make,
    model,
    AVG(
        CASE WHEN measure_name = 'load' THEN measure_value::double ELSE NULL END
    ) AS avg_load_tons,
    MAX(
        CASE WHEN measure_name = 'speed' THEN measure_value::double ELSE NULL END
    ) AS max_speed_mph
FROM "sampleDB".IoT
WHERE time &gt;= ago(7d)
AND measure_name IN ('load', 'speed')
GROUP BY fleet, truck_id, make, model, bin(time, 1d)
ORDER BY truck_id</code></pre><p><strong>TimescaleDB SQL</strong></p><pre><code class="language-SQL">SELECT
    time_bucket(time, '1 day') as binned_time,
    fleet,
    truck_id,
    make,
    model,
    AVG(load) AS avg_load_tons,
    MAX(speed) AS max_speed_mph
FROM "public".IoT
WHERE time &gt;= now() - INTERVAL '7 days'
GROUP BY fleet, truck_id, make, model, binned_time
ORDER BY truck_id</code></pre><p>If the above is any indication, even the most simple aggregate queries in Amazon Timestream require multiple levels of CASE statements and column renaming. We were unable to write this query or pivot the results more easily.</p><p>Conversely, as evidenced in the above example, with TimescaleDB, we use standard SQL syntax. Additionally, any query that already works with your PostgreSQL-supported applications will “just work.” The same isn’t true for Amazon Timestream.</p><p>So while the decision to adopt a SQL-like query language is a great start for Amazon Timestream, there is still a lot to be desired for a truly frictionless, developer-first experience.</p><h2 id="summary">Summary</h2><p>No one wants to invest in a technology only to have it limit their growth or scale in the future, let alone invest in something that's the wrong fit today.</p><p>Before making a decision, we recommend taking a step back and analyzing your stack, your team's skills, and your needs (now and in the future). It could be the difference between infrastructure that evolves and grows with you and one that forces you to start all over.</p><p>In this post, we performed a detailed comparison of TimescaleDB and Amazon Timestream. We don’t claim to be <a href="https://www.tigerdata.com/blog/so-long-timestream-how-and-why-to-migrate-before-its-too-late" rel="noreferrer">Amazon Timestream</a> experts, so we’re open to suggestions on improving this comparison—and invite you to perform your own and share your results.</p><p>In general, we aim to be as transparent as possible about our data models, methodologies, and analysis, and we welcome feedback. We also encourage readers to raise any concerns about the information we’ve presented to help us with benchmarking in the future.</p><p>We recognize that <a href="http://www.timescale.com/?utm_source=timescale-influx-benchmark&amp;utm_medium=blog&amp;utm_campaign=july-2020-advocacy&amp;utm_content=homepage">Timescale</a> isn’t the only time-series solution on the market. There are situations where it might not be the best time-series database choice, and we strive to be upfront in admitting where an alternate solution may be preferable.</p><p>We’re always interested in holistically evaluating our solution against others, and we’ll continue to share our insights with the greater community.</p><h2 id="want-to-learn-more-about-timescale">Want to Learn More About Timescale?</h2><p><a href="https://console.cloud.timescale.com/signup">Create a free account </a>to get started with a fully managed TimescaleDB instance (100&nbsp;% free for 30 days). </p><p>Want to host TimescaleDB yourself? <a href="https://github.com/timescale/timescaledb">Visit our GitHub</a> to learn more about options, get installation instructions, and more (and, as always, ⭐️  are appreciated!)</p><p>Join our <a href="http://slack.timescale.com/">Slack community</a> to ask questions, get advice, and connect with other developers (I, as well as our co-founders, engineers, and passionate community members are active on all channels).</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Speed Up Grafana by Auto-Switching Between Different Aggregations With Postgres]]></title>
            <description><![CDATA[Read a step-by-step guide on enabling Grafana "auto-switching" between PostgreSQL aggregations, depending on the time interval.]]></description>
            <link>https://www.tigerdata.com/blog/speed-up-grafana-autoswitching-postgresql</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/speed-up-grafana-autoswitching-postgresql</guid>
            <category><![CDATA[Data Visualization]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Avthar Sewrathan]]></dc:creator>
            <pubDate>Tue, 11 Aug 2020 14:13:02 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2020/08/Autoswtiching-header.gif">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2020/08/Autoswtiching-header.gif" alt="A Grafana GIF displaying various graphs. Speed up Grafana by auto-switching between different aggregations, using PostgreSQL" /><p><strong>Updated March 4, 2026</strong><em> - Learn how (and why) to speed up your Grafana drill-downs using PostgreSQL to allow "auto-switching" between aggregations, depending on the time interval you select.</em></p><h2 id="the-problem-grafana-is-slow-to-load-visualizations-especially-for-non-aggregated-fine-grained-data">The problem: Grafana is slow to load visualizations, especially for non-aggregated, fine-grained data</h2><p>The&nbsp;<a href="https://grafana.com/" rel="nofollow">Grafana</a>&nbsp;UI is great for drilling down into your data. However, for large amounts of data with second, millisecond, or even nanosecond time granularity, it can be frustratingly slow and result in higher resource usage.</p><p>For example, take this graph of all New York City taxi rides during the month of January 2016:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/grafana_autoswtiching_loading.gif" class="kg-image" alt="Grafana graph loading slowly" loading="lazy" width="1600" height="597" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/grafana_autoswtiching_loading.gif 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/grafana_autoswtiching_loading.gif 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/grafana_autoswtiching_loading.gif 1600w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">An example of how slow drill-downs into data can be in Grafana</span></figcaption></figure><p>One common workaround: instead of querying raw data and aggregating on the fly, you query and visualize data from&nbsp;<em>aggregates</em>&nbsp;of your raw data (e.g., one-minute, one-hour, or one-day rollups).</p><p>For PostgreSQL data sources, we do this by aggregating data into views and querying those instead, and for TimescaleDB, we use continuous aggregates—think “automatically refreshing materialized views” (for more, see the&nbsp;<a href="https://www.tigerdata.com/docs/use-timescale/latest/continuous-aggregates/" rel="nofollow">continuous aggregates docs</a>).</p><p>However, this often leads to several Grafana panels, each querying the same data aggregated at different granularities. For example, you might capture the same metric over time but set up aggregates at various intervals, such as in minute, hourly, and daily intervals.</p><p>This then requires three separate panels, one for each aggregated interval.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Screen-Shot-2020-07-30-at-4.42.57-PM.png" class="kg-image" alt="Three different Grafana graphs showing rides from daily and hourly aggregates as well as raw data" loading="lazy" width="1238" height="1144" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/Screen-Shot-2020-07-30-at-4.42.57-PM.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/Screen-Shot-2020-07-30-at-4.42.57-PM.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Screen-Shot-2020-07-30-at-4.42.57-PM.png 1238w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Example of three panels all showing taxi rides over January 2016 but in different time granularities (daily, hourly, and per minute, from top to bottom)</span></figcaption></figure><p>But, what if we could use&nbsp;<em>one universal panel</em>&nbsp;that could “automatically” switch between minute, hourly, daily, or any other arbitrary aggregations of our data, depending on the time period we’d like to query and analyze? This would speed up queries and use resources like CPU more efficiently.</p><p>Enter the PostgreSQL&nbsp;<code>UNION ALL</code>&nbsp;function.</p><h2 id="the-solution-use-postgres-union-all">The Solution: Use Postgres <code>UNION ALL</code></h2><p>When we use PostgreSQL as our Grafana data source, we can write a single query that automatically switches between different aggregated views of our data (e.g., daily, hourly, weekly views, etc.) in the same Grafana visualization (!).</p><p>🔑&nbsp;<strong>The key</strong>: we (1) use the&nbsp;<code>UNION ALL</code>&nbsp;function to write separate queries to pull data with different aggregations, and (2) then use the&nbsp;<code>WHERE</code>&nbsp;clause to switch the table (or continuous aggregate view) being queried, depending on the length of the time-interval selected (from either the time picker or by highlighting the time period in a graph).</p><p>This allows us to drill arbitrarily deep into our data and makes loading the data as efficient and fast as possible, saving time and CPU resources. (In Grafana, drilling into data is typically done by zooming in and out, highlighting the time period of interest in the graph as shown in the image below.)</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/final_5f1f1f7b0c9e000015634870_766127.gif" class="kg-image" alt="Autoswitching between daily and hourly aggregates and raw data depending on time period selected" loading="lazy" width="1280" height="720" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/final_5f1f1f7b0c9e000015634870_766127.gif 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/final_5f1f1f7b0c9e000015634870_766127.gif 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/final_5f1f1f7b0c9e000015634870_766127.gif 1280w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Example of auto-switching between different data aggregations depending on the time interval selected. Learn how to create this example in the tutorial below.</span></figcaption></figure><h2 id="try-it-yourself-implementation-in-grafana-sample-queries">Try It Yourself: Implementation in Grafana &amp; Sample Queries</h2><p>To help you get up and running with&nbsp;<code>UNION ALL</code>, I’ve put together a short step-by-step guide and a few sample queries (which you can modify to suit your project, app, and the metrics you care about).</p><h3 id="scenario">Scenario</h3><p><a href="https://gist.github.com/antonum/94a8c6579a5ddb379588c153504f5472#scenario"></a></p><p>We’ll use the use case of monitoring IoT devices, specifically taxis equipped with sensors. For reference, we’ll use a dataset containing all New York City taxi ride activity for January 2016 from the&nbsp;<a href="https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page" rel="nofollow">New York Taxi and Limousine Commission</a>&nbsp;(NYC TLC).</p><h3 id="prerequisites">Prerequisites</h3><p><a href="https://gist.github.com/antonum/94a8c6579a5ddb379588c153504f5472#prerequisites"></a></p><ul><li><a href="https://www.tigerdata.com/products/" rel="nofollow">TimescaleDB instance</a>&nbsp;(Tiger Cloud or self-hosted) running PostgreSQL 11+</li><li><a href="https://grafana.com/" rel="nofollow">Grafana instance</a>&nbsp;(cloud or self-hosted)</li><li>TimescaleDB instance connected to Grafana (see&nbsp;<a href="https://www.tigerdata.com/docs/tutorials/latest/grafana/" rel="nofollow">this tutorial</a>&nbsp;for more)</li><li>Use the queries below to create two continuous aggregates with refresh policies. These will be the aggregate views we switch between in our Grafana visualization:</li></ul><p>To create daily aggregates:</p><pre><code class="language-SQL">CREATE MATERIALIZED VIEW rides_daily
WITH (timescaledb.continuous)
AS
    SELECT time_bucket('1 day', pickup_datetime) AS day, COUNT(*) AS ride_count
    FROM rides
    GROUP BY day
WITH NO DATA;

SELECT add_continuous_aggregate_policy('rides_daily',
    start_offset =&gt; INTERVAL '1 month',
    end_offset =&gt; INTERVAL '1 day',
    schedule_interval =&gt; INTERVAL '1 day');</code></pre><p><em>SQL query to create daily aggregates of rides during January 2016</em></p><p>This computes a roll-up of the total number of rides taken during each day during the time period of our data (January 2016).</p><p>To create hourly aggregates:</p><pre><code class="language-SQL">CREATE MATERIALIZED VIEW rides_hourly
WITH (timescaledb.continuous)
AS
    SELECT time_bucket('1 hour', pickup_datetime) AS hour, COUNT(*) AS ride_count
    FROM rides
    GROUP BY hour
WITH NO DATA;

SELECT add_continuous_aggregate_policy('rides_hourly',
    start_offset =&gt; INTERVAL '1 month',
    end_offset =&gt; INTERVAL '1 hour',
    schedule_interval =&gt; INTERVAL '1 hour');</code></pre><p><em>SQL query to create hourly aggregates of rides during January 2016</em></p><p>This computes a roll-up of the total number of rides taken during each hour during the time period of our data.</p><p>For more on how continuous aggregates work, see&nbsp;<a href="https://www.tigerdata.com/docs/use-timescale/latest/continuous-aggregates/" rel="nofollow">these docs</a>.</p><h3 id="example-1-auto-switch-between-daily-aggregate-hourly-aggregate-and-raw-data">Example 1: Auto-switch between daily aggregate, hourly aggregate, and raw data</h3><p><a href="https://gist.github.com/antonum/94a8c6579a5ddb379588c153504f5472#example-1-auto-switch-between-daily-aggregate-hourly-aggregate-and-raw-data"></a></p><p>In the example below, we have a query using&nbsp;<code>UNION ALL</code>, where we only select a specific table or view, depending on the length of time selected interval in the Grafana UI (controlled by the&nbsp;<code>$__timeFrom</code>&nbsp;and&nbsp;<code>$__timeTo</code>&nbsp;macros in Grafana).</p><p>As the comments in the code below show, we use daily aggregates for intervals greater than 14 days, hourly aggregates for intervals between 3 and 14 days, and per-minute aggregates calculated on the fly from raw data for intervals less than 3 days:</p><p><strong>Switching between daily aggregation, hourly aggregation, and minute aggregations on raw data</strong></p><pre><code class="language-SQL">-- Use Daily aggregate for intervals greater than 14 days
SELECT day as time, ride_count, 'daily' AS metric
FROM rides_daily
WHERE  $__timeTo()::timestamp - $__timeFrom()::timestamp &gt; '14 days'::interval AND $__timeFilter(day)
UNION ALL
-- Use hourly aggregate for intervals between 3 and 14 days
SELECT hour, ride_count, 'hourly' AS metric
FROM rides_hourly
WHERE  $__timeTo()::timestamp - $__timeFrom()::timestamp BETWEEN '3 days'::interval AND '14 days'::interval AND $__timeFilter(hour)
UNION ALL
-- Use raw data (minute intervals) intervals between 0 and 3 days
SELECT * FROM
    (SELECT time_bucket('1m',pickup_datetime) AS time, count(*), 'minute' AS metric
    FROM rides
    WHERE  $__timeTo()::timestamp - $__timeFrom()::timestamp &lt; '3 days'::interval AND $__timeFilter(pickup_datetime)
    GROUP BY 1) minute
ORDER BY 1;</code></pre><p><em>Query to switch between daily aggregation, hourly aggregation, and per-minute aggregations created on the fly using raw data</em></p><p><strong>This produces the following behavior in our Grafana panels:</strong></p><p>Querying daily aggregates for intervals greater than 14 days:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Screen-Shot-2020-07-27-at-2.46.44-PM.png" class="kg-image" alt="Graph showing rides taking place in daily intervals for an interval greater than 14 days" loading="lazy" width="1600" height="752" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/Screen-Shot-2020-07-27-at-2.46.44-PM.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/Screen-Shot-2020-07-27-at-2.46.44-PM.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Screen-Shot-2020-07-27-at-2.46.44-PM.png 1600w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The graph is powered by the daily aggregate view for intervals greater than 14 days</span></figcaption></figure><p>Querying hourly aggregates for intervals between 3-14 days:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Screen-Shot-2020-07-27-at-2.47.14-PM.png" class="kg-image" alt="Graph showing rides taking place in hourly intervals for an interval between 3 and 14 days" loading="lazy" width="1600" height="749" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/Screen-Shot-2020-07-27-at-2.47.14-PM.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/Screen-Shot-2020-07-27-at-2.47.14-PM.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Screen-Shot-2020-07-27-at-2.47.14-PM.png 1600w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The graph is powered by the hourly aggregate view for intervals between 3 and 14 days</span></figcaption></figure><p>Querying raw data for intervals less than 3 days:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Screen-Shot-2020-07-27-at-2.47.36-PM.png" class="kg-image" alt="Graph showing rides taking place in minute intervals for an interval less than 3 days" loading="lazy" width="1600" height="751" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/Screen-Shot-2020-07-27-at-2.47.36-PM.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/Screen-Shot-2020-07-27-at-2.47.36-PM.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Screen-Shot-2020-07-27-at-2.47.36-PM.png 1600w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The graph is powered by rolling up raw data into 1-minute intervals on the fly for intervals of less than 3 days</span></figcaption></figure><p>This allows you to automatically switch between different aggregations of data, depending on the length of the time interval selected. Notice how the granularity of the data gets richer as we drill down from looking at data over the month of January to looking at data in a single day:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/final_5f1f1f7b0c9e000015634870_766127-1.gif" class="kg-image" alt="Graphing changing from daily interval to hourly interval to minute interval as we zoom in" loading="lazy" width="1280" height="720" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/final_5f1f1f7b0c9e000015634870_766127-1.gif 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/final_5f1f1f7b0c9e000015634870_766127-1.gif 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/final_5f1f1f7b0c9e000015634870_766127-1.gif 1280w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Demo of automatically switching between daily, hourly, and minute aggregations of data, depending on time interval selected</span></figcaption></figure><h3 id="example-2-auto-switch-between-daily-hourly-and-10-minute-aggregates">Example 2: Auto-switch between daily, hourly, and 10-minute aggregates</h3><p><a href="https://gist.github.com/antonum/94a8c6579a5ddb379588c153504f5472#example-2-auto-switch-between-daily-hourly-and-10-minute-aggregates"></a></p><p>Querying only from continuous aggregates allows us to speed up our dashboards even further. You might not want to directly query the hypertable that houses your raw data, as the queries may be slower due to things like new data being inserted into the hypertable.</p><p>The following example shows a query for switching between aggregations of different granularity without using the raw data hypertable at all (unlike Example 1, which does on-the-fly rollups of raw data).</p><p>First, let’s create 10-minute rollups of the raw data:</p><pre><code class="language-SQL">CREATE MATERIALIZED VIEW rides_10mins
WITH (timescaledb.continuous)
AS
    SELECT time_bucket('10 minutes', pickup_datetime) AS bucket, COUNT(*) AS ride_count
    FROM rides
    GROUP BY bucket
WITH NO DATA;

SELECT add_continuous_aggregate_policy('rides_10mins',
    start_offset =&gt; INTERVAL '1 month',
    end_offset =&gt; INTERVAL '10 minutes',
    schedule_interval =&gt; INTERVAL '10 minutes');</code></pre><p><em>Query to create 10-minute rollups of data in a continuous aggregate</em></p><p><strong>Switching between daily aggregation, hourly aggregation, and minute aggregations (no raw data involved)</strong></p><pre><code class="language-SQL">-- Use Daily aggregate for intervals greater than 14 days
SELECT day as time, ride_count, 'daily' AS metric
FROM rides_daily
WHERE  $__timeTo()::timestamp - $__timeFrom()::timestamp &gt; '14 days'::interval AND  $__timeFilter(day)
UNION ALL
-- Use hourly aggregate for intervals between 3 and 14 days
SELECT hour, ride_count, 'hourly' AS metric
FROM rides_hourly
WHERE $__timeTo()::timestamp - $__timeFrom()::timestamp BETWEEN '3 days'::interval AND '14 days'::interval AND  $__timeFilter(hour)
UNION ALL
-- Use 10-minute aggregate for intervals between 0 and 3 days
SELECT bucket, ride_count, '10min' AS metric
FROM rides_10mins
WHERE $__timeTo()::timestamp - $__timeFrom()::timestamp &lt; '3 days'::interval AND  $__timeFilter(bucket)
ORDER BY 1; </code></pre><p><em>Query to switch between daily aggregation, hourly aggregation, and per-minute aggregations, all using continuous aggregates</em></p><p>In this post, we saw how to use&nbsp;<code>UNION ALL</code>&nbsp;to automatically switch which aggregate view we’re querying on based on the time interval selected so that we can do more efficient drill downs and make Grafana faster</p><p>You can find more information about the&nbsp;<code>UNION ALL</code>&nbsp;function and how it works in this&nbsp;<a href="https://www.postgresqltutorial.com/postgresql-union/" rel="nofollow">PostgreSQL tutorial</a>—from the aptly named PostgreSQLtutorial.com—and&nbsp;<a href="https://www.postgresql.org/docs/current/queries-union.html" rel="nofollow">official PostgreSQL documentation</a>.</p><p>That’s it! You can modify this code to change the aggregates you query, time intervals, and the metrics you want to visualize to suit your needs and projects.</p><p>Happy auto-switching!</p><h2 id="next-steps">Next Steps</h2><p>In this tutorial, we learned how to use PostgreSQL&nbsp;<code>UNION ALL</code>&nbsp;to solve a common Grafana issue: slow-loading dashboards when we want to query fine-grained raw data (like millisecond performance metrics).</p><p>The result: you create graphs that enable you to switch between different aggregations of your data automatically. This allows you to quickly drill down into your metrics, saving time&nbsp;<em>and</em>&nbsp;CPU resources!</p><p>For more resources&nbsp;<a href="https://www.tigerdata.com/blog/slow-grafana-performance-learn-how-to-fix-it-using-downsampling/" rel="nofollow">to speed up Grafana performance, learn how you can fix slow dashboards using downsampling</a>.</p><h3 id="learn-more">Learn more</h3><p><a href="https://gist.github.com/antonum/94a8c6579a5ddb379588c153504f5472#learn-more"></a></p><p>Want more Grafana tips? Explore our&nbsp;<a href="https://www.tigerdata.com/docs/tutorials/latest/grafana/" rel="nofollow">Grafana tutorials</a>.</p><p>Need a database to power your dashboarding and data analysis?&nbsp;<strong>Get started with a free</strong>&nbsp;<a href="https://console.cloud.timescale.com/signup" rel="nofollow"><strong>Tiger Cloud account</strong></a><strong>.</strong></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Achieving the Best of Both Worlds: Ensuring Up-To-Date Results With Real-Time Aggregation]]></title>
            <description><![CDATA[Real-time aggregates (released with TimescaleDB 1.7) build on continuous aggregates' ability to increase query speed and optimize storage. Learn what's new, details about how they work, and how to get started. ]]></description>
            <link>https://www.tigerdata.com/blog/achieving-the-best-of-both-worlds-ensuring-up-to-date-results-with-real-time-aggregation</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/achieving-the-best-of-both-worlds-ensuring-up-to-date-results-with-real-time-aggregation</guid>
            <category><![CDATA[Product & Engineering]]></category>
            <category><![CDATA[Engineering]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Sven Klemm]]></dc:creator>
            <pubDate>Thu, 07 May 2020 15:11:33 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2020/05/mana5280-dkeOcAkors4-unsplash.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2020/05/mana5280-dkeOcAkors4-unsplash.jpg" alt="Achieving the Best of Both Worlds: Ensuring Up-To-Date Results With Real-Time Aggregation" /><p>Real-time aggregates (released with TimescaleDB 1.7) build on continuous aggregates' ability to increase query speed and optimize storage. Learn what's new, details about how they work, and how to get started. </p><p>One constant across all time-series use cases is data: metrics, logs, events, sensor readings; IT and application performance monitoring, SaaS applications, IoT, martech, fintech, and more.  Lots (and lots) of data. What’s more, it typically arrives <em>continuously.</em></p><p>This need to handle large volumes of constantly generated data motivated some of our earliest TimescaleDB architectural decisions, such as its use of automated time-based partitioning and local-only indexing to achieve high insert rates.  And last year, we added type-specific <a href="https://www.tigerdata.com/blog/building-columnar-compression-in-a-row-oriented-database" rel="noreferrer">columnar</a> compression to significantly shrink the overhead involved in storing all of this data (often by 90% or higher – <a href="https://timescale.ghost.io/blog/blog/building-columnar-compression-in-a-row-oriented-database/?utm_source=timescale-real-time-aggregates-details&amp;utm_medium=blog&amp;utm_campaign=1-7-release&amp;utm_content=1-6-release-blog">see our technical description and benchmarking results</a>). </p><p>And another key capability in TimescaleDB, which is the focus of this post, has been <em>continuous aggregates</em>, which we first <a href="https://timescale.ghost.io/blog/blog/continuous-aggregates-faster-queries-with-automatically-maintained-materialized-views/?utm_source=timescale-real-time-aggregates-details&amp;utm_medium=blog&amp;utm_campaign=1-7-release&amp;utm_content=continuous-aggs-1-3-blog">introduced</a> in TimescaleDB 1.3.  Continuous aggregates allow one to specify a SQL query that continually processes raw data into a so-called materialized table.  </p><p>Continuous aggregates are somewhat similar to materialized views in databases, but unlike a materialized view (as in <a href="https://www.postgresql.org/docs/current/rules-materializedviews.html">PostgreSQL</a>), continuous aggregates do not need to be refreshed manually; the view will be refreshed automatically in the background as new data is added, or old data is modified. Additionally, TimescaleDB does not need to re-calculate all of the data on every refresh. Only new and/or invalidated data will be calculated. And since this re-aggregation is automatic – it executes as a background job at regular intervals – this process doesn’t add any maintenance burden to your database.</p><p>This is where most database or streaming systems that offer continuous aggregates or continuous queries give up.  We knew we could do better.</p><p>Enter Real-Time Aggregation, introduced in TimescaleDB 1.7 (<a href="https://timescale.ghost.io/blog/blog/timescaledb-1-7-fast-continuous-aggregates-with-real-time-views-postgresql-12-support-and-more-community-features/?utm_source=timescale-real-time-aggregates-details&amp;utm_medium=blog&amp;utm_campaign=1-7-release&amp;utm_content=1-7-release-announcement-blog">see our release blog</a>).</p><h2 id="quick-background-on-continuous-aggregates">Quick Background on Continuous Aggregates</h2><p>The benefit of continuous aggregations are two fold:</p><ul><li><strong>Query performance.</strong>  By executing queries against pre-calculated results, rather than the underlying raw data,  continuous aggregates can significantly improve query performance.</li><li><strong>Storage savings with </strong><a href="https://docs.timescale.com/latest/using-timescaledb/continuous-aggregates?utm_source=timescale-real-time-aggregates-details&amp;utm_medium=blog&amp;utm_campaign=1-7-release&amp;utm_content=continuous-aggs-drop-data-docs#dropping-data"><strong>downsampling</strong></a><strong><em>. </em></strong> Continuous aggregates are often combined with data retention policies for better storage management.  Raw data can be continually aggregated into a materialized table, and dropped after it reaches a certain age.  So the database may only store some fixed period of raw data (say, one week), yet store aggregate data for much longer.</li></ul><p>Consider the following example, collecting system metrics around CPU usage and storing it in a CPU metrics <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertable</a>, where each row includes a timestamp, hostname, and 3 metrics around CPU usage (usage_user, usage_system, usage_iowait).  </p><p>We collect these statistics every second per server.</p><pre><code>            time              | hostname |     usage_user     |    usage_system     |    usage_iowait
-------------------------------+----------+--------------------+---------------------+---------------------
2020-05-06 02:32:34.627143+00 | host0    | 0.5378765249290502 |  0.2958572490961302 | 0.10685818344495246
2020-05-06 02:32:34.627143+00 | host1    | 0.3175958910709298 |  0.7874926624954846 | 0.16615243032654803
2020-05-06 02:32:34.627143+00 | host2    | 0.4788377981501064 | 0.18277343256546175 |  0.7183967491020162</code></pre><p>So a query that wants to compute the per-hourly histogram of usage consumption over the course of 7 days for 10 servers will process 10 servers * 60 seconds * 60 minutes * 24 hours * 7 days= 6,048,000 rows of data.</p><p>On the other hand, if we pre-compute a histogram per hour, then the same query on the continuous aggregate table will only need to process 10 servers * 24 hours * 7 days = 1680 rows of data.</p><p>But pre-computed results in the continuous aggregate view will lag behind the latest data, as the materialization only runs at scheduled intervals.  So, both to more cheaply handle out-of-order data and to avoid excessive load, there is typically some <em>refresh lag </em>between the raw data and when it’s materialized.  In fact, this refresh lag is configurable in TimescaleDB, such that the continuous aggregation engine will not materialize data that’s newer than the refresh lag.  </p><p>(Slightly more specifically, if we compute aggregations across some <a href="https://docs.timescale.com/latest/using-timescaledb/reading-data?utm_source=timescale-real-time-aggregates-details&amp;utm_medium=blog&amp;utm_campaign=1-7-release&amp;utm_content=time-bucket-docs#time-bucket">time bucket</a>, such as hourly, then each hourly interval has a start time and end time.  TimescaleDB will only materialize data when its corresponding aggregation interval’s <em>end time</em> is older than the refresh lag. So, if we are doing hourly rollups with 30 minute refresh lag, then we’d only perform the materialized aggregation from, say, 2:00am - 3:00am <em>after</em> 2:30pm.)</p><p>So, on one hand, using a continuous aggregate view has cut down the amount of data we process at query time by 3600x (i.e., from more than 6 million rows to fewer than 2000).  But, in this view, we’re often missing the last hour or so of data.</p><p>While you could just make the refresh lag smaller and smaller to workaround this problem, it comes at the cost of higher and higher load; unless these aggregates are recomputed on <em>every</em> new insert (expensive!), they’re fundamentally always stale.</p><h2 id="introducing-real-time-aggregation">Introducing Real-Time Aggregation</h2><p>With real-time aggregation, when you query a continuous aggregate view, rather than just getting the pre-computed aggregate from the materialized table, the query will transparently combine this pre-computed aggregate with raw data from the hypertable that’s yet to be materialized.  And, by combining raw and materialized data in this way, you get accurate and up-to-date results, while still enjoying the speedups that come from pre-computing a large portion of the result.</p><p>Let’s return to the example above.  Recall that when we created hourly rollups, we set the refresh lag to 30 minutes, so our continuous aggregate view will lag behind by 30-90 minutes.</p><p>But, when querying a view that supports real-time aggregation, the single query as before for hourly data across the past week will process and combine the results from two tables:</p><ul><li>Materialized table: 10 servers * (22 hours + 24 hours * 6 days) = 1660 rows</li><li>Raw data: 10 servers * 60 seconds * 90 minutes = 54,000 rows  </li></ul><p>So now, with these “back of the envelope” calculations, we’ve processed a total of 55,660 rows, still well below the 6 million from before. Moreover, the last 90 minutes of data are more likely to already be memory resident for even better performance, given the database page caching already happening for recent data.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2020/09/image.png" class="kg-image" alt="Diagram showing how data moves to a materialized table as it ages and continuous aggregate queries execute, and how real-time aggregates combine this data with newer, not yet materialized data" loading="lazy" width="1500" height="1154" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2020/09/image.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2020/09/image.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2020/09/image.png 1500w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Real-time aggregates allow you to query your pre-calculated data </span><b><strong style="white-space: pre-wrap;">and</strong></b><span style="white-space: pre-wrap;"> newer, not yet materialized "raw" data</span></figcaption></figure><p>The above illustration shows this in practice. The database internally maintains a <strong>c<em>ompletion threshold</em></strong> as metadata, which records the point-in-time to which all previous records from the raw table have been materialized.  This completion threshold lags behind the <em>refresh lag </em>we discussed earlier, and gets updated by the database engine whenever a background task updates the materialized view.</p><p><em>(In fact, it’s a bit more complicated given TimescaleDB’s ability to handle late data that gets written after some time region has already been materialized, i.e., behind the completion threshold.  But we’re going to ignore how TimescaleDB tracks invalidation regions in this post.)</em></p><p>So now when processing our query covering the interval , the database engine will conceptually take a UNION ALL between results from the materialized table starting at <code>now() - interval '7 days'</code> up to the completion threshold, with results from the raw table from the completion threshold up to <code>now()</code>.</p><p>But rather than just describe this behavior, let’s walk through a concrete example and compare our query times without continuous aggregates, with vanilla continuous aggregates, and with real-time aggregation enabled.</p><p>These capabilities were developed by Timescale engineers: <a href="https://github.com/svenklemm"><em>Sven Klemm</em></a><em>, </em><a href="https://github.com/cevian"><em>Matvey Arye</em></a><em>, </em><a href="https://github.com/gayyappan"><em>Gayathri Ayyapan</em></a><em>, </em><a href="https://github.com/davidkohn88"><em>David Kohn</em></a>, and <a href="https://github.com/JLockerman"><em>Josh Lockerman</em></a>.</p><h2 id="testing-real-time-aggregation">Testing Real-Time Aggregation</h2><p>In the following, I’ve created a TimescaleDB 1.7 instance via <a href="https://www.timescale.com/products">Managed Service for TimescaleDB</a> (specially, an “basic-100-compute-optimized” instance with PostgreSQL 12, 4 vCPU, and 100GB SSD storage), and then created the following hypertable:</p><pre><code class="language-SQL">$ psql postgres://tsdbadmin@tsdb-bb8e760-internal-90d0.a.timescaledb.io:26479/defaultdb?sslmode=require

=&gt; CREATE TABLE cpu (
      time TIMESTAMPTZ,
      hostname TEXT,
      usage_user FLOAT,
      usage_system FLOAT,
      usage_iowait FLOAT
   );

=&gt; SELECT create_hypertable ('cpu', 'time', 
      chunk_time_interval =&gt; interval '1d');</code></pre><p>I’m now going to load the hypertable with 14 days of synthetic data (which is created with the following INSERT statement):</p><pre><code class="language-SQL">=&gt; INSERT INTO cpu (
   SELECT time, hostname, random(), random(), random()
      FROM generate_series(NOW() - interval '14d', NOW(), '1s') AS time
      CROSS JOIN LATERAL (
         SELECT 'host' || host_id::text AS hostname 
            FROM generate_series(0,9) AS host_id
      ) h
   );</code></pre><p>Okay, so that inserted 12,096,010 rows of synthetic data into our hypertable of the following format, stretching from 2:32am UTC on April 22 to 2:32am UTC on May 6:</p><pre><code class="language-SQL">=&gt; SELECT * FROM cpu ORDER BY time DESC LIMIT 3;

             time              | hostname |     usage_user     |    usage_system     |    usage_iowait     
-------------------------------+----------+--------------------+---------------------+---------------------
 2020-05-06 02:32:34.627143+00 | host0    | 0.5378765249290502 |  0.2958572490961302 | 0.10685818344495246
 2020-05-06 02:32:34.627143+00 | host1    | 0.3175958910709298 |  0.7874926624954846 | 0.16615243032654803
 2020-05-06 02:32:34.627143+00 | host2    | 0.4788377981501064 | 0.18277343256546175 |  0.7183967491020162


=&gt; SELECT min(time) AS start, max(time) AS end FROM cpu;

-[ RECORD 1 ]------------------------
start | 2020-04-22 02:32:34.627143+00
end   | 2020-05-06 02:32:34.627143+00</code></pre><p>Let’s now create a continuous aggregate view on this table with hourly <a href="https://docs.timescale.com/latest/api?utm_source=timescale-real-time-aggregates-details&amp;utm_medium=blog&amp;utm_campaign=1-7-release&amp;utm_content=api-docs-histograms#histogram">histograms</a>: </p><pre><code class="language-SQL">=&gt; CREATE VIEW cpu_1h 
   WITH (timescaledb.continuous, 
         timescaledb.refresh_lag = '30m',
         timescaledb.refresh_interval = '30m')
   AS
      SELECT 
         time_bucket('1 hour', time) AS hour,
         hostname, 
         histogram(usage_user, 0.0, 1.0, 5) AS hist_usage_user,
         histogram(usage_system, 0.0, 1.0, 5) AS hist_usage_system,
         histogram(usage_iowait, 0.0, 1.0, 5) AS hist_usage_iowait
      FROM cpu
      GROUP BY hour, hostname;</code></pre><p>By default, queries to this view use these real-time aggregation features.  If you want to disable real-time aggregation, set <code>materialized_only = true</code> when creating the view or by later ALTERing the view.  (See <a href="https://docs.timescale.com/latest/api?utm_source=timescale-real-time-aggregates-details&amp;utm_medium=blog&amp;utm_campaign=1-7-release&amp;utm_content=continuous-aggs-create-view-docs#continuous_aggregate-create_view">API docs here</a>.)</p><p>Now, the job scheduling framework will start to asynchronously process this view, which we can see in our <a href="https://docs.timescale.com/latest/api?utm_source=timescale-real-time-aggregates-details&amp;utm_medium=blog&amp;utm_campaign=1-7-release&amp;utm_content=continuous-aggs-stats-docs#timescaledb_information-continuous_aggregate_stats">informational view</a>.  (You can also <a href="https://docs.timescale.com/latest/api?utm_source=timescale-real-time-aggregates-details&amp;utm_medium=blog&amp;utm_campaign=1-7-release&amp;utm_content=continuous-aggs-refresh-view-docs#continuous_aggregate-refresh_view">manually force</a> the materialization to occur if needed.)  <br></p><pre><code class="language-SQL">=&gt; SELECT * FROM timescaledb_information.continuous_aggregate_stats;

- [ RECORD 1 ]
view_name              | cpu_1h
completed_threshold    | 2020-05-06 02:00:00+00
invalidation_threshold | 2020-05-06 02:00:00+00
job_id                 | 1000
last_run_started_at    | 2020-05-06 02:34:08.300524+00
last_successful_finish | 2020-05-06 02:34:09.04923+00
last_run_status        | Success
job_status             | Scheduled
last_run_duration      | 00:00:00.748706
next_scheduled_run     | 2020-05-06 03:04:09.04923+00
total_runs             | 17
total_successes        | 17
total_failures         | 0
total_crashes          | 0
</code></pre><p>From this data, we see that the materialized view includes data up to 2:00am on May 6, while from above we’ve learned that the raw data goes up to 2:32am. </p><p>Let’s try our query directly on the raw table, and use an EXPLAIN ANALYZE to both show the database plan, as well as actually execute the query and collect timing information.  (Note that in many use cases, one would offset queries from <code>now() - &lt;some interval&gt;</code>. But to ensure that we use identical datasets in our subsequent analysis, we explicitly select the interval offset from the dataset’s last timestamp.)</p><pre><code class="language-SQL">=&gt; EXPLAIN (ANALYZE, COSTS OFF)
   SELECT 
      time_bucket('1 hour', time) AS hour,
      hostname, 
      histogram(usage_user, 0.0, 1.0, 5) AS hist_usage_user,
      histogram(usage_system, 0.0, 1.0, 5) AS hist_usage_system,
      histogram(usage_iowait, 0.0, 1.0, 5) AS hist_usage_iowait
   FROM cpu
   WHERE time &gt; '2020-05-06 02:32:34.627143+00'::timestamptz - interval '7 days'
   GROUP BY hour, hostname
   ORDER BY hour DESC;

QUERY PLAN             
----------------------------------------------------------------
 Finalize GroupAggregate (actual time=1859.306..1862.331 rows=1690 loops=1)
   Group Key: (time_bucket('01:00:00'::interval, cpu."time")), cpu.hostname
   -&gt;  Gather Merge (actual time=1841.735..1849.604 rows=1881 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         -&gt;  Sort (actual time=1194.162..1194.222 rows=627 loops=3)
               Sort Key: (time_bucket('01:00:00'::interval, cpu."time")) DESC, cpu.hostname
               Sort Method: quicksort  Memory: 25kB
               Worker 0:  Sort Method: quicksort  Memory: 274kB
               Worker 1:  Sort Method: quicksort  Memory: 274kB
               -&gt;  Partial HashAggregate (actual time=1193.198..1193.594 rows=627 loops=3)
                     Group Key: time_bucket('01:00:00'::interval, cpu."time"), cpu.hostname
                     -&gt;  Parallel Custom Scan (ChunkAppend) on cpu (actual time=9.840..716.952 rows=2016000 loops=3)
                           Chunks excluded during startup: 7
                           -&gt;  Parallel Seq Scan on _hyper_1_14_chunk (actual time=14.751..199.098 rows=864000 loops=1)
                                 Filter: ("time" &gt; ('2020-05-06 02:32:34.627143+00'::timestamp with time zone - '7 days'::interval))
                           -&gt;  Parallel Seq Scan on _hyper_1_13_chunk (actual time=14.749..201.100 rows=864000 loops=1)
                                 Filter: ("time" &gt; ('2020-05-06 02:32:34.627143+00'::timestamp with time zone - '7 days'::interval))
                           -&gt;  Parallel Seq Scan on _hyper_1_12_chunk (actual time=0.025..182.591 rows=864000 loops=1)
                                 Filter: ("time" &gt; ('2020-05-06 02:32:34.627143+00'::timestamp with time zone - '7 days'::interval))
                           -&gt;  Parallel Seq Scan on _hyper_1_11_chunk (actual time=0.031..182.812 rows=864000 loops=1)
                                 Filter: ("time" &gt; ('2020-05-06 02:32:34.627143+00'::timestamp with time zone - '7 days'::interval))
                           -&gt;  Parallel Seq Scan on _hyper_1_10_chunk (actual time=0.035..183.918 rows=864000 loops=1)
                                 Filter: ("time" &gt; ('2020-05-06 02:32:34.627143+00'::timestamp with time zone - '7 days'::interval))
                           -&gt;  Parallel Seq Scan on _hyper_1_9_chunk (actual time=0.019..184.416 rows=864000 loops=1)
                                 Filter: ("time" &gt; ('2020-05-06 02:32:34.627143+00'::timestamp with time zone - '7 days'::interval))
                           -&gt;  Parallel Seq Scan on _hyper_1_8_chunk (actual time=0.823..91.605 rows=386225 loops=2)
                                 Filter: ("time" &gt; ('2020-05-06 02:32:34.627143+00'::timestamp with time zone - '7 days'::interval))
                                 Rows Removed by Filter: 45775
                           -&gt;  Parallel Seq Scan on _hyper_1_15_chunk (actual time=0.022..20.277 rows=91550 loops=1)
                                 Filter: ("time" &gt; ('2020-05-06 02:32:34.627143+00'::timestamp with time zone - '7 days'::interval))

 Planning Time: 1.917 ms
 Execution Time: 1921.753 ms</code></pre><p>Note that TimescaleDB’s constraint exclusion excluded 7 of the chunks from being queried given the WHERE predicate (as the query was for the last 7 days of the 14 day dataset), then processed the query on the remaining 8 chunks (performing a scan over 6,048,000 rows) using two parallel workers.  The query in total took just over 1.9 seconds.</p><p>Now let’s try the query on our materialized table, first turning off real-time aggregation just for this experiment: </p><pre><code class="language-SQL">=&gt; ALTER VIEW cpu_1h set (timescaledb.materialized_only = true);</code></pre><p>First, let’s look at the table definition, which defines a SELECT on the materialized view with the specified GROUP BYs.  But we also see that each of the histograms calls “finalize_agg.”  TimescaleDB doesn’t precisely pre-compute and store the exact answer that’s specified in the query, but rather a <a href="https://www.postgresql.org/docs/current/xaggr.html#XAGGR-PARTIAL-AGGREGATES">partial aggregate</a> that is then “finalized” at query time, which will allow for greater parallelization and rebucketing at query time (in a future release).</p><pre><code class="language-SQL"> \d+ cpu_1h;

                                          View "public.cpu_1h"
      Column       |           Type           | Collation | Nullable | Default | Storage  | Description 
-------------------+--------------------------+-----------+----------+---------+----------+-------------
 hour              | timestamp with time zone |           |          |         | plain    | 
 hostname          | text                     |           |          |         | extended | 
 hist_usage_user   | integer[]                |           |          |         | extended | 
 hist_usage_system | integer[]                |           |          |         | extended | 
 hist_usage_iowait | integer[]                |           |          |         | extended | 

View definition:
 SELECT _materialized_hypertable_2.hour,
    _materialized_hypertable_2.hostname,
    _timescaledb_internal.finalize_agg('histogram(double precision,double precision,double precision,integer)'::text, NULL::name, NULL::name, '{{pg_catalog,float8},{pg_catalog,float8},{pg_catalog,float8},{pg_catalog,int4}}'::name[], _materialized_hypertable_2.agg_3_3, NULL::integer[]) AS hist_usage_user,
    _timescaledb_internal.finalize_agg(...) AS hist_usage_system,
    _timescaledb_internal.finalize_agg(...) AS hist_usage_iowait
   FROM _timescaledb_internal._materialized_hypertable_2
  GROUP BY _materialized_hypertable_2.hour, _materialized_hypertable_2.hostname;</code></pre><p>Now let’s run the query with vanilla continuous aggregates enabled:</p><pre><code class="language-SQL">=&gt; EXPLAIN (ANALYZE, COSTS OFF)
   SELECT * FROM cpu_1h
   WHERE hour &gt; '2020-05-06 02:32:34.627143+00'::timestamptz - interval '7 days'
   ORDER BY hour DESC;

QUERY PLAN
----------------------------------------------------------------
 Sort (actual time=3.218..3.312 rows=1670 loops=1)
   Sort Key: _materialized_hypertable_2.hour DESC
   Sort Method: quicksort  Memory: 492kB
   -&gt;  HashAggregate (actual time=1.943..2.891 rows=1670 loops=1)
         Group Key: _materialized_hypertable_2.hour, _materialized_hypertable_2.hostname
         -&gt;  Custom Scan (ChunkAppend) on _materialized_hypertable_2 (actual time=0.064..0.688 rows=1670 loops=1)
               Chunks excluded during startup: 1
               -&gt;  Seq Scan on _hyper_2_17_chunk (actual time=0.063..0.590 rows=1670 loops=1)
                     Filter: (hour &gt; ('2020-05-06 02:32:34.627143+00'::timestamp with time zone - '7 days'::interval))
                     Rows Removed by Filter: 270

 Planning Time: 0.645 ms
 Execution Time: 3.461 ms</code></pre><p>Just 4 milliseconds, after a scan of 1,670 rows in the materialized hypertable.  And let’s look at the most recent 3 rows returned for a specific host:</p><pre><code class="language-SQL">=&gt; SELECT hour, hostname, hist_usage_user
    FROM cpu_1h
    WHERE hour &gt; '2020-05-06 02:32:34.627143+00'::timestamptz - interval '7 days'         
       AND hostname = 'host0'
    ORDER BY hour DESC LIMIT 3;

          hour          | hostname |      hist_usage_user      
------------------------+----------+---------------------------
 2020-05-06 01:00:00+00 | host0    | {0,781,676,712,719,712,0}
 2020-05-06 00:00:00+00 | host0    | {0,736,714,776,689,685,0}
 2020-05-05 23:00:00+00 | host0    | {0,714,759,715,692,720,0}</code></pre><p>Note that the last record is from the 1:00am - 2:00am hour.</p><p>Now let’s re-enable real-time aggregation and try the same query, first showing how the real-time aggregation is defined as a UNION ALL between the materialized and raw data.</p><pre><code class="language-SQL">=&gt; ALTER VIEW cpu_1h set (timescaledb.materialized_only = false);

=&gt; \d+ cpu_1h;

                                          View "public.cpu_1h"
      Column       |           Type           | Collation | Nullable | Default | Storage  | Description 
-------------------+--------------------------+-----------+----------+---------+----------+-------------
 hour              | timestamp with time zone |           |          |         | plain    | 
 hostname          | text                     |           |          |         | extended | 
 hist_usage_user   | integer[]                |           |          |         | extended | 
 hist_usage_system | integer[]                |           |          |         | extended | 
 hist_usage_iowait | integer[]                |           |          |         | extended | 

View definition:
 SELECT _materialized_hypertable_2.hour,
    _materialized_hypertable_2.hostname,
    _timescaledb_internal.finalize_agg(...) AS hist_usage_user,
    _timescaledb_internal.finalize_agg(...) AS hist_usage_system,
    _timescaledb_internal.finalize_agg(...) AS hist_usage_iowait
   FROM _timescaledb_internal._materialized_hypertable_2
  WHERE _materialized_hypertable_2.hour &lt; COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(1)), '-infinity'::timestamp with time zone)
  GROUP BY _materialized_hypertable_2.hour, _materialized_hypertable_2.hostname
UNION ALL
 SELECT time_bucket('01:00:00'::interval, cpu."time") AS hour,
    cpu.hostname,
    histogram(cpu.usage_user, 0.0::double precision, 1.0::double precision, 5) AS hist_usage_user,
    histogram(cpu.usage_system, 0.0::double precision, 1.0::double precision, 5) AS hist_usage_system,
    histogram(cpu.usage_iowait, 0.0::double precision, 1.0::double precision, 5) AS hist_usage_iowait
   FROM cpu
  WHERE cpu."time" &gt;= COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(1)), '-infinity'::timestamp with time zone)
  GROUP BY (time_bucket('01:00:00'::interval, cpu."time")), cpu.hostname;


=&gt; EXPLAIN (ANALYZE, COSTS OFF)
   SELECT * FROM cpu_1h
   WHERE hour &gt; '2020-05-06 02:32:34.627143+00'::timestamptz - interval '7 days'
   ORDER BY hour DESC;

QUERY PLAN               
----------------------------------------------------------------
 Sort (actual time=20.871..21.055 rows=1680 loops=1)
   Sort Key: _materialized_hypertable_2.hour DESC
   Sort Method: quicksort  Memory: 495kB
   -&gt;  Append (actual time=1.842..20.536 rows=1680 loops=1)
         -&gt;  HashAggregate (actual time=1.841..2.789 rows=1670 loops=1)
               Group Key: _materialized_hypertable_2.hour, _materialized_hypertable_2.hostname
               -&gt;  Custom Scan (ChunkAppend) on _materialized_hypertable_2 (actual time=0.105..0.580 rows=1670 loops=1)
                     Chunks excluded during startup: 1
                     -&gt;  Index Scan using _hyper_2_17_chunk__materialized_hypertable_2_hour_idx on _hyper_2_17_chunk (actual time=0.104..0.475 rows=1670 loops=1)
                           Index Cond: ((hour &lt; COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(1)), '-infinity'::timestamp with time zone)) AND (hour &gt; ('2020-05-06 02:32:34.627143+00'::timestamp with time zone - '7 days'::interval)))
         -&gt;  HashAggregate (actual time=17.641..17.655 rows=10 loops=1)
               Group Key: time_bucket('01:00:00'::interval, cpu."time"), cpu.hostname
               -&gt;  Custom Scan (ChunkAppend) on cpu (actual time=0.165..12.297 rows=19550 loops=1)
                     Chunks excluded during startup: 14
                     -&gt;  Index Scan using _hyper_1_15_chunk_cpu_time_idx on _hyper_1_15_chunk (actual time=0.163..9.723 rows=19550 loops=1)
                           Index Cond: ("time" &gt;= COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(1)), '-infinity'::timestamp with time zone))
                           Filter: (time_bucket('01:00:00'::interval, "time") &gt; ('2020-05-06 02:32:34.627143+00'::timestamp with time zone - '7 days'::interval))

 Planning Time: 3.532 ms
 Execution Time: 22.905 ms
</code></pre><p>Still very fast at just over 26 milliseconds (scanning 1,670 materialized rows and 19,550 raw rows), and now the results:</p><pre><code class="language-SQL">=&gt; SELECT hour, hostname, hist_usage_user
   FROM cpu_1h
WHERE hour &gt; '2020-05-06 02:32:34.627143+00'::timestamptz - interval '7 days'
      AND hostname = 'host0'
   ORDER BY hour DESC LIMIT 3;

          hour          | hostname |      hist_usage_user      
------------------------+----------+---------------------------
 2020-05-06 02:00:00+00 | host0    | {0,384,388,385,400,398,0}
 2020-05-06 01:00:00+00 | host0    | {0,781,676,712,719,712,0}
 2020-05-06 00:00:00+00 | host0    | {0,736,714,776,689,685,0}

</code></pre><p>Unlike when we were processing the materialized table without the real-time aggregation, we have up-to-date data with data from the 2:00 - 3:00am hour.  This is because the materialized table didn’t have data from the last hour, while the real-time aggregation was able to compute that result from the raw data at query time.  You can also notice that there is less data in the final row (namely, each histogram bucket has about half the counts as the prior rows), as this final row was the aggregation of 32 minutes of raw data, not a full hour. </p><p>You can also observe these two stages of real-time aggregation in the above query plan:  the materialized hypertable is processed in the first section via <code>Custom Scan (ChunkAppend) on _materialized_hypertable_2</code>, while the underlying raw hypertable is processed in the second section via <code>Custom Scan (ChunkAppend) on cpu</code>, and each processes only before or after the offset specified by the completion threshold (shown with  <code>_timescaledb_internal.cagg_watermark(1)</code> in the plan).</p><p>So, in summary:  a complete, up-to-date aggregate over the data, both at a fraction of the latency of querying the raw data, and avoiding the excessive overhead of schemes that update materalizations through per-row or per-statement triggers.</p><table>
<thead>
<tr>
<th>Query Type</th>
<th>Latency</th>
<th>Freshness</th>
</tr>
</thead>
<tbody>
<tr>
<td>Raw Data</td>
<td>1924 ms</td>
<td>Up-to-date</td>
</tr>
<tr>
<td>Continuous Aggregates</td>
<td>4 ms</td>
<td>Lags up to 90 minutes</td>
</tr>
<tr>
<td>Real-Time Aggregation</td>
<td>26 ms</td>
<td>Up-to-date</td>
</tr>
</tbody>
</table>
<p><strong>Continuous aggregates and real-time aggregation for the win!</strong></p><h2 id="conclusions">Conclusions</h2><p>What motivated us to build TimescaleDB is the firm belief that time-series use cases need a best-in-class, flexible time-series database, with advanced capabilities specifically designed for time-series workloads.  We developed real-time aggregation for time-series use cases such as devops monitoring, real-time analytics, and IoT, where fast queries over high-volume workloads and accurate, real-time results really matter. </p><p>Real-time aggregation joins a number of advanced capabilities in TimescaleDB around data lifecycle management and time-series analytics, including automated data retention, data reordering, native compression, downsampling, and traditional continuous aggregates.</p><p>And, <strong>there’s still much more to come</strong>. Keep an eye out for our much-anticipated TimescaleDB 2.0 release, which introduces horizontal scaling to TimescaleDB for terabyte to petabyte workloads.</p><h3 id="want-to-check-out-real-time-aggregation">Want to check out real-time aggregation?</h3><ul><li>Ready to dig in? Check out our <a href="https://docs.timescale.com/latest/using-timescaledb/continuous-aggregates/?utm_source=timescale-real-time-aggregates-details&amp;utm_medium=blog&amp;utm_campaign=1-7-release&amp;utm_content=continuous-aggregates-docs">docs</a>.</li><li>Brand new to TimescaleDB?  Get started <a href="https://docs.timescale.com/latest/getting-started/?utm_source=timescale-real-time-aggregates-details&amp;utm_medium=blog&amp;utm_campaign=1-7-release&amp;utm_content=getting-started-docs">here</a>.</li></ul><p>If you have any questions along the way, we’re always available via our <a href="https://slack.timescale.com">community Slack</a> (we’re <a href="https://timescaledb.slack.com/archives/D011A62GNR0">@mike</a> and <a href="https://timescaledb.slack.com/archives/D0137UNE550">@sven </a>, come say hi 👋).</p><p>And, if you are interested in keeping up-to-date with future TimescaleDB releases, <a href="https://www.timescale.com/signup/release-notes/?utm_source=timescale-real-time-aggregates-details&amp;utm_medium=blog&amp;utm_campaign=1-7-release&amp;utm_content=release-notes-subscribe">sign up for our Release Notes</a>.  It’s low-traffic, we promise.</p><p>Until next time, keep it real!</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to Analyze Cryptocurrency Market Data using TimescaleDB, PostgreSQL and Tableau: a Step-by-Step Tutorial]]></title>
            <description><![CDATA[This tutorial is a step-by-step guide on how to analyze a time-series cryptocurrency dataset using Postgres, TimescaleDB and Tableau]]></description>
            <link>https://www.tigerdata.com/blog/tutorials-how-to-analyze-cryptocurrency-market-data-using-timescaledb-postgresql-and-tableau-a-step-by-step-tutorial</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/tutorials-how-to-analyze-cryptocurrency-market-data-using-timescaledb-postgresql-and-tableau-a-step-by-step-tutorial</guid>
            <category><![CDATA[Tutorials]]></category>
            <category><![CDATA[Tableau]]></category>
            <category><![CDATA[SQL]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Avthar Sewrathan]]></dc:creator>
            <pubDate>Thu, 19 Sep 2019 19:55:13 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2019/09/Hero-1-1.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2019/09/Hero-1-1.jpg" alt="How to Analyze Cryptocurrency Market Data using TimescaleDB, PostgreSQL and Tableau: a Step-by-Step Tutorial" /><p>This tutorial is a step-by-step guide on how to analyze a time-series cryptocurrency dataset using Postgres, TimescaleDB and Tableau. The instructions in this tutorial were used to create this <a href="https://timescale.ghost.io/blog/blog/analyzing-bitcoin-ethereum-and-4100-other-cryptocurrencies-using-postgresql-and-timescaledb/">analysis of 4100+ cryptocurrencies</a>.</p><h2 id="overview-of-steps">Overview of steps</h2><p><strong>Step 0: Install TimescaleDB via Timescale Cloud: </strong>We’ll create a Timescale Cloud account and spin up a TimescaleDB instance.</p><p><strong>Step 1: Design the database schema: </strong>We’ll guide you through how to design a schema for cryptocurrency data to use with TimescaleDB.</p><p><strong>Step 2: Create a dataset to analyze: </strong>We’ll use the CryptoCompareAPI and Python to create a CSV file containing the data to analyze.</p><p><strong>Step 3: Load dataset into TimescaleDB: </strong>We’ll insert the data from the CSV file into TimescaleDB using pgAdmin.</p><p><strong>Step 4: Query the data in TimescaleDB: </strong>We’ll connect our data in TimescaleDB to Tableau and perform queries on the dataset.</p><p><strong>Step 5: Visualize the results: </strong>We’ll use Tableau in order to visualize the results from our queries.</p><p>You can download all files and code used in this analysis in this <a href="https://github.com/timescale/examples/tree/master/crypto_tutorial">Github repo</a>. Note that the <a href="https://github.com/timescale/examples/tree/master/crypto_tutorial/Cryptocurrency%20dataset%20Sept%2016%202019">dataset provided</a> tracks OHLCV price data on 4198 different cryptocurrencies (courtesy of <a href="https://www.cryptocompare.com/">CryptoCompare</a>) as of 9/16/2019. Should you follow the steps correctly, your dataset will be up to the date that you perform the analysis.</p><h2 id="step-0-install-timescaledb-via-timescale-cloud">Step 0: Install TimescaleDB via Timescale Cloud</h2><p>Go to <a href="http://www.timescale.com/cloud">www.timescale.com/cloud</a> and sign up for a free trial, where you will receive $300 in credits, to use a cloud-hosted and managed version of TimescaleDB. This is the easiest way to install the DB. If you prefer, you can <a href="https://docs.timescale.com/latest/getting-started/installation">install an instance yourself on your machine by following these instructions</a>. However, the instructions in this post will assume you’re using Timescale Cloud.</p><p>After you’ve created an account, log-in and create a database instance (you can name it something like “crypto_database”). Then select your prefered configuration (dev-only should be enough for this analysis). After successfully creating the database instance, you should see it active.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/1.-New-Db-instance-active.png" class="kg-image" alt="" loading="lazy"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Fig 1: Timescale Cloud page showing an active TSDB instance</em></i></figcaption></figure><p>Once the instance is active, navigate to “Databases” and create a new database. I’ve called mine ‘crypto-test’. You should see it in the list of Databases after its been created.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/3.Create-New-Database.png" class="kg-image" alt="" loading="lazy"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Fig 2: Successful creation of a database in a Timescale Cloud instance</em></i></figcaption></figure><p></p><h2 id="step-1-design-the-database-schema">Step 1: Design the database schema</h2><p>Now that our database is up and running we need some data to insert into it. Before we get data for analysis, we first need to define what kind of data we want to perform queries on. (To skip ahead, see the code in <a href="https://github.com/timescale/examples/blob/master/crypto_tutorial/schema.sql">schema.sql</a>)</p><p>In our analysis, we have two main goals.</p><ol><li>We want to explore the price of Bitcoin and Ethereum, expressed in different fiat currencies, over time.</li><li>We want to explore the price of different cryptocurrencies, expressed in Bitcoin, over time.</li></ol><p>Examples of questions we might want to ask are:</p><ul><li>How has Bitcoin’s price in USD varied over time?</li><li>How has Ethereum’s price in ZAR varied over time?</li><li>How has Bitcoin’s trading volume in KRW increased or decreased over time?</li><li>Which crypto has highest trading volume in last two weeks?</li><li>Which day was Bitcoin most profitable?</li><li>Which are the most profitable new coins from the past 3 months?</li></ul><p>Understanding the questions required of the data leads us to define a schema for our database, so that we can acquire the necessary data to populate it.<br>Our requirements leads us to 4 tables, specifically, three TimescaleDB <a href="https://docs.timescale.com/latest/using-timescaledb/hypertables">hypertables</a>, <em>btc_prices, crypto_prices </em>and<em> eth_prices</em>, and 1 relational table, <em>currency_info</em>.</p><p>The btc_prices hypertable contains data about Bitcoin prices in 17 different fiat currencies since 2010:</p><table>
<thead>
<tr>
<th style="text-align:center">btc_prices hypertable schema</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:center">Field</td>
<td>Description</td>
</tr>
<tr>
<td style="text-align:center">time</td>
<td>The day-specific timestamp of the price records, with time given as the default 00:00:00+00</td>
</tr>
<tr>
<td style="text-align:center">opening_price</td>
<td>The first price at which the coin was exchanged that day</td>
</tr>
<tr>
<td style="text-align:center">highest_price</td>
<td>The highest price at which the coin was exchanged that day</td>
</tr>
<tr>
<td style="text-align:center">lowest_price</td>
<td>The lowest price at which the coin was exchanged that day</td>
</tr>
<tr>
<td style="text-align:center">closing_price</td>
<td>The last price at which the coin was exchanged that day</td>
</tr>
<tr>
<td style="text-align:center">volume_btc</td>
<td>The volume exchanged in the cryptocurrency value that day, in BTC.</td>
</tr>
<tr>
<td style="text-align:center">volume_currency</td>
<td>The volume exchanged in its converted value for that day, quoted in the corresponding fiat currency.</td>
</tr>
<tr>
<td style="text-align:center">currency_code</td>
<td>Corresponds to the fiat currency used for non-btc prices/volumes.</td>
</tr>
</tbody>
</table>
<p><br>Similar to btc_prices, the eth_prices hypertable contains data about Ethereum prices in 17 different fiat currencies since 2015:</p><table>
<thead>
<tr>
<th style="text-align:center">eth_prices hypertable schema</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:center">Field</td>
<td>Description</td>
</tr>
<tr>
<td style="text-align:center">time</td>
<td>The day-specific timestamp of the price records, with time given as the default 00:00:00+00</td>
</tr>
<tr>
<td style="text-align:center">opening_price</td>
<td>The first price at which the coin was exchanged that day</td>
</tr>
<tr>
<td style="text-align:center">highest_price</td>
<td>The highest price at which the coin was exchanged that day</td>
</tr>
<tr>
<td style="text-align:center">lowest_price</td>
<td>The lowest price at which the coin was exchanged that day</td>
</tr>
<tr>
<td style="text-align:center">closing_price</td>
<td>The last price at which the coin was exchanged that day</td>
</tr>
<tr>
<td style="text-align:center">volume_eth</td>
<td>The volume exchanged in the cryptocurrency value that day, in ETH.</td>
</tr>
<tr>
<td style="text-align:center">volume_currency</td>
<td>The volume exchanged in its converted value for that day, quoted in the corresponding fiat currency.</td>
</tr>
<tr>
<td style="text-align:center">currency_code</td>
<td>Corresponds to the fiat currency used for non-ETH prices/volumes.</td>
</tr>
</tbody>
</table>
<p>The crypto_prices hypertable contains data about 4198 cryptocurrencies, including bitcoin and the corresponding crypto/BTC exchange rate, since 2012 or so.</p><table>
<thead>
<tr>
<th style="text-align:center">crypto_prices hypertable schema</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:center">Field</td>
<td>Description</td>
</tr>
<tr>
<td style="text-align:center">time</td>
<td>The day-specific timestamp of the price records, with time given as the default 00:00:00+00</td>
</tr>
<tr>
<td style="text-align:center">opening_price</td>
<td>The first price at which the coin was exchanged that day</td>
</tr>
<tr>
<td style="text-align:center">highest_price</td>
<td>The highest price at which the coin was exchanged that day</td>
</tr>
<tr>
<td style="text-align:center">lowest_price</td>
<td>The lowest price at which the coin was exchanged that day</td>
</tr>
<tr>
<td style="text-align:center">closing_price</td>
<td>The last price at which the coin was exchanged that day</td>
</tr>
<tr>
<td style="text-align:center">volume_eth</td>
<td>The volume exchanged in the cryptocurrency value that day, in ETH.</td>
</tr>
<tr>
<td style="text-align:center">volume_currency</td>
<td>The volume exchanged in its converted value for that day, quoted in the corresponding fiat currency.</td>
</tr>
<tr>
<td style="text-align:center">currency_code</td>
<td>Corresponds to the fiat currency used for non-ETH prices/volumes.</td>
</tr>
</tbody>
</table>
<p>Lastly, we have the currency_info hypertable, which maps the currency’s code to its name.</p><table>
<thead>
<tr>
<th style="text-align:center">currency_info table schema</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:center">Field</td>
<td>Description</td>
</tr>
<tr>
<td style="text-align:center">currency_code</td>
<td>2-7 character abbreviation for currency. Used in other hypertables</td>
</tr>
<tr>
<td style="text-align:center">currency</td>
<td>English name of currency</td>
</tr>
</tbody>
</table>
<p>Once we’ve established the schema for the tables in our database, we can formulate create_table SQL statements to actually create the tables we need:</p><p>Code from <a href="https://github.com/timescale/examples/blob/master/crypto_tutorial/schema.sql"><em>schema.sql</em></a>:</p><pre><code class="language-SQL">--Schema for cryptocurrency analysis
DROP TABLE IF EXISTS "currency_info";
CREATE TABLE "currency_info"(
   currency_code   VARCHAR (10),
   currency        TEXT
);

--Schema for btc_prices table
DROP TABLE IF EXISTS "btc_prices";
CREATE TABLE "btc_prices"(
   time            TIMESTAMP WITH TIME ZONE NOT NULL,
   opening_price   DOUBLE PRECISION,
   highest_price   DOUBLE PRECISION,
   lowest_price    DOUBLE PRECISION,
   closing_price   DOUBLE PRECISION,
   volume_btc      DOUBLE PRECISION,
   volume_currency DOUBLE PRECISION,
   currency_code   VARCHAR (10)
);

--Schema for crypto_prices table
DROP TABLE IF EXISTS "crypto_prices";
CREATE TABLE "crypto_prices"(
   time            TIMESTAMP WITH TIME ZONE NOT NULL,
   opening_price   DOUBLE PRECISION,
   highest_price   DOUBLE PRECISION,
   lowest_price    DOUBLE PRECISION,
   closing_price   DOUBLE PRECISION,
   volume_crypto   DOUBLE PRECISION,
   volume_btc      DOUBLE PRECISION,
   currency_code   VARCHAR (10)
);

--Schema for eth_prices table
DROP TABLE IF EXISTS "eth_prices";
CREATE TABLE "eth_prices"(
   time            TIMESTAMP WITH TIME ZONE NOT NULL,
   opening_price   DOUBLE PRECISION,
   highest_price   DOUBLE PRECISION,
   lowest_price    DOUBLE PRECISION,
   closing_price   DOUBLE PRECISION,
   volume_eth      DOUBLE PRECISION,
   volume_currency DOUBLE PRECISION,
   currency_code   VARCHAR (10)
);

--Timescale specific statements to create hypertables for better performance
SELECT create_hypertable('btc_prices', 'time', 'opening_price', 2);
SELECT create_hypertable('eth_prices', 'time', 'opening_price', 2);
SELECT create_hypertable('crypto_prices', 'time', 'currency_code', 2);</code></pre><p>Notice that we include 3 create_hypertable statements which are special TimescaleDB statements. For more on hypertables, see the <a href="https://docs.timescale.com/latest/using-timescaledb/hypertables">Timescale docs</a> and this <a href="https://timescale.ghost.io/blog/blog/when-boring-is-awesome-building-a-scalable-time-series-database-on-postgresql-2900ea453ee2/">blog post</a>.</p><h2 id="step-2-create-a-dataset-to-analyze">Step 2: Create a dataset to analyze</h2><p>Now that we’ve defined the data we want, it’s time to construct a dataset containing that data. To do this, we’ll write a small python script (to skip ahead see <a href="https://github.com/timescale/examples/blob/master/crypto_tutorial/crypto_data_extraction.py">crypto_data_extraction.py</a>) for extracting data from <a href="https://www.cryptocompare.com/">cryptocompare.com</a> into 4 csv files (coin_names.csv, crypto_prices.csv, btc_prices.csv and eth_prices.csv).<br></p><p>In order to get data from cryptocompare, you’ll need to obtain an <a href="https://min-api.cryptocompare.com/pricing">API key</a>. For this analysis, the free key should be plenty.</p><p>The script consists of 5 parts:<br><strong>(1) Setup: </strong>First, we need to import some libraries to help us parse the data. Notably, we will use the python ‘requests’ library, which make it easy to deal with JSON data from a web API endpoint.</p><pre><code class="language-python">import requests
import json
import csv
from datetime import datetime</code></pre><p>Moreover you’ll need your <a href="https://min-api.cryptocompare.com/pricing">CryptoCompare API key</a> as a variable. We’ve just included it as a normal variable in the code below (this is not recommended for production code) but you can store it as an environment variable or follow whatever production security practices for API key management you usually do.<br></p><pre><code class="language-python">apikey = 'YOUR_CRYPTO_COMPARE_API_KEY'
#attach to end of URLstring
url_api_part = '&amp;api_key=' + apikey</code></pre><p><br><strong>(2) Get a list of all coin names to populate table <em>currency_info: </em></strong>First we use the Python requests library’s <em>get function</em> to get a JSON object containing the list of coins names and symbols on CryptoCompare. Then we convert the data to a dictionary form and write information about the coin to the csv file ‘coin_names.csv’.</p><pre><code class="language-python">#####################################################################
#2. Populate list of all coin names
#####################################################################
#URL to get a list of coins from cryptocompare API
URLcoinslist = 'https://min-api.cryptocompare.com/data/all/coinlist'

#Get list of cryptos with their symbols
res1 = requests.get(URLcoinslist)
res1_json = res1.json()
data1 = res1_json['Data']
symbol_array = []
cryptoDict = dict(data1)

#write to CSV
with open('coin_names.csv', mode = 'w') as test_file:
   test_file_writer = csv.writer(test_file, delimiter = ',', quotechar = '"', quoting=csv.QUOTE_MINIMAL)
   for coin in cryptoDict.values():
       name = coin['Name']
       symbol = coin['Symbol']
       symbol_array.append(symbol)
       coin_name = coin['CoinName']
       full_name = coin['FullName']
       entry = [symbol, coin_name]
       test_file_writer.writerow(entry)
print('Done getting crypto names and symbols. See coin_names.csv for result')</code></pre><p><strong>(3) Get historical BTC prices for 4198 other cryptos to populate <em>crypto_prices: </em></strong>Once we have the list of all the coin names, we can iterate through them and pull their historical prices in BTC since they listed on CryptoCompare. We write that information to the CSV file “crypto_prices.csv”.</p><pre><code class="language-python">#####################################################################
#3. Populate historical price for each crypto in BTC
#####################################################################
#Note: this part might take a while to run since we're populating data for 4k+ coins
#counter variable for progress made
progress = 0
num_cryptos = str(len(symbol_array))
for symbol in symbol_array:
   # get data for that currency
   URL = 'https://min-api.cryptocompare.com/data/histoday?fsym='+ symbol +'&amp;tsym=BTC&amp;allData=true' + url_api_part
   res = requests.get(URL)
   res_json = res.json()
   data = res_json['Data']
   # write required fields into csv
   with open('crypto_prices.csv', mode = 'a') as test_file:
       test_file_writer = csv.writer(test_file, delimiter = ',', quotechar = '"', quoting=csv.QUOTE_MINIMAL)
       for day in data:
           rawts = day['time']
           ts = datetime.utcfromtimestamp(rawts).strftime('%Y-%m-%d %H:%M:%S')
           o = day['open']
           h = day['high']
           l = day['low']
           c = day['close']
           vfrom = day['volumefrom']
           vto = day['volumeto']
           entry = [ts, o, h, l, c, vfrom, vto, symbol]
           test_file_writer.writerow(entry)
   progress = progress + 1
   print('Processed ' + str(symbol))
   print(str(progress) + ' currencies out of ' +  num_cryptos + ' written to csv')
print('Done getting price data for all coins. See crypto_prices.csv for result')</code></pre><p>Notice how the fields defined in Step 1 influence the choice of what data we write to the CSV file!</p><p><strong>(4) Get historical Bitcoin prices in different fiat currencies to populate <em>btc_prices: </em></strong>We then create a list of different fiat currencies in which we want to express Bitcoin’s price. Unfortunately CryptoCompare doesn’t have a comprehensive list so we’ve hard coded (gasp!) a list of 17 popular fiat currencies.</p><p>We then iterate over the list of fiat currencies and pull the historical Bitcoin price in that currency and write it to the CSV file “btc_prices.csv”.</p><pre><code class="language-python">#####################################################################
#4. Populate BTC prices in different fiat currencies
#####################################################################
# List of fiat currencies we want to query
# You can expand this list, but CryptoCompare does not have
# a comprehensive fiat list on their site
fiatList = ['AUD', 'CAD', 'CNY', 'EUR', 'GBP', 'GOLD', 'HKD',
'ILS', 'INR', 'JPY', 'KRW', 'PLN', 'RUB', 'SGD', 'UAH', 'USD', 'ZAR']

#counter variable for progress made
progress2 = 0
for fiat in fiatList:
   # get data for bitcoin price in that fiat
   URL = 'https://min-api.cryptocompare.com/data/histoday?fsym=BTC&amp;tsym='+fiat+'&amp;allData=true' + url_api_part
   res = requests.get(URL)
   res_json = res.json()
   data = res_json['Data']
   # write required fields into csv
   with open('btc_prices.csv', mode = 'a') as test_file:
       test_file_writer = csv.writer(test_file, delimiter = ',', quotechar = '"', quoting=csv.QUOTE_MINIMAL)
       for day in data:
           rawts = day['time']
           ts = datetime.utcfromtimestamp(rawts).strftime('%Y-%m-%d %H:%M:%S')
           o = day['open']
           h = day['high']
           l = day['low']
           c = day['close']
           vfrom = day['volumefrom']
           vto = day['volumeto']
           entry = [ts, o, h, l, c, vfrom, vto, fiat]
           test_file_writer.writerow(entry)
   progress2 = progress2 + 1
   print('processed ' + str(fiat))
   print(str(progress2) + ' currencies out of  17 written')
print('Done getting price data for btc. See btc_prices.csv for result')</code></pre><p><br><strong>(5) Get historical Ethereum prices in different fiat currencies to populate <em>eth_prices: </em></strong>Lastly, we do the same for Ethereum and the list of fiat currencies and write the results in the CSV file “eth_prices.csv”.</p><pre><code class="language-python">#####################################################################
#5. Populate ETH prices in different fiat currencies
#####################################################################
#counter variable for progress made
progress3 = 0
for fiat in fiatList:
   # get data for bitcoin price in that fiat
   URL = 'https://min-api.cryptocompare.com/data/histoday?fsym=ETH&amp;tsym='+fiat+'&amp;allData=true' + url_api_part
   res = requests.get(URL)
   res_json = res.json()
   data = res_json['Data']
   # write required fields into csv
   with open('eth_prices.csv', mode = 'a') as test_file:
       test_file_writer = csv.writer(test_file, delimiter = ',', quotechar = '"', quoting=csv.QUOTE_MINIMAL)
       for day in data:
           rawts = day['time']
           ts = datetime.utcfromtimestamp(rawts).strftime('%Y-%m-%d %H:%M:%S')
           o = day['open']
           h = day['high']
           l = day['low']
           c = day['close']
           vfrom = day['volumefrom']
           vto = day['volumeto']
           entry = [ts, o, h, l, c, vfrom, vto, fiat]
           test_file_writer.writerow(entry)
   progress3 = progress3 + 1
   print('processed ' + str(fiat))
   print(str(progress3) + ' currencies out of  17 written')
print('Done getting price data for eth. See eth_prices.csv for result')</code></pre><p>If you’d rather not pull a fresh data set, you’re welcome to use the <a href="https://github.com/timescale/examples/tree/master/crypto_tutorial/Cryptocurrency%20dataset%20Sept%2016%202019">dataset we already created</a>, but note that it only has data until 9/16/2019.</p><h2 id="step-3-load-dataset-into-timescaledb-using-timescale-cloud-and-pgadmin">Step 3: Load dataset into TimescaleDB, using Timescale Cloud and pgAdmin</h2><p>After following Step 2, you should have 4 CSV files (if not download them <a href="https://github.com/timescale/examples/tree/master/crypto_tutorial/Cryptocurrency%20dataset%20Sept%2016%202019">here</a>).</p><p>The next step is to load this data into TimescaleDB in order to query it and perform our analysis. In Step 0, we created a TimescaleDB instance in Timescale Cloud, so all that’s left is to use pgAdmin to create the tables from Step 1 and transfer data from each csv file to the relevant table.</p><p><strong>3.1 Connect to your TimescaleDB instance</strong></p><p>Download and install <a href="https://www.pgadmin.org/">pgAdmin</a>, or your favorite postgres admin tool (utilities like <a href="http://postgresguide.com/utilities/psql.html">psql</a> also work). Once installed, login to your database using the credentials on the ‘Overview page’ of your Timescale Cloud instance as shown in Fig 3.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/2.-TSDB-Cloud-overview-creds-page-1.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 3: Timescale Cloud ‘Overview’ page to find credentials to login</span></figcaption></figure><p>Once logged in, you should see something like Fig 4 below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/4.-Connected-to-TSDB.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 4: Successful login to Timescale Cloud database in pgAdmin!</span></figcaption></figure><p></p><p><strong>3.2 Use the SQL code from Step 1 to create tables</strong></p><p>Now all our hard work in Step 1 comes in handy! Use the Query Tool in pgAdmin to create the tables we defined in Step 1. One way to find the tool is to navigate to your_project_name -&gt; Databases-&gt; your_db_name, then right click and select Query Tool, as Fig 5 shows below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/QUERY-TOOL.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 5: Locating the Query Tool in pgAdmin</span></figcaption></figure><p>Next, we copy and paste the code from Step 1 (found in <a href="https://github.com/timescale/examples/blob/master/crypto_tutorial/schema.sql">schema.sql</a>) into the Query Editor and run the query. This creates the 4 tables with the necessary fields, as well as the 3 hypertables for btc_prices, eth_prices and crypto_prices.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Example-query.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 6: Run the code from schema.sql in the Query Tool in pgAdmin to create the necessary tables</span></figcaption></figure><p>To check our query was successful, look at the output in the Data Output pane or navigate down to your_project_name -&gt; databases-&gt; your_db_name -&gt; schemas -&gt; tables and you should see your 4 tables.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Table-created-.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 7: Successful creation of tables for analysis in pgAdmin</span></figcaption></figure><p><strong>3.3 Import the data into the database tables</strong></p><p>Now that we’ve created the tables with our desired schema, all that’s left is to insert the data from the CSV files we’ve created into the tables.</p><p>We will do this using the Import tool in pgAdmin, but you could use psql or for better performance on large datasets, the <a href="https://github.com/timescale/timescaledb-parallel-copy">TimescaleDB parallel-copy- tool</a>. However, for files of the size we’ve created the pgAdmin import tool works just fine.</p><p>To import a CSV file to a table, navigate down to your_project_name -&gt; databases-&gt; your_db_name -&gt; schemas -&gt; tables and right click on the table you’d like to insert data into and then select ‘Import/Export’ from the menu, as shown in Fig 8 below. We’ll first insert data into the btc_prices table. </p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Import-1.png" class="kg-image" alt="" loading="lazy"></figure><p>Once you’ve selected Import/Export, select Import, then select the csv file from your directory (in this case, btc_prices.csv), and then select comma (,) as the delimiter, as shown in Fig 9 below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Import-2.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 9: Importing btc_prices.csv into the btc_prices table using pgAdmin</span></figcaption></figure><p>To check if this worked, right click on btc_prices table, select ‘view/edit data’ -&gt; ‘all rows’, as shown in Fig 10. Fig 11 shows that our data has successfully been inserted into the btc_prices table.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Check-1.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 10: Checking that data has been inserted into the table in pgAdmin</span></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Check-2.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 11: Verifying that btc price data is in fact in btc_prices table using pgAdmin</span></figcaption></figure><p>Repeat these steps above for crypto_prices.csv and the crypto_prices table, eth_prices.csv and the eth_prices table and coin_names.csv and the currency_info table, respectively.</p><h2 id="step-4-query-the-data"><br>Step 4: Query the Data</h2><p>With the data needed for our analysis now sitting snugly in our database tables, we can now perform queries on our dataset in order to answer some of the questions posed in Step 1.</p><p>The code below (from <a href="https://github.com/timescale/examples/blob/master/crypto_tutorial/crypto_queries.sql">crypto_queries.sql</a>) contains a sample list of questions and corresponding queries to answer those questions. Of course, you can add in your own questions and create Postgres queries to answer them in addition to, or in place of, the questions and queries provided.</p><p>From <a href="https://github.com/timescale/examples/blob/master/crypto_tutorial/crypto_queries.sql">crypto_queries.sql</a>:</p><pre><code class="language-SQL">-Query 1
-- How did Bitcoin price in USD vary over time?
-- BTC 7 day prices
SELECT time_bucket('7 days', time) as period,
      last(closing_price, time) AS last_closing_price
FROM btc_prices
WHERE currency_code = 'USD'
GROUP BY period
ORDER BY period

--Query 2
-- How did BTC daily returns vary over time?
-- Which days had the worst and best returns?
-- BTC daily return
SELECT time,
      closing_price / lead(closing_price) over prices AS daily_factor
FROM (
  SELECT time,
         closing_price
  FROM btc_prices
  WHERE currency_code = 'USD'
  GROUP BY 1,2
) sub window prices AS (ORDER BY time DESC)

--Query 3
-- How did the trading volume of Bitcoin vary over time in different fiat currencies?
-- BTC volume in different fiat in 7 day intervals
SELECT time_bucket('7 days', time) as period,
      currency_code,
      sum(volume_btc)
FROM btc_prices
GROUP BY currency_code, period
ORDER BY period

-- Q4
-- How did Ethereum (ETH) price in BTC vary over time?
-- ETH prices in BTC in 7 day intervals
SELECT
   time_bucket('7 days', time) AS time_period,
   last(closing_price, time) AS closing_price_btc
FROM crypto_prices
WHERE currency_code='ETH'
GROUP BY time_period
ORDER BY time_period

--Q5
-- How did ETH prices, in different fiat currencies, vary over time?
-- (using the BTC/Fiat exchange rate at the time)
-- ETH prices in fiat
SELECT time_bucket('7 days', c.time) AS time_period,
      last(c.closing_price, c.time) AS last_closing_price_in_btc,
      last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'USD') AS last_closing_price_in_usd,
      last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'EUR') AS last_closing_price_in_eur,
      last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'CNY') AS last_closing_price_in_cny,
      last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'JPY') AS last_closing_price_in_jpy,
      last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'KRW') AS last_closing_price_in_krw
FROM crypto_prices c
JOIN btc_prices b
   ON time_bucket('1 day', c.time) = time_bucket('1 day', b.time)
WHERE c.currency_code = 'ETH'
GROUP BY time_period
ORDER BY time_period

--Q6
--Crypto by date of first data
SELECT ci.currency_code, min(c.time)
FROM currency_info ci JOIN crypto_prices c ON ci.currency_code = c.currency_code
AND c.closing_price &gt; 0
GROUP BY ci.currency_code
ORDER BY min(c.time) DESC

--Q7
-- Number of new cryptocurrencies by day
-- Which days had the most new cryptocurrencies added?
SELECT day, COUNT(code)
FROM (
  SELECT min(c.time) AS day, ci.currency_code AS code
  FROM currency_info ci JOIN crypto_prices c ON ci.currency_code = c.currency_code
  AND c.closing_price &gt; 0
  GROUP BY ci.currency_code
  ORDER BY min(c.time)
)a
GROUP BY day
ORDER BY day DESC


--Q8
-- Which cryptocurrencies had the most transaction volume in the past 14 days?
--Crypto transaction volume during a certain time period
SELECT 'BTC' as currency_code,
       sum(b.volume_currency) as total_volume_in_usd
FROM btc_prices b
WHERE b.currency_code = 'USD'
AND now() - date(b.time) &lt; INTERVAL '14 day'
GROUP BY b.currency_code
UNION
SELECT c.currency_code as currency_code,
       sum(c.volume_btc) * avg(b.closing_price) as total_volume_in_usd
FROM crypto_prices c JOIN btc_prices b ON date(c.time) = date(b.time)
WHERE c.volume_btc &gt; 0
AND b.currency_code = 'USD'
AND now() - date(b.time) &lt; INTERVAL '14 day'
AND now() - date(c.time) &lt; INTERVAL '14 day'
GROUP BY c.currency_code
ORDER BY total_volume_in_usd DESC

--Q9
--Which cryptocurrencies had the top daily return?
--Top crypto by daily return
WITH
   prev_day_closing AS (
SELECT
   currency_code,
   time,
   closing_price,
   LEAD(closing_price) OVER (PARTITION BY currency_code ORDER BY TIME DESC) AS prev_day_closing_price
FROM
    crypto_prices  
)
,    daily_factor AS (
SELECT
   currency_code,
   time,
   CASE WHEN prev_day_closing_price = 0 THEN 0 ELSE closing_price/prev_day_closing_price END AS daily_factor
FROM
   prev_day_closing
)
SELECT
   time,
   LAST(currency_code, daily_factor) as currency_code,
   MAX(daily_factor) as max_daily_factor
FROM
   daily_factor
GROUP BY
   TIME</code></pre><p>For this step and in Step 5, we’ll use <a href="https://www.tableau.com/">Tableau</a> to run the above queries on the dataset and visualize the output. You’re welcome to use other data visualization tools like <a href="https://grafana.com/blog/2018/10/15/make-time-series-exploration-easier-with-the-postgresql/timescaledb-query-editor/">Grafana</a>, but ensure that the tool you’ve selected has a Postgres connector.</p><p>The following steps are to query the data using <a href="https://www.tableau.com/">Tableau</a>:<br><strong>4.1 Create a new workbook: </strong>This will be used to house all the graphs for the analysis.</p><p><strong>4.2 Connect TimescaleDB to Tableau: </strong>Create a connection between Tableau and TimescaleDB running in your Timescale Cloud instance.</p><p>Connecting your TimescaleDB instance in the cloud to Tableau takes just a few clicks, thanks to Tableau’s built in Postgres connector. To connect to your database add a new connection and under the ‘to a server’ section, select PostgreSQL as the connection type. Then enter your database credentials (found in the Timescale Cloud ‘Overview’ tab) like we did in Step 3.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/New-Data-Source-T.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 12: Adding a connection from Timescale Cloud in Tableau</span></figcaption></figure><p><strong>4.3 Create a new data source: </strong>Create a new datasource and rename it to be something unique, as by default it’s the name of your database.</p><p>You’ll need to do this for each query the analysis, since each data source only supports one piece of custom SQL. A way to create many data sources with the same database is to create one in the way described above, duplicate it and then change the custom SQL used each time, since the database you’re connecting to remains the same.</p><p><strong>4.4 Query the data: </strong>Here we’ll use Tableau and the built in SQL editor. To run a query, add custom SQL to your data source by dragging and dropping the “New Custom SQL” button to the place that says ‘Add tables here’.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/New-Custom-SQL-T.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 13: Adding Custom SQL to a data source in Tableau</span></figcaption></figure><p>Once you’ve done that, paste the query you want in the query editor. In the example in Fig 14 below, we’ll use Query 1 from crypto_queries.sql, for historical BTC prices in USD.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Tableau-Query-Tool.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 14: Query for Historical Bitcoin prices in USD in the Tableau query editor</span></figcaption></figure><p><br>Once you’ve entered the query, press OK and then Update Now and you’ll see the results in a table, as illustrated by Fig 15 below.<br></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Query-Success-T.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 15: Results from a successful execution of Query 1 in Tableau</span></figcaption></figure><h2 id="step-5-data-visualization-in-tableau">Step 5: Data Visualization in Tableau</h2><p>Results in a table are only so useful, graphs are much better! So in our final step, let’s take our output from Step 4 and turn it into an interactive graph in Tableau.</p><p>To do this, create a new worksheet (or dashboard) and then select your desired data source (in our case ‘btc 7 days’), shown in Fig 16 below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/New-worksheet-with-data-source-1.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 16: A new Tableau worksheet linked to the data source ‘btc 7 days’</span></figcaption></figure><p>Next, you locate the Dimensions and Measures pane on the left. </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/dimensions-and-measures.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 17: The dimensions and measures pane in Tableau</span></figcaption></figure><p>Then, drag the period (time) dimension to ‘Columns’ part of sheet and then the ‘last closing price’ measure to the rows part of the worksheet. You should see something like the graph shown in Fig 18 below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/graph-before-time-adj.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 18: Initial graph after dragging dimensions and measures in Tableau</span></figcaption></figure><p>Now this graph doesn’t quite have the level of fidelity we’re looking for because the data points are being grouped by year. To fix this, click on the drop down arrow on period and select ‘exact date’. </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/how-to-adjust.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 19: Finding the exact date setting on a dimension in Tableau</span></figcaption></figure><p>This undoes the grouping by year and matches the price datapoint to the exact date that price occurred on. Your group should now look like Fig 20 below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/graph-after-time-adjustment.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 20: Graph after selecting correct setting for ‘period’ in Tableau</span></figcaption></figure><p>From there you can edit axis labels and colors and even add filter to zoom in on a specific time period. Here’s our final result, with labels added in Fig 21 below:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/graph-after-final-editing.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 21: Final graph showing Bitcoin prices in USD from 2010-2019</span></figcaption></figure><p>We encourage you to explore different visualization formats for the results you obtain from queries provided. For inspiration, check out the <a href="https://timescale.ghost.io/blog/analyzing-bitcoin-ethereum-and-4100-other-cryptocurrencies-using-postgresql-and-timescaledb">different visualizations we used in our analysis</a>.</p><h2 id="conclusion">Conclusion</h2><p>This tutorial showed you step by step one method of defining, creating, loading and analyzing a cryptocurrency market dataset. Here’s a reminder of what we covered:<br></p><ol><li>We created a Timescale Cloud account and spun up a TimescaleDB instance.</li><li>We learned how to design a schema for cryptocurrency data for TimescaleDB and PostgreSQL.</li><li>We used the CryptoCompareAPI and Python to create a CSV file containing the data to analyze</li><li>We inserted the data from the CSV files created into TimescaleDB using pgAdmin and Timescale Cloud.</li><li>We connected our data in TimescaleDB to Tableau and performed queries on the dataset</li><li>We used Tableau to create graphs to visualize the results from our queries</li></ol><p>Thank you for reading this far and if you followed all the steps, congratulations on successfully completing this tutorial! We hope you’ve enjoyed following along and that you’ve found this tutorial helpful. </p><p>Perhaps you’d also enjoy our <a href="https://timescale.ghost.io/blog/blog/analyzing-bitcoin-ethereum-and-4100-other-cryptocurrencies-using-postgresql-and-timescaledb/">analysis of over 4100 cryptocurrencies produced by following this tutorial</a>.</p><p>For follow up questions or comments, reach out to us on Twitter (<a href="https://twitter.com/timescaledb">@TimescaleDB</a> or <a href="https://twitter.com/avthars">@avthars</a>), our community <a href="http://slack.timescale.com/">Slack channel</a>, or reach out to me directly via email (avthar at timescale dot com).</p><p>Finally, if you’re interested in learning more about us, check out the <a href="https://www.timescale.com/">Timescale website</a>, <a href="https://www.timescale.com/cloud-signup">Timescale Cloud</a>, and see our <a href="https://github.com/timescale/timescaledb">GitHub</a> and let us know how we can help!</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Analyzing Bitcoin, Ethereum, and 4,100+ Other Cryptocurrencies Using PostgreSQL and TimescaleDB]]></title>
            <description><![CDATA[After a Crypto Winter in 2018, cryptocurrencies today are resurging. How can data help us better understand the Crypto Revival?]]></description>
            <link>https://www.tigerdata.com/blog/analyzing-bitcoin-ethereum-and-4100-other-cryptocurrencies-using-postgresql-and-timescaledb</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/analyzing-bitcoin-ethereum-and-4100-other-cryptocurrencies-using-postgresql-and-timescaledb</guid>
            <category><![CDATA[Product & Engineering]]></category>
            <category><![CDATA[Tutorials]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Avthar Sewrathan]]></dc:creator>
            <pubDate>Thu, 19 Sep 2019 19:55:07 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2019/09/andre-francois-mckenzie-JrjhtBJ-pGU-unsplash.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2019/09/andre-francois-mckenzie-JrjhtBJ-pGU-unsplash.jpg" alt="Analyzing Bitcoin, Ethereum, and 4,100+ Other Cryptocurrencies Using PostgreSQL and TimescaleDB" /><p><strong><em>After a Crypto Winter in 2018, cryptocurrencies today are resurging. How can data help us better understand the Crypto Revival?</em></strong></p><p>When Satoshi Nakamoto first published the Bitcoin whitepaper in 2008, they probably didn’t foresee the world of <a href="https://www.investopedia.com/terms/h/hodl.asp">hodlers</a>, <a href="https://decryptionary.com/dictionary/lambo/">lambos</a>, <a href="https://www.definitions.net/definition/Buidl">buidlers</a>, <a href="https://www.vice.com/en_us/article/ne74nw/inside-the-world-of-the-bitcoin-carnivores">bitcoin maximalist carnivores</a>, and n00bs asking “<a href="https://www.quora.com/What-does-going-to-the-moon-mean-in-cryptocurrency">wen moon</a>” in telegram channels, that their actions would create. In 11 years, crypto has gone from something completely esoteric to something seemingly everyone has heard about.</p><p>2019 has been a big year for crypto, so far. Some of the highlights include: the <a href="https://www.wsj.com/articles/sec-clears-blockstack-to-hold-first-regulated-token-offering-11562794848">SEC approving it’s first token sale</a>, the <a href="https://www.creditkarma.com/insights/i/irs-crack-down-cryptocurrency-owners/">IRS tracking down crypto tax-evaders</a>, <a href="https://www.thetradenewscrypto.com/vast-majority-endowment-funds-testing-crypto-investments/">university endowments investing in crypto</a> and even <a href="https://techcrunch.com/2019/06/18/facebook-libra/">Facebook announcing it’s own cryptocurrency</a>. We also just this month saw <a href="https://news.ycombinator.com/item?id=20919958">over $1 Billion in Bitcoin transferred in a single transaction</a>. All this, and more, indicates a revived interest in the crypto markets, since the highs of 2017 and lows of 2018, by everyone from institutional investors and banks to lay people trying to side hustle.</p><p>With the crypto markets once again awash with speculation and hype, it’s important to leverage all the tools at our disposal in order to make sense of the noise. Sometimes, reading articles and email newsletters isn’t enough. You have to go directly to the data.</p><p>As the developers of <a href="https://www.timescale.com/">TimescaleDB</a>, an open-source time-series database powered by PostgreSQL, we’re data-driven people. So, we thought it would be interesting to take a <em>data-driven approach</em> to analyzing the crypto market. For this analysis, we used PostgreSQL and TimescaleDB to analyze market data about Bitcoin, Ethereum, and 4,196 other cryptocurrencies and used Tableau to visualize our results.</p><p><strong>This post shares many high-level</strong>,<strong> insights about the crypto market since its inception and during recent years. We answer questions like:</strong><br></p><ul><li>How has the price of Bitcoin and Ethereum changed in the past several years?</li><li>Which new cryptocurrencies have been the most profitable in the past 3 months?</li><li>What are the cryptocurrencies on the rise?</li><li>What was the best day to “day-trade” Bitcoin?</li><li>What countries have the highest trading volume of BTC today?</li><li>Why is Bitcoin a terrible way to pay for pizza?</li></ul><p>...and many more, as we dive into analysis of topics like Bitcoin and Ethereum price, new coin growth, trading volume and daily returns.</p><p>We also share how powerful SQL is as a query language for analyzing time-series data, how TimescaleDB and PostgreSQL further simplify time-series data analysis, and how using the two with Tableau visualizations can surface interesting insights from your data.</p><p>For the technically curious, you can learn how to create the <a href="https://github.com/timescale/examples/tree/master/crypto_tutorial/Cryptocurrency%20dataset%20Sept%2016%202019">dataset</a> we used for this analysis, load it, and draw insights from it, in this <a href="https://timescale.ghost.io/blog/tutorials/how-to-analyze-cryptocurrency-market-data-using-timescaledb-postgresql-and-tableau-a-step-by-step-tutorial/">companion tutorial post</a>. In the tutorial, you will find <a href="https://timescale.ghost.io/blog/tutorials/how-to-analyze-cryptocurrency-market-data-using-timescaledb-postgresql-and-tableau-a-step-by-step-tutorial/">step by step instructions</a> on how to <a href="https://timescale.ghost.io/blog/tutorials/how-to-analyze-cryptocurrency-market-data-using-timescaledb-postgresql-and-tableau-a-step-by-step-tutorial/">create the dataset using Python</a> (including all code we used for the analysis), how to <a href="https://timescale.ghost.io/blog/tutorials/how-to-analyze-cryptocurrency-market-data-using-timescaledb-postgresql-and-tableau-a-step-by-step-tutorial/">load the data</a> into <a href="https://portal.managed.timescale.com/login">Managed Service for TimescaleDB</a>, a cloud-hosted version of TimescaleDB, and how to connect your database in the cloud to Tableau to <a href="https://timescale.ghost.io/blog/how-to-analyze-cryptocurrency-market-data-using-timescaledb-postgresql-and-tableau-a-step-by-step-tutorial/">recreate the analysis and produce graphs</a>.</p><h2 id="about-the-data-used-for-this-analysis">About the data used for this analysis</h2><p>For this analysis, we used historical <a href="https://en.wikipedia.org/wiki/Open-high-low-close_chart">OHLCV</a> price data for over <a href="https://github.com/timescale/examples/tree/master/crypto_tutorial/Cryptocurrency%20dataset%20Sept%2016%202019">4100 cryptocurrencies</a> from 7/17/2010 to 9/16/2019, courtesy of <a href="https://www.cryptocompare.com/">CryptoCompare</a> and their <a href="https://min-api.cryptocompare.com/">wonderful API</a>. While the <a href="https://github.com/timescale/examples/tree/master/crypto_tutorial/Cryptocurrency%20dataset%20Sept%2016%202019">dataset we used</a> only includes daily data, TimescaleDB <a href="https://timescale.ghost.io/blog/recap-performant-time-series-data-management-analytics-with-postgres/">easily scales to handle data from much finer grained time periods</a>.</p><p>Some of you may recall that we did a <a href="https://timescale.ghost.io/blog/analyzing-ethereum-bitcoin-and-1200-cryptocurrencies-using-postgresql-3958b3662e51/">similar analysis on crypto back in 2017</a>, but so much has happened since then, including the addition of almost 3000 more cryptocurrencies and Bitcoin hitting nearly ~$20k in price, that we had to revisit this topic. Where applicable, we’ve included graphs to focus on recent history, from 2017-2019, updating the analysis from our previous post.</p><p><em>DISCLAIMER: At Timescale, we help companies harness the power of time series data to make sense of the past, monitor the present, and predict the future. However, nothing in this analysis should be construed as financial advice and we take no liability for your actions as a result of using the information contained in this post. You’re welcome to draw your own conclusions using the tools and data and take your own risks accordingly.</em></p><h2 id="so-if-you%E2%80%99d-invested-100-in-bitcoin-9-years-ago-today-it%E2%80%99d-be-worth%E2%80%A6">So if you’d invested $100 in Bitcoin 9 years ago, today it’d be worth…</h2><p>When analyzing cryptocurrencies, we have to start with the original: Bitcoin. For any beginners, Bitcoin <a href="https://www.luno.com/learn/en/article/bitcoin-as-digital-gold">can be thought of as digital gold</a>, because Bitcoin has built-in scarcity (only 21 million BTC will ever be produced), can be almost infinitely divided without losing its unit value, and is difficult to counterfeit. (As an aside, for those looking for an introduction to Bitcoin and other cryptocurrencies, two good places to start are <a href="https://www.lopp.net/pdf/princeton_bitcoin_book.pdf">The Princeton Bitcoin Book</a> and <a href="https://www.amazon.com/Internet-Money-Andreas-M-Antonopoulos/dp/1537000454">The Internet of Money</a>.)</p><p>Looking at historical BTC-USD prices since 2010, we see that BTC prices have slowly increased, with an almost exponential increase taking place between 2014 and 2018. </p><pre><code class="language-SQL">--Query 1
-- BTC 7 day prices
SELECT time_bucket('7 days', time) as period,      
	last(closing_price, time) AS last_closing_price
FROM btc_prices
WHERE currency_code = 'USD'
GROUP BY period
ORDER BY period</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Q1.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 1: Bitcoin Closing price in USD from 2010-2019</strong></b></i></figcaption></figure><p>So to answer our original question (and probably create some FOMO), if you bought $100 worth of Bitcoin on 16 September 2010, the price of 1BTC was $0.0619, meaning $100 would have bought you 1615.5088853 BTC. Fast forward 9 years, that $100 would have grown to $16,476,736.67! Queue the lambos! (But of course you would have probably sold when BTC hit $100 back in 29 July 2013 :)).</p><p><em>Generating insights like this from time-series data takes no more than a 5 line SQL query thanks to TimescaleDB’s special </em><a href="https://docs.timescale.com/latest/api#time_bucket"><em>timebucket</em></a><em> and </em><a href="https://docs.timescale.com/latest/api#last"><em>last</em></a><em> functions, which are special functions exclusively created for TimescaleDB to simplify time-series analysis.</em></p><h2 id="why-bitcoin-is-a-terrible-way-to-pay-for-pizza">Why Bitcoin is a terrible way to pay for pizza</h2><p>One thing that’s evident from the data is that Bitcoin is volatile. Bitcoin’s price volatility may mean that its main use case is being a store of value rather than a means of exchange. The <a href="https://finance.yahoo.com/news/bitcoin-pizza-day-celebrating-world-100035402.html">Bitcoin Pizza Guy</a>, who paid 10,000 BTC for a 2 pizzas back in 2010, would probably agree, as those 10,000BTC are worth over a $100,000,000 today!</p><p>Moreover, the time period between 2017 and 2019 saw a lot of ups and downs, so let’s zoom in on that period below:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Q1-Zoom.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 2: Bitcoin Closing price in USD from July 2017 to July 2019</strong></b></i></figcaption></figure><p>As Figure 2 shows, 2017-2019 has arguably been the most exciting and perhaps also the most painful time in the Bitcoin market, with Bitcoin prices reaching a high of nearly $20,000, before crashing to under $7,000 three months later. That decline continued throughout 2018, with some calling it the first crypto bear market and others a “Crypto Winter”. However, the recent run in price made by Bitcoin since April 2019 may be the first indication of the end of the Crypto Winter, with BTC prices reaching over $12,000 in June 2019.</p><h2 id="the-best-day-to-%E2%80%9Cday-trade%E2%80%9D-bitcoin-february-26-2014">The best day to “day-trade” Bitcoin? February 26, 2014.</h2><p>In order to better understand Bitcoin’s volatility, let’s look at the daily return on BTC, as a factor of the previous day’s rate. This is simple to do using <a href="https://www.postgresql.org/docs/9.3/functions-window.html">PostgreSQL window functions</a>, as shown in the query below.</p><pre><code class="language-SQL">--Query 2
-- BTC daily return
SELECT time,
	closing_price / lead(closing_price) over prices AS daily_factor
FROM (  
 SELECT time,         
  closing_price  
FROM btc_prices  
WHERE currency_code = 'USD'  
GROUP BY 1,2
) sub window prices AS (ORDER BY time DESC)</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Q2.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 3.1: BTC Daily Return from 2010-2019</strong></b></i></figcaption></figure><p>From Figure 3.1, we can see large amounts of volatility in BTC price from 2010, culminating in a huge spike in daily return factor in February 2014. In fact, within just 7 days we saw the day with the lowest daily return factor — 0.428 on 20 February 2014 — and the highest ever daily return factor — 4.368 on 26 February 2014! It’s no wonder that <a href="https://www.washingtonpost.com/business/why-facebook-chose-stablecoins-as-its-path-to-crypto/2019/06/18/2fa7d738-91e7-11e9-956a-88c291ab5c38_story.html?noredirect=on">stablecoins</a>, or price-stable cryptocurrencies are being looked into as alternatives to use as a means of exchange within apps or for everyday transactions. Imagine paying for everything you buy in Bitcoin and the headaches that daily and weekly BTC price volatility causes.</p><p>Once again, let’s zoom in on the 2017-2019 time period:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Q2-Zoom.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 3.2: BTC Daily Return from 2017-2019</strong></b></i></figcaption></figure><p>From Figure 3.2, we see that Bitcoin’s most profitable day since the start of 2017 occurred recently on 25 August 2019, with a daily return of 1.99 times the previous day’s rate. The day with the biggest loss went to 16 January 2018, with a daily return factor of 0.8276, with 14 September 2017 coming in second biggest loss, with a daily return factor of 0.8379.</p><h2 id="bitcoin%E2%80%99s-top-countries-by-trading-volume-us-japan-south-korea-and-poland">Bitcoin’s top countries by trading volume: US, Japan, South Korea, and... Poland!?</h2><p>Cryptofever has taken the world by storm, with lots of adoption taking place outside of the USA, most notably in places like Europe and Asia. We can get a sense for how crypto is being adopted in different regions of the world by looking at Bitcoin trading volume in different fiat currencies over time.</p><pre><code class="language-SQL">--Query3
-- BTC trading volumes by currency 
SELECT time_bucket('14 days', time) as period,
	currency_code,       
    	sum(volume_btc)
FROM btc_prices
GROUP BY currency_code, period
ORDER BY period;</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Q3.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 4: BTC trading volume in different fiat currencies</strong></b></i></figcaption></figure><p>From figure 4 above, we can see that China saw huge amounts of bitcoin trading volume before government intervention which made buying Bitcoin illegal in mid 2017. We can more clearly see how drastic this effect was by looking at Figure 5 below, which is a bar graph version of Figure 4, showing volume of BTC trade in different fiat by year:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Q3--2-.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 5: BTC Trading Volume in different fiat currencies by year</strong></b></i></figcaption></figure><p>From figure 5, we can more clearly see the rise in Chinese (CNY) Bitcoin trading activity and how government intervention in 2017 brought that to a halt. Moreover, we can see Japan (JPY) and South Korea (KRW) overtaking Europe with respect to Bitcoin trading volume, with more volume than the Euro (EUR) during 2017 and 2018. This confirms the <a href="https://www.bbc.com/news/business-42713314">USA, Japan and South Korea as the world’s 3 largest bitcoin markets</a>.</p><p>Furthermore, if we remove USD, CNY, JPY, KRW and EUR from the list of fiats we can get a sense for the trend in Bitcoin adoption outside the largest markets, as shown in Figure 6 below:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Q3--5-.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 6: BTC Trading Volume in different fiat currencies by year (excluding USD, JPY, KRW, CNY, EUR)</strong></b></i></figcaption></figure><p>Note how South Africa (ZAR) has seen rising BTC trade volumes since 2015, the dramatic increase in trading volume from Hong Kong (HKD) in 2017 and subsequent decrease, and that the currency with the highest trade volume outside the big 5 is none other than the Polish Zloty (PLN)!</p><p>Figure 6.1 shows trading volumes of BTC in PLN since 2014:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Q3--6-.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 6.1: BTC trading volumes in PLN</strong></b></i></figcaption></figure><h2 id="now-if-you%E2%80%99d-bought-100-worth-of-ethereum-in-2015-today-it%E2%80%99d-be-worth">Now if you’d bought $100 worth of Ethereum in 2015, today it’d be worth...</h2><p>Ethereum is popularly regarded as the cryptocurrency with the second largest interest base after Bitcoin, however, it is fundamentally different from Bitcoin. While Bitcoin is considered to be digital gold, Ether is more like fuel (<a href="https://ethereum.stackexchange.com/questions/3/what-is-meant-by-the-term-gas">gas</a>) that runs transactions on the Ethereum network.</p><p>Since the <a href="https://cointelegraph.com/news/ethereum-raises-3700-btc-in-first-12-hours-of-ether-presale">first currency with which you could buy Ethereum was Bitcoin</a>, let’s take a look at historical ETH prices in BTC, shown in Figure 7 below:</p><pre><code class="language-SQL">-- Q4
-- ETH prices in BTC in 7 day intervals
SELECT  
	time_bucket('7 days', time) AS time_period,   
    last(closing_price, time) AS closing_price_btc
FROM crypto_prices
WHERE currency_code='ETH'
GROUP BY time_period
ORDER BY time_period</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Q4.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 7: ETH Price in BTC since 2015</strong></b></i></figcaption></figure><p>Figure 7 shows us Ethereum (ETH) closing prices since 3 August 2015 in weekly intervals, expressed in BTC. Notice that 2017 was a rollercoaster year for ETH, with the currency seeing it’s an all time high of 0.1402 BTC on 12 June 2017 and then crashing back down to 0.0288 BTC 4 December 2017, less than 6 months later. </p><p>Let’s take a look at recent Ethereum prices by zooming in on the period since 2017 in Figure 8 below:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Q4-Zoom.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 8: ETH Price in BTC since 1 January 2017</strong></b></i></figcaption></figure><p>From Figure 8 above, it seems that ETH then went on another bull run in early 2018, with prices reaching 0.1052 BTC in the period around 22 January 2018. Since then, it seems like ETH prices have been trending downward, with the price on 16 September 2019 reaching 0.0188 BTC. While that’s not great for investors, it may prove to be a blessing for developers and decentralized application users in Ethereum’s ecosystem, as gas costs would be cheaper, perhaps decreasing the barriers to adoption.</p><h2 id="crypto-convertibles-not-the-car-kind">Crypto Convertibles (not the car kind)</h2><p>Since most people don’t think about prices in BTC (yet), and given how volatile BTC is, it’s useful to also examine ETH prices expressed different fiat currencies. There are two ways this could be done: First by looking at ETH prices directly in different fiat currencies, like we did for for Bitcoin and USD in Figure 1 above. Secondly, we could convert ETH prices in BTC to fiat currency prices, by looking at that day’s BTC to fiat exchange rate. For illustration purposes, we’ll use the second technique. While this may seem like a strange choice, it’s worth noting that conversions from one cryptocurrency to another and then to a fiat currency are fairly common in the cryptocurrency trading world. This is because many exchanges support buying cryptocurrencies with other cryptocurrencies (mainly BTC and ETH), but not all crypto currencies are purchasable directly with fiat currency.</p><p>In order to examine ETH prices in different fiat currencies in PostgreSQL, we joined two tables and used filters, as the code below illustrates:</p><pre><code class="language-SQL">--Q5
-- ETH prices in fiat
SELECT time_bucket('7 days', c.time) AS time_period,  
	last(c.closing_price, c.time) AS last_closing_price_in_btc,
    last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'USD') AS last_closing_price_in_usd, 
    last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'EUR') AS last_closing_price_in_eur,
    last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'CNY') AS last_closing_price_in_cny,
    last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'JPY') AS last_closing_price_in_jpy,  
    last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'KRW') AS last_closing_price_in_krw
FROM crypto_prices c
JOIN btc_prices b   
	ON time_bucket('1 day', c.time) = time_bucket('1 day', b.time)
WHERE c.currency_code = 'ETH'
GROUP BY time_period
ORDER BY time_period</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Q5-USD-EUR.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 9.1: ETH Price in USD and EUR</strong></b></i></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Q5-JPY.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 9.2: ETH Price in JPY</strong></b></i></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Q5-CNY.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 9.3: ETH Price in CNY</strong></b></i></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Q5-KRW.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 9.4: ETH Price in KRW</strong></b></i></figcaption></figure><p>One thing to notice is that the Figures 9.1-9.4 all the same shape, since they are another expression of the ETH price in BTC. The main difference in the graphs is the scale of the Y axis, as this reflects the respective currency’s BTC exchange rate. This is as a result of the decision to use the BTC-Fiat exchange rate for this conversion, rather than direct ETH-Fiat prices. When plotted all on the same axis, we get Figure 10 below:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Q5.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 10: ETH Price in Different Fiat Currencies</strong></b></i></figcaption></figure><p>Fortunately, you can now directly purchase ETH using fiat currency on many exchanges. So let’s look at historical ETH prices in USD,in Figure 11 below:</p><pre><code class="language-SQL">-- ETH prices in USD
SELECT time_bucket('7 days', time) as period,       
	last(closing_price, time) AS last_closing_price
FROM eth_prices
WHERE currency_code = 'USD'
GROUP BY period
ORDER BY period</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Q10-ETH-USD.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 11: ETH Price in USD from 2015-2019</strong></b></i></figcaption></figure><p>Figure 11 above tells a similar story to that of Figure 8, as we can see the ETH bull run to $1,359 on 8 January 2018. ETH prices in USD have been trending downward since then, with the price on 16 September 2019 falling to $199.</p><p>So to answer our original question, if you bought $100 worth of Ethereum on 16 September 2015, the price of 1ETH was around $0.89, meaning $100 would have bought you 112.3596 ETH. That $100 would now have grown to $22,359.55! However, the best time to sell would’ve been during the peak Jan 2018, your 112.3596 ETH would’ve been worth $152 696.63! This just goes to show how important it is to time the market!</p><p><em>(One piece of analysis which we didn’t do, but encourage readers to do, is to examine the developer activity in the Ethereum ecosystem (eg Github commits/ issues) and see if that correlates to price in some way.)</em></p><h2 id="tracking-4000-other-cryptocurrencies-starting-from-inception">Tracking 4000+ other cryptocurrencies, starting from inception</h2><p>While we can’t see when exactly coins ICO’d or first got listed on exchanges, we <em>can</em> track the date a coin first got added to CryptoCompare as a proxy for its launch date. This allows us to track the launch of different coins over time, as seen in Figure 11.</p><pre><code class="language-SQL">--Q6
--Crypto by date of first data
SELECT ci.currency_code, min(c.time)
FROM currency_info ci JOIN crypto_prices c ON ci.currency_code = c.currency_code
AND c.closing_price &gt; 0
GROUP BY ci.currency_code
ORDER BY min(c.time) DESC</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Q6--Bar-Graph-.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 11: Number of new cryptocurrencies launched year</strong></b></i></figcaption></figure><p>It’s easy to conclude that a bull run in BTC prices might have fueled a massive increase in developer activity in the crypto space. The evidence for this comes from the the 738 new cryptos released in 2017, a year where BTC almost 10x’ed its BTC price between Jan(1 BTC = $2,435 on Jan 1) and December (1 BTC = $19,345 on Dec 16). With bitcoin prices steadily increasing, it’s no wonder that hundreds of developers tried their hand at creating their own cryptocurrency in the hopes of maybe building the next Bitcoin.</p><p>However, despite the dramatic crash in BTC prices throughout most of 2018, it’s interesting to see the amount of new cryptocurrencies launched in 2018 year being the highest ever, with 771 new cryptocurrency projects launching that year.</p><p>Figure 12 shows us the result of examining the number of new cryptos launched by day.</p><pre><code class="language-SQL">--Q7
-- Number of new cryptocurrencies by day
SELECT day, COUNT(code)
FROM (  
	SELECT min(c.time) AS day, ci.currency_code AS code  
    FROM currency_info ci JOIN crypto_prices c ON ci.currency_code = c.currency_code  
    AND c.closing_price &gt; 0  
   GROUP BY ci.currency_code  
   ORDER BY min(c.time))a
GROUP BY day
ORDER BY day DESC</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Q7.png" class="kg-image" alt="" loading="lazy" width="1600" height="1092" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2022/01/Q7.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2022/01/Q7.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Q7.png 1600w" sizes="(min-width: 720px) 720px"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 12: Number of new cryptocurrencies launched by day</strong></b></i></figcaption></figure><p>Two days in particular are super interesting. On 2 December 2014, we saw data for 81 cryptocurrencies being added to Cryptocompare and on 26 May 2017, we saw a whopping 134 new cryptos being added on that day!</p><h2 id="these-coins-had-higher-transaction-volumes-than-bitcoin-this-month-hint-it%E2%80%99s-not-bitcoin-cash">These coins had higher transaction volumes than Bitcoin this month (Hint: It’s not Bitcoin Cash)</h2><p>With over 4000+ cryptocurrencies out there and new ones coming out everyday, it can be hard to pick which ones are worth paying attention to. One helpful metric for spotting coins on the rise is the transaction volume. In Figure 11, we looked at the transaction volume for all 4054 coins in our dataset over the 14 day period from September 2 to 16 2019.</p><pre><code class="language-SQL">--Q8
--Crypto transaction volume during a certain time period
SELECT 'BTC' as currency_code,      
	sum(b.volume_currency) as total_volume_in_usd
FROM btc_prices b
WHERE b.currency_code = 'USD'
AND now() - date(b.time) &lt; INTERVAL '14 day'
GROUP BY b.currency_code
UNION
SELECT c.currency_code as currency_code,      
	sum(c.volume_btc) * avg(b.closing_price) as total_volume_in_usd
FROM crypto_prices c JOIN btc_prices b ON date(c.time) = date(b.time)
WHERE c.volume_btc &gt; 0
AND b.currency_code = 'USD'
AND now() - date(b.time) &lt; INTERVAL '14 day'
AND now() - date(c.time) &lt; INTERVAL '14 day'
GROUP BY c.currency_code
ORDER BY total_volume_in_usd DESC </code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Q8.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 13: Cryptos with the most USD transaction volume over from Sept 2-16 2019</strong></b></i></figcaption></figure><p>It’s surprising to see that Bitcoin, despite being the original cryptocurrency, did not have the largest transaction volume over the time period in question. That honor belonged to USD Tether, <a href="https://en.wikipedia.org/wiki/Tether_(cryptocurrency)">USD Tether</a> -- a fiat backed stablecoin, with Ethereum coming in second and <a href="https://litecoin.org/">Litecoin</a>, the proverbial silver to Bitcoin’s gold, coming in third. Bitcoin (BTC) had the fourth highest USD transaction volume in that 14 day period, followed by <a href="https://www.ripple.com/">Rippple</a> (XRP), a global payments system which has partnered with several banks and payment processors, and <a href="https://en.wikipedia.org/wiki/EOS.IO">EOS</a>, a smart contract platform.</p><h2 id="what-are-the-most-profitable-new-cryptocurrencies">What are the most profitable new cryptocurrencies?</h2><p>Another way of making sense of the flood of new currencies is to look at how profitable coins are, as measured by total daily return. By honing in on the currencies with the highest increase in rate by day, we can gain a different perspective on which currencies might be worth looking into further.</p><p>One question to ask is which cryptocurrency has the highest daily return during a certain time period. Figures 14.1 and 14.2 below show cryptocurrencies sorted by their maximum daily return factor between June 16 and September 16 2019.<br></p><pre><code class="language-SQL">--Q9
--Top crypto by daily return
WITH   
	prev_day_closing AS (
SELECT   
	currency_code,   
    time,   
    closing_price,   
    LEAD(closing_price) OVER (PARTITION BY currency_code ORDER BY TIME DESC) AS prev_day_closing_price
FROM    
	crypto_prices  
),   
	daily_factor AS (
SELECT   
	currency_code,   
	time,   
	CASE WHEN prev_day_closing_price = 0 THEN 0 ELSE closing_price/prev_day_closing_price END AS daily_factor
FROM   
	prev_day_closing
)
SELECT   
	time,   
	LAST(currency_code, daily_factor) as currency_code,   
    MAX(daily_factor) as max_daily_factor
FROM   
	daily_factor
GROUP BY   
	time</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Q9--2-.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 14.1: Cryptocurrencies sorted by absolute highest daily return</strong></b></i></figcaption></figure><p>From Figure 14.1 above, we see that <a href="https://www.cryptocompare.com/coins/mixi/overview">Mixin (MIXI)</a>, an Ethereum based token, comes out on top with a maximum daily return factor of over 25 million times the previous day’s rate. <a href="https://www.cryptocompare.com/coins/bomb/overview">BOMB</a>, an experimental deflationary currency, had the second highest absolute daily return with a return factor of over 700,000.  We can get a better look at the rest of the data by removing these two outliers, as seen in Figure 14.2 below:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Q9--3-.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 14.2: Cryptos sorted by absolute highest daily return (with MIXI, BOMB removed)</strong></b></i></figcaption></figure><p>From Figure 14.2, we see other cryptos that have had big days in the past 3 months were, <a href="https://www.cryptocompare.com/coins/xde2/overview">Double Eagle Coin (XDE2)</a>, <a href="https://www.cryptocompare.com/coins/xgr/overview">Gold Reserve (XGR)</a>, <a href="https://www.cryptocompare.com/coins/soul/overview">SoulCoin (SOUL)</a> and <a href="https://www.cryptocompare.com/coins/ccc/overview">CCCoin (CCC)</a>. </p><p>Furthermore, it’s interesting to note the difference in order of magnitude between the coins with the top daily return in the past 3 months, with MIXI achieving a daily return factor in the millions, BOMB in the hundreds of thousands and the next highest coins in the tens of thousands. </p><p>Another interesting thing to look at is the amount of times a coin had the top daily return during a certain period of time. Figure 15 shows us the coins with the highest frequency of having the top daily return during the 3 months between 16 June and 16 September 2019.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2022/01/Q9.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 15: Top coins by most days with highest daily return during a 3 month period</strong></b></i></figcaption></figure><p>The coins with the highest frequency of having the top daily return are <a href="https://www.cryptocompare.com/coins/mixi/overview">MIXI</a> (discussed above), with 5 unique days with the top daily return, <a href="https://www.coinbase.com/price/bitether">Bitether (BTR)</a>, a cryptocurrency built on the Ethereum platform, with unique 3 days, <a href="https://coinmarketcap.com/currencies/icechain/">IceChain (ICHX)</a> coming in third, with two unique days. </p><h2 id="conclusion">Conclusion</h2><p>In this post, we used the power of PostgreSQL and TimescaleDB to analyze a public cryptocurrency dataset of over 4100 cryptocurrencies over the time period 2010 to 2019. We examined time-series trends in Bitcoin and Ethereum prices, new coin growth, trading volume, daily returns, and more.</p><p>While our analysis aimed to provide a taste of what’s possible using PostgreSQL and TimescaleDB, we encourage you to take the tools we used and apply them to different crypto datasets and gain deeper insights!</p><p>We’ve created a <a href="https://timescale.ghost.io/blog/tutorials/how-to-analyze-cryptocurrency-market-data-using-timescaledb-postgresql-and-tableau-a-step-by-step-tutorial/">companion tutorial post</a> for those interested in re-creating this analysis or looking for a starting point to perform your own analysis. In the tutorial, you will find <a href="https://timescale.ghost.io/blog/tutorials/how-to-analyze-cryptocurrency-market-data-using-timescaledb-postgresql-and-tableau-a-step-by-step-tutorial/">step by step instructions</a> on how to create the dataset using Python (including all code we used for the analysis), how to load the data into <a href="https://portal.managed.timescale.com/login">Managed Service for TimescaleDB</a> and how to connect your database in the cloud to Tableau to recreate the analysis and produce graphs. If you do perform your own analysis, let us know what interesting insights you find! </p><p>Moreover, you can dig in to the technical side of Timescale and how we made PostgreSQL scalable for time-series data <a href="https://timescale.ghost.io/blog/time-series-data-why-and-how-to-use-a-relational-database-instead-of-nosql-d0cd6975e87c/">in this detailed post</a>. If you’re interested in experiencing the power of Timescale for your time series data, sign up for <a href="https://portal.managed.timescale.com/login">Managed Service for TimescaleDB</a>!</p><p>Please drop your comments below and share this post with others whom you think would enjoy it. For follow-up questions or comments, reach out to us on Twitter (<a href="https://twitter.com/timescaledb">@TimescaleDB</a> or <a href="https://twitter.com/avthars">@avthars</a>), our community <a href="http://slack.timescale.com/">Slack channel</a>, or reach out to me directly via email (avthar at timescale dot com).</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to Install psql on Mac, Ubuntu, Debian, Windows]]></title>
            <description><![CDATA[Instructions on how to get psql setup on a variety of operating systems.
]]></description>
            <link>https://www.tigerdata.com/blog/how-to-install-psql-on-mac-ubuntu-debian-windows</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-to-install-psql-on-mac-ubuntu-debian-windows</guid>
            <category><![CDATA[Tutorials]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Ajay Kulkarni]]></dc:creator>
            <pubDate>Thu, 22 Aug 2019 19:21:00 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-11-at-7.08.25-PM.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2023/10/Screenshot-2023-10-11-at-7.08.25-PM.png" alt="A fierce PostgreSQL elephant (How to Install psql on Mac, Ubuntu, Debian, Windows)" /><p><code>psql</code> is a terminal-based front-end to <a href="https://www.tigerdata.com/learn/postgres-basics" rel="noreferrer">PostgreSQL</a>. It provides an interactive command-line interface to the PostgreSQL (or TimescaleDB) database. With psql, <a href="https://www.tigerdata.com/blog/connecting-to-postgres-with-psql-and-pg_service-conf" rel="noreferrer">you can type in queries interactively, issue them to PostgreSQL</a>, and see the query results. It also provides several meta-commands and various shell-like features to facilitate writing scripts and automating a wide variety of tasks.</p><figure class="kg-card kg-embed-card"><iframe width="200" height="113" src="https://www.youtube.com/embed/iPFLdGnJcDw?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="" title="What is psql?"></iframe></figure><p>For instance, users can create, modify, and delete database objects such as tables, views, or users using psql. They can also execute SQL commands, manage data within the database, and even customize the psql environment.</p><p>Since psql is the standard command line interface for interacting with a PostgreSQL or TimescaleDB instance, this article will show you how to install it on a variety of operating systems.</p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">Need a refresher on Postgres and <a href="https://www.tigerdata.com/blog/10-psql-commands-that-will-make-your-life-easier" rel="noreferrer">psql commands</a>? Check out our <a href="https://www.tigerdata.com/learn/postgres-cheat-sheet" rel="noreferrer">Postgres Cheat Sheet</a>.</div></div><h2 id="before-you-start">Before You Start</h2><p>Before you start, you should confirm that you don’t already have <code>psql</code> installed. In fact, if you’ve ever installed Postgres or TimescaleDB before, you likely already have <code>psql</code> installed.</p><p>To verify it, open the command line program and type the following:</p><pre><code class="language-shell">psql --version</code></pre><p>If <code>psql</code> isn't installed (or you want to upgrade to a newer client), keep reading.</p><h3 id="a-quick-note-on-version-compatibility">A quick note on version compatibility</h3><p>Your <code>psql</code> client version should be <strong>greater than or equal to</strong> the PostgreSQL server you're connecting to. An older client against a newer server may not understand newer features and can give confusing output. When in doubt, install the most recent <code>psql</code>. Clients are backwards-compatible with older servers, but not forwards-compatible with newer ones.</p><p>As of this writing, PostgreSQL 18 is the current major release. The instructions below default to PG 18; substitute `17` or another version if you have a specific reason to.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text"><a href="https://www.tigerdata.com/blog/connecting-to-postgres-with-psql-and-pg_service-conf" rel="noreferrer"><u>Forgot how to connect to your Postgres services? You can use the .pg_service.conf file, or if you're a TimescaleDB user, just click one button.</u></a></div></div><h2 id="install-on-macos-using-homebrew">Install on macOS Using Homebrew</h2><p>The cleanest way to install just the <code>psql</code> client on macOS is via Homebrew's <code>libpq</code> formula. (<code>libpq</code> is the official PostgreSQL client library — it ships with <code>psql</code>, <code>pg_dump</code>, <code>pg_restore</code>, and friends, without installing a server.)</p><p><strong>1. Install Homebrew</strong> if you don't already have it. Homebrew is the de facto package manager for macOS.</p><p><strong>2. Install libpq:</strong></p><pre><code class="language-shell">​shell brew update brew install libpq</code></pre><p><strong>3. Make psql available on your PATH.</strong> <code>libpq</code> is "keg-only" in Homebrew, which means it isn't symlinked into your <code>PATH</code> by default (so it doesn't collide with Apple's bundled <code>libpq</code>, or with a full PostgreSQL install if you also have one).</p><p>The recommended approach is to add <code>libpq</code>'s <code>bin</code> directory to your shell <code>PATH</code>. The path differs depending on your Mac's architecture:</p><ul><li><strong>Apple Silicon (M1, M2, M3, M4):</strong> Homebrew lives at <code>/opt/homebrew</code></li><li><strong>Intel Macs:</strong> Homebrew lives at <code>/usr/local</code></li></ul><p>You don't have to know which,  <code>brew --prefix libpq</code> will figure it out. For <code>zsh</code> (the default shell since macOS Catalina):</p><pre><code class="language-shell">shell echo 'export PATH="'$(brew --prefix libpq)'/bin:$PATH"' &gt;&gt; ~/.zshrc source ~/.zshrc ​</code></pre><p>If you're on bash, swap <code>~/.zshrc</code> for <code>~/.bash_profile</code>.</p><p><strong>Alternative: <code>force-link</code>:</strong> if you don't have a full PostgreSQL install on the same machine and you want libpq's tools symlinked into Homebrew's main bin directory, you can run:</p><pre><code class="language-shell ​">shell brew link --force libpq</code></pre><p>Skip this if you already have <code>postgresql</code> installed via Homebrew (it'll collide).</p><p><strong>Want the full PostgreSQL server too?</strong> Install the meta-package instead of <code>libpq</code>:</p><pre><code class="language-shell ​">shell brew install postgresql@18</code></pre><p>This installs the server, <code>psql</code>, and all the client tools in one shot.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">⭐</div><div class="kg-callout-text">Related documentation: <a href="https://www.tigerdata.com/docs/get-started/choose-your-path/install-timescaledb#tab=macos" rel="noreferrer"><u>Installing self-hosted TimescaleDB on macOS</u></a>.</div></div><h2 id="install-on-ubuntu-2204-2404-2604-and-debian-13">Install on Ubuntu (22.04, 24.04, 26.04) and Debian 13</h2><p>You have two options on apt-based distros: the default repos (easy, often a release or two behind), or the official PostgreSQL APT repository (PGDG — gives you the current major version).</p><h3 id="option-a-default-apt-repos-quick-and-dirty%E2%80%8B">Option A: Default apt repos (quick and dirty)​</h3><pre><code class="language-shell">​shell sudo apt update 
sudo apt install postgresql-client </code></pre><p>This installs whatever <code>psql</code> version Ubuntu/Debian ships in their archive. On Ubuntu 26.04 that's a recent version; on older releases you may be one or two majors behind.</p><p>Note: this installs only the client, not the PostgreSQL server.</p><h3 id="option-b-postgresql-apt-repo-recommended-for-current-versions">Option B: PostgreSQL APT repo (recommended for current versions)</h3><p>The PostgreSQL project maintains its own apt repo with up-to-date packages for every supported major version. This is the right choice if you need a specific PG version or you want to stay current.</p><pre><code class="language-shell"># Add the PGDG repo for your distribution
sudo install -d /usr/share/postgresql-common/pgdg
sudo curl -o /usr/share/postgresql-common/pgdg/apt.postgresql.org.asc \
  --fail https://www.postgresql.org/media/keys/ACCC4CF8.asc

sudo sh -c 'echo "deb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.asc] \
  https://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" \
  &gt; /etc/apt/sources.list.d/pgdg.list'

# Install psql 18 (or whichever major version you need)
sudo apt update
sudo apt install postgresql-client-18
</code></pre><p>Full instructions: <a href="https://apt.postgresql.org" rel="noreferrer">apt.postgresql.org</a>.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text"><b><strong style="white-space: pre-wrap;">Editor's Note</strong></b>: For information about how to connect psql to your TimescaleDB database, please refer to <a href="https://www.tigerdata.com/docs/integrate/query-administration/psql" rel="noreferrer"><u>Tiger Data documentation on connecting to your database with psql</u></a>.</div></div><h2 id="install-on-fedora-rhel-rocky-and-almalinux">Install on Fedora, RHEL, Rocky, and AlmaLinux</h2><p>For RPM-based distros, <code>psql</code> is in the standard repos:</p><pre><code class="language-shell">​shell sudo dnf install postgresql ​</code></pre><p>For the current major version, use the PostgreSQL Yum repository — see <a href="https://yum.postgresql.org" rel="noreferrer">yum.postgresql.org</a> for the per-distro repo file.</p><h2 id="install-on-windows-11">Install on Windows 11</h2><p>You have several reasonable options on Windows. Pick whichever matches your workflow.</p><h3 id="option-1-winget-built-into-windows-11">Option 1: <code>winget</code> (built into Windows 11)</h3><pre><code class="language-powershell">winget install PostgreSQL.PostgreSQL.18</code></pre><h3 id="option-2-chocolatey">Option 2: Chocolatey</h3><pre><code class="language-powershell">choco install postgresql18</code></pre><h3 id="option-3-scoop">Option 3: Scoop</h3><pre><code class="language-powershell">scoop install postgresql</code></pre><h3 id="option-4-official-postgresql-installer">Option 4: Official PostgreSQL installer</h3><p>If you prefer a GUI wizard, download the EDB-built installer from <a href="https://postgresql.org" rel="noreferrer">postgresql.org</a>. This installs the server, <code>psql</code>, pgAdmin, and the rest.</p><h3 id="option-5-wsl-recommended-for-developers">Option 5: WSL (recommended for developers)</h3><p>If you're using WSL with Ubuntu, just follow the Ubuntu/Debian instructions above inside your WSL shell. This is often the smoothest path if you're already living in WSL for development.</p><p>After installing, you may need to add the PostgreSQL <code>bin</code> directory to your <code>PATH</code> so that `psql` works from any terminal. The installer typically asks about this; if not, the directory is something like <code>C:\Program Files\PostgreSQL\18\bin</code>.</p><h2 id="last-step-connect-to-your-postgresql-server">Last Step: Connect to Your PostgreSQL Server</h2><p>Let’s confirm that psql is installed:</p><pre><code class="language-shell">psql --version</code></pre><p>Now, in order to connect to your PostgreSQL server, we’ll need the following connection params:</p><ul><li>Hostname</li><li>Port</li><li>Username</li><li>Password</li><li>Database name</li></ul><p>There are two ways to use these params.</p><h3 id="option-1">Option 1:</h3><pre><code class="language-shell">psql -h [HOSTNAME] -p [PORT] -U [USERNAME] -W -d [DATABASENAME]</code></pre><p>Once you run that command, the prompt will ask you for your password. (Which we specified with the <code>-W</code> flag.)</p><h3 id="option-2">Option 2:</h3><pre><code class="language-shell">"postgresql://[USERNAME]:[PASSWORD]@[HOSTNAME]:[PORT]/[DATABASENAME]?sslmode=verify-full"</code></pre><p>Note: the canonical scheme is <code>postgresql://</code>. The shorter <code>postgres://</code> works as an alias, but <code>postgresql://</code> is what you'll see in the official docs.</p><h3 id="a-word-on-sslmode">A word on <code>sslmode</code><br></h3><p>Don't blindly copy <code>sslmode=require</code> from old tutorials. Here's the actual difference:<br><br>- <code>sslmode=require</code> — encrypts the connection but does <strong>not</strong> verify the server's certificate. Vulnerable to man-in-the-middle attacks.<br>- <code>sslmode=verify-ca</code> — verifies the certificate was signed by a trusted CA, but not the hostname.<br>- <code>sslmode=verify-full</code> — verifies the cert chain <em>and</em> the hostname. <strong>Use this for production.</strong><br><br>Tiger Cloud uses certificates signed by a public CA, so <code>verify-full</code> works out of the box — no custom CA bundle required.</p><p>If you are using the Tiger Data dashboard, this is how they look in the <a href="https://console.cloud.tigerdata.com/signup" rel="noreferrer">Tiger Data UI</a>:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/06/data-src-image-d12affb5-b6ef-4bec-a7e2-e14f7e2b474e.png" class="kg-image" alt="" loading="lazy" width="1300" height="1000" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2026/06/data-src-image-d12affb5-b6ef-4bec-a7e2-e14f7e2b474e.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2026/06/data-src-image-d12affb5-b6ef-4bec-a7e2-e14f7e2b474e.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2026/06/data-src-image-d12affb5-b6ef-4bec-a7e2-e14f7e2b474e.png 1300w" sizes="(min-width: 720px) 720px"></figure><p>Congrats! Now you have connected via <code>psql</code>.</p><h2 id="related-resources-psql-postgres">Related resources: psql + Postgres</h2><ul><li><a href="https://www.tigerdata.com/docs/integrate/query-administration/psql" rel="noreferrer"><u>Connect with psql</u></a></li><li><a href="https://www.tigerdata.com/docs/deploy/tiger-cloud/tiger-cloud-aws/tiger-cloud-extensions" rel="noreferrer"><u>PostgresSQL extensions</u></a></li><li><a href="https://www.tigerdata.com/docs/deploy/self-hosted/configuration/postgres-config" rel="noreferrer"><u>Manual PostgresSQL configuration and tuning</u></a></li><li><a href="https://www.tigerdata.com/learn/guide-to-postgresql-scaling" rel="noreferrer"><u>A Guide to Scaling PostgreSQL</u></a></li></ul><h2 id="about-timescaledb-and-postgresql-queries">About TimescaleDB and PostgreSQL Queries </h2><p>Now that you've installed psql, you are ready to start easily interacting with your PostgreSQL—or TimescaleDB—database. </p><p>TimescaleDB is built on PostgreSQL and expands its capabilities for time series, analytics, and events. It <a href="https://www.tigerdata.com/blog/postgresql-timescaledb-1000x-faster-queries-90-data-compression-and-much-more" rel="noreferrer">will make your queries faster</a> via <a href="https://www.tigerdata.com/learn/is-postgres-partitioning-really-that-hard-introducing-hypertables" rel="noreferrer">automatic partitioning</a>, query planner enhancements, improved materialized views, <a href="https://www.tigerdata.com/blog/building-columnar-compression-in-a-row-oriented-database" rel="noreferrer">columnar compression</a>, and much more. Plus, <a href="https://www.tigerdata.com/blog/scaling-postgresql-for-cheap-introducing-tiered-storage-in-timescale" rel="noreferrer">you can scale PostgreSQL for cheap with our multi-tiered storage backend</a>.</p><p>If you're running your PostgreSQL database on your own hardware, <a href="https://www.tigerdata.com/docs/get-started/choose-your-path/install-timescaledb" rel="noreferrer">you can simply add the TimescaleDB extension</a>. If you prefer to try TimescaleDB in AWS, <a href="https://console.cloud.tigerdata.com/signup" rel="noreferrer">create a free account on our platform today</a>. It only takes a couple of seconds, no credit card required! </p><h2 id="faqs-how-to-install-psql-on-mac-ubuntu-debian-windows">FAQs: How to Install psql on Mac, Ubuntu, Debian, Windows</h2><p><strong>Q: How do I check if psql is already installed on my system?</strong><br>You can verify if <code>psql</code> is already installed by opening your command line program and typing <code>psql --version</code>. If you've previously installed PostgreSQL or TimescaleDB, you likely already have <code>psql</code> installed, as it comes bundled with these installations.</p><p><strong>Q: How do I install psql on macOS?</strong><br>Install Homebrew, then run <code>brew update &amp;&amp; brew install libpq</code>. Because <code>libpq</code> is keg-only, add it to your <code>PATH</code> with <code>echo 'export PATH="'$(brew --prefix libpq)'/bin:$PATH"' &gt;&gt; ~/.zshrc &amp;&amp; source ~/.zshrc</code>. On Apple Silicon Macs, Homebrew lives at <code>/opt/homebrew</code>; on Intel Macs, it's at <code>/usr/local</code>. The brew <code>--prefix</code> command handles the difference for you. If you want the full PostgreSQL server, install <code>postgresql@18</code> instead.</p><p><strong>Q: How do I install psql on Ubuntu or Debian? </strong><br>For a quick install, run <code>sudo apt update &amp;&amp; sudo apt install postgresql-client</code>. This pulls whatever version your distro ships, which can lag behind upstream. To get the current major release (PG 18 as of May 2026), add the PostgreSQL APT repository (PGDG) and install <code>postgresql-client-18</code> from there.</p><p><strong>Q: What's the recommended way to install psql on Windows 11?</strong><br>On Windows 11 the easiest option is <code>winget install PostgreSQL.PostgreSQL.18</code>. Chocolatey (<code>choco install postgresql18</code>) and Scoop (<code>scoop install postgresql</code>) are also solid choices. If you want a GUI wizard, use the official EDB-built installer from <a href="https://postgresql.org" rel="noreferrer">postgresql.org</a>. If you already use WSL, install <code>psql</code> inside WSL Ubuntu using the Linux instructions.</p><p><strong>Q: How do I connect to a PostgreSQL server using psql? </strong>Use either flag-based form — <code>psql -h [HOSTNAME] -p [PORT] -U [USERNAME] -W -d [DATABASENAME]</code> (the <code>-W</code> prompts for password) — or the connection URI form — <code>psql "postgresql://[USERNAME]:[PASSWORD]@[HOSTNAME]:[PORT]/[DATABASENAME]?sslmode=verify-full&amp;sslrootcert=system"</code>. Both methods need your hostname, port, username, password, and database name.</p><p><strong>Q: Should I worry about psql version vs server version?</strong>Yes — your psql client should be greater than or equal to the major version of the PostgreSQL server you're connecting to. Older clients against newer servers can produce confusing output or fail on new syntax. When in doubt, install the latest psql.</p><p><strong>Q: What's the difference between <code>sslmode=require</code> and <code>sslmode=verify-full</code>?</strong><br><code>require</code> encrypts the connection but does not verify the server's certificate, leaving you exposed to man-in-the-middle attacks. <code>verify-full</code> encrypts the connection <em>and</em> validates that the server's certificate chains to a trusted CA <em>and</em> matches the hostname you're connecting to. Use <code>verify-full</code> for production. Tiger Cloud uses a publicly trusted CA, so <code>verify-full</code> works without extra configuration.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[OrderedAppend: An Optimization for Range Partitioning]]></title>
            <description><![CDATA[With this feature, we’ve seen up to 100x performance improvements for certain queries.]]></description>
            <link>https://www.tigerdata.com/blog/ordered-append-postgresql-optimization</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/ordered-append-postgresql-optimization</guid>
            <category><![CDATA[Product & Engineering]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Sven Klemm]]></dc:creator>
            <pubDate>Wed, 31 Jul 2019 17:00:07 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2019/07/blogorderedappend.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2019/07/blogorderedappend.jpg" alt="OrderedAppend: An Optimization for Range Partitioning" /><p>In our previous post on <a href="https://www.timescale.com/blog/implementing-constraint-exclusion-for-faster-query-performance/">implementing constraint exclusion</a>, we discussed how TimescaleDB leverages PostgreSQL’s foundation and expands on its capabilities to improve performance. Continuing with the same theme, in this post, we will discuss how we’ve added support for ordered appends, which optimize a wide range of queries, particularly those that are ordered by time.</p><p><strong>We’ve seen performance improvements up to 100x for certain queries</strong> after applying this feature, so we encourage you to keep reading!</p><h2 id="optimizing-appends-for-large-queries">Optimizing Appends for Large Queries</h2><p>PostgreSQL represents how plans should be executed using “nodes.” Various nodes may appear in an EXPLAIN output, but we want to focus specifically on Append nodes, which essentially combine the results from multiple sources into a single result. </p><p>PostgreSQL has two standard Appends that are commonly used that you can find in an EXPLAIN output:</p><ul><li><strong>Append:</strong> appends results of child nodes to return a unioned result</li><li><strong>MergeAppend:</strong> merges the output of child nodes by sort key; all child nodes must be sorted by that same sort key; accesses every chunk when used in TimescaleDB </li></ul><p>When MergeAppend nodes are used with TimescaleDB, we must access every chunk to determine whether it has keys that we need to merge. However, this is obviously less efficient since it requires us to touch every chunk. </p><p>To address this issue, with the release of <a href="https://github.com/timescale/timescaledb/releases/tag/1.2.0">TimescaleDB 1.2</a>, we introduced <strong>OrderedAppend </strong>as<strong> </strong>an optimization for range partitioning. This feature optimizes a large range of queries, particularly those that are ordered by time and contain a LIMIT clause. </p><p>This optimization takes advantage of the fact that we know the range of time held in each chunk and can stop accessing chunks once we’ve found enough rows to satisfy the LIMIT clause. As mentioned above, this optimization can improve performance by up to 100x, depending on the query. </p><p>With the release of <a href="https://github.com/timescale/timescaledb/releases/tag/1.4.0">TimescaleDB 1.4</a>, we wanted to extend the cases in which OrderedAppend can be used. This meant making OrderedAppend space-partition aware and removing the LIMIT clause restriction. With these additions, more users can benefit from the performance benefits achieved through leveraging OrderedAppend.</p><h2 id="developing-query-plans-with-the-optimization">Developing Query Plans With the Optimization</h2><p>As an optimization for range partitioning, OrderedAppend eliminates sort steps because it is aware of the way data is partitioned. </p><p>Since each chunk has a known time range it covers to get sorted output, no global sort step is needed. Only local sort steps have to be completed and then appended in the correct order. If index scans are utilized, which return the output sorted, sorting can be completely avoided.</p><p><strong>For a query ordering by the time dimension with a LIMIT clause, you would normally get something like this: </strong></p><pre><code>dev=# EXPLAIN (ANALYZE,COSTS OFF,BUFFERS,TIMING OFF,SUMMARY OFF)
dev-# SELECT * FROM metrics ORDER BY time LIMIT 1;
                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Limit (actual rows=1 loops=1)
   Buffers: shared hit=16
   -&gt;  Merge Append (actual rows=1 loops=1)
         Sort Key: metrics."time"
         Buffers: shared hit=16
         -&gt;  Index Scan using metrics_time_idx on metrics (actual rows=0 loops=1)
               Buffers: shared hit=1
         -&gt;  Index Scan using _hyper_1_1_chunk_metrics_time_idx on _hyper_1_1_chunk (actual rows=1 loops=1)
               Buffers: shared hit=3
         -&gt;  Index Scan using _hyper_1_2_chunk_metrics_time_idx on _hyper_1_2_chunk (actual rows=1 loops=1)
               Buffers: shared hit=3
         -&gt;  Index Scan using _hyper_1_3_chunk_metrics_time_idx on _hyper_1_3_chunk (actual rows=1 loops=1)
               Buffers: shared hit=3
         -&gt;  Index Scan using _hyper_1_4_chunk_metrics_time_idx on _hyper_1_4_chunk (actual rows=1 loops=1)
               Buffers: shared hit=3
         -&gt;  Index Scan using _hyper_1_5_chunk_metrics_time_idx on _hyper_1_5_chunk (actual rows=1 loops=1)
               Buffers: shared hit=3
</code></pre><p>You can see three pages are read from every chunk and an additional page from the parent table which contains no actual rows.</p><p><strong>With this optimization enabled, you would get a plan looking like this:</strong></p><pre><code>dev=# EXPLAIN (ANALYZE,COSTS OFF,BUFFERS,TIMING OFF,SUMMARY OFF)
dev-# SELECT * FROM metrics ORDER BY time LIMIT 1;
                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Limit (actual rows=1 loops=1)
   Buffers: shared hit=3
   -&gt;  Custom Scan (ChunkAppend) on metrics (actual rows=1 loops=1)
         Order: metrics."time"
         Buffers: shared hit=3
         -&gt;  Index Scan using _hyper_1_1_chunk_metrics_time_idx on _hyper_1_1_chunk (actual rows=1 loops=1)
               Buffers: shared hit=3
         -&gt;  Index Scan using _hyper_1_2_chunk_metrics_time_idx on _hyper_1_2_chunk (never executed)
         -&gt;  Index Scan using _hyper_1_3_chunk_metrics_time_idx on _hyper_1_3_chunk (never executed)
         -&gt;  Index Scan using _hyper_1_4_chunk_metrics_time_idx on _hyper_1_4_chunk (never executed)
         -&gt;  Index Scan using _hyper_1_5_chunk_metrics_time_idx on _hyper_1_5_chunk (never executed)</code></pre><p>After the first chunk, the remaining chunks never get executed, and to complete the query, only three pages have to be read. TimescaleDB removes parent tables from plans like this because we know the parent table does not contain any data.</p><h2 id="mergeappend-vs-chunkappend">MergeAppend vs. ChunkAppend</h2><p>The main difference between these two examples is the type of Append node we used. In the first case, a MergeAppend node is used. In the second case, we used a ChunkAppend node (also introduced in 1.4), which is a TimescaleDB custom node that works similarly to the PostgreSQL Append node but contains additional optimizations. </p><p>The MergeAppend node implements the global sort and requires locally sorted input which has to be sorted by the same sort key. To produce one tuple, the MergeAppend node has to read one tuple from every chunk to decide which one to return to.</p><p>For the very simple example query above, you will see 16 pages read (with MergeAppend) vs. three pages (with ChunkAppend), which is a 5x improvement over the unoptimized case (if we ignore the single page from the parent table) and represents the number of chunks present in that <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertable</a>. So for a hypertable with 100 chunks, there would be 100 times fewer pages to be read to produce the result for the query.</p><p>As you can see, you gain the most benefit from OrderedAppend with a LIMIT clause, as older chunks don’t have to be touched if the required results can be satisfied from more recent chunks. This type of query is very common in time-series workloads (e.g., if you want to get the last reading from a sensor). However, even for queries without a LIMIT clause, this feature is beneficial because it eliminates the sorting of data.</p><h2 id="next-steps">Next Steps</h2><p>If you are interested in using OrderedAppend, make sure you have the latest version of TimescaleDB installed (<a href="https://docs.timescale.com/self-hosted/latest/install/" rel="noreferrer">installation guide</a>). You can also <a href="https://console.cloud.timescale.com/signup" rel="noreferrer">create a free Timescale account</a> (30-day trial, no credit card required) and never worry about upgrading again (we'll do it for you).</p><p>If you are brand new to TimescaleDB, <a href="https://docs.timescale.com/getting-started" rel="noreferrer">get started here</a>. Have questions? Join our <a href="http://slack.timescale.com">Slack</a> channel!</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Mind the Gap: Using SQL Functions for Time-Series Analysis]]></title>
            <description><![CDATA[Write more efficient and readable SQL queries with a new set of time-series analytic tools.]]></description>
            <link>https://www.tigerdata.com/blog/sql-functions-for-time-series-analysis</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/sql-functions-for-time-series-analysis</guid>
            <category><![CDATA[Product & Engineering]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Sven Klemm]]></dc:creator>
            <pubDate>Thu, 24 Jan 2019 20:01:11 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2019/01/20190123_TimeBucketGapFill.jpg">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2019/01/20190123_TimeBucketGapFill.jpg" alt="Mind the Gap: Using SQL Functions for Time-Series Analysis" /><p>SQL functions are reusable routines written in SQL or supported procedural languages that perform operations on input values and return a result, often used to encapsulate logic and simplify complex queries.</p><p>With the release of <a href="https://github.com/timescale/timescaledb" rel="noreferrer">TimescaleDB 1.2</a> came three new SQL functions for time-series analysis: <code>time_bucket_gapfill</code>, <code>interpolate</code>, and <code>locf</code>. Used together, these SQL functions will enable you to write more efficient and readable SQL queries for <a href="https://www.timescale.com/blog/time-series-analysis-what-is-it-how-to-use-it" rel="noreferrer">time-series analysis</a>. </p><p>The efficiency gains were so evident that we have since developed a complete set of <a href="https://www.timescale.com/learn/time-series-data-analysis-hyperfunctions" rel="noreferrer">hyperfunctions</a> for faster time-series analysis with fewer lines of code. You can find them in the <a href="https://docs.timescale.com/self-hosted/latest/tooling/install-toolkit/" rel="noreferrer">Timescale Toolkit</a>.</p><p>In this post, we'll discuss why you'd want to use time buckets, the related gapfilling techniques, and how they’re implemented under the hood.<em> </em>Ultimately, it's the story of how we extended SQL and the PostgreSQL query planner to create a set of highly optimized functions for time-series analysis.</p><h2 id="sql-functions-for-time-series-analysis-introduction-to-time-bucketing">SQL Functions for Time-Series Analysis: Introduction to Time Bucketing</h2><p>Many <a href="https://www.timescale.com/blog/time-series-analysis-what-is-it-how-to-use-it" rel="noreferrer">common techniques for time-series analysis</a> assume that our temporal observations are aggregated to fixed time intervals. Dashboards and most visualizations of time series rely on this technique to make sense of our raw data, turning the noise into a smoother trend line that is more easily interpretable and analytically tractable.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2019/01/timebucket-1.gif" class="kg-image" alt="" loading="lazy" width="556" height="347"></figure><p>When writing queries for this type of reporting, you need an efficient way to aggregate raw observations (often noisy and irregular) to fixed time intervals. Examples of such queries might be average temperature per hour or the average CPU utilization per five seconds.</p><p>The solution is <strong>time bucketing</strong>. The <code>time_bucket</code> function has been a core feature of TimescaleDB since the <a href="https://timescale.ghost.io/blog/when-boring-is-awesome-building-a-scalable-time-series-database-on-postgresql-2900ea453ee2">first public beta release</a>. With time bucketing, we can get a clear picture of the important data trends using a concise, declarative SQL query.</p><pre><code class="language-SQL">SELECT
  time_bucket('1 minute', time) as one_minute_bucket,
  avg(value) as avg_value
FROM observations
GROUP BY one_minute_bucket
ORDER BY one_minute_bucket;</code></pre><h2 id="challenges-with-time-bucketing-for-time-series">Challenges With Time Bucketing for Time Series</h2><p>The reality of time-series data engineering is not always so easy.</p><p>Consider measurements recorded at <strong>irregular sampling intervals, </strong>either intentionally, as<strong> </strong>with<strong> </strong>measurements recorded in response to external events (e.g., motion sensor). Or perhaps inadvertently due to network problems, out-of-sync clocks, or equipment taken offline for maintenance. </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2019/01/none.jpg" class="kg-image" alt="" loading="lazy" width="580" height="253"><figcaption><span style="white-space: pre-wrap;">Time bucket: none</span></figcaption></figure><p>We should also consider analyzing multiple measurements recorded at <strong>mismatched sampling intervals</strong>. For instance, you might collect some of your data every second and some every minute, but still need to analyze both metrics at 15-second intervals.</p><p>The <code>time_bucket</code> function will only aggregate your data to a given time bucket if there is data in it. In both the cases of mismatched or irregular sampling, a time bucket interval might come back with missing data (i.e., gaps). </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2019/01/20mins.jpg" class="kg-image" alt="" loading="lazy" width="579" height="220"><figcaption><span style="white-space: pre-wrap;">Time bucket: 20 minutes</span></figcaption></figure><p>If your analysis requires data aggregated to contiguous time intervals, the time bucketing with <strong>gapfilling</strong> solves this problem.</p><h2 id="sql-functions-time-bucketing-with-gapfilling">SQL Functions: Time Bucketing With Gapfilling</h2><p>TimescaleDB community users have access to a set of SQL functions:</p><ul><li><code>time_bucket_gapfill</code> for creating contiguous, ordered time buckets</li><li><code>interpolate</code> to perform linear interpolation between the previous and next value</li><li><code>locf</code> or <em>last observation carried forward </em>to fill in gaps with the previous known value </li></ul><h3 id="gapfilling">Gapfilling</h3><p>The new <code>time_bucket_gapfill</code> function is similar to <code>time_bucket</code> except that it guarantees a contiguous, ordered set of time buckets.</p><p>The function requires that you provide a <code>start</code> and <code>finish</code> argument to specify the time range for which you need contiguous buckets. The result set will contain additional rows in place of any gaps, ensuring that the returned rows are in chronological order and contiguous.</p><p>Let’s look at the SQL:</p><pre><code class="language-SQL">SELECT
    time_bucket_gapfill(
        '1 hour', time,
        start =&gt; '2019-01-21 9:00', 
        finish =&gt; '2019-01-21 17:00') AS hour,
    avg(value) AS avg_val
FROM temperature
GROUP BY hour;

          hour          |         avg_val
------------------------+-------------------------
 2019-01-21 09:00:00+00 |     26.5867799823790905
 2019-01-21 10:00:00+00 |    23.25141648529633607
 2019-01-21 11:00:00+00 |     21.9964633100885991
 2019-01-21 12:00:00+00 |    23.08512263446292656
 2019-01-21 13:00:00+00 |
 2019-01-21 14:00:00+00 |     27.9968220672055895
 2019-01-21 15:00:00+00 |     26.4914455532679670
 2019-01-21 16:00:00+00 |   24.07531628738616732</code></pre><p>Note that one of the hours is missing data entirely, and the average value is represented as <code>NULL</code>. Gapfilling gives us a contiguous set of time buckets but no data for those rows. That's where the <code>locf</code> and <code>interpolate</code> functions come into play.</p><h3 id="locf-or-last-observation-carried-forward">LOCF or last observation carried forward</h3><p>The “last observation carried forward” technique can be used to impute missing values by assuming the previous known value. </p><pre><code class="language-SQL">SELECT
    time_bucket_gapfill(
        '1 hour', time,
        start =&gt; '2019-01-21 9:00', 
        finish =&gt; '2019-01-21 17:00') AS hour,
  -- instead of avg(val)
  locf(avg(val))
FROM temperature
GROUP BY hour
ORDER BY hour</code></pre><p>Shown here: </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2019/01/LOCF_-20-minutes.jpg" class="kg-image" alt="" loading="lazy" width="588" height="210"><figcaption><span style="white-space: pre-wrap;">LOCF at 20 minutes</span></figcaption></figure><h3 id="linear-interpolation">Linear interpolation</h3><p>Linear interpolation imputes missing values by assuming a line between the previous known value and the next known value.</p><pre><code class="language-SQL">SELECT
    time_bucket_gapfill(
        '1 hour', time,
        start =&gt; '2019-01-21 9:00', 
        finish =&gt; '2019-01-21 17:00') AS hour,
  -- instead of avg(val)
  interpolate(avg(val))
FROM temperature
GROUP BY hour
ORDER BY hour</code></pre><p>Shown here: </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2019/01/inter20mins-1.jpg" class="kg-image" alt="" loading="lazy" width="550" height="199"><figcaption><span style="white-space: pre-wrap;">Interpolate at 20 minutes</span></figcaption></figure><p>These techniques are not exclusive; you can combine them as needed in a single time bucketed query:</p><pre><code class="language-SQL">locf(avg(temperature)), interpolate(max(humidity)), avg(other_val)</code></pre><h2 id="best-practices-for-time-series-analysis-with-sql-functions">Best Practices for Time-Series Analysis With SQL Functions</h2><p>Whether you choose to use the LOCF, interpolation, or gapfilling SQL functions with nulls depends on your assumptions about the time-series data and your analytical approach.</p><ul><li>Use <code>locf</code> if you assume your measurement changes only when you've received new data.</li><li>Use <code>interpolation</code> if you assume your continuous measurement would have a smooth, roughly linear trend if sampled at a higher rate.</li><li>Use standard aggregate functions (without <code>locf</code> or <code>interpolation</code>) if your data is not continuous on the time axis. Where there is no data, the result is assumed NULL.</li><li>If you want to assume scalar values (typically zero) in place of NULLs, you can use PostgreSQL’s coalesce function: <code>COALESCE(avg(val), 0)</code></li></ul><p>If you choose to explicitly <code>ORDER</code> your results, keep in mind that the gapfilling will sort by time in ascending order. Any other explicit ordering may introduce additional sorting steps in the query plan.</p><h2 id="extending-sql-for-time-series-analysis">Extending SQL for Time-Series Analysis</h2><p>The new <code>time_bucket_gapfill</code> SQL query is significantly more readable, less error-prone, more flexible regarding grouping, and faster to execute.</p><p>How does TimescaleDB achieve this? Under the hood, these are not ordinary functions but specially optimized hooks into the database query planner itself. </p><p>The <code>time_bucket_gapfill</code> function inserts a <a href="https://www.postgresql.org/docs/11/custom-scan.html">custom scan</a> node and sort node (if needed) into the query plan. This creates ordered, contiguous time buckets even if some of the buckets are missing observations. The <code>locf</code> and <code>interpolate</code> functions are not executed directly but serve as markers so that the gapfilling node can track the previous and next known values. </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2019/01/customscan.png" class="kg-image" alt="" loading="lazy" width="1070" height="1128" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2019/01/customscan.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2019/01/customscan.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2019/01/customscan.png 1070w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Query plan visualization resulting from time_bucket_gapfill; courtesy of https://tatiyants.com/pev</span></figcaption></figure><p>The result: a semantically cleaner language for expressing time-series analysis, easier to debug, more performant, and saving the application developer from having to implement any of these tricks on the application side. This is another example of how Timescale is extending PostgresSQL for high-performance, general purpose time-series data management.</p><h2 id="supercharge-your-time-series-analysis">Supercharge Your Time-Series Analysis</h2><p>Time buckets with gapfilling and the related imputation function are available as community features under the TSL license. (For more information on the license, read this <a href="https://timescale.ghost.io/blog/how-we-are-building-an-open-source-business-a7701516a480">blog post</a>.) </p><p>If you’re interested in learning more about <a href="https://docs.timescale.com/use-timescale/latest/hyperfunctions/gapfilling-interpolation/" rel="noreferrer">gapfilling, check out our docs</a>. If you are new to TimescaleDB and ready to get started, follow the <a href="https://docs.timescale.com/self-hosted/latest/install/" rel="noreferrer">installation instructions</a>. </p><p>We encourage active TimescaleDB users to join our <a href="https://slack.timescale.com/">Slack community</a> and post any questions you may have there. Finally, if you are looking for a modern cloud-native PostgreSQL platform, <a href="https://www.timescale.com/cloud" rel="noreferrer">check out Timescale Cloud</a>.</p><hr><p><em>Interested in learning more? Follow us on </em><a href="https://twitter.com/" rel="noopener"><em>Twitter</em></a><em> or sign up below to receive more posts like this! </em></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[TimescaleDB vs. PostgreSQL for Time Series: 20x Higher Inserts, 2,000x Faster Deletes, 1.2x-14,000x Faster Queries]]></title>
            <description><![CDATA[This is the first in a series of performance benchmarks comparing TimescaleDB to other databases for storing and analyzing time-series data.]]></description>
            <link>https://www.tigerdata.com/blog/timescaledb-vs-6a696248104e</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/timescaledb-vs-6a696248104e</guid>
            <category><![CDATA[Product & Engineering]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Benchmarks & Comparisons]]></category>
            <dc:creator><![CDATA[Rob Kiefer]]></dc:creator>
            <pubDate>Thu, 10 Aug 2017 15:00:00 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/tigerelephant.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/tigerelephant.png" alt="TimescaleDB vs. PostgreSQL for Time Series: 20x Higher Inserts, 2,000x Faster Deletes, 1.2x-14,000x Faster Queries" /><p><em>This is the first in a series of performance benchmarks comparing TimescaleDB to other databases for storing and analyzing time-series data.</em></p><p><a href="https://github.com/timescale/timescaledb" rel="noopener">TimescaleDB</a> is a new, open-source time-series database architected for fast ingest, complex queries, and ease of use. It looks like PostgreSQL to the outside world (in fact, it’s packaged as an extension), which means it inherits the rock-solid reliability, tooling, and vast ecosystem of PostgreSQL.</p><p>But for time-series data, how does it compare to PostgreSQL itself? Put another way, how does one improve on an existing database stalwart with over 20 years of development?</p><p>By building something that is <strong>more scalable</strong>, <strong>faster, </strong>and <strong>easier to use, and opening it up to a variety of new use cases.</strong></p><p>More specifically, compared to PostgreSQL, TimescaleDB exhibits:</p><ul><li><strong>20x higher inserts at scale</strong> (constant even at billions of rows)</li><li><strong>Faster queries</strong>, ranging from 1.2x to over 14,000x improvements for time-based queries</li><li><strong>2000x faster deletes</strong>, critical for implementing data retention policies</li><li><strong>New time-centric functions, </strong>making time-series manipulation in SQL even easier</li></ul><p>In short, TimescaleDB will outperform PostgreSQL for your needs if one or more of the following is true:</p><ul><li><strong>You are insert bound</strong> (typically seen at the 10s of millions of rows, depending on allocated memory)</li><li><strong>Your queries are largely time-based </strong>and involve more than a single key-value lookup</li><li><strong>Deleting data is a big pain for you</strong></li><li><strong>You are frustrated by the limitations of SQL for time-series analysis </strong>(examples below like: time bucketing, bi-temporality, etc.)</li></ul><p>Below is a detailed set of benchmarks that compare TimescaleDB versus PostgreSQL 9.6 across inserts, queries, deletes, and ease-of-use. (If you would like to run the benchmarks yourself, <a href="https://github.com/timescale/benchmark-postgres" rel="noopener">here is the GitHub repo</a>.)</p><p>Let’s begin.</p><p><em>(Update: at the time of this benchmark, PostgreSQL 10 was still in beta. We’ve since </em><a href="https://timescale.ghost.io/blog/time-series-data-postgresql-10-vs-timescaledb-816ee808bac5/"><em>compared Timescale to PostgreSQL 10 for working with time-series data</em></a><em>.)</em></p><h2 id="the-setup">The Setup</h2><p>Here’s the setup we used for all of these TimescaleDB vs PostgreSQL tests:</p><ul><li>Azure DS4 Standard (8 cores, 28GB memory) with network attached SSD</li><li>Both databases were given all available memory and used 8 clients concurrently</li><li>4,000 simulated devices generated 10 CPU metrics every 10 seconds for 3 full days</li><li>Inserts resulted in 1 <a href="http://docs.timescale.com/introduction/architecture" rel="noopener">(hyper)table</a> with 100M rows of data, with a second set of tests at 1B rows</li><li>For TimescaleDB, we set the <a href="http://docs.timescale.com/api/latest/">chunk size</a> to 12 hours, resulting in 6 total <a href="http://docs.timescale.com/api/latest">chunks</a> for our 100 million row dataset and 60 total chunks for our 1 billion row dataset</li></ul><h2 id="inserts-20x-faster-inserts-at-scale-constant-even-at-billions-of-rows">Inserts: 20x faster inserts at scale, constant even at billions of rows</h2><p>Many engineers start using PostgreSQL to store their time-series data because of its query power and ease of use, but are eventually forced to migrate to some NoSQL system when they hit a certain scale. Some skip over PostgreSQL for time-series workloads completely because of these scaling problems.</p><p>But with TimescaleDB, users can scale to billions of rows on PostgreSQL, while maintaining high, constant insert rates. This also enables users to store their relational metadata and time-series together in the same database, query them together using time-series-optimized SQL, and continue to use their favorite tools and add-ons.</p><p>Consider this graph below, which shows a dataset scaled out to 1 billion rows (on a single machine) emulating a common monitoring scenario, with database clients inserting moderately-sized batches of data containing time, a device’s tag set, and multiple numeric metrics (in this case, 10).</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-92.png" class="kg-image" alt="" loading="lazy" width="800" height="579" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2018/12/image-92.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-92.png 800w" sizes="(min-width: 720px) 720px"><figcaption><b><strong style="white-space: pre-wrap;">At scale, TimescaleDB exhibits 20x higher insert rates than PostgreSQL, which are relatively constant even as the dataset grows.</strong></b></figcaption></figure><p>PostgreSQL and Timescale start off with relatively even insert performance (~115K rows per second). As data volumes approach 100 million rows, PostgreSQL’s insert rate begins to rapidly decline. At 200 million rows the insert rate in PostgreSQL is an average of 30K rows per second and only gets worse; at 1 billion rows, it’s averaging 5K rows per second.</p><p>On the other hand, TimescaleDB sustains an average insert rate of 111K rows per second through 1 billion rows of data–<strong>a 20x improvement</strong>. We would have benchmarked further, but we’d be waiting on the Elephant for quite awhile: PostgreSQL took almost 40 hours to insert 1 billion rows of data, while TimescaleDB took less than 3 hours.</p><p><strong>Summary</strong><br>Whenever a new row of data is inserted into PostgreSQL, the database needs to update the indexes (e.g., B-trees) for each of the table’s indexed columns. Once the indexes are too large to fit in memory — which we find typically happens when your table is in the 10s of millions of rows, depending on your allocated memory — this requires swapping one or more pages in from disk. It is this disk swapping that creates the insert bottleneck. Throwing more memory at the problem only delays the inevitable.</p><p>TimescaleDB solves this through its heavily utilization and automation of <a href="https://timescale.ghost.io/blog/time-series-data-why-and-how-to-use-a-relational-database-instead-of-nosql-d0cd6975e87c/">time-space partitioning</a>, even when running ​on a single machine​. Essentially, all writes to recent time intervals are only to tables that remain in memory. This results in a consistent 20x insert performance improvement over PostgreSQL when inserting data at scale.</p><h2 id="queries-12x-to-14000x-faster-time-based-queries">Queries: 1.2x to 14,000x faster time-based queries</h2><p>Next, we looked at query performance.</p><p>In general, migrating to TimescaleDB from PostgreSQL impacts queries in one of three ways:</p><ul><li>Most simple queries (e.g., indexed lookups) that typically take &lt;20ms, will be a few milliseconds slower on TimescaleDB, owing to the slightly larger planning time overhead.</li><li>More complex queries that use time-based filtering or aggregations will be anywhere from 1.2x to 5x faster on TimescaleDB.</li><li>Finally, queries where we can leverage time-ordering will be significantly faster, anywhere from 450x to more than 14,000x faster in our tests.</li></ul><p><em>Note: All performance numbers shown below are from 1000 “warm” runs (to exclude the effects of disk caching, etc.) of each query type.</em></p><h3 id="simple-queries">Simple queries</h3><p>On single-disk machines, many simple queries that just perform indexed lookups or table scans will be 1–3 milliseconds slower on TimescaleDB. For example, in our 100M row table with indexed time, hostname, and cpu usage information, the following query (#1) will take less than 3ms for each database, but an extra 1ms on TimescaleDB:</p><p><strong>Query 1 — A simple query</strong></p><pre><code> SELECT date_trunc('minute', time) AS minute, 
  MAX(usage_user)
FROM cpu
WHERE hostname = 'host_731'
  AND time &gt;= '2016-01-01 02:17:08.646325 -7:00'
  AND time &lt; '2016-01-01 03:17:08.646325 -7:00'
GROUP BY minute
ORDER BY minute ASC;</code></pre><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-93.png" class="kg-image" alt="" loading="lazy" width="642" height="90" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2018/12/image-93.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-93.png 642w"></figure><p>Similarly, even if we increase the amount of time the query is looking up (query #2), TimescaleDB lags by a few milliseconds:</p><p><strong>Query 2 — A simpler query over larger period of time</strong></p><pre><code>SELECT
  date_trunc('minute', time) AS minute,
  MAX(usage_user)
FROM cpu
WHERE hostname = 'host_731'
  AND time &gt;= '2016-01-01 07:47:52'
  AND time &lt; '2016-01-01 19:47:52'
GROUP BY minute
ORDER BY minute ASC</code></pre><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-94.png" class="kg-image" alt="" loading="lazy" width="643" height="86" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2018/12/image-94.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-94.png 643w"></figure><p><strong>Summary</strong><br>TimescaleDB incurs a penalty of a few milliseconds on planning time, due to its use of multiple tables (chunks) instead of one big table. While this is something we plan on improving in the future, the vast majority of our users so far have been content to give up the extra few milliseconds in exchange for the benefits on other, more complex time-based queries (below).</p><h3 id="time-based-queries">Time-based queries</h3><p>Time-based queries often achieve 1.2x-5x superior performance in TimescaleDB, such as this one below that gets high usage of a device when CPU is over 90 over a specific time-period:</p><p><strong>Query 3 — Time-based filter</strong></p><pre><code>SELECT * FROM cpu
WHERE usage_user &gt; 90.0
  AND time &gt;= '2016-01-01 00:00:00'
  AND time &lt; '2016-01-02 00:00:00'</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-95.png" class="kg-image" alt="" loading="lazy" width="800" height="546" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2018/12/image-95.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-95.png 800w" sizes="(min-width: 720px) 720px"><figcaption><b><strong style="white-space: pre-wrap;">Query is nearly 20% faster on TimescaleDB (lower is better).</strong></b></figcaption></figure><p>Further, larger queries involving time-based aggregations (GROUP BYs), which are quite common in time-oriented analysis, will be even faster in TimescaleDB.</p><p>The following is a query (#4) that touches 33M rows and is ​<strong>5x</strong>​<strong> faster</strong> in TimescaleDB when the entire (hyper)table is 100M rows, and around ​2x faster at 1B rows:</p><p><strong>Query 4 — Time-based aggregation</strong></p><pre><code>SELECT date_trunc('hour', TIME) 
  AS hour, hostname, avg(usage_user) 
  AS mean_usage_user 
FROM cpu
WHERE TIME &gt;= '2016-01-01 00:00:00'
  AND TIME &lt; '2016-01-02 00:00:00'
GROUP BY hour, hostname
ORDER BY hour</code></pre><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-96.png" class="kg-image" alt="" loading="lazy" width="800" height="578" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2018/12/image-96.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-96.png 800w" sizes="(min-width: 720px) 720px"></figure><p><strong>Summary</strong></p><p>Thanks to TimescaleDB’s time-space partitioning, our dataset is naturally partitioned into chunks, each of which has its own indexes. This has several benefits when processing a query with a time predicate or a time-based aggregate. Using constraint exclusion on the various chunks, TimescaleDB can scan less data, allowing the database to choose more suitable plans: e.g., perform calculations fully in memory vs. spilling to disk, utilize smaller indexes (so can walk across fewer index values), use better algorithms like HashAggregates vs. GroupAggregates, etc.</p><h3 id="time-ordering-based-queries">Time-ordering based queries</h3><p>Finally, queries that can reason specifically about time ordering can be ​<strong>much</strong>​ more performant in TimescaleDB, from 450x to more than 14,000x faster in our tests.</p><p>For the following query (#5), TimescaleDB introduces a time-based “merge append” optimization to minimize the number of groups which must be processed (given its knowledge that time is already ordered).</p><p><strong>Query 5: Order by limit by using merge append</strong></p><pre><code>SELECT date_trunc('minute', time) 
  AS minute, max(usage_user) 
FROM cpu 
WHERE time &lt; '2016-01-01 19:47:52' 
GROUP BY minute 
ORDER BY minute DESC 
LIMIT 5</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-97.png" class="kg-image" alt="" loading="lazy" width="800" height="556" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2018/12/image-97.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-97.png 800w" sizes="(min-width: 720px) 720px"><figcaption><b><strong style="white-space: pre-wrap;">Query is over 450x faster on TimescaleDB (lower is better).</strong></b></figcaption></figure><p>For our 100M row table, this results in query latency that is<strong> over ​450x​ faster </strong>than PostgreSQL. And in another run of the same query on 1 billion rows of data, we saw an over <strong>14,000x improvement</strong>! (This surprised even me and I ran the benchmarks.)</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-98.png" class="kg-image" alt="" loading="lazy" width="800" height="557" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2018/12/image-98.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-98.png 800w" sizes="(min-width: 720px) 720px"><figcaption><b><strong style="white-space: pre-wrap;">Query is over 14,000X faster on TimescaleDB (lower is better).</strong></b></figcaption></figure><p><strong>Summary</strong></p><p>These queries benefit from optimizations introduced in TimescaleDB that help to speed up query response times where time ordering is known. For example, TimescaleDB optimizes PostgreSQL’s native merge append to bring these significant efficiencies to ordered time-based aggregates. Such aggregates appear quite regularly for time-series analysis (e.g., query #5), which involve GROUP BYs, ORDER BYs, and LIMITs. As such, even without a strict time-range specified by the user, the database will only process those minimal set of chunks and data needed to answer this query.</p><p>More optimizations for these types of complex, time-oriented queries are in the works, and we expect to see similar improvements in the future.</p><h3 id="data-retention-2000x-faster-deletes">Data retention: 2000x faster deletes</h3><p>Time-series data often builds up very quickly, necessitating data retention policies, such as “only store raw data for one week.” TimescaleDB provides data management tools that make implementing data retention policies easy and significantly more performant than PostgreSQL.</p><p>In fact, it is common to couple data retention policies with the use of aggregations, so one might keep two hypertables: one with raw data, the other with data rolled up into minutely or hourly (etc) aggregates. Then, one could define different retention policies on the two hypertables, e.g., storing the aggregated data for longer.</p><p>TimescaleDB allows efficient deletion of old data at the ​chunk​ level, rather than at the row level, via its ​drop_chunks()​ functionality.</p><pre><code>SELECT drop_chunks(interval ‘7 days’, ‘machine_readings’);</code></pre><p>In the data retention benchmark below:</p><ul><li>Chunks are sized to 12 hours</li><li>100 million rows contain 6 total chunks</li><li>1 billion rows contain 60 total chunks</li><li>5 chunks were dropped from TimescaleDB and 5 12 hour intervals of data were deleted from PostgreSQL.</li><li>The data retention mechanisms used for each are TimescaleDB’s drop_chunks() vs. the DELETE command in PostgreSQL.</li><li>In this scenario, we used SELECT drop_chunks(‘2016–01–01T12:00:00Z’::TIMESTAMPTZ, ‘cpu_ts’); and increased it every 12 hours for each run.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-99.png" class="kg-image" alt="" loading="lazy" width="800" height="511" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2018/12/image-99.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-99.png 800w" sizes="(min-width: 720px) 720px"><figcaption><b><strong style="white-space: pre-wrap;">Deletes are over 2000x faster in TimescaleDB than in PostgreSQL (lower is better).</strong></b></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-100.png" class="kg-image" alt="" loading="lazy" width="721" height="354" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2018/12/image-100.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-100.png 721w" sizes="(min-width: 720px) 720px"><figcaption><b><strong style="white-space: pre-wrap;">5 chunks were dropped from TimescaleDB and 5 12-hour intervals of data were deleted from PostgreSQL.</strong></b></figcaption></figure><p><strong>Summary </strong></p><p>TimescaleDB’s drop_chunks deletes all chunks from the hypertable that only include data older than the specified duration. Because chunks are individual tables, the delete results in simply deleting a file from the file system, and is thus very fast, completing in 10s of milliseconds. Meanwhile, PostgreSQL takes on the order of minutes to complete, since it must delete individual rows of data within a table.</p><p>TimescaleDB’s approach also avoids fragmentation in the underlying database files, which in turn avoids the need for vacuuming that can be additionally expensive (i.e., time-consuming) in very large tables.</p><p>Ultimately, that is why TimescaleDB is over <strong>2000x faster</strong> than PostgreSQL when it comes to deleting old data for <a href="http://docs.timescale.com/api/data-retention" rel="noopener">data retention</a>.</p><h2 id="ease-of-use-new-functions-for-time-series-manipulation">Ease of use: New functions for time-series manipulation</h2><p>Lastly, TimescaleDB includes a number of time-oriented features that aren't found in traditional RDBMS (e.g., PostgreSQL). These include more passive query optimizations (e.g., the time-based merge append in query #5 above) but also special time-oriented functions that one might desire when working with time-series data yet find completely lacking in SQL.</p><p><strong>In other words, TimescaleDB optimizes SQL for easier, more effective time-series manipulation.</strong></p><p>Two of our most used functions are:</p><p><strong>(1)</strong> time_bucket, a more powerful version of the standard ​date_trunc​ function that allows for arbitrary time intervals (e.g., 5 minutes, 6 hours, etc.), as well as flexible groupings and offsets, instead of just second, minute, hour, etc. and</p><p><strong>(2)</strong> last​ / ​first​ aggregates (our bookend functions), which allow you to get the value of one column as ordered by another.</p><p><em>For further reading, see </em><a href="http://docs.timescale.com/introduction/timescaledb-vs-postgres" rel="noopener"><em>Time-oriented Analytics</em></a><em>.</em></p><p>To illustrate the gains in simplicity and ease of use that these functions realize for SQL users, let’s compare various workarounds for time_bucket with an actual example of time_bucket.</p><p>Stack Overflow is often a good gauge of a developer’s wants in a particular technology, so instead of suggesting one workaround over another, let’s just look there:</p><ul><li><a href="https://stackoverflow.com/questions/12045600/postgresql-sql-group-by-time-interval-with-arbitrary-accuracy-down-to-milli-sec" rel="noopener">Postgresql SQL GROUP BY time interval with arbitrary accuracy (down to milli seconds)</a></li><li><a href="https://stackoverflow.com/questions/34498590/sql-to-group-time-intervals-by-arbitrary-time-period?rq=1" rel="noopener">SQL to group time intervals by arbitrary time period</a></li><li><a href="https://stackoverflow.com/questions/7299342/what-is-the-fastest-way-to-truncate-timestamps-to-5-minutes-in-postgres" rel="noopener">What is the fastest way to truncate timestamps to 5 minutes in Postgres?</a></li></ul><p>Here is one of those <a href="https://stackoverflow.com/questions/12045600/postgresql-sql-group-by-time-interval-with-arbitrary-accuracy-down-to-milli-sec">Stack Overflow</a> workarounds for time-bucketing:</p><pre><code>-- You can generate a table of "buckets" by adding intervals created by generate_series(). 
-- This SQL statement will generate a table of five-minute buckets for the first day (the value of min(measured_at)) in your data.

select 
  (select min(measured_at)::date from measurements) + ( n    || ' minutes')::interval start_time,
  (select min(measured_at)::date from measurements) + ((n+5) || ' minutes')::interval end_time
from generate_series(0, (24*60), 5) n

 -- Wrap that statement in a common table expression, and you can join and group on it as if it were a base table.

with five_min_intervals as (
  select 
    (select min(measured_at)::date from measurements) + ( n    || ' minutes')::interval start_time,
    (select min(measured_at)::date from measurements) + ((n+5) || ' minutes')::interval end_time
  from generate_series(0, (24*60), 5) n
)
select f.start_time, f.end_time, avg(m.val) avg_val 
from measurements m
right join five_min_intervals f 
        on m.measured_at &gt;= f.start_time and m.measured_at &lt; f.end_time
group by f.start_time, f.end_time
order by f.start_time</code></pre><p>Now, compare that to TimescaleDB’s time_bucket function:</p><pre><code>SELECT time_bucket('5 minutes', time) 
       five_min, avg(cpu)
FROM metrics
GROUP BY five_min 
ORDER BY five_min DESC 
LIMIT 10;</code></pre><p>The simplicity is apparent and in practice it reduces an enormous amount of mental friction (and saves development time!).</p><p>For the same comparison against last / first functions, see these two links:</p><ul><li><a href="https://stackoverflow.com/questions/1485391/how-to-get-first-and-last-record-from-a-sql-query" rel="noopener">https://stackoverflow.com/questions/1485391/how-to-get-first-and-last-record-from-a-sql-query</a></li><li><a href="http://docs.timescale.com/api/api-timescaledb#first-last" rel="noopener">http://docs.timescale.com/api/api-timescaledb#first-last</a></li></ul><p>In short, SQL is a powerful language that is widely used for good reason, yet in many ways it is sub-optimal in its current form for time-series data manipulation. TimescaleDB is introducing new time-oriented SQL functions so that any user of SQL can work with time-series data without having to abandon a reliable database and mature ecosystem they know and love for an obtuse query language, painful data management or data integrity issues.</p><h2 id="conclusion">Conclusion</h2><p>PostgreSQL is a venerable and powerful database, yet it also lacks both functionality and performance when it comes to working with time-series data. Even so, many people use PostgreSQL to store time-series data or they use it alongside their NoSQL time-series database to store metadata.</p><p>Our goal at TimescaleDB is to build the best database for time-series-heavy applications, without complicating the stack. These applications are emerging in more and more places (e.g., IoT, logistics, finance, events), yet are insufficiently handled by existing solutions.</p><p>Up against PostgreSQL, TimescaleDB achieves <strong>20x faster inserts at scale, 1.2x-14,000x faster time-based queries, 2000x faster deletes, and offers streamlined time-series functionality</strong>. So, if you are storing time-series data in PostgreSQL, there is little reason <strong>not</strong> to install TimescaleDB. <a href="http://docs.timescale.com/getting-started/installation?OS=mac&amp;method=Homebrew" rel="noopener">And it’s quite easy to install</a>, even right in your existing PostgreSQL instance.</p><p>If you like what you see and need help with anything, or just have a question or comment, please <a href="mailto:support@timescale.com">email us</a> or <a href="https://slack-login.timescale.com/" rel="noopener">join our Slack group</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Time-series data: Why (and how) to use a relational database instead of NoSQL]]></title>
            <description><![CDATA[These days, time-series data applications (e.g., data center / server / microservice / container monitoring, sensor / IoT analytics, financial data analysis, etc.) are proliferating.]]></description>
            <link>https://www.tigerdata.com/blog/time-series-data-why-and-how-to-use-a-relational-database-instead-of-nosql-d0cd6975e87c</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/time-series-data-why-and-how-to-use-a-relational-database-instead-of-nosql-d0cd6975e87c</guid>
            <category><![CDATA[General]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Mike Freedman]]></dc:creator>
            <pubDate>Thu, 20 Apr 2017 14:00:00 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/warehouse.png">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/warehouse.png" alt="Time-series data: Why (and how) to use a relational database instead of NoSQL" /><p>These days, <a href="https://timescale.ghost.io/blog/time-series-data/">time-series data</a> applications (e.g., data center / server / microservice / container monitoring, sensor / IoT analytics, financial data analysis, etc.) are proliferating.</p><p>As a result, time-series databases are in fashion (<a href="https://misfra.me/2016/04/09/tsdb-list/" rel="noopener">here are 33 of them</a>). Most of these renounce the trappings of a traditional relational database and adopt what is generally known as a NoSQL model. Usage patterns are similar: <a href="https://www.percona.com/blog/2017/02/10/percona-blog-poll-database-engine-using-store-time-series-data/" rel="noopener">a recent survey</a> showed that developers preferred NoSQL to relational databases for time-series data by over 2:1.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-78.png" class="kg-image" alt="" loading="lazy" width="570" height="240"><figcaption><b><strong style="white-space: pre-wrap;">Relational databases include:</strong></b><span style="white-space: pre-wrap;"> MySQL, MariaDB Server, PostgreSQL. </span><b><strong style="white-space: pre-wrap;">NoSQL databases include:</strong></b><span style="white-space: pre-wrap;"> Elastic, InfluxDB, MongoDB, Cassandra, Couchbase, Graphite, Prometheus, ClickHouse, OpenTSDB, DalmatinerDB, KairosDB, RiakTS. </span><b><strong style="white-space: pre-wrap;">Source: </strong></b><a href="https://www.percona.com/blog/2017/02/10/percona-blog-poll-database-engine-using-store-time-series-data/"><b><strong style="white-space: pre-wrap;">Percona</strong></b></a><b><strong style="white-space: pre-wrap;">.&nbsp;</strong></b></figcaption></figure><p>Typically, the reason for adopting NoSQL time-series databases comes down to scale. While relational databases have many useful features that most NoSQL databases do not (robust secondary index support; complex predicates; a rich query language; JOINs, etc), they are difficult to scale.</p><p>And because time-series data piles up very quickly, many developers believe relational databases are ill-suited for it.</p><p>We take a different, somewhat heretical stance: relational databases can be quite powerful for time-series data. One just needs to solve the scaling problem. That is what we do in <a href="https://github.com/timescale/timescaledb" rel="noopener">TimescaleDB</a>.</p><p>When we <a href="https://timescale.ghost.io/blog/when-boring-is-awesome-building-a-scalable-time-series-database-on-postgresql-2900ea453ee2/">announced TimescaleDB two weeks ago</a>, we received a lot of positive feedback from the community. But we also heard from skeptics, who found it hard to believe that one should (or could) build a scalable time-series database on a relational database (in our case, PostgreSQL).</p><p>There are two separate ways to think about scaling: <strong>scaling up</strong> so that a single machine can store more data, and <strong>scaling out</strong> so that data can be stored across multiple machines.</p><p>Why are both important? The most common approach to scaling out across a cluster of <em>N</em> servers is to partition, or shard, a dataset into <em>N</em> partitions. If each server is limited in its throughput or performance (i.e., unable to scale up), then the overall cluster throughput is greatly reduced.</p><p>This post discusses <strong>scaling up</strong>. (A <strong>scaling-out</strong> post will be published on a later date.)</p><p>In particular, this post explains:</p><ul><li>Why relational databases do not normally scale up well</li><li>How LSM trees (typically used in NoSQL databases) do not adequately solve the needs of many time-series applications</li><li>How time-series data is unique, how one can leverage those differences to overcome the scaling problem, and some performance results</li></ul><p>Our motivations are twofold: <strong>for anyone facing similar problems</strong>, to share what we’ve learned; and <strong>for those considering using TimescaleDB for time-series data</strong> (including the skeptics!), to explain some of our design decisions.</p><hr><h2 id="why-databases-do-not-normally-scale-up-well-swapping-inout-of-memory-is-expensive">Why databases do not normally scale up well: Swapping in/out of memory is expensive</h2><p>A common problem with scaling database performance on a single machine is the significant cost/performance trade-off between memory and disk. While memory is faster than disk, it is much more expensive: about 20x costlier than solid-state storage like Flash, 100x more expensive than hard drives. Eventually, our entire dataset will not fit in memory, which is why we’ll need to write our data and indexes to disk.</p><p>This is an old, common problem for relational databases. Under most relational databases, a table is stored as a collection of fixed-size pages of data (e.g., 8KB pages in PostgreSQL), on top of which the system builds data structures (such as <a href="https://en.wikipedia.org/wiki/B-tree" rel="noopener">B-trees</a>) to index the data. With an index, a query can quickly find a row with a specified ID (e.g., bank account number) without scanning the entire table or “walking” the table in some sorted order.</p><p>Now, if the working set of data and indexes is small, we can keep it in memory.</p><p>But if the data is sufficiently large that we can’t fit all (similarly fixed-size) pages of our B-tree in memory, then updating a random part of the tree can involve significant disk I/O as we read pages from disk into memory, modify in memory, and then write back out to disk (when evicted to make room for other B-tree pages). And a relational database like PostgreSQL keeps a B-tree (or other data structure) for <em>each</em> table index, in order for values in that index to be found efficiently. So, the problem compounds as you index more columns.</p><p>In fact, because the database only accesses the disk in page-sized boundaries, even seemingly small updates can cause these swaps to occur: To change one cell, the database may need to swap out an existing 8KB page and write it back to disk, then read in the new page before modifying it.</p><p>But why not use smaller- or variable-sized pages? There are two good reasons: minimizing disk fragmentation, and (in case of a spinning hard disk) minimizing the overhead of the “seek time” (usually 5–10ms) required in physically moving the disk head to a new location.</p><p>What about solid-state drives (SSDs)? While solutions like NAND Flash drives eliminate any physical “seek” time, they can only be read from or written to at the page-level granularity (today, typically 8KB). So, even to update a single byte, the SSD firmware needs to read an 8KB page from disk to its buffer cache, modify the page, then write the updated 8KB page back to a new disk block.</p><p>The cost of swapping in and out of memory can be seen in this performance graph from PostgreSQL, where insert throughput plunges with table size and increases in variance (depending on whether requests hit in memory or require (potentially multiple) fetches from disk).</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/gif3-1.gif" class="kg-image" alt="" loading="lazy" width="1067" height="600" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2018/12/gif3-1.gif 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2018/12/gif3-1.gif 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/gif3-1.gif 1067w" sizes="(min-width: 720px) 720px"></figure><p><em>Insert throughput as a function of table size for PostgreSQL 9.6.2, running with 10 workers on an Azure standard DS4 v2 (8 core) machine with SSD-based (premium LRS) storage. Clients insert individual rows into the database (each of which has 12 columns: a timestamp, an indexed randomly-chosen primary id, and 10 additional numerical metrics). The PostgreSQL rate starts over 15K inserts/second, but then begins to drop significantly after 50M rows and begins to experience very high variance (including periods of only 100s of inserts/sec).</em></p><hr><h2 id="enter-nosql-databases-with-log-structured-merge-trees-and-new-problems">Enter NoSQL databases with Log-Structured Merge Trees (and new problems)</h2><p>About a decade ago, we started seeing a number of “NoSQL” storage systems address this problem via <a href="http://www.cs.umb.edu/~poneil/lsmtree.pdf" rel="noopener">Log-structured merge (LSM) trees</a>, which reduce the cost of making small writes by only performing larger append-only writes to disk.</p><p>Rather than performing “in-place” writes (where a small change to an existing page requires reading/writing that entire page from/to disk), LSM trees queue up several new updates (including deletes!) into pages and write them as a single batch to disk. In particular, all writes in an LSM tree are performed to a sorted table maintained <em>in memory</em>, which is then flushed to disk as an immutable batch when of sufficient size (as a “sorted string table”, or SSTable). This reduces the cost of making small writes.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-79.png" class="kg-image" alt="" loading="lazy" width="631" height="136" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2018/12/image-79.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-79.png 631w"><figcaption><b><strong style="white-space: pre-wrap;">In an LSM tree, all updates are first written a sorted table in memory, and then flushed to disk as an immutable batch, stored as an SSTable, which is often indexed in memory. Source: </strong></b><a href="https://www.igvita.com/2012/02/06/sstable-and-log-structured-storage-leveldb/"><b><strong style="white-space: pre-wrap;">igvita.com</strong></b></a></figcaption></figure><p>This architecture — which has been adopted by many “NoSQL” databases like LevelDB, Google BigTable, Cassandra, MongoDB (WiredTiger), and <a href="https://www.outfluxdata.com">InfluxDB</a> — may seem great at first. Yet it introduces other tradeoffs: higher memory requirements and poor secondary index support.</p><p><strong>Higher-memory requirements:</strong> Unlike in a B-tree, in an LSM tree there is no single ordering: no global index to give us a sorted order over all keys. Consequently, looking up a value for a key gets more complex: first, check the memory table for the latest version of the key; otherwise, look to (potentially many) on-disk tables to find the latest value associated with that key. To avoid excessive disk I/O (and if the values themselves are large, such as the webpage content stored in Google’s BigTable), indexes for all SSTables may be kept entirely in memory, which in turn increases memory requirements.</p><p><strong>Poor secondary index support:</strong> Given that they lack any global sorted order, LSM trees do not naturally support secondary indexes. Various systems have added some additional support, such as by duplicating the data in a different order. Or, they emulate support for richer predicates by building their primary key as the concatenation of multiple values. Yet this approach comes with the cost of requiring a larger scan among these keys at query time, thus supporting only items with a limited cardinality (e.g., discrete values, not numeric ones).</p><p>There is a better approach to this problem. Let’s start by better understanding time-series data.</p><hr><h2 id="time-series-data-is-different">Time-series data is different</h2><p>Let’s take a step back, and look at the original problem that relational databases were designed to solve. Starting from <a href="https://en.wikipedia.org/wiki/IBM_System_R" rel="noopener">IBM’s seminal System R</a> in the mid-1970s, relational databases were employed for what became known as online transaction processing (<a href="https://www.tigerdata.com/learn/understanding-oltp" rel="noreferrer">OLTP</a>).</p><p>Under OLTP, operations are often transactional updates to various rows in a database. For example, think of a bank transfer: a user debits money from one account and credits another. This corresponds to updates to two rows (or even just two cells) of a database table. Because bank transfers can occur between any two accounts, the two rows that are modified are somewhat randomly distributed over the table.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-80.png" class="kg-image" alt="" loading="lazy" width="1404" height="899" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2018/12/image-80.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2018/12/image-80.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-80.png 1404w" sizes="(min-width: 720px) 720px"><figcaption><b><strong style="white-space: pre-wrap;">Time-series data arises from many different settings: industrial machines; transportation and logistics; DevOps, datacenter, and server monitoring; and financial applications.</strong></b></figcaption></figure><p>Now let’s consider a few examples of time-series workloads:</p><ul><li><strong>DevOps/server/container monitoring.</strong> The system typically collects metrics about different servers or containers: CPU usage, free/used memory, network tx/rx, disk IOPS, etc. Each set of metrics is associated with a timestamp, unique server name/ID, and a set of tags that describe an attribute of what is being collected.</li><li><strong>IoT sensor data.</strong> Each IoT device may report multiple sensor readings for each time period. As an example, for environmental and air quality monitoring this could include: temperature, humidity, barometric pressure, sound levels, measurements of nitrogen dioxide, carbon monoxide, particulate matter, etc. Each set of readings is associated with a timestamp and unique device ID, and may contain other metadata.</li><li><strong>Financial data. </strong>Financial tick data may include streams with a timestamp, the name of the security, and its current price and/or price change. Another type of financial data is payment transactions, which would include a unique account ID, timestamp, transaction amount, as well as any other metadata. (Note that this data is different than the OLTP example above: here we are recording every transaction, while the OLTP system was just reflecting the current state of the system.)</li><li><strong>Fleet/asset management.</strong> Data may include a vehicle/asset ID, timestamp, GPS coordinates at that timestamp, and any metadata.</li></ul><p>In all of these examples, the datasets are a stream of measurements that involve inserting “new data” into the database, typically to the latest time interval. While it’s possible for data to arrive much later than when it was generated/timestamped, either due to network/system delays or because of corrections to update existing data, this is typically the exception, not the norm.</p><p>In other words, these two workloads have very different characteristics:</p><h4 id="oltp-writes"><strong>OLTP Writes</strong></h4><ul><li>Primarily UPDATES</li><li>Randomly distributed (over the set of primary keys)</li><li>Often transactions across multiple primary keys</li></ul><h4 id="time-series-writes"><strong>Time-series Writes</strong></h4><ul><li>Primarily INSERTs</li><li>Primarily to a recent time interval</li><li>Primarily associated with both a timestamp and a separate primary key (e.g., server ID, device ID, security/account ID, vehicle/asset ID, etc.)</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-81.png" class="kg-image" alt="" loading="lazy" width="1600" height="774" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2018/12/image-81.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2018/12/image-81.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-81.png 1600w" sizes="(min-width: 720px) 720px"><figcaption><b><strong style="white-space: pre-wrap;">TimescaleDB stores each chunk in an internal database table, so indexes only grow with the size of each chunk, not the entire </strong></b><a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer"><b><strong style="white-space: pre-wrap;">hypertable</strong></b></a><b><strong style="white-space: pre-wrap;">. As inserts are largely to the more recent interval, that one remains in memory, avoiding expensive swaps to disk.</strong></b></figcaption></figure><p>Why does this matter? As we will see, one can take advantage of these characteristics to solve the scaling-up problem on a relational database.</p><hr><h2 id="a-new-way-adaptive-timespace-chunking">A new way: Adaptive time/space chunking</h2><p><em>NOTE September 2021: following publication of this post, as explained in </em><a href="https://github.com/timescale/timescaledb/issues/2574"><em>this GitHub issue</em></a><em> adaptive chunking was deprecated from latest releases of TimescaleDB. There is </em><a href="https://github.com/timescale/timescaledb/issues/3472"><em>a feature request</em></a><em> for the approach to be reinstated. You may wish to follow or upvote that request. </em></p><p>When previous approaches tried to avoid small writes to disk, they were trying to address the broader OLTP problem of UPDATEs to random locations. But as we just established, time-series workloads are different: writes are primarily INSERTS (not UPDATES), to a recent time interval (not a random location). In other words, time-series workloads are <em>append only.</em></p><p>This is interesting: it means that, if data is sorted by time, we would always be writing towards the “end” of our dataset. Organizing data by time would also allow us to keep the actual working set of database pages rather small, and maintain them in memory. And reads, which we have spent less time discussing, could also benefit: if many read queries are to recent intervals (e.g., for real-time dashboarding), then this data would be already cached in memory.</p><p>At first glance, it may seem like indexing on time would give us efficient writes and reads for free. But once we want any other indexes (e.g., another primary key like server/device ID, or any secondary indexes), then this naive approach would revert us back to making random inserts into our B-tree for that index.</p><p>There is another way, which we call, “adaptive time/space chunking”. This is what we use in TimescaleDB.</p><p>Instead of just indexing by time, TimescaleDB builds distinct <em>tables </em>by splitting data according to two dimensions: the time interval <em>and</em> a primary key (e.g., server/device/asset ID). We refer to these as <em>chunks</em> to differentiate them from <em>partitions</em>, which are typically defined by splitting the primary key space. Because each of these chunks are stored as a database table itself, and the query planner is aware of the chunk’s ranges (in time and keyspace), the query planner can immediately tell to which chunk(s) an operation’s data belongs. (This applies both for inserting rows, as well as for pruning the set of chunks that need to be touched when executing queries.)</p><p>The key benefit of this approach is that now all of our indexes are built only across these much smaller chunks (tables), rather than a single table representing the entire dataset. So if we size these chunks properly, we <em>can </em>fit the latest tables (and their B-trees) completely in memory, and avoid this swap-to-disk problem, while maintaining support for multiple indexes.</p><h2 id="approaches-to-implementing-chunking">Approaches to implementing chunking</h2><p>The two intuitive approaches to design this time/space chunking each have significant limitations:</p><h3 id="approach-1-fixed-duration-intervals">Approach #1: Fixed-duration intervals</h3><p>Under this approach, all chunks can have fixed, identical time intervals, e.g., 1 day. This works well if the volume of data collected per interval does not change. However, as services become popular, their infrastructure correspondingly expands, leading to more servers and more monitoring data. Similarly, successful IoT products will deploy ever more numbers of devices. And once we start writing too much data to each chunk, we’re regularly swapping to disk (and will find ourselves back at square one). On the flip side, choosing too-small intervals to start with leads to other performance downsides, e.g., having to touch many tables at query time.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-82.png" class="kg-image" alt="" loading="lazy" width="699" height="204" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2018/12/image-82.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-82.png 699w"><figcaption><b><strong style="white-space: pre-wrap;">Each chunk has a fixed duration in time. Yet if the data volume per time increases, then eventually chunk size becomes too large to fit in memory.</strong></b></figcaption></figure><h3 id="approach-2-fixed-sized-chunks">Approach #2: Fixed-sized chunks</h3><p>With this approach, all chunks have fixed target sizes, e.g., 1GB. A chunk is written to until it reaches its maximum size, at which point it becomes “closed” and its time interval constraints become fixed. Later data falling within the chunk’s “closed” interval will still be written to the chunk, however, in order to preserve the correctness of the chunk’s time constraints.</p><p>A key challenge is that the time interval of the chunk depends on the order of data. Consider if data (even a single datapoint) arrives “early” by hours or even days, potentially due to a non-synchronized clock, or because of varying delays in systems with intermittent connectivity. This early datapoint will stretch out the time interval of the “open” chunk, while subsequent on-time data can drive the chunk over its target size. The insert logic for this approach is also more complex and expensive, driving down throughput for large batch writes (such as large COPY operations), as the database needs to make sure it inserts data in temporal order to determine when a new chunk should be created (even in the <em>middle</em> of an operation). Other problems exist for fixed- or max-size chunks as well, including time intervals that may not align well with data retention policies (“delete data after 30 days”).</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-83.png" class="kg-image" alt="" loading="lazy" width="706" height="202" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2018/12/image-83.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-83.png 706w"><figcaption><b><strong style="white-space: pre-wrap;">Each chunk’s time interval is fixed only once its maximum size has been reached. Yet if data arrives early, this creates a large interval for the chunk, and the chunk eventually becomes too large to fit in memory.</strong></b></figcaption></figure><p>TimescaleDB takes a third approach that couples the strengths of both approaches.</p><h3 id="approach-3-adaptive-intervals-our-current-design">Approach #3: Adaptive intervals (our current design)</h3><p><a href="https://timescale.ghost.io/blog/blog/time-series-data-why-and-how-to-use-a-relational-database-instead-of-nosql-d0cd6975e87c/#a-new-way-adaptive-timespace-chunking"><em>Please see note, above. </em></a></p><p>Chunks are created with a fixed interval, but the interval adapts from chunk-to-chunk based on changes in data volumes in order to hit maximum target sizes.</p><p>By avoiding open-ended intervals, this approach ensures that data arriving early doesn’t create too-long time intervals that will subsequently lead to over-large chunks. Further, like static intervals, it more naturally supports retention policies specified on time, e.g., “delete data after 30 days”. Given TimescaleDB’s time-based chunking, such policies are implemented by simply dropping chunks (tables) in the database. This means that individual <em>files </em>in the underlying file system can simply be deleted, rather than needing to delete individual <em>rows,</em> which requires erasing/invalidating portions of the underlying file. Such an approach therefore avoids fragmentation in the underlying database files, which in turn avoids the need for <a href="https://www.postgresql.org/docs/9.6/static/routine-vacuuming.html" rel="noopener">vacuuming</a>. And this vacuuming can be prohibitively expensive in very large tables.</p><p>Still, this approach ensures that chunks are sized appropriately so that the latest ones can be maintained in memory, even as data volumes may change.</p><p>Partitioning by primary key then takes each time interval and further splits it into a number of smaller chunks, which all share the same time interval but are disjoint in terms of their primary keyspace. This enables better parallelization both on servers with multiple disks — for both inserts and queries  — as well as multiple servers. More on these issues in a later post.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-84.png" class="kg-image" alt="" loading="lazy" width="709" height="421" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2018/12/image-84.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-84.png 709w"><figcaption><b><strong style="white-space: pre-wrap;">If the data volume per time increases, then chunk interval decreases to maintain right-sized chunks.</strong></b> <b><strong style="white-space: pre-wrap;">If data arrives early, then data is stored into a “future” chunk to maintain right-sized chunks.</strong></b></figcaption></figure><hr><h2 id="result-15x-improvement-in-insert-rate">Result: 15x improvement in insert rate</h2><p>Keeping chunks at the right size is how we achieve our INSERT results that surpass vanilla PostgreSQL, that Ajay already showed in his <a href="https://timescale.ghost.io/blog/when-boring-is-awesome-building-a-scalable-time-series-database-on-postgresql-2900ea453ee2/">earlier post</a>.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-85.png" class="kg-image" alt="" loading="lazy" width="1600" height="900" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2018/12/image-85.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2018/12/image-85.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-85.png 1600w" sizes="(min-width: 720px) 720px"><figcaption><b><strong style="white-space: pre-wrap;">Insert throughput of TimescaleDB vs. PostgreSQL, using the same workload as described earlier. Unlike vanilla PostgreSQL, TimescaleDB maintains a constant insert rate (of about 14.4K inserts/second, or 144K metrics/second, with very low variance), independent of dataset size.</strong></b></figcaption></figure><p>This consistent insert throughput also persists when writing large batches of rows in single operations to TimescaleDB (instead of row-by-row). Such batched inserts are common practice for databases employed in more high-scale production environments, e.g., when ingesting data from a distributed queue like Kafka. <strong>In such scenarios, a single Timescale server can ingest 130K rows (or 1.3M metrics) per second, approximately 15x that of vanilla PostgreSQL once the table has reached a couple 100M rows.</strong></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-86.png" class="kg-image" alt="" loading="lazy" width="1600" height="900" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2018/12/image-86.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2018/12/image-86.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-86.png 1600w" sizes="(min-width: 720px) 720px"><figcaption><b><strong style="white-space: pre-wrap;">Insert throughput of TimescaleDB vs. PostgreSQL when performing INSERTs of 10,000-row batches.</strong></b></figcaption></figure><hr><h2 id="summary">Summary</h2><p>A relational database can be quite powerful for time-series data. Yet, the costs of swapping in/out of memory significantly impacts their performance. But NoSQL approaches that implement Log Structured Merge Trees have only shifted the problem, introducing higher memory requirements and poor secondary index support.</p><p>By recognizing that time-series data is different, we are able to organize data in a new way: adaptive time/space chunking. This minimizes swapping to disk by keeping the working data set small enough to fit inside memory, while allowing us to maintain robust primary and secondary index support (and the full feature set of PostgreSQL). And as a result, we are able to <strong>scale up </strong>PostgreSQL significantly, resulting in a 15x improvement on insert rates.</p><p>But what about performance comparisons to NoSQL databases? That post is coming soon.</p><p>In the meantime, you can download the latest version of TimescaleDB, released under the permissive Apache 2 license, on <a href="https://github.com/timescale/timescaledb" rel="noopener">GitHub</a>.</p><hr><p><em>Like this post? Interested in learning more?</em></p><p><em>Check out our </em><a href="https://github.com/timescale/timescaledb" rel="noopener"><strong><em>GitHub</em></strong></a><em>, join our </em><a href="http://slack.timescale.com/" rel="noopener"><em><strong>Slack communit</strong>y</em></a><em>, and sign up for the community mailing list below. We’re also </em><a href="https://www.timescale.com/careers" rel="noopener"><em>hiring</em></a><em>!</em><br></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[When Boring is Awesome: Building a Scalable Time-Series Database on PostgreSQL]]></title>
            <description><![CDATA[Today we are announcing the beta release of TimescaleDB, a new open-source time-series database optimized for fast ingest and complex queries, now available on GitHub under the Apache 2 license.]]></description>
            <link>https://www.tigerdata.com/blog/when-boring-is-awesome-building-a-scalable-time-series-database-on-postgresql-2900ea453ee2</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/when-boring-is-awesome-building-a-scalable-time-series-database-on-postgresql-2900ea453ee2</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Announcements & Releases]]></category>
            <dc:creator><![CDATA[Ajay Kulkarni]]></dc:creator>
            <pubDate>Tue, 04 Apr 2017 14:00:00 GMT</pubDate>
            <media:content medium="image" url="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/gif5.gif">
            </media:content>
            <content:encoded><![CDATA[<img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/gif5.gif" alt="When Boring is Awesome: Building a Scalable Time-Series Database on PostgreSQL" /><p><em>(Update: Follow the discussion on this </em><a href="https://news.ycombinator.com/item?id=14035416" rel="noopener"><em>Hacker News thread</em></a><em>.)</em></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-87.png" class="kg-image" alt="" loading="lazy" width="406" height="374"><figcaption><b><strong style="white-space: pre-wrap;">TimescaleDB: SQL made scalable for time-series data.</strong></b></figcaption></figure><p><strong>Today we are announcing the beta release of </strong><a href="http://www.timescale.com/" rel="noopener"><strong>TimescaleDB</strong></a>, a new open-source time-series database optimized for fast ingest and complex queries, <a href="https://github.com/timescale/timescaledb" rel="noopener">now available on GitHub</a> under the Apache 2 license.</p><p>TimescaleDB is engineered up from PostgreSQL (packaged as an extension) and yet scales out horizontally, which means it supports normal SQL and all of the features you expect from a relational database: JOINs, secondary indexes, complex predicates and aggregates, window functions, CTEs, etc.</p><p>Key benefits:</p><ul><li><strong>Looks, feels, speaks just like PostgreSQL, </strong>including a normal SQL (not “SQL-like”) query interface. Existing clients, connectors, and BI tools (e.g., Tableau) work out of the box. Designed to be stupidly easy to use. If you know SQL (and especially PostgreSQL), you already know TimescaleDB.</li><li><strong>Scalable.</strong> Everyone “knows” that RDBMSs like <a href="https://www.timescale.com/learn/building-a-scalable-database">PostgreSQL do not scale well</a>. We solve that problem, introducing horizontal scale-out, automatic space/time partitioning, and distributed query optimizations for time-series data. Our latest benchmarks consistently show a greater than 15x improvement on inserts versus vanilla PostgreSQL on a dataset of 250 million rows, achieving constant insert throughput as the database grows (135K writes per second per node, where each write is a row of 10 metrics). We expect the insert rate to stay in that ballpark as the dataset grows further.</li><li><strong>Reliable.</strong> Even though TimescaleDB is new, we benefit from 20+ years of work in PostgreSQL reliability and tooling. We stand on the shoulders of giants.</li></ul><p>(For more technical details, please refer to our <a href="https://docs.timescale.com/v1.1/main">documentation</a>.)</p><p>In this age of new and shiny open source data projects, something as old as PostgreSQL can seem boring. <strong>But sometimes boring is awesome, especially when it’s your database.</strong> TimescaleDB is designed to just work, not wake you up at 3am.</p><p>If you have any kind of time-series data, and you like SQL/PostgreSQL, then please give TimescaleDB a whirl and let us know how it goes. We appreciate any feedback (and we’re pretty friendly folks).</p><p>You can install TimescaleDB via Homebrew, Docker, or from source. <a href="https://github.com/timescale/timescaledb" rel="noopener">More information on GitHub</a>.</p><p>But isn’t there already a glut of time-series databases? Did we really have to build yet another one?</p><p>Yes.</p><p>(Read on, padawan…)</p><h2 id="why-build-yet-another-time-series-database">Why build yet another time-series database?</h2><p>Seems like <a href="https://en.wikipedia.org/wiki/Time_series_database" rel="noopener">time-series databases</a> (i.e., databases optimized for data captured over time, for example, sensor data, financial data, DevOps data, etc.) are in vogue these days.</p><p>There have been a number of blog posts on the subject over the past few years (including these gems by <a href="https://www.xaprb.com/blog/2014/06/08/time-series-database-requirements/" rel="noopener">Baron Schwartz (2014)</a> and <a href="http://jmoiron.net/blog/thoughts-on-timeseries-databases/" rel="noopener">Jason Moiron (2015)</a>, and a plethora of new open-source time-series databases, each with their own trade-offs. </p><p>We can’t read minds, but we imagine that the developers behind each of those projects built their own time-series database because traditional RDBMS (e.g., PostgreSQL, MySQL) didn’t scale for their needs.</p><p>That was the same problem we faced a year ago, when we needed a database to store sensor data for the IoT platform we were building at the time. We loved PostgreSQL, but “knew” that it inherently wouldn’t scale for our needs.</p><p>We tested some of the options in the list above, and saw that they all scaled pretty well, but sacrificed query power in exchange, and failed to support a number of key SQL capabilities.</p><p>So we had a choice: scalability or query power. That made us sad. We needed both.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/yoda.gif" class="kg-image" alt="" loading="lazy" width="450" height="192"><figcaption><b><strong style="white-space: pre-wrap;">Scale AND query power Yoda needs.</strong></b></figcaption></figure><p>In particular, we needed:</p><ul><li><strong>Scalable ingest, reads, and deletes.</strong> Some of our customers collected data at kilohertz, so we knew our dataset would get large very quickly. We needed inserts, reads, and deletes (in particular, bulk deletes) to perform well, even at scale. (On the other hand, we didn’t need to optimize for updates, as we expected to perform those rarely.)</li><li><strong>Performant complex queries.</strong> We needed support for complex predicates (e.g., readings where temperature and cpu are above certain thresholds), non-time based aggregates over a fixed window (e.g., total number of errors by firmware version), multiple aggregates (e.g., group by device type and time), flexible ordering (e.g., top 10 devices by usage), JOINs (e.g., join sensor data with device metadata), etc., at latencies that satisfied our APIs and dashboards.</li><li><strong>Multiple data type options. </strong>We wanted floats, integers, strings, booleans, arrays, JSON blobs. Support for geospatial data types as a bonus.</li><li><strong>An easy to use query language.</strong> We didn’t want to have to learn a completely new query language and build brand new connectors. We wanted something like SQL. Actually, we wanted SQL.</li><li><strong>A reliable and easy to operate database. </strong>We didn’t want to get woken up at 3am because our database crashed (and we sure didn’t want to lose data). We wanted something boring, something that would just work, not something fancy and experimental. We also wanted something with rich tooling and an abundant software ecosystem, so we wouldn’t always be waiting for (or needing to write) the next connector or integration for other systems we were using.</li></ul><p>When we looked at this list, we realized that we needed something like PostgreSQL. In fact,<strong> if we could only solve the </strong><a href="https://www.tigerdata.com/learn/guide-to-postgresql-scaling" rel="noreferrer"><strong>PostgreSQL scalability</strong></a><strong> problem, we’d have the perfect time-series database: scalable, easy to use, and reliable.</strong> (PostgreSQL even has support for geospatial data types and queries, via <a href="http://postgis.net/" rel="noopener">PostGIS</a>.)</p><p>Could this be possible? Being <a href="http://www.timescale.com/about">a group of computer science PhDs and academics</a> (including one tenured Professor), we decided to find out for ourselves, and determined that the nature of time-series workloads lend themselves to a new database architecture that could offer both scale and SQL.</p><p>And then we built it.</p><p>And then we benchmarked it, and found that our database outperformed PostgreSQL by more than 15x on inserts on large datasets. In particular, we found that as the dataset size grows, the insert rate for PostgreSQL drops off dramatically, while our insert rate remains high:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-74.png" class="kg-image" alt="" loading="lazy" width="1333" height="750" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2018/12/image-74.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w1000/2018/12/image-74.png 1000w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-74.png 1333w" sizes="(min-width: 720px) 720px"><figcaption><b><strong style="white-space: pre-wrap;">Insert rate comparison: TimescaleDB vs vanilla PostgreSQL</strong></b></figcaption></figure><p>Note:</p><ul><li>Each row contains 10 metrics and a timestamp.</li><li>Each batch contains 10,000 rows written at once across any partitions (similar to what one would expect in production, e.g., when consuming data off of a message bus like Kafka).</li><li>This is on an Azure standard DS4 v2 machine (8 cores), SSD (premium LRS storage).</li></ul><p>Finally, scale and SQL. This made us happy.</p><p>(So why build yet another time-series database? Because we had to.)</p><h2 id="really-scale-and-normal-sql-demo">Really, scale and normal SQL? (Demo)</h2><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/sw.gif" class="kg-image" alt="" loading="lazy" width="500" height="243"></figure><p>Get ready for the world’s most boring database demo, because TimescaleDB’s query language is just normal SQL:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-75.png" class="kg-image" alt="" loading="lazy" width="778" height="729" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/size/w600/2018/12/image-75.png 600w, https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-75.png 778w" sizes="(min-width: 720px) 720px"></figure><p>This obviously is just a sample. For the full documentation on what kinds of queries we support, <a href="https://www.postgresql.org/docs/" rel="noopener">please refer here</a>.</p><h2 id="what%E2%80%99s-going-on-behind-the-scenes">What’s going on behind the scenes</h2><p>“So what’s the big deal? This looks just like normal PostgreSQL…”</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/gif.gif" class="kg-image" alt="" loading="lazy" width="500" height="285"><figcaption><b><strong style="white-space: pre-wrap;">Some are not easily impressed.&nbsp;</strong></b></figcaption></figure><p>There are actually 5 key things happening behind the scenes:</p><ol><li><strong>Automatic space-time partitioning:</strong> We take advantage of two major attributes of time-series workloads: that all data has a primary key and a timestamp, and that inserts are largely append-only (writes to most recent interval, infrequent updates). This allows us to automatically partition incoming data for a given table by time and space (primary key) into 2D “chunks” (stored internally as PostgreSQL tables), and create new chunks on demand as necessary, all done transparently to the user.</li><li><strong>Right-sized data chunks: </strong>Our engine ensures that chunks are right sized and time-interval aligned to ensure that the multiple B-trees for a table’s indexes can reside in memory during inserts to avoid thrashing. This also allows us to delete data by dropping entire chunks, rather than needing to delete individual rows, thus avoiding expensive vacuuming operations.</li><li><strong>Distributed query optimizations:</strong> When queries arrive, we avoid querying extraneous chunks via constraint exclusion analysis, and then employ different techniques to parallelize the query across the remaining chunks efficiently.</li><li><strong>The “hypertable” abstraction:</strong> All this complexity is hidden from the user behind an abstraction we call a “hypertable”, which provides the illusion of a single table across all space and time (despite the 2D chunking). Inserts are written to the <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertable</a>; the database automatically partitions the data, writes to the appropriate chunk, and creates new chunks if necessary. Similarly, queries are run against the hypertable; the database automatically runs the distributed query optimizations, determines the minimally necessary set of chunks to query, pushes down any further optimizations to these chunks, and returns the appropriate data.</li><li><strong>Tight integration with PostgreSQL:</strong> All this occurs tightly integrated with the PostgreSQL query parser, which allows the database to support the entire spectrum of PostgreSQL commands and then run its own query planner and optimizations.</li></ol><p>For example, each of the queries above is running against a <em>hypertable</em>, allowing the database to hide the complexity of the system from the user.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-76.png" class="kg-image" alt="" loading="lazy" width="600" height="344" srcset="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/image-76.png 600w"><figcaption><b><strong style="white-space: pre-wrap;">A hypertable provides the illusion of a single continuous table across all space and time chunks.</strong></b></figcaption></figure><p>But this just scratches the surface. For more on our technical architecture, take a look at <a href="https://docs.timescale.com">our documentation</a>. </p><h2 id="why-this-matters-time-series-data-is-sprouting-up-everywhere">Why this matters: time-series data is sprouting up everywhere</h2><p>“Fine”, you might say, “you guys built something that only works for your weirdo IoT backend.”</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/gif2.gif" class="kg-image" alt="" loading="lazy" width="400" height="225"></figure><p>That’s what we thought too. But as we made the rounds talking about our IoT platform, people would respond: “We’re building our own IoT platform, so we can’t use yours. But, tell us more about this time-series database you built?”</p><p>Then, they’d add, “You know, forget IoT, we have a lot of time-series data in general. Could your database help there too?”</p><p>And strangely, we heard the same thing from our friends in other industries: they had a growing amount of time-series data and needed something better than existing databases.</p><p>Eureka. Our time-series database was solving a bigger problem.</p><p>We realized that time-series data, which used to be this niche thing within finance and DevOps, was sprouting up everywhere. <strong>We realized that fundamental shifts in computing — more sources of data, fatter pipes, cheaper storage — were creating new currents of time-series data streams. </strong>And that analyzing these new datasets across time was powerful, enabling us to monitor the present, understand historical trends, troubleshoot the past, predict the future.</p><p>We also noticed that even traditional time-series data applications were becoming more complex: e.g., in DevOps, needing to correlate application performance across microservices; in finance, needing to monitor payment transactions and other customer interaction data in real-time.</p><p><strong>There was also another trend at work: the resurgence of SQL.</strong> Recent posts like these from <a href="https://www.percona.com/blog/2017/03/27/whats-next-for-sql-databases/" rel="noopener">Percona (March 27, 2017)</a>, <a href="https://www.xaprb.com/blog/defining-moments-in-database-history/" rel="noopener">Baron Schwartz (March 19, 2017)</a>, and <a href="https://stateofprogress.blog/choose-sql-d017cfc08870" rel="noopener">Paris Kasidiaris (March 13, 2017)</a> capture the sentiment well. The pendulum is swinging back towards boring SQL. In fact, “NoSQL” databases seem to be rebranding themselves to mean “not only SQL”, rather than outright rejecting SQL.</p><p>We realized that our database, which sat at the intersection of the “rise of time-series data” and the “resurgence of SQL”, might actually be useful to other people.</p><p>That’s why we decided last fall to change directions and go all in on the database. After several months of heads-down work, <a href="https://github.com/timescale/timescaledb" rel="noopener">we just open sourced it last month under the Apache 2 license</a>.</p><h2 id="when-you-might-want-to-consider-alternatives">When you might want to consider alternatives</h2><p>That said, TimescaleDB can’t solve everyone’s problems. In particular, there are 3 time-series scenarios where there may be better alternatives:</p><ul><li><strong>Simple read requirements:</strong> When most of your query patterns are simple in nature (e.g., simple key-value lookups, or rollup of a single metric over time).</li><li><strong>Low available storage:</strong> When resource constraints place storage at a premium, and heavy compression is required, even at the cost of query power and ease of use. (Although this is an area of active development, and we expect TimescaleDB to improve.)</li><li><strong>Sparse and/or unstructured data:</strong> When your time-series data is especially sparse and/or generally unstructured. (But even if your data is partially structured, TimescaleDB includes a JSONB field type for the unstructured part(s). This allows you to maintain indexes on the structured parts of your data combined with the flexibility of unstructured storage.)</li></ul><h2 id="summary">Summary</h2><p>TimescaleDB is the first open source time-series database that offers normal SQL at scale. It acts like a relational database yet scales linearly for time-series data.</p><p>TimescaleDB is in active development by a team of PhDs based in New York City, Stockholm, and Los Angeles. A single-node version is currently available for download. A clustered version is in the works.</p><p>We scratched our own itch, and hope it now scratches yours. If it does, or you think it might and want to learn more, we’d love to hear from you at <a href="mailto:hello@timescale.com" rel="nofollow">hello@timescale.com</a>.</p><p><em>For more:</em></p><ul><li><a href="https://github.com/timescale/timescaledb" rel="noopener"><em>Installation instructions on GitHub</em></a></li><li><a href="http://docs.timescale.com"><em>High-level technical overview</em></a></li><li><a href="https://docs.timescale.com/v1.1/main"><em>Our Documentation </em></a></li></ul><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2018/12/gif3.gif" class="kg-image" alt="" loading="lazy" width="500" height="272"></figure>]]></content:encoded>
        </item>
    </channel>
</rss>