<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[Tiger Data Blog]]></title>
        <description><![CDATA[Insights, product updates, and tips from TigerData (Creators of TimescaleDB) engineers on Postgres, time series & AI. IoT, crypto, and analytics tutorials & use cases.]]></description>
        <link>https://www.tigerdata.com/blog</link>
        <image>
            <url>https://www.tigerdata.com/icon.ico</url>
            <title>Tiger Data Blog</title>
            <link>https://www.tigerdata.com/blog</link>
        </image>
        <generator>RSS for Node</generator>
        <lastBuildDate>Tue, 07 Apr 2026 09:53:59 GMT</lastBuildDate>
        <atom:link href="https://www.tigerdata.com/blog" rel="self" type="application/rss+xml"/>
        <ttl>60</ttl>
        <item>
            <title><![CDATA[Five Features of the Tiger CLI You Aren't Using (But Should)]]></title>
            <description><![CDATA[Tiger CLI + MCP server: Let AI manage databases, fork instantly, search Postgres docs, and run queries—all from your coding assistant without context switching.]]></description>
            <link>https://www.tigerdata.com/blog/five-features-tiger-cli-you-arent-using-but-should</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/five-features-tiger-cli-you-arent-using-but-should</guid>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[AI agents]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Jacky Liang]]></dc:creator>
            <pubDate>Wed, 10 Dec 2025 16:37:07 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2025/12/5FeaturesofTigerCLI.png">
            </media:content>
            <content:encoded><![CDATA[<p>Last month, we launched <a href="https://www.tigerdata.com/blog/postgres-for-agents"><u>Agentic Postgres</u></a>, the first Postgres database designed for AI agents. It includes an MCP server that gives agents direct access to your databases, instant zero-copy forks, Postgres and TimescaleDB documentation search, and more.&nbsp;</p><figure class="kg-card kg-video-card kg-width-regular" data-kg-thumbnail="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/media/2025/10/DE84BB33-B4FE-4F6E-8398-9267033F6870-2_thumb.jpg" data-kg-custom-thumbnail="">
            <div class="kg-video-container">
                <video src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/media/2025/10/DE84BB33-B4FE-4F6E-8398-9267033F6870-2.mp4" poster="https://img.spacergif.org/v1/1920x1080/0a/spacer.png" width="1920" height="1080" loop="" autoplay="" muted="" playsinline="" preload="metadata" style="background: transparent url('https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/media/2025/10/DE84BB33-B4FE-4F6E-8398-9267033F6870-2_thumb.jpg') 50% 50% / cover no-repeat;"></video>
                <div class="kg-video-overlay">
                    <button class="kg-video-large-play-icon" aria-label="Play video">
                        <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                            <path d="M23.14 10.608 2.253.164A1.559 1.559 0 0 0 0 1.557v20.887a1.558 1.558 0 0 0 2.253 1.392L23.14 13.393a1.557 1.557 0 0 0 0-2.785Z"></path>
                        </svg>
                    </button>
                </div>
                <div class="kg-video-player-container kg-video-hide">
                    <div class="kg-video-player">
                        <button class="kg-video-play-icon" aria-label="Play video">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M23.14 10.608 2.253.164A1.559 1.559 0 0 0 0 1.557v20.887a1.558 1.558 0 0 0 2.253 1.392L23.14 13.393a1.557 1.557 0 0 0 0-2.785Z"></path>
                            </svg>
                        </button>
                        <button class="kg-video-pause-icon kg-video-hide" aria-label="Pause video">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <rect x="3" y="1" width="7" height="22" rx="1.5" ry="1.5"></rect>
                                <rect x="14" y="1" width="7" height="22" rx="1.5" ry="1.5"></rect>
                            </svg>
                        </button>
                        <span class="kg-video-current-time">0:00</span>
                        <div class="kg-video-time">
                            /<span class="kg-video-duration">0:30</span>
                        </div>
                        <input type="range" class="kg-video-seek-slider" max="100" value="0">
                        <button class="kg-video-playback-rate" aria-label="Adjust playback speed">1×</button>
                        <button class="kg-video-unmute-icon" aria-label="Unmute">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M15.189 2.021a9.728 9.728 0 0 0-7.924 4.85.249.249 0 0 1-.221.133H5.25a3 3 0 0 0-3 3v2a3 3 0 0 0 3 3h1.794a.249.249 0 0 1 .221.133 9.73 9.73 0 0 0 7.924 4.85h.06a1 1 0 0 0 1-1V3.02a1 1 0 0 0-1.06-.998Z"></path>
                            </svg>
                        </button>
                        <button class="kg-video-mute-icon kg-video-hide" aria-label="Mute">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M16.177 4.3a.248.248 0 0 0 .073-.176v-1.1a1 1 0 0 0-1.061-1 9.728 9.728 0 0 0-7.924 4.85.249.249 0 0 1-.221.133H5.25a3 3 0 0 0-3 3v2a3 3 0 0 0 3 3h.114a.251.251 0 0 0 .177-.073ZM23.707 1.706A1 1 0 0 0 22.293.292l-22 22a1 1 0 0 0 0 1.414l.009.009a1 1 0 0 0 1.405-.009l6.63-6.631A.251.251 0 0 1 8.515 17a.245.245 0 0 1 .177.075 10.081 10.081 0 0 0 6.5 2.92 1 1 0 0 0 1.061-1V9.266a.247.247 0 0 1 .073-.176Z"></path>
                            </svg>
                        </button>
                        <input type="range" class="kg-video-volume-slider" max="100" value="100">
                    </div>
                </div>
            </div>
            
        </figure><p>Alongside Agentic Postgres, we shipped a brand new CLI: <a href="https://github.com/timescale/tiger-cli"><u>Tiger CLI</u></a>. It's how you manage your Tiger Cloud databases from your favorite terminal.&nbsp;</p><p>The basics work like you'd expect:&nbsp;</p><pre><code class="language-shell"># Install Tiger CLI 
curl -fsSL https://cli.tigerdata.com | sh

# Authenticate
tiger auth login

# Create a new database service
tiger service create --name my-database

# Connect to your database
tiger db connect

# Get your connection string
tiger db connection-string

# List all your services
tiger service list</code></pre><p>These commands cover most day-to-day workflows. But Tiger CLI has a few features that makes the agentic development workflow significantly more intuitive, and you're probably not using them yet!&nbsp;</p><p>Here are five new features we launched that you aren’t using, but should:&nbsp;</p><ol><li><strong>Let your AI manage your databases</strong>: Install an MCP server that gives your AI assistant direct access to create services, run queries, and check connections</li><li><strong>Turn your AI into a Postgres expert</strong>: Skills teach your AI Postgres best practices automatically, as if it’s been writing production-grade PostgreSQL for a decade+.&nbsp;</li><li><strong>Fork any database in seconds</strong>: Create zero-copy clones of your database for testing migrations or spinning up staging environments</li><li><strong>Search Postgres docs from your editor</strong>: Your AI can search PostgreSQL (across all versions) and TimescaleDB documentation without leaving your IDE or CLI</li><li><strong>Run SQL queries through your AI</strong>: Execute queries against your database directly from your AI assistant</li></ol><p>Let's look at each one.</p><h2 id="let-your-ai-manage-your-databases">Let Your AI Manage Your Databases</h2><p>As anyone coding with Cursor or Claude Code knows, nothing breaks flow more than having to leave your AI agent to execute CLI commands. Every time you need to check a database connection string, list your available services, or create a new database, you need to switch context. From your terminal, to the browser, back to your IDE, it’s easy to break out of your flow state.&nbsp;</p><p>The Tiger CLI now includes a <a href="https://modelcontextprotocol.io/docs/getting-started/intro"><u>Model Context Protocol</u></a> (MCP) server. If you're using an AI coding assistant like Claude Code, Cursor, or VS Code with Copilot, you can give it direct access to your Tiger Cloud databases.</p><p>This means your AI assistant can list your services, run SQL queries, create new databases, and check connection details, all without you switching to your terminal.</p><h3 id="quick-setup">Quick Setup</h3><p>Install the MCP server for your assistant:</p><pre><code class="language-shell"># Interactive (prompts you to pick your client)
tiger mcp install

# Or specify directly
tiger mcp install claude-code
tiger mcp install cursor
tiger mcp install vscode</code></pre><p>We made it super easy to install the Tiger CLI in all of your favorite coding assistants* through an interactive prompt that guides you through the installation.</p><figure class="kg-card kg-image-card"><img src="https://timescale.ghost.io/blog/content/images/2025/12/SCR-20251204-nvcj-2.png" class="kg-image" alt="" loading="lazy" width="889" height="325" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2025/12/SCR-20251204-nvcj-2.png 600w, https://timescale.ghost.io/blog/content/images/2025/12/SCR-20251204-nvcj-2.png 889w" sizes="(min-width: 720px) 720px"></figure><p>Restart your AI assistant after installation.&nbsp;</p><p>* We are constantly adding interactive installations for new coding assistants!&nbsp;</p><h3 id="adding-to-cursor-via-ui">Adding to Cursor via UI</h3><p>If you prefer to configure Cursor (or other IDEs) manually instead of using <code>tiger mcp install</code>:</p><ol><li>Open Cursor Settings</li><li>Look for “Tools &amp; MCP” on the left sidebar</li><li>Click "Add MCP server"</li><li>Enter the following configuration:&nbsp;<ul><li><strong>Name:</strong> <code>tiger</code></li><li><strong>Command:</strong> <code>tiger</code></li><li><strong>Arguments:</strong> <code>mcp</code>, <code>start</code></li></ul></li></ol><figure class="kg-card kg-image-card"><img src="https://timescale.ghost.io/blog/content/images/2025/12/SCR-20251204-nrjp-2.png" class="kg-image" alt="" loading="lazy" width="630" height="402" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2025/12/SCR-20251204-nrjp-2.png 600w, https://timescale.ghost.io/blog/content/images/2025/12/SCR-20251204-nrjp-2.png 630w"></figure><ol start="5"><li>Click "Save" and restart Cursor</li></ol><p>Once configured, you'll see the Tiger MCP server listed in your MCP servers. The green indicator shows it's connected and ready.</p><figure class="kg-card kg-image-card"><img src="https://timescale.ghost.io/blog/content/images/2025/12/SCR-20251204-odoy-2.png" class="kg-image" alt="" loading="lazy" width="872" height="768" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2025/12/SCR-20251204-odoy-2.png 600w, https://timescale.ghost.io/blog/content/images/2025/12/SCR-20251204-odoy-2.png 872w" sizes="(min-width: 720px) 720px"></figure><h3 id="what-you-can-do">What You Can Do</h3><p>Once installed, your AI assistant has access to tools like:</p><ul><li><code>service_list</code> — List all your database services</li><li><code>service_get</code> — Get details about a specific service</li><li><code>service_create</code> — Create a new database</li><li><code>db_execute_query</code> — Run SQL queries against any service</li></ul><p>For a full list of tools available to you and your agent, they are available in the <a href="https://github.com/timescale/tiger-cli?tab=readme-ov-file#available-mcp-tools"><u>Tiger CLI README</u></a> on Github.&nbsp;</p><p>For example, you can ask your AI assistant: "Show me all my Tiger Cloud services" or "Run <code>SELECT count(*) FROM events</code> on my production database."</p><p>The MCP server uses your existing CLI authentication, so there's no extra setup after <code>tiger auth login</code>.</p><h2 id="turn-your-ai-into-a-postgres-expert">Turn Your AI Into a Postgres Expert</h2><p>A new pattern of working with SQL has emerged as agentic coding has exploded in popularity.&nbsp;</p><p>You can now simply just tell an LLM what you want to do with your database, and it will write the SQL for you. You basically don’t even need to spend sweat, tears, or even fear (like you accidentally dropping a table, although this is still <a href="https://fortune.com/2025/07/23/ai-coding-tool-replit-wiped-database-called-it-a-catastrophic-failure/"><u>completely within the realm of possibility</u></a> when working with an agent) to learn how to write SQL.&nbsp;</p><p>But this AI-generated SQL actually has spawned a new problem, which is, AI-generated SQL works… until it doesn’t.&nbsp;</p><p>Your schema may pass tests, your queries may run… but six months later, as your application scales to millions of users, everything slows to a crawl.&nbsp;</p><p>The problem here is that LLMs are actually trained on millions of lines of SQL, absorbed from a billion blog posts (of differing quality), so while they “know” SQL, they don’t actually know what patterns actually scale. There are also hundreds of different dialects of SQL, with dozens of versions each.&nbsp;</p><p>It’s no wonder AI agents may not write the best SQL.&nbsp;</p><p>We’ve personally seen many common AI-generated mistakes, including:&nbsp;</p><ul><li>Using <code>VARCHAR(255)</code> instead of <code>TEXT</code> (the length limit doesn't help performance in Postgres)</li><li>Using <code>SERIAL</code> instead of <code>BIGINT GENERATED ALWAYS AS IDENTITY</code></li><li>Missing indexes on foreign key columns (Postgres doesn't create these automatically)</li><li>Using <code>TIMESTAMP</code> instead of <code>TIMESTAMPTZ</code> (timezone handling is painful to fix later)</li></ul><p>These will not raise syntax or linter errors. Your tests will still pass. But trying to fix these mistakes later once your app is handling millions of users, means painful migrations, downtime, and explaining to your CEO why the database needs to go down for all your users to undergo maintenance. Sound familiar?&nbsp;</p><p>The Tiger MCP server ships with a variation of<a href="https://www.tigerdata.com/blog/free-postgres-mcp-prompt-templates"> <u>Skills</u></a> (a <a href="https://www.claude.com/blog/skills"><u>standard</u></a> built by Anthropic) written by our most senior and experienced Postgres engineers. When your AI needs to design a schema, the MCP server automatically pulls the right “lessons” and then applies 30 years of Postgres best practices to your database design.&nbsp;</p><h3 id="available-skills">Available Skills</h3><p>The MCP server includes skills for common Postgres and TimescaleDB tasks.&nbsp;</p><ul><li><code>design-postgres-tables</code>: Schema design with proper types, constraints, and indexes</li><li><code>setup-timescaledb-hypertables</code>: Hypertable configuration, compression, retention policies</li><li><code>migrate-postgres-tables-to-hypertables</code>: Converting existing tables to hypertables</li><li><code>find-hypertable-candidates</code>: Identifying which tables should become hypertables</li></ul><figure class="kg-card kg-image-card"><img src="https://timescale.ghost.io/blog/content/images/2025/12/SCR-20251204-nzjm-2.png" class="kg-image" alt="" loading="lazy" width="583" height="356"></figure><p>Your AI discovers and uses these automatically based on what you ask it to do. You don’t need to call them explicitly!&nbsp;</p><p><em>We are continually adding new Skills. <strong>Want to request new ones or contribute with your own?</strong> Feel free to </em><a href="https://github.com/timescale/pg-aiguide/issues"><em><u>create an issue</u></em></a><em> in our Skills Github repo.&nbsp;</em></p><h2 id="fork-any-database-in-seconds">Fork Any Database in Seconds</h2><p>Testing database migrations against production data is risky, but testing against fake data is also risky because you may miss edge cases. You need a real copy of your database but without the cost or time of duplicating everything manually.&nbsp;</p><p>Tiger CLI lets you create instant, zero-copy forks of any database:&nbsp;</p><pre><code class="language-shell">tiger service fork &lt;service-id&gt; --name my-staging-db</code></pre><p>The fork created is a point-in-time copy that shares underlying data blocks with the original. You only pay for blocks that change. This makes forks lightweight enough to spin up for a single test and throw away when you’re done.&nbsp;</p><p>Database forking is especially useful when working in an agentic context, where if you are using multiple agents at once, you can fork multiple databases for different agents to work on without affecting the original database. Neat!&nbsp;</p><h3 id="example-workflow">Example Workflow</h3><pre><code class="language-shell"># List your services to find the source ID
tiger service list

# Fork your production database
tiger service fork --source-id svc_abc123 --name testing-migrations

# Connect to the fork
tiger db connect --service-id &lt;new-service-id&gt;

# Run your migration, test it, then delete the fork when done
tiger service delete &lt;new-service-id&gt;</code></pre><h2 id="search-postgres-docs-from-your-editor">Search Postgres Docs From Your Editor</h2><p>Postgres has been around for <a href="https://en.wikipedia.org/wiki/PostgreSQL"><u>almost 30 years</u></a>.</p><p>The documentation is incredibly extensive, spanning 18 versions and countless individual releases. A function that exists in Postgres 18 might not exist in 15. Syntax that worked in 14 now has better alternatives in 18.&nbsp;</p><p>Most AI-powered documentation search tools don’t account for these idiosyncrasies.&nbsp;</p><p>If not using docs search tools, and simply relying on an LLM’s own training data, well, they have training cutoffs, generally at least lagging by 6 months, so LLMs often suggest older methods that aren’t the most performant, or reference features that don’t exist in your version. This is actually a major problem with LLMs writing React and NextJS I’ve personally experienced in my side projects!&nbsp;</p><p>The Tiger MCP server includes search over PostgreSQL documentation from versions 14 to 18 and TimescaleDB docs for time series workloads. Your AI assistant can search version-specific docs without leaving your coding agent.&nbsp;</p><p>Your assistant gets access to:</p><ul><li><code>semantic_search_postgres_docs</code>: Search PostgreSQL documentation (versions 14-18)</li><li><code>semantic_search_tiger_docs</code>: Search Tiger Cloud and TimescaleDB documentation</li></ul><figure class="kg-card kg-image-card"><img src="https://timescale.ghost.io/blog/content/images/2025/12/SCR-20251204-olha-2.png" class="kg-image" alt="" loading="lazy" width="1101" height="786" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2025/12/SCR-20251204-olha-2.png 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2025/12/SCR-20251204-olha-2.png 1000w, https://timescale.ghost.io/blog/content/images/2025/12/SCR-20251204-olha-2.png 1101w" sizes="(min-width: 720px) 720px"></figure><h3 id="no-more-context-switching">No More Context Switching&nbsp;</h3><p>Instead of leaving your editor to search docs, you can ask your AI assistant directly:</p><ul><li>"How do I set up continuous aggregates in TimescaleDB?"</li><li>"What's the syntax for PostgreSQL window functions?"</li><li>"Show me how to configure compression policies"</li></ul><p>The assistant searches the actual docs and gives you accurate, up-to-date answers. You don’t even need to ask the agent to explicitly use the documentation search feature, it will just work.&nbsp;</p><p>P.S. This feature is enabled by default. If you ever need to disable it:</p><pre><code class="language-shell">tiger config set docs_mcp false</code></pre><h2 id="run-sql-queries-through-your-ai">Run SQL Queries Through Your AI</h2><p>You’re debugging an issue, you need to check row count, inspect a table’s schema, or run a quick query. You’re used to having to open a new terminal, find and remember the connection string, connect via psql, run the query, copy the results back (often jankily when done inside the CLI due to text wrapping). All these steps just for getting ONE number.&nbsp;</p><p>No more.&nbsp;</p><p>Once you’ve set up the MCP in your coding agent, your AI assistant can execute SQL queries (using the <code>db_execute_query</code> tool) directly against your databases.</p><p>This means you can stay in your editor and ask your AI:</p><ul><li>"How many events came in during the last 24 hours?”</li><li>"Show me the 10 most recent orders"</li><li>"What's the schema of the users table?"&nbsp;</li></ul><h3 id="example">Example</h3><p>Your AI writes the (performant) SQL, runs it, and returns the results. No more terminal switching, copy-pasting connection strings, remembering your environment, exact syntax for your <code>information_schema</code>.&nbsp;</p><h2 id="get-started-now">Get Started Now</h2><p>Install the Tiger CLI and MCP server:</p><pre><code class="language-shell">curl -fsSL https://cli.tigerdata.com | sh
tiger auth login
tiger mcp install</code></pre><p>Then select your AI assistant (Claude Code, Cursor, VS Code, Windsurf) and you're ready to go.</p><p>Don't have a Tiger Cloud account?<a href="https://console.cloud.timescale.com/signup"> <u>Sign up for free</u></a> — no credit card required. Create your first database, then try out these CLI features.&nbsp;</p><h2 id="resources">Resources</h2><ol><li><a href="https://www.tigerdata.com/blog/postgres-for-agents"><u>Postgres for Agents</u></a>&nbsp;</li><li><a href="https://www.tigerdata.com/blog/free-postgres-mcp-prompt-templates"><u>How to Train Your Agent to Be a Postgres Expert</u></a></li></ol>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to Train Your Agent to Be a Postgres Expert]]></title>
            <description><![CDATA[Turn AI into a Postgres expert with our MCP server. Get 35 years of best practices, versioned docs, and prompt templates for production-ready schemas.]]></description>
            <link>https://www.tigerdata.com/blog/free-postgres-mcp-prompt-templates</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/free-postgres-mcp-prompt-templates</guid>
            <category><![CDATA[Announcements & Releases]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Matty Stratton]]></dc:creator>
            <pubDate>Wed, 22 Oct 2025 14:02:12 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2025/10/2025-Oct-21-Prompt-Template-Thumbnail.png">
            </media:content>
            <content:encoded><![CDATA[<h3 id="with-prompt-templates-and-versioned-docs-we-turn-35-years-of-postgres-wisdom-into-structured-knowledge-your-agent-can-reason-with">With prompt templates and versioned docs, we turn 35 years of Postgres wisdom into structured knowledge your Agent can reason with.</h3><p>Agents are the <a href="https://www.tigerdata.com/blog/postgres-for-agents" rel="noreferrer">new developer</a>. But they’re generalists.&nbsp;</p><p>What happens when they design your Postgres database? Your schema runs, your tests pass… and six months later your queries crawl and your costs skyrocket.&nbsp;</p><p>AI-generated SQL and database schemas are almost right. And that’s the problem. Fixing schema design mistakes is costlier than refactoring code. It often means multi-week migrations, downtime windows, rollback plans, and your CEO asking why the site is in maintenance mode. The root issue? LLMs don’t have the depth of Postgres and database expertise to let them build scalable systems. And when agents try to learn, they find documentation written for humans, not for them.&nbsp;</p><p>But agents don’t need more data, they need better context. They need to know what “good Postgres” actually looks like. The good news is given the right context and tools, agents can become instant experts. Even with Postgres.&nbsp;</p><p>That’s why we built an MCP server that provides 35 years of Postgres wisdom, and full access Postgres docs, all in a format that agents can easily process.&nbsp;</p><p>And we think this just might be the best database MCP server ever built. While Neon, Supabase, and other Postgres companies created MCP servers as thin API wrappers, ours teaches AI how to think in Postgres. The Tiger MCP server gives AI tools that work automatically: no prompt engineering or manual lookups needed. You just ask. And it provides correct, idiomatic Postgres. </p><p>Our new MCP server ships with detailed prompt templates written by our senior engineers, plus versioned Postgres (15-18) docs, and <a href="https://docs.tigerdata.com/" rel="noreferrer">TimescaleDB docs</a>.&nbsp;When your AI needs to design a schema, it automatically pulls the right template and searches the docs (using hybrid search) to generate code that actually lasts. And of course, it handles the basics: start, stop, <a href="https://www.tigerdata.com/blog/fast-zero-copy-database-forks" rel="noreferrer">fork a database</a>. </p><p>But the real magic happens when you ask AI to design your schema. So let’s see what happens when AI designs a simple schema. First without the MCP server, then with it.</p><h2 id="without-the-mcp-server">Without the MCP Server</h2><pre><code class="language-SQL">CREATE TABLE customers (
    customer_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    email VARCHAR(255) UNIQUE NOT NULL,
    password_hash VARCHAR(255) NOT NULL,
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL,
    phone VARCHAR(20),
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    is_active BOOLEAN DEFAULT TRUE
);

CREATE INDEX idx_customers_email ON customers(email);

CREATE TABLE authors (
    author_id SERIAL PRIMARY KEY,
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL,
    biography TEXT,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
</code></pre><p>What’s wrong here:&nbsp;</p><ul><li>Mixed ID types (SERIAL and UUID) with no clear reason</li><li>Arbitrary VARCHAR limits instead of TEXT</li><li>Case-sensitive email index → duplicate accounts</li><li>CURRENT_TIMESTAMP instead of now(), leading to inconsistent timestamps</li></ul><p>Someone new to Postgres may not notice these problems. Until it’s too late.</p><p>Now let’s see what AI generates with our new MCP server.</p><h2 id="with-the-mcp-server">With the MCP Server</h2><pre><code class="language-SQL">CREATE TABLE authors (
  author_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  name TEXT NOT NULL,
  bio TEXT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX ON authors (name);

CREATE TABLE users (
  user_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  email TEXT NOT NULL UNIQUE,
  password_hash TEXT NOT NULL,
  first_name TEXT NOT NULL,
  last_name TEXT NOT NULL,
  phone TEXT,
  is_active BOOLEAN NOT NULL DEFAULT true,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE UNIQUE INDEX ON users (LOWER(email));
CREATE INDEX ON users (created_at);</code></pre><p>What’s better about this?</p><ul><li>Consistent ID strategy with BIGINT GENERATED ALWAYS AS IDENTITY</li><li>TEXT instead of arbitrary VARCHAR limits</li><li>Case-insensitive email lookups</li><li>Modern timestamp handling</li></ul><p>But why does this matter?</p><p>Each of these differences creates a compounding problem. Changing datatypes in the future will require full table rewrites. Missing lowercase email handling means duplicate accounts and confused users. And time zones? Every senior developer gets the thousand-yard stare when you mention UTC conversions.</p><p>This is just with a small example; imagine what would happen with more complex schemas.</p><p>And if you don’t believe us, here’s what Claude has to say:</p><pre><code class="language-markdown">&gt; Please describe the schema you would create for an e-commerce website two times, first with the tiger mcp server disabled, then with the tiger mcp server enabled. For each time, write the schema to its own file in the current working directory. Then compare the two files and let me know which approach generated the better schema, using both qualitative and quantitative reasons. For this example, only use standard Postgres.</code></pre><figure class="kg-card kg-video-card kg-width-regular" data-kg-thumbnail="https://timescale.ghost.io/blog/content/media/2025/10/how-to-train-your-agent_thumb.jpg" data-kg-custom-thumbnail="">
            <div class="kg-video-container">
                <video src="https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/media/2025/10/how-to-train-your-agent.mp4" poster="https://img.spacergif.org/v1/1280x720/0a/spacer.png" width="1280" height="720" loop="" autoplay="" muted="" playsinline="" preload="metadata" style="background: transparent url('https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/media/2025/10/how-to-train-your-agent_thumb.jpg') 50% 50% / cover no-repeat;"></video>
                <div class="kg-video-overlay">
                    <button class="kg-video-large-play-icon" aria-label="Play video">
                        <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                            <path d="M23.14 10.608 2.253.164A1.559 1.559 0 0 0 0 1.557v20.887a1.558 1.558 0 0 0 2.253 1.392L23.14 13.393a1.557 1.557 0 0 0 0-2.785Z"></path>
                        </svg>
                    </button>
                </div>
                <div class="kg-video-player-container kg-video-hide">
                    <div class="kg-video-player">
                        <button class="kg-video-play-icon" aria-label="Play video">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M23.14 10.608 2.253.164A1.559 1.559 0 0 0 0 1.557v20.887a1.558 1.558 0 0 0 2.253 1.392L23.14 13.393a1.557 1.557 0 0 0 0-2.785Z"></path>
                            </svg>
                        </button>
                        <button class="kg-video-pause-icon kg-video-hide" aria-label="Pause video">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <rect x="3" y="1" width="7" height="22" rx="1.5" ry="1.5"></rect>
                                <rect x="14" y="1" width="7" height="22" rx="1.5" ry="1.5"></rect>
                            </svg>
                        </button>
                        <span class="kg-video-current-time">0:00</span>
                        <div class="kg-video-time">
                            /<span class="kg-video-duration">0:50</span>
                        </div>
                        <input type="range" class="kg-video-seek-slider" max="100" value="0">
                        <button class="kg-video-playback-rate" aria-label="Adjust playback speed">1×</button>
                        <button class="kg-video-unmute-icon" aria-label="Unmute">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M15.189 2.021a9.728 9.728 0 0 0-7.924 4.85.249.249 0 0 1-.221.133H5.25a3 3 0 0 0-3 3v2a3 3 0 0 0 3 3h1.794a.249.249 0 0 1 .221.133 9.73 9.73 0 0 0 7.924 4.85h.06a1 1 0 0 0 1-1V3.02a1 1 0 0 0-1.06-.998Z"></path>
                            </svg>
                        </button>
                        <button class="kg-video-mute-icon kg-video-hide" aria-label="Mute">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M16.177 4.3a.248.248 0 0 0 .073-.176v-1.1a1 1 0 0 0-1.061-1 9.728 9.728 0 0 0-7.924 4.85.249.249 0 0 1-.221.133H5.25a3 3 0 0 0-3 3v2a3 3 0 0 0 3 3h.114a.251.251 0 0 0 .177-.073ZM23.707 1.706A1 1 0 0 0 22.293.292l-22 22a1 1 0 0 0 0 1.414l.009.009a1 1 0 0 0 1.405-.009l6.63-6.631A.251.251 0 0 1 8.515 17a.245.245 0 0 1 .177.075 10.081 10.081 0 0 0 6.5 2.92 1 1 0 0 0 1.061-1V9.266a.247.247 0 0 1 .073-.176Z"></path>
                            </svg>
                        </button>
                        <input type="range" class="kg-video-volume-slider" max="100" value="100">
                    </div>
                </div>
            </div>
            
        </figure><pre><code class="language-markdown">Verdict:

The Tiger MCP-enabled schema is objectively superior because:

1. Type Safety: Uses modern PostgreSQL types that prevent common errors
2. Data Integrity: 4x more constraints ensure data quality at the database level
3. Performance: Better indexing strategy with 55% more indexes, including partial and expression indexes
4. Maintainability: Better documentation, clear comments, proper naming
5. Features: More complete e-commerce functionality (discounts, full-text search, audit trails)
6. PostgreSQL Best Practices: Follows official PG17 recommendations from the documentation
7. Future-Proof: Uses modern features like GENERATED ALWAYS AS IDENTITY and NULLS NOT DISTINCT

The Tiger MCP server provided access to PostgreSQL-specific best practices documentation and design patterns that resulted in a more robust, performant, and maintainable schema.
</code></pre><h2 id="how-prompt-templates-make-it-all-possible">How Prompt Templates Make It All Possible</h2><p>Behind the scenes, AI uses the MCP server to call <code>get_prompt_template(‘design_postgres’)</code> to load schema design guidance. No pasted docs. No corrections. Just better code.</p><p>Prompt templates turn production wisdom into reusable guidance for AI. Instead of hunting through documentation written for humans, AI gets the information it needs in a format built for machines.</p><p>It comes down to the fact that humans and LLMs have opposite needs. Humans need narratives and memorable examples (and yes, even cat memes) to help them retain information. LLMs need to preserve context window space. That’s why prompt templates make terrible blog posts, but perfect AI guidance.</p><p>Our philosophy is: don't re-teach what the model already knows. LLMs have seen millions of lines of SQL. They know how to write CREATE TABLE. What they don’t know is the 35 years of Postgres wisdom about what works well and what doesn’t.</p><p>It's like your senior DBA whispering advice in the model's ear.</p><p>Our schema design template <code>(design_postgres_tables)</code> doesn’t explain what a primary key is. It jumps straight to guidance:</p><p>“Prefer <code>BIGINT GENERATED ALWAYS AS IDENTITY</code>; use <code>UUID</code> only when global uniqueness is needed.”</p><p>For data types, it doesn’t teach from scratch. It just tells you what works:</p><p>“DO NOT use <code>money</code> type; DO use <code>numeric</code> instead.”</p><p>Here’s a real snippet from the template:</p><pre><code class="language-markdown">## Postgres "Gotchas"

- **FK indexes**: Postgres **does not** auto-index FK columns. Add them.
- **No silent coercions**: length/precision overflows error out (no truncation). 
  Example: inserting 999 into `NUMERIC(2,0)` fails with error, unlike some 
  databases that silently truncate or round.
- **Heap storage**: no clustered PK by default (unlike SQL Server/MySQL InnoDB); 
  row order on disk is insertion order unless explicitly clustered.</code></pre><p>These gotchas trip up LLMs the same way they trip up developers new to Postgres. We optimized these templates for machines: short, factual, and precise, packing maximum guidance into minimum tokens.&nbsp;</p><p>We tested the same approach on a real IoT schema design task. Without templates, the AI added forbidden configurations and missed critical optimizations. <em>With</em> templates, it generated production-ready code with compression, continuous aggregates, and tuned performance.</p><p>That’s how prompt templates work. Now let’s see how the MCP server makes it all happen.</p><h2 id="how-this-mcp-server-is-smarter-than-others">How This MCP Server is Smarter Than Others</h2><p>While Neon, Supabase, and other Postgres companies created MCP servers as thin API wrappers, ours teaches AI how to think in Postgres.The Tiger MCP server gives AI tools that work automatically: no prompt engineering or manual lookups needed. You just ask. And it provides correct, idiomatic Postgres.</p><p><strong><code>get_prompt_template</code> provides auto-discovered expertise. </strong>Instead of having to call a template explicitly, you just say “I want to make a schema for IoT devices…” and the MCP server figures it out. </p><p>With self-discoverable templates, the AI can detect intent and load the right recipe, applying 35 years of Postgres best practices behind the scenes. </p><p><strong>The templates have real depth. </strong>No scraped snippets or boilerplate. The templates are written by senior Postgres engineers, and provide opinionated, production-tested guidance that is tuned to avoid every trap that seasoned DBAs know to avoid.</p><p><strong>Postgres-native vector retrieval adds the right context.</strong> When the AI needs more information, the MCP server searches the versioned Postgres (15-18) and TimescaleDB docs. And it uses Postgres itself for storage and vector search.</p><p>Versioning is critical. For example, Postgres 15 introduced UNIQUE NULLS NOT DISTINCT, while 16 improved parallel queries, and 17 changed COPY error handling. The MCP keeps AIs grounded in correct syntax every time, avoiding broken code from the wrong version.</p><p>The Tiger MCP doesn’t just wire up APIs. It teaches AI to think like a real Postgres engineer. </p><p>You don’t have to craft the perfect prompt. You just ask, and it does the right thing.</p><h2 id="see-it-for-yourself">See It For Yourself</h2><p>Install the Tiger CLI and MCP server:</p><pre><code class="language-shell">curl -fsSL https://cli.tigerdata.com | sh
tiger auth login
tiger mcp install</code></pre><p>(We also have alternative <a href="https://github.com/timescale/tiger-cli"><u>installation instructions</u></a> for the CLI tool.)</p><p>Then select your AI assistant (Claude Code, Cursor, VS Code, Windsurf, etc.) and immediately get real Postgres knowledge flowing into your AI.</p><p>This is how Postgres becomes the best database to use with AI coding tools: not by accident, not because someone pasted docs into a chat, but because the tooling now teaches AI how to think in Postgres.&nbsp;</p><p>Try the MCP server. Break it. <a href="https://timescaledb.slack.com/join/shared_invite/zt-38c4rrt9t-eR8I4hnb4qeGLUrL6hM3mA#/shared-invite/email"><u>Improve it</u></a>. Help us teach every AI to write real Postgres.</p><hr><p><strong>About the authors</strong></p><p><strong>Matty Stratton</strong></p><p>Matty Stratton is the Head of Developer Advocacy and Docs at Tiger Data, a well-known member of the DevOps community, founder and co-host of the popular <a href="https://www.arresteddevops.com/"><u>Arrested DevOps</u></a> podcast, and a global organizer of the <a href="https://devopsdays.org"><u>DevOpsDays</u></a> set of conferences.</p><p>Matty has over 20 years of experience in IT operations and is a sought-after speaker internationally, presenting at Agile, DevOps, and cloud engineering focused events worldwide. Demonstrating his keen insight into the changing landscape of technology, he recently changed his license plate from DEVOPS to KUBECTL.</p><p>He lives in the Chicagoland area and has three awesome kids and two Australian Shepherds, whom he loves just a little bit more than he loves Diet Coke.</p><p><strong>Matvey Arye</strong></p><p><a href="https://www.linkedin.com/in/matvey-arye/"><u>Matvey Arye</u></a> is a founding engineering leader at Tiger Data (creators of TimescaleDB), the premiere provider of relational database technology for time-series data and AI. Currently, he manages the team at Tiger Data responsible for building the go-to developer platform for AI applications.&nbsp;</p><p>Under his leadership, the Tiger Data engineering team has introduced partitioning, compression, and incremental materialized views for time-series data, plus cutting-edge indexing and performance innovations for AI.&nbsp;</p><p>Matvey earned a Bachelor degree in Engineering at The Cooper Union. He earned a Doctorate in Computer Science at Princeton University where his research focused on cross-continental data analysis covering issues such as networking, approximate algorithms, and performant data processing.&nbsp;</p><p><strong>Jacky Liang</strong></p><p><a href="https://www.linkedin.com/in/jjackyliang/"><u>Jacky Liang</u></a> is a developer advocate at Tiger Data with an AI and LLMs obsession. He's worked at Pinecone, Oracle Cloud, and Looker Data as both a software developer and product manager which has shaped the way he thinks about software.&nbsp;</p><p>He cuts through AI hype to focus on what actually works. How can we use AI to solve real problems? What tools are worth your time? How will this technology actually change how we work?&nbsp;</p><p>When he's not writing or speaking about AI, Jacky builds side projects and tries to keep up with the endless stream of new AI tools and research—an impossible task, but he keeps trying anyway. His model of choice is Claude Sonnet 4 and his favorite coding tool is Claude Code.</p><p></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Blocked Bloom Filters: Speeding Up Point Lookups in Tiger Postgres' Native Columnstore]]></title>
            <description><![CDATA[Learn how blocked Bloom filters in TimescaleDB 2.20 deliver up to 100× faster point lookups, speeding up columnar queries without manual tuning.]]></description>
            <link>https://www.tigerdata.com/blog/blocked-bloom-filters-speeding-up-point-lookups-in-tiger-postgres-native-columnstore</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/blocked-bloom-filters-speeding-up-point-lookups-in-tiger-postgres-native-columnstore</guid>
            <category><![CDATA[TimescaleDB]]></category>
            <category><![CDATA[Product & Engineering]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Jacky Liang]]></dc:creator>
            <pubDate>Wed, 18 Jun 2025 12:00:00 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2025/06/IMG_8672.png">
            </media:content>
            <content:encoded><![CDATA[<div class="kg-card kg-callout-card kg-callout-card-yellow"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">This is the first post in a technical deep dive series that explores the ways we are building the fastest Postgres.</div></div><p>Database storage is a study in locality.</p><ul><li><strong>Row stores</strong> keep all fields of a record together, with operations only on full rows at a time.</li><li><strong>Column stores</strong> organize each column’s values in compressed blocks, allowing operations to target specific columns.</li></ul><p>This trade-off is structural, not just cosmetic. Row layout is great for fast inserts and lookups. Column layout excels at filters, aggregations, and scans over a large number of rows, but on a smaller number of columns (as long as your query plays by the rules).</p><p>But&nbsp;here’s the catch: <strong>Columnstores are only fast at filtering when your predicate aligns with the physical sort order. </strong>If your data is ordered by time, or clustered by a <code>segmentby</code> (like customer ID or device), the engine can skip large blocks. But if you filter on an unsorted field (like a trace ID, transaction UUID, or error code) there’s often no optimization to exploit. The engine has to decompress every block and scan every value, just in case there is a match.<strong> </strong></p><p>TimescaleDB combines both layouts into a hybrid table model, with recent records written to the rowstore and automatically migrated to the columnstore over time. This design closely aligns with real-world query patterns, fresh data is updated frequently, while older data is rarely updated, and mostly used in aggregate. We’ve already implemented columnar mutability, but one challenge remained: sometimes you need to filter on unsorted fields across terabytes of columnar data.&nbsp;</p><p>And it’s not hypothetical, we’ve seen it in the wild many times. Dashboards hang while users query a UUID, waiting as the engine churns through thousands of compressed blocks. The data is there. The filter is simple. But the system grinds.</p><p>That isn’t acceptable. At Tiger, we’re here to deliver <strong>speed without sacrifice</strong>.</p><p>So we implemented <strong>blocked bloom filters</strong>, and our users already love them:&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2025/06/Screenshot-2025-06-19-at-8.08.32-pm.png" class="kg-image" alt="" loading="lazy" width="1612" height="216" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2025/06/Screenshot-2025-06-19-at-8.08.32-pm.png 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2025/06/Screenshot-2025-06-19-at-8.08.32-pm.png 1000w, https://timescale.ghost.io/blog/content/images/size/w1600/2025/06/Screenshot-2025-06-19-at-8.08.32-pm.png 1600w, https://timescale.ghost.io/blog/content/images/2025/06/Screenshot-2025-06-19-at-8.08.32-pm.png 1612w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Our community member </span><a href="https://github.com/pantonis"><u><span class="underline" style="white-space: pre-wrap;">@pantonis</span></u></a><span style="white-space: pre-wrap;"> saw </span><b><strong style="white-space: pre-wrap;">100× faster lookups</strong></b><span style="white-space: pre-wrap;"> after upgrading to TimescaleDB 2.20</span></figcaption></figure><p>P.S. If you're using TimescaleDB 2.20 or later (or Tiger Postgres on <a href="https://www.tigerdata.com/cloud"><u>Tiger Cloud</u></a>), Bloom Filters are already actively optimizing lookups on sparsely distributed UUIDs, enums, and text fields by up to 100x.</p><h2 id="the-challenge-of-point-lookups-in-columnar-storage">The Challenge of Point-Lookups in Columnar Storage</h2><p>If you've worked with large-scale time-series or analytics workloads, you've probably experienced this pain. You're querying 10TB of trace data to find a single ID like '550e8400-e29b-41d4-a716-446655440000'. Your database starts churning through millions of batches, reading and decompressing terabytes of data. Minutes and hours tick by. Your application times out. Users complain because they needed this report an hour ago (everything is urgent).&nbsp;</p><p>Imagine you’re looking for a needle in <strong><em>hundreds of compressed bundles of haystacks</em></strong>—you have to unbundle, loosen, and search through every single bundle because you don't know which one contains your needle.&nbsp;</p><p>This happens because columnar databases store data in sorted batches, compress each column separately, and use ordering metadata to skip irrelevant batches. This works perfectly when you're querying by the same column you sorted by:</p><pre><code class="language-SQL">-- This works well - time-based query on ordered data
SELECT * FROM metrics
  WHERE timestamp BETWEEN '2024-01-01' AND '2024-01-02';</code></pre><p>But completely breaks down for uncorrelated columns:</p><pre><code class="language-SQL">-- This is painful - random ID query on non segmented column
SELECT * FROM metrics 
  WHERE trace_id = '550e8400-e29b-41d4-a716-446655440000';</code></pre><p>The thing with UUIDs (except for UUIDv7, keep an eye out for support coming soon) is they're completely random, it doesn't make sense to order them. When your data is sorted by time, but you're searching by trace ID, every batch now contains a random mix of IDs. So ordering become useless, and the database can't skip any batches.&nbsp;</p><h2 id="what-is-a-bloom-filter">What is a Bloom Filter?&nbsp;</h2><p>This is where bloom filters help—they're additional metadata that can efficiently answer "is this value definitely not in this batch?” without actually storing a reference to each value.</p><p>A bloom filter is a small-yet-efficient data structure that uses an array of bits and hash functions to quickly test if something might be in a set or not.&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2025/06/bloom_filter_basic.CNfSrq8t_ZWI3Wl-1.svg" class="kg-image" alt="" loading="lazy" width="498" height="474"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Bloom filter illustration courtesy of </em></i><a href="https://www.bytedrum.com/posts/bloom-filters/"><u><i><em class="italic underline" style="white-space: pre-wrap;">Bytedrum: Bloom Filters</em></i></u></a><i><em class="italic" style="white-space: pre-wrap;">, a great visual explainer.&nbsp;</em></i></figcaption></figure><p>Bloom filters can say something is "definitely not there", or "might be there", and crucially, they never say "it’s missing" when something is actually there.</p><p>Using Spotify’s playlist feature as an example, when Spotify needs to check if a song is in one of your playlists, instead of scanning through every song in every playlist (which means reading every playlist from storage—stupidly expensive with billions of songs), they use a bloom filter—a compact “summary” that can instantly say whether a song is “definitely not in this playlist” or “might be in this playlist”.&nbsp;</p><p>For the “might be” cases, Spotify then uses traditional seek methods to check the actual playlist data.&nbsp;</p>
<!--kg-card-begin: html-->

  <iframe 
    src="https://spotify-bloom-filter.vercel.app" 
    width="100%" 
    height="1000" 
    frameborder="0"
    scrolling="yes"
    loading="lazy"
    style="display: block; border: none; background: #000;">
  </iframe>
<!--kg-card-end: html-->
<p>[Interactive Spotify Bloom Filter demo: <a href="https://spotify-bloom-filter.vercel.app/"><u>https://spotify-bloom-filter.vercel.app/</u></a>]&nbsp;</p><p>This may not sound that useful, but when dealing with massive-scale workloads of millions of playlists and billions of songs, using a bloom filter eliminates 95%+ of linear-time playlist scans, turning minutes of searching into milliseconds.&nbsp;</p><p>Obviously, no data structure is catch-free, it may occasionally check a playlist unnecessarily, around a 2% false positive rate (more on this later). But, this is an acceptable tradeoff as you still get massive I/O savings across your entire system.&nbsp;</p><h2 id="how-we-added-bloom-filters-into-the-columnstore">How We Added Bloom Filters into the Columnstore</h2><p>Here's how TimescaleDB solves this, bloom filters act as a quick pre-check. Before reading any batch from disk, TimescaleDB checks a tiny bloom filter in memory that says "this ID is definitely not in this batch" or "this ID might be in this batch." This lets us skip 95%+ of batches instantly.</p><p>For the few batches that might contain your ID, TimescaleDB reads them from disk and processes them efficiently using vectorized operations (SIMD) that check many rows at once—much faster than Postgres's traditional row-by-row approach. But the real win is avoiding the I/O in the first place.</p><h3 id="no-manual-configuration-needed">No manual configuration needed</h3><p>When building with TimescaleDB, you <strong><em>don’t need to worry</em></strong> about when to use bloom filters or min/max indexes, because we automatically choose for you based on your column types!&nbsp;</p><p>For columns that you use in your table ordering (like timestamps and numbers used in range queries), we stick with the min/max method because we know that scans will be in order.</p><p>For random things like text fields, UUIDs, enum types (or basically anything else that supports Postgres hash indexes) we will create bloom filters automatically (as long as you have the column indexed in the rowstore with a btree, hash or brin index).&nbsp;</p><h3 id="diving-deeper%E2%80%9Cblocked-bloom-filters%E2%80%9D">Diving deeper - “blocked bloom filters”</h3><p>TimescaleDB uses a technique called a "blocked bloom filter", where each bloom filter starts at about 16KB per batch (sized for up to 1,000 items) with a 2% false positive rate and uses 6 different hash functions per value.&nbsp;</p><p>The 16KB size isn't random—it's calculated based on math. Here’s a <a href="https://hur.st/bloomfilter/"><u>handy calculator</u></a> you can try out yourself.&nbsp;</p><figure class="kg-card kg-image-card"><img src="https://timescale.ghost.io/blog/content/images/2025/06/SCR-20250611-otwm-2-2.png" class="kg-image" alt="" loading="lazy" width="882" height="1004" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2025/06/SCR-20250611-otwm-2-2.png 600w, https://timescale.ghost.io/blog/content/images/2025/06/SCR-20250611-otwm-2-2.png 882w" sizes="(min-width: 720px) 720px"></figure><p>For 1,000 items with a 2% false positive rate, the optimal formula gives us ~8K bits, but we round up to ~16k bits to enable our folding compression trick (we will get to this below!). This sizing ensures we get exactly the false positive rate we want while keeping the filters small enough to stay fast in memory.</p><p>The "blocked" part is a performance technique—instead of spreading hash bits all over a huge array, TimescaleDB keeps all the bits for one value within a 256-bit block. This fits nicely in your CPU cache and makes everything faster.&nbsp;</p><p>For hashing, TimescaleDB primarily uses a modern library called <a href="https://github.com/backtrace-labs/umash"><u>UMASH</u></a> that's faster than Postgres's built-in hashing, but falls back to the Postgres version for custom data types or older processors. There is a funny story here on interoperability that I’ll share on socials!</p><h3 id="achieving-250x-space-savings">Achieving 250x space savings&nbsp;</h3><p>Here’s where we squeezed out additional space and performance benefits out of bloom filters— when a batch doesn’t have many unique values (for example [0, 0, 0, 0, 1, …, 0, 0]), TimescaleDB can compress the bloom filter by “folding” it in half using <a href="https://en.wikipedia.org/wiki/Bitwise_operation#OR"><u>bitwise OR</u></a> operations.&nbsp;</p><p>It can keep folding until the filter shrinks from 16 KB down to just 64 bits (8 bytes) for columns with few unique values, also known as low-cardinality. We can do this because most of the bits in the batch are zero, so folding concentrates the few set bits without significantly increasing false positives.&nbsp;</p><h2 id="query-walkthrough">Query Walkthrough</h2><p>Here’s an example to explain how queries work step-by-step. Let’s use the following trace ID search query:&nbsp;</p><pre><code class="language-SQL">SELECT * FROM metrics 
   WHERE trace_id = 'abc123'</code></pre><ol><li>TimescaleDB first checks the bloom filters (which are likely to be cached in memory using the traditional Postgres buffer manager) for every batch in your columnstore. </li><li>For each batch, the bloom filter either says “definitely not here” or “might be here”. The database immediately skips all the “definitely not here” batches. </li><li>For the “might be here” batches, TimescaleDB reads them from disk, decompresses them, and scans the actual data (the expensive part). If it was a false positive (that 2% chance we mentioned prior), no match gets found, and the query just continues running normally.&nbsp;</li></ol><p>The key here is, false positives are okay and don’t impact performance because even when we hit one false positive, we are still avoiding massive amounts of unnecessary I/O by not having to go through every batch from the get-go.&nbsp;</p><p>A side benefit to our implementation of bloom filters is that the bloom filter metadata is more likely to stay hot in memory. When there are concurrent workloads where different users are querying different parts of your dataset, every query can quickly eliminate most batches without touching slower-more-expensive storage.&nbsp;</p><h2 id="where-bloom-filters-excel">Where Bloom Filters Excel</h2><p>Bloom filters excel at large time-series datasets queried by non-temporal identifiers.</p><p>Think about scenarios where you're storing massive amounts of data over time, but you need to find specific records using IDs, addresses, or other fixed identifiers.</p><p><strong>For financial services teams.</strong> Your customers are trying to resolve a failed payment. They enter a transaction reference number into your search. Nothing loads. Your backend query scans years of data just to return a single match, and the user waits ...</p><pre><code class="language-SQL">-- Finding financial transactions by reference
SELECT * FROM payments 
  WHERE transaction_ref = 'TXN-2024-001234';</code></pre><p><strong>For IoT platform teams.</strong> Your dashboard shows live sensor data, but one widget is blank. It’s querying a single sensor reading by ID, and your backend is scanning billions of rows to find it. Users refresh the page. Nothing. The spinner keeps spinning.</p><pre><code class="language-SQL">-- Device-specific IoT data queries
SELECT * FROM sensor_data 
  WHERE reading_id = '73e98d71-5eb7-4018-ace7-1f4490da654a';</code></pre><p><strong>For teams working on blockchain analytics. </strong>Your analytics engine scans millions of blocks to find transactions from a wallet address. API timeouts leave users thinking your service is broken.</p><pre><code class="language-SQL">-- Looking up blockchain transactions by wallet address&nbsp;&nbsp;
SELECT * FROM transactions 
  WHERE from_address = '0x742d35Cc6634C0532925a3b8D';</code></pre><p>Bloom filters turn these queries from minutes or hours into milliseconds by eliminating the need to decompress and scan billions of rows. Instead of checking every batch in your columnstore, you skip a 95% of them and only decompress the ones that might actually contain your data.</p><h2 id="performance-numbers-find-specific-values-up-to-100x-faster">Performance Numbers: Find Specific Values Up to 100x Faster&nbsp;</h2><p>Instead of telling you why our bloom filter implementation is great, let's just show you some numbers we got after we ran our benchmarks.&nbsp;</p><pre><code>SELECT min(sent), max(sent), count()
  FROM hackers 
  WHERE subject = 'unsubscribe' 
  ORDER BY count() DESC 
  LIMIT 10;
</code></pre><p>Before bloom filters, this took 12ms. With bloom filters, it dropped to 2.7ms—that's <strong>3.5x faster</strong>.</p><p>Or consider this blockchain address lookup:</p><pre><code>SELECT * FROM token_transfers 
  WHERE to_address = '0xe23d4eb73b399250301fb024019a734ba9f0d9b5';
</code></pre><p>This one went from 1.065 seconds down to 171.134 ms—a <strong>6x improvement</strong>.</p><p>And lets not forget the report from our user pantonis, who saw a massive<strong> 100x improvement</strong>:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2025/06/Screenshot-2025-06-19-at-8.08.32-pm.png" class="kg-image" alt="" loading="lazy" width="1612" height="216" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2025/06/Screenshot-2025-06-19-at-8.08.32-pm.png 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2025/06/Screenshot-2025-06-19-at-8.08.32-pm.png 1000w, https://timescale.ghost.io/blog/content/images/size/w1600/2025/06/Screenshot-2025-06-19-at-8.08.32-pm.png 1600w, https://timescale.ghost.io/blog/content/images/2025/06/Screenshot-2025-06-19-at-8.08.32-pm.png 1612w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Our community member </span><a href="https://github.com/pantonis"><u><span class="underline" style="white-space: pre-wrap;">@pantonis</span></u></a><span style="white-space: pre-wrap;"> saw </span><b><strong style="white-space: pre-wrap;">100× faster lookups</strong></b><span style="white-space: pre-wrap;"> after upgrading to TimescaleDB 2.20</span></figcaption></figure><h2 id="where-bloom-filters-don%E2%80%99t-work">Where Bloom Filters Don’t Work&nbsp;</h2><p>Like all data structures, there are strengths and limitations.</p><p>Bloom filters work great when you're looking for exact matches—queries that use the equals sign (=) to find specific values. They also work with standard string comparisons where the rules are consistent.&nbsp;</p><pre><code>-- These work great with bloom filters

SELECT * FROM traces WHERE trace_id = 'abc-123-def';

SELECT * FROM orders WHERE email = 'user@example.com';

SELECT * FROM transactions WHERE status = 'completed';</code></pre><p>However, bloom filters have fundamental limitations and some current implementation restrictions. By design, they can't handle "not equal" searches (<code>&lt;&gt;</code>) or range queries (<code>&lt;</code> or <code>&gt;</code>) because they only test set membership.</p><pre><code>-- These don't work with bloom filters (fundamental limitations)
SELECT * FROM traces WHERE trace_id &lt;&gt; 'abc-123-def';

SELECT * FROM users WHERE created_at &gt; '2024-01-01';

SELECT * FROM transactions WHERE amount BETWEEN 100 AND 500;</code></pre><p>Current implementation restrictions in TimescaleDB mean they also can't help with multiple value searches (like WHERE column IN (1, 2, 3)) or cross-type comparisons without explicit casting. These may change in future versions.</p><pre><code>-- These don't work yet (implementation restrictions)
SELECT * FROM traces WHERE trace_id IN 
  ('abc-123', 'def-456', 'ghi-789');

SELECT * FROM users WHERE user_id = 12345;&nbsp; -- int8 = int4 comparison

SELECT * FROM posts WHERE category = ANY(ARRAY['tech', 'science']);</code></pre><p><strong>Also, bloom filters don't help much when the value you're looking for exists in most batches.&nbsp;</strong></p><p>For example, if you're searching for a common status like <code>active</code> that appears in every batch, the bloom filter will report a potential positive for every batch, forcing TimescaleDB to decompress and check them all anyway. The bloom filter can't skip anything, so you don't get any savings.&nbsp;</p><h2 id="speed-without-sacrifice">Speed without Sacrifice</h2><p>Bloom filters in TimescaleDB are a perfect example of "it just works" optimization and our commitment to making life easier for developers working at massive data scales.</p><p>The bloom data structure automatically kicks in for the right data types and query patterns, dramatically improving performance for needle-in-haystack queries without any configuration required. It works out of the box in Tiger Cloud—no setup required other than having an index on your rowstore columns.</p><p>You can verify bloom filters are working by looking for <code>_timescaledb_functions.bloom1_contains</code> in your query execution plans. The storage overhead is minimal, typically a few hundred bytes per batch, with a maximum of 1KB. For a table with a million batches, you're looking at roughly 100MB to 1GB of bloom filter metadata. That's <strong>0.01% storage overhead</strong> for massive query speedups.</p><p>Built by developers, for developers. TimescaleDB refuses to accept the traditional trade-offs of database storage. We give you the speed of columnar analytics with the flexibility of point lookups, all in the same system.</p><p>Try it now on <a href="https://www.tigerdata.com/cloud"><u>Tiger Cloud</u></a>.&nbsp;</p><h3 id="additional-reading">Additional reading&nbsp;</h3><ol><li><a href="https://www.timescale.com/blog/speed-without-sacrifice-2500x-faster-distinct-queries-10x-faster-upserts-bloom-filters-timescaledb-2-20"><u>Speed Without Sacrifice: 2500x Faster Distinct Queries, 10x Faster Upserts, Bloom Filters and More in TimescaleDB 2.20</u></a>&nbsp;</li><li><a href="https://www.bytedrum.com/posts/bloom-filters/"><u>Bloom Filters: The Unsung Heroes of Computer Science</u></a></li><li><a href="https://hur.st/bloomfilter/"><u>Bloom Filter Calculator</u></a></li></ol>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to Build a Secure, Authorized Chatbot Using Oso and Timescale]]></title>
            <description><![CDATA[Timescale and Oso webinar recap where you'll learn how to build a secure, authorized LLM chatbot using Oso and Timescale Vector. ]]></description>
            <link>https://www.tigerdata.com/blog/how-to-build-a-secure-authorized-chatbot-using-oso-and-timescale</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-to-build-a-secure-authorized-chatbot-using-oso-and-timescale</guid>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Jacky Liang]]></dc:creator>
            <pubDate>Tue, 13 May 2025 13:19:26 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2025/05/2025-may-12-oso-and-timescale-oso-blog-thumbnail.png">
            </media:content>
            <content:encoded><![CDATA[<p>The rush to integrate large language models (LLMs) into production apps has exposed a common failure mode: without proper authorization in place, they can easily expose sensitive data to the wrong users. Combine that with complex infrastructure (vector databases, sync pipelines, separate stores for embeddings and metadata), and you’re shipping a fragile system that puts user data at risk.</p><p>At <a href="https://www.timescale.com/"><u>Timescale</u></a> and <a href="https://www.osohq.com/"><u>Oso</u></a>, we think there’s a better way.</p><p>In this webinar, we show how you can build a secure, scalable AI chatbot using Postgres—and <em>only</em> Postgres—by leveraging Timescale’s <a href="https://github.com/timescale/pgai"><u>pgai library</u></a> and Oso’s <a href="https://www.osohq.com/cloud/authorization-service"><u>authorization platform as a service</u></a>.</p><figure class="kg-card kg-embed-card"><iframe width="200" height="113" src="https://www.youtube.com/embed/5GFhqVOM8UE?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" title="How to build a secure, authorized chatbot using Oso and Timescale"></iframe></figure><p>Here are the webinar highlights, summarized for you in chapters for easy reference.</p><p>(To deploy our sample app for authorized secure chatbot built using Oso and pgai, see this <a href="https://github.com/jackyliang/timescale-oso-rag-chatbot"><u>open-source code</u></a>.)</p><h2 id="why-most-ai-chatbot-demos-fail-in-production">Why Most AI Chatbot Demos Fail in Production</h2><p>[08:30–11:50]</p><p><strong>Why do simple chatbots break in production?</strong> Demo chatbots are easy: embed your docs, slap on an OpenAI API key, and you’re done.</p><p>But in a real business environment, Bob (the employee) should never see Alice’s harsh performance review feedback. Only Alice, their manager and HR should. Sales shouldn’t see engineering tickets.&nbsp;</p><p>Without authorization boundaries, your chatbot becomes a data leak waiting to happen.</p><p>Many demos fall short because they:</p><ul><li>Expose <em>all</em> content to <em>all</em> users</li><li>Ignore org-specific permissions (e.g., team-level access control)</li><li>Assume static or role-based authorization models</li><li>Rely on dual data systems (e.g., Postgres + Vector DB), causing data synchronization difficulties.</li></ul><p>The fix? Build with authorization and data consistency as first principles.</p><h2 id="why-we-combined-postgres-pgvector-and-oso">Why We Combined Postgres, pgvector, and Oso</h2><p>[13:34–17:47]</p><p>We introduced an end-to-end reference stack that solves both the <strong>data synchronization and</strong> <strong>authorization complexity problem</strong>. The solution uses:</p><ul><li><strong>Timescale + pgai</strong> for real-time, in-database vector search and updates</li><li><strong>Oso Cloud</strong> for relationship-based access controls, enforced natively via PostgreSQL</li><li><strong>No glue code</strong> or ETL scripts between systems</li></ul><p>The result: you get a secure, performant, and authorized chat system with <em>zero</em> duplicated data.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text"><i><em class="italic" style="white-space: pre-wrap;">“Chatbot demos are simple. Business-grade AI is hard. We’re going to show you how to make the hard, easy.</em></i>” — <b><strong style="white-space: pre-wrap;">Jacky, Developer Advocate, Timescale</strong></b></div></div><h2 id="real-time-vector-sync-with-pgai-vectorizer">Real-Time Vector Sync With pgai Vectorizer</h2><p>[14:33–20:45]</p><p>Instead of bolting a vector database on top of your existing Postgres database, <a href="https://github.com/timescale/pgai/blob/main/docs/vectorizer/overview.md"><u>pgai Vectorizer</u></a> keeps your embeddings <strong>automatically synchronized</strong> with your source data in Postgres.</p><ul><li>Create vectorizers via Python</li><li>Ingest from S3, Hugging Face, or existing Postgres tables</li><li>Bring your own embedding model (OpenAI, Nomic, etc.)</li><li>Chunk and embed documents with configurable rules</li><li>Never worry about mismatched records again</li></ul><pre><code>SELECT ai.create_vectorizer(
  'blog'::regclass,
  loading =&gt; ai.loading_column(column_name =&gt; 'content'),
  embedding =&gt; ai.embedding_openai(model =&gt; 'text-embedding-3-small', dimensions =&gt; 768),
  destination =&gt; ai.destination_table('blog_embeddings')
);</code></pre><p>Run your vectorizer worker:</p><pre><code class="language-SQL">pgai vectorizer worker -d postgresql://...</code></pre><p>No extra queues, pipelines, or lambdas needed. Just Python and Postgres.</p><h2 id="authorization-that-follows-relationships-not-just-roles">Authorization That Follows Relationships, Not Just Roles</h2><p>[21:43–28:14]</p><p>Many apps rely on <a href="https://www.osohq.com/docs/modeling-in-polar/role-based-access-control-rbac"><u>Role-Based Access Control (RBAC)</u></a>. But real-world permissions often depend on <a href="https://www.osohq.com/docs/modeling-in-polar/relationship-based-access-control-rebac"><u>relationships</u></a>:</p><ul><li>“Bob can view reviews only if he’s the owner of the document”</li><li>“Diane (HR) can see feedback others can’t”</li><li>“Support engineers can access sensitive logs only during active shifts”</li></ul><p>Oso lets you model this in code:</p><pre><code class="language-polar">resource Folder{
 roles = ["viewer"];
 permissions = ["view"];
 relations = { team: Team };


 "viewer" if "member" on "team";
 "viewer" if global "hr";
 "viewer" if is_public(resource);


 "view" if "viewer";
}
</code></pre><p>It also incorporates your Postgres data using native SQL, so you don’t need to sync users, roles, or groups into a second system.</p><h2 id="putting-it-together-authorized-retrieval-augmented-generation-rag">Putting It Together: Authorized Retrieval Augmented Generation (RAG)</h2><p>[30:44–37:32]</p><p>Here’s how the architecture works:</p><ol><li>A user (Bob or Diane) sends a question to the chatbot.</li><li>The app queries Oso to determine what data the user is <em>authorized</em> to access.</li><li>That filter is converted to a SQL query that joins source + embedding data in Timescale.</li><li>Only the authorized context is sent to the LLM (e.g., OpenAI) to generate a final response.</li></ol><p>The result: the same chatbot provides personalized, secure answers based on who’s asking—without leaking data or requiring redundant systems.</p><h2 id="what-you%E2%80%99ll-learn-from-the-demo">What You’ll Learn From the Demo</h2><p>[29:01–48:00]</p><ul><li>How to build a business-grade RAG stack without a separate vector DB</li><li>How to enforce field-level access control in LLM-based apps</li><li>How Timescale + pgai + Oso make Postgres the <em>only</em> data system you need</li><li>Why prompt engineering, chunking, and system prompts matter in retrieval quality</li><li>How to embed PDF, DOCX, and S3-based documents securely</li></ul><h2 id="next-steps">Next Steps</h2><p>We’ve open-sourced the reference app and walkthrough:</p><ul><li><a href="https://www.youtube.com/watch?v=5GFhqVOM8UE"><u>Watch the full webinar</u></a></li><li><a href="https://docs.timescale.com"><u>Explore the Timescale pgai docs</u></a></li><li><a href="https://www.osohq.com/docs"><u>Learn more about Oso Cloud</u></a></li><li><a href="https://oso-oss.slack.com/ssb/redirect"><u>Join the Oso community on Slack</u></a></li></ul><p>If you’re building AI agents, chat interfaces, or internal copilots—don’t wait to layer in security and data correctness.</p><p>Your users will thank you. Your auditors will too.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Document Loading, Parsing, and Cleaning in AI Applications]]></title>
            <description><![CDATA[Before writing the first line of embeddings for your AI application, you need to load, parse, and clean your data. Here’s how.]]></description>
            <link>https://www.tigerdata.com/blog/document-loading-parsing-and-cleaning-in-ai-applications</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/document-loading-parsing-and-cleaning-in-ai-applications</guid>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[AI agents]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Jacky Liang]]></dc:creator>
            <pubDate>Tue, 08 Apr 2025 17:50:10 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2025/04/Document-Loading--Parsing--and-Cleaning-in-AI-Applications.png">
            </media:content>
            <content:encoded><![CDATA[<p>Welcome to part one of our <a href="https://www.timescale.com/blog/agentic-rag-best-practices-guide-for-building-ai-apps-with-postgresql" rel="noreferrer"><u>Agentic RAG Best Practices</u></a> series, where we cover how to load, parse, and clean documents for your agentic applications.&nbsp;</p><p>This comprehensive guide will teach you how to build effective agentic retrieval applications with PostgreSQL.</p><p>Every week, customers ask us about building AI applications. Their most pressing concern isn't advanced chunking strategies or vector databases—it's simply: "How do I clean my data before feeding it to my AI?"</p><p>It’s simple: “Garbage in, garbage out.”&nbsp;</p><p>Before worrying about writing even your first line of embeddings or retrieval code, <strong><em>you need clean data</em></strong>.</p><p>In this first guide of our agentic RAG series, we'll cover gathering the right data, extracting text from various document types, pulling valuable metadata, web scraping techniques, and effectively storing data in PostgreSQL. We'll address common challenges like fixing formatting issues and handling images in documents.&nbsp;</p><p>By the end, you'll know how to transform raw documents into clean, structured data that retrieval agents can effectively use.&nbsp;</p><p>Don’t want to read all of this and just want to apply it? We have prepared a handy dandy preparation checklist for this topic in a preparation checklist. </p><figure class="kg-card kg-image-card"><img src="https://timescale.ghost.io/blog/content/images/2025/04/Document-Loading--Parsing--and-Cleaning-in-AI-Applications_decision-tree.png" class="kg-image" alt="A decision tree for document processing when building agentic RAG apps" loading="lazy" width="2000" height="1746" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2025/04/Document-Loading--Parsing--and-Cleaning-in-AI-Applications_decision-tree.png 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2025/04/Document-Loading--Parsing--and-Cleaning-in-AI-Applications_decision-tree.png 1000w, https://timescale.ghost.io/blog/content/images/size/w1600/2025/04/Document-Loading--Parsing--and-Cleaning-in-AI-Applications_decision-tree.png 1600w, https://timescale.ghost.io/blog/content/images/2025/04/Document-Loading--Parsing--and-Cleaning-in-AI-Applications_decision-tree.png 2392w" sizes="(min-width: 720px) 720px"></figure><p>One fintech customer recently shared how they spent weeks fine-tuning their RAG application with different vector databases, only to realize their poor results stemmed from simply having dirty data. "I approached the whole thing with like, I don't trust these AIs (. … ) So we don't ask them to make decisions. We do normal modeling to figure out what the user needs, then feed that data to the LLM and just say, 'Summarize it.'"&nbsp;</p><p>The garbage in, garbage out principle applies strongly to AI applications. Let's explore how to properly load, parse, and clean your data for AI use.&nbsp;</p><h2 id="gathering-the-right-data-for-ai-applications">Gathering the Right Data for AI Applications</h2><p>👉🏻 Watch the <a href="https://www.tiktok.com/@answer.hq/video/7488002948304407854"><u>one-minute video</u></a> summary.</p><p>Before even thinking about cleaning or processing your data, you need to make sure you have the right data in the first place. I know, this sounds obvious, but it’s a very important step that many teams overlook in rushing to build their shiny RAG app.&nbsp;</p><h3 id="data-selection-matters">Data selection matters</h3><p>We have seen many AI teams build state-of-the-art RAG apps that still deliver bad answers. Most of the time, there is nothing wrong with their retrieval algorithm, vector database, embedding model, or large language model. The problem is that they simply don’t have the necessary information in the knowledge base, so the LLM made something up instead or provided insufficient answers.&nbsp;</p><p>In most cases, if the information doesn’t exist in your documents, your RAG app should either return nothing (the best-case scenario), or the LLM will simply hallucinate a plausible answer (this is the worst-case scenario).&nbsp;</p><h3 id="choosing-the-right-data">Choosing the right data</h3><p>Before building your RAG application, ask yourself and the team these questions:</p><ol><li>What specific questions will users ask the system?&nbsp;</li><li>What documents contain factual information to these questions?&nbsp;</li><li>What are the gaps in our current documentation?&nbsp;</li><li>Is our information up-to-date, or will it need to be regularly updated?&nbsp;</li><li>Do we have a system in place to identify information gaps as users use the app?</li></ol><p>You need to be able to confidently answer these questions.&nbsp;</p><h3 id="where-to-collect-data">Where to collect data</h3><ol><li>Internal knowledge base: Check company wikis, technical documentation, reports, manuals, and databases.&nbsp;</li><li>External sources: Read industry publications, research papers, and public datasets.&nbsp;</li><li>Customer interactions: Check support tickets, chat logs, FAQs, etc.&nbsp;</li><li>Real-time sources: See news feeds, market data, IoT sensor data, etc.</li><li>Intuition: You may have some ideas where certain important data lives, so trust your gut.</li></ol><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text"><b><strong style="white-space: pre-wrap;">Note:</strong></b> Make sure these documents<i><b><strong class="italic" style="white-space: pre-wrap;"> don’t contain sensitive information</strong></b></i> you don’t want your users to ask about!&nbsp;</div></div><p>Be intentional about your data sources—the higher the quality and relevancy, the better.&nbsp;</p><h3 id="ensure-data-freshness">Ensure data freshness</h3><p>Most business data isn't static—it often changes as your products, services, and policies evolve. Outdated information in your RAG system leads to incorrect answers and really hurts your customers’ trust in the AI system (I mean, look at Google’s initial rollout of Bard.)&nbsp;</p><p>Consider the following suggestions for keeping your knowledge base up-to-date:</p><ol><li>Set up consistent update schedules: This will be different depending on your business needs. It can be hourly, weekly, monthly, or even quarterly.</li><li>Implement trigger-based updates: Update content whenever the source document changes. For example, when your team updates some documentation, your system should automatically refresh the corresponding knowledge base entries.</li><li>Create document ownership: If you work in a large company, you may need to assign responsibility to other individuals or teams for specific knowledge areas to ensure data is constantly updated.</li><li>Track user feedback: Many RAG systems allow users to rate answers. This rating system (like a simple thumbs up and down) can help identify outdated or incorrect information that needs to be updated, added, or removed from your knowledge base.&nbsp;</li><li>Track question patterns: Continuously analyze questions that consistently receive poor ratings to identify areas where your knowledge base needs improvement.&nbsp;</li></ol><p>Data freshness is one of the silent killers of data accuracy—no advanced RAG pipeline can fix this.&nbsp;</p><h2 id="extracting-text-from-documents">Extracting Text From Documents</h2><p>👉🏻 Watch the <a href="https://www.tiktok.com/@answer.hq/video/7488737770127494442"><u>one-minute video</u></a> summary.</p><p>Approximately 85 percent of the world's data is unstructured: think PDFs, Word files, emails, PowerPoint presentations, and more. To use this data with AI, you first need to extract the raw text.</p><h2 id="using-markitdown-for-document-conversion">Using MarkItDown for Document Conversion</h2><p>Libraries like <a href="https://github.com/microsoft/markitdown"><u>MarkItDown</u></a> and <a href="https://github.com/docling-project/docling"><u>Docling</u></a> can convert PDFs and other formats to Markdown. Markdown has become one of the cleanest and most efficient formats for ingesting data into LLMs because it's nearly plaintext and token-efficient. It can also efficiently represent non-text data like tables.&nbsp;</p><p><a href="https://github.com/microsoft/markitdown?tab=readme-ov-file#python-api"><strong><u>Extract text from PDF</u></strong></a><strong> using MarkItDown&nbsp;</strong></p><pre><code>from markitdown import MarkItDown  
md = MarkItDown()  
result = md.convert("document.pdf")  
text_markdown = result.text_content  
print(text_markdown[:500])</code></pre><p><a href="https://github.com/docling-project/docling?tab=readme-ov-file#getting-started"><strong><u>Extract text from PDF with Optical Character Recognition (OCR)</u></strong></a><strong> using Docling</strong></p><pre><code># Using Docling for Document Conversion with OCR
from docling.document_converter import DocumentConverter

# Initialize the converter
converter = DocumentConverter()

# Load PDF and extract text with OCR enabled
result = converter.convert(
    "document.pdf",  # Can be local path or URL
    enable_ocr=True  # Enable OCR for scanned documents
)

# Get the converted markdown content
markdown_text = result.document.export_to_markdown()

# Preview the first 500 characters
print(markdown_text[:500])
</code></pre><p>The code above returns an object with <code>text_content</code> containing the markdown text, which you can easily pass into your RAG pipeline or LLM for cleaning, analysis, summarizing, or chunking.&nbsp;</p><h2 id="using-visual-language-models-for-ocr">Using Visual Language Models for OCR</h2><p>A new breed of OCR technology is being powered by visual large language models (VLLMs): models that can process not just text, but also images and PDFs. These are trained specifically for unstructured data extraction. One such VLLM making a splash is <a href="https://docs.mistral.ai/capabilities/document/"><u>Mistral OCR</u></a>.</p><p><a href="https://docs.mistral.ai/capabilities/document/"><strong><u>Extract text and images</u></strong></a><strong> (in base64) from PDF using Mistral OCR</strong></p><pre><code>import os
from mistralai import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
client = Mistral(api_key=api_key)

ocr_response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": "https://arxiv.org/pdf/2201.04234"
    },
    include_image_base64=True
)
</code></pre><p><a href="https://docs.mistral.ai/capabilities/document/#ocr-with-image"><strong><u>Extract from images</u></strong></a><strong> using Mistral OCR</strong></p><pre><code>import os
from mistralai import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
client = Mistral(api_key=api_key)

ocr_response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "image_url",
        "image_url": "https://raw.githubusercontent.com/mistralai/cookbook/refs/heads/main/mistral/ocr/receipt.png"
    }
)
</code></pre><p>What makes Mistral OCR unique is its exceptional performance in extracting text in multiple languages, handling text from images, representing math equations, interpreting structured tables, and other traditionally difficult tasks.&nbsp;</p><p>Other extraction tools you can experiment with include <a href="http://unstructured.io"><u>unstructured.io</u></a>,&nbsp; <a href="https://github.com/allenai/olmocr"><u>olmOCR</u></a>, or just relying on good ol’ humans to extract the data—Upwork or Fiverr is a good place to begin your search for contractors.&nbsp;</p><p>Once you have this more manageable text form, you're ready for either direct ingestion into your database or metadata extraction.&nbsp;</p><h2 id="metadata-extraction">Metadata Extraction</h2><p>👉🏻 Watch the <a href="https://www.tiktok.com/@answer.hq/video/7488738094645103918" rel="noreferrer"><u>one-minute video</u></a> summary.</p><p>All documents contain metadata like title, author, creation date, length, source, customer name, etc. Imagine needing to fetch all documents between Q1 and Q2 of 2025 for a financial report—you'd need to filter by date range using metadata.</p><p>If your PDFs or documents have built-in metadata (added automatically by document processors when saving or exporting), that's great! But what if they don't?&nbsp;</p><h3 id="extracting-built-in-metadata">Extracting built-in metadata</h3><p>For simple metadata extraction from actual PDF data (if available), you can use a library like fitz:</p><p><a href="https://pymupdf.readthedocs.io/en/latest/document.html#Document.metadata"><strong><u>Extract built-in PDF metadata</u></strong></a><strong> using fitz</strong></p><pre><code>import fitz  
doc = fitz.open("example.pdf")  
metadata = doc.metadata  
print(metadata.get("title"), "by", metadata.get("author"))</code></pre><p>For everything else, you need…</p><h2 id="contextual-metadata-extraction">Contextual metadata extraction</h2><p>Most documents don't have native metadata. In these cases, you need a two-step workflow: first, use a PDF text extractor like Mistral OCR, then pass the raw text to another large language model to request specific information using natural language.</p><p>For example, you can use Mistral OCR to analyze each document, define what metadata you'd like to extract (title, author, etc.), and use another LLM to get the metadata information formatted in a specific way (like JSON).</p><p><a href="https://docs.mistral.ai/capabilities/document/#document-understanding"><strong><u>Extract contextual metadata from PDF</u></strong></a><strong> using Mistral OCR and Mistral Small </strong></p><pre><code>import os
from mistralai import Mistral

# Retrieve the API key from environment variables
api_key = os.environ["MISTRAL_API_KEY"]

# Specify model
model = "mistral-small-latest"

# Initialize the Mistral client
client = Mistral(api_key=api_key)

# Define the messages for the chat
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "In JSON format, extract the following metadata from the provided document: title, author, and created data. "
            },
            {
                "type": "document_url",
                "document_url": "https://arxiv.org/pdf/1805.04770"
            }
        ]
    }
]

# Get the chat response
chat_response = client.chat.complete(
    model=model,
    messages=messages
)

# Print the content of the response
print(chat_response.choices[0].message.content)

# Save the metadata somewhere for later ingest into your RAG pipeline
</code></pre><h2 id="extracting-text-from-the-web">Extracting Text From the Web</h2><p>👉🏻 Watch the <a href="https://www.tiktok.com/@answer.hq/video/7488738426116787498" rel="noreferrer"><u>one-minute video</u></a> summary.</p><p>Not all data lives in documents, and not all is accessible via GET API requests. To get data from websites, documentation, and knowledge bases for AI applications, you need to scrape them. The ultimate goal of web scraping is to fetch only the main text content from pages, filtering out headers, footers, sidebars, ads, and tracking scripts, in an LLM-friendly format like Markdown.</p><p>In the past, this was done with libraries like requests, Selenium, BeautifulSoup, etc., and required manually setting up proxies to evade rate limiters. Thankfully, it's no longer as painful to scrape the web today (yay!).</p><p>Web scrapers generally need to do the following tasks:</p><ol><li>Crawl: Get all pages of an entire website by gathering a list of all internal and external links (essentially building a sitemap, if it’s not available on /sitemap.xml).</li><li>Scrape: Get the DOM/text content of each individual page.</li><li>Proxy<strong>: </strong>Switch to a different IP to continue on a large crawling and scraping job.&nbsp;</li><li>Clean: Extract the main useful text from the raw DOM.</li><li>Convert: Format the main text content as Markdown, TXT, JSON, etc.&nbsp;</li></ol><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text"><b><strong style="white-space: pre-wrap;">A note about raw DOM</strong></b>: Using a website's HTML is messy because it has ads, menus, and other junk. It also doesn't work well for React apps or other single-page applications.</div></div><h2 id="firecrawl-for-web-scraping">Firecrawl for Web Scraping</h2><p><a href="http://firecrawl.com"><u>Firecrawl</u></a> is a web scraping/crawling engine accessible via REST API, Python SDK, and a UI dashboard (currently in beta). What's great about Firecrawl is that it extracts clean page text in various formats. It can crawl an entire site and return all pages' content in one go (with advanced filtering options). It also handles all the proxying needed for large-scale crawl and scrape jobs.</p><p><strong>Crawling a website with a limit of 100 pages using </strong><a href="https://docs.firecrawl.dev/introduction#usage"><strong><u>Firecrawl Crawl REST API</u></strong></a></p><pre><code>from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-YOUR_API_KEY")

# Crawl a website:
crawl_status = app.crawl_url(
  'https://firecrawl.dev', 
  params={
    'limit': 100, 
    'scrapeOptions': {'formats': ['markdown', 'html']}
  },
  poll_interval=30
)
print(crawl_status)
</code></pre><p><strong>Scraping a URL and outputting in Markdown using </strong><a href="https://docs.firecrawl.dev/introduction#scraping"><strong><u>Firecrawl Scraping REST API</u></strong></a></p><pre><code>from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-YOUR_API_KEY")

# Scrape a website:
scrape_result = app.scrape_url('firecrawl.dev', params={'formats': ['markdown', 'html']})
print(scrape_result</code></pre><p><strong>Custom metadata extraction in JSON using </strong><a href="https://docs.firecrawl.dev/introduction#extraction"><strong><u>Firecrawl Extraction REST API</u></strong></a></p><pre><code>from firecrawl import FirecrawlApp
from pydantic import BaseModel, Field

# Initialize the FirecrawlApp with your API key
app = FirecrawlApp(api_key='your_api_key')

class ExtractSchema(BaseModel):
    company_mission: str
    supports_sso: bool
    is_open_source: bool
    is_in_yc: bool

data = app.scrape_url('https://docs.firecrawl.dev/', {
    'formats': ['json'],
    'jsonOptions': {
        'schema': ExtractSchema.model_json_schema(),
    }
})
print(data["json"])</code></pre><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text"><i><b><strong class="italic" style="white-space: pre-wrap;">Author’s note:</strong></b></i><i><em class="italic" style="white-space: pre-wrap;"> “Firecrawl is one of my favorite SaaS services of 2024. It has awesome docs, affordable pricing, and has a very responsive team. Most importantly, it works really well.” – Jacky Liang</em></i></div></div><h2 id="other-web-scraping-options">Other Web Scraping Options</h2><p>Firecrawl isn't the only service/library for crawling, scraping, and cleaning. Another capable service is Jina AI's Reader API, which converts a URL to LLM-friendly inputs simply by adding <code>r.jina.ai</code> in front:&nbsp;&nbsp;</p><p><strong>Fetch a webpage in clean Markdown using Jina AI’s </strong><a href="https://jina.ai/reader/"><strong><u>Reader API</u></strong></a></p><pre><code>r.jina.ai/news.ycombinator.com
</code></pre><p>If you want to build your own end-to-end crawling and scraping infrastructure (expert users only), developers typically use <a href="https://github.com/microsoft/playwright"><u>Playwright</u></a>, a Microsoft framework for web testing and automation. <a href="https://github.com/oxylabs/playwright-web-scraping"><u>Playwright Web Scraping</u></a> is a reliable open-source web scraping implementation using Playwright. <a href="https://github.com/mendableai/firecrawl"><u>Firecrawl</u></a> is also open source and lets you host it in your own infrastructure if you want absolute control.&nbsp;</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text"><b><strong style="white-space: pre-wrap;">Pro tip:</strong></b> When running your own web scraping infrastructure, make sure to use proxies for your scraping server to avoid IP bans from websites you're crawling and scraping.&nbsp;</div></div><h2 id="direct-data-loading">Direct Data Loading&nbsp;</h2><p><code>pgai</code> has a handy function that lets you import datasets directly from Hugging Face with just the dataset's name:</p><p><a href="https://github.com/timescale/pgai?tab=readme-ov-file#create-a-table-run-a-vectorizer-and-perform-semantic-search"><strong><u>Load data from Hugging Face</u></strong></a><strong> using pgai</strong></p><pre><code>SELECT ai.load_dataset('wikimedia/wikipedia', '20231101.en', table_name=&gt;'wiki', batch_size=&gt;5, max_batches=&gt;1, if_table_exists=&gt;'append');
</code></pre><p><code>pgai</code> has more direct data loading goodies to come—stay tuned.&nbsp;</p><h2 id="storing-data">Storing Data</h2><p>👉🏻 Watch the <a href="https://www.tiktok.com/@answer.hq/video/7488738803012848942" rel="noreferrer">one-minute video</a> summary.</p><p>At Timescale, we believe <a href="https://www.timescale.com/blog/vector-databases-are-the-wrong-abstraction"><u>boutique vector databases are the wrong abstraction</u></a> for AI workloads. PostgreSQL is the ideal solution for typical apps and AI apps, especially RAG applications.</p><p>Instead of using a separate vector database, you can store text embeddings inside PostgreSQL with <a href="https://github.com/pgvector/pgvector"><u>pgvector</u></a>. We recommend using <a href="https://github.com/timescale/pgai"><u>pgai</u></a> to simplify building RAG apps, as we have a handy interface called <code>create_vectorizer()</code> that automatically embeds raw text, chunks it, and continuously keeps it up-to-date.</p><p><strong>Create an AI project using </strong><a href="https://github.com/timescale/pgai?tab=readme-ov-file"><strong><u>pgvector and pgai</u></strong></a><strong> </strong></p><pre><code>// Enable pgai and pgvector on your Postgres database
CREATE EXTENSION IF NOT EXISTS vector;  
CREATE EXTENSION IF NOT EXISTS ai;  

// Create a table to store Wikipedia articles
CREATE TABLE wiki (
    id      TEXT PRIMARY KEY,
    url     TEXT,
    title   TEXT,
    text    TEXT
);

// Load Wikipedia dataset directly from HuggingFace 
SELECT ai.load_dataset('wikimedia/wikipedia', '20231101.en', table_name=&gt;'wiki', batch_size=&gt;5, max_batches=&gt;1, if_table_exists=&gt;'append');

// Create Vectorizer that
// 1. Chunks the data using the chunking recursive character text splitter
// 2. Embeds it using Mini LM at 384 token size per chunk
// 3. Continuously monitors the wiki table for new text incoming
SELECT ai.create_vectorizer(
     'wiki'::regclass,
     embedding =&gt; ai.embedding_ollama('all-minilm', 384),
     formatting=&gt; ai.formatting_python_template('url: $url  title: $title $chunk')
     chunking =&gt; ai.chunking_recursive_character_text_splitter('text'),

);

// Check status of Vectorizxer embedding creation
select * from ai.vectorizer_status;
</code></pre><p><strong>Running the </strong><a href="https://github.com/timescale/pgai?tab=readme-ov-file#quick-start"><strong><u>pgai Vectorizer worker</u></strong></a></p><p>For the vectorizer to work correctly, you need to run the pgai Vectorizer worker alongside your PostgreSQL database. This worker processes your data and creates embeddings. Set up a <code>docker-compose.yml</code> file with the following configuration:</p><pre><code>version: '3'
services:
  db:
    image: timescale/timescaledb-ha:pg17
    environment:
      POSTGRES_PASSWORD: postgres
    ports:
      - "5432:5432"
    volumes:
      - data:/home/postgres/pgdata/data
      
  vectorizer-worker:
    image: timescale/pgai-vectorizer-worker:latest
    environment:
      PGAI_VECTORIZER_WORKER_DB_URL: postgres://postgres:postgres@db:5432/postgres
      OPENAI_API_KEY: your_openai_api_key_here
    command: [ "--poll-interval", "5s" ]
    
  ollama:
    image: ollama/ollama
    
volumes:
  data:
</code></pre><p>If you're using Ollama for embeddings, as shown in our example, make sure to add the Ollama service and configure the worker:</p><pre><code>vectorizer-worker:
    environment:
      OLLAMA_HOST: http://ollama:11434</code></pre><p>Start everything with “<code>docker-compose up -d</code>” and the worker will automatically poll the database and process your vectorizer tasks. Note that you might need to adjust settings like poll intervals or concurrency depending on your specific workload needs.</p><p>And just like that, we’ve built a production-ready SQL-native retrieval pipeline that is not only powerful but <a href="https://github.com/timescale/pgai/blob/main/docs/vectorizer/overview.md"><u>extremely customizable</u></a>.&nbsp;</p><h2 id="cleaning-messy-data">Cleaning Messy Data</h2><p>Raw text extracted from websites or documents is often messy and contains content not relevant to the main text. This can include ads, navigation menus, footers, tracking scripts, or leftover HTML/CSS markup. Removing this noise is crucial to avoid feeding irrelevant text to your AI model, reduce input token size to lower costs, and increase retrieval accuracy.</p><h3 id="cleaning-webpages">Cleaning webpages</h3><p>Elements to clean include:</p><ul><li>HTML tags that aren't content: <code>&lt;script&gt;</code>, <code>&lt;style&gt;</code>, <code>&lt;nav&gt;</code>, <code>&lt;footer&gt;</code>, etc.</li><li>Advertisements or cookie banners</li><li>Repeated headers/footers on every page</li><li>Excessive whitespace, line breaks, or meaningless Unicode characters</li><li>Images or links to images</li></ul><p>The tools mentioned earlier, like Firecrawl and Jina AI's Reader API, already handle webpage data cleaning and return only the main text content.</p><p>If you have very specific requirements, you can use web automation frameworks like Playwright or BeautifulSoup to get the raw DOM, then use traditional DOM traversal or regex to clean the data. This approach is for experts only.</p><h3 id="cleaning-pdfs">Cleaning PDFs</h3><p>After running documents through Mistral OCR, you'll still have repeated content like page numbers, headers/footers, and repetitive line breaks. A growing technique is using an LLM like Gemini Flash 2.0, which has a two-million-token context window (the largest of all LLMs we've seen) and a reasonable cost to automate cleaning. You can use natural language to instruct it on what to clean: removing repeated titles, sources, footnotes, etc.</p><p>You can also clean text manually if you have very specific data requirements.&nbsp;</p><h2 id="fixing-text-formatting">Fixing Text Formatting</h2><p>Sometimes text is extracted with poor formatting. You need to standardize lists, headings, and line breaks. You can prompt an LLM to rewrite text more clearly, remove gibberish or irrelevant parts, and correct inconsistencies (without adding extra commentary).</p><p>Issues to fix include:</p><ul><li>Line breaks and paragraphs: Merge lines that belong to the same paragraph. For example, replace hyphenated line breaks (<code>-\n</code>) with nothing, and replace newlines followed by lowercase letters with spaces.</li><li>Lists and bullet points: Convert fancy bullet symbols to a common format (e.g., "•" or "–" to "-"). Ensure list items have consistent formatting.</li><li>Headings and subheadings: If using Markdown, ensure headings use # syntax properly with blank lines before and after.</li><li>Whitespace and punctuation: Trim excessive whitespace, normalize quotes and dashes if needed.</li><li>Tone: Standardize tone using large context LLMs like Gemini Flash 2.0 to keep writing style consistent across your text data.</li></ul><p><strong>Clean variety of OCR text formatting issues using regex</strong><em> (not an exhaustive example)</em></p><pre><code>import re

def clean_ocr_text(text):
    # Replace hyphenated line breaks (e.g., "exam-\nple" -&gt; "example")
    text = re.sub(r'(\w)-\n(\w)', r'\1\2', text)

    # Merge lines that are broken in the middle of sentences
    text = re.sub(r'\n(?=\w)', ' ', text)

    # More cleaning steps go here

    return text</code></pre><h2 id="handling-images-in-documents">Handling Images in Documents&nbsp;</h2><p>Some PDFs and web pages have images containing text or are entirely scans of documents. There are several ways to handle these:</p><ol><li><strong>OCR</strong>: For scanned documents, OCR can extract text from images. Mistral OCR handles this well.</li><li><strong>MarkItDown</strong>: Modern libraries like MarkItDown integrate both OCR and vision models to generate image descriptions.</li><li><strong>BLIP</strong>: Models like <a href="https://arxiv.org/abs/2201.12086"><u>BLIP</u></a> (Bootstrapping Language-Image Pre-training) combine understanding images and generating text to give you text descriptions of images.</li><li><strong>Save image URLs</strong>: When extracting from webpages, you can save image URLs as text chunks to display in your application.</li><li><strong>Omit images</strong>: This is common but not ideal, especially if images contain crucial information like charts or diagrams.</li></ol><p><a href="https://dev.to/leapcell/deep-dive-into-microsoft-markitdown-4if5"><strong><u>Generate descriptions for images</u></strong></a><strong> using MarkItDown</strong></p><pre><code>from markitdown import MarkItDown
from openai import OpenAI

# Set up OpenAI client
client = OpenAI(api_key="your-openai-api-key")

# Initialize MarkItDown with LLM capabilities
md = MarkItDown(llm_client=client, llm_model="gpt-4o")

# Convert an image file
result = md.convert("path_to_your_image.jpg")

# Print the generated description
print(result.text_content)
</code></pre><p><a href="https://medium.com/%40jimwang3589/what-is-image-captioning-and-how-to-use-python-to-generate-caption-from-an-image-98a9eb6be06d"><strong><u>Generate captions for an image</u></strong></a><strong> using BLIP&nbsp;</strong></p><pre><code>from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image

# Load the pre-trained BLIP model and processor
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

# Open an image file
image_path = "path_to_your_image.jpg"  # Replace with your image file path
image = Image.open(image_path)

# Preprocess the image and prepare inputs for the model
inputs = processor(images=image, return_tensors="pt")

# Generate caption
outputs = model.generate(**inputs)

# Decode the generated caption
caption = processor.decode(outputs[0], skip_special_tokens=True)

print("Generated Caption:", caption)</code></pre><h2 id="summarizing-or-cleaning-text-with-an-llm">Summarizing or Cleaning Text With an LLM</h2><p>After programmatic cleaning and parsing, you may still need to refine text further:</p><ul><li>Summarize long documents into shorter summaries or key points.</li><li>Make tone and format consistent.</li></ul><p>Simple prompts like "Remove any unnecessary information (like boilerplate) from the following text and correct any errors" or "Summarize the following document in three sentences" can help. If possible, double-check this work, as LLMs can sometimes misinterpret context.</p><p><strong>Summarize a piece of text </strong><a href="https://cloud.google.com/vertex-ai/generative-ai/docs/gemini-v2"><strong><u>using Gemini Flash 2.0</u></strong></a><strong>&nbsp;</strong></p><pre><code>from google import genai
from google.genai.types import HttpOptions

client = genai.Client(http_options=HttpOptions(api_version="v1"))
response = client.models.generate_content(
    model="gemini-2.0-flash-001",
    contents="Summarize in a technical tone the following piece of text: Attention Is All You Need...",
)
print(response.text)</code></pre><h2 id="conclusion">Conclusion</h2><p>We hear it from our customer’s AI teams all the time. They had spent months trying different vector databases, embedding models, and chunking strategies without seeing any improvement in their RAG application. After a quick data review, we discovered their PDFs were being processed with poor OCR, resulting in garbled text full of artifacts. Within a week of implementing proper data cleaning, their application performance jumped dramatically.&nbsp;</p><p>There’s a very important lesson here: Before you worry about the latest bleeding-edge GraphRAG techniques, <strong><em>make sure your data foundation is solid</em></strong>; this step starts way before you write a single line of RAG code. Clean, well-structured, and high-quality data is the foundation of any successful AI application. As we like to say, “Garbage in, garbage out.”&nbsp;</p><p>Start with good data, and the rest of your AI pipeline will fall into place. You need to invest time in proper document gathering, loading, parsing, and cleaning, which will pay dividends in more accurate, relevant, and useful AI outputs.&nbsp;</p><p>In the next installment of <a href="https://www.timescale.com/blog/agentic-rag-best-practices-guide-for-building-ai-apps-with-postgresql" rel="noreferrer">RAG Best Practices</a>, we will be exploring chunking strategies, followed by embedding generation, indexing techniques, performance optimizations, and more.&nbsp;</p><h2 id="get-involved">Get Involved&nbsp;</h2><p>Whether you're new to AI or an experienced developer looking to implement agentic RAG with PostgreSQL, this series will give you the foundation you need.&nbsp;</p><p>Stay tuned for our next guide on chunking strategies, coming in two weeks. </p><p>In the meantime, we'd love to see you share your thoughts, questions, and suggestions on social media and Discord:</p><ul><li><strong>Join our Discord Community</strong>: Get real-time answers from the Timescale team and <a href="https://discord.com/invite/KRdHVXAmkp" rel="noreferrer">connect with other developers</a>.</li><li><strong>Follow us on social media</strong>: Stay updated with the latest from Timescale on <a href="https://twitter.com/TimescaleDB"><u>X/Twitter</u></a> and <a href="http://linkedin.com/company/timescaledb"><u>LinkedIn</u></a>.</li><li><strong>Connect with Jacky </strong>(<strong>developer advocate</strong>): Follow me for more practical AI and PostgreSQL content on<a href="https://twitter.com/jjackyliang"><u> X/Twitter</u></a>, <a href="https://threads.net/@jjackyliang"><u>Threads</u></a>, and <a href="https://www.tiktok.com/@answer.hq"><u>TikTok</u></a>.</li><li><strong>Direct questions</strong>: Have a specific question about your agentic retrieval implementation? Ask me anything at jacky (at) timescale (dot) com.</li></ul><p>We're building this guide for you, so don't hesitate to let us know what topics you'd like us to cover in future installments!&nbsp;</p>]]></content:encoded>
        </item>
    </channel>
</rss>