Tiger Data Blog

Five Features of the Tiger CLI You Aren't Using (But Should)

Jacky Liang — Wed, 10 Dec 2025 16:37:07 GMT

Last month, we launched Agentic Postgres, the first Postgres database designed for AI agents. It includes an MCP server that gives agents direct access to your databases, instant zero-copy forks, Postgres and TimescaleDB documentation search, and more.

0:00

/0:30

Alongside Agentic Postgres, we shipped a brand new CLI: Tiger CLI. It's how you manage your Tiger Cloud databases from your favorite terminal.

The basics work like you'd expect:

# Install Tiger CLI 
curl -fsSL https://cli.tigerdata.com | sh

# Authenticate
tiger auth login

# Create a new database service
tiger service create --name my-database

# Connect to your database
tiger db connect

# Get your connection string
tiger db connection-string

# List all your services
tiger service list

These commands cover most day-to-day workflows. But Tiger CLI has a few features that makes the agentic development workflow significantly more intuitive, and you're probably not using them yet!

Here are five new features we launched that you aren’t using, but should:

Let your AI manage your databases: Install an MCP server that gives your AI assistant direct access to create services, run queries, and check connections
Turn your AI into a Postgres expert: Skills teach your AI Postgres best practices automatically, as if it’s been writing production-grade PostgreSQL for a decade+.
Fork any database in seconds: Create zero-copy clones of your database for testing migrations or spinning up staging environments
Search Postgres docs from your editor: Your AI can search PostgreSQL (across all versions) and TimescaleDB documentation without leaving your IDE or CLI
Run SQL queries through your AI: Execute queries against your database directly from your AI assistant

Let's look at each one.

Let Your AI Manage Your Databases

As anyone coding with Cursor or Claude Code knows, nothing breaks flow more than having to leave your AI agent to execute CLI commands. Every time you need to check a database connection string, list your available services, or create a new database, you need to switch context. From your terminal, to the browser, back to your IDE, it’s easy to break out of your flow state.

The Tiger CLI now includes a Model Context Protocol (MCP) server. If you're using an AI coding assistant like Claude Code, Cursor, or VS Code with Copilot, you can give it direct access to your Tiger Cloud databases.

This means your AI assistant can list your services, run SQL queries, create new databases, and check connection details, all without you switching to your terminal.

Quick Setup

Install the MCP server for your assistant:

# Interactive (prompts you to pick your client)
tiger mcp install

# Or specify directly
tiger mcp install claude-code
tiger mcp install cursor
tiger mcp install vscode

We made it super easy to install the Tiger CLI in all of your favorite coding assistants* through an interactive prompt that guides you through the installation.

Restart your AI assistant after installation.

* We are constantly adding interactive installations for new coding assistants!

Adding to Cursor via UI

If you prefer to configure Cursor (or other IDEs) manually instead of using tiger mcp install:

Open Cursor Settings
Look for “Tools & MCP” on the left sidebar
Click "Add MCP server"
Enter the following configuration:
- Name: tiger
- Command: tiger
- Arguments: mcp, start

Click "Save" and restart Cursor

Once configured, you'll see the Tiger MCP server listed in your MCP servers. The green indicator shows it's connected and ready.

What You Can Do

Once installed, your AI assistant has access to tools like:

service_list — List all your database services
service_get — Get details about a specific service
service_create — Create a new database
db_execute_query — Run SQL queries against any service

For a full list of tools available to you and your agent, they are available in the Tiger CLI README on Github.

For example, you can ask your AI assistant: "Show me all my Tiger Cloud services" or "Run SELECT count(*) FROM events on my production database."

The MCP server uses your existing CLI authentication, so there's no extra setup after tiger auth login.

Turn Your AI Into a Postgres Expert

A new pattern of working with SQL has emerged as agentic coding has exploded in popularity.

You can now simply just tell an LLM what you want to do with your database, and it will write the SQL for you. You basically don’t even need to spend sweat, tears, or even fear (like you accidentally dropping a table, although this is still completely within the realm of possibility when working with an agent) to learn how to write SQL.

But this AI-generated SQL actually has spawned a new problem, which is, AI-generated SQL works… until it doesn’t.

Your schema may pass tests, your queries may run… but six months later, as your application scales to millions of users, everything slows to a crawl.

The problem here is that LLMs are actually trained on millions of lines of SQL, absorbed from a billion blog posts (of differing quality), so while they “know” SQL, they don’t actually know what patterns actually scale. There are also hundreds of different dialects of SQL, with dozens of versions each.

It’s no wonder AI agents may not write the best SQL.

We’ve personally seen many common AI-generated mistakes, including:

Using VARCHAR(255) instead of TEXT (the length limit doesn't help performance in Postgres)
Using SERIAL instead of BIGINT GENERATED ALWAYS AS IDENTITY
Missing indexes on foreign key columns (Postgres doesn't create these automatically)
Using TIMESTAMP instead of TIMESTAMPTZ (timezone handling is painful to fix later)

These will not raise syntax or linter errors. Your tests will still pass. But trying to fix these mistakes later once your app is handling millions of users, means painful migrations, downtime, and explaining to your CEO why the database needs to go down for all your users to undergo maintenance. Sound familiar?

The Tiger MCP server ships with a variation of Skills (a standard built by Anthropic) written by our most senior and experienced Postgres engineers. When your AI needs to design a schema, the MCP server automatically pulls the right “lessons” and then applies 30 years of Postgres best practices to your database design.

Available Skills

The MCP server includes skills for common Postgres and TimescaleDB tasks.

design-postgres-tables: Schema design with proper types, constraints, and indexes
setup-timescaledb-hypertables: Hypertable configuration, compression, retention policies
migrate-postgres-tables-to-hypertables: Converting existing tables to hypertables
find-hypertable-candidates: Identifying which tables should become hypertables

Your AI discovers and uses these automatically based on what you ask it to do. You don’t need to call them explicitly!

We are continually adding new Skills. Want to request new ones or contribute with your own? Feel free to create an issue in our Skills Github repo.

Fork Any Database in Seconds

Testing database migrations against production data is risky, but testing against fake data is also risky because you may miss edge cases. You need a real copy of your database but without the cost or time of duplicating everything manually.

Tiger CLI lets you create instant, zero-copy forks of any database:

tiger service fork  --name my-staging-db

The fork created is a point-in-time copy that shares underlying data blocks with the original. You only pay for blocks that change. This makes forks lightweight enough to spin up for a single test and throw away when you’re done.

Database forking is especially useful when working in an agentic context, where if you are using multiple agents at once, you can fork multiple databases for different agents to work on without affecting the original database. Neat!

Example Workflow

# List your services to find the source ID
tiger service list

# Fork your production database
tiger service fork --source-id svc_abc123 --name testing-migrations

# Connect to the fork
tiger db connect --service-id 

# Run your migration, test it, then delete the fork when done
tiger service delete

Search Postgres Docs From Your Editor

Postgres has been around for almost 30 years.

The documentation is incredibly extensive, spanning 18 versions and countless individual releases. A function that exists in Postgres 18 might not exist in 15. Syntax that worked in 14 now has better alternatives in 18.

Most AI-powered documentation search tools don’t account for these idiosyncrasies.

If not using docs search tools, and simply relying on an LLM’s own training data, well, they have training cutoffs, generally at least lagging by 6 months, so LLMs often suggest older methods that aren’t the most performant, or reference features that don’t exist in your version. This is actually a major problem with LLMs writing React and NextJS I’ve personally experienced in my side projects!

The Tiger MCP server includes search over PostgreSQL documentation from versions 14 to 18 and TimescaleDB docs for time series workloads. Your AI assistant can search version-specific docs without leaving your coding agent.

Your assistant gets access to:

semantic_search_postgres_docs: Search PostgreSQL documentation (versions 14-18)
semantic_search_tiger_docs: Search Tiger Cloud and TimescaleDB documentation

No More Context Switching

Instead of leaving your editor to search docs, you can ask your AI assistant directly:

"How do I set up continuous aggregates in TimescaleDB?"
"What's the syntax for PostgreSQL window functions?"
"Show me how to configure compression policies"

The assistant searches the actual docs and gives you accurate, up-to-date answers. You don’t even need to ask the agent to explicitly use the documentation search feature, it will just work.

P.S. This feature is enabled by default. If you ever need to disable it:

tiger config set docs_mcp false

Run SQL Queries Through Your AI

You’re debugging an issue, you need to check row count, inspect a table’s schema, or run a quick query. You’re used to having to open a new terminal, find and remember the connection string, connect via psql, run the query, copy the results back (often jankily when done inside the CLI due to text wrapping). All these steps just for getting ONE number.

No more.

Once you’ve set up the MCP in your coding agent, your AI assistant can execute SQL queries (using the db_execute_query tool) directly against your databases.

This means you can stay in your editor and ask your AI:

"How many events came in during the last 24 hours?”
"Show me the 10 most recent orders"
"What's the schema of the users table?"

Example

Your AI writes the (performant) SQL, runs it, and returns the results. No more terminal switching, copy-pasting connection strings, remembering your environment, exact syntax for your information_schema.

Get Started Now

Install the Tiger CLI and MCP server:

curl -fsSL https://cli.tigerdata.com | sh
tiger auth login
tiger mcp install

Then select your AI assistant (Claude Code, Cursor, VS Code, Windsurf) and you're ready to go.

Don't have a Tiger Cloud account? Sign up for free — no credit card required. Create your first database, then try out these CLI features.

Resources

How to Train Your Agent to Be a Postgres Expert

Matty Stratton — Wed, 22 Oct 2025 14:02:12 GMT

With prompt templates and versioned docs, we turn 35 years of Postgres wisdom into structured knowledge your Agent can reason with.

Agents are the new developer. But they’re generalists.

What happens when they design your Postgres database? Your schema runs, your tests pass… and six months later your queries crawl and your costs skyrocket.

AI-generated SQL and database schemas are almost right. And that’s the problem. Fixing schema design mistakes is costlier than refactoring code. It often means multi-week migrations, downtime windows, rollback plans, and your CEO asking why the site is in maintenance mode. The root issue? LLMs don’t have the depth of Postgres and database expertise to let them build scalable systems. And when agents try to learn, they find documentation written for humans, not for them.

But agents don’t need more data, they need better context. They need to know what “good Postgres” actually looks like. The good news is given the right context and tools, agents can become instant experts. Even with Postgres.

That’s why we built an MCP server that provides 35 years of Postgres wisdom, and full access Postgres docs, all in a format that agents can easily process.

And we think this just might be the best database MCP server ever built. While Neon, Supabase, and other Postgres companies created MCP servers as thin API wrappers, ours teaches AI how to think in Postgres. The Tiger MCP server gives AI tools that work automatically: no prompt engineering or manual lookups needed. You just ask. And it provides correct, idiomatic Postgres.

Our new MCP server ships with detailed prompt templates written by our senior engineers, plus versioned Postgres (15-18) docs, and TimescaleDB docs. When your AI needs to design a schema, it automatically pulls the right template and searches the docs (using hybrid search) to generate code that actually lasts. And of course, it handles the basics: start, stop, fork a database.

But the real magic happens when you ask AI to design your schema. So let’s see what happens when AI designs a simple schema. First without the MCP server, then with it.

Without the MCP Server

CREATE TABLE customers (
    customer_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    email VARCHAR(255) UNIQUE NOT NULL,
    password_hash VARCHAR(255) NOT NULL,
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL,
    phone VARCHAR(20),
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    is_active BOOLEAN DEFAULT TRUE
);

CREATE INDEX idx_customers_email ON customers(email);

CREATE TABLE authors (
    author_id SERIAL PRIMARY KEY,
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL,
    biography TEXT,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

What’s wrong here:

Mixed ID types (SERIAL and UUID) with no clear reason
Arbitrary VARCHAR limits instead of TEXT
Case-sensitive email index → duplicate accounts
CURRENT_TIMESTAMP instead of now(), leading to inconsistent timestamps

Someone new to Postgres may not notice these problems. Until it’s too late.

Now let’s see what AI generates with our new MCP server.

With the MCP Server

CREATE TABLE authors (
  author_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  name TEXT NOT NULL,
  bio TEXT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX ON authors (name);

CREATE TABLE users (
  user_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  email TEXT NOT NULL UNIQUE,
  password_hash TEXT NOT NULL,
  first_name TEXT NOT NULL,
  last_name TEXT NOT NULL,
  phone TEXT,
  is_active BOOLEAN NOT NULL DEFAULT true,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE UNIQUE INDEX ON users (LOWER(email));
CREATE INDEX ON users (created_at);

What’s better about this?

Consistent ID strategy with BIGINT GENERATED ALWAYS AS IDENTITY
TEXT instead of arbitrary VARCHAR limits
Case-insensitive email lookups
Modern timestamp handling

But why does this matter?

Each of these differences creates a compounding problem. Changing datatypes in the future will require full table rewrites. Missing lowercase email handling means duplicate accounts and confused users. And time zones? Every senior developer gets the thousand-yard stare when you mention UTC conversions.

This is just with a small example; imagine what would happen with more complex schemas.

And if you don’t believe us, here’s what Claude has to say:

> Please describe the schema you would create for an e-commerce website two times, first with the tiger mcp server disabled, then with the tiger mcp server enabled. For each time, write the schema to its own file in the current working directory. Then compare the two files and let me know which approach generated the better schema, using both qualitative and quantitative reasons. For this example, only use standard Postgres.

0:00

/0:50

Verdict:

The Tiger MCP-enabled schema is objectively superior because:

1. Type Safety: Uses modern PostgreSQL types that prevent common errors
2. Data Integrity: 4x more constraints ensure data quality at the database level
3. Performance: Better indexing strategy with 55% more indexes, including partial and expression indexes
4. Maintainability: Better documentation, clear comments, proper naming
5. Features: More complete e-commerce functionality (discounts, full-text search, audit trails)
6. PostgreSQL Best Practices: Follows official PG17 recommendations from the documentation
7. Future-Proof: Uses modern features like GENERATED ALWAYS AS IDENTITY and NULLS NOT DISTINCT

The Tiger MCP server provided access to PostgreSQL-specific best practices documentation and design patterns that resulted in a more robust, performant, and maintainable schema.

How Prompt Templates Make It All Possible

Behind the scenes, AI uses the MCP server to call get_prompt_template(‘design_postgres’) to load schema design guidance. No pasted docs. No corrections. Just better code.

Prompt templates turn production wisdom into reusable guidance for AI. Instead of hunting through documentation written for humans, AI gets the information it needs in a format built for machines.

It comes down to the fact that humans and LLMs have opposite needs. Humans need narratives and memorable examples (and yes, even cat memes) to help them retain information. LLMs need to preserve context window space. That’s why prompt templates make terrible blog posts, but perfect AI guidance.

Our philosophy is: don't re-teach what the model already knows. LLMs have seen millions of lines of SQL. They know how to write CREATE TABLE. What they don’t know is the 35 years of Postgres wisdom about what works well and what doesn’t.

It's like your senior DBA whispering advice in the model's ear.

Our schema design template (design_postgres_tables) doesn’t explain what a primary key is. It jumps straight to guidance:

“Prefer BIGINT GENERATED ALWAYS AS IDENTITY; use UUID only when global uniqueness is needed.”

For data types, it doesn’t teach from scratch. It just tells you what works:

“DO NOT use money type; DO use numeric instead.”

Here’s a real snippet from the template:

## Postgres "Gotchas"

- **FK indexes**: Postgres **does not** auto-index FK columns. Add them.
- **No silent coercions**: length/precision overflows error out (no truncation). 
  Example: inserting 999 into `NUMERIC(2,0)` fails with error, unlike some 
  databases that silently truncate or round.
- **Heap storage**: no clustered PK by default (unlike SQL Server/MySQL InnoDB); 
  row order on disk is insertion order unless explicitly clustered.

These gotchas trip up LLMs the same way they trip up developers new to Postgres. We optimized these templates for machines: short, factual, and precise, packing maximum guidance into minimum tokens.

We tested the same approach on a real IoT schema design task. Without templates, the AI added forbidden configurations and missed critical optimizations. With templates, it generated production-ready code with compression, continuous aggregates, and tuned performance.

That’s how prompt templates work. Now let’s see how the MCP server makes it all happen.

How This MCP Server is Smarter Than Others

While Neon, Supabase, and other Postgres companies created MCP servers as thin API wrappers, ours teaches AI how to think in Postgres.The Tiger MCP server gives AI tools that work automatically: no prompt engineering or manual lookups needed. You just ask. And it provides correct, idiomatic Postgres.

get_prompt_template provides auto-discovered expertise. Instead of having to call a template explicitly, you just say “I want to make a schema for IoT devices…” and the MCP server figures it out.

With self-discoverable templates, the AI can detect intent and load the right recipe, applying 35 years of Postgres best practices behind the scenes.

The templates have real depth. No scraped snippets or boilerplate. The templates are written by senior Postgres engineers, and provide opinionated, production-tested guidance that is tuned to avoid every trap that seasoned DBAs know to avoid.

Postgres-native vector retrieval adds the right context. When the AI needs more information, the MCP server searches the versioned Postgres (15-18) and TimescaleDB docs. And it uses Postgres itself for storage and vector search.

Versioning is critical. For example, Postgres 15 introduced UNIQUE NULLS NOT DISTINCT, while 16 improved parallel queries, and 17 changed COPY error handling. The MCP keeps AIs grounded in correct syntax every time, avoiding broken code from the wrong version.

The Tiger MCP doesn’t just wire up APIs. It teaches AI to think like a real Postgres engineer.

You don’t have to craft the perfect prompt. You just ask, and it does the right thing.

See It For Yourself

Install the Tiger CLI and MCP server:

curl -fsSL https://cli.tigerdata.com | sh
tiger auth login
tiger mcp install

(We also have alternative installation instructions for the CLI tool.)

Then select your AI assistant (Claude Code, Cursor, VS Code, Windsurf, etc.) and immediately get real Postgres knowledge flowing into your AI.

This is how Postgres becomes the best database to use with AI coding tools: not by accident, not because someone pasted docs into a chat, but because the tooling now teaches AI how to think in Postgres.

Try the MCP server. Break it. Improve it. Help us teach every AI to write real Postgres.

About the authors

Matty Stratton

Matty Stratton is the Head of Developer Advocacy and Docs at Tiger Data, a well-known member of the DevOps community, founder and co-host of the popular Arrested DevOps podcast, and a global organizer of the DevOpsDays set of conferences.

Matty has over 20 years of experience in IT operations and is a sought-after speaker internationally, presenting at Agile, DevOps, and cloud engineering focused events worldwide. Demonstrating his keen insight into the changing landscape of technology, he recently changed his license plate from DEVOPS to KUBECTL.

He lives in the Chicagoland area and has three awesome kids and two Australian Shepherds, whom he loves just a little bit more than he loves Diet Coke.

Matvey Arye

Matvey Arye is a founding engineering leader at Tiger Data (creators of TimescaleDB), the premiere provider of relational database technology for time-series data and AI. Currently, he manages the team at Tiger Data responsible for building the go-to developer platform for AI applications.

Under his leadership, the Tiger Data engineering team has introduced partitioning, compression, and incremental materialized views for time-series data, plus cutting-edge indexing and performance innovations for AI.

Matvey earned a Bachelor degree in Engineering at The Cooper Union. He earned a Doctorate in Computer Science at Princeton University where his research focused on cross-continental data analysis covering issues such as networking, approximate algorithms, and performant data processing.

Jacky Liang

Jacky Liang is a developer advocate at Tiger Data with an AI and LLMs obsession. He's worked at Pinecone, Oracle Cloud, and Looker Data as both a software developer and product manager which has shaped the way he thinks about software.

He cuts through AI hype to focus on what actually works. How can we use AI to solve real problems? What tools are worth your time? How will this technology actually change how we work?

When he's not writing or speaking about AI, Jacky builds side projects and tries to keep up with the endless stream of new AI tools and research—an impossible task, but he keeps trying anyway. His model of choice is Claude Sonnet 4 and his favorite coding tool is Claude Code.

Blocked Bloom Filters: Speeding Up Point Lookups in Tiger Postgres' Native Columnstore

Jacky Liang — Wed, 18 Jun 2025 12:00:00 GMT

💡

This is the first post in a technical deep dive series that explores the ways we are building the fastest Postgres.

Database storage is a study in locality.

Row stores keep all fields of a record together, with operations only on full rows at a time.
Column stores organize each column’s values in compressed blocks, allowing operations to target specific columns.

This trade-off is structural, not just cosmetic. Row layout is great for fast inserts and lookups. Column layout excels at filters, aggregations, and scans over a large number of rows, but on a smaller number of columns (as long as your query plays by the rules).

But here’s the catch: Columnstores are only fast at filtering when your predicate aligns with the physical sort order. If your data is ordered by time, or clustered by a segmentby (like customer ID or device), the engine can skip large blocks. But if you filter on an unsorted field (like a trace ID, transaction UUID, or error code) there’s often no optimization to exploit. The engine has to decompress every block and scan every value, just in case there is a match.

TimescaleDB combines both layouts into a hybrid table model, with recent records written to the rowstore and automatically migrated to the columnstore over time. This design closely aligns with real-world query patterns, fresh data is updated frequently, while older data is rarely updated, and mostly used in aggregate. We’ve already implemented columnar mutability, but one challenge remained: sometimes you need to filter on unsorted fields across terabytes of columnar data.

And it’s not hypothetical, we’ve seen it in the wild many times. Dashboards hang while users query a UUID, waiting as the engine churns through thousands of compressed blocks. The data is there. The filter is simple. But the system grinds.

That isn’t acceptable. At Tiger, we’re here to deliver speed without sacrifice.

So we implemented blocked bloom filters, and our users already love them:

Our community member @pantonis saw 100× faster lookups after upgrading to TimescaleDB 2.20

P.S. If you're using TimescaleDB 2.20 or later (or Tiger Postgres on Tiger Cloud), Bloom Filters are already actively optimizing lookups on sparsely distributed UUIDs, enums, and text fields by up to 100x.

The Challenge of Point-Lookups in Columnar Storage

If you've worked with large-scale time-series or analytics workloads, you've probably experienced this pain. You're querying 10TB of trace data to find a single ID like '550e8400-e29b-41d4-a716-446655440000'. Your database starts churning through millions of batches, reading and decompressing terabytes of data. Minutes and hours tick by. Your application times out. Users complain because they needed this report an hour ago (everything is urgent).

Imagine you’re looking for a needle in hundreds of compressed bundles of haystacks—you have to unbundle, loosen, and search through every single bundle because you don't know which one contains your needle.

This happens because columnar databases store data in sorted batches, compress each column separately, and use ordering metadata to skip irrelevant batches. This works perfectly when you're querying by the same column you sorted by:

-- This works well - time-based query on ordered data
SELECT * FROM metrics
  WHERE timestamp BETWEEN '2024-01-01' AND '2024-01-02';

But completely breaks down for uncorrelated columns:

-- This is painful - random ID query on non segmented column
SELECT * FROM metrics 
  WHERE trace_id = '550e8400-e29b-41d4-a716-446655440000';

The thing with UUIDs (except for UUIDv7, keep an eye out for support coming soon) is they're completely random, it doesn't make sense to order them. When your data is sorted by time, but you're searching by trace ID, every batch now contains a random mix of IDs. So ordering become useless, and the database can't skip any batches.

What is a Bloom Filter?

This is where bloom filters help—they're additional metadata that can efficiently answer "is this value definitely not in this batch?” without actually storing a reference to each value.

A bloom filter is a small-yet-efficient data structure that uses an array of bits and hash functions to quickly test if something might be in a set or not.

Bloom filter illustration courtesy of Bytedrum: Bloom Filters, a great visual explainer.

Bloom filters can say something is "definitely not there", or "might be there", and crucially, they never say "it’s missing" when something is actually there.

Using Spotify’s playlist feature as an example, when Spotify needs to check if a song is in one of your playlists, instead of scanning through every song in every playlist (which means reading every playlist from storage—stupidly expensive with billions of songs), they use a bloom filter—a compact “summary” that can instantly say whether a song is “definitely not in this playlist” or “might be in this playlist”.

For the “might be” cases, Spotify then uses traditional seek methods to check the actual playlist data.

[Interactive Spotify Bloom Filter demo: https://spotify-bloom-filter.vercel.app/]

This may not sound that useful, but when dealing with massive-scale workloads of millions of playlists and billions of songs, using a bloom filter eliminates 95%+ of linear-time playlist scans, turning minutes of searching into milliseconds.

Obviously, no data structure is catch-free, it may occasionally check a playlist unnecessarily, around a 2% false positive rate (more on this later). But, this is an acceptable tradeoff as you still get massive I/O savings across your entire system.

How We Added Bloom Filters into the Columnstore

Here's how TimescaleDB solves this, bloom filters act as a quick pre-check. Before reading any batch from disk, TimescaleDB checks a tiny bloom filter in memory that says "this ID is definitely not in this batch" or "this ID might be in this batch." This lets us skip 95%+ of batches instantly.

For the few batches that might contain your ID, TimescaleDB reads them from disk and processes them efficiently using vectorized operations (SIMD) that check many rows at once—much faster than Postgres's traditional row-by-row approach. But the real win is avoiding the I/O in the first place.

No manual configuration needed

When building with TimescaleDB, you don’t need to worry about when to use bloom filters or min/max indexes, because we automatically choose for you based on your column types!

For columns that you use in your table ordering (like timestamps and numbers used in range queries), we stick with the min/max method because we know that scans will be in order.

For random things like text fields, UUIDs, enum types (or basically anything else that supports Postgres hash indexes) we will create bloom filters automatically (as long as you have the column indexed in the rowstore with a btree, hash or brin index).

Diving deeper - “blocked bloom filters”

TimescaleDB uses a technique called a "blocked bloom filter", where each bloom filter starts at about 16KB per batch (sized for up to 1,000 items) with a 2% false positive rate and uses 6 different hash functions per value.

The 16KB size isn't random—it's calculated based on math. Here’s a handy calculator you can try out yourself.

For 1,000 items with a 2% false positive rate, the optimal formula gives us ~8K bits, but we round up to ~16k bits to enable our folding compression trick (we will get to this below!). This sizing ensures we get exactly the false positive rate we want while keeping the filters small enough to stay fast in memory.

The "blocked" part is a performance technique—instead of spreading hash bits all over a huge array, TimescaleDB keeps all the bits for one value within a 256-bit block. This fits nicely in your CPU cache and makes everything faster.

For hashing, TimescaleDB primarily uses a modern library called UMASH that's faster than Postgres's built-in hashing, but falls back to the Postgres version for custom data types or older processors. There is a funny story here on interoperability that I’ll share on socials!

Achieving 250x space savings

Here’s where we squeezed out additional space and performance benefits out of bloom filters— when a batch doesn’t have many unique values (for example [0, 0, 0, 0, 1, …, 0, 0]), TimescaleDB can compress the bloom filter by “folding” it in half using bitwise OR operations.

It can keep folding until the filter shrinks from 16 KB down to just 64 bits (8 bytes) for columns with few unique values, also known as low-cardinality. We can do this because most of the bits in the batch are zero, so folding concentrates the few set bits without significantly increasing false positives.

Query Walkthrough

Here’s an example to explain how queries work step-by-step. Let’s use the following trace ID search query:

SELECT * FROM metrics 
   WHERE trace_id = 'abc123'

TimescaleDB first checks the bloom filters (which are likely to be cached in memory using the traditional Postgres buffer manager) for every batch in your columnstore.
For each batch, the bloom filter either says “definitely not here” or “might be here”. The database immediately skips all the “definitely not here” batches.
For the “might be here” batches, TimescaleDB reads them from disk, decompresses them, and scans the actual data (the expensive part). If it was a false positive (that 2% chance we mentioned prior), no match gets found, and the query just continues running normally.

The key here is, false positives are okay and don’t impact performance because even when we hit one false positive, we are still avoiding massive amounts of unnecessary I/O by not having to go through every batch from the get-go.

A side benefit to our implementation of bloom filters is that the bloom filter metadata is more likely to stay hot in memory. When there are concurrent workloads where different users are querying different parts of your dataset, every query can quickly eliminate most batches without touching slower-more-expensive storage.

Where Bloom Filters Excel

Bloom filters excel at large time-series datasets queried by non-temporal identifiers.

Think about scenarios where you're storing massive amounts of data over time, but you need to find specific records using IDs, addresses, or other fixed identifiers.

For financial services teams. Your customers are trying to resolve a failed payment. They enter a transaction reference number into your search. Nothing loads. Your backend query scans years of data just to return a single match, and the user waits ...

-- Finding financial transactions by reference
SELECT * FROM payments 
  WHERE transaction_ref = 'TXN-2024-001234';

For IoT platform teams. Your dashboard shows live sensor data, but one widget is blank. It’s querying a single sensor reading by ID, and your backend is scanning billions of rows to find it. Users refresh the page. Nothing. The spinner keeps spinning.

-- Device-specific IoT data queries
SELECT * FROM sensor_data 
  WHERE reading_id = '73e98d71-5eb7-4018-ace7-1f4490da654a';

For teams working on blockchain analytics. Your analytics engine scans millions of blocks to find transactions from a wallet address. API timeouts leave users thinking your service is broken.

-- Looking up blockchain transactions by wallet address  
SELECT * FROM transactions 
  WHERE from_address = '0x742d35Cc6634C0532925a3b8D';

Bloom filters turn these queries from minutes or hours into milliseconds by eliminating the need to decompress and scan billions of rows. Instead of checking every batch in your columnstore, you skip a 95% of them and only decompress the ones that might actually contain your data.

Performance Numbers: Find Specific Values Up to 100x Faster

Instead of telling you why our bloom filter implementation is great, let's just show you some numbers we got after we ran our benchmarks.

SELECT min(sent), max(sent), count()
  FROM hackers 
  WHERE subject = 'unsubscribe' 
  ORDER BY count() DESC 
  LIMIT 10;

Before bloom filters, this took 12ms. With bloom filters, it dropped to 2.7ms—that's 3.5x faster.

Or consider this blockchain address lookup:

SELECT * FROM token_transfers 
  WHERE to_address = '0xe23d4eb73b399250301fb024019a734ba9f0d9b5';

This one went from 1.065 seconds down to 171.134 ms—a 6x improvement.

And lets not forget the report from our user pantonis, who saw a massive 100x improvement:

Our community member @pantonis saw 100× faster lookups after upgrading to TimescaleDB 2.20

Where Bloom Filters Don’t Work

Like all data structures, there are strengths and limitations.

Bloom filters work great when you're looking for exact matches—queries that use the equals sign (=) to find specific values. They also work with standard string comparisons where the rules are consistent.

-- These work great with bloom filters

SELECT * FROM traces WHERE trace_id = 'abc-123-def';

SELECT * FROM orders WHERE email = 'user@example.com';

SELECT * FROM transactions WHERE status = 'completed';

However, bloom filters have fundamental limitations and some current implementation restrictions. By design, they can't handle "not equal" searches (<>) or range queries (< or >) because they only test set membership.

-- These don't work with bloom filters (fundamental limitations)
SELECT * FROM traces WHERE trace_id <> 'abc-123-def';

SELECT * FROM users WHERE created_at > '2024-01-01';

SELECT * FROM transactions WHERE amount BETWEEN 100 AND 500;

Current implementation restrictions in TimescaleDB mean they also can't help with multiple value searches (like WHERE column IN (1, 2, 3)) or cross-type comparisons without explicit casting. These may change in future versions.

-- These don't work yet (implementation restrictions)
SELECT * FROM traces WHERE trace_id IN 
  ('abc-123', 'def-456', 'ghi-789');

SELECT * FROM users WHERE user_id = 12345;  -- int8 = int4 comparison

SELECT * FROM posts WHERE category = ANY(ARRAY['tech', 'science']);

Also, bloom filters don't help much when the value you're looking for exists in most batches.

For example, if you're searching for a common status like active that appears in every batch, the bloom filter will report a potential positive for every batch, forcing TimescaleDB to decompress and check them all anyway. The bloom filter can't skip anything, so you don't get any savings.

Speed without Sacrifice

Bloom filters in TimescaleDB are a perfect example of "it just works" optimization and our commitment to making life easier for developers working at massive data scales.

The bloom data structure automatically kicks in for the right data types and query patterns, dramatically improving performance for needle-in-haystack queries without any configuration required. It works out of the box in Tiger Cloud—no setup required other than having an index on your rowstore columns.

You can verify bloom filters are working by looking for _timescaledb_functions.bloom1_contains in your query execution plans. The storage overhead is minimal, typically a few hundred bytes per batch, with a maximum of 1KB. For a table with a million batches, you're looking at roughly 100MB to 1GB of bloom filter metadata. That's 0.01% storage overhead for massive query speedups.

Built by developers, for developers. TimescaleDB refuses to accept the traditional trade-offs of database storage. We give you the speed of columnar analytics with the flexibility of point lookups, all in the same system.

Try it now on Tiger Cloud.

Additional reading

How to Build a Secure, Authorized Chatbot Using Oso and Timescale

Jacky Liang — Tue, 13 May 2025 13:19:26 GMT

The rush to integrate large language models (LLMs) into production apps has exposed a common failure mode: without proper authorization in place, they can easily expose sensitive data to the wrong users. Combine that with complex infrastructure (vector databases, sync pipelines, separate stores for embeddings and metadata), and you’re shipping a fragile system that puts user data at risk.

At Timescale and Oso, we think there’s a better way.

In this webinar, we show how you can build a secure, scalable AI chatbot using Postgres—and only Postgres—by leveraging Timescale’s pgai library and Oso’s authorization platform as a service.

Here are the webinar highlights, summarized for you in chapters for easy reference.

(To deploy our sample app for authorized secure chatbot built using Oso and pgai, see this open-source code.)

Why Most AI Chatbot Demos Fail in Production

[08:30–11:50]

Why do simple chatbots break in production? Demo chatbots are easy: embed your docs, slap on an OpenAI API key, and you’re done.

But in a real business environment, Bob (the employee) should never see Alice’s harsh performance review feedback. Only Alice, their manager and HR should. Sales shouldn’t see engineering tickets.

Without authorization boundaries, your chatbot becomes a data leak waiting to happen.

Many demos fall short because they:

Expose all content to all users
Ignore org-specific permissions (e.g., team-level access control)
Assume static or role-based authorization models
Rely on dual data systems (e.g., Postgres + Vector DB), causing data synchronization difficulties.

The fix? Build with authorization and data consistency as first principles.

Why We Combined Postgres, pgvector, and Oso

[13:34–17:47]

We introduced an end-to-end reference stack that solves both the data synchronization and authorization complexity problem. The solution uses:

Timescale + pgai for real-time, in-database vector search and updates
Oso Cloud for relationship-based access controls, enforced natively via PostgreSQL
No glue code or ETL scripts between systems

The result: you get a secure, performant, and authorized chat system with zero duplicated data.

💡

“Chatbot demos are simple. Business-grade AI is hard. We’re going to show you how to make the hard, easy.” — Jacky, Developer Advocate, Timescale

Real-Time Vector Sync With pgai Vectorizer

[14:33–20:45]

Instead of bolting a vector database on top of your existing Postgres database, pgai Vectorizer keeps your embeddings automatically synchronized with your source data in Postgres.

Create vectorizers via Python
Ingest from S3, Hugging Face, or existing Postgres tables
Bring your own embedding model (OpenAI, Nomic, etc.)
Chunk and embed documents with configurable rules
Never worry about mismatched records again

SELECT ai.create_vectorizer(
  'blog'::regclass,
  loading => ai.loading_column(column_name => 'content'),
  embedding => ai.embedding_openai(model => 'text-embedding-3-small', dimensions => 768),
  destination => ai.destination_table('blog_embeddings')
);

Run your vectorizer worker:

pgai vectorizer worker -d postgresql://...

No extra queues, pipelines, or lambdas needed. Just Python and Postgres.

Authorization That Follows Relationships, Not Just Roles

[21:43–28:14]

Many apps rely on Role-Based Access Control (RBAC). But real-world permissions often depend on relationships:

“Bob can view reviews only if he’s the owner of the document”
“Diane (HR) can see feedback others can’t”
“Support engineers can access sensitive logs only during active shifts”

Oso lets you model this in code:

resource Folder{
 roles = ["viewer"];
 permissions = ["view"];
 relations = { team: Team };


 "viewer" if "member" on "team";
 "viewer" if global "hr";
 "viewer" if is_public(resource);


 "view" if "viewer";
}

It also incorporates your Postgres data using native SQL, so you don’t need to sync users, roles, or groups into a second system.

Putting It Together: Authorized Retrieval Augmented Generation (RAG)

[30:44–37:32]

Here’s how the architecture works:

A user (Bob or Diane) sends a question to the chatbot.
The app queries Oso to determine what data the user is authorized to access.
That filter is converted to a SQL query that joins source + embedding data in Timescale.
Only the authorized context is sent to the LLM (e.g., OpenAI) to generate a final response.

The result: the same chatbot provides personalized, secure answers based on who’s asking—without leaking data or requiring redundant systems.

What You’ll Learn From the Demo

[29:01–48:00]

How to build a business-grade RAG stack without a separate vector DB
How to enforce field-level access control in LLM-based apps
How Timescale + pgai + Oso make Postgres the only data system you need
Why prompt engineering, chunking, and system prompts matter in retrieval quality
How to embed PDF, DOCX, and S3-based documents securely

Next Steps

We’ve open-sourced the reference app and walkthrough:

If you’re building AI agents, chat interfaces, or internal copilots—don’t wait to layer in security and data correctness.

Your users will thank you. Your auditors will too.

Document Loading, Parsing, and Cleaning in AI Applications

Jacky Liang — Tue, 08 Apr 2025 17:50:10 GMT

Welcome to part one of our Agentic RAG Best Practices series, where we cover how to load, parse, and clean documents for your agentic applications.

This comprehensive guide will teach you how to build effective agentic retrieval applications with PostgreSQL.

Every week, customers ask us about building AI applications. Their most pressing concern isn't advanced chunking strategies or vector databases—it's simply: "How do I clean my data before feeding it to my AI?"

It’s simple: “Garbage in, garbage out.”

Before worrying about writing even your first line of embeddings or retrieval code, you need clean data.

In this first guide of our agentic RAG series, we'll cover gathering the right data, extracting text from various document types, pulling valuable metadata, web scraping techniques, and effectively storing data in PostgreSQL. We'll address common challenges like fixing formatting issues and handling images in documents.

By the end, you'll know how to transform raw documents into clean, structured data that retrieval agents can effectively use.

Don’t want to read all of this and just want to apply it? We have prepared a handy dandy preparation checklist for this topic in a preparation checklist.

One fintech customer recently shared how they spent weeks fine-tuning their RAG application with different vector databases, only to realize their poor results stemmed from simply having dirty data. "I approached the whole thing with like, I don't trust these AIs (. … ) So we don't ask them to make decisions. We do normal modeling to figure out what the user needs, then feed that data to the LLM and just say, 'Summarize it.'"

The garbage in, garbage out principle applies strongly to AI applications. Let's explore how to properly load, parse, and clean your data for AI use.

Gathering the Right Data for AI Applications

👉🏻 Watch the one-minute video summary.

Before even thinking about cleaning or processing your data, you need to make sure you have the right data in the first place. I know, this sounds obvious, but it’s a very important step that many teams overlook in rushing to build their shiny RAG app.

Data selection matters

We have seen many AI teams build state-of-the-art RAG apps that still deliver bad answers. Most of the time, there is nothing wrong with their retrieval algorithm, vector database, embedding model, or large language model. The problem is that they simply don’t have the necessary information in the knowledge base, so the LLM made something up instead or provided insufficient answers.

In most cases, if the information doesn’t exist in your documents, your RAG app should either return nothing (the best-case scenario), or the LLM will simply hallucinate a plausible answer (this is the worst-case scenario).

Choosing the right data

Before building your RAG application, ask yourself and the team these questions:

What specific questions will users ask the system?
What documents contain factual information to these questions?
What are the gaps in our current documentation?
Is our information up-to-date, or will it need to be regularly updated?
Do we have a system in place to identify information gaps as users use the app?

You need to be able to confidently answer these questions.

Where to collect data

Internal knowledge base: Check company wikis, technical documentation, reports, manuals, and databases.
External sources: Read industry publications, research papers, and public datasets.
Customer interactions: Check support tickets, chat logs, FAQs, etc.
Real-time sources: See news feeds, market data, IoT sensor data, etc.
Intuition: You may have some ideas where certain important data lives, so trust your gut.

💡

Note: Make sure these documents don’t contain sensitive information you don’t want your users to ask about!

Be intentional about your data sources—the higher the quality and relevancy, the better.

Ensure data freshness

Most business data isn't static—it often changes as your products, services, and policies evolve. Outdated information in your RAG system leads to incorrect answers and really hurts your customers’ trust in the AI system (I mean, look at Google’s initial rollout of Bard.)

Consider the following suggestions for keeping your knowledge base up-to-date:

Set up consistent update schedules: This will be different depending on your business needs. It can be hourly, weekly, monthly, or even quarterly.
Implement trigger-based updates: Update content whenever the source document changes. For example, when your team updates some documentation, your system should automatically refresh the corresponding knowledge base entries.
Create document ownership: If you work in a large company, you may need to assign responsibility to other individuals or teams for specific knowledge areas to ensure data is constantly updated.
Track user feedback: Many RAG systems allow users to rate answers. This rating system (like a simple thumbs up and down) can help identify outdated or incorrect information that needs to be updated, added, or removed from your knowledge base.
Track question patterns: Continuously analyze questions that consistently receive poor ratings to identify areas where your knowledge base needs improvement.

Data freshness is one of the silent killers of data accuracy—no advanced RAG pipeline can fix this.

Extracting Text From Documents

👉🏻 Watch the one-minute video summary.

Approximately 85 percent of the world's data is unstructured: think PDFs, Word files, emails, PowerPoint presentations, and more. To use this data with AI, you first need to extract the raw text.

Using MarkItDown for Document Conversion

Libraries like MarkItDown and Docling can convert PDFs and other formats to Markdown. Markdown has become one of the cleanest and most efficient formats for ingesting data into LLMs because it's nearly plaintext and token-efficient. It can also efficiently represent non-text data like tables.

Extract text from PDF using MarkItDown

from markitdown import MarkItDown  
md = MarkItDown()  
result = md.convert("document.pdf")  
text_markdown = result.text_content  
print(text_markdown[:500])

Extract text from PDF with Optical Character Recognition (OCR) using Docling

# Using Docling for Document Conversion with OCR
from docling.document_converter import DocumentConverter

# Initialize the converter
converter = DocumentConverter()

# Load PDF and extract text with OCR enabled
result = converter.convert(
    "document.pdf",  # Can be local path or URL
    enable_ocr=True  # Enable OCR for scanned documents
)

# Get the converted markdown content
markdown_text = result.document.export_to_markdown()

# Preview the first 500 characters
print(markdown_text[:500])

The code above returns an object with text_content containing the markdown text, which you can easily pass into your RAG pipeline or LLM for cleaning, analysis, summarizing, or chunking.

Using Visual Language Models for OCR

A new breed of OCR technology is being powered by visual large language models (VLLMs): models that can process not just text, but also images and PDFs. These are trained specifically for unstructured data extraction. One such VLLM making a splash is Mistral OCR.

Extract text and images (in base64) from PDF using Mistral OCR

import os
from mistralai import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
client = Mistral(api_key=api_key)

ocr_response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": "https://arxiv.org/pdf/2201.04234"
    },
    include_image_base64=True
)

Extract from images using Mistral OCR

import os
from mistralai import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
client = Mistral(api_key=api_key)

ocr_response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "image_url",
        "image_url": "https://raw.githubusercontent.com/mistralai/cookbook/refs/heads/main/mistral/ocr/receipt.png"
    }
)

What makes Mistral OCR unique is its exceptional performance in extracting text in multiple languages, handling text from images, representing math equations, interpreting structured tables, and other traditionally difficult tasks.

Other extraction tools you can experiment with include unstructured.io, olmOCR, or just relying on good ol’ humans to extract the data—Upwork or Fiverr is a good place to begin your search for contractors.

Once you have this more manageable text form, you're ready for either direct ingestion into your database or metadata extraction.

Metadata Extraction

👉🏻 Watch the one-minute video summary.

All documents contain metadata like title, author, creation date, length, source, customer name, etc. Imagine needing to fetch all documents between Q1 and Q2 of 2025 for a financial report—you'd need to filter by date range using metadata.

If your PDFs or documents have built-in metadata (added automatically by document processors when saving or exporting), that's great! But what if they don't?

Extracting built-in metadata

For simple metadata extraction from actual PDF data (if available), you can use a library like fitz:

Extract built-in PDF metadata using fitz

import fitz  
doc = fitz.open("example.pdf")  
metadata = doc.metadata  
print(metadata.get("title"), "by", metadata.get("author"))

For everything else, you need…

Contextual metadata extraction

Most documents don't have native metadata. In these cases, you need a two-step workflow: first, use a PDF text extractor like Mistral OCR, then pass the raw text to another large language model to request specific information using natural language.

For example, you can use Mistral OCR to analyze each document, define what metadata you'd like to extract (title, author, etc.), and use another LLM to get the metadata information formatted in a specific way (like JSON).

Extract contextual metadata from PDF using Mistral OCR and Mistral Small

import os
from mistralai import Mistral

# Retrieve the API key from environment variables
api_key = os.environ["MISTRAL_API_KEY"]

# Specify model
model = "mistral-small-latest"

# Initialize the Mistral client
client = Mistral(api_key=api_key)

# Define the messages for the chat
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "In JSON format, extract the following metadata from the provided document: title, author, and created data. "
            },
            {
                "type": "document_url",
                "document_url": "https://arxiv.org/pdf/1805.04770"
            }
        ]
    }
]

# Get the chat response
chat_response = client.chat.complete(
    model=model,
    messages=messages
)

# Print the content of the response
print(chat_response.choices[0].message.content)

# Save the metadata somewhere for later ingest into your RAG pipeline

Extracting Text From the Web

👉🏻 Watch the one-minute video summary.

Not all data lives in documents, and not all is accessible via GET API requests. To get data from websites, documentation, and knowledge bases for AI applications, you need to scrape them. The ultimate goal of web scraping is to fetch only the main text content from pages, filtering out headers, footers, sidebars, ads, and tracking scripts, in an LLM-friendly format like Markdown.

In the past, this was done with libraries like requests, Selenium, BeautifulSoup, etc., and required manually setting up proxies to evade rate limiters. Thankfully, it's no longer as painful to scrape the web today (yay!).

Web scrapers generally need to do the following tasks:

Crawl: Get all pages of an entire website by gathering a list of all internal and external links (essentially building a sitemap, if it’s not available on /sitemap.xml).
Scrape: Get the DOM/text content of each individual page.
Proxy: Switch to a different IP to continue on a large crawling and scraping job.
Clean: Extract the main useful text from the raw DOM.
Convert: Format the main text content as Markdown, TXT, JSON, etc.

💡

A note about raw DOM: Using a website's HTML is messy because it has ads, menus, and other junk. It also doesn't work well for React apps or other single-page applications.

Firecrawl for Web Scraping

Firecrawl is a web scraping/crawling engine accessible via REST API, Python SDK, and a UI dashboard (currently in beta). What's great about Firecrawl is that it extracts clean page text in various formats. It can crawl an entire site and return all pages' content in one go (with advanced filtering options). It also handles all the proxying needed for large-scale crawl and scrape jobs.

Crawling a website with a limit of 100 pages using Firecrawl Crawl REST API

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-YOUR_API_KEY")

# Crawl a website:
crawl_status = app.crawl_url(
  'https://firecrawl.dev', 
  params={
    'limit': 100, 
    'scrapeOptions': {'formats': ['markdown', 'html']}
  },
  poll_interval=30
)
print(crawl_status)

Scraping a URL and outputting in Markdown using Firecrawl Scraping REST API

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-YOUR_API_KEY")

# Scrape a website:
scrape_result = app.scrape_url('firecrawl.dev', params={'formats': ['markdown', 'html']})
print(scrape_result

Custom metadata extraction in JSON using Firecrawl Extraction REST API

from firecrawl import FirecrawlApp
from pydantic import BaseModel, Field

# Initialize the FirecrawlApp with your API key
app = FirecrawlApp(api_key='your_api_key')

class ExtractSchema(BaseModel):
    company_mission: str
    supports_sso: bool
    is_open_source: bool
    is_in_yc: bool

data = app.scrape_url('https://docs.firecrawl.dev/', {
    'formats': ['json'],
    'jsonOptions': {
        'schema': ExtractSchema.model_json_schema(),
    }
})
print(data["json"])

💡

Author’s note: “Firecrawl is one of my favorite SaaS services of 2024. It has awesome docs, affordable pricing, and has a very responsive team. Most importantly, it works really well.” – Jacky Liang

Other Web Scraping Options

Firecrawl isn't the only service/library for crawling, scraping, and cleaning. Another capable service is Jina AI's Reader API, which converts a URL to LLM-friendly inputs simply by adding r.jina.ai in front:

Fetch a webpage in clean Markdown using Jina AI’s Reader API

r.jina.ai/news.ycombinator.com

If you want to build your own end-to-end crawling and scraping infrastructure (expert users only), developers typically use Playwright, a Microsoft framework for web testing and automation. Playwright Web Scraping is a reliable open-source web scraping implementation using Playwright. Firecrawl is also open source and lets you host it in your own infrastructure if you want absolute control.

💡

Pro tip: When running your own web scraping infrastructure, make sure to use proxies for your scraping server to avoid IP bans from websites you're crawling and scraping.

Direct Data Loading

pgai has a handy function that lets you import datasets directly from Hugging Face with just the dataset's name:

Load data from Hugging Face using pgai

SELECT ai.load_dataset('wikimedia/wikipedia', '20231101.en', table_name=>'wiki', batch_size=>5, max_batches=>1, if_table_exists=>'append');

pgai has more direct data loading goodies to come—stay tuned.

Storing Data

👉🏻 Watch the one-minute video summary.

At Timescale, we believe boutique vector databases are the wrong abstraction for AI workloads. PostgreSQL is the ideal solution for typical apps and AI apps, especially RAG applications.

Instead of using a separate vector database, you can store text embeddings inside PostgreSQL with pgvector. We recommend using pgai to simplify building RAG apps, as we have a handy interface called create_vectorizer() that automatically embeds raw text, chunks it, and continuously keeps it up-to-date.

Create an AI project using pgvector and pgai

// Enable pgai and pgvector on your Postgres database
CREATE EXTENSION IF NOT EXISTS vector;  
CREATE EXTENSION IF NOT EXISTS ai;  

// Create a table to store Wikipedia articles
CREATE TABLE wiki (
    id      TEXT PRIMARY KEY,
    url     TEXT,
    title   TEXT,
    text    TEXT
);

// Load Wikipedia dataset directly from HuggingFace 
SELECT ai.load_dataset('wikimedia/wikipedia', '20231101.en', table_name=>'wiki', batch_size=>5, max_batches=>1, if_table_exists=>'append');

// Create Vectorizer that
// 1. Chunks the data using the chunking recursive character text splitter
// 2. Embeds it using Mini LM at 384 token size per chunk
// 3. Continuously monitors the wiki table for new text incoming
SELECT ai.create_vectorizer(
     'wiki'::regclass,
     embedding => ai.embedding_ollama('all-minilm', 384),
     formatting=> ai.formatting_python_template('url: $url  title: $title $chunk')
     chunking => ai.chunking_recursive_character_text_splitter('text'),

);

// Check status of Vectorizxer embedding creation
select * from ai.vectorizer_status;

Running the pgai Vectorizer worker

For the vectorizer to work correctly, you need to run the pgai Vectorizer worker alongside your PostgreSQL database. This worker processes your data and creates embeddings. Set up a docker-compose.yml file with the following configuration:

version: '3'
services:
  db:
    image: timescale/timescaledb-ha:pg17
    environment:
      POSTGRES_PASSWORD: postgres
    ports:
      - "5432:5432"
    volumes:
      - data:/home/postgres/pgdata/data
      
  vectorizer-worker:
    image: timescale/pgai-vectorizer-worker:latest
    environment:
      PGAI_VECTORIZER_WORKER_DB_URL: postgres://postgres:postgres@db:5432/postgres
      OPENAI_API_KEY: your_openai_api_key_here
    command: [ "--poll-interval", "5s" ]
    
  ollama:
    image: ollama/ollama
    
volumes:
  data:

If you're using Ollama for embeddings, as shown in our example, make sure to add the Ollama service and configure the worker:

vectorizer-worker:
    environment:
      OLLAMA_HOST: http://ollama:11434

Start everything with “docker-compose up -d” and the worker will automatically poll the database and process your vectorizer tasks. Note that you might need to adjust settings like poll intervals or concurrency depending on your specific workload needs.

And just like that, we’ve built a production-ready SQL-native retrieval pipeline that is not only powerful but extremely customizable.

Cleaning Messy Data

Raw text extracted from websites or documents is often messy and contains content not relevant to the main text. This can include ads, navigation menus, footers, tracking scripts, or leftover HTML/CSS markup. Removing this noise is crucial to avoid feeding irrelevant text to your AI model, reduce input token size to lower costs, and increase retrieval accuracy.

Cleaning webpages

Elements to clean include:

HTML tags that aren't content: