Tiger Data Blog

We Taught AI to Write Real Postgres Code (And Open Sourced It)

Matvey Arye — Mon, 24 Nov 2025 15:00:09 GMT

You open Claude Code or Cursor, describe your tables, and in seconds the AI hands you a Postgres schema that looks… fine. It runs. Your tests pass. You ship.

What you don’t see are the quiet little disasters tucked inside: money for prices, a BRIN index on random data, SERIAL and UUID mixed like a cocktail, timestamp without time zone because some tutorial said it was “easier”.

Fast-forward six months. You’re debugging currency-conversion bugs, chasing timezone ghosts, rewriting migrations, and adding the index that should have existed since day one. The code the AI agent wrote worked; it just wasn’t good. It was copying whatever examples it scraped from the entire internet.

And that’s the problem. It learned SQL from everywhere: Postgres, MySQL, SQLite, SQL Server, Oracle, random tutorials, and a decade of Stack Overflow answers. In all that noise, the nuances of idiomatic, high-quality Postgres get buried under the good, the bad, and the MySQL.

So we built something to fix that.

Giving AI the Postgres Judgment It’s Missing

pg-aiguide gives AI coding agents the Postgres-specific judgment they’re missing.It does this with three things working together:

AI-optimized “skills”— curated, opinionated Postgres best practices that Claude Code and other agents can apply automatically.
Semantic search across official documentation — version-aware retrieval for Postgres 15–18.
Extension ecosystem docs, starting with TimescaleDB and expanding quickly

You can connect it to any AI coding agent via our public Model Context Protocol (MCP) server or with the Claude Code plugin built to take advantage of Claude’s native skill support. No accounts. No usage limits. Completely free.

The goal is simple:Make AI write correct, production-ready Postgres by default.

You shouldn’t have to paste docs, correct outputs, or rely on prompt hacks. The AI should just generate better SQL the first time.

💡

Try it now
You can start using pg-aiguide in less than a minute. It works with Claude Code, Codex, Cursor, Gemini CLI, Visual Studio, VS Code, Windsurf, and any other MCP compatible editors. See our quickstart guide for installation instructions.

Why Database Code Quality Is So Important

AI adoption among Postgres developers has exploded: jumping from 37% to 55% in a single year according to the 2024 State of Postgres survey. But as LLMs get more general, the training data has become broad instead of specialized and the LLMs become less precise.

This creates a surprising problem: A study by METR recently found that experienced developers using AI actually took 19 percent longer to complete tasks. Not because the AI failed, but because the output was almost right—the kind of subtle mistakes that take the longest to spot, debug, and unwind.

With databases, “almost right” compounds quickly.

Schemas codify decisions that are painful to change later. Indexing mistakes hide until your tables grow. Datatype choices that take an hour to fix in development can take days of migrations and careful planning in production.

You can refactor application code. You can’t refactor state without consequences.

pg-aiguide stops these mistakes at the source, before they get baked into production.

Generic vs. Guided SQL

Here’s what happens when you ask a coding assistant to design a simple bookstore schema.

Without pg-aiguide

CREATE TABLE customers (
    customer_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    email VARCHAR(255) UNIQUE NOT NULL,
    password_hash VARCHAR(255) NOT NULL,
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL,
    phone VARCHAR(20),
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    is_active BOOLEAN DEFAULT TRUE
);

CREATE INDEX idx_customers_email ON customers(email);

CREATE TABLE authors (
    author_id SERIAL PRIMARY KEY,
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL,
    biography TEXT,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

This schema works, but has classic issues:

mixed identity strategies
arbitrary VARCHAR limits
case-sensitive email uniqueness
timestamp defaults that behave inconsistently across systems

With pg-aiguide enabled

CREATE TABLE authors (
  author_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  name TEXT NOT NULL,
  bio TEXT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX ON authors (name);

CREATE TABLE users (
  user_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  email TEXT NOT NULL UNIQUE,
  password_hash TEXT NOT NULL,
  first_name TEXT NOT NULL,
  last_name TEXT NOT NULL,
  phone TEXT,
  is_active BOOLEAN NOT NULL DEFAULT true,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE UNIQUE INDEX ON users (LOWER(email));
CREATE INDEX ON users (created_at);

This version uses:

consistent identity columns (BIGINT GENERATED ALWAYS AS IDENTITY)
TEXT instead of unnecessary VARCHAR
correct timestamp handling (timestamptz + now())
case-insensitive uniqueness enforced properly

Behind the scenes, the AI used the design_postgres_table skill either through the view_skill MCP tool or Claude’s native skills framework. In both cases, the agent automatically discovered and applied the Postgres-optimized guidance without human intervention.

You didn’t have to prompt differently.

You didn’t have to paste in docs.

pg-aiguide automatically shifts AI from “SQL that works” to “SQL you’d actually want in production.”

The Skills Are the Secret Sauce

If you want AI to generate high-quality SQL, it is not enough to let it search the manual. A manual tells you what you can do, not what you should do. Skills fill that gap. They give the model judgment, not just facts.

Our skills are not trying to reteach the LLM syntax or capabilities. Instead they give the model the context it needs to make better choices. Here is an excerpt from a real skill.

## Postgres "Gotchas"

- **FK indexes**: Postgres **does not** auto-index FK columns. Add them.
- **No silent coercions**: length/precision overflows error out (no truncation). 
  Example: inserting 999 into `NUMERIC(2,0)` fails with error, unlike some 
  databases that silently truncate or round.
- **Heap storage**: no clustered PK by default (unlike SQL Server/MySQL InnoDB); 
  row order on disk is insertion order unless explicitly clustered.

These are the kinds of details you only know once you have lived in Postgres for a while. They trip up LLMs for the same reason they trip up developers who are new (and not so new) to the database. Yet these details are exactly what allow the model to produce better SQL.

In our evaluations (currently human-vibes-driven, soon LLM-judged), schema quality improves consistently when we compare a system with just semantic search to one that includes both semantic search and skills:

more appropriate data types
correct timestamp semantics
stronger indexing strategies
fewer migration pitfalls
fewer long-term performance surprises

This is what “AI coding tools actually understanding Postgres” looks like.

The Tools We Provide The LLM

pg-aiguide provides two core capabilities that map cleanly to how AI coding tools operate.

1. Skills: Complete, Opinionated Postgres Guidance

view_skill returns full, AI-optimized best practices.These aren’t tutorials and they aren’t vague prompts. They’re machine-targeted, dense, token-efficient guidance that the AI can reliably use.

For example:

prefer BIGINT GENERATED ALWAYS AS IDENTITY
don’t use money
don’t use timestamp without timezone
index your foreign keys
expect errors on precision overflows

Skills don’t need to be chunked—they are written so that each skill fits in context as a single complete unit.

Claude Code even supports skills natively, so the MCP server’s view_skill tool is disabled automatically when running as a plugin.

2. Semantic Search: Version-Aware Vector Retrieval Across Docs

The MCP tools semantic_search_postgres_docs and semantic_search_tiger_docs allow the AI to pull in the correct documentation for the Postgres version you’re targeting.

This matters because Postgres versions evolve meaningfully:

Postgres 15: UNIQUE NULLS NOT DISTINCT
Postgres 16: major changes to parallel query behavior
Postgres 17: COPY error-handling improvements

Without version awareness, an AI can (and does) hallucinate features or syntax that will break your actual environment.

All of this knowledge of Postgres is chunked, embedded, and stored in Postgres itself.

We scrape official HTML docs, preserve header context, attach source URLs, and use character-bounded chunking with H1→H2→H3 breadcrumbs so each piece retains meaning of how it fits into the broader puzzle.

Help Us Build the World’s Best Postgres Guide for AI

Postgres has 35 years of engineering, craft, and hard-won lessons behind it. No single team can capture all of that. The community built the patterns, extensions, and production wisdom that make Postgres what it is. AI coding tools should reflect that depth, not spit out generic SQL lifted from outdated tutorials and old Stack Overflow posts.

pg-aiguide is our first step toward making Postgres the best database to use with AI coding assistants on purpose, not by accident. We are expanding the skill library with richer indexing guidance, full-text search skills, and documentation for essential extensions like PostGIS and pgvector. We are also adding keyword BM25 search to pair with semantic search for more accurate retrieval. But we need your help.

How You Can Contribute

You can make an immediate impact:

add documentation for your Postgres extension
contribute new skills that encode real, battle-tested expertise
help evaluate, refine, and stress-test existing skills
request features or report issues
improve semantic search chunking or propose new areas to index
share deep knowledge on partitioning, replication, security, or performance tuning

Skills matter most. They turn years of experience into guidance the AI can use instantly. Our schema-design skill went through multiple iterations before it felt right, and we learned a ton in the process. We would love to partner with you to build skills in your area of expertise.

pg-aiguide is fully open source at github.com/timescale/pg-aiguide.

Help us teach AI to write Postgres like an expert.

How to Train Your Agent to Be a Postgres Expert

Matty Stratton — Wed, 22 Oct 2025 14:02:12 GMT

With prompt templates and versioned docs, we turn 35 years of Postgres wisdom into structured knowledge your Agent can reason with.

Agents are the new developer. But they’re generalists.

What happens when they design your Postgres database? Your schema runs, your tests pass… and six months later your queries crawl and your costs skyrocket.

AI-generated SQL and database schemas are almost right. And that’s the problem. Fixing schema design mistakes is costlier than refactoring code. It often means multi-week migrations, downtime windows, rollback plans, and your CEO asking why the site is in maintenance mode. The root issue? LLMs don’t have the depth of Postgres and database expertise to let them build scalable systems. And when agents try to learn, they find documentation written for humans, not for them.

But agents don’t need more data, they need better context. They need to know what “good Postgres” actually looks like. The good news is given the right context and tools, agents can become instant experts. Even with Postgres.

That’s why we built an MCP server that provides 35 years of Postgres wisdom, and full access Postgres docs, all in a format that agents can easily process.

And we think this just might be the best database MCP server ever built. While Neon, Supabase, and other Postgres companies created MCP servers as thin API wrappers, ours teaches AI how to think in Postgres. The Tiger MCP server gives AI tools that work automatically: no prompt engineering or manual lookups needed. You just ask. And it provides correct, idiomatic Postgres.

Our new MCP server ships with detailed prompt templates written by our senior engineers, plus versioned Postgres (15-18) docs, and TimescaleDB docs. When your AI needs to design a schema, it automatically pulls the right template and searches the docs (using hybrid search) to generate code that actually lasts. And of course, it handles the basics: start, stop, fork a database.

But the real magic happens when you ask AI to design your schema. So let’s see what happens when AI designs a simple schema. First without the MCP server, then with it.

Without the MCP Server

CREATE TABLE customers (
    customer_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    email VARCHAR(255) UNIQUE NOT NULL,
    password_hash VARCHAR(255) NOT NULL,
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL,
    phone VARCHAR(20),
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    is_active BOOLEAN DEFAULT TRUE
);

CREATE INDEX idx_customers_email ON customers(email);

CREATE TABLE authors (
    author_id SERIAL PRIMARY KEY,
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL,
    biography TEXT,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

What’s wrong here:

Mixed ID types (SERIAL and UUID) with no clear reason
Arbitrary VARCHAR limits instead of TEXT
Case-sensitive email index → duplicate accounts
CURRENT_TIMESTAMP instead of now(), leading to inconsistent timestamps

Someone new to Postgres may not notice these problems. Until it’s too late.

Now let’s see what AI generates with our new MCP server.

With the MCP Server

CREATE TABLE authors (
  author_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  name TEXT NOT NULL,
  bio TEXT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX ON authors (name);

CREATE TABLE users (
  user_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  email TEXT NOT NULL UNIQUE,
  password_hash TEXT NOT NULL,
  first_name TEXT NOT NULL,
  last_name TEXT NOT NULL,
  phone TEXT,
  is_active BOOLEAN NOT NULL DEFAULT true,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE UNIQUE INDEX ON users (LOWER(email));
CREATE INDEX ON users (created_at);

What’s better about this?

Consistent ID strategy with BIGINT GENERATED ALWAYS AS IDENTITY
TEXT instead of arbitrary VARCHAR limits
Case-insensitive email lookups
Modern timestamp handling

But why does this matter?

Each of these differences creates a compounding problem. Changing datatypes in the future will require full table rewrites. Missing lowercase email handling means duplicate accounts and confused users. And time zones? Every senior developer gets the thousand-yard stare when you mention UTC conversions.

This is just with a small example; imagine what would happen with more complex schemas.

And if you don’t believe us, here’s what Claude has to say:

> Please describe the schema you would create for an e-commerce website two times, first with the tiger mcp server disabled, then with the tiger mcp server enabled. For each time, write the schema to its own file in the current working directory. Then compare the two files and let me know which approach generated the better schema, using both qualitative and quantitative reasons. For this example, only use standard Postgres.

0:00

/0:50

Verdict:

The Tiger MCP-enabled schema is objectively superior because:

1. Type Safety: Uses modern PostgreSQL types that prevent common errors
2. Data Integrity: 4x more constraints ensure data quality at the database level
3. Performance: Better indexing strategy with 55% more indexes, including partial and expression indexes
4. Maintainability: Better documentation, clear comments, proper naming
5. Features: More complete e-commerce functionality (discounts, full-text search, audit trails)
6. PostgreSQL Best Practices: Follows official PG17 recommendations from the documentation
7. Future-Proof: Uses modern features like GENERATED ALWAYS AS IDENTITY and NULLS NOT DISTINCT

The Tiger MCP server provided access to PostgreSQL-specific best practices documentation and design patterns that resulted in a more robust, performant, and maintainable schema.

How Prompt Templates Make It All Possible

Behind the scenes, AI uses the MCP server to call get_prompt_template(‘design_postgres’) to load schema design guidance. No pasted docs. No corrections. Just better code.

Prompt templates turn production wisdom into reusable guidance for AI. Instead of hunting through documentation written for humans, AI gets the information it needs in a format built for machines.

It comes down to the fact that humans and LLMs have opposite needs. Humans need narratives and memorable examples (and yes, even cat memes) to help them retain information. LLMs need to preserve context window space. That’s why prompt templates make terrible blog posts, but perfect AI guidance.

Our philosophy is: don't re-teach what the model already knows. LLMs have seen millions of lines of SQL. They know how to write CREATE TABLE. What they don’t know is the 35 years of Postgres wisdom about what works well and what doesn’t.

It's like your senior DBA whispering advice in the model's ear.

Our schema design template (design_postgres_tables) doesn’t explain what a primary key is. It jumps straight to guidance:

“Prefer BIGINT GENERATED ALWAYS AS IDENTITY; use UUID only when global uniqueness is needed.”

For data types, it doesn’t teach from scratch. It just tells you what works:

“DO NOT use money type; DO use numeric instead.”

Here’s a real snippet from the template:

## Postgres "Gotchas"

- **FK indexes**: Postgres **does not** auto-index FK columns. Add them.
- **No silent coercions**: length/precision overflows error out (no truncation). 
  Example: inserting 999 into `NUMERIC(2,0)` fails with error, unlike some 
  databases that silently truncate or round.
- **Heap storage**: no clustered PK by default (unlike SQL Server/MySQL InnoDB); 
  row order on disk is insertion order unless explicitly clustered.

These gotchas trip up LLMs the same way they trip up developers new to Postgres. We optimized these templates for machines: short, factual, and precise, packing maximum guidance into minimum tokens.

We tested the same approach on a real IoT schema design task. Without templates, the AI added forbidden configurations and missed critical optimizations. With templates, it generated production-ready code with compression, continuous aggregates, and tuned performance.

That’s how prompt templates work. Now let’s see how the MCP server makes it all happen.

How This MCP Server is Smarter Than Others

While Neon, Supabase, and other Postgres companies created MCP servers as thin API wrappers, ours teaches AI how to think in Postgres.The Tiger MCP server gives AI tools that work automatically: no prompt engineering or manual lookups needed. You just ask. And it provides correct, idiomatic Postgres.

get_prompt_template provides auto-discovered expertise. Instead of having to call a template explicitly, you just say “I want to make a schema for IoT devices…” and the MCP server figures it out.

With self-discoverable templates, the AI can detect intent and load the right recipe, applying 35 years of Postgres best practices behind the scenes.

The templates have real depth. No scraped snippets or boilerplate. The templates are written by senior Postgres engineers, and provide opinionated, production-tested guidance that is tuned to avoid every trap that seasoned DBAs know to avoid.

Postgres-native vector retrieval adds the right context. When the AI needs more information, the MCP server searches the versioned Postgres (15-18) and TimescaleDB docs. And it uses Postgres itself for storage and vector search.

Versioning is critical. For example, Postgres 15 introduced UNIQUE NULLS NOT DISTINCT, while 16 improved parallel queries, and 17 changed COPY error handling. The MCP keeps AIs grounded in correct syntax every time, avoiding broken code from the wrong version.

The Tiger MCP doesn’t just wire up APIs. It teaches AI to think like a real Postgres engineer.

You don’t have to craft the perfect prompt. You just ask, and it does the right thing.

See It For Yourself

Install the Tiger CLI and MCP server:

curl -fsSL https://cli.tigerdata.com | sh
tiger auth login
tiger mcp install

(We also have alternative installation instructions for the CLI tool.)

Then select your AI assistant (Claude Code, Cursor, VS Code, Windsurf, etc.) and immediately get real Postgres knowledge flowing into your AI.

This is how Postgres becomes the best database to use with AI coding tools: not by accident, not because someone pasted docs into a chat, but because the tooling now teaches AI how to think in Postgres.

Try the MCP server. Break it. Improve it. Help us teach every AI to write real Postgres.

About the authors

Matty Stratton

Matty Stratton is the Head of Developer Advocacy and Docs at Tiger Data, a well-known member of the DevOps community, founder and co-host of the popular Arrested DevOps podcast, and a global organizer of the DevOpsDays set of conferences.

Matty has over 20 years of experience in IT operations and is a sought-after speaker internationally, presenting at Agile, DevOps, and cloud engineering focused events worldwide. Demonstrating his keen insight into the changing landscape of technology, he recently changed his license plate from DEVOPS to KUBECTL.

He lives in the Chicagoland area and has three awesome kids and two Australian Shepherds, whom he loves just a little bit more than he loves Diet Coke.

Matvey Arye

Under his leadership, the Tiger Data engineering team has introduced partitioning, compression, and incremental materialized views for time-series data, plus cutting-edge indexing and performance innovations for AI.

Matvey earned a Bachelor degree in Engineering at The Cooper Union. He earned a Doctorate in Computer Science at Princeton University where his research focused on cross-continental data analysis covering issues such as networking, approximate algorithms, and performant data processing.

Jacky Liang

Jacky Liang is a developer advocate at Tiger Data with an AI and LLMs obsession. He's worked at Pinecone, Oracle Cloud, and Looker Data as both a software developer and product manager which has shaped the way he thinks about software.

He cuts through AI hype to focus on what actually works. How can we use AI to solve real problems? What tools are worth your time? How will this technology actually change how we work?

When he's not writing or speaking about AI, Jacky builds side projects and tries to keep up with the endless stream of new AI tools and research—an impossible task, but he keeps trying anyway. His model of choice is Claude Sonnet 4 and his favorite coding tool is Claude Code.

The Database Has a New User—LLMs—and They Need a Different Database

Matvey Arye — Thu, 21 Aug 2025 12:59:10 GMT

We’re experimenting with a database that can self-describe. And we’re starting with the most popular one in the world: Postgres.

TL;DR: We’re experimenting with turning Postgres into a self-describing database, embedding meaning as part of the schema. By providing natural language explanations of PostgreSQL structures and logic, agents can more accurately query data, and answer a broader set of questions. In our early tests, using an LLM-generated semantic catalog improved SQL generation accuracy by up to 27%. Here's the repo link for reference.

Databases Lack Context About Their Structures

As any developer who has ever had the (mis)fortune of working on a legacy database knows quite well, a database is not self-describing. You can’t look at a database schema and tell what’s going on. That orders1 table can be for a purchase order or a customer order or it’s an experimental table someone created and forgot to drop 5 years ago. This is a long-standing problem with databases that people solve by talking to each other, looking at code that interacts with the database, examining git history, and screaming at the wall.

But, LLMs need to answer questions about the data just from the context provided by the database. No wonder they get confused.

Like our fearless leader Ajay Kulkarni once said—“LLMs crave (accurate) context the same way GPUs crave power.”

In order for agents to extract insights from data in the database it needs to understand which columns and tables to query. For example, if it doesn’t understand which table contains the customer vs purchase orders, it can’t understand when a customer’s order has shipped.

In fact, in internal experiments we conducted at Tiger Data, we found that 42% of context-less LLM-generated SQL queries missed critical filters or misunderstood relationships, silently returning misleading data.

The larger (and thus probably more important) the database, the worse things get.

It doesn’t take a senior engineer to know that this is unacceptable.

Adding Context with Semantic Catalogs

The solution is incredibly simple: allow developers to add context about the database in the form of natural language descriptions of the schemas and business logic within Postgres, injecting much-needed factual meaning into schemas.

We arrived at this by putting ourselves in the place of the LLM — asking, how would we generate the query if we were in its shoes? The answer was clear: without additional context, we wouldn’t know enough to make the right call. Along the way, we tried different prompts, techniques, and even alternative LLMs, but the core issue was always the same: the model simply lacked the context necessary to generate the query. That realization led us to build a way for developers to provide that missing context directly.

That's what we’ve been experimenting with and are excited to share today.

We set out to test what happens when SQL generation is powered by our LLM-generated semantic catalog. The outcome was a 27% boost in accuracy compared to the control.

Let's show you how we created this self-describing Postgres database.

Building a Self-Describing Database

Our thoughts for this self-describing Postgres are built around four core ideas:

Embed semantics alongside schema:
Every table, column, function, and business rule should be described clearly—in natural language.
Versioned and governed descriptions:
Metadata should live alongside application code, version-controlled, reviewed, and governed with the same rigor.
Self-correcting querying:
Postgres itself should expose safety hints, telling agents which queries are expensive or unsafe, and provide deterministic verification mechanisms (EXPLAIN) to catch errors before queries run.
Measure and iterate transparently:
Developers should be able to build an evaluation suite for agent interactions and determine how well the system performed. Critically, the developers should see what errors are due to bad schema descriptions and lack of context (retrieval errors) or incorrect reasoning (logical errors).

What We’re Experimenting With Today

We’re exploring this approach through two building blocks. This post primarily covers the Semantic Catalog. Next week, we’ll publish a post about the evaluation harness.

Component	What It Does	Repo
Semantic Catalog	Stores natural-language descriptions for schema elements. Supports vector search to retrieve relevant context dynamically.	Link
Evaluation Harness	Measures query accuracy in any agentic text-to-sql system and transparently reveals errors in retrieval (bad metadata/context) versus reasoning (bad SQL generation). Helps track accuracy over time.	TBD

Why isn't the problem just retrieval?

Retrieval using the schema element names alone does not cut it simply because the names lack enough meaning. It doesn’t matter if you use keyword search, semantic search, or this week’s newest shiny retrieval algorithm—the semantic meaning of payment_aborts versus refunds is too close for any retrieval mechanism to make heads or tails of it.

Even if you injected the entire schema into the context of your prompt (burning through cash and increasing response latency in the process), it’s doubtful the model would be able to reason about payment_aborts versus refunds either.

The semantic catalog solves this by offering a structured, natural language representation of your database's metadata and business logic. In this experimental version, it supports documentation and descriptions for:

Tables and views
Columns
Functions and procedures
Example queries
Facts: business logic rules or freeform context

Initial descriptions can be LLM-generated and stored in human-readable YAML files designed for version control, peer review, and governance. Once reviewed, this metadata can be imported into the semantic catalog and indexed for semantic search, simplifying the process of retrieving and reasoning about data structures.

Performance enhancement through semantic context

In our early tests, using an LLM-generated semantic catalog improved SQL generation accuracy by up to 27%. Effective SQL generation involves two key steps:

Retrieving Relevant Context: Identifying necessary database components
SQL Generation Reasoning: Formulating accurate SQL queries based on the retrieved context

Our evaluation revealed substantial improvements from using semantic descriptions for SQL generation accuracy (from 58% to 86%) for certain schemas. A smaller improvement was noted when using semantic descriptions in retrieval recall (from 52% to 58% on one of our tests).

Results varied significantly across datasets, schemas, and models. Our current evaluations were conducted using a more cost-effective model (OpenAI's gpt-4.1-nano). We're particularly interested in insights from users experimenting with diverse datasets and more advanced models.

With these results, let’s walk through how the agent actually interacts with the database.

How queries work: the minimal agent loop

Every semantic query follows this four-step flow:

Developers and agents alike gain visibility and control at each step.

Walking through step-by-step:

Step 1: Describe your database using an LLM

Generate initial YAML descriptions with an LLM:

pgai semantic-catalog describe -f descriptions.yaml

Review and refine these descriptions, storing them with the same rigor as your application code.

The resulting structure looks like this:

schema: public
name: restaurant
type: table
description: Stores core information about each restaurant, including its cuisine,
  city, and rating.
columns:
- name: id
  description: Primary key identifier for the restaurant.
- name: name
  description: Lowercased name of the restaurant.
- name: food_type
  description: Lowercased name of the type of cuisine or food style offered by the restaurant.
- name: city_name
  description: Lowercased name of the city where the restaurant is located.
- name: rating
  description: Numeric rating score assigned to the restaurant, where higher than 2.5 is considered good.
...
---
type: sql_example
sql: SELECT t2.house_number, t1.name FROM LOCATION AS t2 JOIN restaurant AS t1 ON t1.id = t2.restaurant_id WHERE t1.city_name IN (SELECT city_name FROM geographic WHERE region = 'bay area') AND t1.rating > 2.5
description: give me some good restaurants in the bay area ?
...
---
type: fact
description: When asking for a restaurant, provide its name (restaurant.name) and its house number (location.house_number).
...

Step 2: Have a human review the description and add context

A developer would then review the descriptions and add additional context. Crucially, business logic that is central to database operations is often not encoded into the schema, and thus cannot be derived by the LLM. A developer needs to provide that information.

After the descriptions are reviewed, they should be treated as a core part of application code and thus should be stored in version control, be reviewed in pull-requests, etc.

Step 3: Import into the catalog

Make descriptions available:

pgai semantic-catalog import -f descriptions.yaml

This supports declarative configuration and continuous deployment practices.

Step 4: Generate SQL

Generate SQL from natural language queries:

CLI:

pgai semantic-catalog generate-sql -p "Which passengers have experienced the most flight delays in 2024?"

Python:

response = await catalog.generate_sql(
    con,
    con,
    "openai:gpt-4.1",
    "Which passengers have experienced the most flight delays in 2024?",
)

Key Lessons So Far

Semantic context matters

Models are far better at generating correct SQL when they have access to rich semantic information, not just schema names, but natural-language descriptions. Even a small amount of context (e.g. "the order table tracks transactions for customers buying subscription plans") can dramatically improve reliability.

Narrow interfaces build confidence

What we’ve noticed with agentic text-to-sql systems is that to get good accuracy you need to tighten the scope of what kind of information the system has access to. The relationship looks something like this:

Successful agentic systems start with tight scopes. We found it valuable to restrict agents to function-level or view-level access at first, and only expand access once correctness is proven.

Postgres allows you to balance control and flexibility through three main interfaces:

Functions: Highly controlled but narrow scope
Views: Moderate control with broader access
Raw Tables: Most general but least constrained and more prone to failure

We suggest beginning with tightly scoped functions, then expanding access as confidence grows.

Self-Correcting Using EXPLAIN

To further improve reliability, we employ Postgres's deterministic EXPLAIN command. This preemptively catches query errors such as incorrect column or table names, allowing agents to self-correct and substantially increasing accuracy.

Store the Semantic Catalog Anywhere

For many deployments, integrating the semantic catalog directly into your existing database provides the simplest path. Recognizing, however, the risks and complexities of altering production environments, we've also enabled the semantic catalog to be hosted independently in a separate database. This approach offers greater deployment flexibility and ensures accessibility even when you only have read-only access to your primary database.

What’s Next: From Self-Describing to Self-Learning

While this semantic foundation is already improving agentic querying, we're just beginning. Our roadmap focuses on:

Self-learning catalog: Automatically enrich metadata by analyzing queries from production environments. Let databases learn from usage to continuously enhance accuracy.
Dynamic policy management: Express complex access rules—like row-level privacy policies—in natural language, automatically enforced by the database.

We’ve open-sourced everything we’ve built so far. Check out the README with a Quickstart to dive in, and we warmly invite your contributions, whether it's running evaluations on your schemas, proposing datasets, or challenging our assumptions. The best databases aren't just designed; they evolve through community effort.

About the author

Matvey Arye is a founding engineering leader at Tiger Data (creators of TimescaleDB), the premiere provider of relational database technology for time-series data and AI. Currently, he manages the team at Tiger Data responsible for building the go-to developer platform for AI applications. Under his leadership, the Tiger Data engineering team has introduced partitioning, compression, and incremental materialized views for time-series data, plus cutting-edge indexing and performance innovations for AI.

How We Made PostgreSQL as Fast as Pinecone for Vector Data

Matvey Arye — Tue, 11 Jun 2024 12:05:45 GMT

We’ve recently announced the open-sourcing of pgvectorscale, a new PostgreSQL extension that provides advanced indexing techniques for vector data. Pgvectorscale provides a new index method for pgvector data, significantly improving the search performance of approximate nearest neighbor (ANN) queries. These queries are key for leveraging modern vector embedding techniques to facilitate semantic search, which allows for finding things similar to a query's meaning. That, in turn, enables applications like retrieval-augmented generation (RAG), summarization, clustering, or general search.

In our announcement post, we described how our new StreamingDiskANN vector index allows us to perform vector search faster than bespoke purpose-built databases created for this purpose—like Pinecone. We also observed that if bespoke databases aren’t faster, then there is no reason to use them because they can’t possibly compete with the rich feature set and ecosystem of general-purpose databases like PostgreSQL.

In this article we’ll go into the technical contributions that allowed us to “break the speed barrier” and create a fast vector index in PostgreSQL. We’ll cover three technical improvements we made:

Implementing the DiskANN algorithm to allow the index to be stored on SSDs instead of having to reside in memory. This vastly decreases the cost of storing large amounts of vectors since SSDs are much cheaper than RAM.
Supporting streaming post-filtering, which allows for accurate retrieval even when secondary filters are applied. In contrast, the HNSW (hierarchical navigable small world) index fails to accurately retrieve data if the filters exclude the first ef_search vectors. Pinecone had previously complained about this problem when comparing itself to pgvector. Guess what; through the power of open source, this issue has been resolved.
Developing a completely new vector quantization algorithm we call SBQ (statistical binary quantization). This algorithm provides a better accuracy vs. performance trade-off compared to existing ones like BQ (binary quantization) and PQ (product quantization).

Enhancing PostgreSQL for Vector Data

Implementing the DiskANN algorithm to optimize for SSD storage

The DiskANN algorithm was developed by work coming out of Microsoft. Its goal was to store a very large number of vectors (think Microsoft scale). At that scale, it was simply uneconomical to store everything in RAM. Thus, the algorithm is geared towards enabling storing vectors on SSDs and using less RAM. Its details are described very well in the paper, so I’ll only give a bit of intuition below.

The DiskANN algorithm is a graph-based search algorithm like HNSW. Graph-based algorithms in this space have a well-known problem: finding an item that’s “very far” from the starting position is expensive because it requires a lot of hops.

HNSW solves this problem by introducing a system of layers where the first (top) layer only has “long-range” edges that quickly get you into the right vicinity and have pointers to nodes into lower levels that allow you to traverse the graph in a more fine-grained way. This solves the long-range problem but introduces more indirection through the layering system, which requires more random-access that forces the graph into RAM for good performance.

In contrast, DiskANN uses a single-layer graph and solves the long-range problem during graph construction by allowing for neighbor edges that refer to far-away nodes. The single-layer construction simplifies the algorithm and decreases the random access necessary during search, allowing SSDs to be used effectively.

Support for streaming retrieval for accurate metadata filtering

Oftentimes, when searching for semantically similar items, you want to constrain your search with additional filters. For example, documents are often associated with a set of tags and you may want to constrain your search by requiring a match of the tags as well as vector similarity.

Figure 1: The problem with two-stage post-filtering is that if the matching records aren’t located in the set before the cutoff of the first stage, the final answer will be incorrect.

This is challenging for many HNSW-based indexes (including pgvector’s implementation) because the index retrieves a pre-set number of records from the index (set by the hnsw.ef_search parameter, often set to 1,000 or less) before applying secondary filters. If not enough items in the retrieved set (e.g., first 1,000 items) match the secondary filters, you will miss those results.

Figure 1 illustrates this problem when you use hnsw.ef_search=5 to find the top two vectors closest to a given query and matching the tag “department=engineering”. In this scenario, the first item with the correct tag is the seventh vector closest to the query.

Since the vector search returns only the closest five items and none matches the tag filter, no results will be returned! This is an extreme example where no results are left over, but there will be some accuracy loss any time the retrieved set has less than k items matching the filter.

Figure 2: Streaming filtering produces the correct result by exposing a get_next() function that can be called continuously until the right number of records are found.

In contrast, our StreamingDiskANN index has no “ef_search” type cutoff. Instead, as shown in Figure 2, it uses a streaming model that allows the index to continuously retrieve the “next closest” item for a given query, potentially even traversing the entire graph! The Postgres execution system will continuously ask for the “next closet” item until it has matched the LIMIT N items that satisfy the additional filters. This is a form of post-filtering that suffers absolutely no accuracy degradation.

As a side note, Pinecone made a big deal of the “ef_search” type limitation to deposition pgvector in their comparison. But, with the introduction of StreamingDiskANN, this criticism no longer applies. This just shows the power of open-source projects to move quickly to mitigate limitations.

Statistical binary quantization (SBQ): A new quantization algorithm

Many vector indexes use compression to reduce the space needed for vector storage and make index traversal faster at the cost of some loss in accuracy. The common algorithms are product quantization (PQ) and binary quantization (BQ). In fact, pgvector’s HNSW index just added BQ in their latest 0.7.0 release (hooray!).

The way most vector databases work to retrieve K results is as follows. The system first retrieves N results (N>K) using the approximate quantized differences, then “corrects” for the error by rescoring. It calculates the full distance for the N results, sorts the list by the full distance, and returns the K items with the smallest distance. Yet, even with rescoring, accuracy is important because it allows you to decrease N (and thus query faster) and improve the chances that the accurate result will be in the set of N pre-fetched results.

We took a look at the BQ algorithm and were unhappy with the amount of accuracy loss it produced. We also immediately saw some low-hanging fruit to improve it. In tinkering with the algorithm, we developed a new compression algorithm we are calling statistical binary quantization (SBQ).

The BQ compression algorithm transforms a floating-point vector into a binary vector in a surprisingly simple way: for each element in the vector, if the value is greater than 0.0, make the binary value 1; otherwise, set the binary value to 0. Then, the distance function simply becomes the XOR function. Why XOR? Well, you’ll find many mathematical explanations (none of which we quite like) but the intuition we use is that the binary vector divides the space into quadrants as seen in Figure 3, and the XOR function is simply a count of how many planes you have to cross to get from one quadrant to another.

Figure 3: BQ for three dimensions. Quadrant 1 is represented by the binary vector [1,1,1] and any vector falling into that quadrant will have a distance of 0. The distance with vectors in other quadrants increases with the number of dimensions that are different.

One of the immediate things that struck us as odd is that the cutoff for each dimension is always 0.0. This was odd because in analyzing real embeddings we’ve previously found that the mean for each dimension is not even approximately 0.0. That means the quadrants we are defining in BQ are not dividing the space of points in half and thus missing out on opportunities for differentiation.

Intuitively, you want the “origin” of your cutting plane in the middle of all the action, but in BQ, it’s off to the side. The solution was quite simple: we used a learning pass to derive the mean value for each dimension and then set the float-value cutoff to the mean instead of 0.0. Thus, we set the binary value for an element to 1 if and only if the float value is greater than the mean for the dimension.

But then we noticed yet another odd thing: the compression algorithm worked better for 1,536 dimensions than for 768 dimensions. This made little sense to use because the literature strongly implies that problems with higher dimensions are harder than lower dimensions (the so-called “curse of dimensionality”). But here, the opposite is true.

However, thinking about the quadrant analogy, this kind of made sense—we’d have fewer quadrants with 768 dimensions, and each quadrant would be bigger and thus less differentiated. So we asked ourselves, could we create more quadrants with 768 dimensions?

Our approach was to convert each floating-point dimension into two bits (which we later generalized). The idea was to use the mean and standard deviations to derive a z-score (a value’s distance from the mean normalized by standard deviation) and then divide the z-score into three regions. We then encode the three regions into two bits so that adjacent regions have a XOR distance of 1, and the distance increases with the z-score distance. In the two-bit case with three regions the encoding is 00, 01, 11.

Experimentally, we found that two-bit encoding really helps accuracy with the 768-dimension case. Thus, by default, we use two-bit encoding for any data with less than about 900 dimensions and one-bit encoding otherwise. In one representative example on a dataset with 768 dimensions, the recall improved from 96.5 % to 98.6 % when switching from the one-bit to two-bit encoding, a significant improvement at such high recall levels.

In sum, these techniques help us achieve a better accuracy/performance trade-off.

A Better PostgreSQL for Vector Data

The three techniques we covered in this post allow us to develop a best-in-class index for vector data in PostgreSQL that rivals the performance of bespoke databases like Pinecone. We were able to achieve this with a small team by harnessing much of the infrastructure that PostgreSQL provides, including caching, WAL (write-ahead logging), and the associated recovery infrastructure, and a rock-solid disk writing system.

We wrote this in Rust using the PGRX framework for writing Rust extensions for PostgreSQL. This further sped up development because we could rely on some of the safety guarantees that Rust and PGRX provide while developing our own safe wrappers for tricky parts of the code (like disk I/O). We think that this combination of tools is really useful and powerful for developing database features and extending the reach of PostgreSQL.

Next steps

Our team has been working tirelessly in the last few months to equip PostgreSQL with these new advanced indexing techniques for vector data. Our goal is to help PostgreSQL developers become AI developers. But for that, we need your feedback.

Here’s how you can get involved:

Share the news with your friends and colleagues: Share our posts announcing pgai and pgvectorscale on X/Twitter, LinkedIn, and Threads. We promise to RT back.
Submit issues and feature requests: We encourage you to submit issues and feature requests for functionality you’d like to see, bugs you find, and suggestions you think would improve both projects.
Make a contribution: We welcome community contributions for both pgvectorscale and pgai. Pgvectorscale is written in Rust, while pgai uses Python and PL/Python. For pgai specifically, let us know which models you want to see supported, particularly for open-source embedding and generation models. See the pgai GitHub for more.
Offer pgvectorscale and pgai extensions on your PostgreSQL cloud: Pgvectorscale and pgai are open-source projects under the PostgreSQL License. We encourage you to offer pgvectorscale and pgai on your managed PostgreSQL database-as-a-service platform, and we can even help you spread the word. Get in touch via our Contact Us form and mention pgai and pgvectorscale to discuss further.
Use pgai and pgvectorscale today: You can find installation instructions on the pgai GitHub and pgvectorscale GitHub repositories, respectively. You can also access both pgai and pgvectorscale on any database service on Tiger Data’s cloud PostgreSQL platform. For production vector workloads, we’re offering private beta access to vector-optimized databases with pgvector and pgvectorscale on Timescale. Sign up here for priority access.

How We Designed a Resilient Vector Embedding Creation System for PostgreSQL Data

Matvey Arye — Fri, 10 Nov 2023 13:33:39 GMT

Embedding data stored in a PostgreSQL table is undoubtedly useful—with applications ranging from semantic search and recommendation systems to generative AI applications and retrieval augmented generation. But creating and managing embeddings for data in PostgreSQL tables can be tricky, with many considerations and edge cases to take into account, such as keeping embeddings up to date with table updates and deletes, ensuring resilience against failures, and impact to existing systems dependent on the table.

In a previous blog post, we detailed a step-by-step guide on the process of creating and managing embeddings for data residing in PostgreSQL using PgVectorizer—our simple and resilient embedding creation system for data residing in PostgreSQL. Using a blog application with data stored in a PostgreSQL database as an example, we covered how to create and keep up-to-date vector embeddings using Python, LangChain, and pgai on Timescale.

In this blog post, we’ll discuss the technical design decisions and the trade-offs we made while building PgVectorizer to ensure simplicity, resilience, and high performance. We’ll also discuss alternative designs if you want to roll your own.

Let’s jump into it.

Design of a High-Performance Vectorizer for PostgreSQL Data (PgVectorizer)

First, let’s describe how the system we are building will work. Feel free to skip this section if you already read the PgVectorizer post.

System overview

As an illustrative example, we’ll use a simple blog application storing data in PostgreSQL using a table defined as follows:

CREATE TABLE blog (
  id              SERIAL PRIMARY KEY NOT NULL,
  title           TEXT NOT NULL, 
  author          TEXT NOT NULL,
  contents        TEXT NOT NULL,
  category        TEXT NOT NULL,
  published_time  TIMESTAMPTZ NULL --NULL if not yet published
);

We want to create embeddings on the contents of the blog post so we can later use it for semantic search and power retrieval augmented generation. Embeddings should only exist and be searchable for blogs that have been published (where the published_time is NOT NULL).

While building this embeddings system, we were able to identify a number of goals that any straightforward and resilient system that creates embeddings should have:

No modifications to the original table. This allows systems and applications that already use this table not to be impacted by changes to the embedding system. This is especially important for legacy systems.
No modification to the applications that interact with the table. Having to modify the code that alters the table may not be possible for legacy systems. It’s also poor software design because it couples systems that don’t use embeddings with code that generates the embedding.
Automatically update embeddings when rows in the source table change (in this case, the blog table). This lessens the maintenance burden and contributes to worry-free software. At the same time, this update need not be instantaneous or within the same commit. For most systems, “eventual consistency” is just fine.
Ensure resilience against network and service failures: Most systems generate embeddings via a call to an external system, such as the OpenAI API. In scenarios where the external system is down, or a network malfunction occurs, it's imperative that the remainder of your database system continues working.

These guidelines were the basis of a robust architecture that we implemented using the Python Vector library, a library for working with vector data using PostgreSQL. To complete the job successfully, added new functionality to this library—PgVectorizer—to make embedding PostgreSQL data as simple as possible.

Here’s the architecture we settled on:

Reference architecture for a simple and resilient system for embedding data in an existing PostgreSQL table. We use the example use case of a blogging application, hence the table names above.

In this design, we first add a trigger to the blog table that monitors for changes and, upon seeing a modification, inserts a job into the blog_work_queue table that indicates that a row in the blog table is out-of-date with its embedding.

On a fixed schedule, an embeddings creator job will poll the blog_work_queue table, and if it finds work to do, will do the following in a loop:

Read and lock a row in the blog_work_queue table
Read the corresponding row in the blog table
Create an embedding for the data in the blog row
Write the embedding to the blog_embedding table
Delete the locked row in the blog_work_queue table

To see this system in action, see an example of usage to create and maintain embeddings in a PostgreSQL table using OpenAI, LangChain, and pgai on Timescale in this blog post.

Going back to the example of our blog application table, on a high level, PgVectorizer has to do two things:

Track changes to the blog rows to know which rows have changed.
Provide a method to process the changes to create embeddings.

Both of these have to be highly concurrent and performant. Let’s see how it works.

Track change to the blog table with the blog_work_queue table

You can create a simple work queue table with the following:

CREATE TABLE blog_embedding_work_queue (
  id  INT 
);

CREATE INDEX ON blog_embedding_work_queue(id);

This is a very simple table, but there is one item of note: this table has no unique key. This was done to avoid locking issues when processing the queue, but it does mean that we may have duplicates. We discuss the trade-off later in Alternative 1 below.

Then you create a trigger to track any changes made to blog:

CREATE OR REPLACE FUNCTION blog_wq_for_embedding() RETURNS TRIGGER LANGUAGE PLPGSQL AS $$ 
BEGIN 
  IF (TG_OP = 'DELETE') THEN
    INSERT INTO blog_embedding_work_queue 
      VALUES (OLD.id);
  ELSE
    INSERT INTO blog_embedding_work_queue 
      VALUES (NEW.id);
  END IF;
  RETURN NULL;
END; 
$$;

CREATE TRIGGER track_changes_for_embedding 
AFTER INSERT OR UPDATE OR DELETE
ON blog 
FOR EACH ROW EXECUTE PROCEDURE blog_wq_for_embedding();

INSERT INTO blog_embedding_work_queue 
  SELECT id FROM blog WHERE published_time is NOT NULL;

The trigger inserts the ID of the blog that has changed into blog_work_queue. We install the trigger and then insert any existing blogs into the work_queue. This ordering is important to make sure that no IDs get dropped.

Now, let’s describe some alternative designs and why we rejected them.

Alternative 1: Implement a primary or unique key for the blog_work_queue table.

Introducing this key would eliminate the problem of duplicate entries. However, it's not without its challenges, particularly because such a key would force us to use the INSERT…ON CONFLICT DO NOTHING clause to insert new IDs into the table, and that clause takes a lock on the ID in the B-tree.

Here's the dilemma: during the processing phase, it's necessary to delete the rows being worked on to prevent simultaneous processing. Yet, committing this deletion can only be done after the corresponding embedding has been placed into blog_embeddings. This ensures no IDs are lost if there's a disruption midway—say, if the embedding creation crashes post-deletion but before the embedding is written.

Now, if we create a unique or primary key, the transaction overseeing the deletion stays open. Consequently, this acts as a lock on those specific IDs, preventing their insertion back into the blog_work_queue for the entire duration of the embedding creation job. Given that it takes longer to create embeddings than your typical database transaction, this spells trouble. The lock would stall the trigger for the main 'blog' table, leading to a dip in the primary application's performance. Making things worse, if processing multiple rows in a batch, deadlocks become a potential problem as well.

However, the potential issues arising from occasional duplicate entries can be managed during the processing stage, as illustrated later. A sporadic duplicate here and there isn't a problem as it only marginally increases the amount of work the embedding job performs. This is certainly more palatable than grappling with the above-mentioned locking challenges.

Alternative 2: Track the work that needs to be done by adding a column to the `blog` table to track whether an up-to-date embedding exists.

For example, we could add an embedded boolean column set to false on modification and flipped to true when the embedding is created. There are three reasons to reject this design:

We don’t want to modify the blog table for the reasons we already mentioned above.
Efficiently getting a list of non-embedded blogs would require an additional index (or partial index) on the blog table. This would slow down other operations.
This increases churn on the table because every modification would now be written twice (once with embedding=false and once with embedding=true) due to the MVCC nature of PostgreSQL.

A separate work_queue_table solves these issues.

Alternative 3: Create the embeddings directly in the trigger.

This approach has several issues:

If the embedding service is down, either the trigger needs to fail (aborting your transaction), or you need to create a backup code path that … stores the IDs that couldn’t be embedded in a queue. The latter solution gets us back to our proposed design but with more complexity bolted on top.
This trigger will probably be much slower than the rest of the database operations because of the latency required to contact an external service. This will slow down the rest of your database operations on the table.
It forces the user to write the creation embedding code directly in the database. Given that the lingua franca of AI is Python and that embedding creation often requires many other libraries, this isn’t always easy or even possible (especially if running within a hosted PostgreSQL cloud environment). It’s much better to have a design where you have a choice to create embeddings inside or outside of the database.

Now we have a list of blogs that need to be embedded, let’s process the list!

Create the embeddings

There are many ways to create embeddings. We recommend using an external Python script. This script will scan the work queue and the related blog posts, invoke an external service to craft the embeddings, and then store these embeddings back into the database. Our reasoning for this strategy is as follows:

Choice of Python: We recommend Python because it offers a rich, unmatched ecosystem for AI data tasks, highlighted by powerful LLM development and data libraries like LangChain and LlamaIndex.
Opting for an external script instead of PL/Python: We wanted users to have control over how they embed their data. Yet, at the same time, many Postgres cloud providers don’t allow the execution of arbitrary Python code inside the database because of security concerns. So, to allow users to have flexibility in both their embedding scripts as well as where they host their database, we went with a design that used external Python scripts.

The jobs must be both performant and concurrency-safe. Concurrency guarantees that if jobs start running behind, the schedulers can start more jobs to help the system catch up and handle the load.

We’ll go through how to set up each of those methods later, but first, let’s see what the Python script would look like. Fundamentally, the script has three parts:

Read the work queue and the blog post
Create an embedding for the blog post
Write the embedding to the blog_embedding table

Steps 2 and 3 are performed by an embed_and_write callback that we define in the PgVectorizer blog post. So, let’s look more deeply at how we process the work queue.

Process the work queue

We’ll first show you the code and then highlight the key elements at play:

def process_queue(embed_and_write_cb, batch_size:int=10):            
    with psycopg2.connect(TIMESCALE_SERVICE_URL) as conn:
        with conn.cursor(cursor_factory=psycopg2.extras.DictCursor) as cursor:
            cursor.execute(f"""
                SELECT to_regclass('blog_embedding_work_queue')::oid; 
                """)
            table_oid = cursor.fetchone()[0]
            
            cursor.execute(f"""
                WITH selected_rows AS (
                    SELECT id
                    FROM blog_embedding_work_queue                         
                    LIMIT {int(batch_size)}
                    FOR UPDATE SKIP LOCKED
                ), 
                locked_items AS (
                    SELECT id, 
                           pg_try_advisory_xact_lock(
                                {int(table_oid)}, id) AS locked
                    FROM (
                        SELECT DISTINCT id 
                        FROM selected_rows 
                        ORDER BY id
                     ) as ids
                ),
                deleted_rows AS (
                    DELETE FROM blog_embedding_work_queue
                    WHERE id IN (
                        SELECT id 
                        FROM locked_items 
                        WHERE locked = true ORDER BY id
                   )
                )
                SELECT locked_items.id as locked_id, {self.table_name}.*
                FROM locked_items
                LEFT JOIN blog ON blog.id = locked_items.id
                WHERE locked = true
                ORDER BY locked_items.id
            """)
            res = cursor.fetchall()
            if len(res) > 0:
                embed_and_write_cb(res)
            return len(res)

process_queue(embed_and_write)

The SQL code in the above snippet is subtle because it is designed to be both performant and concurrency-safe, so let’s go through it:

Getting items off the work queue: Initially, the system retrieves a specified number of entries from the work queue, determined by the batch queue size parameter. A FOR UPDATE lock is taken to ensure that concurrently executing scripts don’t try processing the same queue items. The SKIP LOCKED directive ensures that if any entry is currently being handled by another script, the system will skip it instead of waiting, avoiding unnecessary delays.
Locking blog IDs: Due to the possibility of duplicate entries for the same blog_id within the work-queue table, simply locking said table is insufficient. Concurrent processing of the same ID by different jobs would be detrimental. Consider the following potential race-condition:

Job 1 initiates and accesses a blog, retrieving version 1.
An external update to the blog occurs.
Subsequently, Job 2 begins, obtaining version 2.
Both jobs commence the embedding generation process.
Job 2 concludes, storing the embedding corresponding to blog version 2.
Job 1, upon conclusion, erroneously overwrites the version 2 embedding with the outdated version 1.

While one could counter this issue by introducing explicit version tracking, it introduces considerable complexity without performance benefit. The strategy we opted for not only mitigates this issue but also prevents redundant operations and wasted work by concurrently executing scripts.

A Postgres advisory lock, prefixed with the table identifier to avoid potential overlaps with other such locks, is employed. The try variant, analogous to the earlier application of SKIP LOCKED, ensures the system avoids waiting on locks. The inclusion of the ORDER BY blog_id clause helps prevent potential deadlocks. We’ll cover some alternatives below.

Cleaning up the work queue: The script then deletes all the work queue items for blogs we have successfully locked. If these queue items are visible via Multi-Version Concurrency Control (MVCC), their updates are manifested in the retrieved blog row. Note that we delete all items with the given blog ID, not only the items read when selecting the rows: this effectively handles duplicate entries for the same blog ID. It's crucial to note that this deletion only commits after invoking the embed_and_write() function and the subsequent storage of the updated embedding. This sequence ensures we don’t lose any updates even if the script fails during the embedding generation phase.
Getting the blogs to process: In the last step, we fetch the blogs to process. Note the use of the left join: that allows us to retrieve the blog IDs for deleted items that won’t have a blog row. We need to track those items to delete their embeddings. In the embed_and_write callback, we use published_time being NULL as a sentinel for the blog being deleted (or unpublished, in which case we also want to delete the embedding).

Alternative 4: Avoid using advisory locks by using another table.

If the system already uses advisory locks and you are worried about collisions, it’s possible to use a table with a blog ID as the primary key and lock the rows. In fact, this can be the blog table itself if you are sure these locks won’t slow down any other system (remember, these locks have to be held throughout the embedding process, which can take a while).

Alternatively, you can have a blog_embedding_locks table just for this purpose. We didn’t suggest creating that table because we think it can get quite wasteful in terms of space, and using advisory locks avoids this overhead.

Conclusion and Next Steps

We introduced PgVectorizer and outlined a system adept at generating vector embeddings from data stored in PostgreSQL and automatically keeping them up to date. This architecture ensures the embeddings remain synchronized with the perpetually evolving data, responding seamlessly to insertions, modifications, and deletions.

In this blog post, we gave you a behind-the-scenes look at how we created a system that boasts resilience, effectively handling potential downtimes of the embedding-generation service. Its design is adept at managing a high rate of data modifications and can seamlessly use concurrent embedding-generation processes to accommodate heightened loads.

Moreover, the paradigm of committing data to PostgreSQL and using the database to manage embedding generation in the background emerges as an easy mechanism to supervise embedding upkeep amidst data modifications. A myriad of demos and tutorials in the AI space focus singularly on the initial creation of data from documents, overlooking the intricate nuances associated with preserving data synchronization as it evolves.

However, in real production environments, data invariably changes, and grappling with the complexities of tracking and synchronizing these shifts is no trivial endeavor. But that’s what a database is designed to do! Why not just use it?

Here are some resources to continue your learning journey:

Try pgai on Timescale free for 90 days: Store the embeddings generated from your PostgreSQL data in a fast and scalable vector database built on PostgreSQL. Learn more about pgai on Timescale and how it performs.
Read the docs: Learn more about PgVectorizer and how to use it via the Python Vector library.
Tutorial: Embedding blog data in PostgreSQL: Follow this step-by-step tutorial on how to create, embed, and store blog post data from a PostgreSQL table using the methods discussed in this blog post.
Tutorial: Refining Vector Search Queries With Time Filters in pgvector: Learn how to do time-based filtering and semantic similarity search in a single SQL query.
LangChain and pgai on Timescale: We used LangChain to illustrate document parsing and embedding creation in our examples above. Learn more about how to use it with pgai on Timescale for vector storage, similarity search, and hybrid search.

A Complete Guide to Creating and Storing Embeddings for PostgreSQL Data

Matvey Arye — Fri, 10 Nov 2023 13:33:28 GMT

Why use embeddings for your PostgreSQL data

Vector embeddings provide a mathematical representation of data, encapsulating its semantic essence in a form that machines can readily process. While commonly associated with text, images, and audio, virtually any binary data can be converted into this format.

Generating embeddings from data already stored in your PostgreSQL database unlocks a multitude of applications:

Embeddings enable semantic search, which transcends the limitations of traditional keyword-driven methods. It doesn't just seek exact word matches; it grasps the deeper intent behind a user's query. The result? Even if search terms differ in phrasing, relevant results are surfaced. Taking advantage of hybrid search, which marries lexical and semantic search methodologies, offers users a search experience that's both rich and accurate. It's not just about finding direct matches anymore; it's about tapping into contextually and conceptually similar content to meet user needs.

Recommendation systems benefit immensely from embeddings. Imagine a user who has shown interest in several articles on a singular topic. With embeddings, the recommendation engine can delve deep into the semantic essence of those articles, surfacing other database items that resonate with the same theme. Recommendations, thus, move beyond just the superficial layers like tags or categories and dive into the very heart of the content.

Generative AI, particularly retrieval-augmented generation (RAG), can be powered using the data stored in a PostgreSQL database. This turns your data into more than just tabular information; it becomes context for Large Language Models (LLMs) like OpenAI’s GPT-4 Turbo, Anthropic’s Claude 2, and open-source modes like Llama 2. When a user poses a query, relevant database content is fetched and used to supplement the query as additional context for the LLM. This helps reduce LLM hallucinations, as it ensures the model's output is more grounded in specific and relevant information, even if it wasn't part of the original training data.

Furthermore, embeddings offer a robust solution for clustering data in PostgreSQL. Transforming data into these vectorized forms enables nuanced comparisons between data points in a high-dimensional space. Through algorithms like K-means or hierarchical clustering, data can be categorized into semantic clusters, offering insights that surface-level attributes might miss. This deepens our grasp of inherent data patterns, enriching both exploration and decision-making processes.

This guide delves into the process of creating and managing embeddings for data residing in PostgreSQL using PgVectorizer, a library we developed to make managing embeddings simple. PgVectorizer both creates embedding from your data and keeps your relational and embedding data in sync as your data changes.

We'll navigate through architectural considerations, set up the library, perform a sync between your relational and embedding data, and query your embeddings. To learn more about how we built this, in this article, we go under the hood and explore how PgVectorizer works: we’ll cover schema layout, how we designed the system for performance, concurrency, and resilience, and explore a few alternative design decisions.

Let’s get started!

Creating Embeddings for Data in PostgreSQL (and Keeping Them Up-To-Date With Your Tables)

As a running example, we’ll use a simple blog application storing data in PostgreSQL using a table defined as:

CREATE TABLE blog (
  id              SERIAL PRIMARY KEY NOT NULL,
  title           TEXT NOT NULL, 
  author          TEXT NOT NULL,
  contents        TEXT NOT NULL,
  category        TEXT NOT NULL,
  published_time  TIMESTAMPTZ NULL --NULL if not yet published
);

We want to create embeddings on the contents of the blog post so we can later use it for semantic search. Embeddings should only exist and be searchable for blogs that have been published (where the published_time is NOT NULL).

To make working with embeddings simple and resilient, any system that creates embeddings should have the following goals:

No modifications to the original table. This allows systems and applications that already use this table not to be impacted by changes to the embedding system. This is especially important for legacy systems.
No modification to the applications that interact with the table. Having to modify the code that alters the table may not be possible for legacy systems. It’s also poor software design because it couples systems that don’t use embeddings with code that generates the embedding.
Automatically update embeddings when rows in the source table change (in this case, the blog table). This lessens the maintenance burden and contributes to worry-free software. At the same time, this update need not be instantaneous or within the same commit. For most systems, “eventual consistency” is just fine.
Ensure resilience against network and service failures: Most systems generate embeddings via a call to an external system, such as the OpenAI API. In scenarios where the external system is down or a network malfunction occurs, it's imperative that the remainder of your database system continues working.

These guidelines act as a robust framework for the following architecture:

Reference architecture for a simple and resilient system for embedding data in an existing PostgreSQL table. We use the example use case of a blogging application, hence the names above.

On a fixed schedule, an embeddings creator job will poll the blog_work_queue table, and if it finds work to do, will do the following in a loop:

Read and lock a row in the blog_work_queue table
Read the corresponding row in the blog table
Create an embedding for the data in the blog row
Write the embedding to the blog_embedding table
Delete the locked row in the blog_work_queue table

Next, we’ll discuss how to implement this simply by using the Python Vector library, a library for working with vector data using PostgreSQL.

Easily manage embedding PostgreSQL data using LangChain and the Python Vector library

We’ve added functionality to our library to make embedding PostgreSQL data as simple as possible. We call this functionality PgVectorizer.

Define your embedding creation function

There are myriad ways to embed your data. We don’t want to force you to use just one pre-defined method, so we ask users to define how to embed their data. Thus, we ask you to provide us with a Python function callback to create and store embeddings from database data. We call this the embed_and_write function, and it’s best to illustrate how to write it with an example.

Using the blog example above, it could look like the following when embedding using LangChain, a popular framework to work with LLM applications:

from langchain.docstore.document import Document
from langchain.text_splitter import CharacterTextSplitter
from timescale_vector import client, pgvectorizer
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores.timescalevector import TimescaleVector
from datetime import timedelta


import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv(), override=True)

TIMESCALE_SERVICE_URL = os.environ["TIMESCALE_SERVICE_URL"]

def get_document(blog):
    text_splitter = CharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
    )
    docs = []
    for chunk in text_splitter.split_text(blog['contents']):
        content = f"Author {blog['author']}, title: {blog['title']}, contents:{chunk}"
        metadata = {
            "id": str(client.uuid_from_time(blog['published_time'])),
            "blog_id": blog['id'], 
            "author": blog['author'], 
            "category": blog['category'],
            "published_time": blog['published_time'].isoformat(),
        }
        docs.append(Document(page_content=content, metadata=metadata))
    return docs

def embed_and_write(blog_instances, vectorizer):
    # Note: the vectorizer argument isn’t used in this example but it
    # provides a way to get the name of the table being embedded,
    # along with other metadata.


    embedding = OpenAIEmbeddings()
    vector_store = TimescaleVector(
        collection_name="blog_embedding",
        service_url=TIMESCALE_SERVICE_URL,
        embedding=embedding,
        time_partition_interval=timedelta(days=30),
    )

    # delete old embeddings for all ids in the work queue. locked_id is a special column that is set to the primary key of the table being
    # embedded. For items that are deleted, it is the only key that is set.
    metadata_for_delete = [{"blog_id": blog['locked_id']} for blog in blog_instances]
    vector_store.delete_by_metadata(metadata_for_delete)

    documents = []
    for blog in blog_instances:
        # skip blogs that are not published yet, or are deleted (in which case it will be NULL)
        if blog['published_time'] != None:
            documents.extend(get_document(blog))

    if len(documents) == 0:
        return
    
    texts = [d.page_content for d in documents]
    metadatas = [d.metadata for d in documents]
    ids = [d.metadata["id"] for d in documents]
    vector_store.add_texts(texts, metadatas, ids)

The embed_and_write() function gets a list of blogs that have either been created, updated, or deleted. Its job is to update the vector store with the new blogs. We do this in two steps:

Delete all existing vectors already in the vector store for items with the same primary key. The primary key is passed in via a special “locked_id” attribute. This is necessary if rows are deleted or updated.
Create embeddings for all items that were updated or deleted. Deleted items will have all attributes other than “locked_id” set to None so that any attribute can be used as a sentinel. In the example above, we use “published_time” because we also want to skip embedding documents where “published_time” is NULL in the database.

The get_document() function is very use-case-specific, and you’ll have to adjust this code to suit your needs. Because of the context length limitations in LLM completion queries and token length limitations in embedding generation models, you will likely need some way to split long text up into smaller chunks. Here, we use a simple CharacterTextSplitter in LangChain, but much more complex approaches are possible. In the code above, we use a simple but effective trick: add some semantic context to each chunk by prepending the author and category. The only real requirement for the metadata generation portion is including the blog_id, which we can later use to delete old embeddings for a given blog.

In the code snippet above, we use time-based partitioning based on published_time. This type of partitioning drastically speeds up hybrid search on time and embedding similarity. We partition by time by setting the UUID based on the timestamp using client.uuid_from_time() function and by specifying time_partition_interval=timedelta(days=30) when creating the pgai on Timescale vector store. This type of partitioning really speeds up search when filtering by both published_time and vector similarity. See this explainer video for more on how time-based partitioning works in pgai on Timescale (previously known as Timescale Vector).

Once this is written, all you have to do is call the following code on a schedule:

vectorizer = pgvectorizer.Vectorize(service_url, 'blog')
while vectorizer.process(embed_and_write) > 0:
    pass

This is the embedding creator job, which will sync your PostgreSQL data with a vector store. You can run this Python script on a schedule from practically anywhere:

A scheduled AWS Lambda function
A scheduled Cloudflare worker
A Modal function
A Robocorp automation task
A cron job on an EC2 instance or even on your local machine

It will automatically track which rows within the blog table have changed and call the embed_and_write function on batches of changed rows. It is performant, resilient to failures, and can be run in parallel when you have a backlog of things that need to be embedded. Designing such a system to perform well is harder than it sounds. But we’ve done it for you in PgVectorizer as part of the Python Vector library!

Searching through your embeddings

Use of the embeddings depends on how the embeddings were generated as well as the use case. We will illustrate some simple search applications to work with the LangChain example we gave above and also provide references for more advanced applications, such as hybrid search on metadata and time.

TABLE_NAME = "blog_embedding"
embedding = OpenAIEmbeddings()
vector_store = TimescaleVector(
        collection_name=TABLE_NAME,
        service_url=TIMESCALE_SERVICE_URL,
        embedding=embedding,
        time_partition_interval=timedelta(days=30)
)
# find closest item
res = vector_store.similarity_search_with_score("Tell me about Travel to Istanbul", 1);

#hybrid search with time
start_dt = datetime(2021, 1, 1, 0, 0, 0) 
end_dt = datetime(2024, 1, 1, 0, 0, 0)
res = vector_store.similarity_search_with_score("Tell me about Travel to Istanbul", 1, start_date=start_dt, end_date=end_dt);

There are, of course, a lot more options for search, including filters and predicates on metadata, self-query retriever options, integrations with chat and RAG, and more! We recommend reading this LangChain tutorial for more info about the above-mentioned methods!

Note: While the above example uses LangChain, you can also swap in frameworks like LlamaIndex, a popular LLM data framework that integrates well with pgai on Timescale, or do DIY document parsing and embedding in Python using OpenAI’s text-embedding-ada-002 model, or an open-source embedding model like sentence-transformers, while using the Python Vector client for vector search.

Conclusion and Next Steps

In this blog post, we have outlined a system adept at generating vector embeddings from data stored in PostgreSQL and automatically keeping them up to date. This architecture ensures the embeddings remain synchronized with the perpetually evolving data, responding seamlessly to insertions, modifications, and deletions.

Using PostgreSQL to handle both data storage and background embedding generation offers an interesting new paradigm for maintaining embeddings as data changes. Many AI demonstrations and tutorials tend to concentrate only on initial data creation from documents, often missing the complexities of keeping data and embeddings synchronized as it evolves.

If your goal is to embed and keep up-to-date data in a PostgreSQL table, then you are done! Just use the PgVectorizer class in the Python Vector library and the code above to start embedding your PostgreSQL data and leveraging semantic and hybrid search in your applications.

If you are curious about how the PgVectorizer library works “under the hood” and how we designed the system for high performance, see our companion blog post about how we designed a resilient embeddings system for PostgreSQL data, which discusses the system design decisions and trade-offs we made while building PgVectorizer above.

If you’d like to go straight to applying what you learned to your own data in PostgreSQL, here are some resources to continue your learning journey:

Try pgai on Timescale free for 90 days: Store the embeddings generated from your PostgreSQL data in a fast and scalable vector database built on PostgreSQL. Learn more about pgai on Timescale, and how it performs.
Read the docs: Learn more about PgVectorizer and how to use it via the Python Vector library to embed data stored in PostgreSQL tables.
Tutorial–Embedding blog data in PostgreSQL: Follow this step-by-step tutorial on creating, embedding, and storing blog post data from a PostgreSQL table using the methods discussed in this blog post.
LangChain and pgai on Timescale (previously known as Timescale Vector): We used LangChain to illustrate document parsing and embedding creation in our examples above. Learn more about how to use it with pgai on Timescale for vector storage, similarity search, and hybrid search.

Nearest Neighbor Indexes: What Are IVFFlat Indexes in Pgvector and How Do They Work

Matvey Arye — Fri, 30 Jun 2023 13:03:10 GMT

The rising popularity of ChatGPT, OpenAI, and applications of Large Language Models (LLMs) has brought the concept of approximate nearest neighbor search (ANN) to the forefront and sparked a renewed interest in vector databases due to the use of embeddings. Embeddings are mathematical representations of phrases that capture the semantic meaning as a vector of numerical values.

What makes this representation fascinating—and useful—is that phrases with similar meanings will have similar vector representations, meaning the distance between their respective vectors will be small. We recently discussed one application of these embeddings, retrieval-augmented generation—augmenting base LLMs with knowledge that it wasn’t trained on—but there are numerous other applications as well.

Semantic similarity search

One common application of embeddings is precisely semantic similarity search. The basic concept behind this approach is that if I have a knowledge library consisting of various phrases and I receive a question from a user, I can locate the most relevant information in my library by finding the data that is most similar to the user's query.

This is in contrast to lexical or full-text search, which only returns exact matches for the query. The remarkable aspect of this technique is that, since the embeddings represent the semantics of the phrase rather than its specific wording, I can find pertinent information even if it is expressed using completely different words!

The challenge of speed at scale

Semantic similarity search involves calculating an embedding for the user's question and then searching through my library to find the K most relevant items related to that question—these are the K items whose embeddings are closest to that of the question. However, when dealing with a large library, it becomes crucial to perform this search efficiently and swiftly. In the realm of vector databases, this problem is referred to as "Finding the k nearest neighbors" (KNN).

This post discusses a method to enhance the speed of this search when utilizing PostgreSQL and pgvector for storing vector embeddings: the Inverted File Flat (IVFFlat) algorithm for approximate nearest neighbor search. We’ll cover why IVFFlat is useful, how it works, and best practices for using it in pgvector for fast similarity search over embeddings vectors.

Let’s go!

P.S. If you’re looking for the fastest vector search index on PostgreSQL, check out pgvectorscale.

What Are IVFFlat Indexes?

IVFFlat indexes, short for Inverted File with Flat Compression, are a type of vector index used in PostgreSQL's pgvector extension to speed up similarity searches to find vectors that are close to a given query. This index type uses approximate nearest neighbor search (ANNS) to provide fast searches.

These indexes work by dividing the vectors into multiple lists, known as clusters. Each cluster represents a region of similar vectors, and an inverted index is built to map each region to its corresponding vectors. When a query comes in, the nearest clusters to the query are identified and only the vectors in those clusters are searched. Thus, this approach significantly reduces the scope of similarity searches by excluding all the vectors that are not in the clusters that are close to the query.

Why Use the IVFFlat Index in Pgvector

Searching for the k-nearest neighbors is not a novel problem for PostgreSQL. PostGIS, a PostgreSQL extension for handling location data, stores its data points as two-dimensional vectors (longitude and latitude). Locating nearby locations is a crucial query in that domain.

PostGIS tackles this challenge by employing an index known as an R-Tree, which yields precise results for k-nearest neighbor queries. Similar techniques, such as KD-Trees and Ball Trees, are also employed for this type of search in other databases.

"The curse of dimensionality"

However, there's a catch. These approaches cease to be effective when dealing with data larger than approximately 10 dimensions due to the "curse of dimensionality." Cue the ominous music! Essentially, as you add more dimensions, the available space increases exponentially, resulting in exponentially sparser data. This reduced density renders existing indexing techniques, like the aforementioned R-Tree, KD-Trees, and Ball Trees, which rely on partitioning the space, ineffective. (To learn more, I suggest these two videos: 1, 2).

Given that embeddings often consist of more than a thousand dimensions—OpenAI’s are 1,536—new techniques had to be developed. There are no known exact algorithms for efficiently searching in such high-dimensional spaces. Nevertheless, there are excellent approximate algorithms that fall into the category of approximate nearest neighbor algorithms. Numerous such algorithms exist, but in this article, we will delve into the Inverted File Flat or IVFFlat algorithm, which is provided by pgvector.

How the IVFFlat Index Works in pgvector

How IVFFlat divides the space

To gain an intuitive understanding of how IVFFlat works, let's consider a set of vectors represented in a two-dimensional space as the following points:

A set of vectors represented as points in two dimensions

In the IVFFlat algorithm, the first step involves applying k-means clustering to the vectors to find cluster centroids. In the case of the given vectors, let's assume we perform k-means clustering and identify four clusters with the following centroids.

After k-means clustering, we identify four clusters indicated by the colored triangles

After computing the centroids, the next step is to assign each vector to its nearest centroid. This is accomplished by calculating the distance between the vector and each centroid and selecting the centroid with the smallest distance as the closest one. This process conceptually maps each point in space to the closest centroid based on proximity.

By establishing this mapping, the space becomes divided into distinct regions surrounding each centroid (technically, this kind of division is called a Voronoi Diagram). Each region represents a cluster of vectors that exhibit similar characteristics or are close in semantic meaning.

This division enables efficient organization and retrieval of approximate nearest neighbors during subsequent search operations, as vectors within the same region are likely to be more similar to each other than those in different regions.

The process of assigning each vector to its closest centroid conceptually divides the space into distinct regions that surround each centroid

Building the IVFFlat index in pgvector

IVFFlat proceeds to create an inverted index that maps each centroid to the set of vectors within the corresponding region. In pseudocode, the index can be represented as follows:

inverted_index = {
  centroid_1: [vector_1, vector_2, ...],
  centroid_2: [vector_3, vector_4, ...],
  centroid_3: [vector_5, vector_6, ...],
  ...
}

Here, each centroid serves as a key in the inverted index, and the corresponding value is a list of vectors that belong to the region associated with that centroid. This index structure allows for efficient retrieval of vectors in a region when performing similarity searches.

Searching the IVFFlat index in pgvector

Let's imagine we have a query for the nearest neighbors to a vector represented by a question mark, as shown below:

We want to find nearest neighbors to the vector represented by the question mark

To find the approximate nearest neighbors using IVFFlat, the algorithm operates under the assumption that the nearest vectors will be located in the same region as the query vector. Based on this assumption, IVFFlat employs the following steps:

Calculate the distance between the query vector (red question mark) and each centroid in the index.
Select the centroid with the smallest distance as the closest centroid to the query (the blue centroid in this example).
Retrieve the vectors associated with the region corresponding to the closest centroid from the inverted index.
Compute the distances between the query vector and each of the vectors in the retrieved set.
Select the K vectors with the smallest distances as the approximate nearest neighbors to the query.

The use of the index in IVFFlat accelerates the search process by restricting the search to the region associated with the closest centroid. This results in a significant reduction in the number of vectors that need to be examined during the search. Specifically, if we have C clusters (centroids), on average, we can reduce the number of vectors to search by a factor of 1/C.

Searching at the edge

The assumption that the nearest vectors will be found in the same region as the query vector can introduce recall errors in IVFFlat. Consider the following query:

IVFFlat can sometimes make errors when searching for nearest neighbors to a point at the edge of two regions of the vector space

From visual inspection, it becomes apparent that one of the light-blue vectors is closer to the query vector than any of the dark-blue vectors, despite the query vector falling within the dark-blue region. This illustrates a potential error in assuming that the nearest vectors will always be found within the same region as the query vector.

To mitigate this type of error, one approach is to search not only the region of the closest centroid but also the regions of the next closest R centroids. This approach expands the search scope and improves the chances of finding the true nearest neighbors.

In pgvector, this functionality is implemented through the `probes` parameter, which specifies the number of centroids to consider during the search, as described below.

Parameters for Pgvector’s IVFFlat Implementation

In the implementation of IVFFlat in pgvector, two key parameters are exposed: lists and probes.

Lists parameter in pgvector

The lists parameter determines the number of clusters created during index building (It’s called lists because each centroid has a list of vectors in its region). Increasing this parameter reduces the number of vectors in each list and results in smaller regions.

It offers the following trade-offs to consider:

Higher lists value speeds up queries by reducing the search space during query time.
However, it also decreases the region size, which can lead to more recall errors by excluding some points.
Additionally, more distance comparisons are required to find the closest centroid during step one of the query process.

Here are some recommendations for setting the lists parameter:

For datasets with less than one million rows, use lists = rows / 1000.
For datasets with more than one million rows, use lists = sqrt(rows).
It is generally advisable to have at least 10 clusters.

Probes parameter in pgvector

The probes parameter is a query-time parameter that determines the number of regions to consider during a query. By default, only the region corresponding to the closest centroid is searched. By increasing the probes parameter, more regions can be searched to improve recall at the cost of query speed.

The recommended value for the probes parameter is probes = sqrt(lists).

Using IVFFlat in Pgvector

Creating an index

When creating an index, it is advisable to have existing data in the table, as it will be utilized by k-means to derive the centroids of the clusters.

The index in pgvector offers three different methods to calculate the distance between vectors: L2, inner product, and cosine. It is essential to select the same method for both the index creation and query operations. The following table illustrates the query operators and their corresponding index methods:

Distance type	Query operator	Index method
L2 / Euclidean	<->	vector_l2_ops
Negative Inner product	<#>	vector_ip_ops
Cosine	<=>	vector_cosine_ops

Note: OpenAI recommends cosine distance for its embeddings.

To create an index in pgvector using IVFFlat, you can use a statement using the following form:

CREATE INDEX ON  USING ivfflat ( ) WITH (lists = );

Replace 
 with the name of your table and  with the name of the column that contains the vector type.
For example, if our table is named embeddings and our embedding vectors are in a column named embedding, we can create an IVFFlat index as follows:
CREATE INDEX ON embeddings USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

Here’s a simple Python function that you can use to create an IVFFlat index with the correct parameters for lists and probes as discussed above:
def create_ivfflat_index(conn, table_name, column_name, query_operator="<=>"): 
    index_method = "invalid"
    if query_operator == "<->":
        index_method = "vector_l2_ops"
    elif query_operator == "<#>":
        index_method = "vector_ip_ops"
    elif query_operator == "<=>":
        index_method = "vector_cosine_ops"
    else:
        raise ValueError(f"unrecognized operator {query_operator}")

    with conn.cursor() as cur:
        cur.execute(f"SELECT COUNT(*) as cnt FROM {table_name};")
        num_records = cur.fetchone()[0]

        num_lists = num_records / 1000
        if num_lists < 10:
            num_lists = 10
        if num_records > 1000000:
            num_lists = math.sqrt(num_records)

        cur.execute(f'CREATE INDEX ON {table_name} USING ivfflat ({column_name} {index_method}) WITH (lists = {num_lists});')
        conn.commit()
Querying
An index can be used whenever there is an ORDER BY of the form column   along with a LIMIT k;
Some examples

Get the closest two vectors to a constant vector:
SELECT * FROM my_table ORDER BY embedding_column <=> '[1,2]' LIMIT 2;
This is a common usage pattern in retrieval augmented generation using LLMs, where we find the embedding vectors that are closest in semantic meaning to the user’s query. In that case, the constant vector would be the embedding vector representing the user’s query. 
You can see an example of this in our guide to creating, storing, and querying OpenAI embeddings with pgvector, where we use this Python function to find the three most similar documents to a given user query from our database:
# Helper function: Get top 3 most similar documents from the database
def get_top3_similar_docs(query_embedding, conn):
    embedding_array = np.array(query_embedding)
    # Register pgvector extension
    register_vector(conn)
    cur = conn.cursor()
    # Get the top 3 most similar documents using the KNN <=> operator
    cur.execute("SELECT content FROM embeddings ORDER BY embedding <=> %s LIMIT 3", (embedding_array,))
    top3_docs = cur.fetchall()
    return top3_docs
Get the closest vector to some row:
SELECT * FROM my_table WHERE id != 1 ORDER BY embedding_column <=> (SELECT embedding_column FROM my_table WHERE id = 1) LIMIT 2;
Tip: PostgreSQL's ability to use an index does not guarantee its usage! The cost-based planner evaluates query plans and may determine that a sequential scan or a different index is more efficient for a specific query. You can use the EXPLAIN command to see the chosen execution plan. To test the viability of using an index, you can modify planner costing parameters until you achieve the desired plan. For small datasets, setting enable_seqscan = 0 can be especially advantageous for testing viability as it avoids sequential scans.
To adjust the probes parameter, you can set the ivfflat.probes variable. For instance, to set it to '5', execute the following statement before running the query:
SET ivfflat.probes = 5;
Dealing with data changes
As your data evolves with inserts, updates, and deletes, the IVFFlat index will be updated accordingly. New vectors will be added to the index, while no longer-used vectors will be removed. 
However, the clustering centroids will not be updated. Over time, this can result in a situation where the initial clustering, established during index creation, no longer accurately represents the data. This can be visualized as follows:
As data gets inserted or deleted from the index, if the index is not rebuilt, the IVFFlat index in pgvector can return incorrect approximate nearest neighbors due to clustering centroids no longer fitting the data well
To address this issue, the only solution is to rebuild the index.
Here are two important takeaways from this issue:
Build the index once you have all the representative data you want to reference in it. This is unlike most indexes, which can be built on an empty table.
It is advisable to periodically rebuild the index.
When rebuilding the index, it is highly recommended to use the CONCURRENTLYoption to avoid interfering with ongoing operations.
Thus, to rebuild the index run the following in a cron job:
REINDEX INDEX CONCURRENTLY ;
Summing It Up
The IVFFlat algorithm in pgvector provides an efficient solution for approximate nearest neighbor search over high-dimensional data like embeddings. It works by clustering similar vectors into regions and building an inverted index to map each region to its vectors. This allows queries to focus on a subset of the data, enabling fast search. By tuning the lists and probes parameters, IVFFlat can balance speed and accuracy for a dataset.  
Overall, IVFFlat gives PostgreSQL the ability to perform fast semantic similarity search over complex data. With simple queries, applications can find the nearest neighbors to a query vector among millions of high-dimensional vectors. For natural language processing, information retrieval, and more, IVFFlat is a compelling solution. By understanding how IVFFlat divides the vector space into regions and builds its inverted index, you can optimize its performance for your needs and build powerful applications on top of it.
✨Resources for further learning: Now that you know more about the IVFFlat index in pgvector, here are some resources to further your learning journey: 
Learn about other PostgreSQL indexes for vector search, like HNSW.
Learn how we made PostgreSQL as fast as Pinecone for vector data.
Follow our tutorial on creating, storing, and querying OpenAI embeddings using PostgreSQL as a vector database. Learn how to use pgvector as a vector store in LangChain. Or see how you can refine vector search queries using time filters in pgvector with a single SQL query.
And if you’re looking for a production-ready PostgreSQL database for your AI application’s vector, relational, and time-series data, try Timescale Cloud.

PostgreSQL as a Vector Database: A Pgvector Tutorial
Avthar Sewrathan — Wed, 21 Jun 2023 18:22:10 GMT
Vector databases enable efficient storage and searching of vector data. They are essential for developing and maintaining AI applications using large language models (LLMs).
With some help from the pgvector extension, you can leverage PostgreSQL as a vector database to store and query OpenAI embeddings. OpenAI embeddings are a type of data representation (in the shape of vectors, i.e., lists of numbers) used to measure the similarity of text strings for OpenAI’s models.
In this article, we work through the example of creating a chatbot to answer questions about Tiger Data (creators of TimescaleDB). The chatbot will be trained on content from the Tiger Data Developer Q&A blog posts. This example will illustrate the key concepts for creating, storing, and querying OpenAI embeddings with PostgreSQL and pgvector.
This example has three parts:
Part 1: How to create embeddings from content using the OpenAI API.
Part 2: How to use PostgreSQL as a vector database and store OpenAI embedding vectors using pgvector.
Part 3: How to use embeddings retrieved from a vector database to augment LLM generation.
One could think of this as a “hello world” tutorial for building a chatbot that can reference a company knowledge base or developer docs.
✨
Jupyter Notebook and Code: You can find all the code used in this tutorial in a Jupyter Notebook, as well as sample content and embeddings on the Tiger Data GitHub: timescale/vector-cookbook. We recommend cloning the repo and following along by executing the code cells as you read through the tutorial.
The Big Picture: OpenAI Embeddings
Foundational models of AI (e.g., GPT-3 or GPT-4) may be missing some information needed to provide accurate answers to certain specific questions. That’s because relevant information was not in the dataset used to train the model. (For example, the information is stored in private documents or only became available recently.) This lack of data may make these models unsuitable for use as a chatbot in specific information banks.
Retrieval-augmented generation (RAG) gives a simple solution; it provides additional context to the foundational model in the prompt. 
This technique is powerful—it allows you to “teach” foundational models about things only you know about and use that to create a ChatGPT++ experience for your users!
But what context should you provide to the model? If you have a library of information, how can you determine what’s relevant to a given question? That is what embeddings are for. OpenAI embeddings are a mathematical representation of the semantic meaning of a piece of text that allows for similarity search.
With this representation, when you get a user question and calculate its embedding, you can use a similarity search against data embeddings in your library to find the most relevant information. But that requires having an embedding representation of your library.  
What is a vector database?
A vector database is a database that can handle vector data. Vector databases are useful for:
Semantic search: Vector databases facilitate semantic search, which considers the context or meaning of search terms rather than just exact matches. They are useful for recommendation systems, content discovery, and question-answering systems.
Efficient similarity search: Vector databases are designed for efficient high-dimensional nearest neighbor search, a task where traditional relational databases struggle.
Machine learning: Vector databases store and search embeddings created by machine-learning models. This feature aids in finding items semantically similar to a given item.
Multimedia data handling: Vector databases also excel in working with multimedia data (images, audio, video) by converting them into high-dimensional vectors for efficient similarity search.
NLP and data combination: In natural language processing (NLP), vector databases store high-dimensional vectors representing words, sentences, or documents. They also allow a combination of traditional SQL queries with similarity searches, accommodating both structured and unstructured data.
We’ll use PostgreSQL with the pgvector extension installed as our vector database. Pgvector extends PostgreSQL to handle vector data types and vector similarity search, like nearest neighbor search, which we’ll use to find the k most related embeddings in our database for a given user prompt.
Using Pgvector for a PostgreSQL Vector Database
Pgvector is an open-source extension for PostgreSQL that enables storing and searching over machine learning-generated embeddings. It provides different capabilities that allow users to identify exact and approximate nearest neighbors. Pgvector is designed to work seamlessly with other PostgreSQL features, including indexing and querying.
Now we’re ready to start building our chatbot!
Why use pgvector as a vector database?
Here are five reasons why PostgreSQL is a good choice for storing and handling vector data:
Integrated solution: By using PostgreSQL as a vector database, you keep your data in one place. This can simplify your architecture by reducing the need for multiple databases or additional services.
Enterprise-level robustness and operations: With a 30-year pedigree, PostgreSQL provides world-class data integrity, operations, and robustness. This includes backups, streaming replication, role-based and row-level security, and ACID compliance.
Full-featured SQL: PostgreSQL supports a rich set of SQL features, including joins, subqueries, window functions, and more. This allows for powerful and complex queries that can include both traditional relational data and vector data. It also integrates with a plethora of existing data science and data analysis tools.
Scalability and performance: PostgreSQL is known for its robustness and ability to handle large datasets. Using it as a vector database allows you to leverage these characteristics for vector data as well.
Open source: PostgreSQL is open source, which means it's free to download and use, and you can modify it to suit your needs. It also means that it benefits from the collective input of developers all over the world, which often results in high-quality, secure, and up-to-date software. PostgreSQL has a large and active community, so help is readily available. There are many resources, including documentation, tutorials, forums, and more, to help you troubleshoot and optimize your PostgreSQL database.
setting
Install Python.
Install and configure a Python virtual environment. We recommend Pyenv.
Install the requirements for this notebook using the following command:
pip install -r requirements.txt

Import all the packages we will be using:
import openai
import os
import pandas as pd
import numpy as np
import json
import tiktoken
import psycopg2
import ast
import pgvector
import math
from psycopg2.extras import execute_values
from pgvector.psycopg2 import register_vector

You’ll need to sign up for an OpenAI Developer Account and create an OpenAI API Key – we recommend getting a paid account to avoid rate limiting and setting a spending cap so that you avoid any surprises with bills.
Once you have an OpenAI API key, it’s a best practice to store it as an environment variable and then have your Python program read it.
#First, run export OPENAI_API_KEY=sk-YOUR_OPENAI_API_KEY...


# Get openAI api key by reading local .env file
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) 
openai.api_key  = os.environ['OPENAI_API_KEY'] 

Part 1: Create Embeddings for Your PostgreSQL Vector Database
Embeddings measure how related text strings are. First, we'll create embeddings using the OpenAI API on some text we want the LLM to answer questions on.
In this example, we'll use content from the Tiger Data blog, specifically from the Developer Q&A section, which features posts by Tiger Data users talking about their real-world use cases.
You can replace this blog data with any text you want to embed, such as your own company blog, developer documentation, internal knowledge base, or any other information you’d like to have a “ChatGPT-like” experience over.
# Load your CSV file into a pandas DataFrame
df = pd.read_csv('blog_posts_data.csv')
df.head()

The output looks like this:











Title Content URL
0 How to Build a Weather Station With Elixir, Ne... This is an installment of our “Community Membe... https://www.timescale.com/blog/how-to-build-a-...
1 CloudQuery on Using PostgreSQL for Cloud Asset... This is an installment of our “Community Membe... https://www.timescale.com/blog/cloudquery-on-u...
2 How a Data Scientist Is Building a Time-Series... This is an installment of our “Community Membe... https://www.timescale.com/blog/how-a-data-scie...
3 How Conserv Safeguards History: Building an En... This is an installment of our “Community Membe... https://www.timescale.com/blog/how-conserv-saf...
4 How Messari Uses Data to Open the Cryptoeconom... This is an installment of our “Community Membe... https://www.timescale.com/blog/how-messari-use...

1.1 Calculate the cost of embedding data
It's usually a good idea to calculate how much creating embeddings for your selected content will cost. We provide a number of helper functions to calculate a cost estimate before creating the embeddings to help us avoid surprises.
For OpenAI, you are charged on a per-token basis for embeddings created. The total cost for the blog posts we want to embed will be less than $0.01, thanks to OpenAI’s small text embedding model, text-embedding-3-small. This model boasts not only stronger performance but also 5X cost reduction compared to its predecessor, text-embedding-ada-002.
# Helper functions to help us create the embeddings

# Helper func: calculate number of tokens
def num_tokens_from_string(string: str, encoding_name = "cl100k_base") -> int:
    if not string:
        return 0
    # Returns the number of tokens in a text string
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

# Helper function: calculate length of essay
def get_essay_length(essay):
    word_list = essay.split()
    num_words = len(word_list)
    return num_words

# Helper function: calculate cost of embedding num_tokens
# Assumes we're using the text-embedding-ada-002 model
# See https://openai.com/pricing
def get_embedding_cost(num_tokens):
    return num_tokens/1000*0.0002

# Helper function: calculate total cost of embedding all content in the dataframe
def get_total_embeddings_cost():
    total_tokens = 0
    for i in range(len(df.index)):
        text = df['content'][i]
        token_len = num_tokens_from_string(text)
        total_tokens = total_tokens + token_len
    total_cost = get_embedding_cost(total_tokens)
    return total_cost


# quick check on total token amount for price estimation
total_cost = get_total_embeddings_cost()
print("estimated price to embed this content = $" + str(total_cost))


1.2 Create smaller chunks of content
The OpenAI API has a maximum token limit that it can create an embedding for in a single request: 8,191 to be specific. To get around this limit, we'll break up our text into smaller chunks. Generally, it's a best practice to “chunk” the documents you want to create embeddings into groups of a fixed token size.
The precise number of tokens to include in a chunk depends on your use case and your model’s context window—the number of input tokens it can handle in a prompt.
For our purposes, we'll aim for chunks of around 512 tokens each. Chunking text up is a complex topic worthy of its own blog post. We’ll illustrate a simple method we found to work well below.  If you want to read about other approaches, we recommend this section of the LangChain docs.
Note: If you prefer to skip this step, you can use the provided file: blog_data_and_embeddings.csv, which contains the data and embeddings that you'll generate in this step.
The code below creates a new list of our blog content while retaining the metadata associated with the text, such as the blog title and URL that the text is associated with.

# Create new list with small content chunks to not hit max token limits
# Note: the maximum number of tokens for a single request is 8191
# https://platform.openai.com/docs/guides/embeddings/embedding-models

# list for chunked content and embeddings
new_list = []
# Split up the text into token sizes of around 512 tokens
for i in range(len(df.index)):
    text = df['content'][i]
    token_len = num_tokens_from_string(text)
    if token_len <= 512:
        new_list.append([df['title'][i], df['content'][i], df['url'][i], token_len])
    else:
        # add content to the new list in chunks
        start = 0
        ideal_token_size = 512
        # 1 token ~ 3/4 of a word
        ideal_size = int(ideal_token_size // (4/3))
        end = ideal_size
        #split text by spaces into words
        words = text.split()

        #remove empty spaces
        words = [x for x in words if x != ' ']

        total_words = len(words)
        
        #calculate iterations
        chunks = total_words // ideal_size
        if total_words % ideal_size != 0:
            chunks += 1
        
        new_content = []
        for j in range(chunks):
            if end > total_words:
                end = total_words
            new_content = words[start:end]
            new_content_string = ' '.join(new_content)
            new_content_token_len = num_tokens_from_string(new_content_string)
            if new_content_token_len > 0:
                new_list.append([df['title'][i], new_content_string, df['url'][i], new_content_token_len])
            start += ideal_size
            end += ideal_size


Now that our text is chunked better, we can create embeddings for each chunk of text using the OpenAI API.
We’ll use this helper function to create embeddings for a piece of text:
openai_client = openai.OpenAI()

# Helper function: get embeddings for a text
def get_embeddings(text):
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input = text.replace("\n"," ")
    )
    return response.data[0].embedding

And then create embeddings for each chunk of content:
# Create embeddings for each piece of content
for i in range(len(new_list)):
   text = new_list[i][1]
   embedding = get_embeddings(text)
   new_list[i].append(embedding)

# Create a new dataframe from the list
df_new = pd.DataFrame(new_list, columns=['title', 'content', 'url', 'tokens', 'embeddings'])
df_new.head()


The new data frame should look like this:




  
    
    Title
    Content
    URL
    Tokens
    Embeddings
  


  
    0
    How to Build a Weather Station With Elixir, Ne...
    This is an installment of our “Community Membe...
    https://www.timescale.com/blog/how-to-build-a-...
    501
    [0.021440856158733368, 0.02200360782444477, -0...
  
  
    1
    How to Build a Weather Station With Elixir, Ne...
    capture weather and environmental data. In all...
    https://www.timescale.com/blog/how-to-build-a-...
    512
    [0.016165969893336296, 0.011341351084411144, 0...
  
  
    2
    How to Build a Weather Station With Elixir, Ne...
    command in their database migration:SELECT cre...
    https://www.timescale.com/blog/how-to-build-a-...
    374
    [0.022517921403050423, -0.0019158280920237303,...
  
  
    3
    CloudQuery on Using PostgreSQL for Cloud Asset...
    This is an installment of our “Community Membe...
    https://www.timescale.com/blog/cloudquery-on-u...
    519
    [0.009028822183609009, -0.005185891408473253, ...
  
  
    4
    CloudQuery on Using PostgreSQL for Cloud Asset...
    Architecture with CloudQuery SDK- Writing plug...
    https://www.timescale.com/blog/cloudquery-on-u...
    511
    [0.02050386555492878, 0.010169642977416515, 0....
  




As an optional but recommended step, you can save the original blog content along with associated embeddings in a CSV file for reference later on so that you don't have to recreate embeddings if you want to reference it in another project.
# Save the dataframe with embeddings as a CSV file
df_new.to_csv('blog_data_and_embeddings.csv', index=False)

Pro Tip: Automating Embedding Creation with pgai Vectorizer
In the section above, we showed how to manually create and manage embeddings in your own data pipeline – chunking content, calling the OpenAI API, and storing the results. While this approach helps you understand the fundamentals, in production, you may want to automate this process completely. Let’s look at how pgai Vectorizer can handle this entire pipeline for you! 
Managing embeddings in production involves several challenges: keeping embeddings in sync with changing content, handling API failures, and optimally chunking text. 
pgai Vectorizer automates this entire process directly in PostgreSQL - similar to how PostgreSQL automatically maintains indexes for your tables.
Setting Up pgai Vectorizer
The setup process differs depending on whether you’re using Tiger Cloud (formerly Timescale Cloud) or hosting PostgreSQL yourself. 
On Tiger Cloud
-- 1. Store your OpenAI API key securely in Timescale Cloud
-- 2. Navigate to Project Settings > AI Model API Keys in the Timescale Console
-- 3. The key is stored securely and not in your database
-- 4. Create the extensions
CREATE EXTENSION IF NOT EXISTS ai;
For self-hosted PostgreSQL
export OPENAI_API_KEY="your-api-key-here"

# Start the vectorizer worker
vectorizer-worker --connection="postgres://user:password@host:port/dbname"
Creating Your First Vectorizer
Instead of manually creating embeddings using Python, you can define a vectorizer that automatically generates and maintains embeddings for your content:
SELECT ai.create_vectorizer( 
   'blog_posts'::regclass,
    destination => 'blog_embeddings',
    embedding => ai.embedding_openai('text-embedding-3-small', 768),
    chunking => ai.chunking_recursive_character_text_splitter('content'),
    -- Pro tip: Add blog title as context to each chunk
    formatting => ai.formatting_python_template('$title: $chunk')
);
This single SQL command:
Automatically chunks your blog content
Creates embeddings for each chunk using OpenAI's API
Maintains embeddings as your content changes
Creates a view that joins your content with its embeddings
Searching with Vectorizer
You can then search your content the same way as before:
SELECT 
   chunk,
   embedding <=> ai.openai_embed('text-embedding-3-small', 'How is Timescale used in IoT?') as distance
FROM blog_embeddings
ORDER BY distance
LIMIT 3;
Vectorizer runs automatically every five minutes on Tiger Cloud, handling retries and keeping your embeddings up to date. For more details on setup and advanced features like monitoring the Vectorizer, see our pgai Vectorizer documentation. 
Further Reading on RAG
The accuracy and cost of your RAG application depends heavily on implementation choices such as the embedding model selection to chunking strategies. 
Here are more blog posts to help you build effective RAG applications with PostgreSQL:
Vector Databases Are the Wrong Abstraction – learn why general-purpose databases with vector extensions like pgvectorscale often provide better solutions than specialized vector databases
Which RAG Chunking and Formatting Strategy Is Best? – Explore different approaches to chunking and formatting your content for optimal retrieval-augmented generation (RAG) performance
Which OpenAI Embedding Model Is Best? - Compare OpenAI's embedding models to choose the right one for your use case
Part 2: Store Embeddings in a PostgreSQL Vector Database Using Pgvector
Now that we have created embedding vectors for our blog content, the next step is to store the embedding vectors in a vector database to help us perform a fast search over many vectors.
2.1 Create a PostgreSQL database and install pgvector
First, we’ll create a PostgreSQL database. You can create a cloud PostgreSQL database in minutes for free on Tiger Cloud or use a local PostgreSQL database for this step. 
Once you’ve created your PostgreSQL database, export your connection string as an environment variable, and just like the OpenAI API key, we’ll read it into our Python program from the environment file:
# Timescale database connection string
# Found under "Service URL" of the credential cheat-sheet or "Connection Info" in the Timescale console
# In terminal, run: export TIMESCALE_CONNECTION_STRING=postgres://

connection_string  = os.environ['TIMESCALE_CONNECTION_STRING']


We then connect to our database using the popular psycopg2 Python library and install the pgvector and pgvectorscale extension (which provides powerful filtering and indexing capabilities ) as follows:
# Connect to PostgreSQL database in Timescale using connection string
conn = psycopg2.connect(connection_string)
cur = conn.cursor()

#install pgvector
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
conn.commit()

#install pgvectorscale
cur.execute("CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;")
conn.commit()

2.2 Connect to and configure your vector database
Once we’ve installed pgvector, we use the register_vector() command to register the vector type with our connection:
# Register the vector type with psycopg2
register_vector(conn)

Once we’ve connected to the database, let’s create a table that we’ll use to store embeddings along with metadata. Our table will look as follows:


id title  url content tokens embedding

Id represents the unique ID of each vector embedding in the table.
title is the blog title from which the content associated with the embedding is taken.
url is the blog URL from which the content associated with the embedding is taken.
content is the actual blog content associated with the embedding.
tokens is the number of tokens the embedding represents.
embedding is the vector representation of the content.
One advantage of using PostgreSQL as a vector database is that you can easily store metadata and embedding vectors in the same database, which is helpful for supplying the user-relevant information related to the response they receive, like links to read more or specific parts of a blog post that are relevant to them.
# Create table to store embeddings and metadata
table_create_command = """
CREATE TABLE embeddings (
            id bigserial primary key, 
            title text,
            url text,
            content text,
            tokens integer,
            embedding vector(1536)
            );
            """

cur.execute(table_create_command)
cur.close()
conn.commit()


2.3 Ingest and store vector data into PostgreSQL using pgvector
Now that we’ve created the database and created the table to house the embeddings and metadata, the final step is to insert the embedding vectors into the database. 
For this step, it’s a best practice to batch insert the embeddings rather than insert them one by one.
#Batch insert embeddings and metadata from dataframe into PostgreSQL database
register_vector(conn)
cur = conn.cursor()
# Prepare the list of tuples to insert
data_list = [(row['title'], row['url'], row['content'], int(row['tokens']), np.array(row['embeddings'])) for index, row in df_new.iterrows()]
# Use execute_values to perform batch insertion
execute_values(cur, "INSERT INTO embeddings (title, url, content, tokens, embedding) VALUES %s", data_list)
# Commit after we insert all embeddings
conn.commit()


Let’s sanity check by running some simple queries against our newly inserted data:
cur.execute("SELECT COUNT(*) as cnt FROM embeddings;")
num_records = cur.fetchone()[0]
print("Number of vector records in table: ", num_records,"\n")
# Correct output should be 129


# print the first record in the table, for sanity-checking
cur.execute("SELECT * FROM embeddings LIMIT 1;")
records = cur.fetchall()
print("First record in table: ", records)

2.4 Index your data for faster retrieval
In this example, we only have 129 embedding vectors, so searching through all of them is blazingly fast. But for larger datasets, you need to create indexes to speed up searching for similar embeddings, so we include the code to build the index for illustrative purposes. 
While pgvector supports the IVFFLAT and HNSW index types for approximate nearest neighbor (ANN) search, pgvectorscale offers a more cost-efficient and powerful index type for pgvector data: StreamingDiskANN, which we use here. 
You always want to build this index after you have inserted the data, as the index needs to discover clusters in your data to be effective, and it does this only when first building the index. 
The StreamingDiskANN index has tunable parameters depending on your goal, whether it is changing indexing operations or querying operations. In our case, we use the default values of the parameters. You can read more about tuning here.
# Create an index on the data for faster retrieval
cur.execute('CREATE INDEX embedding_idx ON embeddings USING diskann (embedding);')
conn.commit()

Part 3: Nearest Neighbor Search Using pgvector
Given a user question, we’ll perform the following steps to use information stored in the vector database to answer their question using Retrieval Augmented Generation:
Create an embedding vector for the user question.
Use pgvector to perform a vector similarity search and retrieve the k nearest neighbors to the question embedding from our embedding vectors representing the blog content. In our example, we’ll use k=3, finding the three most similar embedding vectors and associated content.
Supply the content retrieved from the database as additional context to the model and ask it to perform a completion task to answer the user question.
3.1 Define a question you want to answer
First, we’ll define a sample question that a user might want to answer about the blog posts stored in the database.
# Question about Timescale we want the model to answer
input = "How is Timescale used in IoT?"

Since TimescaleDB is popular for IoT sensor data, a user might want to learn specifics about how they can leverage it for that use case.
3.2 Find the most relevant content in the database
Here’s the function we use to find the three nearest neighbors to the user question. Note it uses pgvector’s <=> operator, which finds the Cosine distance (also known as Cosine similarity) between two embedding vectors.  
# Helper function: Get top 3 most similar documents from the database
def get_top3_similar_docs(query_embedding, conn):
    embedding_array = np.array(query_embedding)
    # Register pgvector extension
    register_vector(conn)
    cur = conn.cursor()
    # Get the top 3 most similar documents using the KNN <=> operator
    cur.execute("SELECT content FROM embeddings ORDER BY embedding <=> %s LIMIT 3", (embedding_array,))
    top3_docs = cur.fetchall()
    return top3_docs


3.3 Define helper functions to query OpenAI
We define a helper function to get a completion response from an OpenAI model while we use the previously defined helper function, get_embeddings, to create an embedding for the user question. We use GPT-4o, but you can use any other model from OpenAI.
We also specify a number of parameters, such as limits of the maximum number of tokens in the model response and model temperature, which controls the randomness of the model, which you can modify to your liking:
# Helper function: get text completion from OpenAI API
# Note we're using the latest gpt-3.5-turbo-0613 model
def get_completion_from_messages(messages, model="gpt-4o", temperature=0, max_tokens=1000):
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature, 
        max_tokens=max_tokens, 
    )
    return response.choices[0].message.content

3.3 Putting it all together
We’ll define a function to process the user input by retrieving the most similar documents from our database and passing the user input, along with the relevant retrieved context to the OpenAI model to provide a completion response to.
Note that we modify the system prompt as well in order to influence the tone of the model’s response.
We pass to the model the content associated with the three most similar embeddings to the user input using the assistant role. You can also append the additional context to the user message.
# Function to process input with retrieval of most similar documents from the database
def process_input_with_retrieval(user_input):
    delimiter = "```"

    #Step 1: Get documents related to the user input from database
    related_docs = get_top3_similar_docs(get_embeddings(user_input), conn)

    # Step 2: Get completion from OpenAI API
    # Set system message to help set appropriate tone and context for model
    system_message = f"""
    You are a friendly chatbot. \
    You can answer questions about timescaledb, its features and its use cases. \
    You respond in a concise, technically credible tone. \
    """

    # Prepare messages to pass to model
    # We use a delimiter to help the model understand the where the user_input starts and ends
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": f"{delimiter}{user_input}{delimiter}"},
        {"role": "assistant", "content": f"Relevant Timescale case studies information: \n {related_docs[0][0]} \n {related_docs[1][0]} {related_docs[2][0]}"}   
    ]

    final_response = get_completion_from_messages(messages)
    return final_response

Let’s see an example of the model’s output to our original input question:
input = "How is Timescale used in IoT?"
response = process_input_with_retrieval(input)
print(input)
print(response)

Here’s the model’s response:
User input: How is Timescale used in IoT?

Model Response:

TimescaleDB is widely used in IoT applications due to its ability to efficiently handle time-series data, which is a common requirement in IoT systems. Here are some ways TimescaleDB is utilized in IoT:

1. **Data Ingestion and Storage**: IoT devices generate large volumes of time-stamped data. TimescaleDB's architecture is optimized for ingesting and storing this data efficiently, allowing for high write throughput and efficient storage through features like data compression.

2. **Real-time Analytics**: TimescaleDB supports real-time analytics, enabling IoT applications to process and analyze data as it arrives. This is crucial for applications that require immediate insights, such as monitoring systems or predictive maintenance.

3. **Historical Data Analysis**: With its support for continuous aggregates, TimescaleDB allows for efficient querying of historical data. This is useful for trend analysis, reporting, and long-term data storage, which are common in IoT use cases.

4. **Integration with IoT Platforms**: TimescaleDB can be integrated with various IoT platforms and tools, such as AWS IoT, MQTT, and Grafana, to manage, visualize, and analyze IoT data seamlessly.

5. **Scalability**: As IoT deployments grow, the volume of data can increase significantly. TimescaleDB's scalability features, such as partitioning and distributed hypertables, help manage this growth effectively.

Overall, TimescaleDB provides a robust solution for managing the unique challenges of IoT data, combining the benefits of a relational database with time-series capabilities.

We can also ask the model questions about specific documents in the database, in this case about specific TimescaleDB users who have spoken about their experience in a blog post:
input_2 = "Tell me about Edeva and Hopara. How do they use Timescale?"
response_2 = process_input_with_retrieval(input_2)
print(input_2)
print(response_2)


Here’s the model output for the input question above; notice how it uses specific details from the blog posts.
User input: Tell me about Edeva and Hopara. How do they use Timescale?


Model Response:

### Edeva

**Use Case:**
Edeva leverages TimescaleDB to manage and analyze data from their smart city systems, particularly focusing on their Actibump dynamic speed bump. The data collected includes:

1. **IoT Event Data:** Information such as vehicle speeds and environmental conditions.
2. **Alarm Data:** Alerts for sensor malfunctions or other issues.
3. **Status Data:** Self-checks and statistical data from IoT devices.
4. **Administrative Data:** Metadata about devices, such as configuration details.

**Key Features Utilized:**
- **Continuous Aggregations:** To speed up queries and make dashboards responsive.
- **Percentile Aggregations:** For calculating accurate percentile values without querying raw data.
- **SQL Compatibility:** Simplifies onboarding for developers familiar with SQL.

**Benefits:**
- **Performance:** Transitioned from sluggish to lightning-fast dashboards.
- **Ease of Use:** Developers could quickly adapt due to SQL familiarity.
- **Scalability:** Efficiently handles large datasets, such as hundreds of millions of records.

### Hopara

**Use Case:**
Hopara uses TimescaleDB to manage and visualize time-series data for their geospatial analytics platform. The platform integrates various data sources to provide insights into spatial and temporal trends.

**Key Features Utilized:**
- **Time-Series Data Management:** Efficiently stores and queries large volumes of time-series data.
- **Geospatial Capabilities:** Leverages PostgreSQL’s PostGIS extension for spatial queries.
- **Continuous Aggregations:** To pre-compute and speed up complex queries.

**Benefits:**
- **Scalability:** Handles large datasets with ease.
- **Performance:** Fast query execution for real-time analytics.
- **Integration:** Seamless integration with existing PostgreSQL tools and extensions.

Both Edeva and Hopara benefit from TimescaleDB’s ability to handle large volumes of time-series data efficiently, providing fast query performance and ease of use through SQL compatibility.
Conclusion
Retrieval-augmented generation (RAG) is a powerful method of building applications with LLMs that enables you to teach foundation models about things they were not originally trained on, like private documents or recently published information.
This project is an example of how to create, store, and perform similarity search on OpenAI embeddings. We used PostgreSQL + pgvector + pgvectorscale as our vector database to efficiently store and query the embeddings, enabling precise and relevant responses.
TimescaleDB + PostgreSQL
And if you’re looking for a production PostgreSQL database for your vector workloads, try Timescale. It’s free for 30 days, no credit card required.
Further reading
Here are more blog posts about RAG with PostgreSQL and different tools:
RAG Is More Than Just Vector Search
Retrieval-Augmented Generation With Claude Sonnet 3.5 & Pgvector
Build a Fully Local RAG App With PostgreSQL, Mistral, and Ollama
Building an AI Image Gallery: Advanced RAG With Pgvector and Claude Sonnet 3.5

	Title	Content	URL
0	How to Build a Weather Station With Elixir, Ne...	This is an installment of our “Community Membe...	https://www.timescale.com/blog/how-to-build-a-...
1	CloudQuery on Using PostgreSQL for Cloud Asset...	This is an installment of our “Community Membe...	https://www.timescale.com/blog/cloudquery-on-u...
2	How a Data Scientist Is Building a Time-Series...	This is an installment of our “Community Membe...	https://www.timescale.com/blog/how-a-data-scie...
3	How Conserv Safeguards History: Building an En...	This is an installment of our “Community Membe...	https://www.timescale.com/blog/how-conserv-saf...
4	How Messari Uses Data to Open the Cryptoeconom...	This is an installment of our “Community Membe...	https://www.timescale.com/blog/how-messari-use...

Tiger Data Blog

We Taught AI to Write Real Postgres Code (And Open Sourced It)

Giving AI the Postgres Judgment It’s Missing

Why Database Code Quality Is So Important

Generic vs. Guided SQL

Without pg-aiguide

With pg-aiguide enabled

The Skills Are the Secret Sauce

The Tools We Provide The LLM

1. Skills: Complete, Opinionated Postgres Guidance

2. Semantic Search: Version-Aware Vector Retrieval Across Docs

Help Us Build the World’s Best Postgres Guide for AI

How You Can Contribute

How to Train Your Agent to Be a Postgres Expert

With prompt templates and versioned docs, we turn 35 years of Postgres wisdom into structured knowledge your Agent can reason with.

Without the MCP Server

With the MCP Server

How Prompt Templates Make It All Possible

How This MCP Server is Smarter Than Others

See It For Yourself

The Database Has a New User—LLMs—and They Need a Different Database

We’re experimenting with a database that can self-describe. And we’re starting with the most popular one in the world: Postgres.

Databases Lack Context About Their Structures

Adding Context with Semantic Catalogs

Building a Self-Describing Database

What We’re Experimenting With Today

Why isn't the problem just retrieval?

Performance enhancement through semantic context

How queries work: the minimal agent loop

Key Lessons So Far

Semantic context matters

Narrow interfaces build confidence

Self-Correcting Using EXPLAIN

Store the Semantic Catalog Anywhere

What’s Next: From Self-Describing to Self-Learning

How We Made PostgreSQL as Fast as Pinecone for Vector Data

Enhancing PostgreSQL for Vector Data

Implementing the DiskANN algorithm to optimize for SSD storage

Support for streaming retrieval for accurate metadata filtering

Statistical binary quantization (SBQ): A new quantization algorithm

A Better PostgreSQL for Vector Data

Next steps

How We Designed a Resilient Vector Embedding Creation System for PostgreSQL Data

Design of a High-Performance Vectorizer for PostgreSQL Data (PgVectorizer)

System overview

Track change to the blog table with the blog_work_queue table

Alternative 1: Implement a primary or unique key for the blog_work_queue table.

Alternative 2: Track the work that needs to be done by adding a column to the blog table to track whether an up-to-date embedding exists.

Alternative 3: Create the embeddings directly in the trigger.

Create the embeddings

Process the work queue

Alternative 4: Avoid using advisory locks by using another table.

Conclusion and Next Steps

A Complete Guide to Creating and Storing Embeddings for PostgreSQL Data

Why use embeddings for your PostgreSQL data

Creating Embeddings for Data in PostgreSQL (and Keeping Them Up-To-Date With Your Tables)

Easily manage embedding PostgreSQL data using LangChain and the Python Vector library

Define your embedding creation function

Searching through your embeddings

Conclusion and Next Steps

Nearest Neighbor Indexes: What Are IVFFlat Indexes in Pgvector and How Do They Work

Semantic similarity search

The challenge of speed at scale

What Are IVFFlat Indexes?

Why Use the IVFFlat Index in Pgvector

"The curse of dimensionality"

How the IVFFlat Index Works in pgvector

How IVFFlat divides the space

Building the IVFFlat index in pgvector

Searching the IVFFlat index in pgvector

Searching at the edge

Parameters for Pgvector’s IVFFlat Implementation

Lists parameter in pgvector

Probes parameter in pgvector

Using IVFFlat in Pgvector

Creating an index

Querying

Dealing with data changes

Summing It Up

PostgreSQL as a Vector Database: A Pgvector Tutorial

Alternative 2: Track the work that needs to be done by adding a column to the `blog` table to track whether an up-to-date embedding exists.