Tiger Data Blog

Build Search and RAG Systems on PostgreSQL Using Cohere and Pgai

Avthar Sewrathan — Fri, 09 Aug 2024 18:01:55 GMT

Enterprise-ready LLMs from Cohere—now available in the pgai PostgreSQL extension.

Cohere is a leading generative AI company in the field of large language models (LLMs) and retrieval-augmented generation (RAG) systems. What sets Cohere apart from other model developers is its unwavering focus on enterprise needs and support for multiple languages.

Cohere models

Cohere has quickly gained traction among businesses and developers alike, thanks to its suite of models that cater to key steps of the AI application-building process: text embedding, result reranking, and reasoning. Here’s an overview of Cohere models:

Cohere Embed: A leading text representation model supporting over 100 languages, with the latest version (Embed v3) capable of evaluating document quality and relevance to queries. Cohere Embed is ideal for semantic search, retrieval-augmented generation (RAG), clustering, and classification tasks.
Cohere Rerank: A model that significantly improves search quality for any keyword or vector search system with minimal code changes. It’s optimized for high throughput and reduced compute requirements while leveraging Cohere's embedding performance for accurate reranking. Cohere rerank is system agnostic, meaning it can be used with any vector search system, including PostgreSQL.
Cohere Command: A family of highly scalable language models optimized for enterprise use. Command supports RAG, multi-language use, tool use, and citations. Command R+ is the most advanced model, as it’s optimized for conversational interaction and long-context tasks. Command R is well suited for simpler RAG and single-step tool use tasks and is the most cost-effective choice out of the Command family.

🎉

Today, we’re thrilled to announce that PostgreSQL developers can now harness the power of Cohere's enterprise-grade language models directly on PostgreSQL data using the pgai PostgreSQL extension.

What is pgai?

Pgai is an open-source PostgreSQL extension that brings AI models closer to your data, simplifying tasks such as embedding creation, text classification, semantic search, and retrieval-augmented generation on data stored in PostgreSQL.

Pgai supports the entire suite of Cohere Embed, Rerank, and Command models. This means that developers can now:

Create embeddings using Cohere’s Embed model for data inside PostgreSQL tables, all without having pipe data out and back into the database.
Build hybrid search systems for higher quality results in search and RAG applications using Cohere Rerank, combining vector search in pgvector and PostgreSQL full-text search.
Perform tasks like classification, summarization, data enrichment, and other reasoning tasks on data in PostgreSQL tables using Cohere Command models.
Build highly accurate RAG systems completely in SQL, leveraging the Cohere Embed, Rerank, and Command models altogether.

We built pgai to give more PostgreSQL developers AI Engineering superpowers. Pgai makes it easier for database developers familiar with PostgreSQL to become "AI engineers" by providing familiar SQL-based interfaces to AI functionalities like embedding, creating, and model usage.

While Command is the flagship model for building scalable, production-ready AI applications, pgai supports a variety of models in the Cohere lineup. This includes specialized embedding models that support over 100 languages, enabling developers to select the most suitable model for their specific enterprise use case.

Whether it's building internal AI knowledge assistants, security documentation AI, customer feedback analysis, or employee support systems, Cohere's models integrated with pgai and PostgreSQL offer powerful solutions.

Getting started with Cohere models in pgai

Ready to start building with Cohere's suite of powerful language models and PostgreSQL? Pgai is open source under the PostgreSQL License and is available for immediate use in your AI projects. You can find installation instructions on the pgai GitHub repository. You can also access pgai (alongside pgvector and pgvectorscale) on any database service on Timescale’s Cloud PostgreSQL platform. If you’re new to Timescale, you can get started with a free cloud PostgreSQL database here.

Once connected to your database, create the pgai extension by running:

CREATE EXTENSION IF NOT EXISTS ai CASCADE;

Join the Postgres AI Community

Have questions about using Cohere models with pgai? Join the Postgres for AI Discord, where you can share your projects, seek help, and collaborate with a community of peers. You can also open an issue on the pgai GitHub (and while you’re there, stars are always appreciated ⭐).

Next, let's explore the benefits of using Cohere models for building enterprise AI applications. And we’ll close with a real-world example of using Cohere models to build a hybrid search system, combining pgvector semantic search with PostgreSQL full-text search.

Let's dive in...

Why Use Cohere Models for RAG and Search Applications?

Cohere's suite of models offers several advantages for enterprise-grade RAG and search applications:

High-quality results: Cohere models excel in understanding business language, providing relevant content, and tackling complex enterprise challenges.

Impact of Cohere Rerank: Semi-structured retrieval accuracy based Recall@5 on TMDB-5k-Movies, WikiSQL, nq-tables, and Cohere annotated datasets (higher is better). Source: Cohere blog.

Multi-language support: With support for over 100 languages, Cohere models are ideal for global enterprises. For example, Command R+ is optimized to perform well in the following languages: English, French, Spanish, Italian, German, Brazilian Portuguese, Japanese, Korean, Simplified Chinese, and Arabic. Additionally, pre-training data has been included for the following 13 languages: Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, and Persian.

Flexible deployment: Cohere’s flexible deployment options allow businesses to bring Cohere's models to their own data, either on their servers or in private/commercial clouds, addressing critical security and compliance concerns. This flexibility, combined with Cohere's robust API and partnerships with major cloud providers like Amazon Web Services (through the Bedrock AI platform), ensures seamless integration and improved scalability for enterprise applications.

Enterprise focus: Strategic partnerships with industry leaders like Fujitsu, Oracle, and McKinsey & Company underscore Cohere's enterprise-centric approach.

Cohere models enable sophisticated enterprise AI applications such as:

Investment research assistants
Support chatbots and Intelligent support copilots
Executive AI assistants
Document summarization tools
Knowledge and project staffing assistants
Regulatory compliance monitoring
Sentiment analysis for brand management
Research and development assistance

Cohere's models offer a powerful, scalable, and easily implementable solution for enterprise search and RAG applications, balancing high accuracy with efficient performance.

Tutorial: Build Search and RAG Systems on PostgreSQL Using Cohere and Pgai

Process diagram showing how pgai combines full text, semantic search, and Cohere’s models to find the most relevant results to a user query using hybrid search.

Let’s look at an example of why Cohere's support in pgai is such a game-changer. We’ll perform a hybrid search over a corpus of news articles, combining PostgreSQL full-text search with pgvector’s semantic search, leveraging Cohere Embed and Rerank models in the process.

Here's what we'll cover:

Setting up your environment: Creating a Python virtual environment, installing necessary libraries, and setting up a PostgreSQL database with the pgai extension.
Preparing your data: Creating a table to store news articles and loading sample data from the CNN/Daily Mail news dataset.
Generating vector embeddings using pgai: Using Cohere's Embed model to create vector representations of text.
Implementing full text and semantic search capabilities: Setting up PostgreSQL's full-text search for keyword matching and creating a vector search index for semantic search.
Performing searches: Executing keyword searches using PostgreSQL's full-text search and semantic searches on embeddings using pgvector HNSW.
Performing hybrid search: Combining keyword and semantic search results by using Cohere's Rerank model to improve result relevance.

By the end of this tutorial, you'll have a powerful hybrid search system that provides highly relevant search results, leveraging both traditional keyword search and semantic search techniques.

Setting up your environment

To begin, we'll set up a Python virtual environment and install the necessary libraries. This ensures a clean, isolated environment for our project.

mkdir cohere
cd cohere
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`

pip install datasets "psycopg[binary]" python-dotenv

This setup:

Creates a new directory for our project
Sets up a Python virtual environment
Activates the virtual environment
Installs the required Python packages:
- datasets: For loading our sample dataset
- psycopg: PostgreSQL adapter for Python
- python-dotenv: For managing environment variables

Next, create a .env file in your project directory to store your database connection string and Cohere API key:

DB_URL=postgres://username:password@localhost:5432/your_database
COHERE_API_KEY=your_cohere_api_key_here

Replace the placeholders with your actual database credentials and Cohere API key.

Setting up your PostgreSQL database

Now, let's set up our PostgreSQL database with the necessary extensions and table structure. We’ll create a table to hold our news articles. The embedding column will hold our embeddings which we’ll create using the Cohere Embed model. The tsv column will hold our full-text-search vectors.

Connect to your PostgreSQL database and run the following SQL commands:

-- create the pgai extension
create extension if not exists ai cascade;

-- create a table for the news articles
-- the embedding column will store embeddings from cohere
-- the tsv column will store the full-text-search vector
create table cnn_daily_mail
( id bigint not null primary key generated by default as identity
, highlights text
, article text
, embedding vector(1024)
, tsv tsvector generated always as (to_tsvector('english', article)) stored
);

-- index the full-text-search vector
create index on cnn_daily_mail using gin (tsv);

This SQL script:

Creates the pgai extension
Sets up a table to store our news articles, including columns for the article text, embeddings, and a full-text search vector
Creates an index on the full-text search vector for efficient keyword searches

Load the dataset

Now that our database is set up, we'll load sample data from the CNN/Daily Mail dataset. We'll use Python to download the dataset and efficiently insert it into our PostgreSQL table using the copy command with the binary format.

Create a new file called load_data.py in your project directory and add the following code:

import os
from dotenv import load_dotenv
from datasets import load_dataset
import psycopg


# Load environment variables
load_dotenv()
DB_URL = os.environ["DB_URL"]
COHERE_API_KEY = os.environ["COHERE_API_KEY"]

# Load and prepare the dataset
dataset = load_dataset("cnn_dailymail", "3.0.0")
test = dataset["test"]
test = test.shuffle(seed=42).select(range(0,1000))

# Insert data into PostgreSQL
with psycopg.connect(DB_URL) as con:
   with con.cursor(binary=True) as cur:
       with cur.copy("copy public.cnn_daily_mail (highlights, article) from stdin (format binary)") as cpy:
           cpy.set_types(['text', 'text'])
           for row in test:
               cpy.write_row((row["highlights"], row["article"]))

print("Data loading complete!")

This script:

Loads environment variables from the .env file
Downloads the CNN/Daily Mail dataset
Selects a random subset of 1,000 articles from the test set
Efficiently inserts the data into our PostgreSQL table using the binary COPY command

To run the script and load the data:

python load_data.py

This approach is much faster than inserting rows one by one, especially for larger datasets. After running the script, you should have 1000 news articles loaded into your cnn_daily_mail table, ready for the next steps in our tutorial.

Note: Before running this script, make sure your .env file is properly set up with the correct database URL.

Create vector embeddings with Cohere Embed and pgai

In this step, we'll use Cohere's Embed model to generate vector embeddings for our news articles. These embeddings will enable semantic search capabilities.

First, set your Cohere API key in your PostgreSQL session:

select set_config('ai.cohere_api_key', '', false) is not null as set_cohere_api_key

The pgai extension will use this key by default for the duration of your database session.

We'll explore two methods to generate embeddings: a simple approach and a more production-ready solution. Choose method 1 if you’re just experimenting and method 2 if you want to see how you’d run this in production.

Method 1: Easy mode (for small datasets)

This is the "easy-mode" approach to generating embeddings for each news article.

-- this is the easy way to create the embeddings
-- but it's one LONG-running update statement
update cnn_daily_mail set embedding = ai.cohere_embed('embed-english-v3.0', article, input_type=>'search_document');

This method is straightforward but can be slow for large datasets. The downside to this is that each call to Cohere is relatively slow. This makes the whole statement long-running.

Method 2: Production mode (recommended for larger datasets)

This method processes rows one at a time, allowing for better concurrency and error handling. We select and lock only a single row at a time, and we commit our work as we go along.

-- this is a more production-appropriate way to generate the embeddings
-- we only lock one row at a time and we commit each row immediately
do $$
declare
    _id bigint;
    _article text;
    _embedding vector(1024);
begin
    loop
        select id, article into _id, _article
        from cnn_daily_mail
        where embedding is null
        for update skip locked
        limit 1;

        if not found then
            exit;
        end if;

        _embedding = ai.cohere_embed('embed-english-v3.0', _article, input_type=>'search_document');
        update cnn_daily_mail set embedding = _embedding where id = _id;
        commit;
    end loop;
end;
$$;

Note that both methods above use the cohere_embed() function provided by the pgai extension to generate embeddings for each article in the cnn_daily_mail table using the Cohere Embed v3 embedding model.

After running one of the above methods, verify that all rows have embeddings:

-- this should return 0
select count(*) from cnn_daily_mail where embedding is null;

Create a vector search index

To optimize vector similarity searches, create an index on the embedding column:

-- index the embeddings
create index on cnn_daily_mail using hnsw (embedding vector_cosine_ops);

This index uses the pgvector HNSW (hierarchical navigable small world) algorithm, which is efficient for nearest-neighbor searches in high-dimensional spaces.

With these steps completed, your database is now set up for both keyword-based and semantic searches. In the next section, we'll explore how to perform these searches using pgai and Cohere's models.

Perform a Keyword Search in PostgreSQL

⚠️

Note to reader: Unfortunately, the news dataset is quite depressing, so please forgive the negativity in the keywords and prompts below. All terms are chosen purely for illustrative and educational purposes.

PostgreSQL's full-text search capabilities allow us to perform complex keyword searches efficiently. We'll use the tsv column we created earlier, which contains the tsvector representation of each article.

-- a full-text-search
-- must include "death" OR "kill"
-- must include "police" AND "car" AND "dog"
select article
from cnn_daily_mail
where article @@ to_tsquery('english', '(death | kill) & police & car & dog')
;

Let's break down this query:

@@ is the text search match operator in PostgreSQL.
to_tsquery('english', '...') converts our search terms into a tsquery type.
The search logic is: (death OR kill) AND police AND car AND dog.

This query will return articles that contain:

Either "death" or "kill" (or their variations)
AND "police" (or variations)
AND "car" (or variations)
AND "dog" (or variations)

This method of searching is fast and efficient, especially for exact keyword matches inside the article text. However, it doesn't capture semantic meaning or handle synonyms well. In the next section, we'll explore how to perform semantic searches using the embeddings we created earlier.

Perform a Semantic Search in PostgreSQL With pgvector, pgai, and Cohere Embed

Semantic search allows us to find relevant articles based on the meaning of a query rather than just matching keywords. We'll use Cohere's Embed model to convert our search query into a vector embedding, then find the most similar article embeddings using pgvector.

-- get the 15 most relevant stories to our question
with q as
(
   select ai.cohere_embed
   ('embed-english-v3.0'
   , 'Show me stories about police reports of deadly happenings involving cars and dogs.'
   , input_type=>'search_query'
   ) as q
)
select article
from cnn_daily_mail
order by embedding <=> (select q from q limit 1)
limit 15
;

Let's break this query down:

We use a CTE (Common Table Expression) named q to generate the embedding for our search query using Cohere's model (also named q).
ai.cohere_embed() is a function provided by pgai that interfaces with Cohere's API to create embeddings. See more in the pgai docs.
We specify input_type => 'search_query' to optimize the embedding for search queries.
In the main query, we order the results by the cosine distance (<=>) between each article's embedding and our query embedding.
The LIMIT 15 clause returns the top 15 most semantically similar articles.

This semantic search can find relevant articles even if they don't contain the exact words used in the query. It understands the context and meaning of the search query and matches it with semantically similar content.

Performing Hybrid Search Using Pgai and Cohere Rerank

Hybrid search combines the strengths of both keyword and semantic searches and then uses Cohere's Rerank model to improve the relevance of the results further. This approach can provide more accurate and contextually relevant results than either method alone.

Here's how to perform a hybrid search with reranking:

with full_text_search as
(
   select article
   from cnn_daily_mail
   where article @@ to_tsquery('english', '(death | kill) & police & car & dog')
   limit 15
)
, vector_query as
(
   select ai.cohere_embed
   ('embed-english-v3.0'
   , 'Show me stories about police reports of deadly happenings involving cars and dogs.'
   , input_type=>'search_query'
   ) as query_embedding
)
, vector_search as
(
   select article
   from cnn_daily_mail
   order by embedding <=> (select query_embedding from vector_query limit 1)
   limit 15
)
, rerank as
(
   select ai.cohere_rerank
   ( 'rerank-english-v3.0'
   , 'Show me stories about police reports of deadly happenings involving cars and dogs.'
   , (
       select jsonb_agg(x.article)
       from
       (
           select *
           from full_text_search
           union
           select * from vector_search
       ) x
     )
   , top_n => 5
   , return_documents => true
   ) as response
)
select
  x.index
, x.document->>'text' as article
, x.relevance_score
from rerank
cross join lateral jsonb_to_recordset(rerank.response->'results') x(document jsonb, index int, relevance_score float8)
order by relevance_score desc
;

Let's break down this query:

full_text_search: Performs a keyword search using PostgreSQL's full-text search capabilities.
vector_query: Creates an embedding for our search query using Cohere's Embed model.
vector_search: Performs a semantic search using the query embedding.
combined_results: Combines the results from both keyword and semantic searches.
reranked_results: Uses Cohere's Rerank model to reorder the combined results based on relevance to the query.
The final SELECT statement extracts and formats the reranked results.

This hybrid approach offers several advantages:

It captures both exact keyword matches and semantically similar content.
The reranking step helps to prioritize the most relevant results.
It can handle cases where either keyword or semantic search alone might miss relevant articles.

To use this query effectively:

Adjust the to_tsquery parameters to match your specific keyword search needs.
Modify the natural language query in both the ai.cohere_embed and ai.cohere_rerank functions to match your search intent.
Experiment with the LIMIT values and the top_n parameter to balance between recall and precision.

Remember, while this approach can provide highly relevant results, it does involve multiple API calls to Cohere (for embedding and reranking), which may impact performance for large result sets or high-volume applications.

This hybrid search method demonstrates the power of combining traditional database search techniques with advanced AI models, all within your PostgreSQL database using pgai.

Get Started With Cohere and Pgai Today

The integration of Cohere’s Command, Embed, and Rerank model into pgai marks a significant milestone in our vision to help PostgreSQL evolve into an AI database. By bringing these state-of-the-art language models directly into your database environment, you can unlock new levels of efficiency, intelligence, and innovation in your projects.

Pgai is open source under the PostgreSQL License and is available for you to use in your AI projects today. You can find installation instructions on the pgai GitHub repository. You can also access pgai on any database service on Timescale’s cloud PostgreSQL platform. If you’re new to Timescale, you can get started with a free cloud PostgreSQL database here.

Pgai is an effort to enrich the PostgreSQL ecosystem for AI. If you’d like to help, here’s how you can get involved:

Got questions about using Cohere models in pgai? Join the Postgres for AI Discord, a community of developers building AI applications with PostgreSQL. Share what you’re working on, and help or get helped by a community of peers.
Share the news with your friends and colleagues: Share our posts about pgai on X/Twitter, LinkedIn, and Threads. We promise to RT back.
Submit issues and feature requests: We encourage you to submit issues and feature requests for functionality you’d like to see, bugs you find, and suggestions you think would improve pgai. Head over to the pgai GitHub repo to share your ideas.
Make a contribution: We welcome community contributions for pgai. Pgai is written in Python and PL/Python. Let us know which models you want to see supported, particularly for open-source embedding and generation models. See the pgai GitHub for instructions to contribute.
Offer the pgai extension on your PostgreSQL cloud: Pgai is an open-source project under the PostgreSQL License. We encourage you to offer pgai on your managed PostgreSQL database-as-a-service platform and can even help you spread the word. Get in touch via our Contact Us form and mention pgai to discuss further.

We're excited to see what you'll build with PostgreSQL and pgai!

Refining Vector Search Queries With Time Filters in Pgvector: A Tutorial

John Pruitt — Mon, 01 Apr 2024 15:15:31 GMT

Vector databases have emerged as a powerful solution for working with high-dimensional data. These databases store vectors and perform vector searches. A vector is an array of numbers representing a data point in a multi-dimensional space, and vector search finds the most similar vectors to a query vector in this space. Vector search capabilities are powerful in machine learning and artificial intelligence, where traditional text-based search falls short.

Refining vector search queries becomes increasingly essential as data grows in volume. This is where time filters come into play. Time-based filtering allows users to narrow search results based on temporal criteria, such as creation date, modification date, or any time-related attribute associated with the data. Whether finding the most recent documents or documents from a specific time range, time-based filtering provides a crucial capability that complements similarity search.

Not all vector databases support both vector search and time-based filtering. The ability to perform both in the database unlocks efficiencies not otherwise possible. PostgreSQL with the pgvector and timescaledb extensions provides a potent platform for combining vector search with time-based filtering using SQL. In this post, we will explore vector search, time-based search, and several approaches to how one might combine the two. We will evaluate the speed and efficiency of these options.

Vector Search Queries With Time Filters: The Roadmap

In this hands-on orientation, we will load a sample dataset into a table and run time-filtered vector search queries against it. We can understand how Postgres executes each vector search query by examining the query plans. Furthermore, we can evaluate the performance of each query.

Here’s an overview of what we’ll cover:

Query plans: First, we will provide a brief description of query plans in Postgres. If you are a seasoned query tuner, skip ahead to the setup.

Setup: In the second section, we will do all our prep work. We will create a Timescale service, create a table, and load it with our sample dataset.

Filtering documents by time: In the next section, we will look at time-based filtering in isolation. We will explore the power of TimescaleDB’s hypertables (which automatically partition Postgres tables into smaller data partitions or chunks) and compare them to plain Postgres tables. Feel free to skip ahead if you are already familiar with hypertables and time-based filtering.

Vector search: After looking at time-based filtering in isolation, we will look at vector search in isolation. This section will introduce the features the pgvector and pgai on Timescale extensions provide. You can safely jump past this if you know how to do vector search in Postgres.

Vector search and querying by time: In this section, we will learn how to do both time-based filtering and similarity search in a single SQL query. (When the vectors contain a large language model’s embedding of textual content, vector search is a search of semantic similarity.) We will play with a few different versions of the query and see how Postgres decides to execute them.

Using vector search and time filters in AI applications: Once we’ve covered the techniques for combining vector similarity search and time filters, we’ll briefly explore how to apply what we learned to build GenAI applications.

Query Plans

When discussing query tuning in PostgreSQL, an essential concept to grasp is a "query plan." A query plan is a set of steps the PostgreSQL query planner generates describing how PostgreSQL will execute a SQL query. Various factors, including the query structure, the database schema, and the statistical information about the data distribution in the tables, influence the chosen query plan.

To evaluate a query plan in PostgreSQL, you can use the EXPLAIN command followed by your query. This command returns the planner's chosen execution plan without executing the query. It will show the plan's overall cost and each step's cost. Cost is a unitless measure but can be used to compare the relative performance of one query to another. For a more detailed analysis, including the execution times, you can use EXPLAIN ANALYZE, which executes the query. However, be cautious with EXPLAIN ANALYZE on production systems, as it can affect performance.

Don’t worry! We won’t get too bogged down in the minutia of query plans. We will use EXPLAIN ANALYZE to compare the speed and efficiency of various queries, but we will keep it light.

Note: Because many variables influence query plans, you may see query plans that are not exactly like the ones depicted; however, they should be similar enough to draw the same conclusions.

Another note: If you want to see a visual representation of query plans, there is an excellent, free tool for this at https://explain.dalibo.com/

The Setup

We will load a table with a sample dataset and run queries against it to explore time-based filtering and semantic search. We will use a modified version of the Cohere wikipedia-22-12-simple-embeddings dataset hosted on Huggingface. It contains embeddings of Simple English Wikipedia entries.

We added synthetic data: a time column, category, and tags. We loaded the data into a Postgres table and exported it to a CSV file; therefore, the format has changed. The original dataset on Huggingface is available under the Apache 2.0 license, and thus, our modified version is also subject to the Apache 2.0 license.

You will need a database if you want to follow along. Head to the Timescale console (you can try pgai on Timescale for free with a 90-day extended trial) and create a “Time Series and Analytics” service. Building vector indexes can be a compute-hungry activity, so choose 2 CPU / 8 GiB Memory compute. Choose a region close to you for the best experience.

We need some sample data to load into our database. The dataset we will use is available on Huggingface. Use this link to download the CSV file.

Make sure you have the psql client installed locally. You will need psql version 16 or greater. Run psql --version to see what version you have.

Connect to your Timescale service using the psql client. Pass the connection URL as an argument. Make sure the wiki.csv file you downloaded is in the same directory where you launch psql.

psql ""

We will use three extensions, including timescaledb, pgvector, and timescale_vector. We need to ensure these extensions exist before creating our table.

create extension if not exists timescaledb;
create extension if not exists vector;
create extension if not exists timescale_vector;

Run the following statement to create a table named wiki. The embeddingcolumn is of type vector(768) and will contain our embeddings. The pgvector extension provides the vector type.

create table public.wiki
( id int not null
, time timestamptz not null
, contents text
, meta jsonb
, embedding vector(768)
);

Next, run the following command to convert the plain Postgres table into a TimescaleDB hypertable. The hypertable divides the dataset into day-sized sub-tables called chunks based on the values in the time column.

select create_hypertable
( 'public.wiki'::regclass
, 'time'::name
, chunk_time_interval=>'1d'::interval
);

Now that we have a place to put our data, we can load our data. Run this meta-command to load the data from the CSV file into the wiki table. Expect this to take a few minutes.

\copy wiki from 'wiki.csv' with (format csv, header on)

Make sure Postgres has accurate statistics for our new table. Statistics will influence the query plans.

analyze wiki;

You should find 485,859 rows in the wiki table if you have done this correctly.

select count(*) from wiki;

Finally, let’s turn off parallelism to keep our query plans simple.

set max_parallel_workers_per_gather = 0;

You are ready to go!

Filtering Documents by Time

Our wiki table contains a timestamptz column named time. The data in this column is fake. We have added these timestamps so that we have something to illustrate filtering by time. You would use the timestamp associated with a web page or document in real-world use.

We will use the power of TimescaleDB's hypertables to facilitate filtering by time. A hypertable is a logical table broken up into smaller physical tables called chunks. Each chunk contains data falling within a time range, and the time ranges of the chunks do not overlap. If you are familiar with partitioning or inheritance in Postgres, you can think of hypertables in these terms.

Why is this helpful? When querying a hypertable and filtering on a given time range, TimescaleDB can do chunk exclusion: it ignores any chunks whose time ranges fall outside the bounds of the time range filter, and Postgres expends no compute resources to process the rows in the excluded chunks. Eliminating large chunks of the table through chunk exclusion speeds up time-filtered search.

There are 485,859 rows in the dataset. As we loaded the data, we assigned a date to each row and incremented the date each time we loaded 50,000 rows, giving us 10 chunks. Nine chunks have 50,000 rows, and the last chunk has 35,859. Below are the time range bounds and row counts for the 10 chunks.

┌─────────────┬────────────┬───────┐
│ range_start │ range_end  │ count │
├─────────────┼────────────┼───────┤
│ 2000-01-01  │ 2000-01-02 │ 50000 │
│ 2000-01-02  │ 2000-01-03 │ 50000 │
│ 2000-01-03  │ 2000-01-04 │ 50000 │
│ 2000-01-04  │ 2000-01-05 │ 50000 │
│ 2000-01-05  │ 2000-01-06 │ 50000 │
│ 2000-01-06  │ 2000-01-07 │ 50000 │
│ 2000-01-07  │ 2000-01-08 │ 50000 │
│ 2000-01-08  │ 2000-01-09 │ 50000 │
│ 2000-01-09  │ 2000-01-10 │ 50000 │
│ 2000-01-10  │ 2000-01-11 │ 35859 │
└─────────────┴────────────┴───────┘
(10 rows)

The query below filters the wiki table on a time range. We use explain analyze to get Postgres to run the query and tell us what it did.

explain analyze
select id
from wiki
where '2000-01-04'::timestamptz <= time and time < '2000-01-06'::timestamptz
;

The query plan shows that Postgres did a sequential scan on two chunks of the hypertable. Postgres returned 100,000 rows in total. Postgres excluded the other eight chunks and did not have to expend resources to process the rows in those chunks. Again, chunk exclusion is a hypertable’s superpower; it allows for faster and more efficient queries when filtering on a time range.

QUERY PLAN
Append  (cost=0.00..10064.00 rows=100000 width=4) (actual time=0.005..30.934 rows=100000 loops=1)
  ->  Seq Scan on _hyper_1_4_chunk  (cost=0.00..4846.00 rows=50000 width=4) (actual time=0.005..12.453 rows=50000 loops=1)
        Filter: (('2000-01-04 00:00:00+00'::timestamp with time zone <= "time") AND ("time" < '2000-01-06 00:00:00+00'::timestamp with time zone))
  ->  Seq Scan on _hyper_1_5_chunk  (cost=0.00..4718.00 rows=50000 width=4) (actual time=0.006..12.622 rows=50000 loops=1)
        Filter: (('2000-01-04 00:00:00+00'::timestamp with time zone <= "time") AND ("time" < '2000-01-06 00:00:00+00'::timestamp with time zone))
Planning Time: 0.197 ms
Execution Time: 33.925 ms

To illustrate the savings associated with the chunk exclusion, we can create a wiki2 table exactly like wiki except wiki2 is a plain table—not a hypertable.

Note: the insert will take a minute or two. Be patient.

create table public.wiki2
( id int not null
, time timestamptz not null
, contents text
, meta jsonb
, embedding vector(768)
);

insert into wiki2 select * from wiki;

analyze wiki2;

explain analyze
select id
from wiki2
where '2000-01-04'::timestamptz <= time and time < '2000-01-06'::timestamptz
;

Running the same query against wiki2 gives us the following query plan. Using the plain table is more expensive and slower than using the hypertable. In this case, Postgres had to “look at” all 485,859 rows in the table to find the 100,000 that match the time filter.

QUERY PLAN
Seq Scan on wiki2  (cost=0.00..44853.14 rows=102261 width=4) (actual time=31.259..105.586 rows=100000 loops=1)
  Filter: (('2000-01-04 00:00:00+00'::timestamp with time zone <= "time") AND ("time" < '2000-01-06 00:00:00+00'::timestamp with time zone))
  Rows Removed by Filter: 385859
Planning Time: 0.631 ms
Execution Time: 108.655 ms

You may say this is not a fair comparison because any reasonable developer would put an index on time. Let's try that.

create index on wiki2 (time);

QUERY PLAN
Index Scan using wiki2_time_idx on wiki2  (cost=0.42..10049.55 rows=102257 width=4) (actual time=0.013..26.817 rows=100000 loops=1)
  Index Cond: (("time" >= '2000-01-04 00:00:00+00'::timestamp with time zone) AND ("time" < '2000-01-06 00:00:00+00'::timestamp with time zone))
Planning Time: 0.086 ms
Execution Time: 29.917 ms

┌────────────┬──────────┬────────────┐
│    type    │   cost   │    time    │
├────────────┼──────────┼────────────┤
│ hypertable │ 10064.00 │  33.925 ms │
│ plain      │ 44853.14 │ 108.655 ms │
│ indexed    │ 10049.55 │  29.917 ms │
└────────────┴──────────┴────────────┘

At least for this dataset on this machine type, an indexed plain table is slightly cheaper and faster than a hypertable. So, why would we use a hypertable? There are a couple of reasons. Firstly, as this dataset gets larger, it becomes more likely that the hypertable will outperform the plain table. Secondly, Postgres will use only one index per table per query in most situations. We want to use a hypertable because we are not finished filtering, and we want to use a different index altogether.

Vector Search

As a reminder, we ultimately want to combine time filtering and vector search. Let's explore vector search in isolation first. If you are already a confident user of similarity search with pgvector, feel free to jump ahead.

Our wiki table has a column named embedding of type vector(768). The pgvector extension provides the vector datatype. This column contains the embeddings of the wiki content.

We will grab a random vector to use as a search parameter. The psql client makes this easy. We can select the embedding column from a row and use the \gset meta-command to create a variable in the psql client named emb containing the vector value. We can use this variable as a parameter for future queries. In this case, we will use the vector from the row with id 65272. In subsequent queries, we will search for rows where the embedding is semantically similar to the row with id 65272.

select embedding as emb
from wiki
where id = 65272
\gset

Next, we query the wiki table and use the <=> (cosine) distance operator provided by pgvector and timescale_vector. This operator computes the distance between two vectors, and this distance is a representation of how semantically similar the two vectors are. If we order by the similarity distance and limit the results to 10, our query will return the 10 rows most semantically similar to the content from row 65272.

Note: The $1 in the query below denotes a query parameter, and the \bind meta-command assigns the value from our emb psql variable to this parameter.

select id, embedding <=> $1::vector as dist
from wiki
order by dist
limit 10
\bind :emb
;

┌────────┬─────────────────────┐
│   id   │        dist         │
├────────┼─────────────────────┤
│  65272 │                   0 │
│  16295 │  0.1115766717013853 │
│  16302 │ 0.11360477414644143 │
│  16297 │ 0.12387882580027065 │
│  16298 │  0.1299166956232466 │
│  34598 │ 0.13121629452715455 │
│  16292 │ 0.13307570937651947 │
│  61078 │ 0.13427528064358674 │
│ 215163 │ 0.13759063950532036 │
│  79614 │ 0.13770102915039784 │
└────────┴─────────────────────┘
(10 rows)

As expected, row 65272 is the top result with exactly the vector we borrowed for the query. Nine more results follow with increasing distance.

Looking at the query plan for this, we find that Postgres scanned the entire table and performed the distance calculation on the fly for each of the 485,859 rows. This execution took 3,441 milliseconds, considerably longer than any time-based filtering approaches we had previously explored.

explain analyze
select id, embedding <=> $1::vector as dist
from wiki
order by dist
limit 10
\bind :emb
;

QUERY PLAN
Limit  (cost=62004.36..62004.39 rows=10 width=12) (actual time=3440.996..3441.002 rows=10 loops=1)
  ->  Sort  (cost=62004.36..63219.01 rows=485859 width=12) (actual time=3440.994..3441.000 rows=10 loops=1)
        Sort Key: ((_hyper_1_1_chunk.embedding <=> '[...]'::vector))
        Sort Method: top-N heapsort  Memory: 25kB
        ->  Result  (cost=0.00..51505.12 rows=485859 width=12) (actual time=0.030..3357.861 rows=485859 loops=1)
              ->  Append  (cost=0.00..45431.88 rows=485859 width=22) (actual time=0.015..329.158 rows=485859 loops=1)
                    ->  Seq Scan on _hyper_1_1_chunk  (cost=0.00..4788.00 rows=50000 width=22) (actual time=0.015..30.596 rows=50000 loops=1)
                    ->  Seq Scan on _hyper_1_2_chunk  (cost=0.00..4724.00 rows=50000 width=22) (actual time=0.013..31.533 rows=50000 loops=1)
                    ->  Seq Scan on _hyper_1_3_chunk  (cost=0.00..4660.00 rows=50000 width=22) (actual time=0.012..31.167 rows=50000 loops=1)
                    ->  Seq Scan on _hyper_1_4_chunk  (cost=0.00..4596.00 rows=50000 width=22) (actual time=0.012..30.917 rows=50000 loops=1)
                    ->  Seq Scan on _hyper_1_5_chunk  (cost=0.00..4468.00 rows=50000 width=22) (actual time=0.013..29.304 rows=50000 loops=1)
                    ->  Seq Scan on _hyper_1_6_chunk  (cost=0.00..4340.00 rows=50000 width=22) (actual time=0.012..28.831 rows=50000 loops=1)
                    ->  Seq Scan on _hyper_1_7_chunk  (cost=0.00..4276.00 rows=50000 width=22) (actual time=0.012..29.065 rows=50000 loops=1)
                    ->  Seq Scan on _hyper_1_8_chunk  (cost=0.00..4148.00 rows=50000 width=22) (actual time=0.013..28.186 rows=50000 loops=1)
                    ->  Seq Scan on _hyper_1_9_chunk  (cost=0.00..3956.00 rows=50000 width=22) (actual time=0.012..27.490 rows=50000 loops=1)
                    ->  Seq Scan on _hyper_1_10_chunk  (cost=0.00..3046.59 rows=35859 width=22) (actual time=0.011..20.635 rows=35859 loops=1)
Planning Time: 0.231 ms
Execution Time: 3441.034 ms

Can we make vector search faster and cheaper? Yes, we can do it by using an index. The pgvector extension provides two types of indexes: ivfflat and hnsw. The timescale_vector extension provides a third: tsv. All three index types implement approximate nearest-neighbor search algorithms. None of them will produce exact results, but they will make our searches considerably more efficient. As the number of vectors grows past a certain scale, exact searches become impractical.

This tutorial will use the tsv index type and its default settings.

Note: Expect creating the index to take 10 minutes or more.

create index on wiki using tsv (embedding);

You will note that we get slightly different results when we execute the same query. Without the index, Postgres performs an exhaustive and exact search. With the index, Postgres performs an approximate search. We trade some measure of accuracy for speed and efficiency. We can see how fast and efficient it is by looking at the query plan.

select id, embedding <=> $1::vector as dist
from wiki
order by dist
limit 10
\bind :emb
;

┌───────┬─────────────────────┐
│  id   │        dist         │
├───────┼─────────────────────┤
│ 65272 │                   0 │
│ 16302 │ 0.11360477414644143 │
│ 16295 │  0.1115766717013853 │
│ 16297 │ 0.12387882580027065 │
│ 16298 │  0.1299166956232466 │
│ 34598 │ 0.13121629452715455 │
│ 16292 │ 0.13307570937651947 │
│ 61078 │ 0.13427528064358674 │
│ 79614 │ 0.13770102915039784 │
│ 16307 │  0.1408329687219958 │
└───────┴─────────────────────┘
(10 rows)

The query plan shows that the query cost 2720.81 and took 24.642 milliseconds to execute. The version of the query that did not use the index cost 62004.39 and took 3441.034 milliseconds. The index made a huge difference! Instead of doing a sequential scan on all the chunks, Postgres did an index scan on each chunk.

QUERY PLAN
Limit  (cost=2719.51..2720.81 rows=10 width=12) (actual time=24.317..24.578 rows=10 loops=1)
  ->  Result  (cost=2719.51..66108.97 rows=485859 width=12) (actual time=24.316..24.575 rows=10 loops=1)
        ->  Merge Append  (cost=2719.51..60035.73 rows=485859 width=22) (actual time=24.315..24.572 rows=10 loops=1)
              Sort Key: ((_hyper_1_1_chunk.embedding <=> '[...]'::vector))
              ->  Index Scan using _hyper_1_1_chunk_wiki_embedding_idx on _hyper_1_1_chunk  (cost=279.85..5496.65 rows=50000 width=22) (actual time=1.846..1.982 rows=7 loops=1)
                    Order By: (embedding <=> '[...]'::vector)
              ->  Index Scan using _hyper_1_2_chunk_wiki_embedding_idx on _hyper_1_2_chunk  (cost=279.85..5426.25 rows=50000 width=22) (actual time=2.247..2.361 rows=4 loops=1)
                    Order By: (embedding <=> '[...]'::vector)
              ->  Index Scan using _hyper_1_3_chunk_wiki_embedding_idx on _hyper_1_3_chunk  (cost=279.85..5355.85 rows=50000 width=22) (actual time=2.327..2.327 rows=1 loops=1)
                    Order By: (embedding <=> '[...]'::vector)
              ->  Index Scan using _hyper_1_4_chunk_wiki_embedding_idx on _hyper_1_4_chunk  (cost=279.85..5285.45 rows=50000 width=22) (actual time=2.265..2.265 rows=1 loops=1)
                    Order By: (embedding <=> '[...]'::vector)
              ->  Index Scan using _hyper_1_5_chunk_wiki_embedding_idx on _hyper_1_5_chunk  (cost=279.85..5144.65 rows=50000 width=22) (actual time=2.849..2.849 rows=1 loops=1)
                    Order By: (embedding <=> '[...]'::vector)
              ->  Index Scan using _hyper_1_6_chunk_wiki_embedding_idx on _hyper_1_6_chunk  (cost=279.85..5003.85 rows=50000 width=22) (actual time=2.625..2.625 rows=1 loops=1)
                    Order By: (embedding <=> '[...]'::vector)
              ->  Index Scan using _hyper_1_7_chunk_wiki_embedding_idx on _hyper_1_7_chunk  (cost=279.85..4933.45 rows=50000 width=22) (actual time=2.805..2.805 rows=1 loops=1)
                    Order By: (embedding <=> '[...]'::vector)
              ->  Index Scan using _hyper_1_8_chunk_wiki_embedding_idx on _hyper_1_8_chunk  (cost=279.85..4792.65 rows=50000 width=22) (actual time=2.496..2.497 rows=1 loops=1)
                    Order By: (embedding <=> '[...]'::vector)
              ->  Index Scan using _hyper_1_9_chunk_wiki_embedding_idx on _hyper_1_9_chunk  (cost=279.85..4581.45 rows=50000 width=22) (actual time=2.160..2.161 rows=1 loops=1)
                    Order By: (embedding <=> '[...]'::vector)
              ->  Index Scan using _hyper_1_10_chunk_wiki_embedding_idx on _hyper_1_10_chunk  (cost=200.69..3516.08 rows=35859 width=22) (actual time=2.689..2.689 rows=1 loops=1)
                    Order By: (embedding <=> '[...]'::vector)
Planning Time: 0.288 ms
Execution Time: 24.642 ms

Notice that we got different results from the two queries. The version without the index had to calculate the exact distances on the fly—a big part of its slowness. The indexed version used an approximate nearest-neighbor algorithm. Its results were not exact, but they were delivered faster and cheaper.

Vector Search and Filtering by Time

Now that we have explored filtering by time and vector search in isolation, can we combine the two in a single efficient query?

select id, embedding <=> $1::vector as dist
from wiki
where '2000-01-02'::timestamptz <= time and time < '2000-01-04'::timestamptz
order by dist
limit 10
\bind :emb
;

Note: Notice that our results are "worse" in terms of semantic similarity because our time filter has eliminated some of the most similar rows. This outcome is expected. Instead, we have more relevant rows as we excluded rows with high similarity outside our time period of interest.

┌────────┬─────────────────────┐
│   id   │        dist         │
├────────┼─────────────────────┤
│  65272 │                   0 │
│  61078 │ 0.13427528064358674 │
│  79614 │ 0.13770102915039784 │
│  79612 │ 0.14383225467703376 │
│  61080 │  0.1444315737261198 │
│  52634 │  0.1436081460507892 │
│  64473 │ 0.14383654656162292 │
│  61079 │  0.1474735568613179 │
│ 131920 │ 0.15066526846960182 │
│  64475 │   0.154294958183531 │
└────────┴─────────────────────┘
(10 rows)

The query plan reveals that Postgres only considers the two chunks that match the time filter and uses the tsv vector indexes associated with those chunks. We have the best of both worlds!

QUERY PLAN
Limit  (cost=559.71..561.01 rows=10 width=12) (actual time=4.631..5.025 rows=10 loops=1)
  ->  Result  (cost=559.71..13532.11 rows=100000 width=12) (actual time=4.630..5.023 rows=10 loops=1)
        ->  Merge Append  (cost=559.71..12282.11 rows=100000 width=22) (actual time=4.628..5.019 rows=10 loops=1)
              Sort Key: ((_hyper_1_2_chunk.embedding <=> '[...]'::vector))
              ->  Index Scan using _hyper_1_2_chunk_wiki_embedding_idx on _hyper_1_2_chunk  (cost=279.85..5676.25 rows=50000 width=22) (actual time=2.349..2.682 rows=9 loops=1)
                    Order By: (embedding <=> '[...]'::vector)
                    Filter: (('2000-01-02 00:00:00+00'::timestamp with time zone <= "time") AND ("time" < '2000-01-04 00:00:00+00'::timestamp with time zone))
              ->  Index Scan using _hyper_1_3_chunk_wiki_embedding_idx on _hyper_1_3_chunk  (cost=279.85..5605.85 rows=50000 width=22) (actual time=2.277..2.334 rows=2 loops=1)
                    Order By: (embedding <=> '[...]'::vector)
                    Filter: (('2000-01-02 00:00:00+00'::timestamp with time zone <= "time") AND ("time" < '2000-01-04 00:00:00+00'::timestamp with time zone))
Planning Time: 0.257 ms
Execution Time: 5.056 ms

Could we have achieved the same with the indexed plain table? Let's add a tsv index to wiki2 and find out.

Note: Expect creating the index to take 15 minutes or more.

create index on wiki2 using tsv (embedding);

explain analyze
select id, embedding <=> $1::vector as dist
from wiki2
where '2000-01-04'::timestamptz <= time and time < '2000-01-06'::timestamptz
order by dist
limit 10
\bind :emb
;

In most situations, Postgres can only use one index per table at a time. At least in this specific example on this machine, it chooses to use the vector index on embedding. It then filters the time on the fly. If you run this, you may get the same result, or Postgres may choose to do an index scan on the time column to filter by time and then do on-the-fly distance calculations on the vectors.

In this case, the query is as fast as the hypertable approach but nearly five times more expensive. Crucially, as the dataset grows, the hypertable version will increasingly outperform the plain-table version because the hypertable will use chunk exclusion.

QUERY PLAN
Limit  (cost=2709.44..2714.22 rows=10 width=12) (actual time=3.029..5.078 rows=10 loops=1)
  ->  Index Scan using wiki2_embedding_idx on wiki2  (cost=2709.44..51574.47 rows=102257 width=12) (actual time=3.028..5.074 rows=10 loops=1)
        Order By: (embedding <=> '[...]'::vector)
        Filter: (('2000-01-04 00:00:00+00'::timestamp with time zone <= "time") AND ("time" < '2000-01-06 00:00:00+00'::timestamp with time zone))
        Rows Removed by Filter: 73
Planning Time: 0.093 ms
Execution Time: 5.096 ms

Other approaches that did not perform as well.

Can we force Postgres to use both indexes on one plain table? Yes! We can use a subquery. We will order by the vector distance using the vector index in the subquery. In the outer query, we will filter on time using the index on the time column. By joining the two, we will only get results that match both criteria. In other words, querying this way creates two resultsets—one via each index—and these resultsets are then evaluated for intersection.

Note: Technically, we cannot force Postgres to choose to use indexes. Postgres does not have query hints. However, the structure of a query influences the plan. By restructuring the query, we can politely suggest the plan we want.

explain analyze
select w.id, x.dist
from wiki2 w
inner join
(
  select id, embedding <=> $1::vector as dist
  from wiki2
  order by dist
  limit 10
) x on (w.id = x.id)
where '2000-01-02'::timestamptz <= time and time < '2000-01-04'::timestamptz
order by x.dist
\bind :emb
;

QUERY PLAN
Sort  (cost=12868.26..12868.27 rows=2 width=12) (actual time=36.700..36.703 rows=3 loops=1)
  Sort Key: x.dist
  Sort Method: quicksort  Memory: 25kB
  ->  Hash Join  (cost=2711.06..12868.25 rows=2 width=12) (actual time=5.720..36.693 rows=3 loops=1)
        Hash Cond: (w.id = x.id)
        ->  Index Scan using wiki2_time_idx on wiki2 w  (cost=0.42..9784.27 rows=99553 width=4) (actual time=0.013..25.961 rows=100000 loops=1)
              Index Cond: (("time" >= '2000-01-02 00:00:00+00'::timestamp with time zone) AND ("time" < '2000-01-04 00:00:00+00'::timestamp with time zone))
        ->  Hash  (cost=2710.51..2710.51 rows=10 width=12) (actual time=3.111..3.112 rows=10 loops=1)
              Buckets: 1024  Batches: 1  Memory Usage: 9kB
              ->  Subquery Scan on x  (cost=2709.44..2710.51 rows=10 width=12) (actual time=2.693..3.096 rows=10 loops=1)
                    ->  Limit  (cost=2709.44..2710.41 rows=10 width=12) (actual time=2.692..3.092 rows=10 loops=1)
                          ->  Index Scan using wiki2_embedding_idx on wiki2  (cost=2709.44..50104.18 rows=485859 width=12) (actual time=2.690..3.089 rows=10 loops=1)
                                Order By: (embedding <=> '[...]'::vector)
Planning Time: 0.258 ms
Execution Time: 36.737 ms

The query plan shows us that Postgres used both indexes. Surprisingly, having Postgres use both indexes is slower and more costly than both previous approaches. While the subquery only returns 10 rows, the outer query returns 100,000. Postgres then has to evaluate each of the 100,000 rows against each of the 10 rows using a hash join. Returning the 100,000 and comparing them to the 10 is where the bulk of the execution time takes place.

┌───────┬─────────────────────┐
│  id   │        dist         │
├───────┼─────────────────────┤
│ 65272 │                   0 │
│ 61078 │ 0.13427528064358674 │
│ 79614 │ 0.13770102915039784 │
└───────┴─────────────────────┘
(3 rows)

More importantly, we only get three results using the two-index approach. Why is this the case? The vector search subquery considers all the rows regardless of time. It returns the 10 most similar rows from the whole dataset. Unfortunately, seven of these 10 fall outside the time range we are filtering on and, therefore, get excluded from the results. This effect illustrates the classic post-filtering problem common in vector search use cases. Filtering by time after a similarity search can leave you with fewer results than expected or even none at all.

What happens if we turn this query “inside out”? We can put the time filtering in a subquery and do the vector search in the outer query like so:

explain analyze
select w.id, w.embedding <=> $1::vector as dist
from wiki2 w
where exists
(
  select 1
  from wiki2 x
  where '2000-01-02'::timestamptz <= x.time and x.time < '2000-01-04'::timestamptz
  and w.id = x.id
)
order by dist
limit 10
\bind :emb
;

┌────────┬─────────────────────┐
│   id   │        dist         │
├────────┼─────────────────────┤
│  65272 │                   0 │
│  61078 │ 0.13427528064358674 │
│  79614 │ 0.13770102915039784 │
│  52634 │  0.1436081460507892 │
│  79612 │ 0.14383225467703376 │
│  64473 │ 0.14383654656162292 │
│  61080 │  0.1444315737261198 │
│  61079 │  0.1474735568613179 │
│ 131920 │ 0.15066526846960182 │
│  64475 │   0.154294958183531 │
└────────┴─────────────────────┘
(10 rows)

QUERY PLAN
Limit  (cost=58235.37..58235.39 rows=10 width=12) (actual time=1394.904..1394.909 rows=10 loops=1)
  ->  Sort  (cost=58235.37..58484.25 rows=99553 width=12) (actual time=1394.902..1394.906 rows=10 loops=1)
        Sort Key: ((w.embedding <=> '[...]'::vector))
        Sort Method: top-N heapsort  Memory: 25kB
        ->  Hash Semi Join  (cost=11028.68..56084.06 rows=99553 width=12) (actual time=294.332..1380.233 rows=100000 loops=1)
              Hash Cond: (w.id = x.id)
              ->  Seq Scan on wiki2 w  (cost=0.00..42423.59 rows=485859 width=22) (actual time=0.034..616.130 rows=485859 loops=1)
              ->  Hash  (cost=9784.27..9784.27 rows=99553 width=4) (actual time=40.376..40.378 rows=100000 loops=1)
                    Buckets: 131072  Batches: 1  Memory Usage: 4540kB
                    ->  Index Scan using wiki2_time_idx on wiki2 x  (cost=0.42..9784.27 rows=99553 width=4) (actual time=0.016..28.235 rows=100000 loops=1)
                          Index Cond: (("time" >= '2000-01-02 00:00:00+00'::timestamp with time zone) AND ("time" < '2000-01-04 00:00:00+00'::timestamp with time zone))
Planning Time: 0.164 ms
Execution Time: 1394.960 ms

Unfortunately, Postgres used the index on time but chose to ignore the vector index. It computed the exact distances on the fly instead of using an approximate nearest-neighbor search. Unsurprisingly, this approach is slower and more expensive than the single index query plan on the plain table. It is also much slower and more costly than the hypertable approach.

Using Vector Search and Time Filters in AI Applications

Incorporating time-based filtering with semantic search allows AI applications to offer users a more nuanced, context-aware interaction. This approach ensures that the information, products, or content presented is not just relevant but also timely, providing a more valuable user experience.

Content discovery platforms: Imagine a streaming service that not only recommends movies based on your viewing history but also prioritizes new releases or timely content, such as holiday-themed movies. By applying time filters to semantic search, these platforms can dynamically adjust recommendations, ensuring users are presented with content that's not just relevant to their interests but also their recent viewing history or seasonal trends.
News aggregation: Time-based filtering can revolutionize how readers engage with information in news and content aggregation. Time filters can prioritize the latest news, ensuring readers can access the most current information and trends relevant to their query, while semantic search simultaneously sifts through the articles to find ones related to a user's interests.
E-commerce: For e-commerce platforms, combining semantic search with time-based filters can enhance shopping experiences by promoting products related to a user’s recent purchases, upcoming events, or seasonal trends. For instance, suggesting winter sports gear as the season approaches or highlighting flash sale items within a limited time frame makes the shopping experience more relevant and timely.
Academic research: Researchers often seek their field's most recent studies or articles. By integrating time filters with semantic search, academic databases can provide search results that are contextually relevant and prioritize the latest research, facilitating access to cutting-edge knowledge and discoveries.

Next Steps

So, what have we learned? We can combine vector search with time-based filtering in Postgres to retrieve more temporally relevant vectors. While we can achieve this with a plain table, we can write simpler, faster, and more efficient queries that continue to perform as your dataset grows using TimescaleDB’s hypertables.

Use pgvector on Timescale today! Follow this link to get a 90-day free trial!