---
title: "Semantic Search with OpenAI and PostgreSQL in 10 Minutes"
published: 2025-01-15T08:29:18.000-05:00
updated: 2025-03-24T15:45:51.000-04:00
excerpt: "No AI expertise required: Learn how to do semantic search with OpenAI and PostgreSQL in 10 minutes."
tags: AI
authors: Team Tiger Data
---

> **TimescaleDB is now Tiger Data.**

From chatbots to automotive experiences, we’re witnessing large language models (LLMs) revolutionizing industries and driving the AI adoption race. If you’re familiar with [retrieval-augmented generation (RAG](https://www.timescale.com/blog/which-rag-chunking-and-formatting-strategy-is-best)), you know it’s a game-changer for giving LLMs the context they need to excel in various applications. This context comes from digging into vast amounts of data using search algorithms, with semantic search leading the pack.

In the early days of RAG, semantic search was exclusive to vector databases, making it seem like a tool only AI engineers could wield. But thanks to plugins like [pgai](https://github.com/timescale/pgai) and [pgvector](https://github.com/pgvector/pgvector) alongside platforms like OpenAI, semantic search has broken free from those boundaries. Now, you can unlock this powerful capability in popular databases like PostgreSQL, putting it directly in your hands—no AI expertise required.

## Semantic Search and Vector Embeddings

[Semantic search, or dense vector search](https://www.timescale.com/learn/understanding-semantic-search), is our go-to strategy for finding the most relevant results by focusing on word associations and meanings. Instead of relying on plain keywords, it uses dense vectors—high-dimensional arrays filled with meaningful data.

The magic happens by calculating the similarity between these embeddings, helping us uncover related information lightning-fast. One popular method is [**cosine similarity**](https://www.timescale.com/learn/understanding-cosine-similarity), which measures how close two vectors are to each other. It’s as simple as this formula:

![](https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/03/cosine-formula-blog-thumbnail.png)

The closer the result is to 1, the more similar the vectors—and the better the search results.

In this article, we’ll set up and perform a semantic search using **pgai**, **pgvector**, and **OpenAI** in just 10 minutes. Let the countdown begin!

## Prerequisites

Before proceeding, we’ll need an API key from [OpenAI](https://platform.openai.com/docs/overview). Save this API key, preferably as an environment variable. You can find detailed instructions in the GitHub [repository](https://github.com/timescale/pgai/tree/main).

Install and import the required libraries:

```
!pip install pandas pyarrow pgvector openai psycopg2
```

```Python
import pandas as pd
import openai
import psycopg2
import os
from ast import literal_eval
```

Now, initialize the OpenAI client.

```Python
client = openai.OpenAI(api_key= OPENAI_API_KEY)
```

## Database Setup

You can set up the database either locally or in the cloud. If setting it up locally, ensure that PGAdmin is installed and properly configured. For this tutorial, we’ll use a [Timescale Cloud](https://console.cloud.timescale.com) instance. Visit Timescale Cloud and set up your instance to get started.

![](https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/01/Semantic-Search-with-OpenAI-and-PostgreSQL-in-10-Minutes_create-service.png)

Make sure to enable the AI and vector options, then create the service.

![](https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/01/Semantic-Search-with-OpenAI-and-PostgreSQL-in-10-Minutes_db-semantic-search.png)

Once the instance is up and running, retrieve your credentials from the page.

![](https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/01/Semantic-Search-with-OpenAI-and-PostgreSQL-in-10-Minutes_connect-to-service.png)

Now, establish the client connection. 

```Python
conn = psycopg2.connect(DATABASE_CONNECTION_STRING)
```

You’ll need to enable the extensions to start working with pgvector and pgai.

```Python
def setup_database():
   pgvector = """CREATE EXTENSION IF NOT EXISTS vector"""
   pgai = """CREATE EXTENSION IF NOT EXISTS ai CASCADE"""
   pgvectorscale = """CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE"""

   with conn.cursor() as cursor:
      cursor.execute(pgvector)
      cursor.execute(pgvectorscale)
      cursor.execute(pgai)

setup_database()
```

## Dataset and Vector Embeddings

Create the database table using the following SQL query:

```SQL
CREATE TABLE IF NOT EXISTS movies (
    id bigint primary key generated by default as identity,
    title TEXT NOT NULL,
    overview TEXT,
    genres TEXT,
    producer TEXT,
    "cast" TEXT
);
```

We’ll be using the [Cohere Movies](https://huggingface.co/datasets/Cohere/movies) dataset. To load the dataset, we can simply use the [load\_dataset](https://github.com/timescale/pgai/blob/main/docs/load_dataset_from_huggingface.md) utility provided by pgai, which allows you to load datasets from Hugging Face's datasets library directly into your PostgreSQL database.

```Python
# Use ai.load_dataset to load the dataset
with conn.cursor() as cur:
    # Load the 'movies' dataset from Hugging Face
    cur.execute("""
        SELECT ai.load_dataset(
            name => 'Cohere/movies',
            table_name => 'movies',
            if_table_exists => 'drop',
            field_types => '{"title": "TEXT", "overview": "TEXT", "genres": "TEXT", "producer": "TEXT", "cast": "TEXT"}'::jsonb
        );
    """)
    conn.commit()

print("Dataset loaded successfully into the 'movies' table.")
```

We will also set up [pgai Vectorizer](https://github.com/timescale/pgai/blob/main/docs/vectorizer.md) to automatically generate embeddings using OpenAI’s `text-embedding-3-small` embedding model. Here’s how you can do it: 

```Python
with conn.cursor() as cur:
  cur.execute("""
  SELECT ai.create_vectorizer(
      'movies'::regclass,
      destination => 'movies_embeddings_vectorized',
      embedding => ai.embedding_openai('text-embedding-3-small', 1536),
      chunking => ai.chunking_recursive_character_text_splitter('overview'),
      formatting => ai.formatting_python_template(
          '$title: $chunk'
      )
  );
  """)
print("Vectorizer created successfully.")
```

With that, you’re all set to perform a vector search across your new database. You can also visualize the database using your Timescale dashboard. 

![](https://storage.ghost.io/c/6b/cb/6bcb39cf-9421-4bd1-9c9d-fa7b6755ba0e/content/images/2025/01/Semantic-Search-with-OpenAI-and-PostgreSQL-in-10-Minutes_schema.png)

## Semantic Search

Define the function for performing semantic search.

```SQL

def semantic_search(conn, query):
    sql = """
    WITH query_embedding AS (
        SELECT openai_embed(
           'text-embedding-3-small'
           , %s
           , _api_key=>%s
       ) AS embedding
    )
    SELECT title, overview
    FROM movies, query_embedding
    ORDER BY movies.embedding <#> query_embedding.embedding
    LIMIT 5
    """
    with conn.cursor() as cur:
        cur.execute(sql, (query,))
        return cur.fetchall()
```

Here’s a quick walkthrough of what the SQL query is doing:

1.  Converts the user query into vector embeddings.
2.  Compares the cosine distance between the user query embedding and the embeddings in the database.
3.  Retrieves and returns the top five most similar records from the database.

You can also experiment with other distance functions for similarity searches.

![](https://lh7-rt.googleusercontent.com/docsz/AD_4nXck_1xPbJCvrR4l0Bvzh8iij_tFhex4UTrnPKT0pb8DywJ6rD5T3o25wt-AZj1KuFf38Sbqh8lFSr9MDsbcNbeuPQm7yiYgdnUOivrznKUt8glg2FngvM0qzAgsvzRcPI4JXyyw?key=V-LOXkaPOHkZJ_fJK2GMaGLL)

Now, test this function with your custom query.

```
query = "A thrilling mystery with unexpected twists and turns."
results = semantic_search(conn, query)
```

```
[
    ('Shutter Island',
     'World War II soldier-turned-U.S. Marshal Teddy Daniels investigates the disappearance of a patient from a hospital for the criminally insane, but his efforts are compromised by his troubling visions and also by a mysterious doctor.'),
    ('The Sixth Sense',
     'A psychological thriller about an eight year old boy named Cole Sear who believes he can see into the world of the dead. A child psychologist named Malcolm Crowe comes to Cole to help him deal with his problem, learning that he really can see ghosts of dead people.'),
    ('Angels & Demons',
     'Harvard symbologist Robert Langdon investigates a mysterious symbol seared into the chest of a murdered physicist. He discovers evidence of the unimaginable, the rebirth of an ancient secret brotherhood known as the Illuminati, the most powerful underground organization ever to walk the earth.'),


('Gone Girl',
     'With his wife's disappearance having become the focus of an intense media circus, a man sees the spotlight turned on him when it's suspected that he may not be innocent.'),


    ('The Girl with the Dragon Tattoo',
     'This English-language adaptation of the Swedish novel by Stieg Larsson follows a disgraced journalist, Mikael Blomkvist, as he investigates the disappearance of a weary patriarch's niece from 40 years ago. He is aided by the pierced, tattooed, punk computer hacker named Lisbeth Salander. As they work together in the investigation, Blomkvist and Salander uncover immense corruption beyond anything they have ever imagined.')
]
```

## Wrapping Up

That brings us to the end of this tutorial. Here’s a quick recap of what we covered. We learned how to set up [pgvector](https://github.com/pgvector/pgvector) and [pgai](https://github.com/timescale/pgai) (with [pgai Vectorizer](https://github.com/timescale/pgai/blob/main/docs/vectorizer.md)) for our database and how to convert data into embeddings using OpenAI embedding models. Finally, we prepared a query and performed a semantic search on the database—all in under 10 minutes.

Handle all your embedding operations in PostgreSQL—no specialized databases needed. Install [pgai Vectorizer](https://github.com/timescale/pgai/blob/main/docs/vectorizer.md) and try it yourself (⭐s much appreciated!).