---
title: "Pgai Vectorizer Meets Python: Integrating SQLAlchemy and Alembic"
published: 2025-02-20T09:00:11.000-05:00
updated: 2025-02-20T09:00:11.000-05:00
excerpt: "You can now seamlessly integrate pgai Vectorizer and Python using SQLAlchemy and Alembic to automate embedding creation and management. Learn how."
tags: AI
authors: Hervé Ishimwe
---

> **TimescaleDB is now Tiger Data.**

The launch of [pgai Vectorizer](https://github.com/timescale/pgai/blob/main/docs/vectorizer/quick-start.md) has transformed how developers incorporate vector embeddings into their applications. With one SQL command, they [can automate embedding creation and management, ensuring synchronization with evolving source data](https://www.timescale.com/blog/how-to-automatically-create-update-embeddings-in-postgresql)—gone are the traditional manual and time-consuming processes. 

However, not all developers interact with PostgreSQL directly through SQL. Many rely on integration libraries and frameworks that abstract SQL, offering a more familiar way to manage databases. This blog post will show you how to seamlessly integrate pgai Vectorizer with Python, one of the most widely used and popular programming languages. We’ll use [SQLAlchemy](https://www.sqlalchemy.org/), the Python object-relational mapper (ORM) and SQL toolkit, and [Alembic](https://alembic.sqlalchemy.org/en/latest/), its database migration tool. 

We’ll walk you through the example of a blog application with different blog posts whose titles and content are stored in a table. The goal is to set up a vectorizer that generates and manages embeddings of the blog posts’ content. 

## Pgai Vectorizer: Automated Embedding Creation & Management

Pgai Vectorizer simplifies embedding workflows by automating several steps:

-   **Chunking and formatting strategies** to generate high-quality embeddings
-   **Integration with multiple embedding providers**, including OpenAI, Cohere, Ollama, and LiteLLM
-   **Multiple embedding configurations** for [experimentation with different embedding models](https://www.timescale.com/blog/open-source-vs-openai-embeddings-for-rag) to find the best embedding model for your use case
-   **Automatic synchronization** between source data and embeddings, reducing development overhead

Let’s dive into the steps to add it to our example setup using SQLAlchemy and Alembic! 

## Installing and Importing the Pgai Python Library 

To begin, install the `pgai`’s Python library with SQLAlchemy support:

```Python
pip install "pgai[sqlalchemy]"
```

## Creating a Vectorizer Using Alembic 

To integrate pgai Vectorizer,  define a migration script using [Alembic](https://alembic.sqlalchemy.org/en/latest/). First, register pgai’s Alembic operations in your `env.py`:

```Python
from pgai.alembic import register_operations

register_operations()
```

If you are using Alembic’s autogenerate feature to generate migrations, make sure also to add the following code to your `env.py` to ensure that the tables generated and managed under `pgai` Alembic operations are excluded from the autogenerate process:

```
def include_object(object, name, type_, reflected, compare_to):
    if type_ == "table" and name in target_metadata.info.get("pgai_managed_tables", set()):
        return False
    return True

context.configure(
      connection=connection,
      target_metadata=target_metadata,
      include_object=include_object
  )
```

Assuming the blog posts are stored in a table named `blog_posts`, we create a migration script to generate embeddings using OpenAI’s `text-embedding-3-small` as the embedding model: 

```
from alembic import op
from pgai.vectorizer.configuration import (
    EmbeddingOpenaiConfig,
    ChunkingCharacterTextSplitterConfig,
    FormattingPythonTemplateConfig
)


def upgrade() -> None:
    op.create_vectorizer(
        source="blog_posts",
        target_table ='blog_posts_embedding_store',
        view_table ='blog_posts_embedding',
        embedding=EmbeddingOpenaiConfig(
            model='text-embedding-3-small',
            dimensions=1536
        ),
        chunking=ChunkingRecursiveCharacterTextSplitterConfig(
            chunk_column='content',
            chunk_size=800,
            chunk_overlap=400,
            separators=['.', ' ']
        ),
        formatting=FormattingPythonTemplateConfig(template='$title - $chunk')
    )


def downgrade() -> None:
    op.drop_vectorizer(target_table="blog_posts_embedding_store", drop_all=True)
```

A few things to note from the code above:

-   The create\_vectorizer function creates **the embedding table** and **a view** joining the embedding and source tables. The parameters `target_table` and `view_table` define the table and view names, respectively. 
-   Use the function `ChunkingRecursiveCharacterTextSplitterConfig` to indicate the **column from the source table whose content will be used to generate the embeddings**. We can also define different parameters for splitting the content to fit the context window of the embedding model. 

These chunks are also padded with extra information (i.e., **the blog post title**) following the pattern defined in the function, `FormattingPythonTemplateConfig`, which is one of the ways to enrich the text provided to the embedding models and embeddings generated. 

You can find more parameters in [Vectorizer’s SQL API reference.](https://github.com/timescale/pgai/blob/main/docs/vectorizer/api-reference.md)

## Interacting With the Embeddings Through SQLAlchemy

Pgai’s Python library provides the preconfigured SQLAlchemy relationship, [`vectorizer_relationship`](https://github.com/timescale/pgai/blob/main/docs/vectorizer/python-integration.md), through all ORM interactions with the vectorizers are done.  

```
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column
from pgai.sqlalchemy import vectorizer_relationship

class Base(DeclarativeBase):
    pass

class BlogPost(Base):
    __tablename__ = "blog_posts"

    id: Mapped[int] = mapped_column(primary_key=True)
    title: Mapped[str]
    content: Mapped[str]

    # Add vector embeddings for the content field
    content_embeddings = vectorizer_relationship(
        dimensions=1536
    )
```

Besides the dimensions parameter, `vectorizer_relationship` also supports more parameters, including:

-   `target_schema`: the schema of the embedding table; if not specified, the embedding table inherits the parent (source table)’s schema. 
-   `target_table`: the name of the embedding table.

You can treat this relationship as any SQLAlchemy relationship, which you can [configure by setting `parent_kwargs` parameters](https://github.com/timescale/pgai/blob/main/docs/vectorizer/python-integration.md#configuration). 

### Querying embeddings

This relationship allows you to access different embedding properties defined in the embedding table as well as join embedding queries with regular SQL queries as follows:

```Python
# Filtering over the blog posts and embeddings at the same time
results = (
    session.query(BlogPost, BlogPost.content_embeddings)
    .join(BlogPost.content_embeddings)
    .filter(BlogPost.title.ilike("%search term%"))
    .all()
)

# Accessing embedding properties
for post, embedding in results:
    print(post.title)         
    print(embedding.embedding)      # The vector embedding
    print(embedding.chunk)          # The text chunk
    print(embedding.chunk_seq)      # The sequence number of the chunk
    print(embedding.embedding_uiid) # The UUID number of the embedding
```

### Semantic search

Through this relationship, we can do a semantic search as follows using `cosine_distance`, one of the distance comparators from the [pgvector-python](https://github.com/pgvector/pgvector-python/blob/master/pgvector/sqlalchemy/vector.py) library: 

```Python
from sqlalchemy import func, text

similar_posts = (
    session.query(BlogPost.content_embeddings)
    .order_by(
        BlogPost.content_embeddings.embedding.cosine_distance(
            func.ai.openai_embed( # using pgai's embedding functions 
                "text-embedding-3-small",
                "search query",
                text("dimensions => 1536")
            )	      
        )
    )
    .limit(5)
    .all()
)
```

Let’s break it down:

-   The function `func.ai.openai_embed`, referring to pgai’s OpenAI embedding function, generates the embedding for the search query, which is then used to search over the embedding generated by the vectorizer. 

You can also provide your own query embeddings instead of using pgai’s embedding functionalities:

```Python
similar_posts = (
    session.query(BlogPost.content_embeddings)
    .order_by(
        BlogPost.content_embeddings.embedding.cosine_distance(
            [3, 1, 2,...]
        )
    )
    .limit(5)
    .all()
)
```

## Conclusion

In this blog post, we observed how pgai Vectorizer streamlines embedding generation and management by creating a vectorizer through Alembic, querying embeddings, and performing semantic search! This Python integration allows developers to work with familiar tools while enabling powerful AI-driven features with minimal effort. 

If you’re building AI applications, explore [Timescale’s complete open-source AI stack](https://www.timescale.com/ai) or head to [pgai](https://github.com/timescale/pgai)’s GitHub repository to start bringing AI workflows into PostgreSQL—no need to leave your database. If you find it helpful, we would love your support—leave us a ⭐! 

### Further reading

To learn more about our mission of simplifying your AI stack and other integrations, check out these resources:

-   [Stop Over-Engineering AI Apps: The Case for Boring Technologies](https://www.timescale.com/blog/stop-over-engineering-ai-apps)
-   [Expanding pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple](https://www.timescale.com/blog/sqlalchemy-and-litellm-make-vector-search-simple)
-   [One Line of SQL, All the LiteLLM Embeddings](https://www.timescale.com/blog/one-line-of-sql-all-the-litellm-embeddings)