---
title: Key vector concepts for pgvector | Tiger Data Docs
description: Learn how pgvector stores embeddings in ordinary tables, how nearest-neighbor queries and distance operators work, and why approximate indexes matter on Tiger Cloud.
---

This page explains the ideas behind **pgvector** in PostgreSQL: the **`vector`** column type, typical table shapes, **k**-nearest-neighbor queries, distance metrics, and **approximate nearest neighbor (ANN)** indexes. For how these pieces fit with **pgvectorscale** and Tiger Cloud, see **[Understand pgvector and pgvectorscale](/docs/learn/search/pgvector-pgvectorsearch/index.md)**.

## The `vector` type and table design

Embeddings live in normal tables: you add a column whose type is provided by the **pgvector** extension (for example `vector(1536)` when your model outputs 1,536 dimensions). A common pattern is to store each embedding next to the text it represents and optional **JSONB** metadata:

```
CREATE TABLE IF NOT EXISTS document_embedding (
  id BIGINT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
  document_id BIGINT REFERENCES document (id),
  metadata JSONB,
  contents TEXT,
  embedding vector(1536)
);
```

The row has a primary key, an optional foreign key to a parent document, metadata, the chunk text (if you keep it in the database), and the embedding vector.

### Why a separate table instead of a column on `document`

Embedding models and LLMs enforce **context limits**. Long documents are usually **chunked**; each chunk gets its own embedding. That is naturally a **one-to-many** relationship (one document, many embedding rows), modeled with `document_id` (or similar) on the embedding table.

If you only store vectors and not source text, you can drop the foreign key or keep a loose link inside **JSONB**. Both patterns are common.

## Nearest-neighbor queries

The usual question is: **which stored vectors are closest to this query vector?** That is **k**-nearest neighbors (**k-NN**). In SQL you order by a distance expression and cap rows with **`LIMIT`**.

Here `$1` is a parameter holding the query embedding. The **`<=>`** operator is the **cosine distance** operator when you use the cosine operator class on the column (see [Distance metrics](#distance-metrics)):

```
SELECT *
FROM document_embedding
ORDER BY embedding <=> $1
LIMIT 10;
```

This returns the ten rows whose embeddings are most similar to the query under that metric. Because this is PostgreSQL, you can add **`WHERE`** filters (including on **JSONB**), **JOIN**s, and everything else you already use.

## Distance metrics

The example above uses **cosine distance** via **`<=>`**. pgvector also supports **L2** (**`<->`**) and **inner product** (**`<#>`**) for the matching operator classes. Your **index** must be built for the same distance you use in **`ORDER BY`**, and you should follow your **embedding provider’s** guidance (many text models are tuned for cosine or inner product on **unit-normalized** vectors).

In practice, **cosine distance on normalized vectors** is a solid default for many text-embedding pipelines.

### Cosine distance

Cosine-related scores measure how aligned two directions are in high-dimensional space. For **unit vectors** (length 1), cosine distance and inner-product-based rankings line up in useful ways; pgvector’s operator-class docs spell out the exact definitions for **`<=>`** and friends. For background on cosine similarity, see a standard reference such as [Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) (Wikipedia).

### Inner product

The **inner product** (dot product) scales with vector **length** as well as angle. With **unit** vectors, ordering by inner product often matches ordering by cosine similarity.

### Euclidean (L2) distance

**L2** is straight-line distance in the embedding space. It is sensitive to magnitude and is widely used in clustering and geometry-heavy workflows.

### When rankings match

For **unit-length** embeddings, cosine distance, negative inner product, and Euclidean distance can yield the **same ranking** of neighbors (up to ties). That is why many teams standardize on **normalization + cosine** unless a model vendor documents a different contract.

## Why approximate indexes exist

Classic multidimensional indexes (think **R-tree**-style ideas) work well at **low** dimensionality. Embeddings are often **hundreds or thousands** of dimensions, where exact structures stop scaling (**curse of dimensionality**). Exact **k-NN** over huge tables without an index is too expensive for interactive search.

**ANN** indexes trade a small amount of recall for much faster queries. pgvector ships **HNSW** and **IVFFlat**; **pgvectorscale** adds **StreamingDiskANN** for large or disk-heavy workloads. Deep dive: **[pgvectorscale on GitHub](https://github.com/timescale/pgvectorscale)**.

## Index options on Tiger Cloud

On Tiger Cloud you typically use **pgvector** for the **`vector`** type and for **HNSW** / **IVFFlat**. **StreamingDiskANN** comes from **pgvectorscale** (it builds on pgvector). The table is a coarse comparison (your data size, recall, and update pattern matter more than any one row):

| Index                                | Build time | Typical query speed        | Rebuild after many updates?                           |
| ------------------------------------ | ---------- | -------------------------- | ----------------------------------------------------- |
| **StreamingDiskANN** (pgvectorscale) | Fast       | Often strongest at scale   | Usually **no** (check release notes for your version) |
| **HNSW** (pgvector)                  | Fast       | Fast                       | Usually **no**                                        |
| **IVFFlat** (pgvector)               | Fastest    | Often slowest of the three | Often **yes** (lists can go stale)                    |

For methodology and benchmark context, see the discussion in **[pgvectorscale on GitHub](https://github.com/timescale/pgvectorscale)** and **[Understand pgvector and pgvectorscale](/docs/learn/search/pgvector-pgvectorsearch/index.md)**.

### Practical default

If you are on Tiger Cloud and expect **large** embedding tables or strict latency goals, evaluate **StreamingDiskANN** from pgvectorscale alongside **HNSW**. For smaller datasets or simpler ops, **HNSW** alone is often enough.

## Next steps

| Goal                                                | Where to read                                                                                                                                                                                                           |
| --------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Tiger Cloud stack (pgvector + pgvectorscale + pgai) | **[Understand pgvector and pgvectorscale](/docs/learn/search/pgvector-pgvectorsearch/index.md)**                                                                                                                        |
| StreamingDiskANN parameters, filtered search        | **[pgvectorscale on GitHub](https://github.com/timescale/pgvectorscale)**                                                                                                                                               |
| Enable extensions and walk through a flow           | **[Create a chatbot using pgvector](/docs/deploy/tiger-cloud/tiger-cloud-aws/tiger-cloud-extensions/pgvector/index.md)** ([Azure](/docs/deploy/tiger-cloud/tiger-cloud-azure/tiger-cloud-extensions/pgvector/index.md)) |
| Lexical (BM25) search and hybrid ideas              | **[Understand pg\_textsearch and BM25 search](/docs/learn/search/using-pg-textsearch/index.md)**                                                                                                                        |