TigerData logo
TigerData logo
  • Product

    Tiger Cloud

    Robust elastic cloud platform for startups and enterprises

    Agentic Postgres

    Postgres for Agents

    TimescaleDB

    Postgres for time-series, real-time analytics and events

  • Docs
  • Pricing

    Pricing

    Enterprise Tier

  • Developer Hub

    Changelog

    Benchmarks

    Blog

    Community

    Customer Stories

    Events

    Support

    Integrations

    Launch Hub

  • Company

    Contact us

    About

    Timescale

    Partners

    Security

    Careers

Log InTry for free
TigerData logo

Products

Time-series and Analytics AI and Vector Enterprise Plan Cloud Status Support Security Cloud Terms of Service

Learn

Documentation Blog Tutorials Changelog Success Stories Time-series Database

Company

Contact Us Careers About Brand Community Code Of Conduct Events

Subscribe to the Tiger Data Newsletter

By submitting, you acknowledge Tiger Data's Privacy Policy

2026 (c) Timescale, Inc., d/b/a Tiger Data. All rights reserved.

Privacy preferences
LegalPrivacySitemap

Copy as HTML

Open in ChatGPT

Open in Claude

Open in v0

Team Tiger Data

By Team Tiger Data

5 min read

Aug 28, 2025

AIProduct & EngineeringSQL

Table of contents

01 Evaluating AI for SQL: To Build text-to-sql, You Need an Eval Suite First02 Introducing: text-to-sql-eval03 Bring Your Own Data04 Already Used to Improve Our Own text-to-sql Suite05 Get Started

Spend more time improving your AI app and less time managing a database.

Start building

LLMs Are New Database Users. Now We Need a Way to Measure Them: Meet text-to-sql-eval

LLMs Are New Database Users. Now We Need a Way to Measure Them: Meet text-to-sql-eval
AI
Team Tiger Data

By Team Tiger Data

5 min read

Aug 28, 2025

Table of contents

01 Evaluating AI for SQL: To Build text-to-sql, You Need an Eval Suite First02 Introducing: text-to-sql-eval03 Bring Your Own Data04 Already Used to Improve Our Own text-to-sql Suite05 Get Started

Copy as HTML

Open in ChatGPT

Open in Claude

Open in v0

Spend more time improving your AI app and less time managing a database.

Start building

TL;DR: We’re open-sourcing our text-to-SQL evaluation suite, purpose-built to both assess and enhance text-to-SQL systems for PostgreSQL. AI is only as strong as its data: so it’s not enough to just benchmark your agent’s database access—you also need the results to be actionable so you can improve the results over time. 

Evaluating AI for SQL: To Build text-to-sql, You Need an Eval Suite First

In our previous article, we argued that large language models (LLMs) are a new kind of database user and that they need a different kind of database: one that is self-describing. But once you accept LLMs as database users, a new challenge emerges. How do we measure whether they are succeeding? To build reliable text-to-SQL systems, accuracy needs to be both defined and measured.

At Tiger Data, we began by experimenting with a semantic catalog to give LLMs the context they crave. But we quickly realized that context alone is not enough. We also needed to measure whether our system returned the right results and to pinpoint where things went wrong when it did not.

Existing evaluation tools fell short. The main problem was that these tools were meant for scoring the performance of the system and not helping you improve the text-to-SQL system itself. The difference between pure benchmarks and a text-to-sql improvement system is granularity: instead of only evaluating whether the LLM generated the same query or the same data, you also need to determine finer-grained details such as whether it failed because of problems retrieving schema objects or in reasoning through SQL generation.

Along the way we fixed other problems present in popular text-to-SQL systems: Many did not work with PostgreSQL. The datasets were often flawed, with incorrect golden queries or schemas that did not resemble real-world production environments. 

So we built our own evaluation system. One that measures the accuracy of any text-to-SQL pipeline, helps identify the source of failure, and provides a foundation for improvement. Today, we are open-sourcing it as text-to-sql-eval so the community can measure, debug, and improve their systems as well.

Today, we are open-sourcing it as text-to-sql-eval so the community can measure, debug, and improve their systems as well.

Introducing: text-to-sql-eval

We’re open-sourcing the same tool we use internally to test and improve our text-to-SQL agents. It’s flexible, extensible, and ready to evaluate your data, schema, and use case.

Key features

  • Support for many tools & models: Evaluate any LLM or text-to-SQL system—ours, yours, or open-source.
  • PostgreSQL focused: Purpose-built for PostgreSQL, because every database is different—and those differences matter when evaluating correctness.
  • Evaluate with Your Data: Quickly create a test dataset from your existing schema and data. Alternatively, use one of the included PostgreSQL versions of popular text-to-SQL benchmark datasets.
  • LLM as a Judge: Determining if an LLM-generated query yields the same data as a golden query isn't always straightforward. Differing column aliases, data rendered in different formats, redundant or extraneous columns, sort order—with effort, humans can spot these issues and judge two queries to be equivalent. Our suite offers a deterministic test, but it often produces false negatives, especially on more challenging questions. Therefore, we include an optional LLM-as-a-judge feature to re-evaluate failures more like a human would and provide both sets of results.
  • Track results over time: Store and visualize eval runs via TimescaleDB or the included UI.
  • Tools for improving text-to-SQL systems: When your text-to-SQL solution is underperforming, you need to be able to narrow down the source of the issue. Knowing that you’re passing 60% of the eval set is great, but without more information, how do you know where to spend your time to achieve 70%? We offer three distinct operational modes to help debug issues with schema object retrieval versus reasoning:
    1. Asking the tool to retrieve the relevant schema objects. This mode measures our end-to-end text-to-sql accuracy. Failure here may be due to either failure in retrieval or accuracy.
    2. Providing the full schema directly to the LLM. This gives us an upper bound on the accuracy of the result given the power of the model.
    3. Supplying the "golden tables" to guide query generation. This mode pinpoints failure in reasoning since retrieval is known to be correct. 

The gap between mode 1 and 2 allows us to reason about how much failure is due to problems in retrieval.

How it works

After cloning the repo and installing uv, getting started is easy:

uv sync
cp .env.sample .env

Edit your .env with the settings for your preferred LLM provider and DB. To get a local PG instance:

docker run -d --name text-to-sql-eval \
    -p 127.0.0.1:5555:5432 \
    -e POSTGRES_HOST_AUTH_METHOD=trust \
    timescale/timescaledb-ha:pg17

Load your datasets and evaluate your text-to-sql system with:

uv run python3 -m suite load
uv run python3 -m suite setup pgai 
uv run python3 -m suite eval pgai text_to_sql

(The example above evaluates pgai’s text-to-sql system, but you can easily test a different system by adding it to the suite/agents directory.)

View results with generate-report, or explore them in the web UI:

uv run flask --app suite.eval_site run

Runs can be persisted to PostgreSQL and viewed in the web app.

Bring Your Own Data

Running the eval on your database requires a set of natural-language questions with corresponding “golden queries”. Creating a sizable set of question-answer pairs for a dataset can be a time-consuming and laborious chore. We found that LLMs can significantly aid the process and have open-sourced a companion repo for generating these on your own dataset: timescale/text-to-sql-generator.

After installing it with uv sync and setting your DB_URL in your .env file, you can generate questions with:

uv run python3 -m generator generate

and then export them with

uv run python3 -m generator export

See the timescale/text-to-sql-generator repo for detailed instructions.

Already Used to Improve Our Own text-to-sql Suite

This suite is already powering our internal efforts across:

  • Benchmarking different models and prompting strategies
  • Evaluating schema-specific performance
  • Tracking accuracy regressions

And now, we’re excited to see what the community does with it.

Get Started

Explore the code, docs, and examples on GitHub

 👉 evaluation suite: timescale/text-to-sql-eval

👉 LLM-assisted question generator: timescale/text-to-sql-generator

Try out our text-to-sql tool built on top of this eval suite: 👉 timescale/pgai

Whether you're fine-tuning your own agents or validating out-of-the-box models, text-to-sql-eval helps you move from experimentation to production with confidence.

Join us in raising the bar for AI correctness. Start running real evaluations today—and let’s build something developers can trust.

Related posts

Deploying TimescaleDB Vector Search on CloudNativePG Kubernetes Operator

Deploying TimescaleDB Vector Search on CloudNativePG Kubernetes Operator

TimescaleDBAI

Dec 18, 2025

Build custom TimescaleDB images for CloudNativePG: integrate pgvector and pgvectorscale with Kubernetes-native PostgreSQL for AI time-series applications.

Read more

Five Features of the Tiger CLI You Aren't Using (But Should)

Five Features of the Tiger CLI You Aren't Using (But Should)

AIAI agents

Dec 10, 2025

Tiger CLI + MCP server: Let AI manage databases, fork instantly, search Postgres docs, and run queries—all from your coding assistant without context switching.

Read more

Stay updated with new posts and releases.

Receive the latest technical articles and release notes in your inbox.

Share

Get Started Free with Tiger CLI