LLMs Are New Database Users. Now We Need a Way to Measure Them: Meet text-to-sql-eval

Posted by

Team Tiger Data

TL;DR: We’re open-sourcing our text-to-SQL evaluation suite, purpose-built to both assess and enhance text-to-SQL systems for PostgreSQL. AI is only as strong as its data: so it’s not enough to just benchmark your agent’s database access—you also need the results to be actionable so you can improve the results over time.

Evaluating AI for SQL: To Build text-to-sql, You Need an Eval Suite First

In our previous article, we argued that large language models (LLMs) are a new kind of database user and that they need a different kind of database: one that is self-describing. But once you accept LLMs as database users, a new challenge emerges. How do we measure whether they are succeeding? To build reliable text-to-SQL systems, accuracy needs to be both defined and measured.

At TigerData, we began by experimenting with a semantic catalog to give LLMs the context they crave. But we quickly realized that context alone is not enough. We also needed to measure whether our system returned the right results and to pinpoint where things went wrong when it did not.

Existing evaluation tools fell short. The main problem was that these tools were meant for scoring the performance of the system and not helping you improve the text-to-SQL system itself. The difference between pure benchmarks and a text-to-sql improvement system is granularity: instead of only evaluating whether the LLM generated the same query or the same data, you also need to determine finer-grained details such as whether it failed because of problems retrieving schema objects or in reasoning through SQL generation.

Along the way we fixed other problems present in popular text-to-SQL systems: Many did not work with PostgreSQL. The datasets were often flawed, with incorrect golden queries or schemas that did not resemble real-world production environments.

So we built our own evaluation system. One that measures the accuracy of any text-to-SQL pipeline, helps identify the source of failure, and provides a foundation for improvement. Today, we are open-sourcing it as text-to-sql-eval so the community can measure, debug, and improve their systems as well.

Today, we are open-sourcing it as text-to-sql-eval so the community can measure, debug, and improve their systems as well.

Introducing: `text-to-sql-eval`

We’re open-sourcing the same tool we use internally to test and improve our text-to-SQL agents. It’s flexible, extensible, and ready to evaluate your data, schema, and use case.

Key features

Support for many tools & models: Evaluate any LLM or text-to-SQL system—ours, yours, or open-source.
PostgreSQL focused: Purpose-built for PostgreSQL, because every database is different—and those differences matter when evaluating correctness.
Evaluate with Your Data: Quickly create a test dataset from your existing schema and data. Alternatively, use one of the included PostgreSQL versions of popular text-to-SQL benchmark datasets.
LLM as a Judge: Determining if an LLM-generated query yields the same data as a golden query isn't always straightforward. Differing column aliases, data rendered in different formats, redundant or extraneous columns, sort order—with effort, humans can spot these issues and judge two queries to be equivalent. Our suite offers a deterministic test, but it often produces false negatives, especially on more challenging questions. Therefore, we include an optional LLM-as-a-judge feature to re-evaluate failures more like a human would and provide both sets of results.
Track results over time: Store and visualize eval runs via TimescaleDB or the included UI.
Tools for improving text-to-SQL systems: When your text-to-SQL solution is underperforming, you need to be able to narrow down the source of the issue. Knowing that you’re passing 60% of the eval set is great, but without more information, how do you know where to spend your time to achieve 70%? We offer three distinct operational modes to help debug issues with schema object retrieval versus reasoning:
1. Asking the tool to retrieve the relevant schema objects. This mode measures our end-to-end text-to-sql accuracy. Failure here may be due to either failure in retrieval or accuracy.
2. Providing the full schema directly to the LLM. This gives us an upper bound on the accuracy of the result given the power of the model.
3. Supplying the "golden tables" to guide query generation. This mode pinpoints failure in reasoning since retrieval is known to be correct.

The gap between mode 1 and 2 allows us to reason about how much failure is due to problems in retrieval.

How it works

After cloning the repo and installing uv, getting started is easy:

uv sync
cp .env.sample .env

Edit your .env with the settings for your preferred LLM provider and DB. To get a local PG instance:

docker run -d --name text-to-sql-eval \
    -p 127.0.0.1:5555:5432 \
    -e POSTGRES_HOST_AUTH_METHOD=trust \
    timescale/timescaledb-ha:pg17

Load your datasets and evaluate your text-to-sql system with:

uv run python3 -m suite load
uv run python3 -m suite setup pgai 
uv run python3 -m suite eval pgai text_to_sql

(The example above evaluates pgai’s text-to-sql system, but you can easily test a different system by adding it to the suite/agents directory.)

View results with generate-report, or explore them in the web UI:

uv run flask --app suite.eval_site run

Runs can be persisted to PostgreSQL and viewed in the web app.

Bring Your Own Data

Running the eval on your database requires a set of natural-language questions with corresponding “golden queries”. Creating a sizable set of question-answer pairs for a dataset can be a time-consuming and laborious chore. We found that LLMs can significantly aid the process and have open-sourced a companion repo for generating these on your own dataset: timescale/text-to-sql-generator.

After installing it with uv sync and setting your DB_URL in your .env file, you can generate questions with:

uv run python3 -m generator generate

and then export them with

uv run python3 -m generator export

See the timescale/text-to-sql-generator repo for detailed instructions.

Already Used to Improve Our Own text-to-sql Suite

This suite is already powering our internal efforts across:

Benchmarking different models and prompting strategies
Evaluating schema-specific performance
Tracking accuracy regressions

And now, we’re excited to see what the community does with it.

Get Started

Explore the code, docs, and examples on GitHub

👉 evaluation suite: timescale/text-to-sql-eval

👉 LLM-assisted question generator: timescale/text-to-sql-generator

Try out our text-to-sql tool built on top of this eval suite: 👉 timescale/pgai

Whether you're fine-tuning your own agents or validating out-of-the-box models, text-to-sql-eval helps you move from experimentation to production with confidence.

Join us in raising the bar for AI correctness. Start running real evaluations today—and let’s build something developers can trust.

Date published

Aug 28, 2025

Posted by

Team Tiger Data

Get Started Free with Tiger CLI

Date published

Aug 28, 2025

Posted by

Team Tiger Data

Get Started Free with Tiger CLI

LLMs Are New Database Users. Now We Need a Way to Measure Them: Meet text-to-sql-eval

Evaluating AI for SQL: To Build text-to-sql, You Need an Eval Suite First

Introducing: text-to-sql-eval

Key features

How it works

Bring Your Own Data

Already Used to Improve Our Own text-to-sql Suite

Get Started

Introducing: `text-to-sql-eval`