Category: All posts
Aug 28, 2025
Posted by
Team TigerData
TL;DR: We’re open-sourcing our text-to-SQL evaluation suite, purpose-built to both assess and enhance text-to-SQL systems for PostgreSQL. AI is only as strong as its data: so it’s not enough to just benchmark your agent’s database access—you also need the results to be actionable so you can improve the results over time.
In our previous article, we argued that large language models (LLMs) are a new kind of database user and that they need a different kind of database: one that is self-describing. But once you accept LLMs as database users, a new challenge emerges. How do we measure whether they are succeeding? To build reliable text-to-SQL systems, accuracy needs to be both defined and measured.
At TigerData, we began by experimenting with a semantic catalog to give LLMs the context they crave. But we quickly realized that context alone is not enough. We also needed to measure whether our system returned the right results and to pinpoint where things went wrong when it did not.
Existing evaluation tools fell short. The main problem was that these tools were meant for scoring the performance of the system and not helping you improve the text-to-SQL system itself. The difference between pure benchmarks and a text-to-sql improvement system is granularity: instead of only evaluating whether the LLM generated the same query or the same data, you also need to determine finer-grained details such as whether it failed because of problems retrieving schema objects or in reasoning through SQL generation.
Along the way we fixed other problems present in popular text-to-SQL systems: Many did not work with PostgreSQL. The datasets were often flawed, with incorrect golden queries or schemas that did not resemble real-world production environments.
So we built our own evaluation system. One that measures the accuracy of any text-to-SQL pipeline, helps identify the source of failure, and provides a foundation for improvement. Today, we are open-sourcing it as text-to-sql-eval so the community can measure, debug, and improve their systems as well.
Today, we are open-sourcing it as text-to-sql-eval so the community can measure, debug, and improve their systems as well.
text-to-sql-eval
We’re open-sourcing the same tool we use internally to test and improve our text-to-SQL agents. It’s flexible, extensible, and ready to evaluate your data, schema, and use case.
The gap between mode 1 and 2 allows us to reason about how much failure is due to problems in retrieval.
After cloning the repo and installing uv, getting started is easy:
uv sync
cp .env.sample .env
Edit your .env
with the settings for your preferred LLM provider and DB. To get a local PG instance:
docker run -d --name text-to-sql-eval \
-p 127.0.0.1:5555:5432 \
-e POSTGRES_HOST_AUTH_METHOD=trust \
timescale/timescaledb-ha:pg17
Load your datasets and evaluate your text-to-sql system with:
uv run python3 -m suite load
uv run python3 -m suite setup pgai
uv run python3 -m suite eval pgai text_to_sql
(The example above evaluates pgai’s text-to-sql system, but you can easily test a different system by adding it to the suite/agents directory.)
View results with generate-report
, or explore them in the web UI:
uv run flask --app suite.eval_site run
Runs can be persisted to PostgreSQL and viewed in the web app.
Running the eval on your database requires a set of natural-language questions with corresponding “golden queries”. Creating a sizable set of question-answer pairs for a dataset can be a time-consuming and laborious chore. We found that LLMs can significantly aid the process and have open-sourced a companion repo for generating these on your own dataset: timescale/text-to-sql-generator.
After installing it with uv sync
and setting your DB_URL
in your .env
file, you can generate questions with:
uv run python3 -m generator generate
and then export them with
uv run python3 -m generator export
See the timescale/text-to-sql-generator repo for detailed instructions.
This suite is already powering our internal efforts across:
And now, we’re excited to see what the community does with it.
Explore the code, docs, and examples on GitHub
👉 evaluation suite: timescale/text-to-sql-eval
👉 LLM-assisted question generator: timescale/text-to-sql-generator
Try out our text-to-sql tool built on top of this eval suite: 👉 timescale/pgai
Whether you're fine-tuning your own agents or validating out-of-the-box models, text-to-sql-eval
helps you move from experimentation to production with confidence.
Join us in raising the bar for AI correctness. Start running real evaluations today—and let’s build something developers can trust.