Text-to-SQL with Dataherald and Yellowbrick
Dataherald and gpt-4o interacting with Yellowbrick

Text-to-SQL with Dataherald and Yellowbrick

Natural language to SQL transformation is a very active field of research and product development, boosted by the advancing capabilities of LLMs. In keeping with the GenAI hype, ambitious claims are made about the potential of NL2SQL, particularly the idea that it will increase access to the valuable business data in relational databases by 10-100x, once the hurdle of having to know SQL is removed.

I've spend time experimenting with NL2SQL backed by an LLM, evaluating what it was capable of, and more importantly, what it can’t do. What spurred me to look into this was the recent open sourcing of a product in this area, Dataherald. Pulling directly from Dataherald's GitHub page:

Dataherald is a natural language-to-SQL engine built for enterprise-level question answering over relational data. It allows you to set up an API from your database that can answer questions in plain English.

Dataherald's APIs allow one to build a repository of target database schemas, validated queries, schema annotations and instructions, and then inject this as context into an LLM prompt. The idea is the additional context will help the LLM generate more accurate SQL queries. Dataherald will also execute the SQL query generated from a natural language question against the target database using agents created through its integration with LangChain.

Dataherald currently supports the following database targets: PostgreSQL, SQL Server, Databricks, Snowflake, Redshift, BigQuery, Athena, MariaDB and Clickhouse. I thought it would be an interesting exercise to extend the Dataherald code to add support for Yellowbrick. Previously, I created a connector for LangChain to enable Yellowbrick to be used as a vector store for similarity searches, so I decided to integrate this capability into Dataherald too, as well as adding Yellowbrick as a target database.

With my Dataherald extension anyone with data in Yellowbrick can ask natural language questions of their data and get answers via SQL queries that run against their databases. Dataherald and the LLM do the heavy lifting of converting the question into SQL based on the context provided by Dataherald, and then LangChain agents execute the query on Yellowbrick.

Yellowbrick comes with a set of sample datasets, including NOAA weather observations. The NOAA dataset contains two main tables of interest: the observation fact table that contains 9.2Bn records representing weather measurements over the past 300 years; and the stations dimensions table that lists the 125,000 weather stations where the observations were made. The SQL queries generated ran on a single node 16 vCPU Yellowbrick compute cluster in AWS.

I ran experiments against the NOAA data set using Dataherald backed by gpt-4o. I didn’t manually add any context to the schema which meant only the metadata that Dataherald automatically collects when scanning tables and columns was added to the prompt. This metadata includes table and column names, data types and sample column values.

The first question I asked was:

Find the start and end dates of the longest consecutive period with no rain in Cary, NC in June 2024

Dataherald and gpt-4o really struggled with this. Although the combination successfully ran the SQL generated in response to this question, it took 30 attempts asking the same question over and over before it finally lucked on a SQL query that produced the right answer.

The start and end dates of the longest consecutive period with no rain in Cary, NC in June 2024 are from June 11, 2024, to June 30, 2024.

I verified the answer on the NOAA page for the Cary weather station. An accuracy of around 3%. It's clear that without much context, gpt-4o does a poor job of generating an accurate SQL query in response to the plain English question. Looking over the LLM's failed attempts, it was clear that it was struggling to figure out how to craft the SQL needed to calculate consecutive periods between events. It needed a helping hand, and so I set up these instructions to be added to the prompt, which also included a couple of pointers on date formats and weather station names:

Once provided with this guidance, the accuracy went up to around 80%. Not bad. How can I improve the accuracy further? One way is to “cheat” and inject the correct NL/SQL pair into the prompt from the start. This is easy to do by labelling the successful query as a validated Golden SQL query in Dataherald. Dataherald stores the NL question and a reference to the validated SQL text as an embedding in Yellowbrick, performs a vector similarity search to find SQL queries that most closely matches the given question,  and then provides the matching SQL as context to the LLM prompt.

Unsurprisingly, doing this yields the correct answer from the LLM 100% of the time. A more interesting experiment is to test whether knowing how to join tables, find weather stations and handle dates given one example yields more accurate answers to an adjacent weather question. Let’s try with the following question:

Find the start and end dates of the longest period of rain at Portland International Airport, Oregon in February 2024.

The system provided the correct answer, this time in a tabular format:

Which the NOAA website confirms as correct.

It turns out the “cheat method” of adding examples of successful queries can yield more accurate results than spending lots of time on annotating the database schema with metadata descriptions, as this paper points out. That’s not to say that providing the LLM with information about the schema is time wasted, it’s just that more significant boosts to accuracy come from feeding it prior SQL examples.

If you really double down on the Golden SQL examples and schema annotation, then the accuracy starts to become quite interesting. Take, for example, the TPC-DS data set. Yellowbrick comes with the 1 TB scale variant along with 99 SQL queries that answer business questions against this synthetic retail data set. This schema is more complex snowflake schema comprising of 24 fact and dimension tables. The TPC-DS documentation provides a natural language description of each of the 99 queries, and so NL/SQL pairs can be easily loaded as embeddings into Yellowbrick using Dataherald. With this large corpus of validated example queries, gpt-4o becomes rather good at generating SQL and answers. For example, I can ask the question:

How many mens watches were sold in Florida through the web in 2020 and what was the average age of the buyer?

The right answer is returned first time:

Number of men's watches sold: 18,520. Average age of the buyer: 42.24 years

And the SQL generated to arrive at this answer is correct:

And another example:

Find the top 5 most profitable and least returned items through the store channel in North Carolina in 1999

Which resulted in the SQL:

This looks ok!

Thanks for reading this far. In this post, we’ve tinkered with Dataherald, seen how it performs at the task of NL2SQL given different levels of additional context about the schema and example SQL workloads. We’ve also discussed how Dataherald can be integrated with the Yellowbrick Data Platform, using it both as a target database for executing generated SQL queries, and as a vector store for retrieving Golden SQL examples from.

As we discovered, the accuracy of can NL2SQL vary wildly. The vision of NL2SQL democratizing access to the crown jewels of a company's data estate to all is a huge stretch and is probably an unrealistic goal, given the complexity of most enterprises' data warehouses. It's too easy for current LLMs to generate the wrong answer, right now. I think this capability has application today as an exploratory tool and as a SQL code co-pilot to generate a SQL starting point that is tailored to a customer's schema and data, but it won't kill the SQL language or jobs market any time soon.

As a sign-off, take a look at the short recorded session showing Dataherald, gpt-4o and Yellowbrick in action:

Monikaben Lala

Chief Marketing Officer | Product MVP Expert | Cyber Security Enthusiast | @ GITEX DUBAI in October

2mo

Mark, thanks for sharing!

Hope Frank

Global Chief Marketing & Growth Officer, Exec BOD Member, Investor, Futurist | AI, GenAI, Identity Security, Web3 | Top 100 CMO Forbes, Top 50 Digital /CXO, Top 10 CMO | Consulting Producer Netflix | Speaker

3mo

Mark, thanks for sharing! How are you doing?

Matthew Mullins

Technology Leader | Occasional Philosopher

5mo

Definitely still need a human in the loop! Coginiti's AI Assistant works with both private and public language models and will ingest the user's current Yellowbrick schema. It also implements semantic search over the code catalog and makes the results available in the context. Makes it easy to generate sql, diagnose errors, and optimize code.

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics