Testing SQL the hard way
đ Context
Itâs not uncommon for Data Science workflows to have large chunks of SQL. Maybe you have a sequence of queries that you run every day to produce dashboards, or maybe you have a bunch of queries that spit out features that you feed into machine learning algorithms. If youâre a Data Scientist, youâre bound to have written a SELECT
statement at some point of your day!
Thereâs an online wiki for programming languages called progopedia which lists SQL as ânot a programming language:â this perfectly characterises everything that is strange about these words that we write, execute, and rely on to get our insights right every single day. SQL doesnât have all the sensible things that real programming languages have to help us feel confident that they are doing what we think theyâre doing. Primarily, it lacks a straightforward abstraction for âunitâ testing.
About a year and half ago, a colleague Jack and I were wrangling with a query that had to be rightâwe are, after all, working in a regulated industry. We were dealing with a bunch of queries that were orchestrated together with Airflow. The main problem we had was that we could not rely on the input to our query being right and needed to prevent this query from quietly succeeding if anything in its results was wrong.
We went the entire journey from manually validating entries in our table to automating the detection and reporting of inconsistencies so that we could flag or fix errors that were upstream of our query: this post is a heavily simplified example of what we did. In case youâre wondering, Iâve called the post âtesting SQL the hard wayâ because we have now abandoned this approach (and replaced it with dbt). But it had some fun lessons, so here goes!
đŻââď¸ A table of users
Letâs look at a toy example: letâs say that we want to make a table that has some key information about all of a companyâs customers. Iâve written all of the snippets below in BigQuery SQL, so that you can copy/paste it into the console and play around with it yourself.
We start by building up some mock data. Imagine you have a JSON payload (an âeventâ) each time a user creates an account, with the user_id
, the timestamp of when the account was created, and a bunch of other critical fields. Something like this:
WITH account_created_events AS (
SELECT *
FROM UNNEST([
"{\"user_id\": \"user_1\", \"account_created\": \"2020-01-01\"}",
"{\"user_id\": \"user_2\", \"account_created\": \"2020-01-02\"}",
"{\"user_id\": \"user_3\", \"account_created\": \"2020-01-03\"}",
"{\"user_id\": \"user_4\", \"account_created\": \"2020-01-04\"}"
]) AS event
),
When users finish some set of actions, we get another event to tell us that the user has completed their signup. Hereâs some more mock data, for those same users:
signup_completed_events AS (
SELECT *
FROM UNNEST([
"{\"user_id\": \"user_1\", \"signup_completed\": \"2020-01-02\"}",
"{\"user_id\": \"user_2\", \"signup_completed\": \"2020-01-03\"}",
"{\"user_id\": \"user_3\", \"signup_completed\": \"2020-01-04\"}",
"{\"user_id\": \"user_4\", \"signup_completed\": \"2020-01-05\"}"
]) AS event
),
We can take those two events and create a table with one row per user:
-- Extract the account creation columns
WITH accounts_created AS (
SELECT
JSON_EXTRACT_SCALAR(event, "$.user_id") AS user_id,
TIMESTAMP(JSON_EXTRACT_SCALAR(event, "$.account_created")) AS account_created
FROM account_created_events
),
-- Extract the signup completion columns
signups_completed AS (
SELECT
JSON_EXTRACT_SCALAR(event, "$.user_id") AS user_id,
TIMESTAMP(JSON_EXTRACT_SCALAR(event, "$.signup_completed")) AS signup_completed
FROM signup_completed_events
)
-- Join them together
SELECT
user_id,
account_created,
signup_completed,
TIMESTAMP_DIFF(signup_completed, account_created, HOUR) AS signup_time
FROM accounts_created
LEFT JOIN signups_completed USING (user_id)
And here are the results:
Perfect! This is the ideal analytics table to answer a ton of different basic analytic questions: How many users do we have? How many of them signed up today? How long does it take users, on average, to complete signup?
This is also the type of table that could be extended with tens (or even hundreds) of columns, each representing unique pieces of information about each user. It is the type of canonical data table that could grow to having tens of contributors and many tens of use cases. As long as this table maintains its core structural assumption (one row per user), weâll be good to go.
How could this go wrong?
đ Someone completes the signup flow twice
Letâs imagine that a customer, due to whatever reason, completes the signup flow twice. Maybe thereâs a bug in the app and the customer uninstalls & reinstalls the app, maybe there was a bug or failure in the backend service that is emitting those events (whatever, it doesnât matter!).
Nothing has changed in the analytics code base, but now we have two signup completed events for that user:
signup_completed_events AS (
SELECT *
FROM UNNEST([
"{\"user_id\": \"user_1\", \"signup_completed\": \"2020-01-02\"}",
"{\"user_id\": \"user_2\", \"signup_completed\": \"2020-01-03\"}",
"{\"user_id\": \"user_3\", \"signup_completed\": \"2020-01-04\"}",
"{\"user_id\": \"user_4\", \"signup_completed\": \"2020-01-05\"}",
"{\"user_id\": \"user_3\", \"signup_completed\": \"2020-02-04\"}" -- Yikes!
]) AS event
),
This tiniest of errors â something that is largely invisible to the person who is writing the SQL, breaks the table: user_3
appears twice. Not only that, but your stats on signup completion rates have shot through the roof!
Herein lies the problem: a tiny issue in the data has propagated itself through the LEFT JOIN
s in the analytics code base, and has completely skewed some metrics that you are using.
Of course, you could argue that this is fairly trivial to fix: weâd need to add in a pre-processing step on the signup completed events to pick the one we care about more, and spit out one signup event per user. But what if this table has tens or hundreds of columns? What if the person contributing a new column is not the person who added the signup stats column, so doesnât know this? What if we have two of these types of problems manifest at the same time? What if this table is input to some downstream queries, and all they know is that their table is now wrong?
These problems are more of a headache to discover and diagnose than they are to fix. Indeed, they may first manifest as a question about the metrics (âwhy has our average sign up time gone through the roof?â) and not about the data.
đ âUnitâ testing a table in two steps
This post is not about fixing those bugs: it is only about an approach that we used to use to detect these types of problems with some sort of meaningful error message. At its core, this method relies on enumerating your assumptions about the structure of the table, and then triggering an error in the query execution if you find any.
In our example, our key assumption was that the table has one row per user.
Step 1. Separately from the query above, we wrote a query that should create an empty users_table_errors
table if the users
tableâs assumptions were not broken:
WITH user_validation AS (
SELECT
user_id,
COUNT(*) AS num_rows
FROM users
GROUP BY 1
HAVING num_rows > 1 -- Should never be the case, amirite?
)
SELECT
"Duplicated user_ids" AS failure_type,
COUNT(DISTINCT user_id) AS num_failures
FROM user_validation
-- Union all with any other validations (e.g. validating time stamps)
These types of queries are an extremely useful way of (a) documenting what you expect, and (b) creating a table of all of the (in this case) user ids that donât match your expectations â and how many times each one is duplicated.
Step 2. Now that we have a way to identify errors, we need a way to stop our query from completing successfully if any errors are found. The first way we did this was to force BigQuery to compute something it couldnât if it found errors: we would literally encode a one divided by zero. Shortly after, we found a debugging function buried in the BigQuery documentation.
If a validations table is not empty, we used this ERROR()
function to stop the query! It looks like this:
SELECT
ERROR(CONCAT("ERROR: ", num_failures, " ", failure_type)) AS error
FROM users_table_errors
WHERE num_failures != 0
If you run this in the BigQuery console, it pops up with this kind of alert:
The final workflow would run (1) the original query, (2) the validation query, and then (3) the error query. This approach really accelerated our ability to diagnose and fix errors; it de-coupled the code SQL that did the work from the SQL that did the validation, and it automatically documented all of our assumptions about a given table.
For our example, above: weâd see an error message that reads ERROR: 1 Duplicated user_ids
; weâd pop open the users_table_errors
and see that it has 1 entry with user_3
; and voilĂ , weâd already be on track to finding and resolving the problem.
đ Why is this the âhard wayâ of testing?
We got a lot of value out of the approach above; the major trade-off was that we had to write (nearly) double the amount of SQL. This is not too dissimilar from what happens in other programming languages, where we can write unit tests.
However, these are definitely not unit tests: we would primarily discover errors in the data when the analytics pipeline was running, and had to make choices as to whether we wanted our (fairly huge) graph of queries to halt running completely on every error, or continue with warnings. We had started designing ways to quantify the severity of errors (both in terms of columns and in terms of number of affected rows), to run queries on samples of data, and more.
By the end of 2019, though, the Data Science team at Monzo migrated to using data build tool, or dbt. If you want to read about why we think that was a good choice, Stephenâs post here describes things that we like about it. It helped us to automate a lot of the work flow that Iâve described above, and more.
Does this solve everything we need to ensure reliable, consistent, and scalable analytics queries? Absolutely not. The experience of writing this kind of âcodeâ is still very much characterised by the âtry running it and see if the results look sensible,â which is a far cry from where it could be. But itâs a great step in the right direction.