Taming the Beast: Cleaning Up Messy SQL Scripts

Disclaimer: There is no one-size-fits-all solution for cleaning up messy SQL scripts.

Taming the Beast: Cleaning Up Messy SQL Scripts

If you have worked as a data engineer or analytics engineer, you have probably inherited some nasty looking SQL scripts either by former team members or by business users. This often occurs when there's a lack of a review process for scripts before automation.

The situation is that the SQL script runs once a day/week/month, and as time goes by, the script takes longer and longer to complete running. Sometimes it takes more than one hour. This is problematic for many reasons: it uses the resources inefficiently, which ends up impacting other scheduled jobs and ad-hoc queries; this can lead to inefficient resource utilization and a lack of control over how data is accessed.

We don't know how exactly this messy SQL script came to be. What we know is it exists, and we have to deal with it. So the question is "How do we go about fixing this SQL script with hundreds of lines of code that takes more than one hour to run?" While there is no quick fix and no one-size-fits-all solution, I will give you a framework to deal with this issue.

  1. Understand the business use case and the output

First, talk to the stakeholders and understand the business goal of this script. Does it create new metrics? Does it measure a process? Does it create a list of clients that do X and Y? It's important that you understand the output of the query, but it's equally important that you understand the business problem it solves. There might be another script or dashboard, which accomplishes the very same thing or something similar. Often times, stakeholders from different teams do not talk to each other.

It's crucial you understand the data grain, the dimensions, and the metrics involved. Understanding the output is also important, because there might be a simpler, more direct way to get the same output. Doing this would allow you to deprecate the script and to move on with your life.

  1. Make a list of the tables the script references

Make a list of all the tables used in the script and the number of times you hit each table. Table names should be self explanatory; however, that's not always the case. If it helps you to write notes next to each table, do it. You, as the engineer and owner of these data sets, know better than anyone else what these tables are. You also know which table is best for each use case.

Make sure to not hit the same table an unnecessary number of times, when you can simplify this by using the proper filters, or by using CTEs or temp tables.

3. What do the CTEs and the temp tables accomplish?

If the script has CTEs or temp tables, write down what each of them do and what the output looks like. Write down the grain, the dimensions, the metrics (if any). This will help you break down the script into digestible chunks.

  1. Get rid of subqueries

If you can avoid subqueries, please do. Replace them with joins, CTEs or with temp tables, when appropriate. If you absolutely must use a subquery, please write comments. And avoid full table scans too! Doing full scans will cost your team in time and in the actual bill.

  1. If things get messy, use a separate editor, and replace

If you're dizzy after looking at all that messy SQL, open a new editor, copy the salvageable parts, and rewrite the parts that need some clean up. This might make you stay on track.

Please do not attempt to rewrite the entire job in one sitting. It might not be possible. I've learned over the years that taking a break and going for a walk/run is one of the best ways to clear your mind. When you go back to the script or problem at hand, you might have a new perspective on how to tackle it.

  1. Reduce the scans on the same tables

Can you reduce the number of time you query the same table? For example, if you're scanning table the fact sales table to get the total sales in one CTE, and later you go back to the same sales table to get the total count of orders in another CTE; you should get all metrics from the same table in one scan.

Can you replace some tables with lighter tables? For example, the author of this messy script queries a huge fact table to get a unique list of advertisers when they can accomplish the same by querying the advertisers dimension table.

Conclusion

As I mentioned above, this is not a one-size-fits-all approach; however, it has helped me untangle the messiest of scripts in my career. Additionally, it has taught my teams lessons about having guardrails around jobs and dashboards authored by people outside our team. It has also given us authority to push back when stakeholders want to automate unrevised scripts.