In this post I’m going to share some best practices for formating SQL that I’ve learned and used over the last 20 years. Many of these tips are things that I’ve collected along the way out of frustration with inherting code, needing to perform updates, chasing down bugs, and performing code reviews for others. Also, when it comes to training new employees, I have found that it is much more difficult for people to understand the domain and code when dealing with poorly written SQL.
In each section of this article, I’ll try to focus in on one area of practice. I’ll provide you with an example of a bad practice and example(s) of best practices along with my reasons behind writing code a specific way. As you’re reading you might think, “Well that isn’t as performant of code” and sometimes this might be the case because I’m trying to hone in on a specific point and avoid excess potential for confusion. Also, sometimes I prefer to have slightly less performant code (depending on how often it is used, the purpose of the code, and the performance hit) if it makes the code easier to read and maintain.
I’m guessing that some people reading this might say, “Well my code executes just fine so I don’t care. I understand what it does.” To that (because I’ve heard these comments many times over the years) I would challenge that assertion. It takes significantly more time to find an read sloppy “bad practice” code than it does when working with clean code. These practices will also help to reduce bugs and your code will be appreciated by everyone that reads it.
When we’re talking about programming languages, you can pretty much write statements as you see fit and the code will execute. Obviously there are some exceptions to this (such as indentations in Python) but that’s for another day. What this means is that when I’m talking about SQL best practices, one could make the argument that the this is just a matter of opinion and writing style.
How you write it matters. Think about a book for example. Sure, you could write an entire book without paragraphs, line returns, standard spacing between sentences, and more. Yes, it would still be a book and yes, the reader could probably understand it. However, how easily would it be for you find a specific section if there weren’t paragraphs? How easy would it be to have confidence that you’re actually in the right section if you were asked to edit something? For these reasons any many more, I’m outlining some best practices.
Please note that while all of these practices could be considered “opinions”, they are practices that I’ve used after 20 years of making mistakes, challenges with efficiently reading, editing, and understanding code, and watching my team members face the same challenges. I’ll try my best to tell you about the practice and the logic behind using them. I’ll leave it up to you to consider these practices, see what works for you, and what works for your team.
The first thing that I’d like to talk about is formatting. Code should be well formatted and visually appealing, which makes it very easy to read. Having well formatted code we pay off when it comes to debugging, troubleshooting, and modifying your code.
When it comes to formatting, there are a number of things that should be considered such as intentations, alignment, comma positions, and text case. If you made it this far in the reading you’ve probably noticed how well aligned my code is and that most items are found on a single line. This make a tremendous differece in readability.
Below is an example of code that has been written and is quite unreadable. In this example you’ll notice a number of things go against best practices such as:
- number of items per line
- alignment issues
- trailing commas
- poor or lack of aliasing
- lack of comments
- grouping by number instead of name
- position of aggregate functions in select statement
- multiple hard to detect bugs
In this article I’ll discuss these practices and more to help you write clean and bug-free code that you can be proud of.
Bad Practices - Many Issues
Look at the code below and compare to the code above. Which one is easier to read? Which one provides the best context about what the intent of the code is and what some of the conditional values mean? Which version provides a cleaner UI that will allow you to quickly spot bugs or avoid bugs all together? I’m hoping that you’ve agreed that the code below serves to overcome all of the challenges found in the above code.
The counter argument that I’ve heard people say is, “Well, you had to write more lines of code.” This is irrelevant. The computer doesn’t care and it’s a couple of key strokes to have a line return and spaces or tabs.
In the following sections I will discuss each of these issues and provide examples of the good and bad practices.
If you look at the above “best practice” code, notice how well everything is aligned to the left. All commas, spaces, and indentations make the code very easy to read.
One Item Per Line
My general rule is one item per line. This could be one element in your select statement or one condition in an join statement or one case statement. Again, look at the above code on formatting to see the difference of readibility when writing with one item per line. The key here is to be consistent. I’ve seen code that is written with one item per line but then every so often there will be a join clause that has an and and an or statment that are on the same line. This can be immensely frustrating when reading and debugging because it is very easy to overlook the addition condition because it was written to the same line.
Bad Practice - Multiple Case Conditions on a sigle line
Here we see a case statement that is all in one line. This is a bad practice because it make the code hard to read and quickly pick up on all of the conditions that are being evaluated. Also, it is really challenging if not impossible to properly comment the code. I know that in the example, ‘main_reporting’ isn’t descriptive and doesn’t appear consistent with the other values but hey, sometimes you’re told to output values this way and can’t logic to others.
Best Practice - Multiple Case Conditions on Multiple Lines
Here we see a case statement that is written on multiple lines with comments to help provide clarity.
Please comment your code. I’ll write another post about code comments in the future but comments are important. I feel like I see a post on LinkedIn or some other site on a daily basis where someone says something along the lines of, “You don’t need code comments. The code is a commment. What’s the matter? Don’t you know how to read code?” Seriously, I’ve heard this sort of thing for years. But here’s the reality. While code is a language and if proficient in the language, a reader can understand what the code is doing. But the code never tells the reader why someone wanted to code to function that way. The possibilities are endless as to why someone wanted to code to work a certain way. Sometimes you could be coding around a bug in the back-end data or maybe there is business logic that dictates how the code should function.
While it’s true that you could read the code and possibly look up the documentation on certain tables, it’s that a lot more work than typing a few characters. Below are some examples of good and bad commenting practices.
Bad Commenting - No In-line Comments
Look at the code below. We can see that the code only wants to return results where u.id > 1000. That’s pretty obvious in this very simple example. But the more important question is why did someone do this?
Maybe they are test users prior to u.id = 1000. Or maybe the code is filtering out all users that are from Michigan, because for some reason someone thought that all users less than u.id 1000 are from the state of Michigan. That might sound like a horrible idea to actually have code that would be written that way but it executes all the same. The point here is that as new users, we don’t know and chances are that in six months you probably won’t know either.
Better Commenting - In-line Comments
Here we have an in-line comment that tells us a bit more about why we added the u.id > 1000 condition. We evidently have test users that should be scrubbed out of the result set.
Bad Commenting - No Block Comments
Look at the code below. We can see here that the query is going to return users that are considered to be non-test users. The in-line comment helps us to understand that the desire is to scrub out these test users from the result set. But we had to read a few lines of code. You may be saying, well, it isn’t worth a comment block at the top. It’s just 8 lines of code and it’s obvious what is happening. Simpe or not, the reader doesn’t know the why behind this code. But what if the code wasn’t as simple? You’d surely appreciate some comments. For these reasons and to have better planning before you start writing code (more on that later),
Better Commenting - Block Comments
Below is the same code that we just looked at by now we are telling the user why we want to run this code and things to look out for.
Common Table Expressions (CTE)
Common table expressions or CTEs are a way of creating an in-memory table of your query results. This table can then be used throughout the rest of your SQL script. The benefit to using a CTE is that you can reduce code duplication, make your code more readable, and increase your ability to perform QA checks on your results.
Also note the really good block comment.
Example of Code Not Using CTE's
In the code below we can see that there are two sub queries that are returning resuls. These two subqueries are then joined together to produce the final result set. While this code will execute, there are a few concerns:
It’s really hard to perform a QA on the sub queries and inspect the results. For example, what if we wanted to run a some counts on the number of users that have multiple records for default screens? We can’t easily just execute some sql against the sub query. We’d have to copy/paste the sub query and then modify it to perform this qa. It would be much better if we could avoid changing code during our QA process.
If we need to utilize this users sub-query elsewhere in our code, we’d have to re-write or copy/paste that block of code to other places in our script. This would not be a DRY (don’t repeat yourself) process and exposes more potential to bugs. How so? Assume for a moment that you’ve used the users subquery in 5 locations in your script. Also assume that the code that your working with is not easy to read because it doesn’t follow best practices. If you are asked to update the code to add another condition to scrub out additional test users, there’s a good chance that you could miss adding this condition to at least one of the 5 uses of the subquery.
More cycles on the database. Each time that the subquery is execute it performs table scans to return results. With our users subquery contining wildcard conditions, the database is going to have a fair amount of work to do. It’s much cheaper (CPU cycles and dollars if you’re using cloud databases) to perform the subquery once, store it in memory and then just re-use the result set as needed in your code.
More complex to read the entire block of code and understand what is being performed and why. While it is possible to scroll through the code, it may be hard to easily comprehend what is happening. Genreally speaking, if you have to vertically scroll your code on your monitor, your code is too way too long and should be refactored to smaller components.
Example of Using CTEs
Below we see an example of using CTEs. While CTE’s can be great and help to overcome some of the challenges that we previously pointed out, CTEs typically don’t stay in memory after the final result set has been displayed.
For example, if you were to run this entire block of code, it would execute. But then if you wanted to select all of the results from the users CTE a few minutes later, this data wouldn’t be availble to query. To get around this issue you can use volatile tables or temporary in-memory tables that typically live as long as your session (database connection) remains active. More on this topic later.
Using "select *"
You should never write queries with “select *”. I think the only exception to this rule is if you are trying to inspect a table and in such a case, you should always limit the number of results that are returned. Writing queries this way is a bad idea of many reasons:
Database performance. Returning columns that aren’t needed is more expensive than querying only the columns that you care about.
Challenges debugging. Assuming that you’re using CTEs as described in the previous section, it can be very challening to trace the origins of certain attributes.
Tables change. Even if you actually need to select all columns, there’s no guarantee that your table won’t change over time. And as the table changes you’ll be querying new data that was never intended which could possibly break code elsewhere, cause confusion, or impact database performance and costs.
Bad Practice - Using Select *
Best Practice - Selecting Only Desired Elements
Aliasing is very important to help readers understand where elements reside and what tables are being used. When aliases aren’t used or poor naming convenstions are used, complexity is increased, and the reading/comprehension of code is reduced.
Bad Practice - No Alias Used on Fields
Below you can see that the tables have an alias of ‘u’ and ‘p’ but the selected elements don’t utilize the alias. This can be very frustrating and can cause run-time errors if more than one of the tables contains a field with the same name. Eg. user_id is found in both the ‘users’ table and the ‘perferences’ table.
Best Practice - Alias Used on Fields
Below you can see that the tables and selected elements utlize the table alias name. This makes the code very readable for the end user. Even if you only have a single table it is a good practice to use an alias on the table and field name. Good habbits make for good code.
Bad Practice - Generic Alias on CTE
Below we have a CTE that has been created but the table name that has been assigned is called ‘cte’. This is a very generic name and tells the end-user absolutely nothing about the data in the table. If you were a user reading the select statement that follows the CTE, you wouldn’t have any indication of what table was being used.
Best Practice - Specific Alias on CTE
Below we have a CTE that has been created with a more descriptive name. The name give the user some indication of what data is contained within the able.
Leading vs Lagging Commas
In select statements I prefer to have a leading comma as opposed to a trailing comma and this is one of those cases where I would say my personal opinion comes into play. I’ve seen a lot of people write their selected elements with trailing comma’s and in other languages it is common practice to have a trailing comma. However, with other languages, it isn’t common to have a large number of arguments past into a function whereas in SQL it’s quite common to have a large number of elements being selected (and declared in the code). While you may think that I’m being overly opinionated about this usage, here’s a couple of reasons why I find leading comma’s to be beneficial.
Clean looking UI. When you look at the best practice example, look at how nicely the commas are aligned. It is very easy to see that a comma is missing and avoid a run-time error when compared to using trailing commas
No confusion when working with longer case statements that wrap lines. Looking at the example below, it’s difficult to tell if the end of the line is the end of an element or statement or if it is specifying the end of an argument that is being passed into a function.
To add some additional frustration around this bad practice of trailing commas, BigQuery query formatter actually reformats your code to display everything a trailing comma. :(
Bad Practice - Trailing Comma
In this example we can see that leading commas are not used, making it very hard to spot the missing comma.
Best Practice - Leading Comma
In this example we can see that all of the commas are aligned, making it easy to ensure that no comma is ever missing.
Bad Practice - Trailing Comma Confusion
In this example we can see that leading commas are not used. We have a line break that ends with a comma so it is vary hard to tell if this ‘max’ line is really one single statement or if it is part of a longer statement. One could argue that you shouldn’t use line breaks like this in your code and while I support line breaks in the right places (because it makes code easier to read), you would still encounter the same issue when looking at your word-wrapped code in your editor or in a Git diff comparison.
Best Practice - Leading Comma with Indentation on Wrapped Text
In the example below it’s easy to see that a comma is missing. You might immediately wonder if someone forgot the comma, but because it’s so easy to read code with leading commas, there’s a higher probabilty that the missing leading comma is by design and there isn’t actually a comma that should be in front of the ‘cast’ statement. Also, by adding indentation to the cast statement, the code appears to have a more obvious intent as to why a comma isn’t necessary.
In a lot of SQL code in the old days, this was pretty common practice and it might have to do with the fact that SQL has been around for very long time, probably longer than most text editors with syntax highligting. Today, most people are using (or should be using) syntax highlighting in their editors so the capitalization shouldn’t be as necessary to spot the reserve words.
While I have a personal opinion to not use uppercase you may disagree and have a different opinion. Here’s my justification:
- I don’t like my code yelling at me. In social context and written communication, using uppercase is synonymous with yelling so I try to avoid writing this way.
- The code doesn’t read as fluidly. Psychologic studies have shown that word recognition is easier with lower case words that upper case words. This is because there is more shape variation with lower case words compared to upper case words. Reading speed can decrease by 13-20% when all upper case is used. Also, when you mix upper case and lower case words
- It’s extra key strokes to hold down the shift key while I type or lock/unlock the caps key. Not a good argument, I know.
Bad Practice - Uppercase
In this example we can see that reserve words are uppercased and all other words are lowercased.
Best Practice - Lowercase
In this example we can see that all of the commas are aligned, making it easy to ensure that no comma is ever missing.
Group By - Numbers vs Explicit Fields
I almost always perform my group by with explicit field names as opposed to the position number in the select statement. While this doesn’t have any bering on the results I have found that it saves me time when when it comes to typing and debugging. Normally I’d prefer to just use the numbers because it’s less text on the page but it has caused too many issues and time spent chasing down run-time bugs.
Bad Practice - Group by position number
Here you can see that we are performing a group-by with position numbers based on the selected items. What I don’t like about this practice is that:
- If someone puts an aggreate function(s) anywhere but the first or last item in the select statement then you have to skip a position number in the group by. This creates frustrations if you later decide to re-order your selected elements.
- You have to count out how many elements you have, minus the aggregated items, and then manually type the position number.
Best Practice - Group by field name
Here you can see that we are performing a group-by with explicit field names. While it looks like this would be a lot more typing and work versus using position numbers, it is actually faster to type than using numbers. How so? Because all that you have to do is copy what is in the select statement (minus the aggregated field) and paste those values in the group by. When you use numbers you actually have to type the numbers.