In one of my previous articles I briefly discussed how software development teams function and why they have a need for a code repository and version control system. If you haven’t had a chance to read it I recommend checking it out to get a better understanding of software development teams versus analytics teams. In this article we’ll dive right into what a code repository and version control system is and the benefit of using one. In another article we’ll talk about how to connect your code repository to all of the other tools and techniques to help build a stellar analytics team. So let’s begin!
While you may be new to code repositories and version control systems they have been around for a long time. Some were pretty good and most had quite a few issues. If you’re old enough you may remember SVN, SourceSafe, AccuRev, and others but Git seems to have come out on top of the code repository wars and is used in pretty much every software development environment these days. I won’t dive into the technicals of why Git is better than these older solutions because it really doesn’t matter because you more than likely won’t have a choice in using something other than Git. What is important is that you have a solid understanding of how powerful Git can be for your team and company. So what is a code repository anyway?
A code repository (or repo for short) is simply a collection of code. Think of this like your files that you have stored on your laptop or on a USB drive (if anyone uses those things anymore). Just like your laptop stores a bunch of files, a code repository does the same thing. The difference is that the repository is probably a more official storage directory for your code. For example, on you laptop you can have all sorts of random files sitting in your downloads directory or on your desktop. If you’re slightly more organized, maybe you put all of your pictures into one folder and all of your sql files into another folder. A code repository is just an explicit location that you’ve designated to store all of your code for your team or company.
The second part of Git is the version control system and this is where Git becomes an amazing tool. If Git was only a repository, it would probably just be a simple directory on your laptop or in the cloud and you can already create folders to store things, so what would be the point? But fortunately, we have version control that is tightly integrated with our repository. So what is version control?
Version control is a way of creating save points (known as commits) to the content of your files. If you’re into video games, you can think of a commit as a save point where if you screw up pretty bad in the game or you get killed, you can go back to that save point and re-start from that point as opposed to having to start the entire game from the very beginning. If my video game example fell a bit flat, you can always think about these save points as little shortcuts to points in time. Consider it like a virtual time machine for every key stroke that you’ve ever made. For example, if you were writing a resume you might start out by creating a rough draft with a large number of bullets about all of your accomplishments. But after creating your first draft you might find that you have 5 pages of content and need to reduce it down to 2 pages. You end up deleting a lot of bullets, combine others, move sections of education, employment, and certifications around for better visibility. Next you have some friends review your draft and they go in and make a bunch of changes to the words that you use to make it sound more professional and they fix some of your grammatical errors. Unfortunately, you really don’t have a history of all of the content that changed.
Assume for a moment your friend came over to your house and started editing the file on your laptop. When you friend has completed all of the edits and has saved and closed the file you’ll no longer have insight into the old content. Maybe your friend changed quite a few things that you don’t agree with or maybe your friend is a dog that happens to step on your keyboard a lot and screw up the content in your files. Either way, you can’t go back to that old version of the document once it has been saved and closed. But with a version control system like Git you can go back in time and see a snapshot of what that file looked like before your friend started making changes. What’s more is that you can easily compare changes from version to version (more on this later). Some people like to think of version control systems as a never ending undo button but I feel that this would be in inaccurate description. The version control system is more like a time machine that can drop you to specific points in time (if used correctly) versus having to hold down the undo button indefinitely until you end up in the old historical state.
To further enhance the capabilities of the version control system, each commit has a comment attached to it to help you understand what was changed and why. In our example above, we could have created a commit before asking our friend to review the document and then another commit after they made their changes. With the first commit we might have commented that “this is our initial first draft version” and on the second commit we might have commented that “this is draft number 2 where brian reduced the document from 5 pages to 3 pages to conform with general resume page length”. Over time these comments help to tell a story about all of the changes that have occured and if done correctly, they will help provide the context as to why something changes as opposed to what changed.
Now you have the basic concepts of how a version control system works we can discuss how to use the version control system within your analytics team. I’ll briefly touch on the key points of using Git and how most teams will probably want to get started from a simple approach. As your team grows and depending on the level of control that you need, you may find yourself slightly altering some of the practices around “branching” and accepting/rejecting changes to your code.
When it comes to using Git there are a few different things to keep in mind. First, Git is often used interchangeably with GitHub and it shouldn’t be. Git is the version control system and GitHub is a website that creates a pretty looking GUI front-end to Git. Sure, you can use Git via command line to perform almost all of the functions that you need but GitHub can sometimes offer a better tool to view the contents of your repository and all of the metadata associated with it. It should also be noted that while many companies use GitHub as their front end, there are other choices such as BitBucket that offer similar functionality. Then you have other tools that are desktop applications such as Sublime Merge and many others that will provide a non-web interface view for Git. To avoid getting into the weeds on all of these tools, I’ll just talk about Git as your desktop client and GitHub as the web interface for the time being.
There are a few key piece of information that you’ll want to understand about Git before you start using it. Without a firm grasp of these concepts you’ll probably find Git to be confusing which means you’ll probably end up abandoning it before you ever get to utilize all of the wonderful features. We’ve already discussed some examples in the previous section of how Git allows you to create a historical save point, otherwise known as a commit. But let’s dig a bit deeper on how Git is configured. I’m to talk about this from a general business perspective in a way that most people will probably understand. Keeping in mind that while doing so, I might gloss over a few technical exceptions to the rules or details that aren’t important to really know at this time.
Your Git configuration is going to utilize a repository as we mentioned. When you use and configure Git on your laptop you’ll be using what is referred to as your “local” repository. You can perform all of the Git commands that you need to for version control on your local machine. However, since you’re probably going to be using Git in a team setting where you’re sharing code with other team members we’ll want to have some sort of shared space. So first, let’s talk about this shared space and then circle back on how your laptop interacts with this shared space.
For the shared repository, think of this like a shared drive on your network or a cloud storage drive that you use. In Git terminology, this is referred to as the “remote”. The way that this works is you have a “remote” repository that is stored in the cloud and anyone that has the appropriate permissions can access any of the files in that repository.
With this shared space created, we need to have a way for our laptop to access those files on the “remote” repository. To do this we will “clone” the “remote” repository to our “local” machine (your laptop). The language and commands were selected with the specific meaning in mind so hopefully they will help with the understanding of how this works. When you “clone” the files from the “remote”, you’re basically saying, “Copy everything that is currently in the cloud storage to my laptop”. If you’ve ever used Google Drive (web version) along with the Google Drive Desktop version you’ll find that Git pretty much works in the same way.
When you configure your Google account you have all of your files stored in the cloud and you can open your browser and navigate to all of your files in Google Drive. However, if you want to take your laptop to some remote part of the world where you don’t have internet access, you’ll want to have a copy of those files locally on your laptop. This is where you install the Google Drive Desktop app. It automatically “clones” or copies all of the files from the cloud down to your laptop for off-line access. Behind the scenes Google Drive is automatically keeping file changes in sync. For example, if you open a file on your laptop and make/save changes, that update file is automatically copied back to the cloud. Google Drive also works in the other direction where you can make changes to files via your web browser and those files are automatically synced back to your laptop.
Hopefully the Google Drive example resonates. The basic concept of how Git works is roughly the same but there are some minor differences that you’ll want to keep in mind. While Git stores files in the cloud on the “remote” and you copy or “clone” the files to your “local” just like you do with Google Drive, this process of copying files and keeping them in sync is completely manual. It has to be this way for some very good reasons but I’ll skip over those for now. The reason behind this manual process will make more sense as we discuss how to create version or “commit” files. Just remember that anything that we do on our laptop (the “local”) is going to require manual effort to “push” the changes to the “remote”. It also means that any changes that happen in the cloud (the “remote”) will have to be manually “pulled” to your local. What does this mean in practice?
Let’s go back to our resume example and assume that we created a repository and we’re just going to put our resume into the repository. We obviously will have access to this repository but we also want to give our friend access to the repository because our friend is going to help us out by editing the file. Sure we could email the file back and forth but that could cause confusion, lost or misplaced files, duplicate files, and we’d lose the benefits of version control. So version control it is.
We will start out by creating the resume on our “local” but when our friend goes to check the repository they don’t see the file. This is because all that we have done is create the file on our laptop and we haven’t “pushed” the file to the “remote”. So we’ll manually issue a command (called push) to push the file to the “remote”. Now the file is in the cloud and our friend will be able to find it. But to get the file from the cloud to his laptop he’ll have to manually sync his laptop with the cloud. To do this he’ll issue a “pull” command which will download all of the latest files. When he does this he’ll now have a copy of your resume on his laptop. Now let’s say he makes edits to your resume. The same process that we just described will work in reverse when it comes to you being able to see his changes. He will make an edit but it will only be saved locally. He’ll have to “push” his changes to the “remote” and after he does, you’ll have to “pull” what is on the “remote” down to your laptop.
One other thing to keep in mind as you wrap your head around this process. The “remote” or cloud is the where all* of the files should reside. I say “should” because all of the absolute latest files and changes aren’t always going to be on the “remote” due to the manual sync process. If someone doesn’t “push” their files to the remote then for everyone else on the team that uses the repository, it’s like those files don’t actually exist. The same is true if you don’t “pull” the changes from the “remote”. For example, if your friend pushed his changes to the “remote”, if you never manually “pull” those changes down to your laptop it is like they never existed. There’s some important technicals that we’ll need to deep dive into regarding this but for now just remember things this way.
Now that you know a bit more about how Git is structured and some of the interaction that you’ll have with Git, let’s install Git on your laptop and get you connected to a repository so that you can learn with some real-world examples.
Installing and Configuring Git
For these instructions I’m going to assume that everyone is using a Mac. The Windows instructions aren’t much different and a quick Google search will probably turn up hundreds of links that will help you get you installed up and running.
Now that you have Git installed and somewhat configured, you’ll need to create a new repository or connect to an existing repository. But before we do that, you’ll probably want to configure your ssh key which basically gives your computer permission to talk to the repository. Think of it as a stored password so you aren’t constantly prompted to enter a username and password every time you want to issue a Git command. To do this we’ll first start by creating the ssh key on your laptop and then we’ll copy that key over to GitHub.
Your laptop should now be connected to Github. The next step will be to create a repository on Github or to connect to one.
Creating a new repository
Connecting to a repository
That’s it. You’re connected to a repository and you’re ready to get started using Git.
With Git installed and connected to a repository you’ll want to get familiar with the common commands and functions that you’ll need to use. We already briefly covered the concept of the push and pull commands but here is a short list of what you’ll need to use on a frequent basis.
The Commit Process
Earlier we discussed how Git works and that you can create save points (called commits) in your code so that you can look back at specific points in time. These comments are basically different versions of your code and it’s where the version in Git’s version control system comes into play. Let’s dive a little deeper into what a commit is and how it works.
Thinking back to our Google Drive example, every time that you click the save button (or that Google Auto-saves) your file, the file can be automatically synced from your local machine to the cloud. With Git, there is a multi-step process and a distinct difference between saving your actual file and creating a commit or save point. For example, if you are working on your resume as a Word document and you click the save button, your file has been physically saved on your laptop. However, while Git can detect that you’ve saved the file and that it’s different from the old file, a commit hasn’t actually been created until you explicitly create one.
Let’s say that you created your resume file and it only has your name and no other content on the page and then you click the save button. The file has been saved and if you close Word and re-open your file, your contents will re-appear. However, this version hasn’t been created in Git. So now if you update your resume file by filling out all of your job history, creating 5 pages of content, when you click the save button you will again have saved your file on your computer. If you now decide to go into Git and create a commit, you’ll have created your first save point. This means that if you ever wanted to revert back to the version of your file where you just had your name you won’t be able to do so because you never created a commit, which would have created a version of the file that only had your name on it.
This is very important to keep in mind as you make changes to your files. If you are working on a piece of code that has many associated changes you’ll want to ensure that you are making commits for each one of the changes that you are performing. For example, on your resume after creating your first draft you may have noticed two issues. 1. that you found that you accidentally listed an incorrect phone number and 2. you decided that for the specific job that you’re applying for that you should omit a couple of your key accomplishment bullet points. When you make these changes it would be best if you change the phone number, commit the change, then remove the bullet points and commit that change. While this may feel like it’s a little overhead, it will pay off in dividends in the long run if you ever have to revert back to a previous version of code. To help explain this, let’s assume that you lumped both of these changes under one commit.
You have your current version of the resume with the updated phone number and the omitted bullet points. A few months go by and you are applying for a new job. You remember that a few months ago when applying for a different job that you deleted a few bullets out of your resume because they weren’t relevant for that job. However, this new job that you’re applying for is actually very related to the bullet points that you deleted. Since you have version control you know that you can quickly pull up your historical version and that has those old bullet points.
The frequency of which you commit is somewhat debatable. We established that it is a good practice to use individual commits for different changes that you are making but what if you’re working on a piece of code that will take a long time to develop or just has a lot of lines? While I don’t have a hard and fast rule that I follow, here are a few items to consider to decide what works best for you.
The less that you commit the more risk you have of not being able to go back to a specific point in time. Let’s assume that you take the approach of only committing code once per day. On the surface it would probably be a safe assumption that you’re committing too infrequently. If you’ve been writing code for 8 hours it would be very hard to believe that you’ve only been working on one small block of code so maybe your commit is already going to span many areas of development. However, if this is a brand new file that you’re working with, maybe you’re just trying to get everything working before you commit. This isn’t a bad approach but sometimes it helps to commit even if your file is just work in progress. The reason for this is because it creates a save point that you can go back to if you start getting into a confusing state or really botch your code. Another reason is that if your laptop dies you would end up losing all of your work if you didn’t commit (and push) your changes.
Committing only when you have working code. Some people only like to commit their code when they have a working and tested version. This can also be a decent approach because you can avoid committing buggy code or unfinished work. However, there are a few caveats to this. If you are working on a project for a week or two and you’re waiting to complete the work, you’re probably waiting too long to commit. There are too many changes that you’ve made (along with the reasons that you made them) and if you ever need to revert back to a previous state in your code history you’ll essentially have to erase a week or two of work. Also, even if your code is not in a perfect state, you can avoid causing problems for other developers by just not pushing your commits to the remote until your code is finished and test. Another option is to use a different branch for your code (more on this later).
My practice of when to commit depends on the work that I perform. If I am working on a brand new project that doesn’t rely on anyone else’s code (which is typically the case for analysts working with SQL), I tend to work throughout the day and commit at the end of the day so that I can capture my notes of what I did, where I left off, and what I need to do next. If I’m working with a different language where I’m modifying multiple files in shared libraries, I will commit more frequently to explain why I’m making all of the modifications to other functions. The reason for this is by the end of the day I’m not really going to remember why I needed to modify an existing function and those reasons are going to be extremely important to other developers down the road. The same holds true if I’m performing modifications to existing files as an analyst. As I fix individual issues within a file I make the commit and move on to fixing the next issue.
If nothing else, just remember what the purpose of the commit and commit message is. They serve as a save point for someone to go back to and they provide context for what changed and why the change was necessary. If you find that your commit message or summary is too general in nature then you might not be committing enough. Put yourself in someone else’s shoes and ask, why were all of these changes made. If your commit message doesn’t accurately explain the reason behind all of the changes that you made then you’re creating confusion for others and possibly yourself when you revisit your code 6 months from now.
Git is an amazing tool but there are a few best practices that will need to be adhered to in order to ensure that it adds value to the team. One of these practices is around commenting that is used with your commits. Every time you commit your changes you’ll need to specify a message (comments). For the most part, there really isn’t any enforcement around how many characters you have to use or is there any enforcement or guarantee that your comments are actually valuable. What this means is that it’s possible for you or someone else to use comments such as “committing my code” or even “ … “. These comments are basically useless to you and anyone else that might be reading through the commit history. This would probably be a good time to explain how the git commits might be used so that you can understand the value of a good commit message and the challenges created by poor commit messages.
Git contains a history of every change that has ever been made. Any file that is created, added, deleted, moved, or modified is tracked within Git. Every change or set of changes that you choose to save as a single save point or commit will be captured and logged. Along with all of these changes will be a unique identifier (called a SHA), the author of the commit, a timestamp, and a comment. During the development process, all of these attributes will become very important. For example, let’s say that you have a sql file (eg. quarterly_report.sql) that is used to generate your company’s quarterly report. Over the course of time this file has been modified to account for additional products that your company started creating. We already know what would happen if changes were made without using version control but luckily your team has been using version control so all of the changes have been tracked. Now let’s assume that when we created the report this quarter, something looks wrong. The historical figures that were in the report last quarter now look different in the report this month. This should never happen so you know that there’s probably a bug someplace but you don’t know that for certain. Off you go to look at the commit history to see if you can find a clue.
You run “git log” and you see the git commit history and here’s what you find:
You proceed to bang your head against the wall. The last commit message is “updated file”, which is obvious. We can look at the actual changes to the file and see that the file was updated. Files are always updated because that’s what Git tracks. Unfortunately, this commit message doesn’t really tell us what has changed in the file so we’re going to have to inspect the changes by doing a diff comparison between the original version and this new version. So you look at the file changes and you see that an exchange rate value was changed in the file. Unfortunately, you have no idea why this was changed and you start to ask a few questions about this change:
- was this changed because the exchange rate has been wrong since the initial commit?
- did the actual currency exchange rate change recently (which is a pretty common occurrence in the business world) and we just needed to have the change reflected in the file
- if the rate did change, did someone forget to account for the fact that this would change this output of the historical performance of this report?
- did someone accidentally change this rate when they were making other changes to the file?
- was someone testing the code to see what the theoretical impact would be on a change in rates and they accidentally committed the code to the repository?
We can think about all of these possibilities but at the end of the day we don’t really know why the code was changed and this is one of the most important parts to the commit message. When we commit code we need to describe why we are making the code change and we need to be descriptive enough about the code change so that it would make perfect sense to someone that is a brand new to your company.
A better commit message would have been: “Updating quarterly_report.sql to update the exchange rate to 1.35, which was the actual exchange rate back on 2018-01-01. When we originally wrote the code we accidentally used the incorrect rate. This incorrect rate is throwing off our actual financial account balance. This change will impact the historical quarterly reports that have been produced by accounting. They are aware of this change and have requested this change in our code”. With this message, we can quickly understand:
- What file was changed
- Who requested/needed the change
- What the problem was
- Why we need to make the change
Git provides the ability to create multiple branches of code. For most people starting out, sticking with the single default branch (called ‘master’ but this will be renamed in the future due to political correctness and language sensitivity) is the best option until you really understand how branching works. While Git has a history of every change ever made to any code, it is possible to create a mess pretty quickly if you aren’t careful with how you use branches. But at a high-level, by creating branches, you can have multiple versions of code being written at the same time and you can isolate those versions on their own branch. This helps with software release management to ensure a clean code base. For example, let’s say that we work for Apple on the iTunes product. This code would be in a repository. Now let’s assume that Apple would like us to completely change the look of iTunes. We know that it will take about a year to do the work. Also, we know that we have some pretty big bug fixes that we’ll need to release every month until we release the new look of iTunes. What we could do is create a branch that is specifically for the new view and we could create a different branch for our bug fixes that we’ll release next month. Why would we do this?
If we didn’t create two different branches we would have two options for dealing with our code: 1. we could write the code and just not save/commit/push the code or 2. we could put the code into different branches. Hopefully it’s pretty obvious that option #1 really isn’t a valid option. Would we really want to have a team of developers that writes code on the individual laptops but not let them share or save their code? Maybe the better question is why can’t they save and share their code without a new branch? Well, remember that Git allows you to save your code and push your code to the remote so that it is shared with everyone. If the “new development team” starts pushing changes for a release that won’t happen until next year, the “bug fix team” will be forced to inherit those changes. This creates a really big problem because the “new development team” changes aren’t complete and shouldn’t be released. Basically we’d have teams stepping on each other’s work. The solution for this is to create branches for the code.
By branching, the new development team can work in isolation on their project for the next year and the bugfix team can work on new bug fixes every month and release those fixes without being impacted by whatever happens on the new development team. Now, the new development team will need to ensure that all of the bug fixes are also pulled into their code but this is a pretty easy task that Git allows for. For the sake of this article I’ll leave you with just this information about branches.
Hopefully you’ve found this article useful to getting you started with understanding Git and how it can help you in your environment. In a future article we’ll talk more about how Git is connected to all of your project tasks that are managed in an issue tracking system such as Jira.