At its core, git
is surprisingly simple. It's just a content-addressable filesystem with version control built on top. Everything in git boils down to just three types of objects: blobs, trees, and commits.
Think of blobs as the raw contents of your files. Trees? They're just fancy directories that point to blobs and other trees. And commits are snapshots that point to trees, capturing your project at a moment in time. That's it. Every git feature you've ever used is built from these basic building blocks.
Each object gets a unique ID (a SHA-1 hash), and git stores them in what is essentially a key-value database. Simple, but incredibly powerful.
After enough years in an aeron chair to nearly melt my body into gelatin, I've found that the (for me at least) the best way to truly understand a tool is to build your very own bad version from the ground up.
Through the process I will attempt to
Let's start where any git workflow starts. When you run git init
, git creates a .git
directory with a specific structure. We'll do the same:
=
=
=
=
The objects
directory is where the magic begins. It's git's database, where every version of every file is stored (sorta... but more on that later). There is no sense in storing something if you can't retrieve it later, so we need to understand how git identifies content. A barebones implementation of what git is doing internally could look something like:
= f
= +
return
This extremely simple function is crucial. Content is prefixed with its type and size, delimited with a null-byte, then hashed with the SHA-1 algorithm. This hash becomes the object's unique identifier. Every file you've ever committed has gone through a process like this.
Now that we know what it takes to reliably address our data, we can lay out our object type and store our first blob object:
"""Base class for Git objects (blob, tree, commit)"""
=
=
= f
return +
return
Lets go ahead and add some plumbing to the GitRepository
class to be able to consume/store an object:
...
# Determine the object path from its hash
return
# Compress the object
=
=
# Write the compressed object
=
# Parse the header to gather obj details
=
=
, =
=
return
Fantastiche, we can theoretically write an object to the database and read an object back out again. We have successfully implemented a rudimentary content-addressable storage system. Using this foundation, we can now start implementing some instance methods that will be analagous to some git plumbing commands. Maybe the simplest of which is git hash-object
. This command takes arguments for an object type, a path to a file (we'll skip the foreplay and just consider this to be the "contents" for now, though the originating path does actually have an impact to how the blob is hashed/stored in reality) and a flag that controls whether or not we want to write the hashed contents into the object database.
On the other end of that simple workflow is the plumbing command git cat-file
. Given an oid
, perform a lookup in the object database and return either the contents themselves or some metadata such as the content size, object type, whether or not a given oid exists etc.
Our analogues will look like this:
...
# Simply new up an object, write it to the object database, and receive an OID
=
return
=
:
:
return
:
return
:
return
=
= b
=
=
assert == # Our roundtrip works!
This very limited set of tools and plumbing commands will actually lay the foundation for nearly all of the abilities that store or read information from the object database in our git-analogue. In the next section, we will take the next bite out of the git object model and implement our own trees.