2024-11-22
read time: 4 min
cat ~/posts/git/git-from-scratch-part-1.md

Git at it's core

At its core, git is surprisingly simple. It's just a content-addressable filesystem with version control built on top. Everything in git boils down to just three types of objects: blobs, trees, and commits.

Think of blobs as the raw contents of your files. Trees? They're just fancy directories that point to blobs and other trees. And commits are snapshots that point to trees, capturing your project at a moment in time. That's it. Every git feature you've ever used is built from these basic building blocks.

Each object gets a unique ID (a SHA-1 hash), and git stores them in what is essentially a key-value database. Simple, but incredibly powerful.

Why build git from Scratch?

After enough years in an aeron chair to nearly melt my body into gelatin, I've found that the (for me at least) the best way to truly understand a tool is to build your very own bad version from the ground up.

Through the process I will attempt to

  1. Explain what's really happening when git magically secrets away your data
  2. Give details on why certain git operations are blazing fast while others turn your laptop into your very own personal space heater
  3. Learn more about some deeper computer science concepts along the way:

From zero to blob

Let's start where any git workflow starts. When you run git init, git creates a .git directory with a specific structure. We'll do the same:

class GitRepository:
    def __init__(self, path):
        self.worktree = path
        self.git_dir = os.path.join(path, ".git")
        self.objects_dir = os.path.join(self.git_dir, "objects")
        self.refs_dir = os.path.join(self.git_dir, "refs")

        if not os.path.exists(self.git_dir):
            os.makedirs(self.git_dir)
            os.makedirs(self.objects_dir)
            os.makedirs(self.refs_dir)

The objects directory is where the magic begins. It's git's database, where every version of every file is stored (sorta... but more on that later). There is no sense in storing something if you can't retrieve it later, so we need to understand how git identifies content. A barebones implementation of what git is doing internally could look something like:

def hash_object(obj_type, data):
    header = f"{obj_type} {len(data)}\0"
    store = header.encode() + data
    return hashlib.sha1(store).hexdigest()

This extremely simple function is crucial. Content is prefixed with its type and size, delimited with a null-byte, then hashed with the SHA-1 algorithm. This hash becomes the object's unique identifier. Every file you've ever committed has gone through a process like this.

Now that we know what it takes to reliably address our data, we can lay out our object type and store our first blob object:

class GitObject:
    """Base class for Git objects (blob, tree, commit)"""
    def __init__(self, data, obj_type="blob"):
        self.data = data
        self.obj_type = obj_type
        
    def serialize(self):
        header = f"{self.obj_type} {len(self.data)}\0"
        return header.encode() + self.data
        
    @property
    def oid(self):
        return hashlib.sha1(self.serialize()).hexdigest()

Lets go ahead and add some plumbing to the GitRepository class to be able to consume/store an object:

class GitRepository:
    ...
    def _get_oid_path(self, oid):
        # Determine the object path from its hash
        return os.path.join(self.gitdir, "objects", oid[:2], oid[2:])

    def write_object(self, obj):
        # Compress the object
        compressed = zlib.compress(obj.serialize())
        
        path = self._get_oid_path(obj.oid)
        os.makedirs(os.path.dirname(path), exist_ok=True)
        
        # Write the compressed object
        with open(path, "wb") as f:
            f.write(compressed)
        
    def read_object(self, oid):
        with open(self._get_oid_path(oid), "rb") as f:
            data = zlib.decompress(f.read())
        
        # Parse the header to gather obj details
        null_index = data.index(b'\0')
        header = data[:null_index].decode()
        obj_type, size = header.split()
        content = data[null_index + 1:]

        return GitObject(content, obj_type)

Fantastiche, we can theoretically write an object to the database and read an object back out again. We have successfully implemented a rudimentary content-addressable storage system. Using this foundation, we can now start implementing some instance methods that will be analagous to some git plumbing commands. Maybe the simplest of which is git hash-object. This command takes arguments for an object type, a path to a file (we'll skip the foreplay and just consider this to be the "contents" for now, though the originating path does actually have an impact to how the blob is hashed/stored in reality) and a flag that controls whether or not we want to write the hashed contents into the object database.

On the other end of that simple workflow is the plumbing command git cat-file. Given an oid, perform a lookup in the object database and return either the contents themselves or some metadata such as the content size, object type, whether or not a given oid exists etc.

Our analogues will look like this:

class GitRepository:
    ...
    def hash_object(self, data, obj_type="blob", store=True):
        # Simply new up an object, write it to the object database, and receive an OID
        obj = GitObject(data, obj_type)
        if store:
            self.write_object(obj)

        return obj.oid

    def cat_file(self, oid, format="content"):
        obj = self.read_object(oid)
        match format:
            case "content":
                return obj.data
            case "type":
                return obj.obj_type
            case "size":
                return len(obj.data)

repo = GitRepository("my-repo")
content = b"Hello, Git!"
oid = repo.cmd_hash_object(content)
retrieved_data = repo.cmd_cat_file(oid)
assert retrieved_data == content  # Our roundtrip works!

This very limited set of tools and plumbing commands will actually lay the foundation for nearly all of the abilities that store or read information from the object database in our git-analogue. In the next section, we will take the next bite out of the git object model and implement our own trees.

// Contents