Git: Recalibrating

We are going to have to take a step back from what we saw in the previous articles in this series. We assumed certain things for simplicity up until now to make understanding easier. But those assumptions will break anymore commits to our knowledge . So let us amend our previous commits to correct the assumptions, where we have been not so accurate

It is stuff that is too basic which will make you feel like

All i read about Git has been a lie !!

What is Git ?

Well, Git is a Versioning System. But in a more basic sense , Git is a mini file system inside of another file system. The changes in files are stored in this file system as blobs , which are arranged in the file system in a certain way. And there are trees , which are synonymous with directories in our file systems.But why ? We will see..

How does Git Store changes ?

I couldn’t have been more wrong in this topic. The diff based approach is not the one git follows , whereas it is the approach that git wants us to think , that it is using. All your log messages , commit messages are formatted in such a way that the user believes that Git uses another ‘efficient’ diff based approach. Whereas in reality, Git does not save diffs or deltas. Git saves files directly in a compressed format directly. Every time we commit a file in the repository , git generates a checksum for that file and stores the particular compressed file with the name equal to the checksum of the file.

The main point to note here is, although git stores every version as is, it doesn’t necessarily store the unchanged files repeatedly , whereas it uses the previous versions of unchanged files. If this is not clear here, it will be when we see more about Branching.

Git calls the commits as ‘Snapshots‘. Git records the files in every version in form of snapshots. Let us see an example

Example

There are 4 commits in a repository with 2 files and a folder

new_repo$ git log --oneline
 b4798cb Fourth commit
 71764ed fajfgj
 677358d second file
 d6ed37b first file

new_repo$ ls
 another_folder file file2

Git assigns a checksum for every version of every file. It also assigns a checksum for every tree . A tree is an object which holds the checksum of all the files in a folder. Git also assigns a checksum to every commit.

A command called git cat-file is helpful to read git objects be it a version of a file ( called blobs ) , a tree or commits

When we read the details of the commit checksum mentioned in the log , here is what we get..

 new_repo$ git cat-file -p b4798cb 
  tree 529d516426af9681a7ae009d48c7ddba8a8dd88a 
  parent 71764ed6cfb9006ce889a2842f5236ad7b5d3b6c 
  author harish1996  1524984326 +0530 
  committer harish1996  1524984326 +0530 

 Fourth commit 

Each commit assigns a tree to itself . Also it assigns parent(s) to every commit. Also every commit has some metadeta related to commits.

 new_repo$ git cat-file -p 529d516426af9681a7ae009d48c7ddba8a8dd88a 
  040000 tree e5cba352c904d0b512dba1ac6329e78d4cca3af5 another_folder 
  100644 blob 04591727d3423f9ab78a59031463039bc00bcd97 file 
  100644 blob 476ebfc61574a7c67895974a8da3b5bd5f744518 file2 

If we read the contents of tree. We can find out that a particular tree consists of other trees and blobs related to a particular version of the working directory. The tree in this tree object is a subtree which corresponds to a sub folder in the folder.

new_repo$ git cat-file -p 04591727d3423f9ab78a59031463039bc00bcd97
 afjkasfhjfkhdsfjksdfgdshello its me!!

On reading one of the blob. We can find that the blob contains the entire file as is . But in a previous commit where the file named ‘file’ has different contents. In that case, the tree directs to a different blob which has the version of the file

 new_repo$ git cat-file -p d6ed37b 
  tree ab40298f83e3164254389e66d10ad4808e3fe6e5 
  author harish1996  1524982942 +0530 
  committer harish1996  1524982942 +0530 

 first file 

 new_repo$ git cat-file -p ab40298f83e3164254389e66d10ad4808e3fe6e5 
  100644 blob 9a2d30c5e58145728cad8f9c4661646f1bd74cc9 file 

 new_repo$ git cat-file -p 9a2d30c5e58145728cad8f9c4661646f1bd74cc9 
  afjkasfhjfkhdsfjksdfgds 

As we can see , this object has the previous version of the same file.

The objects are stored inside the .git/objects folder. The first 2 characters of the checksum corresponds to a sub folder inside the .git/objects folder

new_repo$ ls .git/objects/ 
  04 2b 3d 47 52 67 71 9a ab b4 d6 e5 f3 info pack 

 new_repo$ ls .git/objects/04/ 
  591727d3423f9ab78a59031463039bc00bcd97 

The remarkable fact about git is even though it saves each and every version of the working directory. The file sizes of the files are very different.Git tries to compress files. For files used for programming, compression works well.

 new_repo$ ls -l file3 
  -rw-rw-r-- 1 harish harish 904 Apr 29 13:11 file3 

 new_repo$ git log --oneline 
  f6338e0 Sixth commit 
  5c0595d fifth file 
  b4798cb Fourth commit 
  71764ed fajfgj 
  677358d second file 
  d6ed37b first file 

 new_repo$ gc f6338e0 
  tree a123e35fb541a9e06a6c17e6a94c9225e09de483 
  parent 5c0595dd537a82d656b3ff81b379fcb6b895c3bc 
  author harish1996  1524987694 +0530 
  committer harish1996  1524987694 +0530 

 Sixth commit 

 new_repo$ gc a123e35fb541a9e06a6c17e6a94c9225e09de483 
  040000 tree e5cba352c904d0b512dba1ac6329e78d4cca3af5 another_folder 
  100644 blob 788bb3a9c950dda038f93202985130a3e91d1f57 file 
  100644 blob 476ebfc61574a7c67895974a8da3b5bd5f744518 file2 
  100644 blob fc3cdf997028034ceab14b7f2013425de2c098da file3 

 new_repo$ ls -l .git/objects/fc/3cdf997028034ceab14b7f2013425de2c098da 
  -r--r--r-- 1 harish harish 362 Apr 29 13:11 .git/objects/fc/3cdf997028034ceab14b7f2013425de2c098da 

We can see the difference between the two files although they are the same file.

I think we have established here that git does not store diffs , whereas it stores compressed versions of different versions of every file. This understanding will help us to understand branching easier.

2 thoughts on “Git: Recalibrating”

Leave a comment