We are going to have to take a step back from what we saw in the previous articles in this series. We assumed certain things for simplicity up until now to make understanding easier. But those assumptions will break anymore
commits to our knowledge . So let us
amend our previous
commits to correct the assumptions, where we have been not so accurate
It is stuff that is too basic which will make you feel like
All i read about Git has been a lie !!
What is Git ?
Well, Git is a Versioning System. But in a more basic sense , Git is a mini file system inside of another file system. The changes in files are stored in this file system as blobs , which are arranged in the file system in a certain way. And there are trees , which are synonymous with directories in our file systems.But why ? We will see..
How does Git Store changes ?
I couldn’t have been more wrong in this topic. The diff based approach is not the one git follows , whereas it is the approach that git wants us to think , that it is using. All your log messages , commit messages are formatted in such a way that the user believes that Git uses another ‘efficient’ diff based approach. Whereas in reality, Git does not save diffs or deltas. Git saves files directly in a compressed format directly. Every time we commit a file in the repository , git generates a checksum for that file and stores the particular compressed file with the name equal to the checksum of the file.
The main point to note here is, although git stores every version as is, it doesn’t necessarily store the unchanged files repeatedly , whereas it uses the previous versions of unchanged files. If this is not clear here, it will be when we see more about Branching.
Git calls the commits as ‘Snapshots‘. Git records the files in every version in form of snapshots. Let us see an example
There are 4 commits in a repository with 2 files and a folder
new_repo$ git log --oneline b4798cb Fourth commit 71764ed fajfgj 677358d second file d6ed37b first file new_repo$ ls another_folder file file2
Git assigns a checksum for every version of every file. It also assigns a checksum for every tree . A tree is an object which holds the checksum of all the files in a folder. Git also assigns a checksum to every commit.
A command called
git cat-file is helpful to read git objects be it a version of a file ( called blobs ) , a tree or commits
When we read the details of the commit checksum mentioned in the log , here is what we get..
new_repo$ git cat-file -p b4798cb tree 529d516426af9681a7ae009d48c7ddba8a8dd88a parent 71764ed6cfb9006ce889a2842f5236ad7b5d3b6c author harish1996 1524984326 +0530 committer harish1996 1524984326 +0530 Fourth commit
Each commit assigns a tree to itself . Also it assigns parent(s) to every commit. Also every commit has some metadeta related to commits.
new_repo$ git cat-file -p 529d516426af9681a7ae009d48c7ddba8a8dd88a 040000 tree e5cba352c904d0b512dba1ac6329e78d4cca3af5 another_folder 100644 blob 04591727d3423f9ab78a59031463039bc00bcd97 file 100644 blob 476ebfc61574a7c67895974a8da3b5bd5f744518 file2
If we read the contents of tree. We can find out that a particular tree consists of other trees and blobs related to a particular version of the working directory. The tree in this tree object is a subtree which corresponds to a sub folder in the folder.
new_repo$ git cat-file -p 04591727d3423f9ab78a59031463039bc00bcd97 afjkasfhjfkhdsfjksdfgdshello its me!!
On reading one of the blob. We can find that the blob contains the entire file as is . But in a previous commit where the file named ‘file’ has different contents. In that case, the tree directs to a different blob which has the version of the file
new_repo$ git cat-file -p d6ed37b tree ab40298f83e3164254389e66d10ad4808e3fe6e5 author harish1996 1524982942 +0530 committer harish1996 1524982942 +0530 first file new_repo$ git cat-file -p ab40298f83e3164254389e66d10ad4808e3fe6e5 100644 blob 9a2d30c5e58145728cad8f9c4661646f1bd74cc9 file new_repo$ git cat-file -p 9a2d30c5e58145728cad8f9c4661646f1bd74cc9 afjkasfhjfkhdsfjksdfgds
As we can see , this object has the previous version of the same file.
The objects are stored inside the .git/objects folder. The first 2 characters of the checksum corresponds to a sub folder inside the .git/objects folder
new_repo$ ls .git/objects/ 04 2b 3d 47 52 67 71 9a ab b4 d6 e5 f3 info pack new_repo$ ls .git/objects/04/ 591727d3423f9ab78a59031463039bc00bcd97
The remarkable fact about git is even though it saves each and every version of the working directory. The file sizes of the files are very different.Git tries to compress files. For files used for programming, compression works well.
new_repo$ ls -l file3 -rw-rw-r-- 1 harish harish 904 Apr 29 13:11 file3 new_repo$ git log --oneline f6338e0 Sixth commit 5c0595d fifth file b4798cb Fourth commit 71764ed fajfgj 677358d second file d6ed37b first file new_repo$ gc f6338e0 tree a123e35fb541a9e06a6c17e6a94c9225e09de483 parent 5c0595dd537a82d656b3ff81b379fcb6b895c3bc author harish1996 1524987694 +0530 committer harish1996 1524987694 +0530 Sixth commit new_repo$ gc a123e35fb541a9e06a6c17e6a94c9225e09de483 040000 tree e5cba352c904d0b512dba1ac6329e78d4cca3af5 another_folder 100644 blob 788bb3a9c950dda038f93202985130a3e91d1f57 file 100644 blob 476ebfc61574a7c67895974a8da3b5bd5f744518 file2 100644 blob fc3cdf997028034ceab14b7f2013425de2c098da file3 new_repo$ ls -l .git/objects/fc/3cdf997028034ceab14b7f2013425de2c098da -r--r--r-- 1 harish harish 362 Apr 29 13:11 .git/objects/fc/3cdf997028034ceab14b7f2013425de2c098da
We can see the difference between the two files although they are the same file.
I think we have established here that git does not store diffs , whereas it stores compressed versions of different versions of every file. This understanding will help us to understand branching easier.