Git: Understanding the Index File

Git index file (.git/index) is a binary file having the following format: a 12-byte header, a number of sorted index entries, extensions, and a SHA-1 checksum. Now let's create a new Git repository...

Today, I want to share with you what is Git Index. The index is a binary file (generally kept in .git/index) containing a sorted list of path names, each with permissions and the SHA1 of a blob object; git ls-files can show you the contents of the index. Please note that words index, stage, and cache are the same thing in Git: they are used interchangeably. I’m using macOS for the examples in this post.

This blog post will go through the following steps:

0. Prerequisites

Create a project app and initialize as Git project. Then, create a readme file and add it to index.

~ $ mkdir app && cd app
app $ git init
app (master #) $ echo 'Hello world' > README.md
app (master #%) $ git add README.md
app (master +) $

1. Understand Index via git-ls-files

Now, see the indexed files using git-ls-files tool:

app (master +) $ git ls-files --stage
100644 802992c4220de19a90767f3000a79a31b98d0df7 0	README.md

The flag --stage shows staged contents’ mode bits, object name and stage number in the output. So in our case, the output means:

  • The read-me file mode bits are 100644, a regular non-executable file.
  • The SHA-1 value of blob README.md
  • The stage number (slots), useful during merge conflict handling.
  • The object name.

Mode bits, the 6 digits, is the octal notation of the file permission. It can also be read in binary form. Mode bits 100644 (0b1000000110100100) means this file is a regular file, readable and writable by owner of the file, readable by other users in the owner group, and readable by others. Part (a) is a 4-bit object type, valid values in binary are 1000 (regular file), 1010 (symbolic link) and 1110 (gitlink); Part (b) is 3-bit unused; Part (c) is 9-bit unix permission. Only 0755 and 0644 are valid for regular files. Symbolic links and gitlinks have value 0 in this field. These notions are defined in Git index-format.txt.

1000 000 110100100
(a)  (b)    (c)

The different stage numbers are not really used during git-add command. They are used for handling merge conflicts. In a nutshell:

  • Slot 0: “normal”, un-conflicted, all-is-well entry.
  • Slot 1: “base”, the common ancestor version.
  • Slot 2: “ours”, the target (HEAD) version.
  • Slot 3: “theirs”, the being-merged-in version.

2. Inside .git/index

Now, let’s make a bits (binary digits) dump for the Git index file .git/index:

app (master +) $ xxd -b -c 4 .git/index
00000000: 01000100 01001001 01010010 01000011  DIRC
00000004: 00000000 00000000 00000000 00000010  ....
00000008: 00000000 00000000 00000000 00000001  ....
0000000c: 01011010 11100100 00101011 00100000  Z.+
00000010: 00011101 10110001 00001001 01100010  ...b
00000014: 01011010 11100100 00101011 00100000  Z.+
00000018: 00011101 10110001 00001001 01100010  ...b
0000001c: 00000001 00000000 00000000 00000101  ....
00000020: 00000000 01011111 00001111 00110100  ._.4
00000024: 00000000 00000000 10000001 10100100  ....
00000028: 00000000 00000000 00000001 11110101  ....
0000002c: 00000000 00000000 00000000 00010100  ....
00000030: 00000000 00000000 00000000 00001100  ....
00000034: 10000000 00101001 10010010 11000100  .)..
00000038: 00100010 00001101 11100001 10011010  "...
0000003c: 10010000 01110110 01111111 00110000  .v.0
00000040: 00000000 10100111 10011010 00110001  ...1
00000044: 10111001 10001101 00001101 11110111  ....
00000048: 00000000 00001001 01010010 01000101  ..RE
0000004c: 01000001 01000100 01001101 01000101  ADME
00000050: 00101110 01101101 01100100 00000000  .md.
00000054: 11100110 10111010 11001111 10000001  ....
00000058: 01111101 11100111 10011001 11100100  }...
0000005c: 00010010 10110101 00111001 01101011  ..9k
00000060: 00010111 00001111 00010000 01100011  ...c
00000064: 01110001 00001110 11101000 01011011  q..[

or a hex dump version:

app (master +) $ xxd .git/index
00000000: 4449 5243 0000 0002 0000 0001 5ae4 2b20  DIRC........Z.+
00000010: 1db1 0962 5ae4 2b20 1db1 0962 0100 0005  ...bZ.+ ...b....
00000020: 005f 0f34 0000 81a4 0000 01f5 0000 0014  ._.4............
00000030: 0000 000c 8029 92c4 220d e19a 9076 7f30  .....).."....v.0
00000040: 00a7 9a31 b98d 0df7 0009 5245 4144 4d45  ...1......README
00000050: 2e6d 6400 e6ba cf81 7de7 99e4 12b5 396b  .md.....}.....9k
00000060: 170f 1063 710e e85b                      ...cq..[

The index file contains several information:

  1. A 12-byte header.
  2. A number of sorted index entries.
  3. Extensions. They are identified by signature.
  4. 160-bit SHA-1 over the content of the index file before this checksum.

2.1. Header

As you can see, the index starts with a 12-byte header:

hex: 4449 5243 0000 0002 0000 0001

A 12-byte header consisting of a 4-byte signature “DIRC” (0x44495243) standing for DirCache; 4-byte version number “2” (0x00000002), which is the current version of Git index format; 32-bit number of index entries “1” (0x00000001).

2.2. Index Entry

Before going further, we also need the statistic data about README.md:

app (master +) $ stat -f 'mtime=%m ctime=%c' README.md | tr ' ' '\n'
mtime=1524902688
ctime=1524902688
app (master +) $ stat -t '%FT%T' README.md
16777221 8596164404 -rw-r--r-- 1 mincong staff 0 12 "2018-04-28T17:09:29" "2018-04-28T10:04:48" "2018-04-28T10:04:48" "2018-04-28T10:04:48" 4194304 8 0 README.md

In this index entry for README.md, it consists:

  • 64-bit ctime

    hex: 5ae4 2b20 1db1 0962
    

    32-bit ctime seconds, the last time a file’s metadata changed; and 32-bit ctime nanosecond fractions. Here it means “2018-04-28T10:04:48” (0x5AE42B20). According to the HFS+ volume format spec, it only stores timestamps to a granularity of one second. In other words, the nanosecond fractions is unused.

  • 64-bit mtime

    hex: 5ae4 2b20 1db1 0962
    

    32-bit mtime seconds, the last time a file’s data changed; and 32-bit mtime nanosecond fractions. Here it means “2018-04-28T10:04:48” (0x5ae42b20). According to the HFS+ volume format spec, it only stores timestamps to a granularity of one second. In other words, the nanosecond fractions is unused.

  • 32-bit dev

    hex: 0100 0005
    

    The device (16777221) upon which file README.md resides.

  • 32-bit ino

    hex: 005f 0f34
    

    The file inode number (8596164404). The overflowed part seems to be truncated.

  • 32-bit mode

    hex: 0000 81a4
    bin: 00000000 00000000 10000001 10100100
    

    The hexadecimal representation of file permission 100644 (octal).

  • 32-bit uid

    hex: 0000 01f5
    

    The user identifier of the current user. The mine is 501 (0x1F5):

    $ id -u
    501
    
  • 32-bit gid

    hex: 0000 0014
    

    The group identifier of the current user. The mine is 20 (0x14):

    $ id -g
    20
    
  • 32-bit file size

    hex: 0000 000c
    

    The file size is 12 bytes (0xC).

  • 160-bit SHA-1 Object ID

    hex: 8029 92c4 220d e19a 9076
         7f30 00a7 9a31 b98d 0df7
    

    A 20-byte (160-bit) SHA-1 over the content of the index file before this checksum. This is the blob id of the README.md file—we’ve seen it previously via git-ls-files command, do you remember?

  • A 16-bit ‘flags’ field split into (high to low bits)

    hex: 0009
    bin: 00000000 00001001
    

    1-bit assume-valid flag (false); 1-bit extended flag (must be zero in version 2); 2-bit stage (during merge); 12-bit name length if the length is less than 0xFFF, otherwise 0xFFF is stored in this field. The value is 9 in decimal, or 0x9.

  • (Version 3 or later)… Ignore, we are in version 2.

    hex: 52
    bin: 00000000 01010010
    
  • File Path (length=9+1)

    hex: 5245 4144 4d45 2e6d 6400
    bin: 01010010 01000101 01010010 01000101
         01001101 01000101 00101110 01101101
         01100100 00000000
    

    The index entry path name relative to top level directory (without leading slash). Here, it’s value is “README.md” (length=9), with one NUL byte. See below.

  • NUL byte(s) 1 NUL byte to pad the entry to a multiple of eight bytes while keeping the name NUL-terminated. In our case, the index entry starts with 64-bit ctime. In total, it has 72 bytes:

    hex: 5ae4 2b20 1db1 0962 5ae4 2b20 1db1 0962
         0100 0005 005f 0f34 0000 81a4 0000 01f5
         0000 0014 0000 000c 8029 92c4 220d e19a
         9076 7f30 00a7 9a31 b98d 0df7 0009 5245
         4144 4d45 2e6d 6400
    

2.3 Extensions

Empty here

2.4 SHA-1 Index Checksum

hex: e6ba cf81 7de7 99e4 12b5
     396b 170f 1063 710e e85b

160-bit SHA-1 over the content of the index file before this checksum. If the index is corrupted, you might meet an error like “index file corrupt”:

$ git status
error: bad index file sha1 signature
fatal: index file corrupt

3. Why Index?

Git index, or Git cache, has 3 important properties:

  1. The index contains all the information necessary to generate a single (uniquely determined) tree object.
  2. The index enables fast comparisons between the tree object it defines and the working tree.
  3. It can efficiently represent information about merge conflicts between different tree objects, allowing each pathname to be associated with sufficient information about the trees involved that you can create a three-way merge between them.

"Pro Git (2nd Edition)" contains everything you need to know about Git, written by Scott Chacon and Ben Straub. The print version is available on Amazon: https://amzn.to/31CKi27


References