Today, I want to share with you what is Git Index. The index is a binary file
(generally kept in .git/index
) containing a sorted list of path names, each
with permissions and the SHA1 of a blob object; git ls-files
can show you the
contents of the index. Please note that words index, stage, and cache are
the same thing in Git: they are used interchangeably. I’m using macOS for the
examples in this post.
This blog post will go through the following steps:
0. Prerequisites
Create a project app
and initialize as Git project. Then, create a readme file
and add it to index.
~ $ mkdir app && cd app
app $ git init
app (master #) $ echo 'Hello world' > README.md
app (master #%) $ git add README.md
app (master +) $
1. Understand Index via git-ls-files
Now, see the indexed files using git-ls-files
tool:
app (master +) $ git ls-files --stage
100644 802992c4220de19a90767f3000a79a31b98d0df7 0 README.md
The flag --stage
shows staged contents’ mode bits, object name and stage
number in the output. So in our case, the output means:
- The read-me file mode bits are 100644, a regular non-executable file.
- The SHA-1 value of blob README.md
- The stage number (slots), useful during merge conflict handling.
- The object name.
Mode bits, the 6 digits, is the octal notation of the file permission. It can
also be read in binary form. Mode bits 100644 (0b1000000110100100
) means this file
is a regular file, readable and writable by owner of the file, readable by other
users in the owner group, and readable by others. Part (a) is a 4-bit object
type, valid values in binary are 1000 (regular file), 1010 (symbolic link) and
1110 (gitlink); Part (b) is 3-bit unused; Part (c) is 9-bit unix permission.
Only 0755 and 0644 are valid for regular files. Symbolic links and gitlinks
have value 0 in this field. These notions are defined in Git
index-format.txt.
1000 000 110100100
(a) (b) (c)
The different stage numbers are not really used during git-add command. They are used for handling merge conflicts. In a nutshell:
- Slot 0: “normal”, un-conflicted, all-is-well entry.
- Slot 1: “base”, the common ancestor version.
- Slot 2: “ours”, the target (HEAD) version.
- Slot 3: “theirs”, the being-merged-in version.
2. Inside .git/index
Now, let’s make a bits (binary digits) dump for the Git index file .git/index
:
app (master +) $ xxd -b -c 4 .git/index
00000000: 01000100 01001001 01010010 01000011 DIRC
00000004: 00000000 00000000 00000000 00000010 ....
00000008: 00000000 00000000 00000000 00000001 ....
0000000c: 01011010 11100100 00101011 00100000 Z.+
00000010: 00011101 10110001 00001001 01100010 ...b
00000014: 01011010 11100100 00101011 00100000 Z.+
00000018: 00011101 10110001 00001001 01100010 ...b
0000001c: 00000001 00000000 00000000 00000101 ....
00000020: 00000000 01011111 00001111 00110100 ._.4
00000024: 00000000 00000000 10000001 10100100 ....
00000028: 00000000 00000000 00000001 11110101 ....
0000002c: 00000000 00000000 00000000 00010100 ....
00000030: 00000000 00000000 00000000 00001100 ....
00000034: 10000000 00101001 10010010 11000100 .)..
00000038: 00100010 00001101 11100001 10011010 "...
0000003c: 10010000 01110110 01111111 00110000 .v.0
00000040: 00000000 10100111 10011010 00110001 ...1
00000044: 10111001 10001101 00001101 11110111 ....
00000048: 00000000 00001001 01010010 01000101 ..RE
0000004c: 01000001 01000100 01001101 01000101 ADME
00000050: 00101110 01101101 01100100 00000000 .md.
00000054: 11100110 10111010 11001111 10000001 ....
00000058: 01111101 11100111 10011001 11100100 }...
0000005c: 00010010 10110101 00111001 01101011 ..9k
00000060: 00010111 00001111 00010000 01100011 ...c
00000064: 01110001 00001110 11101000 01011011 q..[
or a hex dump version:
app (master +) $ xxd .git/index
00000000: 4449 5243 0000 0002 0000 0001 5ae4 2b20 DIRC........Z.+
00000010: 1db1 0962 5ae4 2b20 1db1 0962 0100 0005 ...bZ.+ ...b....
00000020: 005f 0f34 0000 81a4 0000 01f5 0000 0014 ._.4............
00000030: 0000 000c 8029 92c4 220d e19a 9076 7f30 .....).."....v.0
00000040: 00a7 9a31 b98d 0df7 0009 5245 4144 4d45 ...1......README
00000050: 2e6d 6400 e6ba cf81 7de7 99e4 12b5 396b .md.....}.....9k
00000060: 170f 1063 710e e85b ...cq..[
The index file contains several information:
- A 12-byte header.
- A number of sorted index entries.
- Extensions. They are identified by signature.
- 160-bit SHA-1 over the content of the index file before this checksum.
2.1. Header
As you can see, the index starts with a 12-byte header:
hex: 4449 5243 0000 0002 0000 0001
A 12-byte header consisting of a 4-byte signature “DIRC” (0x44495243
) standing
for DirCache; 4-byte version number “2” (0x00000002
), which is the current
version of Git index format; 32-bit number of index entries “1” (0x00000001
).
2.2. Index Entry
Before going further, we also need the statistic data about README.md:
app (master +) $ stat -f 'mtime=%m ctime=%c' README.md | tr ' ' '\n'
mtime=1524902688
ctime=1524902688
app (master +) $ stat -t '%FT%T' README.md
16777221 8596164404 -rw-r--r-- 1 mincong staff 0 12 "2018-04-28T17:09:29" "2018-04-28T10:04:48" "2018-04-28T10:04:48" "2018-04-28T10:04:48" 4194304 8 0 README.md
In this index entry for README.md, it consists:
-
64-bit ctime
hex: 5ae4 2b20 1db1 0962
32-bit ctime seconds, the last time a file’s metadata changed; and 32-bit ctime nanosecond fractions. Here it means “2018-04-28T10:04:48” (
0x5AE42B20
). According to the HFS+ volume format spec, it only stores timestamps to a granularity of one second. In other words, the nanosecond fractions is unused. -
64-bit mtime
hex: 5ae4 2b20 1db1 0962
32-bit mtime seconds, the last time a file’s data changed; and 32-bit mtime nanosecond fractions. Here it means “2018-04-28T10:04:48” (
0x5ae42b20
). According to the HFS+ volume format spec, it only stores timestamps to a granularity of one second. In other words, the nanosecond fractions is unused. -
32-bit dev
hex: 0100 0005
The device (
16777221
) upon which file README.md resides. -
32-bit ino
hex: 005f 0f34
The file inode number (
8596164404
). The overflowed part seems to be truncated. -
32-bit mode
hex: 0000 81a4 bin: 00000000 00000000 10000001 10100100
The hexadecimal representation of file permission
100644
(octal). -
32-bit uid
hex: 0000 01f5
The user identifier of the current user. The mine is 501 (
0x1F5
):$ id -u 501
-
32-bit gid
hex: 0000 0014
The group identifier of the current user. The mine is 20 (
0x14
):$ id -g 20
-
32-bit file size
hex: 0000 000c
The file size is 12 bytes (
0xC
). -
160-bit SHA-1 Object ID
hex: 8029 92c4 220d e19a 9076 7f30 00a7 9a31 b98d 0df7
A 20-byte (160-bit) SHA-1 over the content of the index file before this checksum. This is the blob id of the README.md file—we’ve seen it previously via
git-ls-files
command, do you remember? -
A 16-bit ‘flags’ field split into (high to low bits)
hex: 0009 bin: 00000000 00001001
1-bit assume-valid flag (
false
); 1-bit extended flag (must be zero in version 2); 2-bit stage (during merge); 12-bit name length if the length is less than0xFFF
, otherwise0xFFF
is stored in this field. The value is 9 in decimal, or0x9
. -
(Version 3 or later)… Ignore, we are in version 2.
hex: 52 bin: 00000000 01010010
-
File Path (length=9+1)
hex: 5245 4144 4d45 2e6d 6400 bin: 01010010 01000101 01010010 01000101 01001101 01000101 00101110 01101101 01100100 00000000
The index entry path name relative to top level directory (without leading slash). Here, it’s value is “README.md” (length=9), with one NUL byte. See below.
-
NUL byte(s) 1 NUL byte to pad the entry to a multiple of eight bytes while keeping the name NUL-terminated. In our case, the index entry starts with 64-bit ctime. In total, it has 72 bytes:
hex: 5ae4 2b20 1db1 0962 5ae4 2b20 1db1 0962 0100 0005 005f 0f34 0000 81a4 0000 01f5 0000 0014 0000 000c 8029 92c4 220d e19a 9076 7f30 00a7 9a31 b98d 0df7 0009 5245 4144 4d45 2e6d 6400
2.3 Extensions
Empty here
2.4 SHA-1 Index Checksum
hex: e6ba cf81 7de7 99e4 12b5
396b 170f 1063 710e e85b
160-bit SHA-1 over the content of the index file before this checksum. If the index is corrupted, you might meet an error like “index file corrupt”:
$ git status
error: bad index file sha1 signature
fatal: index file corrupt
3. Why Index?
Git index, or Git cache, has 3 important properties:
- The index contains all the information necessary to generate a single (uniquely determined) tree object.
- The index enables fast comparisons between the tree object it defines and the working tree.
- It can efficiently represent information about merge conflicts between different tree objects, allowing each pathname to be associated with sufficient information about the trees involved that you can create a three-way merge between them.
References
- Git: index-format
- Stack Overflow: How to read the mode field of git-ls-tree’s output
- Stack Overflow: How do contents of git index evolve during a merge (and what’s in the index after a failed merge)?
- Stack Overflow: How to resolve “Error: bad index – Fatal: index file corrupt” when using Git
- Microsoft: DevOps - Git Internals: Architecture and Index Files
- Wikipedia: Null-terminated string
- Git Community Book: The Git Index
- Apple: HFS+ Volume Format