Btrees

Btrees Introduction

Btrfs uses a single set of btree manipulation code for all metadata in the filesystem. For performance or organizational purposes, the trees are broken up into a few different types, and each type of tree will hold a few different types of keys. The super block holds pointers to the tree roots of the tree of tree roots and the chunk tree.

Tree of Tree roots

This tree is used for indexing and finding the root of most of the other trees in the filesystem. It attaches names to subvolumes and snapshots, and stores the location of the extent allocation tree root. It also stores pointers to all of the subvolumes or snapshots that are being deleted by the transaction code. This allows the deletion to pick up where it left off after a crash.

Chunk Tree

The chunk tree does all of the logical to physical block address mapping for the filesystem, and it stores information about all of the devices in the FS. In order to bootstrap lookup in the chunk tree, the super block also duplicates the chunk items needed to resolve blocks in the chunk tree. Over time, the chunk tree will be split into multiple roots to allow access of larger storage pools.

There are back references from the chunk items to the extent tree that allocated them. Only a single extent tree can allocate extents out of a given chunk.

Two types of key are stored in the chunk tree:

DEV_ITEM (where the offset field is the internal devid), which contain information on all of the underlying block devices in the filesystem
CHUNK_ITEM (where the offset field is the start of the chunk as a virtual address), which maps a section of the virtual address space (a chunk) into physical storage.

Device Allocation Tree

The device allocation tree records which parts of each physical device have been allocated into chunks. This is a relatively small tree that is only updated as new chunks are allocated. It stores back references to the chunk tree that allocated each physical extent on the device.

Extent Allocation Tree

The extent allocation tree records byte ranges that are in use, maintains reference counts on each extent and records back references to the tree or file that is using each extent. Logical block groups are created inside the extent allocation tree, and these reference large logical extents from the chunk tree.

Each block group can only store a specific type of extent. This might include metadata, or mirrored metadata, or striped data blocks etc.

Currently there is only one extent allocation tree shared by all the other trees. This will change in order to scale better under load.

Keys for the extent tree use the start of the extent as the objectid. A BLOCK_GROUP_ITEM key will be followed by the EXTENT_ITEM keys for extents within that block group.

FS Trees

These store files and directories, and all of the normal metadata you would expect to find in a filesystem. There is one root for each subvolume or snapshot, but snapshots will share blocks between roots.

Keys in FS trees always use the inode number of the filesystem object as the objectid.

Each object will have one or more of:

Inode.
Inode ref, indicating what name this object is known as, and in which directory.
For files, a set of extent information, indicating where on the filesystem this file’s data is.
For directories, two sequences of dir_items, one indexed by a hash of the object name, and one indexed by a unique sequential index number.

Checksum Tree

The checksum tree stores block checksums. Every 4k block of data stored on disk has a checksum associated with it. The “offset” part of the keys in the checksum tree indicates the start of the checksummed data on disk. The value stored with the key is a sequence of (currently 4-byte) checksums, for the 4k blocks starting at the offset.