This is the high-level design for a new device mapper target that turns media designed for FAT32 like SD cards or USB keys into a block device that is efficient to use by Linux file systems like ext3. The goal is to improve both performance and the expected lifetime of the medium. See Flash Card Survey for understanding the characteristics of typical cards that this is based on.

Media Format

The logical medium gets organized into 4KB page size units. A two-level page table is used to look up the physical location from the logical page number. Pages containing page indexes (indirect pages) can be freely mixed with data pages, with the exception of the first-level table, which we call the superblock.

Each indirect page contains 1024 32-bit page indexes, containing the number of a 4KB page, the first page on the medium being number 0. The page number 0xffffffff represents an unallocated page.

The superblock contains a variable number of indirect page indexes, each representing 4MB of logical address range. The number of indirect page indexes depends on the size of the logical device, which is always smaller than the physical device, to allow efficient garbage collection.

The superblock may contain a free page bitmap, which makes it possible to easily locate space for the next write. The free page bitmap can alternatively be created in memory from by reading all indirect pages.

Further, the superblock may contain an array recording the age of each physical allocation group, which may be used for static wear leveling. The age of an allocation group is counted as a increasing 32 bit number of allocation groups that have been written over the life time of the medium.

Finally, the superblock records its own age as an increasing 32 bit number, and a checksum.

The superblock is stored in first allocation group on the medium, but may be at changing addresses within the allocation group. Only the superblock with the highest version number and a valid checksum gets used when mounting the medium.

Block allocation

To make up for media that are only able to write efficiently within a small number of allocation groups (typically 1 to 4) at a time, a single allocation group gets chosen to write incoming pages, which requires updating the indirect pages. The indirect pages may get written interleaved with the data pages or to consecutive addresses in another allocation group, depending on the characteristics of the underlying device.

Writing indirect pages in turn requires updating the superblock, since they get written to a new location on the medium every time.

Writes are batched into groups of multiple pages, depending on the characteristics of the medium, which typically require writing between a minimum of 16 KB to 512 KB per access. If less data is available, data that is written may get padded with zeros, or the remaining pages of a write unit can get skipped when writing the next page.

On a medium that allows more than two allocation groups to be written efficiently at a time, the block allocation may use this to separate data from metadata in cases where the difference is passed down by the file system.

Garbage collection

Partially filled allocation groups require garbage collection, which is done in two ways: For allocation groups that contain a mix of empty and full write units, but no partially filled write units, new data gets written to the empty write units from start to end, filling the allocation group up, which typically results in the medium doing one read-modify-write on the allocation on the allocation group.

Any partially-filled write units should get read by the block allocation and written out together with new pages to form full write units. Any write units freed up this way must not be reused until the superblock has been updated to reflect the new location of every moved page.

Wear leveling

There are two kinds of wear leveling: dynamic wear leveling means that the device keeps a set of unused blocks of a uniform size that are used to store any data that gets written while freeing up previously used blocks that get put back into the pool. This evenly spreads out writes across all blocks that are sometimes written to, but not to blocks that never get updated. Static wear leveling also takes blocks that do not get written to back into the pool, but moving the data to a block that has been written to more often.

The medium is expected to perform dynamic wear leveling in the unit of allocation groups that are written to. Since most media do not do static wear leveling, this can get done by recording the age of each allocation group in the superblock and moving the oldest allocation group, i.e. the one that has least recently been written to, to a new location. Since the data in an old allocation group can be expected to be static, it should not be mixed with new pages or with indirect pages that are likely to require garbage collection sooner. When a partially- filled allocation group is moved by wear leveling, it should be filled up with pages from the next oldest allocation group.

WorkingGroups/KernelArchived/Projects/FlashDeviceMapper (last modified 2013-01-14 19:44:05)