Release Note

The investigation of existing file system characteristics completes the work started with the survey of flash memory cards and lets us proceed towards specific changes in those file systems where appropriate, give recommendations to our users regarding the use of file systems, mount options and memory cards, and will be used to decide on the approach for the flash device mapper.


This investigation is required to continue with and

User stories

  • Arnd knows everything one can possibly find out about flash memory cards, but doesn't know what it means for the file systems running on them. With this investigation, that knowledge can be put to much better use.
  • Venkat wants to implement the flash device remapper, but there are too many options for how to implement that. The result of the investigation makes it easier to form an opinion on what to do there.
  • Ted wants to make sure his file system works best on low-end flash devices, but he needs feedback on where the specific problems are with it.


  • The WorkingGroups/Kernel/Projects/FlashCardSurvey results are representative for all current and future low-end flash devices and can be easily modeled

  • The work loads we come up with are representative for what users really care about.
  • A blktrace output file contains a list of write requests depending on workload and file system, but independent of the characteristics of the drive, so we can apply the same trace on another device and see how it would react.


A new tool is written that predicts the behavior of a flash memory card based on a blktrace output. The tool predicts when memory card would perform a garbage collection and how many blocks have to be written internally as part of the garbage collection. The fraction of the number of writes issued by the OS and the number of writes performed by the underlying devices is the write amplification, which is reported as output of the device and can get compared to other combinations of drives and workloads.


See below (need to update)

Old notes from LDS-o

Block allocation in file systems

Block allocation in the main file systems (btrfs, ext4, nilfs2). This also needs no know about the erase block sizes, unless we plan to rely on the remapper from 7.1 indefinitely. This starts with finding out the current patterns using blktrace and coming up with individual per-fs ways to improve these patterns.

The first step is to quantify the write amplification for each of the relevant block based file systems, for a number of real-world workloads. Possible workloads include

  • distribution installation (debootstrap)
  • parallel kernel compile
  • git clone/checkout
  • streaming data write

We will need a new analysis tool that looks at the raw blktrace data file and counts the number of blocks being written while estimating the number of garbage collection cycles for a medium with the characteristics recorded in the flash card survey. The input to the tool should include

  • blktrace data set
  • erase block size
  • page size
  • algorithm used (linear or random writes allowed) in the card
  • number of open erase blocks supported

Since the performance is mainly determined by the number of garbage collection cycles, the main output should be that number, and the write amplification factor, defined as (number of GC * eraseblocksize / bytes written)

With that data, we can hopefully draw specific conclusions, in the form of e.g.

  • with XXXfs, you need at least N open erase blocks to not get into the worst case write amplification.
  • XXXfs is always better than YYYfs.
  • YYYfs has a specific problem, slightly changing the block allocation in there improves it by a factor of C.
  • We do/don't need the flash remapper for optimum performance
  • If we was a Linux Foundation qualification program for flash devices, it should require N erase blocks in random access to be recommended for use with Linux.

Further action on this depends on the specific results.

LDS-o notes from etherpad

2. Fix file system block allocator to understand erase blocks

  • - Can't do that yet, but want to understand what current FS implementations do:
    • Run blocktrace
    • Run simulations on known existing SD cards
    - Come up with conclusions on what to run and not run - Long term goal: come up with requirements of what is needed from SD cards for Linux
    • and possiby have a certification program via Linux Foundation
    - For next cycle: Fix specific problems we see
  • Card algorithms to be modeled:
    • purely linear access, force GC for pages that have been written, keep track of open erase-blocks
    • block remapping within one erase-block, a block may be multiple pages (e.g. 64KB), write every block in the erase-block once before GC, but in any order.
    • Data logging, write every block in the erase-block in any order, clean up later. Every erase block with random data ends up being written twice. Linear data only gets written once.
    • Cache for small (block or smaller) writes, otherwise linear access within the erase block (see above). Needs to GC the cache eventually, but blocks written multiple times into the cache only need to be GC'd once. Less important, only used in expensive USB sticks and CF cards, not SD cards.
  • Input arguments to the tool
    • Erase block size
    • Number of concurrently open erase blocks (aside from potential cache)
    • Algorithm (one of the four above)
    • Block size (depends on algorithm)
  • Possible extensions to model later, not goals for now
    • Slowdown for sub-page writes (requires knowing page size from table)
    • FAT area optimization
  • Measurements to look at: Collect many blktrace outputs from various places, including
    • ext2, ext3, ext4, btrfs, nilfs2, logfs
    • Different options for ext4, to be provided by Surbi
    • Workloads (possibly): streaming write, untar, kernel build, debootstrap, boot, blocktrace your daily desktop experience.
    • Aged file systems

Personal notes

  • Create script for running test cases on a dedicated partition with
    • blktrace gathering stats
  • Modify blktrace to generate raw dump of media write stats, i.e.
    • {position,length} tuples
  • Run test script to gather media write stats for test cases on a
    • subset of filesystems
  • Write skeleton for simulation tool to process media write stats
  • Implement simulation of card algorithm "pure linear access".
  • Implement simulation of card algorithm "block remapping within one
    • erase-block"
  • Implement simulation of card algorithm "data logging"
  • Implement simulation of card algorithm "cache for small writes"
  • Create 'algorithm exposer' tool to generate media write patterns
    • which expose simulated algorithms
  • Validate simulations by comparing their results to timings on real
    • hardware of test case runs and 'algorithm exposer'
  • Automate creation of an aged file system
  • Gather media write stats for all automated test cases on all
    • filesystem types and options
  • Repeat above on real ARM hardware, both uniprocessor and SMP
  • Repeat selected test runs so we can determine noise from real
    • differences
  • Create 'desktop system monitor' script for gathering media write
    • stats and partition mount info from a desktop PC
  • Get volunteers to run 'desktop system monitor'
  • Lots of analysis

I had imagined that I wouldn't try and integrate any tools into the blktrace tool itself, at least initially, because of the time it would take to comprehend the code and do things cleanly.


WorkingGroups/KernelArchived/Specs/investigate-block-allocation-in-fs (last modified 2013-01-14 19:36:45)