An f2fs teardown

October 10, 2012

This article was contributed by Neil Brown

When a techno-geek gets a new toy there must always be an urge to take it apart and see how it works. Practicalities (and warranties) sometimes suppress that urge, but in the case of f2fs and this geek, the urge was too strong. What follows is the result of taking apart this new filesystem to see how it works.

f2fs (interestingly not "f3s") is the "flash-friendly file system", a new filesystem for Linux recently announced by engineers from Samsung. Unlike jffs2 and logfs, f2fs is not targeted at raw flash devices, but rather at the specific hardware that is commonly available to consumers — SSDs, eMMC, SD cards, and other flash storage with an FTL (flash translation layer) already built in. It seems that as hardware gets smarter, we need to make even more clever software to manage that "smartness". Does this sound like parenting to anyone else?

f2fs is based on the log-structured filesystem (LFS) design — which is hardly surprising given the close match between the log-structuring approach and the needs of flash. For those not familiar with log-structured design, the key elements are:

That it requires copy-on-write, so data is always written to previously unused space.

That free space is managed in large regions which are written to sequentially. When the number of free regions gets low, data that is still live is coalesced from several regions into one free region, thus creating more free regions. This process is known as "cleaning" and the overhead it causes is one of the significant costs of log structuring.

As the FTL typically uses a log-structured design to provide the wear-leveling and write-gathering that flash requires, this means that there are two log structures active on the device — one in the firmware and one in the operating system. f2fs is explicitly designed to make use of this fact and leaves a number of tasks to the FTL while focusing primarily on those tasks that it is well positioned to perform. So, for example, f2fs makes no effort to distribute writes evenly across the address space to provide wear-leveling.

The particular value that f2fs brings, which can justify it being "flash friendly", is that it provides large-scale write gathering so that when lots of blocks need to be written at the same time they are collected into large sequential writes which are much easier for the FTL to handle. Rather than creating a single large write, f2fs actually creates up to six in parallel. As we shall see, these are assigned different sorts of blocks with different life expectancies. Grouping blocks with similar life expectancies together tends to make the garbage collection process required by the LFS less expensive.

The "large-scale" is a significant qualifier — f2fs doesn't always gather writes into contiguous streams, only almost always. Some metadata, and occasionally even some regular data, is written via random single-block writes. This would be anathema for a regular log-structured filesystem, but f2fs chooses to avoid a lot of complexity by just doing small updates when necessary and leaving the FTL to make those corner cases work.

Before getting into the details of how f2fs does what it does, a brief list of some of the things it doesn't do is in order.

A feature that we might expect from a copy-on-write filesystem is cheap snapshots as they can be achieved by simply not freeing up the old copy. f2fs does not provide these and cannot in its current form due to its two-locations approach to some metadata which will be detailed later.

Other features that are missing are usage quotas, NFS export, and the "security" flavor of extended attributes (xattrs). Each of these could probably be added with minimal effort if they are needed, though integrating quotas correctly with the crash recovery would be the most challenging. We shouldn't be surprised to see some of these in a future release.

Blocks, segments, sections, and zones

Like most filesystems, f2fs is comprised of blocks. All blocks are 4K in size, though the code implicitly links the block size with the system page size, so it is unlikely to work on systems with larger page sizes as is possible with IA64 and PowerPC. The block addresses are 32 bits so the total number of addressable bytes in the filesystem is at most 2^(32+12) bytes or 16 terabytes. This is probably not a limitation — for current flash hardware at least.

Blocks are collected into "segments". A segment is 512 blocks or 2MB in size. The documentation describes this as a default, but this size is fairly deeply embedded in the code. Each segment has a segment summary block which lists the owner (file plus offset) of each block in the segment. The summary is primarily used when cleaning to determine which blocks need to be relocated and how to update the index information after the relocation. One block can comfortably store summary information for 512 blocks (with a bit of extra space which has other uses), so 2MB is the natural size for a segment. Larger would be impractical and smaller would be wasteful.

Segments are collected into sections. There is genuine flexibility in the size of a section, though it must be a power of two. A section corresponds to a "region" in the outline of log structuring given above. A section is normally filled from start to end before looking around for another section, and the cleaner processes one section at a time. The default size when using the mkfs utility is 2⁰, or one segment per section.

f2fs has six sections "open" for writing at any time with different sorts of data being written to each one. The different sections allows for file content (data) to be kept separate from indexing information (nodes), and for those to be divided into "hot", "warm", and "cold" according to various heuristics. For example, directory data is treated as hot and kept separate from file data because they have different life expectancies. Data that is cold is expected to remain unchanged for quite a long time, so a section full of cold blocks is likely to not require any cleaning. Nodes that are hot are expected to be updated soon, so if we wait a little while, a section that was full of hot nodes will have very few blocks that are still live and thus will be cheap to clean.

Sections are collected into zones. There may be any (integer) number of sections in a zone though the default is again one. The sole purpose of zones is to try to keep these six open sections in different parts of the device. The theory seems to be that flash devices are often made from a number of fairly separate sub-devices each of which can process IO requests independently and hence in parallel. If zones are sized to line up with the sub-devices, then the six open sections can all handle writes in parallel and make best use of the device.

These zones, full of sections of segments of blocks, make up the "main" area of the filesystem. There is also a "meta" area which contains a variety of different metadata such as the segment summary blocks already mentioned. This area is not managed following normal log-structured lines and so leaves more work for the FTL to do. Hopefully it is small enough that this isn't a problem.

There are three approaches to management of writes in this area. First, there is a small amount of read-only data (the superblock) which is never written once the filesystem has been created. Second, there are the segment summary blocks which have already been mentioned. These are simply updated in-place. This can lead to uncertainty as to the "correct" contents for the block after a crash, however for segment summaries this is not an actual problem. The information in it is checked for validity before it is used, and if there is any chance that information is missing, it will be recovered from other sources during the recovery process.

The third approach involves allocating twice as much space as is required so that each block has two different locations it can exist in, a primary and a secondary. Only one of these is "live" at any time and the copy-on-write requirement of an LFS is met by simply writing to the non-live location and updating the record of which is live. This approach to metadata is the main impediment to providing snapshots. f2fs does a small amount of journaling of updates to this last group while creating a checkpoint, which might ease the task for the FTL somewhat.

Files, inodes, and indexing

Most modern filesystems seem to use B-trees or similar structures for managing indexes to locate the blocks in a file. In fact they are so fundamental to btrfs that it takes its name from that data structure. f2fs doesn't. Many filesystems reduce the size of the index by the use of "extents" which provide a start and length of a contiguous list of blocks rather than listing all the addresses explicitly. Again, f2fs doesn't (though it does maintain one extent per inode as a hint).

Rather, f2fs uses an indexing tree that is very reminiscent of the original Unix filesystem and descendants such as ext3. The inode contains a list of addresses for the early blocks in the file, then some addresses for indirect blocks (which themselves contain more addresses) as well as some double and triple-indirect blocks. While ext3 has 12 direct addresses and one each of the indirection addresses, f2fs has 929 direct address, two each of indirect and double-indirect addresses, and a single triple-indirect address. This allows the addressing of nearly 4TB for a file, or one-quarter of the maximum filesystem size.

While this scheme has some costs — which is why other filesystems have discarded it — it has a real benefit for an LFS. As f2fs does not use extents, the index tree for a given file has a fixed and known size. This means that when blocks are relocated through cleaning, it is impossible for changes in available extents to cause the indexing tree to get bigger — which could be embarrassing when the point of cleaning is to free space. logfs, another reasonably modern log structured filesystem for flash, uses much the same arrangement for much the same reason.

Obviously, all this requires a slightly larger inode than ext3 uses. Copy-on-write is rather awkward for objects that are smaller than the block size so f2fs reserves a full 4K block for each inode which provides plenty of space for indexing. It even provides space to store the (base) name of the file, or one of its names, together with the inode number of the parent. This simplifies the recovery of recently-created files during crash recovery and reduces the number of blocks that need to be written for such a file to be safe.

Given that the inode is so large, one would expect that small files and certainly small symlinks would be stored directly in the inode, rather than just storing a single block address and storing the data elsewhere. However f2fs doesn't do that. Most likely the reality is that it doesn't do it yet. It is an easy enough optimization to add, so it's unlikely to remain absent for long.

As already mentioned, the inode contains a single extent that is a summary of some part of the index tree. It says that some range of blocks in the file are contiguous in storage and gives the address of this range. The filesystem attempts to keep the largest extent recorded here and uses it to speed up address lookups. For the common case of a file being written sequentially without any significant pause, this should result in the entire file being in that one extent, and make lookups in the index tree unnecessary.

Surprisingly, it doesn't seem there was enough space to store 64-bit timestamps, so instead of nanosecond resolution for several centuries in the future, it only provides single-second resolution until some time in 2038. This oversight was raised on linux-kernel and may well be addressed in a future release.

One of the awkward details of any copy-on-write filesystem is that whenever a block is written, its address is changed, so its parent in the indexing tree must change and be relocated, and so on up to the root of the tree. The logging nature of an LFS means that roll-forward during recovery can rebuild recent changes to the indexing tree so all the changes do not have to be written immediately, but they do have to be written eventually, and this just makes more work for the cleaner.

This is another area when f2fs makes use of its underlying FTL and takes a short-cut. Among the contents of the "meta" area is a NAT — a Node Address Table. Here "node" refers to inodes and to indirect indexing blocks, as well as blocks used for xattr storage. When the address of an inode is stored in a directory, or an index block is stored in an inode or another index block, it isn't the block address that is stored, but rather an offset into the NAT. The actual block address is stored in the NAT at that offset. This means that when a data block is written, we still need to update and write the node that points to it. But writing that node only requires updating the NAT entry. The NAT is part of the metadata that uses two-location journaling (thus depending on the FTL for write-gathering) and so does not require further indexing.

Directories

An LFS doesn't really impose any particular requirements on the layout of a directory, except to change the fewest number of blocks possible, which is generally good for performance anyway. So we can assess f2fs's directory structure on an equal footing with other filesystems. The primary goal is to provide fast lookup by file name, and to provide a stable address of each name that can be reported using telldir().

The original Unix filesystem (once it had been adjusted for 256-byte file names) used the same directory scheme as ext2 — sequential search though a file full of directory entries. This is simple and effective, but doesn't scale well to large directories.

More modern filesystems such as ext3, xfs, and btrfs use various schemes involving B-trees, sometimes indexed by a hash of the file name. One of the problems with B-trees is that nodes sometimes need to be split and this causes some directory entries to be moved around in the file. This results in extra challenges to provide stable addresses for telldir() and is probably the reason that telldir() is often called out for being a poor interface.

f2fs uses some sequential searching and some hashing to provide a scheme that is simple, reasonably efficient, and trivially provides stable telldir() addresses. A lot of the hashing code is borrowed from ext3, however f2fs omits the use of a per-directory seed. This seed is a secret random number which ensures that the hash values used are different in each directory, so they are not predictable. Using such a seed provides protection against hash-collision attacks. While these might be unlikely in practice, they are so easy to prevent that this omission is a little surprising.

It is easiest to think of the directory structure as a series of hash tables stored consecutively in a file. Each hash table has a number of fairly large buckets. A lookup proceeds from the first hash table to the next, at each stage performing a linear search through the appropriate bucket, until either the name is found or the last hash table has been searched. During the search, any free space in a suitable bucket is recorded in case we need to create the name.

The first hash table has exactly one bucket which is two blocks in size, so for the first few hundred entries, a simple linear search is used. The second hash table has two buckets, then four, then eight and so on until the 31st table with about a billion buckets, each two blocks in size. Subsequent hash tables — should you need that many — all have the same number of buckets as the 31st, but now they are four blocks in size.

The result is that a linear search of several hundred entries can be required, possibly progressing through quite a few blocks if the directory is very large. The length of this search increases only as the logarithm of the number of entries in the directory, so it scales fairly well. This is certainly better than a purely sequential search, but seems like it could be a lot more work than is really necessary. It does however guarantee that only one block needs to be updated for each addition or deletion of a file name, and since entries are never moved, the offset in the file is a stable address for telldir(), which are valuable features.

Superblocks, checkpoints, and other metadata

All filesystems have a superblock and f2fs is no different. However it does make a clear distinction between those parts of the superblock which are read-only and those which can change. These are kept in two separate data structures.

The f2fs_super_block, which is stored in the second block of the device, contains only read-only data. Once the filesystem is created, this is never changed. It describes how big the filesystem is, how big the segments, sections, and zones are, how much space has been allocated for the various parts of the "meta" area, and other little details.

The rest of the information that you might expect to find in a superblock, such as the amount of free space, the address of the segments that should be written to next, and various other volatile details, are stored in an f2fs_checkpoint. This "checkpoint" is one of the metadata types that follows the two-location approach to copy-on-write — there are two adjacent segments both of which store a checkpoint, only one of which is current. The checkpoint contains a version number so that when the filesystem is mounted, both can be read and the one with the higher version number is taken as the live version.

We have already mentioned the Node Address Table (NAT) and Segment Summary Area (SSA) that also occupy the meta area with the superblock (SB) and Checkpoints (CP). The one other item of metadata is the Segment Info Table or SIT.

The SIT stores 74 bytes per segment and is kept separate from the segment summaries because it is much more volatile. It primarily keeps track of which blocks are still in active use so that the segment can be reused when it has no active blocks, or can be cleaned when the active block count gets low.

When updates are required to the NAT or the SIT, f2fs doesn't make them immediately, but stores them in memory until the next checkpoint is written. If there are relatively few updates then they are not written out to their final home but are instead journaled in some spare space in Segment Summary blocks that are normally written at the same time. If the total amount of updates that are required to Segment Summary blocks is sufficiently small, even they are not written and the SIT, NAT, and SSA updates are all journaled with the Checkpoint block — which is always written during checkpoint. Thus, while f2fs feels free to leave some work to the FTL, it tries to be friendly and only performs random block updates when it really has to. When f2fs does need to perform random block updates it will perform several of them at once, which might ease the burden on the FTL a little.

Knowing when to give up

Handling filesystem-full conditions in traditional filesystems is relatively easy. If no space is left, you just return an error. With a log-structured filesystem, it isn't that easy. There might be a lot of free space, but it might all be in different sections and so it cannot be used until those sections are "cleaned", with the live data packed more densely into fewer sections. It usually makes sense to over-provision a log-structured filesystem so there are always free sections to copy data to for cleaning.

The FTL takes exactly this approach and will over-provision to both allow for cleaning and to allow for parts of the device failing due to excessive wear. As the FTL handles over-provisioning internally there is little point in f2fs doing it as well. So when f2fs starts running out of space, it essentially gives up on the whole log-structured idea and just writes randomly wherever it can. Inodes and index blocks are still handled carefully and there is a small amount of over-provisioning for them, but data is just updated in place, or written to any free block that can be found. Thus you can expect performance of f2fs to degrade when the filesystem gets close to full, but that is common to a lot of filesystems so it isn't a big surprise.

Would I buy one?

f2fs certainly seems to contain a number of interesting ideas, and a number of areas for possible improvement — both attractive attributes. Whether reality will match the promise remains to be seen. One area of difficulty is that the shape of an f2fs (such as section and zone size) needs to be tuned to the particular flash device and its FTL; vendors are notoriously secretive about exactly how their FTL works. f2fs also requires that the flash device is comfortable having six or more concurrently "open" write areas. This may not be a problem for Samsung, but does present some problems for your average techno-geek — though Arnd Bergmann has done some research that may prove useful. If this leads to people reporting performance results based on experiments where the f2fs isn't tuned properly to the storage device, it could be harmful for the project as a whole.

f2fs contains a number of optimizations which aim to ease the burden on the FTL. It would be very helpful to know how often these actually result in a reduction in the number of writes. That would help confirm that they are a good idea, or suggest that further refinement is needed. So, some gathering of statistics about how often the various optimizations fire would help increase confidence in the filesystem.

f2fs seems the have been written without much expectation of highly parallel workloads. In particular, all submission of write requests are performed under a single semaphore. So f2fs probably isn't the filesystem to use for big-data processing on 256-core work-horses. It should be fine on mobile computing devices for a few more years though.

And finally, lots of testing is required. Some preliminary performance measurements have been posted, but to get a fair comparison you really need an "aged" filesystem and a large mix of workloads. Hopefully someone will make the time to do the testing.

Meanwhile, would I use it? Given that my phone is as much a toy to play with as a tool to use, I suspect that I would. However, I would make sure I had reliable backups first. But then ... I probably should do that anyway.

Index entries for this article
Kernel	Filesystems/Flash
GuestArticles	Brown, Neil

An f2fs teardown

Posted Oct 11, 2012 1:55 UTC (Thu) by mbcook (guest, #5517) [Link] (1 responses)

I'd just like to thank you guys for this article. When the announcement hit Slashdot and a few other places a couple of days ago, I was really disappointed that coverage basically fit into the "Samsung did something, here's a Git repository" mold.

Thanks for the informative discussion of how f2fs works, and why it may have been designed that way.

An f2fs teardown

Posted Oct 13, 2012 9:58 UTC (Sat) by liljencrantz (guest, #28458) [Link]

Real journalism takes time.

An f2fs teardown

Posted Oct 11, 2012 9:03 UTC (Thu) by hthoma (subscriber, #4743) [Link] (1 responses)

> The block addresses are 32 bits so the total number of addressable bytes in the filesystem is at most 2^(32+12) bytes or 16 terabytes. This is probably not a limitation — for current flash hardware at least.

16 TB is not a limitation? OK, you can not buy a 16 TB SSD now, but you can buy one with 512 GB. That is only a factor of 32. And it will probably not take too long to close that gap.

So this sounds really a lot like: "640 kB of RAM is enough."

An f2fs teardown

Posted Oct 11, 2012 12:06 UTC (Thu) by cladisch (✭ supporter ✭, #50193) [Link]

As far as I can see, f2fs is primarily intended for less intelligent flash devices, like those in smartphones, where 16 TB indeed is enough for the forseeable future.

If there ever is a smartphone with 32 TB of flash storage, I'd guess that it will also have a flash controller as least as good as those in today's top-end SSDs, which aren't as bad with 'normal' file systems.

f2fs versus jffs2

Posted Oct 11, 2012 9:37 UTC (Thu) by flok (guest, #17768) [Link] (2 responses)

What would be the advantage of f2fs over jffs2?

f2fs versus jffs2

Posted Oct 11, 2012 10:04 UTC (Thu) by drago01 (subscriber, #50715) [Link] (1 responses)

It is designed for flash devices that expose themselves as block devices (like SSDs) and not for raw flash devices as jffs2 .. so they aren't really comparable.

f2fs versus jffs2

Posted Nov 16, 2012 15:31 UTC (Fri) by oak (guest, #2786) [Link]

And JFFS2 works really badly on file systems which sizes are in GBs as it AFAIK keeps the whole fs structure in RAM. From scalability point of view, UBIFS would be more reasonable comparison.

An f2fs teardown

Posted Oct 11, 2012 19:16 UTC (Thu) by skitching (guest, #36856) [Link] (6 responses)

Thanks for the great analysis Neil.

Unfortunately, it is sad to see that f2fs is *so* coupled to an underlying FTL. I presume that where you say "leaves it to the FTL", this means that f2fs is simply rewriting data at a fixed address, and relying on the FTL to relocate that address to perform wear-leveling. And given that flash is supposed to be efficient at random writes (in units of an erase block) it is a shame that f2fs spendsg so much effort grouping writes into larger operations to help the FTL layer perform.

So we have a choice of filesystems that only work well on "raw" flash (jffs2, ubifs, logfs) or a filesystem that only works well on managed flash (f2fs).

AIUI, FTL is a solution invented to make it possible to put filesystems like FAT32 onto a flash device without quickly killing it. Are we really still going to be stuck with FTL decades from now?

If the industry is going to provide a "raw" interface to removable flash media in the near future, then f2fs will only be a short-term solution, yes? And interestingly, Samsung are exactly the people who could get a "raw flash interface" happening...

An f2fs teardown

Posted Oct 11, 2012 21:11 UTC (Thu) by arnd (subscriber, #8866) [Link] (4 responses)

A few replies to each of your comments:

* Wear leveling usually works by having a pool of available erase blocks in the drive. When you write to a new location, the drive takes on block out of that pool and writes the data there. When the drive thinks you are done writing to one block, it cleans up any partially written data and puts a different block back into the pool.

* f2fs tries to group writes into larger operations of at least page size (16KB or more) to be efficient, current FTLs are horribly bad at 4KB page size writes. It also tries to fill erase blocks (multiples of 2MB) in the order that the devices can handle.

* logfs actually works on block devices but hasn't been actively worked on over the last few years. f2fs also promises better performance by using only 6 erase blocks concurrently rather than 12 in the case of logfs. A lot of the underlying principles are the same though.

* The "industry" is moving away from raw flash interfaces towards eMMC and related technologies (UFS, SD, ...). We are not going back to raw flash any time soon, which is unfortunate for a number of reasons but also has a few significant advantages. Having the FTL take care of bad block management and wear leveling is one such advantage, at least if they get it right.

An f2fs teardown

Posted Oct 16, 2012 21:01 UTC (Tue) by travelsn (guest, #48694) [Link] (3 responses)

But are there processor out there that can boot from FTL enabled flash don't we still need raw NAND for boot device?

An f2fs teardown

Posted Oct 16, 2012 23:53 UTC (Tue) by neilbrown (subscriber, #359) [Link] (2 responses)

> But are there processor out there that can boot from FTL enabled flash

The OMAP3 in my phone boots from the micro-SD card. It reads from the blocks that a file would be stored in if it were the first file copied onto a newly formated VFAT partition (it doesn't parse the FAT, it just *knows* what to read).

So yes: processors can boot from all sorts of things.

An f2fs teardown

Posted Oct 17, 2012 12:14 UTC (Wed) by etienne (guest, #25256) [Link]

> So yes: processors can boot from all sorts of things.

In fact OMAP3 boots from an internal ROM and can chain-load other boot device.
It would be so nice if that ROM would also contain a description of all the devices on this particular system-on-chip...

An f2fs teardown

Posted Dec 22, 2012 22:38 UTC (Sat) by marcH (subscriber, #57642) [Link]

> It reads from the blocks that a file would be stored in if it were the first file copied onto a newly formated VFAT partition (it doesn't parse the FAT, it just *knows* what to read).

To avoid this kludge, starting from version 4.3 e-MMC chips feature two 128K boot partitions with a simplified access procedure: simplified for ROMs.

JEDEC specifications are free (registration required)

An f2fs teardown

Posted Oct 11, 2012 21:15 UTC (Thu) by neilbrown (subscriber, #359) [Link]

One point that I didn't emphasise in the article, and maybe should have, is that those locations where f2fs writes to fixed addresses are all at the start of the device, the same region that a FAT filesystems uses for the "FAT" (File Allocation Table).
As the FAT is also updated by writes to fixed addresses, that part of the device is often optimised for that sort of access pattern.

So if you manage to align the "meta" area to the "fat"-dedicated area, and align the sections and zones as already described, f2fs should be fully optimised for the device.

As for whether we'll be stuck with FTL for decades: that is really a question of economics. For a better product to appear, there needs to be a big enough market.

An f2fs teardown

Posted Oct 11, 2012 20:01 UTC (Thu) by rfrancoise (subscriber, #15508) [Link]

Fantastic article, thanks.

An f2fs teardown

Posted Oct 12, 2012 8:21 UTC (Fri) by cmccabe (guest, #60281) [Link] (10 responses)

Great article!

I'm a little disturbed by the many arbitrary low limits in the filesystem. 16 TB max? Less than 4 TB max for a file? Timestamps only up to 2038?

I mean, sure, good design requires tradeoffs. But I thought the point of this filesystem was that it would become some kind of long-lived standard for how we accessed embedded flash devices, sort of like how FAT32 is now. We would probably not even be talking about replacing FAT32 on flash devices, despite its many inefficiencies and limitations, if it didn't have the 2TB limit.

Or am I misreading this, and it's simply about avoiding the FAT tax and getting some additional performance in the bargain?

An f2fs teardown

Posted Oct 12, 2012 16:06 UTC (Fri) by Aissen (guest, #59976) [Link] (4 responses)

I'm not sure the primary usage is to prevent the "FAT" tax. Sure, it could be useful on SD cards to replace exFAT. But I think the primary goal is to replace ext4 for eMMCs embedded in smartphones (and tablets, or any other smart device). This limitation could then make (a little bit) sense. With current technology we have 64GB eMMCs, with 128GB in the pipes. With capacity doubling every 2 years, it would take ~15 years to reach the filesystem limit. Let's hope that by then non volatile memory use will be pervasive.

The thing I don't understand, is why work isn't done to make btrfs fit this use case. It already has less write amplification than ext4 or xfs due to it's COW nature (I think Arnd Bergmann did some research on that). It would use the years of experience and higher performance of btrfs (vs a newly developed filesystem). It would also fit the Linux philosophy of running on anything from the tiniest devices to TOP500 computers.

Is it because btrfs as a high CPU overhead ? Consumes lots of disk space ? Or just because every btrfs developer is working on "big data" server-side use cases ?

An f2fs teardown

Posted Oct 14, 2012 18:38 UTC (Sun) by cmccabe (guest, #60281) [Link] (3 responses)

> The thing I don't understand, is why work isn't
> done to make btrfs fit this use case. It already
> has less write amplification than ext4 or xfs
> due to it's COW nature (I think Arnd Bergmann
> did some research on that).

It's not obvious that btrfs is the best choice for SSDs. Ted T'so posted some information on this earlier: http://lwn.net/Articles/470553/

There is currently some work going into btrfs to make it a better match for SSDs. That would probably make an interesting LWN article of its own. Also keep in mind that the type of SSD you see on a desktop is much different than what you see in a mobile phone. The firmware is much fancier and so an optimization for one may be a pessimization for the other.

An f2fs teardown

Posted Oct 15, 2012 8:18 UTC (Mon) by Aissen (guest, #59976) [Link]

In the link you point to (very interesting BTW), Ted says that btrfs will be at a disadvantage in "fsync()-happy workload"s. So it varies between workloads.

I didn't use the work "SSD", and that's because (as you said) it might refer to different things. I talked about eMMCs and SD cards, which are the target use case of f2fs, and used in mobile phones.
In some use cases, btrfs might be the best choice, according to Arnd's year old research:
http://www.youtube.com/watch?feature=player_detailpage&... (wasn't able to find the updated slides).

An f2fs teardown

Posted Oct 17, 2012 14:26 UTC (Wed) by arnd (subscriber, #8866) [Link] (1 responses)

I believe btrfs has improved significantly in this area, but its design means that it won't be as good as f2fs on the media that f2fs optimizes for. The issue with b-tree updates that Ted mentions in the link is something that f2fs avoids by having another level of indirection that is not copy-on-write, and btrfs suffers more from fragmentation because it intentionally does not garbage-collect.

On a lot of flash devices, btrfs starts out significantly faster than ext4 after a fresh mkfs, but it's possible that btrfs performance degrades more as the file system fragments with aging. I don't have any data to back that up though.

An f2fs teardown

Posted Nov 16, 2012 15:33 UTC (Fri) by oak (guest, #2786) [Link]

Nobody mentioned compression, but I think BTRFS can use e.g. LZO compression. What's the situation with that?

An f2fs teardown

Posted Oct 16, 2012 18:22 UTC (Tue) by tomstdenis (guest, #86984) [Link] (4 responses)

4TB max for a file is not a problem.

Let's look at your typical use case [e.g. cell phone]. Max download speeds are in the 5-50Mbit/sec range realistically. It'd take 2 days of straight downloading at 50Mbit/sec constantly to fill that up.

If that were an 720p quality video it'd play for 4+ days straight...

An f2fs teardown

Posted Oct 17, 2012 18:13 UTC (Wed) by intgr (subscriber, #39733) [Link] (1 responses)

> 4TB max for a file is not a problem. [...] Max download speeds are in the 5-50Mbit/sec range realistically.

Famous last words.

2GB max for a file wasn't a problem in 1996 when they designed FAT 32, either. It would take over 5 days to fill that over a 33.6 kbaud modem in those days.

Now I can plug an HDMI-capable cellphone into a 1080p TV and stream multi-gigabyte Bluray rips over Wi-Fi. Yet I can't store them on the SD card because someone thought "it would never be a problem".

An f2fs teardown

Posted Oct 28, 2012 17:20 UTC (Sun) by khim (subscriber, #9252) [Link]

This is interesting comment. Note that FAT32 was explicitly designed as stop-gap solution for Windows96 (and then retrofitted into Windows95OSR2 when Windows96 become first Windows97, then Windows98). Long-term solution was supposed to be Windows 2000 (and later Windows XP) and it worked like a charm.

But then FAT32 was used for totally unrelated task (USB-sticks) and this is where it's limitation become problematic... and since Microsoft wants to monopolize this market, too instead of FAT32X we've gotten exFAT... which is, of course, not supported by many-many things because it's implementation is not free because exFAT is heavily patented.

Moral? F2FS limitations are fine for what's it's designed for, but if we'll try to use it for some unrelated tasks... we may be in trouble.

An f2fs teardown

Posted Oct 18, 2012 4:12 UTC (Thu) by cyanit (guest, #86671) [Link] (1 responses)

It is, how about a 5TB disk image/virtual disk on a virtualized server that has a RAID array of 10 512GB SSDs? (the SSDs would only cost around $6000)

Not to mention the fact that files can be sparse.

An f2fs teardown

Posted Oct 18, 2012 14:23 UTC (Thu) by arnd (subscriber, #8866) [Link]

f2fs isn't really optimized for SSDs at all. The largest media today that it actually targets are USB sticks of maybe 128GB that are both slow and expensive. Rather than using a RAID of 40 USB sticks and f2fs, I would always recomment getting a bunch of SSDs and using btrfs on them.

An f2fs teardown

Posted Oct 12, 2012 16:16 UTC (Fri) by andresfreund (subscriber, #69562) [Link]

Nice article btw, more of it please!

An f2fs teardown

Posted Oct 17, 2012 17:12 UTC (Wed) by jond (subscriber, #37669) [Link] (1 responses)

Great article. I have a spare microsd I want to use to expand my archive storage. This seems to be just what I need.

An f2fs teardown

Posted Oct 18, 2012 12:40 UTC (Thu) by arnd (subscriber, #8866) [Link]

I'm not sure if you were joking here, but since you put no smiley in your message, let me warn you about two reasons why you really should not consider this:

* The file system is not stable or merged yet, and will very likely see incompatible changes to the on-disk layout. Even if you don't run into bugs that cause your data to get destroyed, you won't be able to read the data anymore with the version of the file system that eventually gets merged.

* A lot of SD cards are not sufficient in their hardware characteristics to support f2fs. Have a look at https://wiki.linaro.org/WorkingGroups/Kernel/Projects/Fla..., all SD cards with a number of less than 7 in the "# open AUs linear" column or cards that don't have a power-of-two erase block size.will not work correctly with f2fs. It's not worse than using ext4 or btrfs on the same devices though, but you should not do that either, at least not if you are storing important data.

An f2fs teardown

Posted Sep 27, 2014 10:29 UTC (Sat) by Praveen_Pandey (guest, #99082) [Link]

Great article !!