Finding the proper scope of a file collapse operation

By Jonathan Corbet
March 5, 2014

System call design is never easy; there are often surprising edge cases that developers fail to consider as they settle on an interface. System calls involving filesystems seem to be especially prone to this kind of problem, since the complexity and variety of filesystem implementations means that there may be any number of surprises waiting for a developer who wants to create a new file-oriented operation. Some of these surprises can be seen in the discussion of a proposed addition to the fallocate() system call.

fallocate() is concerned with the allocation of space within a file; its initial purpose was to allow an application to allocate blocks to a file prior to writing them. This type of preallocation ensures that the needed space is available before trying to write the data that goes there; it can also help filesystem implementations lay out the allocated space more efficiently on disk. Later on, the FALLOC_FL_PUNCH_HOLE operation was added to deallocate blocks within a file, leaving a hole in the file.

In February, Namjae Jeon proposed a new fallocate() operation called FALLOC_FL_COLLAPSE_RANGE; this proposal included implementations for the ext4 and xfs filesystems. Like the hole-punching operation, it removes data from a file, but there is a difference: rather than leaving a hole in the file, this operation moves all data beyond the affected range to the beginning of that range, shortening the file as a whole. The immediate user for this operation would appear to be video editing applications, which could use it to quickly and efficiently remove a segment of a video file. If the removed range is block-aligned (which would be a requirement, at least for some filesystems), the removal could be effected by changing the file's extent maps, with no actual copying of data required. Given that files containing video data can be large, it is not hard to understand why an efficient "cut" operation would be attractive.

So what kinds of questions arise with an operation like this? One could start with the interaction with the mmap() system call, which maps a file into a process's address space. The proposed implementation works by removing all pages from the affected range to the end of the file from the page cache; dirty pages are written back to disk first. That will prevent the immediate loss of data that may have been written via a mapping, and will get rid of any memory pages that will be after the end of the file once the operation is complete. But it could be a surprise for a process that does not expect the contents of a file to shift around underneath its mapping. That is not expected to be a huge problem; as Dave Chinner pointed out, the types of applications that would use the collapse operation do not generally access their files via mmap(). Beyond that, applications that are surprised by a collapsed file may well be unable to deal with other modifications even in the absence of a collapse operation.

But, as Hugh Dickins noted, there is a related problem: in the tmpfs filesystem, all files live in the page cache and look a lot like a memory mapping. Since the page cache is the backing store, removing file pages from the page cache is unlikely to lead to a happy ending. So, before tmpfs could support the collapse operation, a lot more effort would have to go into making things play well with the page cache. Hugh was not sure that there would ever be a need for this operation in tmpfs, but, he said, solving the page cache issues for tmpfs would likely lead to a more robust implementation for other filesystems as well.

Hugh also wondered whether the uni-directional collapse operation should, instead, be designed to work in both directions:

I'm a little sad at the name COLLAPSE, but probably seven months too late to object. It surprises me that you're doing all this work to deflate a part of the file, without the obvious complementary work to inflate it - presumably all those advertisers whose ads you're cutting out, will come back to us soon to ask for inflation, so that they have somewhere to reinsert them.

Andrew Morton went a little further, suggesting that a simple "move these blocks from here to there" system call might be the best idea. But Dave took a dim view of that suggestion, worrying that it would introduce a great deal of complexity and difficult corner cases:

IOWs, collapse range is a simple operation, "move arbitrary blocks from here to there" is a nightmare both from the specification and the implementation points of view.

Andrew disagreed, claiming that a more general interface was preferable and that the problems could be overcome, but nobody else supported him on this point. So, chances are, the operation will remain confined to collapsing chunks out of files; a separate "insert" operation may be added in the future, should an interesting use case for it be found.

Meanwhile, there is one other behavioral question to answer; what happens if the region to be removed from the file reaches to the end of the file? The current patch set returns EINVAL in that situation, with the idea that a call to truncate() should be used instead. Ted Ts'o asked whether such operations should just be turned directly into truncate() calls, but Dave is set against that idea. A collapse operation that includes the end of the file, he said, is almost certainly buggy; it is better to return an error in that case.

There are also, evidently, some interesting security issues that could come up if a collapse operation were allowed to include the end of the file. Filesystems can allocate blocks beyond the end of the file; indeed, fallocate() can be used to explicitly request that behavior. Those blocks are typically not zeroed out by the filesystem; instead, they are kept inaccessible so that whatever stale data is contained there cannot be read. Without a great deal of care, a collapse implementation that allowed the range to go beyond the end of the file could end up exposing that data, especially if the operation were to be interrupted (by a system crash, perhaps) in the middle. Rather than set that sort of trap for filesystem developers, Dave would prefer to disallow the risky operations from the beginning, especially since there does not appear to be any real need to support them.

So the end result of all this discussion is that the FALLOC_FL_COLLAPSE_RANGE operation is likely to go into the kernel essentially unchanged. It will not have all the capabilities that some developers would have liked to see, but it will support one useful feature that should help to accelerate a useful class of applications. Whether this will be enough for the long term remains to be seen; system call API design is hard. But, should additional features be needed in the future, new FALLOC_FL commands can be created to make them available in a compatible way.

Index entries for this article
Kernel	fallocate()

Finding the proper scope of a file collapse operation

Posted Mar 6, 2014 3:11 UTC (Thu) by neilbrown (subscriber, #359) [Link] (6 responses)

I'm a bit confused by the use case - which video file formats have frames aligned on 4K boundaries and so could make use of this?

Surely a video editor maintains a separate file with references into various source materials, and then renders the whole when done.... but maybe I have an overly simplistic understanding of video editing.

Finding the proper scope of a file collapse operation

Posted Mar 6, 2014 10:55 UTC (Thu) by Yorick (guest, #19241) [Link] (3 responses)

No, your understanding is basically correct. The whole idea very much sounds like a solution in search for a problem. We all know how difficult it is to get speculative API designs right.

Finding the proper scope of a file collapse operation

Posted Mar 6, 2014 23:31 UTC (Thu) by PaulWay (subscriber, #45600) [Link] (2 responses)

In my view, this is a general problem - programs want to be able to insert, delete and move data around anywhere inside a file. This is a standard mode of operation for any editor - video, audio, text, graphics, metadata, binary data, database, etc. And not just 4k blocks, or sector blocks - arbitrary quantities.

It's just that operating systems traditionally have not supported this, because it's easier to take the problem of mapping data in the file to data on the disk and pass it straight on to the user-space program - especially when you're trying to write an operating system to fit in the kinds of memory spaces we had thirty years ago.

So every program tries to implement this differently, based on its own different requirements. And each one makes compromises, and reinvents the wheel, and makes the same mistakes, and doesn't try to solve the whole problem.

Now, to support arbitrary chunk sizes requires considerable work anywhere you implement it. Maybe it requires significant rethinking of the file system itself. Maybe this won't even be available in older filesystems. It doesn't mean that it isn't a useful thing to do.

Maybe a good way to start is to write a new library that provides these functions, as well as the traditional operations, as a file handle. If it's running on a traditional file system without support for arbitrary range operations, then it implements its own using internal metadata. The file on disk won't be exactly the same as the raw data, but at this stage it's really more for internal operations - in the same way as Audacity's internal save format is not really useable as a playback format in other programs.

Hmmmmmm - libnewfile - another project for me to try... :-)

Then the library would recognise the features of the file system it was writing to and use these for greater efficiency. And in the rainbow unicorn chocolate pony future, when the file system supported these operations natively, the implementation in libnewfile would simply pass them straight through to the file system.

But IMO saying "we can't implement this - we've been doing it this way for the last thirty years" is never a convincing argument.

Have fun,

Paul

Finding the proper scope of a file collapse operation

Posted Mar 7, 2014 10:53 UTC (Fri) by etienne (guest, #25256) [Link]

> support arbitrary chunk sizes [in filesystem]

Maybe it will be done as soon as the hard disk and its DMA subsystem support arbitrary chunk sizes? Some video format are in blocks of 188 bytes.
I am not saying that ultimately the page cache shall not support reasonable multiple power-of-two block sizes.

Finding the proper scope of a file collapse operation

Posted Mar 13, 2014 22:01 UTC (Thu) by ohcamacj (guest, #45012) [Link]

I actually wrote something like that a few years ago, when I was a little kid. I called it offset_fs, instead of libnewfile. If a filename ended in -desc, it would be interpreted as
#comments
/path/to/file1: offset,length
/path/to/file2: offset,length
#etc
and the concatentation of the specified file slices would be obtained upon reading it back.

It's been useful occassionally.

src: http://ohcamacj.dyndns.org:8088/~jonathan/files-1e2a/offs...

However, it's been so long, I forgot how to compile it. -D_FILE_OFFSET_BITS=64 and -lpthread and maybe one or two other options. Sorry.

Finding the proper scope of a file collapse operation

Posted Mar 6, 2014 11:01 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

If the referenced frames are keyframes, it might be easier/faster to extract the frames than seek to the keyframe, decode N frames and then start extracting frames. (Just speculation.)

Finding the proper scope of a file collapse operation

Posted Mar 6, 2014 19:36 UTC (Thu) by kleptog (subscriber, #1183) [Link]

Most video formats have framing and I imagine they have some kind of comment frame. So you could use fallocate() to get rid of any 4k aligned blocks you want to get of, and write in a comment frame for the remaining bytes.

Finding the proper scope of a file collapse operation

Posted Mar 6, 2014 8:37 UTC (Thu) by fishface60 (subscriber, #88700) [Link] (4 responses)

I've got the sneaking suspicion that btrfs may already be able to do everything that's required, since it has an ioctl for cloning a range of data in a file, that can work on the same file.

I don't know if it'll handle moving nicely, but even if it doesn't, you can create a new copy of the file without the required range, by just cloning the data before and after into it, then replace your old file with the new one.

Finding the proper scope of a file collapse operation

Posted Mar 6, 2014 12:08 UTC (Thu) by HIGHGuY (subscriber, #62277) [Link] (3 responses)

I would think Andrew's proposal is doable...

Insert range:
- add sufficient data to end of file using fallocate()
- move range into position

Remove range:
- move range to end of file
- truncate()

With regards to mmap(), tmpfs, etc. you don't need to care about writeback of dirty data, mmap resize, etc. You just swap the mapping of pages around and have truncate() and fallocate() do the hard stuff.

But then again, I'm not hindered by any knowledge on the subject...

Finding the proper scope of a file collapse operation

Posted Mar 6, 2014 14:58 UTC (Thu) by dgm (subscriber, #49227) [Link] (1 responses)

I seems that the basic code is already there. Why not just add a syscall to shuffle blocks around?

Finding the proper scope of a file collapse operation

Posted Mar 6, 2014 18:23 UTC (Thu) by Funcan (subscriber, #44209) [Link]

It's there for some filesystems. Not for most.

Finding the proper scope of a file collapse operation

Posted Mar 17, 2014 1:38 UTC (Mon) by vomlehn (guest, #45588) [Link]

I very much like Andrews approach:

The mmap() interpretation becomes very trivial since the blocks now appear in the new location as if by magic, while the former blocks are still available until you do the truncate().
There is no issue with allocated but unused blocks at the end of the file since you haven't changed the file size and the truncate() is already handling the issue
You have no issue with collapse extending to the end of the file since trucate() handles this issue.

With so many issues with the interface as proposed, and a very simple alternative, it's not clear to me why you'd do anything else.

Finding the proper scope of a file collapse operation

Posted Mar 13, 2014 2:18 UTC (Thu) by abket (guest, #95676) [Link]

This functionality DEFINITELY needs to be generalized to move data between two arbitrary files and ranges (of course it can fail if there specific arrangement is not supported).

It's absurd to just have an ad-hoc "insert hole".