Finding the proper scope of a file collapse operation
fallocate() is concerned with the allocation of space within a file; its initial purpose was to allow an application to allocate blocks to a file prior to writing them. This type of preallocation ensures that the needed space is available before trying to write the data that goes there; it can also help filesystem implementations lay out the allocated space more efficiently on disk. Later on, the FALLOC_FL_PUNCH_HOLE operation was added to deallocate blocks within a file, leaving a hole in the file.
In February, Namjae Jeon proposed a new fallocate() operation called FALLOC_FL_COLLAPSE_RANGE; this proposal included implementations for the ext4 and xfs filesystems. Like the hole-punching operation, it removes data from a file, but there is a difference: rather than leaving a hole in the file, this operation moves all data beyond the affected range to the beginning of that range, shortening the file as a whole. The immediate user for this operation would appear to be video editing applications, which could use it to quickly and efficiently remove a segment of a video file. If the removed range is block-aligned (which would be a requirement, at least for some filesystems), the removal could be effected by changing the file's extent maps, with no actual copying of data required. Given that files containing video data can be large, it is not hard to understand why an efficient "cut" operation would be attractive.
So what kinds of questions arise with an operation like this? One could start with the interaction with the mmap() system call, which maps a file into a process's address space. The proposed implementation works by removing all pages from the affected range to the end of the file from the page cache; dirty pages are written back to disk first. That will prevent the immediate loss of data that may have been written via a mapping, and will get rid of any memory pages that will be after the end of the file once the operation is complete. But it could be a surprise for a process that does not expect the contents of a file to shift around underneath its mapping. That is not expected to be a huge problem; as Dave Chinner pointed out, the types of applications that would use the collapse operation do not generally access their files via mmap(). Beyond that, applications that are surprised by a collapsed file may well be unable to deal with other modifications even in the absence of a collapse operation.
But, as Hugh Dickins noted, there is a related problem: in the tmpfs filesystem, all files live in the page cache and look a lot like a memory mapping. Since the page cache is the backing store, removing file pages from the page cache is unlikely to lead to a happy ending. So, before tmpfs could support the collapse operation, a lot more effort would have to go into making things play well with the page cache. Hugh was not sure that there would ever be a need for this operation in tmpfs, but, he said, solving the page cache issues for tmpfs would likely lead to a more robust implementation for other filesystems as well.
Hugh also wondered whether the uni-directional collapse operation should, instead, be designed to work in both directions:
Andrew Morton went a little further, suggesting that a simple "move these
blocks from here to there
" system call might be the best idea. But
Dave took a dim view of that suggestion,
worrying that it would introduce a great deal of complexity and difficult
corner cases:
Andrew disagreed, claiming that a more general interface was preferable and that the problems could be overcome, but nobody else supported him on this point. So, chances are, the operation will remain confined to collapsing chunks out of files; a separate "insert" operation may be added in the future, should an interesting use case for it be found.
Meanwhile, there is one other behavioral question to answer; what happens if the region to be removed from the file reaches to the end of the file? The current patch set returns EINVAL in that situation, with the idea that a call to truncate() should be used instead. Ted Ts'o asked whether such operations should just be turned directly into truncate() calls, but Dave is set against that idea. A collapse operation that includes the end of the file, he said, is almost certainly buggy; it is better to return an error in that case.
There are also, evidently, some interesting security issues that could come up if a collapse operation were allowed to include the end of the file. Filesystems can allocate blocks beyond the end of the file; indeed, fallocate() can be used to explicitly request that behavior. Those blocks are typically not zeroed out by the filesystem; instead, they are kept inaccessible so that whatever stale data is contained there cannot be read. Without a great deal of care, a collapse implementation that allowed the range to go beyond the end of the file could end up exposing that data, especially if the operation were to be interrupted (by a system crash, perhaps) in the middle. Rather than set that sort of trap for filesystem developers, Dave would prefer to disallow the risky operations from the beginning, especially since there does not appear to be any real need to support them.
So the end result of all this discussion is that the
FALLOC_FL_COLLAPSE_RANGE operation is likely to go into the kernel
essentially unchanged. It will not have all the capabilities that some
developers would have liked to see, but it will support one useful feature
that should help to accelerate a useful class of applications. Whether
this will be enough for the long term remains to be seen; system call API
design is hard. But, should additional features be needed in the future,
new FALLOC_FL commands can be created to make them available in a
compatible way.
Index entries for this article | |
---|---|
Kernel | fallocate() |
Posted Mar 6, 2014 3:11 UTC (Thu)
by neilbrown (subscriber, #359)
[Link] (6 responses)
Surely a video editor maintains a separate file with references into various source materials, and then renders the whole when done.... but maybe I have an overly simplistic understanding of video editing.
Posted Mar 6, 2014 10:55 UTC (Thu)
by Yorick (guest, #19241)
[Link] (3 responses)
Posted Mar 6, 2014 23:31 UTC (Thu)
by PaulWay (subscriber, #45600)
[Link] (2 responses)
It's just that operating systems traditionally have not supported this, because it's easier to take the problem of mapping data in the file to data on the disk and pass it straight on to the user-space program - especially when you're trying to write an operating system to fit in the kinds of memory spaces we had thirty years ago.
So every program tries to implement this differently, based on its own different requirements. And each one makes compromises, and reinvents the wheel, and makes the same mistakes, and doesn't try to solve the whole problem.
Now, to support arbitrary chunk sizes requires considerable work anywhere you implement it. Maybe it requires significant rethinking of the file system itself. Maybe this won't even be available in older filesystems. It doesn't mean that it isn't a useful thing to do.
Maybe a good way to start is to write a new library that provides these functions, as well as the traditional operations, as a file handle. If it's running on a traditional file system without support for arbitrary range operations, then it implements its own using internal metadata. The file on disk won't be exactly the same as the raw data, but at this stage it's really more for internal operations - in the same way as Audacity's internal save format is not really useable as a playback format in other programs.
Hmmmmmm - libnewfile - another project for me to try... :-)
Then the library would recognise the features of the file system it was writing to and use these for greater efficiency. And in the rainbow unicorn chocolate pony future, when the file system supported these operations natively, the implementation in libnewfile would simply pass them straight through to the file system.
But IMO saying "we can't implement this - we've been doing it this way for the last thirty years" is never a convincing argument.
Have fun,
Paul
Posted Mar 7, 2014 10:53 UTC (Fri)
by etienne (guest, #25256)
[Link]
Maybe it will be done as soon as the hard disk and its DMA subsystem support arbitrary chunk sizes? Some video format are in blocks of 188 bytes.
Posted Mar 13, 2014 22:01 UTC (Thu)
by ohcamacj (guest, #45012)
[Link]
It's been useful occassionally.
src: http://ohcamacj.dyndns.org:8088/~jonathan/files-1e2a/offs...
However, it's been so long, I forgot how to compile it. -D_FILE_OFFSET_BITS=64 and -lpthread and maybe one or two other options. Sorry.
Posted Mar 6, 2014 11:01 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link]
Posted Mar 6, 2014 19:36 UTC (Thu)
by kleptog (subscriber, #1183)
[Link]
Posted Mar 6, 2014 8:37 UTC (Thu)
by fishface60 (subscriber, #88700)
[Link] (4 responses)
I don't know if it'll handle moving nicely, but even if it doesn't, you can create a new copy of the file without the required range, by just cloning the data before and after into it, then replace your old file with the new one.
Posted Mar 6, 2014 12:08 UTC (Thu)
by HIGHGuY (subscriber, #62277)
[Link] (3 responses)
Insert range:
Remove range:
With regards to mmap(), tmpfs, etc. you don't need to care about writeback of dirty data, mmap resize, etc. You just swap the mapping of pages around and have truncate() and fallocate() do the hard stuff.
But then again, I'm not hindered by any knowledge on the subject...
Posted Mar 6, 2014 14:58 UTC (Thu)
by dgm (subscriber, #49227)
[Link] (1 responses)
Posted Mar 6, 2014 18:23 UTC (Thu)
by Funcan (subscriber, #44209)
[Link]
Posted Mar 17, 2014 1:38 UTC (Mon)
by vomlehn (guest, #45588)
[Link]
I very much like Andrews approach: With so many issues with the interface as proposed, and a very simple alternative, it's not clear to me why you'd do anything else.
Posted Mar 13, 2014 2:18 UTC (Thu)
by abket (guest, #95676)
[Link]
It's absurd to just have an ad-hoc "insert hole".
Finding the proper scope of a file collapse operation
Finding the proper scope of a file collapse operation
Finding the proper scope of a file collapse operation
Finding the proper scope of a file collapse operation
I am not saying that ultimately the page cache shall not support reasonable multiple power-of-two block sizes.
Finding the proper scope of a file collapse operation
#comments
/path/to/file1: offset,length
/path/to/file2: offset,length
#etc
and the concatentation of the specified file slices would be obtained upon reading it back.
Finding the proper scope of a file collapse operation
Finding the proper scope of a file collapse operation
Finding the proper scope of a file collapse operation
Finding the proper scope of a file collapse operation
- add sufficient data to end of file using fallocate()
- move range into position
- move range to end of file
- truncate()
Finding the proper scope of a file collapse operation
Finding the proper scope of a file collapse operation
Finding the proper scope of a file collapse operation
Finding the proper scope of a file collapse operation