statx() v3

By Jonathan Corbet
November 30, 2016

Some developments just take a long time to truly come to fruition. That has proved to be the case for the proposed statx() system call — at least, the "long time" part has, even if we may still be waiting for "fruition". By most accounts, though, this extension to the stat() system call would appear to be getting closer to being ready. Recent patches show the current state of statx() and where the remaining sticking points are.

The stat() system call, which returns metadata about a file, has a long history, having made its debut in the Version 1 Unix release in 1971. It has changed little in the following 45 years, even though the rest of the operating system has changed around it. Thus, it's unsurprising that stat() tends to fall short of current requirements. It is unable to represent much of the information relevant to files now, including generation and version numbers, file creation time, encryption status, whether they are stored on a remote server, and so on. It gives the caller no choice about which information to obtain, possibly forcing expensive operations to obtain data that the application does not need. The timestamp fields have year-2038 problems. And so on.

David Howells has been sporadically working on replacing stat() since 2010; his version 3 patch (counting since he restarted the effort earlier this year) came out on November 23. While the proposed statx() system call looks much the same as it did when we looked at it in May, there have been a few changes.

The prototype for statx() is still:

    int statx(int dfd, const char *filename, unsigned atflag, unsigned mask,
	      struct statx *buffer);

Normally, dfd is a file descriptor identifying a directory, and filename is the name of the file of interest; that file is expected to be found relative to the given directory. If filename is passed as NULL, then dfd is interpreted as referring directly to the file being queried. Thus, statx() supersedes the functionality of both stat() and fstat().

The atflag argument modifies the behavior of the system call. It handles a couple of flags that already exist in current kernels: AT_SYMLINK_NOFOLLOW to return information about a symbolic link rather than following it, and AT_NO_AUTOMOUNT to prevent the automounting of remote filesystems. A set of new flags just for statx() controls the synchronization of data with remote servers, allowing applications to adjust the balance between I/O activity and accurate results. AT_STATX_FORCE_SYNC will force a synchronization with a remote server, even if the local kernel thinks its information is current, while AT_STATX_DONT_SYNC inhibits queries to the remote server, yielding fast results that may be out-of-date or entirely unavailable.

The atflag parameter, thus, controls what statx() will do to obtain the data; mask, instead, controls which data is obtained. The available flags here allow the application to request file permissions, type, number of links, ownership, timestamps, and more. The special value STATX_BASIC_STATS returns everything stat() would, while STATX_ALL returns everything available. Reducing the amount of information requested might reduce the amount of I/O required to execute the system call, but some reviewers worry that developers will just use STATX_ALL to avoid the need to think about it.

The final argument, buffer, contains a structure to be filled with the relevant information; in this version of the patch this structure looks like:

    struct statx {
	__u32	stx_mask;	/* What results were written [uncond] */
	__u32	stx_blksize;	/* Preferred general I/O size [uncond] */
	__u64	stx_attributes;	/* Flags conveying information about the file [uncond] */
	__u32	stx_nlink;	/* Number of hard links */
	__u32	stx_uid;	/* User ID of owner */
	__u32	stx_gid;	/* Group ID of owner */
	__u16	stx_mode;	/* File mode */
	__u16	__spare0[1];
	__u64	stx_ino;	/* Inode number */
	__u64	stx_size;	/* File size */
	__u64	stx_blocks;	/* Number of 512-byte blocks allocated */
	__u64	__spare1[1];
	struct statx_timestamp	stx_atime;	/* Last access time */
	struct statx_timestamp	stx_btime;	/* File creation time */
	struct statx_timestamp	stx_ctime;	/* Last attribute change time */
	struct statx_timestamp	stx_mtime;	/* Last data modification time */
	__u32	stx_rdev_major;	/* Device ID of special file [if bdev/cdev] */
	__u32	stx_rdev_minor;
	__u32	stx_dev_major;	/* ID of device containing file [uncond] */
	__u32	stx_dev_minor;
	__u64	__spare2[14];	/* Spare space for future expansion */
    };

Here, stx_mask indicates which fields are actually valid; it will be the intersection of the information requested by the application and what the filesystem is able to provide. stx_attributes contains flags describing the state of the file; they indicate whether the file is compressed, encrypted, immutable, append-only, not to be included in backups, or an automount point.

The timestamp fields contain this structure:

    struct statx_timestamp {
	__s64	tv_sec;
	__s32	tv_nsec;
	__s32	__reserved;
    };

The __reserved field was added in the version 3 patch as the result of one of the strongest points of disagreement in recent discussions about statx(). Dave Chinner suggested that, at some point in the future, nanosecond resolution may no longer be adequate; he said that the interface should be able to handle femtosecond timestamps. He was mostly alone on that point; other participants, such as Alan Cox, said that the speed of light will ensure that we never need timestamps below nanosecond resolution. Chinner insisted, though, so Howells added the __reserved field with the idea that it can be pressed into service should the need arise in the future.

Chinner had a number of other objections about the interface, some of which have not yet been addressed. These include the definition of the STATX_ATTR_ flags, which shadow a set of existing flags used with the FS_IOC_GETFLAGS and FS_IOC_SETFLAGS ioctl() calls. Reusing the flags allows a micro-optimization of the statx() code but, Chinner says, it perpetuates some interface mistakes made in the past. Ted Ts'o offered similar advice when reviewing a 2015 version of the patch set, but version 3 retains the same flag definitions.

The largest of Chinner's objections, though, may well be the absence of a comprehensive set of tests for statx(). This code, he said, should not go in until those tests are provided:

Quite frankly, I think this has to be an unconditional requirement for such generic, expandable new syscall functionality - either we get test coverage for it before merge, or we don't merge it. We've demonstrated time and time again that shit doesn't work if it's not tested and cannot be widely verified by independent filesystem developers.

This position has been echoed by others (Michael Kerrisk, for example) recently. The kernel does have a long history of merging new system calls that do not work as advertised, with corresponding pain resulting later on. Howells will likely end up providing such tests, but not yet:

Given the amount of bikeshedding that's taken place on this, I'm glad I *haven't* done the testsuite yet - it would have much more than doubled the amount of work. I *still* don't know what the final form is going to be.

The rate of change of the patch set does seem to be slowing so, perhaps, its final form is beginning to come into focus. The history of this work suggests that it would not be wise to predict its merging in the near future, though. The stat() system call has been with us for a long time; it's reasonable to expect that statx() will last for just as long. A bit of extra "bikeshedding" to get the interface right seems understandable in that context.

Index entries for this article
Kernel	Filesystems/stat()
Kernel	System calls

statx() v3

Posted Dec 1, 2016 12:18 UTC (Thu) by tdz (subscriber, #58733) [Link] (10 responses)

INAP (I'm not a physicist), but nano means 10^-9, while Planck time, the smallest measurable time interval, is near 5 * 10^-44 seconds. So at some point there might be a need for something more fine-grained than nano. :p

Asking more seriously, why not store the bits for sub-nano precision in the area at the end of the statx structure? There's plenty of space available. It would be harder to extract the complete time stamp from |struct statx|, but |struct timespec| could be used instead of introducing yet another time-stamp type.

statx() v3

Posted Dec 1, 2016 19:18 UTC (Thu) by jlayton (subscriber, #31672) [Link]

Personally...one of the major design goals here is to have an interface that is extendable. If we find the need in the future for higher resoluton timestamps, then we can always extend the structure at that point. What we really need to do at this point is to make sure that we don't hamstring ourselves, such that we can't do that later.

So I think David's approach of making _just_ a skeletal statx call that is sufficient for emulating stat is the right approach. That allows us to both debate new attributes individually, and demonstrate that the interface is indeed cleanly extendable.

statx() v3

Posted Dec 1, 2016 19:39 UTC (Thu) by Tara_Li (guest, #26706) [Link] (7 responses)

Ok, what time frame *should* we be looking for? As someone mentioned, light speed is a pretty limiting factor here. Femtosecond is around the scale of the atoms themselves - attoseconds being below nuclear size. I might could see picoseconds, but really - femtoseconds??? Sure, the switches might actually operate on a femtosecond scale, but they still have to communicate that switching outside of the switch itself. And if we manage to bring in quantum entanglement as the mode of communication - well, the whole rest of the OS is likely to need re-writing, anyway.

statx() v3

Posted Dec 1, 2016 21:06 UTC (Thu) by excors (subscriber, #95769) [Link] (4 responses)

Surely relativity becomes a problem too. Defining a clock at a single point on your chip is okay, but once you start distributing it to multiple points across the chip (which'll take many picoseconds) you can no longer say the clock ticks at all those points occur simultaneously, even if you try to compensate for propagation delay - simultaneity depends on your frame of reference. It'll be meaningless to compare high-precision timestamps unless you know how fast the chip was travelling.

(But it does seem sensible for timestamps to be at least as precise as CPU cycle counters so that you can losslessly map between them, and nanoseconds aren't good enough for that.)

statx() v3

Posted Apr 4, 2017 22:27 UTC (Tue) by cwillu (guest, #67268) [Link] (3 responses)

It's unlikely that the components of the chip will have differing natural choices for their frame of reference, given their status as "components of the chip".

statx() v3

Posted Apr 5, 2017 15:06 UTC (Wed) by nybble41 (subscriber, #55106) [Link] (2 responses)

That's true for linear motion, but if the chip is *rotating* then different points on the chip could have different frames of reference—though at that size it would have to be very rapid rotation indeed to cause significant relativistic effects. If my calculations are correct, a dilation factor of just 1.000000000001 (one part in one trillion) would require a relative velocity of 56 megameters per second (about 1.4% of light speed). If the chip has an overall size of 25x25mm and thus a maximum radius of about 16mm, it would need to spin at over 4200 revolutions per second to achieve that velocity differential between the center and the corners. (That drops to about 1900 RPS if the center of rotation is one of the corners rather than the center of the chip, for a radius of ~35mm.)

statx() v3

Posted Apr 5, 2017 15:19 UTC (Wed) by nybble41 (subscriber, #55106) [Link] (1 responses)

> would require a relative velocity of 56 megameters per second (about 1.4% of light speed)

Strike that—this result was for an earlier calculation with a much higher dilation factor (1.0001). The rotation required for that tangential velocity would be several orders of magnitude higher. The correct velocity for a dilation of 1+10^-12 (4200 RPS @ 12mm) is only 424 m/s or about 0.00014% of light speed, which is still more than fast enough to put this concern to rest.

statx() v3

Posted Apr 5, 2017 17:17 UTC (Wed) by excors (subscriber, #95769) [Link]

4200 rps isn't unreasonably fast - turbochargers in cars apparently spin at around that speed.

But there's no need for the two parts of the chip to be in different reference frames anyway. Imagine a high-precision timestamp is encoded in a signal emitted from the exact center of the chip, which travels outwards at the speed of light and is received at two points on opposite edges of the chip a few picoseconds later. According to an observer in the same frame of reference as the chip, that signal would reach both points simultaneously (since it's travelling the same distance to both, at a constant speed). The receivers can correct for the transmission delay (based on the known speed and distances) and get perfectly synchronised high-precision timestamps across the whole chip.

Then imagine the chip is flying past you from left to right, very fast. You see the signal is emitted from the center of the chip and travels outwards at the speed of light relative to you (which is bizarre but apparently true). A few picoseconds later the chip has moved slightly to the right, so the left edge is closer to the point of emission than the right edge is. That means you'll see the signal reach the left edge of the chip first, since it has less distance to travel and its speed is the same in all directions. As far as you're concerned, the "perfectly synchronised" chip-wide timestamps are no longer synchronised - the clock on the left edge says it's noticeably later than the clock on the right edge does.

(This is with no acceleration or rotation or gravity. I won't even pretend to know anything about general relativity and what problems will occur if your chip is orbiting a small black hole, but I suspect the problems will be bad.)

There's not much point having attosecond-precision timestamps if that precision is much smaller than the uncertainty introduced by relativity.

Once you abandon the fiction of absolute time, all you really need to care about is causality - if two events occur inside each others' light cones then you want to be able to assign them distinct timestamps, where those timestamps have a total ordering that's a superset of the partial ordering of causality, and you end up with something like a Lamport timestamp.

statx() v3

Posted Dec 2, 2016 8:39 UTC (Fri) by tdz (subscriber, #58733) [Link]

My point was that there's currently no use case for statx_timestamp, so why introduce it? If the reserved fields go into the statx structure instead, they can be allocated for sub-nano time frames when there's an actual need. If we won't ever use sub-nano precision the reserved bits can be used for something else. For current user space, timespec will be fine, future user space can extract any additional bits and assemble a new high-res time stamp structure by itself.

I don't know what the shortest useful time frame would be. I think it makes sense to have time stamps that can represent individual clock ticks of the processor. With GHz CPUs we're already close to nano granularity. IIRC, distributed clocks require at least twice the resolution of the implemented time scale to reliably distinguish two adjacent points in time.

statx() v3

Posted Dec 8, 2016 13:57 UTC (Thu) by welinder (guest, #4699) [Link]

Using speed of light in vacuum, we have...

1s ~ 300Mm
1ms ~ 300km
1us ~ 300m
1ns ~ 300mm
1ps ~ 300um
1fs ~ 300nm
1as ~ 300pm

Atom size is on the order of 100-500pm, i.e., atto second range.

See http://www.wolframalpha.com/input/?i=femto+second+times+s...

statx() v3

Posted Dec 4, 2016 6:04 UTC (Sun) by markh (subscriber, #33984) [Link]

A structure containing only 64-bit seconds and 32-bit nanoseconds would still be 16 bytes on x86_64 due to alignment, so the bits couldn't be used for something else. Worse, that would make the structure a different size on 32-bit x86.

I think they want to ensure that there is only one version of struct statx, for both 32-bit and 64-bit environments, whether off_t is 32-bit or 64-bit, whether time_t is 32-bit or 64-bit, and so on. That means that time_t, struct timespec, struct timeval, or any other type that may vary in size, cannot be used.

The alternative is to store all fields of each timestamp in separate non-adjacent fields within struct statx, but it is certainly simpler if they are together and a single pointer can be used to point to a particular timestamp.

statx() v3

Posted Dec 1, 2016 19:27 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

It might be nice to use a simple list of tagged values instead of hard-coded structures. That'd be a completely future-proof interface.

I never liked the fixed structures, even with "reserved" space. It always eventually runs out.

statx() v3

Posted Dec 2, 2016 6:26 UTC (Fri) by johill (subscriber, #25196) [Link] (3 responses)

I was thinking something similar, but it does mean having to pass in the length of the output buffer, and possibly truncating when the application provided a buffer that was too small.

At that point, there's little value in having tagged values - just the "make the struct bigger if userspace can deal with it" part should be enough?

statx() v3

Posted Dec 2, 2016 7:06 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Well, Windows API designers had the right idea - the first word of the output buffer is the length of the actual data. Anything that doesn't fit is truncated, but you can simply retry the request.

Also it would pose a problem only for the "GIVE ME ALL THE DATA!!!" requests, in most cases users will probably just specify a pre-determined set of tags.

statx() v3

Posted Dec 2, 2016 7:12 UTC (Fri) by johill (subscriber, #25196) [Link] (1 responses)

Sure, you do need that - not just for retrying but also for knowing which fields are actually *valid*, so you can run on older kernels.

I think my main point was that there's no value in tagging things - the kernel will have to provide all "old" fields, even if a new field supersedes an old one, for compatibility, so the tagged data can't ever be removed. Hence, just making the struct size variable in this way should be enough.

statx() v3

Posted Dec 2, 2016 7:18 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

> I think my main point was that there's no value in tagging things - the kernel will have to provide all "old" fields, even if a new field supersedes an old one, for compatibility, so the tagged data can't ever be removed.
Uhmm... Why?

Newer clients simply won't be asking for the old tag. Sure, you'd have to keep the code in the kernel to provide the old tag for the older clients but that's it.

The tagged structure can be much more flexible and predictable. This would make it easier to use them for stuff like statx_multi() to get information about multiple files at the same time.

statx() v3

Posted Dec 2, 2016 7:28 UTC (Fri) by mm7323 (subscriber, #87386) [Link] (1 responses)

That would replace the stx_mask too.

But if you are having to parse tag-value lists, the code is going to be much slower on both sides. System calls already have enough overhead.

statx() v3

Posted Dec 2, 2016 7:30 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Not necessarily. If your tag lists are fixed then it can just be parsed as a fixed-layout structure.

The kernel-side would be more complicated, but I doubt it'll make a huge difference.

statx() v3

Posted Dec 5, 2016 12:43 UTC (Mon) by k3ninho (subscriber, #50375) [Link] (2 responses)

> "I'm glad I *haven't* done the testsuite yet - it would have much more than doubled the amount of work. I *still* don't know what the final form is going to be."

From the perspective having worked in a test-driven proprietary development shop and (in other settings) having faced down 'we don't know what this software should do' causing problems for our processes, this comment seems like terrible engineering. I understand that the drive behind contributing to free software and open source projects is that it solves a problem you have, or allows you to scratch an itch you have and, that notwithstanding, these patches are investigatory work rather than ready for production use. The idea that making a test suite would double the amount of work is a finger-in-the-air estimate, given the final form isn't mapped out and its scope estimated, and lands as a weak excuse for not making an effort.

If it takes twice as long because there are classes of bugs eliminated during the conception and design or there are issues highlighted by a test suite that subsequently are fixed, that's work to engineer a high-quality solution. If seeking comments from the mailing list changed the interfaces (it didn't between v2 and v3), then this work is investigative and in the open. But the patch seems to be settling on a final form -- and final bug count.

K3n.

statx() v3

Posted Dec 6, 2016 9:51 UTC (Tue) by NAR (subscriber, #1313) [Link] (1 responses)

Not to mention that sometimes writing the testcase reveals API-level problems (interface is clumsy, hard to use, very hard to test, misses something important, etc.).

statx() v3

Posted Dec 6, 2016 23:14 UTC (Tue) by k3ninho (subscriber, #50375) [Link]

>[S]ometimes writing the testcase reveals API-level problems (interface is clumsy, hard to use, very hard to test, misses something important, etc.).

That sounds right. What are we, the end-users of this patch going to do? Take it and its interface when it's not right?!?

If it is going to take another 'n' months of labour to get it to the point where there's a list of sensible users of the interfaces in terms of their reasonable insights into the working system, and that comes with shaking out the issues with each and finally understanding the end-to-end of 'system has a problem with stat()' as well as 'userspace has a problem with stat()..?

I can't imagine a world where 'release early and make no changes' is a better answer than 'release often and incrementally improve'.

Caveat: Even with a healthy set of git habits, rebasing remains something you have to stay on top of.

K3n.