Tracking Issue for `os_str_slice` #118485

blyxxyz · 2023-11-30T19:02:14Z

Feature gate: #![feature(os_str_slice)]

This is a tracking issue for an API for taking substrings of OsStr, which in combination with OsStr::as_encoded_bytes() would make it possible to implement most string operations in (portable) safe code.

Public API

impl OsStr {
    pub fn slice_encoded_bytes<R: ops::RangeBounds<usize>>(&self, range: R) -> &Self;
}

Steps / History

ACP: ACP: A substring API for OsStr libs-team#306
Implementation: Add substring API for OsStr #118484, Move OsStr::slice_encoded_bytes validation to platform modules #118569
Final comment period (FCP)¹
Stabilization PR

Unresolved Questions

Currently this enforces the same requirements on all platforms, but on Unix (and some other platforms) the internal encoding of OsStr is already fully specified to be arbitrary bytes by means of the OsStrExt trait. Should we:
- Relax the requirements on these platforms,
- Keep enforcing the lowest common denominator invariants, or
- Use strict checks now, but leave the possibility of relaxing them later?

https://std-dev-guide.rust-lang.org/feature-lifecycle/stabilization.html ↩

The text was updated successfully, but these errors were encountered:

epage · 2023-11-30T19:29:00Z

Currently this enforces the same requirements on all platforms, but on Unix (and some other platforms) the internal encoding of OsStr is already fully specified to be arbitrary bytes by means of the OsStrExt trait. Should we:

This was discussed somewhat in the ACP (rust-lang/libs-team#306). Copying over the parts I find relevant :)

Starting with "lowest common denominator invariants" can always be relaxed later, as we'd be switching cases from asserting to not-asserting.

The other encoded_bytes functions are only documented for their cross-platform nature and the conversions docs point people to OsStrExt for platform-specific assumptions. I think being consistent with those would be good

epage · 2023-11-30T19:48:44Z

Looking over #118484, this looks like it'll have a lot of overhead to enforce the invariants when building higher level operations on top that would go away with native support for those operations. This is mostly an observation as I'm not sure what else we can do for now while pattern API support is at a stand-still.

blyxxyz · 2023-11-30T20:34:00Z

There's a fast path for splitting on ASCII, which you'll always take when doing traditional options parsing¹. My hunch is that that's the common case in general, that even if part of the string is Unicode or raw bytes you'll typically be hunting for some bit of ASCII syntax.

It can be made a lot faster on Windows (see #118484 (comment)), and I want to try that if I can get a local Windows environment set up. (EDIT: the wtf8 tests do run on Linux, maybe that's enough to work it out.)

If we relax the requirements for Unix then it just becomes a normal slicing operation there. That's one reason that feels attractive to me. Otherwise we have to keep doing something like the current validation.

The ASCII fast path can be made faster by skipping bounds checks and giving it its own function. Here's what I get by mucking around in compiler explorer:

(https://rust.godbolt.org/z/hx9r7nvhc)

But I don't know how far it's worth going, and if we choose relaxed checks then we can just get rid of it.

Though PowerShell does support Unicode dashes. ↩

blyxxyz · 2023-12-01T07:55:19Z

With strict checks it would technically be optimal for user code to have a function like this:

use std::ffi::OsStr;

fn slice_os_str(s: &OsStr, start: usize, end: usize) -> &OsStr {
    #[cfg(all(target_vendor = "fortanix", target_env = "sgx"))]
    use std::os::fortanix_sgx::ffi::OsStrExt;
    #[cfg(target_os = "hermit")]
    use std::os::hermit::ffi::OsStrExt;
    #[cfg(target_os = "solid")]
    use std::os::solid::ffi::OsStrExt;
    #[cfg(unix)]
    use std::os::unix::ffi::OsStrExt;
    #[cfg(target_os = "wasi")]
    use std::os::wasi::ffi::OsStrExt;
    #[cfg(target_os = "xous")]
    use std::os::xous::ffi::OsStrExt;

    #[cfg(any(
        unix,
        target_os = "wasi",
        target_os = "hermit",
        all(target_vendor = "fortanix", target_env = "sgx"),
        target_os = "solid",
        target_os = "xous"
    ))]
    return OsStr::from_bytes(&s.as_bytes()[start..end]);

    #[cfg(not(any(
        unix,
        target_os = "wasi",
        target_os = "hermit",
        all(target_vendor = "fortanix", target_env = "sgx"),
        target_os = "solid",
        target_os = "xous"
    )))]
    return s.slice_encoded_bytes(start..end);
}

And that's a little sad.

from_encoded_bytes_unchecked() is unsafe and not even Miri can check that you're using it correctly, so it makes sense to be as straightforward and restrictive as possible there. But I don't think we need to have as many worries for this function.

So all in all I'm back to preferring to skip the check on these platforms.

epage · 2023-12-01T14:24:30Z

Regarding performance, the question I asked myself is why we can't have a subset of the Pattern API that doesn't take OsStr patterns. Most of the times slice_encoded_bytes would be needed would be reduced by the Pattern API. From what I can tell from the Tracking Issue is that OsStr patterns is what held things up. I noticed that the ACP linked to #109350 which does exactly this.

blyxxyz · 2023-12-01T16:46:09Z

My thought is that we should have such an API but that this method will be easier to get stabilized. It's a small and unopinionated MVP.

epage · 2023-12-01T16:59:11Z

That doesn't make them mutually exclusive. My point was that for more performance critical code, we can look to #109350 while we can have slice_encoded_bytes uphold invariants (and users can always create your example slice_os_str above).

#109350 mirrors an API on one type into another, is related to an approved RFC, and trims down the biggest, most contentious part of that RFC. My hope is that it can be a relatively quick to stabilize API. It hasn't gotten much attention but I'm looking into that.

BurntSushi · 2023-12-02T13:30:06Z

With respect to the restrictions, I am generally inclined toward @blyxxyz's point of view here where we shouldn't need them on Unix because its representation is already set in stone. With that said, keeping uniform restrictions does make the behavior easier to explain and reason about. (I think y'all mentioned that.) But I think most importantly for me anyway, if we start with uniform restrictions, we can always relax them later when we've got some solid use cases motivating us to do that. For std, I am generally inclined to the conservative posture because of our unique stability constraints.

Add substring API for `OsStr` This adds a method for taking a substring of an `OsStr`, which in combination with [`OsStr::as_encoded_bytes()`](https://doc.rust-lang.org/std/ffi/struct.OsStr.html#method.as_encoded_bytes) makes it possible to implement most string operations in safe code. API: ```rust impl OsStr { pub fn slice_encoded_bytes<R: ops::RangeBounds<usize>>(&self, range: R) -> &Self; } ``` Motivation, examples and research at rust-lang/libs-team#306. Tracking issue: rust-lang#118485 cc `@epage` r? libs-api

…ark-Simulacrum Move `OsStr::slice_encoded_bytes` validation to platform modules This delegates OS string slicing (`OsStr::slice_encoded_bytes`) validation to the underlying platform implementation. For now that results in increased performance and better error messages on Windows without any changes to semantics. In the future we may want to provide different semantics for different platforms. The existing implementation is still used on Unix and most other platforms and is now optimized a little better. Tracking issue: rust-lang#118485 cc `@epage,` `@BurntSushi`

Rollup merge of rust-lang#118569 - blyxxyz:platform-os-str-slice, r=Mark-Simulacrum Move `OsStr::slice_encoded_bytes` validation to platform modules This delegates OS string slicing (`OsStr::slice_encoded_bytes`) validation to the underlying platform implementation. For now that results in increased performance and better error messages on Windows without any changes to semantics. In the future we may want to provide different semantics for different platforms. The existing implementation is still used on Unix and most other platforms and is now optimized a little better. Tracking issue: rust-lang#118485 cc `@epage,` `@BurntSushi`

clarfonthey · 2024-06-19T21:21:11Z

Was poking around OsStr's existing methods and found that this appears to be the only one that really allows a safe alternative to from_encoded_bytes_unchecked.

One thing that feels obvious as an alternative here: why is there no from_encoded_bytes that is checked? Example of a use case here:

For a project of mine, I'm looking to try and use OsStr instead of Vec<u8> to represent environment variables to enable Windows support, and it seems like a good path forward is to use as_encoded_bytes. However, there's not a really good way to do splitting by character without unsafe code. I'd like to convert to encoded bytes, split on characters like : or ; to split up PATH variables, and then merge them back together as OsStrs, but I would like to do so with a checked version instead of an unchecked version for safety.

This could maybe also be proposed as an alternative to this function, to allow for better safe code that relies on this method. It's effectively the same as processing any other UTF-8 string as bytes and then recombining the pieces into valid strings, except it's a superset of UTF-8 instead. So, it feels like this would be a reasonable alternative.

blyxxyz · 2024-06-19T21:47:33Z

I covered from_encoded_bytes in the ACP:

A checked, safe from_encoded_bytes(&[u8]) method. This is a natural counterpart to from_encoded_bytes_unchecked(&[u8]). But:

It would encourage relying on the unspecified internal encoding and using it for interchange.

It would require validating the entire string, which may take O(n) time, while a slicing method can run in O(1) time.

It would require implementing a WTF-8 validator, which to my knowledge doesn't exist yet. (The original wtf8 crate deliberately does not implement this.)

OsStr was originally designed to encapsulate the internal encoding as much as possible, you couldn't really touch it. as_encoded_bytes() and from_encoded_bytes_unchecked() broke that, and from_encoded_bytes() would break it even further, which may or may not be acceptable. Most instances of as_encoded_bytes() in the wild look like they're abusing it unfortunately.

It might still be reasonable but slice_encoded_bytes() seemed less problematic.

clarfonthey · 2024-06-20T00:11:17Z

I mean, that's extremely fair. I hadn't fully read the ACP before commenting, so, I probably should have done that.

I guess that my main aversion here is that this API essentially relies on doing something that Rust really dislikes, which is keeping track of a bunch of indices into a string instead of just splitting the string itself. Although I suppose that you could just slice twice to split the string and keep doing that as you go, although it would be awkward. Will have to mess around and see if the code is alright.

blyxxyz · 2024-06-20T15:31:34Z

It is a little clunky but I think not fatally so.

Tracking indices:

let path = OsStr::new("foo:bar:baz");
let mut parts = Vec::new();

let mut idx = 0;
for part in path.as_encoded_bytes().split(|&b| b == b':') {
    parts.push(path.slice_encoded_bytes(idx..idx + part.len()));
    idx += part.len() + 1;
}

Slicing as you go:

let mut rest = path;
while let Some(idx) = rest.as_encoded_bytes().iter().position(|&b| b == b':') {
    parts.push(rest.slice_encoded_bytes(..idx));
    rest = rest.slice_encoded_bytes(idx + 1..);
}
parts.push(rest);

If you need these operations often you can write a helper like clap_lex does: https://github.com/clap-rs/clap/blob/9c1153d6b4b68de72249693ec6b75497d4ce9793/clap_lex/src/ext.rs
AFAIK you can implement basically the whole string API on top of OsStr::as_encoded_bytes/OsStr::slice_encoded_bytes/OsString::push.

I'm not sure about the balance between using this directly and using it as a safe primitive for a fully-featured helper. I plan to experiment on uutils as a case study.

An API along the lines of slice::utf8_chunks() is another possibility. I suspect it'd be awkward for the typical OsStr use cases but I haven't seen it explored.

Skgland · 2024-06-20T18:03:24Z

For a project of mine, I'm looking to try and use OsStr instead of Vec<u8> to represent environment variables to enable Windows support, and it seems like a good path forward is to use as_encoded_bytes. However, there's not a really good way to do splitting by character without unsafe code. I'd like to convert to encoded bytes, split on characters like : or ; to split up PATH variables, and then merge them back together as OsStrs, but I would like to do so with a checked version instead of an unchecked version for safety.

Not sure if this covers all your use cases, but are you aware of std::env::split_paths and std::env::join_paths, which perform PATH splitting and joining.

clarfonthey · 2024-06-21T06:51:32Z

For a project of mine, I'm looking to try and use OsStr instead of Vec<u8> to represent environment variables to enable Windows support, and it seems like a good path forward is to use as_encoded_bytes. However, there's not a really good way to do splitting by character without unsafe code. I'd like to convert to encoded bytes, split on characters like : or ; to split up PATH variables, and then merge them back together as OsStrs, but I would like to do so with a checked version instead of an unchecked version for safety.

Not sure if this covers all your use cases, but are you aware of std::env::split_paths and std::env::join_paths, which perform PATH splitting and joining.

Wasn't aware, but those only support PATH and there are many other path variables that use other conventions. And my main point still stands of wanting a good way to easily deal with these without relying on unsafe code.

That said, the example @blyxxyz gave does look like it works well here.

blyxxyz added C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Nov 30, 2023

blyxxyz mentioned this issue Nov 30, 2023

Add substring API for OsStr #118484

Merged

blyxxyz mentioned this issue Dec 3, 2023

Move OsStr::slice_encoded_bytes validation to platform modules #118569

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking Issue for `os_str_slice` #118485

Tracking Issue for `os_str_slice` #118485

blyxxyz commented Nov 30, 2023 •

edited

Loading

epage commented Nov 30, 2023

epage commented Nov 30, 2023

blyxxyz commented Nov 30, 2023 •

edited

Loading

blyxxyz commented Dec 1, 2023 •

edited

Loading

epage commented Dec 1, 2023

blyxxyz commented Dec 1, 2023

epage commented Dec 1, 2023

BurntSushi commented Dec 2, 2023

clarfonthey commented Jun 19, 2024

blyxxyz commented Jun 19, 2024

clarfonthey commented Jun 20, 2024

blyxxyz commented Jun 20, 2024

Skgland commented Jun 20, 2024 •

edited

Loading

clarfonthey commented Jun 21, 2024

Tracking Issue for os_str_slice #118485

Tracking Issue for os_str_slice #118485

Comments

blyxxyz commented Nov 30, 2023 • edited Loading

Public API

Steps / History

Unresolved Questions

Footnotes

epage commented Nov 30, 2023

epage commented Nov 30, 2023

blyxxyz commented Nov 30, 2023 • edited Loading

Footnotes

blyxxyz commented Dec 1, 2023 • edited Loading

epage commented Dec 1, 2023

blyxxyz commented Dec 1, 2023

epage commented Dec 1, 2023

BurntSushi commented Dec 2, 2023

clarfonthey commented Jun 19, 2024

blyxxyz commented Jun 19, 2024

clarfonthey commented Jun 20, 2024

blyxxyz commented Jun 20, 2024

Skgland commented Jun 20, 2024 • edited Loading

clarfonthey commented Jun 21, 2024

Tracking Issue for `os_str_slice` #118485

Tracking Issue for `os_str_slice` #118485

blyxxyz commented Nov 30, 2023 •

edited

Loading

blyxxyz commented Nov 30, 2023 •

edited

Loading

blyxxyz commented Dec 1, 2023 •

edited

Loading

Skgland commented Jun 20, 2024 •

edited

Loading