Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue for os_str_slice #118485

Open
2 of 4 tasks
blyxxyz opened this issue Nov 30, 2023 · 14 comments
Open
2 of 4 tasks

Tracking Issue for os_str_slice #118485

blyxxyz opened this issue Nov 30, 2023 · 14 comments
Labels
C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@blyxxyz
Copy link
Contributor

blyxxyz commented Nov 30, 2023

Feature gate: #![feature(os_str_slice)]

This is a tracking issue for an API for taking substrings of OsStr, which in combination with OsStr::as_encoded_bytes() would make it possible to implement most string operations in (portable) safe code.

Public API

impl OsStr {
    pub fn slice_encoded_bytes<R: ops::RangeBounds<usize>>(&self, range: R) -> &Self;
}

Steps / History

Unresolved Questions

  • Currently this enforces the same requirements on all platforms, but on Unix (and some other platforms) the internal encoding of OsStr is already fully specified to be arbitrary bytes by means of the OsStrExt trait. Should we:
    • Relax the requirements on these platforms,
    • Keep enforcing the lowest common denominator invariants, or
    • Use strict checks now, but leave the possibility of relaxing them later?

Footnotes

  1. https://std-dev-guide.rust-lang.org/feature-lifecycle/stabilization.html

@blyxxyz blyxxyz added C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Nov 30, 2023
@epage
Copy link
Contributor

epage commented Nov 30, 2023

Currently this enforces the same requirements on all platforms, but on Unix (and some other platforms) the internal encoding of OsStr is already fully specified to be arbitrary bytes by means of the OsStrExt trait. Should we:

This was discussed somewhat in the ACP (rust-lang/libs-team#306). Copying over the parts I find relevant :)

Starting with "lowest common denominator invariants" can always be relaxed later, as we'd be switching cases from asserting to not-asserting.

The other encoded_bytes functions are only documented for their cross-platform nature and the conversions docs point people to OsStrExt for platform-specific assumptions. I think being consistent with those would be good

@epage
Copy link
Contributor

epage commented Nov 30, 2023

Looking over #118484, this looks like it'll have a lot of overhead to enforce the invariants when building higher level operations on top that would go away with native support for those operations. This is mostly an observation as I'm not sure what else we can do for now while pattern API support is at a stand-still.

@blyxxyz
Copy link
Contributor Author

blyxxyz commented Nov 30, 2023

There's a fast path for splitting on ASCII, which you'll always take when doing traditional options parsing1. My hunch is that that's the common case in general, that even if part of the string is Unicode or raw bytes you'll typically be hunting for some bit of ASCII syntax.

It can be made a lot faster on Windows (see #118484 (comment)), and I want to try that if I can get a local Windows environment set up. (EDIT: the wtf8 tests do run on Linux, maybe that's enough to work it out.)

If we relax the requirements for Unix then it just becomes a normal slicing operation there. That's one reason that feels attractive to me. Otherwise we have to keep doing something like the current validation.

The ASCII fast path can be made faster by skipping bounds checks and giving it its own function. Here's what I get by mucking around in compiler explorer:
image
(https://rust.godbolt.org/z/hx9r7nvhc)

But I don't know how far it's worth going, and if we choose relaxed checks then we can just get rid of it.

Footnotes

  1. Though PowerShell does support Unicode dashes.

@blyxxyz
Copy link
Contributor Author

blyxxyz commented Dec 1, 2023

With strict checks it would technically be optimal for user code to have a function like this:

use std::ffi::OsStr;

fn slice_os_str(s: &OsStr, start: usize, end: usize) -> &OsStr {
    #[cfg(all(target_vendor = "fortanix", target_env = "sgx"))]
    use std::os::fortanix_sgx::ffi::OsStrExt;
    #[cfg(target_os = "hermit")]
    use std::os::hermit::ffi::OsStrExt;
    #[cfg(target_os = "solid")]
    use std::os::solid::ffi::OsStrExt;
    #[cfg(unix)]
    use std::os::unix::ffi::OsStrExt;
    #[cfg(target_os = "wasi")]
    use std::os::wasi::ffi::OsStrExt;
    #[cfg(target_os = "xous")]
    use std::os::xous::ffi::OsStrExt;

    #[cfg(any(
        unix,
        target_os = "wasi",
        target_os = "hermit",
        all(target_vendor = "fortanix", target_env = "sgx"),
        target_os = "solid",
        target_os = "xous"
    ))]
    return OsStr::from_bytes(&s.as_bytes()[start..end]);

    #[cfg(not(any(
        unix,
        target_os = "wasi",
        target_os = "hermit",
        all(target_vendor = "fortanix", target_env = "sgx"),
        target_os = "solid",
        target_os = "xous"
    )))]
    return s.slice_encoded_bytes(start..end);
}

And that's a little sad.

from_encoded_bytes_unchecked() is unsafe and not even Miri can check that you're using it correctly, so it makes sense to be as straightforward and restrictive as possible there. But I don't think we need to have as many worries for this function.

So all in all I'm back to preferring to skip the check on these platforms.

@epage
Copy link
Contributor

epage commented Dec 1, 2023

Regarding performance, the question I asked myself is why we can't have a subset of the Pattern API that doesn't take OsStr patterns. Most of the times slice_encoded_bytes would be needed would be reduced by the Pattern API. From what I can tell from the Tracking Issue is that OsStr patterns is what held things up. I noticed that the ACP linked to #109350 which does exactly this.

@blyxxyz
Copy link
Contributor Author

blyxxyz commented Dec 1, 2023

My thought is that we should have such an API but that this method will be easier to get stabilized. It's a small and unopinionated MVP.

@epage
Copy link
Contributor

epage commented Dec 1, 2023

That doesn't make them mutually exclusive. My point was that for more performance critical code, we can look to #109350 while we can have slice_encoded_bytes uphold invariants (and users can always create your example slice_os_str above).

#109350 mirrors an API on one type into another, is related to an approved RFC, and trims down the biggest, most contentious part of that RFC. My hope is that it can be a relatively quick to stabilize API. It hasn't gotten much attention but I'm looking into that.

@BurntSushi
Copy link
Member

With respect to the restrictions, I am generally inclined toward @blyxxyz's point of view here where we shouldn't need them on Unix because its representation is already set in stone. With that said, keeping uniform restrictions does make the behavior easier to explain and reason about. (I think y'all mentioned that.) But I think most importantly for me anyway, if we start with uniform restrictions, we can always relax them later when we've got some solid use cases motivating us to do that. For std, I am generally inclined to the conservative posture because of our unique stability constraints.

bors added a commit to rust-lang-ci/rust that referenced this issue Dec 2, 2023
Add substring API for `OsStr`

This adds a method for taking a substring of an `OsStr`, which in combination with [`OsStr::as_encoded_bytes()`](https://doc.rust-lang.org/std/ffi/struct.OsStr.html#method.as_encoded_bytes) makes it possible to implement most string operations in safe code.

API:
```rust
impl OsStr {
    pub fn slice_encoded_bytes<R: ops::RangeBounds<usize>>(&self, range: R) -> &Self;
}
```
Motivation, examples and research at rust-lang/libs-team#306.

Tracking issue: rust-lang#118485

cc `@epage`
r? libs-api
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Feb 18, 2024
…ark-Simulacrum

Move `OsStr::slice_encoded_bytes` validation to platform modules

This delegates OS string slicing (`OsStr::slice_encoded_bytes`) validation to the underlying platform implementation. For now that results in increased performance and better error messages on Windows without any changes to semantics. In the future we may want to provide different semantics for different platforms.

The existing implementation is still used on Unix and most other platforms and is now optimized a little better.

Tracking issue: rust-lang#118485

cc `@epage,` `@BurntSushi`
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Feb 18, 2024
Rollup merge of rust-lang#118569 - blyxxyz:platform-os-str-slice, r=Mark-Simulacrum

Move `OsStr::slice_encoded_bytes` validation to platform modules

This delegates OS string slicing (`OsStr::slice_encoded_bytes`) validation to the underlying platform implementation. For now that results in increased performance and better error messages on Windows without any changes to semantics. In the future we may want to provide different semantics for different platforms.

The existing implementation is still used on Unix and most other platforms and is now optimized a little better.

Tracking issue: rust-lang#118485

cc `@epage,` `@BurntSushi`
@clarfonthey
Copy link
Contributor

Was poking around OsStr's existing methods and found that this appears to be the only one that really allows a safe alternative to from_encoded_bytes_unchecked.

One thing that feels obvious as an alternative here: why is there no from_encoded_bytes that is checked? Example of a use case here:

For a project of mine, I'm looking to try and use OsStr instead of Vec<u8> to represent environment variables to enable Windows support, and it seems like a good path forward is to use as_encoded_bytes. However, there's not a really good way to do splitting by character without unsafe code. I'd like to convert to encoded bytes, split on characters like : or ; to split up PATH variables, and then merge them back together as OsStrs, but I would like to do so with a checked version instead of an unchecked version for safety.

This could maybe also be proposed as an alternative to this function, to allow for better safe code that relies on this method. It's effectively the same as processing any other UTF-8 string as bytes and then recombining the pieces into valid strings, except it's a superset of UTF-8 instead. So, it feels like this would be a reasonable alternative.

@blyxxyz
Copy link
Contributor Author

blyxxyz commented Jun 19, 2024

I covered from_encoded_bytes in the ACP:

A checked, safe from_encoded_bytes(&[u8]) method. This is a natural counterpart to from_encoded_bytes_unchecked(&[u8]). But:

  • It would encourage relying on the unspecified internal encoding and using it for interchange.

  • It would require validating the entire string, which may take O(n) time, while a slicing method can run in O(1) time.

  • It would require implementing a WTF-8 validator, which to my knowledge doesn't exist yet. (The original wtf8 crate deliberately does not implement this.)

OsStr was originally designed to encapsulate the internal encoding as much as possible, you couldn't really touch it. as_encoded_bytes() and from_encoded_bytes_unchecked() broke that, and from_encoded_bytes() would break it even further, which may or may not be acceptable. Most instances of as_encoded_bytes() in the wild look like they're abusing it unfortunately.

It might still be reasonable but slice_encoded_bytes() seemed less problematic.

@clarfonthey
Copy link
Contributor

I mean, that's extremely fair. I hadn't fully read the ACP before commenting, so, I probably should have done that.

I guess that my main aversion here is that this API essentially relies on doing something that Rust really dislikes, which is keeping track of a bunch of indices into a string instead of just splitting the string itself. Although I suppose that you could just slice twice to split the string and keep doing that as you go, although it would be awkward. Will have to mess around and see if the code is alright.

@blyxxyz
Copy link
Contributor Author

blyxxyz commented Jun 20, 2024

It is a little clunky but I think not fatally so.

Tracking indices:

let path = OsStr::new("foo:bar:baz");
let mut parts = Vec::new();

let mut idx = 0;
for part in path.as_encoded_bytes().split(|&b| b == b':') {
    parts.push(path.slice_encoded_bytes(idx..idx + part.len()));
    idx += part.len() + 1;
}

Slicing as you go:

let mut rest = path;
while let Some(idx) = rest.as_encoded_bytes().iter().position(|&b| b == b':') {
    parts.push(rest.slice_encoded_bytes(..idx));
    rest = rest.slice_encoded_bytes(idx + 1..);
}
parts.push(rest);

If you need these operations often you can write a helper like clap_lex does: https://github.com/clap-rs/clap/blob/9c1153d6b4b68de72249693ec6b75497d4ce9793/clap_lex/src/ext.rs
AFAIK you can implement basically the whole string API on top of OsStr::as_encoded_bytes/OsStr::slice_encoded_bytes/OsString::push.

I'm not sure about the balance between using this directly and using it as a safe primitive for a fully-featured helper. I plan to experiment on uutils as a case study.


An API along the lines of slice::utf8_chunks() is another possibility. I suspect it'd be awkward for the typical OsStr use cases but I haven't seen it explored.

@Skgland
Copy link
Contributor

Skgland commented Jun 20, 2024

For a project of mine, I'm looking to try and use OsStr instead of Vec<u8> to represent environment variables to enable Windows support, and it seems like a good path forward is to use as_encoded_bytes. However, there's not a really good way to do splitting by character without unsafe code. I'd like to convert to encoded bytes, split on characters like : or ; to split up PATH variables, and then merge them back together as OsStrs, but I would like to do so with a checked version instead of an unchecked version for safety.

Not sure if this covers all your use cases, but are you aware of std::env::split_paths and std::env::join_paths, which perform PATH splitting and joining.

@clarfonthey
Copy link
Contributor

For a project of mine, I'm looking to try and use OsStr instead of Vec<u8> to represent environment variables to enable Windows support, and it seems like a good path forward is to use as_encoded_bytes. However, there's not a really good way to do splitting by character without unsafe code. I'd like to convert to encoded bytes, split on characters like : or ; to split up PATH variables, and then merge them back together as OsStrs, but I would like to do so with a checked version instead of an unchecked version for safety.

Not sure if this covers all your use cases, but are you aware of std::env::split_paths and std::env::join_paths, which perform PATH splitting and joining.

Wasn't aware, but those only support PATH and there are many other path variables that use other conventions. And my main point still stands of wanting a good way to easily deal with these without relying on unsafe code.

That said, the example @blyxxyz gave does look like it works well here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants