Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue for str::{floor, ceil}_char_boundary #93743

Open
1 of 3 tasks
clarfonthey opened this issue Feb 7, 2022 · 9 comments
Open
1 of 3 tasks

Tracking Issue for str::{floor, ceil}_char_boundary #93743

clarfonthey opened this issue Feb 7, 2022 · 9 comments
Labels
C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@clarfonthey
Copy link
Contributor

clarfonthey commented Feb 7, 2022

Feature gate: #![feature(round_char_boundary)]

This is a tracking issue for str::{floor, ceil}_char_boundary.

Public API

impl str {
    // Returns the character boundary at or immediately before `index`
    fn floor_char_boundary(&self, index: usize) -> usize;

    // Returns the character boundary at or immediately after `index`
    fn ceil_char_boundary(&self, index: usize) -> usize;
}

Steps / History

Unresolved Questions

@clarfonthey clarfonthey added C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Feb 7, 2022
@nagisa
Copy link
Member

nagisa commented Feb 12, 2022

Have methods that return str slices instead of indices been considered as an alternative? I didn't really see any discussion to that effect in the issue. Given that these indices are only really usable for slicing into the containing string these were calculated from, it isn't all that obvious to me if this is the right API, as opposed to split_at_???_char_boundary_{,mut}.

@clarfonthey
Copy link
Contributor Author

So, that would work for purely code-based splitting, but if a user were to perform more-complicated splitting by something like graphemes or words, they'd want to be able to look at the string both before and after the truncation point, not just at that point.

Since some cases required the index specifically, I decided to keep the API surface small. But adding extra variants that directly slice the string could be very useful too, like the ones you offered.

@Stargateur
Copy link
Contributor

I don't see why choice to panic instead of return Option<usize> or Result<usize, ()>

@clarfonthey
Copy link
Contributor Author

The main reason I chose that was that floor_char_boundary can never panic, and I wanted parity for the API with ceil_char_boundary. Do you have a case that would benefit from such an API?

@Stargateur
Copy link
Contributor

Stargateur commented Jul 20, 2022

One can panic the other not is not parity either. I advice to avoid panic when possible, while there is already a number of method that panic instead of return an Option or a Result, adding new item should probably follow better guide line, specially since Rust could be use in Kernel linux, having method that panic on primitive type would automatically ban them from use.

On this case I wonder if floor_char_boundary make sense, char_indices doesn't include the "last" boundary cause there is no character. I would advice to change method floor_char_boundary to return None when the index would result in len and remove panic of other method too using option instead.

Or something else that remove the panic.

@clarfonthey
Copy link
Contributor Author

So, the reason why one can panic and one can't is basically because of the fact that indexes can't go below zero, and so you can always round down to 0, which is always a valid char boundary. Here, the context is that if you want to limit the size of a string in bytes, you can "round down" from that to a valid index, such that you can slice up to that.

For ceil... the general idea, if I'm being honest, is to round "up" the recommended number of bytes. So, for example, if you want to limit to 1000 bytes, but are fine allowing up to 1004 to include the last character. I guess in this case, it would be okay to return len instead of panicking, but I don't know it this would make sense in all cases.

Definitely open to suggestions on what you think is best, but IMHO the flooring version should definitely never panic, and always return a valid index.

@angelorodem
Copy link

I think that removal of panic from this function is a must, the usage is to correctly find the bounds of a char and prevent panicking when doing string processing, it seems not good that a function to prevent that can indeed panic.

@clarfonthey
Copy link
Contributor Author

clarfonthey commented Jun 7, 2023

So, I was poking around the uses of ceil_char_boundary on GitHub: https://github.com/search?q=ceil_char_boundary+lang%253Arust

and am thinking that maybe the solution to a non-panicking version of ceil_char_boundary is to just return the length of the string if the provided index is out of bounds. The semantics feel a bit muddy, but this feels like the most logical way based upon how people are using it. What do people think about this idea?

Effectively, if you pass an index past the end of the string, it's treated as the end of the string, and rounds "up" to the end of the string.

@eddyb
Copy link
Member

eddyb commented Jul 13, 2023

So, that would work for purely code-based splitting, but if a user were to perform more-complicated splitting by something like graphemes or words, they'd want to be able to look at the string both before and after the truncation point, not just at that point.

Since some cases required the index specifically, I decided to keep the API surface small. But adding extra variants that directly slice the string could be very useful too, like the ones you offered.

I believe there is some confusion here. The "(split) index" is not lost during split_at.

s.split_at(i) returns (&s[..i], &s[i..]) and so:

  • s.split_at(i).0.len() == i
  • s.split_at(i).1.len() == s.len() - i

That is, it's strictly more information than the original i, and for the methods @nagisa suggested (which I'd call split_at_char_boundary_{before,after}), since you can keep the original s yourself, and get the split position by .0.len(), it's just more ergonomic to also have the two &str parts.


Also because {floor,ceil}_char_boundary already do all the checks split_at require, it should be 100% free to have split_at_char_boundary_{before,after} which return (&str, &str) instead of usize (with the first &str containing verbatim the usize of {floor,ceil}_char_boundary, as its length, as explained above).

bors added a commit to rust-lang-ci/rust that referenced this issue Aug 15, 2023
…boundary, r=m-ou-se

Don't panic in ceil_char_boundary

Implementing the alternative mentioned in this comment: rust-lang#93743 (comment)

Since `floor_char_boundary` will always work (rounding down to the length of the string is possible), it feels best for `ceil_char_boundary` to not panic either. However, the semantics of "rounding up" past the length of the string aren't very great, which is why the method originally panicked in these cases.

Taking into account how people are using this method, it feels best to simply return the end of the string in these cases, so that the result is still a valid char boundary.
github-actions bot pushed a commit to rust-lang/miri that referenced this issue Aug 16, 2023
… r=m-ou-se

Don't panic in ceil_char_boundary

Implementing the alternative mentioned in this comment: rust-lang/rust#93743 (comment)

Since `floor_char_boundary` will always work (rounding down to the length of the string is possible), it feels best for `ceil_char_boundary` to not panic either. However, the semantics of "rounding up" past the length of the string aren't very great, which is why the method originally panicked in these cases.

Taking into account how people are using this method, it feels best to simply return the end of the string in these cases, so that the result is still a valid char boundary.
thomcc pushed a commit to tcdi/postgrestd that referenced this issue Oct 17, 2023
… r=m-ou-se

Don't panic in ceil_char_boundary

Implementing the alternative mentioned in this comment: rust-lang/rust#93743 (comment)

Since `floor_char_boundary` will always work (rounding down to the length of the string is possible), it feels best for `ceil_char_boundary` to not panic either. However, the semantics of "rounding up" past the length of the string aren't very great, which is why the method originally panicked in these cases.

Taking into account how people are using this method, it feels best to simply return the end of the string in these cases, so that the result is still a valid char boundary.
lnicola pushed a commit to lnicola/rust-analyzer that referenced this issue Apr 7, 2024
… r=m-ou-se

Don't panic in ceil_char_boundary

Implementing the alternative mentioned in this comment: rust-lang/rust#93743 (comment)

Since `floor_char_boundary` will always work (rounding down to the length of the string is possible), it feels best for `ceil_char_boundary` to not panic either. However, the semantics of "rounding up" past the length of the string aren't very great, which is why the method originally panicked in these cases.

Taking into account how people are using this method, it feels best to simply return the end of the string in these cases, so that the result is still a valid char boundary.
RalfJung pushed a commit to RalfJung/rust-analyzer that referenced this issue Apr 27, 2024
… r=m-ou-se

Don't panic in ceil_char_boundary

Implementing the alternative mentioned in this comment: rust-lang/rust#93743 (comment)

Since `floor_char_boundary` will always work (rounding down to the length of the string is possible), it feels best for `ceil_char_boundary` to not panic either. However, the semantics of "rounding up" past the length of the string aren't very great, which is why the method originally panicked in these cases.

Taking into account how people are using this method, it feels best to simply return the end of the string in these cases, so that the result is still a valid char boundary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants