-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize core::str::Chars::count
#90414
Conversation
is there any reason to think that counting mixed-language strings would be something worth benchmarking? |
An optimization PR without benchmarks isn’t great. Different strings have different performance characteristics in certain implementations of this, and assuming english/ascii is not good. |
I entirely agree but-... ok rereading the actual non-ASCII strings a little closer I see they each have a few ascii substrings (mostly "rust") so my point is mostly moot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far the algorithms themselves look pretty good. Couple nits inline, but this seems like something I would be comfortable approving regardless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to get to this next week, sorry for the delay
If simd can be used, my crate might be useful. |
Realistically it's hard to use SIMD in libcore at the moment. the portable-simd work will help, but only for functions which can be vectorized well using the baseline target settings that libcore is compiled with. Note that libcore cannot perform runtime feature detection (feature detection is in I did a SSE21 SIMD version here when experimenting — https://github.com/thomcc/standalone-char-count-bench/blob/master/src/lib.rs#L215-L265, but came to the conclusion that this was already pushing its luck in terms of complexity and amount of code, and having both would not be worth it. Footnotes
|
Ping from triage: |
7287a87
to
2bf129a
Compare
Updated, rebased, responded to comments. New benchmarks1:
The story it tells seems relatively consistent: very slight (1-2ns) hit for tiny strings, no real difference for small strings, and medium and above is a significant improvement of around 4x-5x. Footnotes
|
@rustbot ready |
r=me after the nit is addressed. Not sure if you have bors powers, so adding a delegate flag. Once you're pushed a fix for the nit, add a comment with |
@bors delegate+ |
✌️ @thomcc can now approve this pull request |
📌 Commit 41f8214 has been approved by |
I think this warrants a perf run before merging because the compiler could be dealing with lots of tiny strings and this could potentially end up being a net-negative. @bors try @perf-timer queue |
🙅 Please do not |
oh well, then at least @bors rollup=never |
@bors rollup=never |
Good point. I'll be interested to see the results. FWIW I don't really think the difference here exists in real code. I only see it in rust-lang/rust (not the dummy crate I've been using to avoid having to go through slow stdlib builds), and even then it seems like the version inside libcore is consistently penalized somehow -- I feel like it's either some artifact of benchmarking, or my setup. Actually, tow that I write that, I realize that I probably need to check my Of course, the perf run will reveal the ultimate truth -- there are a lot of reasons why these benchmarks could be unrealistic (real world usage will have harder to predict branches, as just one example). Still, I'm glad this looks better for small inputs, since they can be somewhat common.
Footnotes
|
I wonder if we could build std and dependencies with incremental but the final benchmark crate in non-incremental mode. |
⌛ Testing commit 41f8214 with merge e2e1d636dc37facc46d2ab58b0879f6092f5b24b... |
💥 Test timed out |
@bors retry |
☀️ Test successful - checks-actions |
Finished benchmarking commit (f624427): comparison url. Summary: This benchmark run shows 10 relevant improvements 🎉 but 2 relevant regressions 😿 to instruction counts.
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression |
Seems like a larger and more consistent improvement than regression -- surprised we do this enough in the compiler to be visible at all. I actually do have some thoughts on how to improve it further for the small string case (specifically, the trick I used in |
Since this is library code and can also be used by the crates being tested it could also mean that the compiler spends a different amount of time handling this code rather than being directly affected by it. And then there's inlining/CGU/whatever perturbation noise. |
I wrote this a while ago after seeing this function as a bottleneck in a profile, but never got around to contributing it. I saw it again, and so here it is. The implementation is fairly complex, but I tried to explain what's happening at both a high level (in the header comment for the file), and in line comments in the impl. Hopefully it's clear enough.
This implementation (
case00_cur_libcore
in the benchmarks below) is somewhat consistently around 4x to 5x faster than the old implementation (case01_old_libcore
in the benchmarks below), for a wide variety of workloads, without regressing performance on any of the workload sizes I've tried.I also improved the benchmarks for this code, so that they explicitly check text in different languages and of different sizes (err, the cross product of language x size). The results of the benchmarks are here:
Benchmark results
I also added fairly thorough tests for different sizes and alignments. This completes on my machine in 0.02s, which is surprising given how thorough they are, but it seems to detect bugs in the implementation. (I haven't run the tests on a 32 bit machine yet since before I reworked the code a little though, so... hopefully I'm not about to embarrass myself).
This uses similar SWAR-style techniques to the
is_ascii
impl I contributed in #74066, so I'm going to request review from the same person who reviewed that one. That said am not particularly picky, and might not have the correct syntax for requesting a review from someone (so it goes).r? @nagisa