Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize core::str::Chars::count #90414

Merged
merged 5 commits into from
Feb 6, 2022
Merged

Conversation

thomcc
Copy link
Member

@thomcc thomcc commented Oct 30, 2021

I wrote this a while ago after seeing this function as a bottleneck in a profile, but never got around to contributing it. I saw it again, and so here it is. The implementation is fairly complex, but I tried to explain what's happening at both a high level (in the header comment for the file), and in line comments in the impl. Hopefully it's clear enough.

This implementation (case00_cur_libcore in the benchmarks below) is somewhat consistently around 4x to 5x faster than the old implementation (case01_old_libcore in the benchmarks below), for a wide variety of workloads, without regressing performance on any of the workload sizes I've tried.

I also improved the benchmarks for this code, so that they explicitly check text in different languages and of different sizes (err, the cross product of language x size). The results of the benchmarks are here:

Benchmark results
test str::char_count::emoji_huge::case00_cur_libcore       ... bench:      20,216 ns/iter (+/- 3,673) = 17931 MB/s
test str::char_count::emoji_huge::case01_old_libcore       ... bench:     108,851 ns/iter (+/- 12,777) = 3330 MB/s
test str::char_count::emoji_huge::case02_iter_increment    ... bench:     329,502 ns/iter (+/- 4,163) = 1100 MB/s
test str::char_count::emoji_huge::case03_manual_char_len   ... bench:     223,333 ns/iter (+/- 14,167) = 1623 MB/s
test str::char_count::emoji_large::case00_cur_libcore      ... bench:         293 ns/iter (+/- 6) = 19331 MB/s
test str::char_count::emoji_large::case01_old_libcore      ... bench:       1,681 ns/iter (+/- 28) = 3369 MB/s
test str::char_count::emoji_large::case02_iter_increment   ... bench:       5,166 ns/iter (+/- 85) = 1096 MB/s
test str::char_count::emoji_large::case03_manual_char_len  ... bench:       3,476 ns/iter (+/- 62) = 1629 MB/s
test str::char_count::emoji_medium::case00_cur_libcore     ... bench:          48 ns/iter (+/- 0) = 14750 MB/s
test str::char_count::emoji_medium::case01_old_libcore     ... bench:         217 ns/iter (+/- 4) = 3262 MB/s
test str::char_count::emoji_medium::case02_iter_increment  ... bench:         642 ns/iter (+/- 7) = 1102 MB/s
test str::char_count::emoji_medium::case03_manual_char_len ... bench:         445 ns/iter (+/- 3) = 1591 MB/s
test str::char_count::emoji_small::case00_cur_libcore      ... bench:          18 ns/iter (+/- 0) = 3777 MB/s
test str::char_count::emoji_small::case01_old_libcore      ... bench:          23 ns/iter (+/- 0) = 2956 MB/s
test str::char_count::emoji_small::case02_iter_increment   ... bench:          66 ns/iter (+/- 2) = 1030 MB/s
test str::char_count::emoji_small::case03_manual_char_len  ... bench:          29 ns/iter (+/- 1) = 2344 MB/s
test str::char_count::en_huge::case00_cur_libcore          ... bench:      25,909 ns/iter (+/- 39,260) = 13299 MB/s
test str::char_count::en_huge::case01_old_libcore          ... bench:     102,887 ns/iter (+/- 3,257) = 3349 MB/s
test str::char_count::en_huge::case02_iter_increment       ... bench:     166,370 ns/iter (+/- 12,439) = 2071 MB/s
test str::char_count::en_huge::case03_manual_char_len      ... bench:     166,332 ns/iter (+/- 4,262) = 2071 MB/s
test str::char_count::en_large::case00_cur_libcore         ... bench:         281 ns/iter (+/- 6) = 19160 MB/s
test str::char_count::en_large::case01_old_libcore         ... bench:       1,598 ns/iter (+/- 19) = 3369 MB/s
test str::char_count::en_large::case02_iter_increment      ... bench:       2,598 ns/iter (+/- 167) = 2072 MB/s
test str::char_count::en_large::case03_manual_char_len     ... bench:       2,578 ns/iter (+/- 55) = 2088 MB/s
test str::char_count::en_medium::case00_cur_libcore        ... bench:          44 ns/iter (+/- 1) = 15295 MB/s
test str::char_count::en_medium::case01_old_libcore        ... bench:         201 ns/iter (+/- 51) = 3348 MB/s
test str::char_count::en_medium::case02_iter_increment     ... bench:         322 ns/iter (+/- 40) = 2090 MB/s
test str::char_count::en_medium::case03_manual_char_len    ... bench:         319 ns/iter (+/- 5) = 2109 MB/s
test str::char_count::en_small::case00_cur_libcore         ... bench:          15 ns/iter (+/- 0) = 2333 MB/s
test str::char_count::en_small::case01_old_libcore         ... bench:          14 ns/iter (+/- 0) = 2500 MB/s
test str::char_count::en_small::case02_iter_increment      ... bench:          30 ns/iter (+/- 1) = 1166 MB/s
test str::char_count::en_small::case03_manual_char_len     ... bench:          30 ns/iter (+/- 1) = 1166 MB/s
test str::char_count::ru_huge::case00_cur_libcore          ... bench:      16,439 ns/iter (+/- 3,105) = 19777 MB/s
test str::char_count::ru_huge::case01_old_libcore          ... bench:      89,480 ns/iter (+/- 2,555) = 3633 MB/s
test str::char_count::ru_huge::case02_iter_increment       ... bench:     217,703 ns/iter (+/- 22,185) = 1493 MB/s
test str::char_count::ru_huge::case03_manual_char_len      ... bench:     157,330 ns/iter (+/- 19,188) = 2066 MB/s
test str::char_count::ru_large::case00_cur_libcore         ... bench:         243 ns/iter (+/- 6) = 20905 MB/s
test str::char_count::ru_large::case01_old_libcore         ... bench:       1,384 ns/iter (+/- 51) = 3670 MB/s
test str::char_count::ru_large::case02_iter_increment      ... bench:       3,381 ns/iter (+/- 543) = 1502 MB/s
test str::char_count::ru_large::case03_manual_char_len     ... bench:       2,423 ns/iter (+/- 429) = 2096 MB/s
test str::char_count::ru_medium::case00_cur_libcore        ... bench:          42 ns/iter (+/- 1) = 15119 MB/s
test str::char_count::ru_medium::case01_old_libcore        ... bench:         180 ns/iter (+/- 4) = 3527 MB/s
test str::char_count::ru_medium::case02_iter_increment     ... bench:         402 ns/iter (+/- 45) = 1579 MB/s
test str::char_count::ru_medium::case03_manual_char_len    ... bench:         280 ns/iter (+/- 29) = 2267 MB/s
test str::char_count::ru_small::case00_cur_libcore         ... bench:          12 ns/iter (+/- 0) = 2666 MB/s
test str::char_count::ru_small::case01_old_libcore         ... bench:          12 ns/iter (+/- 0) = 2666 MB/s
test str::char_count::ru_small::case02_iter_increment      ... bench:          19 ns/iter (+/- 0) = 1684 MB/s
test str::char_count::ru_small::case03_manual_char_len     ... bench:          14 ns/iter (+/- 1) = 2285 MB/s
test str::char_count::zh_huge::case00_cur_libcore          ... bench:      15,053 ns/iter (+/- 2,640) = 20067 MB/s
test str::char_count::zh_huge::case01_old_libcore          ... bench:      82,622 ns/iter (+/- 3,602) = 3656 MB/s
test str::char_count::zh_huge::case02_iter_increment       ... bench:     230,456 ns/iter (+/- 7,246) = 1310 MB/s
test str::char_count::zh_huge::case03_manual_char_len      ... bench:     220,595 ns/iter (+/- 11,624) = 1369 MB/s
test str::char_count::zh_large::case00_cur_libcore         ... bench:         227 ns/iter (+/- 65) = 20792 MB/s
test str::char_count::zh_large::case01_old_libcore         ... bench:       1,136 ns/iter (+/- 144) = 4154 MB/s
test str::char_count::zh_large::case02_iter_increment      ... bench:       3,147 ns/iter (+/- 253) = 1499 MB/s
test str::char_count::zh_large::case03_manual_char_len     ... bench:       2,993 ns/iter (+/- 400) = 1577 MB/s
test str::char_count::zh_medium::case00_cur_libcore        ... bench:          36 ns/iter (+/- 5) = 16388 MB/s
test str::char_count::zh_medium::case01_old_libcore        ... bench:         142 ns/iter (+/- 18) = 4154 MB/s
test str::char_count::zh_medium::case02_iter_increment     ... bench:         379 ns/iter (+/- 37) = 1556 MB/s
test str::char_count::zh_medium::case03_manual_char_len    ... bench:         364 ns/iter (+/- 51) = 1620 MB/s
test str::char_count::zh_small::case00_cur_libcore         ... bench:          11 ns/iter (+/- 1) = 3000 MB/s
test str::char_count::zh_small::case01_old_libcore         ... bench:          11 ns/iter (+/- 1) = 3000 MB/s
test str::char_count::zh_small::case02_iter_increment      ... bench:          20 ns/iter (+/- 3) = 1650 MB/s

I also added fairly thorough tests for different sizes and alignments. This completes on my machine in 0.02s, which is surprising given how thorough they are, but it seems to detect bugs in the implementation. (I haven't run the tests on a 32 bit machine yet since before I reworked the code a little though, so... hopefully I'm not about to embarrass myself).

This uses similar SWAR-style techniques to the is_ascii impl I contributed in #74066, so I'm going to request review from the same person who reviewed that one. That said am not particularly picky, and might not have the correct syntax for requesting a review from someone (so it goes).

r? @nagisa

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 30, 2021
@fbstj
Copy link
Contributor

fbstj commented Oct 31, 2021

is there any reason to think that counting mixed-language strings would be something worth benchmarking?

@thomcc
Copy link
Member Author

thomcc commented Oct 31, 2021

An optimization PR without benchmarks isn’t great. Different strings have different performance characteristics in certain implementations of this, and assuming english/ascii is not good.

@fbstj
Copy link
Contributor

fbstj commented Oct 31, 2021

I entirely agree but-... ok rereading the actual non-ASCII strings a little closer I see they each have a few ascii substrings (mostly "rust") so my point is mostly moot

@apiraino apiraino added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Nov 11, 2021
Copy link
Member

@nagisa nagisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far the algorithms themselves look pretty good. Couple nits inline, but this seems like something I would be comfortable approving regardless.

library/alloc/tests/str.rs Show resolved Hide resolved
library/core/benches/str/char_count.rs Outdated Show resolved Hide resolved
library/core/src/str/count.rs Outdated Show resolved Hide resolved
library/core/src/str/count.rs Outdated Show resolved Hide resolved
@nagisa nagisa added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 19, 2021
Copy link
Member Author

@thomcc thomcc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to get to this next week, sorry for the delay

library/core/src/str/count.rs Outdated Show resolved Hide resolved
library/core/src/str/count.rs Outdated Show resolved Hide resolved
@aobatact
Copy link

aobatact commented Dec 8, 2021

If simd can be used, my crate might be useful.
https://github.com/aobatact/faster_chars_count

@thomcc
Copy link
Member Author

thomcc commented Dec 8, 2021

Realistically it's hard to use SIMD in libcore at the moment. the portable-simd work will help, but only for functions which can be vectorized well using the baseline target settings that libcore is compiled with.

Note that libcore cannot perform runtime feature detection (feature detection is in std:: and probably cannot be moved to core for various reasons), gets compiled with baseline features for the target, and only gets recompiled if the user uses -Zbuild-std.

I did a SSE21 SIMD version here when experimenting — https://github.com/thomcc/standalone-char-count-bench/blob/master/src/lib.rs#L215-L265, but came to the conclusion that this was already pushing its luck in terms of complexity and amount of code, and having both would not be worth it.

Footnotes

  1. Realistically we can't use anything later than SSE2 for the reasons mentioned above.

@JohnCSimon
Copy link
Member

Ping from triage:
@thomcc - can you please post your status on this PR? No posts have been made since early December.

@thomcc
Copy link
Member Author

thomcc commented Feb 1, 2022

Updated, rebased, responded to comments. New benchmarks1:

test str::char_count::emoji_huge::case00_libcore                   ... bench:      17,974 ns/iter (+/- 3,188) = 20167 MB/s
test str::char_count::emoji_huge::case01_filter_count_cont_bytes   ... bench:      97,335 ns/iter (+/- 7,869) = 3724 MB/s
test str::char_count::emoji_huge::case02_iter_increment            ... bench:     187,396 ns/iter (+/- 23,980) = 1934 MB/s
test str::char_count::emoji_huge::case03_manual_char_len           ... bench:     204,138 ns/iter (+/- 12,988) = 1775 MB/s
test str::char_count::emoji_large::case00_libcore                  ... bench:         258 ns/iter (+/- 17) = 21953 MB/s
test str::char_count::emoji_large::case01_filter_count_cont_bytes  ... bench:       1,540 ns/iter (+/- 100) = 3677 MB/s
test str::char_count::emoji_large::case02_iter_increment           ... bench:       2,896 ns/iter (+/- 172) = 1955 MB/s
test str::char_count::emoji_large::case03_manual_char_len          ... bench:       3,188 ns/iter (+/- 45) = 1776 MB/s
test str::char_count::emoji_medium::case00_libcore                 ... bench:          45 ns/iter (+/- 1) = 15733 MB/s
test str::char_count::emoji_medium::case01_filter_count_cont_bytes ... bench:         195 ns/iter (+/- 11) = 3630 MB/s
test str::char_count::emoji_medium::case02_iter_increment          ... bench:         366 ns/iter (+/- 3) = 1934 MB/s
test str::char_count::emoji_medium::case03_manual_char_len         ... bench:         409 ns/iter (+/- 2) = 1731 MB/s
test str::char_count::emoji_small::case00_libcore                  ... bench:          18 ns/iter (+/- 0) = 3777 MB/s
test str::char_count::emoji_small::case01_filter_count_cont_bytes  ... bench:          21 ns/iter (+/- 1) = 3238 MB/s
test str::char_count::emoji_small::case02_iter_increment           ... bench:          36 ns/iter (+/- 2) = 1888 MB/s
test str::char_count::emoji_small::case03_manual_char_len          ... bench:          26 ns/iter (+/- 0) = 2615 MB/s
test str::char_count::emoji_tiny::case00_libcore                   ... bench:           7 ns/iter (+/- 0) = 1142 MB/s
test str::char_count::emoji_tiny::case01_filter_count_cont_bytes   ... bench:           5 ns/iter (+/- 0) = 1600 MB/s
test str::char_count::emoji_tiny::case02_iter_increment            ... bench:           5 ns/iter (+/- 0) = 1600 MB/s
test str::char_count::emoji_tiny::case03_manual_char_len           ... bench:           3 ns/iter (+/- 0) = 2666 MB/s
test str::char_count::en_huge::case00_libcore                      ... bench:      17,351 ns/iter (+/- 3,868) = 19859 MB/s
test str::char_count::en_huge::case01_filter_count_cont_bytes      ... bench:      81,169 ns/iter (+/- 8,799) = 4245 MB/s
test str::char_count::en_huge::case02_iter_increment               ... bench:     159,609 ns/iter (+/- 32,168) = 2158 MB/s
test str::char_count::en_huge::case03_manual_char_len              ... bench:     160,180 ns/iter (+/- 39,412) = 2151 MB/s
test str::char_count::en_large::case00_libcore                     ... bench:         217 ns/iter (+/- 44) = 24811 MB/s
test str::char_count::en_large::case01_filter_count_cont_bytes     ... bench:       1,213 ns/iter (+/- 201) = 4438 MB/s
test str::char_count::en_large::case02_iter_increment              ... bench:       2,602 ns/iter (+/- 305) = 2069 MB/s
test str::char_count::en_large::case03_manual_char_len             ... bench:       2,678 ns/iter (+/- 457) = 2010 MB/s
test str::char_count::en_medium::case00_libcore                    ... bench:          37 ns/iter (+/- 3) = 18189 MB/s
test str::char_count::en_medium::case01_filter_count_cont_bytes    ... bench:         156 ns/iter (+/- 40) = 4314 MB/s
test str::char_count::en_medium::case02_iter_increment             ... bench:         340 ns/iter (+/- 81) = 1979 MB/s
test str::char_count::en_medium::case03_manual_char_len            ... bench:         325 ns/iter (+/- 47) = 2070 MB/s
test str::char_count::en_small::case00_libcore                     ... bench:          13 ns/iter (+/- 3) = 2692 MB/s
test str::char_count::en_small::case01_filter_count_cont_bytes     ... bench:          14 ns/iter (+/- 0) = 2500 MB/s
test str::char_count::en_small::case02_iter_increment              ... bench:          27 ns/iter (+/- 0) = 1296 MB/s
test str::char_count::en_small::case03_manual_char_len             ... bench:          29 ns/iter (+/- 2) = 1206 MB/s
test str::char_count::en_tiny::case00_libcore                      ... bench:           7 ns/iter (+/- 0) = 1142 MB/s
test str::char_count::en_tiny::case01_filter_count_cont_bytes      ... bench:           5 ns/iter (+/- 0) = 1600 MB/s
test str::char_count::en_tiny::case02_iter_increment               ... bench:           7 ns/iter (+/- 0) = 1142 MB/s
test str::char_count::en_tiny::case03_manual_char_len              ... bench:           7 ns/iter (+/- 0) = 1142 MB/s
test str::char_count::ru_huge::case00_libcore                      ... bench:      16,067 ns/iter (+/- 4,995) = 20235 MB/s
test str::char_count::ru_huge::case01_filter_count_cont_bytes      ... bench:      75,264 ns/iter (+/- 13,786) = 4319 MB/s
test str::char_count::ru_huge::case02_iter_increment               ... bench:     140,164 ns/iter (+/- 33,212) = 2319 MB/s
test str::char_count::ru_huge::case03_manual_char_len              ... bench:     127,110 ns/iter (+/- 27,164) = 2557 MB/s
test str::char_count::ru_large::case00_libcore                     ... bench:         201 ns/iter (+/- 33) = 25273 MB/s
test str::char_count::ru_large::case01_filter_count_cont_bytes     ... bench:       1,178 ns/iter (+/- 224) = 4312 MB/s
test str::char_count::ru_large::case02_iter_increment              ... bench:       2,164 ns/iter (+/- 651) = 2347 MB/s
test str::char_count::ru_large::case03_manual_char_len             ... bench:       1,967 ns/iter (+/- 411) = 2582 MB/s
test str::char_count::ru_medium::case00_libcore                    ... bench:          36 ns/iter (+/- 6) = 17638 MB/s
test str::char_count::ru_medium::case01_filter_count_cont_bytes    ... bench:         152 ns/iter (+/- 22) = 4177 MB/s
test str::char_count::ru_medium::case02_iter_increment             ... bench:         280 ns/iter (+/- 141) = 2267 MB/s
test str::char_count::ru_medium::case03_manual_char_len            ... bench:         224 ns/iter (+/- 68) = 2834 MB/s
test str::char_count::ru_small::case00_libcore                     ... bench:          13 ns/iter (+/- 0) = 2461 MB/s
test str::char_count::ru_small::case01_filter_count_cont_bytes     ... bench:          11 ns/iter (+/- 0) = 2909 MB/s
test str::char_count::ru_small::case02_iter_increment              ... bench:          16 ns/iter (+/- 1) = 2000 MB/s
test str::char_count::ru_small::case03_manual_char_len             ... bench:          12 ns/iter (+/- 0) = 2666 MB/s
test str::char_count::ru_tiny::case00_libcore                      ... bench:           9 ns/iter (+/- 0) = 1111 MB/s
test str::char_count::ru_tiny::case01_filter_count_cont_bytes      ... bench:           5 ns/iter (+/- 1) = 2000 MB/s
test str::char_count::ru_tiny::case02_iter_increment               ... bench:           5 ns/iter (+/- 1) = 2000 MB/s
test str::char_count::ru_tiny::case03_manual_char_len              ... bench:           4 ns/iter (+/- 0) = 2500 MB/s
test str::char_count::zh_huge::case00_libcore                      ... bench:      12,775 ns/iter (+/- 3,100) = 23646 MB/s
test str::char_count::zh_huge::case01_filter_count_cont_bytes      ... bench:      70,663 ns/iter (+/- 8,390) = 4274 MB/s
test str::char_count::zh_huge::case02_iter_increment               ... bench:     109,605 ns/iter (+/- 4,938) = 2756 MB/s
test str::char_count::zh_huge::case03_manual_char_len              ... bench:     212,944 ns/iter (+/- 18,815) = 1418 MB/s
test str::char_count::zh_large::case00_libcore                     ... bench:         192 ns/iter (+/- 31) = 24583 MB/s
test str::char_count::zh_large::case01_filter_count_cont_bytes     ... bench:       1,083 ns/iter (+/- 243) = 4358 MB/s
test str::char_count::zh_large::case02_iter_increment              ... bench:       1,403 ns/iter (+/- 188) = 3364 MB/s
test str::char_count::zh_large::case03_manual_char_len             ... bench:       2,810 ns/iter (+/- 440) = 1679 MB/s
test str::char_count::zh_medium::case00_libcore                    ... bench:          36 ns/iter (+/- 8) = 16388 MB/s
test str::char_count::zh_medium::case01_filter_count_cont_bytes    ... bench:         138 ns/iter (+/- 55) = 4275 MB/s
test str::char_count::zh_medium::case02_iter_increment             ... bench:         210 ns/iter (+/- 27) = 2809 MB/s
test str::char_count::zh_medium::case03_manual_char_len            ... bench:         414 ns/iter (+/- 4) = 1425 MB/s
test str::char_count::zh_small::case00_libcore                     ... bench:          16 ns/iter (+/- 1) = 2250 MB/s
test str::char_count::zh_small::case01_filter_count_cont_bytes     ... bench:          12 ns/iter (+/- 1) = 3000 MB/s
test str::char_count::zh_small::case02_iter_increment              ... bench:          12 ns/iter (+/- 2) = 3000 MB/s
test str::char_count::zh_small::case03_manual_char_len             ... bench:          14 ns/iter (+/- 1) = 2571 MB/s
test str::char_count::zh_tiny::case00_libcore                      ... bench:           7 ns/iter (+/- 2) = 1285 MB/s
test str::char_count::zh_tiny::case01_filter_count_cont_bytes      ... bench:           5 ns/iter (+/- 0) = 1800 MB/s
test str::char_count::zh_tiny::case02_iter_increment               ... bench:           4 ns/iter (+/- 2) = 2250 MB/s
test str::char_count::zh_tiny::case03_manual_char_len              ... bench:           3 ns/iter (+/- 2) = 3000 MB/s

The story it tells seems relatively consistent: very slight (1-2ns) hit for tiny strings, no real difference for small strings, and medium and above is a significant improvement of around 4x-5x.

Footnotes

  1. Caveat: the measurements on my machine can be somewhat noisy, as this is a laptop and can get thermally throttled. That said, I quit all other applications aside from my terminal while running these.

@thomcc
Copy link
Member Author

thomcc commented Feb 1, 2022

@rustbot ready

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Feb 1, 2022
library/core/src/str/count.rs Outdated Show resolved Hide resolved
@nagisa
Copy link
Member

nagisa commented Feb 5, 2022

r=me after the nit is addressed. Not sure if you have bors powers, so adding a delegate flag. Once you're pushed a fix for the nit, add a comment with @bors r=nagisa.

@nagisa
Copy link
Member

nagisa commented Feb 5, 2022

@bors delegate+

@bors
Copy link
Contributor

bors commented Feb 5, 2022

✌️ @thomcc can now approve this pull request

@bors
Copy link
Contributor

bors commented Feb 5, 2022

📌 Commit 41f8214 has been approved by nagisa

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 5, 2022
@the8472
Copy link
Member

the8472 commented Feb 5, 2022

The story it tells seems relatively consistent: very slight (1-2ns) hit for tiny strings

I think this warrants a perf run before merging because the compiler could be dealing with lots of tiny strings and this could potentially end up being a net-negative.

@bors try @perf-timer queue

@bors
Copy link
Contributor

bors commented Feb 5, 2022

🙅 Please do not try after a pull request has been r+ed. If you need to try, unapprove (r-) it first.

@the8472
Copy link
Member

the8472 commented Feb 5, 2022

oh well, then at least @bors rollup=never

@nagisa
Copy link
Member

nagisa commented Feb 5, 2022

@bors rollup=never

@thomcc
Copy link
Member Author

thomcc commented Feb 5, 2022

Good point. I'll be interested to see the results.

FWIW I don't really think the difference here exists in real code. I only see it in rust-lang/rust (not the dummy crate I've been using to avoid having to go through slow stdlib builds), and even then it seems like the version inside libcore is consistently penalized somehow -- I feel like it's either some artifact of benchmarking, or my setup.

Actually, tow that I write that, I realize that I probably need to check my config.toml values. Disabling rust.incremental, rust.debug-info-std, ... which makes things seem more sensible. Perhaps these should give a warning if these are enabled when running stdlib benchmarks? Anyway, results at bottom of file, but it feel like it's within noise for the tiny inputs now1.

Of course, the perf run will reveal the ultimate truth -- there are a lot of reasons why these benchmarks could be unrealistic (real world usage will have harder to predict branches, as just one example). Still, I'm glad this looks better for small inputs, since they can be somewhat common.

test str::char_count::emoji_huge::case00_libcore                   ... bench:      17,345 ns/iter (+/- 3,875) = 20899 MB/s
test str::char_count::emoji_huge::case01_filter_count_cont_bytes   ... bench:      94,203 ns/iter (+/- 18,805) = 3848 MB/s
test str::char_count::emoji_huge::case02_iter_increment            ... bench:     128,222 ns/iter (+/- 20,453) = 2827 MB/s
test str::char_count::emoji_huge::case03_manual_char_len           ... bench:     174,969 ns/iter (+/- 57,786) = 2071 MB/s
test str::char_count::emoji_large::case00_libcore                  ... bench:         227 ns/iter (+/- 42) = 24951 MB/s
test str::char_count::emoji_large::case01_filter_count_cont_bytes  ... bench:       1,338 ns/iter (+/- 166) = 4233 MB/s
test str::char_count::emoji_large::case02_iter_increment           ... bench:       1,976 ns/iter (+/- 347) = 2866 MB/s
test str::char_count::emoji_large::case03_manual_char_len          ... bench:       2,751 ns/iter (+/- 538) = 2058 MB/s
test str::char_count::emoji_medium::case00_libcore                 ... bench:          34 ns/iter (+/- 5) = 20823 MB/s
test str::char_count::emoji_medium::case01_filter_count_cont_bytes ... bench:         172 ns/iter (+/- 20) = 4116 MB/s
test str::char_count::emoji_medium::case02_iter_increment          ... bench:         253 ns/iter (+/- 57) = 2798 MB/s
test str::char_count::emoji_medium::case03_manual_char_len         ... bench:         380 ns/iter (+/- 86) = 1863 MB/s
test str::char_count::emoji_small::case00_libcore                  ... bench:          10 ns/iter (+/- 3) = 6800 MB/s
test str::char_count::emoji_small::case01_filter_count_cont_bytes  ... bench:          17 ns/iter (+/- 3) = 4000 MB/s
test str::char_count::emoji_small::case02_iter_increment           ... bench:          32 ns/iter (+/- 14) = 2125 MB/s
test str::char_count::emoji_small::case03_manual_char_len          ... bench:          22 ns/iter (+/- 5) = 3090 MB/s
test str::char_count::emoji_tiny::case00_libcore                   ... bench:           4 ns/iter (+/- 0) = 2000 MB/s
test str::char_count::emoji_tiny::case01_filter_count_cont_bytes   ... bench:           3 ns/iter (+/- 0) = 2666 MB/s
test str::char_count::emoji_tiny::case02_iter_increment            ... bench:           3 ns/iter (+/- 0) = 2666 MB/s
test str::char_count::emoji_tiny::case03_manual_char_len           ... bench:           2 ns/iter (+/- 0) = 4000 MB/s
test str::char_count::en_huge::case00_libcore                      ... bench:      15,425 ns/iter (+/- 4,318) = 22338 MB/s
test str::char_count::en_huge::case01_filter_count_cont_bytes      ... bench:      82,838 ns/iter (+/- 12,770) = 4159 MB/s
test str::char_count::en_huge::case02_iter_increment               ... bench:     121,184 ns/iter (+/- 23,056) = 2843 MB/s
test str::char_count::en_huge::case03_manual_char_len              ... bench:     146,755 ns/iter (+/- 24,069) = 2347 MB/s
test str::char_count::en_large::case00_libcore                     ... bench:         239 ns/iter (+/- 81) = 22527 MB/s
test str::char_count::en_large::case01_filter_count_cont_bytes     ... bench:       1,327 ns/iter (+/- 167) = 4057 MB/s
test str::char_count::en_large::case02_iter_increment              ... bench:       1,902 ns/iter (+/- 324) = 2830 MB/s
test str::char_count::en_large::case03_manual_char_len             ... bench:       2,103 ns/iter (+/- 266) = 2560 MB/s
test str::char_count::en_medium::case00_libcore                    ... bench:          34 ns/iter (+/- 7) = 19794 MB/s
test str::char_count::en_medium::case01_filter_count_cont_bytes    ... bench:         197 ns/iter (+/- 57) = 3416 MB/s
test str::char_count::en_medium::case02_iter_increment             ... bench:         500 ns/iter (+/- 178) = 1346 MB/s
test str::char_count::en_medium::case03_manual_char_len            ... bench:         280 ns/iter (+/- 65) = 2403 MB/s
test str::char_count::en_small::case00_libcore                     ... bench:           9 ns/iter (+/- 1) = 3888 MB/s
test str::char_count::en_small::case01_filter_count_cont_bytes     ... bench:          11 ns/iter (+/- 3) = 3181 MB/s
test str::char_count::en_small::case02_iter_increment              ... bench:          27 ns/iter (+/- 4) = 1296 MB/s
test str::char_count::en_small::case03_manual_char_len             ... bench:          19 ns/iter (+/- 3) = 1842 MB/s
test str::char_count::en_tiny::case00_libcore                      ... bench:           4 ns/iter (+/- 1) = 2000 MB/s
test str::char_count::en_tiny::case01_filter_count_cont_bytes      ... bench:           4 ns/iter (+/- 0) = 2000 MB/s
test str::char_count::en_tiny::case02_iter_increment               ... bench:           6 ns/iter (+/- 1) = 1333 MB/s
test str::char_count::en_tiny::case03_manual_char_len              ... bench:           5 ns/iter (+/- 1) = 1600 MB/s
test str::char_count::ru_huge::case00_libcore                      ... bench:      14,648 ns/iter (+/- 3,487) = 22195 MB/s
test str::char_count::ru_huge::case01_filter_count_cont_bytes      ... bench:      82,164 ns/iter (+/- 19,440) = 3956 MB/s
test str::char_count::ru_huge::case02_iter_increment               ... bench:     121,055 ns/iter (+/- 18,603) = 2685 MB/s
test str::char_count::ru_huge::case03_manual_char_len              ... bench:      93,980 ns/iter (+/- 19,891) = 3459 MB/s
test str::char_count::ru_large::case00_libcore                     ... bench:         209 ns/iter (+/- 50) = 24306 MB/s
test str::char_count::ru_large::case01_filter_count_cont_bytes     ... bench:       1,211 ns/iter (+/- 362) = 4194 MB/s
test str::char_count::ru_large::case02_iter_increment              ... bench:       2,659 ns/iter (+/- 514) = 1910 MB/s
test str::char_count::ru_large::case03_manual_char_len             ... bench:       2,822 ns/iter (+/- 427) = 1800 MB/s
test str::char_count::ru_medium::case00_libcore                    ... bench:          34 ns/iter (+/- 3) = 18676 MB/s
test str::char_count::ru_medium::case01_filter_count_cont_bytes    ... bench:         155 ns/iter (+/- 51) = 4096 MB/s
test str::char_count::ru_medium::case02_iter_increment             ... bench:         362 ns/iter (+/- 136) = 1754 MB/s
test str::char_count::ru_medium::case03_manual_char_len            ... bench:         216 ns/iter (+/- 64) = 2939 MB/s
test str::char_count::ru_small::case00_libcore                     ... bench:           7 ns/iter (+/- 3) = 4571 MB/s
test str::char_count::ru_small::case01_filter_count_cont_bytes     ... bench:           9 ns/iter (+/- 4) = 3555 MB/s
test str::char_count::ru_small::case02_iter_increment              ... bench:          16 ns/iter (+/- 10) = 2000 MB/s
test str::char_count::ru_small::case03_manual_char_len             ... bench:          19 ns/iter (+/- 3) = 1684 MB/s
test str::char_count::ru_tiny::case00_libcore                      ... bench:           5 ns/iter (+/- 0) = 2000 MB/s
test str::char_count::ru_tiny::case01_filter_count_cont_bytes      ... bench:           5 ns/iter (+/- 1) = 2000 MB/s
test str::char_count::ru_tiny::case02_iter_increment               ... bench:           4 ns/iter (+/- 0) = 2500 MB/s
test str::char_count::ru_tiny::case03_manual_char_len              ... bench:           3 ns/iter (+/- 0) = 3333 MB/s
test str::char_count::zh_huge::case00_libcore                      ... bench:      15,208 ns/iter (+/- 1,737) = 19863 MB/s
test str::char_count::zh_huge::case01_filter_count_cont_bytes      ... bench:      76,816 ns/iter (+/- 16,023) = 3932 MB/s
test str::char_count::zh_huge::case02_iter_increment               ... bench:     135,379 ns/iter (+/- 23,298) = 2231 MB/s
test str::char_count::zh_huge::case03_manual_char_len              ... bench:     188,367 ns/iter (+/- 72,483) = 1603 MB/s
test str::char_count::zh_large::case00_libcore                     ... bench:         188 ns/iter (+/- 53) = 25106 MB/s
test str::char_count::zh_large::case01_filter_count_cont_bytes     ... bench:       1,295 ns/iter (+/- 39) = 3644 MB/s
test str::char_count::zh_large::case02_iter_increment              ... bench:       2,404 ns/iter (+/- 450) = 1963 MB/s
test str::char_count::zh_large::case03_manual_char_len             ... bench:       3,031 ns/iter (+/- 579) = 1557 MB/s
test str::char_count::zh_medium::case00_libcore                    ... bench:          33 ns/iter (+/- 3) = 17878 MB/s
test str::char_count::zh_medium::case01_filter_count_cont_bytes    ... bench:         163 ns/iter (+/- 3) = 3619 MB/s
test str::char_count::zh_medium::case02_iter_increment             ... bench:         205 ns/iter (+/- 42) = 2878 MB/s
test str::char_count::zh_medium::case03_manual_char_len            ... bench:         420 ns/iter (+/- 13) = 1404 MB/s
test str::char_count::zh_small::case00_libcore                     ... bench:          10 ns/iter (+/- 2) = 3600 MB/s
test str::char_count::zh_small::case01_filter_count_cont_bytes     ... bench:          11 ns/iter (+/- 0) = 3272 MB/s
test str::char_count::zh_small::case02_iter_increment              ... bench:          10 ns/iter (+/- 1) = 3600 MB/s
test str::char_count::zh_small::case03_manual_char_len             ... bench:          17 ns/iter (+/- 0) = 2117 MB/s
test str::char_count::zh_tiny::case00_libcore                      ... bench:           5 ns/iter (+/- 1) = 1800 MB/s
test str::char_count::zh_tiny::case01_filter_count_cont_bytes      ... bench:           4 ns/iter (+/- 0) = 2250 MB/s
test str::char_count::zh_tiny::case02_iter_increment               ... bench:           3 ns/iter (+/- 0) = 3000 MB/s
test str::char_count::zh_tiny::case03_manual_char_len              ... bench:           3 ns/iter (+/- 0) = 3000 MB/s

Footnotes

  1. Which is what you'd expect when the functions are the same, except one has a branch that will be predicted correctly 100% of the time in the benchmark.

@the8472
Copy link
Member

the8472 commented Feb 5, 2022

which makes things seem more sensible. Perhaps these should give a warning if these are enabled when running stdlib benchmarks?

I wonder if we could build std and dependencies with incremental but the final benchmark crate in non-incremental mode.

@bors
Copy link
Contributor

bors commented Feb 5, 2022

⌛ Testing commit 41f8214 with merge e2e1d636dc37facc46d2ab58b0879f6092f5b24b...

@bors bors mentioned this pull request Feb 5, 2022
@bors
Copy link
Contributor

bors commented Feb 6, 2022

💥 Test timed out

@bors bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Feb 6, 2022
@the8472
Copy link
Member

the8472 commented Feb 6, 2022

@bors retry

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 6, 2022
@bors
Copy link
Contributor

bors commented Feb 6, 2022

⌛ Testing commit 41f8214 with merge f624427...

@bors
Copy link
Contributor

bors commented Feb 6, 2022

☀️ Test successful - checks-actions
Approved by: nagisa
Pushing f624427 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Feb 6, 2022
@bors bors merged commit f624427 into rust-lang:master Feb 6, 2022
@rustbot rustbot added this to the 1.60.0 milestone Feb 6, 2022
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (f624427): comparison url.

Summary: This benchmark run shows 10 relevant improvements 🎉 but 2 relevant regressions 😿 to instruction counts.

  • Average relevant regression: 0.4%
  • Average relevant improvement: -0.9%
  • Largest improvement in instruction counts: -1.4% on incr-unchanged builds of encoding check
  • Largest regression in instruction counts: 0.5% on incr-full builds of html5ever debug

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please open an issue or create a new PR that fixes the regressions, add a comment linking to the newly created issue or PR, and then add the perf-regression-triaged label to this PR.

@rustbot label: +perf-regression

@rustbot rustbot added the perf-regression Performance regression. label Feb 6, 2022
@thomcc
Copy link
Member Author

thomcc commented Feb 6, 2022

Seems like a larger and more consistent improvement than regression -- surprised we do this enough in the compiler to be visible at all.

I actually do have some thoughts on how to improve it further for the small string case (specifically, the trick I used in is_ascii to allow the head and tail to be handled with unaligned reads is a lot more of a pain here, but is still possible if we build a mask to handle the overlap), but I don't know when I'll get to it. (Quite possibly not for a while unless this is regression seems more serious to others than it seems to me)

@the8472
Copy link
Member

the8472 commented Feb 6, 2022

surprised we do this enough in the compiler to be visible at all

Since this is library code and can also be used by the crates being tested it could also mean that the compiler spends a different amount of time handling this code rather than being directly affected by it. And then there's inlining/CGU/whatever perturbation noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. perf-regression Performance regression. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.