rustdoc-search: use set ops for ranking and filtering #118402

notriddle · 2023-11-28T06:07:48Z

This commit adds ranking and quick filtering to type-based search, improving performance and having it order results based on their type signatures.

Preview

Profiler output: https://notriddle.com/rustdoc-html-demo-6/profile-8/index.html

Preview: https://notriddle.com/rustdoc-html-demo-6/ranking-and-filtering-v2/std/index.html

Motivation

If I write a query like str -> String, a lot of functions come up. That's to be expected, but String::from should come up on top, and it doesn't right now. This is because the sorting algorithm is based on the functions name, and doesn't consider the type signature at all. slice::join even comes up above it!

To fix this, the sorting should take into account the function's signature, and the closer match should come up on top.

Guide-level description

When searching by type signature, types with a "closer" match will show up above types that match less precisely.

Reference-level explanation

Functions signature search works in three major phases:

A compact "fingerprint," based on the bloom filter technique, is used to check for matches and to estimate the distance. It sometimes has false positive matches, but it also operates on 128 bit contiguous memory and requires no backtracking, so it performs a lot better than real unification.

The fingerprint represents the set of items in the type signature, but it does not represent nesting, and it ignores when the same item appears more than once.

The result is rejected if any query bits are absent in the function, or if the distance is higher than the current maximum and 200 results have already been found.
The second step performs unification. This is where nesting and true bag semantics are taken into account, and it has no false positives. It uses a recursive, backtracking algorithm.

The result is rejected if any query elements are absent in the function.

Drawbacks

This makes the code bigger.

More than that, this design is a subtle trade-off. It makes the cases I've tested against measurably faster, but it's not clear how well this extends to other crates with potentially more functions and fewer types.

The more complex things get, the more important it is to gather a good set of data to test with (this is arguably more important than the actual benchmarking ifrastructure right now).

Rationale and alternatives

Throwing a bloom filter in front makes it faster.

More than that, it tries to take a tactic where the system can not only check for potential matches, but also gets an accurate distance function without needing to do unification. That way it can skip unification even on items that have the needed elems, as long as they have more items than the currently found maximum.

If I didn't want to be able to cheaply do set operations on the fingerprint, a cuckoo filter is supposed to have better performance. But the nice bit-banging set intersection doesn't work AFAIK.

I also looked into minhashing, but since it's actually an unbiased estimate of the similarity coefficient, I'm not sure how it could be used to skip unification (I wouldn't know if the estimate was too low or too high).

This function actually uses the number of distinct items as its "distance function." This should give the same results that it would have gotten from a Jaccard Distance $1-\frac{|F\cap{}Q|}{|F\cup{}Q|}$, while being cheaper to compute. This is because:

The function $F$ must be a superset of the query $Q$, so their union is just $F$ and the intersection is $Q$ and it can be reduced to $1-\frac{|Q|}{|F|}.
There are no magic thresholds. These values are only being used to compare against each other while sorting (and, if 200 results are found, to compare with the maximum match). This means we only care if one value is bigger than the other, not what it's actual value is, and since $Q$ is the same for everything, it can be safely left out, reducing the formula to $1-\frac{1}{|F|} = \frac{|F|}{|F|}-\frac{1}{|F|} = |F|-1$. And, since the values are only being compared with each other, $|F|$ is fine.

Prior art

This is significantly different from how Hoogle does it.
It doesn't account for order, and it has no special account for nesting, though Box<t> is still two items, while t is only one.

This should give the same results that it would have gotten from a Jaccard Distance $1-\frac{|A\cap{}B|}{|A\cup{}B|}$, while being cheaper to compute.

Unresolved questions

[] and (), the slice/array and tuple/union operators, are ignored while building the signature for the query. This is because they match more than one thing, making them ambiguous. Unfortunately, this also makes them a performance cliff. Is this likely to be a problem?

Right now, the system just stashes the type distance into the same field that levenshtein distance normally goes in. This means exact query matches show up on top (for example, if you have a function like fn nothing(a: Nothing, b: i32), then searching for nothing will show it on top even if there's another function with fn bar(x: Nothing) that's technically a closer match in type signature.

Future possibilities

It should be possible to adopt more sorting criteria to act as a tie breaker, which could be determined during unification.

rustbot · 2023-11-28T06:07:57Z

r? @fmease

(rustbot has picked a reviewer for you, use r? to override)

rustbot · 2023-11-28T06:07:59Z

Some changes occurred in HTML/CSS/JS.

cc @GuillaumeGomez, @jsha

notriddle · 2023-11-28T06:08:13Z

r? @GuillaumeGomez

src/librustdoc/html/static/js/search.js

GuillaumeGomez · 2023-12-10T11:11:22Z

Overall, I have no issue with this change. But for future optimizations which require to increase code size, I'd really like us to have a wide enough testsuite to ensure that it doesn't impact negatively other cases. Considering this, I'm wondering if we shouldn't start by writing these performance checks (even if not run in rustc's CI) before landing any new performance improvement PR. What do you think?

notriddle · 2023-12-10T12:52:26Z

That does seem like a good idea, yeah.

Any suggestions about where to start?

notriddle · 2023-12-11T06:38:06Z

I've added a couple more benchmarks:

stm32f4
ripgrep

I've also added more queries to the arti benchmark, and improved the bloom filter hashing (it gets a lot fewer false positives now, and the many nearly identical queries now all run in similar time).

src/librustdoc/html/static/js/search.js

GuillaumeGomez · 2023-12-12T19:29:28Z

Thanks! r=me once CI pass

…s, r=notriddle Clean up variables in `search.js` While reviewing rust-lang#118402, I saw a few small clean ups that were needed, mostly about variables creation. r? `@notriddle`

…s, r=notriddle Clean up variables in `search.js` While reviewing rust-lang#118402, I saw a few small clean ups that were needed, mostly about variables creation. r? ``@notriddle``

notriddle · 2023-12-12T21:05:21Z

@bors r=GuillaumeGomez

bors · 2023-12-12T21:05:24Z

📌 Commit 34b7265 has been approved by GuillaumeGomez

It is now in the queue for this repository.

…s, r=notriddle Clean up variables in `search.js` While reviewing rust-lang#118402, I saw a few small clean ups that were needed, mostly about variables creation. r? ```@notriddle```

Rollup merge of rust-lang#118886 - GuillaumeGomez:clean-up-search-vars, r=notriddle Clean up variables in `search.js` While reviewing rust-lang#118402, I saw a few small clean ups that were needed, mostly about variables creation. r? ```@notriddle```

bors · 2023-12-13T07:25:38Z

☔ The latest upstream changes (presumably #118900) made this pull request unmergeable. Please resolve the merge conflicts.

This function dates back to 9a45c9d and seems to have been made obsolete when `addIntoResult` grew the ability to check the levenshtein distance matching with commit ba824ec.

This commit adds ranking and quick filtering to type-based search, improving performance and having it order results based on their type signatures. Motivation ---------- If I write a query like `str -> String`, a lot of functions come up. That's to be expected, but `String::from_str` should come up on top, and it doesn't right now. This is because the sorting algorithm is based on the functions name, and doesn't consider the type signature at all. `slice::join` even comes up above it! To fix this, the sorting should take into account the function's signature, and the closer match should come up on top. Guide-level description ----------------------- When searching by type signature, types with a "closer" match will show up above types that match less precisely. Reference-level explanation --------------------------- Functions signature search works in three major phases: * A compact "fingerprint," based on the [bloom filter] technique, is used to check for matches and to estimate the distance. It sometimes has false positive matches, but it also operates on 128 bit contiguous memory and requires no backtracking, so it performs a lot better than real unification. The fingerprint represents the set of items in the type signature, but it does not represent nesting, and it ignores when the same item appears more than once. The result is rejected if any query bits are absent in the function, or if the distance is higher than the current maximum and 200 results have already been found. * The second step performs unification. This is where nesting and true bag semantics are taken into account, and it has no false positives. It uses a recursive, backtracking algorithm. The result is rejected if any query elements are absent in the function. [bloom filter]: https://en.wikipedia.org/wiki/Bloom_filter Drawbacks --------- This makes the code bigger. More than that, this design is a subtle trade-off. It makes the cases I've tested against measurably faster, but it's not clear how well this extends to other crates with potentially more functions and fewer types. The more complex things get, the more important it is to gather a good set of data to test with (this is arguably more important than the actual benchmarking ifrastructure right now). Rationale and alternatives -------------------------- Throwing a bloom filter in front makes it faster. More than that, it tries to take a tactic where the system can not only check for potential matches, but also gets an accurate distance function without needing to do unification. That way it can skip unification even on items that have the needed elems, as long as they have more items than the currently found maximum. If I didn't want to be able to cheaply do set operations on the fingerprint, a [cuckoo filter] is supposed to have better performance. But the nice bit-banging set intersection doesn't work AFAIK. I also looked into [minhashing], but since it's actually an unbiased estimate of the similarity coefficient, I'm not sure how it could be used to skip unification (I wouldn't know if the estimate was too low or too high). This function actually uses the number of distinct items as its "distance function." This should give the same results that it would have gotten from a Jaccard Distance $1-\frac{|F\cap{}Q|}{|F\cup{}Q|}$, while being cheaper to compute. This is because: * The function $F$ must be a superset of the query $Q$, so their union is just $F$ and the intersection is $Q$ and it can be reduced to $1-\frac{|Q|}{|F|}. * There are no magic thresholds. These values are only being used to compare against each other while sorting (and, if 200 results are found, to compare with the maximum match). This means we only care if one value is bigger than the other, not what it's actual value is, and since $Q$ is the same for everything, it can be safely left out, reducing the formula to $1-\frac{1}{|F|} = \frac{|F|}{|F|}-\frac{1}{|F|} = |F|-1$. And, since the values are only being compared with each other, $|F|$ is fine. Prior art --------- This is significantly different from how Hoogle does it. It doesn't account for order, and it has no special account for nesting, though `Box<t>` is still two items, while `t` is only one. This should give the same results that it would have gotten from a Jaccard Distance $1-\frac{|A\cap{}B|}{|A\cup{}B|}$, while being cheaper to compute. Unresolved questions -------------------- `[]` and `()`, the slice/array and tuple/union operators, are ignored while building the signature for the query. This is because they match more than one thing, making them ambiguous. Unfortunately, this also makes them a performance cliff. Is this likely to be a problem? Right now, the system just stashes the type distance into the same field that levenshtein distance normally goes in. This means exact query matches show up on top (for example, if you have a function like `fn nothing(a: Nothing, b: i32)`, then searching for `nothing` will show it on top even if there's another function with `fn bar(x: Nothing)` that's technically a closer match in type signature. Future possibilities -------------------- It should be possible to adopt more sorting criteria to act as a tie breaker, which could be determined during unification. [cuckoo filter]: https://en.wikipedia.org/wiki/Cuckoo_filter [minhashing]: https://en.wikipedia.org/wiki/MinHash

The hash changes are based on some tests with `arti` and various specific queries, aimed at reducing the false positive rate. Sorting the query elements so that generics always come first is instead aimed at reducing the number of Map operations on mgens, assuming if the bloom filter does find a false positive, it'll be able to reject the row without having to track a mapping. - https://hur.st/bloomfilter/?n=3&p=&m=96&k=6 Different functions have different amounts of inputs, and unification isn't very slow anyway, so figuring out a single ideal number of hash functions is nasty, but 6 keeps things low even up to 10 inputs. - https://web.archive.org/web/20210927123933/https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.2442&rep=rep1&type=pdf This is the `h1` and `h2`, both derived from `h0`.

notriddle · 2023-12-13T18:46:48Z

@bors r=GuillaumeGomez

Rebased with the variable changes.

bors · 2023-12-13T18:46:50Z

📌 Commit bec6672 has been approved by GuillaumeGomez

It is now in the queue for this repository.

bors · 2023-12-13T21:45:05Z

⌛ Testing commit bec6672 with merge eeff92a...

bors · 2023-12-13T23:42:16Z

☀️ Test successful - checks-actions
Approved by: GuillaumeGomez
Pushing eeff92a to master...

rust-timer · 2023-12-14T05:00:27Z

Finished benchmarking commit (eeff92a): comparison URL.

Overall result: no relevant changes - no action needed

@rustbot label: -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	2.4%	[0.7%, 4.4%]	4
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	2.4%	[0.7%, 4.4%]	4

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.8%	[0.8%, 0.8%]	1
Regressions ❌ (secondary)	3.3%	[2.2%, 4.1%]	3
Improvements ✅ (primary)	-0.5%	[-0.5%, -0.5%]	1
Improvements ✅ (secondary)	-2.0%	[-2.0%, -2.0%]	1
All ❌✅ (primary)	0.2%	[-0.5%, 0.8%]	2

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 671.752s -> 671.988s (0.04%)
Artifact size: 312.39 MiB -> 312.40 MiB (0.00%)

Update Rust toolchain from nightly-2023-12-13 to nightly-2023-12-14 without any other source changes. This is an automatically generated pull request. If any of the CI checks fail, manual intervention is required. In such a case, review the changes at https://github.com/rust-lang/rust from rust-lang@3340d49 up to rust-lang@eeff92a. The log for this commit range is: rust-lang@eeff92ad32 Auto merge of rust-lang#118402 - notriddle:notriddle/ranking-and-filtering, r=GuillaumeGomez rust-lang@a90372c6e8 Auto merge of rust-lang#118213 - Urgau:check-cfg-diagnostics-rustc-cargo, r=petrochenkov rust-lang@2862500152 Auto merge of rust-lang#118919 - matthiaskrgr:rollup-02udckl, r=matthiaskrgr rust-lang@bec6672984 rustdoc-search: clean up handleSingleArg type handling rust-lang@9dfcf131b3 rustdoc-search: better hashing, faster unification rust-lang@9a9695a052 rustdoc-search: use set ops for ranking and filtering rust-lang@fd1d256d61 rustdoc-search: remove the now-redundant `validateResult` rust-lang@251d1af0d2 Rollup merge of rust-lang#118906 - Kobzol:bootstrap-is-windows, r=petrochenkov rust-lang@666353e7ba Rollup merge of rust-lang#118883 - HosseinAssaran:patch-1, r=fmease rust-lang@1dd36119d0 Rollup merge of rust-lang#118871 - tmiasko:coroutine-maybe-uninit-fields, r=compiler-errors rust-lang@dbc6ec6636 Rollup merge of rust-lang#118759 - compiler-errors:bare-unit-structs, r=petrochenkov rust-lang@f6617d050d Remove dangling check-cfg ui tests files rust-lang@5345a166fe Add more suggestion to unexpected cfg names and values rust-lang@7176b8babd Auto merge of rust-lang#118894 - dtolnay:bootstrapwrite, r=onur-ozkan rust-lang@c3def263a4 Auto merge of rust-lang#118870 - Enselic:rustc_passes-query-stability, r=compiler-errors rust-lang@56d25ba5ea Auto merge of rust-lang#118500 - ZetaNumbers:tcx_hir_refactor, r=petrochenkov rust-lang@2fdd9eda0c Auto merge of rust-lang#118534 - RalfJung:extern-type-size-of-val, r=WaffleLapkin rust-lang@066e6ffa02 Fix LLD thread flag selection for Windows targets rust-lang@c5208518fa Add `TargetSelection::is_windows` method rust-lang@f651b436ce Auto merge of rust-lang#117050 - c410-f3r:here-we-go-again, r=petrochenkov rust-lang@9f1bfe53b6 Auto merge of rust-lang#118900 - workingjubilee:rollup-wkv9hq1, r=workingjubilee rust-lang@f9078a40ee Rollup merge of rust-lang#118891 - compiler-errors:async-gen-blocks, r=eholk rust-lang@4583a0134f Rollup merge of rust-lang#118889 - matthiaskrgr:compl_2023_2, r=WaffleLapkin rust-lang@df0686b629 Rollup merge of rust-lang#118887 - smoelius:patch-1, r=Nilstrieb rust-lang@2f937c720d Rollup merge of rust-lang#118886 - GuillaumeGomez:clean-up-search-vars, r=notriddle rust-lang@5308733112 Rollup merge of rust-lang#118885 - matthiaskrgr:compl_2023, r=compiler-errors rust-lang@89d4a9bee9 Rollup merge of rust-lang#118884 - matthiaskrgr:auszweimacheins, r=Nadrieril rust-lang@18e0966f39 Rollup merge of rust-lang#118873 - lukas-code:fix_waker_getter_tracking_issue_number, r=workingjubilee rust-lang@0430782d1d Rollup merge of rust-lang#118872 - GuillaumeGomez:codeblock-attr-lint, r=notriddle rust-lang@a33f1a3d3a Rollup merge of rust-lang#118864 - farnoy:masked-load-store-fixes, r=workingjubilee rust-lang@2d1d443d7f Rollup merge of rust-lang#118858 - mu001999:dead_code/clean, r=cuviper rust-lang@77d1699756 Auto merge of rust-lang#116438 - ChrisDenton:truncate, r=thomcc rust-lang@b30e94b7bb Unbreak non-unix non-windows bootstrap rust-lang@1d78ce681e Actually parse async gen blocks correctly rust-lang@2a1acc26a0 Update compiler/rustc_pattern_analysis/src/constructor.rs rust-lang@3795cc8eb0 more clippy::complexity fixes rust-lang@046f2dea33 Typo rust-lang@58327c10c5 Add a test for a codeblock with multiple invalid attributes rust-lang@f1342f30a5 Clean up variables in `search.js` rust-lang@d707461a1a clippy::complexity fixes rust-lang@6892fcd690 simplify merging of two vecs rust-lang@a2ffff0708 Change a typo mistake in the-doc-attribute.md rust-lang@f813ccd784 also add a Miri test rust-lang@edcb7aba6b also test projecting to some sized fields at non-zero offset in structs with an extern type tail rust-lang@a47416beb5 test that both size_of_val and align_of_val panic rust-lang@bb0fd665a8 Follow guidelines for lint suggestions rust-lang@98aa20b0a7 Add test for `rustX` codeblock attribute rust-lang@d3cb25f4cf Add `rustX` check to codeblock attributes lint rust-lang@24f009c5e5 Move some methods from `tcx.hir()` to `tcx` rust-lang@04f3adb4a7 fix `waker_getters` tracking issue number rust-lang@e9b16cc2c5 rustc_passes: Enforce `rustc::potential_query_instability` lint rust-lang@95b5a80f47 Fix alignment passed down to LLVM for simd_masked_load rust-lang@fb32eb3529 Clean up CodeBlocks::next code rust-lang@df227f78c6 make it more clear what comments refer to; avoid dangling unaligned references rust-lang@b9c9b3e7a2 remove a cranelift test that doesn't make sense any more rust-lang@9ef1e35166 reject projecting to fields whose offset we cannot compute rust-lang@b1613ebc43 codegen: panic when trying to compute size/align of extern type rust-lang@6c0dbb8cc6 Remove dead codes in core rust-lang@a48cebc4b8 Coroutine variant fields can be uninitialized rust-lang@d473bdfdc3 Support bare unit structs in destructuring assignments rust-lang@0278505691 Attempt to try to resolve blocking concerns rust-lang@c6f7aa0eea Make File::create work on Windows hidden files Co-authored-by: celinval <[email protected]>

rustbot assigned fmease Nov 28, 2023

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. labels Nov 28, 2023

rustbot assigned GuillaumeGomez and unassigned fmease Nov 28, 2023

This comment has been minimized.

Sign in to view

notriddle force-pushed the notriddle/ranking-and-filtering branch from c56f632 to 969de46 Compare November 28, 2023 07:03

notriddle force-pushed the notriddle/ranking-and-filtering branch from 969de46 to 19111af Compare December 10, 2023 01:31

This comment has been minimized.

Sign in to view

notriddle force-pushed the notriddle/ranking-and-filtering branch from 19111af to bada62b Compare December 10, 2023 02:18

GuillaumeGomez reviewed Dec 10, 2023

View reviewed changes

src/librustdoc/html/static/js/search.js Outdated Show resolved Hide resolved

notriddle force-pushed the notriddle/ranking-and-filtering branch from f2579f2 to e62031e Compare December 11, 2023 06:35