Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rustdoc-search: use set ops for ranking and filtering #118402

Merged
merged 4 commits into from
Dec 13, 2023

Conversation

notriddle
Copy link
Contributor

@notriddle notriddle commented Nov 28, 2023

This commit adds ranking and quick filtering to type-based search, improving performance and having it order results based on their type signatures.

Preview

Profiler output: https://notriddle.com/rustdoc-html-demo-6/profile-8/index.html

Preview: https://notriddle.com/rustdoc-html-demo-6/ranking-and-filtering-v2/std/index.html

Motivation

If I write a query like str -> String, a lot of functions come up. That's to be expected, but String::from should come up on top, and it doesn't right now. This is because the sorting algorithm is based on the functions name, and doesn't consider the type signature at all. slice::join even comes up above it!

To fix this, the sorting should take into account the function's signature, and the closer match should come up on top.

Guide-level description

When searching by type signature, types with a "closer" match will show up above types that match less precisely.

Reference-level explanation

Functions signature search works in three major phases:

  • A compact "fingerprint," based on the bloom filter technique, is used to check for matches and to estimate the distance. It sometimes has false positive matches, but it also operates on 128 bit contiguous memory and requires no backtracking, so it performs a lot better than real unification.

    The fingerprint represents the set of items in the type signature, but it does not represent nesting, and it ignores when the same item appears more than once.

    The result is rejected if any query bits are absent in the function, or if the distance is higher than the current maximum and 200 results have already been found.

  • The second step performs unification. This is where nesting and true bag semantics are taken into account, and it has no false positives. It uses a recursive, backtracking algorithm.

    The result is rejected if any query elements are absent in the function.

Drawbacks

This makes the code bigger.

More than that, this design is a subtle trade-off. It makes the cases I've tested against measurably faster, but it's not clear how well this extends to other crates with potentially more functions and fewer types.

The more complex things get, the more important it is to gather a good set of data to test with (this is arguably more important than the actual benchmarking ifrastructure right now).

Rationale and alternatives

Throwing a bloom filter in front makes it faster.

More than that, it tries to take a tactic where the system can not only check for potential matches, but also gets an accurate distance function without needing to do unification. That way it can skip unification even on items that have the needed elems, as long as they have more items than the currently found maximum.

If I didn't want to be able to cheaply do set operations on the fingerprint, a cuckoo filter is supposed to have better performance. But the nice bit-banging set intersection doesn't work AFAIK.

I also looked into minhashing, but since it's actually an unbiased estimate of the similarity coefficient, I'm not sure how it could be used to skip unification (I wouldn't know if the estimate was too low or too high).

This function actually uses the number of distinct items as its "distance function." This should give the same results that it would have gotten from a Jaccard Distance $1-\frac{|F\cap{}Q|}{|F\cup{}Q|}$, while being cheaper to compute. This is because:

  • The function $F$ must be a superset of the query $Q$, so their union is just $F$ and the intersection is $Q$ and it can be reduced to $1-\frac{|Q|}{|F|}.

  • There are no magic thresholds. These values are only being used to compare against each other while sorting (and, if 200 results are found, to compare with the maximum match). This means we only care if one value is bigger than the other, not what it's actual value is, and since $Q$ is the same for everything, it can be safely left out, reducing the formula to $1-\frac{1}{|F|} = \frac{|F|}{|F|}-\frac{1}{|F|} = |F|-1$. And, since the values are only being compared with each other, $|F|$ is fine.

Prior art

This is significantly different from how Hoogle does it.
It doesn't account for order, and it has no special account for nesting, though Box<t> is still two items, while t is only one.

This should give the same results that it would have gotten from a Jaccard Distance $1-\frac{|A\cap{}B|}{|A\cup{}B|}$, while being cheaper to compute.

Unresolved questions

[] and (), the slice/array and tuple/union operators, are ignored while building the signature for the query. This is because they match more than one thing, making them ambiguous. Unfortunately, this also makes them a performance cliff. Is this likely to be a problem?

Right now, the system just stashes the type distance into the same field that levenshtein distance normally goes in. This means exact query matches show up on top (for example, if you have a function like fn nothing(a: Nothing, b: i32), then searching for nothing will show it on top even if there's another function with fn bar(x: Nothing) that's technically a closer match in type signature.

Future possibilities

It should be possible to adopt more sorting criteria to act as a tie breaker, which could be determined during unification.

@rustbot
Copy link
Collaborator

rustbot commented Nov 28, 2023

r? @fmease

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. labels Nov 28, 2023
@rustbot
Copy link
Collaborator

rustbot commented Nov 28, 2023

Some changes occurred in HTML/CSS/JS.

cc @GuillaumeGomez, @jsha

@notriddle
Copy link
Contributor Author

r? @GuillaumeGomez

@rustbot rustbot assigned GuillaumeGomez and unassigned fmease Nov 28, 2023
@rust-log-analyzer

This comment has been minimized.

@notriddle notriddle force-pushed the notriddle/ranking-and-filtering branch from c56f632 to 969de46 Compare November 28, 2023 07:03
@notriddle notriddle force-pushed the notriddle/ranking-and-filtering branch from 969de46 to 19111af Compare December 10, 2023 01:31
@rust-log-analyzer

This comment has been minimized.

@notriddle notriddle force-pushed the notriddle/ranking-and-filtering branch from 19111af to bada62b Compare December 10, 2023 02:18
@GuillaumeGomez
Copy link
Member

Overall, I have no issue with this change. But for future optimizations which require to increase code size, I'd really like us to have a wide enough testsuite to ensure that it doesn't impact negatively other cases. Considering this, I'm wondering if we shouldn't start by writing these performance checks (even if not run in rustc's CI) before landing any new performance improvement PR. What do you think?

@notriddle
Copy link
Contributor Author

notriddle commented Dec 10, 2023

That does seem like a good idea, yeah.

Any suggestions about where to start?

@notriddle notriddle force-pushed the notriddle/ranking-and-filtering branch from f2579f2 to e62031e Compare December 11, 2023 06:35
@notriddle
Copy link
Contributor Author

I've added a couple more benchmarks:

  • stm32f4
  • ripgrep

I've also added more queries to the arti benchmark, and improved the bloom filter hashing (it gets a lot fewer false positives now, and the many nearly identical queries now all run in similar time).

@rust-log-analyzer

This comment has been minimized.

@notriddle notriddle force-pushed the notriddle/ranking-and-filtering branch 2 times, most recently from 7bb1b2a to eb02d91 Compare December 11, 2023 18:20
@rust-log-analyzer

This comment has been minimized.

@notriddle notriddle force-pushed the notriddle/ranking-and-filtering branch from eb02d91 to 7b1f80c Compare December 11, 2023 19:28
@rust-log-analyzer

This comment has been minimized.

@GuillaumeGomez
Copy link
Member

Thanks! r=me once CI pass

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request Dec 12, 2023
…s, r=notriddle

Clean up variables in `search.js`

While reviewing rust-lang#118402, I saw a few small clean ups that were needed, mostly about variables creation.

r? `@notriddle`
GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this pull request Dec 12, 2023
…s, r=notriddle

Clean up variables in `search.js`

While reviewing rust-lang#118402, I saw a few small clean ups that were needed, mostly about variables creation.

r? ``@notriddle``
@notriddle
Copy link
Contributor Author

@bors r=GuillaumeGomez

@bors
Copy link
Contributor

bors commented Dec 12, 2023

📌 Commit 34b7265 has been approved by GuillaumeGomez

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 12, 2023
workingjubilee added a commit to workingjubilee/rustc that referenced this pull request Dec 13, 2023
…s, r=notriddle

Clean up variables in `search.js`

While reviewing rust-lang#118402, I saw a few small clean ups that were needed, mostly about variables creation.

r? ```@notriddle```
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Dec 13, 2023
Rollup merge of rust-lang#118886 - GuillaumeGomez:clean-up-search-vars, r=notriddle

Clean up variables in `search.js`

While reviewing rust-lang#118402, I saw a few small clean ups that were needed, mostly about variables creation.

r? ```@notriddle```
@bors
Copy link
Contributor

bors commented Dec 13, 2023

☔ The latest upstream changes (presumably #118900) made this pull request unmergeable. Please resolve the merge conflicts.

@bors bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Dec 13, 2023
This function dates back to 9a45c9d and
seems to have been made obsolete when `addIntoResult` grew the ability to
check the levenshtein distance matching with commit
ba824ec.
This commit adds ranking and quick filtering to type-based search,
improving performance and having it order results based on their
type signatures.

Motivation
----------

If I write a query like `str -> String`, a lot of functions come up.
That's to be expected, but `String::from_str` should come up on top, and
it doesn't right now. This is because the sorting algorithm is based
on the functions name, and doesn't consider the type signature at all.
`slice::join` even comes up above it!

To fix this, the sorting should take into account the function's
signature, and the closer match should come up on top.

Guide-level description
-----------------------

When searching by type signature, types with a "closer" match will
show up above types that match less precisely.

Reference-level explanation
---------------------------

Functions signature search works in three major phases:

* A compact "fingerprint," based on the [bloom filter] technique, is used to
  check for matches and to estimate the distance. It sometimes has false
  positive matches, but it also operates on 128 bit contiguous memory and
  requires no backtracking, so it performs a lot better than real
  unification.

  The fingerprint represents the set of items in the type signature, but it
  does not represent nesting, and it ignores when the same item appears more
  than once.

  The result is rejected if any query bits are absent in the function, or
  if the distance is higher than the current maximum and 200
  results have already been found.

* The second step performs unification. This is where nesting and true bag
  semantics are taken into account, and it has no false positives. It uses a
  recursive, backtracking algorithm.

  The result is rejected if any query elements are absent in the function.

[bloom filter]: https://en.wikipedia.org/wiki/Bloom_filter

Drawbacks
---------

This makes the code bigger.

More than that, this design is a subtle trade-off. It makes the cases I've
tested against measurably faster, but it's not clear how well this extends
to other crates with potentially more functions and fewer types.

The more complex things get, the more important it is to gather a good set
of data to test with (this is arguably more important than the actual
benchmarking ifrastructure right now).

Rationale and alternatives
--------------------------

Throwing a bloom filter in front makes it faster.

More than that, it tries to take a tactic where the system can not only check
for potential matches, but also gets an accurate distance function without
needing to do unification. That way it can skip unification even on items
that have the needed elems, as long as they have more items than the
currently found maximum.

If I didn't want to be able to cheaply do set operations on the fingerprint,
a [cuckoo filter] is supposed to have better performance.
But the nice bit-banging set intersection doesn't work AFAIK.

I also looked into [minhashing], but since it's actually an unbiased
estimate of the similarity coefficient, I'm not sure how it could be used
to skip unification (I wouldn't know if the estimate was too low or
too high).

This function actually uses the number of distinct items as its
"distance function."
This should give the same results that it would have gotten from a Jaccard
Distance $1-\frac{|F\cap{}Q|}{|F\cup{}Q|}$, while being cheaper to compute.
This is because:

* The function $F$ must be a superset of the query $Q$, so their union is
  just $F$ and the intersection is $Q$ and it can be reduced to
  $1-\frac{|Q|}{|F|}.

* There are no magic thresholds. These values are only being used to
  compare against each other while sorting (and, if 200 results are found,
  to compare with the maximum match). This means we only care if one value
  is bigger than the other, not what it's actual value is, and since $Q$ is
  the same for everything, it can be safely left out, reducing the formula
  to $1-\frac{1}{|F|} = \frac{|F|}{|F|}-\frac{1}{|F|} = |F|-1$. And, since
  the values are only being compared with each other, $|F|$ is fine.

Prior art
---------

This is significantly different from how Hoogle does it.
It doesn't account for order, and it has no special account for nesting,
though `Box<t>` is still two items, while `t` is only one.

This should give the same results that it would have gotten from a Jaccard
Distance $1-\frac{|A\cap{}B|}{|A\cup{}B|}$, while being cheaper to compute.

Unresolved questions
--------------------

`[]` and `()`, the slice/array and tuple/union operators, are ignored while
building the signature for the query. This is because they match more than
one thing, making them ambiguous. Unfortunately, this also makes them
a performance cliff. Is this likely to be a problem?

Right now, the system just stashes the type distance into the
same field that levenshtein distance normally goes in. This means exact
query matches show up on top (for example, if you have a function like
`fn nothing(a: Nothing, b: i32)`, then searching for `nothing` will show it
on top even if there's another function with `fn bar(x: Nothing)` that's
technically a closer match in type signature.

Future possibilities
--------------------

It should be possible to adopt more sorting criteria to act as a tie breaker,
which could be determined during unification.

[cuckoo filter]: https://en.wikipedia.org/wiki/Cuckoo_filter
[minhashing]: https://en.wikipedia.org/wiki/MinHash
The hash changes are based on some tests with `arti` and various
specific queries, aimed at reducing the false positive rate.

Sorting the query elements so that generics always come first is
instead aimed at reducing the number of Map operations on mgens,
assuming if the bloom filter does find a false positive, it'll
be able to reject the row without having to track a mapping.

- https://hur.st/bloomfilter/?n=3&p=&m=96&k=6

  Different functions have different amounts of inputs, and
  unification isn't very slow anyway, so figuring out a single
  ideal number of hash functions is nasty, but 6 keeps things
  low even up to 10 inputs.

- https://web.archive.org/web/20210927123933/https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.2442&rep=rep1&type=pdf

  This is the `h1` and `h2`, both derived from `h0`.
@notriddle notriddle force-pushed the notriddle/ranking-and-filtering branch from 34b7265 to bec6672 Compare December 13, 2023 17:38
@notriddle
Copy link
Contributor Author

@bors r=GuillaumeGomez

Rebased with the variable changes.

@bors
Copy link
Contributor

bors commented Dec 13, 2023

📌 Commit bec6672 has been approved by GuillaumeGomez

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Dec 13, 2023
@bors
Copy link
Contributor

bors commented Dec 13, 2023

⌛ Testing commit bec6672 with merge eeff92a...

@bors
Copy link
Contributor

bors commented Dec 13, 2023

☀️ Test successful - checks-actions
Approved by: GuillaumeGomez
Pushing eeff92a to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Dec 13, 2023
@bors bors merged commit eeff92a into rust-lang:master Dec 13, 2023
12 checks passed
@notriddle notriddle deleted the notriddle/ranking-and-filtering branch December 13, 2023 23:42
@rustbot rustbot added this to the 1.76.0 milestone Dec 13, 2023
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (eeff92a): comparison URL.

Overall result: no relevant changes - no action needed

@rustbot label: -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.4% [0.7%, 4.4%] 4
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 2.4% [0.7%, 4.4%] 4

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.8% [0.8%, 0.8%] 1
Regressions ❌
(secondary)
3.3% [2.2%, 4.1%] 3
Improvements ✅
(primary)
-0.5% [-0.5%, -0.5%] 1
Improvements ✅
(secondary)
-2.0% [-2.0%, -2.0%] 1
All ❌✅ (primary) 0.2% [-0.5%, 0.8%] 2

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 671.752s -> 671.988s (0.04%)
Artifact size: 312.39 MiB -> 312.40 MiB (0.00%)

celinval added a commit to celinval/rust-dev that referenced this pull request Jun 4, 2024
Update Rust toolchain from nightly-2023-12-13 to nightly-2023-12-14
without any other source changes.
This is an automatically generated pull request. If any of the CI checks
fail, manual intervention is required. In such a case, review the
changes at https://github.com/rust-lang/rust from
rust-lang@3340d49
up to
rust-lang@eeff92a.
The log for this commit range is:
rust-lang@eeff92ad32 Auto merge of
rust-lang#118402 - notriddle:notriddle/ranking-and-filtering, r=GuillaumeGomez
rust-lang@a90372c6e8 Auto merge of
rust-lang#118213 - Urgau:check-cfg-diagnostics-rustc-cargo, r=petrochenkov
rust-lang@2862500152 Auto merge of
rust-lang#118919 - matthiaskrgr:rollup-02udckl, r=matthiaskrgr
rust-lang@bec6672984 rustdoc-search:
clean up handleSingleArg type handling
rust-lang@9dfcf131b3 rustdoc-search:
better hashing, faster unification
rust-lang@9a9695a052 rustdoc-search: use
set ops for ranking and filtering
rust-lang@fd1d256d61 rustdoc-search:
remove the now-redundant `validateResult`
rust-lang@251d1af0d2 Rollup merge of
rust-lang#118906 - Kobzol:bootstrap-is-windows, r=petrochenkov
rust-lang@666353e7ba Rollup merge of
rust-lang#118883 - HosseinAssaran:patch-1, r=fmease
rust-lang@1dd36119d0 Rollup merge of
rust-lang#118871 - tmiasko:coroutine-maybe-uninit-fields, r=compiler-errors
rust-lang@dbc6ec6636 Rollup merge of
rust-lang#118759 - compiler-errors:bare-unit-structs, r=petrochenkov
rust-lang@f6617d050d Remove dangling
check-cfg ui tests files
rust-lang@5345a166fe Add more suggestion
to unexpected cfg names and values
rust-lang@7176b8babd Auto merge of
rust-lang#118894 - dtolnay:bootstrapwrite, r=onur-ozkan
rust-lang@c3def263a4 Auto merge of
rust-lang#118870 - Enselic:rustc_passes-query-stability, r=compiler-errors
rust-lang@56d25ba5ea Auto merge of
rust-lang#118500 - ZetaNumbers:tcx_hir_refactor, r=petrochenkov
rust-lang@2fdd9eda0c Auto merge of
rust-lang#118534 - RalfJung:extern-type-size-of-val, r=WaffleLapkin
rust-lang@066e6ffa02 Fix LLD thread flag
selection for Windows targets
rust-lang@c5208518fa Add
`TargetSelection::is_windows` method
rust-lang@f651b436ce Auto merge of
rust-lang#117050 - c410-f3r:here-we-go-again, r=petrochenkov
rust-lang@9f1bfe53b6 Auto merge of
rust-lang#118900 - workingjubilee:rollup-wkv9hq1, r=workingjubilee
rust-lang@f9078a40ee Rollup merge of
rust-lang#118891 - compiler-errors:async-gen-blocks, r=eholk
rust-lang@4583a0134f Rollup merge of
rust-lang#118889 - matthiaskrgr:compl_2023_2, r=WaffleLapkin
rust-lang@df0686b629 Rollup merge of
rust-lang#118887 - smoelius:patch-1, r=Nilstrieb
rust-lang@2f937c720d Rollup merge of
rust-lang#118886 - GuillaumeGomez:clean-up-search-vars, r=notriddle
rust-lang@5308733112 Rollup merge of
rust-lang#118885 - matthiaskrgr:compl_2023, r=compiler-errors
rust-lang@89d4a9bee9 Rollup merge of
rust-lang#118884 - matthiaskrgr:auszweimacheins, r=Nadrieril
rust-lang@18e0966f39 Rollup merge of
rust-lang#118873 - lukas-code:fix_waker_getter_tracking_issue_number,
r=workingjubilee
rust-lang@0430782d1d Rollup merge of
rust-lang#118872 - GuillaumeGomez:codeblock-attr-lint, r=notriddle
rust-lang@a33f1a3d3a Rollup merge of
rust-lang#118864 - farnoy:masked-load-store-fixes, r=workingjubilee
rust-lang@2d1d443d7f Rollup merge of
rust-lang#118858 - mu001999:dead_code/clean, r=cuviper
rust-lang@77d1699756 Auto merge of
rust-lang#116438 - ChrisDenton:truncate, r=thomcc
rust-lang@b30e94b7bb Unbreak non-unix
non-windows bootstrap
rust-lang@1d78ce681e Actually parse async
gen blocks correctly
rust-lang@2a1acc26a0 Update
compiler/rustc_pattern_analysis/src/constructor.rs
rust-lang@3795cc8eb0 more
clippy::complexity fixes
rust-lang@046f2dea33 Typo
rust-lang@58327c10c5 Add a test for a
codeblock with multiple invalid attributes
rust-lang@f1342f30a5 Clean up variables
in `search.js`
rust-lang@d707461a1a clippy::complexity
fixes
rust-lang@6892fcd690 simplify merging of
two vecs
rust-lang@a2ffff0708 Change a typo
mistake in the-doc-attribute.md
rust-lang@f813ccd784 also add a Miri test
rust-lang@edcb7aba6b also test projecting
to some sized fields at non-zero offset in structs with an extern type
tail
rust-lang@a47416beb5 test that both
size_of_val and align_of_val panic
rust-lang@bb0fd665a8 Follow guidelines
for lint suggestions
rust-lang@98aa20b0a7 Add test for `rustX`
codeblock attribute
rust-lang@d3cb25f4cf Add `rustX` check to
codeblock attributes lint
rust-lang@24f009c5e5 Move some methods
from `tcx.hir()` to `tcx`
rust-lang@04f3adb4a7 fix `waker_getters`
tracking issue number
rust-lang@e9b16cc2c5 rustc_passes:
Enforce `rustc::potential_query_instability` lint
rust-lang@95b5a80f47 Fix alignment passed
down to LLVM for simd_masked_load
rust-lang@fb32eb3529 Clean up
CodeBlocks::next code
rust-lang@df227f78c6 make it more clear
what comments refer to; avoid dangling unaligned references
rust-lang@b9c9b3e7a2 remove a cranelift
test that doesn't make sense any more
rust-lang@9ef1e35166 reject projecting to
fields whose offset we cannot compute
rust-lang@b1613ebc43 codegen: panic when
trying to compute size/align of extern type
rust-lang@6c0dbb8cc6 Remove dead codes in
core
rust-lang@a48cebc4b8 Coroutine variant
fields can be uninitialized
rust-lang@d473bdfdc3 Support bare unit
structs in destructuring assignments
rust-lang@0278505691 Attempt to try to
resolve blocking concerns
rust-lang@c6f7aa0eea Make File::create
work on Windows hidden files

Co-authored-by: celinval <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants