Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for entry-like methods on HashSet #60896

Open
cuviper opened this issue May 16, 2019 · 42 comments
Open

Tracking issue for entry-like methods on HashSet #60896

cuviper opened this issue May 16, 2019 · 42 comments
Labels
A-collections Area: std::collections. B-unstable Blocker: Implemented in the nightly compiler and unstable. C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. Libs-Tracked Libs issues that are tracked on the team's project board. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@cuviper
Copy link
Member

cuviper commented May 16, 2019

  • HashSet::get_or_insert
  • HashSet::get_or_insert_with

These provide a simplification of the Entry API for HashSet, with
names chosen to match the similar methods on Option.

ref: #60894

@cuviper cuviper added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. B-unstable Blocker: Implemented in the nightly compiler and unstable. C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. labels May 16, 2019
@bluss
Copy link
Member

bluss commented Oct 4, 2019

(For get_or_insert_with) Even though there is a borrow relation, there is no guarantee that the lookup key and inserted key are equivalent, I guess that's an acceptable loss? But it would be important to point out in that case.

Consider s[..1].to_string() as function

@cuviper
Copy link
Member Author

cuviper commented Oct 4, 2019

Even though there is a borrow relation, there is no guarantee that the lookup key and inserted key are equivalent, I guess that's an acceptable loss?

Borrow does require equivalence, stating, "In particular Eq, Ord and Hash must be equivalent for borrowed and owned values: x.borrow() == y.borrow() should give the same result as x == y." So I think that's fine for get_or_insert, at least.

For get_or_insert_with, yes, it's trickier -- the user-provided function could return anything. We could choose to (debug) assert that the value returned is still equal. Or we can just make this part of the documented contract, on your honor, just like the relationship between Borrow, Eq, etc. There's no concern with memory safety if you violate this, but the collection will probably act strangely having inserted a different value in that place.

@lnicola
Copy link
Member

lnicola commented Oct 4, 2019

We could choose to (debug) assert that the value returned is still equal.

That might not be too useful if std is compiled without debug assertions. Agreed otherwise, it's the user's responsibility to do something sane.

@cuviper
Copy link
Member Author

cuviper commented Oct 4, 2019

That might not be too useful if std is compiled without debug assertions.

AIUI generic code gets the debug-assertion setting of the place where it's monomorphized.

@RustyYato
Copy link
Contributor

@cuviper that doesn't sound right because macros are expanded way before monomorphozation

@cuviper
Copy link
Member Author

cuviper commented Oct 4, 2019

Oh, maybe it's only overflow checks that are affected that way. Nevermind then.

@bluss
Copy link
Member

bluss commented Oct 4, 2019

It seems like a misuse of the entry, to insert a different key than was looked up for (a misuse std should not allow). But I can't find a way to "break" this using some test code. From a cursory look I think the current implementation can't be fooled, it hashes and inserts it as if it is a new key.

Is that intended? It seems like that fact supercharges this API - get_or_insert_with, but it depends on implementation details. I'd assume the vacant entry should be allowed to cache the hash and bucket location if it wanted to? @Amanieu

@Amanieu
Copy link
Member

Amanieu commented Oct 7, 2019

You can abuse this to insert the same element multiple times in a set: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2018&gist=c4d27e6b11b646030d1370351887ebc1

Although this may arguably be a bug in RawEntry which allows you to do it.

@Amanieu
Copy link
Member

Amanieu commented Oct 8, 2019

We have two options here:

  • Add an assert which checks that the new key is equal to the key used for the lookup.
  • Allow sets to contain duplicate keys but consider it a logic error (the underlying hashmap handles duplicate keys just fine).

@bluss
Copy link
Member

bluss commented Oct 9, 2019

The current implementation has a benign behaviour in this scenario, and that's a problem, because users can rely on it, and then we are locked in to the implementation specific behaviour (and can no longer change the hashmap impl).

The following should also be an option on the table:

  • Remove get_or_insert_with because it makes it easy to do the wrong thing to the hashmap, and/or accidentally rely on implementation details

An assert that defends against relying on implementation details is also an option. (Exact handling of duplicate keys or if this insert overwrites or makes a new key).

@Amanieu
Copy link
Member

Amanieu commented Oct 9, 2019

Note that HashMap is required to safely handle duplicate keys since it needs to at least not trigger UB if the key has a broken Eq or Hash implementation (which only uses safe code).

@bluss
Copy link
Member

bluss commented Oct 9, 2019

I agree. There are several places misuse of get_or_insert_with can end up. Inserting a key that's not equivalent to the lookup key has no negative effects right now (assuming it's not a duplicate of anything), but it could have with a different hashmap. (it could result in a key that's impossible to look up). That's an example of implementation specific behaviour that's best to not expose IMO.

@cuviper
Copy link
Member Author

cuviper commented Oct 9, 2019

I expect most of the time folks would just want a lazy to_owned, so we could add that:

    pub fn get_or_insert_owned<Q: ?Sized>(&mut self, value: &Q) -> &T
        where T: Borrow<Q>,
              Q: Hash + Eq + ToOwned<Owned = T>,
    {
        self.map.raw_entry_mut().from_key(value)
            .or_insert_with(|| (value.to_owned(), ())).0
    }

We would only be relying on the contract of equality between borrowed and owned values.

@cuviper
Copy link
Member Author

cuviper commented Oct 9, 2019

There are some T: Borrow<Q> relationships that don't have ToOwned in reverse though, like if the key is Box, Rc, Arc, etc.

@yoann-arseneau
Copy link

There are a few missing features if the goal is to emulate HashMap's entry-like functions. The current implementation can't easily spoof Hash and Eq, and it can't fail gracefully.

Consider the following entry-based HashMap code

match map.raw_entry_mut().from_hash(hash, |k| is_same(k)) {
    RawMutEntry::Occupied(entry) =>
        entry.get(),
    RawMutEntry::Vacant(entry) =>
        entry.insert(produce_key()?, produce_value()?).0
}

Versus the following entry-based HashSet code

set.get_or_insert_with(like_a_value, |v| produce_value().unwrap())

In the HashMap implementation you can completely spoof Hash and Eq upon lookup and you do not need an actual key and value until insertion. Furthermore, you can fail gracefully however you like (including return and ? operator) since you control the program flow.

In the HashSet implementation, you need a Q: Borrow<T> (may need to create a single-use type) and the producer function must either insert successfully or exit exotically (panic!, exit, abort).

Allowing for graceful failure could be as simple as the following slight changes.

fn get_or_insert_with<Q: ?Sized, F>(&mut self, value: &Q, f: F) -> Option<&T>
where
    T: Borrow<Q>,
    Q: Hash + Eq,
    F: FnOnce(&Q) -> Option<T>

-or-

fn get_or_insert_with<Q: ?Sized, F, E>(&mut self, value: &Q, f: F) -> Result<&T, E>
where
    T: Borrow<Q>,
    Q: Hash + Eq,
    F: FnOnce(&Q) -> Result<T, E>

Spoofing Hash and Eq would require a significant change to the argument list or creating a new method.

@cuviper
Copy link
Member Author

cuviper commented Dec 16, 2019

Note that the HashMap::raw_entry[_mut] API is still unstable too. The goal here is like plain entry.

@yoann-arseneau
Copy link

That is a very good point; spoofing is clearly out of scope.

The ability to fail gracefully, however, is part of entry.

match map.entry(key) {
    Entry::Occupied(entry) =>
        entry.get(),
    Entry::Vacant(entry) =>
        entry.insert(produce_value()?)
}

One might expect the following to work:

match set.entry(like_a_value) {
    Entry::Occupied(entry) =>
        entry.get(),
    Entry::Vacant(entry) =>
        entry.insert(produce_value()?)
}

It would be much closer to the entry feature of HashMap, and would allow more flexibility, such as failing gracefully.

@cuviper
Copy link
Member Author

cuviper commented Dec 16, 2019

I think that will also fall afoul of the concerns about get_or_insert_with inserting a value that doesn't actually hash or compare equally. I've opened #67358 for get_or_insert_owned, and I'm starting to think the more flexible options should be left to a HashSet::raw_entry_mut(), where "raw" indicates the potential for misuse (but not unsafety).

@yoann-arseneau
Copy link

It seems that the presence of get_or_insert_with made me misunderstand the scope of this issue, which supports the notion that it doesn't belong.

Centril added a commit to Centril/rust that referenced this issue Jan 9, 2020
…sKalbertodt

Add HashSet::get_or_insert_owned

This is an extension for tracking issue rust-lang#60896. The more-general `get_or_insert_with` has potential for misuse, so we might remove it, but I think `get_or_insert_owned` covers most use cases.
JohnTitor added a commit to JohnTitor/rust that referenced this issue Jan 9, 2020
…sKalbertodt

Add HashSet::get_or_insert_owned

This is an extension for tracking issue rust-lang#60896. The more-general `get_or_insert_with` has potential for misuse, so we might remove it, but I think `get_or_insert_owned` covers most use cases.
@jonas-schievink jonas-schievink added the A-collections Area: std::collections. label Apr 21, 2020
@KodrAus KodrAus added the Libs-Tracked Libs issues that are tracked on the team's project board. label Jul 29, 2020
@jonas-schievink
Copy link
Contributor

It seems that these methods do not handle the following case without looking up x twice, unlike the entry API for maps:

  • Is x in the set?
  • If yes, return
  • If not, does $condition hold?
    • If yes, insert x into the set
    • If not, return

@cuviper
Copy link
Member Author

cuviper commented Aug 23, 2020

@jonas-schievink true, these are only like an unconditional map.entry(_).or_insert(_).

@Stargateur
Copy link
Contributor

why not just copy entry API ?

@cuviper
Copy link
Member Author

cuviper commented Nov 3, 2020

@Stargateur that's possible, but it would be a much larger API footprint to review, with a whole new Entry type for sets.

Previously, the standard HashSet directly wrapped HashMap and used the raw entry API of maps for these new methods. Now HashSet wraps hashbrown::HashSet, so extensive changes here will have to be implemented in hashbrown first.

@Stargateur
Copy link
Contributor

Stargateur commented Nov 3, 2020

I don't think technical detail such as this should be a problem, hashbrown is suppose to help not to be a problem for rust std. In the end, I did HashMap<T, ()> that quite disappointing to end up doing that.

I see no good reason to not do a entry API for hashset, with few change.

Currently, this issue don't solve my need.

@dylanede
Copy link
Contributor

What about the safe (wrt. logic errors) subset of the HashSet::get_or_insert_with use cases of inserting from a Cow<T>? This would be equivalent to a more efficient version of the current stable code:

if !set.contains(cow_data.as_ref()) {
    set.insert(cow_data.into_owned());
}

With HashSet::get_or_insert_with this would become

set.get_or_insert_with(cow_data.as_ref(), || cow_data.into_owned());

But perhaps there should be a specialised version of this, e.g. HashSet::get_or_insert_cow.

@Dushistov
Copy link
Contributor

New "insert" methods for some reason not return was item inserted.
Can they return tuple ?

 // get_or_insert(&mut self, value: T) -> (&T, bool)
 let (item, insert_done) = self.get_or_insert(item);

@matklad
Copy link
Member

matklad commented Dec 17, 2021

The current signature

pub fn get_or_insert(&mut self, value: T) -> &T

feels a bit odd. It at least needs to be

pub fn get_or_insert(&mut self, value: T) -> &mut T

Even in this form, that's an artificially restricted API. There's a fully elaborate, albeit unwieldy form, which both indicates whether insert happened and returns value if it didn't

pub fn get_or_insert(&mut self, value: T) -> (&mut T, Option<T>)

@Stargateur
Copy link
Contributor

Stargateur commented Dec 17, 2021

@matklad I think you forget it's a HashSet you are no allow to mutate the value and even with interior mutability it's a programming error to make mutation that would change the Hash result:

It is a logic error for an item to be modified in such a way that the item’s hash, as determined by the Hash trait, or its equality, as determined by the Eq trait, changes while it is in the set. This is normally only possible through Cell, RefCell, global state, I/O, or unsafe code. The behavior resulting from such a logic error is not specified, but will not result in undefined behavior. This could include panics, incorrect results, aborts, memory leaks, and non-termination.

but I agree it's should return the previous value:

pub fn get_or_insert(&mut self, value: T) -> (&T, Option<T>)

Is there any function in std that do something similar to copy the order semantic in the tuple ? (cur, previous) vs (previous, cur)

@matklad
Copy link
Member

matklad commented Dec 17, 2021

There are cases where mutating an item doesn't change it's equality. Something like this:

struct Person {
  /// Unique integer id, Eq, Hash and Ord are defined in terms of ids 
  id: u32, 
  email: String,
}

impl PartialEq for Person {
    fn eq(&self, other: &Person) -> bool { self.id == other.id }
}

Storing a HashMap<K, V> as HashSet<KV> commes up surprisingly often -- a lot of times when you want to store a collection of entities indexed by some attribute, an attribute is already a field of an entity.

Though, I see that we never expose &mut access to keys anywhere. I was sure that we have HashSet::get_mut already, which is not the case. I agree that it would be bad to set a precedent for exclusive access in an unrelated API.

@Stargateur
Copy link
Contributor

MmMmMmM, good point.

But look like very error prone, user could easily make mistake since they got a mut ref so easily, the only reason to do that would be to be able to use HashSet specific function like intersection(), I always wonder why these function was not also available for HashMap.

a lot of times when you want to store a collection of entities indexed by some attribute, an attribute is already a field of an entity.

I would really advice to just split or just clone the Id and use a HashMap.

@kornelski
Copy link
Contributor

@Stargateur that's possible, but it would be a much larger API footprint to review, with a whole new Entry type for sets.

I think having an Entry API would be great. It's non-trivial, but it already exists for HashMap, so for me it's a "sunk cost" of learning how to use it. Now I expect it to exist for consistency with HashMap, especially that HashSet is like HashMap<K, ()>.

@mxgrey
Copy link

mxgrey commented Mar 22, 2022

I think an Entry API would be a really elegant way to achieve what many commenters are asking for without relying on tuples that have funny looking signatures. Having the consistency with HashMap makes it especially compelling.

I have a use a case where I'd like to check for an existing equivalent entry in the HashSet before deciding whether I want to replace that entry or not. The current HashSet implementation would require me to perform the hashing function and lookup twice. But the Entry API in the HashMap has perfect efficiency and semantics, so for now I'm going to use a HashMap where I manually hash my objects and convert them into (key, value) pairs. It would be nice to switch to a HashSet if/when it has a suitable API.

@cuviper
Copy link
Member Author

cuviper commented Jun 16, 2022

FYI, I've opened a pull request for a set Entry API in hashbrown, which may be propagated to std:
rust-lang/hashbrown#342

@Nemo157
Copy link
Member

Nemo157 commented Jul 13, 2022

I would really like a non-getting variant of get_or_insert_owned, for cases like a HashSet<PathBuf> where you want to try inserting a &Path and detect whether it already exists

pub fn insert_owned<Q: ?Sized>(&mut self, value: &Q) -> bool
        where T: Borrow<Q>,
              Q: Hash + Eq + ToOwned<Owned = T>,

and if the alternative of a full Entry API is followed, then I think Entry::or_insert should return a bool for consistency with HashSet::insert (though, I forgot Entry doesn't allow the borrowed form, so this wouldn't help for this case).

@shepmaster
Copy link
Member

shepmaster commented Jul 13, 2022

@Nemo157 I think that’s the usual use case for https://doc.rust-lang.org/std/collections/hash_map/struct.HashMap.html#method.raw_entry_mut, right? That would indicate that there should be a “raw entry” equivalent for sets as well as a “regular entry” equivalent.

@sgued
Copy link
Contributor

sgued commented Oct 2, 2022

Looking at rust-lang/hashbrown#342 it seems to me that the proposed entry API doesn't solve the issue of having to clone the element if you want to obtain a reference to it after inserting it.

@Amanieu
Copy link
Member

Amanieu commented Apr 8, 2024

In #123657 I am removing get_or_insert_with since there is no chance that this method will be accepted as-is.

This method is unsound because it allows inserting a key at the "wrong" position in a HashSet, which could result in it not appearing it future lookups or being inserted multiple times in the set.

Instead, HashSet::get_or_insert and HashSet::get_or_insert_owned should be preferred.

@cuviper
Copy link
Member Author

cuviper commented Apr 8, 2024

This method is unsound because it allows inserting a key at the "wrong" position in a HashSet, which could result in it not appearing it future lookups or being inserted multiple times in the set.

We need a better word than "unsound", because it doesn't expose memory unsafety in itself. It does however pose a problem if unsafe code can't rely on a well-behaved set for known-good key types. (i.e. no interior mutability or user types)

@Stargateur
Copy link
Contributor

Stargateur commented Apr 9, 2024

What about: get_or_insert_with could allow user to make the state of a HashMap invalid that could result in duplicate key or not found key and so we must remove it.

@CarlKCarlK
Copy link

The get_or_insert_owned method is almost perfect for me. I'd love a version of that let me know if the entry was already there. Here is my work around.

#[inline]
fn add_and_already_there(set_: &mut HashSet<String>, item: &str) -> bool {
    let initial_length = set_.len();
    set_.get_or_insert_owned(item);
    set_.len() != initial_length
}

I'm implementing the CVM counting algorithm that has been getting much attention.

@Neutron3529
Copy link
Contributor

Neutron3529 commented Jul 1, 2024

I've found the signature converts a &mut self to &T, which converts a mutable borrow to the immutable version, which should not be used, since before you dropped the &T, &mut self is considered as borrowed exclusively.

Such code should be fine, but compiler refused to compile it, since we cannot call contains while the hashset is borrowed exclusively.

#![feature(hash_set_entry)]
use std::collections::HashSet;

fn main() {
    let mut hs = HashSet::new();
    let str1 = "hello";
    let a = hs.get_or_insert(str1);
    if hs.contains(str1) {
        println!("{a}")
    }
}

I'm requesting a hybrid borrow mode in IRLO, which allow both borrow self exclusively or normally.

@ia0
Copy link
Contributor

ia0 commented Jul 6, 2024

I'm requesting a hybrid borrow mode in IRLO, which allow both borrow self exclusively or normally.

Copying my reply here too. I believe this problem could be fixed by providing a loss-less get_or_insert function:

/// Inserts a value in the hash set, if not already present.
///
/// Always returns a shared reference to the set and a shared reference to the entry.
/// The entry reference is either the inserted one when the result is `Ok`, or the existing
/// one when the result is `Err`. In this last case, the non-inserted entry is returned too.
fn get_or_insert(&mut self, value: T) -> (&Self, Result<&T, (&T, T)>);

crepererum added a commit to crepererum/arrow-datafusion that referenced this issue Aug 6, 2024
Fix accidental $O(n^2)$ in `collect_columns`.

There are the following ways to insert a clone into a hash set:

- **clone before check:** Basically `set.insert(x.clone())`. That's
  rather expensive if you have a lot of duplicates.
- **iter & clone:** That's what we do right now, but that's in $O(n^2)$.
- **check & insert:** Basically `if !set.contains(x) {set.insert(x.clone())}`
  That requires two hash probes though.
- **entry-based API:** Sadly the stdlib doesn't really offer any
  handle/entry-based APIs yet (see rust-lang/rust#60896 ),
  but `hashbrown` does, so we can use the nice
  `set.get_or_insert_owned(x)` which will only clone the reference if it
  doesn't exists yet and only hashes once.

We now use the last approach.
alamb pushed a commit to apache/datafusion that referenced this issue Aug 6, 2024
Fix accidental $O(n^2)$ in `collect_columns`.

There are the following ways to insert a clone into a hash set:

- **clone before check:** Basically `set.insert(x.clone())`. That's
  rather expensive if you have a lot of duplicates.
- **iter & clone:** That's what we do right now, but that's in $O(n^2)$.
- **check & insert:** Basically `if !set.contains(x) {set.insert(x.clone())}`
  That requires two hash probes though.
- **entry-based API:** Sadly the stdlib doesn't really offer any
  handle/entry-based APIs yet (see rust-lang/rust#60896 ),
  but `hashbrown` does, so we can use the nice
  `set.get_or_insert_owned(x)` which will only clone the reference if it
  doesn't exists yet and only hashes once.

We now use the last approach.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-collections Area: std::collections. B-unstable Blocker: Implemented in the nightly compiler and unstable. C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. Libs-Tracked Libs issues that are tracked on the team's project board. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests