Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hasher::write should clarify its "whole unit" behaviour #94026

Closed
scottmcm opened this issue Feb 15, 2022 · 22 comments · Fixed by #94598
Closed

Hasher::write should clarify its "whole unit" behaviour #94026

scottmcm opened this issue Feb 15, 2022 · 22 comments · Fixed by #94598
Labels
T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@scottmcm
Copy link
Member

scottmcm commented Feb 15, 2022

Inspired by https://users.rust-lang.org/t/hash-prefix-collisions/71823/10?u=scottmcm

Hash::hash_slice has a bunch of text clarifying that h.hash_slice(&[a, b]); h.hash_slice(&[c]); is not guaranteed to be the same as h.hash_slice(&[a]); h.hash_slice(&[b, c]);.

However, Hasher::write is unclear whether that same rule applies to it. It's very clear that .write(&[a]) is not the same as .write_u8(a), but not whether the same sequence of bytes to write is supposed to be the same thing, even if they're in different groupings, like h.write(&[a, b]); h.write(&[c]); vs h.write(&[a]); h.write(&[b, c]);.

This is important for the same kind of things as the VecDeque example mentioned on hash_slice. If I have a circular byte buffer, is it legal for its Hash to just .write the two parts? Or does it need to write_u8 all the individual bytes since two circular buffers should compare equal regardless of where the split happens to be?

Given that Hash for str and Hash for [T] are doing prefix-freedom already, it feels to me like write should not be doing it again.

Also, our SipHasher implementation is going out of its way to maintain the "different chunking of writes is fine":

fn write(&mut self, msg: &[u8]) {
let length = msg.len();
self.length += length;
let mut needed = 0;
if self.ntail != 0 {
needed = 8 - self.ntail;
// SAFETY: `cmp::min(length, needed)` is guaranteed to not be over `length`
self.tail |= unsafe { u8to64_le(msg, 0, cmp::min(length, needed)) } << (8 * self.ntail);
if length < needed {
self.ntail += length;
return;
} else {
self.state.v3 ^= self.tail;
S::c_rounds(&mut self.state);
self.state.v0 ^= self.tail;
self.ntail = 0;
}
}
// Buffered tail is now flushed, process new input.
let len = length - needed;
let left = len & 0x7; // len % 8
let mut i = needed;
while i < len - left {
// SAFETY: because `len - left` is the biggest multiple of 8 under
// `len`, and because `i` starts at `needed` where `len` is `length - needed`,
// `i + 8` is guaranteed to be less than or equal to `length`.
let mi = unsafe { load_int_le!(msg, i, u64) };
self.state.v3 ^= mi;
S::c_rounds(&mut self.state);
self.state.v0 ^= mi;
i += 8;
}
// SAFETY: `i` is now `needed + len.div_euclid(8) * 8`,
// so `i + left` = `needed + len` = `length`, which is by
// definition equal to `msg.len()`.
self.tail = unsafe { u8to64_le(msg, i, left) };
self.ntail = left;
}

So it seems to me like this has been the expected behaviour the whole time. And if not, we should optimize SipHasher to be faster.

cc #80303 which lead to this text in hash_slice.

@scottmcm scottmcm added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. I-libs-api-nominated Nominated for discussion during a libs-api team meeting. labels Feb 15, 2022
@tczajka
Copy link

tczajka commented Feb 15, 2022

While we are at it, it would be good to also clarify what prefix-freedom means in the presence of a mix of write calls, write_u64 calls, etc when the type is also Eq.

I propose the following rules for Hash::hash:

  • Logically compress the sequence of Hasher method calls by combining consecutive write calls, concatenating the byte slices. Calls to other Hasher methods are not combined.
  • If x == y, the compressed sequences of calls must be identical.
  • If x != y:
    • The compressed sequences of calls must (or just "should"?) be different.
    • Additionally, after the initial sequence of identical calls, the next call must be to the same Hasher method with different arguments.
    • Additionally, if the first different call is to write, the two (compressed) byte sequence arguments must be such that neither is a prefix of the other.

There should also be analogous rules for Hash::hash_slice but I'm not sure what they should be, because currently Hash documentation documentation states that it's not OK to hash VecDeque by using two calls to hash_slice. Is there a reason for this? I think it should be OK, i.e. the Hash contract should say that hashing two slices is equivalent to hashing a concatenated slice.

Either way, I think for Hash::hash_slice a decision should be made one way or the other. Either:

  • We treat hash_slice(&[a]); hash_slice(&[b]); as equivalent to hash_slice(&[a, b]), in which case the documentation of Hash should state that as a requirement. Or:
  • We treat them as different, in which case the default implementation of hash_slice should change, because it currently causes a collision on such calls.

@Amanieu
Copy link
Member

Amanieu commented Feb 16, 2022

I strongly disagree with the notion that h.write(&[a, b]); h.write(&[c]); and h.write(&[a]); h.write(&[b, c]); are required to result in the same hash value. This would prevent a lot of optimizations on hashers, in particular with the use of unaligned memory access.

Consider the case of a 7-byte slice: this can be hashed by reading the first 4 bytes and last 4 bytes (with 1 byte of overlap) and hashing those two values. This is much more efficient than hashing each byte individually or buffering the bytes until they form a full word.

So it seems to me like this has been the expected behaviour the whole time. And if not, we should optimize SipHasher to be faster.

We should absolutely optimize SipHasher to be faster.

@tczajka
Copy link

tczajka commented Feb 16, 2022

Consider the case of a 7-byte slice: this can be hashed by reading the first 4 bytes and last 4 bytes (with 1 byte of overlap) and hashing those two values.

Wouldn't that mean that strings "abcdefg" and "abcddefg" always hash to the same value? If so, that would make a poor Hasher.

@Mark-Simulacrum
Copy link
Member

I think more generally, beyond the optimization for unaligned memory access, it seems easy to assume a simple hasher -- e.g., FxHash from rustc, is not going to keep a buffer around to keep 'partial' writes prior to inserting them into the hash function. If you write a partial slice, it'll still end up hashing a full usize -- so writing a series of slices rather than one large one can have a large impact.

(Effectively, this is a form of zero-padding the input buffer to fit a 8-byte block).

@tczajka
Copy link

tczajka commented Feb 16, 2022

If multiple calls to write are not concatenated, the section about "prefix collisions" in Hash documentation really needs rewriting, because it becomes very unclear what it means for one sequence of calls to be a prefix of another sequence of calls.

It's very clear that .write(&[a]) is not the same as .write_u8(a)

impl Hash for str currently assumes that it is the same. It calls write_u8(0xff) as the end marker rather than write(&[0xff]). If it's not the same thing, it's a bug.

@RalfJung
Copy link
Member

impl Hash for str currently assumes that it is the same. It calls write_u8(0xff) as the end marker rather than write(&[0xff]). If it's not the same thing, it's a bug.

Why that? As long as write_u8(0xff) always hashes the same way, the impl Hash for str seems fine.
(Incidentally, I just wondered what that 0xff is for anyway. There is no comment explaining it so the reader has to resort to guessing.)

@bjorn3
Copy link
Member

bjorn3 commented Feb 20, 2022

0xff can never exist in a valid UTF-8 string. The only bit patterns used in the UTF-8 encoding of any codepoint are 0xxxxxxx, 10xxxxxx, 110xxxxx, 1110xxxx and 11110xxx. 11111111 is not valid.

@RalfJung
Copy link
Member

That doesn't explain why write_u8 is called at all. My guess is that it serves to give a byte slice and str with the same data different hashes, but (a) it's just a guess, and (b) that still doesn't explain why one would want those hashes to differ in the first place.

@tczajka
Copy link

tczajka commented Feb 20, 2022

impl Hash for str currently assumes that it is the same. It calls write_u8(0xff) as the end marker rather than write(&[0xff]). If it's not the same thing, it's a bug.

Why that? As long as write_u8(0xff) always hashes the same way, the impl Hash for str seems fine.

It is required for the property that unequal values write different sequences to the Hasher (at least for standard types).

For instance, suppose that a hasher (say, SipHasher) were to treat write_u8(0xff) the same way as it treats write(&[0x41]).

Then this would cause a guaranteed collision between ("AAA", "AAA") and ("AA", "AAAA") regardless of the random seed inside SipHasher, destroying its DoS protection.

@bjorn3
Copy link
Member

bjorn3 commented Feb 20, 2022

That doesn't explain why write_u8 is called at all.

To ensure that hashing abc and the def gives a different hash from first hashing abcd and then ef.

@scottmcm
Copy link
Member Author

Hmm, if hashers don't merge writes that's a shame since the nice \xFF trick for str ends up not really being any better, since if it'll do a whole block for the one byte anyway, and thus doesn't matter compared to using the length.

(And the AHash approach of length-prehashing on every write makes the \xFF pointless, so I feel like that's wrong for AHash to do that regardless.)

@RalfJung
Copy link
Member

RalfJung commented Feb 20, 2022

To ensure that hashing abc and the def gives a different hash from first hashing abcd and then ef.

That sounds like a job for the slice hash function (that str calls), not something str should do. And indeed that function hashes the length, so the write_u8 is unnecessary to achieve the goal you state.

Then this would cause a guaranteed collision between ("AAA", "AAA") and ("AA", "AAAA") regardless of the random seed inside SipHasher, destroying its DoS protection.

No, it wouldn't, since the lengths of the strings are also hashed.

@tczajka
Copy link

tczajka commented Feb 20, 2022

That sounds like a job for the slice hash function (that str calls)

This is not true. The str Hash implementation doesn't call the slice Hash implementation, it calls Hasher write and write_u8 methods directly.

@tczajka
Copy link

tczajka commented Feb 20, 2022

Hmm, if hashers don't merge writes that's a shame since the nice \xFF trick for str ends up not really being any better, since if it'll do a whole block for the one byte anyway, and thus doesn't matter compared to using the length.

It's worse than that -- if it doesn't merge the FF byte with the previous write but instead does something else on a whole block, you might lose the prefix free property for strings (depending on the details of how it gets expanded to a block and what the block hash is).

@RalfJung
Copy link
Member

Oh, I missed that it calls write instead of hashing the byte slice -- my fault, sorry.

Maybe they should just be hashed as byte slices though, that would resolve all the concerns above... wouldn't it?

@scottmcm
Copy link
Member Author

Well if the hasher does merge writes, like SipHash does, then the write approach is great, since hashing "abc" can just hash one 4-byte block instead of needing to hash the potentally-8-byte length in addition to the string itself.

But yes, just using the normal hash for [u8] would certainly resolve the prefix concerns.

@tczajka
Copy link

tczajka commented Feb 20, 2022

Maybe they should just be hashed as byte slices though, that would resolve all the concerns above... wouldn't it?

It would solve the issue for str, but if writes are not merged, then the prefix-free property guidance/requirement in Hash documentation would need to be strengthened. The first different calls to write would already have to be prefix free for two unequal values. Based on the current docs, implementors of Hash may be assuming that just the whole sequences of bytes written must be prefix free.

For instance, if somebody implements their own custom version of String, they might assume the 0xff trick works.

It is also convenient to have writes merged for other reasons, as described in the original ticket description. For instance, if the custom String stores the string chunked into pieces, it's more efficient to write each piece with a single call rather than allocating memory for the merge or calling write_u8 separately for each byte.

If on the other hand writes are defined to behave same when merged, then I see no reason why calls to Hash::hash_slice couldn't also be required to behave same when merged. I think it could be specified that hash_slice must do the same thing as the default implementation, except possibly in a more efficient way.

@joshtriplett joshtriplett added S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). and removed I-libs-api-nominated Nominated for discussion during a libs-api team meeting. labels Mar 2, 2022
@joshtriplett
Copy link
Member

cc @Amanieu for summary of libs discussion and reply

@Amanieu
Copy link
Member

Amanieu commented Mar 4, 2022

It seems that the key question here is whether the responsibility for ensuring that ([a, b], [c]) and ([a], [b, c]) have different hashes resides in the Hash impl or the Hasher impl.

  1. If Hash is responsible then it must hash a length or terminator separately. This is already the case today with the Hash impl of slices hashing the length of the slice and the Hash impl of str hashing an extra 0xff byte (which can't be part of the string since it is not valid UTF-8).
  2. If Hasher is responsible then it must hash the length of the byte slice passed to write. A non-secure hasher like FxHash does not need to do this since it doesn't guarantee protection against adversarial hash values. This would then allow us to remove the length hash from [T] and the terminator hash from str.

If we were to choose the first option then nothing needs to change except some documentation.

If we were to choose the second option then all hashers that guarantee HashDoS resistance will be required to hash the length of each slice passed to write. We would then be able to remove the length hash from [T] and the terminator hash from str. Note that the trick in str with 0xff cannot be used in this case: it only works with str because this is not a valid UTF-8 sequence, but the Hasher has no knowledge of the constraints on the contents of the byte slice.

I personally favor the second option since it would allow for faster hashing in cases where HashDoS resistance is not required (FxHash).

There might be some performance loss on SipHash when hashing strings since a full 8-byte length needs to be hashed instead of a single terminator character, but there is a way around this. We could add an unstable write_str method to Hasher which defaults to calling write. Hashers could then choose to use a specialized implementation which takes the UTF-8-ness of the bytes into account.


Incidentally I noticed an optimization for SipHash that would apply no matter which approach is taken. SipHash hashes the low 8 bits of the total length but this doesn't actually guarantee prefix-freedom: playground. This should be removed in favor of a full length hash (either in Hash or Hasher).

@scottmcm
Copy link
Member Author

scottmcm commented Mar 4, 2022

I've made a proposal partway between those two approaches as PR #94598

I think just leaving it to Hasher::write isn't great, since something like VecDeque can't use that, and it would be nice for a Color's hash to be able to just call write with the 3-byte array, and not get a length prefix even in a dos-resistant hasher since it doesn't need it. (Notably not via the array or slice hash, since that is forced to have the length prefix because of Borrow.) That implies to me there should be a method that a Hash implementation can call, and which a Hasher can then make do nothing if it wants to.

Edit: I think the method @bjorn3 hypothesizes below is what I called write_length_prefix in the PR.

@bjorn3
Copy link
Member

bjorn3 commented Mar 4, 2022

If Hasher is responsible then it must hash the length of the byte slice passed to write. A non-secure hasher like FxHash does not need to do this since it doesn't guarantee protection against adversarial hash values. This would then allow us to remove the length hash from [T] and the terminator hash from str.

On the other hand this would require siphasher (used for hash tables by default) to hash the length, which would slow it down as lengths now get hashed even if not necessary for disambiguation. Maybe we could have a third option where the Hash impl is responsible for calling a new method with the disambiguation data and the Hasher can then decide to hash it or not depending on if it needs HashDoS protection.

Edit: race with @scottmcm

@tczajka
Copy link

tczajka commented Mar 4, 2022

@Amanieu

SipHash hashes the low 8 bits of the total length but this doesn't actually guarantee prefix-freedom: playground.

You're misreading that code. SipHash currently merges all write calls, so this is not a collision, this is working as expected. SipHash makes sure these sequences behave exactly same.

A non-secure hasher like FxHash does not need to do this since it doesn't guarantee protection against adversarial hash values.

This idea is basically to skip over some relevant information when hashing. To me this seems to defeat the whole purpose of hashing. It's like only hashing the first half of a (i32, i32) pair for performance. Is this really such a performance win? It could be a big performance loss due to avoidable collisions.

@joshtriplett joshtriplett removed the S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). label Apr 27, 2022
@bors bors closed this as completed in 8c4fc9d May 6, 2022
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue May 8, 2022
…anieu

Further elaborate the lack of guarantees from `Hasher`

I realized that I got too excited in rust-lang#94598 by adding new methods, and forgot to do the documentation to really answer the core question in rust-lang#94026.

This PR just has that doc update.

r? `@Amanieu`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants