Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the niche optimization if other variant are small enough #70477

Closed
wants to merge 7 commits into from

Conversation

ogoffart
Copy link
Contributor

Reduce the size of enums which have one bigger variant with a niche

I guess this should maybe be behind a compiler flag at first?
cc @eddyb

@rust-highfive
Copy link
Collaborator

r? @matthewjasper

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Mar 27, 2020
@eddyb
Copy link
Member

eddyb commented Mar 27, 2020

r? @eddyb

@rust-highfive rust-highfive assigned eddyb and unassigned matthewjasper Mar 27, 2020
Comment on lines 325 to 327
// Rotate index array to put the largest niche first.
// Since it is already the first amongst the types with the same alignement,
// this will just move some of the potential padding within the structure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh this is pretty clever.

@eddyb
Copy link
Member

eddyb commented Mar 27, 2020

This seems to address #46213, or at least partially?
I'm really happy someone is taking this on, I keep not getting to it 😞.


Something I've been meaning to do since last year is to solve #63866 by first computing the unoptimized enum layout, and only then, the niche one, and only using the niche one if it's strictly smaller or is equal in size but has strictly more available values in its largest niche.

That way we don't have to come up with clever conditions for when we should use a niche layout, as we will simply not use it if it's not helpful in size or in terms of its largest niche.

Something else to keep in mind is that a niche layout has more complex codegen than a tagged one, so it might optimize worse, even if they are otherwise equal on size and largest niche.


@ogoffart So would you be interested in changing the code to work like that (probably in a separate PR, that we can test on its own for perf regressions), i.e. only use an optimized (niche) layout if it's a benefit over the unoptimized (tagged) one?

I think it might allow this PR to be simpler, or at the very least, reduce the risk of accidentally increasing some types' sizes, to 0.

dataful_variant = None;
break 'variants;
if !max_size.is_aligned(align.abi) {
// Don't perform niche optimisation if there is padding anyway
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this fixes #63866. If it detects that the variant would contains padding, the padding will be used for the discriminant.

Copy link
Member

@eddyb eddyb Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, but it seems harder to reason about than having two candidates for a layout and picking the best one out of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking computing the second candidate would be a waste of time.

@ogoffart
Copy link
Contributor Author

right, this fixes #63866 and #46213, I did not look for these issue before.

So would you be interested in changing the code to work like that (probably in a separate PR, that we can test on its own for perf regressions), i.e. only use an optimized (niche) layout if it's a benefit over the unoptimized (tagged) one?

I am not sure I understand this. I believe the current implementation will only use the Niche if it reduces the size of the enum.


Anyway, I think there is more to do in the codegen because many test are failling. They passes in stage 1 (which i was running locally) but fail in stage 2 (on the CI) valgrind reports some use of uninitialized value. I need to investigate.

@ogoffart
Copy link
Contributor Author

I adjusted the ABI to fix the miscompilation issue, and added a test for issue #63866

@bors
Copy link
Contributor

bors commented Mar 30, 2020

☔ The latest upstream changes (presumably #70536) made this pull request unmergeable. Please resolve the merge conflicts.

@bors
Copy link
Contributor

bors commented Apr 1, 2020

☔ The latest upstream changes (presumably #70672) made this pull request unmergeable. Please resolve the merge conflicts.

@eddyb
Copy link
Member

eddyb commented Apr 2, 2020

I am not sure I understand this. I believe the current implementation will only use the Niche if it reduces the size of the enum.

Perhaps, but there's no formal proof of this and it's impossible to exhausively test, so we'd have to rely on convincing ourselves, and that's the main reason I kept putting this off: the conditions were too complex and hard to reason about, and they were also very easy to invalidate by other changes.

I want to land this, but I'm not comfortable with it unless one of these two things happen:

  • the implementation gets a formal model, where we can prove things about all possible cases
  • we check instead of assuming the conditions are sufficient

I'm thinking computing the second candidate would be a waste of time.

We can measure this, it's not uncommon in rustc to check all possibilities.
Bonus: it would remove the need for complex logic that tries to approximate the comparison between the two candidates, and we could focus on things like using SmallVec instead of Vec, instead.

@ogoffart
Copy link
Contributor Author

ogoffart commented Apr 2, 2020

With this patch (when struct_reordering_opt is true,) we only perform the niche optimization if the sum of the size of all fields of the bigger variant is a multiple of the alignment. The ordering of member make sure there is not going to be any padding between these members, so therefore there is no padding place for the discriminant if not using the niche. So using the niche will always lead to a smaller size for the enum.

Maybe i should add a debug_assert that this is the case?

@ogoffart
Copy link
Contributor Author

ogoffart commented Apr 2, 2020

(I added an assert that there was no padding in the layout, and so that the niche must be smaller than not using a niche)

@eddyb
Copy link
Member

eddyb commented Apr 3, 2020

Ah so the logic here is that you're being conservative about niches and only using them when it would be impossible to add a tag without increasing the size of the type?

I think this makes sense in the common case of byte tags, but what about this example?

Am I right that this PR makes both Enum<u16> and Enum<NicheU16> have 6 bytes?

@ogoffart
Copy link
Contributor Author

ogoffart commented Apr 3, 2020

Am I right that this PR makes both Enum<u16> and Enum<NicheU16> have 6 bytes?

Yes, that was right. I fixed that now.
I indeed forgot to consider cases where there are more than 256 variants.

Now I understand better what you mean by "impossible to exhausively test". But I think it is fine to leave some corner case such as enum with huge amount of variants.

For example, an Option<(NicheU16, u8)> would still not use the niche so its the bigger niche would "only" have 254 entries available.

@eddyb
Copy link
Member

eddyb commented Apr 3, 2020

For example, an Option<(NicheU16, u8)> would still not use the niche so its the bigger niche would "only" have 254 entries available.

(NicheU16, u8) is the single field of Option<(NicheU16, u8)>'s Some variant and it has size 4, alignment 2, meaning it would use the niche, not add a tag, with your logic (so same outcome as today).

Comment on lines 942 to 919
if padding_bits >= 32 || count >> padding_bits == 0 {
// Don't perform niche optimisation if there is enough padding bits for the discrimiment
dataful_variant = None;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is a more correct version of the "only use niche if it's physically impossible to fit a tag anywhere", right?

It's still conservative in that it uses the tag layout unless it knows for sure that the niche layout would be beneficial, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's right.
I changed it again to be sure than the padding is actually bigger than the biggest niche so we'd still use the niche in Option2<NicheU16, u8> (which i meant instead of Option<(NicheU16, u8)> from my previous comment which is its own type as you said)

@eddyb
Copy link
Member

eddyb commented Apr 3, 2020

@ogoffart Okay, can you try this example?

I think this is the first example that actually tackles the differences in alignment sorting between structs and enum variants (note that with Dataful(u8, u8, X),, both are 8).

This is the kind of thing that I would have a hard time reasoning about in my head when I was looking into this, and why I believe it to be much saner to just try both layouts and pick the best.

@ogoffart
Copy link
Contributor Author

ogoffart commented Apr 3, 2020

size_of::<Enum<u32>>() = 12
size_of::<Enum<NicheU32>>() = 8

Same as before the patch. We use the niche because 32 bit is bigger than the 16 bits padding.

Note that size_of::<Enum<u32>>() is not 8 because we indeed disable the reordering optimisation if the discriminant need alignement. Why is that? That's a problem.

@eddyb
Copy link
Member

eddyb commented Apr 3, 2020

If you want to keep things simple to reason about, we could limit niche optimizations to situations in which the tag would be a single byte, because we know the fields are sorted ascending in that case, and so any padding would "absorb" the tag, so it's easier to reason about than the general case.

It seems like a practical compromise to say that we don't optimize enums with too many variants (specifically >= 256), although I worry it might result in downstream (e.g. codegen or miri) hidden assumptions that will cause bugs if we ever generalize it again.

cc @rust-lang/compiler we might want to discuss this more broadly.

@eddyb
Copy link
Member

eddyb commented Apr 3, 2020

Same as before the patch. We use the niche because 32 bit is bigger than the 16 bits padding.

Why "32-bit"? Are you saying it's because the niche has a lot more invalid values?
In that case, try this example, I only wanted to increase the size of that field, not change the niche.

@eddyb eddyb added I-nominated T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Apr 3, 2020
@nikomatsakis
Copy link
Contributor

@eddyb can you summarize a bit what you'd like input on? There's a lot of backscroll here =) It sounds like it would make things easier to limit niche optimizations to cases with <256 variants, and you're asking if that's ok?

We don't currently make many guarantees about the niche optimization that I'm aware of, but the unsafe-code-optimizations draft did require that "Option-like" enums would be optimized, but that's it. Option-like means exactly 2 variants.

@ogoffart
Copy link
Contributor Author

Note that the pr 71045 only compute the two layout for enum with one single data varient, While this compute them also for enum with several data variant. Also it first need to perform the estimate.

@ogoffart
Copy link
Contributor Author

There are some performance regressions. The difference could be attributed to:

  1. the extra cost of estimating what is the largest and second largest variant.
  2. the extra cost of computing the second layout including for the case where there is a second largest variant.
  3. generally the fact that using this niche optimization for more structure makes the whole compiler slower.

If the reason are step 1 or 2, it can be mitigated or live with.
But if this is because the niche optimization actually make things slower it is perhaps not worth it, unless there might be do something really wrong in the generated code and that could be improved.

@ogoffart
Copy link
Contributor Author

Could also be

  1. the niche optimization cause the use of different code path in llvm(or rustc) that enable different set of optimizations which may make the compile time slower but may improve the speed of the resulting binary (or not)

trying some benchmark on the generated code would give us more information about which of these hypotheses is right.

@eddyb
Copy link
Member

eddyb commented Apr 12, 2020

So the results on that separate PR are in, and they're neutral: #71045 (comment).

@ogoffart I did do something you didn't, to make it cheaper: I avoided calling tcx.intern_layout at all until the very end. Can you try the same, maybe that alleviates some of the difference?

@ogoffart
Copy link
Contributor Author

@ogoffart I did do something you didn't, to make it cheaper: I avoided calling tcx.intern_layout at all until the very end

Since i do early return, the intern_layout is only done at the very end (on the actual return line, for both case).
So i don't see how that can make any difference.

What could makes a difference is that i compute the niche layout for more enums, while your test pr only computes it for the enum for which we were already doing niche optimization before (the ones with a single data variant)

Reduce the size of enums which have one bigger variant with a niche
Since we do not care much about the case where we can use the niche, but that
the fields cannot be re-ordered
Instead of using estimation euristic to find out if we should do one or
the other layout, always perform the non-niche layout and, when possible,
the niche layout, and compare the resulting layout.

In my opinion, this is less efficient, but was requested in the review
because it is hard to proof that the estimation is correct in every case.
@ogoffart
Copy link
Contributor Author

I did some micro benchmark to understand the difference between the niche and the tag.
matching a enum with a niche. When there is only two variant, there is practically no difference as llvm is able to optimize the extra check away.
But if there are more than two match arms, LLVM won't optimize the code that compute the discriminant, and that has an impact on performance.

The code to get the discriminant from a bool niche has this prologue:

mov    (%rdi),%al
add    $0xfe,%al
movzbl %al,%ecx
lea    0x1(%rcx),%rdx
xor    %eax,%eax
cmp    $0x2,%cl
cmovb  %rdx,%rax

compared to simply movzbl (%rdi),%eax when not using the niche.

But the cmovb and the "normalization" of the discriminant is useless.

This could be something that LLVM could be improved on.
But i'm not sure this alone could explain the few percent regression we observe on some bechmark.

Maybe the best would be to not generate a "get discriminant" when doing a match. This would also give simpler code to LLVM to optimize with less instructions (which may also explain some of the slowdown. But the result does not seem to be worse for optimized build than for check).
This should probably be changed in librustc_mir_build/build/matches/test.rs in Builder::perform_test but at this point i don't know if we already have the layout and can generate a different code for the niche.

@crlf0710 crlf0710 added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Apr 24, 2020
chrysn added a commit to chrysn-pull-requests/Slots that referenced this pull request Apr 27, 2020
This does not yet realize the space savings it promises in the test case
due to missing optimizations[1][2] in Rust, but should eventually manage
to be barely larger than the payload arrays.

[1]: rust-lang/rust#46213
[2]: rust-lang/rust#70477
@eddyb
Copy link
Member

eddyb commented May 1, 2020

But i'm not sure this alone could explain the few percent regression we observe on some bechmark.

The benchmarks are compile times, not runtime. So any slowdowns they detect are the compiler doing more work, not necessarily any codegen'd code being inefficient.

(I'm really sorry I haven't come back to this already, the amount of different things happening in Rust got really overwhelming over the past few weeks)


Maybe the best would be to not generate a "get discriminant" when doing a match.

We used to have a variant-based Switch terminator in MIR, before we replaced it with a SwitchInt one. Maybe codegen could be clever and special-case SwitchInt(Discriminant(...)).
But I don't see how switching over a niche can be simplified except for a 2-case SwitchInt, anything else couldn't use a single comparison.

@ogoffart
Copy link
Contributor Author

ogoffart commented May 1, 2020

The benchmarks are compile times, not runtime. So any slowdowns they detect are the compiler doing more work, not necessarily any codegen'd code being inefficient.

I beg to differ. I believe it is both. AFAIK, the compiler is compiled with itself, so if the change include some big regression (or improvement) in code generation, it would therefore also be slower (or faster) to compile.

Maybe codegen could be clever and special-case SwitchInt(Discriminant(...)).
But I don't see how switching over a niche can be simplified except for a 2-case SwitchInt, anything else couldn't use a single comparison.

Yes, having a SwitchInt(Discriminant(...)) would allow the codegen to be smarter:

currently, this is generated (pseudo code)

let discriminent = if (niche in [begin...end]) { niche - offset } else { data_variant };
match(discriminent) {
   data_variant => ...,
   other_variant1 => ...
   other_variant2 => ...
}

And llvm is not good enough to remove the if or the -offset in all case.

What could be generated is directly

match(niche) {
   other_variant1+offset => ...
   other_variant2+offset => ...
    _ => ...   // (for the data_variant)
   // begin..end  =>  ...    // alternative if the switch is not exhaustive
}

(where other_variantx+offset is a constant)

Then the switch looks almost exactly the same as for the tag variant, (and maybe would also be easier or LLVM to process)


Anyway, my free time period is over and i won't have time to continue on this patch.
Feel free to close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants