Possible performance regressions on some CPUs after #3449 (C fast loops) #3762

iksaif · 2023-09-18T09:32:22Z

Describe the bug
It seems that #3449 introduced a performance regression on modern CPUs, this is particularly problematic on aarch64/arm64 since those don't have an ASM function and will always use the fast loop functions.

To Reproduce

wget https://github.com/facebook/zstd/releases/download/v1.1.3/github_users_sample_set.tar.zst
zstd -d github_users_sample_set.tar.zst
tar xf github_users_sample_set.tar
wget https://mattmahoney.net/dc/silesia.zip
unzip silesia.zip
tar -cf silesia.tar dickens mozilla mr nci ooffice osdb reymont samba sao webster x-ray xml

CFLAGS="-O2 -march=native" make -j8; ./zstd -b1 -r github ; ./zstd -b1e1 --compress-literals --zstd=tlen=131072 silesia.tar

On Darwin: (Apple clang version 14.0.3 (clang-1403.0.22.14.1))
On Linux: (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0)

Expected behavior
Performance should not degrade between v1.5.2 and v1.5.4

Screenshots and charts

b1: zstd -b1 -r github
b2: zstd -b1e1 --compress-literals --zstd=tlen=131072 silesia.tar
🟢 - best performances for this machine
🔴 - worst performances for this machine

Machine	Version	github	silesia
m1	v1.5.2	🟢 529.5 MB/s	🔴 722.5 MB/s
m1	v1.5.4	🔴 412.3 MB/s	🟢 723.4 MB/s
m1	v1.5.4 + disableFast	🟢 529.4 MB/s	🟢 722.4 MB/s
m1	dev	🔴 410.6 MB/s	🟢 722.7 MB/s
c7g Graviton 3	`51b123d`	🟢 483.1 MB/s	🟢 1128.2 MB/s
c7g	v1.5.2	🟢 436.1 MB/s	🟢 1122.5 MB/s
c7g	v1.5.4	🔴 349.6 MB/s	🔴 703.7 MB/s
c7g	v1.5.4 + disableFast	🟢 436.7 MB/s	🟢 1125.5 MB/s
c7g	dev	🔴 349.3 MB/s	🔴 720.4 MB/s
c6i Intel(R) Xeon(R) Platinum 8375C	`51b123d` (pre Huffman ASM)	🟢 492.4 MB/s	🟠 1232.3 MB/s
c6i	v1.5.2	🔴 393.4 MB/s	🟢 1589.8 MB/s
c6i	v1.5.4	🔴 398.7 MB/s	🟢 1579.3 MB/s
c6i	v1.5.4 + disableAsm	🔴 404.0 MB/s	🔴 907.0 MB/s
c6i	v1.5.4 + disableAsm + disableFast	🟢 494.3 MB/s	🟠 1235.2 MB/s
c6i	dev	🔴 397.4 MB/s	🟢 1585.9 MB/s
c6a AMD EPYC 7R13	`51b123d` (pre Huffman ASM)	🟢 519.3 MB/s	🟠 1209.2 MB/s
c6a	v1.5.2	🔴 397.9 MB/s	🟢 1467.3 MB/s
c6a	v1.5.4	🔴 397.6 MB/s	🟢 1466.8 MB/s
c6a	v1.5.4 + disableAsm	🔴 396.8 MB/s	🔴 687.4 MB/s
c6a	v1.5.4 + disableAsm + disableFast	🟢 476.0 MB/s	🟠 1301.4 MB/s
c6a	v1.5.5	🔴 396.1 MB/s	🟢 1469.0 MB/s
c6a	dev	🔴 391.8 MB/s	🟢 1466.3 MB/s

Summary:

v1.5.4 + disableAsm always gives the worst performances for both tests - and this versions uses the Generic C fast loops
v1.5.4 + disableAsm + disableFast: is slightly slower that the ASM function for the second benchmark but faster for the first one
For aarch64 the first benchmark was always faster with v1.5.2
For amd64 the first benchmark was faster before the ASM function
[huf] Add generic C versions of the fast decoding loops #3449 results can be replicated only on Apple silicon with the second benchmark

Desktop (please complete the following information):
See above

The text was updated successfully, but these errors were encountered:

iksaif · 2023-09-18T09:35:02Z

CC: @terrelln since you authored the original PR you are probably interested about this

I haven't looked at why things might be slower, but given the results it might be interesting to offer the option to not build the generic C versions of the fast decoding loops since it's unclear that they offer a significant performance boost on modern CPUs and compilers

Also maybe I did something wrong in those tests, I'm happy to re-run them if necessary

iksaif · 2023-09-18T14:30:11Z

This also relates to:

Huffman assembly is slower than no asm on Zen 2 #3278 - which shows that assembly versions are not as performant as we think (for benchmark 1)
Add fast huf_dec with generic C and tuned aarch64 assembly #3155 which looks at custom ASM functions for aarch64 (but it looks like trusting the compiler could be faster)

facebook#3762 seems to show that it doesn't perform as well as we though it would in many cases. It makes sense to at least allow users to disable them at buildtime and runtime.

terrelln · 2023-09-28T21:28:41Z

Thanks for the report @iksaif!

I will spend some time investigating next week.

At first glance, the performance on c7g Graviton 3 looks bad across the board. But, everything else has just regressed the GitHub case, which is a bunch of small files. Which maybe suggests that we need to either improve the Fast-C & ASM loops for small literals sections, or automatically use the old code versions for small literals sections.

iksaif · 2023-10-14T06:18:02Z

👋 any finding ? thanks!

gcc in the linux kernel was not unrolling the inner loops of the Huffman decoder, which was destroying decoding performance. The compiler was generating crazy code with all sorts of branches. I suspect because of Spectre mitigations, but I'm not certain. Once the loops were manually unrolled, performance was restored. Additionally, when gcc couldn't prove that the variable left shift in the 4X2 decode loop wasn't greater than 63, it inserted checks to verify it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete this check. This is a no op, because `entry.nbBits` is guaranteed to be less than 64. Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the fast C loops for Issue facebook#3762. So if even after this change, there is a performance regression, users can opt-out at compile time.

* Rename `ilimit` to `ilowest` and set it equal to `src` instead of `src + 6 + 8`. This is safe because the fast decoding loops guarantee to never read below `ilowest` already. This allows the fast decoder to run for at least two more iterations, because it consumes at most 7 bytes per iteration. * Continue the fast loop all the way until the number of safe iterations is 0. Initially, I thought that when it got towards the end, the computation of how many iterations of safe might become expensive. But it ends up being slower to have to decode each of the 4 streams individually, which makes sense. This drastically speeds up the Huffman decoder on the `github` dataset for the issue raised in facebook#3762, measured with `zstd -b1e1r github/`. | Decoder | Speed before | Speed after | |----------|--------------|-------------| | Fallback | 477 MB/s | 477 MB/s | | Fast C | 384 MB/s | 492 MB/s | | Assembly | 385 MB/s | 501 MB/s | We can also look at the speed delta for different block sizes of silesia using `zstd -b1e1r silesia.tar -B#`. | Decoder | -B1K ∆ | -B2K ∆ | -B4K ∆ | -B8K ∆ | -B16K ∆ | -B32K ∆ | -B64K ∆ | -B128K ∆ | |----------|--------|--------|--------|--------|---------|---------|---------|----------| | Fast C | +11.2% | +8.2% | +6.1% | +4.4% | +2.7% | +1.5% | +0.6% | +0.2% | | Assembly | +12.5% | +9.0% | +6.2% | +3.6% | +1.5% | +0.7% | +0.2% | +0.03% |

gcc in the linux kernel was not unrolling the inner loops of the Huffman decoder, which was destroying decoding performance. The compiler was generating crazy code with all sorts of branches. I suspect because of Spectre mitigations, but I'm not certain. Once the loops were manually unrolled, performance was restored. Additionally, when gcc couldn't prove that the variable left shift in the 4X2 decode loop wasn't greater than 63, it inserted checks to verify it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete this check. This is a no op, because `entry.nbBits` is guaranteed to be less than 64. Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the fast C loops for Issue #3762. So if even after this change, there is a performance regression, users can opt-out at compile time.

* Rename `ilimit` to `ilowest` and set it equal to `src` instead of `src + 6 + 8`. This is safe because the fast decoding loops guarantee to never read below `ilowest` already. This allows the fast decoder to run for at least two more iterations, because it consumes at most 7 bytes per iteration. * Continue the fast loop all the way until the number of safe iterations is 0. Initially, I thought that when it got towards the end, the computation of how many iterations of safe might become expensive. But it ends up being slower to have to decode each of the 4 streams individually, which makes sense. This drastically speeds up the Huffman decoder on the `github` dataset for the issue raised in facebook#3762, measured with `zstd -b1e1r github/`. | Decoder | Speed before | Speed after | |----------|--------------|-------------| | Fallback | 477 MB/s | 477 MB/s | | Fast C | 384 MB/s | 492 MB/s | | Assembly | 385 MB/s | 501 MB/s | We can also look at the speed delta for different block sizes of silesia using `zstd -b1e1r silesia.tar -B#`. | Decoder | -B1K ∆ | -B2K ∆ | -B4K ∆ | -B8K ∆ | -B16K ∆ | -B32K ∆ | -B64K ∆ | -B128K ∆ | |----------|--------|--------|--------|--------|---------|---------|---------|----------| | Fast C | +11.2% | +8.2% | +6.1% | +4.4% | +2.7% | +1.5% | +0.6% | +0.2% | | Assembly | +12.5% | +9.0% | +6.2% | +3.6% | +1.5% | +0.7% | +0.2% | +0.03% |

terrelln · 2023-11-20T20:05:16Z

Hi @iksaif sorry for the delay, but I have several updates:

PR [huf] Improve fast C & ASM performance on small data #3827 drastically speeds up the fast C & assembly Huffman decoders on very small data like in the github corpus.
PR [huf] Improve fast huffman decoding speed in linux kernel #3826 may improve c7g Graviton 3 performance. It manually unrolls the inner loops, because I've found that in certain scenarios the compiler wasn't unrolling the inner loops. If that was the case for c7g, then I would expect significant performance improvements.
PR [huf] Improve fast huffman decoding speed in linux kernel #3826 also provides the ability to disable the fast C decoding loops at compile time by defining the C macro HUF_DISABLE_FAST_DECODE, in case you still see subpar performance after these PRs.

Please let me know if you see any more issues with performance after these PRs, and we will look into them. I can't guarantee a super fast turnaround time, but we always try to handle outstanding issues before we make releases.

* Rename `ilimit` to `ilowest` and set it equal to `src` instead of `src + 6 + 8`. This is safe because the fast decoding loops guarantee to never read below `ilowest` already. This allows the fast decoder to run for at least two more iterations, because it consumes at most 7 bytes per iteration. * Continue the fast loop all the way until the number of safe iterations is 0. Initially, I thought that when it got towards the end, the computation of how many iterations of safe might become expensive. But it ends up being slower to have to decode each of the 4 streams individually, which makes sense. This drastically speeds up the Huffman decoder on the `github` dataset for the issue raised in #3762, measured with `zstd -b1e1r github/`. | Decoder | Speed before | Speed after | |----------|--------------|-------------| | Fallback | 477 MB/s | 477 MB/s | | Fast C | 384 MB/s | 492 MB/s | | Assembly | 385 MB/s | 501 MB/s | We can also look at the speed delta for different block sizes of silesia using `zstd -b1e1r silesia.tar -B#`. | Decoder | -B1K ∆ | -B2K ∆ | -B4K ∆ | -B8K ∆ | -B16K ∆ | -B32K ∆ | -B64K ∆ | -B128K ∆ | |----------|--------|--------|--------|--------|---------|---------|---------|----------| | Fast C | +11.2% | +8.2% | +6.1% | +4.4% | +2.7% | +1.5% | +0.6% | +0.2% | | Assembly | +12.5% | +9.0% | +6.2% | +3.6% | +1.5% | +0.7% | +0.2% | +0.03% |

gcc in the linux kernel was not unrolling the inner loops of the Huffman decoder, which was destroying decoding performance. The compiler was generating crazy code with all sorts of branches. I suspect because of Spectre mitigations, but I'm not certain. Once the loops were manually unrolled, performance was restored. Additionally, when gcc couldn't prove that the variable left shift in the 4X2 decode loop wasn't greater than 63, it inserted checks to verify it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete this check. This is a no op, because `entry.nbBits` is guaranteed to be less than 64. Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the fast C loops for Issue #3762. So if even after this change, there is a performance regression, users can opt-out at compile time.

iksaif · 2023-11-21T12:37:05Z

Thanks for the update, I'll try this out in the next few weeks!

iksaif · 2023-11-22T13:26:58Z

I can confirm that the latest version from git is back to v1.5.2 level of performances on c7g !

 1# 9114 files       :   7484607 ->   2603666 (x2.875),  183.3 MB/s,  435.3 MB/s
 1#silesia.tar       : 211957760 -> 132055991 (x1.605),  700.0 MB/s, 1143.6 MB/s

terrelln · 2023-11-22T15:30:24Z

Great, I'm glad to hear it! Looks like it was running into the loop unrolling problem.

gcc in the linux kernel was not unrolling the inner loops of the Huffman decoder, which was destroying decoding performance. The compiler was generating crazy code with all sorts of branches. I suspect because of Spectre mitigations, but I'm not certain. Once the loops were manually unrolled, performance was restored. Additionally, when gcc couldn't prove that the variable left shift in the 4X2 decode loop wasn't greater than 63, it inserted checks to verify it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete this check. This is a no op, because `entry.nbBits` is guaranteed to be less than 64. Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the fast C loops for Issue facebook#3762. So if even after this change, there is a performance regression, users can opt-out at compile time.

* Rename `ilimit` to `ilowest` and set it equal to `src` instead of `src + 6 + 8`. This is safe because the fast decoding loops guarantee to never read below `ilowest` already. This allows the fast decoder to run for at least two more iterations, because it consumes at most 7 bytes per iteration. * Continue the fast loop all the way until the number of safe iterations is 0. Initially, I thought that when it got towards the end, the computation of how many iterations of safe might become expensive. But it ends up being slower to have to decode each of the 4 streams individually, which makes sense. This drastically speeds up the Huffman decoder on the `github` dataset for the issue raised in facebook#3762, measured with `zstd -b1e1r github/`. | Decoder | Speed before | Speed after | |----------|--------------|-------------| | Fallback | 477 MB/s | 477 MB/s | | Fast C | 384 MB/s | 492 MB/s | | Assembly | 385 MB/s | 501 MB/s | We can also look at the speed delta for different block sizes of silesia using `zstd -b1e1r silesia.tar -B#`. | Decoder | -B1K ∆ | -B2K ∆ | -B4K ∆ | -B8K ∆ | -B16K ∆ | -B32K ∆ | -B64K ∆ | -B128K ∆ | |----------|--------|--------|--------|--------|---------|---------|---------|----------| | Fast C | +11.2% | +8.2% | +6.1% | +4.4% | +2.7% | +1.5% | +0.6% | +0.2% | | Assembly | +12.5% | +9.0% | +6.2% | +3.6% | +1.5% | +0.7% | +0.2% | +0.03% |

gcc in the linux kernel was not unrolling the inner loops of the Huffman decoder, which was destroying decoding performance. The compiler was generating crazy code with all sorts of branches. I suspect because of Spectre mitigations, but I'm not certain. Once the loops were manually unrolled, performance was restored. Additionally, when gcc couldn't prove that the variable left shift in the 4X2 decode loop wasn't greater than 63, it inserted checks to verify it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete this check. This is a no op, because is guaranteed to be less than 64. Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the fast C loops for Issue facebook/zstd#3762 . So if even after this change, there is a performance regression, users can opt-out at compile time.

gcc in the linux kernel was not unrolling the inner loops of the Huffman decoder, which was destroying decoding performance. The compiler was generating crazy code with all sorts of branches. I suspect because of Spectre mitigations, but I'm not certain. Once the loops were manually unrolled, performance was restored. Additionally, when gcc couldn't prove that the variable left shift in the 4X2 decode loop wasn't greater than 63, it inserted checks to verify it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete this check. This is a no op, because `entry.nbBits` is guaranteed to be less than 64. Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the fast C loops for Issue facebook#3762. So if even after this change, there is a performance regression, users can opt-out at compile time.

* Rename `ilimit` to `ilowest` and set it equal to `src` instead of `src + 6 + 8`. This is safe because the fast decoding loops guarantee to never read below `ilowest` already. This allows the fast decoder to run for at least two more iterations, because it consumes at most 7 bytes per iteration. * Continue the fast loop all the way until the number of safe iterations is 0. Initially, I thought that when it got towards the end, the computation of how many iterations of safe might become expensive. But it ends up being slower to have to decode each of the 4 streams individually, which makes sense. This drastically speeds up the Huffman decoder on the `github` dataset for the issue raised in facebook#3762, measured with `zstd -b1e1r github/`. | Decoder | Speed before | Speed after | |----------|--------------|-------------| | Fallback | 477 MB/s | 477 MB/s | | Fast C | 384 MB/s | 492 MB/s | | Assembly | 385 MB/s | 501 MB/s | We can also look at the speed delta for different block sizes of silesia using `zstd -b1e1r silesia.tar -B#`. | Decoder | -B1K ∆ | -B2K ∆ | -B4K ∆ | -B8K ∆ | -B16K ∆ | -B32K ∆ | -B64K ∆ | -B128K ∆ | |----------|--------|--------|--------|--------|---------|---------|---------|----------| | Fast C | +11.2% | +8.2% | +6.1% | +4.4% | +2.7% | +1.5% | +0.6% | +0.2% | | Assembly | +12.5% | +9.0% | +6.2% | +3.6% | +1.5% | +0.7% | +0.2% | +0.03% |

iksaif changed the title ~~Performance regressions on some CPUs after #3449 (C fast loops)~~ Possible performance regressions on some CPUs after #3449 (C fast loops) Sep 18, 2023

iksaif mentioned this issue Sep 28, 2023

decompress: add -DZSTD_DISABLE_FAST_C_LOOP to disable fast C loops #3778

Open

terrelln self-assigned this Sep 28, 2023

terrelln mentioned this issue Nov 20, 2023

[huf] Improve fast huffman decoding speed in linux kernel #3826

Merged

terrelln mentioned this issue Nov 20, 2023

[huf] Improve fast C & ASM performance on small data #3827

Merged

iksaif closed this as completed Nov 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible performance regressions on some CPUs after #3449 (C fast loops) #3762

Possible performance regressions on some CPUs after #3449 (C fast loops) #3762

iksaif commented Sep 18, 2023 •

edited

Loading

iksaif commented Sep 18, 2023

iksaif commented Sep 18, 2023

terrelln commented Sep 28, 2023

iksaif commented Oct 14, 2023

terrelln commented Nov 20, 2023

iksaif commented Nov 21, 2023

iksaif commented Nov 22, 2023

terrelln commented Nov 22, 2023

Possible performance regressions on some CPUs after #3449 (C fast loops) #3762

Possible performance regressions on some CPUs after #3449 (C fast loops) #3762

Comments

iksaif commented Sep 18, 2023 • edited Loading

iksaif commented Sep 18, 2023

iksaif commented Sep 18, 2023

terrelln commented Sep 28, 2023

iksaif commented Oct 14, 2023

terrelln commented Nov 20, 2023

iksaif commented Nov 21, 2023

iksaif commented Nov 22, 2023

terrelln commented Nov 22, 2023

iksaif commented Sep 18, 2023 •

edited

Loading