-
Notifications
You must be signed in to change notification settings - Fork 12.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking Issue for bigint helper methods #85532
Comments
I'd like a mul_mod, as shown in #85017, because I think you can't implement it efficiently without asm and it's a basic block for power_mod and other things. |
Another set of methods that could be useful that I'll probably offer implementations for at some point: /// `(self << rhs) | carry`
fn carrying_shl(self, rhs: u32, carry: Self) -> (Self, Self);
/// `(self >> rhs) | carry`
fn borrowing_shr(self, rhs: u32, carry: Self) -> (Self, Self);
/// `self << rhs`
fn widening_shl(self, rhs: u32) -> (Self, Self);
/// `self >> rhs`
fn widening_shr(self, rhs: u32) -> (Self, Self); Essentially, return the two halves of a rotation, i.e. |
From @scottmcm in the original PR:
|
Why don't we add |
Mostly effort implementing them efficiently. In the meantime, you can do it with four calls to the |
I was very confused by this function name at first, since borrowing in Rust usually refers to references. I am not a native speaker, but I do formal mathematical work in English professionally, and yet I never before heard the term "borrowing" in the context of subtraction. So I think this, at least, needs some explanation in the docs. (I would have expected something like The current docs for some of the other methods could probably also be improved: they talk about not having the "ability to overflow", which makes it sound like not overflowing is a bad thing. |
The word borrow here comes from the terminology for a full subtractor. I am thinking that maybe the borrowing_sub function could be removed altogether. The same effect that borrowing_sub has can be obtained from carrying_add by making the first carrying_add in the chain have a set carry bit, and then bitnot every rhs. This fact could be put in the documentation of carrying_add. |
Considering how the primary goal of these methods is to be as efficient as possible, usually optimising down to a single instruction, I don't think it'd be reasonable to just get rid of subtraction in favour of telling everyone to use addition instead. Definitely open to changing the name, though. |
These helper methods will not be very useful to me unless they are implemented for every kind of integer. Here is an implementation for a widening multiplication-addition for u128:
I have tested this with my crate edit: There is a version of this that uses the Karatsuba trick to use 3 multiplications instead of 4, but it incurs extra summations, branches, and is not as parallel. For typical desktop processors the above should be the fastest. |
I would make a PR for that. |
Some alternative signatures include |
I would also change up the documentation headers for the
I specifically note |
|
But unsigned overflow and signed overflow are different. For example, on x86_64, while unsigned and signed integers share addition and subtraction instructions, unsigned overflow is detected using the carry flag while signed overflow is detected using the overflow flag. As a concrete example: Edit: I think I had misread your comment and thought the middle part of your comment was the current doc, not your suggestion, so it looks like I completely misinterpreted your final comment. |
Yes signed and unsigned overflow are different, but the |
I think all of these are good suggestions, and like mentioned earlier, these changes definitely should go in a PR if you have the time. I think one important thing to note is that so far the APIs here seem good, but the documentation definitely could use some work. Although if there's a bigger case for changing the subtraction behaviour to be more in line with what's expected (the existing behaviour is mostly modelled after the x86 instructions adc and sbb), then I'm for that. That said, the main goal is to make it relatively painless to write correct code that compiles down to the right instructions in release mode, so, I would say we should make sure that happens regardless of what's done. I would have added an explicit test for that but I honestly don't know how. |
…riplett Add more text and examples to `carrying_{add|mul}` `feature(bigint_helper_methods)` tracking issue rust-lang#85532 cc `@clarfonthey`
…riplett Add more text and examples to `carrying_{add|mul}` `feature(bigint_helper_methods)` tracking issue rust-lang#85532 cc ``@clarfonthey``
…riplett Add more text and examples to `carrying_{add|mul}` `feature(bigint_helper_methods)` tracking issue rust-lang#85532 cc ```@clarfonthey```
…riplett Add more text and examples to `carrying_{add|mul}` `feature(bigint_helper_methods)` tracking issue rust-lang#85532 cc ````@clarfonthey````
…riplett Add more text and examples to `carrying_{add|mul}` `feature(bigint_helper_methods)` tracking issue rust-lang#85532 cc `````@clarfonthey`````
Multiplication, and carry-less multiplication, are inherently a widening operation. Unfortunately, at the time of writing, the types in Rust don't capture this well, being built around fixed-width wrapping multiplication. Rust's stdlib can rely on compiler-level optimizations to clean up performance issues from unnecessarily-wide multiplications, but this becomes a bit of an issue for our library, especially for u64 types, since we rely on intrinsics, which may be hard for compilers to optimize around. This commit adds widening_mul, based on a proposal to add widening_mul to Rust's primitive types: rust-lang/rust#85532 As well as several other tweaks to how xmul is provided, moving more arch-level details into xmul, but still limiting when it is emitted.
Given the importance of these methods for efficient big integer handling, I wonder if there's a plan to collaborate with the LLVM community to provide such compiler intrinsics. Maybe not only Rust need them. |
; (i just copy the llvm-ir output for `u128::from(u64a) * u128::from(u64b)` and double the width)
define { i128, i128 } @widening_mul(i128 noundef %a, i128 noundef %b) unnamed_addr #0 {
start:
%_4 = zext i128 %a to i256
%_5 = zext i128 %b to i256
%m = mul nuw i256 %_5, %_4 ; <----
%_6 = trunc i256 %m to i128
%_8 = lshr i256 %m, 128
%_7 = trunc i256 %_8 to i128
%0 = insertvalue { i128, i128 } poison, i128 %_6, 0
%1 = insertvalue { i128, i128 } %0, i128 %_7, 1
ret { i128, i128 } %1
} so I don't see much rationale from the LLVM side to provide anything more. |
Note that one of the stated goals for having this in the standard library is that we can add intrinsics for them if needed, but so far we haven't yet. We just need to have a solid API before we stabilise, and can continue tweaking performance over time. |
How about multiplication of signed with unsigned values? Also I find the order of return values a bit confusing. Why was it chosen this way around?
Could we maybe make the return value a struct like this #[repr(C)]
pub struct Widening<H, L> {
pub hi: H,
pub lo: L,
} And maybe have the order of fields also be dependent on the endianess of the system, such that it can be transmuted directly into the corresponding wider integer type (that might require some padding as well)? |
There is no BTW they can be just forward to the unsigned version like this (making use of e.g. fn widening_mul_signed_unsigned(a: i64, b: u64) -> (u64, i64) {
let (lo, mut hi) = (a as u64).widening_mul(b);
if a < 0 {
hi -= b;
}
(lo, hi as _)
}
fn widening_mul_signed_signed(a: i64, b: i64) -> (u64, i64) {
// note: for `i64 * i64` it is more efficient to just cast them to i128.
let (lo, hi) = (a as u64).widening_mul(b as u64);
let mut hi = hi as i64;
if a < 0 {
hi -= b;
}
if b < 0 {
hi -= a;
}
(lo, hi)
}
+1 returning a named struct |
afaict on LLVM, power-of-2-sized integer types never have internal padding as long as they're at least 8-bits. this is why rust/library/core/src/num/int_macros.rs Lines 3394 to 3398 in 3cdcdaf
|
yes: 2s complement biginteger multiply: see the comments I left on the corresponding ACP (most of that issue's comments): rust-lang/libs-team#228 |
that is technically possible but highly inefficient with current LLVM since last I checked LLVM won't convert that branchy code sequence to a signed-[un]signed multiply. |
Note that my implementation is not the same as #85532 (comment). Although in the signed×signed case it is not as efficient as reusing And for the signed×unsigned case I'd argue it is more efficient than Anyway the algorithm is only interesting for implementing playground::widening_mul_signed_unsigned:
# %bb.0:
movq %rsi, %rax
mulq %rdi
sarq $63, %rdi
andq %rsi, %rdi
subq %rdi, %rdx
retq
# -- End function
playground::widening_mul_signed_signed:
# %bb.0:
movq %rsi, %rax
mulq %rdi
movq %rdi, %rcx
sarq $63, %rcx
andq %rsi, %rcx
sarq $63, %rsi
andq %rdi, %rsi
addq %rcx, %rsi
subq %rsi, %rdx
retq
# -- End function
# these two use i128::mul
playground::widening_mul_signed_signed_2:
# %bb.0:
movq %rsi, %rax
imulq %rdi
retq
# -- End function
playground::widening_mul_signed_unsigned_2: # @playground::widening_mul_signed_unsigned_2
# %bb.0:
movq %rsi, %rax
mulq %rdi
sarq $63, %rdi
imulq %rsi, %rdi
addq %rdi, %rdx
retq
# -- End function
# this uses a * b == sgn(a) * sgn(b) * abs(a) * abs(b)
playground::widening_mul_signed_signed_3: # @playground::widening_mul_signed_signed_3
# %bb.0:
movq %rdi, %rcx
negq %rcx
cmovsq %rdi, %rcx
movq %rsi, %rax
negq %rax
cmovsq %rsi, %rax
mulq %rcx
xorl %ecx, %ecx
movq %rax, %r8
negq %r8
sbbq %rdx, %rcx
xorq %rdi, %rsi
cmovsq %r8, %rax
cmovsq %rcx, %rdx
retq
# -- End function |
on RISC-V, which has signed-unsigned mul, casting to |
That suggested LLVM should improve the x86_64 i128::mul generation. |
Speaking about unsigned integers, is there a reason, why it is
and not
as having that additional carry can't overflow, too? (For any Or are the bigint helper function more about providing low level access to helpful instructions some architectures have than the "no overflow" (in this case) guaranty? |
I've seen before that you can fit a second carry, but usually in bigint libraries we just have a single carry to deal with in multiply-add chains. For long multiplication there are additions in two directions, but a loop can only handle one chain at at time and is usually adding to a temporary and handling that carry separately with |
@typetetris I don't know whether this will resonate, but I've been thinking of So the goal here to to provide that primitive as the obvious way to write it in math libraries, without needing to know about the best way to represent it in the backend in use. For example, LLVM represents wide multiplication by casting to a larger type and doing multiplication on that, but cranelift has So unless there's a particular use for the extra carry, I don't think it makes sense here, even though it's certainly a nice observation that another carry can fit. Like if the extra carry solved the slightly-wider intermediate result problem in Karatsuba then we should absolutely offer it (after all, it's easy to optimize out a |
Because it fits with |
The return type of This is entirely different for the The two multiplication functions should really return a named struct. |
@AaronKutch and @scottmcm Thanks for your explanations. I just had something like // result = result + lhs * rhs
pub fn classic_mul_add_in_place(result: &mut [u64], lhs: &[u64], rhs: &[u64]) {
debug_assert!(
result.len() > lhs.len() + rhs.len(),
"{} <= {} + {}",
result.len(),
lhs.len(),
rhs.len()
);
debug_assert!(result[lhs.len() + rhs.len()] < u64::MAX);
for (rhs_pos, rhs_leg) in rhs.iter().copied().enumerate() {
let mut carry = 0u64;
for (lhs_leg, result_place) in lhs
.iter()
.copied()
.chain(std::iter::repeat(0u64))
.zip(result[rhs_pos..].iter_mut())
{
let (new_digit, new_carry) = carrying_mul(rhs_leg, lhs_leg, *result_place, carry);
*result_place = new_digit;
carry = new_carry;
}
debug_assert_eq!(carry, 0u64);
}
} in a toy project of mine (certainly buggy, slow and not idiomatic or something!). The If you ever find a way to handle the slightly wider intermediate multiplication in Karatsuba, let me know, please. At the moment I just handle the carries from the additions of high and low part separately leading to some code bloat. Edit1: |
Thank you @typetetris! I found E.g., here is adapted @kennytm's function that uses pub fn u64_widening_mul2(x: u64, y: u64) -> u128 {
let a = (x >> u32::BITS) as u32;
let b = x as u32;
let c = (y >> u32::BITS) as u32;
let d = y as u32;
let (p1, p2) = widening_mul(b, d);
let (p2, p31) = carrying_mul(b, c, p2);
let (p2, p32) = carrying_mul(a, d, p2);
let (p3, p4) = carrying2_mul(a, c, p31, p32);
u128::from(p1) | u128::from(p2) << 32 | u128::from(p3) << 64 | u128::from(p4) << 96
} Ever assembly seems to be better than for initial version with explicit @scottmcm @AaronKutch I'm not sure if this affects the addition of this method to std. |
Btw I didn't find a way to generate optimal code for Tool: https://godbolt.org/ Rust code: #![feature(const_bigint_helper_methods)]
#![feature(bigint_helper_methods)]
#[no_mangle]
pub fn u128_widening_mul(x: u128, y: u128, result: &mut [u128; 2]) {
let a = (x >> 64) as u64;
let b = x as u64;
let c = (y >> 64) as u64;
let d = y as u64;
let (p1, p2) = b.widening_mul(d);
let (p2, p31) = b.carrying_mul(c, p2);
let (p2, p32) = a.carrying_mul(d, p2);
let (p3, p4o) = p31.overflowing_add(p32);
let (p3, p4) = a.carrying_mul(c, p3);
let p4 = p4.wrapping_add(p4o.into());
result[0] = u128::from(p1) | u128::from(p2) << 64;
result[1] = u128::from(p3) | u128::from(p4) << 64;
} output: u128_widening_mul:
umulh x8, x2, x0
mul x10, x3, x0
umulh x9, x3, x0
mul x12, x2, x1
adds x8, x8, x10
umulh x11, x2, x1
cinc x9, x9, hs
mul x14, x3, x1
adds x8, x8, x12
umulh x10, x3, x1
cinc x11, x11, hs
mul x13, x2, x0
adds x12, x9, x11
adds x12, x14, x12
cinc x10, x10, hs
cmn x9, x11
stp x13, x8, [x4]
cinc x8, x10, hs
stp x12, x8, [x4, #16]
ret But using Zig 0.12.0 it was able to generate optimal code: export fn u128_widening_mul(a: u128, b: u128, result: *[2]u128) void {
const value: u256 = @mulWithOverflow(@as(u256, a), @as(u256, b))[0];
result[0] = @intCast(value);
result[1] = @intCast(@shlWithOverflow(value, 128)[0]);
} output: u128_widening_mul:
umulh x8, x2, x0
stp xzr, xzr, [x4, #16]
mul x9, x2, x0
madd x8, x2, x1, x8
madd x8, x3, x0, x8
stp x9, x8, [x4]
ret |
This is kind of an aside to the discussion, but I greatly appreciate everyone discussing ways of optimising these functions for various targets, and ways of optimising using them for said targets. Very much reaffirms my assumption from the beginning that getting these right is very complicated and generally depends a lot on compiler internals, which is why they should exist in the standard library IMHO. |
@Lohann fyi you can share the link to your exact godbolt setup in the top right corner, here: https://godbolt.org/z/r19nKaGh5. But thanks for mentioning this. I think the above point still stands, that we would like to figure out the best API before adding methods that may require intrinsics (for i128/u128). |
@clarfonthey @tgross35 Makes sense. About the api, I also think that /// Calculates the complete product self * rhs without the possibility to overflow.
pub trait WideningMul<Rhs = Self> {
type Output;
#[must_use]
fn widening_mul(self, rhs: Rhs) -> (Self::Output, Self::Output);
} I implemented this trait for all primitives, except |
@Lohann are you sure the result of Zig make sense?u128_widening_mul:
; export fn u128_widening_mul(a: u128, b: u128, result: *[2]u128) void
;
; - x0 = lower 64-bit of `a`
; - x1 = upper 64-bit of `a`
; - x2 = lower 64-bit of `b`
; - x3 = upper 64-bit of `b`
; - x4 = pointer to `result`
umulh x8, x2, x0
; x8 := upper64(x2 * x0)
stp xzr, xzr, [x4, #16]
; x4[1] = 0
mul x9, x2, x0
; x9 := lower64(x2 * x0)
madd x8, x2, x1, x8
; x8 += lower64(x2 * x1)
madd x8, x3, x0, x8
; x8 += lower64(x3 * x0)
stp x9, x8, [x4]
; x4[0] = x8 << 64 | x9
ret it looks like it is simply computing result[0] = a * b
result[1] = 0 I think any compiler targeting aarch64 that generated code for u128*u128 significantly less than 18 instructions (which is the output of directly using LLVM-IR from #85532 (comment)) should be considered their bug. |
@kennytm good catch, you are right there was a bug in my zig code, I was doing shift left instead of shift right, fixed 🤦 . export fn u128_widening_mul(a: u128, b: u128, result: *[2]u128) void {
const x: u256 = @intCast(a);
const y: u256 = @intCast(b);
const value: u256 = @mulWithOverflow(x, y)[0];
result[0] = @truncate(value);
result[1] = @truncate(value >> 128);
} The rust code only generates one instruction more than the zig code, which is weird because zig also uses llvm, but that's not bad I can go forward with this implementation, thank you sir! |
One potential (weird) solution might be implementing an internal 256-bit integer which isn't exposed publicly and doesn't have all the operations implemented, but could be used oh LLVM's side to generate better code here. Or just going all out and doing generic integers like zig has, and keeping them internal until a public API is settled on. Both probably require an MCP. |
I think we don't need full-blown types -- an intrinsic can lower to LLVM |
@typetetris @nickkuk Your messages made me realize something: the way to justify If you want to send a PR adding it to nightly, I'm willing to approve it, though it'll need naming feedback from libs-api at some before before it would have a chance at stabilizing. (I could see @clarfonthey We have intrinsics with fallback bodies now, so we can add a poor version in rust in a way that it can be overridden by LLVM (to emit |
Feature gate:
#![feature(bigint_helper_methods)]
and#![feature(const_bigint_helper_methods)]
This is a tracking issue for the following methods on integers:
carrying_add
borrowing_sub
carrying_mul
widening_mul
These methods are intended to help centralise the effort required for creating efficient big integer implementations, by offering a few methods which would otherwise require special compiler intrinsics or custom assembly code in order to do efficiently. They do not alone constitute big integer implementations themselves, but are necessary building blocks for a larger implementation.
Public API
Steps / History
widening_mul
RFC: widening_mul rfcs#2417carrying_add
,borrowing_sub
,carrying_mul
, andwidening_mul
Add carrying_add, borrowing_sub, widening_mul, carrying_mul methods to integers #85017carrying_add
andborrowing_sub
on signed types: Reimplementcarrying_add
andborrowing_sub
for signed integers. #93873("without the ability to overflow" can be confusing.)
Unresolved Questions
widening_mul
that simply returns the next-larger type? What would we do foru128
/i128
?The text was updated successfully, but these errors were encountered: