Tracking Issue for bigint helper methods #85532

clarfonthey · 2021-05-21T00:30:46Z

Feature gate: #![feature(bigint_helper_methods)] and #![feature(const_bigint_helper_methods)]

This is a tracking issue for the following methods on integers:

carrying_add
borrowing_sub
carrying_mul
widening_mul

These methods are intended to help centralise the effort required for creating efficient big integer implementations, by offering a few methods which would otherwise require special compiler intrinsics or custom assembly code in order to do efficiently. They do not alone constitute big integer implementations themselves, but are necessary building blocks for a larger implementation.

Public API

// On unsigned integers:

/// `self + rhs + carry` (full adder)
fn carrying_add(self, rhs: Self, carry: bool) -> (Self, bool);

/// `self - rhs - carry` (full "subtractor")
fn borrowing_sub(self, rhs: Self, carry: bool) -> (Self, bool);

/// `self * rhs + carry` (multiply-accumulate)
fn carrying_mul(self, rhs: Self, carry: Self) -> (Self, Self);

/// `self * rhs` (wide multiplication, same as `self.carrying_mul(rhs, 0)`)
fn widening_mul(self, rhs: Self) -> (Self, Self);


// On signed integers:

/// `self + rhs + carry` (full adder)
fn carrying_add(self, rhs: Self, carry: bool) -> (Self, bool);

/// `self - rhs - carry` (full "subtractor")
fn borrowing_sub(self, rhs: Self, carry: bool) -> (Self, bool);

Steps / History

widening_mul RFC: widening_mul rfcs#2417
Initial implementation of carrying_add, borrowing_sub, carrying_mul, and widening_mul Add carrying_add, borrowing_sub, widening_mul, carrying_mul methods to integers #85017
Remove implementations on signed types Remove bigint_helper_methods for *signed* types #90848 per discussion in Signed carry carrying_add in bigint_helper_methods #90541
Add new carrying_add and borrowing_sub on signed types: Reimplement carrying_add and borrowing_sub for signed integers. #93873
Add assembly tests to ensure this keeps doing the right thing
Clarify documentation
("without the ability to overflow" can be confusing.)
Final commenting period (FCP)
Stabilization PR

Unresolved Questions

Should these be implemented using compiler intrinsics? LLVM currently has no equivalents, so, we'd have to custom-build some.
Should an alternative API be provided for widening_mul that simply returns the next-larger type? What would we do for u128/i128?
What should the behaviour be for signed integers? Should there be implementations for signed integers at all?
Is the "borrowing" terminology worth it for subtraction, or should we simply call that "carrying" as well for consistency?
Are there other methods that should be added in addition to the existing ones?

The text was updated successfully, but these errors were encountered:

leonardo-m · 2021-05-21T08:18:08Z

Are there other methods that should be added in addition to the existing ones?

I'd like a mul_mod, as shown in #85017, because I think you can't implement it efficiently without asm and it's a basic block for power_mod and other things.

clarfonthey · 2021-05-21T21:48:47Z

Another set of methods that could be useful that I'll probably offer implementations for at some point:

/// `(self << rhs) | carry`
fn carrying_shl(self, rhs: u32, carry: Self) -> (Self, Self); 

/// `(self >> rhs) | carry`
fn borrowing_shr(self, rhs: u32, carry: Self) -> (Self, Self);

/// `self << rhs`
fn widening_shl(self, rhs: u32) -> (Self, Self);

/// `self >> rhs`
fn widening_shr(self, rhs: u32) -> (Self, Self);

Essentially, return the two halves of a rotation, i.e. x.widening_shl(y) is the same as (x << y, x >> (BITS - y)) and similarly for widening_shr. Not sure whether they should allow rhs == BITS or not, but presumably they wouldn't for consistency with existing shift methods.

clarfonthey · 2021-06-01T23:28:15Z

From @scottmcm in the original PR:

Some prior art I happened upon: https://docs.rs/cranelift-codegen/0.74.0/cranelift_codegen/ir/trait.InstBuilder.html#method.isub_bin
Same as isub with an additional borrow flag input. Computes:
   a = x - (y + b_{in}) \pmod 2^B

photino · 2021-09-07T04:38:37Z

Why don't we add carrying_mul and widening_mul for i128/u128 as well?

clarfonthey · 2021-09-07T05:57:36Z

Mostly effort implementing them efficiently. In the meantime, you can do it with four calls to the u64 version. Or three if you want to be fancy.

RalfJung · 2021-09-09T17:06:16Z

fn borrowing_sub(self, rhs: Self, carry: bool) -> (Self, bool);

I was very confused by this function name at first, since borrowing in Rust usually refers to references. I am not a native speaker, but I do formal mathematical work in English professionally, and yet I never before heard the term "borrowing" in the context of subtraction. So I think this, at least, needs some explanation in the docs. (I would have expected something like carrying_sub, but maybe that is nonsense for a native speaker.)

The current docs for some of the other methods could probably also be improved: they talk about not having the "ability to overflow", which makes it sound like not overflowing is a bad thing.

AaronKutch · 2021-09-09T17:09:27Z

The word borrow here comes from the terminology for a full subtractor. I am thinking that maybe the borrowing_sub function could be removed altogether. The same effect that borrowing_sub has can be obtained from carrying_add by making the first carrying_add in the chain have a set carry bit, and then bitnot every rhs. This fact could be put in the documentation of carrying_add.

clarfonthey · 2021-09-09T17:34:47Z

The word borrow here comes from the terminology for a full subtractor. I am thinking that maybe the borrowing_sub function could be removed altogether. The same effect that borrowing_sub has can be obtained from carrying_add by making the first carrying_add in the chain have a set carry bit, and then bitnot every rhs. This fact could be put in the documentation of carrying_add.

Considering how the primary goal of these methods is to be as efficient as possible, usually optimising down to a single instruction, I don't think it'd be reasonable to just get rid of subtraction in favour of telling everyone to use addition instead. Definitely open to changing the name, though.

AaronKutch · 2021-09-09T17:48:03Z

These helper methods will not be very useful to me unless they are implemented for every kind of integer. Here is an implementation for a widening multiplication-addition for u128:

/// Extended multiply-addition of `(lhs * rhs) + add`. The result is returned as a tuple of the wrapping part and the
/// overflow part. No numerical overflow is possible even if all three arguments are set to their max values.
pub const fn widen_mul_add(lhs: u128, rhs: u128, add: u128) -> (u128, u128) {
    //                       [rhs_hi]  [rhs_lo]
    //                       [lhs_hi]  [lhs_lo]
    //                     X___________________
    //                       [------tmp0------]
    //             [------tmp1------]
    //             [------tmp2------]
    //     [------tmp3------]
    //                       [-------add------]
    // +_______________________________________
    //                       [------sum0------]
    //     [------sum1------]

    let lhs_lo = lhs as u64;
    let rhs_lo = rhs as u64;
    let lhs_hi = (lhs.wrapping_shr(64)) as u64;
    let rhs_hi = (rhs.wrapping_shr(64)) as u64;
    let tmp0 = (lhs_lo as u128).wrapping_mul(rhs_lo as u128);
    let tmp1 = (lhs_lo as u128).wrapping_mul(rhs_hi as u128);
    let tmp2 = (lhs_hi as u128).wrapping_mul(rhs_lo as u128);
    let tmp3 = (lhs_hi as u128).wrapping_mul(rhs_hi as u128);
    // tmp1 and tmp2 straddle the boundary. We have to handle three carries
    let (sum0, carry0) = tmp0.overflowing_add(tmp1.wrapping_shl(64));
    let (sum0, carry1) = sum0.overflowing_add(tmp2.wrapping_shl(64));
    let (sum0, carry2) = sum0.overflowing_add(add as u128);
    let sum1 = tmp3
        .wrapping_add(tmp1.wrapping_shr(64))
        .wrapping_add(tmp2.wrapping_shr(64))
        .wrapping_add(carry0 as u128)
        .wrapping_add(carry1 as u128)
        .wrapping_add(carry2 as u128);
    (sum0, sum1)
}

I have tested this with my crate awint.

edit: There is a version of this that uses the Karatsuba trick to use 3 multiplications instead of 4, but it incurs extra summations, branches, and is not as parallel. For typical desktop processors the above should be the fastest.

clarfonthey · 2021-09-09T17:52:44Z

I would make a PR for that.

AaronKutch · 2021-09-09T17:53:05Z

Some alternative signatures include u128::widen_mul_add(lhs, rhs, add), lhs.widen_mul_add(rhs, add), or add.widen_mul_add(lhs, rhs). In awint my general purpose mul-add function is mul_add_triop which uses the third signature but takes self mutably and add-assigns lhs * rhs. I'm not sure which is best.

AaronKutch · 2021-09-09T18:05:48Z

I would also change up the documentation headers for the carrying_add function to say

Extended addition of `self + rhs + carry`. The booleans are interpreted as a single bit
integer of value 0 or 1. If unsigned overflow occurs, then the boolean in the tuple
returns 1. The output carry can be chained into the input carry of another carrying add,
which allows for arbitrarily large additions to be calculated.

I specifically note unsigned overflow, because that happens for both signed and unsigned
integers because of how two's complement works.

AaronKutch · 2021-09-09T18:18:28Z

borrowing_sub should be left in with its naming, but its documentation could be

Extended subtraction of `self - rhs - borrow`. The "borrowing" here refers to borrowing in the full subtractor sense.
The booleans are interpreted as a single bit integer of value 0 or 1. If unsigned overflow occurs, then the boolean
in the tuple returns 1. The output carry can be chained into the input carry of another borrowing subtract,
which allows for arbitrarily large subtraction to be calculated.

tspiteri · 2021-09-10T14:12:59Z

I specifically note unsigned overflow, because that happens for both signed and unsigned
integers because of how two's complement works.

But unsigned overflow and signed overflow are different. For example, on x86_64, while unsigned and signed integers share addition and subtraction instructions, unsigned overflow is detected using the carry flag while signed overflow is detected using the overflow flag.

As a concrete example: 127i8 + 1 causes signed overflow but not unsigned overflow. So the carry flag should be false/0.

Edit: I think I had misread your comment and thought the middle part of your comment was the current doc, not your suggestion, so it looks like I completely misinterpreted your final comment.

AaronKutch · 2021-09-10T20:14:48Z

Yes signed and unsigned overflow are different, but the carrying_add as implemented for unsigned and signed integers both use unsigned overflow because of how two's complement carrying works. Someone using i64::carrying_add might think that the carry out bit follows the bit of i64::overflowing_add when in actuality it is following u64::overflowing_add. So in the documentation I would put emphasis on _unsigned_ overflow.

clarfonthey · 2021-09-14T20:07:46Z

I think all of these are good suggestions, and like mentioned earlier, these changes definitely should go in a PR if you have the time. I think one important thing to note is that so far the APIs here seem good, but the documentation definitely could use some work. Although if there's a bigger case for changing the subtraction behaviour to be more in line with what's expected (the existing behaviour is mostly modelled after the x86 instructions adc and sbb), then I'm for that.

That said, the main goal is to make it relatively painless to write correct code that compiles down to the right instructions in release mode, so, I would say we should make sure that happens regardless of what's done. I would have added an explicit test for that but I honestly don't know how.

…riplett Add more text and examples to `carrying_{add|mul}` `feature(bigint_helper_methods)` tracking issue rust-lang#85532 cc `@clarfonthey`

…riplett Add more text and examples to `carrying_{add|mul}` `feature(bigint_helper_methods)` tracking issue rust-lang#85532 cc ``@clarfonthey``

…riplett Add more text and examples to `carrying_{add|mul}` `feature(bigint_helper_methods)` tracking issue rust-lang#85532 cc ```@clarfonthey```

…riplett Add more text and examples to `carrying_{add|mul}` `feature(bigint_helper_methods)` tracking issue rust-lang#85532 cc ````@clarfonthey````

…riplett Add more text and examples to `carrying_{add|mul}` `feature(bigint_helper_methods)` tracking issue rust-lang#85532 cc `````@clarfonthey`````

Multiplication, and carry-less multiplication, are inherently a widening operation. Unfortunately, at the time of writing, the types in Rust don't capture this well, being built around fixed-width wrapping multiplication. Rust's stdlib can rely on compiler-level optimizations to clean up performance issues from unnecessarily-wide multiplications, but this becomes a bit of an issue for our library, especially for u64 types, since we rely on intrinsics, which may be hard for compilers to optimize around. This commit adds widening_mul, based on a proposal to add widening_mul to Rust's primitive types: rust-lang/rust#85532 As well as several other tweaks to how xmul is provided, moving more arch-level details into xmul, but still limiting when it is emitted.

Xuanmizhen · 2024-02-17T10:42:41Z

Given the importance of these methods for efficient big integer handling, I wonder if there's a plan to collaborate with the LLVM community to provide such compiler intrinsics. Maybe not only Rust need them.

kennytm · 2024-02-18T08:18:05Z

carrying_add and carrying_sub are not much different from llvm.{uadd,sadd,usub,ssub}.with.overflow.

widening_mul can be done by the mul instruction on i256 (LLVM supports arbitrary precision arithmetic for +, -, *)

; (i just copy the llvm-ir output for `u128::from(u64a) * u128::from(u64b)` and double the width)
define { i128, i128 } @widening_mul(i128 noundef %a, i128 noundef %b) unnamed_addr #0 {
start:
  %_4 = zext i128 %a to i256
  %_5 = zext i128 %b to i256
  %m = mul nuw i256 %_5, %_4      ; <----
  %_6 = trunc i256 %m to i128
  %_8 = lshr i256 %m, 128
  %_7 = trunc i256 %_8 to i128
  %0 = insertvalue { i128, i128 } poison, i128 %_6, 0
  %1 = insertvalue { i128, i128 } %0, i128 %_7, 1
  ret { i128, i128 } %1
}

so I don't see much rationale from the LLVM side to provide anything more.

clarfonthey · 2024-02-18T16:02:34Z

Note that one of the stated goals for having this in the standard library is that we can add intrinsics for them if needed, but so far we haven't yet. We just need to have a solid API before we stabilise, and can continue tweaking performance over time.

BattyBoopers · 2024-03-18T10:01:18Z

How about multiplication of signed with unsigned values?
That operation only makes sense with a widening multiply (e.g. any i32*u32 would overflow an i32 if the u32 is greater than i32::MAX, so might as well cast both to i32 in the non-widening case - okay, not exactly any, there's some corner cases (0*x, -1*(i32::MAX+1)) that do not overflow, but they still work as expected after doing that cast).

Also I find the order of return values a bit confusing. Why was it chosen this way around?

This returns the low-order (wrapping) bits and the high-order (overflow) bits of the result as two separate values, in that order.

Could we maybe make the return value a struct like this

#[repr(C)]
pub struct Widening<H, L> {
   pub hi: H,
   pub lo: L,
}

And maybe have the order of fields also be dependent on the endianess of the system, such that it can be transmuted directly into the corresponding wider integer type (that might require some padding as well)?

kennytm · 2024-03-18T16:48:14Z

How about multiplication of signed with unsigned values?

There is no widening_mul of signed with signed values either (see rationale above #85532 (comment)). Is there a use case of signed×signed and signed×unsigned aside from MulDiv?

BTW they can be just forward to the unsigned version like this (making use of e.g. $-a \times b = (-a + 2^{64}) \times b - 2^{64} \times b$):

fn widening_mul_signed_unsigned(a: i64, b: u64) -> (u64, i64) {
    let (lo, mut hi) = (a as u64).widening_mul(b);
    if a < 0 {
        hi -= b;
    }
    (lo, hi as _)
}

fn widening_mul_signed_signed(a: i64, b: i64) -> (u64, i64) {
    // note: for `i64 * i64` it is more efficient to just cast them to i128. 
    let (lo, hi) = (a as u64).widening_mul(b as u64);
    let mut hi = hi as i64;
    if a < 0 {
        hi -= b;
    }
    if b < 0 {
        hi -= a;
    }
    (lo, hi)
}

Could we maybe make the return value a struct like this

+1 returning a named struct
but -1 making it #[repr(C)] (or any kind of special #[repr]).

programmerjake · 2024-03-18T16:51:27Z

And maybe have the order of fields also be dependent on the endianess of the system, such that it can be transmuted directly into the corresponding wider integer type (that might require some padding as well)?

afaict on LLVM, power-of-2-sized integer types never have internal padding as long as they're at least 8-bits. this is why to_ne_bytes is just a transmute for all current Rust integer types:

rust/library/core/src/num/int_macros.rs

Lines 3394 to 3398 in 3cdcdaf

    
           pub const fn to_ne_bytes(self) -> [u8; mem::size_of::<Self>()] { 
        
               // SAFETY: integers are plain old datatypes so we can always transmute them to 
        
               // arrays of bytes 
        
               unsafe { mem::transmute(self) } 
        
           }

programmerjake · 2024-03-18T17:00:06Z

How about multiplication of signed with unsigned values?

There is no widening_mul of signed with signed values either (see rationale above #85532 (comment)). Is there a use case of signed×signed and signed×unsigned aside from MulDiv?

yes: 2s complement biginteger multiply: see the comments I left on the corresponding ACP (most of that issue's comments): rust-lang/libs-team#228

programmerjake · 2024-03-18T17:07:50Z

BTW they can be just forward to the unsigned version like this (making use of e.g. −a×b=(−a+264)×b−264×b):

that is technically possible but highly inefficient with current LLVM since last I checked LLVM won't convert that branchy code sequence to a signed-[un]signed multiply.

kennytm · 2024-03-18T20:04:00Z

@programmerjake

that is technically possible but highly inefficient with current LLVM since last I checked LLVM won't convert that branchy code sequence to a signed-[un]signed multiply.

Note that my implementation is not the same as #85532 (comment). Although in the signed×signed case it is not as efficient as reusing i128::mul (compiling down to a single imul instruction) there are no jumps in the generated ASM.

And for the signed×unsigned case I'd argue it is more efficient than i128::mul.

Anyway the algorithm is only interesting for implementing i128::widening_mul(i128), otherwise simply casting to the wider integer type to multiply and let LLVM to figure out the optimization path should be the best.

playground::widening_mul_signed_unsigned:
# %bb.0:
	movq	%rsi, %rax
	mulq	%rdi
	sarq	$63, %rdi
	andq	%rsi, %rdi
	subq	%rdi, %rdx
	retq
                                        # -- End function

playground::widening_mul_signed_signed:
# %bb.0:
	movq	%rsi, %rax
	mulq	%rdi
	movq	%rdi, %rcx
	sarq	$63, %rcx
	andq	%rsi, %rcx
	sarq	$63, %rsi
	andq	%rdi, %rsi
	addq	%rcx, %rsi
	subq	%rsi, %rdx
	retq
                                        # -- End function

# these two use i128::mul

playground::widening_mul_signed_signed_2:
# %bb.0:
	movq	%rsi, %rax
	imulq	%rdi
	retq
                                        # -- End function

playground::widening_mul_signed_unsigned_2: # @playground::widening_mul_signed_unsigned_2
# %bb.0:
	movq	%rsi, %rax
	mulq	%rdi
	sarq	$63, %rdi
	imulq	%rsi, %rdi
	addq	%rdi, %rdx
	retq
                                        # -- End function

# this uses a * b == sgn(a) * sgn(b) * abs(a) * abs(b)

playground::widening_mul_signed_signed_3: # @playground::widening_mul_signed_signed_3
# %bb.0:
	movq	%rdi, %rcx
	negq	%rcx
	cmovsq	%rdi, %rcx
	movq	%rsi, %rax
	negq	%rax
	cmovsq	%rsi, %rax
	mulq	%rcx
	xorl	%ecx, %ecx
	movq	%rax, %r8
	negq	%r8
	sbbq	%rdx, %rcx
	xorq	%rdi, %rsi
	cmovsq	%r8, %rax
	cmovsq	%rcx, %rdx
	retq
                                        # -- End function

programmerjake · 2024-03-18T20:32:24Z

on RISC-V, which has signed-unsigned mul, casting to i128 is more efficient in all cases:
https://rust.godbolt.org/z/vWc7a3WqP

kennytm · 2024-03-19T04:47:20Z

That suggested LLVM should improve the x86_64 i128::mul generation.

typetetris · 2024-04-24T20:21:43Z

Speaking about unsigned integers, is there a reason, why it is

/// `self * rhs + carry` (multiply-accumulate)
fn carrying_mul(self, rhs: Self, carry: Self) -> (Self, Self);

and not

/// `self * rhs + carry1 + carry2` (multiply-accumulate)
fn carrying_mul(self, rhs: Self, carry1: Self, carry2: Self) -> (Self, Self);

as having that additional carry can't overflow, too?

(For any B one has (B-1) * (B-1) + (B-1) + (B-1) = (B-1) * B + (B-1). B=2^64 for u64 and (B-1) = u64::MAX for example.)

Or are the bigint helper function more about providing low level access to helpful instructions some architectures have than the "no overflow" (in this case) guaranty?

AaronKutch · 2024-04-24T20:36:46Z

I've seen before that you can fit a second carry, but usually in bigint libraries we just have a single carry to deal with in multiply-add chains. For long multiplication there are additions in two directions, but a loop can only handle one chain at at time and is usually adding to a temporary and handling that carry separately with carrying_add

scottmcm · 2024-04-25T15:56:26Z

@typetetris I don't know whether this will resonate, but I've been thinking of carrying_add like a Full Adder as opposed to the existing overflowing_add that's just a Half Adder. When you're combining multiple for a larger-width operation, the full one is what you want.

So the goal here to to provide that primitive as the obvious way to write it in math libraries, without needing to know about the best way to represent it in the backend in use. For example, LLVM represents wide multiplication by casting to a larger type and doing multiplication on that, but cranelift has mul and mulhi instructions, and there's no way for a library to know its codegen backend. Much better for core to have this essential piece that does the right thing. Having biginteger math in core is probably not the right choice -- there's too many options for how to do it -- but it should make easy for other crates to make a u256 or BigInteger.

So unless there's a particular use for the extra carry, I don't think it makes sense here, even though it's certainly a nice observation that another carry can fit. Like if the extra carry solved the slightly-wider intermediate result problem in Karatsuba then we should absolutely offer it (after all, it's easy to optimize out a + 0 for people that don't need it), but at least from a quick look I don't see a way for the extra carry to handle that.

scottmcm · 2024-04-25T15:58:38Z

Also I find the order of return values a bit confusing. Why was it chosen this way around?

Because it fits with overflowing_* in that .0 is the wrapping_* result, then the extra information is in .1.

kennytm · 2024-04-25T23:34:45Z

The return type of .overflowing_*(), .carrying_add() and .borrowing_sub() are (integer, bool) and there's no way to confuse the meaning of the two fields because they have different types.

This is entirely different for the .widening_mul() and .carrying_mul() operations which currently returns (integer, integer) and it's hard to tell which one is the low and high part. The name .widening_mul() does not even contain the word "overflowing" to relate it to other .overflowing_*() functions to make sense of the ordering.

The two multiplication functions should really return a named struct.

typetetris · 2024-04-26T10:23:06Z

@AaronKutch and @scottmcm Thanks for your explanations. I just had something like

// result = result + lhs * rhs
pub fn classic_mul_add_in_place(result: &mut [u64], lhs: &[u64], rhs: &[u64]) {
    debug_assert!(
        result.len() > lhs.len() + rhs.len(),
        "{} <= {} + {}",
        result.len(),
        lhs.len(),
        rhs.len()
    );
    debug_assert!(result[lhs.len() + rhs.len()] < u64::MAX);

    for (rhs_pos, rhs_leg) in rhs.iter().copied().enumerate() {
        let mut carry = 0u64;
        for (lhs_leg, result_place) in lhs
            .iter()
            .copied()
            .chain(std::iter::repeat(0u64))
            .zip(result[rhs_pos..].iter_mut())
        {
            let (new_digit, new_carry) = carrying_mul(rhs_leg, lhs_leg, *result_place, carry);
            *result_place = new_digit;
            carry = new_carry;
        }
        debug_assert_eq!(carry, 0u64);
    }
}

in a toy project of mine (certainly buggy, slow and not idiomatic or something!). The carrying_mul here simply does the "widening" trick of calculating in u128 and then there isn't some additional carrying_add needed. Didn't measure though, if it would make a difference on my machine.

If you ever find a way to handle the slightly wider intermediate multiplication in Karatsuba, let me know, please. At the moment I just handle the carries from the additions of high and low part separately leading to some code bloat.

Edit1: std::iter::repeat instead of std::iter::once
Edit2: The slices in the function are least significant "digit" first.

nickkuk · 2024-05-20T17:32:19Z

Thank you @typetetris!

I found carrying2_mul function with two carries very useful even for long multiplication.

E.g., here is adapted @kennytm's function that uses carrying2_mul:

pub fn u64_widening_mul2(x: u64, y: u64) -> u128 {
    let a = (x >> u32::BITS) as u32;
    let b = x as u32;
    let c = (y >> u32::BITS) as u32;
    let d = y as u32;
    let (p1, p2) = widening_mul(b, d);
    let (p2, p31) = carrying_mul(b, c, p2);
    let (p2, p32) = carrying_mul(a, d, p2);
    let (p3, p4) = carrying2_mul(a, c, p31, p32);
    u128::from(p1) | u128::from(p2) << 32 | u128::from(p3) << 64 | u128::from(p4) << 96
}

Ever assembly seems to be better than for initial version with explicit overflowing_add: https://godbolt.org/z/aaqzYxsd3

@scottmcm @AaronKutch I'm not sure if this affects the addition of this method to std.

Lohann · 2024-07-11T01:11:10Z

Btw I didn't find a way to generate optimal code for u128 widening multiplication for aarch64 targets.

Tool: https://godbolt.org/
Target: aarch64-unknown-linux-gnu
compiler options: -Copt-level=3 -Clto=fat -Ccodegen-units=1 -Cpanic=abort -Cstrip=symbols -Zlocation-detail=none

Rust code:

#![feature(const_bigint_helper_methods)]
#![feature(bigint_helper_methods)]

#[no_mangle]
pub fn u128_widening_mul(x: u128, y: u128, result: &mut [u128; 2]) {
    let a = (x >> 64) as u64;
    let b = x as u64;
    let c = (y >> 64) as u64;
    let d = y as u64;
    let (p1, p2) = b.widening_mul(d);
    let (p2, p31) = b.carrying_mul(c, p2);
    let (p2, p32) = a.carrying_mul(d, p2);
    let (p3, p4o) = p31.overflowing_add(p32);
    let (p3, p4) = a.carrying_mul(c, p3);
    let p4 = p4.wrapping_add(p4o.into());
    result[0] = u128::from(p1) | u128::from(p2) << 64;
    result[1] = u128::from(p3) | u128::from(p4) << 64;
}

output:

u128_widening_mul:
        umulh   x8, x2, x0
        mul     x10, x3, x0
        umulh   x9, x3, x0
        mul     x12, x2, x1
        adds    x8, x8, x10
        umulh   x11, x2, x1
        cinc    x9, x9, hs
        mul     x14, x3, x1
        adds    x8, x8, x12
        umulh   x10, x3, x1
        cinc    x11, x11, hs
        mul     x13, x2, x0
        adds    x12, x9, x11
        adds    x12, x14, x12
        cinc    x10, x10, hs
        cmn     x9, x11
        stp     x13, x8, [x4]
        cinc    x8, x10, hs
        stp     x12, x8, [x4, #16]
        ret

But using Zig 0.12.0 it was able to generate optimal code:
compiler options: -target aarch64-linux -O ReleaseFast -dead_strip -dead_strip_dylibs

export fn u128_widening_mul(a: u128, b: u128, result: *[2]u128) void {
    const value: u256 = @mulWithOverflow(@as(u256, a), @as(u256, b))[0];
    result[0] = @intCast(value);
    result[1] = @intCast(@shlWithOverflow(value, 128)[0]);
}

output:

u128_widening_mul:
        umulh   x8, x2, x0
        stp     xzr, xzr, [x4, #16]
        mul     x9, x2, x0
        madd    x8, x2, x1, x8
        madd    x8, x3, x0, x8
        stp     x9, x8, [x4]
        ret

clarfonthey · 2024-07-11T01:17:02Z

This is kind of an aside to the discussion, but I greatly appreciate everyone discussing ways of optimising these functions for various targets, and ways of optimising using them for said targets. Very much reaffirms my assumption from the beginning that getting these right is very complicated and generally depends a lot on compiler internals, which is why they should exist in the standard library IMHO.

tgross35 · 2024-07-11T01:33:41Z

@Lohann fyi you can share the link to your exact godbolt setup in the top right corner, here: https://godbolt.org/z/r19nKaGh5.

But thanks for mentioning this. ~~Zig supports arbitrary width integers so it just emits much better LLVM IR in your example, compared to Rust where it seems like LLVM can't optimize through the math.~~ nope it's just doing a standard 128-bit multiply and dropping the wide part, see a few comments down. Looks like it isn't even telling LLVM to use i256.

I think the above point still stands, that we would like to figure out the best API before adding methods that may require intrinsics (for i128/u128).

Lohann · 2024-07-11T02:10:59Z

@clarfonthey @tgross35 Makes sense.

About the api, I also think that u128 must implement the same widening_mul primitive, otherwise it can make difficult for using macros that works for all primitives, for example I'm working in a PR for num-traits for supporting the widening multiplication, the trait looks like this so far:

/// Calculates the complete product self * rhs without the possibility to overflow.
pub trait WideningMul<Rhs = Self> {
    type Output;

    #[must_use]
    fn widening_mul(self, rhs: Rhs) -> (Self::Output, Self::Output);
}

I implemented this trait for all primitives, except u128 which I can't find a way to generate optimal code.
rust-num/num-traits#331

kennytm · 2024-07-11T04:48:27Z

@Lohann are you sure the result of Zig make sense?

u128_widening_mul:

; export fn u128_widening_mul(a: u128, b: u128, result: *[2]u128) void
;
; - x0 = lower 64-bit of `a`
; - x1 = upper 64-bit of `a`
; - x2 = lower 64-bit of `b`
; - x3 = upper 64-bit of `b`
; - x4 = pointer to `result`

        umulh   x8, x2, x0

; x8 := upper64(x2 * x0)

        stp     xzr, xzr, [x4, #16]

; x4[1] = 0

        mul     x9, x2, x0

; x9 := lower64(x2 * x0)

        madd    x8, x2, x1, x8

; x8 += lower64(x2 * x1)

        madd    x8, x3, x0, x8

; x8 += lower64(x3 * x0)

        stp     x9, x8, [x4]

; x4[0] = x8 << 64 | x9

        ret

it looks like it is simply computing

result[0] = a * b
result[1] = 0

I think any compiler targeting aarch64 that generated code for u128*u128 significantly less than 18 instructions (which is the output of directly using LLVM-IR from #85532 (comment)) should be considered their bug.

Lohann · 2024-07-11T05:49:39Z

@kennytm good catch, you are right there was a bug in my zig code, I was doing shift left instead of shift right, fixed 🤦 .

export fn u128_widening_mul(a: u128, b: u128, result: *[2]u128) void {
    const x: u256 = @intCast(a);
    const y: u256 = @intCast(b);
    const value: u256 = @mulWithOverflow(x, y)[0];
    result[0] = @truncate(value);
    result[1] = @truncate(value >> 128);
}

The rust code only generates one instruction more than the zig code, which is weird because zig also uses llvm, but that's not bad I can go forward with this implementation, thank you sir!
https://godbolt.org/z/39Mq5serc

clarfonthey · 2024-07-11T21:11:28Z

One potential (weird) solution might be implementing an internal 256-bit integer which isn't exposed publicly and doesn't have all the operations implemented, but could be used oh LLVM's side to generate better code here.

Or just going all out and doing generic integers like zig has, and keeping them internal until a public API is settled on.

Both probably require an MCP.

cuviper · 2024-07-11T21:28:16Z

I think we don't need full-blown types -- an intrinsic can lower to LLVM i256 internally, and hopefully similar on other backends.

scottmcm · 2024-07-12T07:30:54Z

@typetetris @nickkuk Your messages made me realize something: the way to justify carrying2_mul is not to talk about it as two carries, but that it's mul_add_carry. Basically, it's what you need for the wide version of z += a * b;. And, conveniently, LLVM already knows that both adds in the wider type can't overflow https://llvm.godbolt.org/z/6r7bxc39E.

If you want to send a PR adding it to nightly, I'm willing to approve it, though it'll need naming feedback from libs-api at some before before it would have a chance at stabilizing. (I could see carrying_mul_add, mul_add_carry, mul_add_with_carry, or more, so pick whatever you think is best justified.)

@clarfonthey We have intrinsics with fallback bodies now, so we can add a poor version in rust in a way that it can be overridden by LLVM (to emit i256 operations like https://llvm.godbolt.org/z/qjfEajb7Y) to let LLVM see it as well as possible but without forcing all the other backends to implement it too.

clarfonthey added C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels May 21, 2021

clarfonthey mentioned this issue Sep 9, 2021

Add carrying_add, borrowing_sub, widening_mul, carrying_mul methods to integers #85017

Merged

bpr mentioned this issue Oct 25, 2021

Add bigints to standard library nim-lang/Nim#14696

Closed

tarcieri mentioned this issue Oct 27, 2021

Support detecting const-eval vs non-const-eval in a const fn rust-lang/const-eval#7

Open

matthewjasper mentioned this issue Nov 2, 2021

The carry in signed carrying_mul on integers should be unsigned #90493

Open

anisse mentioned this issue Nov 3, 2021

Signed carry carrying_add in bigint_helper_methods #90541

Closed

scottmcm mentioned this issue Nov 4, 2021

Add more text and examples to carrying_{add|mul} #90556

Merged

mikdk mentioned this issue Feb 25, 2024

Missing carry-optimization for subtraction on aarch64 #121566

Open

gilescope mentioned this issue Apr 27, 2024

Use bigint helper methods John-Gong-Math/halo2curves#2

Open

hanna-kruppe mentioned this issue Sep 7, 2024

Generic Integers V2: It's Time rust-lang/rfcs#3686

Open

Tracking Issue for bigint helper methods #85532

Tracking Issue for bigint helper methods #85532

Comments

clarfonthey commented May 21, 2021 • edited by m-ou-se Loading

Public API

Steps / History

Unresolved Questions

leonardo-m commented May 21, 2021

clarfonthey commented May 21, 2021 • edited Loading

clarfonthey commented Jun 1, 2021

photino commented Sep 7, 2021 • edited Loading

clarfonthey commented Sep 7, 2021

RalfJung commented Sep 9, 2021 • edited Loading

AaronKutch commented Sep 9, 2021

clarfonthey commented Sep 9, 2021

AaronKutch commented Sep 9, 2021 • edited Loading

clarfonthey commented Sep 9, 2021

AaronKutch commented Sep 9, 2021

AaronKutch commented Sep 9, 2021

AaronKutch commented Sep 9, 2021

tspiteri commented Sep 10, 2021 • edited Loading

AaronKutch commented Sep 10, 2021

clarfonthey commented Sep 14, 2021 • edited Loading

Xuanmizhen commented Feb 17, 2024

kennytm commented Feb 18, 2024

clarfonthey commented Feb 18, 2024

BattyBoopers commented Mar 18, 2024 • edited Loading

kennytm commented Mar 18, 2024 • edited Loading

programmerjake commented Mar 18, 2024

programmerjake commented Mar 18, 2024

programmerjake commented Mar 18, 2024

kennytm commented Mar 18, 2024 • edited Loading

programmerjake commented Mar 18, 2024

kennytm commented Mar 19, 2024

typetetris commented Apr 24, 2024

AaronKutch commented Apr 24, 2024

scottmcm commented Apr 25, 2024

scottmcm commented Apr 25, 2024

kennytm commented Apr 25, 2024 • edited Loading

typetetris commented Apr 26, 2024 • edited Loading

nickkuk commented May 20, 2024

Lohann commented Jul 11, 2024 • edited Loading

clarfonthey commented Jul 11, 2024

tgross35 commented Jul 11, 2024 • edited Loading

Lohann commented Jul 11, 2024 • edited Loading

kennytm commented Jul 11, 2024 • edited Loading

Lohann commented Jul 11, 2024

clarfonthey commented Jul 11, 2024

cuviper commented Jul 11, 2024 • edited Loading

scottmcm commented Jul 12, 2024

clarfonthey commented May 21, 2021 •

edited by m-ou-se

Loading

clarfonthey commented May 21, 2021 •

edited

Loading

photino commented Sep 7, 2021 •

edited

Loading

RalfJung commented Sep 9, 2021 •

edited

Loading

AaronKutch commented Sep 9, 2021 •

edited

Loading

tspiteri commented Sep 10, 2021 •

edited

Loading

clarfonthey commented Sep 14, 2021 •

edited

Loading

BattyBoopers commented Mar 18, 2024 •

edited

Loading

kennytm commented Mar 18, 2024 •

edited

Loading

kennytm commented Mar 18, 2024 •

edited

Loading

kennytm commented Apr 25, 2024 •

edited

Loading

typetetris commented Apr 26, 2024 •

edited

Loading

Lohann commented Jul 11, 2024 •

edited

Loading

tgross35 commented Jul 11, 2024 •

edited

Loading

Lohann commented Jul 11, 2024 •

edited

Loading

kennytm commented Jul 11, 2024 •

edited

Loading

cuviper commented Jul 11, 2024 •

edited

Loading