I'm trying to produce code (currently using clang++-3.8) that adds two numbers consisting of multiple machine words. To simplify things for the moment I'm only adding 128bit numbers, but I'd like to be able to generalise this.
First some typedefs:
typedef unsigned long long unsigned_word;
typedef __uint128_t unsigned_128;
And a "result" type:
struct Result
{
unsigned_word lo;
unsigned_word hi;
};
The first function, f
, takes two pairs of unsigned words and returns a result, by as an intermediate step putting both of these 64 bit words into a 128 bit word before adding them, like so:
Result f (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
Result x;
unsigned_128 n1 = lo1 + (static_cast<unsigned_128>(hi1) << 64);
unsigned_128 n2 = lo2 + (static_cast<unsigned_128>(hi2) << 64);
unsigned_128 r1 = n1 + n2;
x.lo = r1 & ((static_cast<unsigned_128>(1) << 64) - 1);
x.hi = r1 >> 64;
return x;
}
This actually gets inlined quite nicely like so:
movq 8(%rsp), %rsi
movq (%rsp), %rbx
addq 24(%rsp), %rsi
adcq 16(%rsp), %rbx
Now, instead I've written a simpler function using the clang multi-precision primatives, as below:
static Result g (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
Result x;
unsigned_word carryout;
x.lo = __builtin_addcll(lo1, lo2, 0, &carryout);
x.hi = __builtin_addcll(hi1, hi2, carryout, &x.carry);
return x;
}
This produces the following assembly:
movq 24(%rsp), %rsi
movq (%rsp), %rbx
addq 16(%rsp), %rbx
addq 8(%rsp), %rsi
adcq $0, %rbx
In this case, there's an extra add. Instead of doing an ordinary add
on the lo-words, then an adc
on the hi-words, it just add
s the hi-words, then add
s the lo-words, then does an adc
on the hi-word again with an argument of zero.
This may not look too bad, but when you try this with larger words (say 192bit, 256bit) you soon get a mess of or
s and other instructions dealing with the carries up the chain, instead of a simple chain of add
, adc
, adc
, ... adc
.
The multi-precision primitives seem to be doing a terrible job at exactly what they're intended to do.
So what I'm looking for is code that I could generalise to any length (no need to do it, just enough so I can work out how to), which clang produces additions in an manner with is as efficient as what it does with it's built in 128 bit type (which unfortunately I can't easily generalise). I presume this should just a chain of adc
s, but I'm welcome to arguments and code that it should be something else.