# On Multiplications with Unsaturated Limbs

This post is about a rather technical coding strategy choice that arises when implementing cryptographic algorithms on some elliptic curves, namely how to represent elements of the base field. We will be discussing Curve25519 implementations, in particular as part of Ed25519 signatures, as specified in RFC 8032. The most widely used Rust implementation of these operations is the curve25519-dalek library. My own research library is crrl, also written in plain Rust (no assembly); it is meant for research purposes, but I write it using all best practices for production-level implementations, e.g. it is fully constant-time and offers an API amenable to integration into various applications.

The following table measures performance of Ed25519 signature generation and verification with these libraries, using various backend implementations for operations in the base field (integers modulo 2^{255} – 19), on two test platforms (64-bit x86, and 64-bit RISC-V):

Implementation | x86 (Intel “Coffee Lake”) | RISC-V (SiFive U74) | |||

Library | Backend | sign | verify | sign | verify |

crrl | m64 | 49130 | 108559 | 202021 | 412764 |

m51 | 70149 | 148653 | 158928 | 304902 | |

curve25519-dalek | simd | 59553 | 116243 | – | – |

serial | 59599 | 169621 | 180142 | 449980 | |

fiat | 70552 | 198289 | 187220 | 429755 |

*Ed25519 performance (in clock cycles), on x86 and RISC-V.*

Test platforms are the following:

**x86:**an Intel Core i5-8259U CPU, running at 2.30 GHz (TurboBoost is disabled). This uses “Coffee Lake” cores (one of the late variants of the “Skylake” line). Operating system is Linux (Ubuntu 22.04), in 64-bit mode.**RISC-V:**a StarFive VisionFive2 board with a StarFive JH7110 CPU, running at 1 GHz. The CPU contains four SiFive U74 cores and implements the I, M, C, Zba and Zbb architecture extensions (and some others which are not relevant here). Operating system is again Linux (Ubuntu 23.04), in 64-bit mode.

In both cases, the current Rust “stable” version is used (1.72.0, from 2023-08-23), and compilation uses the environment variable RUSTFLAGS=”-C target-cpu=native” to allow the compiler to use all opcodes supported by the current platform. The computation is performed over a single core, with measurements averaged over randomized inputs. The CPU cycle counter is used. Figures above are listed with many digits, but in practice there is a bit of natural variance due to varying inputs (signature *verification* is not constant-time, since it uses only public data) and, more generally, because of the effect of various operations also occurring within the CPU (e.g. management mode, cache usage from other cores, interruptions from hardware…), so that the measured values should be taken with a grain of salt (roughly speaking, differences below about 3% are not significant).

crrl and curve25519-dalek differ a bit in how they use internal tables to speed up computations; in general, crrl tables are smaller, and crrl performs fewer point additions but more point doublings. For signature verification, crrl implements the Antipa *et al* optimization with Lagrange’s algorithm for lattice basis reduction, but curve25519-dalek does not. The measurements above show that crrl’s strategy works (i.e. it is a tad faster than curve25519-dalek) (note: not listed above is the fact that curve25519-dalek supports batch signature verification with a substantially lower per-signature cost; crrl does not implement that feature yet). The point of this post is not to boast about how crrl is faster; its good performance should be taken as an indication that it is decently optimized and thus a correct illustration of the effect of its implementation strategy choices. Indeed, the interesting part is how the different backends compare to each other, on the two tested architectures.

curve25519-dalek has three backends:

**serial:**Field elements are split over 5 limbs of 51 bits; that is, value*x*is split into five values*x*_{0}to*x*_{4}, such that*x*=*x*_{0}+ 2^{51}*x*_{1}+ 2^{102}*x*_{2}+ 2^{153}*x*_{3}+ 2^{204}*x*_{4}. Importantly, limb values are held in 64-bit words and may somewhat exceed 2^{51}(within some limits, to avoid overflows during computations). The representation is redundant, in that a given field element*x*accepts many different representations; a normalization step is applied when necessary (e.g. when serializing curve points into bytes).**fiat:**The fiat backend is a wrapper around the fiat-crypto library, which uses basically the same implementation strategy as the serial backend, but through automatic code generation that includes a correctness proof. In other words, the fiat backend is guaranteed through the magic of mathematics to always return the correct result, while in all other library backends listed here, the guarantee is “only” through the non-magic of code auditors (including myself) poring over the code for hours in search of issues, and not finding any (in practice all the code referenced here is believed correct).**simd:**AVX2 opcodes are used to perform arithmetic operations on four field elements in parallel; each element is split over ten limbs of 25 and 26 bits each. curve25519-dalek selects that backend whenever possible, i.e. on x86 systems which have AVX2 (such as an Intel “Coffee Lake”), but of course it is not available on RISC-V.

crrl has two backends:

**m51:**A “51-bit limbs” backend similar to curve25519-dalek’s “serial”, though with somewhat different choices for the actual operations (this is detailed later on).**m64:**Field elements are split over four 64-bit limbs, held in 64-bit types. By nature, such limbs cannot exceed their 64-bit range. The representation is still slightly redundant in that overall values may use the complete 256-bit range, so that each field element has two or three possible representations (a final reduction modulo 2^{255}– 19 is performed before serializing).

The “m64” backend could be deemed to be the most classical, in that such a representation would be what was preferred for big integer computations in, say, the 1990s. It minimizes the number of multiplication opcode invocations during a field element multiplication (with two 4-limb operands, 16 register multiplications are used), but also implies quite a lot of carry propagation. See for instance this excerpt of the implementation of field element multiplication in crrl’s m64 backend:

let (e0, e1) = umull(a0, b0);

let (e2, e3) = umull(a1, b1);

let (e4, e5) = umull(a2, b2);

let (e6, e7) = umull(a3, b3);

let (lo, hi) = umull(a0, b1);

let (e1, cc) = addcarry_u64(e1, lo, 0);

let (e2, cc) = addcarry_u64(e2, hi, cc);

let (lo, hi) = umull(a0, b3);

let (e3, cc) = addcarry_u64(e3, lo, cc);

let (e4, cc) = addcarry_u64(e4, hi, cc);

let (lo, hi) = umull(a2, b3);

let (e5, cc) = addcarry_u64(e5, lo, cc);

let (e6, cc) = addcarry_u64(e6, hi, cc);

let (e7, _) = addcarry_u64(e7, 0, cc);

let (lo, hi) = umull(a1, b0);

let (e1, cc) = addcarry_u64(e1, lo, 0);

let (e2, cc) = addcarry_u64(e2, hi, cc);

let (lo, hi) = umull(a3, b0);

let (e3, cc) = addcarry_u64(e3, lo, cc);

let (e4, cc) = addcarry_u64(e4, hi, cc);

let (lo, hi) = umull(a3, b2);

let (e5, cc) = addcarry_u64(e5, lo, cc);

let (e6, cc) = addcarry_u64(e6, hi, cc);

let (e7, _) = addcarry_u64(e7, 0, cc);

The addcarry_u64() calls above implement the “add with carry” operation, which, on x86, map to the ADC opcode (or, on that core, the ADCX or ADOX opcodes).

When Ed25519 signatures were invented, in 2011, the Intel CPUs *du jour* (Intel “Westmere” core) were not very good at carry propagation; they certainly supported the ADC opcode, but with a relatively high latency (2 cycles), and that made the classical code somewhat slow. The use of 51-bit limbs allowed a different code, which, in curve25519-dalek’s serial backend, looks like this:

let b1_19 = b[1] * 19;

let b2_19 = b[2] * 19;

let b3_19 = b[3] * 19;

let b4_19 = b[4] * 19;

// Multiply to get 128-bit coefficients of output

let c0: u128 = m(a[0], b[0]) + m(a[4], b1_19) + m(a[3], b2_19) + m(a[2], b3_19) + m(a[1], b4_19);

let mut c1: u128 = m(a[1], b[0]) + m(a[0], b[1]) + m(a[4], b2_19) + m(a[3], b3_19) + m(a[2], b4_19);

let mut c2: u128 = m(a[2], b[0]) + m(a[1], b[1]) + m(a[0], b[2]) + m(a[4], b3_19) + m(a[3], b4_19);

let mut c3: u128 = m(a[3], b[0]) + m(a[2], b[1]) + m(a[1], b[2]) + m(a[0], b[3]) + m(a[4], b4_19);

let mut c4: u128 = m(a[4], b[0]) + m(a[3], b[1]) + m(a[2], b[2]) + m(a[1], b[3]) + m(a[0] , b[4]);

This code excerpt computes the result over five limbs which can now range over close to 128 bits, and some extra high part propagation (not shown above) is needed to shrink limbs down to 51 bits or so. As we see here, there are now 25 individual multiplications (the m() function), since there are five limbs per input. There still are ADC opcodes in there! They are somewhat hidden behind the additions: these additions are over Rust’s u128 type, a 128-bit integer type that internally uses two registers, so that each addition implies one ADC opcode. However, these carry propagations can occur mostly in parallel (there are five independent dependency chains here), and that maps well to the abilities of a Westmere core. On such cores, the “serial” backend is faster than crrl’s m64. However, in later x86 CPUs from Intel (starting with the Haswell core), support for add-with-carry opcodes was substantially improved, and the classical method with 64-bit limbs again gained the upper hand. This was already noticed by Nath and Sarkar in 2018 and this explains why crrl’s m64 backend, on our x86 test system, is faster than curve25519-dalek’s serial and fiat backends, and even a bit faster than the AVX2-powered simd backend.

### RISC-V

Now enters the RISC-V platform. RISC-V is an open architecture which has been designed with what can be viewed as “pure RISC philosophy”, with a much reduced instruction set. It is inspired from the older DEC Alpha, including in particular a large number of integer registers (32), one of which being fixed to the value zero, and, most notably, no carry flag at all. An “add-with-carry” operation, which adds together two 64-bit inputs *x* and *y* and an input carry *c*, and outputs a 64-bit result *z* and an output carry *d*, now requires no fewer than five instructions:

- Add
*x*and*y*, into*z*(ADD). - Compare
*z*to*x*(SLTU): if*z*is strictly lower, then the addition “wrapped around”; the comparison output (0 or 1) is written into*d*. - Add
*c*to*z*(ADD). - Compare
*z*to*c*(SLTU) for another potential “wrap around”, with a 0 or 1 value written into another register*t*. - Add
*t*to*d*(ADD).

(I cannot *prove* that it is not doable in fewer RISC-V instructions; if there is a better solution please tell me.)

Thus, the add-with-carry is not only a high-latency sequence, but it also requires quite a few instructions, and instruction throughput may be a bottleneck. Out test platform (SiFive U74 core) is not well documented, but some cursory tests show the following:

- Multiplication opcodes have a throughput of one per cycle, and a latency of three cycles (this seems constant-time). As per the RISC-V specification (“M” extension), a 64×64 multiplication with a 128-bit result requires two separate opcodes (MUL returns the low 64 bits of the result, MULHU returns the high 64 bits). There is a recommended code sequence for when the two opcodes relate to the same operands, but this does not appear to be leveraged by this particular CPU.
- For “simple” operations such as ADD or SLTU, the CPU may schedule up to two instructions in the same cycle, but the exact conditions for this to happen are unclear, and each instruction still has a 1-cycle latency.

Under such conditions, a 5-instruction add-with-carry will need a minimum of 2.5 cycles (in terms of throughput). The main output (*z*) is available with a latency of 2 cycles, but the output carry has a latency of 4 cycles. A “partial” add-with-carry with no input carry is cheaper (an ADD and a SLTU), and so is an add-with-carry with no output carry (two ADDs), but these are still relatively expensive. The high latency is similar to the Westmere situation, but the throughput cost is new. For that RISC-V platform, we need to avoid not only long dependency chains of carry propagation, but we should also endeavour to do fewer carry propagations. Another operation which is similarly expensive is the split of a 115-bit value (held in a 128-bit variable) into a low (51 bits) and high (64 bits) parts. The straightforward Rust code looks like this (from curve25519-dalek):

let carry: u64 = (c4 >> 51) as u64;

out[4] = (c4 as u64) & LOW_51_BIT_MASK;

On x86, the 128-bit value is held in two registers; the low part is a simple bitwise AND with a constant, and the high part is extracted with a single SHLD opcode, that can extract a chunk out of the concatenation of two input registers. On RISC-V, there is no shift opcode with two input registers (not counting the shift count); instead, the extraction of the high part (called carry in the code excerpt above) requires three instructions: two single-register shifts (SHR, SHL) and one bitwise OR to combine the results.

In order to yield better performance on RISC-V, the crrl m51 backend does things a bit differently:

let a0 = a0 << 6;

let b0 = b0 << 7;

// ...

let (c00, h00) = umull(a0, b0);

let d0 = c00 >> 13;

Here, the input limbs are pre-shifted (by 6 or 7 bits) so that the products are shifted by 13 bits. In that case, the boundary between the low and high parts falls exactly on the boundary between the two registers that receive the multiplication result; the extraction of the high part becomes free! The low part is obtained with a single opcode (a right shift of the low register by 13 bits). Moreover, instead of performing 128-bit additions, crrl’s m51 code adds the low and high parts separately:

let d0 = c00 >> 13;

let d1 = (c01 >> 13)

+ (c10 >> 13);

let d2 = (c02 >> 13)

+ (c11 >> 13)

+ (c20 >> 13);

// ...

let h0 = h00;

let h1 = h01 + h10;

let h2 = h02 + h11 + h20;

In that way, all add-with-carry operations are avoided. This makes crrl’s m51 code somewhat slower than curve2519-dalek’s serial backend on x86, but it quite improves the performance on RISC-V.

### Conclusion

The discussion above is about a fairly technical point. In the grand scheme of things, the differences in performance between the various implementation strategies is not great; there are not many usage contexts where a speed difference of less than 30% in computing or verifying Ed25519 signatures has any relevance to overall application performance. But, insofar as such things matter, the following points are to be remembered:

- Modern large CPUs (for laptops and servers) are good at handling add-with-carry, and for them the classical “64-bit limbs” format tends to be the fastest.
- Some smaller CPUs will be happier with 51-bit limbs. However, there is no one-size-fits-all implementation strategy: for some CPUs, the main issue is the latency of add-with-carry, while for some others, in particular RISC-V systems, the instruction throughput is the bottleneck.