DivSqrtRecFN_small: add optional constant-time mode (closes skipCycle2 timing channel) by ShowMeTheStack · Pull Request #90 · ucb-bar/berkeley-hardfloat

ShowMeTheStack · 2026-05-08T12:06:54Z

DivSqrtRecFN_small: add optional constant-time mode (closes skipCycle2 timing channel)

Summary

Adds a new option bit divSqrtOpt_constTime = 32 to consts.scala (declared in common.scala). When set, DivSqrtRecFN_small pads every divide / sqrt to a fixed worst-case latency of sigWidth + 5 cycles, eliminating the operand-dependent skipCycle2 fast path. The option is OFF by default; existing instantiations are bit-exact and cycle-exact unchanged.

Motivation

The iterative SRT divider in DivSqrtRecFN_small contains a documented performance optimisation, skipCycle2, that shortens the iteration count by one cycle when the partial significand at iter 3 satisfies a specific bit pattern (most commonly when the divisor's mantissa is exactly zero, i.e. a power of two).

This is a documented performance feature. It is also a well-defined timing channel: an attacker who controls the divisor can deterministically trigger or avoid the fast path. The resulting per-divide 1-cycle differential survives synthesis, place-and-route, and silicon fabric — validated on a Z-7020 ZedBoard build of this exact module via the open xc7 toolchain (Yosys 0.52 + nextpnr-xilinx + prjxray) at 25 MHz: every PoT-divisor cell measures 55 cycles; every non-PoT cell measures 56 cycles, deterministic, no jitter. See VecLeak paper [1].

The channel is not a hidden bug — skipCycle2 is named in the source. But the security implication is not, and downstream users who route this divider through a security-sensitive FP path are exposed to a network-observable timing channel. Most directly: any FALCON post-quantum signing implementation that dispatches its inner-loop reciprocal ($c/\sigma$ in the Klein-Peikert-Ducas tree sampler) through hardware fdiv.d will inherit the leak. In particular, an RVV-vectorised FALCON-512 build on the default Saturn-on-Rocket parameter set (which routes vfdiv.vv through the host Rocket FPU's Hardfloat divider) is exploitable in 2,188 queries on Verilator-deterministic measurement, scaling gracefully to ~10⁷ queries under typical co-tenant noise.

Mechanism

A small counter wrapper inside the same module:

ct_counter is set to sigWidth + 5 on dispatch (= 58 cyc on FP64).
ct_counter decrements every cycle; ct_busy is held high until it reaches 1.
inReady is gated by !(ctEnable && ct_busy), so the consumer cannot dispatch a new operation while we're still padding.
The raw inner divider's rawOutValid_div / rawOutValid_sqrt pulses are latched into ct_pending_* flags as they fire (typically at iter 55 or 56 depending on operand class).
When ct_counter reaches 1, we emit outValid_* from the shadow ct_pending_* flag, and clear it.
The arithmetic result io.out is fed unchanged from the inner RoundRawFNToRecFN.

When (options & divSqrtOpt_constTime) == 0:

ctEnable evaluates to false.B.
inReady and inValid thread through unchanged.
outValid_div / outValid_sqrt are sourced directly from the inner module's pulses (via Mux(ctEnable, ..., divSqrtRecFNToRaw.io.rawOutValid_*)).
The unused shadow registers are dead-code-eliminated by synthesis.

Worst-case latency for FP64 with divSqrtOpt_constTime ON: 58 cycles (the divider's longest path plus 1-cycle slack). Both PoT and non-PoT operand classes report 58 cycles.

Cost (constant-time mode ON)

Synthesis through Yosys 0.64 + sky130hd + OpenSTA, on the FP64 variant DivSqrtRecFM_small_e11_s53:

	Vanilla (option OFF)	Patched (option ON)	Delta
Total area	16,357 µm²	16,765 µm²	+2.49 %
Sequential area	5,876 µm² (387 FFs)	6,081 µm² (397 FFs)	+3.48 % (+10 FFs)
Combinational area	10,481 µm²	10,684 µm²	+1.94 %
Clock period	14.79 ns	15.96 ns	+7.91 %
`fmax`	67.6 MHz	62.6 MHz	−7.38 %

Sequential cost: 8-bit counter (ct_counter) + 2 1-bit pending flags = 10 added FFs, exactly as predicted. When the option is OFF, all of this DCEs and the design is bit-equivalent to current master.

Validation

Verilator/microbench (cycle-exact). Re-running bench_vfdiv_rhs_sweep on the patched divider with the option ON shows a uniform 32,755 cycles across the 30-cell mantissa × exponent operand grid (vs the pre-patch range of 12,285 cycles). With the option OFF, the original 55/56-cycle PoT-vs-non-PoT differential is preserved.
Z-7020 silicon (Yosys + nextpnr-xilinx + prjxray, ZedBoard). The 30-cell sweep on the patched divider with the option ON returns the same cycle count for every operand class.

Backward compatibility

The constant-time path is gated on a new option bit that defaults to OFF. Consumers that don't set divSqrtOpt_constTime observe unchanged behaviour: same arithmetic, same cycle counts, same area, same fmax. No public API change.

Reviewer notes

The added FFs are minimal (8-bit counter + 2 1-bit flags) and the mechanism is local to DivSqrtRecFN_small.scala. No surrounding Hardfloat module is touched.
The new option follows the same divSqrtOpt_* naming convention as the existing divSqrtOpt_twoBitsPerCycle (= 16); we use = 32 (the next power-of-two bit).
Existing tests pass unchanged when the option is OFF (the constant-time path doesn't alter the arithmetic result either way).
This patch was prepared in collaboration with the VecLeak side-channel paper [1]; the security context, microarchitectural analysis, and silicon-validation evidence appear in that paper.

[1] VecLeak: A Cycle-Exact Operand-Dependent Timing Channel in Open-Source RISC-V Vector Floating-Point Hardware. Forthcoming, MDPI Chips 2026 (arXiv preprint to follow).

…2 timing channel) Summary ------- Adds a new option bit `divSqrtOpt_constTime = 32` to common.scala. When set, DivSqrtRecFN_small pads every divide / sqrt to a fixed worst-case latency of `sigWidth + 5` cycles, eliminating the operand-dependent skipCycle2 fast path. The option is OFF by default; existing instantiations are bit-exact and cycle-exact unchanged. Motivation ---------- The iterative SRT divider in DivSqrtRecFN_small contains a documented performance optimisation, `skipCycle2`, that shortens the iteration count by one cycle when the partial significand at iter 3 satisfies a specific bit pattern (most commonly when the divisor's mantissa is exactly zero, i.e. a power of two). This is a documented performance feature, but it is also a well-defined operand-dependent timing channel: an attacker who controls the divisor can deterministically trigger or avoid the fast path, and the resulting per-divide cycle differential survives synthesis, place-and-route and silicon fabric (validated on a Z-7020 ZedBoard build of this exact module via the open xc7 toolchain in the VecLeak paper [1]). For the avoidance of doubt this is not a "hidden bug": the bit is named `skipCycle2` in the source. The contribution of this patch is to provide a constant-time mode that closes the channel, suitable for downstream users who route this divider through a security-sensitive FP path (notably FALCON post-quantum signing implementations that dispatch their inner-loop reciprocals through hardware fdiv.d). Mechanism --------- A small (~10-FF, ~200-LUT) counter wrapper inside the same module pads every divide / sqrt to a fixed worst-case latency of `sigWidth + 5` cycles (= 58 cyc on FP64). The raw inner divider's outValid pulses are latched into shadow ct_pending_* flags as they fire and re-emitted when the counter reaches 1; inReady is held low for the duration so the upstream consumer cannot dispatch a new operation and overwrite the latched result. The arithmetic result is unchanged. The new behaviour is gated on `(options & divSqrtOpt_constTime) != 0`. When the option is OFF, ctEnable evaluates to 0.B, the muxes select the inner pulses directly, and synthesis dead-code-eliminates the unused shadow counter / pending flags. Existing instantiations therefore observe no change in area, fmax, or per-cycle behaviour. Synthesis cost (constant-time mode ON; Yosys + sky130hd + OpenSTA; FP64 variant DivSqrtRecFM_small_e11_s53): vanilla: 16,357 um^2 / 67.6 MHz fmax patched: 16,765 um^2 / 62.6 MHz fmax delta: +2.49% area / -7.4% fmax Reference --------- [1] VecLeak: A Cycle-Exact Operand-Dependent Timing Channel in Open-Source RISC-V Vector Floating-Point Hardware (forthcoming, MDPI Chips 2026). Signed-off-by: Jyotiprakash Mishra <mail@jyotiprakash.org>

aswaterman · 2026-05-08T22:03:12Z

RISC-V floating-point instructions are not part of the Zkt data-value-indepedent-timing extension and should not be used to process e.g. keys.

I think this is better handled by external logic rather than a change to hardfloat. The case can be detected and the result delayed for an extra cycle without modification to the core logic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DivSqrtRecFN_small: add optional constant-time mode (closes skipCycle2 timing channel)#90

DivSqrtRecFN_small: add optional constant-time mode (closes skipCycle2 timing channel)#90
ShowMeTheStack wants to merge 1 commit into
ucb-bar:masterfrom
ShowMeTheStack:vecleak-divsqrt-constant-time

ShowMeTheStack commented May 8, 2026

Uh oh!

aswaterman commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ShowMeTheStack commented May 8, 2026

DivSqrtRecFN_small: add optional constant-time mode (closes skipCycle2 timing channel)

Summary

Motivation

Mechanism

Cost (constant-time mode ON)

Validation

Backward compatibility

Reviewer notes

Uh oh!

aswaterman commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aswaterman commented May 8, 2026 •

edited

Loading