Skip to content

DivSqrtRecFN_small: add optional constant-time mode (closes skipCycle2 timing channel)#90

Open
ShowMeTheStack wants to merge 1 commit into
ucb-bar:masterfrom
ShowMeTheStack:vecleak-divsqrt-constant-time
Open

DivSqrtRecFN_small: add optional constant-time mode (closes skipCycle2 timing channel)#90
ShowMeTheStack wants to merge 1 commit into
ucb-bar:masterfrom
ShowMeTheStack:vecleak-divsqrt-constant-time

Conversation

@ShowMeTheStack
Copy link
Copy Markdown

DivSqrtRecFN_small: add optional constant-time mode (closes skipCycle2 timing channel)

Summary

Adds a new option bit divSqrtOpt_constTime = 32 to consts.scala (declared in common.scala). When set, DivSqrtRecFN_small pads every divide / sqrt to a fixed worst-case latency of sigWidth + 5 cycles, eliminating the operand-dependent skipCycle2 fast path. The option is OFF by default; existing instantiations are bit-exact and cycle-exact unchanged.

Motivation

The iterative SRT divider in DivSqrtRecFN_small contains a documented performance optimisation, skipCycle2, that shortens the iteration count by one cycle when the partial significand at iter 3 satisfies a specific bit pattern (most commonly when the divisor's mantissa is exactly zero, i.e. a power of two).

This is a documented performance feature. It is also a well-defined timing channel: an attacker who controls the divisor can deterministically trigger or avoid the fast path. The resulting per-divide 1-cycle differential survives synthesis, place-and-route, and silicon fabric — validated on a Z-7020 ZedBoard build of this exact module via the open xc7 toolchain (Yosys 0.52 + nextpnr-xilinx + prjxray) at 25 MHz: every PoT-divisor cell measures 55 cycles; every non-PoT cell measures 56 cycles, deterministic, no jitter. See VecLeak paper [1].

The channel is not a hidden bug — skipCycle2 is named in the source. But the security implication is not, and downstream users who route this divider through a security-sensitive FP path are exposed to a network-observable timing channel. Most directly: any FALCON post-quantum signing implementation that dispatches its inner-loop reciprocal ($c/\sigma$ in the Klein-Peikert-Ducas tree sampler) through hardware fdiv.d will inherit the leak. In particular, an RVV-vectorised FALCON-512 build on the default Saturn-on-Rocket parameter set (which routes vfdiv.vv through the host Rocket FPU's Hardfloat divider) is exploitable in 2,188 queries on Verilator-deterministic measurement, scaling gracefully to ~10⁷ queries under typical co-tenant noise.

Mechanism

A small counter wrapper inside the same module:

  • ct_counter is set to sigWidth + 5 on dispatch (= 58 cyc on FP64).
  • ct_counter decrements every cycle; ct_busy is held high until it reaches 1.
  • inReady is gated by !(ctEnable && ct_busy), so the consumer cannot dispatch a new operation while we're still padding.
  • The raw inner divider's rawOutValid_div / rawOutValid_sqrt pulses are latched into ct_pending_* flags as they fire (typically at iter 55 or 56 depending on operand class).
  • When ct_counter reaches 1, we emit outValid_* from the shadow ct_pending_* flag, and clear it.
  • The arithmetic result io.out is fed unchanged from the inner RoundRawFNToRecFN.

When (options & divSqrtOpt_constTime) == 0:

  • ctEnable evaluates to false.B.
  • inReady and inValid thread through unchanged.
  • outValid_div / outValid_sqrt are sourced directly from the inner module's pulses (via Mux(ctEnable, ..., divSqrtRecFNToRaw.io.rawOutValid_*)).
  • The unused shadow registers are dead-code-eliminated by synthesis.

Worst-case latency for FP64 with divSqrtOpt_constTime ON: 58 cycles (the divider's longest path plus 1-cycle slack). Both PoT and non-PoT operand classes report 58 cycles.

Cost (constant-time mode ON)

Synthesis through Yosys 0.64 + sky130hd + OpenSTA, on the FP64 variant DivSqrtRecFM_small_e11_s53:

Vanilla (option OFF) Patched (option ON) Delta
Total area 16,357 µm² 16,765 µm² +2.49 %
Sequential area 5,876 µm² (387 FFs) 6,081 µm² (397 FFs) +3.48 % (+10 FFs)
Combinational area 10,481 µm² 10,684 µm² +1.94 %
Clock period 14.79 ns 15.96 ns +7.91 %
fmax 67.6 MHz 62.6 MHz −7.38 %

Sequential cost: 8-bit counter (ct_counter) + 2 1-bit pending flags = 10 added FFs, exactly as predicted. When the option is OFF, all of this DCEs and the design is bit-equivalent to current master.

Validation

  1. Verilator/microbench (cycle-exact). Re-running bench_vfdiv_rhs_sweep on the patched divider with the option ON shows a uniform 32,755 cycles across the 30-cell mantissa × exponent operand grid (vs the pre-patch range of 12,285 cycles). With the option OFF, the original 55/56-cycle PoT-vs-non-PoT differential is preserved.

  2. Z-7020 silicon (Yosys + nextpnr-xilinx + prjxray, ZedBoard). The 30-cell sweep on the patched divider with the option ON returns the same cycle count for every operand class.

Backward compatibility

The constant-time path is gated on a new option bit that defaults to OFF. Consumers that don't set divSqrtOpt_constTime observe unchanged behaviour: same arithmetic, same cycle counts, same area, same fmax. No public API change.

Reviewer notes

  • The added FFs are minimal (8-bit counter + 2 1-bit flags) and the mechanism is local to DivSqrtRecFN_small.scala. No surrounding Hardfloat module is touched.
  • The new option follows the same divSqrtOpt_* naming convention as the existing divSqrtOpt_twoBitsPerCycle (= 16); we use = 32 (the next power-of-two bit).
  • Existing tests pass unchanged when the option is OFF (the constant-time path doesn't alter the arithmetic result either way).
  • This patch was prepared in collaboration with the VecLeak side-channel paper [1]; the security context, microarchitectural analysis, and silicon-validation evidence appear in that paper.

[1] VecLeak: A Cycle-Exact Operand-Dependent Timing Channel in Open-Source RISC-V Vector Floating-Point Hardware. Forthcoming, MDPI Chips 2026 (arXiv preprint to follow).

…2 timing channel)

Summary
-------
Adds a new option bit `divSqrtOpt_constTime = 32` to common.scala.  When
set, DivSqrtRecFN_small pads every divide / sqrt to a fixed worst-case
latency of `sigWidth + 5` cycles, eliminating the operand-dependent
skipCycle2 fast path.  The option is OFF by default; existing
instantiations are bit-exact and cycle-exact unchanged.

Motivation
----------
The iterative SRT divider in DivSqrtRecFN_small contains a documented
performance optimisation, `skipCycle2`, that shortens the iteration
count by one cycle when the partial significand at iter 3 satisfies a
specific bit pattern (most commonly when the divisor's mantissa is
exactly zero, i.e. a power of two).  This is a documented performance
feature, but it is also a well-defined operand-dependent timing channel:
an attacker who controls the divisor can deterministically trigger or
avoid the fast path, and the resulting per-divide cycle differential
survives synthesis, place-and-route and silicon fabric (validated on a
Z-7020 ZedBoard build of this exact module via the open xc7 toolchain
in the VecLeak paper [1]).

For the avoidance of doubt this is not a "hidden bug": the bit is named
`skipCycle2` in the source.  The contribution of this patch is to
provide a constant-time mode that closes the channel, suitable for
downstream users who route this divider through a security-sensitive FP
path (notably FALCON post-quantum signing implementations that dispatch
their inner-loop reciprocals through hardware fdiv.d).

Mechanism
---------
A small (~10-FF, ~200-LUT) counter wrapper inside the same module pads
every divide / sqrt to a fixed worst-case latency of `sigWidth + 5`
cycles (= 58 cyc on FP64).  The raw inner divider's outValid pulses are
latched into shadow ct_pending_* flags as they fire and re-emitted when
the counter reaches 1; inReady is held low for the duration so the
upstream consumer cannot dispatch a new operation and overwrite the
latched result.  The arithmetic result is unchanged.

The new behaviour is gated on `(options & divSqrtOpt_constTime) != 0`.
When the option is OFF, ctEnable evaluates to 0.B, the muxes select the
inner pulses directly, and synthesis dead-code-eliminates the unused
shadow counter / pending flags.  Existing instantiations therefore
observe no change in area, fmax, or per-cycle behaviour.

Synthesis cost (constant-time mode ON; Yosys + sky130hd + OpenSTA;
FP64 variant DivSqrtRecFM_small_e11_s53):

  vanilla:    16,357 um^2 / 67.6 MHz fmax
  patched:    16,765 um^2 / 62.6 MHz fmax
  delta:      +2.49% area / -7.4% fmax

Reference
---------
[1] VecLeak: A Cycle-Exact Operand-Dependent Timing Channel in
    Open-Source RISC-V Vector Floating-Point Hardware (forthcoming,
    MDPI Chips 2026).

Signed-off-by: Jyotiprakash Mishra <mail@jyotiprakash.org>
@aswaterman
Copy link
Copy Markdown
Member

aswaterman commented May 8, 2026

RISC-V floating-point instructions are not part of the Zkt data-value-indepedent-timing extension and should not be used to process e.g. keys.

I think this is better handled by external logic rather than a change to hardfloat. The case can be detected and the result delayed for an extra cycle without modification to the core logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants