DivSqrtRecFN_small: add optional constant-time mode (closes skipCycle2 timing channel)#90
Open
ShowMeTheStack wants to merge 1 commit into
Open
Conversation
…2 timing channel)
Summary
-------
Adds a new option bit `divSqrtOpt_constTime = 32` to common.scala. When
set, DivSqrtRecFN_small pads every divide / sqrt to a fixed worst-case
latency of `sigWidth + 5` cycles, eliminating the operand-dependent
skipCycle2 fast path. The option is OFF by default; existing
instantiations are bit-exact and cycle-exact unchanged.
Motivation
----------
The iterative SRT divider in DivSqrtRecFN_small contains a documented
performance optimisation, `skipCycle2`, that shortens the iteration
count by one cycle when the partial significand at iter 3 satisfies a
specific bit pattern (most commonly when the divisor's mantissa is
exactly zero, i.e. a power of two). This is a documented performance
feature, but it is also a well-defined operand-dependent timing channel:
an attacker who controls the divisor can deterministically trigger or
avoid the fast path, and the resulting per-divide cycle differential
survives synthesis, place-and-route and silicon fabric (validated on a
Z-7020 ZedBoard build of this exact module via the open xc7 toolchain
in the VecLeak paper [1]).
For the avoidance of doubt this is not a "hidden bug": the bit is named
`skipCycle2` in the source. The contribution of this patch is to
provide a constant-time mode that closes the channel, suitable for
downstream users who route this divider through a security-sensitive FP
path (notably FALCON post-quantum signing implementations that dispatch
their inner-loop reciprocals through hardware fdiv.d).
Mechanism
---------
A small (~10-FF, ~200-LUT) counter wrapper inside the same module pads
every divide / sqrt to a fixed worst-case latency of `sigWidth + 5`
cycles (= 58 cyc on FP64). The raw inner divider's outValid pulses are
latched into shadow ct_pending_* flags as they fire and re-emitted when
the counter reaches 1; inReady is held low for the duration so the
upstream consumer cannot dispatch a new operation and overwrite the
latched result. The arithmetic result is unchanged.
The new behaviour is gated on `(options & divSqrtOpt_constTime) != 0`.
When the option is OFF, ctEnable evaluates to 0.B, the muxes select the
inner pulses directly, and synthesis dead-code-eliminates the unused
shadow counter / pending flags. Existing instantiations therefore
observe no change in area, fmax, or per-cycle behaviour.
Synthesis cost (constant-time mode ON; Yosys + sky130hd + OpenSTA;
FP64 variant DivSqrtRecFM_small_e11_s53):
vanilla: 16,357 um^2 / 67.6 MHz fmax
patched: 16,765 um^2 / 62.6 MHz fmax
delta: +2.49% area / -7.4% fmax
Reference
---------
[1] VecLeak: A Cycle-Exact Operand-Dependent Timing Channel in
Open-Source RISC-V Vector Floating-Point Hardware (forthcoming,
MDPI Chips 2026).
Signed-off-by: Jyotiprakash Mishra <mail@jyotiprakash.org>
Member
|
RISC-V floating-point instructions are not part of the Zkt data-value-indepedent-timing extension and should not be used to process e.g. keys. I think this is better handled by external logic rather than a change to hardfloat. The case can be detected and the result delayed for an extra cycle without modification to the core logic. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DivSqrtRecFN_small: add optional constant-time mode (closes skipCycle2 timing channel)
Summary
Adds a new option bit
divSqrtOpt_constTime = 32toconsts.scala(declared incommon.scala). When set,DivSqrtRecFN_smallpads every divide / sqrt to a fixed worst-case latency ofsigWidth + 5cycles, eliminating the operand-dependentskipCycle2fast path. The option is OFF by default; existing instantiations are bit-exact and cycle-exact unchanged.Motivation
The iterative SRT divider in
DivSqrtRecFN_smallcontains a documented performance optimisation,skipCycle2, that shortens the iteration count by one cycle when the partial significand at iter 3 satisfies a specific bit pattern (most commonly when the divisor's mantissa is exactly zero, i.e. a power of two).This is a documented performance feature. It is also a well-defined timing channel: an attacker who controls the divisor can deterministically trigger or avoid the fast path. The resulting per-divide 1-cycle differential survives synthesis, place-and-route, and silicon fabric — validated on a Z-7020 ZedBoard build of this exact module via the open xc7 toolchain (Yosys 0.52 + nextpnr-xilinx + prjxray) at 25 MHz: every PoT-divisor cell measures 55 cycles; every non-PoT cell measures 56 cycles, deterministic, no jitter. See VecLeak paper [1].
The channel is not a hidden bug —$c/\sigma$ in the Klein-Peikert-Ducas tree sampler) through hardware
skipCycle2is named in the source. But the security implication is not, and downstream users who route this divider through a security-sensitive FP path are exposed to a network-observable timing channel. Most directly: any FALCON post-quantum signing implementation that dispatches its inner-loop reciprocal (fdiv.dwill inherit the leak. In particular, an RVV-vectorised FALCON-512 build on the default Saturn-on-Rocket parameter set (which routesvfdiv.vvthrough the host Rocket FPU's Hardfloat divider) is exploitable in 2,188 queries on Verilator-deterministic measurement, scaling gracefully to ~10⁷ queries under typical co-tenant noise.Mechanism
A small counter wrapper inside the same module:
ct_counteris set tosigWidth + 5on dispatch (= 58 cyc on FP64).ct_counterdecrements every cycle;ct_busyis held high until it reaches 1.inReadyis gated by!(ctEnable && ct_busy), so the consumer cannot dispatch a new operation while we're still padding.rawOutValid_div/rawOutValid_sqrtpulses are latched intoct_pending_*flags as they fire (typically at iter 55 or 56 depending on operand class).ct_counterreaches 1, we emitoutValid_*from the shadowct_pending_*flag, and clear it.io.outis fed unchanged from the innerRoundRawFNToRecFN.When
(options & divSqrtOpt_constTime) == 0:ctEnableevaluates tofalse.B.inReadyandinValidthread through unchanged.outValid_div/outValid_sqrtare sourced directly from the inner module's pulses (viaMux(ctEnable, ..., divSqrtRecFNToRaw.io.rawOutValid_*)).Worst-case latency for FP64 with
divSqrtOpt_constTimeON: 58 cycles (the divider's longest path plus 1-cycle slack). Both PoT and non-PoT operand classes report 58 cycles.Cost (constant-time mode ON)
Synthesis through Yosys 0.64 + sky130hd + OpenSTA, on the FP64 variant
DivSqrtRecFM_small_e11_s53:fmaxSequential cost: 8-bit counter (
ct_counter) + 2 1-bit pending flags = 10 added FFs, exactly as predicted. When the option is OFF, all of this DCEs and the design is bit-equivalent to currentmaster.Validation
Verilator/microbench (cycle-exact). Re-running
bench_vfdiv_rhs_sweepon the patched divider with the option ON shows a uniform 32,755 cycles across the 30-cell mantissa × exponent operand grid (vs the pre-patch range of 12,285 cycles). With the option OFF, the original 55/56-cycle PoT-vs-non-PoT differential is preserved.Z-7020 silicon (Yosys + nextpnr-xilinx + prjxray, ZedBoard). The 30-cell sweep on the patched divider with the option ON returns the same cycle count for every operand class.
Backward compatibility
The constant-time path is gated on a new option bit that defaults to OFF. Consumers that don't set
divSqrtOpt_constTimeobserve unchanged behaviour: same arithmetic, same cycle counts, same area, same fmax. No public API change.Reviewer notes
DivSqrtRecFN_small.scala. No surrounding Hardfloat module is touched.divSqrtOpt_*naming convention as the existingdivSqrtOpt_twoBitsPerCycle(= 16); we use= 32(the next power-of-two bit).[1] VecLeak: A Cycle-Exact Operand-Dependent Timing Channel in Open-Source RISC-V Vector Floating-Point Hardware. Forthcoming, MDPI Chips 2026 (arXiv preprint to follow).