Skip to content

global_evaluate: faithful parallel evaluate() (fix out-of-domain mislocation)#222

Merged
lmoresi merged 3 commits into
developmentfrom
bugfix/global-evaluate-parallel-extrapolation
Jun 12, 2026
Merged

global_evaluate: faithful parallel evaluate() (fix out-of-domain mislocation)#222
lmoresi merged 3 commits into
developmentfrom
bugfix/global-evaluate-parallel-extrapolation

Conversation

@lmoresi

@lmoresi lmoresi commented Jun 6, 2026

Copy link
Copy Markdown
Member

Problem

global_evaluate's parallel path silently returned wrong values for query points outside the (old) domain. The evaluation-swarm migrate routes unclaimed points by nearest rank centre-of-mass and strands them on an arbitrary rank, which extrapolates from a geometrically-far local cell. Deterministic reproduction (rotation gate, linear field T=x, np=5): a point at x=+0.64 read −0.42 (opposite side), max_err=1.06. This corrupted mesh-variable transfer on parallel mover-adapted meshes.

Fix

Make global_evaluate a faithful parallel evaluate() — interpolate inside, extrapolate from the true nearest cell outside, flag inside/outside. A best-claim out-of-domain fallback in global_evaluate_nd:

  • allgather the (small, boundary-layer) extrapolated set,
  • each rank reports nearest-local-cell distance + its local rbf extrapolation,
  • Allreduce(MIN dist / MIN rank / SUM winner-value) picks the globally-nearest rank's value.

Only unconditional collectives + a local rbf_evaluate (never the collective FE interpolation, which would desync per-rank → hang). O(boundary points), no dense global tree. Deadlock-safe by construction.

Default on; GE_LOCAL_FALLBACK=0 restores legacy. Serial path unchanged (gated mpi.size>1).

Validation

Rotation gate, T=x, np=5: max_err 1.06 → 0.003 (bit-identical to serial 0.003); turning the fallback off reproduces 1.06. tier-A green; used throughout the parallel adaptive-convection runs.

Underworld development team with AI support from Claude Code

…location

global_evaluate's parallel path had no correct handling for query points
outside the (old) domain: the evaluation-swarm migrate routes unclaimed points
by nearest rank centre-of-mass and strands them on an arbitrary rank, which
then extrapolates from a geometrically-far cell -> silently-wrong values,
parallel-only (e.g. an annulus boundary point reading the opposite side). This
corrupted mesh-variable transfer on parallel mover-adapted meshes.

Restore the serial evaluate() contract (interpolate inside / extrapolate from
the TRUE nearest cell outside / flag inside-outside) with a best-claim
out-of-domain fallback in global_evaluate_nd: allgather the (small,
boundary-layer) extrapolated set; every rank reports its nearest-local-cell
distance + its LOCAL rbf extrapolation; Allreduce(MIN dist / MIN rank / SUM
winner value) picks the globally-nearest rank's value. Only unconditional
collectives + local rbf_evaluate (never the collective FE interpolation, which
would desync) -> deadlock-safe. O(boundary points), no dense global tree.

Default on; GE_LOCAL_FALLBACK=0 restores legacy. Serial unchanged (gated
mpi.size>1). Validated: deterministic-rotation gate, linear field T=x, np=5,
max_err 1.06 -> 0.003 (== serial).

Underworld development team with AI support from Claude Code
Copilot AI review requested due to automatic review settings June 6, 2026 08:51

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates global_evaluate_nd to correct MPI-parallel evaluation for points that end up outside any rank’s owned cells after the swarm migrate round-trip, aiming to make parallel behavior match the serial evaluate() contract (interpolate in-domain; extrapolate just outside; return inside/outside via check_extrapolated) and to avoid deadlocks by using only unconditional collectives and rank-local RBF evaluation.

Changes:

  • Adds a parallel “best-claim” out-of-domain fallback: allgather stranded points, compute per-rank distance/value, and Allreduce to select the globally-best rank’s extrapolation.
  • Adds an environment-variable escape hatch (GE_LOCAL_FALLBACK) to disable the new fallback and restore legacy behavior.
  • Expands the global_evaluate_nd docstring/comments to document the parallel contract and deadlock-safety constraints.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +593 to +603
ext_vals, ext_flag = evaluate_nd(
expr, all_ext, rbf=True, evalf=False, verbose=False,
check_extrapolated=True,)
ext_vals = np.ascontiguousarray(
np.asarray(ext_vals, dtype=np.double).reshape((n_ext_total,) + expr_shape))
ext_flag = np.asarray(ext_flag).reshape(n_ext_total).astype(np.int32)

# Nearest-local-cell distance for every point (local kd-tree query).
mesh._build_kd_tree_index()
dist2, _ = mesh._centroid_index.query(all_ext, k=1, sqr_dists=True)
dist2 = np.ascontiguousarray(np.asarray(dist2, dtype=np.double).ravel())
Comment thread src/underworld3/function/_function.pyx Outdated
Comment on lines +574 to +576
import os
if uw.mpi.size > 1 and os.environ.get("GE_LOCAL_FALLBACK", "1") not in (
"0", "off", "false", "no", ""):
Comment on lines +371 to +374
Contract: this is a faithful *parallel* counterpart of :func:`evaluate` —
a query point is interpolated wherever in the mesh it lands (on any rank),
a point just outside the mesh is extrapolated from its true nearest cell,
and ``check_extrapolated`` returns an inside/outside flag per point. The
Comment on lines +533 to +537
# ------------------------------------------------------------------
# Out-of-domain extrapolation — keep the parallel result a faithful
# match for the serial ``evaluate()`` contract: interpolate a point
# wherever it lands across ranks, extrapolate a point just outside the
# mesh, and flag inside/outside.
lmoresi added 2 commits June 6, 2026 16:28
…K env)

The parallel out-of-domain best-claim fallback was controlled only by the
GE_LOCAL_FALLBACK environment variable (Copilot review flag). Promote it to a
real local_fallback=True kwarg threaded through global_evaluate ->
_global_evaluate_impl -> global_evaluate_nd. The kwarg is the supported
control surface; the env var, if explicitly set, still overrides it — an
operator escape hatch retained because of the parallel-deadlock debugging
history (a no-code kill switch for the added collectives). Serial is untouched
and bit-identical regardless of the kwarg.

Underworld development team with AI support from Claude Code
- GE_LOCAL_FALLBACK parsing is now case/whitespace-insensitive (.strip().lower()),
  so False/OFF/No disable the fallback as expected.
- clarify the contract docstring: out-of-domain points extrapolate from the
  globally nearest cell by a centroid-distance heuristic (lowest rank wins ties),
  not necessarily the exact nearest cell.

Underworld development team with AI support from Claude Code
@lmoresi

lmoresi commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

Pushed fixes for the review nits:

  • GE_LOCAL_FALLBACK case-sensitivity — now normalized with .strip().lower(), so False/OFF/No disable the fallback. (Also: local_fallback is now a proper kwarg, not env-only.)
  • "true nearest cell" docstring — reworded to reflect the actual centroid-distance heuristic (lowest rank wins ties in parallel).

On the other two threads:

  • ext_flag always-true — this is correct by construction, not a bug: the points fed to the rbf fallback (all_ext) are exactly the points no rank could locate in-cell, so flagging them all as extrapolated is right, and writing return_mask back to True for them is a no-op. Happy to drop the redundant ext_flag plumbing as cleanup if preferred.
  • Missing regression test — agreed this is the real gap. A parallel out-of-domain test (asserting rank-independence) would lock the behaviour; flagging for a follow-up unless we want it in this PR.

@lmoresi lmoresi merged commit 58385b9 into development Jun 12, 2026
1 check passed
@lmoresi lmoresi deleted the bugfix/global-evaluate-parallel-extrapolation branch June 13, 2026 00:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants