Skip to content

perf: hot-path optimizations across entity ops, relationships, and lifecycle (18 workloads +4-35% faster, 0 regressions)#317

Open
Pyseph wants to merge 6 commits into
Ukendio:mainfrom
Pyseph:perf/hot-path-optimizations
Open

perf: hot-path optimizations across entity ops, relationships, and lifecycle (18 workloads +4-35% faster, 0 regressions)#317
Pyseph wants to merge 6 commits into
Ukendio:mainfrom
Pyseph:perf/hot-path-optimizations

Conversation

@Pyseph

@Pyseph Pyseph commented Jun 12, 2026

Copy link
Copy Markdown

Note

Authorship disclosure: these optimizations were researched, implemented, and benchmarked by Claude (Fable 5, Anthropic's model) running in Claude Code — including the failed experiments documented below. My role (@Pyseph) was direction, supervision, and independent verification: I ran every Studio benchmark round on real hardware and reviewed each round's results before the next began. Every commit carries a Co-Authored-By: Claude trailer. Flagging this up front so reviewers can calibrate scrutiny accordingly — though the work was held to the repo's own standard: nothing landed without surviving the full test suite, a >=80% win-rate A/B verdict, and a recompile confirmation.

Summary

Six self-contained performance commits to src/jecs.luau, developed and verified against this exact base commit (3087601). 18 benchmarked workloads get faster, 0 get slower, and behavior is bit-for-bit compatible — the full test suite (139 cases) passes after every individual commit, and luau-analyze stays clean throughout.

Final A/B numbers, measured in Roblox Studio under --!native --!optimize 2 (methodology below):

workload faster by won rounds
world:parent +34.9% 100%
world:contains +34.7% 100%
world:target +32.8% 100%
remove a component from a 5-component entity +24.8% 100%
world:delete (ChildOf child / plain 4-component) +19.5% / +19.2% 100%
world:remove +18.0% 100%
world:add (tag) +16.4% 93%
world:children (16k parents × 4 children) +15.0% 85%
world:add ChildOf (exclusive pair) +14.4% 100%
add a 5th component to a 4-component entity +14.0% 100%
world:has +11.2% 90%
world:clear +9.8% 100%
world:set (pair) +8.9% 100%
world:get (4 comps / 1 comp) +8.3% / +7.1% 97% / 82%
world:set (add / overwrite) +6.2% / +4.2% 83% / 88%

Query iteration is deliberately untouched (byte-identical) — see "Failed experiments" for why that is a feature of this PR, not an omission.

Methodology

All numbers come from an A/B harness that loads the optimized copy and this base commit as two independent ModuleScripts and runs identical workloads against both, interleaved round-by-round with alternating order so GC/scheduler drift hits both sides equally. The reported time is the best-of-60-rounds (the run least disturbed by GC), and a result only counts as a win when the delta clears 1.5% and the optimized side wins ≥80% of paired rounds.

Two disciplines mattered more than expected:

  • Per-build bias is real. Two independently compiled copies of byte-identical source showed stable ±2–3% deltas (occasionally more) from native code layout alone. Every claimed win was therefore confirmed across a forced recompile — a result that moves with the layout is an artifact, one that survives is real.
  • Win-rate gates beat deltas. Allocation-heavy workloads (spawning, mass deletion) have best-of deltas that swing wildly with GC; the share-of-rounds-won is far more stable and is what separated real wins from noise.

Correctness was held to "bit-for-bit": beyond the test suite, the changes preserve pinned-but-obscure semantics — setting a value on a tag still errors after the tag was added (the frozen NULL_ARRAY write is the error mechanism), on_remove hooks may move the entity mid-remove and the operation re-reads record.archetype afterward, exclusive-pair replacement fires hooks on warm edge-cache hits, and generation arithmetic (including the wrap at 2^16) is value-identical for every input.

What each commit does

  1. Inline hot-path helper calls in entity operations. Luau's -O2 inliner only handles simple local functions, so entity_index_try_get_unsafe, fetch, the new_entity/archetype_append/inner_archetype_move chain, and the pair-math helpers were real call frames on every get/has/set/add/remove — frequently costing more than the work they wrapped. The fused move also removes a duplicated dst_entities store the old chain made.
  2. Per-transition move plans + target/parent inlining. Archetype moves resolved every column's destination through columns_map per entity; the first move along a (from → to) transition now caches a column-mapping plan that later moves replay with array reads. A gate keeps sources with <3 columns on the direct loop — without it, single-column removes regressed (measured, see below). world_parent becomes a specialized reader with the ChildOf wildcard id precomputed per world.
  3. world_delete/world_targets inlining. The delete path dropped ~6 call frames, including the whole archetype_delete body and a value-identical unified form of ECS_GENERATION_INC.
  4. world_contains inlining + each/children specialization. contains was three calls deep for a boolean.
  5. Delete probe-skip + counts-lookup elimination + clear inlining. The biggest single idea, adapted from flecs: ordinary entities are never used as a component id, pair relation, or pair target, so world_delete's three wildcard component_index probes always miss for them. id_record_create (the only producer of such records) now marks referenced entities in world.entity_used_as; deleting unmarked entities skips the probes and pair arithmetic entirely, and the mark is cleared on delete so recycled ids start clean. Stale marks are safe (they just fall back to probe-and-miss); missing marks cannot occur. Separately, records[aid] exists iff counts[aid] ≥ 1, so target/parent's index-0 path drops a redundant hash lookup — this took target from +14.5% to +32.8%.
  6. Fused liveness check (record.entity). Handle validation drops from a dependent two-load chain (record.densedense_array[dense] → compare) to a single field compare, maintained at every alive-making site. This lifted every operation a further +2–12% over commit 5.

Mistakes made and lessons learned

These cost real time and are worth recording for anyone optimizing this codebase later:

  • Query iterator "improvements" measure slower under native codegen. Restructuring the per-entity next() closures to read the boxed row-counter upvalue once instead of three times (and to fold column reads into the return expression) looks strictly better in bytecode terms — and consistently measured slower for high arities (the 8-component case lost 100% of rounds). Native codegen schedules the original's up-front column loads better than the "cleverer" shape. The iterators in this PR are byte-identical to base because every alternative was tried and lost.
  • Archetype list order affects iteration speed. Swapping query_archetypes from hash-walking idr.records to scanning the (cheaper) idr.cache array changed the order of compatible_archetypes — and cached-query iteration regressed 7% while wildcard queries regressed ~20%. The scan was never the cost; the order was. Reverted. (The table.create presizing in commit 5 keeps hash order exactly for this reason.)
  • Cache fetches have to amortize. The move-plan cache as first written regressed single-column removes from +18% to +3.6% because the plan fetch (~4 reads) costs more than the one lookup it saves. The <3-column gate in commit 2 is load-bearing, not decoration.
  • Table literal key order is a performance interface. Adding record.entity as the first constructor key reproducibly regressed the 4-component get path by 3–5% across recompiles — the insertion order shifted which keys win the record table's internal hash slots. Appending it last fixed it. If you add fields to hot record-shaped tables, append them.
  • A microbenchmark can lie in both directions. A CLI harness (luau 0.724, --codegen) predicted Studio results case-for-case for five rounds, then showed a phantom −3.4% on one case in round 6 that Studio measured as +8.3%. Conversely, byte-identical code routinely read ±3% in either harness. Nothing here was landed on a single environment's say-so.

What was deliberately not done

  • SoA entity records (parallel dense/row/archetype arrays instead of per-entity tables) is the largest remaining win — it would make world:entity() allocation-free — but it breaks the exported Record/jecs.record API shape, so it belongs in a design discussion rather than this PR.
  • Iterator closure elimination in targets/each: every variant breaks the public directly-callable () -> Id contract or nested iteration.
  • Spawn-path allocation: the per-entity record table is inherent to the current API and the workload is GC-jitter-dominated anyway.

Test plan

  • luau test/tests.luau — 139/139 after each commit (also run with -O2).
  • luau-analyze src/jecs.luau — clean after each commit.
  • Studio A/B benchmark suite (29 workloads spanning reads, writes, archetype moves, relationships, hierarchy iteration, lifecycle): 18 FASTER / 0 SLOWER at the tip; the remainder are byte-identical code paths or documented GC-jitter cases reading as noise.

The benchmark harness itself (runner with interleaved best-of/win-rate verdicts plus the workload suite) lives in a separate repo and can be contributed too if there's interest.

🤖 Generated with Claude Code

Pyseph and others added 6 commits June 12, 2026 18:58
Luau's -O2 inliner only handles simple local functions, so the helper
calls inside every world:get/has/set/add/remove were real call frames
at runtime - often costing more than the work they wrapped:

- entity_index_try_get_unsafe -> direct sparse read + dense-slot
  generation check in each hot prologue
- fetch() inlined per component in world_get (branch on the column
  table, never on the value: stored values may be `false`)
- new_entity/archetype_append/inner_archetype_move fused into a single
  inner_entity_move (also removes a duplicated dst_entities store; the
  old chain wrote the entity into the destination twice)
- ECS_IS_PAIR/ECS_PAIR_FIRST/ECS_PAIR wildcard-id math in the
  world_set/world_add pair branches -> 3 arithmetic ops, zero calls
- archetype_traverse_remove edge-hit path inlined in world_remove;
  world.component_index field read -> captured upvalue
- single-id fast path in world_has; ecs_assert -> plain branch in the
  spawn paths

Measured in Roblox Studio under --!native --!optimize 2 against this
exact base commit (interleaved A/B rounds, best-of-60 timing, >=80%
win-rate gate): remove +18.2% (won 100% of rounds), set pair +7.7%
(100%), add ChildOf +6.6% (98%), set add +5.7% (98%), add tag +4.7%
(92%), get 1 comp +4.5% (88%), get 4 comps +3.9% (88%), set overwrite
+2.4% (83%). Query iteration is intentionally untouched: restructuring
the iterator closures measured as a regression under native codegen
and was reverted before landing.

No behavior changes: the full test suite passes, and pinned semantics
(tag-set must add before erroring, on_remove re-entrancy re-reads
record.archetype, exclusive-pair hook ordering) are preserved exactly.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…e world_parent

Archetype moves relocate every column the entity owns, resolving each
source column's destination through columns_map - one hash lookup per
column per moved entity, repeated identically for every entity taking
the same (from -> to) transition. The first move along a transition now
builds a small plan (parallel arrays: source column -> destination
column or false) cached on the source archetype; later moves replay it
with plain array reads.

The plan fetch itself costs ~4 reads, which measurably regressed
single-column moves (remove dropped from +18% to +3.6% in A/B runs), so
sources with fewer than 3 columns keep the direct loop. Plans into a
destroyed archetype are dropped via its edges in archetype_destroy;
a source archetype's own plans die with it; archetype ids are never
reused, so stale keys cannot alias.

world_target gets the same treatment as the round-1 entity ops
(inlined entity lookup, ECS_PAIR wildcard arithmetic, inlined
entity_index_get_alive/ECS_PAIR_SECOND), and world_parent no longer
delegates to world_target: it is a specialized reader with the ChildOf
wildcard id precomputed once per world.

Studio A/B vs base (same methodology as the previous commit), measured
on workloads where entities carry 4 data components: remove 5th
component +24.1% (won 100%), add 5th component +13.0% (98%),
parent +28.2% (100%), target +15.9% (100%). All previous wins held.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
world_delete paid ~6 call frames of bookkeeping per deletion:

- entity lookup helper -> inlined sparse read + generation check
- both wildcard probe ids (ECS_PAIR x2) -> inline arithmetic
- the archetype_delete call chain (row swap, swapped-record lookup,
  fast-delete column loops) -> inlined, re-reading record.archetype and
  record.row after the on_remove hooks exactly like the call did
- ECS_GENERATION_INC -> a unified formula that is value-identical to
  the original's two branches for every input, including the wrap to
  generation 0 at 2^16 (exercised by the 65535-iteration recycling
  test)
- world.component_index / world.archetypes field reads -> the world
  upvalues (world_cleanup reassigns the upvalue and the field together,
  so they cannot diverge)

world_targets gets the world_target treatment, plus ECS_PAIR_SECOND
inlined inside the per-target iterator closure.

Studio A/B vs base: delete (4 components) +3.8% (won 98%), delete of a
ChildOf child +3.8% (98%), targets +7.5% best-of, target +14.5% (100%),
parent +26.7% (100%). The delete deltas here are modest; the follow-up
commit that skips the wildcard probes entirely is where deletion gets
its large win.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
world_contains was three calls deep (entity_index_is_alive ->
entity_index_try_get -> try_get_any) to answer a boolean that
validation-heavy code (replication, networking) asks per entity per
message. It is now a direct sparse read + dense bounds + generation
check with identical semantics.

world_each/world_children move inside world_new so component_index and
archetypes are upvalues rather than per-call world field reads, and
world_children computes pair(ChildOf, parent) as a single addition
against a per-world constant instead of an ECS_PAIR call. The iterator
body is unchanged. world_entity's desired-id path drops its ECS_ID call
for the inline modulo.

Studio A/B vs base: contains +34.7% (won 100%); children iteration
(16k parents x 4 children, the realistic per-frame walk shape) +15.0%
(85%, high-jitter case). The desired-id change reads as noise on an
allocation-dominated workload and is kept as a value-identical call
removal.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…kups; inline world_clear

Three independent wins:

1. world_delete probe skip (the flecs EcsEntityIsId/IsTarget idea
   adapted to Luau): every delete built two 2^48-magnitude wildcard
   pair keys and probed component_index three times - and for virtually
   every game entity all three probes miss, because ordinary entities
   are never used as a component id, pair relation, or pair target.
   id_record_create is the only producer of component_index entries, so
   it now marks referenced entities (by low 24 bits) in
   world.entity_used_as; deleting an unmarked entity skips the pair
   arithmetic, all three probes, and the cleanup blocks. The mark is
   cleared at the delete tail so recycled ids start clean. A stale mark
   only falls back to the old probe-and-miss behavior; a missing mark
   cannot occur.

2. world_target/world_parent: records[archetype_id] exists iff
   counts[archetype_id] >= 1 (they are created and removed strictly
   together), so the index-0 path - the dominant call shape - needs
   only the records lookup. counts is consulted only for explicit
   nonzero indexes.

3. world_clear: same inlining recipe as world_delete (entity lookup +
   the archetype_delete body). The now-orphaned archetype_delete /
   archetype_fast_delete helpers are removed. query_archetypes also
   presizes its result with table.create(idr.size) + counter append
   (idr.size is an upper bound; iteration order over idr.records is
   unchanged).

Studio A/B vs base: delete (4 components) +19.2% -> from +3.8% before
this commit (won 100%), delete of a ChildOf child +19.5% (100%),
clear +9.8% (100%), target +32.8% (100%), parent +34.9% (100%).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Every hot operation validates its entity handle with a dependent
two-load chain: read record.dense, index dense_array with it, compare.
Each record now stores the canonical generation-carrying id it
currently belongs to, maintained in lockstep with dense_array at every
alive-making site (ENTITY_INDEX_NEW_ID, world_entity's fresh /
desired-id / promote branches, entity_index_ensure, and the
world_delete tail, which stores the bumped generation). The prologues
of get/has/set/add/remove/target/targets/parent/delete/clear become a
single field compare; contains keeps its strict alive_count bound but
also loses the dense_array load.

Records that were pre-populated but never made alive (world:range
prefill) leave the field nil, which reproduces the old nil-compare
failure exactly - including the existing quirk where a
recycled-but-not-yet-spawned id passes the unsafe check.

One non-obvious detail: the field is appended LAST in the record
constructors. Inserting it first changed which keys win the table's
internal hash slots and reproducibly regressed the 4-component get
path by 3-5% across recompiles; appended last, the pre-existing hot
fields keep their slots.

Head-to-head A/B against the previous commit (5 independent builds):
has +5-12.6%, add tag +5.8%, parent +5.4%, set overwrite +5.2%,
clear +4%, get 1 +3.8%, delete child +3.2%, target +2.1%. Final Studio
verdicts vs base for the full series: 18 workloads FASTER, 0 SLOWER.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@Ukendio

Ukendio commented Jun 12, 2026

Copy link
Copy Markdown
Owner

To temper some expectations, I will not be accepting this PR as-is, however there are still many things that can be learned here that I think is valuable. We will keep this open to make it easy to index this in the future and iterate on the things that can be improved in jecs.

Some notes:

  1. This seems overly eager at inlining things that is already done by luau. For simplicity sake, we will not be doing that.
  2. The cached column pointers (move plans) present an interesting optimization but the problem is that the cost vs benefit has yet to be proven in real games outside of limited benchmarks that do not really test explosion of fragmentation and many different edges that archetypes. This also makes things more difficult to reclaim archetypes in the future if we want to do that as we generally do not like working with pointers too often. That said it is an interesting optimization we should at least give the benefit of the doubt to.
  3. The entity_used_as map fragments the way we want to think of IDs which is that most are homogenously accessed through the component_index. Specializing those used for pairs is pretty counter to our goals.
  4. I find the expansion of putting the entity in the record to be interesting, and generally a non-issue outside of expanding memory usage of entity records by 25% but that is ultimately insignificant. However, it does feel a bit tacky which I need to grumble on to articulate further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants