Skip to content

memory leak reproducer in go and rust#47

Open
h-2 wants to merge 1 commit intostatsig-io:mainfrom
h-2:memory_leak
Open

memory leak reproducer in go and rust#47
h-2 wants to merge 1 commit intostatsig-io:mainfrom
h-2:memory_leak

Conversation

@h-2
Copy link
Copy Markdown

@h-2 h-2 commented Apr 28, 2026

We have seen significant, permanent growth of memory usage when rolling out statsig in our test instances and especially controlled production deployments. This is currently blocking most uses of Statsig at DeepL.

With the help of Claude, we have created reproducers in go and rust. More details in follow-up messages.

@h-2
Copy link
Copy Markdown
Author

h-2 commented Apr 28, 2026

Memory leak in statsig-server-core with per-request user IDs

Running on v0.19.3 (HEAD) via the Go binding, I'm seeing unbounded RSS growth when each evaluation call uses a distinct userID — the typical server-side pattern where every incoming request carries a different end-user identity.

Clean A/B result, same setup, same 10,000 iterations:
Test                                      Initial RSS   Final RSS   Δ                       Result
TestMemoryLeak (fixed userID)             35.57 MB      44.12 MB    +8.55 MB (+24%)         PASS
TestMemoryLeakPerRequestUsers (unique)    43.67 MB      8.36 GB     +8.31 GB (+19,494%)     FAIL

Both tests call GetFeatureGate / GetDynamicConfig / GetExperiment / GetLayer / GetClientInitResponse on the same gates/configs. The only difference is whether the user argument is a singleton or constructed fresh per iteration. That one change makes RSS grow roughly 1000x faster.

The upstream TestMemoryLeak in statsig-go/test/memory_leak_test.go does call createUser(t, i) inside its hot loop but discards the result (_ = createUser(...)) and queries a fixed user instead — so it never exercises per-userID SDK state. The attached repro test is essentially the same loop with that one fix.

Caveat: the per-iteration growth (~830 KB) is inflated by createUser's longDummyString() (~100 KB custom attribute). Absolute magnitude is exaggerated; the retain-per-userID pattern is the real finding.

Likely cause: per-user state retained in the exposure logger / evaluation cache without eviction. This matches what we're seeing in production at high authenticated-user cardinality.

Happy to share the test file — it's a ~70-line drop-in at statsig-go/test/memory_leak_per_request_users_test.go.

@h-2
Copy link
Copy Markdown
Author

h-2 commented Apr 28, 2026

Two questions answered:

  1. Go user destroy — correct

StatsigUserBuilder.Build() registers a finalizer at statsig_user.go:110-111 that calls statsig_user_release(obj.ref). GetFeatureGate / GetDynamicConfig / etc. pass only the uint64 ref by value — nothing Go-side retains a reference beyond the call. So if the leak were purely the Go side forgetting to release, we'd see the Rust-level test pass cleanly. It doesn't:

  1. Direct Rust reproducer — leak is in the Rust core

Wrote a Rust integration test that uses statsig_rust::Statsig / StatsigUser directly, no FFI, no Go binding. Same 10,000 iterations, unique user_id per iteration, same 5 evaluation calls:

Test                                              Initial RSS  Final RSS  Δ                      Result
Go — fixed user                                   35.6 MB      44.1 MB    +8.5 MB (+24%)         PASS
Go — unique user per call (100 KB custom attr)    43.7 MB      8.36 GB    +8.31 GB (+19,494%)    FAIL
Rust — unique user per call (minimal user)        10.5 MB      78.1 MB    +67.6 MB (+646%)       FAIL

Two useful conclusions:

• The leak is definitively in the Rust core. It reproduces without any binding layer. This eliminates Go / FFI / C++ wrapper as possible culprits.
• It scales with user object size. The Go test's ~830 KB/user leak was inflated by longDummyString() (~100 KB custom attr per user); minimal Rust users leak ~6.7 KB/user. Both confirm per-userID state retention; the absolute number just depends on what's inside the user.

Per-user growth on a bare StatsigUser is ~6.7 KB. Rough back-of-envelope against production: at the steady rate of distinct account UUIDs per pod, this is in the right order of magnitude for the ~50 MiB/h/pod we're seeing in production.

Two independent reproducers attached:

• statsig-go/test/memory_leak_per_request_users_test.go — Go binding
• statsig-rust/tests/memory_leak_per_request_users_tests.rs — Rust core (the decisive one)

The Rust one makes it impossible to dismiss as a binding issue, and it will run in their own Rust CI.

@h-2
Copy link
Copy Markdown
Author

h-2 commented Apr 28, 2026

Follow-up: heaptrack analysis pinpoints the retention site

Ran the Rust reproducer (10k iter) under heaptrack to localize the memory growth.

Framing
This is better described as long-lived retention / high-water growth than a classic leak — dropping the client would free it. The concern is that in a long-running server process the client never drops, so retention is indistinguishable from leakage in practice.

What the capture shows

Sawtooth in the "Consumed" timeline — peaks of ~25 MB drop back cleanly after each event batch serialize + flush. Event flushing works correctly.
Rising baseline under the sawtooth climbs from ~5 MB to ~10 MB over 13 s. That's the real signal.
Bottom-Up filtered to statsig_rust: "leaked" columns are 100s of kB; "peak" is MB-scale. The heap is freed on Drop — it just accumulates while the client is alive.

Root cause: ExposureSampling::should_dedupe_exposure

exposure_sampling.rs:122-130

fn should_dedupe_exposure(&self, sampling_key: &ExposureSamplingKey) -> bool {
    let mut dedupe_set = write_lock_or_return!(TAG, self.exposure_dedupe_set, false);
    if dedupe_set.contains(sampling_key) {
        return true;
    }
    dedupe_set.insert(sampling_key.clone());   // no insert-time bound
    false
}

ExposureSamplingKey (line 311) includes user_values_hash, so every unique user adds a new entry. Reset is deferred to the event-logger background loop (event_logger.rs:234try_reset_all_sampling), and the reset itself only calls clear() when TTL expires or len() > 100_000 (exposure_sampling.rs:197).

Two secondary notes:

clear() removes entries but does not shrink the bucket allocation — so even after a reset, the HashSet keeps its peak capacity. This actually reinforces the "baseline never drops back down" observation in the heaptrack timeline.
spec_sampling_set has the same clear-on-loop shape but is keyed only by (spec_name_hash, rule_id_hash), so it's bounded by gate-count rather than user-count — much less likely to be a meaningful high-cardinality contributor. Disregard it for this analysis.

Suggested fix

Priority is enforcing a bound on insert, not on a background tick:

• Replace AHashSet<ExposureSamplingKey> with a bounded LRU (e.g. lru::LruCache) so eviction happens on insert and can't drift above the cap.
• Or pair the set with a per-entry TTL (expiring map) instead of one last_reset timestamp for the whole set.
• Without insert-time bounds, a hot server will keep growing between background-loop resets, and even after clear() the hash table keeps the peak allocation.

Note: exposure_dedupe_ttl_ms is already remotely configurable via SDK config, so the TTL knob isn't entirely absent. SAMPLING_MAX_KEYS is hardcoded, though — exposing it via StatsigOptions would be a low-risk stopgap before a deeper structural fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant