memory leak reproducer in go and rust#47
Conversation
|
Memory leak in statsig-server-core with per-request user IDs Running on v0.19.3 (HEAD) via the Go binding, I'm seeing unbounded RSS growth when each evaluation call uses a distinct userID — the typical server-side pattern where every incoming request carries a different end-user identity. Both tests call GetFeatureGate / GetDynamicConfig / GetExperiment / GetLayer / GetClientInitResponse on the same gates/configs. The only difference is whether the user argument is a singleton or constructed fresh per iteration. That one change makes RSS grow roughly 1000x faster. The upstream TestMemoryLeak in statsig-go/test/memory_leak_test.go does call createUser(t, i) inside its hot loop but discards the result (_ = createUser(...)) and queries a fixed user instead — so it never exercises per-userID SDK state. The attached repro test is essentially the same loop with that one fix. Caveat: the per-iteration growth (~830 KB) is inflated by createUser's longDummyString() (~100 KB custom attribute). Absolute magnitude is exaggerated; the retain-per-userID pattern is the real finding. Likely cause: per-user state retained in the exposure logger / evaluation cache without eviction. This matches what we're seeing in production at high authenticated-user cardinality. Happy to share the test file — it's a ~70-line drop-in at statsig-go/test/memory_leak_per_request_users_test.go. |
|
Two questions answered:
StatsigUserBuilder.Build() registers a finalizer at statsig_user.go:110-111 that calls statsig_user_release(obj.ref). GetFeatureGate / GetDynamicConfig / etc. pass only the uint64 ref by value — nothing Go-side retains a reference beyond the call. So if the leak were purely the Go side forgetting to release, we'd see the Rust-level test pass cleanly. It doesn't:
Wrote a Rust integration test that uses statsig_rust::Statsig / StatsigUser directly, no FFI, no Go binding. Same 10,000 iterations, unique user_id per iteration, same 5 evaluation calls: Two useful conclusions: • The leak is definitively in the Rust core. It reproduces without any binding layer. This eliminates Go / FFI / C++ wrapper as possible culprits. Per-user growth on a bare StatsigUser is ~6.7 KB. Rough back-of-envelope against production: at the steady rate of distinct account UUIDs per pod, this is in the right order of magnitude for the ~50 MiB/h/pod we're seeing in production. Two independent reproducers attached: • statsig-go/test/memory_leak_per_request_users_test.go — Go binding The Rust one makes it impossible to dismiss as a binding issue, and it will run in their own Rust CI. |
|
Follow-up: heaptrack analysis pinpoints the retention site Ran the Rust reproducer (10k iter) under heaptrack to localize the memory growth. Framing What the capture shows • Sawtooth in the "Consumed" timeline — peaks of ~25 MB drop back cleanly after each event batch serialize + flush. Event flushing works correctly. Root cause: fn should_dedupe_exposure(&self, sampling_key: &ExposureSamplingKey) -> bool {
let mut dedupe_set = write_lock_or_return!(TAG, self.exposure_dedupe_set, false);
if dedupe_set.contains(sampling_key) {
return true;
}
dedupe_set.insert(sampling_key.clone()); // no insert-time bound
false
}
Two secondary notes: • Suggested fix Priority is enforcing a bound on insert, not on a background tick: • Replace Note: |
We have seen significant, permanent growth of memory usage when rolling out statsig in our test instances and especially controlled production deployments. This is currently blocking most uses of Statsig at DeepL.
With the help of Claude, we have created reproducers in go and rust. More details in follow-up messages.