vfs: add per-instance inotify watch and event-queue caps#13188
Open
ibondarenko1 wants to merge 1 commit into
Open
vfs: add per-instance inotify watch and event-queue caps#13188ibondarenko1 wants to merge 1 commit into
ibondarenko1 wants to merge 1 commit into
Conversation
Inotify in pkg/sentry/vfs/inotify.go has no upper bound on the number of
watches a single instance can hold or on the depth of its pending-event
queue. AddWatch grows i.watches (line 313) and the target's Watches.ws
(line 433) without a size check; queueEvent (line 275) appends to
i.events without checking length.
Linux fs/notify/inotify/inotify_user.c caps both. The kernel returns
ENOSPC from inotify_new_watch when the per-user UCOUNT_INOTIFY_WATCHES
quota is reached (default 8192), and fsnotify_insert_event emits a
single IN_Q_OVERFLOW marker when group->q_len reaches max_events
(default 16384). Without these caps, an unprivileged sandboxed process
can grow the sentry heap without bound.
Witness (kali, runsc release-20260406.0, alpine 3.20 sandbox):
Baseline VmRSS = 52 MB.
inotify_add_watch x 200000 distinct dirs from one inotify fd.
Post-flood VmRSS = 510 MB. No ENOSPC returned at any step.
Sustainable growth rate approximately 4 MB per second.
Default sentry memory caps would OOM within minutes.
Add two per-instance caps matching the Linux default values:
maxInotifyWatchesPerInstance = 8192
maxInotifyQueuedEvents = 16384
AddWatch now returns (int32, error) and returns ENOSPC once
len(i.watches) reaches the cap. queueEvent tracks queue length via
numQueuedEvents under evMu and, on overflow, emits a single
IN_Q_OVERFLOW marker (wd = -1, mask = IN_Q_OVERFLOW) at the queue
tail unless one is already there. Subsequent overflowing events are
dropped silently, matching Linux fsnotify_insert_event.
A separate Linux limit, fs.inotify.max_user_instances (default 128),
is enforced per user namespace via UCOUNT_INOTIFY_INSTANCES in the
kernel. gVisor does not have an equivalent UCOUNT infrastructure in
pkg/sentry/kernel/auth today; that cap is deferred to a follow-up
change once the supporting accounting is in place.
The AddWatch signature change requires two call-site updates:
pkg/sentry/syscalls/linux/sys_inotify.go - propagate the error to
the inotify_add_watch(2) caller.
pkg/sentry/fsimpl/kernfs/kernfs_test.go - existing tests use t.Fatal
on unexpected errors.
Adds two regression tests in pkg/sentry/vfs/inotify_test.go:
TestInotifyAddWatchReturnsENOSPCAtCap
TestInotifyQueueOverflowEmitsMarker
Tested:
gofmt -l pkg/sentry/vfs/inotify.go pkg/sentry/vfs/inotify_test.go pkg/sentry/syscalls/linux/sys_inotify.go pkg/sentry/fsimpl/kernfs/kernfs_test.go
(clean)
Related: CVE-2023-7258 (gVisor mount-point ref-counting DoS, CWE-400,
CVSS 4.8) is the class precedent. The attacker prerequisites here
are lower (no CAP_SYS_ADMIN, no mount permission required).
19badfc to
5f413d5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pkg/sentry/vfs/inotify.gohas no per-instance cap on the number of watches an*Inotifycan hold or on the depth of its pending-event queue.AddWatchextendsi.watches(line 313) andWatches.ws(line 433) without a size check, andqueueEvent(line 275) appends toi.eventswithout checking length.Linux
fs/notify/inotify/inotify_user.ccaps both. The kernel returnsENOSPCfrominotify_new_watchwhen the per-userUCOUNT_INOTIFY_WATCHESquota is reached (default 8192).fsnotify_insert_eventemits a singleIN_Q_OVERFLOWmarker whengroup->q_lenreachesmax_events(default 16384). gVisor accepts both without bound.This PR adds two per-instance caps matching the Linux default values:
AddWatchreturnsENOSPConcelen(i.watches)reaches the cap.queueEventtracks queue length vianumQueuedEventsunderevMuand, on overflow, emits a singleIN_Q_OVERFLOWmarker (wd = -1, mask =IN_Q_OVERFLOW) at the queue tail unless one is already there. Subsequent overflowing events are dropped silently, matchingfsnotify_insert_event.Affected code at HEAD (
503ea178ff)pkg/sentry/vfs/inotify.golines 326-353 before the change:pkg/sentry/vfs/inotify.golines 275-293 before the change:Witness
Reproducer (alpine:3.20 container with
--runtime=runsc, runsc release-20260406.0):Measured sentry
VmRSS:No syscall returned
ENOSPC. Linux would have stopped at watch #8192 withENOSPC. Approximately 2.3 KB sentry heap per watch. Sustainable consumption rate approximately 4 MB per second; a 512 MB sentry caps within minutes.Linux reference
fs/notify/inotify/inotify_user.cdefinesinotify_table:{ .procname = "max_user_watches", .data = &init_user_ns.ucount_max[UCOUNT_INOTIFY_WATCHES], .maxlen = sizeof(long), .mode = 0644, .proc_handler = proc_doulongvec_minmax, ... }, { .procname = "max_queued_events", .data = &inotify_max_queued_events, ... }Default
max_user_watchesranges from 8192 to 1048576 depending on system RAM. Defaultmax_queued_eventsis 16384. gVisor adopts the lower-bound conservative defaults.inotify_new_watchenforces the watch quota:fsnotify_insert_eventenforces the queue quota by inserting an overflow marker oncegroup->q_len >= group->max_events.Change
pkg/sentry/vfs/inotify.gomaxInotifyWatchesPerInstanceandmaxInotifyQueuedEventsconstants nearinotifyEventBaseSize.numQueuedEvents intfield toInotify, protected byevMu.AddWatchsignature changes from(int32)to(int32, error). Returnslinuxerr.ENOSPConcelen(i.watches) >= maxInotifyWatchesPerInstance.queueEventincrementsnumQueuedEventsafterPushBack. On overflow, emits a singleIN_Q_OVERFLOWmarker at the tail if one is not already present and returns without queuing the would-be event.numQueuedEventswhen an event is removed viai.events.Remove(event).pkg/sentry/syscalls/linux/sys_inotify.goAddWatcherror to theinotify_add_watch(2)syscall return.pkg/sentry/fsimpl/kernfs/kernfs_test.goAddWatchcall sites updated to use the new(wd, err)return;t.Fatalon unexpected error.pkg/sentry/vfs/inotify_test.go(new file).Out of scope for this PR
fs.inotify.max_user_instances(Linux default 128) is enforced per user namespace viaUCOUNT_INOTIFY_INSTANCES. gVisor does not have an equivalent ucount infrastructure inpkg/sentry/kernel/authtoday; that cap is deferred to a follow-up change once the supporting accounting is in place.The cap added here is per-Inotify-instance. Linux is per-user across all instances of a user. The per-instance cap covers a narrower axis than Linux's per-user cap; a process holding multiple Inotify instances can still exceed the Linux equivalent total. Once UCOUNT-like accounting lands in pkg/sentry/kernel/auth, a per-user cap can be added on top of this per-instance one.
Test plan
gofmt -l pkg/sentry/vfs/inotify.go pkg/sentry/vfs/inotify_test.go pkg/sentry/syscalls/linux/sys_inotify.go pkg/sentry/fsimpl/kernfs/kernfs_test.goreturns clean.ENOSPCreturned.ENOSPCat watch fix(sec): upgrade github.com/opencontainers/runc to 1.1.2 #8192 (deferred to CI; local Bazel chain has an unrelatedcannot find 'ld'issue on the protobuf tool host-link in my environment).TestInotifyAddWatchReturnsENOSPCAtCapandTestInotifyQueueOverflowEmitsMarkerininotify_test.go.Related
CVE-2023-7258 (gVisor mount-point ref-counting DoS, CWE-400 Uncontrolled Resource Consumption, CVSS 4.8 Medium, fixed in gVisor commit
6a112c60a257dadac59962e0bc9e9b5aee70b5b6) is the class precedent. The required attacker prerequisites there were higher (root user inside sandbox with mount permission). The inotify gap addressed here is reachable from an unprivileged sandbox process with no special capability.Notes
This PR is hardening only. It does not claim a CVE and does not request a CVE. The CVE precedent is cited to document class lineage and to surface the conservative defaults rationale.