Skip to content

vfs: add per-instance inotify watch and event-queue caps#13188

Open
ibondarenko1 wants to merge 1 commit into
google:masterfrom
ibondarenko1:hardening/inotify-resource-caps
Open

vfs: add per-instance inotify watch and event-queue caps#13188
ibondarenko1 wants to merge 1 commit into
google:masterfrom
ibondarenko1:hardening/inotify-resource-caps

Conversation

@ibondarenko1
Copy link
Copy Markdown

@ibondarenko1 ibondarenko1 commented May 14, 2026

Summary

pkg/sentry/vfs/inotify.go has no per-instance cap on the number of watches an *Inotify can hold or on the depth of its pending-event queue. AddWatch extends i.watches (line 313) and Watches.ws (line 433) without a size check, and queueEvent (line 275) appends to i.events without checking length.

Linux fs/notify/inotify/inotify_user.c caps both. The kernel returns ENOSPC from inotify_new_watch when the per-user UCOUNT_INOTIFY_WATCHES quota is reached (default 8192). fsnotify_insert_event emits a single IN_Q_OVERFLOW marker when group->q_len reaches max_events (default 16384). gVisor accepts both without bound.

This PR adds two per-instance caps matching the Linux default values:

maxInotifyWatchesPerInstance = 8192
maxInotifyQueuedEvents       = 16384

AddWatch returns ENOSPC once len(i.watches) reaches the cap. queueEvent tracks queue length via numQueuedEvents under evMu and, on overflow, emits a single IN_Q_OVERFLOW marker (wd = -1, mask = IN_Q_OVERFLOW) at the queue tail unless one is already there. Subsequent overflowing events are dropped silently, matching fsnotify_insert_event.

Affected code at HEAD (503ea178ff)

pkg/sentry/vfs/inotify.go lines 326-353 before the change:

// AddWatch constructs a new inotify watch and adds it to the target. It
// returns the watch descriptor returned by inotify_add_watch(2).
//
// The caller must hold a reference on target.
func (i *Inotify) AddWatch(target *Dentry, mask uint32) int32 {
    i.mu.Lock()
    defer i.mu.Unlock()

    ws := target.Watches()
    if existing := ws.Lookup(i.id); existing != nil {
        ...
        return existing.wd
    }

    w := i.newWatchLocked(target, ws, mask)
    return w.wd
}

pkg/sentry/vfs/inotify.go lines 275-293 before the change:

func (i *Inotify) queueEvent(ev *Event) {
    i.evMu.Lock()

    if last := i.events.Back(); last != nil {
        if ev.equals(last) {
            i.evMu.Unlock()
            return
        }
    }

    i.events.PushBack(ev)
    i.evMu.Unlock()

    i.queue.Notify(waiter.ReadableEvents)
}

Witness

Reproducer (alpine:3.20 container with --runtime=runsc, runsc release-20260406.0):

docker run -d --runtime=runsc --name=v alpine:3.20 sleep 7200
docker exec v python3 -c '
import ctypes, os, tempfile
libc = ctypes.CDLL(None)
inotify_init = libc.inotify_init
inotify_add_watch = libc.inotify_add_watch
inotify_add_watch.argtypes = [ctypes.c_int, ctypes.c_char_p, ctypes.c_uint32]
fd = inotify_init()
base = tempfile.mkdtemp()
for i in range(200000):
    d = os.path.join(base, "d%d" % i)
    os.mkdir(d)
    wd = inotify_add_watch(fd, d.encode(), 0x0FFF)
    assert wd >= 0
'

Measured sentry VmRSS:

Stage VmRSS Delta
Baseline (sandbox idle) 52 MB -
After 20000 watches 95 MB +43 MB
After 50000 watches 180 MB +128 MB
After 200000 watches 510 MB +458 MB

No syscall returned ENOSPC. Linux would have stopped at watch #8192 with ENOSPC. Approximately 2.3 KB sentry heap per watch. Sustainable consumption rate approximately 4 MB per second; a 512 MB sentry caps within minutes.

Linux reference

fs/notify/inotify/inotify_user.c defines inotify_table:

{
    .procname = "max_user_watches",
    .data     = &init_user_ns.ucount_max[UCOUNT_INOTIFY_WATCHES],
    .maxlen   = sizeof(long),
    .mode     = 0644,
    .proc_handler = proc_doulongvec_minmax,
    ...
},
{
    .procname = "max_queued_events",
    .data     = &inotify_max_queued_events,
    ...
}

Default max_user_watches ranges from 8192 to 1048576 depending on system RAM. Default max_queued_events is 16384. gVisor adopts the lower-bound conservative defaults.

inotify_new_watch enforces the watch quota:

if (!inc_inotify_watches(group->inotify_data.ucounts)) {
    inotify_remove_from_idr(group, tmp_i_mark);
    ret = -ENOSPC;
    goto out_err;
}

fsnotify_insert_event enforces the queue quota by inserting an overflow marker once group->q_len >= group->max_events.

Change

  1. pkg/sentry/vfs/inotify.go

    • Add maxInotifyWatchesPerInstance and maxInotifyQueuedEvents constants near inotifyEventBaseSize.
    • Add numQueuedEvents int field to Inotify, protected by evMu.
    • AddWatch signature changes from (int32) to (int32, error). Returns linuxerr.ENOSPC once len(i.watches) >= maxInotifyWatchesPerInstance.
    • queueEvent increments numQueuedEvents after PushBack. On overflow, emits a single IN_Q_OVERFLOW marker at the tail if one is not already present and returns without queuing the would-be event.
    • The reader loop decrements numQueuedEvents when an event is removed via i.events.Remove(event).
  2. pkg/sentry/syscalls/linux/sys_inotify.go

    • Propagate the new AddWatch error to the inotify_add_watch(2) syscall return.
  3. pkg/sentry/fsimpl/kernfs/kernfs_test.go

    • Two AddWatch call sites updated to use the new (wd, err) return; t.Fatal on unexpected error.
  4. pkg/sentry/vfs/inotify_test.go (new file).

Out of scope for this PR

fs.inotify.max_user_instances (Linux default 128) is enforced per user namespace via UCOUNT_INOTIFY_INSTANCES. gVisor does not have an equivalent ucount infrastructure in pkg/sentry/kernel/auth today; that cap is deferred to a follow-up change once the supporting accounting is in place.

The cap added here is per-Inotify-instance. Linux is per-user across all instances of a user. The per-instance cap covers a narrower axis than Linux's per-user cap; a process holding multiple Inotify instances can still exceed the Linux equivalent total. Once UCOUNT-like accounting lands in pkg/sentry/kernel/auth, a per-user cap can be added on top of this per-instance one.

Test plan

  • gofmt -l pkg/sentry/vfs/inotify.go pkg/sentry/vfs/inotify_test.go pkg/sentry/syscalls/linux/sys_inotify.go pkg/sentry/fsimpl/kernfs/kernfs_test.go returns clean.
  • Witness reproduced before the change: sentry RSS grew from 52 MB to 510 MB on 200000 watches with no ENOSPC returned.
  • Witness rerun after the change should return ENOSPC at watch fix(sec): upgrade github.com/opencontainers/runc to 1.1.2 #8192 (deferred to CI; local Bazel chain has an unrelated cannot find 'ld' issue on the protobuf tool host-link in my environment).
  • Regression tests added: TestInotifyAddWatchReturnsENOSPCAtCap and TestInotifyQueueOverflowEmitsMarker in inotify_test.go.

Related

CVE-2023-7258 (gVisor mount-point ref-counting DoS, CWE-400 Uncontrolled Resource Consumption, CVSS 4.8 Medium, fixed in gVisor commit 6a112c60a257dadac59962e0bc9e9b5aee70b5b6) is the class precedent. The required attacker prerequisites there were higher (root user inside sandbox with mount permission). The inotify gap addressed here is reachable from an unprivileged sandbox process with no special capability.

Notes

This PR is hardening only. It does not claim a CVE and does not request a CVE. The CVE precedent is cited to document class lineage and to surface the conservative defaults rationale.

Inotify in pkg/sentry/vfs/inotify.go has no upper bound on the number of
watches a single instance can hold or on the depth of its pending-event
queue. AddWatch grows i.watches (line 313) and the target's Watches.ws
(line 433) without a size check; queueEvent (line 275) appends to
i.events without checking length.

Linux fs/notify/inotify/inotify_user.c caps both. The kernel returns
ENOSPC from inotify_new_watch when the per-user UCOUNT_INOTIFY_WATCHES
quota is reached (default 8192), and fsnotify_insert_event emits a
single IN_Q_OVERFLOW marker when group->q_len reaches max_events
(default 16384). Without these caps, an unprivileged sandboxed process
can grow the sentry heap without bound.

Witness (kali, runsc release-20260406.0, alpine 3.20 sandbox):

  Baseline VmRSS = 52 MB.
  inotify_add_watch x 200000 distinct dirs from one inotify fd.
  Post-flood VmRSS = 510 MB. No ENOSPC returned at any step.
  Sustainable growth rate approximately 4 MB per second.
  Default sentry memory caps would OOM within minutes.

Add two per-instance caps matching the Linux default values:

  maxInotifyWatchesPerInstance = 8192
  maxInotifyQueuedEvents       = 16384

AddWatch now returns (int32, error) and returns ENOSPC once
len(i.watches) reaches the cap. queueEvent tracks queue length via
numQueuedEvents under evMu and, on overflow, emits a single
IN_Q_OVERFLOW marker (wd = -1, mask = IN_Q_OVERFLOW) at the queue
tail unless one is already there. Subsequent overflowing events are
dropped silently, matching Linux fsnotify_insert_event.

A separate Linux limit, fs.inotify.max_user_instances (default 128),
is enforced per user namespace via UCOUNT_INOTIFY_INSTANCES in the
kernel. gVisor does not have an equivalent UCOUNT infrastructure in
pkg/sentry/kernel/auth today; that cap is deferred to a follow-up
change once the supporting accounting is in place.

The AddWatch signature change requires two call-site updates:
  pkg/sentry/syscalls/linux/sys_inotify.go - propagate the error to
    the inotify_add_watch(2) caller.
  pkg/sentry/fsimpl/kernfs/kernfs_test.go - existing tests use t.Fatal
    on unexpected errors.

Adds two regression tests in pkg/sentry/vfs/inotify_test.go:
  TestInotifyAddWatchReturnsENOSPCAtCap
  TestInotifyQueueOverflowEmitsMarker

Tested:
  gofmt -l pkg/sentry/vfs/inotify.go pkg/sentry/vfs/inotify_test.go             pkg/sentry/syscalls/linux/sys_inotify.go             pkg/sentry/fsimpl/kernfs/kernfs_test.go
  (clean)

Related: CVE-2023-7258 (gVisor mount-point ref-counting DoS, CWE-400,
CVSS 4.8) is the class precedent. The attacker prerequisites here
are lower (no CAP_SYS_ADMIN, no mount permission required).
@ibondarenko1 ibondarenko1 force-pushed the hardening/inotify-resource-caps branch from 19badfc to 5f413d5 Compare May 16, 2026 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants