Skip to content

fix: avoid page_table_check BUG on time namespace VVAR page#1578

Open
smira wants to merge 1 commit into
mainfrom
fix-page-table-check-timens-vvar
Open

fix: avoid page_table_check BUG on time namespace VVAR page#1578
smira wants to merge 1 commit into
mainfrom
fix-page-table-check-timens-vvar

Conversation

@smira

@smira smira commented Jun 8, 2026

Copy link
Copy Markdown
Member

What

Adds a kernel patch (0006-mm-page_table_check-do-not-track-special-PFN-mapped-PTEs.patch) that stops page_table_check from tracking special (PFN-mapped) PTEs.

Why

Reported in siderolabs/talos#13496: nodes panic + reboot under heavy container/CI churn with:

kernel BUG at mm/page_table_check.c:143!
__page_table_check_zero
__free_frozen_pages
free_time_ns
free_nsproxy
do_exit / do_group_exit

Root cause

In 6.18.y the vDSO [vvar] mapping is VM_PFNMAP and its pages are installed into userspace via vmf_insert_pfn(), which produces special PTEs. On x86 and arm64 pte_user_accessible_page() only checks the PRESENT/USER bits and does not exclude pte_special(), so page_table_check accounts these PFN mappings in the per-page map counters — even though they are not rmap-managed (vm_normal_page() returns NULL).

Most vvar pages live in the kernel image and are never freed, so the stray accounting is invisible. The time-namespace VVAR page is the exception: it is a real alloc_page() page freed by __free_page() in free_time_ns() when the last task of a time namespace exits. The unbalanced special-PTE accounting leaves a non-zero file_map_count, which trips the BUG_ON() in __page_table_check_zero() at free time.

This is why:

  • it is arch-independent (x86_64 + arm64 here, riscv via syzbot),
  • only the timens page trips it (it's the only one in this set that is freed),
  • it only affects Talos / kernels with CONFIG_PAGE_TABLE_CHECK enabled,
  • it is triggered by CLONE_NEWTIME churn (runc / docker-init / tini),
  • and page_table_check=off is a complete workaround.

Fix

Skip special PTEs in both the set and clear paths of page_table_check, so the counters stay balanced (always zero) for PFN-mapped pages. Special PTEs have no struct-page rmap semantics and should never have been tracked. pte_special() is generically available (no-op false on arches without ARCH_HAS_PTE_SPECIAL), so it is a single arch-independent change.

Mainline sidesteps this since 05988dba1179 ("vdso/datastore: Allocate data pages dynamically", v7.0) switched the mapping to VM_MIXEDMAP + vmf_insert_page() with balanced struct-page accounting, but 6.18.y still uses the PFNMAP path.

Testing

  • Patch applies cleanly to the 6.18.34 source built by this repo (patch -p1).
  • Functional validation: reproduce on an affected node (heavy CLONE_NEWTIME/CI workload with CONFIG_PAGE_TABLE_CHECK active) and confirm the mm/page_table_check.c BUG no longer fires, without page_table_check=off.

The fix should also be submitted upstream to linux-mm / linux-stable (6.18.y).

Copilot AI review requested due to automatic review settings June 8, 2026 12:52
@github-project-automation github-project-automation Bot moved this to To Do in Planning Jun 8, 2026
@talos-bot talos-bot moved this from To Do to In Review in Planning Jun 8, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an out-of-tree Linux kernel patch to prevent page_table_check from tracking PFN-mapped special PTEs (e.g., the [vvar] mapping installed via vmf_insert_pfn()), addressing a reported mm/page_table_check.c BUG when the time-namespace VVAR page is freed.

Changes:

  • Introduces a new kernel patch file to skip pte_special() entries in both the set and clear paths of page_table_check.
  • Prevents unbalanced per-page map counter accounting for special/PFN-mapped PTEs that don’t have normal rmap/struct page semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-project-automation github-project-automation Bot moved this from In Review to Approved in Planning Jun 8, 2026
Backport a kernel patch to stop page_table_check from tracking special
(PFN-mapped) PTEs.

The vDSO "[vvar]" mapping is VM_PFNMAP and its pages are installed with
vmf_insert_pfn(), producing special PTEs. pte_user_accessible_page() on
x86/arm64 does not exclude special PTEs, so page_table_check accounts
these PFN mappings. The time-namespace VVAR page is a real alloc_page()
that is freed in free_time_ns() when the last task in a time namespace
exits; the unbalanced accounting leaves a non-zero map count and trips
the BUG_ON() in __page_table_check_zero():

  kernel BUG at mm/page_table_check.c:143!
  __page_table_check_zero / __free_frozen_pages / free_time_ns /
  free_nsproxy / do_exit

This is hit under heavy container/CI churn (CLONE_NEWTIME via runc /
docker-init / tini) on both amd64 and arm64, since the Talos kernel
enables CONFIG_PAGE_TABLE_CHECK. Mainline sidesteps it in v7.0 by
switching the mapping to VM_MIXEDMAP + vmf_insert_page(), but 6.18.y
still uses the PFNMAP path.

See: siderolabs/talos#13496

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
@smira smira force-pushed the fix-page-table-check-timens-vvar branch from dd10f0c to e30cbb5 Compare June 8, 2026 12:59
| `0003-net-macb-add-TX-stall-watchdog-to-recover-from-lost-.patch` | macb: per-queue `delayed_work` watchdog that calls `macb_tx_restart()` if tx_tail hasn't advanced. v2 uses a `bool tx_stall_tail_moved` flag (pelwell-suggested form) instead of a tx_tail snapshot, gates the check on `netif_carrier_ok()` to eliminate a boot-time false positive, and wraps the stall-warn in `if (printk_ratelimit()) netdev_warn(...)` so events stay observable while bounded | v2 submitted to netdev | [v2 thread](https://lore.kernel.org/netdev/20260514215459.36109-1-lukasz@raczylo.com/T/) · [v2 patch 3 build-fix](https://lore.kernel.org/netdev/20260515095336.92237-1-lukasz@raczylo.com/T/) |
| `0004-PCI-prevent-shrink-bridge-window.patch` | PCI: prevent `adjust_bridge_window()` from shrinking a bridge window below the size required by `pbus_size_mem()` — fixes large-BAR / eGPU resource starvation | Merged to mainline v6.19, candidate for 6.18.y stable backport | [lore patch](https://patch.msgid.link/20260219153951.68869-1-ilpo.jarvinen@linux.intel.com) |
| `0005-slab-backport-flex-allocator-helpers.patch` | Incomplete backport to 6.18.x breaking the DRBD build | Cherry-picked from mainline, drop when upgrading ||
| `006-mm-page_table_check-do-not-track-special-PFN-mapped-PTEs.patch` | mm/page_table_check: do not track special (PFN-mapped) PTEs | Linux 7.0 is not affected, but 6.18.x. is ||

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `006-mm-page_table_check-do-not-track-special-PFN-mapped-PTEs.patch` | mm/page_table_check: do not track special (PFN-mapped) PTEs | Linux 7.0 is not affected, but 6.18.x. is ||
| `0006-mm-page_table_check-do-not-track-special-PFN-mapped-PTEs.patch` | mm/page_table_check: do not track special (PFN-mapped) PTEs | Linux 7.0 is not affected, but 6.18.x. is ||

@smira

smira commented Jun 8, 2026

Copy link
Copy Markdown
Member Author

@smira

smira commented Jun 8, 2026

Copy link
Copy Markdown
Member Author

Verified the fix with the following reproducer:

// SPDX-License-Identifier: GPL-2.0
//
// Reproducer for: kernel BUG at mm/page_table_check.c (page_table_check)
//   __page_table_check_zero / __free_frozen_pages / free_time_ns /
//   free_nsproxy / do_exit
//
// See https://github.com/siderolabs/talos/issues/13496 and the syzbot report
// syzbot+2b5fe617654be3d8848b@syzkaller.appspotmail.com (riscv).
//
// Root cause (6.18.y): the vDSO "[vvar]" mapping is VM_PFNMAP and its pages
// are installed with vmf_insert_pfn(), producing *special* PTEs. On x86/arm64
// pte_user_accessible_page() does not exclude pte_special(), so
// page_table_check tracks these PFN mappings. The time-namespace VVAR page is
// a real alloc_page() that is freed by __free_page() in free_time_ns(); the
// unbalanced special-PTE accounting leaves a non-zero map count and trips the
// BUG_ON() in __page_table_check_zero() when the namespace is destroyed.
//
// This program reproduces the real-world trigger (runc / docker-init / tini):
// it churns time namespaces in parallel, faults the timens VVAR page via the
// vDSO clock, and forks inside the namespace to exercise copy of the special
// PFNMAP PTEs, then tears everything down.
//
// REQUIREMENTS
//   * Kernel built with CONFIG_PAGE_TABLE_CHECK and the check *active*
//     (boot with page_table_check=on, or CONFIG_PAGE_TABLE_CHECK_ENFORCED=y
//     as on Talos). On a kernel without the check active this is a no-op.
//   * CONFIG_TIME_NS=y.
//   * Run as root (or with CAP_SYS_ADMIN) so unshare(CLONE_NEWTIME) succeeds.
//
// EXPECTED RESULT
//   * Unpatched, affected kernel: kernel BUG / oops in __page_table_check_zero
//     (on Talos this panics and reboots the node). May take from seconds to a
//     few minutes; raising the worker count and CPU pressure speeds it up.
//   * Patched kernel: runs indefinitely with no crash. Stop with Ctrl-C.
//
// BUILD
//   cc -O2 -Wall -o repro_timens_ptc repro_timens_ptc.c
// RUN
//   ./repro_timens_ptc            # workers = 2 * nproc
//   ./repro_timens_ptc 64         # explicit worker count

#define _GNU_SOURCE
#include <errno.h>
#include <sched.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/wait.h>
#include <time.h>
#include <unistd.h>

#ifndef CLONE_NEWTIME
#define CLONE_NEWTIME 0x00000080
#endif

// Fault in / read the (timens) VVAR page through the vDSO fast path.
static void touch_vdso_clock(void)
{
        struct timespec ts;
        int i;

        for (i = 0; i < 8; i++) {
                clock_gettime(CLOCK_MONOTONIC, &ts);
                clock_gettime(CLOCK_BOOTTIME, &ts);
                clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
        }
}

// Runs inside a freshly created time namespace.
static void in_timens(void)
{
        pid_t p;

        touch_vdso_clock();     /* map the timens VVAR page at the TIME slot */

        p = fork();             /* copy the VM_PFNMAP special PTEs to the child */
        if (p == 0) {
                touch_vdso_clock();
                _exit(0);
        }
        if (p > 0)
                waitpid(p, NULL, 0);

        touch_vdso_clock();
        _exit(0);               /* unmap on exit */
}

static void one_round(void)
{
        pid_t c = fork();

        if (c == 0) {
                /* Create a fresh time namespace for this subtree's children. */
                if (unshare(CLONE_NEWTIME) != 0)
                        _exit(2);

                pid_t g = fork();       /* grandchild enters the new timens */
                if (g == 0)
                        in_timens();    /* does not return */
                if (g > 0)
                        waitpid(g, NULL, 0);

                /*
                 * Child exits here: it drops the last reference to the time
                 * namespace (its ->time_ns_for_children), so free_nsproxy() ->
                 * free_time_ns() -> __free_page(vvar_page) runs. On an affected
                 * kernel page_table_check BUGs on the stray map count.
                 */
                _exit(0);
        }
        if (c > 0)
                waitpid(c, NULL, 0);
}

int main(int argc, char **argv)
{
        int workers, i;
        pid_t t;
        int st;

        workers = (argc > 1) ? atoi(argv[1])
                             : (int)sysconf(_SC_NPROCESSORS_ONLN) * 2;
        if (workers < 1)
                workers = 1;

        /* Sanity check: can we create a time namespace at all? */
        t = fork();
        if (t == 0)
                _exit(unshare(CLONE_NEWTIME) == 0 ? 0 : 2);
        waitpid(t, &st, 0);
        if (!WIFEXITED(st) || WEXITSTATUS(st) == 2) {
                fprintf(stderr,
                        "cannot create CLONE_NEWTIME namespace: need root/CAP_SYS_ADMIN "
                        "and CONFIG_TIME_NS=y\n");
                return 1;
        }

        fprintf(stderr,
                "spawning %d workers churning time namespaces;\n"
                "needs CONFIG_PAGE_TABLE_CHECK active. Runs until the kernel BUGs "
                "(Ctrl-C to stop on a fixed kernel).\n",
                workers);

        for (i = 0; i < workers; i++) {
                pid_t w = fork();

                if (w == 0) {
                        for (;;)
                                one_round();
                        _exit(0);
                }
        }

        for (;;) {
                pid_t r = wait(NULL);

                if (r < 0 && errno == ECHILD)
                        break;
        }
        return 0;
}

I can crash the kernel without the fix after a few minutes, with the fix it ran for an hour without issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Proposed
Status: Proposed
Status: Approved

Development

Successfully merging this pull request may close these issues.

5 participants