fix: avoid page_table_check BUG on time namespace VVAR page#1578
Open
smira wants to merge 1 commit into
Open
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an out-of-tree Linux kernel patch to prevent page_table_check from tracking PFN-mapped special PTEs (e.g., the [vvar] mapping installed via vmf_insert_pfn()), addressing a reported mm/page_table_check.c BUG when the time-namespace VVAR page is freed.
Changes:
- Introduces a new kernel patch file to skip
pte_special()entries in both the set and clear paths ofpage_table_check. - Prevents unbalanced per-page map counter accounting for special/PFN-mapped PTEs that don’t have normal rmap/
struct pagesemantics.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
frezbo
approved these changes
Jun 8, 2026
Backport a kernel patch to stop page_table_check from tracking special (PFN-mapped) PTEs. The vDSO "[vvar]" mapping is VM_PFNMAP and its pages are installed with vmf_insert_pfn(), producing special PTEs. pte_user_accessible_page() on x86/arm64 does not exclude special PTEs, so page_table_check accounts these PFN mappings. The time-namespace VVAR page is a real alloc_page() that is freed in free_time_ns() when the last task in a time namespace exits; the unbalanced accounting leaves a non-zero map count and trips the BUG_ON() in __page_table_check_zero(): kernel BUG at mm/page_table_check.c:143! __page_table_check_zero / __free_frozen_pages / free_time_ns / free_nsproxy / do_exit This is hit under heavy container/CI churn (CLONE_NEWTIME via runc / docker-init / tini) on both amd64 and arm64, since the Talos kernel enables CONFIG_PAGE_TABLE_CHECK. Mainline sidesteps it in v7.0 by switching the mapping to VM_MIXEDMAP + vmf_insert_page(), but 6.18.y still uses the PFNMAP path. See: siderolabs/talos#13496 Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
dd10f0c to
e30cbb5
Compare
shanduur
reviewed
Jun 8, 2026
| | `0003-net-macb-add-TX-stall-watchdog-to-recover-from-lost-.patch` | macb: per-queue `delayed_work` watchdog that calls `macb_tx_restart()` if tx_tail hasn't advanced. v2 uses a `bool tx_stall_tail_moved` flag (pelwell-suggested form) instead of a tx_tail snapshot, gates the check on `netif_carrier_ok()` to eliminate a boot-time false positive, and wraps the stall-warn in `if (printk_ratelimit()) netdev_warn(...)` so events stay observable while bounded | v2 submitted to netdev | [v2 thread](https://lore.kernel.org/netdev/20260514215459.36109-1-lukasz@raczylo.com/T/) · [v2 patch 3 build-fix](https://lore.kernel.org/netdev/20260515095336.92237-1-lukasz@raczylo.com/T/) | | ||
| | `0004-PCI-prevent-shrink-bridge-window.patch` | PCI: prevent `adjust_bridge_window()` from shrinking a bridge window below the size required by `pbus_size_mem()` — fixes large-BAR / eGPU resource starvation | Merged to mainline v6.19, candidate for 6.18.y stable backport | [lore patch](https://patch.msgid.link/20260219153951.68869-1-ilpo.jarvinen@linux.intel.com) | | ||
| | `0005-slab-backport-flex-allocator-helpers.patch` | Incomplete backport to 6.18.x breaking the DRBD build | Cherry-picked from mainline, drop when upgrading || | ||
| | `006-mm-page_table_check-do-not-track-special-PFN-mapped-PTEs.patch` | mm/page_table_check: do not track special (PFN-mapped) PTEs | Linux 7.0 is not affected, but 6.18.x. is || |
Member
There was a problem hiding this comment.
Suggested change
| | `006-mm-page_table_check-do-not-track-special-PFN-mapped-PTEs.patch` | mm/page_table_check: do not track special (PFN-mapped) PTEs | Linux 7.0 is not affected, but 6.18.x. is || | |
| | `0006-mm-page_table_check-do-not-track-special-PFN-mapped-PTEs.patch` | mm/page_table_check: do not track special (PFN-mapped) PTEs | Linux 7.0 is not affected, but 6.18.x. is || |
Member
Author
Member
Author
|
Verified the fix with the following reproducer: // SPDX-License-Identifier: GPL-2.0
//
// Reproducer for: kernel BUG at mm/page_table_check.c (page_table_check)
// __page_table_check_zero / __free_frozen_pages / free_time_ns /
// free_nsproxy / do_exit
//
// See https://github.com/siderolabs/talos/issues/13496 and the syzbot report
// syzbot+2b5fe617654be3d8848b@syzkaller.appspotmail.com (riscv).
//
// Root cause (6.18.y): the vDSO "[vvar]" mapping is VM_PFNMAP and its pages
// are installed with vmf_insert_pfn(), producing *special* PTEs. On x86/arm64
// pte_user_accessible_page() does not exclude pte_special(), so
// page_table_check tracks these PFN mappings. The time-namespace VVAR page is
// a real alloc_page() that is freed by __free_page() in free_time_ns(); the
// unbalanced special-PTE accounting leaves a non-zero map count and trips the
// BUG_ON() in __page_table_check_zero() when the namespace is destroyed.
//
// This program reproduces the real-world trigger (runc / docker-init / tini):
// it churns time namespaces in parallel, faults the timens VVAR page via the
// vDSO clock, and forks inside the namespace to exercise copy of the special
// PFNMAP PTEs, then tears everything down.
//
// REQUIREMENTS
// * Kernel built with CONFIG_PAGE_TABLE_CHECK and the check *active*
// (boot with page_table_check=on, or CONFIG_PAGE_TABLE_CHECK_ENFORCED=y
// as on Talos). On a kernel without the check active this is a no-op.
// * CONFIG_TIME_NS=y.
// * Run as root (or with CAP_SYS_ADMIN) so unshare(CLONE_NEWTIME) succeeds.
//
// EXPECTED RESULT
// * Unpatched, affected kernel: kernel BUG / oops in __page_table_check_zero
// (on Talos this panics and reboots the node). May take from seconds to a
// few minutes; raising the worker count and CPU pressure speeds it up.
// * Patched kernel: runs indefinitely with no crash. Stop with Ctrl-C.
//
// BUILD
// cc -O2 -Wall -o repro_timens_ptc repro_timens_ptc.c
// RUN
// ./repro_timens_ptc # workers = 2 * nproc
// ./repro_timens_ptc 64 # explicit worker count
#define _GNU_SOURCE
#include <errno.h>
#include <sched.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/wait.h>
#include <time.h>
#include <unistd.h>
#ifndef CLONE_NEWTIME
#define CLONE_NEWTIME 0x00000080
#endif
// Fault in / read the (timens) VVAR page through the vDSO fast path.
static void touch_vdso_clock(void)
{
struct timespec ts;
int i;
for (i = 0; i < 8; i++) {
clock_gettime(CLOCK_MONOTONIC, &ts);
clock_gettime(CLOCK_BOOTTIME, &ts);
clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
}
}
// Runs inside a freshly created time namespace.
static void in_timens(void)
{
pid_t p;
touch_vdso_clock(); /* map the timens VVAR page at the TIME slot */
p = fork(); /* copy the VM_PFNMAP special PTEs to the child */
if (p == 0) {
touch_vdso_clock();
_exit(0);
}
if (p > 0)
waitpid(p, NULL, 0);
touch_vdso_clock();
_exit(0); /* unmap on exit */
}
static void one_round(void)
{
pid_t c = fork();
if (c == 0) {
/* Create a fresh time namespace for this subtree's children. */
if (unshare(CLONE_NEWTIME) != 0)
_exit(2);
pid_t g = fork(); /* grandchild enters the new timens */
if (g == 0)
in_timens(); /* does not return */
if (g > 0)
waitpid(g, NULL, 0);
/*
* Child exits here: it drops the last reference to the time
* namespace (its ->time_ns_for_children), so free_nsproxy() ->
* free_time_ns() -> __free_page(vvar_page) runs. On an affected
* kernel page_table_check BUGs on the stray map count.
*/
_exit(0);
}
if (c > 0)
waitpid(c, NULL, 0);
}
int main(int argc, char **argv)
{
int workers, i;
pid_t t;
int st;
workers = (argc > 1) ? atoi(argv[1])
: (int)sysconf(_SC_NPROCESSORS_ONLN) * 2;
if (workers < 1)
workers = 1;
/* Sanity check: can we create a time namespace at all? */
t = fork();
if (t == 0)
_exit(unshare(CLONE_NEWTIME) == 0 ? 0 : 2);
waitpid(t, &st, 0);
if (!WIFEXITED(st) || WEXITSTATUS(st) == 2) {
fprintf(stderr,
"cannot create CLONE_NEWTIME namespace: need root/CAP_SYS_ADMIN "
"and CONFIG_TIME_NS=y\n");
return 1;
}
fprintf(stderr,
"spawning %d workers churning time namespaces;\n"
"needs CONFIG_PAGE_TABLE_CHECK active. Runs until the kernel BUGs "
"(Ctrl-C to stop on a fixed kernel).\n",
workers);
for (i = 0; i < workers; i++) {
pid_t w = fork();
if (w == 0) {
for (;;)
one_round();
_exit(0);
}
}
for (;;) {
pid_t r = wait(NULL);
if (r < 0 && errno == ECHILD)
break;
}
return 0;
}I can crash the kernel without the fix after a few minutes, with the fix it ran for an hour without issues. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a kernel patch (
0006-mm-page_table_check-do-not-track-special-PFN-mapped-PTEs.patch) that stopspage_table_checkfrom tracking special (PFN-mapped) PTEs.Why
Reported in siderolabs/talos#13496: nodes panic + reboot under heavy container/CI churn with:
Root cause
In 6.18.y the vDSO
[vvar]mapping isVM_PFNMAPand its pages are installed into userspace viavmf_insert_pfn(), which produces special PTEs. On x86 and arm64pte_user_accessible_page()only checks the PRESENT/USER bits and does not excludepte_special(), sopage_table_checkaccounts these PFN mappings in the per-page map counters — even though they are not rmap-managed (vm_normal_page()returnsNULL).Most vvar pages live in the kernel image and are never freed, so the stray accounting is invisible. The time-namespace VVAR page is the exception: it is a real
alloc_page()page freed by__free_page()infree_time_ns()when the last task of a time namespace exits. The unbalanced special-PTE accounting leaves a non-zerofile_map_count, which trips theBUG_ON()in__page_table_check_zero()at free time.This is why:
CONFIG_PAGE_TABLE_CHECKenabled,CLONE_NEWTIMEchurn (runc / docker-init / tini),page_table_check=offis a complete workaround.Fix
Skip special PTEs in both the set and clear paths of
page_table_check, so the counters stay balanced (always zero) for PFN-mapped pages. Special PTEs have no struct-page rmap semantics and should never have been tracked.pte_special()is generically available (no-opfalseon arches withoutARCH_HAS_PTE_SPECIAL), so it is a single arch-independent change.Mainline sidesteps this since
05988dba1179("vdso/datastore: Allocate data pages dynamically", v7.0) switched the mapping toVM_MIXEDMAP+vmf_insert_page()with balanced struct-page accounting, but 6.18.y still uses the PFNMAP path.Testing
patch -p1).CLONE_NEWTIME/CI workload withCONFIG_PAGE_TABLE_CHECKactive) and confirm themm/page_table_check.cBUG no longer fires, withoutpage_table_check=off.The fix should also be submitted upstream to linux-mm / linux-stable (6.18.y).