Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
From: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Comment thread
smira marked this conversation as resolved.
Date: Mon, 8 Jun 2026 00:00:00 +0000
Subject: [PATCH] mm/page_table_check: do not track special (PFN-mapped) PTEs

The vDSO data store ("[vvar]") special mapping is created as a VM_PFNMAP
mapping and its pages are installed into userspace with vmf_insert_pfn(),
which produces *special* PTEs (pte_special()). On x86 and arm64 (and
riscv) pte_user_accessible_page() only tests the PRESENT/USER bits and
does not exclude special PTEs, so page_table_check accounts these PFN
mappings in the per-page anon/file map counters even though they are not
rmap-managed pages (vm_normal_page() returns NULL for them).

Most of these data pages live in the kernel image and are never freed, so
the stray accounting is invisible. The time-namespace VVAR page is the
exception: it is a real alloc_page() page that is released with
__free_page() in free_time_ns() when the last task of a time namespace
exits. Across the map / unmap / vdso_join_timens() zap transitions the
special-PTE accounting is not balanced for this page, so a non-zero
file_map_count survives to the free path and trips:

kernel BUG at mm/page_table_check.c:143!
__page_table_check_zero+...
__free_frozen_pages+...
free_time_ns+...
free_nsproxy+...
do_exit / do_group_exit

This reproduces under heavy container/CI churn (rapid creation and
teardown of time namespaces via CLONE_NEWTIME, e.g. runc / docker-init /
tini) on x86_64 and arm64, and was independently reported by syzbot on
riscv. It only manifests when CONFIG_PAGE_TABLE_CHECK is active.

Special PTEs have no struct-page rmap semantics and must never have been
tracked by page table check. Skip them in both the set and clear paths so
the counters stay balanced (always zero) for PFN-mapped pages, regardless
of how the architecture defines pte_user_accessible_page(). pte_special()
is available generically (a no-op returning false on architectures
without ARCH_HAS_PTE_SPECIAL), so this is a single, arch-independent fix.

Mainline sidesteps this since commit 05988dba1179 ("vdso/datastore:
Allocate data pages dynamically", v7.0) switched the mapping to
VM_MIXEDMAP + vmf_insert_page() with balanced struct-page accounting, but
6.18.y still uses the PFNMAP path and needs this fix.

Reported-by: syzbot+2b5fe617654be3d8848b@syzkaller.appspotmail.com
Link: https://github.com/siderolabs/talos/issues/13496
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
---
mm/page_table_check.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

--- a/mm/page_table_check.c
+++ b/mm/page_table_check.c
@@ -150,7 +150,15 @@
if (&init_mm == mm)
return;

- if (pte_user_accessible_page(pte)) {
+ /*
+ * PFN-mapped (special) PTEs - e.g. the vDSO/time-namespace "[vvar]"
+ * mapping installed via vmf_insert_pfn() - are not rmap-managed and
+ * must not be tracked here. Tracking them can leave a non-zero map
+ * count on a struct page that is later freed (the time namespace VVAR
+ * page in free_time_ns()), tripping the BUG_ON() in
+ * __page_table_check_zero().
+ */
+ if (pte_user_accessible_page(pte) && !pte_special(pte)) {
page_table_check_clear(pte_pfn(pte), PAGE_SIZE >> PAGE_SHIFT);
}
}
@@ -205,7 +213,7 @@

for (i = 0; i < nr; i++)
__page_table_check_pte_clear(mm, ptep_get(ptep + i));
- if (pte_user_accessible_page(pte))
+ if (pte_user_accessible_page(pte) && !pte_special(pte))
page_table_check_set(pte_pfn(pte), nr, pte_write(pte));
}
EXPORT_SYMBOL(__page_table_check_ptes_set);
2 changes: 2 additions & 0 deletions kernel/build/patches/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,5 @@
| `0002-net-macb-insert-PCIe-read-barrier-before-TX-completi.patch` | macb: insert non-destructive PCIe read barrier (`queue_readl(queue, IMR)`) before `macb_tx_complete_pending()` in `macb_tx_poll()`. Replaces the v1 ISR-read form which was destructive on read-clear silicon (RP1) — that read silently consumed RCOMP / ROVR / TXUBR bits, causing silent RX-completion loss at moderate-to-heavy load | v2 submitted to netdev | [v2 thread](https://lore.kernel.org/netdev/20260514215459.36109-1-lukasz@raczylo.com/T/) |
| `0003-net-macb-add-TX-stall-watchdog-to-recover-from-lost-.patch` | macb: per-queue `delayed_work` watchdog that calls `macb_tx_restart()` if tx_tail hasn't advanced. v2 uses a `bool tx_stall_tail_moved` flag (pelwell-suggested form) instead of a tx_tail snapshot, gates the check on `netif_carrier_ok()` to eliminate a boot-time false positive, and wraps the stall-warn in `if (printk_ratelimit()) netdev_warn(...)` so events stay observable while bounded | v2 submitted to netdev | [v2 thread](https://lore.kernel.org/netdev/20260514215459.36109-1-lukasz@raczylo.com/T/) · [v2 patch 3 build-fix](https://lore.kernel.org/netdev/20260515095336.92237-1-lukasz@raczylo.com/T/) |
| `0004-PCI-prevent-shrink-bridge-window.patch` | PCI: prevent `adjust_bridge_window()` from shrinking a bridge window below the size required by `pbus_size_mem()` — fixes large-BAR / eGPU resource starvation | Merged to mainline v6.19, candidate for 6.18.y stable backport | [lore patch](https://patch.msgid.link/20260219153951.68869-1-ilpo.jarvinen@linux.intel.com) |
| `0005-slab-backport-flex-allocator-helpers.patch` | Incomplete backport to 6.18.x breaking the DRBD build | Cherry-picked from mainline, drop when upgrading ||
| `0006-mm-page_table_check-do-not-track-special-PFN-mapped-PTEs.patch` | mm/page_table_check: do not track special (PFN-mapped) PTEs | Linux 7.0 is not affected, but 6.18.x. is | [submission](https://lore.kernel.org/linux-mm/20260608155758.1220420-1-andrey.smirnov@siderolabs.com/T#u) |