diff --git a/kernel/build/patches/0006-mm-page_table_check-do-not-track-special-PFN-mapped-PTEs.patch b/kernel/build/patches/0006-mm-page_table_check-do-not-track-special-PFN-mapped-PTEs.patch new file mode 100644 index 000000000..4b23629b0 --- /dev/null +++ b/kernel/build/patches/0006-mm-page_table_check-do-not-track-special-PFN-mapped-PTEs.patch @@ -0,0 +1,79 @@ +From: Andrey Smirnov +Date: Mon, 8 Jun 2026 00:00:00 +0000 +Subject: [PATCH] mm/page_table_check: do not track special (PFN-mapped) PTEs + +The vDSO data store ("[vvar]") special mapping is created as a VM_PFNMAP +mapping and its pages are installed into userspace with vmf_insert_pfn(), +which produces *special* PTEs (pte_special()). On x86 and arm64 (and +riscv) pte_user_accessible_page() only tests the PRESENT/USER bits and +does not exclude special PTEs, so page_table_check accounts these PFN +mappings in the per-page anon/file map counters even though they are not +rmap-managed pages (vm_normal_page() returns NULL for them). + +Most of these data pages live in the kernel image and are never freed, so +the stray accounting is invisible. The time-namespace VVAR page is the +exception: it is a real alloc_page() page that is released with +__free_page() in free_time_ns() when the last task of a time namespace +exits. Across the map / unmap / vdso_join_timens() zap transitions the +special-PTE accounting is not balanced for this page, so a non-zero +file_map_count survives to the free path and trips: + + kernel BUG at mm/page_table_check.c:143! + __page_table_check_zero+... + __free_frozen_pages+... + free_time_ns+... + free_nsproxy+... + do_exit / do_group_exit + +This reproduces under heavy container/CI churn (rapid creation and +teardown of time namespaces via CLONE_NEWTIME, e.g. runc / docker-init / +tini) on x86_64 and arm64, and was independently reported by syzbot on +riscv. It only manifests when CONFIG_PAGE_TABLE_CHECK is active. + +Special PTEs have no struct-page rmap semantics and must never have been +tracked by page table check. Skip them in both the set and clear paths so +the counters stay balanced (always zero) for PFN-mapped pages, regardless +of how the architecture defines pte_user_accessible_page(). pte_special() +is available generically (a no-op returning false on architectures +without ARCH_HAS_PTE_SPECIAL), so this is a single, arch-independent fix. + +Mainline sidesteps this since commit 05988dba1179 ("vdso/datastore: +Allocate data pages dynamically", v7.0) switched the mapping to +VM_MIXEDMAP + vmf_insert_page() with balanced struct-page accounting, but +6.18.y still uses the PFNMAP path and needs this fix. + +Reported-by: syzbot+2b5fe617654be3d8848b@syzkaller.appspotmail.com +Link: https://github.com/siderolabs/talos/issues/13496 +Signed-off-by: Andrey Smirnov +--- + mm/page_table_check.c | 12 ++++++++++-- + 1 file changed, 10 insertions(+), 2 deletions(-) + +--- a/mm/page_table_check.c ++++ b/mm/page_table_check.c +@@ -150,7 +150,15 @@ + if (&init_mm == mm) + return; + +- if (pte_user_accessible_page(pte)) { ++ /* ++ * PFN-mapped (special) PTEs - e.g. the vDSO/time-namespace "[vvar]" ++ * mapping installed via vmf_insert_pfn() - are not rmap-managed and ++ * must not be tracked here. Tracking them can leave a non-zero map ++ * count on a struct page that is later freed (the time namespace VVAR ++ * page in free_time_ns()), tripping the BUG_ON() in ++ * __page_table_check_zero(). ++ */ ++ if (pte_user_accessible_page(pte) && !pte_special(pte)) { + page_table_check_clear(pte_pfn(pte), PAGE_SIZE >> PAGE_SHIFT); + } + } +@@ -205,7 +213,7 @@ + + for (i = 0; i < nr; i++) + __page_table_check_pte_clear(mm, ptep_get(ptep + i)); +- if (pte_user_accessible_page(pte)) ++ if (pte_user_accessible_page(pte) && !pte_special(pte)) + page_table_check_set(pte_pfn(pte), nr, pte_write(pte)); + } + EXPORT_SYMBOL(__page_table_check_ptes_set); diff --git a/kernel/build/patches/README.md b/kernel/build/patches/README.md index 49f7f11f2..58e3823ca 100644 --- a/kernel/build/patches/README.md +++ b/kernel/build/patches/README.md @@ -4,3 +4,5 @@ | `0002-net-macb-insert-PCIe-read-barrier-before-TX-completi.patch` | macb: insert non-destructive PCIe read barrier (`queue_readl(queue, IMR)`) before `macb_tx_complete_pending()` in `macb_tx_poll()`. Replaces the v1 ISR-read form which was destructive on read-clear silicon (RP1) — that read silently consumed RCOMP / ROVR / TXUBR bits, causing silent RX-completion loss at moderate-to-heavy load | v2 submitted to netdev | [v2 thread](https://lore.kernel.org/netdev/20260514215459.36109-1-lukasz@raczylo.com/T/) | | `0003-net-macb-add-TX-stall-watchdog-to-recover-from-lost-.patch` | macb: per-queue `delayed_work` watchdog that calls `macb_tx_restart()` if tx_tail hasn't advanced. v2 uses a `bool tx_stall_tail_moved` flag (pelwell-suggested form) instead of a tx_tail snapshot, gates the check on `netif_carrier_ok()` to eliminate a boot-time false positive, and wraps the stall-warn in `if (printk_ratelimit()) netdev_warn(...)` so events stay observable while bounded | v2 submitted to netdev | [v2 thread](https://lore.kernel.org/netdev/20260514215459.36109-1-lukasz@raczylo.com/T/) · [v2 patch 3 build-fix](https://lore.kernel.org/netdev/20260515095336.92237-1-lukasz@raczylo.com/T/) | | `0004-PCI-prevent-shrink-bridge-window.patch` | PCI: prevent `adjust_bridge_window()` from shrinking a bridge window below the size required by `pbus_size_mem()` — fixes large-BAR / eGPU resource starvation | Merged to mainline v6.19, candidate for 6.18.y stable backport | [lore patch](https://patch.msgid.link/20260219153951.68869-1-ilpo.jarvinen@linux.intel.com) | +| `0005-slab-backport-flex-allocator-helpers.patch` | Incomplete backport to 6.18.x breaking the DRBD build | Cherry-picked from mainline, drop when upgrading || +| `0006-mm-page_table_check-do-not-track-special-PFN-mapped-PTEs.patch` | mm/page_table_check: do not track special (PFN-mapped) PTEs | Linux 7.0 is not affected, but 6.18.x. is | [submission](https://lore.kernel.org/linux-mm/20260608155758.1220420-1-andrey.smirnov@siderolabs.com/T#u) |