Skip to content

mm/slub: Add disable_canary kernel cmdline#112

Open
nbouchinet-anssi wants to merge 109 commits into
anthraxx:7.0from
nbouchinet-anssi:nbouchinet-anssi/7.0-canary_disable
Open

mm/slub: Add disable_canary kernel cmdline#112
nbouchinet-anssi wants to merge 109 commits into
anthraxx:7.0from
nbouchinet-anssi:nbouchinet-anssi/7.0-canary_disable

Conversation

@nbouchinet-anssi

Copy link
Copy Markdown

This commit adds an option to disable canary from kernel cmdline.
With sheaf introduction, the canary patch has grown in complexity and we encounter various crashes as for #111.

Add this option to let users disable canary without disabling the patchset completely or using an obsolete linux-hardened version.

thestinger and others added 30 commits May 27, 2026 02:42
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Levente Polyak <levente@leventepolyak.net>
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
It can make sense to disable this to reduce attack surface / complexity.
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
anthraxx and others added 27 commits May 27, 2026 02:54
Orthogonal to the other sysctl proc functions expose the variant that is
checking CAP_SYS_ADMIN on write for consumption in external subsystem's
sysctl tables.

Signed-off-by: Levente Polyak <levente@leventepolyak.net>
[nicolas.bouchinet@ssi.gouv.fr: Constify the ctl_table argument as in commit 78eb4ea]
Signed-off-by: Nicolas Bouchinet <nicolas.bouchinet@ssi.gouv.fr>
Based on the public grsecurity patches.

[thibaut.sautereau@ssi.gouv.fr: Adapt to sysctl code refactoring]
Signed-off-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
Signed-off-by: Levente Polyak <levente@leventepolyak.net>
[thibaut.sautereau@ssi.gouv.fr: Adapt to sysctl code refactoring]
Signed-off-by: Nicolas Bouchinet <nicolas.bouchinet@oss.cyber.gouv.fr>
This moves the usb related sysctl knobs to an own usb local sysctl table
in order to clean up the global sysctl as well as allow the knob to be
exported and referenced appropriately when building the usb components
as dedicated modules.

Signed-off-by: Levente Polyak <levente@leventepolyak.net>
The userspace API is left intact for compatibility.

Signed-off-by: Levente Polyak <levente@leventepolyak.net>
This patch adds struct user_namespace *owner_user_ns to the tty_struct.
Then it is set to current_user_ns() in the alloc_tty_struct function.

This is done to facilitate capability checks against the original user
namespace that allocated the tty.

E.g. ns_capable(tty->owner_user_ns,CAP_SYS_ADMIN)

This combined with the use of user namespace's will allow hardening
protections to be built to mitigate container escapes that utilize TTY
ioctls such as TIOCSTI.

See: https://bugzilla.redhat.com/show_bug.cgi?id=1411256

Acked-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Matt Brown <matt@nmatt.com>
This introduces the tiocsti_restrict sysctl, whose default is controlled
via CONFIG_SECURITY_TIOCSTI_RESTRICT. When activated, this control
restricts all TIOCSTI ioctl calls from non CAP_SYS_ADMIN users.

This patch depends on patch 1/2

This patch was inspired from GRKERNSEC_HARDEN_TTY.

This patch would have prevented
https://bugzilla.redhat.com/show_bug.cgi?id=1411256 under the following
conditions:
* non-privileged container
* container run inside new user namespace

Possible effects on userland:

There could be a few user programs that would be effected by this
change.
See: <https://codesearch.debian.net/search?q=ioctl%5C%28.*TIOCSTI>
notable programs are: agetty, csh, xemacs and tcsh

However, I still believe that this change is worth it given that the
Kconfig defaults to n. This will be a feature that is turned on for the
same reason that people activate it when using grsecurity. Users of this
opt-in feature will realize that they are choosing security over some OS
features like unprivileged TIOCSTI ioctls, as should be clear in the
Kconfig help message.

Threat Model/Patch Rational:

>From grsecurity's config for GRKERNSEC_HARDEN_TTY.

 | There are very few legitimate uses for this functionality and it
 | has made vulnerabilities in several 'su'-like programs possible in
 | the past.  Even without these vulnerabilities, it provides an
 | attacker with an easy mechanism to move laterally among other
 | processes within the same user's compromised session.

So if one process within a tty session becomes compromised it can follow
that additional processes, that are thought to be in different security
boundaries, can be compromised as a result. When using a program like su
or sudo, these additional processes could be in a tty session where TTY
file descriptors are indeed shared over privilege boundaries.

This is also an excellent writeup about the issue:
<http://www.halfdog.net/Security/2012/TtyPushbackPrivilegeEscalation/>

When user namespaces are in use, the check for the capability
CAP_SYS_ADMIN is done against the user namespace that originally opened
the tty.

Acked-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Matt Brown <matt@nmatt.com>
Signed-off-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
Signed-off-by: Levente Polyak <levente@leventepolyak.net>
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
With 46c7dd5 ("modpost: always show verbose warning for section
mismatch"), sec_mismatch_verbose was removed which would have printed
errors for all writable function pointers during compilation if it
hadn't been "#if 0"ed out for quite some time now.

Let's introduce a new DEBUG_WRITABLE_FUNCTION_POINTERS_VERBOSE Kconfig
option to cleanly control this linux-hardened functionality.

Signed-off-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
Signed-off-by: Levente Polyak <levente@leventepolyak.net>
Commit a9cd410 ("mm/page_alloc.c: memory hotplug: free pages as
higher order") changed `static void __init __free_pages_boot_core()`
into `void __free_pages_core()`, causing the following section mismatch
warning at compile time:

    WARNING: vmlinux.o(.text+0x180fe4): Section mismatch in reference from the function __free_pages_core() to the variable .meminit.data:extra_latent_entropy
    The function __free_pages_core() references the variable __meminitdata extra_latent_entropy.
    This is often because __free_pages_core lacks a __meminitdata annotation or the annotation of extra_latent_entropy is wrong.

This commit is an attempt at fixing this issue. I'm not sure it's OK as
we are accessing pages that are still managed by the memblock allocator.
The prefetching part is not an issue as it only affects struct pages.

Signed-off-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
[levente@leventepolyak.net: most of core MM initialization moved to mm/mm_init.c]
Signed-off-by: Levente Polyak <levente@leventepolyak.net>
[nicolas.bouchinet@ssi.gouv.fr: MAX_ORDER has been renamed to MAX_PAGE_ORDER (see 5e0a760)]
Signed-off-by: Nicolas Bouchinet <nicolas.bouchinet@ssi.gouv.fr>
This has required some rework during the port to 5.13, due to
da844b7 ("kasan, mm: integrate slab init_on_alloc with HW_TAGS"),
and the patch is actually quite simpler now since we do not need to
unpoison objects anymore.

Signed-off-by: Levente Polyak <levente@leventepolyak.net>
Signed-off-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
[nicolas.bouchinet@ssi.gouv.fr: pre/post-alloc hooks moved from mm/slab.h to mm/slub.c (see 6011be5)]
Signed-off-by: Nicolas Bouchinet <nicolas.bouchinet@ssi.gouv.fr>
This is modified from Brad Spengler/PaX Team's code in the last public
patch of grsecurity/PaX based on my understanding of the code. Changes
or omissions from the original code are mine and don't reflect the
original grsecurity/PaX code.

TCP simultaneous connect adds a weakness in Linux's implementation of
TCP that allows two clients to connect to each other without either
entering a listening state. The weakness allows an attacker to easily
prevent a client from connecting to a known server provided the source
port for the connection is guessed correctly.

As the weakness could be used to prevent an antivirus or IPS from
fetching updates, or prevent an SSL gateway from fetching a CRL, it
should be eliminated.

This creates a net.ipv4.tcp_simult_connect sysctl that when disabled,
disables TCP simultaneous connect.

Reviewed-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
Reviewed-by: Levente Polyak <levente@leventepolyak.net>
Signed-off-by: Levente Polyak <levente@leventepolyak.net>
When disabled, unprivileged users will not be able to create
new overlayfs mounts. This cuts the attack surface if no
unprivileged user namespace mounts are required like for
running rootless containers.

Signed-off-by: Levente Polyak <levente@leventepolyak.net>
Trigger BUG when kfence encounters data corruption of kfence managed
objects. This allows a finer-grained control instead of globally
enabling panic_on_warn.

Signed-off-by: Levente Polyak <levente@leventepolyak.net>
Before commit d0fe47c ("slub: add back check for free nonslab
objects"), freeing a non-slab object used to trigger a BUG if
CONFIG_DEBUG_VM was enabled. Now it only warns, which I think is not
enough for such a memory corruption. Let's restore the previous
behaviour, but tie it to CONFIG_BUG_ON_DATA_CORRUPTION as suggested by
Levente.

After page folios were introduced in v5.17, this patch was adapted to
trigger a bug when the order of the folio is zero instead of when the
page is not a compound page, which is not equivalent but respects the
semantics of the conversion to page folios and follows the change made
to the WARN_ON_ONCE beneath.

Suggested-by: Levente Polyak <levente@leventepolyak.net>
Signed-off-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
[nicolas.bouchinet@ssi.gouv.fr: kfree moved from mm/slab_common.c to mm/slub.c (see b774d3e)]
Signed-off-by: Nicolas Bouchinet <nicolas.bouchinet@ssi.gouv.fr>
This forces processes to have `CAP_SYS_ADMIN` in order to use io_uring or
to be in the io_uring_group.

The patch alter the sysctl value range in order that once set to "2" it
can't be lowered again.

The io_uring_group sysctl option is set to -1 by default, user should
define a proper group and set the sysctl properly if they want it configured.

Signed-off-by: Nicolas Bouchinet <nicolas.bouchinet@ssi.gouv.fr>
Since we expose proc_dointvec_minmax_sysadmin, add it to sanity checking
functions.

Signed-off-by: Nicolas Bouchinet <nicolas.bouchinet@ssi.gouv.fr>
Signed-off-by: Levente Polyak <levente@leventepolyak.net>
With barn and sheaves introduction, slab objects are used to prefill or
refill sheaves, which are cache of small objects taking the form of an
array of pointers to slab objects.

Sheaves are then used for quick allocation and free, which consist of
shrinking and growing the array index.
Thus, there is two vision of allocation state for those objects. While
they are seen as allocated by the slab allocator, the sheaf allocator
see them as free and then allocates them.

We thus need to adapt the slab canary patch in order to avoid sanitizing
objects allocation and free from this array.

A next patch will add a per-sheave canary random value which would lead
to a better tracking of objects overflow.

Signed-off-by: Levente Polyak <levente@leventepolyak.net>
Signed-off-by: Nicolas Bouchinet <nicolas.bouchinet@ssi.gouv.fr>
Sheaf allocation is an allocation cache that uses pre-allocated slab
objects for faster free and allocation from a sheaf array.
This patch adds a sheaf canary in order to detect small overflows and
double-free of sheaf objects.

Signed-off-by: Levente Polyak <levente@leventepolyak.net>
Signed-off-by: Nicolas Bouchinet <nicolas.bouchinet@ssi.gouv.fr>
- Invert canary logic, we now only track objects in their inactive state
  allocated objects are always tagged as random_active. Free objects are
  tagged as sheaf_random_inactive or random_inactive depending on if
  they are in a sheaf or in a slab freelist.
  The logic inversion should make the patch way more stable.

- Fixes slab_debug canary crash in early allocation state when the
  bootstrap sheaf is in use.

- Fixes slabobj_ext offset computaion when stored in objects.

- Always instrument sheaf_canary, even when slab_debug is active.

- Fixes canary mismatch in some free path.

- Adapt canary to new alloc/free paths.

- Fixes kmem_cache_refill_sheaf instrumentation.

Signed-off-by: Levente Polyak <levente@leventepolyak.net>
Signed-off-by: Nicolas Bouchinet <nicolas.bouchinet@ssi.gouv.fr>
With canary_debug, a canary mismatch will print supposed canary values
and the one that has been encountered.

Signed-off-by: Levente Polyak <levente@leventepolyak.net>
Signed-off-by: Nicolas Bouchinet <nicolas.bouchinet@ssi.gouv.fr>
Signed-off-by: Nicolas Bouchinet <nicolas.bouchinet@ssi.gouv.fr>
Excplicitly define CONST_CAST_TREE
For gcc-16, this was removed in gcc trunk

see commits
  c3d96ff9e916c02584aa081f03ab999292efbb50
  458c7926d48959abcb2c1adaa22458e27459a551

Link: https://www.spinics.net/lists/kernel/msg6111050.html
Signed-off-by: Levente Polyak <levente@leventepolyak.net>
Signed-off-by: Nicolas Bouchinet <nicolas.bouchinet@ssi.gouv.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants