Skip to content

Conversation

leobalter
Copy link

Thanks for sharing linux in github! The beer is free too!

@suissa
Copy link

suissa commented Sep 5, 2011

beer rulez!

@0xRoch
Copy link

0xRoch commented Sep 5, 2011

epic

@raphaelcosta
Copy link

Hahaha nice!

@ei-grad
Copy link

ei-grad commented Sep 5, 2011

github is too github...

@ghost
Copy link

ghost commented Sep 5, 2011

Oh my... EPIC!

@zenorocha
Copy link

#win!

@hugobarauna
Copy link

Unecessary =p

@fellix
Copy link

fellix commented Sep 5, 2011

Unecessary * 2

This is not Orkut =/

@RusAlex
Copy link

RusAlex commented Sep 5, 2011

Thanks for linux!

@sobrinho
Copy link

sobrinho commented Sep 5, 2011

totally unnecessary, congratz!

@Gunni
Copy link

Gunni commented Sep 5, 2011

yeah make pull requests either vanish or be a link to https://github.com/torvalds/linux/tree/master/Documentation/development-process

@lucasrenan
Copy link

This is not Orkut =/ 2

@Spaceghost
Copy link

I am thoroughly disappoint.

@CruzBishop
Copy link
Contributor

...This is just crazy

@mbt
Copy link

mbt commented Sep 5, 2011

@torvalds I will volunteer to help clean up spam requests if there is a way to do so.

@Spaceghost
Copy link

Great way to introduce someone very prominent in the open source
community to github. Brilliant way to get your chuckles...

@leobalter
Copy link
Author

No more beers for you, going back to BSD

@leobalter leobalter closed this Sep 6, 2011
@ebraminio
Copy link

"No more beers for you, going back to BSD" :D

@Spaceghost
Copy link

@diegoviola you might want to cool it a bit. We're not a lynch mob, the goal was to stop having joke pull requests started on @torvalds repository. Save the 'saving the world' bit for later. :)

@Spaceghost
Copy link

@diegoviola, you're cool. Just something we all might want to keep in
mind. There are no enemies here, at least none I can see.

@ghost
Copy link

ghost commented Sep 7, 2011

The amount of social networking b.s. for an operating system kernel's source code repository IS TOO DAMN HIGH.

@mvanveen
Copy link

mvanveen commented Sep 7, 2011

+1

stefanha pushed a commit to stefanha/linux that referenced this pull request Oct 30, 2011
Add mount options backupuid and backugid.

It allows an authenticated user to access files with the intent to back them
up including their ACLs, who may not have access permission but has
"Backup files and directories user right" on them (by virtue of being part
of the built-in group Backup Operators.

When mount options backupuid is specified, cifs client restricts the
use of backup intents to the user whose effective user id is specified
along with the mount option.

When mount options backupgid is specified, cifs client restricts the
use of backup intents to the users whose effective user id belongs to the
group id specified along with the mount option.

If an authenticated user is not part of the built-in group Backup Operators
at the server, access to such files is denied, even if allowed by the client.

Signed-off-by: Shirish Pargaonkar <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
Signed-off-by: Steve French <[email protected]>
iksaif pushed a commit to iksaif/platform-drivers-x86 that referenced this pull request Nov 6, 2011
This patch validates sdev pointer in scsi_dh_activate before proceeding further.

Without this check we might see the panic as below. I have seen this
panic multiple times..

Call trace:

 #0 [ffff88007d647b50] machine_kexec at ffffffff81020902
 #1 [ffff88007d647ba0] crash_kexec at ffffffff810875b0
 #2 [ffff88007d647c70] oops_end at ffffffff8139c650
 #3 [ffff88007d647c90] __bad_area_nosemaphore at ffffffff8102dd15
 #4 [ffff88007d647d50] page_fault at ffffffff8139b8cf
    [exception RIP: scsi_dh_activate+0x82]
    RIP: ffffffffa0041922  RSP: ffff88007d647e00  RFLAGS: 00010046
    RAX: 0000000000000000  RBX: 0000000000000000  RCX: 00000000000093c5
    RDX: 00000000000093c5  RSI: ffffffffa02e6640  RDI: ffff88007cc88988
    RBP: 000000000000000f   R8: ffff88007d646000   R9: 0000000000000000
    R10: ffff880082293790  R11: 00000000ffffffff  R12: ffff88007cc88988
    R13: 0000000000000000  R14: 0000000000000286  R15: ffff880037b845e0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
 #5 [ffff88007d647e38] run_workqueue at ffffffff81060268
 torvalds#6 [ffff88007d647e78] worker_thread at ffffffff81060386
 torvalds#7 [ffff88007d647ee8] kthread at ffffffff81064436
 torvalds#8 [ffff88007d647f48] kernel_thread at ffffffff81003fba

Signed-off-by: Babu Moger <[email protected]>
Cc: [email protected]
Signed-off-by: James Bottomley <[email protected]>
baerwolf pushed a commit to baerwolf/linux-stephan that referenced this pull request Nov 12, 2011
commit a18a920 upstream.

This patch validates sdev pointer in scsi_dh_activate before proceeding further.

Without this check we might see the panic as below. I have seen this
panic multiple times..

Call trace:

 #0 [ffff88007d647b50] machine_kexec at ffffffff81020902
 #1 [ffff88007d647ba0] crash_kexec at ffffffff810875b0
 #2 [ffff88007d647c70] oops_end at ffffffff8139c650
 #3 [ffff88007d647c90] __bad_area_nosemaphore at ffffffff8102dd15
 #4 [ffff88007d647d50] page_fault at ffffffff8139b8cf
    [exception RIP: scsi_dh_activate+0x82]
    RIP: ffffffffa0041922  RSP: ffff88007d647e00  RFLAGS: 00010046
    RAX: 0000000000000000  RBX: 0000000000000000  RCX: 00000000000093c5
    RDX: 00000000000093c5  RSI: ffffffffa02e6640  RDI: ffff88007cc88988
    RBP: 000000000000000f   R8: ffff88007d646000   R9: 0000000000000000
    R10: ffff880082293790  R11: 00000000ffffffff  R12: ffff88007cc88988
    R13: 0000000000000000  R14: 0000000000000286  R15: ffff880037b845e0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
 #5 [ffff88007d647e38] run_workqueue at ffffffff81060268
 torvalds#6 [ffff88007d647e78] worker_thread at ffffffff81060386
 torvalds#7 [ffff88007d647ee8] kthread at ffffffff81064436
 torvalds#8 [ffff88007d647f48] kernel_thread at ffffffff81003fba

Signed-off-by: Babu Moger <[email protected]>
Signed-off-by: James Bottomley <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
torvalds pushed a commit that referenced this pull request Dec 15, 2011
If the pte mapping in generic_perform_write() is unmapped between
iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic(), the
"copied" parameter to ->end_write can be zero. ext4 couldn't cope with
it with delayed allocations enabled. This skips the i_disksize
enlargement logic if copied is zero and no new data was appeneded to
the inode.

 gdb> bt
 #0  0xffffffff811afe80 in ext4_da_should_update_i_disksize (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x1\
 08000, len=0x1000, copied=0x0, page=0xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2467
 #1  ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\
 xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512
 #2  0xffffffff810d97f1 in generic_perform_write (iocb=<value optimized out>, iov=<value optimized out>, nr_segs=<value o\
 ptimized out>, pos=0x108000, ppos=0xffff88001e26be40, count=<value optimized out>, written=0x0) at mm/filemap.c:2440
 #3  generic_file_buffered_write (iocb=<value optimized out>, iov=<value optimized out>, nr_segs=<value optimized out>, p\
 os=0x108000, ppos=0xffff88001e26be40, count=<value optimized out>, written=0x0) at mm/filemap.c:2482
 #4  0xffffffff810db5d1 in __generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, ppos=0\
 xffff88001e26be40) at mm/filemap.c:2600
 #5  0xffffffff810db853 in generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=<value optimi\
 zed out>, pos=<value optimized out>) at mm/filemap.c:2632
 #6  0xffffffff811a71aa in ext4_file_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, pos=0x108000) a\
 t fs/ext4/file.c:136
 #7  0xffffffff811375aa in do_sync_write (filp=0xffff88003f606a80, buf=<value optimized out>, len=<value optimized out>, \
 ppos=0xffff88001e26bf48) at fs/read_write.c:406
 #8  0xffffffff81137e56 in vfs_write (file=0xffff88003f606a80, buf=0x1ec2960 <Address 0x1ec2960 out of bounds>, count=0x4\
 000, pos=0xffff88001e26bf48) at fs/read_write.c:435
 #9  0xffffffff8113816c in sys_write (fd=<value optimized out>, buf=0x1ec2960 <Address 0x1ec2960 out of bounds>, count=0x\
 4000) at fs/read_write.c:487
 #10 <signal handler called>
 #11 0x00007f120077a390 in __brk_reservation_fn_dmi_alloc__ ()
 #12 0x0000000000000000 in ?? ()
 gdb> print offset
 $22 = 0xffffffffffffffff
 gdb> print idx
 $23 = 0xffffffff
 gdb> print inode->i_blkbits
 $24 = 0xc
 gdb> up
 #1  ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\
 xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512
 2512                    if (ext4_da_should_update_i_disksize(page, end)) {
 gdb> print start
 $25 = 0x0
 gdb> print end
 $26 = 0xffffffffffffffff
 gdb> print pos
 $27 = 0x108000
 gdb> print new_i_size
 $28 = 0x108000
 gdb> print ((struct ext4_inode_info *)((char *)inode-((int)(&((struct ext4_inode_info *)0)->vfs_inode))))->i_disksize
 $29 = 0xd9000
 gdb> down
 2467            for (i = 0; i < idx; i++)
 gdb> print i
 $30 = 0xd44acbee

This is 100% reproducible with some autonuma development code tuned in
a very aggressive manner (not normal way even for knumad) which does
"exotic" changes to the ptes. It wouldn't normally trigger but I don't
see why it can't happen normally if the page is added to swap cache in
between the two faults leading to "copied" being zero (which then
hangs in ext4). So it should be fixed. Especially possible with lumpy
reclaim (albeit disabled if compaction is enabled) as that would
ignore the young bits in the ptes.

Signed-off-by: Andrea Arcangeli <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
Cc: [email protected]
elettronicagf pushed a commit to elettronicagf/kernel-omap3 that referenced this pull request Dec 16, 2011
Cancel idle timer in musb_platform_exit.

The idle timer could trigger after clock had been disabled leading to
kernel panic when MUSB_DEVCTL is accessed in musb_do_idle on 2.6.37.

The fault below is no longer triggered on 2.6.38-rc4 (clock is disabled
later, and only if compiled as a module, and the offending memory access
has moved) but the timer should be cancelled nonetheless.

Rebooting... musb_hdrc musb_hdrc: remove, state 4
usb usb1: USB disconnect, address 1
musb_hdrc musb_hdrc: USB bus 1 deregistered
Unhandled fault: external abort on non-linefetch (0x1028) at 0xfa0ab060
Internal error: : 1028 [#1] PREEMPT
last sysfs file: /sys/kernel/uevent_seqnum
Modules linked in:
CPU: 0    Not tainted  (2.6.37+ torvalds#6)
PC is at musb_do_idle+0x24/0x138
LR is at musb_do_idle+0x18/0x138
pc : [<c02377d8>]    lr : [<c02377cc>]    psr: 80000193
sp : cf2bdd80  ip : cf2bdd80  fp : c048a20c
r10: c048a60c  r9 : c048a40c  r8 : cf85e110
r7 : cf2bc000  r6 : 40000113  r5 : c0489800  r4 : cf85e110
r3 : 00000004  r2 : 00000006  r1 : fa0ab000  r0 : cf8a7000
Flags: Nzcv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
Control: 10c5387d  Table: 8faac019  DAC: 00000015
Process reboot (pid: 769, stack limit = 0xcf2bc2f0)
Stack: (0xcf2bdd80 to 0xcf2be000)
dd80: 00000103 c0489800 c02377b4 c005fa34 00000555 c0071a8c c04a3858 cf2bdda8
dda0: 00000555 c048a00c cf2bdda8 cf2bdda8 1838beb0 00000103 00000004 cf2bc000
ddc0: 00000001 00000001 c04896c8 0000000a 00000000 c005ac14 00000001 c003f32c
dde0: 00000000 00000025 00000000 cf2bc000 00000002 00000001 cf2bc000 00000000
de00: 00000001 c005ad08 cf2bc000 c002e07c c03ec039 ffffffff fa200000 c0033608
de20: 00000001 00000000 cf852c14 cf81f200 c045b714 c045b708 cf2bc000 c04a37e8
de40: c0033c04 cf2bc000 00000000 00000001 cf2bde68 cf2bde68 c01c3abc c004f7d8
de60: 60000013 ffffffff c0033c04 00000000 01234567 fee1dead 00000000 c006627c
de80: 00000001 c00662c8 28121969 c00663ec cfa38c40 cf9f6a00 cf2bded0 cf9f6a0c
dea0: 00000000 cf92f000 00008914 c02cd284 c04a55c8 c028b398 c00715c0 becf24a8
dec0: 30687465 00000000 00000000 00000000 00000002 1301a8c0 00000000 00000000
dee0: 00000002 1301a8c0 00000000 00000000 c0450494 cf527920 00011f10 cf2bdf08
df00: 00011f10 cf2bdf10 00011f10 cf2bdf18 c00f0b44 c004f7e8 cf2bdf18 cf2bdf18
df20: 00011f10 cf2bdf30 00011f10 cf2bdf38 cf401300 cf486100 00000008 c00d2b28
df40: 00011f10 cf401300 00200200 c00d3388 00011f10 cfb63a88 cfb63a80 c00c2f08
df60: 00000000 00000000 cfb63a80 00000000 cf0a3480 00000006 c0033c04 cfb63a80
df80: 00000000 c00c0104 00000003 cf0a3480 cfb63a80 00000000 00000001 00000004
dfa0: 00000058 c0033a80 00000000 00000001 fee1dead 28121969 01234567 00000000
dfc0: 00000000 00000001 00000004 00000058 00000001 00000001 00000000 00000001
dfe0: 4024d200 becf2cb0 00009210 4024d218 60000010 fee1dead 00000000 00000000
[<c02377d8>] (musb_do_idle+0x24/0x138) from [<c005fa34>] (run_timer_softirq+0x1a8/0x26)
[<c005fa34>] (run_timer_softirq+0x1a8/0x26c) from [<c005ac14>] (__do_softirq+0x88/0x13)
[<c005ac14>] (__do_softirq+0x88/0x138) from [<c005ad08>] (irq_exit+0x44/0x98)
[<c005ad08>] (irq_exit+0x44/0x98) from [<c002e07c>] (asm_do_IRQ+0x7c/0xa0)
[<c002e07c>] (asm_do_IRQ+0x7c/0xa0) from [<c0033608>] (__irq_svc+0x48/0xa8)
Exception stack(0xcf2bde20 to 0xcf2bde68)
de20: 00000001 00000000 cf852c14 cf81f200 c045b714 c045b708 cf2bc000 c04a37e8
de40: c0033c04 cf2bc000 00000000 00000001 cf2bde68 cf2bde68 c01c3abc c004f7d8
de60: 60000013 ffffffff
[<c0033608>] (__irq_svc+0x48/0xa8) from [<c004f7d8>] (sub_preempt_count+0x0/0xb8)
Code: ebf86030 e5940098 e594108c e5902010 (e5d13060)
---[ end trace 3689c0d808f9bf7c ]---
Kernel panic - not syncing: Fatal exception in interrupt

Cc: [email protected]
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Felipe Balbi <[email protected]>
Signed-off-by: Sriramakrishnan A G <[email protected]>
tworaz pushed a commit to tworaz/linux that referenced this pull request Jan 9, 2012
[ Upstream commit e226930 ]

This code has been broken forever, but in several different and
creative ways.

So far as I can work out, the R6040 MAC filter has 4 exact-match
entries, the first of which the driver uses for its assigned unicast
address, plus a 64-entry hash-based filter for multicast addresses
(maybe unicast as well?).

The original version of this code would write the first 4 multicast
addresses as exact-match entries from offset 1 (bug #1: there is no
entry 4 so this could write to some PHY registers).  It would fill the
remainder of the exact-match entries with the broadcast address (bug #2:
this would overwrite the last used entry).  If more than 4 multicast
addresses were configured, it would set up the hash table, write some
random crap to the MAC control register (bug #3) and finally walk off
the end of the list when filling the exact-match entries (bug #4).

All of this seems to be pointless, since it sets the promiscuous bit
when the interface is made promiscuous or if >4 multicast addresses
are enabled, and never clears it (bug #5, masking bug #2).

The recent(ish) changes to the multicast list fixed bug #4, but
completely removed the limit on iteration over the exact-match entries
(bug torvalds#6).

Bug #4 was reported as
<https://bugzilla.kernel.org/show_bug.cgi?id=15355> and more recently
as <http://bugs.debian.org/600155>.  Florian Fainelli attempted to fix
these in commit 3bcf822, but that
actually dealt with bugs #1-3, bug #4 having been fixed in mainline at
that point.

That commit fixes the most important current bug torvalds#6.

Signed-off-by: Ben Hutchings <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
mj22226 pushed a commit to mj22226/linux that referenced this pull request Sep 26, 2025
The block layer validates buffer alignment using the device's
dma_alignment value. If dma_alignment is smaller than
logical_block_size(bp_block) -1, misaligned buffer incorrectly pass
validation and propagate to the lower-level driver.

This patch adjusts dma_alignment to be at least logical_block_size -1,
ensuring that misalignment buffers are properly rejected at the block
layer and do not reach the DASD driver unnecessarily.

Fixes: 2a07bb6 ("s390/dasd: Remove DMA alignment")
Reviewed-by: Stefan Haberland <[email protected]>
Cc: [email protected] torvalds#6.11+
Signed-off-by: Jaehoon Kim <[email protected]>
Signed-off-by: Stefan Haberland <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
davidhildenbrand added a commit to davidhildenbrand/linux that referenced this pull request Sep 26, 2025
…ge_order()

Patch series "mm: MM owner tracking for large folios (!hugetlb) +
CONFIG_NO_PAGE_MAPCOUNT", v3.

Let's add an "easy" way to decide -- without false positives, without
page-mapcounts and without page table/rmap scanning -- whether a large
folio is "certainly mapped exclusively" into a single MM, or whether it
"maybe mapped shared" into multiple MMs.

Use that information to implement Copy-on-Write reuse, to convert
folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to
introduce a kernel config option that lets us not use+maintain per-page
mapcounts in large folios anymore.

The bigger picture was presented at LSF/MM [1].

This series is effectively a follow-up on my early work [2], which
implemented a more precise, but also more complicated, way to identify
whether a large folio is "mapped shared" into multiple MMs or "mapped
exclusively" into a single MM.

1 Patch Organization
====================

Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two
                "unsigned long" available for our purposes

Patch torvalds#7 -> torvalds#11: preparations

Patch torvalds#12: MM owner tracking for large folios

Patch torvalds#13: COW reuse for PTE-mapped anon THP

Patch torvalds#14: folio_maybe_mapped_shared()

Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT

2 MM owner tracking
===================

We assign each MM a unique ID ("MM ID"), to be able to squeeze more
information in our folios.  On 32bit we use 15-bit IDs, on 64bit we use
31-bit IDs.

For each large folios, we now store two MM-ID+mapcount ("slot")
combinations:
* mm0_id + mm0_mapcount
* mm1_id + mm1_mapcount

On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit
mapcount.  This way, we require 2x "unsigned long" on 32bit and 64bit for
both slots.

Paired with the large mapcount, we can reliably identify whether one of
these MMs is the current owner (-> owns all mappings) or even holds all
folio references (-> owns all mappings, and all references are from
mappings).

As long as only two MMs map folio pages at a time, we can reliably and
precisely identify whether a large folio is "mapped shared" or "mapped
exclusively".

Any additional MM that starts mapping the folio while there are no free
slots becomes an "untracked MM".  If one such "untracked MM" is the last
one mapping a folio exclusively, we will not detect the folio as "mapped
exclusively" but instead as "maybe mapped shared".  (exception: only a
single mapping remains)

So that's where the approach gets imprecise.

For now, we use a bit-spinlock to sync the large mapcount + slots, and
make sure we do keep the machinery fast, to not degrade (un)map
performance drastically: for example, we make sure to only use a single
atomic (when grabbing the bit-spinlock), like we would already perform
when updating the large mapcount.

3 CONFIG_NO_PAGE_MAPCOUNT
=========================

patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not
maintaining the per-page mapcounts in large folios anymore.

Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore
when (un)mapping pages, we'll account a complete folio as mapped if a
single page is mapped.  In addition, we'll not detect partially mapped
anonymous folios as such in all cases yet.

Likely less relevant changes include that we might now under-estimate the
USS (Unique Set Size) of a process, but never over-estimate it.

The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to
then slowly make it the only option, as we learn about real-life impacts
and possible ways to mitigate them.

4 Performance
=============

Detailed performance numbers were included in v1 [3], and not that much
changed between v1 and v2.

I did plenty of measurements on different systems in the meantime, that
all revealed slightly different results.

The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code
layout changes on some systems.  Especially the fork() benchmark started
being more-shaky-than-before on recent kernels for some reason.

In summary, with my micro-benchmarks:

* Small folios are not impacted.

* CoW performance seems to be mostly unchanged across all folios sizes.

* CoW reuse performance of large folios now matches CoW reuse
  performance of small folios, because we now actually implement the CoW
  reuse optimization.  On an Intel Xeon Silver 4210R I measured a ~65%
  reduction in runtime, on an arm64 system I measured ~54% reduction.

* munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT.  I saw
  double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and
  up to ~70% on an AmpereOne A192-32X) with larger folios.  The larger the
  folios, the larger the performance improvement.

* munmao() performance very slightly (couple percent) degrades without
  CONFIG_NO_PAGE_MAPCOUNT for smaller folios.  For larger folios, there
  seems to be no change at all.

* fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT.  I saw
  double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and
  up to ~10% on an AmpereOne A192-32X) with larger folios.  The larger the
  folios, the larger the performance improvement.

* While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be
  almost unchanged on some systems, I saw some degradation for smaller
  folios on the AmpereOne A192-32X.  I did not investigate the details
  yet, but I suspect code layout changes or suboptimal code placement /
  inlining.

I'm not to worried about the fork() micro-benchmarks for smaller folios
given how shaky the results are lately and by how much we improved fork()
performance recently.

I also ran case-anon-cow-rand and case-anon-cow-seq part of
vm-scalability, to assess the scalability and the impact of the
bit-spinlock.  My measurements on a two 2-socket 10-core Intel Xeon Silver
4210R CPU revealed no significant changes.

Similarly, running these benchmarks with 2 MiB THPs enabled on the
AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev,
which is nice.

So far, I did not get my hands on a similarly large system with multiple
sockets.

I found no other fitting scalability benchmarks that seem to really hammer
on concurrent mapping/unmapping of large folio pages like
case-anon-cow-seq does.

5 Concerns
==========

5.1 Bit spinlock
----------------

I'm not quite happy about the bit-spinlock, but so far it does not seem to
affect scalability in my measurements.

If it ever becomes a problem we could either investigate improving the
locking, or simply stopping the MM tracking once there are "too many
mappings" and simply assume that the folio is "mapped shared" until it was
freed.

This would be similar (but slightly different) to the "0,1,2,stopped"
counting idea Willy had at some point.  Adding that logic to "stop
tracking" adds more code to the hot path, so I avoided that for now.

5.2 folio_maybe_mapped_shared()
-------------------------------

I documented the change from folio_likely_mapped_shared() to
folio_maybe_mapped_shared() quite extensively.  If we run into surprises,
I have some ideas on how to resolve them.  For now, I think we should be
fine.

5.3 Added code to map/unmap hot path
------------------------------------

So far, it looks like the added code on the rmap hot path does not really
seem to matter much in the bigger picture.  I'd like to further reduce it
(and possibly improve fork() performance further), but I don't easily see
how right now.  Well, and I am out of puff 🙂

Having that said, alternatives I considered (e.g., per-MM per-folio
mapcount) would add a lot more overhead to these hot paths.

6 Future Work
=============

6.1 Large mapcount
------------------

It would be very handy if the large mapcount would count how often folio
pages are actually mapped into page tables: a PMD on x86-64 would count
512 times.  Calculating the average per-page mapcount will be easy, and
remapping (PMD->PTE) folios would get even faster.

That would also remove the need for the entire mapcount (except for
PMD-sized folios for memory statistics reasons ...), and allow for mapping
folios larger than PMDs (e.g., 4 MiB) easily.

We likely would also have to take the same number of folio references to
make our folio_mapcount() == folio_ref_count() work, and we'd want to be
able to avoid mapcount+refcount overflows: this could already become an
issue with pte-mapped PUD-sized folios (fsdax).

One approach we discussed in the THP cabal meeting is (1) extending the
mapcount for large folios to 64bit (at least on 64bit systems) and (2)
keeping the refcount at 32bit, but (3) having exactly one reference if the
the mapcount != 0.

It should be doable, but there are some corner cases to consider on the
unmap path; it is something that I will be looking into next.

6.2 hugetlb
-----------

I'd love to make use of the same tracking also for hugetlb.

The real problem is PMD table sharing: getting a page mapped by MM X and
unmapped by MM Y will not work.  With mshare, that problem should not
exist (all mapping/unmapping will be routed through the mshare MM).

[1] https://lwn.net/Articles/974223/
[2] https://lore.kernel.org/linux-mm/[email protected]/T/
[3] https://lkml.kernel.org/r/[email protected]
[4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c

This patch (of 20):

Let's factor it out into a simple helper function.  This helper will also
come in handy when working with code where we know that our folio is
large.

Maybe in the future we'll have the order readily available for small and
large folios; in that case, folio_large_order() would simply translate to
folio_order().

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Lance Yang <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andy Lutomirks^H^Hski <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: Michal Koutn <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: tejun heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
(cherry picked from commit 6220ea5)
Signed-off-by: David Hildenbrand <[email protected]>
davidhildenbrand added a commit to davidhildenbrand/linux that referenced this pull request Sep 26, 2025
…ge_order()

Patch series "mm: MM owner tracking for large folios (!hugetlb) +
CONFIG_NO_PAGE_MAPCOUNT", v3.

Let's add an "easy" way to decide -- without false positives, without
page-mapcounts and without page table/rmap scanning -- whether a large
folio is "certainly mapped exclusively" into a single MM, or whether it
"maybe mapped shared" into multiple MMs.

Use that information to implement Copy-on-Write reuse, to convert
folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to
introduce a kernel config option that lets us not use+maintain per-page
mapcounts in large folios anymore.

The bigger picture was presented at LSF/MM [1].

This series is effectively a follow-up on my early work [2], which
implemented a more precise, but also more complicated, way to identify
whether a large folio is "mapped shared" into multiple MMs or "mapped
exclusively" into a single MM.

1 Patch Organization
====================

Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two
                "unsigned long" available for our purposes

Patch torvalds#7 -> torvalds#11: preparations

Patch torvalds#12: MM owner tracking for large folios

Patch torvalds#13: COW reuse for PTE-mapped anon THP

Patch torvalds#14: folio_maybe_mapped_shared()

Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT

2 MM owner tracking
===================

We assign each MM a unique ID ("MM ID"), to be able to squeeze more
information in our folios.  On 32bit we use 15-bit IDs, on 64bit we use
31-bit IDs.

For each large folios, we now store two MM-ID+mapcount ("slot")
combinations:
* mm0_id + mm0_mapcount
* mm1_id + mm1_mapcount

On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit
mapcount.  This way, we require 2x "unsigned long" on 32bit and 64bit for
both slots.

Paired with the large mapcount, we can reliably identify whether one of
these MMs is the current owner (-> owns all mappings) or even holds all
folio references (-> owns all mappings, and all references are from
mappings).

As long as only two MMs map folio pages at a time, we can reliably and
precisely identify whether a large folio is "mapped shared" or "mapped
exclusively".

Any additional MM that starts mapping the folio while there are no free
slots becomes an "untracked MM".  If one such "untracked MM" is the last
one mapping a folio exclusively, we will not detect the folio as "mapped
exclusively" but instead as "maybe mapped shared".  (exception: only a
single mapping remains)

So that's where the approach gets imprecise.

For now, we use a bit-spinlock to sync the large mapcount + slots, and
make sure we do keep the machinery fast, to not degrade (un)map
performance drastically: for example, we make sure to only use a single
atomic (when grabbing the bit-spinlock), like we would already perform
when updating the large mapcount.

3 CONFIG_NO_PAGE_MAPCOUNT
=========================

patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not
maintaining the per-page mapcounts in large folios anymore.

Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore
when (un)mapping pages, we'll account a complete folio as mapped if a
single page is mapped.  In addition, we'll not detect partially mapped
anonymous folios as such in all cases yet.

Likely less relevant changes include that we might now under-estimate the
USS (Unique Set Size) of a process, but never over-estimate it.

The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to
then slowly make it the only option, as we learn about real-life impacts
and possible ways to mitigate them.

4 Performance
=============

Detailed performance numbers were included in v1 [3], and not that much
changed between v1 and v2.

I did plenty of measurements on different systems in the meantime, that
all revealed slightly different results.

The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code
layout changes on some systems.  Especially the fork() benchmark started
being more-shaky-than-before on recent kernels for some reason.

In summary, with my micro-benchmarks:

* Small folios are not impacted.

* CoW performance seems to be mostly unchanged across all folios sizes.

* CoW reuse performance of large folios now matches CoW reuse
  performance of small folios, because we now actually implement the CoW
  reuse optimization.  On an Intel Xeon Silver 4210R I measured a ~65%
  reduction in runtime, on an arm64 system I measured ~54% reduction.

* munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT.  I saw
  double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and
  up to ~70% on an AmpereOne A192-32X) with larger folios.  The larger the
  folios, the larger the performance improvement.

* munmao() performance very slightly (couple percent) degrades without
  CONFIG_NO_PAGE_MAPCOUNT for smaller folios.  For larger folios, there
  seems to be no change at all.

* fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT.  I saw
  double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and
  up to ~10% on an AmpereOne A192-32X) with larger folios.  The larger the
  folios, the larger the performance improvement.

* While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be
  almost unchanged on some systems, I saw some degradation for smaller
  folios on the AmpereOne A192-32X.  I did not investigate the details
  yet, but I suspect code layout changes or suboptimal code placement /
  inlining.

I'm not to worried about the fork() micro-benchmarks for smaller folios
given how shaky the results are lately and by how much we improved fork()
performance recently.

I also ran case-anon-cow-rand and case-anon-cow-seq part of
vm-scalability, to assess the scalability and the impact of the
bit-spinlock.  My measurements on a two 2-socket 10-core Intel Xeon Silver
4210R CPU revealed no significant changes.

Similarly, running these benchmarks with 2 MiB THPs enabled on the
AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev,
which is nice.

So far, I did not get my hands on a similarly large system with multiple
sockets.

I found no other fitting scalability benchmarks that seem to really hammer
on concurrent mapping/unmapping of large folio pages like
case-anon-cow-seq does.

5 Concerns
==========

5.1 Bit spinlock
----------------

I'm not quite happy about the bit-spinlock, but so far it does not seem to
affect scalability in my measurements.

If it ever becomes a problem we could either investigate improving the
locking, or simply stopping the MM tracking once there are "too many
mappings" and simply assume that the folio is "mapped shared" until it was
freed.

This would be similar (but slightly different) to the "0,1,2,stopped"
counting idea Willy had at some point.  Adding that logic to "stop
tracking" adds more code to the hot path, so I avoided that for now.

5.2 folio_maybe_mapped_shared()
-------------------------------

I documented the change from folio_likely_mapped_shared() to
folio_maybe_mapped_shared() quite extensively.  If we run into surprises,
I have some ideas on how to resolve them.  For now, I think we should be
fine.

5.3 Added code to map/unmap hot path
------------------------------------

So far, it looks like the added code on the rmap hot path does not really
seem to matter much in the bigger picture.  I'd like to further reduce it
(and possibly improve fork() performance further), but I don't easily see
how right now.  Well, and I am out of puff 🙂

Having that said, alternatives I considered (e.g., per-MM per-folio
mapcount) would add a lot more overhead to these hot paths.

6 Future Work
=============

6.1 Large mapcount
------------------

It would be very handy if the large mapcount would count how often folio
pages are actually mapped into page tables: a PMD on x86-64 would count
512 times.  Calculating the average per-page mapcount will be easy, and
remapping (PMD->PTE) folios would get even faster.

That would also remove the need for the entire mapcount (except for
PMD-sized folios for memory statistics reasons ...), and allow for mapping
folios larger than PMDs (e.g., 4 MiB) easily.

We likely would also have to take the same number of folio references to
make our folio_mapcount() == folio_ref_count() work, and we'd want to be
able to avoid mapcount+refcount overflows: this could already become an
issue with pte-mapped PUD-sized folios (fsdax).

One approach we discussed in the THP cabal meeting is (1) extending the
mapcount for large folios to 64bit (at least on 64bit systems) and (2)
keeping the refcount at 32bit, but (3) having exactly one reference if the
the mapcount != 0.

It should be doable, but there are some corner cases to consider on the
unmap path; it is something that I will be looking into next.

6.2 hugetlb
-----------

I'd love to make use of the same tracking also for hugetlb.

The real problem is PMD table sharing: getting a page mapped by MM X and
unmapped by MM Y will not work.  With mshare, that problem should not
exist (all mapping/unmapping will be routed through the mshare MM).

[1] https://lwn.net/Articles/974223/
[2] https://lore.kernel.org/linux-mm/[email protected]/T/
[3] https://lkml.kernel.org/r/[email protected]
[4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c

This patch (of 20):

Let's factor it out into a simple helper function.  This helper will also
come in handy when working with code where we know that our folio is
large.

Maybe in the future we'll have the order readily available for small and
large folios; in that case, folio_large_order() would simply translate to
folio_order().

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Lance Yang <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andy Lutomirks^H^Hski <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: Michal Koutn <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: tejun heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
(cherry picked from commit 6220ea5)
Signed-off-by: David Hildenbrand <[email protected]>
davidhildenbrand added a commit to davidhildenbrand/linux that referenced this pull request Sep 26, 2025
…ge_order()

Patch series "mm: MM owner tracking for large folios (!hugetlb) +
CONFIG_NO_PAGE_MAPCOUNT", v3.

Let's add an "easy" way to decide -- without false positives, without
page-mapcounts and without page table/rmap scanning -- whether a large
folio is "certainly mapped exclusively" into a single MM, or whether it
"maybe mapped shared" into multiple MMs.

Use that information to implement Copy-on-Write reuse, to convert
folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to
introduce a kernel config option that lets us not use+maintain per-page
mapcounts in large folios anymore.

The bigger picture was presented at LSF/MM [1].

This series is effectively a follow-up on my early work [2], which
implemented a more precise, but also more complicated, way to identify
whether a large folio is "mapped shared" into multiple MMs or "mapped
exclusively" into a single MM.

1 Patch Organization
====================

Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two
                "unsigned long" available for our purposes

Patch torvalds#7 -> torvalds#11: preparations

Patch torvalds#12: MM owner tracking for large folios

Patch torvalds#13: COW reuse for PTE-mapped anon THP

Patch torvalds#14: folio_maybe_mapped_shared()

Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT

2 MM owner tracking
===================

We assign each MM a unique ID ("MM ID"), to be able to squeeze more
information in our folios.  On 32bit we use 15-bit IDs, on 64bit we use
31-bit IDs.

For each large folios, we now store two MM-ID+mapcount ("slot")
combinations:
* mm0_id + mm0_mapcount
* mm1_id + mm1_mapcount

On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit
mapcount.  This way, we require 2x "unsigned long" on 32bit and 64bit for
both slots.

Paired with the large mapcount, we can reliably identify whether one of
these MMs is the current owner (-> owns all mappings) or even holds all
folio references (-> owns all mappings, and all references are from
mappings).

As long as only two MMs map folio pages at a time, we can reliably and
precisely identify whether a large folio is "mapped shared" or "mapped
exclusively".

Any additional MM that starts mapping the folio while there are no free
slots becomes an "untracked MM".  If one such "untracked MM" is the last
one mapping a folio exclusively, we will not detect the folio as "mapped
exclusively" but instead as "maybe mapped shared".  (exception: only a
single mapping remains)

So that's where the approach gets imprecise.

For now, we use a bit-spinlock to sync the large mapcount + slots, and
make sure we do keep the machinery fast, to not degrade (un)map
performance drastically: for example, we make sure to only use a single
atomic (when grabbing the bit-spinlock), like we would already perform
when updating the large mapcount.

3 CONFIG_NO_PAGE_MAPCOUNT
=========================

patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not
maintaining the per-page mapcounts in large folios anymore.

Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore
when (un)mapping pages, we'll account a complete folio as mapped if a
single page is mapped.  In addition, we'll not detect partially mapped
anonymous folios as such in all cases yet.

Likely less relevant changes include that we might now under-estimate the
USS (Unique Set Size) of a process, but never over-estimate it.

The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to
then slowly make it the only option, as we learn about real-life impacts
and possible ways to mitigate them.

4 Performance
=============

Detailed performance numbers were included in v1 [3], and not that much
changed between v1 and v2.

I did plenty of measurements on different systems in the meantime, that
all revealed slightly different results.

The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code
layout changes on some systems.  Especially the fork() benchmark started
being more-shaky-than-before on recent kernels for some reason.

In summary, with my micro-benchmarks:

* Small folios are not impacted.

* CoW performance seems to be mostly unchanged across all folios sizes.

* CoW reuse performance of large folios now matches CoW reuse
  performance of small folios, because we now actually implement the CoW
  reuse optimization.  On an Intel Xeon Silver 4210R I measured a ~65%
  reduction in runtime, on an arm64 system I measured ~54% reduction.

* munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT.  I saw
  double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and
  up to ~70% on an AmpereOne A192-32X) with larger folios.  The larger the
  folios, the larger the performance improvement.

* munmao() performance very slightly (couple percent) degrades without
  CONFIG_NO_PAGE_MAPCOUNT for smaller folios.  For larger folios, there
  seems to be no change at all.

* fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT.  I saw
  double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and
  up to ~10% on an AmpereOne A192-32X) with larger folios.  The larger the
  folios, the larger the performance improvement.

* While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be
  almost unchanged on some systems, I saw some degradation for smaller
  folios on the AmpereOne A192-32X.  I did not investigate the details
  yet, but I suspect code layout changes or suboptimal code placement /
  inlining.

I'm not to worried about the fork() micro-benchmarks for smaller folios
given how shaky the results are lately and by how much we improved fork()
performance recently.

I also ran case-anon-cow-rand and case-anon-cow-seq part of
vm-scalability, to assess the scalability and the impact of the
bit-spinlock.  My measurements on a two 2-socket 10-core Intel Xeon Silver
4210R CPU revealed no significant changes.

Similarly, running these benchmarks with 2 MiB THPs enabled on the
AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev,
which is nice.

So far, I did not get my hands on a similarly large system with multiple
sockets.

I found no other fitting scalability benchmarks that seem to really hammer
on concurrent mapping/unmapping of large folio pages like
case-anon-cow-seq does.

5 Concerns
==========

5.1 Bit spinlock
----------------

I'm not quite happy about the bit-spinlock, but so far it does not seem to
affect scalability in my measurements.

If it ever becomes a problem we could either investigate improving the
locking, or simply stopping the MM tracking once there are "too many
mappings" and simply assume that the folio is "mapped shared" until it was
freed.

This would be similar (but slightly different) to the "0,1,2,stopped"
counting idea Willy had at some point.  Adding that logic to "stop
tracking" adds more code to the hot path, so I avoided that for now.

5.2 folio_maybe_mapped_shared()
-------------------------------

I documented the change from folio_likely_mapped_shared() to
folio_maybe_mapped_shared() quite extensively.  If we run into surprises,
I have some ideas on how to resolve them.  For now, I think we should be
fine.

5.3 Added code to map/unmap hot path
------------------------------------

So far, it looks like the added code on the rmap hot path does not really
seem to matter much in the bigger picture.  I'd like to further reduce it
(and possibly improve fork() performance further), but I don't easily see
how right now.  Well, and I am out of puff 🙂

Having that said, alternatives I considered (e.g., per-MM per-folio
mapcount) would add a lot more overhead to these hot paths.

6 Future Work
=============

6.1 Large mapcount
------------------

It would be very handy if the large mapcount would count how often folio
pages are actually mapped into page tables: a PMD on x86-64 would count
512 times.  Calculating the average per-page mapcount will be easy, and
remapping (PMD->PTE) folios would get even faster.

That would also remove the need for the entire mapcount (except for
PMD-sized folios for memory statistics reasons ...), and allow for mapping
folios larger than PMDs (e.g., 4 MiB) easily.

We likely would also have to take the same number of folio references to
make our folio_mapcount() == folio_ref_count() work, and we'd want to be
able to avoid mapcount+refcount overflows: this could already become an
issue with pte-mapped PUD-sized folios (fsdax).

One approach we discussed in the THP cabal meeting is (1) extending the
mapcount for large folios to 64bit (at least on 64bit systems) and (2)
keeping the refcount at 32bit, but (3) having exactly one reference if the
the mapcount != 0.

It should be doable, but there are some corner cases to consider on the
unmap path; it is something that I will be looking into next.

6.2 hugetlb
-----------

I'd love to make use of the same tracking also for hugetlb.

The real problem is PMD table sharing: getting a page mapped by MM X and
unmapped by MM Y will not work.  With mshare, that problem should not
exist (all mapping/unmapping will be routed through the mshare MM).

[1] https://lwn.net/Articles/974223/
[2] https://lore.kernel.org/linux-mm/[email protected]/T/
[3] https://lkml.kernel.org/r/[email protected]
[4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c

This patch (of 20):

Let's factor it out into a simple helper function.  This helper will also
come in handy when working with code where we know that our folio is
large.

Maybe in the future we'll have the order readily available for small and
large folios; in that case, folio_large_order() would simply translate to
folio_order().

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Lance Yang <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andy Lutomirks^H^Hski <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: Michal Koutn <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: tejun heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
(cherry picked from commit 6220ea5)
Signed-off-by: David Hildenbrand <[email protected]>
davidhildenbrand added a commit to davidhildenbrand/linux that referenced this pull request Sep 26, 2025
…ge_order()

Patch series "mm: MM owner tracking for large folios (!hugetlb) +
CONFIG_NO_PAGE_MAPCOUNT", v3.

Let's add an "easy" way to decide -- without false positives, without
page-mapcounts and without page table/rmap scanning -- whether a large
folio is "certainly mapped exclusively" into a single MM, or whether it
"maybe mapped shared" into multiple MMs.

Use that information to implement Copy-on-Write reuse, to convert
folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to
introduce a kernel config option that lets us not use+maintain per-page
mapcounts in large folios anymore.

The bigger picture was presented at LSF/MM [1].

This series is effectively a follow-up on my early work [2], which
implemented a more precise, but also more complicated, way to identify
whether a large folio is "mapped shared" into multiple MMs or "mapped
exclusively" into a single MM.

1 Patch Organization
====================

Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two
                "unsigned long" available for our purposes

Patch torvalds#7 -> torvalds#11: preparations

Patch torvalds#12: MM owner tracking for large folios

Patch torvalds#13: COW reuse for PTE-mapped anon THP

Patch torvalds#14: folio_maybe_mapped_shared()

Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT

2 MM owner tracking
===================

We assign each MM a unique ID ("MM ID"), to be able to squeeze more
information in our folios.  On 32bit we use 15-bit IDs, on 64bit we use
31-bit IDs.

For each large folios, we now store two MM-ID+mapcount ("slot")
combinations:
* mm0_id + mm0_mapcount
* mm1_id + mm1_mapcount

On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit
mapcount.  This way, we require 2x "unsigned long" on 32bit and 64bit for
both slots.

Paired with the large mapcount, we can reliably identify whether one of
these MMs is the current owner (-> owns all mappings) or even holds all
folio references (-> owns all mappings, and all references are from
mappings).

As long as only two MMs map folio pages at a time, we can reliably and
precisely identify whether a large folio is "mapped shared" or "mapped
exclusively".

Any additional MM that starts mapping the folio while there are no free
slots becomes an "untracked MM".  If one such "untracked MM" is the last
one mapping a folio exclusively, we will not detect the folio as "mapped
exclusively" but instead as "maybe mapped shared".  (exception: only a
single mapping remains)

So that's where the approach gets imprecise.

For now, we use a bit-spinlock to sync the large mapcount + slots, and
make sure we do keep the machinery fast, to not degrade (un)map
performance drastically: for example, we make sure to only use a single
atomic (when grabbing the bit-spinlock), like we would already perform
when updating the large mapcount.

3 CONFIG_NO_PAGE_MAPCOUNT
=========================

patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not
maintaining the per-page mapcounts in large folios anymore.

Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore
when (un)mapping pages, we'll account a complete folio as mapped if a
single page is mapped.  In addition, we'll not detect partially mapped
anonymous folios as such in all cases yet.

Likely less relevant changes include that we might now under-estimate the
USS (Unique Set Size) of a process, but never over-estimate it.

The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to
then slowly make it the only option, as we learn about real-life impacts
and possible ways to mitigate them.

4 Performance
=============

Detailed performance numbers were included in v1 [3], and not that much
changed between v1 and v2.

I did plenty of measurements on different systems in the meantime, that
all revealed slightly different results.

The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code
layout changes on some systems.  Especially the fork() benchmark started
being more-shaky-than-before on recent kernels for some reason.

In summary, with my micro-benchmarks:

* Small folios are not impacted.

* CoW performance seems to be mostly unchanged across all folios sizes.

* CoW reuse performance of large folios now matches CoW reuse
  performance of small folios, because we now actually implement the CoW
  reuse optimization.  On an Intel Xeon Silver 4210R I measured a ~65%
  reduction in runtime, on an arm64 system I measured ~54% reduction.

* munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT.  I saw
  double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and
  up to ~70% on an AmpereOne A192-32X) with larger folios.  The larger the
  folios, the larger the performance improvement.

* munmao() performance very slightly (couple percent) degrades without
  CONFIG_NO_PAGE_MAPCOUNT for smaller folios.  For larger folios, there
  seems to be no change at all.

* fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT.  I saw
  double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and
  up to ~10% on an AmpereOne A192-32X) with larger folios.  The larger the
  folios, the larger the performance improvement.

* While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be
  almost unchanged on some systems, I saw some degradation for smaller
  folios on the AmpereOne A192-32X.  I did not investigate the details
  yet, but I suspect code layout changes or suboptimal code placement /
  inlining.

I'm not to worried about the fork() micro-benchmarks for smaller folios
given how shaky the results are lately and by how much we improved fork()
performance recently.

I also ran case-anon-cow-rand and case-anon-cow-seq part of
vm-scalability, to assess the scalability and the impact of the
bit-spinlock.  My measurements on a two 2-socket 10-core Intel Xeon Silver
4210R CPU revealed no significant changes.

Similarly, running these benchmarks with 2 MiB THPs enabled on the
AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev,
which is nice.

So far, I did not get my hands on a similarly large system with multiple
sockets.

I found no other fitting scalability benchmarks that seem to really hammer
on concurrent mapping/unmapping of large folio pages like
case-anon-cow-seq does.

5 Concerns
==========

5.1 Bit spinlock
----------------

I'm not quite happy about the bit-spinlock, but so far it does not seem to
affect scalability in my measurements.

If it ever becomes a problem we could either investigate improving the
locking, or simply stopping the MM tracking once there are "too many
mappings" and simply assume that the folio is "mapped shared" until it was
freed.

This would be similar (but slightly different) to the "0,1,2,stopped"
counting idea Willy had at some point.  Adding that logic to "stop
tracking" adds more code to the hot path, so I avoided that for now.

5.2 folio_maybe_mapped_shared()
-------------------------------

I documented the change from folio_likely_mapped_shared() to
folio_maybe_mapped_shared() quite extensively.  If we run into surprises,
I have some ideas on how to resolve them.  For now, I think we should be
fine.

5.3 Added code to map/unmap hot path
------------------------------------

So far, it looks like the added code on the rmap hot path does not really
seem to matter much in the bigger picture.  I'd like to further reduce it
(and possibly improve fork() performance further), but I don't easily see
how right now.  Well, and I am out of puff 🙂

Having that said, alternatives I considered (e.g., per-MM per-folio
mapcount) would add a lot more overhead to these hot paths.

6 Future Work
=============

6.1 Large mapcount
------------------

It would be very handy if the large mapcount would count how often folio
pages are actually mapped into page tables: a PMD on x86-64 would count
512 times.  Calculating the average per-page mapcount will be easy, and
remapping (PMD->PTE) folios would get even faster.

That would also remove the need for the entire mapcount (except for
PMD-sized folios for memory statistics reasons ...), and allow for mapping
folios larger than PMDs (e.g., 4 MiB) easily.

We likely would also have to take the same number of folio references to
make our folio_mapcount() == folio_ref_count() work, and we'd want to be
able to avoid mapcount+refcount overflows: this could already become an
issue with pte-mapped PUD-sized folios (fsdax).

One approach we discussed in the THP cabal meeting is (1) extending the
mapcount for large folios to 64bit (at least on 64bit systems) and (2)
keeping the refcount at 32bit, but (3) having exactly one reference if the
the mapcount != 0.

It should be doable, but there are some corner cases to consider on the
unmap path; it is something that I will be looking into next.

6.2 hugetlb
-----------

I'd love to make use of the same tracking also for hugetlb.

The real problem is PMD table sharing: getting a page mapped by MM X and
unmapped by MM Y will not work.  With mshare, that problem should not
exist (all mapping/unmapping will be routed through the mshare MM).

[1] https://lwn.net/Articles/974223/
[2] https://lore.kernel.org/linux-mm/[email protected]/T/
[3] https://lkml.kernel.org/r/[email protected]
[4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c

This patch (of 20):

Let's factor it out into a simple helper function.  This helper will also
come in handy when working with code where we know that our folio is
large.

Maybe in the future we'll have the order readily available for small and
large folios; in that case, folio_large_order() would simply translate to
folio_order().

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Lance Yang <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andy Lutomirks^H^Hski <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: Michal Koutn <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: tejun heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
(cherry picked from commit 6220ea5)
Signed-off-by: David Hildenbrand <[email protected]>
davidhildenbrand added a commit to davidhildenbrand/linux that referenced this pull request Sep 26, 2025
…ge_order()

Patch series "mm: MM owner tracking for large folios (!hugetlb) +
CONFIG_NO_PAGE_MAPCOUNT", v3.

Let's add an "easy" way to decide -- without false positives, without
page-mapcounts and without page table/rmap scanning -- whether a large
folio is "certainly mapped exclusively" into a single MM, or whether it
"maybe mapped shared" into multiple MMs.

Use that information to implement Copy-on-Write reuse, to convert
folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to
introduce a kernel config option that lets us not use+maintain per-page
mapcounts in large folios anymore.

The bigger picture was presented at LSF/MM [1].

This series is effectively a follow-up on my early work [2], which
implemented a more precise, but also more complicated, way to identify
whether a large folio is "mapped shared" into multiple MMs or "mapped
exclusively" into a single MM.

1 Patch Organization
====================

Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two
                "unsigned long" available for our purposes

Patch torvalds#7 -> torvalds#11: preparations

Patch torvalds#12: MM owner tracking for large folios

Patch torvalds#13: COW reuse for PTE-mapped anon THP

Patch torvalds#14: folio_maybe_mapped_shared()

Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT

2 MM owner tracking
===================

We assign each MM a unique ID ("MM ID"), to be able to squeeze more
information in our folios.  On 32bit we use 15-bit IDs, on 64bit we use
31-bit IDs.

For each large folios, we now store two MM-ID+mapcount ("slot")
combinations:
* mm0_id + mm0_mapcount
* mm1_id + mm1_mapcount

On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit
mapcount.  This way, we require 2x "unsigned long" on 32bit and 64bit for
both slots.

Paired with the large mapcount, we can reliably identify whether one of
these MMs is the current owner (-> owns all mappings) or even holds all
folio references (-> owns all mappings, and all references are from
mappings).

As long as only two MMs map folio pages at a time, we can reliably and
precisely identify whether a large folio is "mapped shared" or "mapped
exclusively".

Any additional MM that starts mapping the folio while there are no free
slots becomes an "untracked MM".  If one such "untracked MM" is the last
one mapping a folio exclusively, we will not detect the folio as "mapped
exclusively" but instead as "maybe mapped shared".  (exception: only a
single mapping remains)

So that's where the approach gets imprecise.

For now, we use a bit-spinlock to sync the large mapcount + slots, and
make sure we do keep the machinery fast, to not degrade (un)map
performance drastically: for example, we make sure to only use a single
atomic (when grabbing the bit-spinlock), like we would already perform
when updating the large mapcount.

3 CONFIG_NO_PAGE_MAPCOUNT
=========================

patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not
maintaining the per-page mapcounts in large folios anymore.

Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore
when (un)mapping pages, we'll account a complete folio as mapped if a
single page is mapped.  In addition, we'll not detect partially mapped
anonymous folios as such in all cases yet.

Likely less relevant changes include that we might now under-estimate the
USS (Unique Set Size) of a process, but never over-estimate it.

The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to
then slowly make it the only option, as we learn about real-life impacts
and possible ways to mitigate them.

4 Performance
=============

Detailed performance numbers were included in v1 [3], and not that much
changed between v1 and v2.

I did plenty of measurements on different systems in the meantime, that
all revealed slightly different results.

The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code
layout changes on some systems.  Especially the fork() benchmark started
being more-shaky-than-before on recent kernels for some reason.

In summary, with my micro-benchmarks:

* Small folios are not impacted.

* CoW performance seems to be mostly unchanged across all folios sizes.

* CoW reuse performance of large folios now matches CoW reuse
  performance of small folios, because we now actually implement the CoW
  reuse optimization.  On an Intel Xeon Silver 4210R I measured a ~65%
  reduction in runtime, on an arm64 system I measured ~54% reduction.

* munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT.  I saw
  double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and
  up to ~70% on an AmpereOne A192-32X) with larger folios.  The larger the
  folios, the larger the performance improvement.

* munmao() performance very slightly (couple percent) degrades without
  CONFIG_NO_PAGE_MAPCOUNT for smaller folios.  For larger folios, there
  seems to be no change at all.

* fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT.  I saw
  double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and
  up to ~10% on an AmpereOne A192-32X) with larger folios.  The larger the
  folios, the larger the performance improvement.

* While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be
  almost unchanged on some systems, I saw some degradation for smaller
  folios on the AmpereOne A192-32X.  I did not investigate the details
  yet, but I suspect code layout changes or suboptimal code placement /
  inlining.

I'm not to worried about the fork() micro-benchmarks for smaller folios
given how shaky the results are lately and by how much we improved fork()
performance recently.

I also ran case-anon-cow-rand and case-anon-cow-seq part of
vm-scalability, to assess the scalability and the impact of the
bit-spinlock.  My measurements on a two 2-socket 10-core Intel Xeon Silver
4210R CPU revealed no significant changes.

Similarly, running these benchmarks with 2 MiB THPs enabled on the
AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev,
which is nice.

So far, I did not get my hands on a similarly large system with multiple
sockets.

I found no other fitting scalability benchmarks that seem to really hammer
on concurrent mapping/unmapping of large folio pages like
case-anon-cow-seq does.

5 Concerns
==========

5.1 Bit spinlock
----------------

I'm not quite happy about the bit-spinlock, but so far it does not seem to
affect scalability in my measurements.

If it ever becomes a problem we could either investigate improving the
locking, or simply stopping the MM tracking once there are "too many
mappings" and simply assume that the folio is "mapped shared" until it was
freed.

This would be similar (but slightly different) to the "0,1,2,stopped"
counting idea Willy had at some point.  Adding that logic to "stop
tracking" adds more code to the hot path, so I avoided that for now.

5.2 folio_maybe_mapped_shared()
-------------------------------

I documented the change from folio_likely_mapped_shared() to
folio_maybe_mapped_shared() quite extensively.  If we run into surprises,
I have some ideas on how to resolve them.  For now, I think we should be
fine.

5.3 Added code to map/unmap hot path
------------------------------------

So far, it looks like the added code on the rmap hot path does not really
seem to matter much in the bigger picture.  I'd like to further reduce it
(and possibly improve fork() performance further), but I don't easily see
how right now.  Well, and I am out of puff 🙂

Having that said, alternatives I considered (e.g., per-MM per-folio
mapcount) would add a lot more overhead to these hot paths.

6 Future Work
=============

6.1 Large mapcount
------------------

It would be very handy if the large mapcount would count how often folio
pages are actually mapped into page tables: a PMD on x86-64 would count
512 times.  Calculating the average per-page mapcount will be easy, and
remapping (PMD->PTE) folios would get even faster.

That would also remove the need for the entire mapcount (except for
PMD-sized folios for memory statistics reasons ...), and allow for mapping
folios larger than PMDs (e.g., 4 MiB) easily.

We likely would also have to take the same number of folio references to
make our folio_mapcount() == folio_ref_count() work, and we'd want to be
able to avoid mapcount+refcount overflows: this could already become an
issue with pte-mapped PUD-sized folios (fsdax).

One approach we discussed in the THP cabal meeting is (1) extending the
mapcount for large folios to 64bit (at least on 64bit systems) and (2)
keeping the refcount at 32bit, but (3) having exactly one reference if the
the mapcount != 0.

It should be doable, but there are some corner cases to consider on the
unmap path; it is something that I will be looking into next.

6.2 hugetlb
-----------

I'd love to make use of the same tracking also for hugetlb.

The real problem is PMD table sharing: getting a page mapped by MM X and
unmapped by MM Y will not work.  With mshare, that problem should not
exist (all mapping/unmapping will be routed through the mshare MM).

[1] https://lwn.net/Articles/974223/
[2] https://lore.kernel.org/linux-mm/[email protected]/T/
[3] https://lkml.kernel.org/r/[email protected]
[4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c

This patch (of 20):

Let's factor it out into a simple helper function.  This helper will also
come in handy when working with code where we know that our folio is
large.

Maybe in the future we'll have the order readily available for small and
large folios; in that case, folio_large_order() would simply translate to
folio_order().

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Lance Yang <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andy Lutomirks^H^Hski <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: Michal Koutn <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: tejun heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
(cherry picked from commit 6220ea5)
Signed-off-by: David Hildenbrand <[email protected]>
davidhildenbrand added a commit to davidhildenbrand/linux that referenced this pull request Sep 26, 2025
…ge_order()

Patch series "mm: MM owner tracking for large folios (!hugetlb) +
CONFIG_NO_PAGE_MAPCOUNT", v3.

Let's add an "easy" way to decide -- without false positives, without
page-mapcounts and without page table/rmap scanning -- whether a large
folio is "certainly mapped exclusively" into a single MM, or whether it
"maybe mapped shared" into multiple MMs.

Use that information to implement Copy-on-Write reuse, to convert
folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to
introduce a kernel config option that lets us not use+maintain per-page
mapcounts in large folios anymore.

The bigger picture was presented at LSF/MM [1].

This series is effectively a follow-up on my early work [2], which
implemented a more precise, but also more complicated, way to identify
whether a large folio is "mapped shared" into multiple MMs or "mapped
exclusively" into a single MM.

1 Patch Organization
====================

Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two
                "unsigned long" available for our purposes

Patch torvalds#7 -> torvalds#11: preparations

Patch torvalds#12: MM owner tracking for large folios

Patch torvalds#13: COW reuse for PTE-mapped anon THP

Patch torvalds#14: folio_maybe_mapped_shared()

Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT

2 MM owner tracking
===================

We assign each MM a unique ID ("MM ID"), to be able to squeeze more
information in our folios.  On 32bit we use 15-bit IDs, on 64bit we use
31-bit IDs.

For each large folios, we now store two MM-ID+mapcount ("slot")
combinations:
* mm0_id + mm0_mapcount
* mm1_id + mm1_mapcount

On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit
mapcount.  This way, we require 2x "unsigned long" on 32bit and 64bit for
both slots.

Paired with the large mapcount, we can reliably identify whether one of
these MMs is the current owner (-> owns all mappings) or even holds all
folio references (-> owns all mappings, and all references are from
mappings).

As long as only two MMs map folio pages at a time, we can reliably and
precisely identify whether a large folio is "mapped shared" or "mapped
exclusively".

Any additional MM that starts mapping the folio while there are no free
slots becomes an "untracked MM".  If one such "untracked MM" is the last
one mapping a folio exclusively, we will not detect the folio as "mapped
exclusively" but instead as "maybe mapped shared".  (exception: only a
single mapping remains)

So that's where the approach gets imprecise.

For now, we use a bit-spinlock to sync the large mapcount + slots, and
make sure we do keep the machinery fast, to not degrade (un)map
performance drastically: for example, we make sure to only use a single
atomic (when grabbing the bit-spinlock), like we would already perform
when updating the large mapcount.

3 CONFIG_NO_PAGE_MAPCOUNT
=========================

patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not
maintaining the per-page mapcounts in large folios anymore.

Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore
when (un)mapping pages, we'll account a complete folio as mapped if a
single page is mapped.  In addition, we'll not detect partially mapped
anonymous folios as such in all cases yet.

Likely less relevant changes include that we might now under-estimate the
USS (Unique Set Size) of a process, but never over-estimate it.

The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to
then slowly make it the only option, as we learn about real-life impacts
and possible ways to mitigate them.

4 Performance
=============

Detailed performance numbers were included in v1 [3], and not that much
changed between v1 and v2.

I did plenty of measurements on different systems in the meantime, that
all revealed slightly different results.

The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code
layout changes on some systems.  Especially the fork() benchmark started
being more-shaky-than-before on recent kernels for some reason.

In summary, with my micro-benchmarks:

* Small folios are not impacted.

* CoW performance seems to be mostly unchanged across all folios sizes.

* CoW reuse performance of large folios now matches CoW reuse
  performance of small folios, because we now actually implement the CoW
  reuse optimization.  On an Intel Xeon Silver 4210R I measured a ~65%
  reduction in runtime, on an arm64 system I measured ~54% reduction.

* munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT.  I saw
  double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and
  up to ~70% on an AmpereOne A192-32X) with larger folios.  The larger the
  folios, the larger the performance improvement.

* munmao() performance very slightly (couple percent) degrades without
  CONFIG_NO_PAGE_MAPCOUNT for smaller folios.  For larger folios, there
  seems to be no change at all.

* fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT.  I saw
  double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and
  up to ~10% on an AmpereOne A192-32X) with larger folios.  The larger the
  folios, the larger the performance improvement.

* While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be
  almost unchanged on some systems, I saw some degradation for smaller
  folios on the AmpereOne A192-32X.  I did not investigate the details
  yet, but I suspect code layout changes or suboptimal code placement /
  inlining.

I'm not to worried about the fork() micro-benchmarks for smaller folios
given how shaky the results are lately and by how much we improved fork()
performance recently.

I also ran case-anon-cow-rand and case-anon-cow-seq part of
vm-scalability, to assess the scalability and the impact of the
bit-spinlock.  My measurements on a two 2-socket 10-core Intel Xeon Silver
4210R CPU revealed no significant changes.

Similarly, running these benchmarks with 2 MiB THPs enabled on the
AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev,
which is nice.

So far, I did not get my hands on a similarly large system with multiple
sockets.

I found no other fitting scalability benchmarks that seem to really hammer
on concurrent mapping/unmapping of large folio pages like
case-anon-cow-seq does.

5 Concerns
==========

5.1 Bit spinlock
----------------

I'm not quite happy about the bit-spinlock, but so far it does not seem to
affect scalability in my measurements.

If it ever becomes a problem we could either investigate improving the
locking, or simply stopping the MM tracking once there are "too many
mappings" and simply assume that the folio is "mapped shared" until it was
freed.

This would be similar (but slightly different) to the "0,1,2,stopped"
counting idea Willy had at some point.  Adding that logic to "stop
tracking" adds more code to the hot path, so I avoided that for now.

5.2 folio_maybe_mapped_shared()
-------------------------------

I documented the change from folio_likely_mapped_shared() to
folio_maybe_mapped_shared() quite extensively.  If we run into surprises,
I have some ideas on how to resolve them.  For now, I think we should be
fine.

5.3 Added code to map/unmap hot path
------------------------------------

So far, it looks like the added code on the rmap hot path does not really
seem to matter much in the bigger picture.  I'd like to further reduce it
(and possibly improve fork() performance further), but I don't easily see
how right now.  Well, and I am out of puff 🙂

Having that said, alternatives I considered (e.g., per-MM per-folio
mapcount) would add a lot more overhead to these hot paths.

6 Future Work
=============

6.1 Large mapcount
------------------

It would be very handy if the large mapcount would count how often folio
pages are actually mapped into page tables: a PMD on x86-64 would count
512 times.  Calculating the average per-page mapcount will be easy, and
remapping (PMD->PTE) folios would get even faster.

That would also remove the need for the entire mapcount (except for
PMD-sized folios for memory statistics reasons ...), and allow for mapping
folios larger than PMDs (e.g., 4 MiB) easily.

We likely would also have to take the same number of folio references to
make our folio_mapcount() == folio_ref_count() work, and we'd want to be
able to avoid mapcount+refcount overflows: this could already become an
issue with pte-mapped PUD-sized folios (fsdax).

One approach we discussed in the THP cabal meeting is (1) extending the
mapcount for large folios to 64bit (at least on 64bit systems) and (2)
keeping the refcount at 32bit, but (3) having exactly one reference if the
the mapcount != 0.

It should be doable, but there are some corner cases to consider on the
unmap path; it is something that I will be looking into next.

6.2 hugetlb
-----------

I'd love to make use of the same tracking also for hugetlb.

The real problem is PMD table sharing: getting a page mapped by MM X and
unmapped by MM Y will not work.  With mshare, that problem should not
exist (all mapping/unmapping will be routed through the mshare MM).

[1] https://lwn.net/Articles/974223/
[2] https://lore.kernel.org/linux-mm/[email protected]/T/
[3] https://lkml.kernel.org/r/[email protected]
[4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c

This patch (of 20):

Let's factor it out into a simple helper function.  This helper will also
come in handy when working with code where we know that our folio is
large.

Maybe in the future we'll have the order readily available for small and
large folios; in that case, folio_large_order() would simply translate to
folio_order().

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Lance Yang <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andy Lutomirks^H^Hski <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: Michal Koutn <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: tejun heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
(cherry picked from commit 6220ea5)
Signed-off-by: David Hildenbrand <[email protected]>
guidosarducci added a commit to guidosarducci/linux that referenced this pull request Sep 27, 2025
 - treat tailcall count as 32-bit for access and update
 - change out_offset scope from file to function
 - minor format/structure changes for consistency

Testing: (skipping fentry, fexit, freplace)
========

root@qemu-armhf:/usr/libexec/kselftests-bpf# modprobe test_bpf test_suite=test_tail_calls
test_bpf: #0 Tail call leaf jited:1 967 PASS
test_bpf: #1 Tail call 2 jited:1 1427 PASS
test_bpf: #2 Tail call 3 jited:1 2373 PASS
test_bpf: #3 Tail call 4 jited:1 2304 PASS
test_bpf: #4 Tail call load/store leaf jited:1 1684 PASS
test_bpf: #5 Tail call load/store jited:1 2249 PASS
test_bpf: torvalds#6 Tail call error path, max count reached jited:1 22538 PASS
test_bpf: torvalds#7 Tail call count preserved across function calls jited:1 1055668 PASS
test_bpf: torvalds#8 Tail call error path, NULL target jited:1 513 PASS
test_bpf: torvalds#9 Tail call error path, index out of range jited:1 392 PASS
test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed]

root@qemu-armhf:/usr/libexec/kselftests-bpf# ./test_progs -n 397/1-12,17-18,23-24,27-31
397/1   tailcalls/tailcall_1:OK
397/2   tailcalls/tailcall_2:OK
397/3   tailcalls/tailcall_3:OK
397/4   tailcalls/tailcall_4:OK
397/5   tailcalls/tailcall_5:OK
397/6   tailcalls/tailcall_6:OK
397/7   tailcalls/tailcall_bpf2bpf_1:OK
397/8   tailcalls/tailcall_bpf2bpf_2:OK
397/9   tailcalls/tailcall_bpf2bpf_3:OK
397/10  tailcalls/tailcall_bpf2bpf_4:OK
397/11  tailcalls/tailcall_bpf2bpf_5:OK
397/12  tailcalls/tailcall_bpf2bpf_6:OK
397/17  tailcalls/tailcall_poke:OK
397/18  tailcalls/tailcall_bpf2bpf_hierarchy_1:OK
397/23  tailcalls/tailcall_bpf2bpf_hierarchy_2:OK
397/24  tailcalls/tailcall_bpf2bpf_hierarchy_3:OK
397/27  tailcalls/tailcall_failure:OK
397/28  tailcalls/reject_tail_call_spin_lock:OK
397/29  tailcalls/reject_tail_call_rcu_lock:OK
397/30  tailcalls/reject_tail_call_preempt_lock:OK
397/31  tailcalls/reject_tail_call_ref:OK
397     tailcalls:OK
Summary: 1/21 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Tony Ambardar <[email protected]>
guidosarducci added a commit to guidosarducci/linux that referenced this pull request Sep 27, 2025
 - treat tailcall count as 32-bit for access and update
 - change out_offset scope from file to function
 - minor format/structure changes for consistency

Testing: (skipping fentry, fexit, freplace)
========

root@qemu-armhf:/usr/libexec/kselftests-bpf# modprobe test_bpf test_suite=test_tail_calls
test_bpf: #0 Tail call leaf jited:1 967 PASS
test_bpf: #1 Tail call 2 jited:1 1427 PASS
test_bpf: #2 Tail call 3 jited:1 2373 PASS
test_bpf: #3 Tail call 4 jited:1 2304 PASS
test_bpf: #4 Tail call load/store leaf jited:1 1684 PASS
test_bpf: #5 Tail call load/store jited:1 2249 PASS
test_bpf: torvalds#6 Tail call error path, max count reached jited:1 22538 PASS
test_bpf: torvalds#7 Tail call count preserved across function calls jited:1 1055668 PASS
test_bpf: torvalds#8 Tail call error path, NULL target jited:1 513 PASS
test_bpf: torvalds#9 Tail call error path, index out of range jited:1 392 PASS
test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed]

root@qemu-armhf:/usr/libexec/kselftests-bpf# ./test_progs -n 397/1-12,17-18,23-24,27-31
397/1   tailcalls/tailcall_1:OK
397/2   tailcalls/tailcall_2:OK
397/3   tailcalls/tailcall_3:OK
397/4   tailcalls/tailcall_4:OK
397/5   tailcalls/tailcall_5:OK
397/6   tailcalls/tailcall_6:OK
397/7   tailcalls/tailcall_bpf2bpf_1:OK
397/8   tailcalls/tailcall_bpf2bpf_2:OK
397/9   tailcalls/tailcall_bpf2bpf_3:OK
397/10  tailcalls/tailcall_bpf2bpf_4:OK
397/11  tailcalls/tailcall_bpf2bpf_5:OK
397/12  tailcalls/tailcall_bpf2bpf_6:OK
397/17  tailcalls/tailcall_poke:OK
397/18  tailcalls/tailcall_bpf2bpf_hierarchy_1:OK
397/23  tailcalls/tailcall_bpf2bpf_hierarchy_2:OK
397/24  tailcalls/tailcall_bpf2bpf_hierarchy_3:OK
397/27  tailcalls/tailcall_failure:OK
397/28  tailcalls/reject_tail_call_spin_lock:OK
397/29  tailcalls/reject_tail_call_rcu_lock:OK
397/30  tailcalls/reject_tail_call_preempt_lock:OK
397/31  tailcalls/reject_tail_call_ref:OK
397     tailcalls:OK
Summary: 1/21 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Tony Ambardar <[email protected]>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 29, 2025
Before disabling SR-IOV via config space accesses to the parent PF,
sriov_disable() first removes the PCI devices representing the VFs.

Since commit 9d16947 ("PCI: Add global pci_lock_rescan_remove()")
such removal operations are serialized against concurrent remove and
rescan using the pci_rescan_remove_lock. No such locking was ever added
in sriov_disable() however. In particular when commit 18f9e9d
("PCI/IOV: Factor out sriov_add_vfs()") factored out the PCI device
removal into sriov_del_vfs() there was still no locking around the
pci_iov_remove_virtfn() calls.

On s390 the lack of serialization in sriov_disable() may cause double
remove and list corruption with the below (amended) trace being observed:

  PSW:  0704c00180000000 0000000c914e4b38 (klist_put+56)
  GPRS: 000003800313fb48 0000000000000000 0000000100000001 0000000000000001
	00000000f9b520a8 0000000000000000 0000000000002fbd 00000000f4cc9480
	0000000000000001 0000000000000000 0000000000000000 0000000180692828
	00000000818e8000 000003800313fe2c 000003800313fb20 000003800313fad8
  #0 [3800313fb20] device_del at c9158ad5c
  #1 [3800313fb88] pci_remove_bus_device at c915105ba
  #2 [3800313fbd0] pci_iov_remove_virtfn at c9152f198
  #3 [3800313fc28] zpci_iov_remove_virtfn at c90fb67c0
  #4 [3800313fc60] zpci_bus_remove_device at c90fb6104
  #5 [3800313fca0] __zpci_event_availability at c90fb3dca
  torvalds#6 [3800313fd08] chsc_process_sei_nt0 at c918fe4a2
  torvalds#7 [3800313fd60] crw_collect_info at c91905822
  torvalds#8 [3800313fe10] kthread at c90feb390
  torvalds#9 [3800313fe68] __ret_from_fork at c90f6aa64
  torvalds#10 [3800313fe98] ret_from_fork at c9194f3f2.

This is because in addition to sriov_disable() removing the VFs, the
platform also generates hot-unplug events for the VFs. This being the
reverse operation to the hotplug events generated by sriov_enable() and
handled via pdev->no_vf_scan. And while the event processing takes
pci_rescan_remove_lock and checks whether the struct pci_dev still exists,
the lack of synchronization makes this checking racy.

Other races may also be possible of course though given that this lack of
locking persisted so long observable races seem very rare. Even on s390 the
list corruption was only observed with certain devices since the platform
events are only triggered by config accesses after the removal, so as long
as the removal finished synchronously they would not race. Either way the
locking is missing so fix this by adding it to the sriov_del_vfs() helper.

Just like PCI rescan-remove, locking is also missing in sriov_add_vfs()
including for the error case where pci_stop_and_remove_bus_device() is
called without the PCI rescan-remove lock being held. Even in the non-error
case, adding new PCI devices and buses should be serialized via the PCI
rescan-remove lock. Add the necessary locking.

Fixes: 18f9e9d ("PCI/IOV: Factor out sriov_add_vfs()")
Signed-off-by: Niklas Schnelle <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Benjamin Block <[email protected]>
Reviewed-by: Farhan Ali <[email protected]>
Reviewed-by: Julian Ruess <[email protected]>
Cc: [email protected]
Link: https://patch.msgid.link/[email protected]
torvalds pushed a commit that referenced this pull request Sep 29, 2025
The generic/736 xfstest fails for HFS case:

BEGIN TEST default (1 test): hfs Mon May 5 03:18:32 UTC 2025
DEVICE: /dev/vdb
HFS_MKFS_OPTIONS:
MOUNT_OPTIONS: MOUNT_OPTIONS
FSTYP -- hfs
PLATFORM -- Linux/x86_64 kvm-xfstests 6.15.0-rc4-xfstests-g00b827f0cffa #1 SMP PREEMPT_DYNAMIC Fri May 25
MKFS_OPTIONS -- /dev/vdc
MOUNT_OPTIONS -- /dev/vdc /vdc

generic/736 [03:18:33][ 3.510255] run fstests generic/736 at 2025-05-05 03:18:33
_check_generic_filesystem: filesystem on /dev/vdb is inconsistent
(see /results/hfs/results-default/generic/736.full for details)
Ran: generic/736
Failures: generic/736
Failed 1 of 1 tests

The HFS volume becomes corrupted after the test run:

sudo fsck.hfs -d /dev/loop50
** /dev/loop50
Using cacheBlockSize=32K cacheTotalBlock=1024 cacheSize=32768K.
Executing fsck_hfs (version 540.1-Linux).
** Checking HFS volume.
The volume name is untitled
** Checking extents overflow file.
** Checking catalog file.
** Checking catalog hierarchy.
** Checking volume bitmap.
** Checking volume information.
invalid MDB drNxtCNID
Master Directory Block needs minor repair
(1, 0)
Verify Status: VIStat = 0x8000, ABTStat = 0x0000 EBTStat = 0x0000
CBTStat = 0x0000 CatStat = 0x00000000
** Repairing volume.
** Rechecking volume.
** Checking HFS volume.
The volume name is untitled
** Checking extents overflow file.
** Checking catalog file.
** Checking catalog hierarchy.
** Checking volume bitmap.
** Checking volume information.
** The volume untitled was repaired successfully.

The main reason of the issue is the absence of logic that
corrects mdb->drNxtCNID/HFS_SB(sb)->next_id (next unused
CNID) after deleting a record in Catalog File. This patch
introduces a hfs_correct_next_unused_CNID() method that
implements the necessary logic. In the case of Catalog File's
record delete operation, the function logic checks that
(deleted_CNID + 1) == next_unused_CNID and it finds/sets the new
value of next_unused_CNID.

sudo ./check generic/736
FSTYP -- hfs
PLATFORM -- Linux/x86_64 hfsplus-testing-0001 6.15.0+ #6 SMP PREEMPT_DYNAMIC Tue Jun 10 15:02:48 PDT 2025
MKFS_OPTIONS -- /dev/loop51
MOUNT_OPTIONS -- /dev/loop51 /mnt/scratch

generic/736 33s
Ran: generic/736
Passed all 1 tests

sudo fsck.hfs -d /dev/loop50
** /dev/loop50
Using cacheBlockSize=32K cacheTotalBlock=1024 cacheSize=32768K.
Executing fsck_hfs (version 540.1-Linux).
** Checking HFS volume.
The volume name is untitled
** Checking extents overflow file.
** Checking catalog file.
** Checking catalog hierarchy.
** Checking volume bitmap.
** Checking volume information.
** The volume untitled appears to be OK

Signed-off-by: Viacheslav Dubeyko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Viacheslav Dubeyko <[email protected]>
lougovsk pushed a commit to lougovsk/linux that referenced this pull request Sep 30, 2025
Another day, another syzkaller bug. KVM erroneously allows userspace to
pend vCPU events for a vCPU that hasn't been initialized yet, leading to
KVM interpreting a bunch of uninitialized garbage for routing /
injecting the exception.

In one case the injection code and the hyp disagree on whether the vCPU
has a 32bit EL1 and put the vCPU into an illegal mode for AArch64,
tripping the BUG() in exception_target_el() during the next injection:

  kernel BUG at arch/arm64/kvm/inject_fault.c:40!
  Internal error: Oops - BUG: 00000000f2000800 [#1]  SMP
  CPU: 3 UID: 0 PID: 318 Comm: repro Not tainted 6.17.0-rc4-00104-g10fd0285305d torvalds#6 PREEMPT
  Hardware name: linux,dummy-virt (DT)
  pstate: 21402009 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
  pc : exception_target_el+0x88/0x8c
  lr : pend_serror_exception+0x18/0x13c
  sp : ffff800082f03a10
  x29: ffff800082f03a10 x28: ffff0000cb132280 x27: 0000000000000000
  x26: 0000000000000000 x25: ffff0000c2a99c20 x24: 0000000000000000
  x23: 0000000000008000 x22: 0000000000000002 x21: 0000000000000004
  x20: 0000000000008000 x19: ffff0000c2a99c20 x18: 0000000000000000
  x17: 0000000000000000 x16: 0000000000000000 x15: 00000000200000c0
  x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
  x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
  x8 : ffff800082f03af8 x7 : 0000000000000000 x6 : 0000000000000000
  x5 : ffff800080f621f0 x4 : 0000000000000000 x3 : 0000000000000000
  x2 : 000000000040009b x1 : 0000000000000003 x0 : ffff0000c2a99c20
  Call trace:
   exception_target_el+0x88/0x8c (P)
   kvm_inject_serror_esr+0x40/0x3b4
   __kvm_arm_vcpu_set_events+0xf0/0x100
   kvm_arch_vcpu_ioctl+0x180/0x9d4
   kvm_vcpu_ioctl+0x60c/0x9f4
   __arm64_sys_ioctl+0xac/0x104
   invoke_syscall+0x48/0x110
   el0_svc_common.constprop.0+0x40/0xe0
   do_el0_svc+0x1c/0x28
   el0_svc+0x34/0xf0
   el0t_64_sync_handler+0xa0/0xe4
   el0t_64_sync+0x198/0x19c
  Code: f946bc01 b4fffe61 9101e020 17fffff2 (d4210000)

Reject the ioctls outright as no sane VMM would call these before
KVM_ARM_VCPU_INIT anyway. Even if it did the exception would've been
thrown away by the eventual reset of the vCPU's state.

Cc: [email protected] # 6.17
Fixes: b7b27fa ("arm/arm64: KVM: Add KVM_GET/SET_VCPU_EVENTS")
Signed-off-by: Oliver Upton <[email protected]>
Message-Id: <[email protected]>
guidosarducci added a commit to guidosarducci/linux that referenced this pull request Oct 1, 2025
 - treat tailcall count as 32-bit for access and update
 - change out_offset scope from file to function
 - minor format/structure changes for consistency

Testing: (skipping fentry, fexit, freplace)
========

root@qemu-armhf:/usr/libexec/kselftests-bpf# modprobe test_bpf test_suite=test_tail_calls
test_bpf: #0 Tail call leaf jited:1 967 PASS
test_bpf: #1 Tail call 2 jited:1 1427 PASS
test_bpf: #2 Tail call 3 jited:1 2373 PASS
test_bpf: #3 Tail call 4 jited:1 2304 PASS
test_bpf: #4 Tail call load/store leaf jited:1 1684 PASS
test_bpf: #5 Tail call load/store jited:1 2249 PASS
test_bpf: torvalds#6 Tail call error path, max count reached jited:1 22538 PASS
test_bpf: torvalds#7 Tail call count preserved across function calls jited:1 1055668 PASS
test_bpf: torvalds#8 Tail call error path, NULL target jited:1 513 PASS
test_bpf: torvalds#9 Tail call error path, index out of range jited:1 392 PASS
test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed]

root@qemu-armhf:/usr/libexec/kselftests-bpf# ./test_progs -n 397/1-12,17-18,23-24,27-31
397/1   tailcalls/tailcall_1:OK
397/2   tailcalls/tailcall_2:OK
397/3   tailcalls/tailcall_3:OK
397/4   tailcalls/tailcall_4:OK
397/5   tailcalls/tailcall_5:OK
397/6   tailcalls/tailcall_6:OK
397/7   tailcalls/tailcall_bpf2bpf_1:OK
397/8   tailcalls/tailcall_bpf2bpf_2:OK
397/9   tailcalls/tailcall_bpf2bpf_3:OK
397/10  tailcalls/tailcall_bpf2bpf_4:OK
397/11  tailcalls/tailcall_bpf2bpf_5:OK
397/12  tailcalls/tailcall_bpf2bpf_6:OK
397/17  tailcalls/tailcall_poke:OK
397/18  tailcalls/tailcall_bpf2bpf_hierarchy_1:OK
397/23  tailcalls/tailcall_bpf2bpf_hierarchy_2:OK
397/24  tailcalls/tailcall_bpf2bpf_hierarchy_3:OK
397/27  tailcalls/tailcall_failure:OK
397/28  tailcalls/reject_tail_call_spin_lock:OK
397/29  tailcalls/reject_tail_call_rcu_lock:OK
397/30  tailcalls/reject_tail_call_preempt_lock:OK
397/31  tailcalls/reject_tail_call_ref:OK
397     tailcalls:OK
Summary: 1/21 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Tony Ambardar <[email protected]>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Oct 1, 2025
The test starts a workload and then opens events. If the events fail
to open, for example because of perf_event_paranoid, the gopipe of the
workload is leaked and the file descriptor leak check fails when the
test exits. To avoid this cancel the workload when opening the events
fails.

Before:
```
$ perf test -vv 7
  7: PERF_RECORD_* events & perf_sample fields:
 --- start ---
test child forked, pid 1189568
Using CPUID GenuineIntel-6-B7-1
 ------------------------------------------------------------
perf_event_attr:
  type                    	   0 (PERF_TYPE_HARDWARE)
  config                  	   0xa00000000 (cpu_atom/PERF_COUNT_HW_CPU_CYCLES/)
  disabled                	   1
 ------------------------------------------------------------
sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8
sys_perf_event_open failed, error -13
 ------------------------------------------------------------
perf_event_attr:
  type                             0 (PERF_TYPE_HARDWARE)
  config                           0xa00000000 (cpu_atom/PERF_COUNT_HW_CPU_CYCLES/)
  disabled                         1
  exclude_kernel                   1
 ------------------------------------------------------------
sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8 = 3
 ------------------------------------------------------------
perf_event_attr:
  type                             0 (PERF_TYPE_HARDWARE)
  config                           0x400000000 (cpu_core/PERF_COUNT_HW_CPU_CYCLES/)
  disabled                         1
 ------------------------------------------------------------
sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8
sys_perf_event_open failed, error -13
 ------------------------------------------------------------
perf_event_attr:
  type                             0 (PERF_TYPE_HARDWARE)
  config                           0x400000000 (cpu_core/PERF_COUNT_HW_CPU_CYCLES/)
  disabled                         1
  exclude_kernel                   1
 ------------------------------------------------------------
sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8 = 3
Attempt to add: software/cpu-clock/
..after resolving event: software/config=0/
cpu-clock -> software/cpu-clock/
 ------------------------------------------------------------
perf_event_attr:
  type                             1 (PERF_TYPE_SOFTWARE)
  size                             136
  config                           0x9 (PERF_COUNT_SW_DUMMY)
  sample_type                      IP|TID|TIME|CPU
  read_format                      ID|LOST
  disabled                         1
  inherit                          1
  mmap                             1
  comm                             1
  enable_on_exec                   1
  task                             1
  sample_id_all                    1
  mmap2                            1
  comm_exec                        1
  ksymbol                          1
  bpf_event                        1
  { wakeup_events, wakeup_watermark } 1
 ------------------------------------------------------------
sys_perf_event_open: pid 1189569  cpu 0  group_fd -1  flags 0x8
sys_perf_event_open failed, error -13
perf_evlist__open: Permission denied
 ---- end(-2) ----
Leak of file descriptor 6 that opened: 'pipe:[14200347]'
 ---- unexpected signal (6) ----
iFailed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
    #0 0x565358f6666e in child_test_sig_handler builtin-test.c:311
    #1 0x7f29ce849df0 in __restore_rt libc_sigaction.c:0
    #2 0x7f29ce89e95c in __pthread_kill_implementation pthread_kill.c:44
    #3 0x7f29ce849cc2 in raise raise.c:27
    #4 0x7f29ce8324ac in abort abort.c:81
    #5 0x565358f662d4 in check_leaks builtin-test.c:226
    torvalds#6 0x565358f6682e in run_test_child builtin-test.c:344
    torvalds#7 0x565358ef7121 in start_command run-command.c:128
    torvalds#8 0x565358f67273 in start_test builtin-test.c:545
    torvalds#9 0x565358f6771d in __cmd_test builtin-test.c:647
    torvalds#10 0x565358f682bd in cmd_test builtin-test.c:849
    torvalds#11 0x565358ee5ded in run_builtin perf.c:349
    torvalds#12 0x565358ee6085 in handle_internal_command perf.c:401
    torvalds#13 0x565358ee61de in run_argv perf.c:448
    torvalds#14 0x565358ee6527 in main perf.c:555
    torvalds#15 0x7f29ce833ca8 in __libc_start_call_main libc_start_call_main.h:74
    torvalds#16 0x7f29ce833d65 in __libc_start_main@@GLIBC_2.34 libc-start.c:128
    torvalds#17 0x565358e391c1 in _start perf[851c1]
  7: PERF_RECORD_* events & perf_sample fields                       : FAILED!
```

After:
```
$ perf test 7
  7: PERF_RECORD_* events & perf_sample fields                       : Skip (permissions)
```

Fixes: 16d00fe ("perf tests: Move test__PERF_RECORD into separate object")
Signed-off-by: Ian Rogers <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Athira Rajeev <[email protected]>
Cc: Chun-Tse Shao <[email protected]>
Cc: Howard Chu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
ericwoud added a commit to ericwoud/linux that referenced this pull request Oct 7, 2025
prefix=PATCH v1 nf
[email protected]
[email protected]

/*** BLURB HERE ***/

Also see the debugging info send by Florian in mailing:
"[RFC PATCH v3 nf-next] selftests: netfilter: Add bridge_fastpath.sh"

net/bridge/br_private.h:1627 suspicious rcu_dereference_protected() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
7 locks held by socat/410:
 #0: ffff88800d7a9c90 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_stream_connect+0x43/0xa0
 #1: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: __ip_queue_xmit+0x62/0x1830
 #2: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: ip_output+0x57/0x3c0
 #3: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: ip_finish_output2+0x263/0x17d0
 #4: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: process_backlog+0x38a/0x14b0
 #5: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: netif_receive_skb_internal+0x83/0x330
 torvalds#6: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: nf_hook.constprop.0+0x8a/0x440

stack backtrace:
CPU: 0 UID: 0 PID: 410 Comm: socat Not tainted 6.17.0-rc7-virtme #1 PREEMPT(full)
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
 <IRQ>
 dump_stack_lvl+0x6f/0xb0
 lockdep_rcu_suspicious.cold+0x4f/0xb1
 br_vlan_fill_forward_path_pvid+0x32c/0x410 [bridge]
 br_fill_forward_path+0x7a/0x4d0 [bridge]
 ...
Leo-Yan pushed a commit to Leo-Yan/linux that referenced this pull request Oct 8, 2025
These iterations require the read lock, otherwise RCU
lockdep will splat:

=============================
WARNING: suspicious RCU usage
6.17.0-rc3-00014-g31419c045d64 torvalds#6 Tainted: G           O
-----------------------------
drivers/base/power/main.c:1333 RCU-list traversed in non-reader section!!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
5 locks held by rtcwake/547:
 #0: 00000000643ab418 (sb_writers#6){.+.+}-{0:0}, at: file_start_write+0x2b/0x3a
 #1: 0000000067a0ca88 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x181/0x24b
 #2: 00000000631eac40 (kn->active#3){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x191/0x24b
 #3: 00000000609a1308 (system_transition_mutex){+.+.}-{4:4}, at: pm_suspend+0xaf/0x30b
 #4: 0000000060c0fdb0 (device_links_srcu){.+.+}-{0:0}, at: device_links_read_lock+0x75/0x98

stack backtrace:
CPU: 0 UID: 0 PID: 547 Comm: rtcwake Tainted: G           O        6.17.0-rc3-00014-g31419c045d64 torvalds#6 VOLUNTARY
Tainted: [O]=OOT_MODULE
Stack:
 223721b3a80 6089eac6 00000001 00000001
 ffffff00 6089eac6 00000535 6086e528
 721b3ac0 6003c294 00000000 60031fc0
Call Trace:
 [<600407ed>] show_stack+0x10e/0x127
 [<6003c294>] dump_stack_lvl+0x77/0xc6
 [<6003c2fd>] dump_stack+0x1a/0x20
 [<600bc2f8>] lockdep_rcu_suspicious+0x116/0x13e
 [<603d8ea1>] dpm_async_suspend_superior+0x117/0x17e
 [<603d980f>] device_suspend+0x528/0x541
 [<603da24b>] dpm_suspend+0x1a2/0x267
 [<603da837>] dpm_suspend_start+0x5d/0x72
 [<600ca0c9>] suspend_devices_and_enter+0xab/0x736
 [...]

Add the fourth argument to the iteration to annotate
this and avoid the splat.

Fixes: 0679963 ("PM: sleep: Make async suspend handle suppliers like parents")
Fixes: ed18738 ("PM: sleep: Make async resume handle consumers like children")
Signed-off-by: Johannes Berg <[email protected]>
Link: https://patch.msgid.link/20250826134348.aba79f6e6299.I9ecf55da46ccf33778f2c018a82e1819d815b348@changeid
Signed-off-by: Rafael J. Wysocki <[email protected]>
Leo-Yan pushed a commit to Leo-Yan/linux that referenced this pull request Oct 8, 2025
Commit 0e2f80a("fs/dax: ensure all pages are idle prior to
filesystem unmount") introduced the WARN_ON_ONCE to capture whether
the filesystem has removed all DAX entries or not and applied the
fix to xfs and ext4.

Apply the missed fix on erofs to fix the runtime warning:

[  5.266254] ------------[ cut here ]------------
[  5.266274] WARNING: CPU: 6 PID: 3109 at mm/truncate.c:89 truncate_folio_batch_exceptionals+0xff/0x260
[  5.266294] Modules linked in:
[  5.266999] CPU: 6 UID: 0 PID: 3109 Comm: umount Tainted: G S                  6.16.0+ torvalds#6 PREEMPT(voluntary)
[  5.267012] Tainted: [S]=CPU_OUT_OF_SPEC
[  5.267017] Hardware name: Dell Inc. OptiPlex 5000/05WXFV, BIOS 1.5.1 08/24/2022
[  5.267024] RIP: 0010:truncate_folio_batch_exceptionals+0xff/0x260
[  5.267076] Code: 00 00 41 39 df 7f 11 eb 78 83 c3 01 49 83 c4 08 41 39 df 74 6c 48 63 f3 48 83 fe 1f 0f 83 3c 01 00 00 43 f6 44 26 08 01 74 df <0f> 0b 4a 8b 34 22 4c 89 ef 48 89 55 90 e8 ff 54 1f 00 48 8b 55 90
[  5.267083] RSP: 0018:ffffc900013f36c8 EFLAGS: 00010202
[  5.267095] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  5.267101] RDX: ffffc900013f3790 RSI: 0000000000000000 RDI: ffff8882a1407898
[  5.267108] RBP: ffffc900013f3740 R08: 0000000000000000 R09: 0000000000000000
[  5.267113] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  5.267119] R13: ffff8882a1407ab8 R14: ffffc900013f3888 R15: 0000000000000001
[  5.267125] FS:  00007aaa8b437800(0000) GS:ffff88850025b000(0000) knlGS:0000000000000000
[  5.267132] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  5.267138] CR2: 00007aaa8b3aac10 CR3: 000000024f764000 CR4: 0000000000f52ef0
[  5.267144] PKRU: 55555554
[  5.267150] Call Trace:
[  5.267154]  <TASK>
[  5.267181]  truncate_inode_pages_range+0x118/0x5e0
[  5.267193]  ? save_trace+0x54/0x390
[  5.267296]  truncate_inode_pages_final+0x43/0x60
[  5.267309]  evict+0x2a4/0x2c0
[  5.267339]  dispose_list+0x39/0x80
[  5.267352]  evict_inodes+0x150/0x1b0
[  5.267376]  generic_shutdown_super+0x41/0x180
[  5.267390]  kill_block_super+0x1b/0x50
[  5.267402]  erofs_kill_sb+0x81/0x90 [erofs]
[  5.267436]  deactivate_locked_super+0x32/0xb0
[  5.267450]  deactivate_super+0x46/0x60
[  5.267460]  cleanup_mnt+0xc3/0x170
[  5.267475]  __cleanup_mnt+0x12/0x20
[  5.267485]  task_work_run+0x5d/0xb0
[  5.267499]  exit_to_user_mode_loop+0x144/0x170
[  5.267512]  do_syscall_64+0x2b9/0x7c0
[  5.267523]  ? __lock_acquire+0x665/0x2ce0
[  5.267535]  ? __lock_acquire+0x665/0x2ce0
[  5.267560]  ? lock_acquire+0xcd/0x300
[  5.267573]  ? find_held_lock+0x31/0x90
[  5.267582]  ? mntput_no_expire+0x97/0x4e0
[  5.267606]  ? mntput_no_expire+0xa1/0x4e0
[  5.267625]  ? mntput+0x24/0x50
[  5.267634]  ? path_put+0x1e/0x30
[  5.267647]  ? do_faccessat+0x120/0x2f0
[  5.267677]  ? do_syscall_64+0x1a2/0x7c0
[  5.267686]  ? from_kgid_munged+0x17/0x30
[  5.267703]  ? from_kuid_munged+0x13/0x30
[  5.267711]  ? __do_sys_getuid+0x3d/0x50
[  5.267724]  ? do_syscall_64+0x1a2/0x7c0
[  5.267732]  ? irqentry_exit+0x77/0xb0
[  5.267743]  ? clear_bhb_loop+0x30/0x80
[  5.267752]  ? clear_bhb_loop+0x30/0x80
[  5.267765]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  5.267772] RIP: 0033:0x7aaa8b32a9fb
[  5.267781] Code: c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 f3 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 e9 83 0d 00 f7 d8
[  5.267787] RSP: 002b:00007ffd7c4c9468 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[  5.267796] RAX: 0000000000000000 RBX: 00005a61592a8b00 RCX: 00007aaa8b32a9fb
[  5.267802] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00005a61592b2080
[  5.267806] RBP: 00007ffd7c4c9540 R08: 00007aaa8b403b20 R09: 0000000000000020
[  5.267812] R10: 0000000000000001 R11: 0000000000000246 R12: 00005a61592a8c00
[  5.267817] R13: 0000000000000000 R14: 00005a61592b2080 R15: 00005a61592a8f10
[  5.267849]  </TASK>
[  5.267854] irq event stamp: 4721
[  5.267859] hardirqs last  enabled at (4727): [<ffffffff814abf50>] __up_console_sem+0x90/0xa0
[  5.267873] hardirqs last disabled at (4732): [<ffffffff814abf35>] __up_console_sem+0x75/0xa0
[  5.267884] softirqs last  enabled at (3044): [<ffffffff8132adb3>] kernel_fpu_end+0x53/0x70
[  5.267895] softirqs last disabled at (3042): [<ffffffff8132b5f4>] kernel_fpu_begin_mask+0xc4/0x120
[  5.267905] ---[ end trace 0000000000000000 ]---

Fixes: bde708f ("fs/dax: always remove DAX page-cache entries when breaking layouts")
Signed-off-by: Yuezhang Mo <[email protected]>
Reviewed-by: Friendy Su <[email protected]>
Reviewed-by: Daniel Palmer <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
kuba-moo pushed a commit to linux-netdev/testing that referenced this pull request Oct 8, 2025
net/bridge/br_private.h:1627 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
7 locks held by socat/410:
 #0: ffff88800d7a9c90 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_stream_connect+0x43/0xa0
 #1: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: __ip_queue_xmit+0x62/0x1830
 [..]
 torvalds#6: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: nf_hook.constprop.0+0x8a/0x440

Call Trace:
 lockdep_rcu_suspicious.cold+0x4f/0xb1
 br_vlan_fill_forward_path_pvid+0x32c/0x410 [bridge]
 br_fill_forward_path+0x7a/0x4d0 [bridge]

Use to correct helper, non _rcu variant requires RTNL mutex.

Fixes: bcf2766 ("net: bridge: resolve forwarding path for VLAN tag actions in bridge devices")
Signed-off-by: Eric Woudstra <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing that referenced this pull request Oct 8, 2025
net/bridge/br_private.h:1627 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
7 locks held by socat/410:
 #0: ffff88800d7a9c90 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_stream_connect+0x43/0xa0
 #1: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: __ip_queue_xmit+0x62/0x1830
 [..]
 torvalds#6: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: nf_hook.constprop.0+0x8a/0x440

Call Trace:
 lockdep_rcu_suspicious.cold+0x4f/0xb1
 br_vlan_fill_forward_path_pvid+0x32c/0x410 [bridge]
 br_fill_forward_path+0x7a/0x4d0 [bridge]

Use to correct helper, non _rcu variant requires RTNL mutex.

Fixes: bcf2766 ("net: bridge: resolve forwarding path for VLAN tag actions in bridge devices")
Signed-off-by: Eric Woudstra <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing that referenced this pull request Oct 8, 2025
net/bridge/br_private.h:1627 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
7 locks held by socat/410:
 #0: ffff88800d7a9c90 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_stream_connect+0x43/0xa0
 #1: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: __ip_queue_xmit+0x62/0x1830
 [..]
 torvalds#6: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: nf_hook.constprop.0+0x8a/0x440

Call Trace:
 lockdep_rcu_suspicious.cold+0x4f/0xb1
 br_vlan_fill_forward_path_pvid+0x32c/0x410 [bridge]
 br_fill_forward_path+0x7a/0x4d0 [bridge]

Use to correct helper, non _rcu variant requires RTNL mutex.

Fixes: bcf2766 ("net: bridge: resolve forwarding path for VLAN tag actions in bridge devices")
Signed-off-by: Eric Woudstra <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing that referenced this pull request Oct 9, 2025
net/bridge/br_private.h:1627 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
7 locks held by socat/410:
 #0: ffff88800d7a9c90 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_stream_connect+0x43/0xa0
 #1: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: __ip_queue_xmit+0x62/0x1830
 [..]
 torvalds#6: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: nf_hook.constprop.0+0x8a/0x440

Call Trace:
 lockdep_rcu_suspicious.cold+0x4f/0xb1
 br_vlan_fill_forward_path_pvid+0x32c/0x410 [bridge]
 br_fill_forward_path+0x7a/0x4d0 [bridge]

Use to correct helper, non _rcu variant requires RTNL mutex.

Fixes: bcf2766 ("net: bridge: resolve forwarding path for VLAN tag actions in bridge devices")
Signed-off-by: Eric Woudstra <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing that referenced this pull request Oct 9, 2025
net/bridge/br_private.h:1627 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
7 locks held by socat/410:
 #0: ffff88800d7a9c90 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_stream_connect+0x43/0xa0
 #1: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: __ip_queue_xmit+0x62/0x1830
 [..]
 torvalds#6: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: nf_hook.constprop.0+0x8a/0x440

Call Trace:
 lockdep_rcu_suspicious.cold+0x4f/0xb1
 br_vlan_fill_forward_path_pvid+0x32c/0x410 [bridge]
 br_fill_forward_path+0x7a/0x4d0 [bridge]

Use to correct helper, non _rcu variant requires RTNL mutex.

Fixes: bcf2766 ("net: bridge: resolve forwarding path for VLAN tag actions in bridge devices")
Signed-off-by: Eric Woudstra <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing that referenced this pull request Oct 9, 2025
net/bridge/br_private.h:1627 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
7 locks held by socat/410:
 #0: ffff88800d7a9c90 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_stream_connect+0x43/0xa0
 #1: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: __ip_queue_xmit+0x62/0x1830
 [..]
 torvalds#6: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: nf_hook.constprop.0+0x8a/0x440

Call Trace:
 lockdep_rcu_suspicious.cold+0x4f/0xb1
 br_vlan_fill_forward_path_pvid+0x32c/0x410 [bridge]
 br_fill_forward_path+0x7a/0x4d0 [bridge]

Use to correct helper, non _rcu variant requires RTNL mutex.

Fixes: bcf2766 ("net: bridge: resolve forwarding path for VLAN tag actions in bridge devices")
Signed-off-by: Eric Woudstra <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing that referenced this pull request Oct 9, 2025
net/bridge/br_private.h:1627 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
7 locks held by socat/410:
 #0: ffff88800d7a9c90 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_stream_connect+0x43/0xa0
 #1: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: __ip_queue_xmit+0x62/0x1830
 [..]
 torvalds#6: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: nf_hook.constprop.0+0x8a/0x440

Call Trace:
 lockdep_rcu_suspicious.cold+0x4f/0xb1
 br_vlan_fill_forward_path_pvid+0x32c/0x410 [bridge]
 br_fill_forward_path+0x7a/0x4d0 [bridge]

Use to correct helper, non _rcu variant requires RTNL mutex.

Fixes: bcf2766 ("net: bridge: resolve forwarding path for VLAN tag actions in bridge devices")
Signed-off-by: Eric Woudstra <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing that referenced this pull request Oct 9, 2025
net/bridge/br_private.h:1627 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
7 locks held by socat/410:
 #0: ffff88800d7a9c90 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_stream_connect+0x43/0xa0
 #1: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: __ip_queue_xmit+0x62/0x1830
 [..]
 torvalds#6: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: nf_hook.constprop.0+0x8a/0x440

Call Trace:
 lockdep_rcu_suspicious.cold+0x4f/0xb1
 br_vlan_fill_forward_path_pvid+0x32c/0x410 [bridge]
 br_fill_forward_path+0x7a/0x4d0 [bridge]

Use to correct helper, non _rcu variant requires RTNL mutex.

Fixes: bcf2766 ("net: bridge: resolve forwarding path for VLAN tag actions in bridge devices")
Signed-off-by: Eric Woudstra <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Oct 9, 2025
The following lockdep splat was observed while kernel auto-online a CXL
memory region:

[   51.926183] ======================================================
[   51.933441] WARNING: possible circular locking dependency detected
[   51.940701] 6.17.0djtest+ #53 Tainted: G        W
[   51.947290] ------------------------------------------------------
[   51.954553] systemd-udevd/3334 is trying to acquire lock:
[   51.960938] ffffffff90346188 (hmem_resource_lock){+.+.}-{4:4}, at: hmem_register_resource+0x31/0x50
[   51.971429]
               but task is already holding lock:
[   51.978548] ffffffff90338890 ((node_chain).rwsem){++++}-{4:4}, at: blocking_notifier_call_chain+0x2e/0x70
[   51.989621]
               which lock already depends on the new lock.

[   51.999605]
               the existing dependency chain (in reverse order) is:
[   52.008539]
               -> torvalds#6 ((node_chain).rwsem){++++}-{4:4}:
[   52.016195]        down_read+0x45/0x190
[   52.020789]        blocking_notifier_call_chain+0x2e/0x70
[   52.027131]        node_notify+0x1f/0x30
[   52.031809]        online_pages+0xc1/0x330
[   52.036684]        memory_subsys_online+0x22a/0x280
[   52.042431]        device_online+0x50/0x90
[   52.047298]        state_store+0x9b/0xa0
[   52.051956]        dev_attr_store+0x18/0x30
[   52.056907]        sysfs_kf_write+0x4e/0x70
[   52.061854]        kernfs_fop_write_iter+0x187/0x260
[   52.067673]        vfs_write+0x21f/0x590
[   52.072313]        ksys_write+0x73/0xf0
[   52.076854]        __x64_sys_write+0x1d/0x30
[   52.081874]        x64_sys_call+0x7d/0x1d80
[   52.086797]        do_syscall_64+0x6c/0x2f0
[   52.091717]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   52.098198]
               -> #5 (mem_hotplug_lock){++++}-{0:0}:
[   52.105512]        percpu_down_write+0x4b/0x260
[   52.110825]        try_online_node+0x21/0x50
[   52.115844]        cpu_up+0x43/0xd0
[   52.119989]        cpuhp_bringup_mask+0x60/0xa0
[   52.125305]        bringup_nonboot_cpus+0x76/0x110
[   52.130912]        smp_init+0x2e/0x90
[   52.135235]        kernel_init_freeable+0x19a/0x300
[   52.140930]        kernel_init+0x1e/0x140
[   52.145635]        ret_from_fork+0x159/0x200
[   52.150633]        ret_from_fork_asm+0x1a/0x30
[   52.155826]
               -> #4 (cpu_hotplug_lock){++++}-{0:0}:
[   52.163081]        __cpuhp_state_add_instance+0x51/0x200
[   52.169238]        iova_domain_init_rcaches+0x1ed/0x200
[   52.175301]        iommu_setup_dma_ops+0x1b4/0x500
[   52.180877]        bus_iommu_probe+0xd2/0x180
[   52.185954]        iommu_device_register+0x9f/0xe0
[   52.191530]        intel_iommu_init+0xd3b/0xf20
[   52.196810]        pci_iommu_init+0x16/0x40
[   52.201695]        do_one_initcall+0x5c/0x2d0
[   52.206767]        kernel_init_freeable+0x281/0x300
[   52.212432]        kernel_init+0x1e/0x140
[   52.217109]        ret_from_fork+0x159/0x200
[   52.222082]        ret_from_fork_asm+0x1a/0x30
[   52.227253]
               -> #3 (&group->mutex){+.+.}-{4:4}:
[   52.234196]        __mutex_lock+0xa9/0x11e0
[   52.239066]        mutex_lock_nested+0x1f/0x30
[   52.244236]        __iommu_probe_device+0x28c/0x5e0
[   52.249893]        probe_iommu_group+0x2f/0x50
[   52.255064]        bus_for_each_dev+0x7e/0xd0
[   52.260126]        bus_iommu_probe+0x3f/0x180
[   52.265190]        iommu_device_register+0x9f/0xe0
[   52.270751]        intel_iommu_init+0xd3b/0xf20
[   52.276016]        pci_iommu_init+0x16/0x40
[   52.280892]        do_one_initcall+0x5c/0x2d0
[   52.285956]        kernel_init_freeable+0x281/0x300
[   52.291613]        kernel_init+0x1e/0x140
[   52.296284]        ret_from_fork+0x159/0x200
[   52.301253]        ret_from_fork_asm+0x1a/0x30
[   52.306421]
               -> #2 (iommu_probe_device_lock){+.+.}-{4:4}:
[   52.314333]        __mutex_lock+0xa9/0x11e0
[   52.319201]        mutex_lock_nested+0x1f/0x30
[   52.324372]        iommu_probe_device+0x21/0x70
[   52.329638]        iommu_bus_notifier+0x2c/0x80
[   52.334903]        notifier_call_chain+0x4b/0x110
[   52.340357]        blocking_notifier_call_chain+0x4a/0x70
[   52.346594]        bus_notify+0x3b/0x50
[   52.351079]        device_add+0x65d/0x8b0
[   52.355750]        platform_device_add+0xf8/0x250
[   52.361205]        platform_device_register_full+0x154/0x1f0
[   52.367739]        platform_device_register_simple.constprop.0.isra.0+0x37/0x50
[   52.376119]        efisubsys_init+0xaf/0x570
[   52.381090]        do_one_initcall+0x5c/0x2d0
[   52.386152]        kernel_init_freeable+0x281/0x300
[   52.391809]        kernel_init+0x1e/0x140
[   52.396481]        ret_from_fork+0x159/0x200
[   52.401450]        ret_from_fork_asm+0x1a/0x30
[   52.406620]
               -> #1 (&(&priv->bus_notifier)->rwsem){++++}-{4:4}:
[   52.415109]        down_read+0x45/0x190
[   52.419593]        blocking_notifier_call_chain+0x2e/0x70
[   52.425828]        bus_notify+0x3b/0x50
[   52.430311]        device_add+0x65d/0x8b0
[   52.434981]        platform_device_add+0xf8/0x250
[   52.440435]        __hmem_register_resource+0x70/0xc0
[   52.446279]        hmem_register_resource+0x3b/0x50
[   52.451923]        hmat_register_target+0x3c/0x190
[   52.457488]        hmat_init+0x13f/0x370
[   52.462067]        do_one_initcall+0x5c/0x2d0
[   52.467132]        kernel_init_freeable+0x281/0x300
[   52.472790]        kernel_init+0x1e/0x140
[   52.477464]        ret_from_fork+0x159/0x200
[   52.482433]        ret_from_fork_asm+0x1a/0x30
[   52.487604]
               -> #0 (hmem_resource_lock){+.+.}-{4:4}:
[   52.495030]        __lock_acquire+0x14a4/0x2290
[   52.500290]        lock_acquire+0xdd/0x2f0
[   52.505070]        __mutex_lock+0xa9/0x11e0
[   52.509944]        mutex_lock_nested+0x1f/0x30
[   52.515115]        hmem_register_resource+0x31/0x50
[   52.520771]        hmat_register_target+0x3c/0x190
[   52.526319]        hmat_callback+0x6b/0x80
[   52.531098]        notifier_call_chain+0x4b/0x110
[   52.536552]        blocking_notifier_call_chain+0x4a/0x70
[   52.542788]        node_notify+0x1f/0x30
[   52.547369]        online_pages+0x288/0x330
[   52.552246]        memory_subsys_online+0x22a/0x280
[   52.557902]        device_online+0x50/0x90
[   52.562669]        state_store+0x9b/0xa0
[   52.567247]        dev_attr_store+0x18/0x30
[   52.572123]        sysfs_kf_write+0x4e/0x70
[   52.576998]        kernfs_fop_write_iter+0x187/0x260
[   52.582750]        vfs_write+0x21f/0x590
[   52.587327]        ksys_write+0x73/0xf0
[   52.591811]        __x64_sys_write+0x1d/0x30
[   52.596779]        x64_sys_call+0x7d/0x1d80
[   52.601653]        do_syscall_64+0x6c/0x2f0
[   52.606528]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   52.612968]
               other info that might help us debug this:

[   52.622356] Chain exists of:
                 hmem_resource_lock --> mem_hotplug_lock --> (node_chain).rwsem

[   52.635550]  Possible unsafe locking scenario:

[   52.642495]        CPU0                    CPU1
[   52.647752]        ----                    ----
[   52.653014]   rlock((node_chain).rwsem);
[   52.657589]                                lock(mem_hotplug_lock);
[   52.664701]                                lock((node_chain).rwsem);
[   52.672015]   lock(hmem_resource_lock);
[   52.676497]
                *** DEADLOCK ***

[   52.683541] 8 locks held by systemd-udevd/3334:
[   52.688801]  #0: ff36b6d49fbf0410 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x73/0xf0
[   52.697870]  #1: ff36b6d4ece03a88 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x12c/0x260
[   52.708210]  #2: ff36b6d4ece1cbb8 (kn->active#62){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x141/0x260
[   52.718645]  #3: ffffffff90333cc8 (device_hotplug_lock){+.+.}-{4:4}, at: lock_device_hotplug_sysfs+0x1b/0x50
[   52.729863]  #4: ff36b6d4ece4b108 (&dev->mutex){....}-{4:4}, at: device_online+0x23/0x90
[   52.739130]  #5: ffffffff900664d0 (cpu_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x12/0x30
[   52.749288]  torvalds#6: ffffffff9024c810 (mem_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x1e/0x30
[   52.759446]  torvalds#7: ffffffff90338890 ((node_chain).rwsem){++++}-{4:4}, at: blocking_notifier_call_chain+0x2e/0x70
[   52.770860]
               stack backtrace:
[   52.776068] CPU: 0 UID: 0 PID: 3334 Comm: systemd-udevd Tainted: G        W           6.17.0djtest+ #53 PREEMPT(voluntary)
[   52.776071] Tainted: [W]=WARN
[   52.776072] Hardware name: Intel Corporation AvenueCity/AvenueCity, BIOS BHSDCRB1.IPC.3545.P03.2509232237 09/23/2025
[   52.776073] Call Trace:
[   52.776074]  <TASK>
[   52.776076]  dump_stack_lvl+0x72/0xa0
[   52.776080]  dump_stack+0x14/0x1a
[   52.776082]  print_circular_bug.cold+0x188/0x1c6
[   52.776084]  check_noncircular+0x12f/0x160
[   52.776087]  ? __lock_acquire+0x486/0x2290
[   52.776089]  ? __lock_acquire+0x486/0x2290
[   52.776091]  __lock_acquire+0x14a4/0x2290
[   52.776095]  lock_acquire+0xdd/0x2f0
[   52.776096]  ? hmem_register_resource+0x31/0x50
[   52.776100]  ? hmem_register_resource+0x31/0x50
[   52.776101]  __mutex_lock+0xa9/0x11e0
[   52.776104]  ? hmem_register_resource+0x31/0x50
[   52.776104]  ? __kernfs_create_file+0xb5/0x110
[   52.776110]  mutex_lock_nested+0x1f/0x30
[   52.776112]  ? mutex_lock_nested+0x1f/0x30
[   52.776114]  hmem_register_resource+0x31/0x50
[   52.776115]  hmat_register_target+0x3c/0x190
[   52.776119]  hmat_callback+0x6b/0x80
[   52.776120]  notifier_call_chain+0x4b/0x110
[   52.776123]  blocking_notifier_call_chain+0x4a/0x70
[   52.776125]  node_notify+0x1f/0x30
[   52.776126]  online_pages+0x288/0x330
[   52.776129]  memory_subsys_online+0x22a/0x280
[   52.776132]  device_online+0x50/0x90
[   52.776134]  state_store+0x9b/0xa0
[   52.776136]  dev_attr_store+0x18/0x30
[   52.776137]  sysfs_kf_write+0x4e/0x70
[   52.776139]  kernfs_fop_write_iter+0x187/0x260
[   52.776142]  vfs_write+0x21f/0x590
[   52.776146]  ksys_write+0x73/0xf0
[   52.776148]  __x64_sys_write+0x1d/0x30
[   52.776150]  x64_sys_call+0x7d/0x1d80
[   52.776152]  do_syscall_64+0x6c/0x2f0
[   52.776154]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   52.776156] RIP: 0033:0x7f11142fda57
[   52.776158] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[   52.776160] RSP: 002b:00007ffd0bd530f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   52.776163] RAX: ffffffffffffffda RBX: 000000000000000e RCX: 00007f11142fda57
[   52.776164] RDX: 000000000000000e RSI: 00007ffd0bd537c0 RDI: 0000000000000006
[   52.776166] RBP: 00007ffd0bd537c0 R08: 00007f11143f70a0 R09: 00007ffd0bd53190
[   52.776167] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000e
[   52.776168] R13: 000055814e03e780 R14: 000000000000000e R15: 00007f11143f69e0
[   52.776171]  </TASK>

The lock ordering can cause potential deadlock. There are instances
where hmem_resource_lock is taken after (node_chain).rwsem, and vice
versa. Narrow the scope of hmem_resource_lock in hmem_register_resource()
to avoid the circular locking dependency. The locking is only needed when
hmem_active needs to be protected.

Fixes: 7dab174 ("dax/hmem: Move hmem device registration to dax_hmem.ko")
Signed-off-by: Dave Jiang <[email protected]>
kuba-moo pushed a commit to linux-netdev/testing that referenced this pull request Oct 9, 2025
net/bridge/br_private.h:1627 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
7 locks held by socat/410:
 #0: ffff88800d7a9c90 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_stream_connect+0x43/0xa0
 #1: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: __ip_queue_xmit+0x62/0x1830
 [..]
 torvalds#6: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: nf_hook.constprop.0+0x8a/0x440

Call Trace:
 lockdep_rcu_suspicious.cold+0x4f/0xb1
 br_vlan_fill_forward_path_pvid+0x32c/0x410 [bridge]
 br_fill_forward_path+0x7a/0x4d0 [bridge]

Use to correct helper, non _rcu variant requires RTNL mutex.

Fixes: bcf2766 ("net: bridge: resolve forwarding path for VLAN tag actions in bridge devices")
Signed-off-by: Eric Woudstra <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Oct 9, 2025
Holding dev_mutex across c4iw_remove() during module exit can lead to a
lockdep warning and potential deadlock. The RDMA core takes global
locks (e.g. devices_rwsem) inside ib_unregister_device(), which may
conflict with the locking order used elsewhere in the driver.

 ======================================================
 WARNING: possible circular locking dependency detected
 6.12.0-124.5.1.el10_1.x86_64+debug #1 Not tainted
 ------------------------------------------------------
 rmmod/3524 is trying to acquire lock:
 ffffffffc1c0dd18 (devices_rwsem){++++}-{4:4}, at: disable_device+0xaf/0x240 [ib_core]

 but task is already holding lock:
 ffff889104e44708 (&device->unregistration_lock){+.+.}-{4:4}, at: __ib_unregister_device+0x209/0x460 [ib_core]

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> torvalds#6 (&device->unregistration_lock){+.+.}-{4:4}:
        __lock_acquire+0x559/0xb80
        lock_acquire.part.0+0xbe/0x270
        __mutex_lock+0x18b/0x12b0
        __ib_unregister_device+0x209/0x460 [ib_core]
        ib_unregister_device+0x25/0x30 [ib_core]
        c4iw_remove+0xce/0xda [iw_cxgb4]
        c4iw_exit_module+0x7d/0xe0 [iw_cxgb4]
        __do_sys_delete_module.isra.0+0x33a/0x540
        do_syscall_64+0x92/0x180
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 -> #5 (dev_mutex){+.+.}-{4:4}:
        __lock_acquire+0x559/0xb80
        lock_acquire.part.0+0xbe/0x270
        __mutex_lock+0x18b/0x12b0
        c4iw_uld_add+0x137/0x500 [iw_cxgb4]
        uld_attach+0x908/0xd80 [cxgb4]
        cxgb4_uld_alloc_resources.part.0+0x364/0x1120 [cxgb4]
        cxgb4_register_uld+0x10c/0x400 [cxgb4]
        c4iw_init_module+0x77/0x80 [iw_cxgb4]
        do_one_initcall+0xa5/0x260
        do_init_module+0x238/0x7c0
        init_module_from_file+0xdf/0x150
        idempotent_init_module+0x230/0x770
        __x64_sys_finit_module+0xbe/0x130
        do_syscall_64+0x92/0x180
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 -> #4 (uld_mutex){+.+.}-{4:4}:
        __lock_acquire+0x559/0xb80
        lock_acquire.part.0+0xbe/0x270
        __mutex_lock+0x18b/0x12b0
        cxgb_up+0x24/0xee0 [cxgb4]
        cxgb_open+0x7e/0x250 [cxgb4]
        __dev_open+0x241/0x420
        __dev_change_flags+0x44c/0x660
        dev_change_flags+0x80/0x160
        do_setlink+0x1acd/0x23e0
        __rtnl_newlink+0xb07/0xe40
        rtnl_newlink+0x62/0x90
        rtnetlink_rcv_msg+0x2f3/0xb20
        netlink_rcv_skb+0x13d/0x3b0
        netlink_unicast+0x42e/0x720
        netlink_sendmsg+0x765/0xc20
        ____sys_sendmsg+0x974/0xc60
        ___sys_sendmsg+0xfd/0x180
        __sys_sendmsg+0xe8/0x190
        do_syscall_64+0x92/0x180
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 -> #3 (rtnl_mutex){+.+.}-{4:4}:
        __lock_acquire+0x559/0xb80
        lock_acquire.part.0+0xbe/0x270
        __mutex_lock+0x18b/0x12b0
        ib_get_eth_speed+0xee/0x9d0 [ib_core]
        ib_query_port+0x140/0x1f0 [ib_core]
        ib_setup_port_attrs+0x1a5/0x4c0 [ib_core]
        add_one_compat_dev+0x4bd/0x7b0 [ib_core]
        rdma_dev_init_net+0x257/0x3e0 [ib_core]
        ops_init+0x109/0x300
        setup_net+0x1c4/0x730
        copy_net_ns+0x23b/0x540
        create_new_namespaces+0x358/0x920
        unshare_nsproxy_namespaces+0x8a/0x1b0
        ksys_unshare+0x2df/0x740
        __x64_sys_unshare+0x31/0x40
        do_syscall_64+0x92/0x180
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 -> #2 (&device->compat_devs_mutex){+.+.}-{4:4}:
        __lock_acquire+0x559/0xb80
        lock_acquire.part.0+0xbe/0x270
        __mutex_lock+0x18b/0x12b0
        add_one_compat_dev+0xe0/0x7b0 [ib_core]
        rdma_dev_init_net+0x257/0x3e0 [ib_core]
        ops_init+0x109/0x300
        setup_net+0x1c4/0x730
        copy_net_ns+0x23b/0x540
        create_new_namespaces+0x358/0x920
        unshare_nsproxy_namespaces+0x8a/0x1b0
        ksys_unshare+0x2df/0x740
        __x64_sys_unshare+0x31/0x40
        do_syscall_64+0x92/0x180
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 -> #1 (rdma_nets_rwsem){++++}-{4:4}:
        __lock_acquire+0x559/0xb80
        lock_acquire.part.0+0xbe/0x270
        down_read+0xa3/0x4b0
        enable_device_and_get+0x26b/0x350 [ib_core]
        ib_register_device+0x1c3/0x4f0 [ib_core]
        bnxt_re_ib_init+0x433/0x530 [bnxt_re]
        bnxt_re_add_device+0x60d/0x760 [bnxt_re]
        bnxt_re_probe+0xcf/0x140 [bnxt_re]
        auxiliary_bus_probe+0xa1/0xf0
        really_probe+0x1e0/0x8a0
        __driver_probe_device+0x18c/0x370
        driver_probe_device+0x4a/0x120
        __driver_attach+0x194/0x4a0
        bus_for_each_dev+0x106/0x190
        bus_add_driver+0x2a1/0x4d0
        driver_register+0x1a5/0x360
        __auxiliary_driver_register+0x152/0x240
        c4iw_init_module+0x43/0x80 [iw_cxgb4]
        do_one_initcall+0xa5/0x260
        do_init_module+0x238/0x7c0
        init_module_from_file+0xdf/0x150
        idempotent_init_module+0x230/0x770
        __x64_sys_finit_module+0xbe/0x130
        do_syscall_64+0x92/0x180
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 -> #0 (devices_rwsem){++++}-{4:4}:
        check_prev_add+0xf1/0xce0
        validate_chain+0x481/0x560
        __lock_acquire+0x559/0xb80
        lock_acquire.part.0+0xbe/0x270
        down_write+0x99/0x220
        disable_device+0xaf/0x240 [ib_core]
        __ib_unregister_device+0x26f/0x460 [ib_core]
        ib_unregister_device+0x25/0x30 [ib_core]
        c4iw_remove+0xce/0xda [iw_cxgb4]
        c4iw_exit_module+0x7d/0xe0 [iw_cxgb4]
        __do_sys_delete_module.isra.0+0x33a/0x540
        do_syscall_64+0x92/0x180
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 other info that might help us debug this:

 Chain exists of:
   devices_rwsem --> dev_mutex --> &device->unregistration_lock

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&device->unregistration_lock);
                                lock(dev_mutex);
                                lock(&device->unregistration_lock);
   lock(devices_rwsem);

  *** DEADLOCK ***

This patch fixes the issue by moving all uld_ctx entries from the global
uld_ctx_list to a temporary local list while holding dev_mutex, then
releasing the mutex before calling c4iw_remove() and freeing the
contexts. This prevents any lock inversion while safely avoiding races
on the shared list.

Signed-off-by: Kamal Heib <[email protected]>
guidosarducci added a commit to guidosarducci/linux that referenced this pull request Oct 10, 2025
 - treat tailcall count as 32-bit for access and update
 - change out_offset scope from file to function
 - minor format/structure changes for consistency

Testing: (skipping fentry, fexit, freplace)
========

root@qemu-armhf:/usr/libexec/kselftests-bpf# modprobe test_bpf test_suite=test_tail_calls
test_bpf: #0 Tail call leaf jited:1 967 PASS
test_bpf: #1 Tail call 2 jited:1 1427 PASS
test_bpf: #2 Tail call 3 jited:1 2373 PASS
test_bpf: #3 Tail call 4 jited:1 2304 PASS
test_bpf: #4 Tail call load/store leaf jited:1 1684 PASS
test_bpf: #5 Tail call load/store jited:1 2249 PASS
test_bpf: torvalds#6 Tail call error path, max count reached jited:1 22538 PASS
test_bpf: torvalds#7 Tail call count preserved across function calls jited:1 1055668 PASS
test_bpf: torvalds#8 Tail call error path, NULL target jited:1 513 PASS
test_bpf: torvalds#9 Tail call error path, index out of range jited:1 392 PASS
test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed]

root@qemu-armhf:/usr/libexec/kselftests-bpf# ./test_progs -n 397/1-12,17-18,23-24,27-31
397/1   tailcalls/tailcall_1:OK
397/2   tailcalls/tailcall_2:OK
397/3   tailcalls/tailcall_3:OK
397/4   tailcalls/tailcall_4:OK
397/5   tailcalls/tailcall_5:OK
397/6   tailcalls/tailcall_6:OK
397/7   tailcalls/tailcall_bpf2bpf_1:OK
397/8   tailcalls/tailcall_bpf2bpf_2:OK
397/9   tailcalls/tailcall_bpf2bpf_3:OK
397/10  tailcalls/tailcall_bpf2bpf_4:OK
397/11  tailcalls/tailcall_bpf2bpf_5:OK
397/12  tailcalls/tailcall_bpf2bpf_6:OK
397/17  tailcalls/tailcall_poke:OK
397/18  tailcalls/tailcall_bpf2bpf_hierarchy_1:OK
397/23  tailcalls/tailcall_bpf2bpf_hierarchy_2:OK
397/24  tailcalls/tailcall_bpf2bpf_hierarchy_3:OK
397/27  tailcalls/tailcall_failure:OK
397/28  tailcalls/reject_tail_call_spin_lock:OK
397/29  tailcalls/reject_tail_call_rcu_lock:OK
397/30  tailcalls/reject_tail_call_preempt_lock:OK
397/31  tailcalls/reject_tail_call_ref:OK
397     tailcalls:OK
Summary: 1/21 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Tony Ambardar <[email protected]>
guidosarducci added a commit to guidosarducci/linux that referenced this pull request Oct 11, 2025
 - treat tailcall count as 32-bit for access and update
 - change out_offset scope from file to function
 - minor format/structure changes for consistency

Testing: (skipping fentry, fexit, freplace)
========

root@qemu-armhf:/usr/libexec/kselftests-bpf# modprobe test_bpf test_suite=test_tail_calls
test_bpf: #0 Tail call leaf jited:1 967 PASS
test_bpf: #1 Tail call 2 jited:1 1427 PASS
test_bpf: #2 Tail call 3 jited:1 2373 PASS
test_bpf: #3 Tail call 4 jited:1 2304 PASS
test_bpf: #4 Tail call load/store leaf jited:1 1684 PASS
test_bpf: #5 Tail call load/store jited:1 2249 PASS
test_bpf: torvalds#6 Tail call error path, max count reached jited:1 22538 PASS
test_bpf: torvalds#7 Tail call count preserved across function calls jited:1 1055668 PASS
test_bpf: torvalds#8 Tail call error path, NULL target jited:1 513 PASS
test_bpf: torvalds#9 Tail call error path, index out of range jited:1 392 PASS
test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed]

root@qemu-armhf:/usr/libexec/kselftests-bpf# ./test_progs -n 397/1-12,17-18,23-24,27-31
397/1   tailcalls/tailcall_1:OK
397/2   tailcalls/tailcall_2:OK
397/3   tailcalls/tailcall_3:OK
397/4   tailcalls/tailcall_4:OK
397/5   tailcalls/tailcall_5:OK
397/6   tailcalls/tailcall_6:OK
397/7   tailcalls/tailcall_bpf2bpf_1:OK
397/8   tailcalls/tailcall_bpf2bpf_2:OK
397/9   tailcalls/tailcall_bpf2bpf_3:OK
397/10  tailcalls/tailcall_bpf2bpf_4:OK
397/11  tailcalls/tailcall_bpf2bpf_5:OK
397/12  tailcalls/tailcall_bpf2bpf_6:OK
397/17  tailcalls/tailcall_poke:OK
397/18  tailcalls/tailcall_bpf2bpf_hierarchy_1:OK
397/23  tailcalls/tailcall_bpf2bpf_hierarchy_2:OK
397/24  tailcalls/tailcall_bpf2bpf_hierarchy_3:OK
397/27  tailcalls/tailcall_failure:OK
397/28  tailcalls/reject_tail_call_spin_lock:OK
397/29  tailcalls/reject_tail_call_rcu_lock:OK
397/30  tailcalls/reject_tail_call_preempt_lock:OK
397/31  tailcalls/reject_tail_call_ref:OK
397     tailcalls:OK
Summary: 1/21 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Tony Ambardar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.