-
Notifications
You must be signed in to change notification settings - Fork 58.1k
Putting some beer in the freezer #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
beer rulez! |
epic |
Hahaha nice! |
github is too github... |
Oh my... EPIC! |
#win! |
Unecessary =p |
Unecessary * 2 This is not Orkut =/ |
Thanks for linux! |
totally unnecessary, congratz! |
yeah make pull requests either vanish or be a link to https://github.com/torvalds/linux/tree/master/Documentation/development-process |
This is not Orkut =/ 2 |
I am thoroughly disappoint. |
...This is just crazy |
@torvalds I will volunteer to help clean up spam requests if there is a way to do so. |
Great way to introduce someone very prominent in the open source |
No more beers for you, going back to BSD |
"No more beers for you, going back to BSD" :D |
@diegoviola you might want to cool it a bit. We're not a lynch mob, the goal was to stop having joke pull requests started on @torvalds repository. Save the 'saving the world' bit for later. :) |
@diegoviola, you're cool. Just something we all might want to keep in |
The amount of social networking b.s. for an operating system kernel's source code repository IS TOO DAMN HIGH. |
+1 |
Add mount options backupuid and backugid. It allows an authenticated user to access files with the intent to back them up including their ACLs, who may not have access permission but has "Backup files and directories user right" on them (by virtue of being part of the built-in group Backup Operators. When mount options backupuid is specified, cifs client restricts the use of backup intents to the user whose effective user id is specified along with the mount option. When mount options backupgid is specified, cifs client restricts the use of backup intents to the users whose effective user id belongs to the group id specified along with the mount option. If an authenticated user is not part of the built-in group Backup Operators at the server, access to such files is denied, even if allowed by the client. Signed-off-by: Shirish Pargaonkar <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Steve French <[email protected]>
This patch validates sdev pointer in scsi_dh_activate before proceeding further. Without this check we might see the panic as below. I have seen this panic multiple times.. Call trace: #0 [ffff88007d647b50] machine_kexec at ffffffff81020902 #1 [ffff88007d647ba0] crash_kexec at ffffffff810875b0 #2 [ffff88007d647c70] oops_end at ffffffff8139c650 #3 [ffff88007d647c90] __bad_area_nosemaphore at ffffffff8102dd15 #4 [ffff88007d647d50] page_fault at ffffffff8139b8cf [exception RIP: scsi_dh_activate+0x82] RIP: ffffffffa0041922 RSP: ffff88007d647e00 RFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000000093c5 RDX: 00000000000093c5 RSI: ffffffffa02e6640 RDI: ffff88007cc88988 RBP: 000000000000000f R8: ffff88007d646000 R9: 0000000000000000 R10: ffff880082293790 R11: 00000000ffffffff R12: ffff88007cc88988 R13: 0000000000000000 R14: 0000000000000286 R15: ffff880037b845e0 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 #5 [ffff88007d647e38] run_workqueue at ffffffff81060268 torvalds#6 [ffff88007d647e78] worker_thread at ffffffff81060386 torvalds#7 [ffff88007d647ee8] kthread at ffffffff81064436 torvalds#8 [ffff88007d647f48] kernel_thread at ffffffff81003fba Signed-off-by: Babu Moger <[email protected]> Cc: [email protected] Signed-off-by: James Bottomley <[email protected]>
commit a18a920 upstream. This patch validates sdev pointer in scsi_dh_activate before proceeding further. Without this check we might see the panic as below. I have seen this panic multiple times.. Call trace: #0 [ffff88007d647b50] machine_kexec at ffffffff81020902 #1 [ffff88007d647ba0] crash_kexec at ffffffff810875b0 #2 [ffff88007d647c70] oops_end at ffffffff8139c650 #3 [ffff88007d647c90] __bad_area_nosemaphore at ffffffff8102dd15 #4 [ffff88007d647d50] page_fault at ffffffff8139b8cf [exception RIP: scsi_dh_activate+0x82] RIP: ffffffffa0041922 RSP: ffff88007d647e00 RFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000000093c5 RDX: 00000000000093c5 RSI: ffffffffa02e6640 RDI: ffff88007cc88988 RBP: 000000000000000f R8: ffff88007d646000 R9: 0000000000000000 R10: ffff880082293790 R11: 00000000ffffffff R12: ffff88007cc88988 R13: 0000000000000000 R14: 0000000000000286 R15: ffff880037b845e0 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 #5 [ffff88007d647e38] run_workqueue at ffffffff81060268 torvalds#6 [ffff88007d647e78] worker_thread at ffffffff81060386 torvalds#7 [ffff88007d647ee8] kthread at ffffffff81064436 torvalds#8 [ffff88007d647f48] kernel_thread at ffffffff81003fba Signed-off-by: Babu Moger <[email protected]> Signed-off-by: James Bottomley <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
If the pte mapping in generic_perform_write() is unmapped between iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic(), the "copied" parameter to ->end_write can be zero. ext4 couldn't cope with it with delayed allocations enabled. This skips the i_disksize enlargement logic if copied is zero and no new data was appeneded to the inode. gdb> bt #0 0xffffffff811afe80 in ext4_da_should_update_i_disksize (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x1\ 08000, len=0x1000, copied=0x0, page=0xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2467 #1 ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\ xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512 #2 0xffffffff810d97f1 in generic_perform_write (iocb=<value optimized out>, iov=<value optimized out>, nr_segs=<value o\ ptimized out>, pos=0x108000, ppos=0xffff88001e26be40, count=<value optimized out>, written=0x0) at mm/filemap.c:2440 #3 generic_file_buffered_write (iocb=<value optimized out>, iov=<value optimized out>, nr_segs=<value optimized out>, p\ os=0x108000, ppos=0xffff88001e26be40, count=<value optimized out>, written=0x0) at mm/filemap.c:2482 #4 0xffffffff810db5d1 in __generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, ppos=0\ xffff88001e26be40) at mm/filemap.c:2600 #5 0xffffffff810db853 in generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=<value optimi\ zed out>, pos=<value optimized out>) at mm/filemap.c:2632 #6 0xffffffff811a71aa in ext4_file_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, pos=0x108000) a\ t fs/ext4/file.c:136 #7 0xffffffff811375aa in do_sync_write (filp=0xffff88003f606a80, buf=<value optimized out>, len=<value optimized out>, \ ppos=0xffff88001e26bf48) at fs/read_write.c:406 #8 0xffffffff81137e56 in vfs_write (file=0xffff88003f606a80, buf=0x1ec2960 <Address 0x1ec2960 out of bounds>, count=0x4\ 000, pos=0xffff88001e26bf48) at fs/read_write.c:435 #9 0xffffffff8113816c in sys_write (fd=<value optimized out>, buf=0x1ec2960 <Address 0x1ec2960 out of bounds>, count=0x\ 4000) at fs/read_write.c:487 #10 <signal handler called> #11 0x00007f120077a390 in __brk_reservation_fn_dmi_alloc__ () #12 0x0000000000000000 in ?? () gdb> print offset $22 = 0xffffffffffffffff gdb> print idx $23 = 0xffffffff gdb> print inode->i_blkbits $24 = 0xc gdb> up #1 ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\ xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512 2512 if (ext4_da_should_update_i_disksize(page, end)) { gdb> print start $25 = 0x0 gdb> print end $26 = 0xffffffffffffffff gdb> print pos $27 = 0x108000 gdb> print new_i_size $28 = 0x108000 gdb> print ((struct ext4_inode_info *)((char *)inode-((int)(&((struct ext4_inode_info *)0)->vfs_inode))))->i_disksize $29 = 0xd9000 gdb> down 2467 for (i = 0; i < idx; i++) gdb> print i $30 = 0xd44acbee This is 100% reproducible with some autonuma development code tuned in a very aggressive manner (not normal way even for knumad) which does "exotic" changes to the ptes. It wouldn't normally trigger but I don't see why it can't happen normally if the page is added to swap cache in between the two faults leading to "copied" being zero (which then hangs in ext4). So it should be fixed. Especially possible with lumpy reclaim (albeit disabled if compaction is enabled) as that would ignore the young bits in the ptes. Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: "Theodore Ts'o" <[email protected]> Cc: [email protected]
Cancel idle timer in musb_platform_exit. The idle timer could trigger after clock had been disabled leading to kernel panic when MUSB_DEVCTL is accessed in musb_do_idle on 2.6.37. The fault below is no longer triggered on 2.6.38-rc4 (clock is disabled later, and only if compiled as a module, and the offending memory access has moved) but the timer should be cancelled nonetheless. Rebooting... musb_hdrc musb_hdrc: remove, state 4 usb usb1: USB disconnect, address 1 musb_hdrc musb_hdrc: USB bus 1 deregistered Unhandled fault: external abort on non-linefetch (0x1028) at 0xfa0ab060 Internal error: : 1028 [#1] PREEMPT last sysfs file: /sys/kernel/uevent_seqnum Modules linked in: CPU: 0 Not tainted (2.6.37+ torvalds#6) PC is at musb_do_idle+0x24/0x138 LR is at musb_do_idle+0x18/0x138 pc : [<c02377d8>] lr : [<c02377cc>] psr: 80000193 sp : cf2bdd80 ip : cf2bdd80 fp : c048a20c r10: c048a60c r9 : c048a40c r8 : cf85e110 r7 : cf2bc000 r6 : 40000113 r5 : c0489800 r4 : cf85e110 r3 : 00000004 r2 : 00000006 r1 : fa0ab000 r0 : cf8a7000 Flags: Nzcv IRQs off FIQs on Mode SVC_32 ISA ARM Segment user Control: 10c5387d Table: 8faac019 DAC: 00000015 Process reboot (pid: 769, stack limit = 0xcf2bc2f0) Stack: (0xcf2bdd80 to 0xcf2be000) dd80: 00000103 c0489800 c02377b4 c005fa34 00000555 c0071a8c c04a3858 cf2bdda8 dda0: 00000555 c048a00c cf2bdda8 cf2bdda8 1838beb0 00000103 00000004 cf2bc000 ddc0: 00000001 00000001 c04896c8 0000000a 00000000 c005ac14 00000001 c003f32c dde0: 00000000 00000025 00000000 cf2bc000 00000002 00000001 cf2bc000 00000000 de00: 00000001 c005ad08 cf2bc000 c002e07c c03ec039 ffffffff fa200000 c0033608 de20: 00000001 00000000 cf852c14 cf81f200 c045b714 c045b708 cf2bc000 c04a37e8 de40: c0033c04 cf2bc000 00000000 00000001 cf2bde68 cf2bde68 c01c3abc c004f7d8 de60: 60000013 ffffffff c0033c04 00000000 01234567 fee1dead 00000000 c006627c de80: 00000001 c00662c8 28121969 c00663ec cfa38c40 cf9f6a00 cf2bded0 cf9f6a0c dea0: 00000000 cf92f000 00008914 c02cd284 c04a55c8 c028b398 c00715c0 becf24a8 dec0: 30687465 00000000 00000000 00000000 00000002 1301a8c0 00000000 00000000 dee0: 00000002 1301a8c0 00000000 00000000 c0450494 cf527920 00011f10 cf2bdf08 df00: 00011f10 cf2bdf10 00011f10 cf2bdf18 c00f0b44 c004f7e8 cf2bdf18 cf2bdf18 df20: 00011f10 cf2bdf30 00011f10 cf2bdf38 cf401300 cf486100 00000008 c00d2b28 df40: 00011f10 cf401300 00200200 c00d3388 00011f10 cfb63a88 cfb63a80 c00c2f08 df60: 00000000 00000000 cfb63a80 00000000 cf0a3480 00000006 c0033c04 cfb63a80 df80: 00000000 c00c0104 00000003 cf0a3480 cfb63a80 00000000 00000001 00000004 dfa0: 00000058 c0033a80 00000000 00000001 fee1dead 28121969 01234567 00000000 dfc0: 00000000 00000001 00000004 00000058 00000001 00000001 00000000 00000001 dfe0: 4024d200 becf2cb0 00009210 4024d218 60000010 fee1dead 00000000 00000000 [<c02377d8>] (musb_do_idle+0x24/0x138) from [<c005fa34>] (run_timer_softirq+0x1a8/0x26) [<c005fa34>] (run_timer_softirq+0x1a8/0x26c) from [<c005ac14>] (__do_softirq+0x88/0x13) [<c005ac14>] (__do_softirq+0x88/0x138) from [<c005ad08>] (irq_exit+0x44/0x98) [<c005ad08>] (irq_exit+0x44/0x98) from [<c002e07c>] (asm_do_IRQ+0x7c/0xa0) [<c002e07c>] (asm_do_IRQ+0x7c/0xa0) from [<c0033608>] (__irq_svc+0x48/0xa8) Exception stack(0xcf2bde20 to 0xcf2bde68) de20: 00000001 00000000 cf852c14 cf81f200 c045b714 c045b708 cf2bc000 c04a37e8 de40: c0033c04 cf2bc000 00000000 00000001 cf2bde68 cf2bde68 c01c3abc c004f7d8 de60: 60000013 ffffffff [<c0033608>] (__irq_svc+0x48/0xa8) from [<c004f7d8>] (sub_preempt_count+0x0/0xb8) Code: ebf86030 e5940098 e594108c e5902010 (e5d13060) ---[ end trace 3689c0d808f9bf7c ]--- Kernel panic - not syncing: Fatal exception in interrupt Cc: [email protected] Signed-off-by: Johan Hovold <[email protected]> Signed-off-by: Felipe Balbi <[email protected]> Signed-off-by: Sriramakrishnan A G <[email protected]>
[ Upstream commit e226930 ] This code has been broken forever, but in several different and creative ways. So far as I can work out, the R6040 MAC filter has 4 exact-match entries, the first of which the driver uses for its assigned unicast address, plus a 64-entry hash-based filter for multicast addresses (maybe unicast as well?). The original version of this code would write the first 4 multicast addresses as exact-match entries from offset 1 (bug #1: there is no entry 4 so this could write to some PHY registers). It would fill the remainder of the exact-match entries with the broadcast address (bug #2: this would overwrite the last used entry). If more than 4 multicast addresses were configured, it would set up the hash table, write some random crap to the MAC control register (bug #3) and finally walk off the end of the list when filling the exact-match entries (bug #4). All of this seems to be pointless, since it sets the promiscuous bit when the interface is made promiscuous or if >4 multicast addresses are enabled, and never clears it (bug #5, masking bug #2). The recent(ish) changes to the multicast list fixed bug #4, but completely removed the limit on iteration over the exact-match entries (bug torvalds#6). Bug #4 was reported as <https://bugzilla.kernel.org/show_bug.cgi?id=15355> and more recently as <http://bugs.debian.org/600155>. Florian Fainelli attempted to fix these in commit 3bcf822, but that actually dealt with bugs #1-3, bug #4 having been fixed in mainline at that point. That commit fixes the most important current bug torvalds#6. Signed-off-by: Ben Hutchings <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
The block layer validates buffer alignment using the device's dma_alignment value. If dma_alignment is smaller than logical_block_size(bp_block) -1, misaligned buffer incorrectly pass validation and propagate to the lower-level driver. This patch adjusts dma_alignment to be at least logical_block_size -1, ensuring that misalignment buffers are properly rejected at the block layer and do not reach the DASD driver unnecessarily. Fixes: 2a07bb6 ("s390/dasd: Remove DMA alignment") Reviewed-by: Stefan Haberland <[email protected]> Cc: [email protected] torvalds#6.11+ Signed-off-by: Jaehoon Kim <[email protected]> Signed-off-by: Stefan Haberland <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/[email protected]/T/ [3] https://lkml.kernel.org/r/[email protected] [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Lance Yang <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Andy Lutomirks^H^Hski <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michal Koutn <[email protected]> Cc: Muchun Song <[email protected]> Cc: tejun heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> (cherry picked from commit 6220ea5) Signed-off-by: David Hildenbrand <[email protected]>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/[email protected]/T/ [3] https://lkml.kernel.org/r/[email protected] [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Lance Yang <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Andy Lutomirks^H^Hski <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michal Koutn <[email protected]> Cc: Muchun Song <[email protected]> Cc: tejun heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> (cherry picked from commit 6220ea5) Signed-off-by: David Hildenbrand <[email protected]>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/[email protected]/T/ [3] https://lkml.kernel.org/r/[email protected] [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Lance Yang <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Andy Lutomirks^H^Hski <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michal Koutn <[email protected]> Cc: Muchun Song <[email protected]> Cc: tejun heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> (cherry picked from commit 6220ea5) Signed-off-by: David Hildenbrand <[email protected]>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/[email protected]/T/ [3] https://lkml.kernel.org/r/[email protected] [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Lance Yang <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Andy Lutomirks^H^Hski <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michal Koutn <[email protected]> Cc: Muchun Song <[email protected]> Cc: tejun heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> (cherry picked from commit 6220ea5) Signed-off-by: David Hildenbrand <[email protected]>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/[email protected]/T/ [3] https://lkml.kernel.org/r/[email protected] [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Lance Yang <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Andy Lutomirks^H^Hski <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michal Koutn <[email protected]> Cc: Muchun Song <[email protected]> Cc: tejun heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> (cherry picked from commit 6220ea5) Signed-off-by: David Hildenbrand <[email protected]>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/[email protected]/T/ [3] https://lkml.kernel.org/r/[email protected] [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Lance Yang <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Andy Lutomirks^H^Hski <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michal Koutn <[email protected]> Cc: Muchun Song <[email protected]> Cc: tejun heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> (cherry picked from commit 6220ea5) Signed-off-by: David Hildenbrand <[email protected]>
- treat tailcall count as 32-bit for access and update - change out_offset scope from file to function - minor format/structure changes for consistency Testing: (skipping fentry, fexit, freplace) ======== root@qemu-armhf:/usr/libexec/kselftests-bpf# modprobe test_bpf test_suite=test_tail_calls test_bpf: #0 Tail call leaf jited:1 967 PASS test_bpf: #1 Tail call 2 jited:1 1427 PASS test_bpf: #2 Tail call 3 jited:1 2373 PASS test_bpf: #3 Tail call 4 jited:1 2304 PASS test_bpf: #4 Tail call load/store leaf jited:1 1684 PASS test_bpf: #5 Tail call load/store jited:1 2249 PASS test_bpf: torvalds#6 Tail call error path, max count reached jited:1 22538 PASS test_bpf: torvalds#7 Tail call count preserved across function calls jited:1 1055668 PASS test_bpf: torvalds#8 Tail call error path, NULL target jited:1 513 PASS test_bpf: torvalds#9 Tail call error path, index out of range jited:1 392 PASS test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed] root@qemu-armhf:/usr/libexec/kselftests-bpf# ./test_progs -n 397/1-12,17-18,23-24,27-31 397/1 tailcalls/tailcall_1:OK 397/2 tailcalls/tailcall_2:OK 397/3 tailcalls/tailcall_3:OK 397/4 tailcalls/tailcall_4:OK 397/5 tailcalls/tailcall_5:OK 397/6 tailcalls/tailcall_6:OK 397/7 tailcalls/tailcall_bpf2bpf_1:OK 397/8 tailcalls/tailcall_bpf2bpf_2:OK 397/9 tailcalls/tailcall_bpf2bpf_3:OK 397/10 tailcalls/tailcall_bpf2bpf_4:OK 397/11 tailcalls/tailcall_bpf2bpf_5:OK 397/12 tailcalls/tailcall_bpf2bpf_6:OK 397/17 tailcalls/tailcall_poke:OK 397/18 tailcalls/tailcall_bpf2bpf_hierarchy_1:OK 397/23 tailcalls/tailcall_bpf2bpf_hierarchy_2:OK 397/24 tailcalls/tailcall_bpf2bpf_hierarchy_3:OK 397/27 tailcalls/tailcall_failure:OK 397/28 tailcalls/reject_tail_call_spin_lock:OK 397/29 tailcalls/reject_tail_call_rcu_lock:OK 397/30 tailcalls/reject_tail_call_preempt_lock:OK 397/31 tailcalls/reject_tail_call_ref:OK 397 tailcalls:OK Summary: 1/21 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Tony Ambardar <[email protected]>
- treat tailcall count as 32-bit for access and update - change out_offset scope from file to function - minor format/structure changes for consistency Testing: (skipping fentry, fexit, freplace) ======== root@qemu-armhf:/usr/libexec/kselftests-bpf# modprobe test_bpf test_suite=test_tail_calls test_bpf: #0 Tail call leaf jited:1 967 PASS test_bpf: #1 Tail call 2 jited:1 1427 PASS test_bpf: #2 Tail call 3 jited:1 2373 PASS test_bpf: #3 Tail call 4 jited:1 2304 PASS test_bpf: #4 Tail call load/store leaf jited:1 1684 PASS test_bpf: #5 Tail call load/store jited:1 2249 PASS test_bpf: torvalds#6 Tail call error path, max count reached jited:1 22538 PASS test_bpf: torvalds#7 Tail call count preserved across function calls jited:1 1055668 PASS test_bpf: torvalds#8 Tail call error path, NULL target jited:1 513 PASS test_bpf: torvalds#9 Tail call error path, index out of range jited:1 392 PASS test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed] root@qemu-armhf:/usr/libexec/kselftests-bpf# ./test_progs -n 397/1-12,17-18,23-24,27-31 397/1 tailcalls/tailcall_1:OK 397/2 tailcalls/tailcall_2:OK 397/3 tailcalls/tailcall_3:OK 397/4 tailcalls/tailcall_4:OK 397/5 tailcalls/tailcall_5:OK 397/6 tailcalls/tailcall_6:OK 397/7 tailcalls/tailcall_bpf2bpf_1:OK 397/8 tailcalls/tailcall_bpf2bpf_2:OK 397/9 tailcalls/tailcall_bpf2bpf_3:OK 397/10 tailcalls/tailcall_bpf2bpf_4:OK 397/11 tailcalls/tailcall_bpf2bpf_5:OK 397/12 tailcalls/tailcall_bpf2bpf_6:OK 397/17 tailcalls/tailcall_poke:OK 397/18 tailcalls/tailcall_bpf2bpf_hierarchy_1:OK 397/23 tailcalls/tailcall_bpf2bpf_hierarchy_2:OK 397/24 tailcalls/tailcall_bpf2bpf_hierarchy_3:OK 397/27 tailcalls/tailcall_failure:OK 397/28 tailcalls/reject_tail_call_spin_lock:OK 397/29 tailcalls/reject_tail_call_rcu_lock:OK 397/30 tailcalls/reject_tail_call_preempt_lock:OK 397/31 tailcalls/reject_tail_call_ref:OK 397 tailcalls:OK Summary: 1/21 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Tony Ambardar <[email protected]>
Before disabling SR-IOV via config space accesses to the parent PF, sriov_disable() first removes the PCI devices representing the VFs. Since commit 9d16947 ("PCI: Add global pci_lock_rescan_remove()") such removal operations are serialized against concurrent remove and rescan using the pci_rescan_remove_lock. No such locking was ever added in sriov_disable() however. In particular when commit 18f9e9d ("PCI/IOV: Factor out sriov_add_vfs()") factored out the PCI device removal into sriov_del_vfs() there was still no locking around the pci_iov_remove_virtfn() calls. On s390 the lack of serialization in sriov_disable() may cause double remove and list corruption with the below (amended) trace being observed: PSW: 0704c00180000000 0000000c914e4b38 (klist_put+56) GPRS: 000003800313fb48 0000000000000000 0000000100000001 0000000000000001 00000000f9b520a8 0000000000000000 0000000000002fbd 00000000f4cc9480 0000000000000001 0000000000000000 0000000000000000 0000000180692828 00000000818e8000 000003800313fe2c 000003800313fb20 000003800313fad8 #0 [3800313fb20] device_del at c9158ad5c #1 [3800313fb88] pci_remove_bus_device at c915105ba #2 [3800313fbd0] pci_iov_remove_virtfn at c9152f198 #3 [3800313fc28] zpci_iov_remove_virtfn at c90fb67c0 #4 [3800313fc60] zpci_bus_remove_device at c90fb6104 #5 [3800313fca0] __zpci_event_availability at c90fb3dca torvalds#6 [3800313fd08] chsc_process_sei_nt0 at c918fe4a2 torvalds#7 [3800313fd60] crw_collect_info at c91905822 torvalds#8 [3800313fe10] kthread at c90feb390 torvalds#9 [3800313fe68] __ret_from_fork at c90f6aa64 torvalds#10 [3800313fe98] ret_from_fork at c9194f3f2. This is because in addition to sriov_disable() removing the VFs, the platform also generates hot-unplug events for the VFs. This being the reverse operation to the hotplug events generated by sriov_enable() and handled via pdev->no_vf_scan. And while the event processing takes pci_rescan_remove_lock and checks whether the struct pci_dev still exists, the lack of synchronization makes this checking racy. Other races may also be possible of course though given that this lack of locking persisted so long observable races seem very rare. Even on s390 the list corruption was only observed with certain devices since the platform events are only triggered by config accesses after the removal, so as long as the removal finished synchronously they would not race. Either way the locking is missing so fix this by adding it to the sriov_del_vfs() helper. Just like PCI rescan-remove, locking is also missing in sriov_add_vfs() including for the error case where pci_stop_and_remove_bus_device() is called without the PCI rescan-remove lock being held. Even in the non-error case, adding new PCI devices and buses should be serialized via the PCI rescan-remove lock. Add the necessary locking. Fixes: 18f9e9d ("PCI/IOV: Factor out sriov_add_vfs()") Signed-off-by: Niklas Schnelle <[email protected]> Signed-off-by: Bjorn Helgaas <[email protected]> Reviewed-by: Benjamin Block <[email protected]> Reviewed-by: Farhan Ali <[email protected]> Reviewed-by: Julian Ruess <[email protected]> Cc: [email protected] Link: https://patch.msgid.link/[email protected]
The generic/736 xfstest fails for HFS case: BEGIN TEST default (1 test): hfs Mon May 5 03:18:32 UTC 2025 DEVICE: /dev/vdb HFS_MKFS_OPTIONS: MOUNT_OPTIONS: MOUNT_OPTIONS FSTYP -- hfs PLATFORM -- Linux/x86_64 kvm-xfstests 6.15.0-rc4-xfstests-g00b827f0cffa #1 SMP PREEMPT_DYNAMIC Fri May 25 MKFS_OPTIONS -- /dev/vdc MOUNT_OPTIONS -- /dev/vdc /vdc generic/736 [03:18:33][ 3.510255] run fstests generic/736 at 2025-05-05 03:18:33 _check_generic_filesystem: filesystem on /dev/vdb is inconsistent (see /results/hfs/results-default/generic/736.full for details) Ran: generic/736 Failures: generic/736 Failed 1 of 1 tests The HFS volume becomes corrupted after the test run: sudo fsck.hfs -d /dev/loop50 ** /dev/loop50 Using cacheBlockSize=32K cacheTotalBlock=1024 cacheSize=32768K. Executing fsck_hfs (version 540.1-Linux). ** Checking HFS volume. The volume name is untitled ** Checking extents overflow file. ** Checking catalog file. ** Checking catalog hierarchy. ** Checking volume bitmap. ** Checking volume information. invalid MDB drNxtCNID Master Directory Block needs minor repair (1, 0) Verify Status: VIStat = 0x8000, ABTStat = 0x0000 EBTStat = 0x0000 CBTStat = 0x0000 CatStat = 0x00000000 ** Repairing volume. ** Rechecking volume. ** Checking HFS volume. The volume name is untitled ** Checking extents overflow file. ** Checking catalog file. ** Checking catalog hierarchy. ** Checking volume bitmap. ** Checking volume information. ** The volume untitled was repaired successfully. The main reason of the issue is the absence of logic that corrects mdb->drNxtCNID/HFS_SB(sb)->next_id (next unused CNID) after deleting a record in Catalog File. This patch introduces a hfs_correct_next_unused_CNID() method that implements the necessary logic. In the case of Catalog File's record delete operation, the function logic checks that (deleted_CNID + 1) == next_unused_CNID and it finds/sets the new value of next_unused_CNID. sudo ./check generic/736 FSTYP -- hfs PLATFORM -- Linux/x86_64 hfsplus-testing-0001 6.15.0+ #6 SMP PREEMPT_DYNAMIC Tue Jun 10 15:02:48 PDT 2025 MKFS_OPTIONS -- /dev/loop51 MOUNT_OPTIONS -- /dev/loop51 /mnt/scratch generic/736 33s Ran: generic/736 Passed all 1 tests sudo fsck.hfs -d /dev/loop50 ** /dev/loop50 Using cacheBlockSize=32K cacheTotalBlock=1024 cacheSize=32768K. Executing fsck_hfs (version 540.1-Linux). ** Checking HFS volume. The volume name is untitled ** Checking extents overflow file. ** Checking catalog file. ** Checking catalog hierarchy. ** Checking volume bitmap. ** Checking volume information. ** The volume untitled appears to be OK Signed-off-by: Viacheslav Dubeyko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Viacheslav Dubeyko <[email protected]>
Another day, another syzkaller bug. KVM erroneously allows userspace to pend vCPU events for a vCPU that hasn't been initialized yet, leading to KVM interpreting a bunch of uninitialized garbage for routing / injecting the exception. In one case the injection code and the hyp disagree on whether the vCPU has a 32bit EL1 and put the vCPU into an illegal mode for AArch64, tripping the BUG() in exception_target_el() during the next injection: kernel BUG at arch/arm64/kvm/inject_fault.c:40! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP CPU: 3 UID: 0 PID: 318 Comm: repro Not tainted 6.17.0-rc4-00104-g10fd0285305d torvalds#6 PREEMPT Hardware name: linux,dummy-virt (DT) pstate: 21402009 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) pc : exception_target_el+0x88/0x8c lr : pend_serror_exception+0x18/0x13c sp : ffff800082f03a10 x29: ffff800082f03a10 x28: ffff0000cb132280 x27: 0000000000000000 x26: 0000000000000000 x25: ffff0000c2a99c20 x24: 0000000000000000 x23: 0000000000008000 x22: 0000000000000002 x21: 0000000000000004 x20: 0000000000008000 x19: ffff0000c2a99c20 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 00000000200000c0 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000 x8 : ffff800082f03af8 x7 : 0000000000000000 x6 : 0000000000000000 x5 : ffff800080f621f0 x4 : 0000000000000000 x3 : 0000000000000000 x2 : 000000000040009b x1 : 0000000000000003 x0 : ffff0000c2a99c20 Call trace: exception_target_el+0x88/0x8c (P) kvm_inject_serror_esr+0x40/0x3b4 __kvm_arm_vcpu_set_events+0xf0/0x100 kvm_arch_vcpu_ioctl+0x180/0x9d4 kvm_vcpu_ioctl+0x60c/0x9f4 __arm64_sys_ioctl+0xac/0x104 invoke_syscall+0x48/0x110 el0_svc_common.constprop.0+0x40/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x34/0xf0 el0t_64_sync_handler+0xa0/0xe4 el0t_64_sync+0x198/0x19c Code: f946bc01 b4fffe61 9101e020 17fffff2 (d4210000) Reject the ioctls outright as no sane VMM would call these before KVM_ARM_VCPU_INIT anyway. Even if it did the exception would've been thrown away by the eventual reset of the vCPU's state. Cc: [email protected] # 6.17 Fixes: b7b27fa ("arm/arm64: KVM: Add KVM_GET/SET_VCPU_EVENTS") Signed-off-by: Oliver Upton <[email protected]> Message-Id: <[email protected]>
- treat tailcall count as 32-bit for access and update - change out_offset scope from file to function - minor format/structure changes for consistency Testing: (skipping fentry, fexit, freplace) ======== root@qemu-armhf:/usr/libexec/kselftests-bpf# modprobe test_bpf test_suite=test_tail_calls test_bpf: #0 Tail call leaf jited:1 967 PASS test_bpf: #1 Tail call 2 jited:1 1427 PASS test_bpf: #2 Tail call 3 jited:1 2373 PASS test_bpf: #3 Tail call 4 jited:1 2304 PASS test_bpf: #4 Tail call load/store leaf jited:1 1684 PASS test_bpf: #5 Tail call load/store jited:1 2249 PASS test_bpf: torvalds#6 Tail call error path, max count reached jited:1 22538 PASS test_bpf: torvalds#7 Tail call count preserved across function calls jited:1 1055668 PASS test_bpf: torvalds#8 Tail call error path, NULL target jited:1 513 PASS test_bpf: torvalds#9 Tail call error path, index out of range jited:1 392 PASS test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed] root@qemu-armhf:/usr/libexec/kselftests-bpf# ./test_progs -n 397/1-12,17-18,23-24,27-31 397/1 tailcalls/tailcall_1:OK 397/2 tailcalls/tailcall_2:OK 397/3 tailcalls/tailcall_3:OK 397/4 tailcalls/tailcall_4:OK 397/5 tailcalls/tailcall_5:OK 397/6 tailcalls/tailcall_6:OK 397/7 tailcalls/tailcall_bpf2bpf_1:OK 397/8 tailcalls/tailcall_bpf2bpf_2:OK 397/9 tailcalls/tailcall_bpf2bpf_3:OK 397/10 tailcalls/tailcall_bpf2bpf_4:OK 397/11 tailcalls/tailcall_bpf2bpf_5:OK 397/12 tailcalls/tailcall_bpf2bpf_6:OK 397/17 tailcalls/tailcall_poke:OK 397/18 tailcalls/tailcall_bpf2bpf_hierarchy_1:OK 397/23 tailcalls/tailcall_bpf2bpf_hierarchy_2:OK 397/24 tailcalls/tailcall_bpf2bpf_hierarchy_3:OK 397/27 tailcalls/tailcall_failure:OK 397/28 tailcalls/reject_tail_call_spin_lock:OK 397/29 tailcalls/reject_tail_call_rcu_lock:OK 397/30 tailcalls/reject_tail_call_preempt_lock:OK 397/31 tailcalls/reject_tail_call_ref:OK 397 tailcalls:OK Summary: 1/21 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Tony Ambardar <[email protected]>
The test starts a workload and then opens events. If the events fail to open, for example because of perf_event_paranoid, the gopipe of the workload is leaked and the file descriptor leak check fails when the test exits. To avoid this cancel the workload when opening the events fails. Before: ``` $ perf test -vv 7 7: PERF_RECORD_* events & perf_sample fields: --- start --- test child forked, pid 1189568 Using CPUID GenuineIntel-6-B7-1 ------------------------------------------------------------ perf_event_attr: type 0 (PERF_TYPE_HARDWARE) config 0xa00000000 (cpu_atom/PERF_COUNT_HW_CPU_CYCLES/) disabled 1 ------------------------------------------------------------ sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8 sys_perf_event_open failed, error -13 ------------------------------------------------------------ perf_event_attr: type 0 (PERF_TYPE_HARDWARE) config 0xa00000000 (cpu_atom/PERF_COUNT_HW_CPU_CYCLES/) disabled 1 exclude_kernel 1 ------------------------------------------------------------ sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8 = 3 ------------------------------------------------------------ perf_event_attr: type 0 (PERF_TYPE_HARDWARE) config 0x400000000 (cpu_core/PERF_COUNT_HW_CPU_CYCLES/) disabled 1 ------------------------------------------------------------ sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8 sys_perf_event_open failed, error -13 ------------------------------------------------------------ perf_event_attr: type 0 (PERF_TYPE_HARDWARE) config 0x400000000 (cpu_core/PERF_COUNT_HW_CPU_CYCLES/) disabled 1 exclude_kernel 1 ------------------------------------------------------------ sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8 = 3 Attempt to add: software/cpu-clock/ ..after resolving event: software/config=0/ cpu-clock -> software/cpu-clock/ ------------------------------------------------------------ perf_event_attr: type 1 (PERF_TYPE_SOFTWARE) size 136 config 0x9 (PERF_COUNT_SW_DUMMY) sample_type IP|TID|TIME|CPU read_format ID|LOST disabled 1 inherit 1 mmap 1 comm 1 enable_on_exec 1 task 1 sample_id_all 1 mmap2 1 comm_exec 1 ksymbol 1 bpf_event 1 { wakeup_events, wakeup_watermark } 1 ------------------------------------------------------------ sys_perf_event_open: pid 1189569 cpu 0 group_fd -1 flags 0x8 sys_perf_event_open failed, error -13 perf_evlist__open: Permission denied ---- end(-2) ---- Leak of file descriptor 6 that opened: 'pipe:[14200347]' ---- unexpected signal (6) ---- iFailed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon #0 0x565358f6666e in child_test_sig_handler builtin-test.c:311 #1 0x7f29ce849df0 in __restore_rt libc_sigaction.c:0 #2 0x7f29ce89e95c in __pthread_kill_implementation pthread_kill.c:44 #3 0x7f29ce849cc2 in raise raise.c:27 #4 0x7f29ce8324ac in abort abort.c:81 #5 0x565358f662d4 in check_leaks builtin-test.c:226 torvalds#6 0x565358f6682e in run_test_child builtin-test.c:344 torvalds#7 0x565358ef7121 in start_command run-command.c:128 torvalds#8 0x565358f67273 in start_test builtin-test.c:545 torvalds#9 0x565358f6771d in __cmd_test builtin-test.c:647 torvalds#10 0x565358f682bd in cmd_test builtin-test.c:849 torvalds#11 0x565358ee5ded in run_builtin perf.c:349 torvalds#12 0x565358ee6085 in handle_internal_command perf.c:401 torvalds#13 0x565358ee61de in run_argv perf.c:448 torvalds#14 0x565358ee6527 in main perf.c:555 torvalds#15 0x7f29ce833ca8 in __libc_start_call_main libc_start_call_main.h:74 torvalds#16 0x7f29ce833d65 in __libc_start_main@@GLIBC_2.34 libc-start.c:128 torvalds#17 0x565358e391c1 in _start perf[851c1] 7: PERF_RECORD_* events & perf_sample fields : FAILED! ``` After: ``` $ perf test 7 7: PERF_RECORD_* events & perf_sample fields : Skip (permissions) ``` Fixes: 16d00fe ("perf tests: Move test__PERF_RECORD into separate object") Signed-off-by: Ian Rogers <[email protected]> Tested-by: Arnaldo Carvalho de Melo <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Athira Rajeev <[email protected]> Cc: Chun-Tse Shao <[email protected]> Cc: Howard Chu <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: James Clark <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kan Liang <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
prefix=PATCH v1 nf [email protected] [email protected] /*** BLURB HERE ***/ Also see the debugging info send by Florian in mailing: "[RFC PATCH v3 nf-next] selftests: netfilter: Add bridge_fastpath.sh" net/bridge/br_private.h:1627 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 7 locks held by socat/410: #0: ffff88800d7a9c90 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_stream_connect+0x43/0xa0 #1: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: __ip_queue_xmit+0x62/0x1830 #2: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: ip_output+0x57/0x3c0 #3: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: ip_finish_output2+0x263/0x17d0 #4: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: process_backlog+0x38a/0x14b0 #5: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: netif_receive_skb_internal+0x83/0x330 torvalds#6: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: nf_hook.constprop.0+0x8a/0x440 stack backtrace: CPU: 0 UID: 0 PID: 410 Comm: socat Not tainted 6.17.0-rc7-virtme #1 PREEMPT(full) Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: <IRQ> dump_stack_lvl+0x6f/0xb0 lockdep_rcu_suspicious.cold+0x4f/0xb1 br_vlan_fill_forward_path_pvid+0x32c/0x410 [bridge] br_fill_forward_path+0x7a/0x4d0 [bridge] ...
These iterations require the read lock, otherwise RCU lockdep will splat: ============================= WARNING: suspicious RCU usage 6.17.0-rc3-00014-g31419c045d64 torvalds#6 Tainted: G O ----------------------------- drivers/base/power/main.c:1333 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 5 locks held by rtcwake/547: #0: 00000000643ab418 (sb_writers#6){.+.+}-{0:0}, at: file_start_write+0x2b/0x3a #1: 0000000067a0ca88 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x181/0x24b #2: 00000000631eac40 (kn->active#3){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x191/0x24b #3: 00000000609a1308 (system_transition_mutex){+.+.}-{4:4}, at: pm_suspend+0xaf/0x30b #4: 0000000060c0fdb0 (device_links_srcu){.+.+}-{0:0}, at: device_links_read_lock+0x75/0x98 stack backtrace: CPU: 0 UID: 0 PID: 547 Comm: rtcwake Tainted: G O 6.17.0-rc3-00014-g31419c045d64 torvalds#6 VOLUNTARY Tainted: [O]=OOT_MODULE Stack: 223721b3a80 6089eac6 00000001 00000001 ffffff00 6089eac6 00000535 6086e528 721b3ac0 6003c294 00000000 60031fc0 Call Trace: [<600407ed>] show_stack+0x10e/0x127 [<6003c294>] dump_stack_lvl+0x77/0xc6 [<6003c2fd>] dump_stack+0x1a/0x20 [<600bc2f8>] lockdep_rcu_suspicious+0x116/0x13e [<603d8ea1>] dpm_async_suspend_superior+0x117/0x17e [<603d980f>] device_suspend+0x528/0x541 [<603da24b>] dpm_suspend+0x1a2/0x267 [<603da837>] dpm_suspend_start+0x5d/0x72 [<600ca0c9>] suspend_devices_and_enter+0xab/0x736 [...] Add the fourth argument to the iteration to annotate this and avoid the splat. Fixes: 0679963 ("PM: sleep: Make async suspend handle suppliers like parents") Fixes: ed18738 ("PM: sleep: Make async resume handle consumers like children") Signed-off-by: Johannes Berg <[email protected]> Link: https://patch.msgid.link/20250826134348.aba79f6e6299.I9ecf55da46ccf33778f2c018a82e1819d815b348@changeid Signed-off-by: Rafael J. Wysocki <[email protected]>
Commit 0e2f80a("fs/dax: ensure all pages are idle prior to filesystem unmount") introduced the WARN_ON_ONCE to capture whether the filesystem has removed all DAX entries or not and applied the fix to xfs and ext4. Apply the missed fix on erofs to fix the runtime warning: [ 5.266254] ------------[ cut here ]------------ [ 5.266274] WARNING: CPU: 6 PID: 3109 at mm/truncate.c:89 truncate_folio_batch_exceptionals+0xff/0x260 [ 5.266294] Modules linked in: [ 5.266999] CPU: 6 UID: 0 PID: 3109 Comm: umount Tainted: G S 6.16.0+ torvalds#6 PREEMPT(voluntary) [ 5.267012] Tainted: [S]=CPU_OUT_OF_SPEC [ 5.267017] Hardware name: Dell Inc. OptiPlex 5000/05WXFV, BIOS 1.5.1 08/24/2022 [ 5.267024] RIP: 0010:truncate_folio_batch_exceptionals+0xff/0x260 [ 5.267076] Code: 00 00 41 39 df 7f 11 eb 78 83 c3 01 49 83 c4 08 41 39 df 74 6c 48 63 f3 48 83 fe 1f 0f 83 3c 01 00 00 43 f6 44 26 08 01 74 df <0f> 0b 4a 8b 34 22 4c 89 ef 48 89 55 90 e8 ff 54 1f 00 48 8b 55 90 [ 5.267083] RSP: 0018:ffffc900013f36c8 EFLAGS: 00010202 [ 5.267095] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 5.267101] RDX: ffffc900013f3790 RSI: 0000000000000000 RDI: ffff8882a1407898 [ 5.267108] RBP: ffffc900013f3740 R08: 0000000000000000 R09: 0000000000000000 [ 5.267113] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 5.267119] R13: ffff8882a1407ab8 R14: ffffc900013f3888 R15: 0000000000000001 [ 5.267125] FS: 00007aaa8b437800(0000) GS:ffff88850025b000(0000) knlGS:0000000000000000 [ 5.267132] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5.267138] CR2: 00007aaa8b3aac10 CR3: 000000024f764000 CR4: 0000000000f52ef0 [ 5.267144] PKRU: 55555554 [ 5.267150] Call Trace: [ 5.267154] <TASK> [ 5.267181] truncate_inode_pages_range+0x118/0x5e0 [ 5.267193] ? save_trace+0x54/0x390 [ 5.267296] truncate_inode_pages_final+0x43/0x60 [ 5.267309] evict+0x2a4/0x2c0 [ 5.267339] dispose_list+0x39/0x80 [ 5.267352] evict_inodes+0x150/0x1b0 [ 5.267376] generic_shutdown_super+0x41/0x180 [ 5.267390] kill_block_super+0x1b/0x50 [ 5.267402] erofs_kill_sb+0x81/0x90 [erofs] [ 5.267436] deactivate_locked_super+0x32/0xb0 [ 5.267450] deactivate_super+0x46/0x60 [ 5.267460] cleanup_mnt+0xc3/0x170 [ 5.267475] __cleanup_mnt+0x12/0x20 [ 5.267485] task_work_run+0x5d/0xb0 [ 5.267499] exit_to_user_mode_loop+0x144/0x170 [ 5.267512] do_syscall_64+0x2b9/0x7c0 [ 5.267523] ? __lock_acquire+0x665/0x2ce0 [ 5.267535] ? __lock_acquire+0x665/0x2ce0 [ 5.267560] ? lock_acquire+0xcd/0x300 [ 5.267573] ? find_held_lock+0x31/0x90 [ 5.267582] ? mntput_no_expire+0x97/0x4e0 [ 5.267606] ? mntput_no_expire+0xa1/0x4e0 [ 5.267625] ? mntput+0x24/0x50 [ 5.267634] ? path_put+0x1e/0x30 [ 5.267647] ? do_faccessat+0x120/0x2f0 [ 5.267677] ? do_syscall_64+0x1a2/0x7c0 [ 5.267686] ? from_kgid_munged+0x17/0x30 [ 5.267703] ? from_kuid_munged+0x13/0x30 [ 5.267711] ? __do_sys_getuid+0x3d/0x50 [ 5.267724] ? do_syscall_64+0x1a2/0x7c0 [ 5.267732] ? irqentry_exit+0x77/0xb0 [ 5.267743] ? clear_bhb_loop+0x30/0x80 [ 5.267752] ? clear_bhb_loop+0x30/0x80 [ 5.267765] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 5.267772] RIP: 0033:0x7aaa8b32a9fb [ 5.267781] Code: c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 f3 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 e9 83 0d 00 f7 d8 [ 5.267787] RSP: 002b:00007ffd7c4c9468 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [ 5.267796] RAX: 0000000000000000 RBX: 00005a61592a8b00 RCX: 00007aaa8b32a9fb [ 5.267802] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00005a61592b2080 [ 5.267806] RBP: 00007ffd7c4c9540 R08: 00007aaa8b403b20 R09: 0000000000000020 [ 5.267812] R10: 0000000000000001 R11: 0000000000000246 R12: 00005a61592a8c00 [ 5.267817] R13: 0000000000000000 R14: 00005a61592b2080 R15: 00005a61592a8f10 [ 5.267849] </TASK> [ 5.267854] irq event stamp: 4721 [ 5.267859] hardirqs last enabled at (4727): [<ffffffff814abf50>] __up_console_sem+0x90/0xa0 [ 5.267873] hardirqs last disabled at (4732): [<ffffffff814abf35>] __up_console_sem+0x75/0xa0 [ 5.267884] softirqs last enabled at (3044): [<ffffffff8132adb3>] kernel_fpu_end+0x53/0x70 [ 5.267895] softirqs last disabled at (3042): [<ffffffff8132b5f4>] kernel_fpu_begin_mask+0xc4/0x120 [ 5.267905] ---[ end trace 0000000000000000 ]--- Fixes: bde708f ("fs/dax: always remove DAX page-cache entries when breaking layouts") Signed-off-by: Yuezhang Mo <[email protected]> Reviewed-by: Friendy Su <[email protected]> Reviewed-by: Daniel Palmer <[email protected]> Reviewed-by: Gao Xiang <[email protected]> Signed-off-by: Gao Xiang <[email protected]>
net/bridge/br_private.h:1627 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 7 locks held by socat/410: #0: ffff88800d7a9c90 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_stream_connect+0x43/0xa0 #1: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: __ip_queue_xmit+0x62/0x1830 [..] torvalds#6: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: nf_hook.constprop.0+0x8a/0x440 Call Trace: lockdep_rcu_suspicious.cold+0x4f/0xb1 br_vlan_fill_forward_path_pvid+0x32c/0x410 [bridge] br_fill_forward_path+0x7a/0x4d0 [bridge] Use to correct helper, non _rcu variant requires RTNL mutex. Fixes: bcf2766 ("net: bridge: resolve forwarding path for VLAN tag actions in bridge devices") Signed-off-by: Eric Woudstra <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: NipaLocal <nipa@local>
net/bridge/br_private.h:1627 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 7 locks held by socat/410: #0: ffff88800d7a9c90 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_stream_connect+0x43/0xa0 #1: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: __ip_queue_xmit+0x62/0x1830 [..] torvalds#6: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: nf_hook.constprop.0+0x8a/0x440 Call Trace: lockdep_rcu_suspicious.cold+0x4f/0xb1 br_vlan_fill_forward_path_pvid+0x32c/0x410 [bridge] br_fill_forward_path+0x7a/0x4d0 [bridge] Use to correct helper, non _rcu variant requires RTNL mutex. Fixes: bcf2766 ("net: bridge: resolve forwarding path for VLAN tag actions in bridge devices") Signed-off-by: Eric Woudstra <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: NipaLocal <nipa@local>
net/bridge/br_private.h:1627 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 7 locks held by socat/410: #0: ffff88800d7a9c90 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_stream_connect+0x43/0xa0 #1: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: __ip_queue_xmit+0x62/0x1830 [..] torvalds#6: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: nf_hook.constprop.0+0x8a/0x440 Call Trace: lockdep_rcu_suspicious.cold+0x4f/0xb1 br_vlan_fill_forward_path_pvid+0x32c/0x410 [bridge] br_fill_forward_path+0x7a/0x4d0 [bridge] Use to correct helper, non _rcu variant requires RTNL mutex. Fixes: bcf2766 ("net: bridge: resolve forwarding path for VLAN tag actions in bridge devices") Signed-off-by: Eric Woudstra <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: NipaLocal <nipa@local>
net/bridge/br_private.h:1627 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 7 locks held by socat/410: #0: ffff88800d7a9c90 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_stream_connect+0x43/0xa0 #1: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: __ip_queue_xmit+0x62/0x1830 [..] torvalds#6: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: nf_hook.constprop.0+0x8a/0x440 Call Trace: lockdep_rcu_suspicious.cold+0x4f/0xb1 br_vlan_fill_forward_path_pvid+0x32c/0x410 [bridge] br_fill_forward_path+0x7a/0x4d0 [bridge] Use to correct helper, non _rcu variant requires RTNL mutex. Fixes: bcf2766 ("net: bridge: resolve forwarding path for VLAN tag actions in bridge devices") Signed-off-by: Eric Woudstra <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: NipaLocal <nipa@local>
net/bridge/br_private.h:1627 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 7 locks held by socat/410: #0: ffff88800d7a9c90 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_stream_connect+0x43/0xa0 #1: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: __ip_queue_xmit+0x62/0x1830 [..] torvalds#6: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: nf_hook.constprop.0+0x8a/0x440 Call Trace: lockdep_rcu_suspicious.cold+0x4f/0xb1 br_vlan_fill_forward_path_pvid+0x32c/0x410 [bridge] br_fill_forward_path+0x7a/0x4d0 [bridge] Use to correct helper, non _rcu variant requires RTNL mutex. Fixes: bcf2766 ("net: bridge: resolve forwarding path for VLAN tag actions in bridge devices") Signed-off-by: Eric Woudstra <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: NipaLocal <nipa@local>
net/bridge/br_private.h:1627 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 7 locks held by socat/410: #0: ffff88800d7a9c90 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_stream_connect+0x43/0xa0 #1: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: __ip_queue_xmit+0x62/0x1830 [..] torvalds#6: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: nf_hook.constprop.0+0x8a/0x440 Call Trace: lockdep_rcu_suspicious.cold+0x4f/0xb1 br_vlan_fill_forward_path_pvid+0x32c/0x410 [bridge] br_fill_forward_path+0x7a/0x4d0 [bridge] Use to correct helper, non _rcu variant requires RTNL mutex. Fixes: bcf2766 ("net: bridge: resolve forwarding path for VLAN tag actions in bridge devices") Signed-off-by: Eric Woudstra <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: NipaLocal <nipa@local>
net/bridge/br_private.h:1627 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 7 locks held by socat/410: #0: ffff88800d7a9c90 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_stream_connect+0x43/0xa0 #1: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: __ip_queue_xmit+0x62/0x1830 [..] torvalds#6: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: nf_hook.constprop.0+0x8a/0x440 Call Trace: lockdep_rcu_suspicious.cold+0x4f/0xb1 br_vlan_fill_forward_path_pvid+0x32c/0x410 [bridge] br_fill_forward_path+0x7a/0x4d0 [bridge] Use to correct helper, non _rcu variant requires RTNL mutex. Fixes: bcf2766 ("net: bridge: resolve forwarding path for VLAN tag actions in bridge devices") Signed-off-by: Eric Woudstra <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: NipaLocal <nipa@local>
net/bridge/br_private.h:1627 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 7 locks held by socat/410: #0: ffff88800d7a9c90 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_stream_connect+0x43/0xa0 #1: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: __ip_queue_xmit+0x62/0x1830 [..] torvalds#6: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: nf_hook.constprop.0+0x8a/0x440 Call Trace: lockdep_rcu_suspicious.cold+0x4f/0xb1 br_vlan_fill_forward_path_pvid+0x32c/0x410 [bridge] br_fill_forward_path+0x7a/0x4d0 [bridge] Use to correct helper, non _rcu variant requires RTNL mutex. Fixes: bcf2766 ("net: bridge: resolve forwarding path for VLAN tag actions in bridge devices") Signed-off-by: Eric Woudstra <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: NipaLocal <nipa@local>
The following lockdep splat was observed while kernel auto-online a CXL memory region: [ 51.926183] ====================================================== [ 51.933441] WARNING: possible circular locking dependency detected [ 51.940701] 6.17.0djtest+ #53 Tainted: G W [ 51.947290] ------------------------------------------------------ [ 51.954553] systemd-udevd/3334 is trying to acquire lock: [ 51.960938] ffffffff90346188 (hmem_resource_lock){+.+.}-{4:4}, at: hmem_register_resource+0x31/0x50 [ 51.971429] but task is already holding lock: [ 51.978548] ffffffff90338890 ((node_chain).rwsem){++++}-{4:4}, at: blocking_notifier_call_chain+0x2e/0x70 [ 51.989621] which lock already depends on the new lock. [ 51.999605] the existing dependency chain (in reverse order) is: [ 52.008539] -> torvalds#6 ((node_chain).rwsem){++++}-{4:4}: [ 52.016195] down_read+0x45/0x190 [ 52.020789] blocking_notifier_call_chain+0x2e/0x70 [ 52.027131] node_notify+0x1f/0x30 [ 52.031809] online_pages+0xc1/0x330 [ 52.036684] memory_subsys_online+0x22a/0x280 [ 52.042431] device_online+0x50/0x90 [ 52.047298] state_store+0x9b/0xa0 [ 52.051956] dev_attr_store+0x18/0x30 [ 52.056907] sysfs_kf_write+0x4e/0x70 [ 52.061854] kernfs_fop_write_iter+0x187/0x260 [ 52.067673] vfs_write+0x21f/0x590 [ 52.072313] ksys_write+0x73/0xf0 [ 52.076854] __x64_sys_write+0x1d/0x30 [ 52.081874] x64_sys_call+0x7d/0x1d80 [ 52.086797] do_syscall_64+0x6c/0x2f0 [ 52.091717] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 52.098198] -> #5 (mem_hotplug_lock){++++}-{0:0}: [ 52.105512] percpu_down_write+0x4b/0x260 [ 52.110825] try_online_node+0x21/0x50 [ 52.115844] cpu_up+0x43/0xd0 [ 52.119989] cpuhp_bringup_mask+0x60/0xa0 [ 52.125305] bringup_nonboot_cpus+0x76/0x110 [ 52.130912] smp_init+0x2e/0x90 [ 52.135235] kernel_init_freeable+0x19a/0x300 [ 52.140930] kernel_init+0x1e/0x140 [ 52.145635] ret_from_fork+0x159/0x200 [ 52.150633] ret_from_fork_asm+0x1a/0x30 [ 52.155826] -> #4 (cpu_hotplug_lock){++++}-{0:0}: [ 52.163081] __cpuhp_state_add_instance+0x51/0x200 [ 52.169238] iova_domain_init_rcaches+0x1ed/0x200 [ 52.175301] iommu_setup_dma_ops+0x1b4/0x500 [ 52.180877] bus_iommu_probe+0xd2/0x180 [ 52.185954] iommu_device_register+0x9f/0xe0 [ 52.191530] intel_iommu_init+0xd3b/0xf20 [ 52.196810] pci_iommu_init+0x16/0x40 [ 52.201695] do_one_initcall+0x5c/0x2d0 [ 52.206767] kernel_init_freeable+0x281/0x300 [ 52.212432] kernel_init+0x1e/0x140 [ 52.217109] ret_from_fork+0x159/0x200 [ 52.222082] ret_from_fork_asm+0x1a/0x30 [ 52.227253] -> #3 (&group->mutex){+.+.}-{4:4}: [ 52.234196] __mutex_lock+0xa9/0x11e0 [ 52.239066] mutex_lock_nested+0x1f/0x30 [ 52.244236] __iommu_probe_device+0x28c/0x5e0 [ 52.249893] probe_iommu_group+0x2f/0x50 [ 52.255064] bus_for_each_dev+0x7e/0xd0 [ 52.260126] bus_iommu_probe+0x3f/0x180 [ 52.265190] iommu_device_register+0x9f/0xe0 [ 52.270751] intel_iommu_init+0xd3b/0xf20 [ 52.276016] pci_iommu_init+0x16/0x40 [ 52.280892] do_one_initcall+0x5c/0x2d0 [ 52.285956] kernel_init_freeable+0x281/0x300 [ 52.291613] kernel_init+0x1e/0x140 [ 52.296284] ret_from_fork+0x159/0x200 [ 52.301253] ret_from_fork_asm+0x1a/0x30 [ 52.306421] -> #2 (iommu_probe_device_lock){+.+.}-{4:4}: [ 52.314333] __mutex_lock+0xa9/0x11e0 [ 52.319201] mutex_lock_nested+0x1f/0x30 [ 52.324372] iommu_probe_device+0x21/0x70 [ 52.329638] iommu_bus_notifier+0x2c/0x80 [ 52.334903] notifier_call_chain+0x4b/0x110 [ 52.340357] blocking_notifier_call_chain+0x4a/0x70 [ 52.346594] bus_notify+0x3b/0x50 [ 52.351079] device_add+0x65d/0x8b0 [ 52.355750] platform_device_add+0xf8/0x250 [ 52.361205] platform_device_register_full+0x154/0x1f0 [ 52.367739] platform_device_register_simple.constprop.0.isra.0+0x37/0x50 [ 52.376119] efisubsys_init+0xaf/0x570 [ 52.381090] do_one_initcall+0x5c/0x2d0 [ 52.386152] kernel_init_freeable+0x281/0x300 [ 52.391809] kernel_init+0x1e/0x140 [ 52.396481] ret_from_fork+0x159/0x200 [ 52.401450] ret_from_fork_asm+0x1a/0x30 [ 52.406620] -> #1 (&(&priv->bus_notifier)->rwsem){++++}-{4:4}: [ 52.415109] down_read+0x45/0x190 [ 52.419593] blocking_notifier_call_chain+0x2e/0x70 [ 52.425828] bus_notify+0x3b/0x50 [ 52.430311] device_add+0x65d/0x8b0 [ 52.434981] platform_device_add+0xf8/0x250 [ 52.440435] __hmem_register_resource+0x70/0xc0 [ 52.446279] hmem_register_resource+0x3b/0x50 [ 52.451923] hmat_register_target+0x3c/0x190 [ 52.457488] hmat_init+0x13f/0x370 [ 52.462067] do_one_initcall+0x5c/0x2d0 [ 52.467132] kernel_init_freeable+0x281/0x300 [ 52.472790] kernel_init+0x1e/0x140 [ 52.477464] ret_from_fork+0x159/0x200 [ 52.482433] ret_from_fork_asm+0x1a/0x30 [ 52.487604] -> #0 (hmem_resource_lock){+.+.}-{4:4}: [ 52.495030] __lock_acquire+0x14a4/0x2290 [ 52.500290] lock_acquire+0xdd/0x2f0 [ 52.505070] __mutex_lock+0xa9/0x11e0 [ 52.509944] mutex_lock_nested+0x1f/0x30 [ 52.515115] hmem_register_resource+0x31/0x50 [ 52.520771] hmat_register_target+0x3c/0x190 [ 52.526319] hmat_callback+0x6b/0x80 [ 52.531098] notifier_call_chain+0x4b/0x110 [ 52.536552] blocking_notifier_call_chain+0x4a/0x70 [ 52.542788] node_notify+0x1f/0x30 [ 52.547369] online_pages+0x288/0x330 [ 52.552246] memory_subsys_online+0x22a/0x280 [ 52.557902] device_online+0x50/0x90 [ 52.562669] state_store+0x9b/0xa0 [ 52.567247] dev_attr_store+0x18/0x30 [ 52.572123] sysfs_kf_write+0x4e/0x70 [ 52.576998] kernfs_fop_write_iter+0x187/0x260 [ 52.582750] vfs_write+0x21f/0x590 [ 52.587327] ksys_write+0x73/0xf0 [ 52.591811] __x64_sys_write+0x1d/0x30 [ 52.596779] x64_sys_call+0x7d/0x1d80 [ 52.601653] do_syscall_64+0x6c/0x2f0 [ 52.606528] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 52.612968] other info that might help us debug this: [ 52.622356] Chain exists of: hmem_resource_lock --> mem_hotplug_lock --> (node_chain).rwsem [ 52.635550] Possible unsafe locking scenario: [ 52.642495] CPU0 CPU1 [ 52.647752] ---- ---- [ 52.653014] rlock((node_chain).rwsem); [ 52.657589] lock(mem_hotplug_lock); [ 52.664701] lock((node_chain).rwsem); [ 52.672015] lock(hmem_resource_lock); [ 52.676497] *** DEADLOCK *** [ 52.683541] 8 locks held by systemd-udevd/3334: [ 52.688801] #0: ff36b6d49fbf0410 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x73/0xf0 [ 52.697870] #1: ff36b6d4ece03a88 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x12c/0x260 [ 52.708210] #2: ff36b6d4ece1cbb8 (kn->active#62){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x141/0x260 [ 52.718645] #3: ffffffff90333cc8 (device_hotplug_lock){+.+.}-{4:4}, at: lock_device_hotplug_sysfs+0x1b/0x50 [ 52.729863] #4: ff36b6d4ece4b108 (&dev->mutex){....}-{4:4}, at: device_online+0x23/0x90 [ 52.739130] #5: ffffffff900664d0 (cpu_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x12/0x30 [ 52.749288] torvalds#6: ffffffff9024c810 (mem_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x1e/0x30 [ 52.759446] torvalds#7: ffffffff90338890 ((node_chain).rwsem){++++}-{4:4}, at: blocking_notifier_call_chain+0x2e/0x70 [ 52.770860] stack backtrace: [ 52.776068] CPU: 0 UID: 0 PID: 3334 Comm: systemd-udevd Tainted: G W 6.17.0djtest+ #53 PREEMPT(voluntary) [ 52.776071] Tainted: [W]=WARN [ 52.776072] Hardware name: Intel Corporation AvenueCity/AvenueCity, BIOS BHSDCRB1.IPC.3545.P03.2509232237 09/23/2025 [ 52.776073] Call Trace: [ 52.776074] <TASK> [ 52.776076] dump_stack_lvl+0x72/0xa0 [ 52.776080] dump_stack+0x14/0x1a [ 52.776082] print_circular_bug.cold+0x188/0x1c6 [ 52.776084] check_noncircular+0x12f/0x160 [ 52.776087] ? __lock_acquire+0x486/0x2290 [ 52.776089] ? __lock_acquire+0x486/0x2290 [ 52.776091] __lock_acquire+0x14a4/0x2290 [ 52.776095] lock_acquire+0xdd/0x2f0 [ 52.776096] ? hmem_register_resource+0x31/0x50 [ 52.776100] ? hmem_register_resource+0x31/0x50 [ 52.776101] __mutex_lock+0xa9/0x11e0 [ 52.776104] ? hmem_register_resource+0x31/0x50 [ 52.776104] ? __kernfs_create_file+0xb5/0x110 [ 52.776110] mutex_lock_nested+0x1f/0x30 [ 52.776112] ? mutex_lock_nested+0x1f/0x30 [ 52.776114] hmem_register_resource+0x31/0x50 [ 52.776115] hmat_register_target+0x3c/0x190 [ 52.776119] hmat_callback+0x6b/0x80 [ 52.776120] notifier_call_chain+0x4b/0x110 [ 52.776123] blocking_notifier_call_chain+0x4a/0x70 [ 52.776125] node_notify+0x1f/0x30 [ 52.776126] online_pages+0x288/0x330 [ 52.776129] memory_subsys_online+0x22a/0x280 [ 52.776132] device_online+0x50/0x90 [ 52.776134] state_store+0x9b/0xa0 [ 52.776136] dev_attr_store+0x18/0x30 [ 52.776137] sysfs_kf_write+0x4e/0x70 [ 52.776139] kernfs_fop_write_iter+0x187/0x260 [ 52.776142] vfs_write+0x21f/0x590 [ 52.776146] ksys_write+0x73/0xf0 [ 52.776148] __x64_sys_write+0x1d/0x30 [ 52.776150] x64_sys_call+0x7d/0x1d80 [ 52.776152] do_syscall_64+0x6c/0x2f0 [ 52.776154] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 52.776156] RIP: 0033:0x7f11142fda57 [ 52.776158] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 [ 52.776160] RSP: 002b:00007ffd0bd530f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 52.776163] RAX: ffffffffffffffda RBX: 000000000000000e RCX: 00007f11142fda57 [ 52.776164] RDX: 000000000000000e RSI: 00007ffd0bd537c0 RDI: 0000000000000006 [ 52.776166] RBP: 00007ffd0bd537c0 R08: 00007f11143f70a0 R09: 00007ffd0bd53190 [ 52.776167] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000e [ 52.776168] R13: 000055814e03e780 R14: 000000000000000e R15: 00007f11143f69e0 [ 52.776171] </TASK> The lock ordering can cause potential deadlock. There are instances where hmem_resource_lock is taken after (node_chain).rwsem, and vice versa. Narrow the scope of hmem_resource_lock in hmem_register_resource() to avoid the circular locking dependency. The locking is only needed when hmem_active needs to be protected. Fixes: 7dab174 ("dax/hmem: Move hmem device registration to dax_hmem.ko") Signed-off-by: Dave Jiang <[email protected]>
net/bridge/br_private.h:1627 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 7 locks held by socat/410: #0: ffff88800d7a9c90 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_stream_connect+0x43/0xa0 #1: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: __ip_queue_xmit+0x62/0x1830 [..] torvalds#6: ffffffff9a779900 (rcu_read_lock){....}-{1:3}, at: nf_hook.constprop.0+0x8a/0x440 Call Trace: lockdep_rcu_suspicious.cold+0x4f/0xb1 br_vlan_fill_forward_path_pvid+0x32c/0x410 [bridge] br_fill_forward_path+0x7a/0x4d0 [bridge] Use to correct helper, non _rcu variant requires RTNL mutex. Fixes: bcf2766 ("net: bridge: resolve forwarding path for VLAN tag actions in bridge devices") Signed-off-by: Eric Woudstra <[email protected]> Signed-off-by: Florian Westphal <[email protected]>
Holding dev_mutex across c4iw_remove() during module exit can lead to a lockdep warning and potential deadlock. The RDMA core takes global locks (e.g. devices_rwsem) inside ib_unregister_device(), which may conflict with the locking order used elsewhere in the driver. ====================================================== WARNING: possible circular locking dependency detected 6.12.0-124.5.1.el10_1.x86_64+debug #1 Not tainted ------------------------------------------------------ rmmod/3524 is trying to acquire lock: ffffffffc1c0dd18 (devices_rwsem){++++}-{4:4}, at: disable_device+0xaf/0x240 [ib_core] but task is already holding lock: ffff889104e44708 (&device->unregistration_lock){+.+.}-{4:4}, at: __ib_unregister_device+0x209/0x460 [ib_core] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> torvalds#6 (&device->unregistration_lock){+.+.}-{4:4}: __lock_acquire+0x559/0xb80 lock_acquire.part.0+0xbe/0x270 __mutex_lock+0x18b/0x12b0 __ib_unregister_device+0x209/0x460 [ib_core] ib_unregister_device+0x25/0x30 [ib_core] c4iw_remove+0xce/0xda [iw_cxgb4] c4iw_exit_module+0x7d/0xe0 [iw_cxgb4] __do_sys_delete_module.isra.0+0x33a/0x540 do_syscall_64+0x92/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #5 (dev_mutex){+.+.}-{4:4}: __lock_acquire+0x559/0xb80 lock_acquire.part.0+0xbe/0x270 __mutex_lock+0x18b/0x12b0 c4iw_uld_add+0x137/0x500 [iw_cxgb4] uld_attach+0x908/0xd80 [cxgb4] cxgb4_uld_alloc_resources.part.0+0x364/0x1120 [cxgb4] cxgb4_register_uld+0x10c/0x400 [cxgb4] c4iw_init_module+0x77/0x80 [iw_cxgb4] do_one_initcall+0xa5/0x260 do_init_module+0x238/0x7c0 init_module_from_file+0xdf/0x150 idempotent_init_module+0x230/0x770 __x64_sys_finit_module+0xbe/0x130 do_syscall_64+0x92/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #4 (uld_mutex){+.+.}-{4:4}: __lock_acquire+0x559/0xb80 lock_acquire.part.0+0xbe/0x270 __mutex_lock+0x18b/0x12b0 cxgb_up+0x24/0xee0 [cxgb4] cxgb_open+0x7e/0x250 [cxgb4] __dev_open+0x241/0x420 __dev_change_flags+0x44c/0x660 dev_change_flags+0x80/0x160 do_setlink+0x1acd/0x23e0 __rtnl_newlink+0xb07/0xe40 rtnl_newlink+0x62/0x90 rtnetlink_rcv_msg+0x2f3/0xb20 netlink_rcv_skb+0x13d/0x3b0 netlink_unicast+0x42e/0x720 netlink_sendmsg+0x765/0xc20 ____sys_sendmsg+0x974/0xc60 ___sys_sendmsg+0xfd/0x180 __sys_sendmsg+0xe8/0x190 do_syscall_64+0x92/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #3 (rtnl_mutex){+.+.}-{4:4}: __lock_acquire+0x559/0xb80 lock_acquire.part.0+0xbe/0x270 __mutex_lock+0x18b/0x12b0 ib_get_eth_speed+0xee/0x9d0 [ib_core] ib_query_port+0x140/0x1f0 [ib_core] ib_setup_port_attrs+0x1a5/0x4c0 [ib_core] add_one_compat_dev+0x4bd/0x7b0 [ib_core] rdma_dev_init_net+0x257/0x3e0 [ib_core] ops_init+0x109/0x300 setup_net+0x1c4/0x730 copy_net_ns+0x23b/0x540 create_new_namespaces+0x358/0x920 unshare_nsproxy_namespaces+0x8a/0x1b0 ksys_unshare+0x2df/0x740 __x64_sys_unshare+0x31/0x40 do_syscall_64+0x92/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #2 (&device->compat_devs_mutex){+.+.}-{4:4}: __lock_acquire+0x559/0xb80 lock_acquire.part.0+0xbe/0x270 __mutex_lock+0x18b/0x12b0 add_one_compat_dev+0xe0/0x7b0 [ib_core] rdma_dev_init_net+0x257/0x3e0 [ib_core] ops_init+0x109/0x300 setup_net+0x1c4/0x730 copy_net_ns+0x23b/0x540 create_new_namespaces+0x358/0x920 unshare_nsproxy_namespaces+0x8a/0x1b0 ksys_unshare+0x2df/0x740 __x64_sys_unshare+0x31/0x40 do_syscall_64+0x92/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (rdma_nets_rwsem){++++}-{4:4}: __lock_acquire+0x559/0xb80 lock_acquire.part.0+0xbe/0x270 down_read+0xa3/0x4b0 enable_device_and_get+0x26b/0x350 [ib_core] ib_register_device+0x1c3/0x4f0 [ib_core] bnxt_re_ib_init+0x433/0x530 [bnxt_re] bnxt_re_add_device+0x60d/0x760 [bnxt_re] bnxt_re_probe+0xcf/0x140 [bnxt_re] auxiliary_bus_probe+0xa1/0xf0 really_probe+0x1e0/0x8a0 __driver_probe_device+0x18c/0x370 driver_probe_device+0x4a/0x120 __driver_attach+0x194/0x4a0 bus_for_each_dev+0x106/0x190 bus_add_driver+0x2a1/0x4d0 driver_register+0x1a5/0x360 __auxiliary_driver_register+0x152/0x240 c4iw_init_module+0x43/0x80 [iw_cxgb4] do_one_initcall+0xa5/0x260 do_init_module+0x238/0x7c0 init_module_from_file+0xdf/0x150 idempotent_init_module+0x230/0x770 __x64_sys_finit_module+0xbe/0x130 do_syscall_64+0x92/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (devices_rwsem){++++}-{4:4}: check_prev_add+0xf1/0xce0 validate_chain+0x481/0x560 __lock_acquire+0x559/0xb80 lock_acquire.part.0+0xbe/0x270 down_write+0x99/0x220 disable_device+0xaf/0x240 [ib_core] __ib_unregister_device+0x26f/0x460 [ib_core] ib_unregister_device+0x25/0x30 [ib_core] c4iw_remove+0xce/0xda [iw_cxgb4] c4iw_exit_module+0x7d/0xe0 [iw_cxgb4] __do_sys_delete_module.isra.0+0x33a/0x540 do_syscall_64+0x92/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: devices_rwsem --> dev_mutex --> &device->unregistration_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&device->unregistration_lock); lock(dev_mutex); lock(&device->unregistration_lock); lock(devices_rwsem); *** DEADLOCK *** This patch fixes the issue by moving all uld_ctx entries from the global uld_ctx_list to a temporary local list while holding dev_mutex, then releasing the mutex before calling c4iw_remove() and freeing the contexts. This prevents any lock inversion while safely avoiding races on the shared list. Signed-off-by: Kamal Heib <[email protected]>
- treat tailcall count as 32-bit for access and update - change out_offset scope from file to function - minor format/structure changes for consistency Testing: (skipping fentry, fexit, freplace) ======== root@qemu-armhf:/usr/libexec/kselftests-bpf# modprobe test_bpf test_suite=test_tail_calls test_bpf: #0 Tail call leaf jited:1 967 PASS test_bpf: #1 Tail call 2 jited:1 1427 PASS test_bpf: #2 Tail call 3 jited:1 2373 PASS test_bpf: #3 Tail call 4 jited:1 2304 PASS test_bpf: #4 Tail call load/store leaf jited:1 1684 PASS test_bpf: #5 Tail call load/store jited:1 2249 PASS test_bpf: torvalds#6 Tail call error path, max count reached jited:1 22538 PASS test_bpf: torvalds#7 Tail call count preserved across function calls jited:1 1055668 PASS test_bpf: torvalds#8 Tail call error path, NULL target jited:1 513 PASS test_bpf: torvalds#9 Tail call error path, index out of range jited:1 392 PASS test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed] root@qemu-armhf:/usr/libexec/kselftests-bpf# ./test_progs -n 397/1-12,17-18,23-24,27-31 397/1 tailcalls/tailcall_1:OK 397/2 tailcalls/tailcall_2:OK 397/3 tailcalls/tailcall_3:OK 397/4 tailcalls/tailcall_4:OK 397/5 tailcalls/tailcall_5:OK 397/6 tailcalls/tailcall_6:OK 397/7 tailcalls/tailcall_bpf2bpf_1:OK 397/8 tailcalls/tailcall_bpf2bpf_2:OK 397/9 tailcalls/tailcall_bpf2bpf_3:OK 397/10 tailcalls/tailcall_bpf2bpf_4:OK 397/11 tailcalls/tailcall_bpf2bpf_5:OK 397/12 tailcalls/tailcall_bpf2bpf_6:OK 397/17 tailcalls/tailcall_poke:OK 397/18 tailcalls/tailcall_bpf2bpf_hierarchy_1:OK 397/23 tailcalls/tailcall_bpf2bpf_hierarchy_2:OK 397/24 tailcalls/tailcall_bpf2bpf_hierarchy_3:OK 397/27 tailcalls/tailcall_failure:OK 397/28 tailcalls/reject_tail_call_spin_lock:OK 397/29 tailcalls/reject_tail_call_rcu_lock:OK 397/30 tailcalls/reject_tail_call_preempt_lock:OK 397/31 tailcalls/reject_tail_call_ref:OK 397 tailcalls:OK Summary: 1/21 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Tony Ambardar <[email protected]>
- treat tailcall count as 32-bit for access and update - change out_offset scope from file to function - minor format/structure changes for consistency Testing: (skipping fentry, fexit, freplace) ======== root@qemu-armhf:/usr/libexec/kselftests-bpf# modprobe test_bpf test_suite=test_tail_calls test_bpf: #0 Tail call leaf jited:1 967 PASS test_bpf: #1 Tail call 2 jited:1 1427 PASS test_bpf: #2 Tail call 3 jited:1 2373 PASS test_bpf: #3 Tail call 4 jited:1 2304 PASS test_bpf: #4 Tail call load/store leaf jited:1 1684 PASS test_bpf: #5 Tail call load/store jited:1 2249 PASS test_bpf: torvalds#6 Tail call error path, max count reached jited:1 22538 PASS test_bpf: torvalds#7 Tail call count preserved across function calls jited:1 1055668 PASS test_bpf: torvalds#8 Tail call error path, NULL target jited:1 513 PASS test_bpf: torvalds#9 Tail call error path, index out of range jited:1 392 PASS test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed] root@qemu-armhf:/usr/libexec/kselftests-bpf# ./test_progs -n 397/1-12,17-18,23-24,27-31 397/1 tailcalls/tailcall_1:OK 397/2 tailcalls/tailcall_2:OK 397/3 tailcalls/tailcall_3:OK 397/4 tailcalls/tailcall_4:OK 397/5 tailcalls/tailcall_5:OK 397/6 tailcalls/tailcall_6:OK 397/7 tailcalls/tailcall_bpf2bpf_1:OK 397/8 tailcalls/tailcall_bpf2bpf_2:OK 397/9 tailcalls/tailcall_bpf2bpf_3:OK 397/10 tailcalls/tailcall_bpf2bpf_4:OK 397/11 tailcalls/tailcall_bpf2bpf_5:OK 397/12 tailcalls/tailcall_bpf2bpf_6:OK 397/17 tailcalls/tailcall_poke:OK 397/18 tailcalls/tailcall_bpf2bpf_hierarchy_1:OK 397/23 tailcalls/tailcall_bpf2bpf_hierarchy_2:OK 397/24 tailcalls/tailcall_bpf2bpf_hierarchy_3:OK 397/27 tailcalls/tailcall_failure:OK 397/28 tailcalls/reject_tail_call_spin_lock:OK 397/29 tailcalls/reject_tail_call_rcu_lock:OK 397/30 tailcalls/reject_tail_call_preempt_lock:OK 397/31 tailcalls/reject_tail_call_ref:OK 397 tailcalls:OK Summary: 1/21 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Tony Ambardar <[email protected]>
Thanks for sharing linux in github! The beer is free too!