Bugzilla – Bug 1214304
Backport arm64: module: rework module VA range selection
Last modified: 2024-06-25 17:54:09 UTC
This bug addresses a design issue where when KASLR is disabled, a single 128M region is shared between the kernel and modules. In the presence of large modules, e.g. NVIDIA GPU driver, this region can saturate quickly and prevent other modules from loading. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3e35d303ab7d22c4b6597e56ba46ee7cc61f3a5a
This fix needs to be back ported to SLES 15 SP5.
According to my opinion this could be not necessarily a bug related issue (maybe more of a performance issue?). From the bug report I assume that kernel module that exceed 128Mb in size or so failed to be loaded or, conversely, if you've already filled up the 128Mb section devoted to module loading with big modules, any loading attempt of other (smaller) modules would fail since the section is already full. This seems not to be the case with current 5.14 (SLE15-SP5) kernel since after a first (negative) attempt to load the module in the 128Mb region, it falls back to vmalloc area and succeed. Indeed I've created a simple test module with several hundreds of megabytes in .text section, and it can be insmod-ed succesfully. Of course, being allocated in vmalloc memory makes use of PLT indirect call to exported symbols vs local branch, and this could impact performance a little bit (still not noticeable, I would say). So I was wondering whether the bugfix request was meant to have a bigger module reserved space per se (e.g. from a performance point of view) or was it really a consequence of some real observed error. Can you please post any error related to this issue, if any?
This is a functional bug that was observed when testing with the NVIDIA GPU driver (which is quite large). Here are some threads discussing the issue: https://lore.kernel.org/all/20230326170756.3021936-1-sdonthineni@nvidia.com/ https://lore.kernel.org/all/20230404135437.2744866-1-ardb@kernel.org/
(In reply to Matt Ochs from comment #3) > This is a functional bug that was observed when testing with the NVIDIA GPU > driver (which is quite large). > > Here are some threads discussing the issue: > > https://lore.kernel.org/all/20230326170756.3021936-1-sdonthineni@nvidia.com/ > > https://lore.kernel.org/all/20230404135437.2744866-1-ardb@kernel.org/ The patch you're proposing deals with dropping the following snippet (wrt out current kernel source in SP5): - if (!p && IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) && - (IS_ENABLED(CONFIG_KASAN_VMALLOC) || - (!IS_ENABLED(CONFIG_KASAN_GENERIC) && - !IS_ENABLED(CONFIG_KASAN_SW_TAGS)))) according to our .config, this conditional currently evaluates to true in case the kernel cannot allocate module memory in the reserved region (128Mb) and proceed to allocate it (as per the 'if' body) using the vmalloc-ed 2Gb area. So the code should already work, as already tested with some huge (> 200Mb) module. Since the reports you mentioned has to do with Ubuntu, my understanding is that Ubuntu kernel quite surely has a different config than ours (not to mention different kernel code), so may incur in the bug by not falling back on the 2Gb area. Can we reproduce this bug by any chance on actual SLE15-SP5 or is it just speculation?
Yes, I can recreate this on SLES 15 SP5. A large module will load fine with KASLR enabled but fails to load when KASLR is disabled. ---------------------------------------------------------------------------- # cat /etc/os-release NAME="SLES" VERSION="15-SP5" VERSION_ID="15.5" PRETTY_NAME="SUSE Linux Enterprise Server 15 SP5" ID="sles" ID_LIKE="suse" ANSI_COLOR="0;32" CPE_NAME="cpe:/o:suse:sles:15:sp5" DOCUMENTATION_URL="https://documentation.suse.com/" # uname -a Linux host 5.14.21-150500.55.19-64kb #1 SMP PREEMPT_DYNAMIC Tue Aug 8 22:15:01 UTC 2023 (9908c29) aarch64 aarch64 aarch64 GNU/Linux # grep CONFIG_ARM64_MODULE_PLTS /boot/config-$(uname -r) CONFIG_ARM64_MODULE_PLTS=y # grep CONFIG_KASAN_VMALLOC /boot/config-$(uname -r) # grep CONFIG_KASAN_GENERIC /boot/config-$(uname -r) # grep CONFIG_KASAN_SW_TAGS /boot/config-$(uname -r) # # size bingo.ko text data bss dec hex filename 134218249 976 134217728 268436953 100005d9 bingo.ko ---------------------------------------------------------------------------- # dmesg | grep -i kaslr [ 4.863533] KASLR enabled # for x in {1..10}; do echo $x; insmod bingo.ko;lsmod | grep bingo;rmmod bingo; done 1 bingo 268697600 0 2 bingo 268697600 0 3 bingo 268697600 0 4 bingo 268697600 0 5 bingo 268697600 0 6 bingo 268697600 0 7 bingo 268697600 0 8 bingo 268697600 0 9 bingo 268697600 0 10 bingo 268697600 0 ---------------------------------------------------------------------------- # dmesg | grep -i kaslr [ 0.000000] Kernel command line: BOOT_IMAGE=/boot/Image-5.14.21-150500.55.19-64kb root=UUID=be6abf85-d3cc-44e1-8199-340c3db4f902 splash=silent modprobe.blacklist=ast,nouveau mitigations=auto quiet security=apparmor nokaslr [ 0.000000] Unknown kernel command line parameters "nokaslr BOOT_IMAGE=/boot/Image-5.14.21-150500.55.19-64kb splash=silent", will be passed to user space. [ 4.850619] KASLR disabled on command line [ 23.709228] nokaslr # for x in {1..10}; do echo $x; insmod bingo.ko;lsmod | grep bingo;rmmod bingo; done 1 insmod: ERROR: could not insert module bingo.ko: Cannot allocate memory rmmod: ERROR: Module bingo is not currently loaded 2 insmod: ERROR: could not insert module bingo.ko: Cannot allocate memory rmmod: ERROR: Module bingo is not currently loaded 3 insmod: ERROR: could not insert module bingo.ko: Cannot allocate memory rmmod: ERROR: Module bingo is not currently loaded 4 insmod: ERROR: could not insert module bingo.ko: Cannot allocate memory rmmod: ERROR: Module bingo is not currently loaded 5 insmod: ERROR: could not insert module bingo.ko: Cannot allocate memory rmmod: ERROR: Module bingo is not currently loaded 6 insmod: ERROR: could not insert module bingo.ko: Cannot allocate memory rmmod: ERROR: Module bingo is not currently loaded 7 insmod: ERROR: could not insert module bingo.ko: Cannot allocate memory rmmod: ERROR: Module bingo is not currently loaded 8 insmod: ERROR: could not insert module bingo.ko: Cannot allocate memory rmmod: ERROR: Module bingo is not currently loaded 9 insmod: ERROR: could not insert module bingo.ko: Cannot allocate memory rmmod: ERROR: Module bingo is not currently loaded 10 insmod: ERROR: could not insert module bingo.ko: Cannot allocate memory rmmod: ERROR: Module bingo is not currently loaded host-10-176-223-220:/tmp/bingo # Example of errors from dmesg: [ 260.183475] alloc_vmap_area: 48 callbacks suppressed [ 260.183492] vmap allocation for size 268763136 failed: use vmalloc=<size> to increase size [ 260.183498] warn_alloc: 2 callbacks suppressed [ 260.183498] insmod: vmalloc error: size 268697600, vm_struct allocation failed, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0-3
So I've tried with the same OS version you have (I've noted that you're on 64K page version). The config and kernel code responsible for module loading are already using 2Gb vmalloc area if the reserved 128Mb section is non enough, as stated in the previous comments. Besides the text section, your module have also a huge bss, so I've extended both text and bss in my testing module to be over 200Mb each, and it still can be loaded successfully even with kaslr disabled, so it's probably something more involved that is happening during initialization on your specific module. Given the reported errors, two things that possibly comes to mind are: - huge allocation in vmalloc memory area from the designated module_init function - the vmalloc area could be already filled by other kernel modules that didn't fit in reserved memory It would be beneficial to take a look at your supportconfig, can you please produce the archive by running it and attach to this ticket? Also, can we have by any chance a copy of bingo.ko?
In case it is feasible to upload bingo.ko to us, please use one of the following anonymous ftp server below, depending on which one is nearer to your lcoation: FTP EMEA support-ftp.emea.suse.com/incoming/ FTP US support-ftp.us.suse.com/incoming/ user: anonymous pwd: 1214304 (anyone will do but this is the case number, will make upload easier to find)
Created attachment 868978 [details] nokaslr bingo module load failure supportconfig
(In reply to Andrea della Porta from comment #6) > It would be beneficial to take a look at your supportconfig, can you please > produce the archive by running it and attach to this ticket? Also, can we > have by any chance a copy of bingo.ko? I have attached a supportconfig taken on the 64k kernel + nokaslr after the module load failures. As requested, I uploaded the module source and binary to the ftp server (support-ftp.us.suse.com/incoming/1214304/bz1214304_bingo_module.tar.xz).
Many thanks for the attachments. I've tried both your binary module as is and, in second run, recompiled it from the source and both works, with kaslr disabled and same kernel as yours. I suspect that there some other modules big enough to fill the vmalloc area dedicated to module load. This is in part confirmed by the following line from supportconfig log: [ 39.828654][ T2348] vmap allocation for size 57278464 failed: use vmalloc=<size> to increase size in at least one run (presumably in a previous attempt with respect to the one in which supportconfig has been taken) some module also failed to be loaded: judging from the size it seems to be really close to the nvidia.ko as in the following line from lsmod: nvidia 57212928 2 nvidia_uvm,nvidia_modeset One strange thing to note here is that, from cat /proc/meminfo: VmallocUsed: 643904 kB so there should be plenty of space form the bingo.ko module (or any other, see nvidia.ko. etc) to be allocated. To narrow down the possible culprits, I would suggest rmmod-ing the driver you deem as 'big' (e.g. nvidia, nvidia_uvm and nvidia_modeset would be a good starting point) and then see if bingo.ko could be loaded.
(In reply to Andrea della Porta from comment #10) > Many thanks for the attachments. I've tried both your binary module as is > and, in second run, recompiled it from the source and both works, with kaslr > disabled and same kernel as yours. > I suspect that there some other modules big enough to fill the vmalloc area > dedicated to module load. This is in part confirmed by the following line > from supportconfig log: > > [ 39.828654][ T2348] vmap allocation for size 57278464 failed: use > vmalloc=<size> to increase size > > in at least one run (presumably in a previous attempt with respect to the > one in which supportconfig has been taken) some module also failed to be > loaded: judging from the size it seems to be really close to the nvidia.ko > as in the following line from lsmod: > > nvidia 57212928 2 nvidia_uvm,nvidia_modeset > > One strange thing to note here is that, from cat /proc/meminfo: > > VmallocUsed: 643904 kB > > so there should be plenty of space form the bingo.ko module (or any other, > see nvidia.ko. etc) to be allocated. > To narrow down the possible culprits, I would suggest rmmod-ing the driver > you deem as 'big' (e.g. nvidia, nvidia_uvm and nvidia_modeset would be a > good starting point) and then see if bingo.ko could be loaded. I computed a total of the module sizes prior to attempting to load the bingo module. Without the large NVIDIA drivers it shows ~34MB consumed. That seems reasonable for 92 modules. And still, the bingo module fails to load. There must be a disconnect somewhere if you are able to load the bingo module with KASLR disabled while I continue to experience load failures using the same kernel. # uname -a Linux host 5.14.21-150500.55.19-64kb #1 SMP PREEMPT_DYNAMIC Tue Aug 8 22:15:01 UTC 2023 (9908c29) aarch64 aarch64 aarch64 GNU/Linux # dmesg |grep -i kaslr [ 0.000000] Kernel command line: BOOT_IMAGE=/boot/Image-5.14.21-150500.55.19-64kb root=UUID=be6abf85-d3cc-44e1-8199-340c3db4f902 splash=silent modprobe.blacklist=ast,nouveau mitigations=auto quiet security=apparmor nokaslr [ 0.000000] Unknown kernel command line parameters "nokaslr BOOT_IMAGE=/boot/Image-5.14.21-150500.55.19-64kb splash=silent", will be passed to user space. [ 4.846729] KASLR disabled on command line [ 24.039114] nokaslr # lsmod | awk 'NR != 1 {x = x + $2} END {print "Total: "x}' Total: 35586048 # lsmod | grep nvi # lsmod | wc -l 93 # cat /proc/meminfo | grep -i vmalloc VmallocTotal: 133143461888 kB VmallocUsed: 595328 kB VmallocChunk: 0 kB # insmod bingo.ko insmod: ERROR: could not insert module bingo.ko: Cannot allocate memory
One notably difference between your system under test and mine is that you have 72x the processor core than I have here. In order to further shrink the possible sources of the issue, please can you have a run setting nr_cpu=16 (or some other number you deem as appropriate much lower than 288) on the kernel command line? Sometimes per-cpu variables can get quite heavy on memory. Thanks
(In reply to Andrea della Porta from comment #12) > One notably difference between your system under test and mine is that you > have 72x the processor core than I have here. In order to further shrink the > possible sources of the issue, please can you have a run setting nr_cpu=16 > (or some other number you deem as appropriate much lower than 288) on the > kernel command line? Sometimes per-cpu variables can get quite heavy on > memory. Thanks Issue still recreates with 16 CPUs: # uname -a Linux host 5.14.21-150500.55.19-64kb #1 SMP PREEMPT_DYNAMIC Tue Aug 8 22:15:01 UTC 2023 (9908c29) aarch64 aarch64 aarch64 GNU/Linux # cat /proc/cmdline BOOT_IMAGE=/boot/Image-5.14.21-150500.55.19-64kb root=UUID=f379263d-4eaa-4e0d-a4e7-5e26a893e7a8 splash=silent modprobe.blacklist=ast,nouveau mitigations=auto quiet security=apparmor nokaslr nr_cpus=16 # dmesg | grep -i kaslr [ 0.000000] Kernel command line: BOOT_IMAGE=/boot/Image-5.14.21-150500.55.19-64kb root=UUID=f379263d-4eaa-4e0d-a4e7-5e26a893e7a8 splash=silent modprobe.blacklist=ast,nouveau mitigations=auto quiet security=apparmor nokaslr nr_cpus=16 [ 0.000000] Unknown kernel command line parameters "nokaslr BOOT_IMAGE=/boot/Image-5.14.21-150500.55.19-64kb splash=silent", will be passed to user space. [ 0.559935] KASLR disabled on command line [ 18.657380] nokaslr # lscpu Architecture: aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model: 0 Thread(s) per core: 1 Core(s) per socket: 16 Socket(s): 1 Stepping: r0p0 Frequency boost: disabled CPU max MHz: 3411.0000 CPU min MHz: 81.0000 BogoMIPS: 2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 s m4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ss bs sb paca pacg dcpodp sve2 sveaes svepmull svebitperm s vesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti Caches (sum of all): L1d: 1 MiB (16 instances) L1i: 1 MiB (16 instances) L2: 16 MiB (16 instances) L3: 114 MiB (1 instance) NUMA: NUMA node(s): 36 NUMA node0 CPU(s): 0-15 # cat /proc/meminfo | grep -i vmal VmallocTotal: 133143461888 kB VmallocUsed: 300928 kB VmallocChunk: 0 kB # lsmod | wc -l 93 # lsmod | awk 'NR != 1 {x = x + $2} END {print "Total: "x}' Total: 35586048 # insmod ./bingo.ko insmod: ERROR: could not insert module ./bingo.ko: Cannot allocate memory # insmod ./bingo.ko insmod: ERROR: could not insert module ./bingo.ko: Cannot allocate memory # insmod ./bingo.ko insmod: ERROR: could not insert module ./bingo.ko: Cannot allocate memory # insmod ./bingo.ko insmod: ERROR: could not insert module ./bingo.ko: Cannot allocate memory # dmesg | tail -100 [ 596.087364] Node 0 DMA: 1*64kB (U) 1*128kB (M) 2*256kB (UM) 2*512kB (UM) 2*1024kB (UM) 2*2048kB (UM) 2*4096kB (UM) 2*8192kB (UM) 2*16384kB (UM) 2*32768kB (UM) 1*65536kB (U) 1*131072kB (M) 2*262144kB (UM) 2*524288kB (M) = 1900224kB [ 596.087377] Node 0 Normal: 1*64kB (M) 2*128kB (ME) 1*256kB (U) 2*512kB (UE) 2*1024kB (UE) 0*2048kB 1*4096kB (M) 2*8192kB (UM) 2*16384kB (ME) 2*32768kB (UM) 3*65536kB (UME) 3*131072kB (UME) 1*262144kB (E) 229*524288kB (M) = 121036352kB [ 596.087387] Node 1 Normal: 8*64kB (UM) 10*128kB (UE) 6*256kB (UME) 6*512kB (UME) 4*1024kB (UME) 5*2048kB (UME) 2*4096kB (UM) 2*8192kB (ME) 2*16384kB (UE) 2*32768kB (UE) 3*65536kB (UME) 3*131072kB (ME) 1*262144kB (E) 237*524288kB (M) = 125251840kB [ 596.087398] Node 2 Normal: 9*64kB (UM) 1*128kB (U) 4*256kB (UME) 4*512kB (UME) 2*1024kB (ME) 4*2048kB (ME) 2*4096kB (ME) 2*8192kB (ME) 1*16384kB (E) 1*32768kB (E) 3*65536kB (UME) 3*131072kB (ME) 1*262144kB (E) 237*524288kB (M) = 125195968kB [ 596.087409] Node 3 Normal: 14*64kB (UM) 1*128kB (E) 4*256kB (UME) 6*512kB (UME) 4*1024kB (UME) 3*2048kB (ME) 2*4096kB (UM) 2*8192kB (ME) 2*16384kB (UE) 2*32768kB (UE) 4*65536kB (UME) 2*131072kB (ME) 1*262144kB (E) 237*524288kB (M) = 125180928kB [ 596.087422] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16777216kB [ 596.087424] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=524288kB [ 596.087425] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 596.087425] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16777216kB [ 596.087426] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=524288kB [ 596.087427] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 596.087427] Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16777216kB [ 596.087428] Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=524288kB [ 596.087428] Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 596.087429] Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16777216kB [ 596.087429] Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=524288kB [ 596.087430] Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 596.087430] 10260 total pagecache pages [ 596.087431] 0 pages in swap cache [ 596.087432] Swap cache stats: add 0, delete 0, find 0/0 [ 596.087432] Free swap = 2097408kB [ 596.087433] Total swap = 2097408kB [ 596.087433] 7856920 pages RAM [ 596.087434] 0 pages HighMem/MovableOnly [ 596.087434] 10704 pages reserved [ 596.087435] 0 pages cma reserved [ 596.087436] 0 pages hwpoisoned [ 605.867859] vmap allocation for size 268763136 failed: use vmalloc=<size> to increase size [ 606.736477] vmap allocation for size 268763136 failed: use vmalloc=<size> to increase size [ 606.736496] warn_alloc: 1 callbacks suppressed [ 606.736501] insmod: vmalloc error: size 268697600, vm_struct allocation failed, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0-3 [ 606.736516] CPU: 8 PID: 1788 Comm: insmod Tainted: G O X 5.14.21-150500.55.19-64kb #1 SLE15-SP5 ed0c1d696f2124381beeeb4097a1cd4052878149 [ 606.736520] Hardware name: NVIDIA Grace Hopper x4 P4496/UT2.1 DP Chassis, BIOS 00001900 20230724 [ 606.736522] Call trace: [ 606.736524] dump_backtrace+0x0/0x240 [ 606.736529] show_stack+0x20/0x40 [ 606.736530] dump_stack_lvl+0x68/0x84 [ 606.736535] dump_stack+0x18/0x34 [ 606.736537] warn_alloc+0x124/0x1c0 [ 606.736541] __vmalloc_node_range+0x350/0x400 [ 606.736544] module_alloc+0x13c/0x180 [ 606.736546] layout_and_allocate+0x8d8/0xbc0 [ 606.736549] load_module+0x614/0x2340 [ 606.736550] __do_sys_finit_module+0xc0/0x140 [ 606.736551] __arm64_sys_finit_module+0x24/0x40 [ 606.736552] invoke_syscall+0x74/0x100 [ 606.736554] el0_svc_common.constprop.4+0xa4/0x1c0 [ 606.736556] do_el0_svc+0x2c/0xc0 [ 606.736557] el0_svc+0x24/0x40 [ 606.736558] el0t_64_sync_handler+0x94/0xc0 [ 606.736560] el0t_64_sync+0x198/0x19c [ 606.736562] Mem-Info: [ 606.736607] active_anon:1059 inactive_anon:1855 isolated_anon:0 active_file:4729 inactive_file:4273 isolated_file:0 unevictable:5 dirty:4 writeback:0 slab_reclaimable:851 slab_unreclaimable:4423 mapped:581 shmem:1243 pagetables:160 bounce:0 free:7790074 free_pcp:6250 free_cma:0 [ 606.736610] Node 0 active_anon:67776kB inactive_anon:118720kB active_file:302656kB inactive_file:273472kB unevictable:320kB isolated(anon):0kB isolated(file):0kB mapped:37184kB dirty:256kB writeback:0kB shmem:79552kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:22976kB pagetables:10112kB all_unreclaimable? no [ 606.736612] Node 1 active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:64kB pagetables:0kB all_unreclaimable? no [ 606.736615] Node 2 active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:0kB pagetables:64kB all_unreclaimable? no [ 606.736617] Node 3 active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:0kB pagetables:64kB all_unreclaimable? no [ 606.736620] Node 0 DMA free:1900224kB boost:0kB min:101312kB low:126592kB high:151872kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:2097152kB managed:2031488kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 606.736624] lowmem_reserve[]: 0 0 7519 7519 7519 [ 606.736626] Node 0 Normal free:121035776kB boost:0kB min:6146688kB low:7683328kB high:9219968kB reserved_highatomic:0KB active_anon:67776kB inactive_anon:118720kB active_file:302656kB inactive_file:273472kB unevictable:320kB writepending:256kB present:123415424kB managed:123233152kB mlocked:320kB bounce:0kB free_pcp:389888kB local_pcp:20544kB free_cma:0kB [ 606.736629] lowmem_reserve[]: 0 0 0 0 0 [ 606.736634] Node 1 Normal free:125251840kB boost:0kB min:6268992kB low:7836224kB high:9403456kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:125776768kB managed:125653696kB mlocked:0kB bounce:0kB free_pcp:4352kB local_pcp:0kB free_cma:0kB [ 606.736637] lowmem_reserve[]: 0 0 0 0 0 [ 606.736641] Node 2 Normal free:125195968kB boost:0kB min:6268992kB low:7836224kB high:9403456kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:125776768kB managed:125653696kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 606.736643] lowmem_reserve[]: 0 0 0 0 0 [ 606.736645] Node 3 Normal free:125180928kB boost:0kB min:6265600kB low:7832000kB high:9398400kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:125776768kB managed:125585792kB mlocked:0kB bounce:0kB free_pcp:5760kB local_pcp:0kB free_cma:0kB [ 606.736647] lowmem_reserve[]: 0 0 0 0 0 [ 606.736649] Node 0 DMA: 1*64kB (U) 1*128kB (M) 2*256kB (UM) 2*512kB (UM) 2*1024kB (UM) 2*2048kB (UM) 2*4096kB (UM) 2*8192kB (UM) 2*16384kB (UM) 2*32768kB (UM) 1*65536kB (U) 1*131072kB (M) 2*262144kB (UM) 2*524288kB (M) = 1900224kB [ 606.736660] Node 0 Normal: 2*64kB (UM) 2*128kB (ME) 1*256kB (U) 2*512kB (UE) 1*1024kB (E) 0*2048kB 1*4096kB (M) 2*8192kB (UM) 2*16384kB (ME) 2*32768kB (UM) 3*65536kB (UME) 3*131072kB (UME) 1*262144kB (E) 229*524288kB (M) = 121035392kB [ 606.736671] Node 1 Normal: 8*64kB (UM) 10*128kB (UE) 6*256kB (UME) 6*512kB (UME) 4*1024kB (UME) 5*2048kB (UME) 2*4096kB (UM) 2*8192kB (ME) 2*16384kB (UE) 2*32768kB (UE) 3*65536kB (UME) 3*131072kB (ME) 1*262144kB (E) 237*524288kB (M) = 125251840kB [ 606.736683] Node 2 Normal: 9*64kB (UM) 1*128kB (U) 4*256kB (UME) 4*512kB (UME) 2*1024kB (ME) 4*2048kB (ME) 2*4096kB (ME) 2*8192kB (ME) 1*16384kB (E) 1*32768kB (E) 3*65536kB (UME) 3*131072kB (ME) 1*262144kB (E) 237*524288kB (M) = 125195968kB [ 606.736694] Node 3 Normal: 14*64kB (UM) 1*128kB (E) 4*256kB (UME) 6*512kB (UME) 4*1024kB (UME) 3*2048kB (ME) 2*4096kB (UM) 2*8192kB (ME) 2*16384kB (UE) 2*32768kB (UE) 4*65536kB (UME) 2*131072kB (ME) 1*262144kB (E) 237*524288kB (M) = 125180928kB [ 606.736708] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16777216kB [ 606.736709] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=524288kB [ 606.736710] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 606.736710] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16777216kB [ 606.736711] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=524288kB [ 606.736711] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 606.736712] Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16777216kB [ 606.736712] Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=524288kB [ 606.736713] Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 606.736713] Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16777216kB [ 606.736714] Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=524288kB [ 606.736714] Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 606.736715] 10263 total pagecache pages [ 606.736715] 0 pages in swap cache [ 606.736716] Swap cache stats: add 0, delete 0, find 0/0 [ 606.736717] Free swap = 2097408kB [ 606.736717] Total swap = 2097408kB [ 606.736718] 7856920 pages RAM [ 606.736719] 0 pages HighMem/MovableOnly [ 606.736719] 10704 pages reserved [ 606.736720] 0 pages cma reserved [ 606.736720] 0 pages hwpoisoned [ 607.397424] vmap allocation for size 268763136 failed: use vmalloc=<size> to increase size
With enabled KASRL (on AArch64 in 5.14.21-150500.55.19), the randomized area [module_alloc_base, module_alloc_base+SZ_2G] should not overlap with any other vmalloc allocation. With KASLR disabled, module_alloc_base starts at _etext-SZ_128M and so the range [module_alloc_base, module_alloc_base+SZ_2G] is occupied also by vmlinux and additional vmalloc allocations done at [VMALLOC_START, VMALLOC_END]. My suspicion is that these allocations exhausted the module area which causes the observed failure. @Matt, could you please collect and provide file /proc/vmallocinfo after inserting the test module fails so we can confirm/reject this idea?
Created attachment 869098 [details] /proc/vmallocinfo before insmod failure
Created attachment 869099 [details] /proc/vmallocinfo after insmod failure
(In reply to Petr Pavlu from comment #14) > With enabled KASRL (on AArch64 in 5.14.21-150500.55.19), the randomized > area [module_alloc_base, module_alloc_base+SZ_2G] should not overlap > with any other vmalloc allocation. With KASLR disabled, > module_alloc_base starts at _etext-SZ_128M and so the range > [module_alloc_base, module_alloc_base+SZ_2G] is occupied also by vmlinux > and additional vmalloc allocations done at [VMALLOC_START, VMALLOC_END]. > My suspicion is that these allocations exhausted the module area which > causes the observed failure. > > @Matt, could you please collect and provide file /proc/vmallocinfo after > inserting the test module fails so we can confirm/reject this idea? I have attached before and after snapshots of /proc/vmallocinfo when KASLR is disabled.
(In reply to Matt Ochs from comment #17) > > @Matt, could you please collect and provide file /proc/vmallocinfo after > > inserting the test module fails so we can confirm/reject this idea? > > I have attached before and after snapshots of /proc/vmallocinfo when KASLR > is disabled. The 2Gb vmalloc memory from which to allocate new modules should have range: start: ffff800008e90000 - end: ffff800088e90000 From the attached vmallocinfo log, there seems to be some huge allocation from PCI related ioremapped region. In particular, the following one seems to break past the end mark: 0xffff800040000000-0xffff800050010000 268500992 pci_ecam_create+0x130/0x280 phys=0x0000600010000000 ioremap 0xffff800060000000-0xffff800070010000 268500992 pci_ecam_create+0x130/0x280 phys=0x0000610010000000 ioremap 0xffff800080000000-0xffff800090010000 268500992 pci_ecam_create+0x130/0x280 phys=0x0000628010000000 ioremap ..... Matt, do you know what these mappings can be related to? Can you please provide the output of 'lspci -v'?
> Matt, do you know what these mappings can be related to? Can you please > provide the output of 'lspci -v'? Sorry, don't bother with lspci, found that info in supportconfig. There are quite some pci device in there, seems reasonable that they eat up a lot of memory though.
Matt, can you please comment out the hugebss definition in your bingo.c and try to load it again with this new module? If it loads correctly, can you please also attach 'cat /proc/vmallocinfo' output? Many thanks
Created attachment 869183 [details] /proc/vmallocinfo before insmod without hugebss
Created attachment 869184 [details] /proc/vmallocinfo after insmod without hugebss successful load
Created attachment 869185 [details] /proc/vmallocinfo after secondary module insmod without hugebss failed load
(In reply to Andrea della Porta from comment #20) > Matt, can you please comment out the hugebss definition in your bingo.c and > try to load it again with this new module? If it loads correctly, can you > please also attach 'cat /proc/vmallocinfo' output? Many thanks Without the huge .BSS I can load the bingo module when KASLR is disabled. However, subsequent loads of additional bingo modules (leaving the first bingo module loaded) continue to experience the memory allocation failure. I have attached three additional outputs from /proc/vmallocinfo: - Snapshot before loading bingo - Snapshot after successful bingo load - Snapshot after failed bingo2 load
(In reply to Matt Ochs from comment #24) > (In reply to Andrea della Porta from comment #20) > > Matt, can you please comment out the hugebss definition in your bingo.c and > > try to load it again with this new module? If it loads correctly, can you > > please also attach 'cat /proc/vmallocinfo' output? Many thanks > > Without the huge .BSS I can load the bingo module when KASLR is disabled. > However, subsequent loads of additional bingo modules (leaving the first > bingo module loaded) continue to experience the memory allocation failure. I > have attached three additional outputs from /proc/vmallocinfo: > - Snapshot before loading bingo > - Snapshot after successful bingo load > - Snapshot after failed bingo2 load So the scenario here is that vmalloc memory is fragmented in blocks aligned at 512MiB boundary, each of which make up for 256MiB+64K of memory effectively allocated. These are the result of some pci driver that is ioremapping some device memory into the kernel virtual address space (more or less 256MiB each) and the ioremap is aligning these blocks to next power of two (hence the 512MiB boundary). Vmalloc can allocate physically discontiguous memory, and of course all that interleaved 256KiB-64K holes could sum up to accomodate even the largest of the module, but the point is that the virtual memory *has* to be contiguous, and this is not the case since your original bingo module is slightly bigger than 256MiB, while the holes are slightly less than 256MiB. This is the reason why bingo module without the bss segment can be loaded, and the original (that has also the bss segment) can't. The fact that it's failing to load a second instance of bingo module is honestly unexpected, since there should still be at least one hole between 0xffff800050010000 and 0xffff800060000000. You have just changed obj-m += bingo2.o and recompiled it, right? Some point of discussion, in order to find a solution to this: MODULE PERSPECTIVE - I'm not a PCI expert but are you really sure there is the need to ioremap such a huge region (256MiB)? Since these mappings seem to come from ECAM configuration memory, if I remember correctly these are a few Kilobytes in size. Of course I don't know how that PCI driver works internally but maybe this is a point worth investigating, since it would defragment the vmalloc area a lot (to benefit not only modules loading) - modules stacking: maybe you don't really need to have a monolithic driver (i.e. one with .bss, .data and .text all in one single entity) and opt for several smaller drivers that depends on one another. In this way the memory allocation is optimized, filling the remaining holes better KERNEL PERSPECTIVE - an interesting discussion is the following: https://linux-arm-kernel.infradead.narkive.com/a0Qwraeu/rfc-patch-resend-mm-vmalloc-remove-ioremap-align-constraint long time ago a patch has been proposed that would limit fragmentation by allowing less restrictive aligment on ioremap (after all, on 64K pages aarch64 kernel the max aligment coincides with a huge page, i.e. 512MiB), but it needs careful evaluation since it may have non obvious downsides hindered, especially on large scale machine)
To add a couple of details to my previous comment, I'd just like to explain better why I proposed that workaround. Unfortunately the originally proposed patch changes a macro definition (VMALLOC_VSIZE) in public kernel header. This means that other entities (especially drivers) may include that. Applying that patch to SLE15-SP5 kernel can therefore break compatibility with third party. On the other hand, the patch in the discussion I've proposed can backfire in non trivial to identify way. The advise then is to try to limit the memory fragmentation, that would benefit the entire system and alleviate memory pressure. If you think it could be useful and feasible, feel free to share any details about the PCI driver, maybe we can elaborate further about what to do with that huge allocations.
(In reply to Andrea della Porta from comment #26) > To add a couple of details to my previous comment, I'd just like to explain > better why I proposed that workaround. Unfortunately the originally proposed > patch changes a macro definition (VMALLOC_VSIZE) in public kernel header. > This means that other entities (especially drivers) may include that. > Applying that patch to SLE15-SP5 kernel can therefore break compatibility > with third party. On the other hand, the patch in the discussion I've > proposed can backfire in non trivial to identify way. > The advise then is to try to limit the memory fragmentation, that would > benefit the entire system and alleviate memory pressure. If you think it > could be useful and feasible, feel free to share any details about the PCI > driver, maybe we can elaborate further about what to do with that huge > allocations. Understood regarding the binary compat concern. Would this patch then be a candidate for SLES15-SP6?
(In reply to Matt Ochs from comment #27) > (In reply to Andrea della Porta from comment #26) > > To add a couple of details to my previous comment, I'd just like to explain > > better why I proposed that workaround. Unfortunately the originally proposed > > patch changes a macro definition (VMALLOC_VSIZE) in public kernel header. > > This means that other entities (especially drivers) may include that. > > Applying that patch to SLE15-SP5 kernel can therefore break compatibility > > with third party. On the other hand, the patch in the discussion I've > > proposed can backfire in non trivial to identify way. > > The advise then is to try to limit the memory fragmentation, that would > > benefit the entire system and alleviate memory pressure. If you think it > > could be useful and feasible, feel free to share any details about the PCI > > driver, maybe we can elaborate further about what to do with that huge > > allocations. > > Understood regarding the binary compat concern. Would this patch then be a > candidate for SLES15-SP6? Yes, we're currently considering to include it in SP6, I'll ask Petr to also take a look in case of hidden side effects
(In reply to Andrea della Porta from comment #28) > Yes, we're currently considering to include it in SP6, I'll ask Petr to also > take a look in case of hidden side effects Looks ok to me for 15-SP6 and ALP. I agree with the concern about binary compatibility in earlier SPs. It is unlikely that an external module would rely on MODULES_VSIZE or derived definitions but it can't be quite ruled out.
I think it would be acceptable that this is not resolved until SLES 15 SP6 given that KASLR is enabled by default on SLES. Should a customer encounter this, we would advise to enable KASLR or explore reordering module loading to minimize fragmentation.
Hi Matt, I've prepared a backport of the patch for SLES15-SP5 in the following kernel: https://download.opensuse.org/repositories/home:/aporta:/branches:/bsc1214304/pool/aarch64/kernel-64kb-5.14.21-150500.1.1.TEST.g9908c29.aarch64.rpm Please can you give it a try? You can install it by downloading on the target and running: rpm -i --force kernel-64kb-5.14.21-150500.1.1.TEST.g9908c29.aarch64.rpm Please note that this kernel is just for testing purposed and is meant to be run with 'nokaslr' kernel param at boot. Do not use this kernel with kaslr enabled (nor for anything else other than testing this particular issue), since it could behave incorrectly or even crash. Please, provide the output of cat /proc/vmallocinfo after insmod-ing bingo.ko. Many thanks
Created attachment 869576 [details] /proc/vmallocinfo before first insmod
Created attachment 869577 [details] /proc/vmallocinfo after first insmod
Created attachment 869578 [details] /proc/vmallocinfo after 10 insmod rmmod loops
Was able to test the kernel, confirmed the issue is resolved when KASLR disabled using the hugebss+hugetext bingo module. I have attached 3 text files with various snapshots of vmallocinfo. -------------------------------------------------------------------------------- # uname -a Linux host 5.14.21-150500.1.1.TEST.g9908c29-64kb #1 SMP PREEMPT_DYNAMIC Tue Aug 8 22:15:01 UTC 2023 (9908c29) aarch64 aarch64 aarch64 GNU/Linux # cat /proc/cmdline BOOT_IMAGE=/boot/Image-5.14.21-150500.1.1.TEST.g9908c29-64kb root=UUID=f379263d-4eaa-4e0d-a4e7-5e26a893e7a8 splash=silent modprobe.blacklist=ast,nouveau mitigations=auto quiet security=apparmor nokaslr # dmesg | grep -i kaslr [ 0.000000] Kernel command line: BOOT_IMAGE=/boot/Image-5.14.21-150500.1.1.TEST.g9908c29-64kb root=UUID=f379263d-4eaa-4e0d-a4e7-5e26a893e7a8 splash=silent modprobe.blacklist=ast,nouveau mitigations=auto quiet security=apparmor nokaslr [ 0.000000] Unknown kernel command line parameters "nokaslr BOOT_IMAGE=/boot/Image-5.14.21-150500.1.1.TEST.g9908c29-64kb splash=silent", will be passed to user space. [ 4.853445] KASLR disabled on command line [ 24.140514] nokaslr # lscpu Architecture: aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 288 On-line CPU(s) list: 0-287 Vendor ID: ARM Model: 0 Thread(s) per core: 1 Core(s) per socket: 72 Socket(s): 4 Stepping: r0p0 Frequency boost: disabled CPU max MHz: 3429.0000 CPU min MHz: 81.0000 BogoMIPS: 2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 s m4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ss bs sb paca pacg dcpodp sve2 sveaes svepmull svebitperm s vesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti Caches (sum of all): L1d: 18 MiB (288 instances) L1i: 18 MiB (288 instances) L2: 288 MiB (288 instances) L3: 456 MiB (4 instances) NUMA: NUMA node(s): 36 NUMA node0 CPU(s): 0-71 NUMA node1 CPU(s): 72-143 NUMA node2 CPU(s): 144-215 NUMA node3 CPU(s): 216-287 # insmod ./bingo.ko # lsmod | grep bingo bingo 268697600 0 # rmmod bingo # for x in {1..10}; do echo $x; insmod bingo.ko;lsmod | grep bin; rmmod bingo; done 1 bingo 268697600 0 2 bingo 268697600 0 3 bingo 268697600 0 4 bingo 268697600 0 5 bingo 268697600 0 6 bingo 268697600 0 7 bingo 268697600 0 8 bingo 268697600 0 9 bingo 268697600 0
Patches have been accepted in SP6, and tested succesfully on SP5. Marking the ticket as resolved. Thanks
Tested on SLES 15 SP6 kernel.