Bugzilla – Bug 1212446
[ALP] Regression in Linux 6.4 in network throughput on localhost with netperf
Last modified: 2024-03-24 02:10:17 UTC
Testing on the performance team grid revealed a regression in Linux 6.3 when compared to Linux 6.0 or Linux 6.2. The size of the regressions differs between CPU generations. The regression is largest for the smallest buffer sizes (64, 128 and 156 bytes). There is a regression in TCP throughput on all Intel and AMD CPUs in the grid. Typically, the size of the regression in TCP throughput is largest between Linux 6.0 and Linux 6.3: > bing1 Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz > netperf-tcp > 6.0.0 6.1.0 6.2.0 6.3.0 > Hmean 64 2174.41 ( 0.00%) 2195.87 ( 0.99%) 2160.47 ( -0.64%) 2031.93 ( -6.55%) > Hmean 128 4145.44 ( 0.00%) 4198.09 ( 1.27%) 4109.10 ( -0.88%) 3895.21 ( -6.04%) > Hmean 256 6614.00 ( 0.00%) 6618.56 ( 0.07%) 6536.76 ( -1.17%) 6234.29 ( -5.74%) > netperf-udp > 6.0.0 6.1.0 6.2.0 6.3.0 > Hmean send-64 376.66 ( 0.00%) 382.07 ( 1.44%) 385.53 ( 2.35%) 371.48 ( -1.38%) > Hmean send-128 745.62 ( 0.00%) 762.27 ( 2.23%) 767.15 ( 2.89%) 724.00 ( -2.90%) > Hmean send-256 1438.62 ( 0.00%) 1461.07 ( 1.56%) 1483.66 ( 3.13%) 1435.92 ( -0.19%) On timon1, the size of the regression in TCP throughput is largest between Linux 6.0 and Linux 6.2: > timon1 Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz > netperf-tcp > 6.0.0 6.1.0 6.2.0 6.3.0 > Hmean 64 2426.77 ( 0.00%) 2387.60 ( -1.61%) 2224.76 ( -8.32%) 2334.96 ( -3.78%) > Hmean 128 4792.27 ( 0.00%) 4745.29 ( -0.98%) 4349.13 ( -9.25%) 4619.73 ( -3.60%) > Hmean 256 9128.85 ( 0.00%) 9038.75 ( -0.99%) 8310.65 ( -8.96%) 8793.32 ( -3.68%) > netperf-udp > 6.0.0 6.1.0 6.2.0 6.3.0 > Hmean send-64 421.00 ( 0.00%) 426.69 ( 1.35%) 416.60 ( -1.05%) 410.74 ( -2.44%) > Hmean send-128 841.56 ( 0.00%) 853.79 ( 1.45%) 834.17 ( -0.88%) 826.06 ( -1.84%) > Hmean send-256 1671.16 ( 0.00%) 1698.31 ( 1.62%) 1613.82 ( -3.43%) 1645.20 ( -1.55%) There is not much of a regression on zazu: > zazu Intel(R) Xeon(R) Gold 6330N CPU @ 2.20GHz > netperf-tcp > 6.0.0 6.1.0 6.2.0 6.3.0 > Hmean 64 2324.22 ( 0.00%) 2273.60 ( -2.18%) 2344.58 ( 0.88%) 2261.83 ( -2.68%) > Hmean 128 4571.85 ( 0.00%) 4512.50 ( -1.30%) 4651.17 ( 1.73%) 4465.59 ( -2.32%) > Hmean 256 8749.81 ( 0.00%) 8613.48 ( -1.56%) 8857.62 ( 1.23%) 8500.16 ( -2.85%) > netperf-udp > 6.0.0 6.1.0 6.2.0 6.3.0 > Hmean send-64 401.22 ( 0.00%) 404.38 ( 0.79%) 408.49 ( 1.81%) 400.54 ( -0.17%) > Hmean send-128 802.57 ( 0.00%) 814.79 ( 1.52%) 819.56 ( 2.12%) 805.08 ( 0.31%) > Hmean send-256 1588.79 ( 0.00%) 1610.26 ( 1.35%) 1606.07 ( 1.09%) 1607.41 ( 1.17%) On simba1, there is a small regression between Linux 6.2 and Linux 6.3: > simba1 Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz > netperf-tcp > 6.0.0 6.1.0 6.2.0 6.3.0 > Hmean 64 2354.56 ( 0.00%) 2425.80 ( 3.03%) 2427.61 ( 3.10%) 2334.02 ( -0.87%) > Hmean 128 4631.12 ( 0.00%) 4815.27 ( 3.98%) 4802.97 ( 3.71%) 4592.37 ( -0.84%) > Hmean 256 8869.65 ( 0.00%) 9121.16 ( 2.84%) 9138.31 ( 3.03%) 8781.06 ( -1.00%) > netperf-udp > 6.0.0 6.1.0 6.2.0 6.3.0 > Hmean send-64 401.07 ( 0.00%) 411.89 ( 2.70%) 418.60 ( 4.37%) 404.90 ( 0.95%) > Hmean send-128 804.47 ( 0.00%) 817.67 ( 1.64%) 838.14 ( 4.19%) 812.19 ( 0.96%) > Hmean send-256 1631.65 ( 0.00%) 1627.56 ( -0.25%) 1690.66 ( 3.62%) 1629.13 ( -0.15%) The regression is larger on AMD machines: > deandre AMD EPYC 9654 > netperf-tcp > 6.0.0 6.1.0 6.2.0 6.3.0 > Hmean 64 2661.12 ( 0.00%) 2637.21 ( -0.90%) 2599.42 ( -2.32%) 2272.47 ( -14.60%) > Hmean 128 5008.94 ( 0.00%) 4970.45 ( -0.77%) 4875.40 ( -2.67%) 4339.79 ( -13.36%) > Hmean 256 9425.00 ( 0.00%) 9307.01 ( -1.25%) 9214.31 ( -2.24%) 8281.36 ( -12.13%) > netperf-udp > 6.0.0 6.1.0 6.2.0 6.3.0 > Hmean send-64 509.80 ( 0.00%) 500.06 ( -1.91%) 518.16 ( 1.64%) 528.36 ( 3.64%) > Hmean send-128 1010.04 ( 0.00%) 976.28 ( -3.34%) 1011.92 ( 0.19%) 1026.64 ( 1.64%) > Hmean send-256 2005.17 ( 0.00%) 1957.72 ( -2.37%) 2030.43 ( 1.26%) 2027.69 ( 1.12%)
Is the same regression observed with 6.4 as well?
(In reply to Michal Hocko from comment #1) > Is the same regression observed with 6.4 as well? There is a regression between 6.0 and 6.4 reported by the performance team grid. I have run tests to get profiles and I cannot reproduce the regression between 6.0 and 6.3 - 6.0 and 6.3 perform similarly in my tests. I am working on providing an explanation. My current hypothesis is that the higher throughput reported under 6.0 (as seen in comment 0) is caused by an older CPU firmware version. The 6.0 results are from Oct 2022 for some machines.
Since Linux 6.4 has become the base kernel version for ALP, here are new results from the grid. An even greater regression was reported between 6.0 and 6.4 on Intel machines: > bing1 Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz > netperf-tcp > Buffer size 6.0.0 6.1.0 6.2.0 6.3.0 6.4.0 > Hmean 64 2174.41 ( 0.00%) 2195.87 ( 0.99%) 2160.47 ( -0.64%) 2031.93 ( -6.55%) 1882.90 (-13.41%) > Hmean 128 4145.44 ( 0.00%) 4198.09 ( 1.27%) 4109.10 ( -0.88%) 3895.21 ( -6.04%) 3591.05 (-13.37%) > Hmean 256 6614.00 ( 0.00%) 6618.56 ( 0.07%) 6536.76 ( -1.17%) 6234.29 ( -5.74%) 5972.07 ( -9.71%) > Hmean 1024 18933.25( 0.00%) 18837.52(-0.51%) 18768.16( -0.87%) 18105.33( -4.37%) 17333.93( -8.45%) > Hmean 2048 29859.54( 0.00%) 29450.69(-1.37%) 29755.55( -0.35%) 28757.20( -3.69%) 27493.41( -7.92%) > Hmean 3312 36009.95( 0.00%) 35733.79(-0.77%) 35875.23( -0.37%) 35329.09( -1.89%) 34048.70( -5.45%) > Hmean 4096 38902.10( 0.00%) 38733.64(-0.43%) 38933.69( 0.08%) 38205.69( -1.79%) 37144.18( -4.52%) > Hmean 8192 43571.81( 0.00%) 43677.82( 0.24%) 43713.05( 0.32%) 43833.88( 0.60%) 42969.59( -1.38%) > Hmean 16384 49109.21( 0.00%) 49354.75( 0.50%) 48900.01( -0.43%) 48834.18( -0.56%) 48456.58( -1.33%) The size of the regression did not change on Zen 3 and Zen 4 machines: > deandre AMD EPYC 9654 > netperf-tcp > Buffer size 6.0.0 6.1.0 6.2.0 6.3.0 6.4.0 > Hmean 64 2661.12 ( 0.00%) 2637.21 ( -0.90%) 2599.42 ( -2.32%) 2272.47 ( -14.60%) 2314.51 ( -13.02%) > Hmean 128 5008.94 ( 0.00%) 4970.45 ( -0.77%) 4875.40 ( -2.67%) 4339.79 ( -13.36%) 4432.17 ( -11.51%) > Hmean 256 9425.00 ( 0.00%) 9307.01 ( -1.25%) 9214.31 ( -2.24%) 8281.36 ( -12.13%) 8413.18 ( -10.74%) > Hmean 1024 23218.85( 0.00%) 23170.74( -0.21%) 23094.18( -0.54%) 21394.38( -7.86%) 21548.80( -7.19%) > Hmean 2048 30340.41( 0.00%) 29622.27( -2.37%) 29366.87( -3.21%) 28710.75( -5.37%) 28659.39( -5.54%) > Hmean 3312 34198.17( 0.00%) 33705.03( -1.44%) 33503.50( -2.03%) 33133.14( -3.11%) 33091.18( -3.24%) > Hmean 4096 34650.74( 0.00%) 34016.23( -1.83%) 33981.56( -1.93%) 33731.23( -2.65%) 33664.32( -2.85%) > Hmean 8192 39109.60( 0.00%) 39751.87( 1.64%) 39548.66( 1.12%) 39832.60( 1.85%) 39513.34( 1.03%) > Hmean 16384 41406.87( 0.00%) 41785.97( 0.92%) 42002.20( 1.44%) 42301.29( 2.16%) 41712.63( 0.74%) The regression shows up in tests with small buffer sizes, which probably means the number of syscalls made to transfer data matters. At first, I could not reproduce the regression. I ran tests on the grid: > bing2 Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz > netperf-tcp > Buffer size 6.0.0 6.3.0 6.4.0 > Hmean 64 2002.62 ( 0.00%) 1990.11 * -0.62%* 1984.88 * -0.89%* > Hmean 256 6200.69 ( 0.00%) 6192.91 ( -0.13%) 6200.72 ( 0.00%) > Hmean 1024 17962.48 ( 0.00%) 18035.54 * 0.41%* 18113.20 * 0.84%* > Hmean 2048 28521.23 ( 0.00%) 28582.40 ( 0.21%) 28874.90 * 1.24%* > Hmean 3312 35073.78 ( 0.00%) 35030.83 ( -0.12%) 35436.52 * 1.03%* > Hmean 4096 37939.84 ( 0.00%) 38058.32 ( 0.31%) 38064.09 ( 0.33%) > Hmean 8192 43290.77 ( 0.00%) 43526.58 * 0.54%* 43571.70 * 0.65%* > Hmean 16384 48739.96 ( 0.00%) 49239.02 * 1.02%* 48894.49 ( 0.32%) > Hmean 65507 54710.35 ( 0.00%) 54913.36 * 0.37%* 54810.55 ( 0.18%) All the kernels were configured with the same config file. Then, I ran tests on bing3 and configured the kernels with the config file used by the grid (Marvin) at the time when the automated tests were executed: > bing3 Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz > netperf-tcp > Buffer size 6.0.0 6.3.0 6.4.0 > Hmean 64 2092.17 ( 0.00%) 1975.73 * -5.57%* 1908.17 * -8.79%* > Hmean 128 4013.45 ( 0.00%) 3761.81 * -6.27%* 3637.21 * -9.37%* > Hmean 256 6430.60 ( 0.00%) 6138.37 * -4.54%* 5984.11 * -6.94%* > Hmean 1024 18543.25 ( 0.00%) 17917.13 * -3.38%* 17525.97 * -5.49%* > Hmean 2048 29406.79 ( 0.00%) 28518.33 * -3.02%* 27887.43 * -5.17%* > Hmean 3312 35692.60 ( 0.00%) 34918.12 * -2.17%* 34434.63 * -3.52%* > Hmean 4096 38387.12 ( 0.00%) 37950.87 * -1.14%* 37444.05 * -2.46%* > Hmean 8192 43465.85 ( 0.00%) 43607.88 ( 0.33%) 43156.65 * -0.71%* > Hmean 16384 48765.28 ( 0.00%) 48803.45 ( 0.08%) 48708.08 ( -0.12%) When buffer size of 64 bytes is set the netperf process is the bottleneck. The CPU running the netperf process is busier: > # Samples: 361K of event 'bus-cycles' > # Event count (approx.): 7662575500 > # Overhead Samples Period Command Shared Object > 44.06% 135495 3375795595 netperf [kernel.vmlinux] > 36.79% 116500 2819168481 netserver [kernel.vmlinux] > 2.90% 8934 222555650 netperf libc-2.31.so > 2.90% 9179 222106330 netserver libc-2.31.so > 1.79% 5678 137359526 netserver netserver > 1.64% 5045 125704051 netperf netperf The runtime corresponding to the event count of the the netperf process: > (3375795595 + 222555650 + 125704051) / 25000000 = 148.962 seconds The runtime corresponding to the event count of the the netserver process: > (2819168481 + 222106330 + 137359526) / 25000000 = 127.145 seconds The benchmark takes 150 seconds. A profile diff (in bus-cycles) for the test with 64-byte buffers on bing3 (focusing just on the netperf process): > test bing3 tcp 64 6.0.0-vanilla 6.3.0-vanilla > # Util1 Util2 Diff Command Shared Object Symbol CPU > 0 63,821,807 63,821,807 (100.0%) netperf [kernel.kallsyms] tomoyo_socket_sendmsg_permission all > 25,663,830 65,043,891 39,380,061 (153.4%) netperf [kernel.kallsyms] security_socket_sendmsg all > 37,467,695 70,775,306 33,307,611 ( 88.9%) netperf [kernel.kallsyms] ipv4_mtu all > 29,730,655 57,734,946 28,004,291 ( 94.2%) netperf [kernel.kallsyms] sock_sendmsg all > 0 24,880,980 24,880,980 (100.0%) netperf [kernel.kallsyms] bpf_lsm_socket_sendmsg all > 77,630,759 94,037,965 16,407,206 ( 21.1%) netperf [kernel.kallsyms] __virt_addr_valid all > 15,696,586 31,583,730 15,887,144 (101.2%) netperf [kernel.kallsyms] __x64_sys_sendto all > 0 12,148,230 12,148,230 (100.0%) netperf [kernel.kallsyms] tomoyo_sock_family.part.2 all > 232,113,276 219,415,697 -12,697,579 ( 5.5%) netperf [kernel.kallsyms] copy_user_enhanced_fast_string all > 174,153,324 159,965,349 -14,187,975 ( 8.1%) netperf [kernel.kallsyms] _raw_spin_lock_bh all > 222,492,825 207,008,349 -15,484,476 ( 7.0%) netperf libc-2.31.so send all > 110,015,924 87,089,875 -22,926,049 ( 20.8%) netperf [kernel.kallsyms] __sys_sendto all > 149,397,565 123,525,466 -25,872,099 ( 17.3%) netperf [kernel.kallsyms] read_tsc all > 121,571,240 87,804,285 -33,766,955 ( 27.8%) netperf netperf send_tcp_stream all > 130,482,864 93,990,619 -36,492,245 ( 28.0%) netperf [kernel.kallsyms] __check_object_size all The functions executed under 6.3.0-vanilla but not under 6.0.0-vanilla (tomoyo_socket_sendmsg_permission, bpf_lsm_socket_sendmsg, tomoyo_sock_family.part.2) belong to various LSMs and carry out mandatory access control: > 40.38% 0.96% 2928 72924113 netperf [kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe > |--39.43%--entry_SYSCALL_64_after_hwframe > | do_syscall_64 > | |--36.62%--__x64_sys_sendto > | | --36.25%--__sys_sendto > | | |--33.19%--sock_sendmsg > | | | |--28.12%--tcp_sendmsg > | | | |--3.93%--security_socket_sendmsg > | | | | |--1.29%--aa_sk_perm > | | | | | --0.05%--__cond_resched > | | | | |--1.13%--tomoyo_socket_sendmsg_permission > | | | | | --0.17%--tomoyo_sock_family.part.2 > | | | | |--0.29%--bpf_lsm_socket_sendmsg > | | | | |--0.17%--apparmor_socket_sendmsg > | | | | --0.07%--tomoyo_socket_sendmsg A profile diff (in cycles) for the test with 64-byte buffers on bing2 did not show any difference in functions from LSMs because they were executed under both kernels: > test bing2 tcp 64 6.0.0-vanilla 6.3.0-vanilla > # Util1 Util2 Diff Command Shared Object Symbol CPU > 3,108,910,575 6,819,941,295 3,711,030,720 (119.4%) netperf netperf 0x000000000000a479 all > 6,848,558,780 9,157,610,739 2,309,051,959 ( 33.7%) netperf [kernel.kallsyms] ipv4_mtu all > 48,174,509,222 50,367,436,898 2,192,927,676 ( 4.6%) netperf [kernel.kallsyms] tcp_sendmsg_locked all The pertinent difference in kernel config files used by Marvin at the time when the automated tests were executed on the grid: > $ diff -u kconfig-6.0.0-vanilla.txt kconfig-6.3.0-vanilla.txt > -CONFIG_LSM="integrity,apparmor" > +CONFIG_LSM="landlock,lockdown,yama,loadpin,safesetid,integrity,apparmor,selinux,smack,tomoyo,bpf" The default initialization of LSMs changed to include many more frameworks, which comes with a performance penalty. The commit in kernel-source that changed it: > commit 720c38318edb68b138a4bc4c86bb8ff0fbcda672 > Author: Jeff Mahoney <jeffm@suse.com> > Date: Thu Dec 8 14:32:18 2022 -0500 > config: update CONFIG_LSM defaults (bsc#1205603). > CONFIG_LSM determines what the default order of LSM usage is. The > default order is set based on whether AppArmor or SELinux is preferred > in the config (we still prefer AppArmor). The default set has changed > over time and we haven't updated it, leading to things like bpf LSMs > not working out of the box. > This change just updates CONFIG_LSM to what the default would be now.
To check the analysis, I ran a test on bing3 and passed lsm=integrity,apparmor to the kernel: > bing3 Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz > netperf-tcp > Buffer size 6.0.0 6.3.0 6.4.0 > Hmean 64 2098.80 ( 0.00%) 2084.18 * -0.70%* 2084.71 * -0.67%* > Hmean 128 4023.99 ( 0.00%) 3967.67 * -1.40%* 3944.35 * -1.98%* > Hmean 256 6423.86 ( 0.00%) 6342.55 * -1.27%* 6398.97 ( -0.39%) > Hmean 1024 18715.06 ( 0.00%) 18495.83 ( -1.17%) 18411.75 * -1.62%* > Hmean 2048 29395.03 ( 0.00%) 29113.37 * -0.96%* 29203.98 * -0.65%* > Hmean 3312 35556.39 ( 0.00%) 35577.92 ( 0.06%) 35247.40 * -0.87%* > Hmean 4096 38169.92 ( 0.00%) 38732.39 * 1.47%* 38263.48 ( 0.25%) > Hmean 8192 43519.06 ( 0.00%) 44287.92 * 1.77%* 44016.39 * 1.14%* > Hmean 16384 49240.75 ( 0.00%) 49579.35 ( 0.69%) 49324.33 ( 0.17%) The regression is gone when the LSM settings are reverted. Next, I quantified the impact of individual LSMs. The kernel names in the following tables contain suffixes that denote the lsm argument specified on the kernel command line: > -prev - the kernel was passed lsm=integrity,apparmor > -new - the kernel was passed lsm=landlock,lockdown,yama,loadpin,safesetid,integrity,apparmor,selinux,smack,tomoyo,bpf > -nt - (no tomoyo) the kernel was passed lsm=landlock,lockdown,yama,loadpin,safesetid,integrity,apparmor,selinux,smack,bpf > -ntnb - (no tomoyo, no bpf) the kernel was passed lsm=landlock,lockdown,yama,loadpin,safesetid,integrity,apparmor,selinux,smack The combined regression caused by tomoyo and bpf LSM callbacks is more than 7% under Linux 6.4 on bing3: > bing3 Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz > netperf-tcp > 6.4.0 6.4.0 6.4.0 6.4.0 > Buffer size vanilla-prev vanilla-new vanilla-nt vanilla-ntnb > Hmean 64 2094.09 ( 0.00%) 1938.45 * -7.43%* 2035.36 * -2.80%* 2092.30 ( -0.09%) > Hmean 128 4003.31 ( 0.00%) 3686.51 * -7.91%* 3881.77 * -3.04%* 3987.24 ( -0.40%) > Hmean 256 6460.78 ( 0.00%) 6050.71 * -6.35%* 6243.52 * -3.36%* 6445.97 ( -0.23%) > Hmean 1024 18545.31 ( 0.00%) 17607.09 * -5.06%* 18245.47 * -1.62%* 18490.91 ( -0.29%) > Hmean 2048 29272.75 ( 0.00%) 27989.47 * -4.38%* 28951.70 * -1.10%* 29071.39 * -0.69%* > Hmean 3312 35352.42 ( 0.00%) 34361.17 * -2.80%* 35131.84 * -0.62%* 35413.87 ( 0.17%) > Hmean 4096 38033.06 ( 0.00%) 37305.99 * -1.91%* 38122.59 ( 0.24%) 38195.74 ( 0.43%) > Hmean 8192 43336.07 ( 0.00%) 42755.58 * -1.34%* 43357.50 ( 0.05%) 43684.97 * 0.81%* > Hmean 16384 49016.40 ( 0.00%) 48399.16 * -1.26%* 49182.73 ( 0.34%) 49696.99 * 1.39%* Half of the regression is caused by the tomoyo LSM callback, tomoyo_socket_sendmsg: > 40.38% 0.96% 2928 72924113 netperf [kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe > |--39.43%--entry_SYSCALL_64_after_hwframe > | do_syscall_64 > | |--36.62%--__x64_sys_sendto > | | --36.25%--__sys_sendto > | | |--33.19%--sock_sendmsg > | | | |--28.12%--tcp_sendmsg > | | | |--3.93%--security_socket_sendmsg > | | | | |--1.29%--aa_sk_perm > | | | | | --0.05%--__cond_resched > | | | | |--1.13%--tomoyo_socket_sendmsg_permission > | | | | | --0.17%--tomoyo_sock_family.part.2 > | | | | |--0.29%--bpf_lsm_socket_sendmsg > | | | | |--0.17%--apparmor_socket_sendmsg > | | | | --0.07%--tomoyo_socket_sendmsg I forced tomoyo_sock_family() to be inlined and the overhead of the tomoyo LSM callback got reduced: > bing3 Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz > netperf-tcp > 6.4.0 6.4.0 6.4.0 6.4.0 > Buffer size tomoyo-prev tomoyo-new tomoyo-nt tomoyo-ntnb > Hmean 64 2095.18 ( 0.00%) 1992.49 * -4.90%* 2025.82 * -3.31%* 2104.16 ( 0.43%) > Hmean 128 4003.76 ( 0.00%) 3776.71 * -5.67%* 3874.75 * -3.22%* 4030.11 ( 0.66%) > Hmean 256 6501.12 ( 0.00%) 6175.40 * -5.01%* 6294.64 * -3.18%* 6477.90 * -0.36%* > Hmean 1024 18695.36 ( 0.00%) 17814.94 * -4.71%* 18213.32 * -2.58%* 18564.24 * -0.70%* > Hmean 2048 29582.21 ( 0.00%) 28305.37 * -4.32%* 28738.64 * -2.85%* 29219.64 * -1.23%* > Hmean 3312 35647.01 ( 0.00%) 34719.50 * -2.60%* 35178.56 * -1.31%* 35482.23 * -0.46%* > Hmean 4096 38442.34 ( 0.00%) 37997.83 * -1.16%* 37834.10 * -1.58%* 38303.13 ( -0.36%) > Hmean 8192 43605.34 ( 0.00%) 43638.23 ( 0.08%) 43337.54 * -0.61%* 43545.55 ( -0.14%) > Hmean 16384 49108.32 ( 0.00%) 49181.45 ( 0.15%) 48724.71 * -0.78%* 49373.56 ( 0.54%) So far, I have been using the 15sp4 compiler (gcc 7.5). I found out that gcc 12.3 inlines tomoyo_sock_family() even without the inline function specifier. This fact makes me dubious about submitting a patch inlining tomoyo_sock_family() upstream. It can be assumed that ALP will employ a recent gcc for compiling the kernel.
I ran tests to quantify the impact of the tomoyo and bpf LSMs on multiple machines. The kernel names in the following tables contain suffixes that denote the lsm argument specified on the kernel command line: > -nt - (no tomoyo) the kernel was passed lsm=landlock,lockdown,yama,loadpin,safesetid,integrity,apparmor,selinux,smack,bpf > -ntnb - (no tomoyo, no bpf) the kernel was passed lsm=landlock,lockdown,yama,loadpin,safesetid,integrity,apparmor,selinux,smack For the smallest buffer sizes (64 and 128), we see that the performance impact of the tomoyo LSM is between 3% and 4% on most machines and much more on armani, which is a Zen 3 machine: > armani AMD EPYC 7713 64-Core Processor 2.0GHz > netperf-udp > 6.4.8 6.4.8 6.4.8 > alp-230807 alp-230807-nt alp-230807-ntnb > Hmean send-64 343.23 ( 0.00%) 372.52 * 8.53%* 397.71 * 15.87%* > Hmean send-128 678.93 ( 0.00%) 734.28 * 8.15%* 781.65 * 15.13%* > Hmean send-256 1352.18 ( 0.00%) 1472.93 * 8.93%* 1556.23 * 15.09%* > Hmean send-1024 5187.59 ( 0.00%) 5529.80 * 6.60%* 5749.17 * 10.83%* > netperf-tcp > 6.4.8 6.4.8 6.4.8 > alp-230807 alp-230807-nt alp-230807-ntnb > Hmean 64 1991.98 ( 0.00%) 2170.00 * 8.94%* 2248.64 * 12.88%* > Hmean 128 3790.39 ( 0.00%) 4119.44 * 8.68%* 4234.52 * 11.72%* > Hmean 256 7207.25 ( 0.00%) 7797.34 * 8.19%* 8062.61 * 11.87%* > Hmean 1024 24067.65 ( 0.00%) 25532.09 * 6.08%* 26176.27 * 8.76%* > bing2 Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz > netperf-udp > 6.4.8 6.4.8 6.4.8 > alp-230807 alp-230807-nt alp-230807-ntnb > Hmean send-64 358.16 ( 0.00%) 366.27 * 2.26%* 378.73 * 5.74%* > Hmean send-128 705.29 ( 0.00%) 708.79 ( 0.50%) 738.74 * 4.74%* > Hmean send-256 1403.63 ( 0.00%) 1407.67 ( 0.29%) 1468.94 * 4.65%* > Hmean send-1024 5335.08 ( 0.00%) 5373.77 ( 0.73%) 5560.37 * 4.22%* > netperf-tcp > 6.4.8 6.4.8 6.4.8 > alp-230807 alp-230807-nt alp-230807-ntnb > Hmean 64 1990.97 ( 0.00%) 2069.85 * 3.96%* 2118.47 * 6.40%* > Hmean 128 3815.90 ( 0.00%) 3953.75 * 3.61%* 4056.83 * 6.31%* > Hmean 256 6456.85 ( 0.00%) 6557.91 ( 1.57%) 6800.38 * 5.32%* > Hmean 1024 18094.43 ( 0.00%) 18734.71 * 3.54%* 18767.92 * 3.72%* > hardy2 Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz > netperf-udp > 6.4.8 6.4.8 6.4.8 > alp-230807 alp-230807-nt alp-230807-ntnb > Hmean send-64 266.20 ( 0.00%) 273.39 * 2.70%* 283.43 * 6.47%* > Hmean send-128 525.04 ( 0.00%) 542.57 * 3.34%* 550.44 * 4.84%* > Hmean send-256 1054.81 ( 0.00%) 1085.81 * 2.94%* 1102.03 * 4.48%* > Hmean send-1024 3968.91 ( 0.00%) 3976.68 ( 0.20%) 4255.94 * 7.23%* > netperf-tcp > 6.4.8 6.4.8 6.4.8 > alp-230807 alp-230807-nt alp-230807-ntnb > Hmean 64 779.06 ( 0.00%) 806.93 * 3.58%* 812.85 * 4.34%* > Hmean 128 1499.69 ( 0.00%) 1569.62 * 4.66%* 1585.93 * 5.75%* > Hmean 256 2866.62 ( 0.00%) 2900.68 * 1.19%* 2981.46 * 4.01%* > Hmean 1024 10005.45 ( 0.00%) 10216.24 * 2.11%* 10420.25 * 4.15%* > simba2 Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz > netperf-udp > 6.4.8 6.4.8 6.4.8 > alp-230807 alp-230807-nt alp-230807-ntnb > Hmean send-64 392.51 ( 0.00%) 410.86 * 4.67%* 415.17 * 5.77%* > Hmean send-128 804.86 ( 0.00%) 826.45 * 2.68%* 831.27 * 3.28%* > Hmean send-256 1583.82 ( 0.00%) 1647.26 * 4.01%* 1664.68 * 5.11%* > Hmean send-1024 5933.02 ( 0.00%) 6073.28 * 2.36%* 6138.85 * 3.47%* > netperf-tcp > 6.4.8 6.4.8 6.4.8 > alp-230807 alp-230807-nt alp-230807-ntnb > Hmean 64 2313.46 ( 0.00%) 2401.68 * 3.81%* 2381.45 * 2.94%* > Hmean 128 4523.62 ( 0.00%) 4708.94 * 4.10%* 4679.95 * 3.46%* > Hmean 256 8576.52 ( 0.00%) 8933.48 * 4.16%* 8904.31 * 3.82%* > Hmean 1024 25304.54 ( 0.00%) 25345.88 ( 0.16%) 25637.67 * 1.32%* > toto AMD EPYC 7601 32-Core Processor 2.2GHz > netperf-udp > 6.4.8 6.4.8 6.4.8 > alp-230807 alp-230807-nt alp-230807-ntnb > Hmean send-64 243.28 ( 0.00%) 253.77 * 4.31%* 274.97 * 13.03%* > Hmean send-128 491.07 ( 0.00%) 511.98 * 4.26%* 548.04 * 11.60%* > Hmean send-256 976.39 ( 0.00%) 1013.91 * 3.84%* 1099.14 * 12.57%* > Hmean send-1024 3535.11 ( 0.00%) 3796.68 * 7.40%* 3963.32 * 12.11%* > netperf-tcp > 6.4.8 6.4.8 6.4.8 > alp-230807 alp-230807-nt alp-230807-ntnb > Hmean 64 1110.09 ( 0.00%) 1164.61 * 4.91%* 1186.33 * 6.87%* > Hmean 128 2128.86 ( 0.00%) 2205.09 * 3.58%* 2300.50 * 8.06%* > Hmean 256 4068.26 ( 0.00%) 4249.99 * 4.47%* 4368.44 * 7.38%* > Hmean 1024 12118.48 ( 0.00%) 12663.98 * 4.50%* 12961.32 * 6.95%* Disabling the bpf LSM yields further performance improvements, but it may need to stay enabled, see bug 1205603. When it comes to LSMs the callbacks of an LSM are executed if that particular LSM is enabled. There are two parameters determining which LSMs get enabled when the kernel boots - security and lsm. The security parameter allows one to select one major LSM and the other major LSMs will get disabled. Major LSMs are: apparmor, selinux, smack and tomoyo. The security parameter and major LSMs are considered a legacy approach to selecting LSMs. The lsm parameter allows one to specify the order in which LSMs are initialized, the first exclusive LSM gets enabled and the remaining exclusive LSMs get disabled. Exclusive LSMs are: apparmor, selinux and smack. After changing CONFIG_LSM to > CONFIG_LSM="landlock,lockdown,yama,loadpin,safesetid,integrity,apparmor,selinux,smack,tomoyo,bpf" tomoyo got enabled on performance grid machines because the performance grid uses only basic kernel parameters: > [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.4.0-vanilla root=UUID=065c0e9b-68c0-4a4f-a88b-f3b44a49d390 sysrq_always_enabled panic=100 console=tty0 console=ttyS0,115200 > [ 3.105614] LSM: initializing lsm=lockdown,capability,landlock,yama,apparmor,tomoyo,bpf,integrity > [ 3.106596] landlock: Up and running. > [ 3.109907] Yama: becoming mindful. > [ 3.113261] AppArmor: AppArmor initialized > [ 3.116575] TOMOYO Linux initialized > [ 3.119912] LSM support for eBPF active So, we see the performance impact of both tomoyo and bpf LSMs on grid machines (in addition to apparmor, which had been there all along). SLES passes security=apparmor to the kernel, which means that apparmor will be enabled and selinux, smack and tomoyo will be disabled. Should a user delete the security=apparmor parameter the value of CONFIG_LSM will determine the LSMs that will be enabled: 1. In 15sp5, CONFIG_LSM="integrity,apparmor" means that apparmor will be the only major LSMs enabled. 2. In 15sp6, CONFIG_LSM="landlock,lockdown,yama,loadpin,safesetid,integrity,apparmor,selinux,smack,tomoyo,bpf" means that tomoyo will get enabled in addition to apparmor. I recommend changing CONFIG_LSM for 15sp6 so that the act of deleting the security=apparmor parameter does not cause a regression in throughput in cases where a small amount of data is transferred between processes and many syscalls are made. I suggest removing all major LSMs apart from apparmor: CONFIG_LSM="landlock,lockdown,yama,loadpin,safesetid,integrity,apparmor,bpf" As for ALP, the security=selinux parameter is passed to the kernel in ALP Dolomite Milestone 2: > [ 0.020910] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.4.1-1-default root=UUID=33032718-8cbe-4dab-b06a-f62c9b3ed10b rd.timeout=60 rd.retry=45 console=ttyS0,115200 console=tty0 security=selinux selinux=1 quiet net.ifnames=0 ignition.platform.id=qemu > [ 0.039846] LSM: initializing lsm=lockdown,capability,landlock,yama,selinux,bpf,integrity > [ 0.039854] landlock: Up and running. > [ 0.039855] Yama: becoming mindful. > [ 0.039859] SELinux: Initializing. > [ 0.039869] LSM support for eBPF active This means selinux will be enabled and other major LSMs - apparmor, smack and tomoyo - will be disabled. Again, should someone remove security=selinux from the kernel command line the value of CONFIG_LSM would cause tomoyo to get enabled in addition to selinux after rebooting the machine. As we know, enabling tomoyo will result in a slight performance regression in certain cases. For ALP, I suggest removing all major LSMs apart from selinux: CONFIG_LSM="landlock,lockdown,yama,loadpin,safesetid,integrity,selinux,bpf" More details: Debugging output for LSM initialization under ALP: > [ 0.063179] LSM: legacy security=selinux > [ 0.063180] LSM: CONFIG_LSM=landlock,lockdown,yama,loadpin,safesetid,integrity,apparmor,selinux,smack,tomoyo,bpf > [ 0.063181] LSM: boot arg lsm= *unspecified* > [ 0.063182] LSM: early started: lockdown (enabled) > [ 0.063183] LSM: first ordered: capability (enabled) > [ 0.063184] LSM: security=selinux disabled: tomoyo (only one legacy major LSM) > [ 0.063185] LSM: security=selinux disabled: apparmor (only one legacy major LSM) > [ 0.063186] LSM: builtin ordered: landlock (enabled) > [ 0.063187] LSM: builtin ignored: lockdown (not built into kernel) > [ 0.063187] LSM: builtin ordered: yama (enabled) > [ 0.063188] LSM: builtin ignored: loadpin (not built into kernel) > [ 0.063189] LSM: builtin ignored: safesetid (not built into kernel) > [ 0.063190] LSM: builtin ordered: apparmor (disabled) > [ 0.063190] LSM: builtin ordered: selinux (enabled) > [ 0.063191] LSM: builtin ignored: smack (not built into kernel) > [ 0.063192] LSM: builtin ordered: tomoyo (disabled) > [ 0.063192] LSM: builtin ordered: bpf (enabled) > [ 0.063193] LSM: last ordered: integrity (enabled) > [ 0.063194] LSM: exclusive chosen: selinux > [ 0.063195] LSM: initializing lsm=lockdown,capability,landlock,yama,selinux,bpf,integrity > [ 0.063199] LSM: cred blob size = 32 > [ 0.063199] LSM: file blob size = 24 > [ 0.063200] LSM: inode blob size = 72 > [ 0.063200] LSM: ipc blob size = 8 > [ 0.063201] LSM: msg_msg blob size = 4 > [ 0.063201] LSM: superblock blob size = 80 > [ 0.063202] LSM: task blob size = 8 > [ 0.063205] LSM: initializing capability > [ 0.063206] LSM: initializing landlock > [ 0.063207] landlock: Up and running. > [ 0.063207] LSM: initializing yama > [ 0.063208] Yama: becoming mindful. > [ 0.063212] LSM: initializing selinux > [ 0.063213] SELinux: Initializing. > [ 0.063221] LSM: initializing bpf > [ 0.063223] LSM support for eBPF active > [ 0.063224] LSM: initializing integrity Commits responsible for LSMs selection: > v5.0-rc1-1-g47008e5161fa LSM: Introduce LSM_FLAG_LEGACY_MAJOR > This adds a flag for the current "major" LSMs to distinguish them when > we have a universal method for ordering all LSMs. It's called "legacy" > since the distinction of "major" will go away in the blob-sharing world. > v5.0-rc1-11-g14bd99c821f7 LSM: Separate idea of "major" LSM from "exclusive" LSM > In order to both support old "security=" Legacy Major LSM selection, and > handling real exclusivity, this creates LSM_FLAG_EXCLUSIVE and updates > the selection logic to handle them. > v5.0-rc1-38-ga5e2fe7ede12 TOMOYO: Update LSM flags to no longer be exclusive > With blob sharing in place, TOMOYO is no longer an exclusive LSM, so it > can operate separately now. Mark it as such.
is there a runtime toggle to enable tomoyo?
Yes, there is. The lsm parameter can be passed to the kernel in its command line: lsm=landlock,lockdown,yama,loadpin,safesetid,integrity,apparmor,tomoyo,bpf The value of the lsm parameter can be derived from whatever is in /sys/kernel/security/lsm. Alternatively, to use tomoyo as a major LSM, security=tomoyo can be passed to the kernel.
my question was aiming towards having tomoyo built into the kernel but disabled by default. so the user could enable it via kernel cmdline.
(In reply to Marcus Rückert from comment #8) > my question was aiming towards having tomoyo built into the kernel but > disabled by default. so the user could enable it via kernel cmdline. This is exactly what I have just described. I suspect we have a misunderstanding when it comes to the terms I used. Compile time options are always in capitals, kernel parameters passed to the kernel on its command line are in lower case and always called "kernel parameters". On the kernel command line, tomoyo can be enabled with the lsm= kernel parameter or the security= kernel parameter. The CONFIG_SECURITY_TOMOYO compile time option is set to yes so tomoyo is built into the kernel. Disabling tomoyo by default is accomplished by tweaking the CONFIG_LSM compile time option. Sorry, but "runtime toggle" isn't part of the terminology so I am not going to use it. See, Documentation/admin-guide/kernel-parameters.rst Documentation/admin-guide/kernel-parameters.txt
Two blind alleys: Despite the encouraging results on bing3, testing on the grid did not show any substantial improvements after inlining tomoyo_sock_family() and tomoyo_kernel_service(). There is one difference between 6.3 and 6.4 that attracted my attention - copy_user_generic() was revamped to check for the FSRM feature (a bit flag defined in cpuid output) instead of the ERMS and REP_GOOD features. A profile diff (in bus-cycles) shows time is spent in rep_movs_alternative() and copyin() instead of copy_user_enhanced_fast_string(): > test bing3 tcp 64 6.3.0-vanilla-new 6.4.0-vanilla-new > # Util1 Util2 Diff Command Shared Object Symbol CPU > 0 116,798,963 116,798,963 (100.0%) netperf [kernel.kallsyms] rep_movs_alternative all > 0 107,566,905 107,566,905 (100.0%) netperf [kernel.kallsyms] copyin all > 50,512,400 73,443,792 22,931,392 ( 45.4%) netperf [kernel.kallsyms] ipv4_mtu all > 90,326,233 111,343,210 21,016,977 ( 23.3%) netperf netperf send_tcp_stream all > 75,108,050 95,527,384 20,419,334 ( 27.2%) netperf [kernel.kallsyms] aa_sk_perm all > 39,850,798 55,473,237 15,622,439 ( 39.2%) netperf [kernel.kallsyms] sock_sendmsg all > 23,117,865 35,803,490 12,685,625 ( 54.9%) netperf [kernel.kallsyms] skb_page_frag_refill all > 33,210,582 20,175,432 -13,035,150 ( 39.2%) netperf [kernel.kallsyms] inet_sendmsg all > 113,199,306 97,356,180 -15,843,126 ( 14.0%) netperf [kernel.kallsyms] __check_object_size all > 135,873,271 82,815,536 -53,057,735 ( 39.0%) netperf [kernel.kallsyms] _copy_from_iter all > 216,323,164 0 -216,323,164 (100.0%) netperf [kernel.kallsyms] copy_user_enhanced_fast_string all The compiler inlined copyin() in 6.3.0-vanilla whereas it was uninlined 6.4.0-vanilla. I tested inlining copyin(), copyout() and copyout_nofault(). Tests on the grid failed to show any substantial improvements on all machines.