Bug 1212446 - [ALP] Regression in Linux 6.4 in network throughput on localhost with netperf
Summary: [ALP] Regression in Linux 6.4 in network throughput on localhost with netperf
Status: IN_PROGRESS
Alias: None
Product: ALP Bedrock
Classification: SUSE ALP - SUSE Adaptable Linux Platform
Component: Kernel (show other bugs)
Version: unspecified
Hardware: x86-64 openSUSE Tumbleweed
: P2 - High : Normal
Target Milestone: ---
Assignee: Jiri Wiesner
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-06-16 12:46 UTC by Jiri Wiesner
Modified: 2024-03-24 02:10 UTC (History)
5 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jiri Wiesner 2023-06-16 12:46:45 UTC
Testing on the performance team grid revealed a regression in Linux 6.3 when compared to Linux 6.0 or Linux 6.2. The size of the regressions differs between CPU generations. The regression is largest for the smallest buffer sizes (64, 128 and 156 bytes). There is a regression in TCP throughput on all Intel and AMD CPUs in the grid. Typically, the size of the regression in TCP throughput is largest between Linux 6.0 and Linux 6.3:
> bing1 Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz
> netperf-tcp
>                          6.0.0                    6.1.0                   6.2.0                   6.3.0
> Hmean         64      2174.41 ( 0.00%)        2195.87 ( 0.99%)        2160.47 ( -0.64%)       2031.93 ( -6.55%)
> Hmean         128     4145.44 ( 0.00%)        4198.09 ( 1.27%)        4109.10 ( -0.88%)       3895.21 ( -6.04%)
> Hmean         256     6614.00 ( 0.00%)        6618.56 ( 0.07%)        6536.76 ( -1.17%)       6234.29 ( -5.74%)
> netperf-udp
>                                 6.0.0                    6.1.0                   6.2.0                   6.3.0
> Hmean         send-64         376.66  ( 0.00%)        382.07  ( 1.44%)        385.53  ( 2.35%)        371.48  ( -1.38%)
> Hmean         send-128        745.62  ( 0.00%)        762.27  ( 2.23%)        767.15  ( 2.89%)        724.00  ( -2.90%)
> Hmean         send-256        1438.62 ( 0.00%)        1461.07 ( 1.56%)        1483.66 ( 3.13%)        1435.92 ( -0.19%)
On timon1, the size of the regression in TCP throughput is largest between Linux 6.0 and Linux 6.2:
> timon1 Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
> netperf-tcp
>                          6.0.0                    6.1.0                   6.2.0                   6.3.0
> Hmean         64      2426.77 ( 0.00%)        2387.60 ( -1.61%)       2224.76 ( -8.32%)       2334.96 ( -3.78%)
> Hmean         128     4792.27 ( 0.00%)        4745.29 ( -0.98%)       4349.13 ( -9.25%)       4619.73 ( -3.60%)
> Hmean         256     9128.85 ( 0.00%)        9038.75 ( -0.99%)       8310.65 ( -8.96%)       8793.32 ( -3.68%)
> netperf-udp
>                                 6.0.0                    6.1.0                   6.2.0                   6.3.0
> Hmean         send-64         421.00  ( 0.00%)        426.69  ( 1.35%)        416.60  ( -1.05%)       410.74  ( -2.44%)
> Hmean         send-128        841.56  ( 0.00%)        853.79  ( 1.45%)        834.17  ( -0.88%)       826.06  ( -1.84%)
> Hmean         send-256        1671.16 ( 0.00%)        1698.31 ( 1.62%)        1613.82 ( -3.43%)       1645.20 ( -1.55%)
There is not much of a regression on zazu:
> zazu Intel(R) Xeon(R) Gold 6330N CPU @ 2.20GHz
> netperf-tcp
>                          6.0.0                    6.1.0                   6.2.0                   6.3.0
> Hmean         64      2324.22 ( 0.00%)        2273.60 ( -2.18%)       2344.58 ( 0.88%)        2261.83 ( -2.68%)
> Hmean         128     4571.85 ( 0.00%)        4512.50 ( -1.30%)       4651.17 ( 1.73%)        4465.59 ( -2.32%)
> Hmean         256     8749.81 ( 0.00%)        8613.48 ( -1.56%)       8857.62 ( 1.23%)        8500.16 ( -2.85%)
> netperf-udp
>                                 6.0.0                    6.1.0                   6.2.0                   6.3.0
> Hmean         send-64         401.22  ( 0.00%)        404.38  ( 0.79%)        408.49  ( 1.81%)        400.54  ( -0.17%)
> Hmean         send-128        802.57  ( 0.00%)        814.79  ( 1.52%)        819.56  ( 2.12%)        805.08  ( 0.31%)
> Hmean         send-256        1588.79 ( 0.00%)        1610.26 ( 1.35%)        1606.07 ( 1.09%)        1607.41 ( 1.17%)
On simba1, there is a small regression between Linux 6.2 and Linux 6.3:
> simba1 Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
> netperf-tcp
>                          6.0.0                    6.1.0                   6.2.0                   6.3.0
> Hmean         64      2354.56 ( 0.00%)        2425.80 ( 3.03%)        2427.61 ( 3.10%)        2334.02 ( -0.87%)
> Hmean         128     4631.12 ( 0.00%)        4815.27 ( 3.98%)        4802.97 ( 3.71%)        4592.37 ( -0.84%)
> Hmean         256     8869.65 ( 0.00%)        9121.16 ( 2.84%)        9138.31 ( 3.03%)        8781.06 ( -1.00%)
> netperf-udp
>                                 6.0.0                    6.1.0                   6.2.0                   6.3.0
> Hmean         send-64         401.07  ( 0.00%)        411.89  ( 2.70%)        418.60  ( 4.37%)        404.90  ( 0.95%)
> Hmean         send-128        804.47  ( 0.00%)        817.67  ( 1.64%)        838.14  ( 4.19%)        812.19  ( 0.96%)
> Hmean         send-256        1631.65 ( 0.00%)        1627.56 ( -0.25%)       1690.66 ( 3.62%)        1629.13 ( -0.15%)
The regression is larger on AMD machines:
> deandre AMD EPYC 9654
> netperf-tcp
>                          6.0.0                    6.1.0                   6.2.0                   6.3.0
> Hmean         64      2661.12 ( 0.00%)        2637.21 ( -0.90%)       2599.42 ( -2.32%)       2272.47 ( -14.60%)
> Hmean         128     5008.94 ( 0.00%)        4970.45 ( -0.77%)       4875.40 ( -2.67%)       4339.79 ( -13.36%)
> Hmean         256     9425.00 ( 0.00%)        9307.01 ( -1.25%)       9214.31 ( -2.24%)       8281.36 ( -12.13%)
> netperf-udp
>                                 6.0.0                    6.1.0                   6.2.0                   6.3.0
> Hmean         send-64         509.80  ( 0.00%)        500.06  ( -1.91%)       518.16  ( 1.64%)        528.36  ( 3.64%)
> Hmean         send-128        1010.04 ( 0.00%)        976.28  ( -3.34%)       1011.92 ( 0.19%)        1026.64 ( 1.64%)
> Hmean         send-256        2005.17 ( 0.00%)        1957.72 ( -2.37%)       2030.43 ( 1.26%)        2027.69 ( 1.12%)
Comment 1 Michal Hocko 2023-07-20 09:19:04 UTC
Is the same regression observed with 6.4 as well?
Comment 2 Jiri Wiesner 2023-07-20 11:22:37 UTC
(In reply to Michal Hocko from comment #1)
> Is the same regression observed with 6.4 as well?
There is a regression between 6.0 and 6.4 reported by the performance team grid. I have run tests to get profiles and I cannot reproduce the regression between 6.0 and 6.3 - 6.0 and 6.3 perform similarly in my tests. I am working on providing an explanation. My current hypothesis is that the higher throughput reported under 6.0 (as seen in comment 0) is caused by an older CPU firmware version. The 6.0 results are from Oct 2022 for some machines.
Comment 3 Jiri Wiesner 2023-07-24 15:10:34 UTC
Since Linux 6.4 has become the base kernel version for ALP, here are new results from the grid. An even greater regression was reported between 6.0 and 6.4 on Intel machines:
> bing1 Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz
> netperf-tcp
>           Buffer size    6.0.0                    6.1.0                   6.2.0                   6.3.0                   6.4.0
> Hmean         64      2174.41 ( 0.00%)        2195.87 ( 0.99%)        2160.47 ( -0.64%)       2031.93 ( -6.55%)       1882.90 (-13.41%)
> Hmean         128     4145.44 ( 0.00%)        4198.09 ( 1.27%)        4109.10 ( -0.88%)       3895.21 ( -6.04%)       3591.05 (-13.37%)
> Hmean         256     6614.00 ( 0.00%)        6618.56 ( 0.07%)        6536.76 ( -1.17%)       6234.29 ( -5.74%)       5972.07 ( -9.71%)
> Hmean         1024    18933.25( 0.00%)        18837.52(-0.51%)        18768.16( -0.87%)       18105.33( -4.37%)       17333.93( -8.45%)
> Hmean         2048    29859.54( 0.00%)        29450.69(-1.37%)        29755.55( -0.35%)       28757.20( -3.69%)       27493.41( -7.92%)
> Hmean         3312    36009.95( 0.00%)        35733.79(-0.77%)        35875.23( -0.37%)       35329.09( -1.89%)       34048.70( -5.45%)
> Hmean         4096    38902.10( 0.00%)        38733.64(-0.43%)        38933.69(  0.08%)       38205.69( -1.79%)       37144.18( -4.52%)
> Hmean         8192    43571.81( 0.00%)        43677.82( 0.24%)        43713.05(  0.32%)       43833.88(  0.60%)       42969.59( -1.38%)
> Hmean         16384   49109.21( 0.00%)        49354.75( 0.50%)        48900.01( -0.43%)       48834.18( -0.56%)       48456.58( -1.33%)
The size of the regression did not change on Zen 3 and Zen 4 machines:
> deandre AMD EPYC 9654
> netperf-tcp
>           Buffer size    6.0.0                    6.1.0                   6.2.0                   6.3.0                   6.4.0
> Hmean         64      2661.12 ( 0.00%)        2637.21 ( -0.90%)       2599.42 ( -2.32%)       2272.47 ( -14.60%)      2314.51 ( -13.02%)
> Hmean         128     5008.94 ( 0.00%)        4970.45 ( -0.77%)       4875.40 ( -2.67%)       4339.79 ( -13.36%)      4432.17 ( -11.51%)
> Hmean         256     9425.00 ( 0.00%)        9307.01 ( -1.25%)       9214.31 ( -2.24%)       8281.36 ( -12.13%)      8413.18 ( -10.74%)
> Hmean         1024    23218.85( 0.00%)        23170.74( -0.21%)       23094.18( -0.54%)       21394.38( -7.86%)       21548.80( -7.19%)
> Hmean         2048    30340.41( 0.00%)        29622.27( -2.37%)       29366.87( -3.21%)       28710.75( -5.37%)       28659.39( -5.54%)
> Hmean         3312    34198.17( 0.00%)        33705.03( -1.44%)       33503.50( -2.03%)       33133.14( -3.11%)       33091.18( -3.24%)
> Hmean         4096    34650.74( 0.00%)        34016.23( -1.83%)       33981.56( -1.93%)       33731.23( -2.65%)       33664.32( -2.85%)
> Hmean         8192    39109.60( 0.00%)        39751.87( 1.64%)        39548.66( 1.12%)        39832.60( 1.85%)        39513.34( 1.03%)
> Hmean         16384   41406.87( 0.00%)        41785.97( 0.92%)        42002.20( 1.44%)        42301.29( 2.16%)        41712.63( 0.74%)
The regression shows up in tests with small buffer sizes, which probably means the number of syscalls made to transfer data matters.
 
At first, I could not reproduce the regression. I ran tests on the grid:
> bing2 Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz
> netperf-tcp 
>        Buffer size               6.0.0                  6.3.0                  6.4.0
> Hmean     64        2002.62 (   0.00%)     1990.11 *  -0.62%*     1984.88 *  -0.89%* 
> Hmean     256       6200.69 (   0.00%)     6192.91 (  -0.13%)     6200.72 (   0.00%)
> Hmean     1024     17962.48 (   0.00%)    18035.54 *   0.41%*    18113.20 *   0.84%*
> Hmean     2048     28521.23 (   0.00%)    28582.40 (   0.21%)    28874.90 *   1.24%*
> Hmean     3312     35073.78 (   0.00%)    35030.83 (  -0.12%)    35436.52 *   1.03%*
> Hmean     4096     37939.84 (   0.00%)    38058.32 (   0.31%)    38064.09 (   0.33%)
> Hmean     8192     43290.77 (   0.00%)    43526.58 *   0.54%*    43571.70 *   0.65%*
> Hmean     16384    48739.96 (   0.00%)    49239.02 *   1.02%*    48894.49 (   0.32%)
> Hmean     65507    54710.35 (   0.00%)    54913.36 *   0.37%*    54810.55 (   0.18%)
All the kernels were configured with the same config file. Then, I ran tests on bing3 and configured the kernels with the config file used by the grid (Marvin) at the time when the automated tests were executed:
> bing3 Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz
> netperf-tcp
>        Buffer size               6.0.0                  6.3.0                  6.4.0
> Hmean     64        2092.17 (   0.00%)     1975.73 *  -5.57%*     1908.17 *  -8.79%*
> Hmean     128       4013.45 (   0.00%)     3761.81 *  -6.27%*     3637.21 *  -9.37%*
> Hmean     256       6430.60 (   0.00%)     6138.37 *  -4.54%*     5984.11 *  -6.94%*
> Hmean     1024     18543.25 (   0.00%)    17917.13 *  -3.38%*    17525.97 *  -5.49%*
> Hmean     2048     29406.79 (   0.00%)    28518.33 *  -3.02%*    27887.43 *  -5.17%*
> Hmean     3312     35692.60 (   0.00%)    34918.12 *  -2.17%*    34434.63 *  -3.52%*
> Hmean     4096     38387.12 (   0.00%)    37950.87 *  -1.14%*    37444.05 *  -2.46%*
> Hmean     8192     43465.85 (   0.00%)    43607.88 (   0.33%)    43156.65 *  -0.71%*
> Hmean     16384    48765.28 (   0.00%)    48803.45 (   0.08%)    48708.08 (  -0.12%)

When buffer size of 64 bytes is set the netperf process is the bottleneck. The CPU running the netperf process is busier:
> # Samples: 361K of event 'bus-cycles'
> # Event count (approx.): 7662575500
> # Overhead       Samples        Period  Command          Shared Object
>     44.06%        135495    3375795595  netperf          [kernel.vmlinux]
>     36.79%        116500    2819168481  netserver        [kernel.vmlinux]
>      2.90%          8934     222555650  netperf          libc-2.31.so
>      2.90%          9179     222106330  netserver        libc-2.31.so
>      1.79%          5678     137359526  netserver        netserver
>      1.64%          5045     125704051  netperf          netperf
The runtime corresponding to the event count of the the netperf process:
> (3375795595 + 222555650 + 125704051) / 25000000 = 148.962 seconds
The runtime corresponding to the event count of the the netserver process:
> (2819168481 + 222106330 + 137359526) / 25000000 = 127.145 seconds
The benchmark takes 150 seconds.
 
A profile diff (in bus-cycles) for the test with 64-byte buffers on bing3 (focusing just on the netperf process):
> test bing3  tcp  64  6.0.0-vanilla  6.3.0-vanilla
> #      Util1         Util2                   Diff  Command          Shared Object        Symbol                         CPU
>            0    63,821,807    63,821,807 (100.0%)  netperf          [kernel.kallsyms]    tomoyo_socket_sendmsg_permission all
>   25,663,830    65,043,891    39,380,061 (153.4%)  netperf          [kernel.kallsyms]    security_socket_sendmsg        all
>   37,467,695    70,775,306    33,307,611 ( 88.9%)  netperf          [kernel.kallsyms]    ipv4_mtu                       all
>   29,730,655    57,734,946    28,004,291 ( 94.2%)  netperf          [kernel.kallsyms]    sock_sendmsg                   all
>            0    24,880,980    24,880,980 (100.0%)  netperf          [kernel.kallsyms]    bpf_lsm_socket_sendmsg         all
>   77,630,759    94,037,965    16,407,206 ( 21.1%)  netperf          [kernel.kallsyms]    __virt_addr_valid              all
>   15,696,586    31,583,730    15,887,144 (101.2%)  netperf          [kernel.kallsyms]    __x64_sys_sendto               all
>            0    12,148,230    12,148,230 (100.0%)  netperf          [kernel.kallsyms]    tomoyo_sock_family.part.2      all
>  232,113,276   219,415,697   -12,697,579 (  5.5%)  netperf          [kernel.kallsyms]    copy_user_enhanced_fast_string all
>  174,153,324   159,965,349   -14,187,975 (  8.1%)  netperf          [kernel.kallsyms]    _raw_spin_lock_bh              all
>  222,492,825   207,008,349   -15,484,476 (  7.0%)  netperf          libc-2.31.so         send                           all
>  110,015,924    87,089,875   -22,926,049 ( 20.8%)  netperf          [kernel.kallsyms]    __sys_sendto                   all
>  149,397,565   123,525,466   -25,872,099 ( 17.3%)  netperf          [kernel.kallsyms]    read_tsc                       all
>  121,571,240    87,804,285   -33,766,955 ( 27.8%)  netperf          netperf              send_tcp_stream                all
>  130,482,864    93,990,619   -36,492,245 ( 28.0%)  netperf          [kernel.kallsyms]    __check_object_size            all
The functions executed under 6.3.0-vanilla but not under 6.0.0-vanilla (tomoyo_socket_sendmsg_permission, bpf_lsm_socket_sendmsg, tomoyo_sock_family.part.2) belong to various LSMs and carry out mandatory access control:
>   40.38%   0.96%     2928   72924113 netperf     [kernel.vmlinux]      [k] entry_SYSCALL_64_after_hwframe
>       |--39.43%--entry_SYSCALL_64_after_hwframe
>       |     do_syscall_64
>       |     |--36.62%--__x64_sys_sendto
>       |     |      --36.25%--__sys_sendto
>       |     |           |--33.19%--sock_sendmsg
>       |     |           |     |--28.12%--tcp_sendmsg
>       |     |           |     |--3.93%--security_socket_sendmsg
>       |     |           |     |     |--1.29%--aa_sk_perm
>       |     |           |     |     |      --0.05%--__cond_resched
>       |     |           |     |     |--1.13%--tomoyo_socket_sendmsg_permission
>       |     |           |     |     |      --0.17%--tomoyo_sock_family.part.2
>       |     |           |     |     |--0.29%--bpf_lsm_socket_sendmsg
>       |     |           |     |     |--0.17%--apparmor_socket_sendmsg
>       |     |           |     |      --0.07%--tomoyo_socket_sendmsg
 
A profile diff (in cycles) for the test with 64-byte buffers on bing2 did not show any difference in functions from LSMs because they were executed under both kernels:
> test bing2  tcp  64  6.0.0-vanilla  6.3.0-vanilla
> #        Util1           Util2                     Diff  Command          Shared Object        Symbol                         CPU
>  3,108,910,575   6,819,941,295   3,711,030,720 (119.4%)  netperf          netperf              0x000000000000a479             all
>  6,848,558,780   9,157,610,739   2,309,051,959 ( 33.7%)  netperf          [kernel.kallsyms]    ipv4_mtu                       all
> 48,174,509,222  50,367,436,898   2,192,927,676 (  4.6%)  netperf          [kernel.kallsyms]    tcp_sendmsg_locked             all
 
The pertinent difference in kernel config files used by Marvin at the time when the automated tests were executed on the grid:
> $ diff -u kconfig-6.0.0-vanilla.txt kconfig-6.3.0-vanilla.txt
> -CONFIG_LSM="integrity,apparmor"
> +CONFIG_LSM="landlock,lockdown,yama,loadpin,safesetid,integrity,apparmor,selinux,smack,tomoyo,bpf"
The default initialization of LSMs changed to include many more frameworks, which comes with a performance penalty. The commit in kernel-source that changed it:
> commit 720c38318edb68b138a4bc4c86bb8ff0fbcda672
> Author: Jeff Mahoney <jeffm@suse.com>
> Date:   Thu Dec 8 14:32:18 2022 -0500
>     config: update CONFIG_LSM defaults (bsc#1205603).
>     CONFIG_LSM determines what the default order of LSM usage is.  The
>     default order is set based on whether AppArmor or SELinux is preferred
>     in the config (we still prefer AppArmor).  The default set has changed
>     over time and we haven't updated it, leading to things like bpf LSMs
>     not working out of the box.
>     This change just updates CONFIG_LSM to what the default would be now.
Comment 4 Jiri Wiesner 2023-08-02 15:17:40 UTC
To check the analysis, I ran a test on bing3 and passed lsm=integrity,apparmor to the kernel:
> bing3 Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz
> netperf-tcp
>        Buffer size               6.0.0                  6.3.0                  6.4.0
> Hmean     64        2098.80 (   0.00%)     2084.18 *  -0.70%*     2084.71 *  -0.67%*
> Hmean     128       4023.99 (   0.00%)     3967.67 *  -1.40%*     3944.35 *  -1.98%*
> Hmean     256       6423.86 (   0.00%)     6342.55 *  -1.27%*     6398.97 (  -0.39%)
> Hmean     1024     18715.06 (   0.00%)    18495.83 (  -1.17%)    18411.75 *  -1.62%*
> Hmean     2048     29395.03 (   0.00%)    29113.37 *  -0.96%*    29203.98 *  -0.65%*
> Hmean     3312     35556.39 (   0.00%)    35577.92 (   0.06%)    35247.40 *  -0.87%*
> Hmean     4096     38169.92 (   0.00%)    38732.39 *   1.47%*    38263.48 (   0.25%)
> Hmean     8192     43519.06 (   0.00%)    44287.92 *   1.77%*    44016.39 *   1.14%*
> Hmean     16384    49240.75 (   0.00%)    49579.35 (   0.69%)    49324.33 (   0.17%)
The regression is gone when the LSM settings are reverted.
 
Next, I quantified the impact of individual LSMs. The kernel names in the following tables contain suffixes that denote the lsm argument specified on the kernel command line:
> -prev - the kernel was passed lsm=integrity,apparmor
> -new - the kernel was passed lsm=landlock,lockdown,yama,loadpin,safesetid,integrity,apparmor,selinux,smack,tomoyo,bpf
> -nt - (no tomoyo) the kernel was passed lsm=landlock,lockdown,yama,loadpin,safesetid,integrity,apparmor,selinux,smack,bpf
> -ntnb - (no tomoyo, no bpf) the kernel was passed lsm=landlock,lockdown,yama,loadpin,safesetid,integrity,apparmor,selinux,smack
The combined regression caused by tomoyo and bpf LSM callbacks is more than 7% under Linux 6.4 on bing3:
> bing3 Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz
> netperf-tcp
>                                  6.4.0                  6.4.0                  6.4.0                  6.4.0
>       Buffer size         vanilla-prev            vanilla-new             vanilla-nt           vanilla-ntnb
> Hmean     64        2094.09 (   0.00%)     1938.45 *  -7.43%*     2035.36 *  -2.80%*     2092.30 (  -0.09%)
> Hmean     128       4003.31 (   0.00%)     3686.51 *  -7.91%*     3881.77 *  -3.04%*     3987.24 (  -0.40%)
> Hmean     256       6460.78 (   0.00%)     6050.71 *  -6.35%*     6243.52 *  -3.36%*     6445.97 (  -0.23%)
> Hmean     1024     18545.31 (   0.00%)    17607.09 *  -5.06%*    18245.47 *  -1.62%*    18490.91 (  -0.29%)
> Hmean     2048     29272.75 (   0.00%)    27989.47 *  -4.38%*    28951.70 *  -1.10%*    29071.39 *  -0.69%*
> Hmean     3312     35352.42 (   0.00%)    34361.17 *  -2.80%*    35131.84 *  -0.62%*    35413.87 (   0.17%)
> Hmean     4096     38033.06 (   0.00%)    37305.99 *  -1.91%*    38122.59 (   0.24%)    38195.74 (   0.43%)
> Hmean     8192     43336.07 (   0.00%)    42755.58 *  -1.34%*    43357.50 (   0.05%)    43684.97 *   0.81%*
> Hmean     16384    49016.40 (   0.00%)    48399.16 *  -1.26%*    49182.73 (   0.34%)    49696.99 *   1.39%*
 
Half of the regression is caused by the tomoyo LSM callback, tomoyo_socket_sendmsg:
> 40.38%   0.96%     2928   72924113 netperf     [kernel.vmlinux]      [k] entry_SYSCALL_64_after_hwframe
>     |--39.43%--entry_SYSCALL_64_after_hwframe
>     |     do_syscall_64
>     |     |--36.62%--__x64_sys_sendto
>     |     |      --36.25%--__sys_sendto
>     |     |           |--33.19%--sock_sendmsg
>     |     |           |     |--28.12%--tcp_sendmsg
>     |     |           |     |--3.93%--security_socket_sendmsg
>     |     |           |     |     |--1.29%--aa_sk_perm
>     |     |           |     |     |      --0.05%--__cond_resched
>     |     |           |     |     |--1.13%--tomoyo_socket_sendmsg_permission
>     |     |           |     |     |      --0.17%--tomoyo_sock_family.part.2
>     |     |           |     |     |--0.29%--bpf_lsm_socket_sendmsg
>     |     |           |     |     |--0.17%--apparmor_socket_sendmsg
>     |     |           |     |      --0.07%--tomoyo_socket_sendmsg
I forced tomoyo_sock_family() to be inlined and the overhead of the tomoyo LSM callback got reduced:
> bing3 Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz
> netperf-tcp
>                                  6.4.0                  6.4.0                  6.4.0                  6.4.0
>       Buffer size          tomoyo-prev             tomoyo-new              tomoyo-nt            tomoyo-ntnb
> Hmean     64        2095.18 (   0.00%)     1992.49 *  -4.90%*     2025.82 *  -3.31%*     2104.16 (   0.43%)
> Hmean     128       4003.76 (   0.00%)     3776.71 *  -5.67%*     3874.75 *  -3.22%*     4030.11 (   0.66%)
> Hmean     256       6501.12 (   0.00%)     6175.40 *  -5.01%*     6294.64 *  -3.18%*     6477.90 *  -0.36%*
> Hmean     1024     18695.36 (   0.00%)    17814.94 *  -4.71%*    18213.32 *  -2.58%*    18564.24 *  -0.70%*
> Hmean     2048     29582.21 (   0.00%)    28305.37 *  -4.32%*    28738.64 *  -2.85%*    29219.64 *  -1.23%*
> Hmean     3312     35647.01 (   0.00%)    34719.50 *  -2.60%*    35178.56 *  -1.31%*    35482.23 *  -0.46%*
> Hmean     4096     38442.34 (   0.00%)    37997.83 *  -1.16%*    37834.10 *  -1.58%*    38303.13 (  -0.36%)
> Hmean     8192     43605.34 (   0.00%)    43638.23 (   0.08%)    43337.54 *  -0.61%*    43545.55 (  -0.14%)
> Hmean     16384    49108.32 (   0.00%)    49181.45 (   0.15%)    48724.71 *  -0.78%*    49373.56 (   0.54%)
So far, I have been using the 15sp4 compiler (gcc 7.5). I found out that gcc 12.3 inlines tomoyo_sock_family() even without the inline function specifier. This fact makes me dubious about submitting a patch inlining tomoyo_sock_family() upstream. It can be assumed that ALP will employ a recent gcc for compiling the kernel.
Comment 5 Jiri Wiesner 2023-08-08 16:40:25 UTC
I ran tests to quantify the impact of the tomoyo and bpf LSMs on multiple machines. The kernel names in the following tables contain suffixes that denote the lsm argument specified on the kernel command line:
> -nt - (no tomoyo) the kernel was passed lsm=landlock,lockdown,yama,loadpin,safesetid,integrity,apparmor,selinux,smack,bpf
> -ntnb - (no tomoyo, no bpf) the kernel was passed lsm=landlock,lockdown,yama,loadpin,safesetid,integrity,apparmor,selinux,smack
For the smallest buffer sizes (64 and 128), we see that the performance impact of the tomoyo LSM is between 3% and 4% on most machines and much more on armani, which is a Zen 3 machine:
> armani AMD EPYC 7713 64-Core Processor 2.0GHz
> netperf-udp 
>                                       6.4.8                  6.4.8                  6.4.8 
>                                  alp-230807          alp-230807-nt        alp-230807-ntnb
> Hmean     send-64         343.23 (   0.00%)      372.52 *   8.53%*      397.71 *  15.87%*
> Hmean     send-128        678.93 (   0.00%)      734.28 *   8.15%*      781.65 *  15.13%*
> Hmean     send-256       1352.18 (   0.00%)     1472.93 *   8.93%*     1556.23 *  15.09%*
> Hmean     send-1024      5187.59 (   0.00%)     5529.80 *   6.60%*     5749.17 *  10.83%*
> netperf-tcp
>                                  6.4.8                  6.4.8                  6.4.8 
>                             alp-230807          alp-230807-nt        alp-230807-ntnb
> Hmean     64        1991.98 (   0.00%)     2170.00 *   8.94%*     2248.64 *  12.88%*
> Hmean     128       3790.39 (   0.00%)     4119.44 *   8.68%*     4234.52 *  11.72%*
> Hmean     256       7207.25 (   0.00%)     7797.34 *   8.19%*     8062.61 *  11.87%*
> Hmean     1024     24067.65 (   0.00%)    25532.09 *   6.08%*    26176.27 *   8.76%*
> bing2 Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz
> netperf-udp
>                                       6.4.8                  6.4.8                  6.4.8
>                                  alp-230807          alp-230807-nt        alp-230807-ntnb 
> Hmean     send-64         358.16 (   0.00%)      366.27 *   2.26%*      378.73 *   5.74%*
> Hmean     send-128        705.29 (   0.00%)      708.79 (   0.50%)      738.74 *   4.74%*
> Hmean     send-256       1403.63 (   0.00%)     1407.67 (   0.29%)     1468.94 *   4.65%*
> Hmean     send-1024      5335.08 (   0.00%)     5373.77 (   0.73%)     5560.37 *   4.22%*
> netperf-tcp
>                                  6.4.8                  6.4.8                  6.4.8
>                             alp-230807          alp-230807-nt        alp-230807-ntnb
> Hmean     64        1990.97 (   0.00%)     2069.85 *   3.96%*     2118.47 *   6.40%*
> Hmean     128       3815.90 (   0.00%)     3953.75 *   3.61%*     4056.83 *   6.31%*
> Hmean     256       6456.85 (   0.00%)     6557.91 (   1.57%)     6800.38 *   5.32%*
> Hmean     1024     18094.43 (   0.00%)    18734.71 *   3.54%*    18767.92 *   3.72%*
> hardy2 Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
> netperf-udp
>                                       6.4.8                  6.4.8                  6.4.8
>                                  alp-230807          alp-230807-nt        alp-230807-ntnb
> Hmean     send-64         266.20 (   0.00%)      273.39 *   2.70%*      283.43 *   6.47%*
> Hmean     send-128        525.04 (   0.00%)      542.57 *   3.34%*      550.44 *   4.84%*
> Hmean     send-256       1054.81 (   0.00%)     1085.81 *   2.94%*     1102.03 *   4.48%*
> Hmean     send-1024      3968.91 (   0.00%)     3976.68 (   0.20%)     4255.94 *   7.23%* 
> netperf-tcp
>                                  6.4.8                  6.4.8                  6.4.8
>                             alp-230807          alp-230807-nt        alp-230807-ntnb
> Hmean     64         779.06 (   0.00%)      806.93 *   3.58%*      812.85 *   4.34%*
> Hmean     128       1499.69 (   0.00%)     1569.62 *   4.66%*     1585.93 *   5.75%*
> Hmean     256       2866.62 (   0.00%)     2900.68 *   1.19%*     2981.46 *   4.01%*
> Hmean     1024     10005.45 (   0.00%)    10216.24 *   2.11%*    10420.25 *   4.15%*
> simba2 Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
> netperf-udp
>                                       6.4.8                  6.4.8                  6.4.8
>                                  alp-230807          alp-230807-nt        alp-230807-ntnb
> Hmean     send-64         392.51 (   0.00%)      410.86 *   4.67%*      415.17 *   5.77%*
> Hmean     send-128        804.86 (   0.00%)      826.45 *   2.68%*      831.27 *   3.28%*
> Hmean     send-256       1583.82 (   0.00%)     1647.26 *   4.01%*     1664.68 *   5.11%*
> Hmean     send-1024      5933.02 (   0.00%)     6073.28 *   2.36%*     6138.85 *   3.47%*
> netperf-tcp
>                                  6.4.8                  6.4.8                  6.4.8
>                             alp-230807          alp-230807-nt        alp-230807-ntnb
> Hmean     64        2313.46 (   0.00%)     2401.68 *   3.81%*     2381.45 *   2.94%*
> Hmean     128       4523.62 (   0.00%)     4708.94 *   4.10%*     4679.95 *   3.46%*
> Hmean     256       8576.52 (   0.00%)     8933.48 *   4.16%*     8904.31 *   3.82%*
> Hmean     1024     25304.54 (   0.00%)    25345.88 (   0.16%)    25637.67 *   1.32%*
> toto AMD EPYC 7601 32-Core Processor 2.2GHz
> netperf-udp
>                                       6.4.8                  6.4.8                  6.4.8
>                                  alp-230807          alp-230807-nt        alp-230807-ntnb
> Hmean     send-64         243.28 (   0.00%)      253.77 *   4.31%*      274.97 *  13.03%*
> Hmean     send-128        491.07 (   0.00%)      511.98 *   4.26%*      548.04 *  11.60%*
> Hmean     send-256        976.39 (   0.00%)     1013.91 *   3.84%*     1099.14 *  12.57%*
> Hmean     send-1024      3535.11 (   0.00%)     3796.68 *   7.40%*     3963.32 *  12.11%*
> netperf-tcp
>                                  6.4.8                  6.4.8                  6.4.8
>                             alp-230807          alp-230807-nt        alp-230807-ntnb
> Hmean     64        1110.09 (   0.00%)     1164.61 *   4.91%*     1186.33 *   6.87%*
> Hmean     128       2128.86 (   0.00%)     2205.09 *   3.58%*     2300.50 *   8.06%*
> Hmean     256       4068.26 (   0.00%)     4249.99 *   4.47%*     4368.44 *   7.38%*
> Hmean     1024     12118.48 (   0.00%)    12663.98 *   4.50%*    12961.32 *   6.95%*
Disabling the bpf LSM yields further performance improvements, but it may need to stay enabled, see bug 1205603.

When it comes to LSMs the callbacks of an LSM are executed if that particular LSM is enabled. There are two parameters determining which LSMs get enabled when the kernel boots - security and lsm. The security parameter allows one to select one major LSM and the other major LSMs will get disabled. Major LSMs are: apparmor, selinux, smack and tomoyo. The security parameter and major LSMs are considered a legacy approach to selecting LSMs. The lsm parameter allows one to specify the order in which LSMs are initialized, the first exclusive LSM gets enabled and the remaining exclusive LSMs get disabled. Exclusive LSMs are: apparmor, selinux and smack.
 
After changing CONFIG_LSM to
> CONFIG_LSM="landlock,lockdown,yama,loadpin,safesetid,integrity,apparmor,selinux,smack,tomoyo,bpf"
tomoyo got enabled on performance grid machines because the performance grid uses only basic kernel parameters:
> [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.4.0-vanilla root=UUID=065c0e9b-68c0-4a4f-a88b-f3b44a49d390 sysrq_always_enabled panic=100 console=tty0 console=ttyS0,115200
> [    3.105614] LSM: initializing lsm=lockdown,capability,landlock,yama,apparmor,tomoyo,bpf,integrity
> [    3.106596] landlock: Up and running.
> [    3.109907] Yama: becoming mindful.
> [    3.113261] AppArmor: AppArmor initialized
> [    3.116575] TOMOYO Linux initialized
> [    3.119912] LSM support for eBPF active
So, we see the performance impact of both tomoyo and bpf LSMs on grid machines (in addition to apparmor, which had been there all along).

SLES passes security=apparmor to the kernel, which means that apparmor will be enabled and selinux, smack and tomoyo will be disabled. Should a user delete the security=apparmor parameter the value of CONFIG_LSM will determine the LSMs that will be enabled:
1. In 15sp5, CONFIG_LSM="integrity,apparmor" means that apparmor will be the only major LSMs enabled.
2. In 15sp6, CONFIG_LSM="landlock,lockdown,yama,loadpin,safesetid,integrity,apparmor,selinux,smack,tomoyo,bpf" means that tomoyo will get enabled in addition to apparmor.
I recommend changing CONFIG_LSM for 15sp6 so that the act of deleting the security=apparmor parameter does not cause a regression in throughput in cases where a small amount of data is transferred between processes and many syscalls are made. I suggest removing all major LSMs apart from apparmor:
CONFIG_LSM="landlock,lockdown,yama,loadpin,safesetid,integrity,apparmor,bpf"
 
As for ALP, the security=selinux parameter is passed to the kernel in ALP Dolomite Milestone 2:
> [    0.020910] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.4.1-1-default root=UUID=33032718-8cbe-4dab-b06a-f62c9b3ed10b rd.timeout=60 rd.retry=45 console=ttyS0,115200 console=tty0 security=selinux selinux=1 quiet net.ifnames=0 ignition.platform.id=qemu
> [    0.039846] LSM: initializing lsm=lockdown,capability,landlock,yama,selinux,bpf,integrity
> [    0.039854] landlock: Up and running.
> [    0.039855] Yama: becoming mindful.
> [    0.039859] SELinux:  Initializing.
> [    0.039869] LSM support for eBPF active
This means selinux will be enabled and other major LSMs - apparmor, smack and tomoyo - will be disabled. Again, should someone remove security=selinux from the kernel command line the value of CONFIG_LSM would cause tomoyo to get enabled in addition to selinux after rebooting the machine. As we know, enabling tomoyo will result in a slight performance regression in certain cases. For ALP, I suggest removing all major LSMs apart from selinux:
CONFIG_LSM="landlock,lockdown,yama,loadpin,safesetid,integrity,selinux,bpf"

More details:
Debugging output for LSM initialization under ALP:
> [    0.063179] LSM: legacy security=selinux
> [    0.063180] LSM:   CONFIG_LSM=landlock,lockdown,yama,loadpin,safesetid,integrity,apparmor,selinux,smack,tomoyo,bpf
> [    0.063181] LSM: boot arg lsm= *unspecified*
> [    0.063182] LSM:   early started: lockdown (enabled)
> [    0.063183] LSM:   first ordered: capability (enabled)
> [    0.063184] LSM: security=selinux disabled: tomoyo (only one legacy major LSM)
> [    0.063185] LSM: security=selinux disabled: apparmor (only one legacy major LSM)
> [    0.063186] LSM: builtin ordered: landlock (enabled)
> [    0.063187] LSM: builtin ignored: lockdown (not built into kernel)
> [    0.063187] LSM: builtin ordered: yama (enabled)
> [    0.063188] LSM: builtin ignored: loadpin (not built into kernel)
> [    0.063189] LSM: builtin ignored: safesetid (not built into kernel)
> [    0.063190] LSM: builtin ordered: apparmor (disabled)
> [    0.063190] LSM: builtin ordered: selinux (enabled)
> [    0.063191] LSM: builtin ignored: smack (not built into kernel)
> [    0.063192] LSM: builtin ordered: tomoyo (disabled)
> [    0.063192] LSM: builtin ordered: bpf (enabled)
> [    0.063193] LSM:    last ordered: integrity (enabled)
> [    0.063194] LSM: exclusive chosen:   selinux
> [    0.063195] LSM: initializing lsm=lockdown,capability,landlock,yama,selinux,bpf,integrity
> [    0.063199] LSM: cred blob size       = 32
> [    0.063199] LSM: file blob size       = 24
> [    0.063200] LSM: inode blob size      = 72
> [    0.063200] LSM: ipc blob size        = 8
> [    0.063201] LSM: msg_msg blob size    = 4
> [    0.063201] LSM: superblock blob size = 80
> [    0.063202] LSM: task blob size       = 8
> [    0.063205] LSM: initializing capability
> [    0.063206] LSM: initializing landlock
> [    0.063207] landlock: Up and running.
> [    0.063207] LSM: initializing yama
> [    0.063208] Yama: becoming mindful.
> [    0.063212] LSM: initializing selinux
> [    0.063213] SELinux:  Initializing.
> [    0.063221] LSM: initializing bpf
> [    0.063223] LSM support for eBPF active
> [    0.063224] LSM: initializing integrity
 
Commits responsible for LSMs selection:
> v5.0-rc1-1-g47008e5161fa LSM: Introduce LSM_FLAG_LEGACY_MAJOR
>     This adds a flag for the current "major" LSMs to distinguish them when
>     we have a universal method for ordering all LSMs. It's called "legacy"
>     since the distinction of "major" will go away in the blob-sharing world.
> v5.0-rc1-11-g14bd99c821f7 LSM: Separate idea of "major" LSM from "exclusive" LSM
>     In order to both support old "security=" Legacy Major LSM selection, and
>     handling real exclusivity, this creates LSM_FLAG_EXCLUSIVE and updates
>     the selection logic to handle them.
> v5.0-rc1-38-ga5e2fe7ede12 TOMOYO: Update LSM flags to no longer be exclusive
>     With blob sharing in place, TOMOYO is no longer an exclusive LSM, so it
>     can operate separately now. Mark it as such.
Comment 6 Marcus Rückert 2023-08-08 20:04:25 UTC
is there a runtime toggle to enable tomoyo?
Comment 7 Jiri Wiesner 2023-08-09 07:52:57 UTC
Yes, there is. The lsm parameter can be passed to the kernel in its command line:
lsm=landlock,lockdown,yama,loadpin,safesetid,integrity,apparmor,tomoyo,bpf
The value of the lsm parameter can be derived from whatever is in /sys/kernel/security/lsm. Alternatively, to use tomoyo as a major LSM, security=tomoyo can be passed to the kernel.
Comment 8 Marcus Rückert 2023-08-09 09:54:52 UTC
my question was aiming towards having tomoyo built into the kernel but disabled by default. so the user could enable it via kernel cmdline.
Comment 9 Jiri Wiesner 2023-08-09 10:17:57 UTC
(In reply to Marcus Rückert from comment #8)
> my question was aiming towards having tomoyo built into the kernel but
> disabled by default. so the user could enable it via kernel cmdline.
This is exactly what I have just described. I suspect we have a misunderstanding when it comes to the terms I used. Compile time options are always in capitals, kernel parameters passed to the kernel on its command line are in lower case and always called "kernel parameters". On the kernel command line, tomoyo can be enabled with the lsm= kernel parameter or the security= kernel parameter. The CONFIG_SECURITY_TOMOYO compile time option is set to yes so tomoyo is built into the kernel. Disabling tomoyo by default is accomplished by tweaking the CONFIG_LSM compile time option.

Sorry, but "runtime toggle" isn't part of the terminology so I am not going to use it. See,
Documentation/admin-guide/kernel-parameters.rst
Documentation/admin-guide/kernel-parameters.txt
Comment 10 Jiri Wiesner 2023-08-11 17:18:37 UTC
Two blind alleys:
Despite the encouraging results on bing3, testing on the grid did not show any substantial improvements after inlining tomoyo_sock_family() and tomoyo_kernel_service().

There is one difference between 6.3 and 6.4 that attracted my attention - copy_user_generic() was revamped to check for the FSRM feature (a bit flag defined in cpuid output) instead of the ERMS and REP_GOOD features. A profile diff (in bus-cycles) shows time is spent in rep_movs_alternative() and copyin() instead of copy_user_enhanced_fast_string():
> test bing3  tcp  64  6.3.0-vanilla-new  6.4.0-vanilla-new
> #      Util1         Util2                   Diff  Command          Shared Object        Symbol                         CPU
>            0   116,798,963   116,798,963 (100.0%)  netperf          [kernel.kallsyms]    rep_movs_alternative           all
>            0   107,566,905   107,566,905 (100.0%)  netperf          [kernel.kallsyms]    copyin                         all
>   50,512,400    73,443,792    22,931,392 ( 45.4%)  netperf          [kernel.kallsyms]    ipv4_mtu                       all
>   90,326,233   111,343,210    21,016,977 ( 23.3%)  netperf          netperf              send_tcp_stream                all
>   75,108,050    95,527,384    20,419,334 ( 27.2%)  netperf          [kernel.kallsyms]    aa_sk_perm                     all
>   39,850,798    55,473,237    15,622,439 ( 39.2%)  netperf          [kernel.kallsyms]    sock_sendmsg                   all
>   23,117,865    35,803,490    12,685,625 ( 54.9%)  netperf          [kernel.kallsyms]    skb_page_frag_refill           all
>   33,210,582    20,175,432   -13,035,150 ( 39.2%)  netperf          [kernel.kallsyms]    inet_sendmsg                   all
>  113,199,306    97,356,180   -15,843,126 ( 14.0%)  netperf          [kernel.kallsyms]    __check_object_size            all
>  135,873,271    82,815,536   -53,057,735 ( 39.0%)  netperf          [kernel.kallsyms]    _copy_from_iter                all
>  216,323,164             0  -216,323,164 (100.0%)  netperf          [kernel.kallsyms]    copy_user_enhanced_fast_string all
The compiler inlined copyin() in 6.3.0-vanilla whereas it was uninlined 6.4.0-vanilla. I tested inlining copyin(), copyout() and copyout_nofault(). Tests on the grid failed to show any substantial improvements on all machines.