Bugzilla – Bug 1178359
Unexplainable high load average
Last modified: 2022-01-21 12:12:02 UTC
Created attachment 843219 [details] fig1 Overview: We've witnessed an unusually high system load average on several recent Leap 15.2 virtual machine builds which reside under a Nutanix AHV hypervisor. Steps to reproduce: It's difficult to reproduce as it seems to start after several days of uptime, if at all. But I have seen this happen on 4 systems so far. Actual results: On an affected VM, after some time (days or weeks), system load average starts jumping up in steps. For one system, the load average jumped from ~0.01 (normal), to ~0.5, then to ~1.0 and so on until it is now ~5.5, over the period of several days. The attached image (fig 1) shows this behaviour over time, it is a chart of the 5-minute average (note: unit on the chart is % where 100% = nproc) A reboot resets the problem although I am yet to see if it returns. Some output from another VM, this one is essentially as close to a bare 15.2 installation as we have: # cat /proc/loadavg 2.04 2.01 2.00 1/228 4319 As you can see from the below output, there are no processes listed as waiting or in uninterruptible sleep, and there is nothing waiting for IO, nor is there any swap activity. # vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 19200 248944 1060 1337940 0 0 3 27 11 20 1 0 99 0 0 0 0 19200 248912 1060 1337940 0 0 0 0 349 260 1 0 100 0 0 0 0 19200 248912 1060 1337940 0 0 0 0 318 249 0 0 100 0 0 0 0 19200 248912 1060 1337940 0 0 0 0 338 254 0 0 100 0 0 0 0 19200 248976 1060 1337940 0 0 0 0 287 246 0 1 100 0 0 0 0 19200 248944 1060 1337940 0 0 0 0 262 214 1 0 100 0 0 0 0 19200 248912 1060 1337940 0 0 0 0 320 248 0 1 100 0 0 0 0 19200 248944 1060 1337940 0 0 0 0 315 243 1 0 100 0 0 0 0 19200 248912 1060 1337940 0 0 0 0 284 236 0 0 100 0 0 0 0 19200 248912 1060 1337940 0 0 0 0 362 276 0 0 100 0 0 0 0 19200 248944 1060 1337940 0 0 0 0 377 270 0 0 100 0 0 0 0 19200 248912 1060 1337940 0 0 0 0 335 233 0 0 100 0 0 0 0 19200 248912 1060 1337940 0 0 0 0 316 271 0 0 99 0 0 0 0 19200 248944 1060 1337940 0 0 0 0 321 258 0 0 100 0 0 0 0 19200 248912 1060 1337940 0 0 0 0 297 228 0 0 99 0 0 0 0 19200 248944 1060 1337940 0 0 0 0 319 241 0 1 100 0 0 0 0 19200 248912 1060 1337940 0 0 0 0 309 250 1 0 100 0 0 0 0 19200 248912 1060 1337940 0 0 0 0 328 249 0 0 100 0 0 0 0 19200 248944 1060 1337940 0 0 0 0 339 267 0 0 99 0 0 0 0 19200 248912 1060 1337940 0 0 0 0 360 245 0 0 100 0 0 0 0 19200 248944 1060 1337940 0 0 0 0 288 238 0 0 100 0 0 0 0 19200 248944 1060 1337940 0 0 0 0 338 246 0 0 100 0 0 0 0 19200 248944 1060 1337940 0 0 0 0 348 263 1 0 100 0 0 0 0 19200 248660 1060 1338132 0 0 0 1288 326 288 0 0 100 0 0 0 0 19200 248660 1060 1338132 0 0 0 0 341 258 0 1 100 0 0 0 0 19200 248692 1060 1338132 0 0 0 0 369 267 0 0 100 0 0 0 0 19200 248692 1060 1338132 0 0 0 0 330 265 0 0 100 0 0 0 0 19200 248660 1060 1338132 0 0 0 0 337 263 0 0 99 0 0 0 0 19200 248692 1060 1338132 0 0 0 0 331 258 0 0 100 0 0 0 0 19200 248660 1060 1338132 0 0 0 0 309 223 0 0 100 0 0 0 0 19200 248912 1060 1338132 0 0 0 0 306 262 0 0 100 0 0 Expected results: System load average stays within expected levels which allows for effective monitoring. Build: Linux 5.3.18-lp152.26-default #1 SMP Mon Jun 29 14:58:38 UTC 2020 (2a0430f) x86_64 x86_64 x86_64 GNU/Linux Nutanix AHV virtual machine, Intel(R) Xeon(R) Gold 6150 CPU, variable memory/core count for VMs. Additional Builds and platforms: We have not witnessed this on Leap 15.1
We've seen a similar bogus loadavg problem on the recent Tumbleweed and Leap 15.2 kernels, but it was about certain ARM64 bare metal. Now this is on x86-64 VM, and with a much older kernel (released in July -- corresponding to SLE15-SP2 commit 72557bb644c5). So I'm not quite sure whether it's the same problem. Adding Mel and Rudi to Cc who have been already involved with the another bug.
Could you test with the latest openSUSE-15.2 KOTD? http://download.opensuse.org/repositories/Kernel:/openSUSE-15.2/standard/ This contains the recent fix for the loadavg bug. Although this has hit on Arm and such platforms, it might be the case for a specific hypervisor, too.
No response, closing.