Bugzilla – Bug 1196438
Consider switching to 300HZ default
Last modified: 2024-02-12 17:36:43 UTC
I recently stumbled over the 300 HZ option in the kernel configuration. Currently we are using 250 HZ with IDLE_HZ. The HZ setting is used as far as I can see to determine when jiffies are advancing, so it influences the granularity of a couple of things like kvmclock, timers and so on. Also USER_HZ is 100 HZ which is the granularity to which some metrics are exported to user space. 300 is divisible without remainder by 100, unlike 250. so it appears there are good reasons for switching the default, and it is "just" 20% more timer interrupts than before, so it should not be a huge issue. I also did a test build and a 300 HZ kernel is by a few bytes smaller than a 250 HZ kernel, indicating that the compiler can optimize away a few things. I am seeing an increase in a few small functions, and I'm looking into making the code size increase go away with a source level tweak. I've benchmarked both versions in a micro benchmark that does a billion invocations of both, and while the code is larger, it runs in exactly the same runtime (+/- 3% which I consider my benchmark noise level) on a Ryzen Zen 2+. In the Kconfig description of 300 HZ option, it appears this is more recommended for multimedia usecases because it is divisible without remainder for common rates, like 30 (fps), 60 fps , 120 fps, 44.1khZ and others that are often needed. This is imho not only usable for desktop, but also for servers that are using multimedia related applications. I've seen that fedora-like distributions use 1000 HZ, arch linux uses 300 Hz and debian defaults to 250 HZ. So there is no clear trend. I'd be fine with 300 or 1000 HZ. Comments? Would like to hear your feedback before sending a change to the Tumbleweed configs.
see https://lists.opensuse.org/archives/list/kernel@lists.opensuse.org/thread/QN2AFMCXQGGHF2I6FSQYVH6AXRX4WPIQ/ for context
(In reply to Dirk Mueller from comment #1) > see > https://lists.opensuse.org/archives/list/kernel@lists.opensuse.org/thread/ > QN2AFMCXQGGHF2I6FSQYVH6AXRX4WPIQ/ for context Battery of scheduler tests comparing HZ=250 (default x86-64 on master) versus HZ=300 will be queued in. Note that as the queue already has a substantial number of tests on it, it'll take several days before there are results to evaluate.
(In reply to Mel Gorman from comment #2) > (In reply to Dirk Mueller from comment #1) > > see > > https://lists.opensuse.org/archives/list/kernel@lists.opensuse.org/thread/ > > QN2AFMCXQGGHF2I6FSQYVH6AXRX4WPIQ/ for context > > Battery of scheduler tests comparing HZ=250 (default x86-64 on master) > versus HZ=300 will be queued in. Note that as the queue already has a > substantial number of tests on it, it'll take several days before there are > results to evaluate. In general the performance results are not bad. There were a few outliers showing major regressions but they were not consistent across machines or clear that it was specifically due to a change in HZ and most likely noise (the benchmarks in question are not always consistent). The most obvious impact is the last surprising one -- more interrupts are delivered. As more CPUs become active for a workload that is scaling from 1 CPU to many CPUs, the interrupts increase relative to the number of CPUs in the system. This may generate a few bugs, particularly for systems where there are many active CPUs. However, it can be trivially checked by monitoring /proc/interrupts over time and checking if the increases are exclusively tick related. Overall, I think this is safe enough to enable. It'll generate some noise when/if SLE adopts it but it'll be manageable.
(In reply to Dirk Mueller from comment #0) > I recently stumbled over the 300 HZ option in the kernel > configuration. Currently we are using 250 HZ with IDLE_HZ. The HZ > setting is used > as far as I can see to determine when jiffies are advancing, so it > influences the granularity of a couple of things like kvmclock, timers > and so on. Also > USER_HZ is 100 HZ which is the granularity to which some metrics are > exported to user space. 300 is divisible without remainder by 100, > unlike 250. > > so it appears there are good reasons for switching the default, and it > is "just" 20% more timer interrupts than before, so it should not be a > huge issue. I also > did a test build and a 300 HZ kernel is by a few bytes smaller than a > 250 HZ kernel, indicating that the compiler can optimize away a few > things. > > I am seeing an increase in a few small functions, and I'm looking into > making the code size increase go away with a source level tweak. I've > benchmarked both versions in a micro > benchmark that does a billion invocations of both, and while the code > is larger, it runs in exactly the same runtime (+/- 3% which I > consider my benchmark noise level) on a Ryzen Zen 2+. > > In the Kconfig description of 300 HZ option, it appears this is more > recommended for multimedia usecases because it is divisible without > remainder for common rates, like 30 (fps), 60 fps , 120 fps, 44.1khZ > and others that are often needed. > This is imho not only usable for desktop, but also for servers that > are using multimedia related applications. > > I've seen that fedora-like distributions use 1000 HZ, arch linux uses > 300 Hz and debian defaults to 250 HZ. So there is no clear trend. I'd > be fine with 300 or 1000 HZ. > > Comments? Would like to hear your feedback before sending a change to > the Tumbleweed configs. I may be disappointing but I don't have a strong opinion on this. This was introduced in 2006 due to some frame rate frequency (https://lwn.net/Articles/208411/). Nowadays we have hrtimers for precise deadlines and dynticks for power consumption. In any case I don't mind either 250 or 300 Hz. Both are good tradeoffs.
So anyone going to push the change ;)? Or do we stay with 250? FWIW: $ git grep CONFIG_HZ_[0-9].*=y config config/arm64/default:CONFIG_HZ_100=y config/armv6hl/default:CONFIG_HZ_100=y config/armv7hl/default:CONFIG_HZ_200=y config/i386/pae:CONFIG_HZ_250=y config/ppc64/default:CONFIG_HZ_100=y config/ppc64le/default:CONFIG_HZ_100=y config/riscv64/default:CONFIG_HZ_100=y config/s390x/default:CONFIG_HZ_100=y config/s390x/zfcpdump:CONFIG_HZ_250=y config/x86_64/default:CONFIG_HZ_250=y So some architectures even have 100.
Submitted [master 06fab9d372f] config: align all architectures on CONFIG_HZ=300 (bsc#1196438) to users/dmueller/master/for-next
FYI at least non-default 300 requires kernel fix for sysctl value: https://lore.kernel.org/lkml/20230719103743.4775-2-chrubis@suse.cz/
So at last, neither ALP nor Leap will adopt 300. So I think we should switch the TW back to 250 to be consistent with all our other distros.
At least we found and fixed bugs in mainline kernel [1] [2]. I don't know the reason why Leap and Alp don't switch, but I wonder if it could mainline kernel could consider switching the default to 300. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c7fcb99877f9f542c918509b2801065adcaf46fa [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c1fc6484e1fb7cc2481d169bfef129a1b0676abe
(In reply to Jiri Slaby from comment #11) > So at last, neither ALP nor Leap will adopt 300. So I think we should switch > the TW back to 250 to be consistent with all our other distros. I think it is expected to not have a change in SLE 15 or Leap. I am surprised on ALP, what were the reasons for that compared to the reason to not have a HZ that is compatible with the 100 HZ userland expectation (100, 300 or 1000)?
(In reply to Jiri Slaby from comment #11) > So at last, neither ALP nor Leap will adopt 300. So I think we should switch > the TW back to 250 to be consistent with all our other distros. So for SLE the consensus is to keep CONFIG_HZ=250. Cyril recommended to keep CONFIG_HZ=250 citing the bug fixed by commit c7fcb99877f9 and Mel recommended to keep SLE and ALP aligned which implies CONFIG_HZ=250 for ALP as well. So I agree we should revert the change. Objections?
(In reply to Yousaf Kaukab from comment #17) > Objections? Just setting a timeout: If not, let me revert TW to 250 after 8th Dec (next Fri).
(In reply to Yousaf Kaukab from comment #17) > So for SLE the consensus is to keep CONFIG_HZ=250. I assume you mean SLE15 here? I think the discussion should be open ended with SLE16. > Cyril recommended to keep CONFIG_HZ=250 citing the bug fixed by commit c7fcb99877f9 and Mel > recommended to keep SLE and ALP aligned which implies CONFIG_HZ=250 for ALP > as well. I'm sorry, but that means there can be no progress of anything anywhere ever. But the matter of fact is that we are improving the config continuously based on feedback. > So I agree we should revert the change. Objections? Well, what about the original suggestion to use a value that is evenly divisible for the user space metrics? I checked again, and RHEL9/10 is going with 100 HZ. So if your concern is that the number of tick interrupts is causing performance issues, why not go to the *same* value across all architectures, and pick a value that is evenly divisible by 100 (the userspace HZ)? Like 100? (or 300, for that matter) I picked 300 because it was closest to the original 250 which I thought is the least concern overall. There were no conclusive benchmarks showing it would be a problem. Why can we not run with that to see partner and customer feedback prior release?