Bug 1196438 - Consider switching to 300HZ default
Summary: Consider switching to 300HZ default
Status: IN_PROGRESS
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Current
Hardware: Other Other
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: openSUSE Kernel Bugs
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-02-24 12:46 UTC by Dirk Mueller
Modified: 2024-02-12 17:36 UTC (History)
13 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dirk Mueller 2022-02-24 12:46:56 UTC
I recently stumbled over the 300 HZ option in the kernel
configuration. Currently we are using 250 HZ with IDLE_HZ. The HZ
setting is used
as far as I can see to determine when jiffies are advancing, so it
influences the granularity of a couple of things like kvmclock, timers
and so on. Also
USER_HZ is 100 HZ which is the granularity to which some metrics are
exported to user space. 300 is divisible without remainder by 100,
unlike 250.

so it appears there are good reasons for switching the default, and it
is "just" 20% more timer interrupts than before, so it should not be a
huge issue. I also
did a test build and a 300 HZ kernel is by a few bytes smaller than a
250 HZ kernel, indicating that the compiler can optimize away a few
things.

I am seeing an increase in a few small functions, and I'm looking into
making the code size increase go away with a source level tweak. I've
benchmarked both versions in a micro
benchmark that does a billion invocations of both, and while the code
is larger, it runs in exactly the same runtime (+/- 3% which I
consider my benchmark noise level) on a Ryzen Zen 2+.

In the Kconfig description of 300 HZ option, it appears this is more
recommended for multimedia usecases because it is divisible without
remainder for common rates, like 30 (fps), 60 fps , 120 fps, 44.1khZ
and others that are often needed.
This is imho not only usable for desktop, but also for servers that
are using multimedia related applications.

I've seen that fedora-like distributions use 1000 HZ, arch linux uses
300 Hz and debian defaults to 250 HZ. So there is no clear trend. I'd
be fine with 300 or 1000 HZ.

Comments? Would like to hear your feedback before sending a change to
the Tumbleweed configs.
Comment 2 Mel Gorman 2022-02-24 15:02:57 UTC
(In reply to Dirk Mueller from comment #1)
> see
> https://lists.opensuse.org/archives/list/kernel@lists.opensuse.org/thread/
> QN2AFMCXQGGHF2I6FSQYVH6AXRX4WPIQ/ for context

Battery of scheduler tests comparing HZ=250 (default x86-64 on master) versus HZ=300 will be queued in. Note that as the queue already has a substantial number of tests on it, it'll take several days before there are results to evaluate.
Comment 3 Mel Gorman 2022-03-03 10:59:31 UTC
(In reply to Mel Gorman from comment #2)
> (In reply to Dirk Mueller from comment #1)
> > see
> > https://lists.opensuse.org/archives/list/kernel@lists.opensuse.org/thread/
> > QN2AFMCXQGGHF2I6FSQYVH6AXRX4WPIQ/ for context
> 
> Battery of scheduler tests comparing HZ=250 (default x86-64 on master)
> versus HZ=300 will be queued in. Note that as the queue already has a
> substantial number of tests on it, it'll take several days before there are
> results to evaluate.

In general the performance results are not bad. There were a few outliers showing major regressions but they were not consistent across machines or clear that it was specifically due to a change in HZ and most likely noise (the benchmarks in question are not always consistent). The most obvious impact is the last surprising one -- more interrupts are delivered. As more CPUs become active for a workload that is scaling from 1 CPU to many CPUs, the interrupts increase relative to the number of CPUs in the system. This may generate a few bugs, particularly for systems where there are many active CPUs. However, it can be trivially checked by monitoring /proc/interrupts over time and checking if the increases are exclusively tick related.

Overall, I think this is safe enough to enable. It'll generate some noise when/if SLE adopts it but it'll be manageable.
Comment 4 Frederic Weisbecker 2022-03-07 11:40:25 UTC
(In reply to Dirk Mueller from comment #0)
> I recently stumbled over the 300 HZ option in the kernel
> configuration. Currently we are using 250 HZ with IDLE_HZ. The HZ
> setting is used
> as far as I can see to determine when jiffies are advancing, so it
> influences the granularity of a couple of things like kvmclock, timers
> and so on. Also
> USER_HZ is 100 HZ which is the granularity to which some metrics are
> exported to user space. 300 is divisible without remainder by 100,
> unlike 250.
> 
> so it appears there are good reasons for switching the default, and it
> is "just" 20% more timer interrupts than before, so it should not be a
> huge issue. I also
> did a test build and a 300 HZ kernel is by a few bytes smaller than a
> 250 HZ kernel, indicating that the compiler can optimize away a few
> things.
> 
> I am seeing an increase in a few small functions, and I'm looking into
> making the code size increase go away with a source level tweak. I've
> benchmarked both versions in a micro
> benchmark that does a billion invocations of both, and while the code
> is larger, it runs in exactly the same runtime (+/- 3% which I
> consider my benchmark noise level) on a Ryzen Zen 2+.
> 
> In the Kconfig description of 300 HZ option, it appears this is more
> recommended for multimedia usecases because it is divisible without
> remainder for common rates, like 30 (fps), 60 fps , 120 fps, 44.1khZ
> and others that are often needed.
> This is imho not only usable for desktop, but also for servers that
> are using multimedia related applications.
> 
> I've seen that fedora-like distributions use 1000 HZ, arch linux uses
> 300 Hz and debian defaults to 250 HZ. So there is no clear trend. I'd
> be fine with 300 or 1000 HZ.
> 
> Comments? Would like to hear your feedback before sending a change to
> the Tumbleweed configs.

I may be disappointing but I don't have a strong opinion on this. This was introduced in 2006 due to some frame rate frequency (https://lwn.net/Articles/208411/). Nowadays we have hrtimers for precise deadlines and dynticks for power consumption.

In any case I don't mind either 250 or 300 Hz. Both are good tradeoffs.
Comment 5 Jiri Slaby 2023-05-11 05:29:19 UTC
So anyone going to push the change ;)? Or do we stay with 250?

FWIW:
$ git grep CONFIG_HZ_[0-9].*=y config
config/arm64/default:CONFIG_HZ_100=y
config/armv6hl/default:CONFIG_HZ_100=y
config/armv7hl/default:CONFIG_HZ_200=y
config/i386/pae:CONFIG_HZ_250=y
config/ppc64/default:CONFIG_HZ_100=y
config/ppc64le/default:CONFIG_HZ_100=y
config/riscv64/default:CONFIG_HZ_100=y
config/s390x/default:CONFIG_HZ_100=y
config/s390x/zfcpdump:CONFIG_HZ_250=y
config/x86_64/default:CONFIG_HZ_250=y

So some architectures even have 100.
Comment 6 Dirk Mueller 2023-05-12 08:10:44 UTC
Submitted 

[master 06fab9d372f] config: align all architectures on CONFIG_HZ=300 (bsc#1196438)

to users/dmueller/master/for-next
Comment 8 Petr Vorel 2023-07-19 11:28:15 UTC
FYI at least non-default 300 requires kernel fix for sysctl value:
https://lore.kernel.org/lkml/20230719103743.4775-2-chrubis@suse.cz/
Comment 11 Jiri Slaby 2023-10-18 07:20:48 UTC
So at last, neither ALP nor Leap will adopt 300. So I think we should switch the TW back to 250 to be consistent with all our other distros.
Comment 12 Petr Vorel 2023-10-18 07:43:12 UTC
At least we found and fixed bugs in mainline kernel [1] [2]. I don't know the reason why Leap and Alp don't switch, but I wonder if it could mainline kernel could consider switching the default to 300.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c7fcb99877f9f542c918509b2801065adcaf46fa
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c1fc6484e1fb7cc2481d169bfef129a1b0676abe
Comment 13 Dirk Mueller 2023-10-18 07:50:08 UTC
(In reply to Jiri Slaby from comment #11)
> So at last, neither ALP nor Leap will adopt 300. So I think we should switch
> the TW back to 250 to be consistent with all our other distros.

I think it is expected to not have a change in SLE 15 or Leap. I am surprised on ALP, what were the reasons for that compared to the reason to not have a HZ that is compatible with the 100 HZ userland expectation (100, 300 or 1000)?
Comment 17 Yousaf Kaukab 2023-11-29 12:41:33 UTC
(In reply to Jiri Slaby from comment #11)
> So at last, neither ALP nor Leap will adopt 300. So I think we should switch
> the TW back to 250 to be consistent with all our other distros.

So for SLE the consensus is to keep CONFIG_HZ=250. Cyril recommended to keep CONFIG_HZ=250 citing the bug fixed by commit c7fcb99877f9 and Mel recommended to keep SLE and ALP aligned which implies CONFIG_HZ=250 for ALP as well. So I agree we should revert the change. Objections?
Comment 18 Jiri Slaby 2023-11-30 06:52:53 UTC
(In reply to Yousaf  Kaukab from comment #17)
> Objections?

Just setting a timeout: If not, let me revert TW to 250 after 8th Dec (next Fri).
Comment 19 Dirk Mueller 2023-11-30 15:45:14 UTC
(In reply to Yousaf  Kaukab from comment #17)

> So for SLE the consensus is to keep CONFIG_HZ=250.

I assume you mean SLE15 here? I think the discussion should be open ended with SLE16. 

> Cyril recommended to keep  CONFIG_HZ=250 citing the bug fixed by commit c7fcb99877f9 and Mel
> recommended to keep SLE and ALP aligned which implies CONFIG_HZ=250 for ALP
> as well. 

I'm sorry, but that means there can be no progress of anything anywhere ever. But the matter of fact is that we are improving the config continuously based on feedback. 


> So I agree we should revert the change. Objections?

Well, what about the original suggestion to use a value that is evenly divisible for the user space metrics?
I checked again, and RHEL9/10 is going with 100 HZ. So if your concern is that the number of tick interrupts is causing performance issues, why not go to the *same* value across all architectures, and pick a value that is evenly divisible by 100 (the userspace HZ)? Like 100? (or 300, for that matter)

I picked 300 because it was closest to the original 250 which I thought is the least concern overall. There were no conclusive benchmarks showing it would be a problem. Why can we not run with that to see partner and customer feedback prior release?