Bug 1226963 - [SLERT-15-SP6-RC2] process_scheduler_cfs_cyclictest_rt regression vs SLERT-15-SP5-GM baseline
Summary: [SLERT-15-SP6-RC2] process_scheduler_cfs_cyclictest_rt regression vs SLERT-15...
Status: RESOLVED WONTFIX
Alias: None
Product: SUSE Linux Enterprise Real Time 15 SP6
Classification: SUSE Linux Enterprise Real Time Extension
Component: Kernel (show other bugs)
Version: unspecified
Hardware: Other Other
: P2 - High : Normal
Target Milestone: ---
Assignee: Kernel Bugs
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-06-25 16:05 UTC by Kostas Peletidis
Modified: 2024-07-14 08:24 UTC (History)
14 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Cyclictest verbose results with mitigations=auto for two hosts (10.97 MB, application/x-bzip2)
2024-07-13 12:57 UTC, Kostas Peletidis
Details
Cyclictest verbose results with mitigations disabled for two hosts (10.99 MB, application/x-bzip2)
2024-07-13 12:58 UTC, Kostas Peletidis
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Kostas Peletidis 2024-06-25 16:05:44 UTC
[Issue]
A regression was detected that is similar to that of the last couple of milestones i.e. as far as dashboard statistics are concerned the regressions are big enough to worry about but in absolute terms the regressions are in the order of a few microseconds.

On ph042, the difference of the mean latencies for GM and RC2 is just over 5 microseconds and the difference of their maximum values is 9 microseconds.

On vh011, the difference between mean and maximum values is about 3 microseconds.

In the past (see Bug 1223321) it was concluded that a difference of a few microseconds is not necessarily grounds for a "FAIL" verdict. On the other hand, there is no clear justification for a "PASS" verdict either. An expert's opinion would be much appreciated, hence this bug report. Mel has kindly agreed to investigate.


[Environment]
ph042 : Intel Silver 4110 Skylake-SP
vh011 : Intel Silver 4110 Skylake-SP


[Logs]
ph042
-----
http://10.67.134.100/sleperf_dashboard/details.html?suite=qa_test_cyclictest&_case=process_scheduler_cfs_cyclictest_rt&q_tenv_id=7750&q_role_name=RealTime&r_tenv_id_role_name_pair=7272-RealTime:7291-RealTime:7439-RealTime:7692-RealTime&r_tenv_ids=7272-7291-7439-7692&build=RC2&arch=x86_64&release=SLERT-15-SP6&category=misc&category_value=null&machine=ph042.qa2.suse.asia&software_tag=baremetal&software_sub_tag=default

vh011
-----
http://10.67.134.100/sleperf_dashboard/details.html?suite=qa_test_cyclictest&_case=process_scheduler_cfs_cyclictest_rt&q_tenv_id=7751&q_role_name=RealTime&r_tenv_id_role_name_pair=7271-RealTime:7292-RealTime:7440-RealTime:7691-RealTime&r_tenv_ids=7271-7292-7440-7691&build=RC2&arch=x86_64&release=SLERT-15-SP6&category=misc&category_value=null&machine=vh011.qa2.suse.asia&software_tag=baremetal&software_sub_tag=default
Comment 1 Mel Gorman 2024-06-26 12:25:08 UTC
cyclictest can be an exception as the test is really concerned with microsecond differences. It can still be tricky as the reporting in your configuration is per-cpu so individual CPUs can have outliers. In this particular case though, my biggest concern is that the security mitigations appear to differ. SP5 is not mitigating Meltdown while SP6 is. While cyclictest should not have overhead due to Meltdown, it still has some so the security mitigations applied must match across all tests.

Thanks for reporting Kostas and even if this turns out to be a difference in mitigation, it's best to be cautious about cyclictest for RT tests.
Comment 2 Mel Gorman 2024-07-11 12:50:36 UTC
Is there still a regression if mitigations are disabled or matched?
Comment 3 Kostas Peletidis 2024-07-11 15:09:46 UTC
Hi Mel. I haven't tried turning off mitigations yet, that is going to be the next thing to do.

However, now that the new cyclictest case - the one without any test load - has been deployed I have some results to share with you. There are some spikes that push the mean latency to double digits which suggests that hackbench may not be responsible for the latencies we have been seeing with SP6 builds.

Next step: Retest with mitigations set to off.

[Logs]
ph041
-----
http://10.67.134.100/sleperf_dashboard/details.html?suite=qa_test_cyclictest&_case=process_scheduler_cfs_cyclictest_none_rt&q_tenv_id=7750&q_role_name=RealTime&r_tenv_id_role_name_pair=7291-RealTime&r_tenv_ids=7291&build=RC2&arch=x86_64&release=SLERT-15-SP6&category=misc&category_value=null&machine=ph042.qa2.suse.asia&software_tag=baremetal&software_sub_tag=default


vh011
-----
http://10.67.134.100/sleperf_dashboard/details.html?suite=qa_test_cyclictest&_case=process_scheduler_cfs_cyclictest_none_rt&q_tenv_id=7751&q_role_name=RealTime&r_tenv_id_role_name_pair=7292-RealTime&r_tenv_ids=7292&build=RC2&arch=x86_64&release=SLERT-15-SP6&category=misc&category_value=null&machine=vh011.qa2.suse.asia&software_tag=baremetal&software_sub_tag=default
Comment 4 Mel Gorman 2024-07-12 07:30:02 UTC
Ok, the averages are barely moved indicating that there were occasional large outliers but not enough to be worried. What I'd normally do is look at a "fine-grained" report using the -v switch to cyclictest. This is verbose enough that it causes distortions to the results so by default I run both tests in separate runs. The fine-grained data can be used to see the distribution and how often outliers occur. However, the difference in absolute time could simply be showing cstate or cpu idle polling jitter with idle polling being a more likely explanation as SP5 polls for longer than SP6 when idle waiting on new task activity. The difference in absolute time is small enough that I think we can close this as WONTFIX.
Comment 5 Kostas Peletidis 2024-07-13 12:57:16 UTC
Created attachment 876039 [details]
Cyclictest verbose results with mitigations=auto for two hosts
Comment 6 Kostas Peletidis 2024-07-13 12:58:27 UTC
Created attachment 876040 [details]
Cyclictest verbose results with mitigations disabled for two hosts
Comment 7 Kostas Peletidis 2024-07-13 13:05:40 UTC
(In reply to Mel Gorman from comment #2)
> Is there still a regression if mitigations are disabled or matched?

Hi Mel,

I am happy to close this ticket as WONTFIX.

For completeness, I have attached a couple of log archives from hosts ph042 and vh011, with and without mitigations.

Attachment 876039 [details] - Verbose cyclictest results with mitigations=auto
Attachment 876040 [details] - Verbose cyclictest results without mitigations
Comment 8 Jeffrey Cheung 2024-07-14 08:24:38 UTC
Close the bug report as suggestion.