Bugzilla – Bug 1226963
[SLERT-15-SP6-RC2] process_scheduler_cfs_cyclictest_rt regression vs SLERT-15-SP5-GM baseline
Last modified: 2024-07-14 08:24:38 UTC
[Issue] A regression was detected that is similar to that of the last couple of milestones i.e. as far as dashboard statistics are concerned the regressions are big enough to worry about but in absolute terms the regressions are in the order of a few microseconds. On ph042, the difference of the mean latencies for GM and RC2 is just over 5 microseconds and the difference of their maximum values is 9 microseconds. On vh011, the difference between mean and maximum values is about 3 microseconds. In the past (see Bug 1223321) it was concluded that a difference of a few microseconds is not necessarily grounds for a "FAIL" verdict. On the other hand, there is no clear justification for a "PASS" verdict either. An expert's opinion would be much appreciated, hence this bug report. Mel has kindly agreed to investigate. [Environment] ph042 : Intel Silver 4110 Skylake-SP vh011 : Intel Silver 4110 Skylake-SP [Logs] ph042 ----- http://10.67.134.100/sleperf_dashboard/details.html?suite=qa_test_cyclictest&_case=process_scheduler_cfs_cyclictest_rt&q_tenv_id=7750&q_role_name=RealTime&r_tenv_id_role_name_pair=7272-RealTime:7291-RealTime:7439-RealTime:7692-RealTime&r_tenv_ids=7272-7291-7439-7692&build=RC2&arch=x86_64&release=SLERT-15-SP6&category=misc&category_value=null&machine=ph042.qa2.suse.asia&software_tag=baremetal&software_sub_tag=default vh011 ----- http://10.67.134.100/sleperf_dashboard/details.html?suite=qa_test_cyclictest&_case=process_scheduler_cfs_cyclictest_rt&q_tenv_id=7751&q_role_name=RealTime&r_tenv_id_role_name_pair=7271-RealTime:7292-RealTime:7440-RealTime:7691-RealTime&r_tenv_ids=7271-7292-7440-7691&build=RC2&arch=x86_64&release=SLERT-15-SP6&category=misc&category_value=null&machine=vh011.qa2.suse.asia&software_tag=baremetal&software_sub_tag=default
cyclictest can be an exception as the test is really concerned with microsecond differences. It can still be tricky as the reporting in your configuration is per-cpu so individual CPUs can have outliers. In this particular case though, my biggest concern is that the security mitigations appear to differ. SP5 is not mitigating Meltdown while SP6 is. While cyclictest should not have overhead due to Meltdown, it still has some so the security mitigations applied must match across all tests. Thanks for reporting Kostas and even if this turns out to be a difference in mitigation, it's best to be cautious about cyclictest for RT tests.
Is there still a regression if mitigations are disabled or matched?
Hi Mel. I haven't tried turning off mitigations yet, that is going to be the next thing to do. However, now that the new cyclictest case - the one without any test load - has been deployed I have some results to share with you. There are some spikes that push the mean latency to double digits which suggests that hackbench may not be responsible for the latencies we have been seeing with SP6 builds. Next step: Retest with mitigations set to off. [Logs] ph041 ----- http://10.67.134.100/sleperf_dashboard/details.html?suite=qa_test_cyclictest&_case=process_scheduler_cfs_cyclictest_none_rt&q_tenv_id=7750&q_role_name=RealTime&r_tenv_id_role_name_pair=7291-RealTime&r_tenv_ids=7291&build=RC2&arch=x86_64&release=SLERT-15-SP6&category=misc&category_value=null&machine=ph042.qa2.suse.asia&software_tag=baremetal&software_sub_tag=default vh011 ----- http://10.67.134.100/sleperf_dashboard/details.html?suite=qa_test_cyclictest&_case=process_scheduler_cfs_cyclictest_none_rt&q_tenv_id=7751&q_role_name=RealTime&r_tenv_id_role_name_pair=7292-RealTime&r_tenv_ids=7292&build=RC2&arch=x86_64&release=SLERT-15-SP6&category=misc&category_value=null&machine=vh011.qa2.suse.asia&software_tag=baremetal&software_sub_tag=default
Ok, the averages are barely moved indicating that there were occasional large outliers but not enough to be worried. What I'd normally do is look at a "fine-grained" report using the -v switch to cyclictest. This is verbose enough that it causes distortions to the results so by default I run both tests in separate runs. The fine-grained data can be used to see the distribution and how often outliers occur. However, the difference in absolute time could simply be showing cstate or cpu idle polling jitter with idle polling being a more likely explanation as SP5 polls for longer than SP6 when idle waiting on new task activity. The difference in absolute time is small enough that I think we can close this as WONTFIX.
Created attachment 876039 [details] Cyclictest verbose results with mitigations=auto for two hosts
Created attachment 876040 [details] Cyclictest verbose results with mitigations disabled for two hosts
(In reply to Mel Gorman from comment #2) > Is there still a regression if mitigations are disabled or matched? Hi Mel, I am happy to close this ticket as WONTFIX. For completeness, I have attached a couple of log archives from hosts ph042 and vh011, with and without mitigations. Attachment 876039 [details] - Verbose cyclictest results with mitigations=auto Attachment 876040 [details] - Verbose cyclictest results without mitigations
Close the bug report as suggestion.