Bug 1223798

Summary: Nvidia: Grace Perf-stat tool does not support ipc
Product: [openSUSE] PUBLIC SUSE Linux Enterprise Desktop 15 SP5 Reporter: Carol Soto <csoto>
Component: KernelAssignee: Tony Jones <tonyj>
Status: RESOLVED WORKSFORME QA Contact:
Severity: Normal    
Priority: P5 - None CC: afaerber, arm-bugs, csoto, ddavis, ivan.ivanov, mbenes, mochs, petr.tesarik, tonyj, yousaf.kaukab
Version: unspecified   
Target Milestone: ---   
Hardware: aarch64   
OS: SLES 15   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Carol Soto 2024-05-03 03:16:34 UTC
We would to request to backport these 2 patches:

perf vendor events arm64: Update N2 and V2 metrics and events using Arm telemetry repo
https://github.com/torvalds/linux/commit/4473949074c35072f598bd525ae51d5455f05745

perf parse-events: Make legacy events lower priority than sysfs/JSON
https://github.com/torvalds/linux/commit/a24d9d9dc096fc0d0bd85302c9a4fe4fe3b1107b

One way to know if the patches are working is by dooing this command and seeing this output: 
sudo perf stat -v -a -M ipc ls /
Using CPUID 0x00000000410fd4f0
metric expr INST_RETIRED / CPU_CYCLES for ipc
found event INST_RETIRED
found event CPU_CYCLES
Parsing metric events '{INST_RETIRED/metric-id=INST_RETIRED/,CPU_CYCLES/metric-id=CPU_CYCLES/}:W'
INST_RETIRED -> armv8_pmuv3_0/metric-id=INST_RETIRED,INST_RETIRED/
CPU_CYCLES -> armv8_pmuv3_0/metric-id=CPU_CYCLES,CPU_CYCLES/
Matched metric-id INST_RETIRED to INST_RETIRED
Matched metric-id CPU_CYCLES to CPU_CYCLES
Control descriptor is not initialized
bin  bin.usr-is-merged	boot  cdrom  dev  etc  home  lib  lib.usr-is-merged  lost+found  media	mnt  opt  proc	root  run  sbin  sbin.usr-is-merged  snap  srv	swap.img  sys  tmp  usr  var
INST_RETIRED: 19656830 442994656 442994656
CPU_CYCLES: 33765594 442994656 442994656

 Performance counter stats for 'system wide':

          19656830      INST_RETIRED                     #      0.6 per cycle  ipc            
          33765594      CPU_CYCLES                                                            

       0.002799165 seconds time elapsed



We would like this working with SLES 15 SP5 and SP6. 

When I tried this command at SLES SP5 I got this:
# sudo perf stat -v -a -M ipc ls /
Using CPUID 0x00000000410fd4f0
Cannot find metric or group `ipc'

 Usage: perf stat [<options>] [<command>]

    -M, --metrics <metric/metric group list>
                          monitor specified metrics or metric groups (separated by ,)
# rpm -qa | grep perf
gperf-3.1-1.27.aarch64
perf-5.14.21-150500.50.44.aarch64
# rpm -qa | grep kernel-64k
kernel-64kb-devel-5.14.21-150500.55.39.1.aarch64
kernel-64kb-5.14.21-150500.55.39.1.aarch64
Comment 1 Tony Jones 2024-05-10 20:14:52 UTC



(In reply to Carol Soto from comment #0)
> We would to request to backport these 2 patches:
> 
> perf vendor events arm64: Update N2 and V2 metrics and events using Arm
> telemetry repo
> https://github.com/torvalds/linux/commit/
> 4473949074c35072f598bd525ae51d5455f05745

this is already in SP6 and is not an issue to backport to SP5

> perf parse-events: Make legacy events lower priority than sysfs/JSON
> https://github.com/torvalds/linux/commit/
> a24d9d9dc096fc0d0bd85302c9a4fe4fe3b1107b

I am confused.   Reading the backing thread for the above commit,  the issue was tracked down to 5ea8f2ccffb23983f02012a2731464586b10fbf3 "perf parse-events: Support hardware events as terms" which caused a v6.5->v6.6 regression.

This commit is not present in SP5.

For SP6 (which is pending release) my concern with this is that it is changing the existing default behavior of the tool.  Right now legacy events are always given priority.   Now with this patch if a PMU is specified, the legacy event will have lower priority.
Comment 2 Carol Soto 2024-05-10 21:19:06 UTC
(In reply to Tony Jones from comment #1)
> 
> 
> 
> (In reply to Carol Soto from comment #0)
> > We would to request to backport these 2 patches:
> > 
> > perf vendor events arm64: Update N2 and V2 metrics and events using Arm
> > telemetry repo
> > https://github.com/torvalds/linux/commit/
> > 4473949074c35072f598bd525ae51d5455f05745
> 
> this is already in SP6 and is not an issue to backport to SP5
> 
> > perf parse-events: Make legacy events lower priority than sysfs/JSON
> > https://github.com/torvalds/linux/commit/
> > a24d9d9dc096fc0d0bd85302c9a4fe4fe3b1107b
> 
> I am confused.   Reading the backing thread for the above commit,  the issue
> was tracked down to 5ea8f2ccffb23983f02012a2731464586b10fbf3 "perf
> parse-events: Support hardware events as terms" which caused a v6.5->v6.6
> regression.
> 
> This commit is not present in SP5.
> 
> For SP6 (which is pending release) my concern with this is that it is
> changing the existing default behavior of the tool.  Right now legacy events
> are always given priority.   Now with this patch if a PMU is specified, the
> legacy event will have lower priority.

For SP6 maybe I check when the kernel is out with first commit. Is that in beta already or need to wait for release? When I opened the bugzilla I only check SP5. 
Thanks
Carol
Comment 3 Tony Jones 2024-05-10 22:20:22 UTC
(In reply to Carol Soto from comment #2)
> (In reply to Tony Jones from comment #1)
> > 
> > 
> > 
> > (In reply to Carol Soto from comment #0)
> > > We would to request to backport these 2 patches:
> > > 
> > > perf vendor events arm64: Update N2 and V2 metrics and events using Arm
> > > telemetry repo
> > > https://github.com/torvalds/linux/commit/
> > > 4473949074c35072f598bd525ae51d5455f05745
> > 
> > this is already in SP6 and is not an issue to backport to SP5
> > 
> > > perf parse-events: Make legacy events lower priority than sysfs/JSON
> > > https://github.com/torvalds/linux/commit/
> > > a24d9d9dc096fc0d0bd85302c9a4fe4fe3b1107b
> > 
> > I am confused.   Reading the backing thread for the above commit,  the issue
> > was tracked down to 5ea8f2ccffb23983f02012a2731464586b10fbf3 "perf
> > parse-events: Support hardware events as terms" which caused a v6.5->v6.6
> > regression.
> > 
> > This commit is not present in SP5.
> > 
> > For SP6 (which is pending release) my concern with this is that it is
> > changing the existing default behavior of the tool.  Right now legacy events
> > are always given priority.   Now with this patch if a PMU is specified, the
> > legacy event will have lower priority.
> 
> For SP6 maybe I check when the kernel is out with first commit. Is that in
> beta already or need to wait for release? When I opened the bugzilla I only
> check SP5. 
> Thanks
> Carol

$ git log patches.suse/perf-vendor-events-arm64-Update-N2-and-V2-metrics-and-events-using-Arm-telemetry-repo.patch | cat

commit cf0943cb836f8b865f9a6e7cb63e1d39b1c42079
Author: Tony Jones <tonyj@suse.de>
Date:   Sun Jan 14 23:55:52 2024 -0800

    perf vendor events arm64: Update N2 and V2 metrics and
    events using Arm telemetry repo (perf-v6.7 (jsc#PED-6012
    jsc#PED-6121)).

$ grep Git-commit: patches.suse/perf-vendor-events-arm64-Update-N2-and-V2-metrics-and-events-using-Arm-telemetry-repo.patch
Git-commit: 4473949074c35072f598bd525ae51d5455f05745

The above fix was, wrt SP6, in Snapshot-202402-1 and also in beta4
Comment 4 Carol Soto 2024-05-10 22:30:37 UTC
(In reply to Tony Jones from comment #3)
> (In reply to Carol Soto from comment #2)
> > (In reply to Tony Jones from comment #1)
> > > 
> > > 
> > > 
> > > (In reply to Carol Soto from comment #0)
> > > > We would to request to backport these 2 patches:
> > > > 
> > > > perf vendor events arm64: Update N2 and V2 metrics and events using Arm
> > > > telemetry repo
> > > > https://github.com/torvalds/linux/commit/
> > > > 4473949074c35072f598bd525ae51d5455f05745
> > > 
> > > this is already in SP6 and is not an issue to backport to SP5
> > > 
> > > > perf parse-events: Make legacy events lower priority than sysfs/JSON
> > > > https://github.com/torvalds/linux/commit/
> > > > a24d9d9dc096fc0d0bd85302c9a4fe4fe3b1107b
> > > 
> > > I am confused.   Reading the backing thread for the above commit,  the issue
> > > was tracked down to 5ea8f2ccffb23983f02012a2731464586b10fbf3 "perf
> > > parse-events: Support hardware events as terms" which caused a v6.5->v6.6
> > > regression.
> > > 
> > > This commit is not present in SP5.
> > > 
> > > For SP6 (which is pending release) my concern with this is that it is
> > > changing the existing default behavior of the tool.  Right now legacy events
> > > are always given priority.   Now with this patch if a PMU is specified, the
> > > legacy event will have lower priority.
> > 
> > For SP6 maybe I check when the kernel is out with first commit. Is that in
> > beta already or need to wait for release? When I opened the bugzilla I only
> > check SP5. 
> > Thanks
> > Carol
> 
> $ git log
> patches.suse/perf-vendor-events-arm64-Update-N2-and-V2-metrics-and-events-
> using-Arm-telemetry-repo.patch | cat
> 
> commit cf0943cb836f8b865f9a6e7cb63e1d39b1c42079
> Author: Tony Jones <tonyj@suse.de>
> Date:   Sun Jan 14 23:55:52 2024 -0800
> 
>     perf vendor events arm64: Update N2 and V2 metrics and
>     events using Arm telemetry repo (perf-v6.7 (jsc#PED-6012
>     jsc#PED-6121)).
> 
> $ grep Git-commit:
> patches.suse/perf-vendor-events-arm64-Update-N2-and-V2-metrics-and-events-
> using-Arm-telemetry-repo.patch
> Git-commit: 4473949074c35072f598bd525ae51d5455f05745
> 
> The above fix was, wrt SP6, in Snapshot-202402-1 and also in beta4

Thanks I check on Monday for a system, give it a try and let you know. 
Carol
Comment 5 Tony Jones 2024-05-11 00:09:18 UTC
(In reply to Tony Jones from comment #3)

> Git-commit: 4473949074c35072f598bd525ae51d5455f05745
> 
> The above fix was, wrt SP6, in Snapshot-202402-1 and also in beta4

WRT the second patch.  

What hardware are you using as a24d9d9dc096fc0d0bd85302c9a4fe4fe3b1107b seems to be focused on fixing an issue on Apple (icestorm) ARM64 systems but in your error output you are referencing a different PMU.
Comment 7 Carol Soto 2024-05-13 16:00:55 UTC
(In reply to Tony Jones from comment #5)
> (In reply to Tony Jones from comment #3)
> 
> > Git-commit: 4473949074c35072f598bd525ae51d5455f05745
> > 
> > The above fix was, wrt SP6, in Snapshot-202402-1 and also in beta4
> 
> WRT the second patch.  
> 
> What hardware are you using as a24d9d9dc096fc0d0bd85302c9a4fe4fe3b1107b
> seems to be focused on fixing an issue on Apple (icestorm) ARM64 systems but
> in your error output you are referencing a different PMU.

I just tried with this SLES 15 SP6 kernel 6.4.0-150600.9-64kb. 
The first patch is included so its ok sudo perf stat -v -a -M ipc ls /
Using CPUID 0x00000000410fd4f0
metric expr INST_RETIRED / CPU_CYCLES for ipc
found event INST_RETIRED
found event CPU_CYCLES
Parsing metric events '{INST_RETIRED/metric-id=INST_RETIRED/,CPU_CYCLES/metric-id=CPU_CYCLES/}:W'
INST_RETIRED -> armv8_pmuv3_0/metric-id=INST_RETIRED,INST_RETIRED/
CPU_CYCLES -> armv8_pmuv3_0/metric-id=CPU_CYCLES,CPU_CYCLES/
Matched metric-id INST_RETIRED to INST_RETIRED
Matched metric-id CPU_CYCLES to CPU_CYCLES
Control descriptor is not initialized
bin  boot  dev	etc  home  lib	lib64  mnt  opt  proc  root  run  sbin	selinux  srv  sys  tmp	usr  var
INST_RETIRED: 13383342 391883968 391883968
CPU_CYCLES: 31992934 391883968 391883968

 Performance counter stats for 'system wide':

        13,383,342      INST_RETIRED                     #      0.4 per cycle  ipc            
        31,992,934      CPU_CYCLES                                                            

       0.002505120 seconds time elapsed


The second patch maybe is like you said we donot need it. I can not see the issue, this command ran ok. 
sudo taskset -c 0 perf stat -e armv8_pmuv3_0/cycles/ -e armv8_pmuv3_0/cycles/ -e cycles ls
bin

 Performance counter stats for 'ls':

         1,778,300      armv8_pmuv3_0/cycles/                                                 
         1,778,292      armv8_pmuv3_0/cycles/                                                 
         1,778,292      cycles                                                                

       0.000854048 seconds time elapsed

       0.000854000 seconds user
       0.000000000 seconds sys

Thanks 
Carol
Comment 8 Tony Jones 2024-05-13 16:23:12 UTC
(In reply to Carol Soto from comment #7)
> (In reply to Tony Jones from comment #5)
> > (In reply to Tony Jones from comment #3)
> > 
> > > Git-commit: 4473949074c35072f598bd525ae51d5455f05745
> > > 
> > > The above fix was, wrt SP6, in Snapshot-202402-1 and also in beta4
> > 
> > WRT the second patch.  
> > 
> > What hardware are you using as a24d9d9dc096fc0d0bd85302c9a4fe4fe3b1107b
> > seems to be focused on fixing an issue on Apple (icestorm) ARM64 systems but
> > in your error output you are referencing a different PMU.
> 
> I just tried with this SLES 15 SP6 kernel 6.4.0-150600.9-64kb. 

The version of the perf userspace tool (rpm -q perf) is more important than the kernel version in terms of whether the fix (the first) is present.

> The second patch maybe is like you said we donot need it. I can not see the
> issue, this command ran ok. 
> sudo taskset -c 0 perf stat -e armv8_pmuv3_0/cycles/ -e
> armv8_pmuv3_0/cycles/ -e cycles ls
> bin

So is there an issue with SP6.  Sorry from your reply I cannot tell.

Do you need 4473949074c3 ackporting to SP5?   Please note that SP5+4473949074c3
is very different from what is in SP6 so for this specific hardware it's unknown whether the above would be sufficient.
Comment 9 Carol Soto 2024-05-13 16:31:36 UTC
(In reply to Tony Jones from comment #8)
> (In reply to Carol Soto from comment #7)
> > (In reply to Tony Jones from comment #5)
> > > (In reply to Tony Jones from comment #3)
> > > 
> > > > Git-commit: 4473949074c35072f598bd525ae51d5455f05745
> > > > 
> > > > The above fix was, wrt SP6, in Snapshot-202402-1 and also in beta4
> > > 
> > > WRT the second patch.  
> > > 
> > > What hardware are you using as a24d9d9dc096fc0d0bd85302c9a4fe4fe3b1107b
> > > seems to be focused on fixing an issue on Apple (icestorm) ARM64 systems but
> > > in your error output you are referencing a different PMU.
> > 
> > I just tried with this SLES 15 SP6 kernel 6.4.0-150600.9-64kb. 
> 
> The version of the perf userspace tool (rpm -q perf) is more important than
> the kernel version in terms of whether the fix (the first) is present.

This is the perf version that I tried with SLES 15 SP6. 
rpm -qa | grep perf
perf-6.4.0.git18573.c37d66c4fd-150600.1.1.aarch64



> 
> > The second patch maybe is like you said we donot need it. I can not see the
> > issue, this command ran ok. 
> > sudo taskset -c 0 perf stat -e armv8_pmuv3_0/cycles/ -e
> > armv8_pmuv3_0/cycles/ -e cycles ls
> > bin
> 
> So is there an issue with SP6.  Sorry from your reply I cannot tell.
> 
> Do you need 4473949074c3 ackporting to SP5?   Please note that
> SP5+4473949074c3
> is very different from what is in SP6 so for this specific hardware it's
> unknown whether the above would be sufficient.

Yes just backporting 4473949074c3 backporting to SP5. 

thanks
Carol
Comment 10 Tony Jones 2024-05-13 16:54:59 UTC
(In reply to Carol Soto from comment #9)

> This is the perf version that I tried with SLES 15 SP6. 
> rpm -qa | grep perf
> perf-6.4.0.git18573.c37d66c4fd-150600.1.1.aarch64

That will be fine.

> Yes just backporting 4473949074c3 backporting to SP5. 

I will prepare a test package for you to try.
Comment 11 Tony Jones 2024-05-13 20:29:04 UTC
(In reply to Carol Soto from comment #9)
> Yes just backporting 4473949074c3 backporting to SP5. 

As I alluded to in comment 8,  there is more to this than just backporting this commit.  There is no json support for arm/neoverse-n2-v2 in SP5.
Comment 12 Carol Soto 2024-05-14 14:57:52 UTC
(In reply to Tony Jones from comment #11)
> (In reply to Carol Soto from comment #9)
> > Yes just backporting 4473949074c3 backporting to SP5. 
> 
> As I alluded to in comment 8,  there is more to this than just backporting
> this commit.  There is no json support for arm/neoverse-n2-v2 in SP5.

Yeah I noticed that when I run the command sudo perf stat -v -a -M ipc ls /
at sles15 sp5 the command is missing commits. Will still possible to make this command work with SLES 15 SP5? 
Thanks
Carol
Comment 13 Tony Jones 2024-05-16 01:27:29 UTC
(In reply to Carol Soto from comment #12)
> Will still possible to make
> this command work with SLES 15 SP5? 

It depends on what changes are needed.   I will try to borrow our matching hardware and see what is involved.   If it is complex you will likely need an ECO.   If it's simple it can be handled by this bugzilla.
Comment 14 Carol Soto 2024-05-16 02:21:26 UTC
(In reply to Tony Jones from comment #13)
> (In reply to Carol Soto from comment #12)
> > Will still possible to make
> > this command work with SLES 15 SP5? 
> 
> It depends on what changes are needed.   I will try to borrow our matching
> hardware and see what is involved.   If it is complex you will likely need
> an ECO.   If it's simple it can be handled by this bugzilla.

Thanks so much for the info. 
Carol
Comment 15 Tony Jones 2024-05-22 00:12:56 UTC
I looked at this some more.

(In reply to Carol Soto from comment #0)
> We would to request to backport these 2 patches:
> 
> perf vendor events arm64: Update N2 and V2 metrics and events using Arm
> telemetry repo
> https://github.com/torvalds/linux/commit/
> 4473949074c35072f598bd525ae51d5455f05745

As I mentioned, this patch isn't sufficient as SP5 lacks any of the prior support for arm/neoverse-n2-v2.  So all it's dependent patches will also be needed.

In addition we also need all the base metric support in 'arm64/sbsa.json' namely:
a9ff64e5a0421914c6b23e4505d9384b8c745b5a
556fd664d666c0cc9d5b0d52851b0480c51cf59e
ab3744007d51420dd63d5323acbe7abbb843ba63

Also needed is the jevent general metric support:
5b51e47a3f1d7619b424b4b89b5d19569a462b09

This general metric support change uses the revised python based json generator whereas SP5 has the previous C based generator.  So this would need to be handled also.

Bottom line, the scope of this is IMO unsuitable for a bugzilla.

We have an ECO process through which significant changes can be requested and SUSE can evaluate
Comment 16 Carol Soto 2024-05-22 14:53:28 UTC
(In reply to Tony Jones from comment #15)
> I looked at this some more.
> 
> (In reply to Carol Soto from comment #0)
> > We would to request to backport these 2 patches:
> > 
> > perf vendor events arm64: Update N2 and V2 metrics and events using Arm
> > telemetry repo
> > https://github.com/torvalds/linux/commit/
> > 4473949074c35072f598bd525ae51d5455f05745
> 
> As I mentioned, this patch isn't sufficient as SP5 lacks any of the prior
> support for arm/neoverse-n2-v2.  So all it's dependent patches will also be
> needed.
> 
> In addition we also need all the base metric support in 'arm64/sbsa.json'
> namely:
> a9ff64e5a0421914c6b23e4505d9384b8c745b5a
> 556fd664d666c0cc9d5b0d52851b0480c51cf59e
> ab3744007d51420dd63d5323acbe7abbb843ba63
> 
> Also needed is the jevent general metric support:
> 5b51e47a3f1d7619b424b4b89b5d19569a462b09
> 
> This general metric support change uses the revised python based json
> generator whereas SP5 has the previous C based generator.  So this would
> need to be handled also.
> 
> Bottom line, the scope of this is IMO unsuitable for a bugzilla.
> 
> We have an ECO process through which significant changes can be requested
> and SUSE can evaluate

Hi
Thanks so much for looking into this. The changes are in SP6, we will communicate to our team if they really want this on SP5 then we will have to the ECO process.
Carol
Comment 17 Tony Jones 2024-05-22 15:41:15 UTC
(In reply to Carol Soto from comment #16)

> The changes are in SP6

Correct.  I verified it is working in SP6.   Most of the changes were in v6.3
and SP6 perf is based on v6.7

> communicate to our team if they really want this on SP5 then we will have to
> the ECO process.

Yes, you will need to provide suitable rationale as to why you need this feature and then we can scope the work and decide.

Thanks!
Comment 18 Carol Soto 2024-05-22 16:19:21 UTC
(In reply to Tony Jones from comment #17)
> (In reply to Carol Soto from comment #16)
> 
> > The changes are in SP6
> 
> Correct.  I verified it is working in SP6.   Most of the changes were in v6.3
> and SP6 perf is based on v6.7
> 
> > communicate to our team if they really want this on SP5 then we will have to
> > the ECO process.
> 
> Yes, you will need to provide suitable rationale as to why you need this
> feature and then we can scope the work and decide.
> 
> Thanks!

Please feel free to move the bugzilla to the right state. Im ok if we say resolved in SLES 15 SP6. 
Thanks
Carol
Comment 19 Tony Jones 2024-05-22 16:42:40 UTC
(In reply to Carol Soto from comment #18)
> (In reply to Tony Jones from comment #17)
> > (In reply to Carol Soto from comment #16)
> > 
> > > The changes are in SP6
> > 
> > Correct.  I verified it is working in SP6.   Most of the changes were in v6.3
> > and SP6 perf is based on v6.7
> > 
> > > communicate to our team if they really want this on SP5 then we will have to
> > > the ECO process.
> > 
> > Yes, you will need to provide suitable rationale as to why you need this
> > feature and then we can scope the work and decide.
> > 
> > Thanks!
> 
> Please feel free to move the bugzilla to the right state. Im ok if we say
> resolved in SLES 15 SP6. 
> Thanks
> Carol

closing,  working in SP6.   Nvidia will engage in ECO process if they require this hardware enablement in SP5.