Bug 1196060

Summary: systemtap fails to build simple example from docs
Product: [openSUSE] openSUSE Tumbleweed Reporter: James Fehlig <jfehlig>
Component: KernelAssignee: openSUSE Kernel Bugs <kernel-bugs>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P2 - High CC: jfehlig, martin.liska, mliska, rguenther, tiwai, tonyj
Version: Current   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description James Fehlig 2022-02-16 22:43:09 UTC
On a fairly recent Tumbleweed with kernel 5.16.8-1, I'm unable to execute the simple "Initial Testing" script described in the systemtap documentation

https://sourceware.org/systemtap/SystemTap_Beginners_Guide/using-systemtap.html#testing

# stap -v -e 'probe vfs.read {printf("read performed\n"); exit()}'
Pass 1: parsed user script and 479 library scripts using 100620virt/90764res/7076shr/83800data kb, in 120usr/30sys/186real ms.
WARNING: kbuild exited with status: 2
WARNING: kbuild exited with status: 2
...
WARNING: kbuild exited with status: 2
^CWARNING: kbuild exited with signal: 15 (Terminated)

I let the above command run for a few minutes before killing it. The WARNING was emitted ~1/sec. The following kernel packages are installed

kernel-macros-5.16.8-1.1.noarch
kernel-default-5.16.8-1.1.x86_64
kernel-devel-5.16.8-1.1.noarch
kernel-source-5.16.8-1.1.noarch
kernel-default-debuginfo-5.16.8-1.1.x86_64
kernel-default-devel-5.16.8-1.1.x86_64
Comment 1 Tony Jones 2022-02-17 17:03:36 UTC
Please try the version in devel:tools.    It is in staging right now for Tumbleweed.
Comment 2 Tony Jones 2022-02-17 17:06:09 UTC
(In reply to Tony Jones from comment #1)
> Please try the version in devel:tools.    It is in staging right now for
> Tumbleweed.

It is now in openSUSE:Factory.   systemtap version 4.6

Also please always report what version of systemtap you are using stap -V or rpm -q systemtap.
Comment 3 James Fehlig 2022-02-17 22:12:39 UTC
Version 4.6 fixes the issue, thanks! Of course I had grander plans with which I'm still having problems. I'd like to dump the VMSA of a AMD SEV-ES guest with a script along the lines of

# cat sev-es-vmsa.stp
function dump_vmsa(addr:long) {
  printf("VMSA\n")
  for (i = 0; i < 4096 ; i+= 64) {
    printf("%.64M\n", addr + i);
  }
}
probe module("kvm_amd").statement("__sev_launch_update_vmsa@arch/x86/kvm/svm/sev.c:625") {
  dump_vmsa($svm->vmsa)
}

But trying to run this results in

# stap -v sev-es-vmsa.stp
Pass 1: parsed user script and 483 library scripts using 110832virt/101100res/7292shr/93924data kb, in 160usr/20sys/175real ms.
semantic error: resolution failed in DWARF builder

semantic error: while resolving probe point: identifier 'module' at sev-es-vmsa.stp:7:7
        source: probe module("kvm_amd").statement("__sev_launch_update_vmsa@arch/x86/kvm/svm/sev.c:625") {
                      ^
semantic error: no match

Pass 2: analyzed script: 0 probes, 0 functions, 0 embeds, 0 globals using 112928virt/104800res/8712shr/96020data kb, in 770usr/40sys/814real ms.
Pass 2: analysis failed.  [man error::pass2]

Not sure why it doesn't like the 'module' identifier. I tried something a bit simpler with the same result

# stap -v -e 'probe module("ext4").function("*") { }'
Pass 1: parsed user script and 483 library scripts using 110828virt/100900res/7092shr/93920data kb, in 150usr/20sys/173real ms.
semantic error: resolution failed in DWARF builder

semantic error: while resolving probe point: identifier 'module' at <input>:1:7
        source: probe module("ext4").function("*") { }
                      ^
semantic error: no match

Pass 2: analyzed script: 0 probes, 0 functions, 0 embeds, 0 globals using 112924virt/104584res/8504shr/96016data kb, in 780usr/30sys/813real ms.
Pass 2: analysis failed.  [man error::pass2]
Comment 4 James Fehlig 2022-02-17 22:15:09 UTC
(In reply to Tony Jones from comment #2)
> Also please always report what version of systemtap you are using stap -V or
> rpm -q systemtap.

Thanks for the tip. FYI, prior to updating to 4.6

# stap -V
Systemtap translator/driver (version 4.2/0.186, non-git sources)
Copyright (C) 2005-2019 Red Hat, Inc. and others
This is free software; see the source for copying conditions.
tested kernel versions: 2.6.32 ... 5.4-rc6
enabled features: AVAHI BPF LIBSQLITE3 NLS NSS
Comment 5 Tony Jones 2022-02-22 22:08:17 UTC
ok.  definitely a regression here vs SP3.  I'll look at it.

# stap -vv --poison-cache -l 'module("ext4").function("*")'
...
derive-probes (location #0): module("ext4").function("*") of keyword at <input>:1:1
semantic error: resolution failed in DWARF builder
   thrown from: elaborate.cxx:1093
semantic error: resolution failed in DWARF builder
   thrown from: elaborate.cxx:1093
semantic error: while resolving probe point: identifier 'module' at <input>:1:7
   thrown from: elaborate.cxx:1081
        source: probe module("ext4").function("*") {}
                      ^

semantic error: no match
   thrown from: elaborate.cxx:1044
Comment 6 Tony Jones 2022-02-24 03:02:37 UTC
rguenther so looking at elfutils I see: "checking for zstd... no"

is this an issue since kernel-default is packaging ko's with .zst suffix

# rpm -ql kernel-default | grep ext4
/usr/lib/modules/5.16.10-1-default/kernel/fs/ext4
/usr/lib/modules/5.16.10-1-default/kernel/fs/ext4/ext4.ko.zst

This isn't the issue for this bug as systemtap/elfutils should be opening /usr/lib/debug/usr/lib/modules/5.16.10-1-default/kernel/fs/ext4/ext4.ko.debug from kernel-default-debuginfo

However 'strace stap --poison-cache -l 'module("ext4").function("write_mmp_block")'

shows it's not accessing this debug file at all.

build-id is consistent which has been a cause of these issues before

# eu-readelf -a /usr/lib/debug/usr/lib/modules/5.16.10-1-default/kernel/fs/ext4/ext4.ko.debug | grep -A 1 GNU_BUILD_ID
  GNU                   20  GNU_BUILD_ID
    Build ID: 771c20d08943980047dc325b9101eb04164d1956

# eu-readelf -a /lib/modules/5.16.10-1-default/kernel/fs/ext4/ext4.ko.zst | grep -A 1 GNU_BUILD_ID
eu-readelf: failed reading '/lib/modules/5.16.10-1-default/kernel/fs/ext4/ext4.ko.zst': not a valid ELF file

# eu-readelf -a /lib/modules/5.16.10-1-default/kernel/fs/ext4/ext4.ko | grep -A 1 GNU_BUILD_ID
  GNU                   20  GNU_BUILD_ID
    Build ID: 771c20d08943980047dc325b9101eb04164d1956


Hmmn.  I shall keep investigating.

This is an issue on Factory and SP4.
Comment 7 Tony Jones 2022-02-24 03:09:20 UTC
# strace -o /tmp/log stap --poison-cache -l 'module("ext4").function("write_mmp_block")'

# grep ext4 /tmp/log
execve("/bin/stap", ["stap", "--poison-cache", "-l", "module(\"ext4\").function(\"write_m"...], 0x7ffc2a83d928 /* 62 vars */) = 0
newfstatat(AT_FDCWD, "/lib/modules/5.16.10-1-default/kernel/fs/ext4", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
openat(AT_FDCWD, "/lib/modules/5.16.10-1-default/kernel/fs/ext4", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
newfstatat(AT_FDCWD, "/lib/modules/5.16.10-1-default/kernel/fs/ext4/ext4.ko", {st_mode=S_IFREG|0644, st_size=1972179, ...}, 0) = 0
newfstatat(AT_FDCWD, "/lib/modules/5.16.10-1-default/kernel/fs/ext4/ext4.ko.zst", {st_mode=S_IFREG|0644, st_size=476572, ...}, 0) = 0


[I manually uncompressed ext4.ko.zst -> ext4.ko]
Comment 8 Richard Biener 2022-02-24 10:26:51 UTC
(In reply to Tony Jones from comment #5)
> ok.  definitely a regression here vs SP3.  I'll look at it.
> 
> # stap -vv --poison-cache -l 'module("ext4").function("*")'
> ...
> derive-probes (location #0): module("ext4").function("*") of keyword at
> <input>:1:1
> semantic error: resolution failed in DWARF builder
>    thrown from: elaborate.cxx:1093
> semantic error: resolution failed in DWARF builder
>    thrown from: elaborate.cxx:1093
> semantic error: while resolving probe point: identifier 'module' at
> <input>:1:7
>    thrown from: elaborate.cxx:1081
>         source: probe module("ext4").function("*") {}
>                       ^
> 
> semantic error: no match
>    thrown from: elaborate.cxx:1044

Is this with the same kernel/module .ko or comparing results with different .kos (from SP3 and Tumbleweed)?

There might be changes in the debug info that systemtap does not recognize,
I'm not sure if the above diagnostics are a sign of that - somebody more
familiar with systemtap should see what

'semantic error: resolution failed in DWARF builder'

actually means.  It could be it just fails to lookup a symbol, but maybe
it's more.

There's both the compiler and eventually dwz coming into play - to rule out
the latter can you try disabling dwz?  (I think you can do #!BuildIgnore: dwz when building the kernel and modules)

You can also try if it's some new DWARF5 that confuses sytemtap by
forcing GCC to use DWARF4 - you somehow need to inject -gdwarf-4 into the
kernel/modules build to test that.

Eventually it's also elfutils (CCing maintainer) which systemtap seems to
use for DWARF parsing (libdw, that is).
Comment 9 Tony Jones 2022-02-24 15:25:03 UTC
(In reply to Richard Biener from comment #8)
> (In reply to Tony Jones from comment #5)
> > ok.  definitely a regression here vs SP3.  I'll look at it.
> > 
> > # stap -vv --poison-cache -l 'module("ext4").function("*")'
> > ...
> > derive-probes (location #0): module("ext4").function("*") of keyword at
> > <input>:1:1
> > semantic error: resolution failed in DWARF builder
> >    thrown from: elaborate.cxx:1093
> > semantic error: resolution failed in DWARF builder
> >    thrown from: elaborate.cxx:1093
> > semantic error: while resolving probe point: identifier 'module' at
> > <input>:1:7
> >    thrown from: elaborate.cxx:1081
> >         source: probe module("ext4").function("*") {}
> >                       ^
> > 
> > semantic error: no match
> >    thrown from: elaborate.cxx:1044
> 
> Is this with the same kernel/module .ko or comparing results with different
> .kos (from SP3 and Tumbleweed)?

Same OS in all cases.  Also SP3 works fine.

> Eventually it's also elfutils (CCing maintainer) which systemtap seems to
> use for DWARF parsing (libdw, that is).

Sorry I thought you were the libdw maintainer.

The problem appears to be debugging a kernel module ("module.function") in systemtap syntax when the main .ko for the module has been stripped and the debugging info is in a separate debuginfo.

- systemtap is not having an issue debugging the primary kernel (aka "kernel.function works fine").

- systemtap is not having an issue debugging modules for a hand built kernel (here "module.function" works).

So I don't think it's a DWARF5 vs DWARF4 issue.  I don't think dwz is an issue (though I am curious why there is no support for it in elfutils). 

As I said in previous comment, the debuginfo version of the ext4.ko is where the debug info resides yet is not being referenced at all (per strace log).

I'll look more at systemtap today.  The ability to debug what is going on at this level isn't great.
Comment 10 Tony Jones 2022-02-25 17:25:11 UTC
I opened https://bugzilla.suse.com/show_bug.cgi?id=1196510 against elfutils to add zst support.  

That isn't the issue but needs fixing IMO.

Still debugging systemtap (setup_dwfl_report_kernel_p appears to not be working correctly but unclear if it's systemtap or elfutils).
Comment 11 Tony Jones 2022-02-25 18:59:05 UTC
(In reply to Tony Jones from comment #10)
> I opened https://bugzilla.suse.com/show_bug.cgi?id=1196510 against elfutils
> to add zst support.  
> 
> That isn't the issue but needs fixing IMO.
> 
> Still debugging systemtap (setup_dwfl_report_kernel_p appears to not be
> working correctly but unclear if it's systemtap or elfutils).

ok, the issue is due to elfutils.

systemtap is building offline_search_names based off modules.dep,  which contains the path with the zst suffix,  but elfutils can't process this when asked.

I was manually uncompressing the zst but since stap was still not finding it I assumed there was some other bug but there isn't as stap is basing it's search on what it finds in the modules.dep,  aka still trying to process the zst version.

If I hand edit the modules.dep removing the zst extension systemtap now works.

@mliska, can we please fix the above elfutils bug please,  in factory and SP4
Comment 12 Tony Jones 2022-02-25 22:55:23 UTC
# using updated libdw (now in Base:System, hopefully factory soon; also pushed for SP4:GA)

stap -l 'module("ext4").function("*")'  | head -10
module("ext4").function("ClearPageChecked@../include/linux/page-flags.h:433")
module("ext4").function("ClearPageDirty@../include/linux/page-flags.h:423")
module("ext4").function("ClearPageError@../include/linux/page-flags.h:419")
module("ext4").function("ClearPageUptodate@../include/linux/page-flags.h:667")
module("ext4").function("ERR_PTR@../include/linux/err.h:24")
module("ext4").function("EXT4_SB@../fs/ext4/ext4.h:1757")
module("ext4").function("INIT_LIST_HEAD@../include/linux/list.h:35")
module("ext4").function("IS_ERR@../include/linux/err.h:34")
module("ext4").function("IS_ERR_OR_NULL@../include/linux/err.h:39")
module("ext4").function("PTR_ERR@../include/linux/err.h:29")
Comment 13 Martin Liška 2022-02-27 09:30:46 UTC
> @mliska, can we please fix the above elfutils bug please,  in factory and SP4

Thanks for the submit request you did and the investigation!
Can we close this as it's fixed in devel project now?
Comment 14 Tony Jones 2022-02-28 15:09:49 UTC
(In reply to Martin Liška from comment #13)
> > @mliska, can we please fix the above elfutils bug please,  in factory and SP4
> 
> Thanks for the submit request you did and the investigation!
> Can we close this as it's fixed in devel project now?

I was hoping to see it accepted in SP4.
Comment 15 James Fehlig 2022-02-28 15:18:35 UTC
Although I still have problems with my own script, things seem to be working fine with the updated systemtap and libdw.
Comment 16 James Fehlig 2022-02-28 15:24:10 UTC
(In reply to Tony Jones from comment #14)
> I was hoping to see it accepted in SP4.

Yes, would definitely be nice to have a working systemtap in SP4. I'll bump the priority.
Comment 17 Tony Jones 2022-03-02 19:54:42 UTC
fix for elfutils accepted to SP4,  closing.