Bug 1190434 - crash is unable to find debuginfo of usrmerged kernel
Summary: crash is unable to find debuginfo of usrmerged kernel
Status: CONFIRMED
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Current
Hardware: Other Other
: P3 - Medium : Normal (vote)
Target Milestone: ---
Assignee: David Mair
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 1029961
  Show dependency treegraph
 
Reported: 2021-09-13 08:51 UTC by Fabian Vogt
Modified: 2024-04-05 16:06 UTC (History)
11 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---
dmair: needinfo?


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Fabian Vogt 2021-09-13 08:51:44 UTC
With a usrmerged kernel, vmlinux(.xz) is in the modules directory for the kernel:

/usr/lib/modules/5.14.1-1-default/vmlinux.xz

This means the debug info ends up here:

/usr/lib/debug/usr/lib/modules/5.14.1-1-default/vmlinux.debug

Crash only looks at the old location though, and fails:

please wait... (uncompressing /usr/lib/modules/5.14.1-1-default/vmlinux.xz)
                                                                            
readmem: read_diskdump() 
NOTE: gnu_debuglink file: vmlinux.debug
crc32: f73ed3b0
/usr/lib/modules/5.14.1-1-default//vmlinux.debug: not readable/found
/usr/lib/modules/5.14.1-1-default//.debug/vmlinux.debug: not readable/found
/usr/lib/debug/boot/vmlinux.debug: not readable/found
crash: /var/tmp/vmlinux.xz_DZPxr5: no debugging data available
crash: vmlinux.debug: debuginfo file not found

crash: either install the appropriate kernel debuginfo package, or
       copy vmlinux.debug to this machine
Comment 1 Ludwig Nussel 2021-10-06 15:23:19 UTC
https://build.opensuse.org/request/show/923547

The search path is quite messy. There seem to be at least two code paths and there are references to /usr/src/redhat for example. Reassign to maintainer to decide whether aforementioned fix is good enough.
Comment 2 Fabian Vogt 2021-12-16 09:48:09 UTC
Ping - any news here? The suggested SR was declined.
Comment 3 Fabian Vogt 2022-02-25 12:01:27 UTC
Ping again.
Comment 4 Fabian Vogt 2022-04-01 11:37:25 UTC
Ping.
Comment 5 Ludwig Nussel 2022-04-04 08:41:58 UTC
looks like I rebased the patch and made it conditional but looks like I didn't submit it again. Not sure what happened. Anyway https://build.opensuse.org/request/show/966765
Comment 6 openQA Review 2022-04-19 00:03:09 UTC
This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos-extra@64bit_virtio-2G
https://openqa.opensuse.org/tests/2303420#step/kdump_and_crash/1

To prevent further reminder comments one of the following options should be followed:
1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
3. The bugref in the openQA scenario is removed or replaced, e.g. `label:wontfix:boo1234`

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.
Comment 7 openQA Review 2022-06-02 02:24:34 UTC
This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos-extra@64bit_virtio-2G
https://openqa.opensuse.org/tests/2398376#step/kdump_and_crash/1

To prevent further reminder comments one of the following options should be followed:
1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
3. The bugref in the openQA scenario is removed or replaced, e.g. `label:wontfix:boo1234`

Expect the next reminder at the earliest in 56 days if nothing changes in this ticket.
Comment 8 Petr Cervinka 2022-08-24 12:29:52 UTC
I believe that it is still failing, any idea about SR https://build.opensuse.org/request/show/966765 ?
Comment 9 OBSbugzilla Bot 2022-09-02 13:20:03 UTC
This is an autogenerated message for OBS integration:
This bug (1190434) was mentioned in
https://build.opensuse.org/request/show/1000890 Factory / crash
Comment 10 Dominique Leuenberger 2022-11-07 16:45:32 UTC
The UsrMerge patch has been included in Tumbleweed as part of
  https://build.opensuse.org/request/show/1032553#tab-pane-changes
in snapshot 20221027

This has changed the error now, as can be seen in openQA:
  https://openqa.opensuse.org/tests/2857536#step/kdump_and_crash/60

> Dwarf error: wrong version in compilation unit header (is 5, should be 2, 3, or 4)
Comment 17 David Mair 2022-12-01 19:47:10 UTC
I reproduced the following with Tumbleweed's current crash version 7.3.1 and kernel 6.0.8-1:

> Dwarf error: wrong version in compilation unit header (is 5, should be 2, 3, or 4)

I'm working on a crash 8.0.2 upgrade and as I expected, that error is resolved with crash 8.0.2 and kernel 6.0.8-1 vmcore. However, I do have a subsequent problem where crash can't read the linux banner from core memory leading to a claimed mismatched vmcore and vmlinux even though they have filenames with the same version number.

Still working on it...
Comment 18 David Mair 2022-12-02 17:38:36 UTC
(In reply to David Mair from comment #17)
> I reproduced the following with Tumbleweed's current crash version 7.3.1 and
> kernel 6.0.8-1:
> 
> > Dwarf error: wrong version in compilation unit header (is 5, should be 2, 3, or 4)
> 
> I'm working on a crash 8.0.2 upgrade and as I expected, that error is
> resolved with crash 8.0.2 and kernel 6.0.8-1 vmcore. However, I do have a
> subsequent problem where crash can't read the linux banner from core memory
> leading to a claimed mismatched vmcore and vmlinux even though they have
> filenames with the same version number.
> 

The good news is that new problem of mine is known to upstream crash and has an accepted patch. I'm trying it today.
Comment 19 David Mair 2022-12-03 19:20:07 UTC
(In reply to David Mair from comment #18)
> (In reply to David Mair from comment #17)
> > I reproduced the following with Tumbleweed's current crash version 7.3.1 and
> > kernel 6.0.8-1:
> > 
> > > Dwarf error: wrong version in compilation unit header (is 5, should be 2, 3, or 4)
> > 
> > I'm working on a crash 8.0.2 upgrade and as I expected, that error is
> > resolved with crash 8.0.2 and kernel 6.0.8-1 vmcore. However, I do have a
> > subsequent problem where crash can't read the linux banner from core memory
> > leading to a claimed mismatched vmcore and vmlinux even though they have
> > filenames with the same version number.
> > 
> 
> The good news is that new problem of mine is known to upstream crash and has
> an accepted patch. I'm trying it today.

Well, I'm afraid not...

I have built a version of crash that appears to support dwarf 5, crash 8.0.2. However, it fails verifying the kernel and coredump are the same kernel (6.0.x) version. The upstream fix for the reported error message ("linux_banner is an invalid address) is both of:

1) Already present in the source of crash I'm building; and
2) Was a change to support all representations of a symbol being present in the initialized data section of the kernel binary (the patch added 'd' location type support to existing 'D' location type only). BUT, when I dump the symbols from the kernel binary it is in the 'D' initialized data section anyway so the patch wasn't needed for the Tumbleweed kernel (the difference occurs based on the compiler used).

It should be noted that when creating a coredump on Tumbleweed (kernel 6.0.8 and 6.0.10), makedumpfile reports that "this kernel isn't supported by makedumpfile" and if my end-result in crash is that a valid symbol in the kernel binary named linux_banner has a Kernel Address in the coredump, accessed via use of KVADDR() in crash, that is considered an invalid address and fails, preventing the verification that the linux version in the binary and coredump are the same then I offer the opinion that makedumpfile appears to be not up-to-date for kernel 6. As observed when using a crash version in a home project that is up-to-date for kernel 6.

Unless someone corrects my explanation of my opinion of the current error I observe in crash 8.0.2 then I believe at least makedumpfile needs to be fixed as well as crash to wholly resolve the reported problem. Input welcomed...
Comment 20 David Mair 2022-12-07 19:25:41 UTC
I have a home project version of crash that appears to decode dwarf 5 debug file format. However, I don't believe it is enough to resolve the problem using coredumps on Tumbleweed.

Using kernel 6.0.10-1-default when I trigger a coredump creation then during the creation of the coredump the following is displayed twice after beginning to start the kdump kernel:

> The kernel version is unsupported
> The makedumpfile operation may be incomplete

Then, when I attempt to use the created coredump file the apparent memory for the symbol linux_banner appears not to be present in the coredump. The result being that crash will not start because it can't verify the kernel binary and dumped kernel are the same version.

As a matter of care I repeated the test multiple times, using all the kernel dump format options that create a file. In every case crash reports the coredump has missing memory and fails to start.

Given that the creating of the coredump reported that the kernel version is "unsupported" and that the makedumpfile may be "incomplete" and that crash appears to find that there is missing memory in the coredump. I think we need to consider why the attempt to create the coredump behaves the way it does before further attempts to resolve crash.
Comment 21 David Mair 2022-12-07 19:43:12 UTC
On my test system kexec is upstream latest release and makedumpfle is upstream latest release - 1 (each as present in Tumbleweed patched to-date). The test system is a kvm/qemu x86_64 VM, though I doubt that would cause the outcome of creating the coredump to report that the kernel version itself is unsupported. Instruction is welcome.
Comment 24 Fabian Vogt 2023-05-03 08:46:33 UTC
The current error message is:

WARNING: invalid linux_banner pointer: 65762078756e694c
crash: /var/tmp/vmlinux.xz_CSsIpp and /var/crash/2023-05-02-00:13/vmcore do not match!

I did some debugging and found that upstream crash works just fine, so this is caused by one of the downstream modifications. I bisected the patches and ended up with crash-debuginfo-compressed.patch being the culprit.

The cause is this part:

--- crash-7.2.7.orig/symbols.c
+++ crash-7.2.7/symbols.c
@@ -203,9 +203,9 @@ symtab_init(void)
 	 *  Pull a bait-and-switch on st->bfd if we've got a separate
          *  .gnu_debuglink file that matches the CRC. Not done for kerntypes.
 	 */
-	if (!(LKCD_KERNTYPES()) &&
-	    !(bfd_get_file_flags(st->bfd) & HAS_SYMS)) {
-		if (!check_gnu_debuglink(st->bfd))
+	if (!(LKCD_KERNTYPES())) {
+		if (!check_gnu_debuglink(st->bfd) &&
+		    !(bfd_get_file_flags(st->bfd) & HAS_SYMS))
 			no_debugging_data(FATAL);
 	}
 	
Previously it only ran check_gnu_debuglink if the loaded BFD had no symbols. check_gnu_debuglink replaces (!) st->bfd with the file that gnu_debuglink points to.

With the patch applied it unconditionally runs check_gnu_debuglink even if the loaded BFD has symbols. This is invalid!

The file pointed to by gnu_debuglink does not have text or data included, only debug info. It is not a full replacement for the previously loaded vmlinux.

Without data, vmlinux.debug has different types for symbols, like using "B" for linux_banner because there just isn't a .data section in that file. This confuses the code, resulting in the misleading error message.

Crash without that patch works just fine, because gdb itself handles gnu_debuglink properly already.
Comment 25 Fabian Vogt 2023-06-28 11:41:39 UTC
Still broken: https://openqa.opensuse.org/tests/3388914#step/kdump_and_crash/60

Any news?
Comment 26 David Mair 2023-06-28 16:34:32 UTC
(In reply to Fabian Vogt from comment #25)
> Still broken:
> https://openqa.opensuse.org/tests/3388914#step/kdump_and_crash/60
> 
> Any news?

I'm working on that for other reasons to this bug:

> WARNING: invalid linux_banner pointer: 65762078756e694c
> crash: /var/tmp/vmlinux.xz_kZfkHM and /var/crash/2023-06-28-05-53/vmcore do
> not match!

The second item can be ignored, whether or not the vmlinux and vmcore do not match is actually unknown because the real problem is the invalid linux_banner which is demonstrably actually valid.

The linux_banner is the a string variable beginning with the text "Linux version". If you sort the hex value 65762078756e694c into byte order it is actually the ASCII value "Linux ve", the first 8 bytes of the banner text.

It happens in initializing crash from the coredump when it attempts to do the following:

* Get the address in the coredump of the linux_banner variable
* De-reference the linux_banner variable to get the address of the banner text
* Compare the banner text with "Linux version"

i.e. it expects to perform two memory de-references, the location of the variable and de-reference the variable to get the address of the banner text. However, after the first de-reference the data in linux_banner as a named object in the coredump is already the banner text.

The problem is that the read from the coredump of the linux_banner variable reads from the linux_banner text, not from the linux_banner variable so the second de-reference is of an invalid kernel address because it is ASCII text.

I have hacked a workarounds to it but when I do on the coredump I have I find it is unusable for other reasons (excluded pages), so I need to investigate with another coredump (of which I have several).

crash can be started in this scenario with limited functionality using:

> crash --minimal <coredump> <kernel> <...>

However, the limited set of functionality --minimal provides probably makes it worthless, e.g. memory read and disassemble are usable but gdb commands like bt and switch process/CPU are not usable.
Comment 27 David Mair 2023-06-28 17:08:25 UTC
The following error:

> WARNING: invalid linux_banner pointer: 65762078756e694c
> crash: /var/tmp/vmlinux.xz_CSsIpp and /var/crash/2023-05-02-00:13/vmcore
> do not match!

Occurs due to this:

>     if (!(sp = symbol_search("linux_banner")))
>         error(FATAL, "linux_banner symbol does not exist?\n");
>     else
>         if ((sp->type == 'R') || (sp->type == 'r') ||
>                 (THIS_KERNEL_VERSION >= LINUX(2,6,11) &&
>                 (sp->type == 'D' || sp->type == 'd')) ||
>                 (machine_type("ARM") && sp->type == 'T') ||
>                 (machine_type("ARM64")))
>             linux_banner = symbol_value("linux_banner");
>         else
>             get_symbol_data("linux_banner", sizeof(ulong),

I've re-arranged that slightly to separate the two conditionals, it appears in crash as a triplet conditional but should operate as above. It is followed by:

>     if (!IS_KVADDR(linux_banner))
>         error(WARNING, "invalid linux_banner pointer: %lx\n",
>             linux_banner);
> 
>     if (!accessible(linux_banner))
>         goto bad_match;

The test for a valid banner is after this but we take the goto bad_match above because 0x65762078756e694c is ASCII text and is not a KV_ADDR (causing the WARNING "invalid linux_banner pointer". It is not an accessible memory address so we goto bad_match even though we are looking at the value of a correct linux_banner.

You can see the two de-references I described in the code above. symbol_search() gets a structure with a value member that is the address in the coredump of the "linux_banner" symbol. We load that successfully from the coredump and take the else path in the first conditional where the "type" member of the structure has the value 'R' (character) in the coredump I'm looking at so we take the true path of the second conditional in the top piece of code and should use the symbol_value() function to set linux_banner from the value member of an instance of the same structure as sp. That value is NOT 0x65762078756e694c and if I hack a patch for the above code to force:

> linux_banner = sp->value

rather than use symbol_value("linux_banner") then I DO NOT get the failure at the top of this message...though with the coredump I have I still can't load it due to pages excluded from it for per-CPU tasks.
Comment 28 David Mair 2023-06-28 17:27:34 UTC
In the source examples below I re-post some from comment #27 where I correct a line I ended with a comma but the source ends with a semi-colon.

To show why I am bogged down by this, here's the source for symbol_search():

> ulong
> symbol_value(char *symbol)
> {
>         struct syment *sp;
> 
>         if (!(sp = symbol_search(symbol)))
>                 error(FATAL, "cannot resolve \"%s\"\n", symbol);
> 
>         return(sp->value);
> }

This sets it's own struct syment sp in an identical way to the following source sample from comment #27 except that the use of symbol_value in the following set's linux_banner to the ASCII text 0x65762078756e694c, which is the value linux_banner holds a pointer to:

>     if (!(sp = symbol_search("linux_banner")))
>         error(FATAL, "linux_banner symbol does not exist?\n");
>     else
>         if ((sp->type == 'R') || (sp->type == 'r') ||
>                 (THIS_KERNEL_VERSION >= LINUX(2,6,11) &&
>                 (sp->type == 'D' || sp->type == 'd')) ||
>                 (machine_type("ARM") && sp->type == 'T') ||
>                 (machine_type("ARM64")))
>             linux_banner = symbol_value("linux_banner");
>         else
>             get_symbol_data("linux_banner", sizeof(ulong);

Yet, if I insert the single line linux_banner = sp->value instead of that symbol_value call and use the sp structure pointer we already have the error doesn't happen:

>     if (!(sp = symbol_search("linux_banner")))
>         error(FATAL, "linux_banner symbol does not exist?\n");
>     else
>         if ((sp->type == 'R') || (sp->type == 'r') ||
>                 (THIS_KERNEL_VERSION >= LINUX(2,6,11) &&
>                 (sp->type == 'D' || sp->type == 'd')) ||
>                 (machine_type("ARM") && sp->type == 'T') ||
>                 (machine_type("ARM64")))
>             linux_banner = sp->value
>         else
>             get_symbol_data("linux_banner", sizeof(ulong);

So far, that makes no sense!

I believe calling get_symbol_data() would do it but if we took the false path in the nested conditional the change I show in the true path would never be executed.

Hence I am deep debugging at present and it's quite nasty for all the reasons I've described...
Comment 29 David Mair 2023-06-28 20:45:11 UTC
For the record, the current discussion is a different bug from the original report. However...

After debugging in gdb, the following error:

> WARNING: invalid linux_banner pointer: 65762078756e694c
> crash: /var/tmp/vmlinux.xz_CSsIpp and /var/crash/2023-05-02-00:13/vmcore
> do not match!

Occurs in this:

>     if (!(sp = symbol_search("linux_banner")))
>         error(FATAL, "linux_banner symbol does not exist?\n");
>     else
>         if ((sp->type == 'R') || (sp->type == 'r') ||
>                 (THIS_KERNEL_VERSION >= LINUX(2,6,11) &&
>                 (sp->type == 'D' || sp->type == 'd')) ||
>                 (machine_type("ARM") && sp->type == 'T') ||
>                 (machine_type("ARM64")))
>             linux_banner = symbol_value("linux_banner");
>         else
>             get_symbol_data("linux_banner", sizeof(ulong),

The reason in the case of the coredump I'm looking at is that sp->type for linux_banner is 'B' so we take the last false block and get_symbol_data() when we should set linux_banner = symbol_value() instead. I can test my case but I won't commit it until I've seen the others experiencing it that I'm aware of reporting a patch to test sp->type == 'B' works for their versions. I don't have a crash 7 patched for now but I have a teammate who should be able to check the result using a crash 8.0.3 build I just made with a fix.
Comment 30 Fabian Vogt 2023-06-29 06:22:07 UTC
(In reply to David Mair from comment #27)
> The following error:
> 
> > WARNING: invalid linux_banner pointer: 65762078756e694c
> > crash: /var/tmp/vmlinux.xz_CSsIpp and /var/crash/2023-05-02-00:13/vmcore
> > do not match!
> 
> Occurs due to this:

I already analyzed that in comment 25. The cause for that it that it tries to use the wrong BFD for reading the symbol, so instead of the address it gets the value:

> With the patch applied it unconditionally runs check_gnu_debuglink even if the loaded BFD has symbols. This is invalid!
>
> The file pointed to by gnu_debuglink does not have text or data included, only debug info. It is not a full replacement for the previously loaded vmlinux.
Comment 31 David Mair 2023-06-29 16:29:33 UTC
(In reply to Fabian Vogt from comment #30)
> (In reply to David Mair from comment #27)
> > The following error:
> > 
> > > WARNING: invalid linux_banner pointer: 65762078756e694c
> > > crash: /var/tmp/vmlinux.xz_CSsIpp and /var/crash/2023-05-02-00:13/vmcore
> > > do not match!
> > 
> > Occurs due to this:
> 
> I already analyzed that in comment 25. The cause for that it that it tries
> to use the wrong BFD for reading the symbol, so instead of the address it
> gets the value:
> 
> > With the patch applied it unconditionally runs check_gnu_debuglink even if the loaded BFD has symbols. This is invalid!
> >
> > The file pointed to by gnu_debuglink does not have text or data included, only debug info. It is not a full replacement for the previously loaded vmlinux.

Fair point, I'll re-consider that I found the type of linux_banner to be 'B' running crash in gdb.
Comment 32 David Mair 2023-07-05 16:55:09 UTC
(In reply to Fabian Vogt from comment #24)
> The current error message is:
> 
> WARNING: invalid linux_banner pointer: 65762078756e694c
> crash: /var/tmp/vmlinux.xz_CSsIpp and /var/crash/2023-05-02-00:13/vmcore do
> not match!
> 
> I did some debugging and found that upstream crash works just fine, so this
> is caused by one of the downstream modifications. I bisected the patches and
> ended up with crash-debuginfo-compressed.patch being the culprit.
> 
> The cause is this part:
> 
> --- crash-7.2.7.orig/symbols.c
> +++ crash-7.2.7/symbols.c
> @@ -203,9 +203,9 @@ symtab_init(void)
>  	 *  Pull a bait-and-switch on st->bfd if we've got a separate
>           *  .gnu_debuglink file that matches the CRC. Not done for
> kerntypes.
>  	 */
> -	if (!(LKCD_KERNTYPES()) &&
> -	    !(bfd_get_file_flags(st->bfd) & HAS_SYMS)) {
> -		if (!check_gnu_debuglink(st->bfd))
> +	if (!(LKCD_KERNTYPES())) {
> +		if (!check_gnu_debuglink(st->bfd) &&
> +		    !(bfd_get_file_flags(st->bfd) & HAS_SYMS))
>  			no_debugging_data(FATAL);
>  	}
...

Okay, sorry, I've been off for a few days plus a public holiday. I just compared our sources with upstream and upstream doesn't use a patch like that and the current upstream source has the same block of symtab_init() in the format that is removed (the - lines of the patch). It concerns me a little that:

a) The above patch had an intended purpose, presumably cases of compressed debuginfo; and
b) It does look to my plain view that the nesting versus combined conditionals is wrong in the patch for even something aproximating the intended purpose (the + lines)

I'll test my own coredump using a crash without the above patch, if it works I'll post a link to my OBS workspace and if I get confirmed results I'll submit a removal of the above patch for the report case of Tumbleweed and then work on applying it to other versions. I have to combine it with other Tumbleweed crash submits but I should have a test build today and the whole business should take a few days depending on test response times.
Comment 33 David Mair 2023-07-05 16:59:29 UTC
For the record, the original submit of the patch I'm removing was in 2011. according to changelog. It isn't proven but it's intent can be long ago worked around upstream if a build without it doesn't have the reported problem.
Comment 34 David Mair 2023-07-06 17:25:08 UTC
I'll have to delay a submit. My experience of building crash suggests a problem in kernel-devel library that causes build to fail...at the time I am posting.
Comment 35 Michal Suchanek 2023-07-07 08:20:52 UTC
which kernel-devel version?
Comment 36 Fabian Vogt 2023-08-18 14:52:12 UTC
There was a crash submission to oS:F recently but it seems unrelated.
What's the current state?
Comment 37 David Mair 2023-08-18 15:29:10 UTC
(In reply to Fabian Vogt from comment #36)
> There was a crash submission to oS:F recently but it seems unrelated.
> What's the current state?

The comment #24 problem is being worked on but expected to fix in the next few days.

The crash version in Tumbleweed is already 8.0.3, I patched it earlier in the week for a glibc update. However, the error message at comment #24 still occurs. As indicated, one of our own patches to crash causes it. The rest of this week after the glibc patch I've been diagnosing that and the re-ordering of the conditional as shown in comment #24 appears to be the cause.

In the patched order there are some coredumps for which Factory crash will fail with a linux_banner error where the linux_banner pointer is taken from the string text of the linux_banner rather than the pointer to it and will fail. For the same coredump using the original conditional order the problem does not happen. I am planning to debug both versions today to see if there is a reason to be concerned about reverting the conditional ordering.
Comment 38 David Mair 2023-08-18 15:33:57 UTC
The submission earlier this week was for bsc#1214232, crash fails to build from source on Factory.
Comment 39 David Mair 2023-08-18 18:29:44 UTC
The patch crash-debuginfo-cpmpressed.patch contains the following modification to symtab_init() in symbols.c:

-	if (!(LKCD_KERNTYPES()) &&
-	    !(bfd_get_file_flags(st->bfd) & HAS_SYMS)) {
-		if (!check_gnu_debuglink(st->bfd))
+	if (!(LKCD_KERNTYPES())) {
+		if (!check_gnu_debuglink(st->bfd) &&
+		    !(bfd_get_file_flags(st->bfd) & HAS_SYMS))
 			no_debugging_data(FATAL);
 	}

I have a coredump that when used with crash it fails to start reporting:

WARNING: invalid linux_banner pointer: 65762078756e694c
crash:/path/to/kernel and /path/to/vmcore do not match!

If the above section of crash-debuginfo-cpmpressed.patch is reversed to the original conditional ordering that does not occur and crash starts.

I'm running crash in gdb to form an opinion on whether reverting that section of crash-debuginfo-cpmpressed.patch is safe as a solution.
Comment 40 David Mair 2023-08-18 20:36:46 UTC
(In reply to David Mair from comment #39)
> The patch crash-debuginfo-cpmpressed.patch contains the following
> modification to symtab_init() in symbols.c:
> 
> -	if (!(LKCD_KERNTYPES()) &&
> -	    !(bfd_get_file_flags(st->bfd) & HAS_SYMS)) {
> -		if (!check_gnu_debuglink(st->bfd))
> +	if (!(LKCD_KERNTYPES())) {
> +		if (!check_gnu_debuglink(st->bfd) &&
> +		    !(bfd_get_file_flags(st->bfd) & HAS_SYMS))
>  			no_debugging_data(FATAL);
>  	}
> 
> I have a coredump that when used with crash it fails to start reporting:
> 
> WARNING: invalid linux_banner pointer: 65762078756e694c
> crash:/path/to/kernel and /path/to/vmcore do not match!
> 
> If the above section of crash-debuginfo-cpmpressed.patch is reversed to the
> original conditional ordering that does not occur and crash starts.
> 
> I'm running crash in gdb to form an opinion on whether reverting that
> section of crash-debuginfo-cpmpressed.patch is safe as a solution.

Optimizations are obfuscating some of what I an see but...

When running the patched form:

	if (!(LKCD_KERNTYPES())) {
		if (!check_gnu_debuglink(st->bfd) &&
		    !(bfd_get_file_flags(st->bfd) & HAS_SYMS))
			no_debugging_data(FATAL);
	}

The first condition succeeds, we enter the block and run check_gnu_debuglink(). The linux_banner entry in the symbol table is considered invalid, it's type is bss segment.

When running the pre-patched form:

	if (!(LKCD_KERNTYPES()) &&
		!(bfd_get_file_flags(st->bfd) & HAS_SYMS)) {
			if (!check_gnu_debuglink(st->bfd))
	 			no_debugging_data(FATAL);
 	}

LKCD_KERNTYPES() is also zero and here's no evidence of use of bfd_get_file_flags() except that we don't enter the conditional block, don't use check_gnu_debuglink() and use the sections known to the kernel using the outcome of bfd_read_minisymbols() to generate the symbol table, linux_banner has the 'd' type and crash loads.

The likely ultimate state of the patched version is that the symbol table has entries from check_gnu_debuglink() followed by entries from bfd_read_minisymbols() and only the first one is used and it has the wrong type for linux_banner in the first entry but not for the second entry.

Fixing that is pretty nasty, we set the wrong type for the symbol and that could be fixed.

We could remove entries obtained from check_gnu_debuglink() if we have the same named entry obtained from bfd_read_minisymbols() but that means searching for a duplicate for every symbol obtained using check_gnu_debuglink() and I don't know how it will perform.

We could drop the results of check_gnu_debuglink() if bfd_read_minisymbols() has more than zero results but that assumes there are no cases in one but not the other.

For now I'm looking at how the rest of crash-debuginfo-cpmpressed.patch works, it modifies check_gnu_debuglink()...where the wrong type gets set for the symbol.
Comment 41 David Mair 2023-08-19 00:23:57 UTC
The linux_banner error happens due to parsing of symbols from the debuginfo file.

In symbols.c, given the two versions of the conditional patched by crash-debuginfo-compressed.patch (i.e. the not patched and patched versions of the conditional) then:

NOT PATCHED:
* st->bfd is set to a file-descriptor from a file open of the kernel executable

PATCHED:
* st->bfd is set to a file-descriptor from a file-open of the debuginfo file

The logic requires a bit of careful thinking, if there is a discoverable .debug file for the same kernel executable as is being used then even if the crash command-line has no debuginfo file specified, the patched code will use (a supplied or) discoverable debuginfo file and the behavior will be the same (incorrect) whether or not you provided a debuginfo file. This explains some cases where I've seen crash not fail with a linux_banner error when I didn't specify the debuginfo file, it wasn't discoverable either and st->bfd will be the kernel executable.

The symbols are then read from whatever st->bfd is a file-descriptor for. In the case of reading from the debuginfo file, all symbol table entries will have type 'B' or 'b'. In the case of reading from the kernel executable the symbol table entries have a variety of types. There are about 78000 symbols in both cases. linux_banner is type 'D' from the kernel executable and crash loads, linux_banner is type 'B' from the debuginfo and crash fails reporting a bad linux_banner.

If I understand the purpose of crash-debuginfo-compressed.patch it isn't well named. I believe it's intended to handlethe use of a compressed kernel executable without de-compressing it (the symbol table information will be taken from the debuginfo file rather than the kernel executable). This explains behavior variations I've seen with and without a compressed kernel executable (the behavior differences also depended on whether a debuginfo could be disvovered).

Using a mapfile with crash should prevent the attempt to use the debuginfo file but it isn't really a fix.

I have to say that:

a) It's appears it's not the fault of crash that all entries in the kernel debuginfo file are 'B' or 'b'; and
b) It would be nice to use the kernel binary in preference to the debuginfo file if the kernel has directly accessible symbol information. I believe this would require that crash always be used with a de-compressed kernel binary. So, forcing that isn't a complete fix either.

I'd be interested in knowing why kernel compilation results in all symbols having the type bss segment in debuginfo when the kernel binary has varying types for the same symbols. A better fix would be a type-correct debuginfo file from a kernel build.
Comment 42 Michal Suchanek 2023-08-21 06:26:12 UTC
Kernel compilation does not result in debuinfo files at all.

Kernel compilation results in vmlinux and module files with debuginfo directly included.

This debuginfo is then separated and compressed in rpm scripts by multiple tools.

I don't know if the debuginfo is broken to start with and it does not matter because the correct symbol map is included with it or if it is broken during the rpm debuginfo splitting.
Comment 43 David Mair 2023-08-21 15:20:36 UTC
Replace kernel "compile" with "kernel package build". It doesn't matter in the end, the kernel binary and debuginfo file from the packages for the same kernel have differing types for almost all symbols causing crash to report an error on loading...
Comment 44 David Mair 2023-08-21 15:56:55 UTC
(In reply to David Mair from comment #43)
> Replace kernel "compile" with "kernel package build". It doesn't matter in
> the end, the kernel binary and debuginfo file from the packages for the same
> kernel have differing types for almost all symbols causing crash to report
> an error on loading...

NB: This doesn't happen for every kernel debuginfo file, most are generated such that crash doesn't fail when using them to build the symbol table. A small number appear to have no type distinctions for symbols, all symbols are the same type and wrong type in the case of linux_banner in my experience. Therefore, it appears to occur based on kernel package build results.
Comment 45 David Mair 2023-08-21 18:26:01 UTC
bsc#723639 introduced the patch that causes the problem parsing linux_banner when there all symbol types are 'B' or 'b' in the debuginfo file.

I am debugging crash itself to verify that the invalid type for linux_banner is stored in the debuginfo file and not a default because no type can be found. I also have a proposed "better" way to parse linux_banner from upstream that I will test.
Comment 46 David Mair 2023-08-21 19:52:34 UTC
Upstream provided an alternative method of testing the linux_banner that I've test for my failing case. This would be a fix for comment #20, comment #24 using latest upstream crash plus one proposed patch specifically for linux_banner testing.

As I have one test case, if there is any interest in testing it please say. I will commit my patches and it could be tested from my home project on OBS.

Is there anything else to be resolved that I'm missing?
Comment 47 David Mair 2023-09-20 19:31:55 UTC
(In reply to Fabian Vogt from comment #24)
> The current error message is:
> 
> WARNING: invalid linux_banner pointer: 65762078756e694c
> crash: /var/tmp/vmlinux.xz_CSsIpp and /var/crash/2023-05-02-00:13/vmcore do
> not match!

I have a patch for this that's been verified on three coredumps where the behavior is seen and is ACKed and committed upstream. I will be submitting the fix this week for openSUSE.
Comment 48 Petr Cervinka 2023-11-24 10:14:26 UTC
I can confirm from QA side that crash version 8.0.4 works fine on Tumbleweed now.
Comment 49 Petr Vorel 2023-11-24 10:54:42 UTC
For a record, it has been fixed in build 20231117, last broken build was 20231116.