|
Bugzilla – Full Text Bug Listing |
| Summary: | crash fails to find MAGIC_START and refuses to load vmcore | ||
|---|---|---|---|
| Product: | [SUSE Linux Enterprise Server] PUBLIC SUSE Linux Enterprise Server 15 SP3 | Reporter: | Ganapathi CH <ch.ganapathi> |
| Component: | Other | Assignee: | David Mair <dmair> |
| Status: | IN_PROGRESS --- | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Normal | ||
| Priority: | P5 - None | CC: | ajesh.issac, ch.ganapathi |
| Version: | SLES15SP3Maint-Upd | ||
| Target Milestone: | unspecified | ||
| Hardware: | x86-64 | ||
| OS: | SLES 12 | ||
| Whiteboard: | |||
| Found By: | Third Party Developer/Partner | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: | vmcore from SLES-12-SP5 that cannot be opened using crash | ||
|
Description
Ganapathi CH
2024-04-27 06:59:03 UTC
The kernel vmcore is from SLES-12-SP5. However, the SUSE bug filing does not allow to select SLES-12-SP5 as the product. How can this vmcore be opened using crash? I tried with SLES-12-SP5 with kernel-default-4.12.14-122.144.1.x86_64.rpm. But no success. Please help me to open the vmcore using crash. (In reply to Ganapathi CH from comment #1) > The kernel vmcore is from SLES-12-SP5. However, the SUSE bug filing does not > allow to select SLES-12-SP5 as the product. > > How can this vmcore be opened using crash? > > I tried with SLES-12-SP5 with kernel-default-4.12.14-122.144.1.x86_64.rpm. > But no success. > > Please help me to open the vmcore using crash. The best first choice solution to this is to use a later version of SLES (or openSUSE) as the place you perform the use of crash, e.g. SLES15-SP5 (or the most recent openSUSE version). The process of performing debug of a coredump using a given release of crash on a SUSE product is supported for debug of coredumps from previous versions of SUSE products and the latest crash release on _any_ product is crash 8.0.4, which is substantially more useful and with broader kernel support than the very old crash 7.2.1 However, while that might give you immediate functionality, the same statement combined with your report implies that the kernel on SLES12-SP5 has been updated such that a crash update is required on SLES12-SP5 for normal kernel debugging. My preferred approach would be generalizing the use of crash 8 versions on supported products when this type of thing happens and, internally, the way crash is managed as source code for build was recently separated from source code management for general packages in individual products. However, whether crash 8 can be released for SLES12-SP5 requires verification before consideration and cannot result in a quick solution for you, whereas the approach suggested above can. The same result with crash 8.0.4. Looks like the crash is unable to find the string 'linux_banner' in vmcore. But why? # crash -d vmcore -f vmlinux-4.12.14-122.144-default.gz usr/lib/debug/boot/vmlinux-4.12.14-122.144-default.debug crash 8.0.4 Copyright (C) 2002-2022 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011, 2020-2022 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. Copyright (C) 2015, 2021 VMware, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb (GDB) 10.2 Copyright (C) 2021 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-pc-linux-gnu". Type "show configuration" for configuration details. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... WARNING: could not find MAGIC_START! crash: usr/lib/debug/boot/vmlinux-4.12.14-122.144-default.debug and /proc/kcore do not match! Usage: crash [OPTION]... NAMELIST MEMORY-IMAGE[@ADDRESS] (dumpfile form) crash [OPTION]... [NAMELIST] (live system form) Enter "crash -h" for details. Thank you for testing and the feedback. I've previously worked bug where crash was not possible to find the banner. It was a long time ago but, as I recall it was a bit strange. I'll start researching it in my Tuesday (2024-04-30) morning. A way to get crash to start under failure circumstances is to use the --minimal argument on the crash command-line. A large amount of the startup verification is not performed, but the result is a very limited set of functionality in crash. Going back to crash on SLES12-SP5, you should be able to start it in --minimal mode then use something like: crash> set debug 8 crash> rd linux_banner 20 crash> rd -a linux_banner It should demonstrate the path through memory (and the coredump) at debug 8 level. If linux_banner is discoverable the command's should dump the memory containing it and it should identify the kernel and version number, e.g. Linux version 4.12.14-122.144. The failure case I was referring to in comment #4 was that the "type" of the symbol linux_banner caused the banner string itself to be referenced as a pointer and since it's text, treating it as an address means it would be unlikely to represent any address in system memory. The address of linux banner would appear to be something like: 0x4C6C6E7578205665 (or 0x65562078756E6C4C), The first 8 bytes of linux_banner ASCII text treated as pointer to actual linux_banner content. That doesn't appear to be the case from your debug launch of crash but it's worth noting, output from the following might help in minimal mode: crash> info linux_banner You could always expand the compressed kernel to limit working through it's content to direct access rather than de-compress and access. If crash will not start when using --minimal on the command-line then it is likely the coredump (vmcore) is damaged or faulty in some way. That can happen if it is obtained using a method other than kdump, e.g. a xen memory image or vmware's equivalent. There have been some compatibility issues between such memory images and a kdump memory image. If it is a kdump memory image then if storage space permits then it couldn't hurt to use the dump level that includes the most memory in the vmcore image, e.g. dump level of zero (no pages filtered). You can set that in YAST or directly in /etc/sysconfig/kdump with the option KDUMP_DUMPLEVEL="<value>". Even on a system with a huge amount of memory it is a reasonable elimination case for inability to find objects by using KDUMP_DUMPLEVEL="0" As a last resort with the kdump model ensuring dump file format by using KDUMP_DUMPFORMAT="ELF" in /etc/sysconfig/kdump is a worthwhile elimination combined with the use of KDUMP_DUMPLEVEL="0" Beyond that, for debugging it I'd need access to a coredump exhibiting the problem (it would save time to have it and the matching kernel/debuginfo files as well) but let's take that up after your experience with all in the previous paragraphs in this post. crash --minimal can open vmcore. However, with very minimal commands supported it is not useful for diagnosing the problem. Here is the output of linux_banner from crash --minimal: crash> rd -a linux_banner <addr: ffffffff89c000c0 count: 9223372036854775807 flag: 2080 (KVADDR)> <readmem: ffffffff89c000c0, KVADDR, "ascii", 1, (FOE), 7ffc758f2910> <read_diskdump: addr: ffffffff89c000c0 paddr: debc000c0 cnt: 1> read_diskdump: paddr/pfn: debc000c0/debc00 -> physical page is cached: debc00000 So, the vmcore seems to be good or else the 'minimal' also wouldn't have worked, I guess. But the limited scope of commands is hindering any further progress. I have also attached crash folder from /var/crash/ so you can check the actual vmcore. I have not attached the vmlinux-4...debug file as it is too big (even after compressing with tar jcf). I know --minimal mode has very limited functionality, I said it in comment #5. I was asking you to try it only to get the diagnostic information when reading linux_banner. Thank you, Plainly linux_banner is accessible as an identity, though that doesn't of itself demonstrate the coredump has no faults. In fact, I'm not sure I agree the rd -a command worked, I don't see the ASCII dump of memory from the address in linux_banner in comment #6. The command should have ended by outputting something like this: crash> rd -a linux_banner ...the details about what's being parsed due to the debug level, then... c082a020: Linux version 4.8.23-144.232.x86_64 (suseckbuild@hs20-bc2-4.buil c082a05c: dab.suse.com) (gcc version 4.4.4 20100726 (SUSE LLC 4.8.2-09) c082a098: (GCC) ) #1 SMP Tue May 9 20:25:37 CET 2021 That is a mock-up and obviously the address dumped would not be c082a020 in your case, the kernel version, builder and compiler version would be different as well. But, it would be very similar and your output appears to have showed nothing of the content of linux_banner. The identifier linux_banner is known, and it's address, but reading from the location referenced apears to have shown no output. I see the material you attached when you created the bug. I'll investigate using crash in gdb and aim to diagnose it then report back. It may take me two or three days, Debugging crash as it debugs the kernel requires careful consideraion of whether what is being observed is runtime activity in crash and/or interpretation of the core dump as a runtime image of the kernel. Here's the end of the debug output I get with crash 8.0.4 on openSUSE Tumbleweed using your attached files, de-compressing the kernel binary and using the rpm version of vmlinux-4.12.14-122.144-default.debug that I excepted to match the dumped vmlinux-4.12.14-122.144-default: <readmem: ffffffff8a04ede8, KVADDR, "page_offset_base", 8, (FOE|Q), 55ad639e24c8> <read_diskdump: addr: ffffffff8a04ede8 paddr: dec04ede8 cnt: 8> read_diskdump: paddr/pfn: dec04ede8/dec04e -> cache physical page: dec04e000 read_diskdump/cache_page: descriptor with zero offset found at paddr/pfn/pos: dec04e000/dec04e/4262ba read_diskdump: READ_ERROR: cannot cache page: dec04e000 crash: page incomplete: kernel virtual address: ffffffff8a04ede8 type: "page_offset_base" Crucially, the value of page_offset_base (as I understand it) would be required to read kernel text, including the content of a symbol like linux_banner. From the point of view of diagnosis only: * The readmem fails * The attempt to read the value is from kernel virtual address ffffffff8a04ede8 * 8 bytes from physical address dec04ede8 * dec04ede8 is read into a cached physical page located at dec04e000 * caching the page fails de to a zero offset found at the physical address * Therefore, page_offset_base cannot be read from the cached page I can't even start crash in --minimal mode using this coredump. This does NOT match the behavior you observe, e.g. your debug output for page_offset_base is: <readmem: ffffffff8a04ede8, KVADDR, "page_offset_base", 8, (FOE), d45a28> <read_diskdump: addr: ffffffff8a04ede8 paddr: dec04ede8 cnt: 8> read_diskdump: paddr/pfn: dec04ede8/dec04e -> cache physical page: dec04e000 * The kernel virtual address is the same * The physical address is the same * The read to cache succeeds and subsequent steps of the initialization take place for your use On the basis of this I'm not convinced the attached coredump is complete. It would help to have a checksum of it (e.g. sha256 or sha512) so that I can confirm I am working on the same file as you. Nevertheless, the behavior I see is not entirely different from yours, I see it unable to read a basic memory reference point, you see it unable to read a slightly less significant basic memory reference point but both cases are failures to read a symbol's value. The fact that you were able to start crash in --minimal mode proves nothing about the integrity of the coredump, --minimal excludes a lot of normal initialization tests and tries to permit basic access to the coredump as a memory image. IOW, if the initialization required for crash to debug the coredump as a runtime image is not performed but crash loads (--minimal) it only proves that addressing some objects works in the coredump, not that the coredump has enough integrity for utility with crash. I strongly recommend retrying with a coredump created using KDUMP_DUMPLEVEL="0". It's not suggested as a solution, but rather a way to exclude some potential problem cases that might cause the observed symptoms. Given that I can't reproduce the problem you see but I see a similar problem (failure to read a more significant reference than your observation) I have little to go on other than the appearance that the coredump does not have the required integrity for normal use with crash. That might have happened when it was created, possibly due to the way it was filtered when created, or possibly in the manner it was made available to you. For any of these cases I would always want to start with a coredump created at kdump level zero then work down from that. I would also like to know what happens on a coredump triggered deliberately on an equivalent SLES12-SP5 system that is operating normally. crash --minimal mode didn't open for me too on openSUSE Tumbleweed. However, it can be opened in --minimal mode on a SLES12-SP5 server. (In reply to Ganapathi CH from comment #10) > crash --minimal mode didn't open for me too on openSUSE Tumbleweed. However, > it can be opened in --minimal mode on a SLES12-SP5 server. Thanks for the verification. It is still as I said, --minimal demonstrates little about the integrity of the coredump while allowing limited access to it's content. The evidence still suggests to me that the coredump could be bad. For progress as a crash bug it has to be verified that the problem exists in a coredump taken from a system that is not in the process of failing/aborting/oopsing/etc, i.e. a triggered coredump from idle server running normally, and the starting point has to be that said coredump has no content filtered (kdump level zero). If that coredump can be opened then it is likely the experience you are having with the attached coredump is not a crash bug and alterative configuration of kdump might help in the real problem case. I'm working on a test VM to try to experiment with kdump/crash. I recommend you perform your own test case as described in comment #11 but I'll post my results if they are needed. I also need to make note that the current kernel version for SLES12-SP5 is 4.12.14-122.201.1 and that the problem is reported for kernel 4.12.14-122.144 which is from December 2022 as I understand it and substantially outdated. (In reply to David Mair from comment #12) > I'm working on a test VM to try to experiment with kdump/crash. I recommend > you perform your own test case as described in comment #11 but I'll post my > results if they are needed. OK. FOR THE RECORD: I cannot reproduce the reported problem using crash 7.2.1 on SLES12-SP5 patched to-date, using a triggered kdump coredump from an idle but operational SLES12-SP5 server with the current maintenance kernel version of 4.12.14-122.201. Crash loads the coredump just fine and reaches the crash > console prompt without error. The kdump level used was the install default of 23, which I understand excludes only the following: Pages Filled with Zeros Cache Pages Cache Private Pages Free Pages The dump format was LZO Compressed which makes the only difference between the format of my coredump and the supplied one is that I included User Data Pages. Which, as I mentioned, is the default filtering. My kdump on a 6GB host was configured with the following startup configuration: Low Memory: 72MiB High Memory: 165MiB The coredump triggered from the idle, and not failing server with 6GiB of memory has a compressed size of 111MiB using the above detailed enabled filters and compression. I repeated the test with all page type filters enabled (which is KDUMP_DUMP_LEVEL of 31). It's still a patched to-date system, but now using the same KDUMP_DUMP_LEVEL as the reported problem coredump. I still had no problem opening the newly created coredump using crash. The coredump this time was 56MiB. It's an extremely crude value but, the ratio of physical memory to coredump size I see for a LZO compressed coredump is as-follows by KDUMP_DUMP_LEVEL: KDUMP_DUMP_LEVEL 23: 55.6 to 1 KDUMP_DUMP_LEVEL 23: 110.5 to 1 If this was a credibly consistent ratio across SLES12-SP5 systems (which is NOT really a fact but is a reasonable assumption for guidance) then, I would estimate the physical memory is between 24GiB and 32GiB on the system the supplied problem 233MiB coredump was from. I will now transfer the problem coredump to my test VM to see if there is any difference trying crash with it on a patched to date SLES12-SP5 (though with the problem system's kernel binary and debuginfo). ...and using the supplied problem coredump on a patched to-date (May 2024) SLES12-SP5 where the coredump is from a December 2022 kernel (version 4.12.14-122.144) using the matching kernel 4.12.14-122.144 binary and kernel 4.12.14-122.144 debuginfo for the dump of the kernel 4.12.14-122.144 system I observe the reported problem, the value of the linux_banner symbol is output as a 64-bit zero integer: fffffffff89c00c0: 0000000000000000 This is not valid, the page was read from the coredump and the value at the location in the coredump representing the location of the linux_banner symbol is zero and not either the text of the linux_banner or a pointer address to the linux_banner text. Going further, using: crash> rd 0xfffffffff89c00c0 4096 to browse the page supposedly containing linux_banner. Every byte in the page has zero value, i.e. it is a zero page. The kdump dump level was configured to exclude the byte content of zero pages from the coredump, therefore the real content of the page containing linux_banner was either damaged when inserted ito the coredump by kdump or improperly resolved by kdump such that a zero page was that was not the page contianing linux_banner was treated as that page and flagged in the coredump as a zero page. None of these scenarios suggest a problem in crash and further support my opinion that the coredump is invalid and should not be used with crash. Combine these with the fact that I can dump the core on a stable sytem with dump level 23 and level 31 then use it with crash using a patched to date SLES12-SP5. * I recommend the supplied coredump be considered invalid based on the above analysis * For progress I strongly recommend patching the problem system to-date and attempt to reproduce the problem underlying the creation of the coredump * If the underlying problem that caused the core to be dumped is reproduced on a patched to-date SLES12-SP5 and the coredump still cannot be opened by crash then it is likely the underlying cause corrupts memory sufficiently that the a dumped core is of little use * If that is observed then I strongly recommend that the kdump dump level be set to zero (no filtered pages, all memory in the coredump), the underlying problem be reproduced again and the unfiltered cordump be considered * If crash still fails to load that unfiltered coredump then I strongly recommend that a coredump be triggered on an idle instance of the system exhibiting the underlying problem (i.e. get a coredump of a working system that the underlying problem can be reproduced on). This should be done as soon as possible after linux boot is complete (i.e. a system that's just up and done almost nothing but boot to idle state) * If crash does not load that coredump we should continue talking but I would suspect that the coredump of a just booted, idle, patched to-date SLES12-SP5 will not be damaged, matching my observations on a test system and if you can use that coredump with crash then it is improbable that crash has a bug causing the reported problem but something causes the coredump to be invalid when created. Thanks for the analysis and the suggestions. Yes, I agree that is unlikely to be a crash issue; maybe kdump or some other issue. That coredump was from a customer. I have pinged the customer. Haven't heard back. If the customer responds I will suggest getting the coredump with kdump level 0 and see if I can open that. If not, I will ask for a triggered coredump on the idle server and verify. You can close this bug for now. If the customer responds and if I get a coredump as suggested but still cannot open I will reach out again. Hope that is fine. |