Bugzilla – Bug 944659
After installing Leap M2, 9 out of 10 times, the system hangs with black screen, just after the bootloader hands over
Last modified: 2016-09-04 21:10:34 UTC
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.89 Safari/537.36 Build Identifier: AFter installing Leap M2, the system does not start *most of the time*. Just after the bootloader messages go away, there is some activity where I see the keyboard, either FNLOCK or NUMLOCK flash, and after that the system hangs with a black screen (it never switches to what I would describe as the alt-ctrl-F7 screen) This happens 9 out of 10 times, whether I just reset, or cold reboot the system. I made numerous changes in the BIOS after this occured to try to resolve it (including of course to set settings to Save) but the behaviour is still the same. As I noticed the keyboard, either FNLOCK or NUMLOCK activity after bootloader screen goes away, I tried with several keyboards, same result. I also noticed that *sometines* after the bootloader message goes away, the FNLOCK goes ON, if I am fast enough to switch it back OFF, the system will start. This is AMD with Radeon: home-server:~ # /sbin/lspci -nnk | grep -i vga -A2 01:05.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] RS880 [Radeon HD 4290] [1002:9714] Subsystem: Gigabyte Technology Co., Ltd Device [1458:d000] Kernel driver in use: radeon I suspect some sort of driver bug or timing issue but that is a speculation. Reproducible: Always Steps to Reproduce: 1. Reboot the system 2. Bootloader shows it's message 3. Black Screen Actual Results: The system hangs just after the bootloader messages go away. This happens 9 out of 10 times. The times when the system starts, it behaves normally. Expected Results: The system switches to the F7 display (as if I hit Alt-Ctrl-F7) and shows the login screen. I described more details in the Details section. As I mentioned, this is Radeon on chip card home-server:~ # /sbin/lspci -nnk | grep -i vga -A2 01:05.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] RS880 [Radeon HD 4290] [1002:9714] Subsystem: Gigabyte Technology Co., Ltd Device [1458:d000] Kernel driver in use: radeon
Additional Comment: ------------------- Also, this is fully reproduceable even with all devices disconnected (keyboard, audio, and other USB devices) except the display. (So while some input on the keyboard at the exact point of kernel start that I commented on may help, the keyboard is definitely not the cause)
Could you check whether boot works fine with nomodeset boot option? If yes, the problem is likely radeon kernel driver. Also, (without nomodeset option), try to boot without quiet boot option. This might show the error messages by chance.
Thanks for following up. Describing what I did: In DEFAULT_APPEND in sysconfig->bootloader resume=/dev/disk/by-id/ata-WDC_WD5001AALS-00L3B2_WD-WMASY3398749-part1 splash=silent quiet showopts I first added nomodeset and there was no difference. On the attempts the system started the displayed part of the screen was narrower but it only booted intermittenly as always. Then in the original version I removed "quiet" - no messages showed up and no difference in behaviour either, just black screen
That's surprising. Basically nomodeset skips KMS so essentially the whole graphics driver will be skipped. Your result implies that it's no graphics driver issue. Hmmm... Could you run "hwinfo --all" once when you could boot, and attach the output to Bugzilla? Also, attach the output of "dmesg" after boot, too. Other things to check: does the installation DVD boot properly at every time? What about the rescue system on DVD? Is it only the installed system that hangs?
BTW, you don't have to modify /etc/sysconfig/bootloader or else for testing a boot option temporarily. Just press 'e' at GRUB boot menu and it goes to the edit mode. Add the boot option to the line "linux /boot/vmlinuz...." and boot further via Ctrl-X or F10 key.
Created attachment 646716 [details] hwinfo --al for bug_944659
Created attachment 646717 [details] dmesg for bug 944659
Thanks for the info about pressing the 'e' in GRUB. I knew there was a way but did not remember. I am adding two attachements with resuld of hwinfo -- all > hwinfo_bug_944659 dmesg > dmesg_bug_944659
I need to answer your other questions: 1. does the installation DVD boot properly at every time? : (It's on USB) Yes, it did boot every time I tried (I now have the BIOS set to hard disk first - do you want me to retry it?) 2. What about the rescue system on DVD? Is it only the installed system that hangs? : Not sure I understand - I have only one OS installed; this is on a SSD with 2 partitions, /dev/sda1 for the OS (no further partitioning) and /dev/sda2 for swap. I have not tried to boot into the rescue on the USB - do you want me to try? Thanks again for the follow up.
I also have an update here: Let me look into two things: 1. When I said that I changed DEFAULT_APPEND in sysconfig->bootloader and removed the "quiet" option - I did that, it is still gone. But it appears this did not actually affect the bootloader. The reason I think that is when you pointed out the 'e' option, I tried that removed "quiet", and the system did boot (one try only so this is not conclusive) but most importantly, the boot process looked different, it does show all the messages (which removing "quiet" from the DEFAULT_APPEND). Is there a way for me to remove the "quiet" permanently so I can test more? 2. I noticed my SECURE_BOOT is set to "yes" - I plan to experiment with setting it to "no". I will not have time to experiment more until my evening, but you should get more info by tomorrow. Thanks.
(In reply to milan zimmermann from comment #9) > 2. What about the rescue system on DVD? Is it only the installed system > that hangs? : Not sure I understand - I have only one OS installed; this is > on a SSD with 2 partitions, /dev/sda1 for the OS (no further partitioning) > and /dev/sda2 for swap. I have not tried to boot into the rescue on the USB > - do you want me to try? Yes, I meant the rescue boot item from the installation DVD. (In reply to milan zimmermann from comment #10) > I also have an update here: Let me look into two things: > > 1. When I said that I changed DEFAULT_APPEND in sysconfig->bootloader and > removed the "quiet" option - I did that, it is still gone. Not only changing /etc/sysconfig/bootloader, but you'll have to refresh the grub configuration, too. This can be done via YaST bootloader dialog. Or, it'll be easier to edit /etc/default/grub instead (edit $GRUB_CMDLINE_LINUX_DEFAULT), then update the real grub config via /usr/sbin/grub2-mkconfig -o /boot/grub2/grub.cfg > But it appears > this did not actually affect the bootloader. The reason I think that is when > you pointed out the 'e' option, I tried that removed "quiet", and the system > did boot (one try only so this is not conclusive) but most importantly, the > boot process looked different, it does show all the messages (which removing > "quiet" from the DEFAULT_APPEND). Right, the purpose to remove quiet option is to see the kernel log at the hang. > 2. I noticed my SECURE_BOOT is set to "yes" - I plan to experiment with > setting it to "no". Oh, that's an interesting point. Yes, please investigate it, too. Thanks. If you find out that the hang happens far before the kernel starts showing many messages, you might try to pass dis_ucode_ldr boot option. It's just to be sure, though.
Ok, I have a few things to follow up on (USB boot, dis_ucode_ldr) but let me comment on what hopefully is at least a step enabling further looking into: What I did so far based on your reply: - Removed quiet from GRUB_CMDLINE_LINUX_DEFAULT (which I could not find earlier) - In sysconfig yast, set the SECURE_BOOT to "no" - ran /usr/sbin/grub2-mkconfig -o /boot/grub2/grub.cfg (which I was already doing after changes before) - Rebooted - The system still hangs most of the time so it seems the SECURE_BOOT does not affect this - The "quiet remove" has an effect - I can now see messages when the system hangs. How can I deliver the messages to this report? I took a camera picture of the messages, but the result is too big for an attachment. Can I email it to your email (if so, what is it)? Or is there a log this stuff goes to? I see no clear errors there, except (this may be a message) "radeon: ... registered panic notifier". The last message is ""systemd-journald ... Received request to flush runtime journal from PID 1" Thanks.
I have some interesting follow ups: 1. Adding the dis_ucode_ldr did not help 2. When testing your request to boot from the USB image, into "rescue system", I noticed F5 (Kernel) allows to set "NO ACPI". So I tried: - with "NO ACPI" system booted 5 times in 5 tries (100% success) - with "Default" system booted 0 times in 5 tries (0% success) Where do I go from here? I did try to set acpi=off in boot and rebuilt grub.cfg but that *did not help* - system booted 0 times in 5 tries. This is confusing me to no end ... Do you have any suggestions? Thanks
(In reply to milan zimmermann from comment #12) > How can I deliver the messages to this report? I took a camera picture of > the messages, but the result is too big for an attachment. Can I email it to > your email (if so, what is it)? Or is there a log this stuff goes to? It's not problem to attach a picture on Bugzilla. Use attachment. The size doesn't matter unless it's over 100MB or so :) > I see no clear errors there, except (this may be a message) "radeon: ... > registered panic notifier". The last message is ""systemd-journald ... > Received request to flush runtime journal from PID 1" Maybe your first test with nomodeset wasn't performed properly? Could you retry with nomodeset boot option? This will result in the lower (or no) graphics with VESA fb, but it should be working at least.
Attaching the boot-fail screen capture. I will re-do my nomodeset testing again tonight, will make absolutely sure it makes it to boot.cfg. Thanks for all your help here.
Created attachment 646863 [details] Error message when boot hangs
I tested with nomodeset making sure to mkconfig and that it appears in /boot/grub2/grub.cfg. It did not help, failed to boot 5 times in 6 tries. In the succesful boot, the screen resolution was lower, the display narrower and stretched, all indications that the nomodeset did kick in. But interestingly, in dmesg with the nomodeset, there is an error: [ 3.167331] pata_jmicron 0000:05:00.1: enabling device (0000 -> 0001) [ 3.167740] [drm] VGACON disable radeon kernel modesetting. [ 3.167758] [drm:radeon_init [radeon]] *ERROR* No UMS support in radeon module! I am attaching the full dmesg. Overall, from some 1000 boots or so I did during the last week, the only reliably working setting was when I boot using USB and in F5 (Kernel), set "No ACPI" . But setting acpi=off in the bootloader does not have the same effect as I noted. Not sure where to take it next, but thanks very much for your help so far. BTW, Would you have some idea how to find what actual setting is set when I select "No ACPI" in the USB boot in F5 (Kernel)?
Created attachment 646902 [details] dmesg_after_first_successful_boot_with_nomodeset
BTW, there is a link I found which appears to have a very similar problem with an earlier version: https://forums.opensuse.org/showthread.php/501077-No-UMS-support-in-radeon-module-no-X The person ended up with the proprietary driver, I'd like still to give the free radeon a few more tries, maybe it will help others too if we find a solution. I am not entirely dead, as after successful amount of attempts the system boots.
(In reply to milan zimmermann from comment #17) > I tested with nomodeset making sure to mkconfig and that it appears in > /boot/grub2/grub.cfg. It did not help, failed to boot 5 times in 6 tries. In > the succesful boot, the screen resolution was lower, the display narrower > and stretched, all indications that the nomodeset did kick in. > > But interestingly, in dmesg with the nomodeset, there is an error: > > [ 3.167331] pata_jmicron 0000:05:00.1: enabling device (0000 -> 0001) > [ 3.167740] [drm] VGACON disable radeon kernel modesetting. > [ 3.167758] [drm:radeon_init [radeon]] *ERROR* No UMS support in radeon > module! This is OK, the expected result. > I am attaching the full dmesg. > > Overall, from some 1000 boots or so I did during the last week, the only > reliably working setting was when I boot using USB and in F5 (Kernel), set > "No ACPI" . But setting acpi=off in the bootloader does not have the same > effect as I noted. ACPI=off supposedly disables some devices indirectly, so this might help avoiding the bad point. > Not sure where to take it next, but thanks very much for your help so far. > > BTW, Would you have some idea how to find what actual setting is set when I > select "No ACPI" in the USB boot in F5 (Kernel)? You can take a look at /proc/cmdline. Judging from the boot screen you attached in comment 16, this doesn't seem like a crash of any driver. Now I read through the kernel log, the possible hit after the last dying message is acpi-cpufreq. Could you try to blacklist it, e.g. adding the following line to /etc/modprobe.d/99-local.conf? blacklist acpi-cpufreq Then reboot and retest. Check dmesg output to verify whether acpi-cpufreq If a message with "acpi-cpufreq" appears, the blacklist didn't work -- as a temporary test, just remove the module from /lib/modules/$VERSION/kernel/drivers/cpufreq directory.
1. ---------- FWIW I agree this is likely nothing to do with the video driver, or any driver, likely more some kernel module. 2. ---------- I added the blacklist acpi-cpufreq into /etc/modprobe.d/99-local.conf then rebooted several times. The boot still hangs most of the time. When is starts: home-server:~ # dmesg | grep cpufreq home-server:~ # Does that mean the blacklist worked and I do not have to remove the ko file? I my system, I have: home-server:~ # ls /lib/modules/4.1.6-9-desktop/kernel/drivers/cpufreq/ acpi-cpufreq.ko cpufreq_conservative.ko cpufreq_stats.ko powernow-k8.ko amd_freq_sensitivity.ko cpufreq_powersave.ko cpufreq_userspace.ko Do you think it would make sense to start deleting those one by one and see if any of them helps? This is an AMD system. 3. -------- Also , having booted to "rescue system" with acpi=off (which works 100%), I see: cat /proc/cmdline acpi=off initrd=initrd splash=silent rescue=1 install=hd:/// But if I set acpi=off in my hard disk boot, there is no difference and it mostly hangs. Is it possible the boot/hang boot behavior depends on the initrd which is likely different on the USB vs my system?
so after the last set of experiments, I realized I did not look at dmesg for a while. There is a ACPI warning pasted below. Interestingly, it is just below to where the system would normally hang! Googling for the ACPI warning, there are some recent reports around it, at least the first two reporting hang or lockup of X: https://bugs.freedesktop.org/attachment.cgi?id=117111&action=edit https://lists.debian.org/debian-x/2015/06/msg00095.html https://forums.gentoo.org/viewtopic-p-7781756.html?sid=0c2eb14b626e7550d394345772ba601a#top -------- [ 3.782610] hid-generic 0003:046D:C01B.0002: input,hidraw1: USB HID v1.10 Mouse [Logitech USB-PS/2 Optical Mouse] on usb-0000:00:13.0-1/input0 [ 3.798793] input: HOLTEK USB Keyboard as /devices/pci0000:00/0000:00:12.0/usb4/4-5/4-5:1.1/0003:04D9:A085.0003/i nput/input3 [ 3.849456] hid-generic 0003:04D9:A085.0003: input,hidraw2: USB HID v1.10 Mouse [HOLTEK USB Keyboard] on usb-0000 :00:12.0-5/input1 [ 4.002437] Switched to clocksource tsc [ 4.376601] [drm] ib test on ring 5 succeeded [ 4.377028] [drm] Radeon Display Connectors [ 4.377034] [drm] Connector 0: [ 4.377039] [drm] VGA-1 [ 4.377044] [drm] DDC: 0x7e40 0x7e40 0x7e44 0x7e44 0x7e48 0x7e48 0x7e4c 0x7e4c [ 4.377051] [drm] Encoders: [ 4.377055] [drm] CRT1: INTERNAL_KLDSCP_DAC1 [ 4.377059] [drm] Connector 1: [ 4.377063] [drm] DVI-D-1 [ 4.377067] [drm] HPD1 [ 4.377072] [drm] DDC: 0x7e50 0x7e50 0x7e54 0x7e54 0x7e58 0x7e58 0x7e5c 0x7e5c [ 4.377079] [drm] Encoders: [ 4.377083] [drm] DFP1: INTERNAL_KLDSCP_LVTMA [ 4.441898] [drm] fb mappable at 0xD0359000 [ 4.441906] [drm] vram apper at 0xD0000000 [ 4.441911] [drm] size 8294400 [ 4.441915] [drm] fb depth is 24 [ 4.441919] [drm] pitch is 7680 [ 4.442036] fbcon: radeondrmfb (fb0) is primary device [ 4.447652] Console: switching to colour frame buffer device 240x67 [ 4.454777] radeon 0000:01:05.0: fb0: radeondrmfb frame buffer device [ 4.454813] radeon 0000:01:05.0: registered panic notifier [ 4.457650] [drm] Initialized radeon 2.42.0 20080528 for 0000:01:05.0 on minor 0 [ 4.476789] PM: Starting manual resume from disk [ 4.476827] PM: Hibernation image partition 8:2 present [ 4.476828] PM: Looking for hibernation image. [ 4.477083] PM: Image not found (code -22) [ 4.477087] PM: Hibernation image not present or could not be loaded. [ 4.493388] BTRFS info (device sda1): use ssd allocation scheme [ 4.493433] BTRFS info (device sda1): disk space caching is enabled [ 4.692211] scsi 10:0:0:0: Direct-Access Lexar USB Flash Drive 1100 PQ: 0 ANSI: 6 [ 4.692509] sd 10:0:0:0: Attached scsi generic sg2 type 0 [ 4.694186] sd 10:0:0:0: [sdb] 15671296 512-byte logical blocks: (8.02 GB/7.47 GiB) [ 4.694949] sd 10:0:0:0: [sdb] Write Protect is off [ 4.694987] sd 10:0:0:0: [sdb] Mode Sense: 22 00 00 00 [ 4.695717] sd 10:0:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA [ 4.700850] sdb: sdb1 sdb2 [ 4.705350] sd 10:0:0:0: [sdb] Attached SCSI removable disk [ 4.891513] systemd-journald[205]: Received SIGTERM from PID 1 (systemd). [ 5.058420] BTRFS info (device sda1): disk space caching is enabled [ 5.079292] random: nonblocking pool is initialized [ 5.085206] systemd-journald[448]: Received request to flush runtime journal from PID 1 ^^^^^^ INTERESTINGLY, THIS IS ROUGHLY WHERE THE PROCESS TYPICALLY HANGS ^^^^^ [ 5.159629] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4 [ 5.162267] wmi: Mapper loaded [ 5.165034] ACPI Warning: SystemIO range 0x0000000000000B00-0x0000000000000B07 conflicts with OpRegion 0x0000000000000B00-0x0000000000000B0F (\SOR1) (20150410/utaddress-254) [ 5.165047] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver [ 5.165473] sp5100_tco: SP5100/SB800 TCO WatchDog Timer Driver v0.05 [ 5.165540] sp5100_tco: PCI Revision ID: 0x42 [ 5.165575] sp5100_tco: Using 0xfed80b00 for watchdog MMIO address [ 5.165589] sp5100_tco: Last reboot was not triggered by watchdog. [ 5.166422] sp5100_tco: initialized (0xffffc900018c2b00). heartbeat=60 sec (nowayout=0) [ 5.174157] input: Power Button as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0C:00/input/input4 [ 5.174426] ACPI: Power Button [PWRB] [ 5.174534] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input5 [ 5.174582] ACPI: Power Button [PWRF] [ 5.198804] xhci_hcd 0000:02:00.0: xHCI Host Controller li
Would anyone have an idea what I can do next to help get over this issue? It seems a fairly serious and worth looking into further, but I already tried all I can think of. Is waiting for beta the suggested option? Thanks
Sorry I've been on vacation. And, yes, now the best option is to try beta1 installation. If this still shows the same issue, please try the kernel in OBS Kernel:openSUSE-42.1 repo, which is based on 4.1.8 now.
Thanks for following up. The issue is the same after update to 42.1 Beta. After many long searches I came to a hard to prove and (of course quite possibly incorrect) conclusion this has something to do with the Audio support in the kernel; there are dmesg like snd_hda_codec_hdmi hdaudioC1D0: HDMI ATI/AMD: no speaker allocation for ELD just at the time the system normally hangs. Next Steps ----- I will install the kernel you suggested - I suppose just zypper up (with the online repos enabled?) - and report
(In reply to milan zimmermann from comment #25) > Thanks for following up. > > The issue is the same after update to 42.1 Beta. > > After many long searches I came to a hard to prove and (of course quite > possibly incorrect) conclusion this has something to do with the Audio > support in the kernel; there are dmesg like > > snd_hda_codec_hdmi hdaudioC1D0: HDMI ATI/AMD: no speaker allocation for ELD > > just at the time the system normally hangs. This is an utterly harmless message, found normally when plugging with a monitor without a speaker, and it must be just coincidence that this is seen at last. I can say it because I am the upstream maintainer of sound subsystem :) What we really need is to figure out whether this is really a kernel hang. If yes, what kind of hang. Since you don't get any kernel Oops or panic message, it doesn't look like a normal kernel hang due to a kernel bug, but either a hardware hang (hardware defect or hang by a driver bug) or some bad task blocking the whole system. As acpi=off seems curing, the odd is more to the former. If so, it's tough to figure out. You need to start from a minimal system that works reliably by disabling the hardware components as much as possible, then enable piece by piece until it hits the issue again. Or you can try older kernels. For example, 3.11.x kernel in openSUSE-13.1, 3.12.x for SLE12, 3.16.x for openSUSE-13.2. If any older kernel works, we may try bisection to spot out the regression.
(In reply to Takashi Iwai from comment #26) > (In reply to milan zimmermann from comment #25) > > Thanks for following up. > > > > The issue is the same after update to 42.1 Beta. > > > > After many long searches I came to a hard to prove and (of course quite > > possibly incorrect) conclusion this has something to do with the Audio > > support in the kernel; there are dmesg like > > > > snd_hda_codec_hdmi hdaudioC1D0: HDMI ATI/AMD: no speaker allocation for ELD > > > > just at the time the system normally hangs. > > This is an utterly harmless message, found normally when plugging with a > monitor without a speaker, and it must be just coincidence that this is seen > at last. I can say it because I am the upstream maintainer of sound > subsystem :) Great thanks, I will not push in that direction. > > What we really need is to figure out whether this is really a kernel hang. > If yes, what kind of hang. > > Since you don't get any kernel Oops or panic message, it doesn't look like a > normal kernel hang due to a kernel bug, but either a hardware hang (hardware > defect or hang by a driver bug) or some bad task blocking the whole system. > As acpi=off seems curing, the odd is more to the former. Regarding acpi=off: To be precise, having booted to "rescue system" with acpi=off works 100%. But if I set acpi=off in my hard disk boot, there is no difference and it mostly hangs. > > If so, it's tough to figure out. You need to start from a minimal system > that works reliably by disabling the hardware components as much as > possible, then enable piece by piece until it hits the issue again. From the best I can tell, I did everything I can think of. I have disabled devices in BIOS. I have pulled every plug, USB and otherwise, including the monitor out, and when I plug monitor back in the system shows the hang message. > > Or you can try older kernels. For example, 3.11.x kernel in openSUSE-13.1, > 3.12.x for SLE12, 3.16.x for openSUSE-13.2. If any older kernel works, we > may try bisection to spot out the regression. I think trying 13.2 is worth it. I do not want to go to 13.1 I use btrfs and not sure it is supported, but if it is, I can try that. Would you have an advice how to add 13.2 repo so it forces to take kernel from it? Thanks
So I installed the kernel from this repo: http://download.opensuse.org/repositories/Kernel:/openSUSE-13.2/standard the version is 3.16.7-100. Out of 10 boots (5 power off/on, 5 resets) it booted 10 times - 100%. (Going back to 42.1 kernel failed 3 out of 4). So I am now reasonably convinced it is a Kernel regression somewhere along the way. Attaching the dmesg from the 3.16.7-100 boot. Can I provide more info?
Created attachment 649549 [details] dmesg from 3.16.7. So far, the system boots successfully 100% of time with this kernel.
There are a few other kernels in OBS home:tiwai:kernel:3.17, home:tiwai:kernel:3.18, ... up to :4.0. You can try it until you hit the same problem. 3.17.x might be already problematic, from my very wild guess.
Just added http://download.opensuse.org/repositories/home:/tiwai:/kernel:/3.17/standard/ as a repo, will switch the kernel and test it.
(In reply to milan zimmermann from comment #31) > Just added > http://download.opensuse.org/repositories/home:/tiwai:/kernel:/3.17/standard/ > as a repo, will switch the kernel and test it. This kernel (3.17.6-1.g12b7bf1-desktop) booted 5 times out of five. That is probably enough text for this one, but will try a few more tomorrow. Dmesg attached. Will test 3.18 and 3.19 next - tomorrow after I get some work done, it is way after midnight here, will report here.
Created attachment 649557 [details] dmesg from 3.17 kernel. worked 5 out of 5
**It seems the problem starting with kernel 3.18** Ok, so next I installed http://download.opensuse.org/repositories/home:/tiwai:/kernel:/3.18/standard/ and the boot issue started hapening. Out of 7 attempts, 5 failures ans 2 successes. I am attaching dmesg from 3.18 (after a succesful boot) and also differences between 3.18 (left) and 3.17 (right). There are number of interesting things in 3.18 dmesg that do not exist in 3.17 but I have no idea about their significance. What can we do next?
Created attachment 649706 [details] dmesg from 3.18, the boot problems start here.
Created attachment 649707 [details] dmesg diff 3.18 (left) and 3.17 (right) - result of: diff <(cat ~/tmp/dmesg-kernel-3.18.6-1.1 | sed 's/.*] //') <(cat ~/tmp/dmesg-kernel-3.17.6-1.g12b7bf1-deskto | sed 's/.*] //') > ~/tmp/dmesg_diff_
As a note, I deducted in the past, if the boot hangs, it corresponds to the time between 4 seconds and 6 seconds into dmesg. Interestingly, the 3.18 boot has this at that time: [ 4.106244] radeon 0000:01:05.0: registered panic notifier [ 4.112156] [drm] Initialized radeon 2.40.0 20080528 for 0000:01:05.0 on minor 0 [ 6.116215] floppy0: no floppy controllers found [ 6.139987] PM: Starting manual resume from disk Not sure that means anything though.
Hmm, I spoke too early, 3.17 has the same message but earlier. I will stop speculating and leave it to the pros; in any case the issue clearly seem to have started in 3.18. Thanks.
I should summarize as I added some noise: - From all I can tell, the bug of unsuccessful boots has been introduced between kernels 3.17.6-1.g12b7bf1-desktop (works) and 3.18.6-1.1 (often hangs on boot). Dmesg attached. Thanks
Hi Takashi, First I appreciate your help on this item. I have been on 3.17.6-1.g12b7bf1-desktop five days, booting as a test several times a day, without a single boot issue. So it does appear there is some kind of regression from 3.17.6-1.g12b7bf1-desktop to 3.18.6-1.1 (which does have this boot issue). Can I do something else to help resolve? Perhaps installing a different distro with kernel 3.18 or higher on a USB? Or ask on the factory mailing list if others have similar issue? My hardware is about 4 years old, both the motherboard and CPU came out new then, so I'd wild guess that hardware of the same type 2-4 years old would have the same issue, so it seems worth following up for others on similar hardware. Thanks Milan
Sorry for the long latency of my response, as I've been traveling often for now. And thanks for narrowing down the range. Up from this point, we need more detailed tests from you: namely, compiling the kernel by yourself and examining the regression point via git bisect. As a start, install gcc, git-core and make packages if not yet present. Then clone the linux kernel git repo from upstream % mkdir /somewhere/kernel % cd /somewhere/kernel % git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git where /somewhere/kernel is any directory you prefer to put the repo and build. Note that this will fetch many data from the net. Then, boot with 3.17 kernel. Then, checkout v3.17, and do localconfig. % cd /somewhere/kernel/linux % git checkout v3.17 % make localconfig This might ask you something, and usually just press RETURN for each to choose the default. In the end, you'll have ".config" file matching with your running system. Edit this file via a text editor and change CONFIG_LOCALVERSION to a different value, e.g. "-bisect". After the configuration, do make. Better to with -j4 or such to build parallel. % make -j4 Once when everything is done, install it. Be root, then run make install. % su % make install Then reboot to this kernel, and confirm that it still works. The next is to check 3.18 that should be broken. % cd /somewhere/kernel/linux % git checkout v3.18 At this point, better to backup your old config % cp .config .config.3.17 and run make oldconfig again, adjust to 3.18. % make oldconfig Rebuild, reinstall, and retest. Confirm that 3.18 is really broken. Now we start git bisection. Boot any good working kernel. Maybe good to backup 3.18 config again before starting git bisection. % cd /somewhere/kernel % cp .config .config.3.18 OK, here we go. % git bisect start % git bisect good v3.17 % git bisect bad v3.18 This will lead you some commit between 3.17 and 3.18. Do make localconfig, rebuild, reinstall, and retest. If this test kernel works, give "git bisect good". Otherwise reboot with a working kernel, and give "git bisect bad". Then git will lead you to yet another point, and you rebuild, reinstall and retest. Repeat this until you hit the regression point. You can see the current bisection process at any time via "git bisect log". Note that git bisect sometimes jumps to an old kernel version suddenly. This is normal, as the commit is actually based on that version. So, at installing and testing the kernel, be careful which kernel you're testing. While testing, you can remove the old installed kernels. I don't know a good default way for that, but you can just remove /lib/modules/$VERSION directory as well as /boot/vmlinuz-$VERSION, /boot/initrd-$VERSION and /boot/System.map-$VERSION. After cleaning up, run below as root /usr/sbin/grub2-mkconfig -o /boot/grub2/grub.cfg to refresh the GRUB menu. Oh, and after the bisection is done, run "git bisect reset" to reset the bisection state. Don't forget to save your bisection procedure from "git bisect log" beforehand.
Thanks Takashi for the detail description. I will work on this this weekend, and report back. I am reading your instructions, and also about git bisection and I think I know what to do and how. If the process takes too long, I may have to finish it next weekend though, will see how it goes. If I run into things I cannot resolve, I will ask here. Thanks.
this problem has been around for a while and only seems to manifest on milan's machine (as far as we know).
Milan, does this issue persist with the latest kernel? Thanks for the update.
No response, closing. Please reopen if there is anything new.
Hi, I am sorry for my late response here. (I was away for a month mid-summer an then there was too much email). I can report **this issue is indeed fixed in Leap 42.2 Alpha 3**. I installed Alpha 3 about 2 weeks ago, and the system booted every single time (around 30 attempts or more). Thanks for the great work of the OpenSUSE team. Also for all the work and help Takashi put in last fall to help me. Takashi, I apologize for not finishing the git disection process last fall/winter. After spending two weekends I gave up, and kept using your older working kernel till now. Thanks Milan
Fixed in 42.2 Alpha 3
OK, thanks for your information update.