Bug 1172541 - Mysterious crashes with kernel 5.6.14, suspect KVM
Mysterious crashes with kernel 5.6.14, suspect KVM
Status: RESOLVED WORKSFORME
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel
Current
x86-64 All
: P5 - None : Normal (vote)
: ---
Assigned To: openSUSE Kernel Bugs
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2020-06-05 05:24 UTC by James Carter
Modified: 2022-01-17 08:50 UTC (History)
4 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description James Carter 2020-06-05 05:24:17 UTC
Version: kernel-default-5.6.14-1.1.x86_64 and aarch64 for Tumbleweed

I did a dist-upgrade on 2020-06-02.  After I waited for 
download.opensuse.org to recover from its illness, the upgrade went
smoothly.  But after I rebooted all the machines, 5 to 10 minutes
later two of them went catatonic, and a third (my laptop) froze up
a few minutes after being booted up two days later.  After I power cycled
the gateway machine, it again froze about five minutes after booting. 

On those hosts I reverted (by the grub menu) to
kernel-default-5.6.12-1.3.x86_64 and aarch64.  There were no further
catatonic incidents.  The other hosts continued (on 5.6.14) for up to
36 hours and did not go catatonic (but I eventually reverted them also).

Here are the symptoms:  On two machines I was working on the console
(XFCE) when it happened.  It seemed perfectly normal, but suddenly
keystrokes had no effect and continuously updating apps (load meter
etc.) ceased updating.  On the gateway, network thru traffic was no
longer forwarded.  After I rebooted it I checked syslog (debug level).
After the last normal message (e.g. DHCP lease renewed) there was a
burst of \0's, and the dmesg dump appeared immediately after, starting
with "microcode updated early to revision..." and "Jun  4 08:47:03
jacinth kernel: [    0.000000] Linux version 5.6.12-1-default
(geeko@buildhost) (gcc version 9.3.1 20200406 [revision
6db837a5288ee3ca5ec504fbd5a765817e556ac2] (SUSE Linux)) #1 SMP Tue May
12 17:44:12 UTC 2020 (9bff61b)", i.e. the very first expected messages
from dmesg.  (I carefully booted it into 5.6.12 not 5.6.14.) I have not
been able to find any useful symptoms that might shed light on what is
killing the machines. 

Machines running 5.6.12 did not crash, both before the dist-upgrade and
after I reverted.  

For what it's worth, here is some data about the various machines.  
These went catatonic: 
jacinth     Gateway, directory server, never sleeps.  Intel NUC6CAYH,
            Celeron(R) CPU J3455 @ 1.50GHz, Intel HD Graphics 500
            (i915). It runs hostapd as a wireless access point, driver
            8812au from
            rtl8812au-kmp-default-5.6.4.2+git20200318.49e98ff_k5.6.12_1-1.7.x86_64
xena        Laptop, wireless, powered off when the human is sleeping.  
            Acer Aspire A515-54-51D1, Intel(R) Core(TM) i5-8265U 
            CPU @ 1.60GHz, Intel UHD Graphics 620 (Whiskey Lake) (i915)
diamond     Normal desktop.  When it crashed nobody was using it.  At
            night it hibernates.  Intel NUC7i5BNH, Core(TM) i5-7260U CPU
            @ 2.20GHz, Intel Iris Plus Graphics 640 (i915)

These survived up to 36 hours on 5.6.14 (called "non-catatonic" below):
claude      Webserver (1.5 hits/min). VM (KVM) on Jacinth, x86_64, 
            video=cirrus
holly       Desktop replacement.  Raspberry Pi-3B, Broadcom BCM2837 
            @1.2GHz and VideoCore IV (vc4).  aarch64.  RPi's can't sleep.  
iris        Audio-video player (17kbyte/sec), hibernates when unused.  
            Intel NUC6CAYH, Celeron(R) CPU J3455 @ 1.50GHz (like Jacinth),
            Intel HD Graphics 500 (i915)
oso         Development VM (KVM) on Diamond, x86_64, video=cirrus
petra       Development VM (KVM) on Xena, x86_64, video=cirrus
surya       Cloud server, VM (KVM) at Linode, never sleeps :-), Intel(R) 
            Xeon(R) CPU E5-2680 v3 @ 2.50GHz, no GPU at all.  

Thinking that X-Windows activity might be correlated with catatonia,
for about 6 hours I set the non-catatonic machines (except Surya, has no
console) to show screensaver eye candy continuously.  I was wrong; none 
of them crashed.  

Iris hardware is identical to Jacinth, yet it did not go catatonic while
Jacinth failed more times than any other host.  

I don't really know what effective action could be taken, beyond trying
to guess which kernel patch may have been the culprit.  Anyway, thank
you for whatever you can do with this info.
Comment 1 Takashi Iwai 2020-06-05 08:40:19 UTC
Unfortunately it's very hard to guess what's going on there only from that.
We already moved to 5.7 kernel now, and the update is on its way to TW.
Could you try 5.7 kernel from OBS Kernel:stable repo and see whether the issue persists?

If a crash happens, it'd be great if you can get a trace of Oops or whatever.
Comment 2 James Carter 2020-06-07 00:10:57 UTC
I installed kernel-default-5.7.0-3.1.gad96a07.x86_64 from OBS 
Kernel:stable.  Unfortunately the most revealing machines could not use
it: Xena uses EFI boot, you can't switch to "legacy", and the kernel's
signature is not acceptable to the EFI checker.  Jacinth is the Wi-Fi
access point and needs rtl8812au-kmp-default-5.6.4.2+$git.k$version, which
is not available for kernel 5.7.0.  Anyway the other machines have been
running about 24 hours with no catatonia.  I'll check back when the
production version is out.  

I was reading changelogs at 
https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog- $version.  
5.6.13 and 5.6.16 had nothing relevant; 5.6.15 had a fix for i915 that
also seemed irrelevant,  5.6.14 had a hit:
Commit f18d3beac6bc2e1bddfba202f5327200acbda54c by Chris Wilson.
This is a fix (possibly gone awry) for the following issue and 2 others:
https://gitlab.freedesktop.org/drm/intel/-/issues/1746 OP Arnout Engelen 
("1 month ago", about 2020-05-(early))  "i915_gem_evict_something 
stalls".  He has kernel 5.7.0-rc1 and gets what sounds a lot like my 
problem including no useful log output.  He and Chris Wilson worked out
what kconfig items to turn on (rebuild kernel), and he got 400Mb of
log files which he gzipped and posted :-)  He cross references two 
seemingly relevant bugs from kernel 5.5.2 and 5.3.8.  I never saw the
actual changelog for 5.7.0-rc1.

Having got a hit for i915, and since I was becoming impatient reading
all the non-hits, I decided to limit the search to i915 only.  The
next one was filed under "5.7".
Commit e7cea7905815ac938e6e90b0cb6b91bcd22f6a15
Linus pulls a bunch of DRM fixes including i915 including something
from Chris Wilson closely related to the one above under 5.6.14.

I'm a little confused by the timing and version interlocking, but I'm
calling it like this: Arnout Engelen is running NixOS with a
self-compiled kernel, which was 5.7.0-rc1.  Chris Wilson came up with a
fix which landed in 5.6.14 and bit me.  I'm guessing that something was
wrong with the fix, and Chris Wilson improved it (I never saw the 
changelog stanza for this).  Linus eventually got it in time for 5.7.0
(post-rc1).

I hope I haven't fixated on a plausible but irrelevant issue, and the
real problem is in some of the changes that I skipped over.  

About the stack traceback: if I read the changelog right, in rare cases
it's possible for the i915 driver to get into an infinite loop looking
for tasks to evict from an assignment table.  That would explain sudden
catatonia.  But there's no "oops" and nothing is written to syslog or
dmesg.
Comment 3 Takashi Iwai 2020-06-08 13:26:43 UTC
OK, then let's wait until 5.7 kernel gets merged to TW snapshot.
Comment 4 James Carter 2020-06-24 00:07:07 UTC
I installed kernel-default-5.7.1-1.2.x86_64 (except aarch64 on one 
machine).  The three machines that froze up on
kernel-default-5.6.14-1.1.x86_64, also froze up on 5.7.1.  I kept the
other six on 5.7.1 for about 48 hours witn no freezeups.  Again there
is no trace of the problem in the logs, when it happens.  

I have two machines (Intel NUC6CAYH, Celeron J3455, Intel HD Graphics
500, 2x4Gb RAM, Realtek RTL8111/8168/8411 NIC) with identical hardware,
for a quick replacement if the more critical one fails: this is the main
router, music library, directory server, etc.  The other is a leaf node
that does audio performance.  The router is always the first to freeze
up, whereas the leaf node has never frozen yet, despite substantial net
traffic for the music.  I suspect that hardware is irrelevant, but it's
way too soon to blow off this possibility.  I suspect that the roles
exercise a vulnerable data path on the router, but there is no evidence
of this either.

However, when I put 5.7.1 on my "non-failing" hosts I got a rash of
connection failures.  This is like chasing ghosts: here's the issue that
I worked on first, since it was the most mission-critical.  I don't
seriously expect anyone to figure it out, but I'm getting it on record
in case it provides a clue.  My publicly exposed webserver is on a VM
(running 5.7.1 on the Celeron).  IPv4+6 connections from the wild side
come into the main router (running 5.6.12) and get DNAT to the
webserver.  HTTP and HTTPS on IPv4 work.  IPv6 from the wild side used
to work, not now, but with the "comorbid" network issues IPv6 is too
complicated to give any clues.  A tester on the webserver VM attempts
https://www.jfcarter.net/ and it times out, both IPv4 and IPv6.  I did
lots of troubleshooting including tcpdumps and reboots (into 5.7.1). The
client would initiate a TLS connection with SNI, the server would send
its cert which the client would accept as valid, but the server then
just sat there, not sending a TLS session ticket.  Finally I tried a
scorched earth solution: rebooted all the leaf nodes and VMs into
5.6.12.  The net issues miraculously vanished (not seen for 2 days), and
specifically, web service was back to normal.

It looks like kernel 5.7.2 is now out.  I'm going to install it on one
vulnerable machine and see if it freezes.  But in the meantime I'm going
to learn to use git and I'll try to use the bisect feature to find the
commit that messed it up.  I wish there were a symptom that would appear
immediately and left the machine alive enough that you could log in
remotely and boot back to 5.6.12.  But the world isn't arranged for our
convenience.  

If I ever accidentally uninstall 5.6.12 I'll have a real problem.
Several times I've wanted to revert a bad update, but my only recourse
was either to wait for SuSE to release a fixed version or to fix it
myself, e.g. https://bugzilla.opensuse.org/show_bug.cgi?id=1172256 .
The present issue (bug 1172541) finally got me moving to set up an
archive of all RPMs installed on any of my machines, keeping no longer
installed versions for a month or two, in case of reversion.  Did I
reimplement the wheel, that is, does SuSE have such an archive server?
If not, maybe you should.
Comment 5 Takashi Iwai 2020-06-25 13:36:01 UTC
FYI, Tumbleweed has some archives of the former snapshot releases, available at
  http://download.opensuse.org/history/
But it doesn't keep forever but rather for only a short period.  You'd need to zypper addlock to keep some certain package locally.
Comment 6 James Carter 2020-06-26 23:12:56 UTC
"Good" news from a test: I have these two identical machines, one with a
VM and one without; the VM host freezes reliably and the leaf node has
never been seen to freeze.  OK, on the leaf note (Iris) I installed 
qemu-kvm-5.0.0-3.1.x86_64 libvirt-daemon-6.4.0-1.1.x86_64 and numerous
friends (the latest) and created a new VM called Orion, basically 
copying the XML and software selection from my other VMs.  For testing
I ran a console session on the guest (VNC through virt-viewer) and set 
the guest's screensaver to show random eye candy, giving the guest and
host consistent but not excessive net traffic and CPU load.  (I have the
impression that the eye candy is not necessary; just being booted is
enough; but that's not assured and it's more prudent to load up the
guest, to maybe avoid a false negative.)  

With kernel 5.6.12 on the host and 5.7.1 on the guest, it ran for 14 
hours without freezing.  With 5.7.1 on both the host and the guest, it
ran for 5 hours and the host froze up, being rebooted by its watchdog
timer.  I can't tell if the guest had any issues just before the host
froze; I was in bed at the time.  As usual, the log files on both the
host and guest look perfectly normal, then end with a block of \0, 
and continue on with normal-looking output from the rebooted session.  

So this does support the hypothesis that machines running kernel 5.6.14+
(here, 5.7.1) will freeze if and only if they are host to a VM.  It does
not support the alternative fact that the recent update to libvirt and
qemu may have made the host stop freezing.  In the upgrades to 5.6.14
and kernel-default-5.7.0-3.1.gad96a07.x86_64 two of the VM hosts would
freeze within 5 minutes of booting (tried several times each), but hosts
that are less mission critical and more convenient for testing also take
a long time to freeze, if they're going to.

It's going to be hard to get developers to pay attention to this problem
if they can't reproduce it in-house, and I'm going to try to put
together an easily describable software and VM configuration that will
freeze up. 

@tiwai, I retrieved kernel-default-5.6.12-1.3.x86_64 from 
http://download.opensuse.org/history/ , thanks for the link!
Comment 7 Miroslav Beneš 2020-10-16 13:00:31 UTC
James, is there anything new? TW is on 5.8.x kernel now (and soon 5.9.x), so there may be some improvement.

CCing Joerg. Joerg, I think comment 6 explains the problem well, but there is unfortunately nothing much to base further investigation on.
Comment 8 Joerg Roedel 2020-10-19 12:27:10 UTC
Does any of the machines where you can reproduce the bug have a serial port to catch the kernel log output while the machine freezes? That would be very helpful to track the issue down.

Thanks, Joerg
Comment 9 James Carter 2020-10-22 00:43:30 UTC
@miroslav: During the summer I tried to investigate but the problem is,
the thing won't fail reliably, which makes bisection very hard.  And
then USA politics went into underdrive, which took a lot of time and
focus away from "important" work like the kernel bug.

I did a test last night (2020-10-21 AM).  First with
kernel-default-5.8.14-1.2, but unbound wouldn't start, see bug 1177715.
I reverted to 5.7.9-1-default (on the VM host), started the guest, gave
it some minimal work (other host logs in with X-Windows propagation,
guest pings various hosts and shows a graph of the results with
X-Windows), and let it run.  Every 3 hours both hosts do some monitoring
and auditing tasks which fork off up to 10 tests in parallel, 50-60
tests overall.  I've noticed that the host is most likely to crash
during these tests, but my attempts to point the finger of blame at a
specific test were fruitless.  The tests at midnight went OK (no
complaints, no crash), but 03:04 was the last message in syslog on both
machines, and the host crashed (no watchdog).  So kernel 5.7.9 doesn't
have a fix.

It's interesting that neither the host nor the guest's syslog has a
block of \0's after the last message, while back in June they did.  That
means that then and now the log file's inode is getting written showing
how much stuff had been sent to the kernel, but formerly the data was
not yet synced to disc, while now it does get synced more promptly.
rsyslogd has an option to sync after every write, which I didn't turn
on, and perhaps I should.  [Done.]

@joerg: None of my machines has a serial port.  I could try buying a
pair of USB serial dongles, with a wire between.  If I get this working
I'll post the result.  USB is useless while booting, but if the machine
is fully functional until the crash, the USB dongle should send at least
something.

I upgraded the guest's packages, specifically libvirt, qemu-x86 and
numerous friends, to get recent important-seeming bugfixes that had not
been installed on the mothballed guest due to political distractions. 
I'll try the test again.  Wish me luck, I'm going to need it.
Comment 10 Miroslav Beneš 2022-01-14 13:28:54 UTC
This fell through the cracks, sorry about that. 

Is there anything new? TW has 5.16 kernel now so it may have just disappeared.
Comment 11 James Carter 2022-01-17 05:22:44 UTC
@Miroslav, the problem hasn't happened again.  I think you're right, 
2 years of kernel development have either fixed it or at least made
it no longer fail on the things I do.  I think we could close this bug.
Comment 12 Miroslav Beneš 2022-01-17 08:50:54 UTC
ok, thanks for the feedback. Closing.