Bug 799475 - kernel bug in …/mm/slab.c:3175 and subsequent deep freeze
Summary: kernel bug in …/mm/slab.c:3175 and subsequent deep freeze
Status: RESOLVED WONTFIX
Alias: None
Product: openSUSE 12.2
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Final
Hardware: x86-64 openSUSE 12.2
: P5 - None : Critical (vote)
Target Milestone: ---
Assignee: E-mail List
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-01-19 12:55 UTC by Anton Samsonov
Modified: 2014-08-08 20:29 UTC (History)
2 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Photograph of the first two crashes (39.68 KB, image/png)
2013-01-19 12:57 UTC, Anton Samsonov
Details
Photograph of the third crash (42.43 KB, image/png)
2013-01-19 12:57 UTC, Anton Samsonov
Details
Photograph of a crash similar to the original one (47.69 KB, image/png)
2013-01-20 17:02 UTC, Anton Samsonov
Details
Photograph of an alternate crash with the same “nohz=off highres=off” options as above (41.14 KB, image/png)
2013-01-20 17:04 UTC, Anton Samsonov
Details
Photograph of the final screen of the same boot sequence as on previous picture (39.98 KB, image/png)
2013-01-20 17:05 UTC, Anton Samsonov
Details
Photograph of a crash with “powersaved=off nohz=off highres=off processor.max_cstate=1” options (37.75 KB, image/png)
2013-01-20 17:07 UTC, Anton Samsonov
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Anton Samsonov 2013-01-19 12:55:17 UTC
User-Agent:       Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0

Preface. I'm using openSUSE ocassionally, not on a day-by-day basis. To make things worse, my desktop has Intel ICH10R fake-RAID, so I'm not usually surprised when the mirror drops to “verify” state after some graceful openSUSE shutdown or after pressing the Reset button during a stalled shutdown, as well as when openSUSE turns into a pumpkin in several months after installation and doesn't want to start until I reinstall it. This time it looked the same, when [after applying recent updates a month ago, I guess] openSUSE stopped loading, falling back to single-user mode after failing /home mount operation, perhaps, — it's hard to tell from the intermixed systemd output.

But today I looked more carefully at the previous messages in scrollback buffer, and noticed that there was actually a crash report up there. As this report was not written to any file (at least I didn't find any), I wrote down the first lines:

> kernel bug in …/mm/slab.c:3175
> invalid opcode: 0000
> pid = mount, sig = SEGV
> Trace:
> ... __kmalloc+0x153/0x190
> ... ext4_kvzalloc+0x1d/0x60 (numbers in this line are the same as in latter reports)
> ... ext4_fill_super+0x1556/0x2840
> .....

Then it turned out that booting in failsafe mode allowed to enter KDE without a problem. And, having found no information on that specific bug, I decided to apply the current updates as well. Although no updates for kernel were available at this time, the situation has worsened. At first, half of attempts to boot in normal mode resulted in giving 3 crash reports, ending with a deep freeze; the other half still dropped to single-user mode after 1 crash report. But, after some tryouts, the 3-crash way absolutely prevailed, so I now have no other option but to photograph the screen (see the attachments).

It now proceeds as this:
1. The boot process starts as usual, mounting some partitions and starting some services.
2. The original …/mm/slab.c:3175 crash occurs (at least I suppose so, because call traces look similar) and immediately goes another one, as depicted in trace01.png.
3. The computer freezes, which looks like a 100 % processor core load: generally unresponsive, but may occasionally react to scrollback keys.
4. After 45 seconds, the third crash report is displayed, as depicted in trace02.png, and the computer ultimately hangs, which looks like an idle halt.
The resulting freeze is so deep that SysRq keys stop working and I have to press the Reset button.

The failsafe mode still worked, though, so I finally figured out what options do the trick:
> nohz=off highres=off
When both are present, the system boots to KDE just fine, as it always did earlier. When any of this options is missing, then the 3-crash issue is sure to occur.

Further details. All filesystems, including the mentioned /home, are ext4 residing on LVM2 volumes, with the exception of /boot on raw partition. I did no change to disks in recent months/years, nor to other hardware (except to videocard, but I doubt it could influence a filesystem driver or timer module).

Any hints on how to debug this?

Reproducible: Always
Comment 1 Anton Samsonov 2013-01-19 12:57:04 UTC
Created attachment 520988 [details]
Photograph of the first two crashes
Comment 2 Anton Samsonov 2013-01-19 12:57:47 UTC
Created attachment 520989 [details]
Photograph of the third crash
Comment 3 Anton Samsonov 2013-01-20 17:00:57 UTC
There has been a development. Today it didn't boot even with “nohz=off highres=off” options, resulting in 1 crash report followed by deep freeze (see trace03.png in new attachments). Although I could not scroll back to see the first lines, this looked very similar to the situation I was experiencing originally until applying updates yesterday, though the numbers after function names were different from the very first text transcript posted here. The other difference was the falling into deep freeze instead of entering single-user mode.

I retried just for the case to check whether I could have misspelled the options. The outcome was a new story: after quickly displaying a couple of crash reports, it stuck for a while with the message about “fsck /boot” (trace04.png), then after 30 seconds it displayed another crash report, and after 23 more seconds it ultimately halted on watchdog timer (trace05.png).

The most strange thing here is that fsck said that /boot (labeled “LinuxBoot”) partition had 130'560 files, while in reality it has 361 files and 11 folders. I must also add that in previous tryouts, when a single-user prompt was available but incapable to shutdown, and I used REISUB magic to force rebooting, there were new messages about fsck and mount after sending SIGTERM, as well as then after SIGKILL.

Booting with the complete set of failsafe options still worked though, so I then started to check other configurations. Trying to boot with “powersaved=off nohz=off highres=off processor.max_cstate=1” resulted in yet unseen crash referring to CPU cache tuning, but still caused by some ext4 handling routines (see trace06.png). Was finally able to boot with
> apm=off edd=off powersaved=off nohz=off highres=off processor.max_cstate=1 nomodeset
or something like this, don't remember exactly which one of “edd” or “nomodeset” was excluded.

At this moment I could start to consider the possibility of a real hardware failure, if only openSUSE was the only one operating system on this computer. But my primary OS is Windows 7, and it boots just fine and [almost] never screws the fake-RAID on shutdown, and runs demanding modern videogames, as well as CPU-, GPU- and RAM-intensive BOINC computations that are cross-validated against other nodes. Of course, it's not a strong proof, but I'm more inclined towards a software bug, taking into account that the situation worsened each time after updating openSUSE.
Comment 4 Anton Samsonov 2013-01-20 17:02:42 UTC
Created attachment 520996 [details]
Photograph of a crash similar to the original one
Comment 5 Anton Samsonov 2013-01-20 17:04:38 UTC
Created attachment 520997 [details]
Photograph of an alternate crash with the same “nohz=off highres=off” options as above
Comment 6 Anton Samsonov 2013-01-20 17:05:57 UTC
Created attachment 520998 [details]
Photograph of the final screen of the same boot sequence as on previous picture
Comment 7 Anton Samsonov 2013-01-20 17:07:23 UTC
Created attachment 520999 [details]
Photograph of a crash with “powersaved=off nohz=off highres=off processor.max_cstate=1” options
Comment 8 Anton Samsonov 2013-04-20 07:45:18 UTC
Just to be sure it was not a hardware failure, I occasionally ran MemTest86+ and several live-boot systems, including Debian, Mint, Clonezilla, Fedora, Chakra (Arch), as well as OpenSUSE Factory. Some of those systems activated my logical volumes automatically, so I mounted and fsck'ed them, without any error, as well as the /boot partition. Not to mention that the main system, Windows 7, still functioned perfectly all the time, running heavy videogames, decoding HD video, crunching BOINC tasks for CPU and GPU, doing large file processing with results identical to other machines.

So I will now install OpenSUSE 12.3 from scratch. Let's see how quickly it reaches the deteriorated state, too.
Comment 9 Jeff Mahoney 2013-07-15 20:20:08 UTC
Sorry for the delay. The screenshots you've taken are, unfortunately, all of secondary crashes and can't be used for debugging. The system is already in an unstable state when they occur.

Has your experience with 12.3 been any better?
Comment 10 Anton Samsonov 2013-07-20 08:58:36 UTC
(In reply to comment #9)

> Has your experience with 12.3 been any better?

The experience is always the same (more or less): for several months after the installation, everything works just fine, but occasionally deteriorates to an unusable state — when only a single-user prompt is available, which doesn't help as the system either crashes or hangs on transition to higher runlevels. By the time this happens, a new version of openSUSE is usually available, so, after several attempts to fix the problem, I install the new version from scratch. This gives me another few months, and the cycle repeats.

If/when the same happens to openSUSE 12.3, I'll try report back, but, again, I have absolutely no idea how to provide more helpful dumps for such cases.
Comment 11 Borislav Petkov 2014-03-08 23:02:10 UTC
Is this issue still of interest or can we close?
Comment 12 Anton Samsonov 2014-03-10 08:09:04 UTC
(In reply to comment #11)
> Is this issue still of interest or can we close?

I've updated to 12.3 and 13.1 since then, with relatively less usage than earlier, so problems (if they persist) didn't have much time to accumulate. Thus it may be better to close this entry and perhaps open a new one if/when necessary — against a recent openSUSE version.
Comment 13 Jeff Mahoney 2014-08-08 20:29:11 UTC
This report is against openSUSE 12.2 which is no longer under maintenance. If
you are able to reproduce it with openSUSE 13.1 or openSUSE Factory, please
re-open and reset the the "Product" field to the appropriate release.