Bug 132104

Summary: BUG in drivers/block/cfq_iosched.c:1148
Product: [openSUSE] SUSE LINUX 10.0 Reporter: Richard Biener <rguenther>
Component: KernelAssignee: Jens Axboe <axboe>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P2 - High CC: stoppe
Version: Final   
Target Milestone: ---   
Hardware: i686   
OS: SuSE Linux 10.0   
Whiteboard:
Found By: Development Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: bootup messages
dmesg after boot
Add debug catches and dumps for this bug in CFQ
Screenshot with start of kernel panic with "kernel BUG at drivers/block/cfq-iosched.c" message

Description Richard Biener 2005-11-03 09:36:56 UTC
2nd time now the kernel oopsed with $subject, first time during regular work, 2nd time during (idle) night.  There's no further I/O possible after the oops (I guess it still has a lock on something).

sysreq-p shows the swapper task active, the filesystem is reiserfs.
Comment 1 Richard Biener 2005-11-03 09:38:47 UTC
Created attachment 56324 [details]
bootup messages
Comment 2 Richard Biener 2005-11-03 09:39:15 UTC
Created attachment 56326 [details]
dmesg after boot
Comment 3 Richard Biener 2005-11-03 09:39:56 UTC
A camera snapshot of the oops message (from the first oops) will follow once I get it off the camera.
Comment 4 Jens Axboe 2005-11-03 10:19:09 UTC
This is a should-not-happen case, the queue selected for service turns out to be empty. So either it became empty and didn't get expired, or we selected an empty queue. Hmm.

Are you using ionice for anything?
Comment 5 Jens Axboe 2005-11-03 10:33:49 UTC
Created attachment 56341 [details]
Add debug catches and dumps for this bug in CFQ
Comment 6 Jens Axboe 2005-11-03 10:34:26 UTC
Richard, can you build a kernel with this patch and run with that? I can also check it into SL100 and you can just grab one of the KOTD kernels.
Comment 7 Jens Axboe 2005-11-03 10:35:12 UTC
BTW, the patch should also stop it from crashing and continue on, if the problem isn't due to a memory/hardware problem (ie the data structures have been fscked up).
Comment 8 Richard Biener 2005-11-03 10:39:12 UTC
Ok, I'll do that somewhen today.  Thanks sofar.
Comment 9 Christoph Stoppe 2005-11-09 06:44:43 UTC
Created attachment 56740 [details]
Screenshot with start of kernel panic with "kernel BUG at drivers/block/cfq-iosched.c" message

Hi,

we also encountered this bug with a SuSE 10.0 system running on a Dell PowerEdge 750 server. I attached a screenshot (sorry for the bad quality) which shows the start of the kernel panic message.

I'll try to apply the patch to the kernel and report back, if this helps. This could last about two weeks, since our machines crashes "only" every 4-8 days...

kind regards,

Christoph Stoppe
Comment 10 Christoph Stoppe 2005-11-16 07:31:41 UTC
This morning our server produced some output with the patched kernel. The following messages appeared five times in /var/log/messages:

Nov 13 06:15:14 webserver kernel: Badness in __cfq_set_active_queue at drivers/block/cfq-iosched.c:795
Nov 13 06:15:14 webserver kernel:  [<c029ba1e>] cfq_set_active_queue+0xbe/0x140
Nov 13 06:15:14 webserver kernel:  [<c029c349>] cfq_dispatch_requests+0x39/0x90
Nov 13 06:15:14 webserver kernel:  [<c029c439>] cfq_next_request+0x99/0xb0
Nov 13 06:15:14 webserver kernel:  [<c028f802>] elv_next_request+0x12/0x170
Nov 13 06:15:14 webserver kernel:  [<f883a7f7>] scsi_dispatch_cmd+0x177/0x2d0 [scsi_mod]
Nov 13 06:15:14 webserver kernel:  [<f8840e45>] scsi_request_fn+0x45/0x3c0 [scsi_mod]
Nov 13 06:15:14 webserver kernel:  [<c0291a66>] blk_remove_plug+0x26/0x60
Nov 13 06:15:14 webserver kernel:  [<c0291be0>] blk_run_queue+0x30/0x50
Nov 13 06:15:14 webserver kernel:  [<f88400a6>] scsi_run_queue+0x76/0xb0 [scsi_mod]
Nov 13 06:15:14 webserver kernel:  [<f8840246>] scsi_end_request+0xb6/0x110 [scsi_mod]
Nov 13 06:15:14 webserver kernel:  [<f884056f>] scsi_io_completion+0x16f/0x510 [scsi_mod]
Nov 13 06:15:14 webserver kernel:  [<c012f350>] lock_timer_base+0x20/0x50
Nov 13 06:15:14 webserver kernel:  [<f8815f01>] sd_rw_intr+0x161/0x400 [sd_mod]
Nov 13 06:15:14 webserver kernel:  [<f883db85>] scsi_delete_timer+0x15/0x60 [scsi_mod]
Nov 13 06:15:14 webserver kernel:  [<c01190df>] smp_apic_timer_interrupt+0xdf/0x100
Nov 13 06:15:14 webserver kernel:  [<f88635b4>] ata_scsi_qc_complete+0x24/0x40 [libata]
Nov 13 06:15:14 webserver kernel:  [<f8861393>] ata_qc_complete+0x33/0xc0 [libata]
Nov 13 06:15:14 webserver kernel:  [<f883acba>] scsi_finish_command+0x8a/0xd0 [scsi_mod]
Nov 13 06:15:14 webserver kernel:  [<f886189b>] ata_interrupt+0x9b/0x120 [libata]
Nov 13 06:15:14 webserver kernel:  [<f883abb7>] scsi_softirq+0xa7/0xe0 [scsi_mod]
Nov 13 06:15:14 webserver kernel:  [<c012b4a2>] __do_softirq+0x72/0xe0
Nov 13 06:15:14 webserver kernel:  [<c012b545>] do_softirq+0x35/0x40
Nov 13 06:15:14 webserver kernel:  [<c010703b>] do_IRQ+0x3b/0x70
Nov 13 06:15:14 webserver kernel:  [<c010537a>] common_interrupt+0x1a/0x20
Nov 13 06:15:14 webserver kernel:  [<c0102305>] mwait_idle+0x25/0x50
Nov 13 06:15:14 webserver kernel:  [<c01020d7>] cpu_idle+0x37/0xc0
Nov 13 06:15:14 webserver kernel:  [<c040691a>] start_kernel+0x17a/0x1e0
Nov 13 06:15:14 webserver kernel:  [<c0406330>] unknown_bootoption+0x0/0x1e0
Nov 13 06:15:14 webserver kernel: rb empty on dispatch: q=0/0, a=0/1, d=0/0, rr=0, f=40, k=0

kind regards,

Christoph Stoppe
Comment 11 Christoph Stoppe 2006-02-28 07:09:43 UTC
We switched our server from XFS to Reiserfs (without re-installing) and nothing changed. The kernel panics keep on occurring about once a day. Is anyone working on this issue and will there be a solution in form of an update for Suse 10.0?

kind regards,

Christoph Stoppe
Comment 12 Jens Axboe 2006-02-28 09:59:16 UTC
Christoph, let me know what arch and kernel you are using (eg i386/x86-64 and default/smp) and I'll try and build a test kernel.
Comment 13 Richard Biener 2006-02-28 10:01:16 UTC
Christoph, a workaround is to use elevator=anticipatory as kernel parameter.
Comment 14 Jens Axboe 2006-02-28 10:02:39 UTC
Yes that will work of course, if Christoph is willing to test a new kernel out that would be nice though.
Comment 15 Christoph Stoppe 2006-02-28 10:23:37 UTC
Hi,

thanks for your fast replies. Here's the info you requested:

An "uname -a" gives:

Linux webserver 2.6.13-15.7-smp #1 SMP Wed Dec 7 08:18:11 CET 2005 i686 i686 i386 GNU/Linux

I already installed the kernel update to 2.6.13-15.8, but had no time to restart the machine. Maybe this will happen on the coming weekend.

As mentioned before, this machine is a DELL PowerEdge 750 which has one CPU, a "cat /proc/cpuinfo" gives:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 3
model name      : Intel(R) Pentium(R) 4 CPU 2.80GHz
stepping        : 3
cpu MHz         : 2800.410
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni monitor ds_cpl cid
bogomips        : 5608.69

If you need more informations about the machines hardware, don't hesitate to ask.

I'll try the workaround with "elevator=anticipatory" as a workaround when restarting the machine with the new kernel.

I even could test a new kernel, but would need some time to install it, since the machine in question es a production webserver (=testing another kernel could only happen on weekends).

kind regards,

Christoph
Comment 16 Jens Axboe 2006-02-28 10:40:15 UTC
Thanks Christoph. I have another test right now, so if your machine is in production I'd suggest you go with the anticipatory work-around for now. If testing works out at this end, the patch will go out with the next kernel update anyways.
Comment 17 Jens Axboe 2006-03-13 11:47:11 UTC
The fix has been verified as working outside of bugzilla. It has been committed to cvs, closing bug.