|
Bugzilla – Full Text Bug Listing |
| Summary: | BUG in drivers/block/cfq_iosched.c:1148 | ||
|---|---|---|---|
| Product: | [openSUSE] SUSE LINUX 10.0 | Reporter: | Richard Biener <rguenther> |
| Component: | Kernel | Assignee: | Jens Axboe <axboe> |
| Status: | RESOLVED FIXED | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Normal | ||
| Priority: | P2 - High | CC: | stoppe |
| Version: | Final | ||
| Target Milestone: | --- | ||
| Hardware: | i686 | ||
| OS: | SuSE Linux 10.0 | ||
| Whiteboard: | |||
| Found By: | Development | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
bootup messages
dmesg after boot Add debug catches and dumps for this bug in CFQ Screenshot with start of kernel panic with "kernel BUG at drivers/block/cfq-iosched.c" message |
||
|
Description
Richard Biener
2005-11-03 09:36:56 UTC
Created attachment 56324 [details]
bootup messages
Created attachment 56326 [details]
dmesg after boot
A camera snapshot of the oops message (from the first oops) will follow once I get it off the camera. This is a should-not-happen case, the queue selected for service turns out to be empty. So either it became empty and didn't get expired, or we selected an empty queue. Hmm. Are you using ionice for anything? Created attachment 56341 [details]
Add debug catches and dumps for this bug in CFQ
Richard, can you build a kernel with this patch and run with that? I can also check it into SL100 and you can just grab one of the KOTD kernels. BTW, the patch should also stop it from crashing and continue on, if the problem isn't due to a memory/hardware problem (ie the data structures have been fscked up). Ok, I'll do that somewhen today. Thanks sofar. Created attachment 56740 [details]
Screenshot with start of kernel panic with "kernel BUG at drivers/block/cfq-iosched.c" message
Hi,
we also encountered this bug with a SuSE 10.0 system running on a Dell PowerEdge 750 server. I attached a screenshot (sorry for the bad quality) which shows the start of the kernel panic message.
I'll try to apply the patch to the kernel and report back, if this helps. This could last about two weeks, since our machines crashes "only" every 4-8 days...
kind regards,
Christoph Stoppe
This morning our server produced some output with the patched kernel. The following messages appeared five times in /var/log/messages: Nov 13 06:15:14 webserver kernel: Badness in __cfq_set_active_queue at drivers/block/cfq-iosched.c:795 Nov 13 06:15:14 webserver kernel: [<c029ba1e>] cfq_set_active_queue+0xbe/0x140 Nov 13 06:15:14 webserver kernel: [<c029c349>] cfq_dispatch_requests+0x39/0x90 Nov 13 06:15:14 webserver kernel: [<c029c439>] cfq_next_request+0x99/0xb0 Nov 13 06:15:14 webserver kernel: [<c028f802>] elv_next_request+0x12/0x170 Nov 13 06:15:14 webserver kernel: [<f883a7f7>] scsi_dispatch_cmd+0x177/0x2d0 [scsi_mod] Nov 13 06:15:14 webserver kernel: [<f8840e45>] scsi_request_fn+0x45/0x3c0 [scsi_mod] Nov 13 06:15:14 webserver kernel: [<c0291a66>] blk_remove_plug+0x26/0x60 Nov 13 06:15:14 webserver kernel: [<c0291be0>] blk_run_queue+0x30/0x50 Nov 13 06:15:14 webserver kernel: [<f88400a6>] scsi_run_queue+0x76/0xb0 [scsi_mod] Nov 13 06:15:14 webserver kernel: [<f8840246>] scsi_end_request+0xb6/0x110 [scsi_mod] Nov 13 06:15:14 webserver kernel: [<f884056f>] scsi_io_completion+0x16f/0x510 [scsi_mod] Nov 13 06:15:14 webserver kernel: [<c012f350>] lock_timer_base+0x20/0x50 Nov 13 06:15:14 webserver kernel: [<f8815f01>] sd_rw_intr+0x161/0x400 [sd_mod] Nov 13 06:15:14 webserver kernel: [<f883db85>] scsi_delete_timer+0x15/0x60 [scsi_mod] Nov 13 06:15:14 webserver kernel: [<c01190df>] smp_apic_timer_interrupt+0xdf/0x100 Nov 13 06:15:14 webserver kernel: [<f88635b4>] ata_scsi_qc_complete+0x24/0x40 [libata] Nov 13 06:15:14 webserver kernel: [<f8861393>] ata_qc_complete+0x33/0xc0 [libata] Nov 13 06:15:14 webserver kernel: [<f883acba>] scsi_finish_command+0x8a/0xd0 [scsi_mod] Nov 13 06:15:14 webserver kernel: [<f886189b>] ata_interrupt+0x9b/0x120 [libata] Nov 13 06:15:14 webserver kernel: [<f883abb7>] scsi_softirq+0xa7/0xe0 [scsi_mod] Nov 13 06:15:14 webserver kernel: [<c012b4a2>] __do_softirq+0x72/0xe0 Nov 13 06:15:14 webserver kernel: [<c012b545>] do_softirq+0x35/0x40 Nov 13 06:15:14 webserver kernel: [<c010703b>] do_IRQ+0x3b/0x70 Nov 13 06:15:14 webserver kernel: [<c010537a>] common_interrupt+0x1a/0x20 Nov 13 06:15:14 webserver kernel: [<c0102305>] mwait_idle+0x25/0x50 Nov 13 06:15:14 webserver kernel: [<c01020d7>] cpu_idle+0x37/0xc0 Nov 13 06:15:14 webserver kernel: [<c040691a>] start_kernel+0x17a/0x1e0 Nov 13 06:15:14 webserver kernel: [<c0406330>] unknown_bootoption+0x0/0x1e0 Nov 13 06:15:14 webserver kernel: rb empty on dispatch: q=0/0, a=0/1, d=0/0, rr=0, f=40, k=0 kind regards, Christoph Stoppe We switched our server from XFS to Reiserfs (without re-installing) and nothing changed. The kernel panics keep on occurring about once a day. Is anyone working on this issue and will there be a solution in form of an update for Suse 10.0? kind regards, Christoph Stoppe Christoph, let me know what arch and kernel you are using (eg i386/x86-64 and default/smp) and I'll try and build a test kernel. Christoph, a workaround is to use elevator=anticipatory as kernel parameter. Yes that will work of course, if Christoph is willing to test a new kernel out that would be nice though. Hi, thanks for your fast replies. Here's the info you requested: An "uname -a" gives: Linux webserver 2.6.13-15.7-smp #1 SMP Wed Dec 7 08:18:11 CET 2005 i686 i686 i386 GNU/Linux I already installed the kernel update to 2.6.13-15.8, but had no time to restart the machine. Maybe this will happen on the coming weekend. As mentioned before, this machine is a DELL PowerEdge 750 which has one CPU, a "cat /proc/cpuinfo" gives: processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 3 model name : Intel(R) Pentium(R) 4 CPU 2.80GHz stepping : 3 cpu MHz : 2800.410 cache size : 1024 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni monitor ds_cpl cid bogomips : 5608.69 If you need more informations about the machines hardware, don't hesitate to ask. I'll try the workaround with "elevator=anticipatory" as a workaround when restarting the machine with the new kernel. I even could test a new kernel, but would need some time to install it, since the machine in question es a production webserver (=testing another kernel could only happen on weekends). kind regards, Christoph Thanks Christoph. I have another test right now, so if your machine is in production I'd suggest you go with the anticipatory work-around for now. If testing works out at this end, the patch will go out with the next kernel update anyways. The fix has been verified as working outside of bugzilla. It has been committed to cvs, closing bug. |