Bug 113237

Summary: XFS oops when trying installation repair
Product: [openSUSE] SUSE LINUX 10.0 Reporter: michel munnix <michel.munnix>
Component: KernelAssignee: Andreas Gruenbacher <agruen>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: forgotten_3A9F3QliFF, forgotten_aHtZ2osk0j, forgotten_f0K9NrX7su, nathans, yast2-maintainers
Version: Beta 2   
Target Milestone: ---   
Hardware: i686   
OS: All   
Whiteboard:
Found By: Other Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: xfsbufd startup and wakeup fixes

Description michel munnix 2005-08-26 09:10:06 UTC
when prompted to choose update or install, clicked on other.
choose automatic repair
activate the swap
repair did not proceed

I tried to reproduce the problem but it did not occur again

here is an extract of dmesg:
md: ... autorun DONE.
end_request: I/O error, dev fd0, sector 0
end_request: I/O error, dev fd0, sector 0
end_request: I/O error, dev fd0, sector 0
end_request: I/O error, dev fd0, sector 0
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
Adding 1052216k swap on /dev/hda1.  Priority:-1 extents:1
Adding 1052216k swap on /dev/hda1.  Priority:-2 extents:1
SGI XFS with ACLs, security attributes, realtime, large block numbers, no debug
enabled
SGI XFS Quota Management subsystem
Unable to handle kernel NULL pointer dereference at virtual address 00000000
 printing eip:
c0118c4d
*pde = 00000000
Oops: 0000 [#1]
Modules linked in: xfs exportfs reiserfs ext3 jbd parport_pc parport edd dm_snap
shot multipath raid6 raid5 xor raid1 raid0 dm_mod st piix fan thermal processor
usb_storage usbhid uhci_hcd usbcore ide_disk ide_cd ide_core sg sr_mod sd_mod sc
si_mod cdrom cramfs vfat fat nls_iso8859_1 nls_cp437 af_packet nvram
CPU:    0
EIP:    0060:[<c0118c4d>]    Not tainted VLI
EFLAGS: 00210046   (2.6.13-rc6-git7-3-default)
EIP is at try_to_wake_up+0xd/0x90
eax: 00000000   ebx: 00000000   ecx: 00000000   edx: 0000000f
esi: 00200246   edi: 00000000   ebp: c4c21d74   esp: c4c21d68
ds: 007b   es: 007b   ss: 0068
Process y2base (pid: 2499, threadinfo=c4c20000 task=c1617530)
Stack: 00000000 c3bbafc0 000000c0 0000000b c928d534 c0145f91 00000000 00000000
       00000000 000000d2 00000000 00000180 0000658f 00000006 c4c21e18 00000080
       0000000b c0146f88 000000d2 c035b4c4 0000658e 00000009 00000060 00000003
Call Trace:
 [<c928d534>] xfsbufd_wakeup+0x24/0x30 [xfs]
 [<c0145f91>] shrink_slab+0xa1/0x180
 [<c0146f88>] try_to_free_pages+0xe8/0x1b0
 [<c0140eaf>] __alloc_pages+0x1ef/0x420
 [<c0149fef>] do_wp_page+0x9f/0x2e0
 [<c014aefb>] __handle_mm_fault+0x11b/0x130
 [<c0117497>] do_page_fault+0x127/0x5ef
 [<c016a178>] poll_freewait+0x48/0x60
 [<c016a64a>] do_select+0x30a/0x340
 [<c0117370>] do_page_fault+0x0/0x5ef
 [<c0103f0f>] error_code+0x4f/0x60
 [<c016007b>] bd_acquire+0x2b/0xa0
 [<c01e33de>] __put_user_4+0x12/0x18
 [<c016a8ba>] sys_select+0x21a/0x360
 [<c0102d79>] syscall_call+0x7/0xb
Code: 3c 54 3f c0 55 0f 94 c0 25 ff 00 00 00 89 e5 5d c3 8d b6 00 00 00 00 8d bc
 27 00 00 00 00 55 89 e5 57 89 cf 56 53 89 c3 9c 5e fa <8b> 08 31 c0 85 ca 74 0d
 8b 53 28 85 d2 74 14 c7 03 00 00 00 00
Comment 1 Olaf Kirch 2005-08-26 09:19:51 UTC
Walks like an XFS bug, quacks like an XFS bug, must be an XFS bug :) 
Comment 2 Andreas Gruenbacher 2005-08-26 10:24:10 UTC
SGI, is this enough information for you to fix this? 
Comment 3 Christoph Hellwig 2005-08-26 11:38:09 UTC
Created attachment 47757 [details]
xfsbufd startup and wakeup fixes
Comment 4 Christoph Hellwig 2005-08-26 11:42:06 UTC
What does this automatic repair choice do?

Anyway, this oops means xfsbufd_task is NULL.  The only reason I see this could
happen is due to a race when the xfsbufd thread is started - xfsbufd_wakeup is
called before the child had a chance to run.  The patch below switches xfsbufd
startup to the kthread infrastructure that avoids this race, and adds some
small fixes to xfsbufd_wakeup.

Could you spin a kernel with that fix for the reporter?

(sorry, this should have gotten out with the attachment, but @#W%#% bugzilla
ignores the comment when you're adding an attachment)
Comment 5 Andreas Gruenbacher 2005-08-31 18:33:12 UTC
http://portal.suse.de/sdb/de/2003/11/YaST-System-Repair.html contains a German 
description of system repair. This feature most certainly has nothing to do 
with this bug, though. 
 
It's a bit tricky to test this: system repair is offered during installation 
only.  Can we maybe trigger this with a simple test case? At least a trivial 
loop didn't fail for me, with or without the fix: 
 
#! /bin/sh 
set -v 
while :; do 
    modprobe xfs 
    mount /dev/hda7 /mnt 
    dd if=/dev/urandom of=/mnt/random bs=16 count=1 
    umount /mnt 
    rmmod xfs 
done 
 
I saw you submitted this change to the xfs cvs, but without this hunk: 
 
@@ -1744,10 +1743,15 @@ 
        int                     priority, 
        unsigned int            mask) 
 { 
-       if (xfsbufd_force_sleep) 
+       if (xfsbufd_force_sleep || !priority) 
                return 0; 
+       if (!xfsbufd_task) { 
+               WARN_ON(1); 
+               return 0; 
+       } 
+ 
        xfsbufd_force_flush = 1; 
-       barrier(); 
+       wmb(); 
        wake_up_process(xfsbufd_task); 
        return 0; 
 } 
 
Christoph, I think we can just take the fix. Which version should we go with? 
Thanks. 
Comment 6 Christoph Hellwig 2005-09-04 23:29:27 UTC
The additional fixes shouldn't be nessecary.  They're were just some more
defensive programming, but I talked to Nathan and am pretty sure either the diff
I checked into CVS or today's "Make sure the threads and shaker in xfs_buf
arede-initialized in reverse startup order" checking should fix the issue. 
Comment 7 Andreas Gruenbacher 2005-09-05 23:55:46 UTC
Okay, I have added the two changesets from the xfs cvs as 
patches.fixes/xfs-switch-to-kthread-api and 
patches.fixes/xfs-switch-to-kthread-api-2. Thanks.