Bug 128446

Summary: kernel crash/oops with XFS
Product: [openSUSE] SUSE Linux 10.1 Reporter: Adrian Schröter <adrian.schroeter>
Component: KernelAssignee: Forgotten User f0K9NrX7su <forgotten_f0K9NrX7su>
Status: RESOLVED DUPLICATE QA Contact: E-mail List <qa-bugs>
Severity: Critical    
Priority: P5 - None CC: ajones, dmueller, forgotten_aHtZ2osk0j, forgotten_f0K9NrX7su
Version: unspecified   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: Other Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Adrian Schröter 2005-10-14 15:54:26 UTC
the near-alpha2 kernel did crash on booting. happens in XFS code, XFS is used 
on two paritions on dirks system. 
 
I have the full oops on my camera, but no cable here atm, so I do enter only 
the trace, tell me if you need more: 
 
xfs_trans_update_ail 
xfs_trans_chunk_committed 
xfs_trans_committed 
xlog_state_do_callback 
xlog_iodone 
pagebuf_iodone_work 
worker_thread 
pagebuf_iodone_work 
default_wake_function 
...
Comment 1 Andreas Gruenbacher 2005-10-17 11:22:16 UTC
SGI, this was on a machine with xfs root, with an almost-stock 2.6.14-rc4-git4 kernel. Are you aware of any problems that look like this? Thanks for checking!
Comment 2 Forgotten User f0K9NrX7su 2005-11-03 05:33:32 UTC
Hi Andreas,

As a matter of fact we do have one such reported bug recently from
the community (pv#945029 - sorry no bugworks access:) which has the
same stack callback.
However, their problem occurs when they run out of space (and say
it happens when testing with default ACLs and inheriting ACLs).
I plan to try out their scenario.
However, are there any unusual circumstances in your situation which
would provide a clue to reproduce locally?

Things are going wrong when the inmemory log buffer makes it to disk,
we get a callback and then call our xfs_trans_committed routine.
This adds the items in the transaction to the active-item-list, which
is a list of items (for metadata) which are in the ondisk log but
whose metadata has not been written to disk yet.
If the item already exists then it just updates its position in the list.
For the pv#945029, they reported that xfs_ail_insert fails because
lip->li_ail.ail_forw field is NULL which is a problem when it is linking the
next item's back ptr to our new item.
The insert works by scanning back from the end of the list.
So we traverse just using the back ptrs. Somehow the back ptrs are
intact but the forward ptr isn't.
The active item list (AIL) is locked prior to this call, so there
shouldn't be a race problem.

--Tim
Comment 3 Dirk Mueller 2005-11-03 11:48:15 UTC
unfortunately I cannot provide further information as I reinstalled the corrupted partition with a different filesystem. 

I cannot immediately trigger it, but running autobuild (which does a lot of compilation, file reads and writes) on the machine for several days appears to have caused this problem.
Comment 4 Andreas Gruenbacher 2005-11-04 14:06:36 UTC
Tim, I'm assigning this bug to you until we have a fix.
Comment 5 Forgotten User f0K9NrX7su 2005-12-05 00:33:22 UTC
Traceback looks same as 133990 traceback.
Was the FS full and was it using default ACLs?
--Tim

*** This bug has been marked as a duplicate of 133990 ***
Comment 6 Dirk Mueller 2005-12-05 09:42:10 UTC
I don't think the file system was full, but it could have happened. looks similiar indeed.