Bugzilla – Bug 128446
kernel crash/oops with XFS
Last modified: 2005-12-08 00:54:27 UTC
the near-alpha2 kernel did crash on booting. happens in XFS code, XFS is used on two paritions on dirks system. I have the full oops on my camera, but no cable here atm, so I do enter only the trace, tell me if you need more: xfs_trans_update_ail xfs_trans_chunk_committed xfs_trans_committed xlog_state_do_callback xlog_iodone pagebuf_iodone_work worker_thread pagebuf_iodone_work default_wake_function ...
SGI, this was on a machine with xfs root, with an almost-stock 2.6.14-rc4-git4 kernel. Are you aware of any problems that look like this? Thanks for checking!
Hi Andreas, As a matter of fact we do have one such reported bug recently from the community (pv#945029 - sorry no bugworks access:) which has the same stack callback. However, their problem occurs when they run out of space (and say it happens when testing with default ACLs and inheriting ACLs). I plan to try out their scenario. However, are there any unusual circumstances in your situation which would provide a clue to reproduce locally? Things are going wrong when the inmemory log buffer makes it to disk, we get a callback and then call our xfs_trans_committed routine. This adds the items in the transaction to the active-item-list, which is a list of items (for metadata) which are in the ondisk log but whose metadata has not been written to disk yet. If the item already exists then it just updates its position in the list. For the pv#945029, they reported that xfs_ail_insert fails because lip->li_ail.ail_forw field is NULL which is a problem when it is linking the next item's back ptr to our new item. The insert works by scanning back from the end of the list. So we traverse just using the back ptrs. Somehow the back ptrs are intact but the forward ptr isn't. The active item list (AIL) is locked prior to this call, so there shouldn't be a race problem. --Tim
unfortunately I cannot provide further information as I reinstalled the corrupted partition with a different filesystem. I cannot immediately trigger it, but running autobuild (which does a lot of compilation, file reads and writes) on the machine for several days appears to have caused this problem.
Tim, I'm assigning this bug to you until we have a fix.
Traceback looks same as 133990 traceback. Was the FS full and was it using default ACLs? --Tim *** This bug has been marked as a duplicate of 133990 ***
I don't think the file system was full, but it could have happened. looks similiar indeed.