|
Bugzilla – Full Text Bug Listing |
| Summary: | kernel oops at boot | ||
|---|---|---|---|
| Product: | [openSUSE] SUSE Linux 10.1 | Reporter: | Jon Nelson <jnelson-suse> |
| Component: | Kernel | Assignee: | Neil Brown <nfbrown> |
| Status: | RESOLVED FIXED | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Normal | ||
| Priority: | P5 - None | ||
| Version: | Beta 4 | ||
| Target Milestone: | --- | ||
| Hardware: | Other | ||
| OS: | Other | ||
| Whiteboard: | fixreleased:kernel:sles10 | ||
| Found By: | Other | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: | Patch to fix lockup | ||
|
Description
Jon Nelson
2006-02-19 16:31:16 UTC
mdadm triggers a soft lockup in find_get_pages:
read_lock_irq(&mapping->tree_lock);
It seems we're leaking a write lock somewhere.
Jon, can you reproduce this?
I'll see what I can do - I've been working on a bunch of other items recently. I seem to recall seeing at least one other oops at boot time, although I don't think it was exactly the same. The other oops may have been the holder of the write lock. We need whichever oops was first. (In reply to comment #3) > The other oops may have been the holder of the write lock. We need whichever > oops was first. > I don't want to be misunderstood - I believe I have seen multiple, different oopses, all at boot time. However, not /every/ boot seems to generate an oops. This trace isn't making any sense at all to me. mdadm calls and ioctl which goes through block_ioctl to generic_file_read. What ioctl would do that? There is no mention of 'md_ioctl' in the trace, so it doesn't look like an md-specific ioctl is being called. The only other ioctls mdadm calls are BLKGETSIZE BLKGETSIZE64 and BLKFLSBUF. I don't see any of them calling generic_file_read... I guess it is possible that if md is a module, maybe it's symbols aren't being printed for some reason. There is an md ioctl that could possibly call generic_file_read but you would still expect read_cache_page to be in the trace. And this would only happen when starting an array the had a write-intent bitmap stored in a separate file. Are there any md arrays with the bitmap in a separate file? Very weird .... have you considered memcheck?? Probably a long shot... This was a fresh 10.1 beta4 install, no raid devices at all. I do use LVM, though, if that matters. I do not get the same oops every time I boot, nor do I even *get* an oops every time I boot. Both the frequency and the oops itself seem random. As far as the machine goes, I've had it about 3 years and it's been rock solid running 9.1 through 10.0, and a bunch of stuff prior to that. I will try to find some time to boot it a bunch of times and see if I can get more oopses for you. When they happen, they invariably happen early in the boot process. re comment #5 - the generic_file_read addresses on the stack are probably leftovers from earlier calls. In the light of comment #6, this may be bad RAM. Please try running memtest for a few hours, just to be sure. If your RAM is okay, please run hwinfo on your machine and attach the output to this bug report. I haven't been able to make this happen with beta5, so I'm not sure what you want to do here. I'm quite sure the memory is fine (it was a daily use machine that was rock solid up until a few weeks ago when I started with the betas). Well, if you cannot reproduce it, maybe we can assume it is fixed. On reviewing the trace, the only thing I can come up with is that maybe invalidate_mapping_pages is just taking a loooong time to invalidate all the pages that make have been cached for some block device. The location of the lockup : find_get_pages+0x4c/0x53 : looks to be more like the unlock than the lock, so I don't think it is a lock leak. Chris: You probably know more about the MM than me. Would it make sense to put a cond_resched() in invalidate_mapping_pages just after the unlock_page ?? invalidate_mapping_pages seems to expect no schedules at all. We would have to audit the callers for spin locks. It's quite possible this is a false positive of the softlock detection code. Let's resolve this as INVALID. Does 'softlock' have false positives? 10 seconds without a schedule is a long time in anybodies book... I might still look into cond_resched in invalidate_mapping_pages, but I'm happy with this bug going INVALID for now. It can always be reopened if it recurs. I've had a very similar bug reported on linux-raid, so I'm re-opening this bug. Created attachment 79158 [details]
Patch to fix lockup
I believe this patch fixes the lockup.
Leaving this bug as 'assigned' to remind me to get the patch into CVS after seeking feedback on linux-kernel. I haven't had this bug since beta5 or so. I tested all betas and release candidates through rc3 *except* rc2. I didn't get this into SLES10, but it will be in -SP1 and any maintenance update. The reason it hasn't hit since beta5 is probably just luck. The bug only hits if a race is lost, and lots of variables could affect that. Present and active in CVS SLES10_GA_BRANCH for this bug: patches.fixes/truncate-soft-lockup Adding whiteboard flag for maintenance tracking. Approved for maintenance (just for completeness) truncate-soft-lockup: Patch published in kernelupdate 2.6.16.21-0.25, dated Sep 19, 2006 & released Oct 04, 2006. Setting Whiteboard Status -> fixreleased |