Bug 384150

Summary: ata errors causing hang in Yast
Product: [openSUSE] openSUSE 11.0 Reporter: Andras Mantia <amantia>
Component: KernelAssignee: Tejun Heo <teheo>
Status: RESOLVED WONTFIX QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P5 - None CC: trenn
Version: Beta 1   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: dmesg after the errors appeared
dmesg output after disabling hal for sr0
lspci output

Description Andras Mantia 2008-04-26 22:11:28 UTC
Since I upgraded to 11.0 I have problems using my DVD writer. The 11.0beta1 DVD is inside, and I try to install some packages. After a while I get lots of errors, like:

Apr 27 01:07:15 stein klogd: ata2.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Apr 27 01:07:15 stein klogd: ata2.01: cmd a0/00:00:00:00:00/00:00:00:00:00/b0 tag 0
Apr 27 01:07:15 stein klogd:          cdb 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
Apr 27 01:07:15 stein klogd:          res 40/00:02:00:0c:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
Apr 27 01:07:15 stein klogd: ata2.01: status: { DRDY }
Apr 27 01:07:15 stein klogd: ata2: soft resetting link
Apr 27 01:07:16 stein klogd: ata2.00: configured for PIO0
Apr 27 01:07:16 stein klogd: ata2.01: configured for PIO0
Apr 27 01:07:16 stein klogd: ata2: EH complete

Installation just hangs at this point, I cannot eject the medium using "eject sr0", nor can I close the yast install module. The drive and the medium have no problems, and a quick google shown I'm not the only one having such issue
Comment 1 Andras Mantia 2008-04-26 22:56:54 UTC
I have to add that it seems the log is filled with this error even when the drive is not really used. :(
Comment 2 Andras Mantia 2008-04-27 08:13:47 UTC
Now it is clear, that I get the errors even if I don't use the drive.  And not only that, but the load of the machine increases continously, last night it was at 46+ load average when I shut it down.
Comment 3 Andras Mantia 2008-04-27 08:26:29 UTC
I just realized that the error is for two drives:
From the log: 
ata2: PATA max UDMA/133 cmd 0x170 ctl 0x376 bmdma 0xfc08 irq 15
ata2.00: ATAPI: TSSTcorpCD/DVDW TS-H552U, US07, max UDMA/33
ata2.01: ATAPI: JLMS    XJ-HD163, GH5S, max UDMA/33
ata2.00: configured for UDMA/33
ata2.01: configured for UDMA/33

the drives are on the same cable, the writer is the master, the reader is the slave.
Comment 4 Andras Mantia 2008-04-27 10:39:48 UTC
I disabled the pata_via (removed from the initrd modules, backlisted it, added via82cxxx to the initrd), and now it seems to work ok, but of course it is not using libata anymore and my device names changed, so I had to adapt the fstab as well.
Comment 5 Tejun Heo 2008-05-06 06:29:23 UTC
That's hald polling for media presence and the kernel is timing it out.  This is usually caused by drives locking up on continuous polling, which is done by trying to open the cdrom device.  The open command sequences used by IDE cdrom driver and SCSI cdrom driver (sr) can be different under some circumstances and this might be causing the problem.  Can you please do the followings to verify?

* Change driver back to sata_via.  Let such failures occur several times and attach the result of "dmesg" here.

* Disable polling on both devices using "hal-disable-polling /dev/sr0; hal-disable-polling /dev/sr1"

* Reboot and check that both cdroms work fine.  Auto-mount and stuff won't work but everything else should.  Post the result of "dmesg" after doing so.

You can change to mount-by-UUID or mount-by-volName, which is default for fresh installs.  That way switching between the two drivers won't require updating fstab.

Thanks.
Comment 6 Andras Mantia 2008-05-09 14:37:19 UTC
Due to a hardware upgrade I have only one IDE device now, the DVD reader. Luckily (or not) the bug is visible with this devices only. So as a first step, I will attach the dmesg output.
Comment 7 Andras Mantia 2008-05-09 14:38:21 UTC
Created attachment 213984 [details]
dmesg after the errors appeared
Comment 8 Andras Mantia 2008-05-09 14:54:22 UTC
Created attachment 213992 [details]
dmesg output after disabling hal for sr0
Comment 9 Andras Mantia 2008-05-09 14:56:02 UTC
Disabling hal makes the device work with pata as well. I attached the new log, please ignore the read errors at the beginning, I put a wrong media inside the reader. Ignore also the vmware errors, they are unrelated. :)
Comment 10 Tejun Heo 2008-05-13 13:04:15 UTC
Hmmm... Can you be persuaded into sending the drive to me?  I can pay for the shipping and replacement drive via paypal.  We had several reports regarding drives failing on media presence polling and having a malfunctioning one around will definitely help fixing the problem.

Thanks.
Comment 11 Andras Mantia 2008-05-13 14:11:18 UTC
I could send you (especially to the EU), but I'm not sure if this is caused by the drive or by the chipset. At least I saw with two drives (the DVD driver that I removed, that was a Samsung/TSSCorp and the Lite-On/JLMS DVD reader). 
The mainboard is an ASUS A8V-Deluxe, socket 939 in case it matters.
Anyway, if you need it and it would be useful for you, I can send it next week.
Comment 12 Tejun Heo 2008-05-13 14:21:22 UTC
Hmmm... I'm located in Korea but UPS should work well enough.  I think it's more likely the drive but it would be great if you can ship the board too.  Would that be possible?
Comment 13 Andras Mantia 2008-05-13 14:47:42 UTC
 Ok, I have to check how can I sent to Korea. I never did - especially with UPS - and in the past you could send abroad only from a special post office, twice a week.
 Unfortunately shipping the board is more harder, es because it is s939, and it is hard to get (good) s939 board today (almost all of them are AM2), that would mean changing the architecture (CPU, memory, graphics card), and well, I just bought a new graphics card...
 So maybe I should send the drive and you could look up for a second hand MB there? 
Comment 14 Tejun Heo 2008-05-14 03:12:17 UTC
Okay, only the drive then.  Please lemme know after you figured out how to send the drive.  Thanks.
Comment 15 Andras Mantia 2008-05-20 16:56:04 UTC
UPS sending is expensive (~90 euro), regular post is cheaper, but would take much more time to get it there. I will make a test with a different IDE drive soon, and if that fails, I think we can be sure that the problem is not in the driver, but in the chipset (or chipset driver).
Comment 16 Andras Mantia 2008-05-23 16:44:25 UTC
Today I made some tests:
- Plextor IDE CD writer on the same channel, using the same cabel. No errors.
- Lite-On LTD-163 DVD reader: works fine until something starts to access it. Now it seems that it is NOT hald. I saw the errors coming after I run Windows in vmware, so I rebooted and stopped completely vmware. Everything was fine until I tried to play some mp3 from a CD. Now kio_file (from KDE) was using the drive and I couldn't kill this process because the kernel was busy (and got the ata errors).

So it seems that indeed the drive or the driver is the guilty one. As I said sending from here is expensive and complicated. I found similar drives on eBay for sale (without bidding) for a reasonable price. Can't you get one from there?

Here is an example link:
http://cgi.ebay.com/LITEON-LTD-163-16X-INTERNAL-CD-DVD-ROM-DRIVE_W0QQitemZ220206744898QQihZ012QQcategoryZ3754QQrdZ1QQssPageNameZWD1VQQcmdZViewItem

As a final note, I had absolutely no problems with the driver with prior versions of openSUSE, including the 10.3 version with all the updates.
Comment 17 Tejun Heo 2008-05-26 03:46:15 UTC
Does specifying "libata.force=1:pio4" make any difference?
Comment 18 Andras Mantia 2008-05-29 20:23:22 UTC
Sorry, I couldn't try this option yet, but I found that "noapic" helps, I don't get the errors anymore.
Of course I don't understand why it worked before without noapic, or why the Plextor drive works also without it.
Comment 19 Tejun Heo 2008-05-30 03:43:28 UTC
Hmm.. it doesn't really add up.  The only way acpi can affect libata operation is via IRQ routing in this case and if IRQ routing was bonkers, it shouldn't have worked at all for any drive.  Can you please try for longer period of time w/ and w/o the acpi option and confirm the behaviors?  Thanks.
Comment 20 Andras Mantia 2008-05-31 14:33:13 UTC
I tested a little more:
- noapic (apic, not acpi) helps for sure, I was running the computer for days without seeing the error
- without noapic the error is visible soon after boot
- libata.force=1:pio4 : I'm running with this and without noapic for 30 minutes, without problems, playing music from the problematic drive.
Comment 21 Andras Mantia 2008-05-31 14:34:33 UTC
And of course, in the very same moment when I pressed "Commit" the music stopped and I see the errors. Switching back to noapic now...
Comment 22 Tejun Heo 2008-06-01 05:26:08 UTC
I see. noapic.  It's still about the same tho.  If noapic is specified, IRQs are routed through the legacy 8259 IRQ controller and makes it a IRQ routing problem.  Can you please post the result of "lspci -nnvvv"?
Comment 23 Andras Mantia 2008-06-01 07:15:56 UTC
Created attachment 219360 [details]
lspci output
Comment 24 Andras Mantia 2008-06-01 07:17:59 UTC
I added the output when running noapic. Do you also want the same output when noapic is not used?
Comment 25 Tejun Heo 2008-06-01 08:17:35 UTC
Going through the bug again.  Both using a different drive && noapic work around the problem.  The former points at the drive specific problem while the latter points at generic IRQ routing problem.  The two causes interacting with each other is quite unlikely although I can't say it can never happen.  Strange....
Comment 26 Andras Mantia 2008-06-01 08:30:15 UTC
I admit this is strange. I will borrow again the other drive and test it for a longer period.
Comment 27 Tejun Heo 2008-06-01 08:35:58 UTC
Thanks.  I appreciate that.  Also, you never had any problem w/ SL103?

Cc'ing Thomas for reference.
Comment 28 Andras Mantia 2008-06-01 08:47:13 UTC
No, the system worked fine with 10.3 and earlier versions. I have the drive since 2002 and the motherboard since 2005 or 2006. It might be that with 10.3 I still used the old drivers, not libata as my IDE hard disk was /dev/hda, so that might not be relevant. But for sure I never used any extra kernel parameters.
Comment 29 Thomas Renninger 2008-06-12 13:15:26 UTC
Puhh, no idea.
Maybe you should update to a newer BIOS, not sure whether BIOS initialization can be different if you have another driver? Very strange...
I close this one as won't fix, IMO it's not worth wasting more time. Please keep commenting or reopen the bug if you find out more.
Comment 30 Andras Mantia 2008-06-12 13:27:54 UTC
I have two other drives to test with, just I need some time to do. Upgrading of the BIOS is not possible, I have the latest one. Downgrading is possible, but the old BIOS caused hangs with the Opteron, maybe I could try with 11.0...
I'm fine with closing it as wontfix, but if you look at the lkml list you will see that I'm far from being the only one having such issues.
Comment 31 Tejun Heo 2008-06-12 13:45:38 UTC
Andras, please add comments and/or reopen if you find anything interesting.  Also, which threads in lkml are you talking about?  I try to watch ATA related problems in LKML but miss a lot of them.