Bug 393675

Summary: 2.6.25.4-2-default kernel -> ata errors, system freeze
Product: [openSUSE] openSUSE 11.0 Reporter: Rolf Offermanns <roffermanns>
Component: KernelAssignee: Tejun Heo <teheo>
Status: RESOLVED NORESPONSE QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P5 - None CC: dchubarov, rolf.offermanns, trenn
Version: Factory   
Target Milestone: ---   
Hardware: i686   
OS: openSUSE 11.0   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: hdparm -I for the disk in question as suggested by Tejun Heo
full kernel boot log as suggested by Tejun Heo
lspci -nn
/proc/ioports
acpidump
lspci -nnvvv
/proc/interrupts

Description Rolf Offermanns 2008-05-22 16:54:35 UTC
My laptop (Thinkpad T43p) freezes for 10-20 sec. whenever I switch from ac power to battery or vice versa.

I see the following message in the kernel log, whenever this happens:
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:08:77:14:79/00:00:00:00:00/e2 tag 0 dma 4096 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: port is slow to respond, please be patient (Status 0xd0)
ata1: device not ready (errno=-16), forcing hardreset
ata1: soft resetting link
ata1.00: configured for UDMA/100
ata1: EH complete
sd 0:0:0:0: [sda] 117210240 512-byte hardware sectors (60012 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA


Bug 384150 has a similar message.
Comment 1 Tejun Heo 2008-06-24 07:45:09 UTC
Hmmm... This looks like a hardware problem to me.  SATA links are very susceptible to electro magnetic interferences and voltage fluctuations.  They're the canaries in the system with the serial port.  (It's SATA, right?, please post full kernel boot log and the result of hdparm -I)

It probably stands out due to the high timeout value sd uses (30 secs).  Adjusting timeout values is planned but it seems to difficult to get people agree on the correct value.

Thanks.
Comment 2 Dmitri Chubarov 2008-07-25 06:18:50 UTC
I have updated from 2.6.25.5 to 2.6.25.11-0.1 and started to see these messages on the first reboot. Later followed the advice to boot the kernel with combined_mode=libata. Still the same messages showed up in dmesg immediately after boot. Could this be related to this update?

The controller is :
Intel Corporation 82801FR/FRW (ICH6R/ICH6RW) SATA Controller (rev 03)
The device is: 
Seagate Barracuda 7200.7 ST3120827AS 120GB Hard Drive

Though smartctl report looks pretty sad:
ID# ATTRIBUTE_NAME           VALUE WORST THRESH TYPE    WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate      061   047   006    Pre-fail    -      182468611
  3 Spin_Up_Time             097   096   000    Pre-fail    -      0
  4 Start_Stop_Count         100   100   020    Old_age     -      87
  5 Reallocated_Sector_Ct    100   100   036    Pre-fail    -      0
  7 Seek_Error_Rate          089   060   030    Pre-fail    -      818543413
  9 Power_On_Hours           083   083   000    Old_age     -      15464
 10 Spin_Retry_Count         100   100   097    Pre-fail    -      0
 12 Power_Cycle_Count        100   100   020    Old_age     -      89
194 Temperature_Celsius      037   044   000    Old_age     -      37 (0 19 0 0)
195 Hardware_ECC_Recovered   061   047   000    Old_age     -      182468611
197 Current_Pending_Sector   100   100   000    Old_age     -      0
198 Offline_Uncorrectable    100   100   000    Old_age     -      0
199 UDMA_CRC_Error_Count     200   200   000    Old_age     -      0
200 Multi_Zone_Error_Rate    100   253   000    Old_age     -      0
202 TA_Increase_Count        100   253   000    Old_age     -      0
Comment 3 Dmitri Chubarov 2008-07-25 06:22:06 UTC
Created attachment 229930 [details]
hdparm -I for the disk in question as suggested by Tejun Heo
Comment 4 Dmitri Chubarov 2008-07-25 06:26:52 UTC
Created attachment 229932 [details]
full kernel boot log as suggested by Tejun Heo
Comment 5 Dmitri Chubarov 2008-07-25 14:01:40 UTC
noapic boot option fixed the problem for me
Comment 6 Dmitri Chubarov 2008-07-25 14:05:47 UTC
with noapic the system boots without issues but eventually freezes as before.
Comment 7 Tejun Heo 2008-07-28 06:06:13 UTC
Hmmm... Strange, so it's not a hardware problem.  Does running "hdparm -B 255 /dev/sda" make any difference?

Thomas, it looks like it also could be an IRQ related problem.  Does anything ring a bell?
Comment 8 Thomas Renninger 2008-07-28 12:16:49 UTC
This "could" be related:

Driver 'sr' needs updating - please use bus_type methods
sr0: scsi3-mmc drive: 40x/40x writer cd/rw xa/form2 cdda tray
Uniform CD-ROM driver Revision: 3.20
sr 4:0:0:0: Attached scsi CD-ROM sr0
sd 0:0:0:0: Attached scsi generic sg0 type 0
sr 4:0:0:0: Attached scsi generic sg1 type 5
ACPI: PCI Interrupt 0000:00:1f.3[B] -> GSI 19 (level, low) -> IRQ 19
ACPI: I/O resource 0000:00:1f.3 [0x400-0x41f] conflicts with ACPI region SMRG [0x400-0x40f]
ACPI: Device needs an ACPI driver

What is 0x400-0x40f, is it used by libata devices, then ACPI may poke that IO space without driver synchronization.

Can you provide:
cat /proc/ioport
lspci -nn
acpidump output
Comment 9 Dmitri Chubarov 2008-07-30 14:51:13 UTC
Created attachment 230800 [details]
lspci -nn
Comment 10 Dmitri Chubarov 2008-07-30 14:51:42 UTC
Created attachment 230801 [details]
/proc/ioports
Comment 11 Dmitri Chubarov 2008-07-30 14:52:52 UTC
Created attachment 230802 [details]
acpidump
Comment 12 Dmitri Chubarov 2008-07-30 14:55:07 UTC
(In reply to comment #7 from Tejun Heo)
> Hmmm... Strange, so it's not a hardware problem.  Does running "hdparm -B 255
> /dev/sda" make any difference?
> 
How do I run hdparm on a disk holding the root filesystem?
When I do 
    hdparm -v -v -B 255 /dev/sda
I get the following Input/output error 
/dev/sda:
 setting Advanced Power Management level to disabled
 HDIO_DRIVE_CMD failed: Input/output error
 IO_support    =  0 (default) 
16-bit)
 HDIO_GET_UNMASKINTR failed: Inappropriate ioctl for device
 HDIO_GET_DMA failed: Inappropriate ioctl for device
 HDIO_GET_KEEPSETTINGS failed: Inappropriate ioctl for device
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 14593/255/63, sectors = 234441648, start = 0
Comment 13 Tejun Heo 2008-07-30 23:01:03 UTC
Looks like the drive just doesn't support APM.  So, it's unrelated.  Can you please post the result of "lspci -nnvvv".  Let's see what's overlapping with ACPI.
Comment 14 Dmitri Chubarov 2008-07-31 03:18:46 UTC
Created attachment 230944 [details]
lspci -nnvvv

Sending the output from lspci -nnvvv.

Something in the output does indeed show overlapping IRQ assignments. Both SATA Controller and SMBus are routed to IRQ 19.
Comment 15 Dmitri Chubarov 2008-07-31 03:34:24 UTC
Created attachment 230948 [details]
/proc/interrupts

I am also attaching /proc/interrupts in case the problem might be an IRQ issue, which agrees well with the fact that the frequency of system freezes varies between reboots
Comment 16 Tejun Heo 2008-07-31 10:10:14 UTC
IRQs can be shared, so usually it should be okay.  The overlapping region is SMBus controller, which should be okay too as long as only either one of the two interfaces is used (right Thomas?).

Does unloading all usb related modules make any difference?
Comment 17 Dmitri Chubarov 2008-07-31 12:51:47 UTC
(In reply to comment #16 from Tejun Heo)
> IRQs can be shared, so usually it should be okay.  The overlapping region is
> SMBus controller, which should be okay too as long as only either one of the
> two interfaces is used (right Thomas?).

I have tried changing the  "PnP OS" switch in the BIOS between "Yes" and "No" and booting with pci=routeirq, disabling the ieee1394 port but the IRQ allocation remains almost the same.

> 
> Does unloading all usb related modules make any difference?
> 
It seems, there was no difference. In case I might be missing something, I paste lots of messages from syslog below.

# modprobe -r  ehci_hcd uhci_hcd usbhid usbcore

Jul 31 19:15:21 clust08 kernel: ehci_hcd 0000:00:1d.7: remove, state 4
Jul 31 19:15:21 clust08 kernel: usb usb2: USB disconnect, address 1
Jul 31 19:15:21 clust08 kernel: ehci_hcd 0000:00:1d.7: USB bus 2 deregistered
Jul 31 19:15:21 clust08 kernel: uhci_hcd 0000:00:1d.3: remove, state 4
Jul 31 19:15:21 clust08 kernel: usb usb5: USB disconnect, address 1
Jul 31 19:15:21 clust08 kernel: uhci_hcd 0000:00:1d.3: USB bus 5 deregistered
Jul 31 19:15:21 clust08 kernel: uhci_hcd 0000:00:1d.2: remove, state 4
Jul 31 19:15:21 clust08 kernel: usb usb4: USB disconnect, address 1
Jul 31 19:15:21 clust08 kernel: uhci_hcd 0000:00:1d.2: USB bus 4 deregistered
Jul 31 19:15:21 clust08 kernel: uhci_hcd 0000:00:1d.1: remove, state 4
Jul 31 19:15:21 clust08 kernel: usb usb3: USB disconnect, address 1
Jul 31 19:15:21 clust08 kernel: uhci_hcd 0000:00:1d.1: USB bus 3 deregistered
Jul 31 19:15:21 clust08 kernel: uhci_hcd 0000:00:1d.0: remove, state 1
Jul 31 19:15:21 clust08 kernel: usb usb1: USB disconnect, address 1
Jul 31 19:15:21 clust08 kernel: usb 1-1: USB disconnect, address 2
Jul 31 19:15:21 clust08 kernel: usb 1-2: USB disconnect, address 3
Jul 31 19:15:21 clust08 kernel: uhci_hcd 0000:00:1d.0: USB bus 1 deregistered
Jul 31 19:15:21 clust08 kernel: usbcore: deregistering interface driver usbhid
Jul 31 19:15:21 clust08 kernel: usbcore: deregistering interface driver hiddev
Jul 31 19:15:21 clust08 kernel: usbcore: deregistering device driver usb
Jul 31 19:15:21 clust08 kernel: usbcore: deregistering interface driver usbfs
Jul 31 19:15:21 clust08 kernel: usbcore: deregistering interface driver hub

...

Jul 31 19:43:10 clust08 kernel: ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x2 frozen
Jul 31 19:43:10 clust08 kernel: ata2.00: cmd 61/09:00:8f:3d:54/00:00:02:00:00/40 tag 0 ncq 4608 
out
Jul 31 19:43:10 clust08 kernel:          res 40/00:00:00:4f:c2/00:00:00:4f:c2/00 Emask 0x4 (time
out)
Jul 31 19:43:10 clust08 kernel: ata2.00: status: { DRDY }
Jul 31 19:43:13 clust08 kernel: ata2: soft resetting link
Jul 31 19:43:19 clust08 kernel: ata2: port is slow to respond, please be patient (Status 0xd0)
Jul 31 19:43:23 clust08 kernel: ata2: softreset failed (device not ready)
Jul 31 19:43:23 clust08 kernel: ata2: hard resetting link
Jul 31 19:43:29 clust08 kernel: ata2: port is slow to respond, please be patient (Status 0x80)
Jul 31 19:43:33 clust08 kernel: ata2: COMRESET failed (errno=-16)
Jul 31 19:43:33 clust08 kernel: ata2: hard resetting link
Jul 31 19:43:34 clust08 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Jul 31 19:43:34 clust08 kernel: ata2.00: configured for UDMA/133
Jul 31 19:43:34 clust08 kernel: ata2: EH complete
Jul 31 19:43:34 clust08 kernel: sd 1:0:0:0: [sda] 234441648 512-byte hardware sectors (120034 MB)
Jul 31 19:43:34 clust08 kernel: sd 1:0:0:0: [sda] Write Protect is off
Jul 31 19:43:34 clust08 kernel: sd 1:0:0:0: [sda] Mode Sense: 00 3a 00 00
Jul 31 19:43:34 clust08 kernel: sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Comment 18 Thomas Renninger 2008-07-31 12:56:30 UTC
- Oh midair collision..., should still be relevant: -

I doubt this is related to USB, those seem to be correct.

Can you remove the driver who registers for the smbus. lsmod |grep i801 or lsmod |grep smbus might show it.

I don't see why 1f.2 (PCI device/slot -> AHCI) is routed to irq 19, IMO it should have been evaluated to 18.
But the first IDE device is rather huge with channels and in the channels drive device declarations, I probably miss something there. I just paste some info from dmesg to make it easier to look for it later:

ACPI: PCI Interrupt 0000:00:1f.2[B] -> GSI 19 (level, low) -> IRQ 19
ahci 0000:00:1f.2: AHCI 0001.0000 32 slots 4 ports 1.5 Gbps 0xf impl IDE mode

ata_piix 0000:00:1f.1: version 2.12
ACPI: PCI Interrupt 0000:00:1f.1[A] -> GSI 18 (level, low) -> IRQ 18

You should try to remove the smbus driver first.
blacklist i2c-i801
in /etc/modprobe.conf
should do the trick.
Comment 19 Dmitri Chubarov 2008-07-31 13:53:33 UTC
Reporting the SMBus results...

After blacklisting the i2c_i801 module the system rebooted and stayed up for some 25 minutes, then the disk froze again. SMBus now is on IRQ 0, therefore not guilty isn't it? The only two devices remaining on IRQ 19 are USB#2 and SATA.

On the other hand, I have found a thread on the ASUS forum where people are reporting Windows hanging on some boards and SMBus shows I/O ports conflict in Windows device manager. The thread is on the Russian ASUS support forum, so it is of limited use here. I post a link anyway 
http://forum.asus.ru/viewtopic.php?t=3344

I will report again when I manage to update the firmware on the machine in question. Apparently ASUS flash burner fails to run from a floppy image.




Comment 20 Thomas Renninger 2008-07-31 14:03:51 UTC
> SMBus now is on IRQ 0
?
You could try to move the driver to be sure it's not touched:
mv /lib/modules/`uname -r`/kernel/drivers/i2c/busses/i2c-i801.ko /tmp
and reboot. Just to be sure...

> I will report again when I manage to update the firmware on the machine in
> question
This is a good idea
Comment 21 Dmitri Chubarov 2008-08-01 09:11:10 UTC
(In reply to comment #20 from Thomas Renninger)
> > SMBus now is on IRQ 0
> ?
> You could try to move the driver to be sure it's not touched:
> mv /lib/modules/`uname -r`/kernel/drivers/i2c/busses/i2c-i801.ko /tmp
> and reboot. Just to be sure...

That comes from lspci: 

00:1f.3 SMBus [0c05]: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) SMBus Controller [8086:266a] (rev 03)
	Subsystem: ASUSTeK Computer Inc. P5GD1-VW Mainboard [1043:80a6]
	Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin B routed to IRQ 0
	Region 4: I/O ports at 0400 [size=32]
	Kernel modules: i2c-i801

The module is not loaded.

> 
> > I will report again when I manage to update the firmware on the machine in
> > question
> This is a good idea

It keeps coming. I have updated from 1005 BIOS to 1012 yet it is there. About 10 events in 15 hours probably correlated with user activity. 

Comment 22 Thomas Renninger 2008-08-01 09:45:39 UTC
I expect 10.3 did and still does work?

This is getting ugly.
You could try to remove battery, thermal, processor, ac, fan drivers.
The processor driver you better move away again or it gets loaded through cpufreq drivers:
mv /lib/modules/`uname -r`/kernel/drivers/acpi/processor.ko /tmp
Be careful to not delete it or you will have to reinstall the kernel.
You could copy all of them away.

If it works then it is related to C-states, cpufreq, EC or something like that.
If you find out how you can trigger that easier (e.g. installing and running bonnie or something CPU/IO intensive task?) that could also help a lot.
Comment 23 Andreas Jaeger 2008-10-24 15:29:33 UTC
The requested information has not been provided for over 4 weeks.
The bug gets therefore closed as NORESPONSE.

Please reopen the bug if you can supply the requested information.
I would appreciate if you could test the current openSUSE 11.1 Beta
first and see whether that one fixes the problem.
Comment 24 Rolf Offermanns 2008-10-24 16:11:48 UTC
In fact I am already using the 11.1 betas for a while on this laptop and did not experience this problem yet.