Bugzilla – Bug 393675
2.6.25.4-2-default kernel -> ata errors, system freeze
Last modified: 2008-10-24 16:11:48 UTC
My laptop (Thinkpad T43p) freezes for 10-20 sec. whenever I switch from ac power to battery or vice versa. I see the following message in the kernel log, whenever this happens: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: cmd ca/00:08:77:14:79/00:00:00:00:00/e2 tag 0 dma 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: status: { DRDY } ata1: port is slow to respond, please be patient (Status 0xd0) ata1: device not ready (errno=-16), forcing hardreset ata1: soft resetting link ata1.00: configured for UDMA/100 ata1: EH complete sd 0:0:0:0: [sda] 117210240 512-byte hardware sectors (60012 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Bug 384150 has a similar message.
Hmmm... This looks like a hardware problem to me. SATA links are very susceptible to electro magnetic interferences and voltage fluctuations. They're the canaries in the system with the serial port. (It's SATA, right?, please post full kernel boot log and the result of hdparm -I) It probably stands out due to the high timeout value sd uses (30 secs). Adjusting timeout values is planned but it seems to difficult to get people agree on the correct value. Thanks.
I have updated from 2.6.25.5 to 2.6.25.11-0.1 and started to see these messages on the first reboot. Later followed the advice to boot the kernel with combined_mode=libata. Still the same messages showed up in dmesg immediately after boot. Could this be related to this update? The controller is : Intel Corporation 82801FR/FRW (ICH6R/ICH6RW) SATA Controller (rev 03) The device is: Seagate Barracuda 7200.7 ST3120827AS 120GB Hard Drive Though smartctl report looks pretty sad: ID# ATTRIBUTE_NAME VALUE WORST THRESH TYPE WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 061 047 006 Pre-fail - 182468611 3 Spin_Up_Time 097 096 000 Pre-fail - 0 4 Start_Stop_Count 100 100 020 Old_age - 87 5 Reallocated_Sector_Ct 100 100 036 Pre-fail - 0 7 Seek_Error_Rate 089 060 030 Pre-fail - 818543413 9 Power_On_Hours 083 083 000 Old_age - 15464 10 Spin_Retry_Count 100 100 097 Pre-fail - 0 12 Power_Cycle_Count 100 100 020 Old_age - 89 194 Temperature_Celsius 037 044 000 Old_age - 37 (0 19 0 0) 195 Hardware_ECC_Recovered 061 047 000 Old_age - 182468611 197 Current_Pending_Sector 100 100 000 Old_age - 0 198 Offline_Uncorrectable 100 100 000 Old_age - 0 199 UDMA_CRC_Error_Count 200 200 000 Old_age - 0 200 Multi_Zone_Error_Rate 100 253 000 Old_age - 0 202 TA_Increase_Count 100 253 000 Old_age - 0
Created attachment 229930 [details] hdparm -I for the disk in question as suggested by Tejun Heo
Created attachment 229932 [details] full kernel boot log as suggested by Tejun Heo
noapic boot option fixed the problem for me
with noapic the system boots without issues but eventually freezes as before.
Hmmm... Strange, so it's not a hardware problem. Does running "hdparm -B 255 /dev/sda" make any difference? Thomas, it looks like it also could be an IRQ related problem. Does anything ring a bell?
This "could" be related: Driver 'sr' needs updating - please use bus_type methods sr0: scsi3-mmc drive: 40x/40x writer cd/rw xa/form2 cdda tray Uniform CD-ROM driver Revision: 3.20 sr 4:0:0:0: Attached scsi CD-ROM sr0 sd 0:0:0:0: Attached scsi generic sg0 type 0 sr 4:0:0:0: Attached scsi generic sg1 type 5 ACPI: PCI Interrupt 0000:00:1f.3[B] -> GSI 19 (level, low) -> IRQ 19 ACPI: I/O resource 0000:00:1f.3 [0x400-0x41f] conflicts with ACPI region SMRG [0x400-0x40f] ACPI: Device needs an ACPI driver What is 0x400-0x40f, is it used by libata devices, then ACPI may poke that IO space without driver synchronization. Can you provide: cat /proc/ioport lspci -nn acpidump output
Created attachment 230800 [details] lspci -nn
Created attachment 230801 [details] /proc/ioports
Created attachment 230802 [details] acpidump
(In reply to comment #7 from Tejun Heo) > Hmmm... Strange, so it's not a hardware problem. Does running "hdparm -B 255 > /dev/sda" make any difference? > How do I run hdparm on a disk holding the root filesystem? When I do hdparm -v -v -B 255 /dev/sda I get the following Input/output error /dev/sda: setting Advanced Power Management level to disabled HDIO_DRIVE_CMD failed: Input/output error IO_support = 0 (default) 16-bit) HDIO_GET_UNMASKINTR failed: Inappropriate ioctl for device HDIO_GET_DMA failed: Inappropriate ioctl for device HDIO_GET_KEEPSETTINGS failed: Inappropriate ioctl for device readonly = 0 (off) readahead = 256 (on) geometry = 14593/255/63, sectors = 234441648, start = 0
Looks like the drive just doesn't support APM. So, it's unrelated. Can you please post the result of "lspci -nnvvv". Let's see what's overlapping with ACPI.
Created attachment 230944 [details] lspci -nnvvv Sending the output from lspci -nnvvv. Something in the output does indeed show overlapping IRQ assignments. Both SATA Controller and SMBus are routed to IRQ 19.
Created attachment 230948 [details] /proc/interrupts I am also attaching /proc/interrupts in case the problem might be an IRQ issue, which agrees well with the fact that the frequency of system freezes varies between reboots
IRQs can be shared, so usually it should be okay. The overlapping region is SMBus controller, which should be okay too as long as only either one of the two interfaces is used (right Thomas?). Does unloading all usb related modules make any difference?
(In reply to comment #16 from Tejun Heo) > IRQs can be shared, so usually it should be okay. The overlapping region is > SMBus controller, which should be okay too as long as only either one of the > two interfaces is used (right Thomas?). I have tried changing the "PnP OS" switch in the BIOS between "Yes" and "No" and booting with pci=routeirq, disabling the ieee1394 port but the IRQ allocation remains almost the same. > > Does unloading all usb related modules make any difference? > It seems, there was no difference. In case I might be missing something, I paste lots of messages from syslog below. # modprobe -r ehci_hcd uhci_hcd usbhid usbcore Jul 31 19:15:21 clust08 kernel: ehci_hcd 0000:00:1d.7: remove, state 4 Jul 31 19:15:21 clust08 kernel: usb usb2: USB disconnect, address 1 Jul 31 19:15:21 clust08 kernel: ehci_hcd 0000:00:1d.7: USB bus 2 deregistered Jul 31 19:15:21 clust08 kernel: uhci_hcd 0000:00:1d.3: remove, state 4 Jul 31 19:15:21 clust08 kernel: usb usb5: USB disconnect, address 1 Jul 31 19:15:21 clust08 kernel: uhci_hcd 0000:00:1d.3: USB bus 5 deregistered Jul 31 19:15:21 clust08 kernel: uhci_hcd 0000:00:1d.2: remove, state 4 Jul 31 19:15:21 clust08 kernel: usb usb4: USB disconnect, address 1 Jul 31 19:15:21 clust08 kernel: uhci_hcd 0000:00:1d.2: USB bus 4 deregistered Jul 31 19:15:21 clust08 kernel: uhci_hcd 0000:00:1d.1: remove, state 4 Jul 31 19:15:21 clust08 kernel: usb usb3: USB disconnect, address 1 Jul 31 19:15:21 clust08 kernel: uhci_hcd 0000:00:1d.1: USB bus 3 deregistered Jul 31 19:15:21 clust08 kernel: uhci_hcd 0000:00:1d.0: remove, state 1 Jul 31 19:15:21 clust08 kernel: usb usb1: USB disconnect, address 1 Jul 31 19:15:21 clust08 kernel: usb 1-1: USB disconnect, address 2 Jul 31 19:15:21 clust08 kernel: usb 1-2: USB disconnect, address 3 Jul 31 19:15:21 clust08 kernel: uhci_hcd 0000:00:1d.0: USB bus 1 deregistered Jul 31 19:15:21 clust08 kernel: usbcore: deregistering interface driver usbhid Jul 31 19:15:21 clust08 kernel: usbcore: deregistering interface driver hiddev Jul 31 19:15:21 clust08 kernel: usbcore: deregistering device driver usb Jul 31 19:15:21 clust08 kernel: usbcore: deregistering interface driver usbfs Jul 31 19:15:21 clust08 kernel: usbcore: deregistering interface driver hub ... Jul 31 19:43:10 clust08 kernel: ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x2 frozen Jul 31 19:43:10 clust08 kernel: ata2.00: cmd 61/09:00:8f:3d:54/00:00:02:00:00/40 tag 0 ncq 4608 out Jul 31 19:43:10 clust08 kernel: res 40/00:00:00:4f:c2/00:00:00:4f:c2/00 Emask 0x4 (time out) Jul 31 19:43:10 clust08 kernel: ata2.00: status: { DRDY } Jul 31 19:43:13 clust08 kernel: ata2: soft resetting link Jul 31 19:43:19 clust08 kernel: ata2: port is slow to respond, please be patient (Status 0xd0) Jul 31 19:43:23 clust08 kernel: ata2: softreset failed (device not ready) Jul 31 19:43:23 clust08 kernel: ata2: hard resetting link Jul 31 19:43:29 clust08 kernel: ata2: port is slow to respond, please be patient (Status 0x80) Jul 31 19:43:33 clust08 kernel: ata2: COMRESET failed (errno=-16) Jul 31 19:43:33 clust08 kernel: ata2: hard resetting link Jul 31 19:43:34 clust08 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Jul 31 19:43:34 clust08 kernel: ata2.00: configured for UDMA/133 Jul 31 19:43:34 clust08 kernel: ata2: EH complete Jul 31 19:43:34 clust08 kernel: sd 1:0:0:0: [sda] 234441648 512-byte hardware sectors (120034 MB) Jul 31 19:43:34 clust08 kernel: sd 1:0:0:0: [sda] Write Protect is off Jul 31 19:43:34 clust08 kernel: sd 1:0:0:0: [sda] Mode Sense: 00 3a 00 00 Jul 31 19:43:34 clust08 kernel: sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
- Oh midair collision..., should still be relevant: - I doubt this is related to USB, those seem to be correct. Can you remove the driver who registers for the smbus. lsmod |grep i801 or lsmod |grep smbus might show it. I don't see why 1f.2 (PCI device/slot -> AHCI) is routed to irq 19, IMO it should have been evaluated to 18. But the first IDE device is rather huge with channels and in the channels drive device declarations, I probably miss something there. I just paste some info from dmesg to make it easier to look for it later: ACPI: PCI Interrupt 0000:00:1f.2[B] -> GSI 19 (level, low) -> IRQ 19 ahci 0000:00:1f.2: AHCI 0001.0000 32 slots 4 ports 1.5 Gbps 0xf impl IDE mode ata_piix 0000:00:1f.1: version 2.12 ACPI: PCI Interrupt 0000:00:1f.1[A] -> GSI 18 (level, low) -> IRQ 18 You should try to remove the smbus driver first. blacklist i2c-i801 in /etc/modprobe.conf should do the trick.
Reporting the SMBus results... After blacklisting the i2c_i801 module the system rebooted and stayed up for some 25 minutes, then the disk froze again. SMBus now is on IRQ 0, therefore not guilty isn't it? The only two devices remaining on IRQ 19 are USB#2 and SATA. On the other hand, I have found a thread on the ASUS forum where people are reporting Windows hanging on some boards and SMBus shows I/O ports conflict in Windows device manager. The thread is on the Russian ASUS support forum, so it is of limited use here. I post a link anyway http://forum.asus.ru/viewtopic.php?t=3344 I will report again when I manage to update the firmware on the machine in question. Apparently ASUS flash burner fails to run from a floppy image.
> SMBus now is on IRQ 0 ? You could try to move the driver to be sure it's not touched: mv /lib/modules/`uname -r`/kernel/drivers/i2c/busses/i2c-i801.ko /tmp and reboot. Just to be sure... > I will report again when I manage to update the firmware on the machine in > question This is a good idea
(In reply to comment #20 from Thomas Renninger) > > SMBus now is on IRQ 0 > ? > You could try to move the driver to be sure it's not touched: > mv /lib/modules/`uname -r`/kernel/drivers/i2c/busses/i2c-i801.ko /tmp > and reboot. Just to be sure... That comes from lspci: 00:1f.3 SMBus [0c05]: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) SMBus Controller [8086:266a] (rev 03) Subsystem: ASUSTeK Computer Inc. P5GD1-VW Mainboard [1043:80a6] Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin B routed to IRQ 0 Region 4: I/O ports at 0400 [size=32] Kernel modules: i2c-i801 The module is not loaded. > > > I will report again when I manage to update the firmware on the machine in > > question > This is a good idea It keeps coming. I have updated from 1005 BIOS to 1012 yet it is there. About 10 events in 15 hours probably correlated with user activity.
I expect 10.3 did and still does work? This is getting ugly. You could try to remove battery, thermal, processor, ac, fan drivers. The processor driver you better move away again or it gets loaded through cpufreq drivers: mv /lib/modules/`uname -r`/kernel/drivers/acpi/processor.ko /tmp Be careful to not delete it or you will have to reinstall the kernel. You could copy all of them away. If it works then it is related to C-states, cpufreq, EC or something like that. If you find out how you can trigger that easier (e.g. installing and running bonnie or something CPU/IO intensive task?) that could also help a lot.
The requested information has not been provided for over 4 weeks. The bug gets therefore closed as NORESPONSE. Please reopen the bug if you can supply the requested information. I would appreciate if you could test the current openSUSE 11.1 Beta first and see whether that one fixes the problem.
In fact I am already using the 11.1 betas for a while on this laptop and did not experience this problem yet.