Bug 157514 - System lockup with concurrent acces to SATA disks on Promise PDC20378
Summary: System lockup with concurrent acces to SATA disks on Promise PDC20378
Status: RESOLVED WONTFIX
Alias: None
Product: openSUSE 10.3
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Final
Hardware: i686 openSUSE 10.3
: P5 - None : Major (vote)
Target Milestone: ---
Assignee: Takashi Iwai
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-03-13 07:57 UTC by Bernhard Heibler
Modified: 2008-10-31 07:30 UTC (History)
2 users (show)

See Also:
Found By: Customer
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
/etc/raidtab (788 bytes, text/plain)
2006-03-13 08:01 UTC, Bernhard Heibler
Details
lspci (1.36 KB, text/plain)
2006-03-13 08:02 UTC, Bernhard Heibler
Details
fdisk -l (437 bytes, text/plain)
2006-03-13 08:02 UTC, Bernhard Heibler
Details
/var/log/messages (2.00 KB, text/plain)
2006-03-13 08:02 UTC, Bernhard Heibler
Details
mdstat (458 bytes, text/plain)
2006-03-15 10:36 UTC, Bernhard Heibler
Details
messages (2.12 KB, text/plain)
2006-03-15 10:36 UTC, Bernhard Heibler
Details
/var/log/messages Suse 10.3 (2.22 KB, text/plain)
2008-02-13 19:16 UTC, Bernhard Heibler
Details
/var/log/boot.msg Suse 10.3 (35.95 KB, text/plain)
2008-02-13 19:17 UTC, Bernhard Heibler
Details
output of hwinfo of suse 10.3 (386.31 KB, text/plain)
2008-02-13 19:17 UTC, Bernhard Heibler
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Bernhard Heibler 2006-03-13 07:57:06 UTC
I have setup a dual Xeon 2.8 GHZ system using a Tyan Tiger i7505 board. I have connected two Seagate Barracuda 7200.9 400 GB SATA disks ST 3400633AS with the onboard prommise SATA controller (Promise Technology, Inc. PDC20378 (FastTrak 378/SATA 378) (rev 02) I have constructed 3 Raid 1 partitions. The system ran fine for one week but suddenly the system started to freeze. In var log messages there were now special messages the system just stop. But I found some sata DriveStatusError error messages in var log messages. So I thought one of my disks died. I checked them with the Seagate tools but found no errors. So i did more tests. It seams that the system will freeze as soon as the reconstruction of the big raid partition is finished. Bug again it just will look up without error messages. So for now I have removed the second partition of the raid and leave it in errror state and the system runs without any problems for 2 weeks now.
Comment 1 Bernhard Heibler 2006-03-13 07:58:39 UTC
I use the following kernel:

Linux lux 2.6.13-15.8-smp #1 SMP Tue Feb 7 11:07:24 UTC 2006 i686 i686 i386 GNU/Linu
Comment 2 Bernhard Heibler 2006-03-13 08:01:58 UTC
Created attachment 72487 [details]
/etc/raidtab
Comment 3 Bernhard Heibler 2006-03-13 08:02:14 UTC
Created attachment 72488 [details]
lspci
Comment 4 Bernhard Heibler 2006-03-13 08:02:29 UTC
Created attachment 72489 [details]
fdisk -l
Comment 5 Bernhard Heibler 2006-03-13 08:02:46 UTC
Created attachment 72490 [details]
/var/log/messages
Comment 6 Hannes Reinecke 2006-03-13 10:56:57 UTC
Neil, that's one for you.
Comment 7 Neil Brown 2006-03-14 00:10:12 UTC
This is very strange... The raid1 code generally works fine in that kernel and can certainly complete a resync.  Possibly there is some strange interation with the SATA driver, but I cannot think what it would be.

I gather that you have three raid1 arrays and only one is causing trouble.  Is that correct?

Can you try the recovery again  (add the other partition back in) and if/when it hangs, type
   alt-sysrq-P
and
   alt-sysrq-T
on the console and see if you get any useful output.  If we are lucky it will appear in the kernel log, but probably not.  Possibly a digital-photo of the -P output, or copy it down by hand (if it works at all).

Also could you get me the output of
  cat /proc/mdstat
  mdadm -X /dev/sda3
  mdadm -X /dev/sdb3
whiile the recovery is happening.

Thanks.
Comment 8 Bernhard Heibler 2006-03-15 10:35:52 UTC
The problem also happens on the midsize raid partition. First of all I added the partition sdb1 to md0 which is the swap partition. The sync completed without any problem. Then I added sdb2 to md1 which is the root partition. Then it happened again. As the sync process got to the end the system looked up. The good news is that it is not a sudden death. The lookup takes 5-10 seconds. First of all file system access seams to block network still works but then all it is over. I didn't get any output from alt-sysrq-P stuff but I will try again. I will attach a Notebook to the serial port and try to get the kernel messages. I will try this on friday afternoon since the system is a productive one ...

Comment 9 Bernhard Heibler 2006-03-15 10:36:44 UTC
Created attachment 72969 [details]
mdstat
Comment 10 Bernhard Heibler 2006-03-15 10:36:58 UTC
Created attachment 72970 [details]
messages
Comment 11 Derek Moore 2006-04-27 15:01:29 UTC
We have run into this problem (or a very similar problem) at my work.  We build Linux mains for retail stores with the system installed on a RAID-1 array of 2 200 GB SATA drives.

We had a probelm where some systems were locking up and some systems weren't.  We tried many things, disabling ACPI in various ways, disabling powersaved, upgrading the BIOS, nolapic, etc.  Nothing seemed to keep the boxes from locking up.

Eventually, I realized that different hardware techs were building the boxes slightly differently.  One tech was putting the SATA drives as primary on separate channels.  Another tech was putting the SATA drives on the same channel.

All the boxes with the two SATA drives as master and slave *on the same channel* were the boxes that were locking up.  When the SATA drives were on separate channels, the boxes did not lock.

I also discovered a method by which I could consistently lock-up the SATA-drives-on-same-channel boxes.  After the initial install, the RAID is rebuilding itself for the first time.  While the RAID is rebuilding itself, run the cron job '/etc/cron.daily/suse.de-updatedb'.  The extra drive activity caused by the updatedb, plus all the work of rebuilding the RAID, seemed to be what was causing the box to just freeze.

I'm no kernel hacker, but I theorize this is a problem with libata and/or SATA chipset drivers.  It doesn't necessarily seem to be that the MD subsystem is causing the lock-ups, except in the sense that the MD subsystem puts enough strain and stress on the bus to cause a freeze somehow.
Comment 13 Jens Axboe 2006-08-08 13:38:48 UTC
We don't do this level of support on the SL series of products, especially not when we have a 10.1 out. Please try and reproduce it on 10.1, we have lots of SATA fixes since then.
Comment 14 Jens Axboe 2006-09-12 13:12:44 UTC
Closing as advertised.
Comment 15 Bernhard Heibler 2008-02-07 19:10:52 UTC
Sorry I have to reopen this one the problem seams to be in every Linux version I have checked. Suse 9.1, 10.0, 10.2, 10.3 
Comment 16 Bernhard Heibler 2008-02-13 19:01:34 UTC
It looks like the problem will more likely happen with newer hard disks. I have old SATA 1 disks that run a whole night without any problems. But with newer ones  things get screwed upd pretty fast.  Var log messages of 2.6.22.17-0.1-default:

Feb 13 19:42:03 suse103 kernel: raid1: raid set md0 active with 2 out of 2 mirrors
Feb 13 19:42:09 suse103 kernel: kjournald starting.  Commit interval 5 seconds
Feb 13 19:42:09 suse103 kernel: EXT3 FS on md0, internal journal
Feb 13 19:42:09 suse103 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Feb 13 19:54:31 suse103 kernel: ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x180100 action 0x2
Feb 13 19:54:31 suse103 kernel: ata4.00: port_status 0x20200000
Feb 13 19:54:31 suse103 kernel: ata4.00: cmd c8/00:f0:30:e7:9f/00:00:00:00:00/e3 tag 0 cdb 0x0 data 122880 in
Feb 13 19:54:31 suse103 kernel:          res 51/0c:f0:30:e7:9f/00:00:00:00:00/e3 Emask 0x10 (ATA bus error)
Feb 13 19:54:32 suse103 kernel: ata4: soft resetting link
Feb 13 19:54:32 suse103 kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Feb 13 19:54:32 suse103 kernel: ata4.00: configured for UDMA/133
Feb 13 19:54:32 suse103 kernel: ata4: EH complete
Feb 13 19:54:32 suse103 kernel: sd 3:0:0:0: [sdb] 156301488 512-byte hardware sectors (80026 MB)
Feb 13 19:54:32 suse103 kernel: sd 3:0:0:0: [sdb] Write Protect is off
Feb 13 19:54:32 suse103 kernel: sd 3:0:0:0: [sdb] Mode Sense: 00 3a 00 00
Feb 13 19:54:32 suse103 kernel: sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 13 19:54:42 suse103 kernel: ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x380100 action 0x2
Feb 13 19:54:42 suse103 kernel: ata4.00: port_status 0x20200000
Feb 13 19:54:42 suse103 kernel: ata4.00: cmd c8/00:38:68:34:a0/00:00:00:00:00/e3 tag 0 cdb 0x0 data 28672 in
Feb 13 19:54:42 suse103 kernel:          res 51/0c:38:68:34:a0/00:00:00:00:00/e3 Emask 0x10 (ATA bus error)
Feb 13 19:54:42 suse103 kernel: ata4: soft resetting link
Feb 13 19:54:42 suse103 kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Feb 13 19:54:43 suse103 kernel: ata4.00: configured for UDMA/133
Feb 13 19:54:43 suse103 kernel: ata4: EH complete
Feb 13 19:54:43 suse103 kernel: sd 3:0:0:0: [sdb] 156301488 512-byte hardware sectors (80026 MB)
Feb 13 19:54:43 suse103 kernel: sd 3:0:0:0: [sdb] Write Protect is off
Feb 13 19:54:43 suse103 kernel: sd 3:0:0:0: [sdb] Mode Sense: 00 3a 00 00
Feb 13 19:54:43 suse103 kernel: sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 13 19:54:58 suse103 kernel: ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x380100 action 0x2
Feb 13 19:54:58 suse103 kernel: ata4.00: port_status 0x20200000
Feb 13 19:54:58 suse103 kernel: ata4.00: cmd 25/00:80:90:e0:9d/00:01:03:00:00/e0 tag 0 cdb 0x0 data 196608 in
Feb 13 19:54:58 suse103 kernel:          res 51/0c:00:0f:e2:9d/0c:00:00:00:00/e0 Emask 0x10 (ATA bus error)
Feb 13 19:54:58 suse103 kernel: ata4: soft resetting link
Feb 13 19:54:59 suse103 kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Feb 13 19:54:59 suse103 kernel: ata4.00: configured for UDMA/133
Feb 13 19:54:59 suse103 kernel: ata4: EH complete
Feb 13 19:54:59 suse103 kernel: sd 3:0:0:0: [sdb] 156301488 512-byte hardware sectors (80026 MB)
Feb 13 19:54:59 suse103 kernel: sd 3:0:0:0: [sdb] Write Protect is off
Feb 13 19:54:59 suse103 kernel: sd 3:0:0:0: [sdb] Mode Sense: 00 3a 00 00
Feb 13 19:54:59 suse103 kernel: sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 13 19:55:04 suse103 kernel: ata4: limiting SATA link speed to 1.5 Gbps
Feb 13 19:55:04 suse103 kernel: ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x380100 action 0x6
Feb 13 19:55:04 suse103 kernel: ata4.00: port_status 0x20200000
Feb 13 19:55:04 suse103 kernel: ata4.00: cmd c8/00:d0:a0:dc:9f/00:00:00:00:00/e3 tag 0 cdb 0x0 data 106496 in
Feb 13 19:55:04 suse103 kernel:          res 51/0c:d0:a0:dc:9f/0c:00:00:00:00/e3 Emask 0x10 (ATA bus error)
Feb 13 19:55:04 suse103 kernel: ata4: hard resetting link
Feb 13 19:55:04 suse103 kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Feb 13 19:55:04 suse103 kernel: ata4.00: configured for UDMA/133
Feb 13 19:55:04 suse103 kernel: ata4: EH complete
Feb 13 19:55:05 suse103 kernel: ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x380100 action 0x2
Feb 13 19:55:05 suse103 kernel: ata4.00: port_status 0x20200000
Feb 13 19:55:05 suse103 kernel: ata4.00: cmd c8/00:d0:a0:dc:9f/00:00:00:00:00/e3 tag 0 cdb 0x0 data 106496 in
Feb 13 19:55:05 suse103 kernel:          res 51/0c:00:6f:dd:9f/0c:00:00:00:00/e3 Emask 0x10 (ATA bus error)
Feb 13 19:55:05 suse103 kernel: ata4: hard resetting link
Feb 13 19:55:05 suse103 kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Feb 13 19:55:05 suse103 kernel: ata4.00: configured for UDMA/133
Feb 13 19:55:05 suse103 kernel: ata4: EH complete
Feb 13 19:55:05 suse103 kernel: sd 3:0:0:0: [sdb] 156301488 512-byte hardware sectors (80026 MB)
Feb 13 19:55:05 suse103 kernel: sd 3:0:0:0: [sdb] Write Protect is off
Feb 13 19:55:05 suse103 kernel: sd 3:0:0:0: [sdb] Mode Sense: 00 3a 00 00
Feb 13 19:55:05 suse103 kernel: sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 13 19:55:05 suse103 kernel: sd 3:0:0:0: [sdb] 156301488 512-byte hardware sectors (80026 MB)
Feb 13 19:55:05 suse103 kernel: sd 3:0:0:0: [sdb] Write Protect is off
Feb 13 19:55:05 suse103 kernel: sd 3:0:0:0: [sdb] Mode Sense: 00 3a 00 00
Feb 13 19:55:05 suse103 kernel: sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Comment 17 Bernhard Heibler 2008-02-13 19:04:27 UTC
The hard disk itself logs the following SMART error:

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 47 97 3b e9 e1  Error: ICRC, ABRT 71 sectors at LBA = 0x01e93b97 = 32062359

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 c8 16 3b e9 e1 00      06:23:41.121  READ DMA
  c8 00 00 16 3a e9 e1 00      06:23:41.110  READ DMA
  c8 00 00 16 39 e9 e1 00      06:23:41.097  READ DMA
  c8 00 80 96 38 e9 e1 00      06:23:41.087  READ DMA
  c8 00 40 56 38 e9 e1 00      06:23:41.084  READ DMA
Comment 18 Bernhard Heibler 2008-02-13 19:16:55 UTC
Created attachment 194753 [details]
/var/log/messages Suse 10.3
Comment 19 Bernhard Heibler 2008-02-13 19:17:15 UTC
Created attachment 194754 [details]
/var/log/boot.msg Suse 10.3
Comment 20 Bernhard Heibler 2008-02-13 19:17:49 UTC
Created attachment 194755 [details]
output of hwinfo of suse 10.3
Comment 21 Roger Larsson 2008-06-05 22:37:48 UTC
Was about to create a new bug report. But it could go with this one.
I run into this with a SuSE 10.3 system and USB disk (WD Elements 500GB).
- this disk was bought today, I seen this now and then with another USB disk.
(but this time I were going to backup before trying 11.0 RC and it was very
consistent!)

Nothing works - NumLock dead, magic sysreq, magic sync/unmount/boot.

What is worse is that after this happened I could not even plug it in without freezing...

Running in failsafe (init 3) did not work either.

But killing hald-addon-storage did!

# ps aux | grep hald-addon-storage
... hald-addon-storage: polling /dev/sr0 (every 2 sec)
... hald-addon-storage: polling /dev/sdc (every 2 sec)
Kill the one polling your device. (The functionallity provided might not be necessary as I got the popup anyway! Kernel support?)

(Maybe firewire uses another addon)

Good luck!
Comment 22 Roger Larsson 2008-08-26 21:00:22 UTC
The mother board in #21 is now replaced as error rate increased. The mentioned "workaround" for one stopped to work...
I do really expect my HW to be the problem but I have no means to verify that.
Comment 23 Takashi Iwai 2008-10-31 07:30:30 UTC
OK, let's close this bug now.  If you encounter the same problem again, please reopen.

Thanks.