Bug 115095 - suspend-to-disk hangs during suspend
Summary: suspend-to-disk hangs during suspend
Status: RESOLVED INVALID
Alias: None
Product: SUSE LINUX 10.0
Classification: openSUSE
Component: Kernel (show other bugs)
Version: RC 1
Hardware: i386 All
: P5 - None : Normal
Target Milestone: ---
Assignee: Pavel Machek
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-09-02 22:06 UTC by Carl-Daniel Hailfinger
Modified: 2005-11-22 23:43 UTC (History)
3 users (show)

See Also:
Found By: Development
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
full suspend log including sysrq-p and sysrq-t (46.58 KB, text/plain)
2005-09-02 22:08 UTC, Carl-Daniel Hailfinger
Details
Full backtrace including SysRq-P and SysRq-T (45.68 KB, text/plain)
2005-09-12 13:30 UTC, Carl-Daniel Hailfinger
Details
SysRq-P, SysRq-T during hanging suspend-to-disk (43.27 KB, text/plain)
2005-09-12 23:41 UTC, Carl-Daniel Hailfinger
Details
backtrace of hang during suspend-to-disk (vanilla 2.6.13.1) (44.51 KB, text/plain)
2005-09-17 11:41 UTC, Carl-Daniel Hailfinger
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Carl-Daniel Hailfinger 2005-09-02 22:06:19 UTC
Suspend-to-disk hangs _during suspend_ if a cdrom has been mounted before
suspend. If the cdrom is already unmounted again, the same problem happens.
Perfectly reproducible, clean install of default beta4.

Machine is a Pentium3 desktop with 1GHz, no other special features. It suspends
and resumes just fine if I don't use the CD drive.

normal suspend-to-disk kernel messages:
uhci_hcd 0000:00:04.3: remove, state 1
usb usb2: USB disconnect, address 1
uhci_hcd 0000:00:04.3: USB bus 2 deregistered
ACPI: PCI interrupt for device 0000:00:04.3 disabled
uhci_hcd 0000:00:04.2: remove, state 1
usb usb1: USB disconnect, address 1
uhci_hcd 0000:00:04.2: USB bus 1 deregistered
ACPI: PCI interrupt for device 0000:00:04.2 disabled
end_request: I/O error, dev fd0, sector 0
Stopping tasks: ============================================================|
Freeing memory... done (104114 pages freed)
swsusp: Need to copy 16869 pages
swsusp: critical section/: done (16936 pages copied)
swsusp: Restoring Highmem
PCI: Setting latency timer of device 0000:00:01.0 to 64
ACPI: PCI Interrupt 0000:00:09.0[A] -> Link [LNKD] -> GSI 5 (level, low) -> IRQ 5
eth0: link up, 100Mbps, full-duplex, lpa 0x45E1
Writing data to swap (16936 pages)... done
Writing pagedir...done (67 pages)
S|
Shutdown: hdb
Shutdown: hda


messages when the box hangs:
uhci_hcd 0000:00:04.3: remove, state 1
usb usb2: USB disconnect, address 1
uhci_hcd 0000:00:04.3: USB bus 2 deregistered
ACPI: PCI interrupt for device 0000:00:04.3 disabled
uhci_hcd 0000:00:04.2: remove, state 1
usb usb1: USB disconnect, address 1
uhci_hcd 0000:00:04.2: USB bus 1 deregistered
ACPI: PCI interrupt for device 0000:00:04.2 disabled
end_request: I/O error, dev fd0, sector 0
Stopping tasks: ===============================================================|
Freeing memory... done (25192 pages freed)
swsusp: Need to copy 11903 pages
swsusp: critical section/: done (11950 pages copied)
swsusp: Restoring Highmem
PCI: Setting latency timer of device 0000:00:01.0 to 64
ACPI: PCI Interrupt 0000:00:09.0[A] -> Link [LNKD] -> GSI 5 (level, low) -> IRQ 5
eth0: link up, 100Mbps, full-duplex, lpa 0x45E1
<== it hangs here forever (at least 2 hours)

I got backtraces with sysrq-t on serial console. Probably interesting things:
hald-addon-storage and do_acpi_sleep because of disk access and ktorrent because
of being stopped and in interrupt.

The following is an excerpt from sysrq-t (only processes not sitting in
refrigerator+0x55/0x70).

ksoftirqd/0   S 08C56E00     0     2      1             3       (L-TLB)
dfd83fc0 00000046 00000000 08c56e00 003d104e c029a361 d61e9000 dfdd2658 
       dfdd2530 08c56e00 003d104e 00000001 c040c710 007a1200 00000000 dfd82000 
       dfd81fa0 00000000 c0120f60 c0120fc2 fffffffc c012db65 ffffffff ffffffff 
Call Trace:
 [<c029a361>] qdisc_restart+0x11/0x1a0
 [<c0120fc2>] ksoftirqd+0x62/0xb0
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14

events/0      R running     0     3      1             4     2 (L-TLB)

khelper       S 9E3E7B00     0     4      1             5     3 (L-TLB)
dff4df70 00000046 00000000 9e3e7b00 003d1046 0000007b 00000001 dffe1b88 
       dffe1a60 9e3e7b00 003d1046 cab27e9c dff4df6c 029f6300 00000000 dfff7548 
       00000202 cab27e9c dfff7540 c012a59e dfff7548 dfff7550 dfff7558 dff4c000 
Call Trace:
 [<c012a59e>] worker_thread+0x1ee/0x200
 [<c012a1a0>] __call_usermodehelper+0x0/0x50
 [<c01195c7>] __wake_up_common+0x37/0x60
 [<c0119580>] default_wake_function+0x0/0x10
 [<c012a3b0>] worker_thread+0x0/0x200
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14

kthread       S AE408100     0     5      1     7     107     4 (L-TLB)
df84bf70 00000046 00000000 ae408100 003d08d1 ad4c5d00 003d08d1 dffe1678 
       dffe1550 ae408100 003d08d1 ae408100 003d08d1 00000000 00000000 dffd5e68 
       00000282 db9a1f64 dffd5e60 c012a59e dffd5e68 dffd5e70 dffd5e78 df84a000 
Call Trace:
 [<c012a59e>] worker_thread+0x1ee/0x200
 [<c012db70>] keventd_create_kthread+0x0/0x50
 [<c01195c7>] __wake_up_common+0x37/0x60
 [<c0119580>] default_wake_function+0x0/0x10
 [<c012a3b0>] worker_thread+0x0/0x200
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14

kacpid        S 5BE0F600     0     7      5            74       (L-TLB)
dff49f70 00000046 00000000 5be0f600 003d1039 c022a4c1 00000001 dffe1168 
       dffe1040 5be0f600 003d1039 dba51e20 dff49f6c 00000000 00000000 dffd5268 
       00000282 dba51e20 dffd5260 c012a59e dffd5268 dffd5270 dffd5278 dff48000 
Call Trace:
 [<c022a4c1>] acpi_bus_check_device+0x3e/0x61
 [<c012a59e>] worker_thread+0x1ee/0x200
 [<c021443f>] acpi_os_execute_deferred+0x0/0x16
 [<c01195c7>] __wake_up_common+0x37/0x60
 [<c0119580>] default_wake_function+0x0/0x10
 [<c012a3b0>] worker_thread+0x0/0x200
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14

kblockd/0     S A5870C00     0    74      5           105     7 (L-TLB)
dfea3f70 00000046 00000000 a5870c00 003d104c d564172c 00000001 dff76698 
       dff76570 a5870c00 003d104c df6d765c dfea3f6c 02625a00 00000000 dff73b88 
       00000247 df6d765c dff73b80 c012a59e dff73b88 dff73b90 dff73b98 dfea2000 
Call Trace:
 [<c012a59e>] worker_thread+0x1ee/0x200
 [<c0262a30>] cfq_kick_queue+0x0/0x60
 [<c01195c7>] __wake_up_common+0x37/0x60
 [<c0119580>] default_wake_function+0x0/0x10
 [<c012a3b0>] worker_thread+0x0/0x200
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14

pdflush       D 936E5D00     0   105      5           106    74 (L-TLB)
dfee3ef4 00000046 00000000 936e5d00 003d1058 e66e5f00 003d1049 dff76188 
       dff76060 936e5d00 003d1058 e66e5f00 dfee3efc 00b71b00 00000000 001ecfce 
       c0365428 000010e5 dfee3f24 c02f7517 c040d058 dfed8ab4 001ecfce 4b87ad6e 
Call Trace:
 [<c02f7517>] schedule_timeout+0x47/0xb0
 [<c01246a0>] process_timeout+0x0/0x10
 [<c02f749e>] io_schedule_timeout+0xe/0x20
 [<c025965e>] blk_congestion_wait+0x6e/0x90
 [<c012dea0>] autoremove_wake_function+0x0/0x30
 [<c0142392>] wb_kupdate+0xe2/0x130
 [<c0142b20>] pdflush+0x0/0x20
 [<c0142a82>] __pdflush+0xb2/0x150
 [<c0142b3a>] pdflush+0x1a/0x20
 [<c01422b0>] wb_kupdate+0x0/0x130
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14

pdflush       S 5EF83600     0   106      5           108   105 (L-TLB)
df923f9c 00000046 00000000 5ef83600 003d1016 5ef83600 003d1016 dfeb3bc8 
       dfeb3aa0 5ef83600 003d1016 5ef83600 003d1016 01ab3f00 00000000 df923fc0 
       df923fb4 df922000 c0142b20 c0142a47 fffffffc dfd81f98 00000000 c0142b3a 
Call Trace:
 [<c0142b20>] pdflush+0x0/0x20
 [<c0142a47>] __pdflush+0x77/0x150
 [<c0142b3a>] pdflush+0x1a/0x20
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14

aio/0         S 6737CF00     0   108      5           698   106 (L-TLB)
df927f70 00000046 00000000 6737cf00 003d08ba 6737cf00 003d08ba dfeb31a8 
       dfeb3080 6737cf00 003d08ba 6737cf00 003d08ba 00000000 00000000 df926000 
       dfd81f64 dff73480 dff73480 c012a59e dff73488 dff73490 df927fc8 df926000 
Call Trace:
 [<c012a59e>] worker_thread+0x1ee/0x200
 [<c01195c7>] __wake_up_common+0x37/0x60
 [<c0119580>] default_wake_function+0x0/0x10
 [<c012a3b0>] worker_thread+0x0/0x200
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14

kseriod       S C1571900     0   698      5           977   108 (L-TLB)
dfe63fa4 00000046 00000000 c1571900 003d08ba ffffffea c03679a8 dff76ba8 
       dff76a80 c1571900 003d08ba c03679bc 00000246 016e3600 00000000 dfe62000 
       dfe62000 00000000 c0249370 c024942d 00000000 dff76a80 c012dea0 c0362bdc 
Call Trace:
 [<c0249370>] serio_thread+0x0/0xf0
 [<c024942d>] serio_thread+0xbd/0xf0
 [<c012dea0>] autoremove_wake_function+0x0/0x30
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14

reiserfs/0    S F128F400     0   977      5         12269   698 (L-TLB)
deeb7f70 00000046 00000000 f128f400 003d1049 00000000 00000001 dee536d8 
       dee535b0 f128f400 003d1049 d828de00 deeb7f6c 00000000 00000000 dee611e8 
       00000246 d828de00 dee611e0 c012a59e dee611e8 dee611f0 dee611f8 deeb6000 
Call Trace:
 [<c012a59e>] worker_thread+0x1ee/0x200
 [<e15afb30>] flush_async_commits+0x0/0x70 [reiserfs]
 [<c01195c7>] __wake_up_common+0x37/0x60
 [<c0119580>] default_wake_function+0x0/0x10
 [<c012a3b0>] worker_thread+0x0/0x200
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14

khubd         S 5C6ACA00     0 12269      5         12443   977 (L-TLB)
dc98dfa4 00000046 00000000 5c6aca00 003d103a 5c6aca00 003d103a dbf44148 
       dbf44020 5c6aca00 003d103a 5c6aca00 003d103a 00000000 00000000 dc98c000 
       dc98c000 00000000 e194c540 e194c5fa 00000000 dbf44020 c012dea0 e1963828 
Call Trace:
 [<e194c540>] hub_thread+0x0/0xf0 [usbcore]
 [<e194c5fa>] hub_thread+0xba/0xf0 [usbcore]
 [<c012dea0>] autoremove_wake_function+0x0/0x30
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14

kgameportd    S AE7D8A00     0 12443      5               12269 (L-TLB)
d6093fa4 00000046 0028b0aa ae7d8a00 003d08d1 c022aa00 d63126ac d6328148 
       d6328020 afebc000 003d08d1 d6312600 00000246 016e3600 00000000 d6092000 
       d6092000 00000000 e197f630 e197f6ed 00000000 d6328020 c012dea0 e19825fc 
Call Trace:
 [<c022aa00>] acpi_platform_notify+0x25/0x4d
 [<e197f630>] gameport_thread+0x0/0xf0 [gameport]
 [<e197f6ed>] gameport_thread+0xbd/0xf0 [gameport]
 [<c012dea0>] autoremove_wake_function+0x0/0x30
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14

hald-addon-st D EE899100     0 15147  14499               15103 (NOTLB)
d1021ef8 00000086 00000000 ee899100 003d1049 df77d964 c016fe10 dbed0148 
       dbed0020 ee899100 003d1049 00000000 b7f46ff4 1a39de00 00000000 d1020000 
       00000000 b7f46ff4 d1020000 c0135575 d1021fbc 00000000 c0102b95 c01604db 
Call Trace:
 [<c016fe10>] __invalidate_device+0x30/0x40
 [<c0135575>] refrigerator+0x55/0x70
 [<c0102b95>] do_signal+0x85/0x110
 [<c01604db>] do_open+0x22b/0x2f0
 [<c0160649>] blkdev_open+0x39/0x50
 [<c0158743>] dentry_open+0x153/0x1d0
 [<c01585de>] filp_open+0x3e/0x50
 [<c0102c49>] do_notify_resume+0x29/0x40
 [<c0102e0e>] work_notifysig+0x13/0x25

ktorrent      D ECDE5200     0 17552      1         17791 15948 (NOTLB)
cfea7ef8 00200086 00061a80 ecde5200 003d1049 c0103dea cfea6000 d14c9bc8 
       d14c9aa0 ed1b5b00 003d1049 cfea6000 d14c9aa0 577fe700 00000000 cfea6000 
       00000000 4129cce8 cfea6000 c0135575 cfea7fbc 00000000 c0102b95 00000000 
Call Trace:
 [<c0103dea>] common_interrupt+0x1a/0x20
 [<c0135575>] refrigerator+0x55/0x70
 [<c0102b95>] do_signal+0x85/0x110
 [<c016a170>] __pollwait+0x0/0xa0
 [<c016a910>] sys_select+0x290/0x360
 [<c0102c49>] do_notify_resume+0x29/0x40
 [<c0102e0e>] work_notifysig+0x13/0x25

do_acpi_sleep D 19B2ED00     0 18064  14534                     (NOTLB)
c7469dac 00000082 00000000 19b2ed00 003d104e 00000006 00000008 d11d2148 
       d11d2020 19b2ed00 003d104e c0261973 e087662c 863acb00 00000000 c7468000 
       c7469ddc e0876434 c7469dd0 c02f6ed6 00000001 d11d2020 c0119580 c7469de0 
Call Trace:
 [<c0261973>] cfq_next_request+0x33/0x90
 [<c02f6ed6>] wait_for_completion+0x76/0xb0
 [<c0119580>] default_wake_function+0x0/0x10
 [<e08607e1>] ide_do_drive_cmd+0xa1/0xf0 [ide_core]
 [<e085e49d>] generic_ide_resume+0x7d/0x90 [ide_core]
 [<c02595d0>] blk_end_sync_rq+0x0/0x20
 [<c0256038>] resume_device+0x48/0xb0
 [<c025612e>] dpm_resume+0x8e/0xa0
 [<c0256151>] device_resume+0x11/0x20
 [<c0136836>] swsusp_write+0x6/0x20
 [<c0137379>] pm_suspend_disk+0x79/0x90
 [<c0135395>] enter_state+0x45/0x70
 [<c01354ca>] state_store+0x8a/0x98
 [<c0135440>] state_store+0x0/0x98
 [<c01912de>] subsys_attr_store+0x1e/0x30
 [<c01914ed>] flush_write_buffer+0x1d/0x30
 [<c0191530>] sysfs_write_file+0x30/0x50
 [<c0191500>] sysfs_write_file+0x0/0x50
 [<c015935d>] vfs_write+0x8d/0x170
 [<c01594ec>] sys_write+0x3c/0x70
 [<c0102d1b>] sysenter_past_esp+0x54/0x79

Full backtrace will follow as attachment.

May be related to Bug 99589.
Comment 1 Carl-Daniel Hailfinger 2005-09-02 22:08:26 UTC
Created attachment 48668 [details]
full suspend log including sysrq-p and sysrq-t
Comment 2 Carl-Daniel Hailfinger 2005-09-02 22:13:13 UTC
Interesting. SysRq-s,u,b gives another backtrace. Should I open a separate bug
for that?

SysRq : Emergency Sync
Emergency Sync complete
SysRq : Emergency Remount R/O
SysRq : Resetting
Badness in pci_get_subsys at drivers/pci/search.c:234
 [<c01fa241>] pci_get_subsys+0xe1/0xf0
 [<c01fa25b>] pci_get_device+0xb/0x10
 [<c0115d62>] mach_reboot_fixups+0x12/0x30
 [<c01122da>] machine_emergency_restart+0x3a/0xb0
 [<c0246c82>] __handle_sysrq+0x72/0x100
 [<c0246d27>] handle_sysrq+0x17/0x20
 [<c0241ddc>] kbd_event+0x9c/0xc0
 [<c026c5e3>] input_event+0xb3/0x400
 [<c026fbb2>] atkbd_report_key+0x22/0xa0
 [<c026fdc9>] atkbd_interrupt+0x199/0x590
 [<c0249d89>] serio_interrupt+0x19/0x3d
 [<c024a9f1>] i8042_interrupt+0xf1/0x200
 [<c013bc23>] handle_IRQ_event+0x33/0x60
 [<c013bcd5>] __do_IRQ+0x85/0xd0
 [<c01055e8>] do_IRQ+0x38/0x60
 [<c0103dea>] common_interrupt+0x1a/0x20
 [<e0830ac3>] acpi_processor_idle+0x10e/0x26f [processor]
 [<c01010ae>] cpu_idle+0xe/0x50
 [<c03a8730>] start_kernel+0x160/0x1b0
No reboot fixup found for your hardware
Comment 3 Carl-Daniel Hailfinger 2005-09-02 22:17:58 UTC
The machine replayed dozens of transactions on all mounted filesystems on the
next boot although I had used SysRq-s. Strange.
Comment 4 Carl-Daniel Hailfinger 2005-09-02 22:32:27 UTC
Booting up to kdm, inserting a cdrom, waiting a few seconds, ejecting it again
and suspending afterwards also hangs during suspend. So we can rule out mounting
by hand and strange kde effects. However, I cannot reproduce with runlevel 3.
Comment 5 Carl-Daniel Hailfinger 2005-09-02 23:04:12 UTC
This is getting so strange.
Boot to runlevel 3, insert cd, suspend -> works
Boot to runlevel 3, insert cd, init 5, suspend -> BREAKS
Boot to runlevel 3, insert cd, init 5, init 3, suspend -> works
Boot to runlevel 3, insert cd, init 5, init 3, init 5, suspend -> works
Boot to runlevel 3, insert cd, init 5, init 3, init 5, eject cd, insert cd,
suspend -> works
Boot to runlevel 3, insert cd, init 5, init 3, init 5, eject cd, insert cd, log
into kde, suspend -> works
Comment 6 Forgotten User ZhJd0F0L3x 2005-09-03 20:39:20 UTC
 this is really scary...
Comment 7 Pavel Machek 2005-09-04 21:03:56 UTC
Yes, scary. Can you try latest vanilla kernel to see if it is suse or generic
problem? Can you try to generate smaller testcase? init 5 does really a lot of
stuff, break_suspend.c would be great.
Comment 8 Carl-Daniel Hailfinger 2005-09-04 21:24:16 UTC
Did you already look at the backtrace?

Especially the following seems fishy to me (the process is called
hald-addon-storage). How can a process be in the refrigerator (at the same place
as all other frozen processes) and enter __invalidate_device from there?

hald-addon-st D EE899100     0 15147  14499               15103 (NOTLB)
Call Trace:
 [<c016fe10>] __invalidate_device+0x30/0x40
 [<c0135575>] refrigerator+0x55/0x70
 [<c0102b95>] do_signal+0x85/0x110
 [<c01604db>] do_open+0x22b/0x2f0
 [<c0160649>] blkdev_open+0x39/0x50
 [<c0158743>] dentry_open+0x153/0x1d0
 [<c01585de>] filp_open+0x3e/0x50
 [<c0102c49>] do_notify_resume+0x29/0x40
 [<c0102e0e>] work_notifysig+0x13/0x25


Besides that, I already spent 15 hours chasing this bug :-(
I'm really glad I lowered it down from "random freeze at suspend" to "freezes if
cdrom is/was inserted and is/was in runlevel 5 the first time after boot".

Oh, and another hint: Suspend-to-disk spins down all disks before writing to
swap. It then spins up again the disks to write the data to swap. My swap
partition is on hdb and it seems (from hearing) that only hda and not hdb is
spun up again, so the data can't be written to swap.
Comment 9 Forgotten User ZhJd0F0L3x 2005-09-04 21:37:35 UTC
is hal-addon-storage maybe already in state D before suspend?
Comment 10 Carl-Daniel Hailfinger 2005-09-04 21:54:07 UTC
But how can the freezer appear in the middle of the hald-addon-storage
backtrace? Should it not refuse to freeze if anything is in state D?

Hm. hald-addon-storage is in state D for a few seconds after ejecting the data
cd. After that, it returns to state S. And now the machine seems to suspend just
fine. AARGH!

I will now get a backtrace of hald-addon-storage in state D to compare it with
the backtrace of hanging suspend.
Comment 11 Carl-Daniel Hailfinger 2005-09-04 22:21:10 UTC
On top of all existing problems, this is a hard-to-trigger race condition. 

hald-addon-storage hangs in state D for 0.1 seconds every two seconds.
hald-addon-storage hangs in state D for 10-20 seconds after closing the cd drive.
hald-addon-storage hangs in state D for 5 seconds after opening the cd drive.

So if the process freeze happens at exactly the right(wrong) time, you will hit
the bug.

However, the backtrace of hald-addon-storage in state D during "normal" use has
nothing to do with the backtrace found here.
Comment 12 Pavel Machek 2005-09-04 22:56:01 UTC
refrigerator() should just fail, if some process is in D state. Not sure what is
going on here.
Comment 13 Jens Axboe 2005-09-05 13:24:18 UTC
Pavel, please take a look.
Comment 14 Pavel Machek 2005-09-06 20:42:10 UTC
Is there some easy way to reproduce "D state on hald-addon-storage" without
actually running hald?

Do you have those backtraces you mentioned in comment #10?
Comment 15 Danny Al-Gaaf 2005-09-06 20:50:56 UTC
No, there is no way to run hald-addon-storage without HAL.
Comment 16 Pavel Machek 2005-09-06 21:17:07 UTC
Is there way to at least strace it to show what it does/create small application
to "emulate" hald-addon-storage?
Comment 17 Carl-Daniel Hailfinger 2005-09-07 14:45:24 UTC
I have to dig out these backtraces.

If you want a small application to emulate the relevant effects of
hald-addon-storage, you probably should just poll the cdrom in an endless loop.

Danny, can you extract the polling code?
Comment 18 Carl-Daniel Hailfinger 2005-09-07 15:25:30 UTC
Pavel: Here are the backtraces from "normal" hald-addon-storage in state D.

hald-addon-st D 2C9BC600     0  5379   5095          5639  5177 (NOTLB)
d9393d6c 00000086 00000000 2c9bc600 003d092b 2c9bc600 003d092b d81d46d8
       d81d45b0 2c9bc600 003d092b 2c9bc600 003d092b 007a1200 00000000 d9392000
       d9393d9c e0876ce8 d9393d90 c02f6ed6 00000001 d81d45b0 c0119580 d9393da0
Call Trace:
 [<c02f6ed6>] wait_for_completion+0x76/0xb0
 [<c0119580>] default_wake_function+0x0/0x10
 [<e08607e1>] ide_do_drive_cmd+0xa1/0xf0 [ide_core]
 [<e151b8c4>] cdrom_queue_packet_command+0x34/0xc0 [ide_cd]
 [<c0119580>] default_wake_function+0x0/0x10
 [<e08607fd>] ide_do_drive_cmd+0xbd/0xf0 [ide_core]
 [<e151c48f>] cdrom_check_status+0x5f/0x70 [ide_cd]
 [<c02595d0>] blk_end_sync_rq+0x0/0x20
 [<e151d4ae>] ide_cdrom_check_media_change_real+0x1e/0x40 [ide_cd]
 [<e151086a>] media_changed+0x4a/0x90 [cdrom]
 [<c01601e3>] check_disk_change+0x13/0x60
 [<e150fe33>] cdrom_open+0x63/0xe0 [cdrom]
 [<e151e229>] idecd_open+0x49/0xa0 [ide_cd]
 [<c0160325>] do_open+0x75/0x2f0
 [<c016062c>] blkdev_open+0x1c/0x50
 [<c01586ae>] dentry_open+0xbe/0x1d0
 [<c01585de>] filp_open+0x3e/0x50
 [<c01588e7>] sys_open+0x37/0xc0
 [<c0102d1b>] sysenter_past_esp+0x54/0x79
hald-addon-st D 30A95F00     0  5639   5095                5379 (NOTLB)
df323d6c 00000082 00000000 30a95f00 003d092b 30a95f00 003d092b df2dfb88
       df2dfa60 30a95f00 003d092b 30a95f00 003d092b 03567e00 00000000 df322000
       df323d9c e0876af0 df323d90 c02f6ed6 00000001 df2dfa60 c0119580 df323da0
Call Trace:
 [<c02f6ed6>] wait_for_completion+0x76/0xb0
 [<c0119580>] default_wake_function+0x0/0x10
 [<e08607e1>] ide_do_drive_cmd+0xa1/0xf0 [ide_core]
 [<e151b8c4>] cdrom_queue_packet_command+0x34/0xc0 [ide_cd]
 [<c02f6ed6>] wait_for_completion+0x76/0xb0
 [<c0119580>] default_wake_function+0x0/0x10
 [<e08607fd>] ide_do_drive_cmd+0xbd/0xf0 [ide_core]
 [<e151c48f>] cdrom_check_status+0x5f/0x70 [ide_cd]
 [<c02595d0>] blk_end_sync_rq+0x0/0x20
 [<e151d4ae>] ide_cdrom_check_media_change_real+0x1e/0x40 [ide_cd]
 [<e151086a>] media_changed+0x4a/0x90 [cdrom]
 [<c01601e3>] check_disk_change+0x13/0x60
 [<e150fe33>] cdrom_open+0x63/0xe0 [cdrom]
 [<e151e229>] idecd_open+0x49/0xa0 [ide_cd]
 [<c0160325>] do_open+0x75/0x2f0
 [<c016062c>] blkdev_open+0x1c/0x50
 [<c01586ae>] dentry_open+0xbe/0x1d0
 [<c01585de>] filp_open+0x3e/0x50
 [<c01588e7>] sys_open+0x37/0xc0
 [<c0102d1b>] sysenter_past_esp+0x54/0x79
Comment 19 Carl-Daniel Hailfinger 2005-09-10 19:11:09 UTC
Argh. With RC1 it even hangs if no cdrom is/was ever inserted.
I'm now in the unfortunate situation that every suspend hangs.

Pavel: Is there any more information I can give you?
Comment 20 Pavel Machek 2005-09-11 10:00:35 UTC
Does it hang in the same way as with cdrom before? Any processes in D state with
RC1? Does it work in init=/bin/bash?
Comment 21 Carl-Daniel Hailfinger 2005-09-11 23:32:35 UTC
Argh! I still cannot reproduce without X running.
Comment 22 Carl-Daniel Hailfinger 2005-09-12 06:24:39 UTC
And the reduced testcase I had does not help at all :(
Bah. Getting it to hang without X is near impossible...
To make matters worse, suspend-to-disk via "powersave -U" makes some dbus calls.

And the backtraces get even worse... How can a process in the refrigerator cause
page faults?!? (process "substrate"). And hald-addon-storage isn't involved with
the hang anymore.

Excerpt from my most recent hang:
ksoftirqd/0   S 854E6D00     0     2      1             3       (L-TLB)
dfd83fc0 00000046 00000000 854e6d00 003d08f1 c029d051 dd7d9400 dfdd3658 
       dfdd3530 854e6d00 003d08f1 00000001 c0412710 00f42400 00000000 dfd82000 
       dfd81fa0 00000000 c0120f60 c0120fc2 fffffffc c012db65 ffffffff ffffffff 
Call Trace:
 [<c029d051>] qdisc_restart+0x11/0x1a0
 [<c0120f60>] ksoftirqd+0x0/0xb0
 [<c0120fc2>] ksoftirqd+0x62/0xb0
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14
events/0      R running     0     3      1             4     2 (L-TLB)
khelper       S 34FF5900     0     4      1             5     3 (L-TLB)
dfddbf70 00000046 00000000 34ff5900 003d08e6 0000007b 00000001 dffe1b88 
       dffe1a60 34ff5900 003d08e6 d504be68 dfddbf6c 01312d00 00000000 dfff7548 
       00000202 d504be68 dfff7540 c012a59e dfff7548 dfff7550 dfff7558 dfdda000 
Call Trace:
 [<c012a59e>] worker_thread+0x1ee/0x200
 [<c012a1a0>] __call_usermodehelper+0x0/0x50
 [<c01195c7>] __wake_up_common+0x37/0x60
 [<c0119580>] default_wake_function+0x0/0x10
 [<c012a3b0>] worker_thread+0x0/0x200
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14
kthread       S 8F1C6E00     0     5      1     9     377     4 (L-TLB)
dff49f70 00000046 00000000 8f1c6e00 003d08c3 8e655300 003d08c3 dffe1678 
       dffe1550 8f1c6e00 003d08c3 8f1c6e00 003d08c3 00000000 00000000 dffd5e68 
       00000282 dc6adf64 dffd5e60 c012a59e dffd5e68 dffd5e70 dffd5e78 dff48000 
Call Trace:
 [<c012a59e>] worker_thread+0x1ee/0x200
 [<c012db70>] keventd_create_kthread+0x0/0x50
 [<c01195c7>] __wake_up_common+0x37/0x60
 [<c0119580>] default_wake_function+0x0/0x10
 [<c012a3b0>] worker_thread+0x0/0x200
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14
kacpid        S 15186700     0     9      5           340       (L-TLB)
dfe85f70 00000046 00000000 15186700 003d08df c022c6cd 00000001 dffc8188 
       dffc8060 15186700 003d08df d4382960 dfe85f6c 00000000 00000000 dffd5268 
       00000286 d4382960 dffd5260 c012a59e dffd5268 dffd5270 dffd5278 dfe84000 
Call Trace:
 [<c022c6cd>] acpi_bus_check_device+0x3e/0x61
 [<c012a59e>] worker_thread+0x1ee/0x200
 [<c021447f>] acpi_os_execute_deferred+0x0/0x35
 [<c01195c7>] __wake_up_common+0x37/0x60
 [<c0119580>] default_wake_function+0x0/0x10
 [<c012a3b0>] worker_thread+0x0/0x200
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14
kblockd/0     S CE707600     0   340      5           375     9 (L-TLB)
dfe87f70 00000046 00000000 ce707600 003d08f0 afaeed00 00000001 dff62bc8 
       dff62aa0 ce707600 003d08f0 df6dd65c dfe87f6c 007a1200 00000000 dff5fb88 
       00000247 df6dd65c dff5fb80 c012a59e dff5fb88 dff5fb90 dff5fb98 dfe86000 
Call Trace:
 [<c012a59e>] worker_thread+0x1ee/0x200
 [<c02653e0>] cfq_kick_queue+0x0/0x60
 [<c01195c7>] __wake_up_common+0x37/0x60
 [<c0119580>] default_wake_function+0x0/0x10
 [<c012a3b0>] worker_thread+0x0/0x200
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14
pdflush       S BCCF6E00     0   375      5           376   340 (L-TLB)
dfe89f9c 00000046 00000000 bccf6e00 003d08ba c0118be7 dffd0700 dff626b8 
       dff62590 bccf6e00 003d08ba dfe89fc8 00000046 00000000 00000000 dfe89fc0 
       dfe89fb4 dfe88000 c0142b20 c0142a47 fffffffc dfd81f98 00000000 c0142b3a 
Call Trace:
 [<c0118be7>] activate_task+0x57/0x70
 [<c0142b20>] pdflush+0x0/0x20
 [<c0142a47>] __pdflush+0x77/0x150
 [<c0142b3a>] pdflush+0x1a/0x20
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14
pdflush       D 82EC1300     0   376      5           378   375 (L-TLB)
dfe8bde0 00000046 00000000 82ec1300 003d08f1 0000000e 00000000 dff621a8 
       dff62080 82ec1300 003d08f1 dfe8bdc4 dfe8bdc4 029f6300 00000000 dfe8be10 
       00000000 dfe8be18 dfe8bde8 c02fa16e c1401008 c015a2cb c02fa330 c015a2a0 
Call Trace:
 [<c02fa16e>] io_schedule+0xe/0x20
 [<c015a2cb>] sync_buffer+0x2b/0x30
 [<c02fa330>] __wait_on_bit+0x40/0x70
 [<c015a2a0>] sync_buffer+0x0/0x30
 [<c015a2a0>] sync_buffer+0x0/0x30
 [<c02fa3cf>] out_of_line_wait_on_bit+0x6f/0x80
 [<c012ded0>] wake_bit_function+0x0/0x40
 [<c015a333>] __wait_on_buffer+0x23/0x30
 [<e15ac723>] flush_commit_list+0x353/0x490 [reiserfs]
 [<e15b0c0d>] do_journal_end+0x90d/0x9b0 [reiserfs]
 [<e15afb1a>] journal_end_sync+0x5a/0x70 [reiserfs]
 [<e159d420>] reiserfs_sync_fs+0x40/0x50 [reiserfs]
 [<c015ee4d>] sync_supers+0x6d/0xc0
 [<c01422de>] wb_kupdate+0x2e/0x130
 [<c02f9883>] schedule+0x343/0x610
 [<c0142b20>] pdflush+0x0/0x20
 [<c0142a82>] __pdflush+0xb2/0x150
 [<c0142b3a>] pdflush+0x1a/0x20
 [<c01422b0>] wb_kupdate+0x0/0x130
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14
aio/0         S BE3DA400     0   378      5           970   376 (L-TLB)
df92bf70 00000046 00000000 be3da400 003d08ba be3da400 003d08ba dff4d6d8 
       dff4d5b0 be3da400 003d08ba be3da400 003d08ba 00000000 00000000 df92a000 
       dfd81f64 dff5f480 dff5f480 c012a59e dff5f488 dff5f490 df92bfc8 df92a000 
Call Trace:
 [<c012a59e>] worker_thread+0x1ee/0x200
 [<c01195c7>] __wake_up_common+0x37/0x60
 [<c0119580>] default_wake_function+0x0/0x10
 [<c012a3b0>] worker_thread+0x0/0x200
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14
kseriod       S 1768CA00     0   970      5          1235   378 (L-TLB)
dff4bfa4 00000046 00000000 1768ca00 003d08bb ffffffea c036cb88 dffe1168 
       dffe1040 1768ca00 003d08bb c036cb9c 00000246 007a1200 00000000 dff4a000 
       dff4a000 00000000 c024bd20 c024bddd 00000000 dffe1040 c012dea0 c0367c1c 
Call Trace:
 [<c024bd20>] serio_thread+0x0/0xf0
 [<c024bddd>] serio_thread+0xbd/0xf0
 [<c012dea0>] autoremove_wake_function+0x0/0x30
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14
reiserfs/0    D 88E1F400     0  1235      5          2732   970 (L-TLB)
dfdddef8 00000046 00000000 88e1f400 003d08f1 c1380d00 c13ae5a0 dffc8ba8 
       dffc8a80 88e1f400 003d08f1 c11a7d00 c1329420 007a1200 00000000 ce70c4f8 
       00000286 dffc8a80 ce70c500 c02f93ee 00000001 dffc8a80 c0119580 ce70c500 
Call Trace:
 [<c02f93ee>] __down+0xbe/0xe0
 [<c0119580>] default_wake_function+0x0/0x10
 [<c02f9513>] __down_failed+0x7/0xc
 [<e15b0d4a>] .text.lock.journal+0x8/0x10e [reiserfs]
 [<e15afb60>] flush_async_commits+0x30/0x70 [reiserfs]
 [<c012a519>] worker_thread+0x169/0x200
 [<e15afb30>] flush_async_commits+0x0/0x70 [reiserfs]
 [<c01195c7>] __wake_up_common+0x37/0x60
 [<c0119580>] default_wake_function+0x0/0x10
 [<c012a3b0>] worker_thread+0x0/0x200
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14
khubd         S B512FE00     0  2732      5          2835  1235 (L-TLB)
def13fa4 00000046 00000000 b512fe00 003d08df dc704940 daf94400 df2dfb88 
       df2dfa60 b512fe00 003d08df 00000000 00000001 016e3600 00000000 def12000 
       def12000 00000000 e18e3540 e18e35fa 00000000 df2dfa60 c012dea0 e18fa828 
Call Trace:
 [<e18e3540>] hub_thread+0x0/0xf0 [usbcore]
 [<e18e35fa>] hub_thread+0xba/0xf0 [usbcore]
 [<c012dea0>] autoremove_wake_function+0x0/0x30
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14
kgameportd    S 8F968000     0  2835      5                2732 (L-TLB)
dc70ffa4 00000046 0021e88e 8f968000 003d08c3 c022cc0c dc7cbcac dc6a4678 
       dc6a4550 90c7ad00 003d08c3 dc7cbc00 00000246 01312d00 00000000 dc70e000 
       dc70e000 00000000 e1921630 e19216ed 00000000 dc6a4550 c012dea0 e19245fc 
Call Trace:
 [<c022cc0c>] acpi_platform_notify+0x25/0x4d
 [<e1921630>] gameport_thread+0x0/0xf0 [gameport]
 [<e19216ed>] gameport_thread+0xbd/0xf0 [gameport]
 [<c012dea0>] autoremove_wake_function+0x0/0x30
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14
do_acpi_sleep D 9678F500     0  6128   5223                     (NOTLB)
ce167dac 00000082 00000000 9678f500 003d08f1 df718200 0000000a d840cbc8 
       d840caa0 9678f500 003d08f1 c0264323 e0876434 bef92900 00000000 ce166000 
       ce167ddc e0876434 ce167dd0 c02f9bc6 00000001 d840caa0 c0119580 ce167de0 
Call Trace:
 [<c0264323>] cfq_next_request+0x33/0x90
 [<c02f9bc6>] wait_for_completion+0x76/0xb0
 [<c0119580>] default_wake_function+0x0/0x10
 [<e08607e1>] ide_do_drive_cmd+0xa1/0xf0 [ide_core]
 [<e085e49d>] generic_ide_resume+0x7d/0x90 [ide_core]
 [<c025bf80>] blk_end_sync_rq+0x0/0x20
 [<c02589e8>] resume_device+0x48/0xb0
 [<c0258ade>] dpm_resume+0x8e/0xa0
 [<c0258b01>] device_resume+0x11/0x20
 [<c0136836>] swsusp_write+0x6/0x20
 [<c0137379>] pm_suspend_disk+0x79/0x90
 [<c0135395>] enter_state+0x45/0x70
 [<c01354ca>] state_store+0x8a/0x98
 [<c0135440>] state_store+0x0/0x98
 [<c01912de>] subsys_attr_store+0x1e/0x30
 [<c01914ed>] flush_write_buffer+0x1d/0x30
 [<c0191530>] sysfs_write_file+0x30/0x50
 [<c0191500>] sysfs_write_file+0x0/0x50
 [<c015935d>] vfs_write+0x8d/0x170
 [<c01594ec>] sys_write+0x3c/0x70
 [<c0102d1b>] sysenter_past_esp+0x54/0x79
substrate     D AC586F00     0  6131   6130                     (NOTLB)
d5009ef8 00000082 00000000 ac586f00 003d08f0 00000002 00000001 c15c3168 
       c15c3040 ac586f00 003d08f0 00000000 c0117370 09c67100 00000000 d5008000 
       00000000 c0117370 d5008000 c0135575 d5009fbc 00000000 c0102b95 cdda5074 
Call Trace:
 [<c0117370>] do_page_fault+0x0/0x5ef
 [<c0117370>] do_page_fault+0x0/0x5ef
 [<c0135575>] refrigerator+0x55/0x70
 [<c0102b95>] do_signal+0x85/0x110
 [<c014c9e8>] __do_mmap_pgoff+0x508/0x780
 [<c014ca12>] __do_mmap_pgoff+0x532/0x780
 [<c02f9883>] schedule+0x343/0x610
 [<c0117370>] do_page_fault+0x0/0x5ef
 [<c0102c49>] do_notify_resume+0x29/0x40
 [<c0102e0e>] work_notifysig+0x13/0x25
Comment 23 Carl-Daniel Hailfinger 2005-09-12 06:51:37 UTC
Interesting similarities about both hang backtraces: Processes reiserfs/0 and
do_acpi_sleep. Partial diff follows.

 reiserfs/0
 Call Trace:
- __down+0xbe/0xe0
- default_wake_function+0x0/0x10
- __down_failed+0x7/0xc
- .text.lock.journal+0x8/0x10e [reiserfs]
- flush_async_commits+0x30/0x70 [reiserfs]
- worker_thread+0x169/0x200
+ worker_thread+0x1ee/0x200
  flush_async_commits+0x0/0x70 [reiserfs]
  __wake_up_common+0x37/0x60
  default_wake_function+0x0/0x10
  worker_thread+0x0/0x200
  kthread+0x85/0x90
  kthread+0x0/0x90
  kernel_thread_helper+0x5/0x14

 do_acpi_sleep
 Call Trace:
  cfq_next_request+0x33/0x90
  wait_for_completion+0x76/0xb0
  default_wake_function+0x0/0x10
  ide_do_drive_cmd+0xa1/0xf0 [ide_core]
  generic_ide_resume+0x7d/0x90 [ide_core]
  blk_end_sync_rq+0x0/0x20
  resume_device+0x48/0xb0
  dpm_resume+0x8e/0xa0
  device_resume+0x11/0x20
  swsusp_write+0x6/0x20
  pm_suspend_disk+0x79/0x90
  enter_state+0x45/0x70
  state_store+0x8a/0x98
  state_store+0x0/0x98
  subsys_attr_store+0x1e/0x30
  flush_write_buffer+0x1d/0x30
  sysfs_write_file+0x30/0x50
  sysfs_write_file+0x0/0x50
  vfs_write+0x8d/0x170
  sys_write+0x3c/0x70
  sysenter_past_esp+0x54/0x79
Comment 24 Forgotten User ZhJd0F0L3x 2005-09-12 06:56:59 UTC
BTW: do_acpi_sleep is a shellscript that does nothing but "echo disk >
/sys/power/state"
Comment 25 Carl-Daniel Hailfinger 2005-09-12 13:00:23 UTC
That means the backtrace of do_acpi_sleep tells us where suspend is hanging? If
so, how to debug further?
Comment 26 Carl-Daniel Hailfinger 2005-09-12 13:27:53 UTC
Hm. SysRq-S doesn't work. Neither does SysRq-U. They seem to hang like the rest
of this system.
Comment 27 Carl-Daniel Hailfinger 2005-09-12 13:30:19 UTC
Created attachment 49610 [details]
Full backtrace including SysRq-P and SysRq-T

Jens, could you have a look at this? I suspect a problem in the block layer.
Comment 28 Carl-Daniel Hailfinger 2005-09-12 13:48:54 UTC
I did another backtrace after SysRq-S and SysRq-U. The only task with different
backtrace was one of the pdflush tasks.

Before SysRq-S, SysRq-U:
pdflush       S 6F530100     0   375      5           376   340 (L-TLB)
dfe89f9c 00000046 00000000 6f530100 003d08ba c0118be7 dffd0700 dff636b8 
       dff63590 6f530100 003d08ba dfe89fc8 00000046 00000000 00000000 dfe89fc0 
       dfe89fb4 dfe88000 c0142b20 c0142a47 fffffffc dfd81f98 00000000 c0142b3a 
Call Trace:
 [<c0118be7>] activate_task+0x57/0x70
 [<c0142b20>] pdflush+0x0/0x20
 [<c0142a47>] __pdflush+0x77/0x150
 [<c0142b3a>] pdflush+0x1a/0x20
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14

After SysRq-S, SysRq-U:
pdflush       D 91641A00     0   375      5           376   340 (L-TLB)
dfe89d8c 00000046 00000000 91641a00 003d0945 dffcd17c c014aeba dff636b8 
       dff63590 91641a00 003d0945 00000000 00000000 00000000 00000000 dfe89dbc 
       dfe89dc4 c1400f98 dfe89d94 c02fa16e 00000000 c015a2cb c02fa41c c015a2a0 
Call Trace:
 [<c014aeba>] __handle_mm_fault+0xda/0x130
 [<c02fa16e>] io_schedule+0xe/0x20
 [<c015a2cb>] sync_buffer+0x2b/0x30
 [<c02fa41c>] __wait_on_bit_lock+0x3c/0x70
 [<c015a2a0>] sync_buffer+0x0/0x30
 [<c015a2a0>] sync_buffer+0x0/0x30
 [<c02fa4bf>] out_of_line_wait_on_bit_lock+0x6f/0x80
 [<c012ded0>] wake_bit_function+0x0/0x40
 [<c015a2f3>] __lock_buffer+0x23/0x30
 [<c015ba5c>] __block_write_full_page+0x1fc/0x2e0
 [<c015fb50>] blkdev_get_block+0x0/0x50
 [<c015d042>] block_write_full_page+0x92/0xc0
 [<c014572c>] pagevec_lookup_tag+0x1c/0x30
 [<c015fb50>] blkdev_get_block+0x0/0x50
 [<c0179645>] mpage_writepages+0x205/0x3b0
 [<c015fc70>] blkdev_writepage+0x0/0x10
 [<c014255a>] do_writepages+0x1a/0x30
 [<c0177c10>] __sync_single_inode+0x50/0x220
 [<c0177e03>] __writeback_single_inode+0x23/0x100
 [<c0178084>] generic_sync_sb_inodes+0x1a4/0x280
 [<c0142b20>] pdflush+0x0/0x20
 [<c0178287>] sync_inodes_sb+0x67/0x70
 [<c0178302>] __sync_inodes+0x52/0x90
 [<c0178351>] sync_inodes+0x11/0x30
 [<c015a681>] do_sync+0x11/0x60
 [<c0142a82>] __pdflush+0xb2/0x150
 [<c0142b3a>] pdflush+0x1a/0x20
 [<c015a670>] do_sync+0x0/0x60
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14


Another interesting fact: SysRq-O didn't poweroff the machine and it gave
interesting messages:
SysRq : Power Off
md: stopping all md devices.
md: md0 switched to read-only mode.
Shutdown: hdb

However, I have *two* harddisks (hda and hdb). And even stranger, SysRq-O caused
one of the harddisks to spin up again which it didn't during suspend.
Comment 29 Carl-Daniel Hailfinger 2005-09-12 13:59:01 UTC
I did another backtrace after SysRq-O. The only thing that changed was that
events/0 was no longer in state R, but in state D. Backtrace of this task follows:

events/0      D D9891200     0     3      1             4     2 (L-TLB)
dfd91e0c 00000046 062ccf80 d9891200 003d09cb e08763a0 e0864b7a dfdd3148 
       dfdd3020 17492d00 003d09cc 00000000 00000000 3f6b5a00 00000000 dfd90000 
       dfd91e3c e087662c dfd91e30 c02f9bc6 00000001 dfdd3020 c0119580 dfd91e40 
Call Trace:
 [<e0864b7a>] do_rw_taskfile+0x1fa/0x210 [ide_core]
 [<c02f9bc6>] wait_for_completion+0x76/0xb0
 [<c0119580>] default_wake_function+0x0/0x10
 [<e08607e1>] ide_do_drive_cmd+0xa1/0xf0 [ide_core]
 [<e085e40a>] generic_ide_suspend+0x7a/0x90 [ide_core]
 [<c025bf80>] blk_end_sync_rq+0x0/0x20
 [<e081ed30>] ide_device_shutdown+0x30/0x40 [ide_disk]
 [<c0258509>] device_shutdown+0x49/0x66
 [<c0127cb0>] kernel_power_off+0x20/0x30
 [<c012a519>] worker_thread+0x169/0x200
 [<c01375f0>] do_poweroff+0x0/0x10
 [<c01195c7>] __wake_up_common+0x37/0x60
 [<c0119580>] default_wake_function+0x0/0x10
 [<c012a3b0>] worker_thread+0x0/0x200
 [<c012db65>] kthread+0x85/0x90
 [<c012dae0>] kthread+0x0/0x90
 [<c01012f1>] kernel_thread_helper+0x5/0x14
Comment 30 Carl-Daniel Hailfinger 2005-09-12 19:59:21 UTC
While trying to reproduce, it hung during resume instead of during suspend.
Comment 31 Carl-Daniel Hailfinger 2005-09-12 20:21:45 UTC
Diff between backtrace during suspend (still working) and resume (hang directly
after):

-do_acpi_sleep D CF7A5900     0  6029   5264                5950 (NOTLB)
-df5fbdac 00000082 0007a120 cf7a5900 003d08fd e08763a0 e0864b7a dffc8698 
-       dffc8570 cfb76200 003d08fd 00000000 00000000 7b803600 00000000 df5fa000 
-       df5fbddc e0876434 df5fbdd0 c02f9bc6 00000001 dffc8570 c0119580 df5fbde0 
+do_acpi_sleep D 93F0CB00     0  6418   5264                6340 (NOTLB)
+d367ddb0 00000082 00000000 93f0cb00 000000d7 e08763a0 e0864b7a d2ae6168 
+       d2ae6040 93f0cb00 000000d7 00000000 00000000 09059100 0000000b d367c000 
+       d367dde0 e0876434 d367ddd4 c02f9bc6 00000001 d2ae6040 c0119580 d367dde4 
 Call Trace:
  [<e0864b7a>] do_rw_taskfile+0x1fa/0x210 [ide_core]
  [<c02f9bc6>] wait_for_completion+0x76/0xb0
  [<c0119580>] default_wake_function+0x0/0x10
  [<e08607e1>] ide_do_drive_cmd+0xa1/0xf0 [ide_core]
  [<e085e49d>] generic_ide_resume+0x7d/0x90 [ide_core]
  [<c025bf80>] blk_end_sync_rq+0x0/0x20
  [<e0864cd0>] task_no_data_intr+0x0/0x70 [ide_core]
  [<c02589e8>] resume_device+0x48/0xb0
  [<c0258ade>] dpm_resume+0x8e/0xa0
  [<c0258b01>] device_resume+0x11/0x20
- [<c0136836>] swsusp_write+0x6/0x20
- [<c0137379>] pm_suspend_disk+0x79/0x90
+ [<c01371f5>] finish+0x5/0x40
+ [<c0137355>] pm_suspend_disk+0x55/0x90
  [<c0135395>] enter_state+0x45/0x70
  [<c01354ca>] state_store+0x8a/0x98
  [<c0135440>] state_store+0x0/0x98
  [<c01912de>] subsys_attr_store+0x1e/0x30
  [<c01914ed>] flush_write_buffer+0x1d/0x30
  [<c0191530>] sysfs_write_file+0x30/0x50
  [<c0191500>] sysfs_write_file+0x0/0x50
  [<c015935d>] vfs_write+0x8d/0x170
  [<c01594ec>] sys_write+0x3c/0x70
  [<c0102d1b>] sysenter_past_esp+0x54/0x79
+kdesktop_lock D 62CD9000     0  6419   6292                     (NOTLB)
+cee5fef8 00200082 00000000 62cd9000 000000cb 00000000 deafbc4c deef2b68 
+       deef2a40 62cd9000 000000cb 00000000 00000000 179a7b00 00000000 cee5e000 
+       00000000 00000000 cee5e000 c0135575 cee5ffbc 00000000 c0102b95 c025b6fb 
+Call Trace:
+ [<c0135575>] refrigerator+0x55/0x70
+ [<c0102b95>] do_signal+0x85/0x110
+ [<c025b6fb>] __freed_request+0x5b/0x60
+ [<c025b718>] freed_request+0x18/0x60
+ [<e085eb7e>] __ide_end_request+0x9e/0x140 [ide_core]
+ [<e085ec72>] ide_end_request+0x52/0x60 [ide_core]
+ [<c02594b2>] elv_queue_empty+0x12/0x20
+ [<e085ff99>] ide_do_request+0xb9/0x3b0 [ide_core]
+ [<c02f9883>] schedule+0x343/0x610
+ [<c013bc23>] handle_IRQ_event+0x33/0x60
+ [<c013bcf3>] __do_IRQ+0xa3/0xd0
+ [<c0102c49>] do_notify_resume+0x29/0x40
+ [<c0102e0e>] work_notifysig+0x13/0x25

More notes on what I do to get the hang during suspend: Log into KDE, click on
Logout->Suspend. Calling powersave -U from console doesn't trigger it reliably.

Upping severity because I can reproduce this 100% reliably, and it worked just
fine in 9.3
Comment 32 Chris L Mason 2005-09-12 23:17:19 UTC
I suspect this is actually in the cdrom code, but lets try with elevator=deadline please 
Comment 33 Carl-Daniel Hailfinger 2005-09-12 23:31:39 UTC
Unfortunately, elevator=deadline didn't help. Will retry with elevator=noop.
A cdrom is no longer needed in RC1 to trigger the bug reliably.
Comment 34 Carl-Daniel Hailfinger 2005-09-12 23:37:39 UTC
elevator=noop didn't help either. But after >150 journal replays in the last few
days I really appreciate having reiserfs.

Complete backtrace follows as attachment.
Comment 35 Carl-Daniel Hailfinger 2005-09-12 23:41:54 UTC
Created attachment 49704 [details]
SysRq-P, SysRq-T during hanging suspend-to-disk
Comment 36 Carl-Daniel Hailfinger 2005-09-12 23:49:17 UTC
The above backtrace was generated with elevator=noop.

Some info about my system (dmesg excerpt):
VP_IDE: IDE controller at PCI slot 0000:00:04.1
    ACPI-0212: *** Warning: Device is not power manageable
VP_IDE: chipset revision 16
VP_IDE: not 100% native mode: will probe irqs later
VP_IDE: VIA vt82c686a (rev 22) IDE UDMA66 controller on pci0000:00:04.1
    ide0: BM-DMA at 0xd800-0xd807, BIOS settings: hda:DMA, hdb:DMA
    ide1: BM-DMA at 0xd808-0xd80f, BIOS settings: hdc:DMA, hdd:DMA
hda: ST340823A, ATA DISK drive
hdb: ST3200822A, ATA DISK drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hda: max request size: 128KiB
hda: 78165360 sectors (40020 MB) w/1024KiB Cache, CHS=65535/16/63, UDMA(66)
hda: cache flushes not supported
 hda: hda1 hda2 < hda5 hda6 hda7 hda8 hda9 hda10 >
hdb: max request size: 1024KiB
hdb: 390721968 sectors (200049 MB) w/8192KiB Cache, CHS=24321/255/63, UDMA(66)
hdb: cache flushes supported
 hdb: hdb1 hdb2 < hdb5 hdb6 hdb7 hdb8 >
hdc: LITE-ON CD-RW SOHR-5238S, ATAPI CD/DVD-ROM drive
hdd: LITEON DVD-ROM LTD122, ATAPI CD/DVD-ROM drive
ide1 at 0x170-0x177,0x376 on irq 15
hdc: ATAPI 52X CD-ROM CD-R/RW drive, 2048kB Cache, UDMA(33)
Uniform CD-ROM driver Revision: 3.20
hdd: ATAPI 40X DVD-ROM drive, 512kB Cache, UDMA(33)

I don't know whether the ACPI warning applies to my IDE controller or just
happens to be printed at that time.

All filesystems (except swap and /boot) are on LVM on /dev/hdb.
swap and /boot are on normal dos-type partitions on /dev/hdb.

Please note that at the end of the backtrace, SysRq-S and SysRq-U seem to do
nothing and SysRq-O hangs after the "Shutdown: hdb" message.
Comment 37 Carl-Daniel Hailfinger 2005-09-13 14:05:34 UTC
It even hangs after deleting cdrom.ko and ide-cd.ko from /lib/modules and from
the initrd, (and I verified with lsmod that they're indeed not loaded).

So it's not cdrom related.

What else can I try? Were the backtraces helpful?
Comment 38 Chris L Mason 2005-09-13 15:50:44 UTC
Vojtech, it looks like we're stuck in ide land after resume, could you please take a quick 
look? 
Comment 39 Carl-Daniel Hailfinger 2005-09-13 15:57:44 UTC
Small correction: The hang occurs almost always during suspend, not during resume.
Comment 40 Beau Steward 2005-09-14 07:16:57 UTC
I'm having this problem as well. When my panics happen, the screen corrupts so
I'm not able to see the backtraces and I have no way to connect to the machine
via serial (it's a laptop). I'm going to try with vanilla sources tomorrow, and
I will also watch this bug for resolution. Hardware is HP ZE2000z (ZE2200 CTO)
with ATI Radeon XPRESS 200M mainboard and AMD Turion 64 processor running
x86_64. I've had one successful suspend in all of my attempts, but that didn't
resume. Suspend to ram works fine and I use it often.
Comment 41 Pavel Machek 2005-09-14 10:33:32 UTC
Carl: can you confirm that you have no processes in "D" state before you attempt
suspend? Does vanilla kernel work for you? Would it be possible to try it
without second harddrive, and preferably with LVM out of the picture? People
don't usually use LVM/hdb on notebooks, so that gets little testing.

Oh and decreasing severity. This is normal problem. "9.3 worked" is not too
helpfull, unless you want to binary search where it broke.
Comment 42 Carl-Daniel Hailfinger 2005-09-15 10:44:29 UTC
Yes, there are no processes in D state before suspend. And this is not a
notebook, but a normal desktop. Trying without a second harddrive is
unfortunately not possible because the first harddrive has the boot loader and
the BIOS can't boot from the second hard drive because it's bigger than 128 GB.
Killing LVM would mean I have to burn 200 GB to CD-ROMs (I have no DVD burner,
nor would that help because some files are >10 GB) before repartioning.

So what is left to check? Vanilla 2.6.13?
Comment 43 Pavel Machek 2005-09-15 11:10:49 UTC
Yes, checking vanilla 2.6.13 would be nice.

Plus if you see processes in "D" state during normal operation... that should
not be. File a separate bug about that, and mark this one as "depends on that".
Comment 44 Beau Steward 2005-09-16 15:40:18 UTC
I tried vanilla sources and I got the same display corruption, but I could not
see any signs of a kernel panic (that is not to say a panic didn't happen). I
think I would like to try an older 2.6 kernel as I've been experiencing another
problem which I had not experienced on other distributions that also continued
on the vanilla sources.
Comment 45 Beau Steward 2005-09-16 15:41:27 UTC
Oh, my last attempt, I did check for D state processes and none existed.
Comment 46 Beau Steward 2005-09-16 17:11:22 UTC
For the first time ever, I was able to suspend to disk and then resume. The
kernel that I used was vanilla 2.6.12 compiled from make oldconfig. I'll have to
do some additional testing later.
Comment 47 Carl-Daniel Hailfinger 2005-09-17 11:40:19 UTC
With vanilla 2.6.13.1 and identical .config, it did survive one suspend/resume
cycle. The second cycle hung during suspend, again in IDE land. Short excerpt
from backtrace, full backtrace will be attached:

do_acpi_sleep D E08512A0     0  5786   4885                5720 (NOTLB)
c9df7dac 00000082 0008b824 e08512a0 e083fb6a d24296d8 d24295b0 9d82ea00 
       003d0931 00000000 00000000 7d2b7500 00000000 c9df6000 c9df7ddc e0851330 
       c9df7dd0 c02ec846 00000001 d24295b0 c01183d0 c9df7de0 c9df7de0 c9df7e04 
Call Trace:
 [<e083fb6a>] do_rw_taskfile+0x1fa/0x210 [ide_core]
 [<c02ec846>] wait_for_completion+0x76/0xb0
 [<c01183d0>] default_wake_function+0x0/0x10
 [<e083b7d1>] ide_do_drive_cmd+0xa1/0xf0 [ide_core]
 [<e083948d>] generic_ide_resume+0x7d/0x90 [ide_core]
 [<c024f9c0>] blk_end_sync_rq+0x0/0x20
 [<e083fcc0>] task_no_data_intr+0x0/0x70 [ide_core]
 [<c024c428>] resume_device+0x48/0xb0
 [<c024c51e>] dpm_resume+0x8e/0xa0
 [<c024c541>] device_resume+0x11/0x20
 [<c0135096>] swsusp_write+0x6/0x20
 [<c0135bd9>] pm_suspend_disk+0x79/0x90
 [<c0133bf5>] enter_state+0x45/0x70
 [<c0133d2a>] state_store+0x8a/0x98
 [<c0133ca0>] state_store+0x0/0x98
 [<c018e80e>] subsys_attr_store+0x1e/0x30
 [<c018ea1d>] flush_write_buffer+0x1d/0x30
 [<c018ea60>] sysfs_write_file+0x30/0x50
 [<c018ea30>] sysfs_write_file+0x0/0x50
 [<c0157339>] vfs_write+0x99/0x160
 [<c01574ac>] sys_write+0x3c/0x70
 [<c0102d1b>] sysenter_past_esp+0x54/0x79

and it hangs there forever.
Comment 48 Carl-Daniel Hailfinger 2005-09-17 11:41:58 UTC
Created attachment 50228 [details]
backtrace of hang during suspend-to-disk (vanilla 2.6.13.1)
Comment 49 Carl-Daniel Hailfinger 2005-09-17 11:45:40 UTC
this last backtrace was with vanilla 2.6.13.1.

Vojtech, can you please have a look?
Comment 50 Carl-Daniel Hailfinger 2005-09-17 11:56:46 UTC
OK, the vanilla kernel is much better than the SUSE kernel.
* SUSE kernel from RC1 always hangs during suspend
* vanilla kernel always survives the first suspend cycle but will hang on the
second cycle


Excerpt of backtrace of hang during resume:

do_acpi_sleep D E08512A0     0  5792   4958                     (NOTLB)
ddb95db0 00000086 0008b824 e08512a0 e083fb6a d112d1c8 d112d0a0 0f9b0500 
       003d0919 00000000 00000000 ad9d9200 00000018 ddb94000 ddb95de0 e0851330 
       ddb95dd4 c02ec846 00000001 d112d0a0 c01183d0 ddb95de4 ddb95de4 ddb95e08 
Call Trace:
 [<e083fb6a>] do_rw_taskfile+0x1fa/0x210 [ide_core]
 [<c02ec846>] wait_for_completion+0x76/0xb0
 [<c01183d0>] default_wake_function+0x0/0x10
 [<e083b7d1>] ide_do_drive_cmd+0xa1/0xf0 [ide_core]
 [<e083948d>] generic_ide_resume+0x7d/0x90 [ide_core]
 [<c024f9c0>] blk_end_sync_rq+0x0/0x20
 [<e083fcc0>] task_no_data_intr+0x0/0x70 [ide_core]
 [<c024c428>] resume_device+0x48/0xb0
 [<c024c51e>] dpm_resume+0x8e/0xa0
 [<c024c541>] device_resume+0x11/0x20
 [<c0135a55>] finish+0x5/0x40
 [<c0135bb5>] pm_suspend_disk+0x55/0x90
 [<c0133bf5>] enter_state+0x45/0x70
 [<c0133d2a>] state_store+0x8a/0x98
 [<c0133ca0>] state_store+0x0/0x98
 [<c018e80e>] subsys_attr_store+0x1e/0x30
 [<c018ea1d>] flush_write_buffer+0x1d/0x30
 [<c018ea60>] sysfs_write_file+0x30/0x50
 [<c018ea30>] sysfs_write_file+0x0/0x50
 [<c0157339>] vfs_write+0x99/0x160
 [<c01574ac>] sys_write+0x3c/0x70
 [<c0102d1b>] sysenter_past_esp+0x54/0x79
Comment 51 Carl-Daniel Hailfinger 2005-09-17 12:02:53 UTC
Common part of backtrace:
do_acpi_sleep D E08512A0     0  5792   4958                     (NOTLB)
Call Trace:
 [<e083fb6a>] do_rw_taskfile+0x1fa/0x210 [ide_core]
 [<c02ec846>] wait_for_completion+0x76/0xb0
 [<c01183d0>] default_wake_function+0x0/0x10
 [<e083b7d1>] ide_do_drive_cmd+0xa1/0xf0 [ide_core]
 [<e083948d>] generic_ide_resume+0x7d/0x90 [ide_core]
 [<c024f9c0>] blk_end_sync_rq+0x0/0x20
 [<e083fcc0>] task_no_data_intr+0x0/0x70 [ide_core]
 [<c024c428>] resume_device+0x48/0xb0
 [<c024c51e>] dpm_resume+0x8e/0xa0
 [<c024c541>] device_resume+0x11/0x20

During each hang, hda has already spun up while hdb still is not spinning.
Comment 52 Beau Steward 2005-09-17 16:18:36 UTC
On my testing with 2.6.13.1, I was unable to get any successful suspends.
However, with 2.6.12, I was able to suspend and resume 5 times in a row without
displayed errors, but it wasn't without problems. The problem I ran into is
being investigated in another bug. It seems my noapictimer kernel flag is
ignored on resume so the system clock goes back to 2x speed or faster.
Comment 53 Pavel Machek 2005-09-19 14:09:06 UTC
Can you find which patch between 2.6.12 and 2.6.13 breaks it?
Comment 54 Beau Steward 2005-09-21 17:45:11 UTC
I am going through the various versions from 2.6.12 to 2.6.13.1 with clean
compiles. It's taking some time, but I am honestly not liking the way suspend to
disk is working in 2.6.12.* with the previously mentioned problem. However, I
hope to have at least up to 2.6.12.4 tested by today, and the rest tomorrow.
It's possible the 2x speed problem will be resolved with bug 113323.
Comment 55 Pavel Machek 2005-09-21 20:38:24 UTC
Uff, no.

Going through 2.6.12.1, 2.6.12.2, 2.6.12.3 is not going to be helpfull. These
are stable releases, nothing significant changes there.

Go through 2.6.12, 2.6.13-rc1, 2.6.13-rc2...
Comment 56 Beau Steward 2005-09-23 07:00:34 UTC
I've found that this starts between 2.6.13-rc1 and 2.6.13-rc3. I've not been 
able to get 2.6.13-rc2 to compile yet, I'll work on that more. I was able to do 
multiple successful suspends with 2.6.13-rc1. I'll try to have information on 
rc2 by tomorrow. And yeah, I wasn't thinking when I orginally posted my testing 
method. I started breaking it down into groups after a couple of tests. Also, I 
noticed 2.6.13.2 was released so I tested this, and it is not fixed.
Comment 57 Beau Steward 2005-09-24 06:19:10 UTC
Removed a pci audio device module and rc2 compiled. Problem exists between 2.6.
13-rc2 and 2.6.13-rc3.
Comment 58 Pavel Machek 2005-09-26 13:19:07 UTC
2.6.13-rc2 to -rc3 has some differences in SMP swsusp handling, but they should
not matter to you (right?). So it has to be something in surrounding code.

Carl, can you confirm that -rc2 works for you?

Beau, can you do binary search with git? Diff between -rc2 and -rc3 is still 2MB
:-(.
Comment 59 Pavel Machek 2005-09-26 13:20:18 UTC
[Alternatively, just try few daily snapshots, if it is easier than getting git
up/running.]
Comment 60 Beau Steward 2005-09-29 23:49:40 UTC
I'm not sure about using git, so I'll check out the daily snapshots this 
weekend. I apologize I can't work much faster on this, doing this when I have 
time.
Comment 61 Pavel Machek 2005-11-22 23:43:08 UTC
It has been way more than a weekend, sorry.