Bug 305162 - system takes three minutes to resume (regression)
Summary: system takes three minutes to resume (regression)
Status: VERIFIED FIXED
Alias: None
Product: openSUSE 10.3
Classification: openSUSE
Component: Mobile Devices (show other bugs)
Version: RC 1
Hardware: i586 Other
: P5 - None : Critical (vote)
Target Milestone: ---
Assignee: Thomas Renninger
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-08-28 08:20 UTC by Felix Möller
Modified: 2007-10-11 10:21 UTC (History)
4 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---
coolo: SHIP_STOPPER-


Attachments
this is the log when pressing the power button immideately after suspending. This is reproducible every time. (6.84 KB, text/plain)
2007-08-28 08:20 UTC, Felix Möller
Details
/var/log/messages (30.77 KB, text/plain)
2007-09-04 15:32 UTC, Felix Möller
Details
this is the dmesg with cpufreq debug (69.55 KB, text/plain)
2007-09-09 16:42 UTC, Felix Möller
Details
cpufreq_check_negative_load.patch (1.36 KB, patch)
2007-09-11 22:03 UTC, Thomas Renninger
Details | Diff
cpufreq_fix_tickless_acounting.patch (2.56 KB, patch)
2007-09-11 22:04 UTC, Thomas Renninger
Details | Diff
this is the new dmesg again stalling (71.70 KB, text/plain)
2007-09-13 14:25 UTC, Felix Möller
Details
patch from current kernel-git (3.72 KB, text/plain)
2007-09-24 19:26 UTC, Felix Möller
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Felix Möller 2007-08-28 08:20:12 UTC
Created attachment 160171 [details]
this is the log when pressing the power button immideately after suspending. This is reproducible every time.

Since arround beta1 my system (Intel MacBook) takes 3 minutes to resume. During the last alphas the system took 3 seconds to resume. So something is broken here. :-(
Comment 1 Holger Macht 2007-08-28 08:40:50 UTC
From the log:

---
Di 28. Aug 09:31:02 CEST 2007: done running suspend hooks.
+ /usr/sbin/s2ram
Allocated buffer at 0x2010 (base is 0x0)
ES: 0x0201 EBX: 0x0000
Function not supported?
Switching from vt7 to vt1
Calling save_state
fbcon fb0 state 1
Calling restore_state_from
fbcon fb0 state 0
switching back to vt7
+ RET=0
+ set +x
Di 28. Aug 09:33:38 CEST 2007: running resume hooks.
---

Pavel, Stefan, Olaf, any idea?
Comment 2 Olaf Hering 2007-08-28 08:53:12 UTC
no idea about intel hardware
Comment 3 Felix Möller 2007-08-29 20:23:37 UTC
Did todays factory upgrade:
# rpm -q kernel-default suspend
kernel-default-2.6.22.5-2
suspend-0.50.20070731-12

but still:
Wed Aug 29 22:13:04 CEST 2007: done running suspend hooks.
+ /usr/sbin/s2ram
Allocated buffer at 0x2010 (base is 0x0)
ES: 0x0201 EBX: 0x0000
Function not supported?
Switching from vt7 to vt1
Calling save_state
fbcon fb0 state 1
Calling restore_state_from
fbcon fb0 state 0
switching back to vt7
+ RET=0
+ set +x
Wed Aug 29 22:15:26 CEST 2007: running resume hooks.

Any idea what to do? As far as I can research my system did suspend fine with factory of 2007-08-10 (~ Beta1). :-(

The time difference does not seem to be constant now its 2:22 in my original report it was 2:36 ...
Comment 4 Pavel Machek 2007-09-04 10:28:57 UTC
Does system still work properly after resume? Does "time sleep 1" produce something reasonable?

Where is the time spent? While suspending?

Stefan, how widespread is this problem? Can we reproduce it on machine where VGA works during resume, or where we have serial/netconsole?
Comment 5 Felix Möller 2007-09-04 15:31:09 UTC
(In reply to comment #4 from Pavel Machek)
> Does system still work properly after resume? Does "time sleep 1" produce
> something reasonable?
# time sleep 1

real    0m1.003s
user    0m0.000s
sys     0m0.004s

The system works perfectly after resume. Even the WLAN is reconnecting. ;) But booting takes 1 minute and resuming 3 so it is unusable. :-(

> Where is the time spent? While suspending?
The system suspends within 2 seconds or so. The time is spent resuming.
 
> Stefan, how widespread is this problem? Can we reproduce it on machine where
> VGA works during resume, or where we have serial/netconsole?
Comment 6 Felix Möller 2007-09-04 15:32:54 UTC
Created attachment 161747 [details]
/var/log/messages

This is /var/log/messages before and after the supend.
Comment 7 Pavel Machek 2007-09-04 21:31:17 UTC
Unfortunately, /var/log/messages shows timestamps from syslog (not from kernel) that are useless.

If you increase console loglevel (sysrq-9), can you see the messages during resume? Or perhaps enable timestamps for printk (it is in kernel config somewhere,  IIRC).
Comment 8 Pavel Machek 2007-09-04 21:35:14 UTC
Hmm, can you try if "nohz=off highres=off" helps? that's only big change in latest alphas I'm aware of.

ALso try  running with nosmp and minimal drivers...
Comment 9 Felix Möller 2007-09-05 07:23:39 UTC
(In reply to comment #8 from Pavel Machek)
> Hmm, can you try if "nohz=off highres=off" helps? that's only big change in
> latest alphas I'm aware of.
Setting "nohz=off highres=off" breaks it totally. I get with moved /lib/modules and without:
===== Mi 5. Sep 08:34:26 CEST 2007: running hook: /usr/lib/pm-utils/sleep.d/99info =====
Mi 5. Sep 08:34:26 CEST 2007: done running suspend hooks.
+ /usr/sbin/s2ram
Allocated buffer at 0x2010 (base is 0x0)
ES: 0x0201 EBX: 0x0000

and there the pm-suspend.log file just ends. 

> ALso try  running with nosmp and minimal drivers...
The default kernel-parameters with a moved /lib/modules/ do work. Unfortunately the intel-xorg driver does not even load without the modules. Therefore I just can say it is working from the console with /lib/modules/2.6.22.5-8-default/ moved to /lib/modules/2.6.22.5-8-default.fm/.

The system then still has some drivers loaded:
Module                  Size  Used by
usbhid                 41300  0
hid                    29184  1 usbhid
ff_memless              9352  1 usbhid
ehci_hcd               34956  0
uhci_hcd               27024  0
usbcore               123372  4 usbhid,ehci_hcd,uhci_hcd
edd                    12996  0
ext3                  131848  2
mbcache                12292  1 ext3
jbd                    68148  1 ext3
fan                     9220  0
ata_piix               18948  2
libata                135116  1 ata_piix
piix                   12932  0 [permanent]
thermal                19848  0
processor              40744  1 thermal
sr_mod                 19492  0
cdrom                  37020  1 sr_mod
ide_disk               20480  0
ide_core              122948  2 piix,ide_disk
sd_mod                 31104  3
sg                     37036  0
scsi_mod              140376  4 libata,sr_mod,sd_mod,sg
Comment 10 Pavel Machek 2007-09-05 08:34:52 UTC
Ok, so it is fast with minimal modules, and slow with the full set? Can you selectively load/unload modules to find out which one causes the slowdown?
Comment 11 Felix Möller 2007-09-05 09:14:19 UTC
yes with minimal modules it suspends and resumes in 2 seconds. 

I will try to find out which module causes the delay.
Comment 12 Felix Möller 2007-09-08 13:24:41 UTC
After several hours of rebooting I found the culprit: acpi_cpufreq

I have to add an "install acpi_cpufreq /bin/true" to /etc/modprobe.d/blacklist to solve the problem.

Reading through the changelog there are two commits which look related, therefore adding Thomas.

* Mo Aug 13 2007 - trenn@suse.de
- patches.fixes/cpufreq_check_negative_load.patch:
  ondemand: add a check to avoid negative load calculation
  (https://bugzilla.novell.com/show_bug.cgi?id=297355).
- patches.fixes/cpufreq_fix_tickless_acounting.patch: ondemand:
  fix tickless accounting and software coordination bug
  (https://bugzilla.novell.com/show_bug.cgi?id=297355).

* Mo Aug 06 2007 - trenn@suse.de
- patches.arch/acpi_autoloading_acpica.patch:
- patches.arch/acpi_autoloading_interface.patch:
- patches.arch/acpi_autoloading_modaliases.patch:
- patches.fixes/cpufreq_move_policy_init.patch: ondemand as
  default governor.
- patches.fixes/cpufreq_ondemand_as_default.patch:
- Update config files.
  These are ACPI autoloading (via udev) and use ondemand as
  default governor when loading low-level cpufreq driver patches.
Comment 13 Thomas Renninger 2007-09-09 15:43:15 UTC
Venkatesh, maybe you heard of this already?

Felix, do you know how to compile the SUSE sources?
Can you retry with cpufreq debug compiled in?
Install latest kernel-source package then do:
cd /use/src/linux
cp arch/x86_64/defconfig.default .config   (or i386?)
edit .config, search for something like CPU_FREQ_DEBUG and edit the line to:
CONFIG_CPU_FREQ_DEBUG=y   (not sure whether the config name is exactly correct...)
then:
make -j3
make install
cd /boot
mkinitrd -k vmlinuz-$NEW_KERNEL_VERSION -i initrd-$NEW_KERNEL_VERSION
edit /boot/grub/menu.lst    and add the newly installed kernel, best as a separate entry, also add the boot parameter cpufreq.debug=7).
Comment 14 Felix Möller 2007-09-09 16:01:47 UTC
You were right with the config option. I am compiling the kernel right now. 

The system is i586 with a Core Duo. It is an MacBook of early 2006.

Will report back later today.
Comment 15 Felix Möller 2007-09-09 16:42:17 UTC
Created attachment 162896 [details]
this is the dmesg with cpufreq debug

For me it looks interesting around line 800:
[  130.548000] PM: Writing back config space on device 0000:00:07.0 at offset 4 (was 0, writing 90444000)
[  130.548000] PM: Writing back config space on device 0000:00:07.0 at offset 1 (was b00000, writing b00002)
[  270.900000] PM: Writing back config space on device 0000:00:1b.0 at offset f (was 100, writing 10b)
[  270.900000] PM: Writing back config space on device 0000:00:1b.0 at offset 4 (was 4, writing 90440004)

# lspci -s 00:07.0 -nn
00:07.0 Performance counters [Class 1101]: Intel Corporation Unknown device [8086:27a3] (rev 03)

Thomas what else do you need?
Comment 16 Thomas Renninger 2007-09-09 17:20:22 UTC
> Thomas what else do you need?
Time :) Just kidding...

Maybe Venkatesh can have a look at this, it looks very Intel specific.
Please ping me at end of the week or beginning of next week if Venkatesh also has no time or nobody else had a look at this.
Comment 17 Thomas Renninger 2007-09-09 17:38:16 UTC
Maybe Pavel or Rafael already know about this?

This is a good hint:
> Since arround beta1 my system (Intel MacBook) takes 3 minutes to resume.
> During the last alphas the system took 3 seconds to resume.
If you still have the rpms flying around or even the kernels still installed, could you post which was the last working kernel version and from which one things broke, pls (rpm -qp --changelog kernel-default.rpm should extract the version from an rpm).
Just an idea, maybe nmi_watchdog=0 works? If yes, it's likely that this problem got fixed in Beta3 (nmi_watchdog was enabled by accident on i386 (sometimes?)).
Comment 18 Felix Möller 2007-09-09 18:59:38 UTC
nmi_watchdog=0 watchdog does not help.

But just adding SUSPEND_MODULES="acpi_cpufreq" to /etc/pm/config.d/config helps. This makes frequency-scaling working for me again and I do not have to blacklist acpi_cpufreq any longer.

Sadly I do not have the old kernels anymore. But as far as I can research my system did suspend fine with factory of 2007-08-10 (~ Beta1). :-(
Comment 19 Pavel Machek 2007-09-11 16:57:25 UTC
Hmm... not much I could add.

It looks like acpi_cpufreq problem, and question is if recent patches introduced the problem, or if module autoloading caused it to be used...
Comment 20 Venkatesh Pallipadi 2007-09-11 18:16:12 UTC
If I understand correctly, there are two recent changes that can be potentially causing this.
* Mo Aug 13 2007 - trenn@suse.de
- patches.fixes/cpufreq_check_negative_load.patch:
  ondemand: add a check to avoid negative load calculation
  (https://bugzilla.novell.com/show_bug.cgi?id=297355).
- patches.fixes/cpufreq_fix_tickless_acounting.patch: ondemand:
  fix tickless accounting and software coordination bug
  (https://bugzilla.novell.com/show_bug.cgi?id=297355).

* Mo Aug 06 2007 - trenn@suse.de
- patches.arch/acpi_autoloading_acpica.patch:
:
:

Can you rollback the two patches that went on Aug 13 and see whether the problem is still there?
Comment 21 Thomas Renninger 2007-09-11 22:03:46 UTC
Created attachment 163373 [details]
cpufreq_check_negative_load.patch

Seems as these patches are in stable 2.6.22.X now.
These are the two patches, you can rip them out with:
cd /usr/src/linux
patch -R -p1 -i ATTACHED_FILE
Comment 22 Thomas Renninger 2007-09-11 22:04:30 UTC
Created attachment 163374 [details]
cpufreq_fix_tickless_acounting.patch
Comment 23 Felix Möller 2007-09-13 14:25:08 UTC
Created attachment 163900 [details]
this is the new dmesg again stalling

I ripped both of the patches out and rebuilt the kernel, but it is still stalling.
Comment 24 Thomas Renninger 2007-09-13 19:14:51 UTC
Greg, do you know whether something changed here?
Andi said, we could blacklist the device, but (possibly?) only if there is a driver for it?
Here the relevant information collected:

Since arround beta1 my system (Intel MacBook) takes 3 minutes to resume. During
the last alphas the system took 3 seconds to resume. So something is broken
here. :-(

For me it looks interesting around line 800:
[  130.548000] PM: Writing back config space on device 0000:00:07.0 at offset 4
(was 0, writing 90444000)
[  130.548000] PM: Writing back config space on device 0000:00:07.0 at offset 1
(was b00000, writing b00002)
[  270.900000] PM: Writing back config space on device 0000:00:1b.0 at offset f
(was 100, writing 10b)
[  270.900000] PM: Writing back config space on device 0000:00:1b.0 at offset 4
(was 4, writing 90440004)

# lspci -s 00:07.0 -nn
00:07.0 Performance counters [Class 1101]: Intel Corporation Unknown device
[8086:27a3] (rev 03)
Comment 25 Felix Möller 2007-09-21 11:36:18 UTC
Another weird thing is with acpi_cpufreq cpuinfo looks a little strange:
processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 14
model name      : Genuine Intel(R) CPU            1400  @ 1.83GHz
stepping        : 8
cpu MHz         : 1952.038
cache size      : 2048 KB

the MHz is bigger than what my CPU has;)
Comment 26 Rafael Wysocki 2007-09-21 13:57:46 UTC
There are some patches from Thomas Gleixner that fix similar issues in the current Linus' tree.

AFAICS, they are these two patches plus-minus some minor stuff:
http://www.sisk.pl/kernel/hibernation_and_suspend/2.6.23-rc6/patches/32-timekeeping-prevent-time-going-backwards-on-resume-v2.patch
http://www.sisk.pl/kernel/hibernation_and_suspend/2.6.23-rc6/patches/38-cpu-hotplug-support-broken.patch
Comment 27 Thomas Renninger 2007-09-21 14:45:01 UTC
Rafael: thanks a lot.
Those patches patch fine with our 2.6.22 kernel.
As those patches came in for rc6, I expect they should be rather stable, but we should/must still check whether it fits well for our kernel and ask on kernel list.

It seems this one needs some minor adjustance:
drivers/acpi/processor_core.c: In function ‘acpi_cpu_soft_notify’:
drivers/acpi/processor_core.c:726: error: implicit declaration of function ‘acpi_processor_tstate_has_changed’
make[2]: *** [drivers/acpi/processor_core.o] Error 1

I won't be able to do that before Mo.
Increasing severity, looks like we have a bug for SMP suspend and CPU hotplug where a fix already exists, better get this in now, than with an update.
Comment 28 Stephan Kulow 2007-09-21 14:52:09 UTC
I tend to disagree actually. I would rather see some more testing than risking breaking the final with somethign we could have avoided
Comment 29 Thomas Renninger 2007-09-21 14:59:27 UTC
Posting on kernel list to get a risk estimation should be a good idea anyway?
Comment 30 Stephan Kulow 2007-09-21 15:08:13 UTC
yes. But 10.3 won't actively waiting for this fix
Comment 31 Felix Möller 2007-09-21 15:09:39 UTC
I applied 32-timekeeping-prevent-time-going-backwards-on-resume-v2.patch which does not fix the problem. 

I cannot apply 38-cpu-hotplug-support-broken.patch as it does not compile.

My current kernel is:
# rpm -q kernel-default
kernel-default-2.6.22.5-25
Comment 32 Felix Möller 2007-09-24 07:16:05 UTC
Thomas I think you might want to have a look at http://kerneltrap.org/mailarchive/linux-kernel/2007/9/22/272671 .
Comment 33 Rafael Wysocki 2007-09-24 12:03:47 UTC
Yes, definitely.
Comment 34 Felix Möller 2007-09-24 19:26:28 UTC
Created attachment 174436 [details]
patch from current kernel-git

I applied the attached patch, which is a copy from the kernel-git-repository to my kernel-source-2.6.22.5-29 and now the system resume and suspends fine as far as i can see.

Maybe this is something for the 2.6.22.x releases?
Comment 35 Thomas Renninger 2007-09-25 07:54:15 UTC
Felix: Great work!
This is definetly safe enough to go in. I can't promise because we are at a very late state.., will ask for inclusion.
Ahh, a Vaio is mentioned..., this probably happens on the machines with broken apic timer in deeper C states.
I will come back...
Comment 36 Rafael Wysocki 2007-09-25 09:53:55 UTC
(In reply to comment #34 from Felix Möller)
> Created an attachment (id=174436) [details]
> patch from current kernel-git
> 
> I applied the attached patch, which is a copy from the kernel-git-repository to
> my kernel-source-2.6.22.5-29 and now the system resume and suspends fine as far
> as i can see.
> 
> Maybe this is something for the 2.6.22.x releases?
> 

I think this patch is -stable material.
Comment 37 Forgotten User ZhJd0F0L3x 2007-09-25 13:29:34 UTC
BTW: we could probably also work around this one from userspace...
Comment 38 Thomas Renninger 2007-09-25 13:43:21 UTC
You mean with:
echo 1 >/sys/module/processor/max_cstate
at suspend and
echo 3 >/sys/module/processor/max_cstate
at resume time?

Better not, the patch exactly does the same where it should be done.
Patch from comment #34 added -> closing.
Rafael, could you add me to CC if you ask for stable inclusion for this one or tell me whether it will hit stable kernel or not (if you are CC'ed by requester...). Thanks Felix and Rafael for finally resolving this one.
Comment 39 Rafael Wysocki 2007-09-25 14:08:06 UTC
(In reply to comment #38 from Thomas Renninger)
> You mean with:
> echo 1 >/sys/module/processor/max_cstate
> at suspend and
> echo 3 >/sys/module/processor/max_cstate
> at resume time?
> 
> Better not, the patch exactly does the same where it should be done.

I agree.  This is a kernel problem and should be fixed in the kernel.

> Patch from comment #34 added -> closing.

> Rafael, could you add me to CC if you ask for stable inclusion for this one or
> tell me whether it will hit stable kernel or not (if you are CC'ed by
> requester...).

I will.

> Thanks Felix and Rafael for finally resolving this one.

Well, the credit should go to Thomas Gleixner, although I helped him a bit. ;-)
Comment 40 Thomas Renninger 2007-09-25 14:38:19 UTC
Rafael, FYI:

I know these Vaios have a broken lapic timer in deeper C-states.

I expect the IPI timer broadcast setup which is a workaround for 
those in 10.2 is messed up in timer suspend functions
by Thomas Gleissner's nohz/clocksource/whatever patches
that came in some time ago.

If I understood that right, they are still thinking about a right
fix, which might be a modification where IPI broadcast timers break,
I will post to lkml as soon as I find the time.

The workaround for those Vaios to use IPI broadcast timer interrupts is located here(formerly AMDs where blacklisted to use IPI, I added Pentium Ms to 10.2 some time ago, but this fell out due to CLOCKSOURCE patches and if I read the code right all machines are now using IPI interrupts):

2.6.22 (drivers/acpi/processor_idle.c):
#ifdef CONFIG_GENERIC_CLOCKEVENTS
        unsigned long reason;

        reason = pr->power.timer_broadcast_on_state < INT_MAX ?
                CLOCK_EVT_NOTIFY_BROADCAST_ON : CLOCK_EVT_NOTIFY_BROADCAST_OFF;

        clockevents_notify(reason, &pr->id);
#else

This is just a guess, but I could imagine the clocksoure (or whatever) timer suspend functions are messing up the IPI broadcast timer setup somehow. Therefore the Sony Vaios take that long at suspend time (hitting keys should help to be quicker if this is the problem).
With 10.2, without blacklisting those machines even booted very slow (nolapic or pressing keys (producing interrupts) helped).

I haven't followed up the lkml thread, don't know whether this info is still needed. I will search for the thread if I find the time, but time for me is precious at the moment, maybe this helps a bit fixing up the root cause...
Of course this is just a guess, it may be something else :)
Comment 41 Forgotten User ZhJd0F0L3x 2007-09-25 15:11:29 UTC
(In reply to comment #39 from Rafael Wysocki)
> (In reply to comment #38 from Thomas Renninger)
> > You mean with:
> > echo 1 >/sys/module/processor/max_cstate
> > at suspend and
> > echo 3 >/sys/module/processor/max_cstate
> > at resume time?
> > 
> > Better not, the patch exactly does the same where it should be done.
> 
> I agree.  This is a kernel problem and should be fixed in the kernel.

Of course. But for the case that a kernel patch would have been considered "too intrusive" (probably by some manager :-), i might have been able to sneak in a pm-utils "fix" :-)

But i'm happy that this is not needed now.
Comment 42 Rafael Wysocki 2007-09-25 20:24:19 UTC
Referring to Comment #40:

This certainly is valuable information, as it may help understand some strange behavior that we observe on these boxes from time to time (especially on the Andrew's one).

OTOH, I'm not sure if it's still relevant as far as suspend is concerned.

Thanks for it anyway. :-)
Comment 43 Felix Möller 2007-10-11 07:58:08 UTC
It is working for me with:

# uname -a
Linux macbook 2.6.22.9-0.4-default #1 SMP 2007/10/05 21:32:04 UTC i686 i686 i386 GNU/Linux

Thanks!