|
Bugzilla – Full Text Bug Listing |
| Summary: | system takes three minutes to resume (regression) | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE 10.3 | Reporter: | Felix Möller <felix> |
| Component: | Mobile Devices | Assignee: | Thomas Renninger <trenn> |
| Status: | VERIFIED FIXED | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Critical | ||
| Priority: | P5 - None | CC: | coolo, rjwysocki, trenn, venkatesh.pallipadi |
| Version: | RC 1 | Flags: | coolo:
SHIP_STOPPER-
|
| Target Milestone: | --- | ||
| Hardware: | i586 | ||
| OS: | Other | ||
| Whiteboard: | |||
| Found By: | --- | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
this is the log when pressing the power button immideately after suspending. This is reproducible every time.
/var/log/messages this is the dmesg with cpufreq debug cpufreq_check_negative_load.patch cpufreq_fix_tickless_acounting.patch this is the new dmesg again stalling patch from current kernel-git |
||
From the log: --- Di 28. Aug 09:31:02 CEST 2007: done running suspend hooks. + /usr/sbin/s2ram Allocated buffer at 0x2010 (base is 0x0) ES: 0x0201 EBX: 0x0000 Function not supported? Switching from vt7 to vt1 Calling save_state fbcon fb0 state 1 Calling restore_state_from fbcon fb0 state 0 switching back to vt7 + RET=0 + set +x Di 28. Aug 09:33:38 CEST 2007: running resume hooks. --- Pavel, Stefan, Olaf, any idea? no idea about intel hardware Did todays factory upgrade: # rpm -q kernel-default suspend kernel-default-2.6.22.5-2 suspend-0.50.20070731-12 but still: Wed Aug 29 22:13:04 CEST 2007: done running suspend hooks. + /usr/sbin/s2ram Allocated buffer at 0x2010 (base is 0x0) ES: 0x0201 EBX: 0x0000 Function not supported? Switching from vt7 to vt1 Calling save_state fbcon fb0 state 1 Calling restore_state_from fbcon fb0 state 0 switching back to vt7 + RET=0 + set +x Wed Aug 29 22:15:26 CEST 2007: running resume hooks. Any idea what to do? As far as I can research my system did suspend fine with factory of 2007-08-10 (~ Beta1). :-( The time difference does not seem to be constant now its 2:22 in my original report it was 2:36 ... Does system still work properly after resume? Does "time sleep 1" produce something reasonable? Where is the time spent? While suspending? Stefan, how widespread is this problem? Can we reproduce it on machine where VGA works during resume, or where we have serial/netconsole? (In reply to comment #4 from Pavel Machek) > Does system still work properly after resume? Does "time sleep 1" produce > something reasonable? # time sleep 1 real 0m1.003s user 0m0.000s sys 0m0.004s The system works perfectly after resume. Even the WLAN is reconnecting. ;) But booting takes 1 minute and resuming 3 so it is unusable. :-( > Where is the time spent? While suspending? The system suspends within 2 seconds or so. The time is spent resuming. > Stefan, how widespread is this problem? Can we reproduce it on machine where > VGA works during resume, or where we have serial/netconsole? Created attachment 161747 [details]
/var/log/messages
This is /var/log/messages before and after the supend.
Unfortunately, /var/log/messages shows timestamps from syslog (not from kernel) that are useless. If you increase console loglevel (sysrq-9), can you see the messages during resume? Or perhaps enable timestamps for printk (it is in kernel config somewhere, IIRC). Hmm, can you try if "nohz=off highres=off" helps? that's only big change in latest alphas I'm aware of. ALso try running with nosmp and minimal drivers... (In reply to comment #8 from Pavel Machek) > Hmm, can you try if "nohz=off highres=off" helps? that's only big change in > latest alphas I'm aware of. Setting "nohz=off highres=off" breaks it totally. I get with moved /lib/modules and without: ===== Mi 5. Sep 08:34:26 CEST 2007: running hook: /usr/lib/pm-utils/sleep.d/99info ===== Mi 5. Sep 08:34:26 CEST 2007: done running suspend hooks. + /usr/sbin/s2ram Allocated buffer at 0x2010 (base is 0x0) ES: 0x0201 EBX: 0x0000 and there the pm-suspend.log file just ends. > ALso try running with nosmp and minimal drivers... The default kernel-parameters with a moved /lib/modules/ do work. Unfortunately the intel-xorg driver does not even load without the modules. Therefore I just can say it is working from the console with /lib/modules/2.6.22.5-8-default/ moved to /lib/modules/2.6.22.5-8-default.fm/. The system then still has some drivers loaded: Module Size Used by usbhid 41300 0 hid 29184 1 usbhid ff_memless 9352 1 usbhid ehci_hcd 34956 0 uhci_hcd 27024 0 usbcore 123372 4 usbhid,ehci_hcd,uhci_hcd edd 12996 0 ext3 131848 2 mbcache 12292 1 ext3 jbd 68148 1 ext3 fan 9220 0 ata_piix 18948 2 libata 135116 1 ata_piix piix 12932 0 [permanent] thermal 19848 0 processor 40744 1 thermal sr_mod 19492 0 cdrom 37020 1 sr_mod ide_disk 20480 0 ide_core 122948 2 piix,ide_disk sd_mod 31104 3 sg 37036 0 scsi_mod 140376 4 libata,sr_mod,sd_mod,sg Ok, so it is fast with minimal modules, and slow with the full set? Can you selectively load/unload modules to find out which one causes the slowdown? yes with minimal modules it suspends and resumes in 2 seconds. I will try to find out which module causes the delay. After several hours of rebooting I found the culprit: acpi_cpufreq I have to add an "install acpi_cpufreq /bin/true" to /etc/modprobe.d/blacklist to solve the problem. Reading through the changelog there are two commits which look related, therefore adding Thomas. * Mo Aug 13 2007 - trenn@suse.de - patches.fixes/cpufreq_check_negative_load.patch: ondemand: add a check to avoid negative load calculation (https://bugzilla.novell.com/show_bug.cgi?id=297355). - patches.fixes/cpufreq_fix_tickless_acounting.patch: ondemand: fix tickless accounting and software coordination bug (https://bugzilla.novell.com/show_bug.cgi?id=297355). * Mo Aug 06 2007 - trenn@suse.de - patches.arch/acpi_autoloading_acpica.patch: - patches.arch/acpi_autoloading_interface.patch: - patches.arch/acpi_autoloading_modaliases.patch: - patches.fixes/cpufreq_move_policy_init.patch: ondemand as default governor. - patches.fixes/cpufreq_ondemand_as_default.patch: - Update config files. These are ACPI autoloading (via udev) and use ondemand as default governor when loading low-level cpufreq driver patches. Venkatesh, maybe you heard of this already? Felix, do you know how to compile the SUSE sources? Can you retry with cpufreq debug compiled in? Install latest kernel-source package then do: cd /use/src/linux cp arch/x86_64/defconfig.default .config (or i386?) edit .config, search for something like CPU_FREQ_DEBUG and edit the line to: CONFIG_CPU_FREQ_DEBUG=y (not sure whether the config name is exactly correct...) then: make -j3 make install cd /boot mkinitrd -k vmlinuz-$NEW_KERNEL_VERSION -i initrd-$NEW_KERNEL_VERSION edit /boot/grub/menu.lst and add the newly installed kernel, best as a separate entry, also add the boot parameter cpufreq.debug=7). You were right with the config option. I am compiling the kernel right now. The system is i586 with a Core Duo. It is an MacBook of early 2006. Will report back later today. Created attachment 162896 [details]
this is the dmesg with cpufreq debug
For me it looks interesting around line 800:
[ 130.548000] PM: Writing back config space on device 0000:00:07.0 at offset 4 (was 0, writing 90444000)
[ 130.548000] PM: Writing back config space on device 0000:00:07.0 at offset 1 (was b00000, writing b00002)
[ 270.900000] PM: Writing back config space on device 0000:00:1b.0 at offset f (was 100, writing 10b)
[ 270.900000] PM: Writing back config space on device 0000:00:1b.0 at offset 4 (was 4, writing 90440004)
# lspci -s 00:07.0 -nn
00:07.0 Performance counters [Class 1101]: Intel Corporation Unknown device [8086:27a3] (rev 03)
Thomas what else do you need?
> Thomas what else do you need?
Time :) Just kidding...
Maybe Venkatesh can have a look at this, it looks very Intel specific.
Please ping me at end of the week or beginning of next week if Venkatesh also has no time or nobody else had a look at this.
Maybe Pavel or Rafael already know about this?
This is a good hint:
> Since arround beta1 my system (Intel MacBook) takes 3 minutes to resume.
> During the last alphas the system took 3 seconds to resume.
If you still have the rpms flying around or even the kernels still installed, could you post which was the last working kernel version and from which one things broke, pls (rpm -qp --changelog kernel-default.rpm should extract the version from an rpm).
Just an idea, maybe nmi_watchdog=0 works? If yes, it's likely that this problem got fixed in Beta3 (nmi_watchdog was enabled by accident on i386 (sometimes?)).
nmi_watchdog=0 watchdog does not help. But just adding SUSPEND_MODULES="acpi_cpufreq" to /etc/pm/config.d/config helps. This makes frequency-scaling working for me again and I do not have to blacklist acpi_cpufreq any longer. Sadly I do not have the old kernels anymore. But as far as I can research my system did suspend fine with factory of 2007-08-10 (~ Beta1). :-( Hmm... not much I could add. It looks like acpi_cpufreq problem, and question is if recent patches introduced the problem, or if module autoloading caused it to be used... If I understand correctly, there are two recent changes that can be potentially causing this. * Mo Aug 13 2007 - trenn@suse.de - patches.fixes/cpufreq_check_negative_load.patch: ondemand: add a check to avoid negative load calculation (https://bugzilla.novell.com/show_bug.cgi?id=297355). - patches.fixes/cpufreq_fix_tickless_acounting.patch: ondemand: fix tickless accounting and software coordination bug (https://bugzilla.novell.com/show_bug.cgi?id=297355). * Mo Aug 06 2007 - trenn@suse.de - patches.arch/acpi_autoloading_acpica.patch: : : Can you rollback the two patches that went on Aug 13 and see whether the problem is still there? Created attachment 163373 [details]
cpufreq_check_negative_load.patch
Seems as these patches are in stable 2.6.22.X now.
These are the two patches, you can rip them out with:
cd /usr/src/linux
patch -R -p1 -i ATTACHED_FILE
Created attachment 163374 [details]
cpufreq_fix_tickless_acounting.patch
Created attachment 163900 [details]
this is the new dmesg again stalling
I ripped both of the patches out and rebuilt the kernel, but it is still stalling.
Greg, do you know whether something changed here? Andi said, we could blacklist the device, but (possibly?) only if there is a driver for it? Here the relevant information collected: Since arround beta1 my system (Intel MacBook) takes 3 minutes to resume. During the last alphas the system took 3 seconds to resume. So something is broken here. :-( For me it looks interesting around line 800: [ 130.548000] PM: Writing back config space on device 0000:00:07.0 at offset 4 (was 0, writing 90444000) [ 130.548000] PM: Writing back config space on device 0000:00:07.0 at offset 1 (was b00000, writing b00002) [ 270.900000] PM: Writing back config space on device 0000:00:1b.0 at offset f (was 100, writing 10b) [ 270.900000] PM: Writing back config space on device 0000:00:1b.0 at offset 4 (was 4, writing 90440004) # lspci -s 00:07.0 -nn 00:07.0 Performance counters [Class 1101]: Intel Corporation Unknown device [8086:27a3] (rev 03) Another weird thing is with acpi_cpufreq cpuinfo looks a little strange: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 14 model name : Genuine Intel(R) CPU 1400 @ 1.83GHz stepping : 8 cpu MHz : 1952.038 cache size : 2048 KB the MHz is bigger than what my CPU has;) There are some patches from Thomas Gleixner that fix similar issues in the current Linus' tree. AFAICS, they are these two patches plus-minus some minor stuff: http://www.sisk.pl/kernel/hibernation_and_suspend/2.6.23-rc6/patches/32-timekeeping-prevent-time-going-backwards-on-resume-v2.patch http://www.sisk.pl/kernel/hibernation_and_suspend/2.6.23-rc6/patches/38-cpu-hotplug-support-broken.patch Rafael: thanks a lot. Those patches patch fine with our 2.6.22 kernel. As those patches came in for rc6, I expect they should be rather stable, but we should/must still check whether it fits well for our kernel and ask on kernel list. It seems this one needs some minor adjustance: drivers/acpi/processor_core.c: In function ‘acpi_cpu_soft_notify’: drivers/acpi/processor_core.c:726: error: implicit declaration of function ‘acpi_processor_tstate_has_changed’ make[2]: *** [drivers/acpi/processor_core.o] Error 1 I won't be able to do that before Mo. Increasing severity, looks like we have a bug for SMP suspend and CPU hotplug where a fix already exists, better get this in now, than with an update. I tend to disagree actually. I would rather see some more testing than risking breaking the final with somethign we could have avoided Posting on kernel list to get a risk estimation should be a good idea anyway? yes. But 10.3 won't actively waiting for this fix I applied 32-timekeeping-prevent-time-going-backwards-on-resume-v2.patch which does not fix the problem. I cannot apply 38-cpu-hotplug-support-broken.patch as it does not compile. My current kernel is: # rpm -q kernel-default kernel-default-2.6.22.5-25 Thomas I think you might want to have a look at http://kerneltrap.org/mailarchive/linux-kernel/2007/9/22/272671 . Yes, definitely. Created attachment 174436 [details]
patch from current kernel-git
I applied the attached patch, which is a copy from the kernel-git-repository to my kernel-source-2.6.22.5-29 and now the system resume and suspends fine as far as i can see.
Maybe this is something for the 2.6.22.x releases?
Felix: Great work! This is definetly safe enough to go in. I can't promise because we are at a very late state.., will ask for inclusion. Ahh, a Vaio is mentioned..., this probably happens on the machines with broken apic timer in deeper C states. I will come back... (In reply to comment #34 from Felix Möller) > Created an attachment (id=174436) [details] > patch from current kernel-git > > I applied the attached patch, which is a copy from the kernel-git-repository to > my kernel-source-2.6.22.5-29 and now the system resume and suspends fine as far > as i can see. > > Maybe this is something for the 2.6.22.x releases? > I think this patch is -stable material. BTW: we could probably also work around this one from userspace... You mean with: echo 1 >/sys/module/processor/max_cstate at suspend and echo 3 >/sys/module/processor/max_cstate at resume time? Better not, the patch exactly does the same where it should be done. Patch from comment #34 added -> closing. Rafael, could you add me to CC if you ask for stable inclusion for this one or tell me whether it will hit stable kernel or not (if you are CC'ed by requester...). Thanks Felix and Rafael for finally resolving this one. (In reply to comment #38 from Thomas Renninger) > You mean with: > echo 1 >/sys/module/processor/max_cstate > at suspend and > echo 3 >/sys/module/processor/max_cstate > at resume time? > > Better not, the patch exactly does the same where it should be done. I agree. This is a kernel problem and should be fixed in the kernel. > Patch from comment #34 added -> closing. > Rafael, could you add me to CC if you ask for stable inclusion for this one or > tell me whether it will hit stable kernel or not (if you are CC'ed by > requester...). I will. > Thanks Felix and Rafael for finally resolving this one. Well, the credit should go to Thomas Gleixner, although I helped him a bit. ;-) Rafael, FYI:
I know these Vaios have a broken lapic timer in deeper C-states.
I expect the IPI timer broadcast setup which is a workaround for
those in 10.2 is messed up in timer suspend functions
by Thomas Gleissner's nohz/clocksource/whatever patches
that came in some time ago.
If I understood that right, they are still thinking about a right
fix, which might be a modification where IPI broadcast timers break,
I will post to lkml as soon as I find the time.
The workaround for those Vaios to use IPI broadcast timer interrupts is located here(formerly AMDs where blacklisted to use IPI, I added Pentium Ms to 10.2 some time ago, but this fell out due to CLOCKSOURCE patches and if I read the code right all machines are now using IPI interrupts):
2.6.22 (drivers/acpi/processor_idle.c):
#ifdef CONFIG_GENERIC_CLOCKEVENTS
unsigned long reason;
reason = pr->power.timer_broadcast_on_state < INT_MAX ?
CLOCK_EVT_NOTIFY_BROADCAST_ON : CLOCK_EVT_NOTIFY_BROADCAST_OFF;
clockevents_notify(reason, &pr->id);
#else
This is just a guess, but I could imagine the clocksoure (or whatever) timer suspend functions are messing up the IPI broadcast timer setup somehow. Therefore the Sony Vaios take that long at suspend time (hitting keys should help to be quicker if this is the problem).
With 10.2, without blacklisting those machines even booted very slow (nolapic or pressing keys (producing interrupts) helped).
I haven't followed up the lkml thread, don't know whether this info is still needed. I will search for the thread if I find the time, but time for me is precious at the moment, maybe this helps a bit fixing up the root cause...
Of course this is just a guess, it may be something else :)
(In reply to comment #39 from Rafael Wysocki) > (In reply to comment #38 from Thomas Renninger) > > You mean with: > > echo 1 >/sys/module/processor/max_cstate > > at suspend and > > echo 3 >/sys/module/processor/max_cstate > > at resume time? > > > > Better not, the patch exactly does the same where it should be done. > > I agree. This is a kernel problem and should be fixed in the kernel. Of course. But for the case that a kernel patch would have been considered "too intrusive" (probably by some manager :-), i might have been able to sneak in a pm-utils "fix" :-) But i'm happy that this is not needed now. Referring to Comment #40: This certainly is valuable information, as it may help understand some strange behavior that we observe on these boxes from time to time (especially on the Andrew's one). OTOH, I'm not sure if it's still relevant as far as suspend is concerned. Thanks for it anyway. :-) It is working for me with: # uname -a Linux macbook 2.6.22.9-0.4-default #1 SMP 2007/10/05 21:32:04 UTC i686 i686 i386 GNU/Linux Thanks! |
Created attachment 160171 [details] this is the log when pressing the power button immideately after suspending. This is reproducible every time. Since arround beta1 my system (Intel MacBook) takes 3 minutes to resume. During the last alphas the system took 3 seconds to resume. So something is broken here. :-(