|
Bugzilla – Full Text Bug Listing |
| Summary: | AMD X_64 Multi-core CPU's have NO Thermal Protection from both Kernel or Power Manager to the point of PC auto Shut-down | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE 11.3 | Reporter: | Maks Vasilev <max> |
| Component: | Kernel | Assignee: | E-mail List <kernel-maintainers> |
| Status: | VERIFIED WONTFIX | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Critical | ||
| Priority: | P2 - High | CC: | adrian.schroeter, aj, boris.ostrovsky, bruno, forgotten_7Vd19u3Vod, jeffm, kjozic, malmyzh, max, opensuse-bugs, per, roland.haidl, scott, trenn |
| Version: | Final | ||
| Target Milestone: | --- | ||
| Hardware: | x86-64 | ||
| OS: | openSUSE 11.3 | ||
| Whiteboard: | |||
| Found By: | --- | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
dmesg
ACPI=off(AMD Phenom II X4 920) Startup with BIOS Feature = ENABLED boot.msg with "Cool 'n Quiet" disabled in BIOS Issue has been present since 10.x - We just go round and around with this one with endless user logs, user Bugs and NO Resolution for over 3 or so years! 1 of 2 Images 2 of 2 Images Self Titled Self Titled Self Titled Self Titled Self Titled Self Titled Self Titled My logs - not broken. My logs from Phenom II X6 1055T - Also not broken |
||
Created attachment 328844 [details]
ACPI=off(AMD Phenom II X4 920)
Comment on attachment 328844 [details]
ACPI=off(AMD Phenom II X4 920)
I have the same
Comment on attachment 328844 [details]
ACPI=off(AMD Phenom II X4 920)
I have the same
Comment on attachment 328844 [details]
ACPI=off(AMD Phenom II X4 920)
I have the same
Ouu .. sorry ... the first time. Screenshot for apci = on: http://i051.radikal.ru/0911/c4/13385d2e10e3.jpg Thanks. Irek, is there any way you can capture the top of that oops? Irek, is there any way you can capture the top of that oops? I have no idea to capture start of this oops. I try experement with pause key, but this oops show with one portion. My screenshot: http://picpaste.com/pics/19381.1259251702.jpeg My MainBoard is GIGABYTE GA-MA790FX-DS5, MB of Irek Fasikhov is GIGABYTE GA-MA790X-DS4 Same with GA-MA790X-DS4 motherboard. Same problem with GIGABYTE GA-MA770-DS3, AMD 770 / SB600 chipsets and many other similar hardware, see also bug 548108, detailed information about hardware and messages can be found there. Comment 15 of bug 548108 suggests that the problem is caused by the SUSE patches to the kernel. Same here with (backport of #548108) Motherboad gigabyte MA-790FX-DQ6 Detailled hwinfo here https://bugzilla.novell.com/attachment.cgi?id=331572 Vanilia kernel is working without a itch. 2.6.32 suse flavor have same trouble. BIG precision : a 32 bits kernel is booting ! all 11.1 kernel working I installed the latest vanilla kernel from SUSE Factory 2.6.32-41-vanilla #1 SMP 2009-12-11 11:05:24 -0500 x86_64 x86_64 x86_64 GNU/Linux Mainboard: GIGABYTE GA-MA770-DS3, AMD 770 / SB600 chipsets The system boots without any problem with acpi switched on - more details in bug 548108 After upgrading the kernel to: http://download.opensuse.org/repositories/Kernel:/HEAD/openSUSE_Factory/x86_64/kernel-default-2.6.32-41.1.x86_64.rpm dmesg: [ 16.933728] powernow-k8: Found 1 AMD Phenom(tm) II X4 920 Processor processors (4 cpu cores) (version 2.20.00) [ 16.933768] powernow-k8: 0 : pstate 0 (2800 MHz) [ 16.933769] powernow-k8: 1 : pstate 1 (2100 MHz) [ 16.933771] powernow-k8: 2 : pstate 2 (1600 MHz) [ 16.933772] powernow-k8: 3 : pstate 3 (800 MHz) linux-n4fy:/home/kataklysm # uname -a Linux linux-n4fy 2.6.32-41-default #1 SMP 2009-12-11 11:05:24 -0500 x86_64 x86_64 x86_64 GNU/Linux acpi works After upgrading the kernel to: http://download.opensuse.org/repositories/Kernel:/HEAD/openSUSE_Factory/x86_64/kernel-default-2.6.32-41.1.x86_64.rpm dmesg: [ 16.933728] powernow-k8: Found 1 AMD Phenom(tm) II X4 920 Processor processors (4 cpu cores) (version 2.20.00) [ 16.933768] powernow-k8: 0 : pstate 0 (2800 MHz) [ 16.933769] powernow-k8: 1 : pstate 1 (2100 MHz) [ 16.933771] powernow-k8: 2 : pstate 2 (1600 MHz) [ 16.933772] powernow-k8: 3 : pstate 3 (800 MHz) linux-n4fy:/home/kataklysm # uname -a Linux linux-n4fy 2.6.32-41-default #1 SMP 2009-12-11 11:05:24 -0500 x86_64 x86_64 x86_64 GNU/Linux acpi works ACPI works for me with kernel http://download.opensuse.org/repositories/Kernel:/HEAD/openSUSE_Factory/x86_64/kernel-default-2.6.32-41.2.x86_64.rpm and mainboard GA-MA770-DS3 BIOS version F6 I'll try BIOS version F7 and installation kernel as recommended in bug 548108 For MB GA-MA770-DS3 rev. 1 (with SB600) a BIOS upgrade to F7 and loading "Optimized Defaults" solves the problem. I recommend to mark this report as duplicate to bug 548108 Created attachment 344333 [details]
Startup with BIOS Feature = ENABLED
Sorry .ODT text file contents
I have an AMD Phenom(tm) 9950 Quad-Core Processor Speed: 1125,000,0003001.2500 MHz Cores: 4 Memory Information Total memory (RAM): 71.258 GiB Free memory: 61.258 GiB (+ 6291.257 MiB Caches) Free swap: 241.256 GiB The AMD BIOS control gives us the ability to either enable a feature called Cool N' Quiet or not. The default is enabled. When enabled the CPU frequency always starts the CPU clock on a very low clock speed (during boot)or any low demand? The upshift of the clock frequency of the CPU is dependant on ACPI and laggs horribly behind and for many of us offers poorer overall performance. To not have our CPU's starting out at full clock speed or taking an inordinate amount of time to upshift, unless sufficient demands are made of it, makes many of us Disable this in BIOS If I boot from Linux 2.6.31.12-0.1-desktop x86_64 kernel with Cool N Quiet ENABLED (CPU ACPI Interface ON) system will boot without much of an issue - The start-up log is attached. IF, like most of us who want full performance all the time, I DISABLE this, the Video/Sound/Freezing/Keyboard Frozen...etc plage an X_86 PC and the start-up log indicates the "CPUFrequency Not Supported" The Install DVD Kernel does not have the freezing issues as described if DISABLED or Enabled in BIOS. Only Linux 2.6.31.12-0.1-desktop x86_64. We need Kernel Development to cope with both BIOS settings, particularly for most who want to start with max CPU clock frequency. Now it gets ugly - I have attached the start-up log of Kernel Linux 2.6.31.12-0.1-desktop x86_64, with BIOS feature Enabled. The PC is stable, but the log is very very ugly and sick, BUT the hardware is NOT. All of the above is reproducible 100% unless my 6 x X_64 PC's just happen to have the same problem. There is nothing wrong with out hardware and nothing wrong with our BIOS versions! I have the same processor and MB GA-MA770-DS3 rev. 1 (with SB600) BIOS version F7. Enabling or disabling "Cool N' Quiet" works equally well with kernel 2.6.31.12-0.1-desktop x86_64 (see attached boot.msg) Created attachment 345459 [details]
boot.msg with "Cool 'n Quiet" disabled in BIOS
Created attachment 348776 [details]
Issue has been present since 10.x - We just go round and around with this one with endless user logs, user Bugs and NO Resolution for over 3 or so years!
Until we can create a CPU Ladder that can interface with an X_64 CPU with BIOS ACPI OFF - Nothing is going to change. Courtesy attached document sent to AMD Processor Development in .US in the hope they can assist in development dollars and a FIX, we need to throw money at this one , work with AMD and actually resolve it. It's been around since 10.3 and bug reported over a dozen times without resolution, but with much discussion.
My start-up log says CPU Frequency Not Supported, Yes I run an AMD Sponsor X_64 CPU on all my PC's. Unless I am mistaken, AMD is a partner, yet we cannot fix this and work together with AMD. Cross fingers the direct approach works.
CC: Added QA, et al. Request escalation to Blocker Status against RC of 11.3 together with the true implications of that Status. Quick discussion and verdict please! According to comment 14, 15 and 16, the initial problem (not booting with acpi=on) is solved for quite a while -> marking the bug fixed as it is against 11.3 (updated?), but the initial problem is solved. Some additional info: - A bug has been resolved which speeds up freq switching with the ondemand governor: - 732553e567c2700ba5b9bccc6ec885c75779a94b - count IO as cpu time - ... You shouldn't see any/much performance impact with powernow-k8. Does this board have a graphics card on board? Then the overheating probably comes/came from missing power management for the graphics card. This should be around 10W more power consumption constantly. It took quite some time for AMD to fix this, but it now also found its way into the open source radeon kms (kernel mode setting) driver, thus with 11.3, Hm, not sure whether it's already in 2.6.33 (11.3, it definitely is in the latest kernel(s)), you could try a latest kernel: ftp://ftp.suse.com/pub/projects/kernel/kotd/master/x86_64 If you still have problems, please open a new bug! ACPI Is irrelevant to this issue but, being on or off matters not! - Renamed Bug QA - Category may be CPU Power or Frequency Bug, with or without Kernel interference! This Bug opened in 2009 and remnant from others opened in 2008, This Bug remains a huge issue and all 3 of my X_64 PC's exhibit this bug Just install an X_64 PC Standard with BIOS = Cool 'N Quiet = off - The default is normally off but check it! All my X64 PC's have different BIOS's makers and their firmware is up to current and All are Multicore X64 AMD CPU's. QUOTE < <5>[ 9.363205] sd 0:0:0:0: Attached scsi generic sg0 type 03>[ 9.349605] k10temp 0000:00:18.3: unreliable CPU thermal sensor; monitoring disabled <6>[ 9.349846] 8139cp: 8139cp: 10/100 PCI Ethernet driver v1.3 (Mar 22, 2004) <6>[ 9.353020] 8139cp 0000:01:06.0: This (id 10ec:8139 rev 10) is not an 8139C+ compatible chip, use 8139too <6>[ 9.357042] FDC 0 is a post-1991 82077 ...................... QUOTE <6>[ 0.856600] cpuidle: using governor ladder <6>[ 0.856601] cpuidle: using governor menu <6>[ 0.856791] usbcore: registered new interface driver hiddev <6>[ 0.856800] usbcore: registered new interface driver usbhid <6>[ 0.856801] usbhid: USB HID core driver ............................... 6>[ 0.006402] Dentry cache hash table entries: 1048576 (order: 11, 8388608 bytes) <6>[ 0.009624] Inode-cache hash table entries: 524288 (order: 10, 4194304 bytes) <4>[ 0.011183] Mount-cache hash table entries: 256 <7>[ 0.011317] tseg: 0000000000 <6>[ 0.011327] CPU: Physical Processor ID: 0 <6>[ 0.011329] CPU: Processor Core ID: 0 <6>[ 0.011331] mce: CPU supports 6 MCE banks <6>[ 0.011339] using C1E aware idle routine <6>[ 0.011341] Performance Events: AMD PMU driver. <6>[ 0.011344] ... version: 0 <6>[ 0.011345] ... bit width: 48 <6>[ 0.011346] ... generic registers: 4 <6>[ 0.011348] ... value mask: 0000ffffffffffff <6>[ 0.011349] ... max period: 00007fffffffffff <6>[ 0.011350] ... fixed-purpose events: 0 <6>[ 0.011351] ... event mask: 000000000000000f <6>[ 0.011418] Unpacking initramfs... <6>[ 0.163370] Freeing initrd memory: 10284k freed <6>[ 0.167256] ACPI: Core revision 20100121 <6>[ 0.179909] Setting APIC routing to flat <6>[ 0.180388] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 <6>[ 0.190404] CPU0: AMD Phenom(tm) 9950 Quad-Core Processor stepping 03 <6>[ 0.196024] Booting Node 0, Processors #1 #2 #3 Ok. <6>[ 0.450022] Brought up 4 CPUs <6>[ 0.450025] Total of 4 processors activated (22400.87 BogoMIPS). Part of log is above, however its 100% Reliable in ANY X_64 from AMD. An Installation, being one of the most resource hungry for resources and for a sustained period; can and will completely overheat the CPU to shut-down if ambient is above 32 Degrees (C) In the very unlikely event that you cannot see this on your test PC of any X_64 AMD CPU with Cool n Quiet OFF, please ask for specific logs from me for all 3 PC's, that all show that both thermal monitoring and CPU governing ladder both is OFF or UNABLE or NOT AVAILABLE, please be specific in you log requests! Either throw money at this and fix it or close as WONTFIX It will put all of us out of our misery ...:-) Anyone else have their boot logs handy????? ...:-) Sorry - final thoughts - I have a Widget that displays the real-time temperature of each core of my quad core CPU - so I am sure monitoring the thermal temp is relatively easy. All we need is a CPU Thermal Ladder to throttle the CPU and/or Change CPU fan speed in the event BIOS is not so enabled, and/or suspend or shutdown the O/S if 'n' temperature is observed - I.E 70+ (In reply to comment #29) > All we need is a CPU Thermal Ladder to throttle the CPU and/or Change CPU fan > speed in the event BIOS is not so enabled, and/or suspend or shutdown the O/S > if 'n' temperature is observed - I.E 70+ My only real interest in this is that my main workstation is an AMD Phenom quad too - is there a good reason why the BIOS would not be enabled for Cool'n'Quiet? Whether Cool'n'Quiet is by default off is up to the OEM/platform/BIOS vendor. He also must make sure installed fans are capable to get the hot air out of the chassis. He can implement thermal/passive cooling at temperature X via an ACPI passive trip point. Oh that reminds me to a common AMD quad core BIOS bug I posted a workaround which never made it in. Hmm, iirc there should be an ACPI message when the thermal driver is loaded, but I can't see any in the provided dmesgs. Argh, reading again from the beginning, several bugs are mixed up here together. Bug 1. and this is even more critical: - Machine does not boot with a backtrace to ACPI subsystem -> I doubt everyone is seeing this? -> Could be related to "ACPI global code execution" which got fixed recently, might not be in latest kernel yet, need to double check. Could also be something else. Bug 2. thermal shut downs due to a possibly missing or not working passive cooling BIOS implementation. In fact "Cool'n Quiet" CPU frequency scaling needs ACPI subsystem working. Could be that you hit 2. only because of 1.. Booting with acpi=off is bad and ACPI controlled temperature control (on which quite a lot, especially laptop vendors depend) will not work. Instead of reducing max freq by one if it gets too hot, the kernel or even worse the HW will shut the machine off. Irek (and/or whoever else sees the "Machine not booting with acpi on" issue): Can you please open a new bug, name it similar ("Early boot hang with backtrace to ACPI subsystem or similar"). Unfortunately the picture of the hang does not show the whole backtrace but can you also attach it: http://i051.radikal.ru/0911/c4/13385d2e10e3.jpg together with acpidump and dmesg output (you might be able to retrieve by booting with acpi=off). Let's make sure all machines boot and work properly without acpi=off, if you still have thermal problems, let's look at these separately. Putting NeedInfo to Maxim wont help as its clear he has given up - See Bug Activity. In the test PC I used I copied the very very partial boot.... I think a PC's Thermal Shutdown qualifies for the Priority I think the title is representative of the bug...There have been bugs on this subject since 2008 As I can get a Widget to monitor CPU Temp, I wrote that this issue could be solved by the most unlikely type of actions. Therefore surely the Kernel or CPU Power Ladder can be so much more sophisticated and eloquent in keeping the CPU from Thermal shutdown I quoted the most important lines being the CPU Thermal Sensor monitoring was unreliable and hence was disabled - I gather this decision came from either the Kernel or the CPU Power ladder! k10temp 0000:00:18.3: unreliable CPU thermal sensor; monitoring disabled <6>[ 0.856600] cpuidle: using governor ladder <6>[ 0.856601] cpuidle: using governor menu I can run all my PC's with ACPI = off and this will not change. If you are unable to test this on an AMD X_64 PC - please close as WONTFIX so we can all see that the creation more Bugs on this issue which started in 2008; can be put to bed. > I can run all my PC's with ACPI = off and this will not change. Looks like you have some problems with it... Some clarifications: - This is not an AMD problem, but the OEM/platform/BIOS vendor has to make sure the thermal management of the platform works - This is mostly controlled via ACPI - acpi=off is totally unsupported, even most irq routing information is passed via ACPI and you can be lucky to get recent machines booted at all without it. > If you are unable to test this on an AMD X_64 PC - please close as WONTFIX As said this is vendor specific (if the problem persists without acpi=off). If you are not willing to give this a test or debug it down without acpi=off, I have to close this bug "invalid". BTW: the cpuidle subsystem is totally unused on your quad core. It's about C-states which on AMD were only implemented (via ACPI!) on Turion single cores. I keep the bug open for a while for comments. But I tend to close it anyway and suggest to open new, clean bugs. The root cause of your problems (or most problems in this bug) seems to come from acpi=off which (nearly?) everybody used here. Of course you run into thermal management problems if you switch of thermal management. Seems to me Thomas has hit the nail on the head - "Of course you run into thermal management problems if you switch of thermal management." Scott, if I'm reading everything right, it sounds like you're asking for a way to monitor and manage CPU temperature without ACPI? If this is correct, I would suggest you open a feature request instead. Or enable ACPI on your PCs. If running with ACPI enabled is causing problems for you, report those. Created attachment 405877 [details]
1 of 2 Images
NO! I run with no change to ACPI either in BIOS nor Boot Switch.
Looking at ANY AMD X_64 CPU you can see the boot log for yourself!
We currently have both neither a CPU thermal ladder and no Kernel throttling.
As such the CPU runs at max clock all the time and the GPU follows the CPU's lead and given a warm ambient temp a full installation(reason above) can and will either invoke BIOS overheat and a shut-down and/or completely cook both CPU/GPU
The bugger in the middle of this is AMD's Cool 'n Quiet!
When Enabled the CPU always starts at the least clock speed and slowly upshifts its clock speed speed when under pressure. The reason this has BIOS default of Disabled is that the upshifts in clock speed takes for ever to occur and makes the PC perform like a slug.
The BIOS default is thankfully DISABLED.
Cool 'N Quiet is AMD's BIOS control over CPU clock speed and it can and does slow the CPU/GPU clock when thermal sensors get far too hot!
The issue is that we have is
We currently have both neither a CPU thermal ladder and no Kernel throttling of the CPU/GPU.
EVERY boot log of an AMD X_64 CPU will show that there is NO thermal Monitoring because of ?? and the Kernel has no throttling because of ?
We have a Widget that can report thermal monitoring so why cant out boot process monitor the CPU/GPU temp and take action when required?
Attached are images that show that KDE cannot use Power Control and has No ability to use power control.
We need a CPU Thermal Ladder for AMD X_64 CPU's and the ability to slow its clock via either Kernel or Thermal Ladder or ???
Thanks for taking this seriously, everyone else in development gave up in 2008 and all the users who have reported any type of CPU being cooked have also given up...:-)
Created attachment 405878 [details]
2 of 2 Images
Jut to clarify this bug and sorry to keep repeating it but under test and normal conditions II have ... ACPI IS ENABLED IN BOTH BIOS AND BOOTED WITHOUT ANY ACPI SWITCH AND WITH NO OTHER BIOS OR SOFTWARE SETTING THAT TURN ACPI OFF! WHAT-SO-EVER!!!! Puhh, quite some info mixed together. I won't go into too much detail, but try to answer or explain some topics you mention:
GPU
---
Interesting topic. GPU is the second biggest power waster. Depending on your graphics driver and HW your GPU might do some powersaving or not. Especially since on-board ATI graphics these can consume quite some energy and unfortunately the latest open source KMS driver may do some power savings, but it might be worth to double check with fglrx which may be able to do better.
CPU
---
Throttling
..........
A technique which is much less efficient compared to cpufreq. Intel uses it in worst case to avoid thermal shutdowns. You never want to have this enabled if you have cpufreq scaling (aka Powernow!).
Powernow!
.........
Should always be enabled. Yes there were issues. But there were significant improvements over the past years. Namely:
- first userspace governor was used, checking cpu load every 333ms
- then ondemand (kernel) governor was used, but on AMD systems
checking of cpu load went up to about 1.2 seconds in worst case.
- With latest kernel on quad cores, every 10ms cpu load should get checked
in kernel.
- Very latest kernel count IO wait time as CPU load time or "CPU is utilized
time" (this should be available on 11.4), it may give you a bit improvement
on heavy disk work.
Performance loss is nearly zero. I doubt you find a workload to prove more than
2% of performance loss even if you try really hard to scale the CPU utilization up and down all the time.
Anyway, if possible, please switch powernow-k8 on.
> We need a CPU Thermal Ladder
You mix something up there. This has nothing to do with the cpuidle ladder governor. What you want to have is a passive trip point that can be set if the OEM/BIOS does not provide one.
Quick introduction to passive cooling:
Via thermal ACPI tables there are active (can be more than one), passive, hot, critical trip points which define a temperature.
active -> different fans or fan states are switched on
passive -> First try to limit cpufreq, if not available try throttling
critical -> shut the machine down
These are defined via ACPI and there were bugs, probably still are in BIOS or in kernel.
Which trip points your BIOS exports to OS can be found here (deprecated in 11.4):
cat /proc/acpi/thermal_zone/THRM/trip_points
in recent kernels you have to gather this info from sysfs:
/sys/devices/virtual/thermal
I remember one bug I've seen in several BIOSes which supported dual core AMD CPUs and then were enhanced to support socket compatible quad core AMD cpus:
A passive trip point is connected to a CPU. But the CPU object's ACPI name got renamed, but they forgot to change it in the passive trip point definition.
I submitted a workaround to assign all CPUs (should always be intended) to this passive trip point if there is an error to reference the (wrong name/not existing) CPU.
This is a wild guess. I also expect you have different issues as this is platform/BIOS specific and there are several people in CC of this bug.
Please try to gather some more info. Enable PowerNow! and ACPI. If you still have issues, monitor (ACPI) temperature (exported in the paths I point to above), look at trip points, etc.
> What you want to have is a passive trip point
There were efforts to give userspace the possibility to set a passive trip point temperature, even the OEM/BIOS did not export one. I very much liked that and also tried to make it happen. That's perfect for e.g. get a silent laptop (define passive below active ones to avoid fan activity) or to workaround thermal shutdown problems. Finally there were some workarounds like that for specific machines in this area. But better gather some more info first, this might not be needed.
Argh, I wanted to write:
> We need a CPU Thermal Ladder
What you mean is a passive trip point
Created attachment 405885 [details] Self Titled >GPU >--- >Interesting topic. GPU is the second biggest power waster. Depending on your >graphics driver and HW your GPU might do some powe In any X_64 AMD CPU the CPU frequency is directly proportional to the GPU either as a separate PCIE or on Board. >CPU >--- >Throttling >.......... Is our own current way of preventing overheat damage to I586 CPU but cannot be applied to X64 >Powernow! >......... >Should always be enabled. Yes there were issues. But there were significant >improvements over the past years. Namely:..... I dont care how much power I use or is wasted! Here the issue is that the Power config is vanilla - I have not disabled nor enabled anything to do with CPU - I can not do this as its greyed out and not accessible by default. I have NO ability to effect the greyed out images I offered you a pic There is no functional ability for me to do this in ANY X_64 AMD CPU! > We need a CPU Thermal Ladder Default Vanilla boot log shows that the thermal Ladder is not even available in X_64 AMD CPU My Widget has NO trouble reporting default sensors on CPU Temp May I suggest you perform a test vanilla installation on any X_64 AMD CPU without touching ACPI and leaving default Cool 'n Quiet is disabled. Final thoughts are, openSUSE, SLED, SLES and Novell are all very highly committed to provide direct support to AMD and we advertise AMD in just about everything to do with these products. The amount of money that AMD throw our way in all the above products is enormous and I would hate to suggest that ignoring a vendor like AMD be taken lightly what so ever ! AMD is a huge huge money thrower at all of Novells products which is also include openSUSE, and to ignore software support for them would amount to suicide! Created attachment 405886 [details]
Self Titled
Created attachment 405887 [details]
Self Titled
Created attachment 405888 [details]
Self Titled
Created attachment 405889 [details]
Self Titled
Created attachment 405890 [details]
Self Titled
Created attachment 405891 [details]
Self Titled
Thomas, can you please insert 'Target Milestone for Fix' for this bug - I think everyone in the CC as well as past inactions since 11.0, might bring a few people back to openSuse - Many X64 users have well gone since this bug has failed to be fixed since first reported in 11.0. After this gets fixed we might just be able to tempt users back to opensuse AND SLES/SLED 'cause most X64 users deserted SUSE in Droves! The problem is that it's hard to identify your exact problem and that there seem to be several. In comment #5 and/or #8 there were pictures to an ACPI kernel segfault, unfortunately these are not available anymore. This is a critical issue and acpi=off is no valid workaround, this must get fixed. Looks like some people workarounded the bug with acpi=off which lead to thermal shutdowns. Your current complain is that you cannot configure power saving settings via kde? Can you please try to use cpufrequtils package and try to read out available CPU frequencies. Maybe http://lists.opensuse.org/opensuse-kernel/2011-01/msg00100.html is related? FWIW, I have at least 4 multicore AMD x86-64 machines. Two are Phenom II X6 1005Ts, one is a dual-proc Opteron 6128, and one is a dual-proc dual-core AMD processor old enough that it doesn't have a name. My desktop is one of the Phenom II X6's. I rebooted with Cool 'n' quiet disabled and have been running 10-way kernel compiles in a loop for several hours now to try to force the CPU into thermal shutdown. Instead, the temperature has peaked at 48C in the CPU and 39C on the motherboard. The CPU and Chassis fans are both running around 3100 RPM. All cores are running at their specified max of 2.8GHz. This is on openSUSE Factory without the acpu-cpufreq module loaded. > Maybe http://lists.opensuse.org/opensuse-kernel/2011-01/msg00100.html is
> related?
There was a bug that cpufreq drivers do not get loaded, because we got rid of hal. Instead there is /etc/init.d/cpufreq from pm-utils package which will load them.
This is a temporary workaround until the cpufreq drivers get autoloaded which will get implemented in the next kernel versions.
You may want to double check whether you were affected.
This problem is entirely reproducible - Its consequences are varied and severe but bottom line is that the termal temp is ignored and there is no process in the operation of OpenSuse that can throttel the CPU so it wont fry itself. -Before the CPU does Fry itself we get thermal shutdowns from the Motherboard and a variety of CPU error often evident on screen or otherwise There in NO advantage to turning ACPI off - I reported this on a Vanilla KDE Installation and Vanilla updates of 11.3 on my Processor (CPU): AMD Phenom(tm) 9950 Quad-Core Processor Speed: 2,800.00 MHz Cores: 4 This problem was also evident on all 5 different X_64 PC's I have networked and has been around since 11.1 and I really dont know why this has taken man hours of discussion but NO action. Either Close as WONTFIX or continue to test your own test installation on an X_64 AMD Processor and tell us all what you have observed please RE: #50 Try a NEW Installation and add File Server, Network Admin....etc. and try when the ambient temp is above 32C Those conditions are going to be tough to reproduce for Thomas and I as we both live in the northern hemisphere in relatively temperate climates even at the height of summer. > there is no process in the operation of OpenSuse that can throttel the CPU so > it wont fry itself. This is wrong. OEMs have to pass passive and critical trip points. If temp exceeds the first, the CPU will get throttled on the latter the machine will be shut down. As said, this all depends on the BIOS, not on AMD and (partly as long as there is no bug) not on the OS. It also depends on your OEM that you have fans which can transport away the heat, even if the CPUs are fully utilized for some hours without the CPUs getting "fried" or the machine gets slowed down. I agree it would be fine to have an ACPI independent temperature control and throttle mechanism, but I could not convince Len Brown about it. > evident on all 5 different X_64 PC's I have which problem? - powernow-k8 driver is not loaded and cpufrequtils does not show you cpu frequency switching capabilities? - Do the machines shutdown (or hard switch off)? - Or is it about KDE Windows not showing cpu frequency capabilities? Please attach acpidump and dmidecode of systems which shutdown/switch off hard. It matters not what the conditions are - Its a simple fact that the O/S has NO control over the CPU temp and able to throttle it down a few clicks - No matter your install you will see in your boot log that there is NO CPU support....and No CPU Thermal Ladder able to assist. Its not that it will fail when X is present - Its just simply that the O/S has no ability to keep the CPU cool by throttling it down a few clicks...None what so ever...It is irrelevant to make the condition fail and shut-down the PC - Its the problem that there is NO provision from within OpenSuse to look after this situation when present. If you really want to reproduce it - Overclock the CPU and GPU and perform an installation - Remember the issue is we have no control over the CPU to avoid thermal damage of both Again: This is wrong. If the OEM provides a passive trip point, the kernel will throttle the CPU if this temp is reached. Please provide acpidump and dmidecode and I can check whether trip points are provided. If not, you have to complain to your OEM and/or you have to buy better fans. As I said - This is 100% Reproducible - Your boot and system Logs should revile that the CPU is NOT supported. My total 6 x X_64 pc's have a variety of BIOS manufacturers Can you paste the log entry you're talking about? I don't see anything in my log that looks suspicious. k10temp 0000:00:18.3: unreliable CPU thermal sensor; monitoring disabled Do you want the whole boot log? - Before that I would be interested if you could attach your boot log for me to compare...I think there is great advantage in doing this even though all my PC's carry warnings of no support for the CPU Jan 5 12:52:19 MULTIVAC-010 rchal: CPU frequency scaling is not supported by your processor. Jan 5 12:52:19 MULTIVAC-010 rchal: boot with 'CPUFREQ=no' in to avoid this warning. Jan 5 12:52:19 MULTIVAC-010 rchal: Cannot load cpufreq governors - No cpufreq driver available No, that's enough. I have nothing to paste because even with the k10temp module loaded, I don't get any log output from it.
The log message does help to pinpoint what the problem is, and the problem is that you're running into AMD Erratum #319.
Processors affected by erratum 319, which include Quad-Core Opterons, Six-Core Opterons, Embedded Opterons, Phenom Triple- and Quad-Core, Dual Core Athlons, and the 2,3, and 4 core versions of the Phenom II, have an unreliable temperature sensor.
None of my processors are affected by erratum 319 because they are either too old or too new. I expect that all of yours are affected, hence why you're able to 100% reproduce and I can't at all.
This isn't a matter of not having Linux support for the processor - it's that the CPU itself has a design flaw that exhibits itself on /some/ processors but not necessarily all of them. Since AMD has deemed it worthwhile to issue an errata, it's worth looking into. Since there's no workaround except at the hardware level (or BIOS, I'm not sure. Thomas might know more), the right answer for the kernel is to not load the driver by default because then it would be returning bad data.
You can override the default and force it to load by loading the module with "force=1". You might try that and see if it helps.
As for the lack of CPU frequency scaling, as Thomas has said, that's another issue that we'll need to examine.
FWIW, on my desktop system, it works. I see:
80.802316] powernow-k8: Found 1 AMD Phenom(tm) II X6 1055T Processor (6 cpu cores) (version 2.20.00)
[ 80.802331] powernow-k8: Core Performance Boosting: on.
[ 80.802367] powernow-k8: 0 : pstate 0 (2800 MHz)
[ 80.802368] powernow-k8: 1 : pstate 1 (2200 MHz)
[ 80.802370] powernow-k8: 2 : pstate 2 (1500 MHz)
[ 80.802371] powernow-k8: 3 : pstate 3 (800 MHz)
On one of my servers, I see:
[ 16.163755] powernow-k8: Found 4 AMD Opteron(tm) Processor 6128 processors (16 cpu cores) (version 2.20.00)
[ 16.174797] powernow-k8: 0 : pstate 0 (2000 MHz)
[ 16.180279] powernow-k8: 1 : pstate 1 (1500 MHz)
[ 16.185772] powernow-k8: 2 : pstate 2 (1200 MHz)
[ 16.191263] powernow-k8: 3 : pstate 3 (1000 MHz)
[ 16.191264] powernow-k8: 4 : pstate 4 (800 MHz)
[ 16.194290] powernow-k8: 0 : pstate 0 (2000 MHz)
[ 16.194293] powernow-k8: 1 : pstate 1 (1500 MHz)
[ 16.194294] powernow-k8: 2 : pstate 2 (1200 MHz)
[ 16.194296] powernow-k8: 3 : pstate 3 (1000 MHz)
[ 16.194297] powernow-k8: 4 : pstate 4 (800 MHz)
... both are relatively idle and currently running at 800 MHz on all cores.
When I disable Cool 'n' Quiet in the BIOS, I see:
powernow-k8: Found 1 AMD Phenom(tm) II X6 1055T Processor (6 cpu cores) (version 2.20.00)
powernow-k8: Core Performance Boosting: on.
[Firmware Bug]: powernow-k8: No compatible ACPI _PSS objects found.
[Firmware Bug]: powernow-k8: Try again with latest BIOS.
... and the module refuses to load and all cores were running at 2.8 GHz.
Are you running into the scaling issues with Cool 'n' Quiet enabled?
More info on Erratum 319:
From the code:
/*
* Erratum 319: The thermal sensor of Socket F/AM2+ processors
* may be unreliable.
*/
(this eventually prints the error message you've run into)
From "Revision Guide for AMD Family 10h Processors"[1], page 72:
319 Inaccurate Temperature Measurement
Description
The internal thermal sensor used for CurTmp (F3xA4[31:21]), hardware thermal control (HTC),
software thermal control (STC) thermal zone, and the sideband temperature sensor interface (SB-TSI)
may report inconsistent values.
For CPUID Fn0000_0001_EAX[7:4] (Model) 4 and higher, this temperature inconsistency will occur
only on AM2r2, Fr2, Fr5 and Fr6 package processors
Potential Effect on System
HTC, STC thermal zone, and SB-TSI do not provide reliable thermal protection. This does not affect
THERMTRIP or the use of the STC-active state through StcPstateLimit or StcPstateEn
(F3x68[30:28, 5]).
Suggested Workaround
None. Platforms that accept AM2r2, Fr2 (1207), Fr5 (1207) or Fr6 (1207) package processors should
be designed with conventional thermal control and throttling methods or utilize PROCHOT_L
functionality based on temperature measurements from an analog thermal diode
(THERMDA/THERMDC). These systems should not rely on the HTC features, STC thermal zone
features, or use SB-TSI.
When (((CPUID Fn8000_0001_EBX[PkgType, bits 31:28] == 1 (AM2r2 or AM3)) &&
(F2x[1, 0]94[Ddr3Mode, bit 8] == 0)) || (CPUID Fn8000_0001_EBX[31:28] == 0 (F (1207)))),
software should not modify HtcTmpLmt (F3x64[22:16]), utilize the value from CurTmp, or enable
any of the STC thermal zone features by setting StcThrottEn, StcApcTmpLoEn, StcApcTmpHiEn,
StcSbcTmpLoEn, or StcSpcTmpHiEn (F3x68[4,3:0]).
Fix Planned
Yes
[1] http://support.amd.com/us/Processor_TechDocs/41322.pdf
(In reply to comment #50) > My desktop is one of the Phenom II X6's. I rebooted with Cool 'n' quiet > disabled and have been running 10-way kernel compiles in a loop for several > hours now to try to force the CPU into thermal shutdown. Instead, the > temperature has peaked at 48C in the CPU and 39C on the motherboard. The CPU > and Chassis fans are both running around 3100 RPM. All cores are running at > their specified max of 2.8GHz. Jeff, for stress-testing a CPU or the cooling, try mprime (from http://www.mersenne.org/) - I'm pretty certain I was able to drive my quad-phenom to about 60-61C when I was testing a new motherboard. I am getting more and more confused.., it would be great if you could provide requested data to double check.
I do not think that the issue has to do with the thermal errata (not sure).
If it has, the subject of the bug is totally wrong, then the kernel should *not* look at the temperature sensor. Platform vendors should know about it and provide an alternate thermal sensor as described and export it via ACPI.
For powernow-k8 driver not loading and CPU frequency scaling not available, please check your BIOS settings and best first update the BIOS.
> I really dont know why this has taken man hours of discussion but NO action.
Ok, instead of further discussing..., could you please take following actions:
- Make sure you do not use acpi=off but default boot params
- For all platforms where powernow-k8 driver does not load/work, please
update the BIOS and check BIOS settings for related options. The settings
may vary depending on the BIOS vendor, it may be included in the ACPI or
a power subsection. There may be options like "performance/power optimized"
or simply Cool'n Quiet on/off. You should be able to make powernow-k8
and dynamic cpu frequency scaling working by that.
- For each affected machine, collect dmesg, dmidecode, acpidump, and
/proc/cpuinfo
output. Tar the info of each machine up and attach it separately with a
short description if the phenomenons are slightly different
(kernel initiates shutdown after about x minutes, or machine hangs hard,
or power is switched off hard by HW, etc.)
Currently I have nearly zero info. There are boot messages with acpi=off -> worthless, there are pointers to kernel oopses related to ACPI which is another problem, etc.
Firstly this problem is evident on many many users X_64 PC's - At last could there were some 35 users that have reported this issue be it in many past bug reports either closed as duplicates or past bugs closed as NO answer because the lost faith in this problem being resolved. "I have kept a running tally and the Original Reporter of this bug gave up and I assumed responsibility for the bug - See bug history. 2 'Users had completely cooked Video cards as the CPU frequency is directly proportional to the GPU frequency. I user had their CPU completely cooked and unusable To say that this issue is confined to a small amount of AMD Processors is totally false! To create a work around a Vanilla Installation with Default settings in BIOS is not a good enough reason to pass this bug off with a potential work around. Its Vanilla and I have stated this position from day 1 and currently we have no answer so this remains as a Major Bug that requires resolution. I dont understand why you have limited information :- Please upload the following logs from your test PC from your /var/log directory. Every time you reboot your test PC the logs should show more than enough information. I would like to see a PC that has no problem like your test one - Can you also copy your sysinfo:// in respect to your processor information. install boot.msg boot.omsg messages warn I run a vanilla installation and use BIOS defaults. I dont change anything - Cool N' Quiet's default is OFF - as such it is NOT acceptable to pass this bug off with an awkward work around - With AMD's current financial commitment with Novell, Enterprise Suse and hence OpenSuse - we can not afford to sit and do nothing about this problem. Forget about anything to do with added switches etc. This needs to work on Vanilla! It's not a small number of processors, it's nearly anything with a socket F or AM2+. It's a bad situation to be sure. Scott, I'm not sure what you expect us to do here. Your hardware is the problem and you're refusing to use any of the solutions. I really don't care if Cool 'n' Quiet defaults to off in your BIOS. It's wrong. You have processors with a known errata that DIRECTLY RELATE to the problem you're reporting, you refuse to use a hardware technology that could alleviate the problem, and you refuse to use the suggested workaround just in case your hardware doesn't actually have faulty thermal sensors. The authors of that code thought the errata made using it with unreliable data a dangerous operation and that's why it's disabled by default. They write in an option to allow the user to override it if the user is sure they don't have faulty hardware or are looking to find out. I'm not about to second guess them based only on your handwaving and refusals. If there are others still getting emails on this report, might you be willing to try one of the workarounds? Otherwise we're not getting anywhere and we'll have to close this as WONTFIX. Bug 557586 - AMD X_64 Multi-core CPU's have NO Thermal Protection from both Kernel or Power Management on a Vanilla Installation despite hardware settings This has always been a software issue and never about hardware - The default hardware is inconsequential. I only included hardware information to demonstrate this issue is reproducible in spite of Hardware settings and with NO changes to software ACPI what-so-ever!!!! The problem is that simple yet we skirt around fixing this bug. Its not rocket science and never was - If you decide this problem is not worthy of fixing you can always close as WONTFIX and that action will silence future bug reports which have grown exponentially since they started to be reported in Bugzilla on X_64 CPU's since 11.0. It is abundantly clear from ANY log that the boot process warns that it is incapable of monitoring CPU temp and incapable of exerting any control over the CPU Clock. It is also abundantly clear that this is an O/S issue and BIOS Manufactures and Revisions are irrelevant but included to rule out any Hardware problem that contributes to this Bug We only need to correct this bug on a Vanilla installation and Vanilla Switches and Vanilla Hardware on any AMD X_64 CPU! As we can achieve this on I584 CPU's an NOT X_64 CPU's, we expect that any X_64 CPU to do the same - Problem is X_64 cannot do this! A search of the Bugzilla Database will confirm this. I have privately contacted every user from any closed or given up in Bug Report since11.0. that repeats and confirms this same problem is universal. Without your test logs showing that you have no problems, I cannot see that this issue dones not apply to your X_64 and then work out what the difference is! Thus I can only work on my 6 x X_64 PC's and every other users logs which I have obtained. Created attachment 409718 [details]
My logs - not broken.
Yes your logs are not broken, however you are running a 16 Core AMD Operton Processors with two gigabyte Ethernet cards. The AMD Opteron™ 6000 Series Platform above which features parallel Processing which makes it essentially able to handle both 32bit code and 64bit code in both O/S and Application - So off-course it just shows up as having the O/S excerting perfect control and it reads both CPU temp and can throttel the CPU - Yes we have perfect control over any I586-32bit O/S - We know this http://blogs.amd.com/work/2011/01/12/best-cpu-of-the-year/ Its hardly the type of processor to test on. Its ability to run 32bit code means that the test subject is compromised by having both 32bit and 53bit parallel processing. How about trying a test PC that is vanilla X64, just like everyone who has an issue on this - Attaching log files as you have done, confuses and can never reflect the BUG. I dont know of many SMB or users that can afford a AMD Opteron™ 6000 Series PC. - These log details are of no value what so ever. I though you told me you had done a test install on a test X64 PC. Attaching the log files from you main production server are at best irrelevant and could be viewed as stalling fro time. Where are the logs of you test X64 vanilla Installation you said you had performed with no real issue? Can you just upload the logs from the test-offline X64 PC please Vanilla QA - Please Advise why you have failed to Manage this Problem and not intervened in the seemingly wasted man hours of obstructional rhetoric with neither resolve nor concern? ?!?
> .. why you have failed ..
I wonder what you try to achieve with that comment, you won't get your problem addressed with it.
I run dozens of AMD x86 machines here without seeing thermal shutdowns.
You suggested to stop discussing and get things fixed, so would you please follow my instructions so that I can work on this issue:
- Make sure you do not use acpi=off but default boot params
- For all platforms where powernow-k8 driver does not load/work, please
update the BIOS and check BIOS settings for related options. The settings
may vary depending on the BIOS vendor, it may be included in the ACPI or
a power subsection. There may be options like "performance/power
optimized" or simply Cool'n Quiet on/off. You should be able to make
powernow-k8 and dynamic cpu frequency scaling working by that.
- For each affected machine, collect dmesg, dmidecode, acpidump, and
/proc/cpuinfo
output. Tar the info of each machine up and attach it separately with a
short description if the phenomenons are slightly different
(kernel initiates shutdown after about x minutes, or machine hangs hard,
or power is switched off hard by HW, etc.)
I'll have to close the bug "resolved noresponse" if you don't/cannot provide requested info and will do so in some days.
Created attachment 409817 [details]
My logs from Phenom II X6 1055T - Also not broken
>I wonder what you try to achieve with that comment, you won't get your problem
>addressed with it.
Its our Problem not mine- This is a Suse Linux problem to solve - I'm just the messenger - I have taken a huge amount of steps to remedy this as the Vanilla Installation fails and tells us it has failed
The Bug as reported is very simple and clear - The O/S has NO ability to monitor Temp and rejects the O/S ability to modify clock speed and states this in its own logs!
RE: QA - I now know that suse.de does not operate a Quality System - We have the electronic means of logging bugs in our 'Problem Management System' but do nothing about its functional ability.
For the good of the project I think the QA should attend the next Quality Assurance Meeting, which occurs every year in Brussels. Forgive me - I was under the impression that suse.de ran a Quality System
The ISO (International Standards) are created / modified / enhanced by the yearly meetings in Brussels. The suse.de Project needs falls in following any such QA Procedure. the suse.de should follow both ISO90001 and ISO90002 from memory.
I have already provided every bit of log data you need and viewing the log results on your Online AMD 6000 Platform that runs SLES, its logs are irrelivant as I stated above.
It is clear you have not even attempted to perform a Vanilla Installation of a AMD 64-bit, in order to reproduce the problem logs.
I capitulated due lack of confidence! Development and bug fixes seem to have the lowest priorities.
The ethos of hopeing the next version of KDE will fix most of the bugs reports logged against the previous release of OpenSuse; is very interesting!
This is the first time I have ever seems a huge project that does not confirm to any Quality Process.
I have now given up on ever having this issue dealt with as nothing has been done to solve it despite my log data, Performing a Vanilla Installation on my own offline X_64
PC.
The log data you have provided is irrelevant due to its amazing hardware ability.
The Title of the Bug remains at
AMD X_64 Multi-core CPU's have NO Thermal Protection from both Kernel or Power Manage!
>I wonder what you try to achieve with that comment, you won't get your problem
>addressed with it.
Its our Problem not mine- This is a Suse Linux problem to solve - I'm just the messenger - I have taken a huge amount of steps to remedy this as the Vanilla Installation fails and tells us it has failed
The Bug as reported is very simple and clear - The O/S has NO ability to monitor Temp and rejects the O/S ability to modify clock speed and states this in its own logs!
RE: QA - I now know that suse.de does not operate a Quality System - We have the electronic means of logging bugs in our 'Problem Management System' but do nothing about its functional ability.
For the good of the project I think the QA should attend the next Quality Assurance Meeting, which occurs every year in Brussels. Forgive me - I was under the impression that suse.de ran a Quality System
The ISO (International Standards) are created / modified / enhanced by the yearly meetings in Brussels. The suse.de Project needs falls in following any such QA Procedure. the suse.de should follow both ISO90001 and ISO90002 from memory.
I have already provided every bit of log data you need and viewing the log results on your Online AMD 6000 Platform that runs SLES, its logs are irrelivant as I stated above.
It is clear you have not even attempted to perform a Vanilla Installation of a AMD 64-bit, in order to reproduce the problem logs.
I capitulated due lack of confidence! Development and bug fixes seem to have the lowest priorities.
The ethos of hopeing the next version of KDE will fix most of the bugs reports logged against the previous release of OpenSuse; is very interesting!
This is the first time I have ever seems a huge project that does not confirm to any Quality Process.
I have now given up on ever having this issue dealt with as nothing has been done to solve it despite my log data, Performing a Vanilla Installation on my own offline X_64
PC.
The log data you have provided is irrelevant due to its amazing hardware ability.
The Title of the Bug remains at
AMD X_64 Multi-core CPU's have NO Thermal Protection from both Kernel or Power Manage!
|
Created attachment 328842 [details] dmesg User-Agent: Opera/9.80 (X11; Linux x86_64; U; ru) Presto/2.2.15 Version/10.01 Start install system from DVD. After "Loading Linux kernel" in boot menu i have blank black screen and CupsLock + ScrollLock LED on. acpi=off as a parameter in installer boot menu work, but ACPI not work on installed system. If i removed acpi=off in grub config line i have same blank black screen on boot. It's not a hardware problem, any other OS, include openSUSE 11.1 x86_64 work fine. dmesg in attach. Reproducible: Always Steps to Reproduce: 1. just boot 2. 3.