Bug 127555 - powernow-k8 mysterious oops in write_new_fid
Summary: powernow-k8 mysterious oops in write_new_fid
Status: VERIFIED INVALID
Alias: None
Product: SUSE LINUX 10.0
Classification: openSUSE
Component: Kernel (show other bugs)
Version: RC 4
Hardware: x86-64 SUSE Other
: P5 - None : Major
Target Milestone: ---
Assignee: Joachim Deguara
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-10-11 13:53 UTC by Glen Christensen
Modified: 2008-07-17 13:29 UTC (History)
5 users (show)

See Also:
Found By: System Test
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Tarball of /var/log/YaST2 (199.21 KB, application/x-gzip)
2005-10-11 13:54 UTC, Glen Christensen
Details
zipped /var/log/messages file (8.22 KB, application/x-gzip)
2005-10-11 13:55 UTC, Glen Christensen
Details
boot.msg file as requested (24.19 KB, application/octet-stream)
2005-10-11 15:39 UTC, Glen Christensen
Details
hwinfo output attached (146.63 KB, application/octet-stream)
2005-10-12 16:58 UTC, Glen Christensen
Details
new boot.msg file as per request (24.24 KB, application/octet-stream)
2005-10-12 17:43 UTC, Glen Christensen
Details
lsmod output (2.16 KB, application/octet-stream)
2005-10-12 19:43 UTC, Glen Christensen
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Glen Christensen 2005-10-11 13:53:44 UTC
Running SuSE Linux 10.0 RC4, After a clean install. the OS freezes up when
prompted for login.  I have installed four separate times, with the same result
each time.  I am dual booting with NLD and I am able to mount the SuSE 10
partition and modify files.  Thus I have booted with run level 2, 3 and 5. 
Whether booting with X or just in text mode the machine locks up just after the
prompt for login.  I have also duplicated this on two separate boxes, both are
Bridge Technologies AMD 64.
Comment 1 Glen Christensen 2005-10-11 13:54:31 UTC
Created attachment 53631 [details]
Tarball of /var/log/YaST2
Comment 2 Glen Christensen 2005-10-11 13:55:56 UTC
Created attachment 53632 [details]
zipped /var/log/messages file
Comment 3 Dr. Werner Fink 2005-10-11 13:59:26 UTC
Q: How is the summary connected to the text in the description?
Please would you attach the file /var/log/boot.msg, thanks.
Comment 4 Danny Al-Gaaf 2005-10-11 14:15:28 UTC
The messages in /var/log/messages related to submountd are no problem if there 
is no floppy in the floppydrive and should have nothing to do with the reported 
problem.
Comment 5 Dr. Werner Fink 2005-10-11 14:23:08 UTC
This is what I've also suspected, IMHO just simple purly coincidental.
Comment 6 Glen Christensen 2005-10-11 15:39:56 UTC
Created attachment 53648 [details]
boot.msg file as requested
Comment 7 Glen Christensen 2005-10-11 15:42:15 UTC
I copy and pasted the wrong text in the original summary and have changed it to
reflect the actual problem.
Comment 8 Dr. Werner Fink 2005-10-11 16:14:44 UTC
Q: Is it possible to use a remote system to login over network with ssh?
Comment 9 Glen Christensen 2005-10-11 16:23:43 UTC
That will work.  The box is currently booted to NLD.  You can ssh to
151.155.207.117  root password is novell.  Already mounted is: /dev/hda3 /suse10.
Comment 10 Dr. Werner Fink 2005-10-11 16:31:37 UTC
Ahh .. OK, this looks like X11 Window System bug maybe
the driver of the graphic card has a problem.
Comment 11 Stefan Dirsch 2005-10-11 16:40:57 UTC
According to the logfiles below /suse10 there aren't any X11 related problems.
Could you do the following. Boot into runlevel 3, login and simply start "X".
Does this already crash your machine?
Comment 12 Glen Christensen 2005-10-11 18:06:40 UTC
My machine crashes when I get the login prompt on run level 3.  So I am unable
to start "X"
Comment 13 Stefan Dirsch 2005-10-12 05:56:33 UTC
Then it cannot be an X.Org related problem. Assigning back to the maintainer 
of the component. 
Comment 14 Dr. Werner Fink 2005-10-12 09:32:02 UTC
Hmmm ... then all virtual consoles seems to cause crashes.

IMHO this is a hardware problem.  Even if it works with
older versions, the new gcc produce bionaries and kernel
which seems to be to much stress for your systems.

Please check your CMOS setup and enable `safe settings'
if this does not help you should update your BIOS.

On the other side it could be insufficient support for
the Bridge Technologies AMD 64 from the kernels side.
Therefore I reasign it to the kernels component maintainer.

Hubert? Do we have an AMD 64 expert around to have a look
onto this problem. I've also found an oops in the
/var/log/messages in attachment with id=53632. 
Comment 15 Hubert Mantel 2005-10-12 14:29:33 UTC
Maybe Andi has an idea...
Comment 16 Andreas Kleen 2005-10-12 14:39:56 UTC
It oopses in the powernow-k8 driver. Mark, known problem? 

Reporter: First try an BIOS update. If that doesn't you can work around it by
editing /etc/sysconfig/powersave/cpufreq
and replacing CPUFREQD_MODULE="" with CPUFREQ_MODULE="off". That should
make the machine work.

Oct  5 16:28:32 linux kernel: powernow-k8: Found 1 AMD Athlon 64 / Opteron
processors (version 1.50.3)
Oct  5 16:28:32 linux kernel: powernow-k8:    0 : fid 0x10 (2400 MHz), vid 0x2
(1500 mV)
Oct  5 16:28:32 linux kernel: powernow-k8:    1 : fid 0xe (2200 MHz), vid 0x6
(1400 mV)
Oct  5 16:28:32 linux kernel: powernow-k8:    2 : fid 0xc (2000 MHz), vid 0xa
(1300 mV)
Oct  5 16:28:32 linux kernel: powernow-k8:    3 : fid 0xa (1800 MHz), vid 0xe
(1200 mV)
Oct  5 16:28:32 linux kernel: powernow-k8:    4 : fid 0x2 (1000 MHz), vid 0x12
(1100 mV)
Oct  5 16:28:32 linux kernel: cpu_init done, current fid 0x10, vid 0x2
Oct  5 16:28:32 linux rcpowersaved: enter 'powernow_k8' into CPUFREQD_MODULE in
/etc/sysconfig/powersave/cpufreq.
Oct  5 16:28:32 linux rcpowersaved: this will speed up starting powersaved and
avoid unnecessary warnings in syslog.
Oct  5 16:28:33 linux kernel: Unable to handle kernel NULL pointer dereference
at 0000000000000012 RIP: 
Oct  5 16:28:33 linux kernel: <ffffffff883721a2>{:powernow_k8:write_new_vid+66}
Oct  5 16:28:33 linux kernel: PGD 31ce2067 PUD 31ce3067 PMD 0 
Oct  5 16:28:33 linux kernel: Oops: 0002 [1] 
Oct  5 16:28:33 linux kernel: CPU 0 
Oct  5 16:28:33 linux kernel: Modules linked in: cpufreq_ondemand
cpufreq_userspace cpufreq_powersave powernow_k8 freq_table edd snd_pcm_oss
snd_mixer_o
ss snd_seq snd_seq_device snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_pcm
snd_timer snd soundcore snd_page_alloc lp parport_pc parport af_packet joydev
 sg st sr_mod ipv6 nls_utf8 hfsplus vfat fat subfs button battery ac floppy
sk98lin ohci1394 ieee1394 skge generic e1000 shpchp pci_hotplug i2c_nforce2 
i2c_core ehci_hcd ohci_hcd usbcore dm_mod reiserfs fan thermal processor
sata_sil it821x ide_cd cdrom sata_nv libata amd74xx sd_mod scsi_mod ide_disk id
e_core
Oct  5 16:28:33 linux kernel: Pid: 3, comm: events/0 Tainted: G     U
2.6.13-15-default
Oct  5 16:28:33 linux kernel: RIP: 0010:[<ffffffff883721a2>]
<ffffffff883721a2>{:powernow_k8:write_new_vid+66}
Oct  5 16:28:33 linux kernel: RSP: 0018:ffff81007fde5db8  EFLAGS: 00010246
Oct  5 16:28:33 linux kernel: RAX: 0000000000000012 RBX: 0000000000000002 RCX:
00000000c0010042
Oct  5 16:28:33 linux kernel: RDX: 0000000000000002 RSI: 0000000000000012 RDI:
ffff81002fdf9380
Oct  5 16:28:33 linux kernel: RBP: ffff81002fdf9380 R08: ffffffff80419410 R09:
0000000000000000
Oct  5 16:28:33 linux kernel: R10: 0000000000000002 R11: 00000000ffffffff R12:
0000000000000012
Oct  5 16:28:33 linux kernel: R13: 0000000000000002 R14: 0000000000000012 R15:
ffff81006372aa00
Oct  5 16:28:33 linux kernel: FS:  0000000040200960(0000)
GS:ffffffff8049b800(0000) knlGS:00000000569a2780
Oct  5 16:28:33 linux kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Oct  5 16:28:33 linux kernel: CR2: 0000000000000012 CR3: 0000000031ce1000 CR4:
00000000000006e0
Oct  5 16:28:33 linux kernel: Process events/0 (pid: 3, threadinfo
ffff81007fde4000, task ffff81007ffa06f0)
Oct  5 16:28:33 linux kernel: Stack: ffff81002fdf9380 0000000000000002
0000000000000002 ffffffff883728f0 
Oct  5 16:28:33 linux kernel:        ffffffff80422d70 0000000000000002
000000024bd48398 ffffffff00000010 
Oct  5 16:28:33 linux kernel:        00249f0000000000 00000000000f4240 
Oct  5 16:28:33 linux kernel: Call
Trace:<ffffffff883728f0>{:powernow_k8:powernowk8_target+1008}
Oct  5 16:28:33 linux kernel:       
<ffffffff883796d9>{:cpufreq_ondemand:do_dbs_timer+409}
Oct  5 16:28:33 linux kernel:        <ffffffff801436af>{worker_thread+415}
<ffffffff801301c0>{default_wake_function+0}
Oct  5 16:28:33 linux kernel:        <ffffffff80143510>{worker_thread+0}
<ffffffff80143510>{worker_thread+0}
Oct  5 16:28:33 linux kernel:        <ffffffff8014779d>{kthread+205}
<ffffffff8010f392>{child_rip+8}
Oct  5 16:28:33 linux kernel:        <ffffffff80232160>{dummycon_dummy+0}
<ffffffff801476d0>{kthread+0}
Oct  5 16:28:33 linux kernel:        <ffffffff8010f38a>{child_rip+0} 
Oct  5 16:28:33 linux kernel: 
Oct  5 16:28:33 linux kernel: Code: b9 41 00 01 c0 ba 01 00 00 00 c1 e0 08 09 d8
0d 00 00 01 00 
Oct  5 16:28:33 linux kernel: RIP
<ffffffff883721a2>{:powernow_k8:write_new_vid+66} RSP <ffff81007fde5db8>
Oct  5 16:28:33 linux kernel: CR2: 0000000000000012
Comment 17 Glen Christensen 2005-10-12 16:05:52 UTC
This seems to have worked around the problem: CPUFREQD_MODULE="" with
CPUFREQ_MODULE="off"

What is the downside of having this set to off?
Comment 18 Mark Langsdorf 2005-10-12 16:20:58 UTC
Disabling CPUFREQ turns off dynamic frequency control.  The processor will no 
longer reduce frequency when it's idle, and will thus draw more power and run 
the fans.  It's not a huge thing, but it's something that could be avoided.

Andi, the error report doesn't make any sense.  write_new_fid() only has one 
pointer, a struct powernow_k8_data *data, and is only called from 
core_frequency_transition().  Neither function modifiers the pointer.  If data 
were NULL, it should have crashed in a higher level function.

What hardware is this occuring on?
Comment 19 Andreas Kleen 2005-10-12 16:32:34 UTC
The disassembled code is

   0:   b9 41 00 01 c0          mov    $0xc0010041,%ecx
   5:   ba 01 00 00 00          mov    $0x1,%edx
   a:   c1 e0 08                shl    $0x8,%eax
   d:   09 d8                   or     %ebx,%eax

which looks like shortly before the wrmsr write. But yes 
it's impossible, this code cannot reference address 12. 
Looks very dubious. Maybe the CPU gets confused?


Comment 20 Mark Langsdorf 2005-10-12 16:51:16 UTC
Oct  5 16:28:33 linux kernel: Unable to handle kernel NULL pointer dereference
at 0000000000000012 RIP: 
Oct  5 16:28:33 linux kernel: <ffffffff883721a2>{:powernow_k8:write_new_vid+66}

It looks to me like the error is occuring at 1c6:
1c6:    8b 55 28               mov   0x29(%rbp),%edx

which is at least a pointer deference: it looks like line 216 of powernow-k8.c,
the "if (savefid != data->currfid) {" following the call to 
query_current_values_with_pending_wait().

It still makes no sense - data->currfid has already been dereferenced in the 
function once, and nothing alters the value of data.  

If the processor is getting confused, I can't see how.  This is tested code 
that hasn't changed through the life of powernow-k8.

Again, what hardware is this running on?
Comment 21 Glen Christensen 2005-10-12 16:58:16 UTC
Created attachment 53857 [details]
hwinfo output attached
Comment 22 Andreas Kleen 2005-10-12 17:09:52 UTC
I diassembled the Code: line in the oops. It's normally accurate.

Your 0x29(%rbp) doesn't match the value of RBP (0xffff81002fdf9380)
and the resulting address (12)
Comment 23 Glen Christensen 2005-10-12 17:16:48 UTC
I spoke too soon on the work around.  CPUFREQ_MODULE="off" only worked for about
2-3 minutes and then the machine froze again.
Comment 24 Andreas Kleen 2005-10-12 17:26:24 UTC
Hmm can you doublecheck in lsmod before the hang that the powernow-k8 module
really is not loaded? 
 

Also when you still can login remotely attach the boot.msg again.
Comment 25 Glen Christensen 2005-10-12 17:43:02 UTC
Created attachment 53860 [details]
new boot.msg file as per request
Comment 26 Andreas Kleen 2005-10-12 18:12:09 UTC
There is no oops in there, so it must be something else.

BTW you misedited cpufreq:

/etc/sysconfig/powersave/cpufreq: line 110: unexpected EOF while looking for mat
ching `"'
/etc/sysconfig/powersave/cpufreq: line 111: syntax error: unexpected end of fil
Comment 27 Glen Christensen 2005-10-12 19:43:37 UTC
Okay, modified the syntax of cpufreq and I am running for now.  lsmod output
attached, no evidence of powernow-k8 module
Comment 28 Glen Christensen 2005-10-12 19:43:59 UTC
Created attachment 53872 [details]
lsmod output
Comment 29 Bodo Bauer 2006-02-08 10:54:52 UTC
Adding David and Jacob as Mark is on sabbatical
Comment 30 Dave Keck 2006-02-28 15:45:40 UTC
This bug has been around since last October.  Is it still something that needs looking into???
Comment 31 Thomas Renninger 2006-02-28 15:56:05 UTC
Not sure, maybe I can now reproduce this with a new machine or at least I think I got something similar (freezing with cpufreq). Need to evaluate...
Comment 32 Thomas Renninger 2006-03-29 12:52:31 UTC
Please reopen if you stil see this with latest 10.1/SLES10 products.
Comment 33 Glen Christensen 2006-03-29 16:21:57 UTC
I still see this, but a work around is to set the CPUFREQD_MODULE="off" in /etc/sysconfig/powersave/cpufreq
Comment 34 Andreas Kleen 2006-03-29 18:28:59 UTC
With what kernel do you see it? Can you try the latest beta?
Comment 36 Dave Keck 2006-04-18 11:33:12 UTC
Assigning to Mark Langsdorf.
Comment 37 Andreas Jaeger 2007-04-25 20:09:23 UTC
This bug has been in state NEW for over a year.  Joachim, please advise what to do.
Comment 38 Joachim Deguara 2007-04-26 07:58:10 UTC
Glen, can you test with a release version, perhaps something newer like openSUSE 10.2?  I would also check if you can update the BIOS as the version you have from hwinfo says its from 2004.

If nothing happens on this bug in two weeks then I would just mark it WONTFIX or INVALID.

side note: Mark or Thomas, looks like he was using ondemand governor, is that fine with this processor?
Comment 39 Thomas Renninger 2007-04-26 08:27:42 UTC
> ondemand governor, is that fine with this processor?:
Grepped out of hwinfo:
       "Socket 754"
       "AMD"
       "AMD Athlon(tm) 64 Processor 3400+"

Definitely must work. We have a lot of them, some boards were upgraded and needed a BIOS update to get cpufreq working or possibly had cpufreq not working because of the BIOS, this is/was one of the most common AMD processors? I am pretty sure it's supported by ondemand for a long time, got some reports with DualCores and this is also some time ago...

IMO we should close this one (just doing this now), could be some weird HW (possibly BIOS) defect related to cpufreq?

Maybe simply exchanging the processor helps?

I have the strong feeling that if you buy another machine of the same type it's working...
Comment 40 Glen Christensen 2007-04-26 13:51:21 UTC
Yes, a bios update resolves the issue.