Bugzilla – Bug 127555
powernow-k8 mysterious oops in write_new_fid
Last modified: 2008-07-17 13:29:32 UTC
Running SuSE Linux 10.0 RC4, After a clean install. the OS freezes up when prompted for login. I have installed four separate times, with the same result each time. I am dual booting with NLD and I am able to mount the SuSE 10 partition and modify files. Thus I have booted with run level 2, 3 and 5. Whether booting with X or just in text mode the machine locks up just after the prompt for login. I have also duplicated this on two separate boxes, both are Bridge Technologies AMD 64.
Created attachment 53631 [details] Tarball of /var/log/YaST2
Created attachment 53632 [details] zipped /var/log/messages file
Q: How is the summary connected to the text in the description? Please would you attach the file /var/log/boot.msg, thanks.
The messages in /var/log/messages related to submountd are no problem if there is no floppy in the floppydrive and should have nothing to do with the reported problem.
This is what I've also suspected, IMHO just simple purly coincidental.
Created attachment 53648 [details] boot.msg file as requested
I copy and pasted the wrong text in the original summary and have changed it to reflect the actual problem.
Q: Is it possible to use a remote system to login over network with ssh?
That will work. The box is currently booted to NLD. You can ssh to 151.155.207.117 root password is novell. Already mounted is: /dev/hda3 /suse10.
Ahh .. OK, this looks like X11 Window System bug maybe the driver of the graphic card has a problem.
According to the logfiles below /suse10 there aren't any X11 related problems. Could you do the following. Boot into runlevel 3, login and simply start "X". Does this already crash your machine?
My machine crashes when I get the login prompt on run level 3. So I am unable to start "X"
Then it cannot be an X.Org related problem. Assigning back to the maintainer of the component.
Hmmm ... then all virtual consoles seems to cause crashes. IMHO this is a hardware problem. Even if it works with older versions, the new gcc produce bionaries and kernel which seems to be to much stress for your systems. Please check your CMOS setup and enable `safe settings' if this does not help you should update your BIOS. On the other side it could be insufficient support for the Bridge Technologies AMD 64 from the kernels side. Therefore I reasign it to the kernels component maintainer. Hubert? Do we have an AMD 64 expert around to have a look onto this problem. I've also found an oops in the /var/log/messages in attachment with id=53632.
Maybe Andi has an idea...
It oopses in the powernow-k8 driver. Mark, known problem? Reporter: First try an BIOS update. If that doesn't you can work around it by editing /etc/sysconfig/powersave/cpufreq and replacing CPUFREQD_MODULE="" with CPUFREQ_MODULE="off". That should make the machine work. Oct 5 16:28:32 linux kernel: powernow-k8: Found 1 AMD Athlon 64 / Opteron processors (version 1.50.3) Oct 5 16:28:32 linux kernel: powernow-k8: 0 : fid 0x10 (2400 MHz), vid 0x2 (1500 mV) Oct 5 16:28:32 linux kernel: powernow-k8: 1 : fid 0xe (2200 MHz), vid 0x6 (1400 mV) Oct 5 16:28:32 linux kernel: powernow-k8: 2 : fid 0xc (2000 MHz), vid 0xa (1300 mV) Oct 5 16:28:32 linux kernel: powernow-k8: 3 : fid 0xa (1800 MHz), vid 0xe (1200 mV) Oct 5 16:28:32 linux kernel: powernow-k8: 4 : fid 0x2 (1000 MHz), vid 0x12 (1100 mV) Oct 5 16:28:32 linux kernel: cpu_init done, current fid 0x10, vid 0x2 Oct 5 16:28:32 linux rcpowersaved: enter 'powernow_k8' into CPUFREQD_MODULE in /etc/sysconfig/powersave/cpufreq. Oct 5 16:28:32 linux rcpowersaved: this will speed up starting powersaved and avoid unnecessary warnings in syslog. Oct 5 16:28:33 linux kernel: Unable to handle kernel NULL pointer dereference at 0000000000000012 RIP: Oct 5 16:28:33 linux kernel: <ffffffff883721a2>{:powernow_k8:write_new_vid+66} Oct 5 16:28:33 linux kernel: PGD 31ce2067 PUD 31ce3067 PMD 0 Oct 5 16:28:33 linux kernel: Oops: 0002 [1] Oct 5 16:28:33 linux kernel: CPU 0 Oct 5 16:28:33 linux kernel: Modules linked in: cpufreq_ondemand cpufreq_userspace cpufreq_powersave powernow_k8 freq_table edd snd_pcm_oss snd_mixer_o ss snd_seq snd_seq_device snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_pcm snd_timer snd soundcore snd_page_alloc lp parport_pc parport af_packet joydev sg st sr_mod ipv6 nls_utf8 hfsplus vfat fat subfs button battery ac floppy sk98lin ohci1394 ieee1394 skge generic e1000 shpchp pci_hotplug i2c_nforce2 i2c_core ehci_hcd ohci_hcd usbcore dm_mod reiserfs fan thermal processor sata_sil it821x ide_cd cdrom sata_nv libata amd74xx sd_mod scsi_mod ide_disk id e_core Oct 5 16:28:33 linux kernel: Pid: 3, comm: events/0 Tainted: G U 2.6.13-15-default Oct 5 16:28:33 linux kernel: RIP: 0010:[<ffffffff883721a2>] <ffffffff883721a2>{:powernow_k8:write_new_vid+66} Oct 5 16:28:33 linux kernel: RSP: 0018:ffff81007fde5db8 EFLAGS: 00010246 Oct 5 16:28:33 linux kernel: RAX: 0000000000000012 RBX: 0000000000000002 RCX: 00000000c0010042 Oct 5 16:28:33 linux kernel: RDX: 0000000000000002 RSI: 0000000000000012 RDI: ffff81002fdf9380 Oct 5 16:28:33 linux kernel: RBP: ffff81002fdf9380 R08: ffffffff80419410 R09: 0000000000000000 Oct 5 16:28:33 linux kernel: R10: 0000000000000002 R11: 00000000ffffffff R12: 0000000000000012 Oct 5 16:28:33 linux kernel: R13: 0000000000000002 R14: 0000000000000012 R15: ffff81006372aa00 Oct 5 16:28:33 linux kernel: FS: 0000000040200960(0000) GS:ffffffff8049b800(0000) knlGS:00000000569a2780 Oct 5 16:28:33 linux kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Oct 5 16:28:33 linux kernel: CR2: 0000000000000012 CR3: 0000000031ce1000 CR4: 00000000000006e0 Oct 5 16:28:33 linux kernel: Process events/0 (pid: 3, threadinfo ffff81007fde4000, task ffff81007ffa06f0) Oct 5 16:28:33 linux kernel: Stack: ffff81002fdf9380 0000000000000002 0000000000000002 ffffffff883728f0 Oct 5 16:28:33 linux kernel: ffffffff80422d70 0000000000000002 000000024bd48398 ffffffff00000010 Oct 5 16:28:33 linux kernel: 00249f0000000000 00000000000f4240 Oct 5 16:28:33 linux kernel: Call Trace:<ffffffff883728f0>{:powernow_k8:powernowk8_target+1008} Oct 5 16:28:33 linux kernel: <ffffffff883796d9>{:cpufreq_ondemand:do_dbs_timer+409} Oct 5 16:28:33 linux kernel: <ffffffff801436af>{worker_thread+415} <ffffffff801301c0>{default_wake_function+0} Oct 5 16:28:33 linux kernel: <ffffffff80143510>{worker_thread+0} <ffffffff80143510>{worker_thread+0} Oct 5 16:28:33 linux kernel: <ffffffff8014779d>{kthread+205} <ffffffff8010f392>{child_rip+8} Oct 5 16:28:33 linux kernel: <ffffffff80232160>{dummycon_dummy+0} <ffffffff801476d0>{kthread+0} Oct 5 16:28:33 linux kernel: <ffffffff8010f38a>{child_rip+0} Oct 5 16:28:33 linux kernel: Oct 5 16:28:33 linux kernel: Code: b9 41 00 01 c0 ba 01 00 00 00 c1 e0 08 09 d8 0d 00 00 01 00 Oct 5 16:28:33 linux kernel: RIP <ffffffff883721a2>{:powernow_k8:write_new_vid+66} RSP <ffff81007fde5db8> Oct 5 16:28:33 linux kernel: CR2: 0000000000000012
This seems to have worked around the problem: CPUFREQD_MODULE="" with CPUFREQ_MODULE="off" What is the downside of having this set to off?
Disabling CPUFREQ turns off dynamic frequency control. The processor will no longer reduce frequency when it's idle, and will thus draw more power and run the fans. It's not a huge thing, but it's something that could be avoided. Andi, the error report doesn't make any sense. write_new_fid() only has one pointer, a struct powernow_k8_data *data, and is only called from core_frequency_transition(). Neither function modifiers the pointer. If data were NULL, it should have crashed in a higher level function. What hardware is this occuring on?
The disassembled code is 0: b9 41 00 01 c0 mov $0xc0010041,%ecx 5: ba 01 00 00 00 mov $0x1,%edx a: c1 e0 08 shl $0x8,%eax d: 09 d8 or %ebx,%eax which looks like shortly before the wrmsr write. But yes it's impossible, this code cannot reference address 12. Looks very dubious. Maybe the CPU gets confused?
Oct 5 16:28:33 linux kernel: Unable to handle kernel NULL pointer dereference at 0000000000000012 RIP: Oct 5 16:28:33 linux kernel: <ffffffff883721a2>{:powernow_k8:write_new_vid+66} It looks to me like the error is occuring at 1c6: 1c6: 8b 55 28 mov 0x29(%rbp),%edx which is at least a pointer deference: it looks like line 216 of powernow-k8.c, the "if (savefid != data->currfid) {" following the call to query_current_values_with_pending_wait(). It still makes no sense - data->currfid has already been dereferenced in the function once, and nothing alters the value of data. If the processor is getting confused, I can't see how. This is tested code that hasn't changed through the life of powernow-k8. Again, what hardware is this running on?
Created attachment 53857 [details] hwinfo output attached
I diassembled the Code: line in the oops. It's normally accurate. Your 0x29(%rbp) doesn't match the value of RBP (0xffff81002fdf9380) and the resulting address (12)
I spoke too soon on the work around. CPUFREQ_MODULE="off" only worked for about 2-3 minutes and then the machine froze again.
Hmm can you doublecheck in lsmod before the hang that the powernow-k8 module really is not loaded? Also when you still can login remotely attach the boot.msg again.
Created attachment 53860 [details] new boot.msg file as per request
There is no oops in there, so it must be something else. BTW you misedited cpufreq: /etc/sysconfig/powersave/cpufreq: line 110: unexpected EOF while looking for mat ching `"' /etc/sysconfig/powersave/cpufreq: line 111: syntax error: unexpected end of fil
Okay, modified the syntax of cpufreq and I am running for now. lsmod output attached, no evidence of powernow-k8 module
Created attachment 53872 [details] lsmod output
Adding David and Jacob as Mark is on sabbatical
This bug has been around since last October. Is it still something that needs looking into???
Not sure, maybe I can now reproduce this with a new machine or at least I think I got something similar (freezing with cpufreq). Need to evaluate...
Please reopen if you stil see this with latest 10.1/SLES10 products.
I still see this, but a work around is to set the CPUFREQD_MODULE="off" in /etc/sysconfig/powersave/cpufreq
With what kernel do you see it? Can you try the latest beta?
Assigning to Mark Langsdorf.
This bug has been in state NEW for over a year. Joachim, please advise what to do.
Glen, can you test with a release version, perhaps something newer like openSUSE 10.2? I would also check if you can update the BIOS as the version you have from hwinfo says its from 2004. If nothing happens on this bug in two weeks then I would just mark it WONTFIX or INVALID. side note: Mark or Thomas, looks like he was using ondemand governor, is that fine with this processor?
> ondemand governor, is that fine with this processor?: Grepped out of hwinfo: "Socket 754" "AMD" "AMD Athlon(tm) 64 Processor 3400+" Definitely must work. We have a lot of them, some boards were upgraded and needed a BIOS update to get cpufreq working or possibly had cpufreq not working because of the BIOS, this is/was one of the most common AMD processors? I am pretty sure it's supported by ondemand for a long time, got some reports with DualCores and this is also some time ago... IMO we should close this one (just doing this now), could be some weird HW (possibly BIOS) defect related to cpufreq? Maybe simply exchanging the processor helps? I have the strong feeling that if you buy another machine of the same type it's working...
Yes, a bios update resolves the issue.