Bugzilla – Bug 146529
System freezes during installation in the second stage. When it just starts installing from disk 2
Last modified: 2006-03-10 18:00:15 UTC
When install beta 2 on my system it freezes in the second stage. Just before it eactualy starts installing the packes from disk 2. My system is a smp system. I have a feeling it's a kernel problem. During the first stage it used a single proceesor kernel. I'm not sure tough.
Created attachment 65569 [details] The yast files that i have. I don't know if there is any usefullinformation. As the system completely froze. So i don't know when it stopped writing to the harddisk.
Created attachment 65570 [details] The froozen screen
Created attachment 65571 [details] The froozen screen more detail
Please also attach the syslog (/var/log/messages) of this machine, this will reflect if this is a kernel problem. Furthermore, give more information about your hardware (hwinfo).
This could be related to bug #146450 Let's wait for Werner
Created attachment 65755 [details] This is the hwinfo for my dual athlon machine This is the only information i can give right now. I'm now running Beta1 on this machine. As i have to give Infoa for more bugs i can give you more information tonight. Probably tomoorw night. The hwinfo is generated by Beta1.
I don't have time to reproduce this bug today. Probably tomorrow. I had a few bugs that needed extra info.
If this is a kernell issue, the syslog is required. Furthermore the output of `lsmod' could be of help.
The whole system freeses so i cant sent the lsmod.
I wanted to attach the /var directory. But it's to big. You can download it at: http://worldcitizen.demon.nl/var_dual_athon_101_beta2.tgz
I've just tried with the default kernel (nosmp). Still the same problem.
By the way the problems appers both when you install from CD as from NFS. By the way it didn't ask for disk2 at that time.
By the way 10.1 beta1 installs without a problem.
10.1 Beta 3 same result. :-(
Can i help in anyway? To provide extra info, or do some tests etc?
Did a few tests, all didn't help. :-( Only IDE disks - result the same freeze Without firewire card - result the same freeze without tv card - result the same freeze
Als not fixed in beta4, exactly the same problem.
The log finishes somewhere inside Keyboard::SetLanguage function. Jiri?
Joop, are you able to get the new logs from the reproduced situation with last beta?
One more hint: type y2debug=1 at the boot line when the installation boots to 2nd stage to get more verbose logging.
Created attachment 69639 [details] YaST2 log YaST2 log with option y2debug=1
Added the compressed /var/log/YaST2 files with boot option y2debug=1
I'm not sure what was last now: reading /etc/YaST2/control_files/order.ycp? Or Pkg Builtin called: SetLocale called from clients/installation.ycp?
Not sure if it is not zypp-based problem: adding Stano. Also adding kernel maintainers (kernel problem suggested by reporter)
If the problem appeared before beta4, it's not libzypp related.
If you suspect a kernel problem, please switch to virtual console #7 (the one showing the syslog output) shortly before the machine hangs. Please let us know if there's any unusual output there.
Created attachment 69872 [details] The error i get on terminal #9 second phase of the installation process
According to: http://lwn.net/Articles/168975/ this might be related to the internal memeory. I'm doing a memtest. Untill now nothing has been found. I didn't get any memory errors yet.
Test ran for 20 minutes + no error found.
Kernel panic...
After moving the directory (Via rescue -> mount -> mv) : /lib/modules/2.6.16-rc3-git3-2-smp/kernel/drivers/edac (this don't mather: to /edac) The problem didn't occur any more. So maybe a edac bug?
Tommorow i will do a new install the help of the rescue disk. beore the second boot i will move /lib/modules/2.6.16-rc3-git3-2-smp/kernel/drivers/edac again. LIke that i will probably complete the installation.
See Documentation/edac/edac.txt: The 'edac' kernel module goal is to detect and report errors that occur within the computer system. In the initial release, memory Correctable Errors (CE) and Uncorrectable Errors (UE) are the primary errors being harvested. The last message is EDAC MC0 UE page 0x0, offset 0x0, grain 536870912, row 0, labels "": AMD762 ie this uncorrectable ECC error is reported by memory controller 0. The weird grain number is actually 0x20000000 Please try running memtest on your machine overnight and see if that reports any bad RAM.
I've also calculated the value 0x20000000 = 512MB This would mean the the second memory bank? (DIMM0 is 0 - 536870912) But i presume row 0 is DIMM 0? I have two chips of 512MB installed in my system. Might it be that EDAC thinks i have 1 1MB DIMM installed? (I just wondering myself)
All, I've checked something. Tne hwinfo i added before. I see something wear. I see Bank 3 twice. Once with and and once without memory. Memory Device: #24 and Memory Device: #25 (Location: "S3" , Bank: "Bank 3"). This looks weard. I've installed the memmory as suggested in the manual, i'll check it tonight again to be 100% sure) ftp://ftp.tyan.com/manuals/m_S2460_103.pdf This might mean it thinks i have 1 1GB chip inside my system? Or in the wrong order (Started with DIMM3 instead of DIMM0), but as i said i'm quite sure this is not the case. Physical Memory Array: #20 Use: 0x03 (System memory) Location: 0x03 (Motherboard) Slots: 4 Max. Size: 4 GB ECC: 0x03 (None) Memory Device: #21 Location: "S1" Bank: "Bank 1" Memory Array: #20 Error Info: No Error Form Factor: 0x09 (DIMM) Type: 0x03 (DRAM) Type Detail: 0x0080 (Synchronous) Data Width: 0 bits Size: No Memory Installed Memory Device: #22 Location: "S2" Bank: "Bank 2" Memory Array: #20 Error Info: No Error Form Factor: 0x09 (DIMM) Type: 0x03 (DRAM) Type Detail: 0x0080 (Synchronous) Data Width: 0 bits Size: No Memory Installed Memory Device: #23 Location: "S3" Bank: "Bank 3" Memory Array: #20 Error Info: No Error Form Factor: 0x09 (DIMM) Type: 0x03 (DRAM) Type Detail: 0x0080 (Synchronous) Data Width: 32 bits (+32 ECC bits) Size: 24 MB Memory Device: #24 Location: "S3" Bank: "Bank 3" Memory Array: #20 Error Info: No Error Form Factor: 0x09 (DIMM) Type: 0x03 (DRAM) Type Detail: 0x0080 (Synchronous) Data Width: 0 bits Size: No Memory Installed Memory Array Mapping: #25 Memory Array: #20 Partition Width: 2 Start Address: 0x00000000 End Address: 0x01800000 Memory Device Mapping: #26 Memory Device: #21 Array Mapping: #25 Row: 1 Interleave Pos: 0 Interleaved Depth: 1 Start Address: 0x00000000 End Address: 0x00000400 Memory Device Mapping: #27 Memory Device: #22 Array Mapping: #25 Row: 1 Interleave Pos: 0 Interleaved Depth: 1 Start Address: 0x00000000 End Address: 0x00000400 Memory Device Mapping: #28 Memory Device: #23 Array Mapping: #25 Row: 1 Interleave Pos: 0 Interleaved Depth: 1 Start Address: 0x00000000 End Address: 0x01800000 Type 32 Record: #29 Data 00: 20 0b 1d 00 00 00 00 00 00 00 00 Config Status: cfg=new, avail=yes, need=no, active=unknown
By the way: Size: 24 MB Is rather strange? Should be 1024 MB?
End Address: 0x01800000 is indeed 24MB, wierd.
Created attachment 70010 [details] Latset hwinfo
I did a lot of tests. 1 dimm in dimm0 the other dimm in dimm0 1 dimm in dimm4 Nothing worked. Exactly the same error. This would mean that both dimms are broken? The only thing that worked is switching ECC off all together. This isn't what i really want. I have a feeling ecc handling isn't implemented correcty in EDAC for the AMD762. Tonight i will run the memory test.
I did the memmory test overnight. 6 passes for 6:45 hours. No errors.
I have been looking on the internet. I found the information below. Maybe this option would be good to investigate this problem? echo "0" >/sys/devices/system/edac/mc/panic_on_ue Is the following done in openSuSE: echo "1" >/sys/devices/system/edac/mc/panic_on_ue ? This might cause the kernel pannic. Or is this 1 as soon as the edac modules are loaded? Do you know at which stage the edac modules are loaded? DIRECTORY 'mc' In directory 'mc' are EDAC system overall control and attribute files: Panic on UE control file: 'panic_on_ue' An uncorrectable error will cause a machine panic. This is usually desirable. It is a bad idea to continue when an uncorrectable error occurs - it is indeterminate what was uncorrected and the operating system context might be so mangled that continuing will lead to further corruption. If the kernel has MCE configured, then EDAC will never notice the UE. LOAD TIME: module/kernel parameter: panic_on_ue=[0|1] RUN TIME: echo "1" >/sys/devices/system/edac/mc/panic_on_ue Log UE control file: 'log_ue' Generate kernel messages describing uncorrectable errors. These errors are reported through the system message log system. UE statistics will be accumulated even when UE logging is disabled. LOAD TIME: module/kernel parameter: log_ue=[0|1] RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ue Log CE control file: 'log_ce' Generate kernel messages describing correctable errors. These errors are reported through the system message log system. CE statistics will be accumulated even when CE logging is disabled. LOAD TIME: module/kernel parameter: log_ce=[0|1] RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ce Polling period control file: 'poll_msec' The time period, in milliseconds, for polling for error information. Too small a value wastes resources. Too large a value might delay necessary handling of errors and might loose valuable information for locating the error. 1000 milliseconds (once each second) is about right for most uses. LOAD TIME: module/kernel parameter: poll_msec=[0|1] RUN TIME: echo "1000" >/sys/devices/system/edac/mc/poll_msec Module Version read-only attribute file: 'mc_version' The EDAC CORE modules's version and compile date are shown here to indicate what EDAC is running.
I'm aware of the documentation :) Yes, it may help to try booting the kernel with edac_mc.panic_on_ue=0 But I'm afraid this is just a workaround; something else is broken. Anyway, please give it a try
Created attachment 70592 [details] The error i get with boot option: edac_mc.panic_on_ue=0
I've send a mail to Tyan, kernel-smp and the bluesmoke mailinglist. I got a mail from Tyan, it's not promissing yet. I now asked to a test-tool to test the ECC functionality. Didn't get an answer yet.
The error i get in text: EDAC MC0: UE page 0x0, offset 0x0, grain 536870912, row 0, labels "": AMD762 Kernel panic - not syncing: EDAC MC0: UE page 0x0, offset 0x0, grain 536870912, row 0, labels "": AMD762 Badness in smp_call_function at arch/i386/kernel/smp.c:595 [<c0112968>] smp_call_function+0x52/0xc0 [<c0120c13>] printk+0x14/0x18 [<c01129e9>] smp_send_stop+0x13/0x1c [<c01202d9>] panic+0x5d/0xec [<f97389d6>] edac_mc_handle_ue+0x12a/0x13e [edac_mc] [<c011a0aa>] move_tasks+0x1d5/0x212 [<c0294330>] _spin_unlock_irq+0x5/0xa7 [<c0238f2c>] pci_conf1_write+0x99/0xa7 [<c0239faa>] pci_write+0x1d/0x22 [<f95ca2e0>] amd76x_check+0xc3/0xf7 [amd76x_edac] [<f97387db>] do_edac_check+0x19/0xaf [edac_mc] [<f9738c55>] edac_kernel_thread+0x3a/0x92 [edac_mc] [<f9738c1b>] edac_kernel_thread+0x0/0x92 [edac_mc] [<c0102005>] kernel_thread_helper+0x5/0xb REMARK: I typed this over from the screen. So there might be typos in it.
The edac development team is also looking into this. Link: http://sourceforge.net/mailarchive/forum.php?forum_id=43090 . It's not synced yet. :-(
By the way the boot option: edac_mc.panic_on_ue=0 Isn't enough, that's why it still paniced. This should be done: options edac_mc panic_on_ue=0 in /etc/modprobe.conf.local The bluesmoke development team tought me this.
-> gregkh for monitoring
I've disabled it from the build now. Also, based on upstream comments, this code _really_ needs work and might just be removed from 2.6.16 entirely.