Bug 146529 - System freezes during installation in the second stage. When it just starts installing from disk 2
Summary: System freezes during installation in the second stage. When it just starts i...
Status: RESOLVED FIXED
Alias: None
Product: SUSE Linux 10.1
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Beta 2
Hardware: x86 Other
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: Greg Kroah-Hartman
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-01-29 21:13 UTC by Joop Boonen
Modified: 2006-03-10 18:00 UTC (History)
0 users

See Also:
Found By: Other
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
The yast files that i have. (132.31 KB, application/x-gtar)
2006-01-29 21:16 UTC, Joop Boonen
Details
The froozen screen (160.47 KB, image/jpeg)
2006-01-29 21:20 UTC, Joop Boonen
Details
The froozen screen more detail (484.84 KB, image/jpeg)
2006-01-29 21:21 UTC, Joop Boonen
Details
This is the hwinfo for my dual athlon machine (229.36 KB, text/plain)
2006-01-30 20:09 UTC, Joop Boonen
Details
YaST2 log (3.84 MB, application/x-compressed-tar)
2006-02-21 18:53 UTC, Joop Boonen
Details
The error i get on terminal #9 second phase of the installation process (98.16 KB, image/jpeg)
2006-02-22 21:12 UTC, Joop Boonen
Details
Latset hwinfo (228.76 KB, text/plain)
2006-02-23 17:02 UTC, Joop Boonen
Details
The error i get with boot option: edac_mc.panic_on_ue=0 (518.04 KB, image/jpeg)
2006-02-28 09:18 UTC, Joop Boonen
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Joop Boonen 2006-01-29 21:13:21 UTC
When install beta 2 on my system it freezes in the second stage. Just before it eactualy starts installing the packes from disk 2.

My system is a smp system. I have a feeling it's a kernel problem. During the first stage it used a single proceesor kernel. I'm not sure tough.
Comment 1 Joop Boonen 2006-01-29 21:16:03 UTC
Created attachment 65569 [details]
The yast files that i have.

I don't know if there is any usefullinformation. As the system completely froze. So i don't know when it stopped writing to the harddisk.
Comment 2 Joop Boonen 2006-01-29 21:20:20 UTC
Created attachment 65570 [details]
The froozen screen
Comment 3 Joop Boonen 2006-01-29 21:21:08 UTC
Created attachment 65571 [details]
The froozen screen more detail
Comment 4 Michael Gross 2006-01-30 12:15:23 UTC
Please also attach the syslog (/var/log/messages) of this machine, this will reflect if this is a kernel problem. Furthermore, give more information about your hardware (hwinfo).
Comment 5 Michael Gross 2006-01-30 15:10:24 UTC
This could be related to bug #146450
Let's wait for Werner
Comment 6 Joop Boonen 2006-01-30 20:09:54 UTC
Created attachment 65755 [details]
This is the hwinfo for my dual athlon machine

This is the only information i can give right now. I'm now running Beta1 on this machine. As i have to give Infoa for more bugs i can give you more information tonight. Probably tomoorw night.

The hwinfo is generated by Beta1.
Comment 7 Joop Boonen 2006-01-30 21:46:48 UTC
I don't have time to reproduce this bug today. Probably tomorrow. I had a few bugs that needed extra info.
Comment 8 Michael Gross 2006-01-31 09:52:40 UTC
If this is a kernell issue, the syslog is required. Furthermore the output of `lsmod' could be of help.
Comment 9 Joop Boonen 2006-01-31 18:39:37 UTC
The whole system freeses so i cant sent the lsmod.
Comment 10 Joop Boonen 2006-01-31 18:53:00 UTC
I wanted to attach the /var directory. But it's to big.

You can download it at:
http://worldcitizen.demon.nl/var_dual_athon_101_beta2.tgz
Comment 11 Joop Boonen 2006-01-31 20:48:07 UTC
I've just tried with the default kernel (nosmp). Still the same problem.
Comment 12 Joop Boonen 2006-01-31 20:50:09 UTC
By the way the problems appers both when you install from CD as from NFS. By the way it didn't ask for disk2 at that time.
Comment 13 Joop Boonen 2006-02-01 07:06:48 UTC
By the way 10.1 beta1 installs without a problem.
Comment 14 Joop Boonen 2006-02-03 22:18:41 UTC
10.1 Beta 3 same result. :-(
Comment 15 Joop Boonen 2006-02-07 06:38:24 UTC
Can i help in anyway? To provide extra info, or do some tests etc?
Comment 16 Joop Boonen 2006-02-07 20:56:33 UTC
Did a few tests, all didn't help. :-(

Only IDE disks - result the same freeze
Without firewire card - result the same freeze
without tv card - result the same freeze
Comment 17 Joop Boonen 2006-02-19 16:51:19 UTC
Als not fixed in beta4, exactly the same problem.
Comment 18 Jiri Srain 2006-02-19 20:42:38 UTC
The log finishes somewhere inside Keyboard::SetLanguage function.

Jiri?
Comment 19 Jiří Suchomel 2006-02-20 08:50:56 UTC
Joop, are you able to get the new logs from the reproduced situation with last beta?
Comment 20 Jiří Suchomel 2006-02-20 14:34:59 UTC
One more hint: type y2debug=1 at the boot line when the installation boots to 2nd stage to get more verbose logging.
Comment 21 Joop Boonen 2006-02-21 18:53:56 UTC
Created attachment 69639 [details]
YaST2 log

YaST2 log with option y2debug=1
Comment 22 Joop Boonen 2006-02-21 18:55:38 UTC
Added the compressed /var/log/YaST2 files with boot option y2debug=1
Comment 23 Jiří Suchomel 2006-02-22 09:03:06 UTC
I'm not sure what was last now: reading /etc/YaST2/control_files/order.ycp? Or Pkg Builtin called: SetLocale called from clients/installation.ycp?
Comment 24 Jiří Suchomel 2006-02-22 10:14:06 UTC
Not sure if it is not zypp-based problem: adding Stano.

Also adding kernel maintainers (kernel problem suggested by reporter)
Comment 25 Stanislav Visnovsky 2006-02-22 10:25:22 UTC
If the problem appeared before beta4, it's not libzypp related.
Comment 26 Olaf Kirch 2006-02-22 13:03:41 UTC
If you suspect a kernel problem, please switch to virtual console #7
(the one showing the syslog output) shortly before the machine hangs.
Please let us know if there's any unusual output there.
Comment 27 Joop Boonen 2006-02-22 21:12:42 UTC
Created attachment 69872 [details]
The error i get on terminal #9 second phase of the installation process
Comment 28 Joop Boonen 2006-02-22 21:34:43 UTC
According to: http://lwn.net/Articles/168975/ this might be related to the internal memeory. I'm doing a memtest. Untill now nothing has been found. I didn't get any memory errors yet.
Comment 29 Joop Boonen 2006-02-22 21:52:42 UTC
Test ran for 20 minutes + no error found.
Comment 30 Jiri Srain 2006-02-22 22:02:46 UTC
Kernel panic...
Comment 31 Joop Boonen 2006-02-22 22:20:16 UTC
After moving the directory (Via rescue -> mount -> mv) :
/lib/modules/2.6.16-rc3-git3-2-smp/kernel/drivers/edac
(this don't mather: to  /edac)
The problem didn't occur any more. So maybe a edac bug?
Comment 32 Joop Boonen 2006-02-22 22:24:21 UTC
Tommorow i will do a new install the help of the rescue disk. beore the second boot i will move /lib/modules/2.6.16-rc3-git3-2-smp/kernel/drivers/edac again. LIke that i will probably complete the installation.
Comment 33 Olaf Kirch 2006-02-23 10:24:43 UTC
See Documentation/edac/edac.txt:

The 'edac' kernel module goal is to detect and report errors that occur
within the computer system. In the initial release, memory Correctable Errors
(CE) and Uncorrectable Errors (UE) are the primary errors being harvested.

The last message is

EDAC MC0 UE page 0x0, offset 0x0, grain 536870912, row 0, labels "": AMD762

ie this uncorrectable ECC error is reported by memory controller 0.
The weird grain number is actually 0x20000000

Please try running memtest on your machine overnight and see if that
reports any bad RAM.
Comment 34 Joop Boonen 2006-02-23 12:49:57 UTC
I've also calculated the value 0x20000000 = 512MB This would mean the the second memory bank? (DIMM0 is 0 - 536870912) But i presume row 0 is DIMM 0? I have two chips of 512MB installed in my system.

Might it be that EDAC thinks i have 1 1MB DIMM installed? (I just wondering myself)
Comment 35 Joop Boonen 2006-02-23 13:19:02 UTC
All,

I've checked something. Tne hwinfo i added before. I see something wear. I see Bank 3 twice. Once with and and once without memory. Memory Device: #24 and Memory Device: #25 (Location: "S3" , Bank: "Bank 3"). This looks weard. I've  installed the memmory as suggested in the manual, i'll check it tonight again to be 100% sure) ftp://ftp.tyan.com/manuals/m_S2460_103.pdf

This might mean it thinks i have 1 1GB chip inside my system? Or in the wrong order (Started with DIMM3 instead of DIMM0), but as i said i'm quite sure this is not the case.

  Physical Memory Array: #20
    Use: 0x03 (System memory)
    Location: 0x03 (Motherboard)
    Slots: 4
    Max. Size: 4 GB
    ECC: 0x03 (None)
  Memory Device: #21
    Location: "S1"
    Bank: "Bank 1"
    Memory Array: #20
    Error Info: No Error
    Form Factor: 0x09 (DIMM)
    Type: 0x03 (DRAM)
    Type Detail: 0x0080 (Synchronous)
    Data Width: 0 bits
    Size: No Memory Installed
  Memory Device: #22
    Location: "S2"
    Bank: "Bank 2"
    Memory Array: #20
    Error Info: No Error
    Form Factor: 0x09 (DIMM)
    Type: 0x03 (DRAM)
    Type Detail: 0x0080 (Synchronous)
    Data Width: 0 bits
    Size: No Memory Installed
  Memory Device: #23
    Location: "S3"
    Bank: "Bank 3"
    Memory Array: #20
    Error Info: No Error
    Form Factor: 0x09 (DIMM)
    Type: 0x03 (DRAM)
    Type Detail: 0x0080 (Synchronous)
    Data Width: 32 bits (+32 ECC bits)
    Size: 24 MB
  Memory Device: #24
    Location: "S3"
    Bank: "Bank 3"
    Memory Array: #20
    Error Info: No Error
    Form Factor: 0x09 (DIMM)
    Type: 0x03 (DRAM)
    Type Detail: 0x0080 (Synchronous)
    Data Width: 0 bits
    Size: No Memory Installed
  Memory Array Mapping: #25
    Memory Array: #20
    Partition Width: 2
    Start Address: 0x00000000
    End Address: 0x01800000
  Memory Device Mapping: #26
    Memory Device: #21
    Array Mapping: #25
    Row: 1
    Interleave Pos: 0
    Interleaved Depth: 1
    Start Address: 0x00000000
    End Address: 0x00000400
  Memory Device Mapping: #27
    Memory Device: #22
    Array Mapping: #25
    Row: 1
    Interleave Pos: 0
    Interleaved Depth: 1
    Start Address: 0x00000000
    End Address: 0x00000400
  Memory Device Mapping: #28
    Memory Device: #23
    Array Mapping: #25
    Row: 1
    Interleave Pos: 0
    Interleaved Depth: 1
    Start Address: 0x00000000
    End Address: 0x01800000
  Type 32 Record: #29
    Data 00: 20 0b 1d 00 00 00 00 00 00 00 00
  Config Status: cfg=new, avail=yes, need=no, active=unknown
Comment 36 Joop Boonen 2006-02-23 13:28:17 UTC
By the way: 
Size: 24 MB
Is rather strange? Should be 1024 MB? 
Comment 37 Joop Boonen 2006-02-23 13:34:24 UTC
End Address: 0x01800000 is indeed 24MB, wierd.
Comment 38 Joop Boonen 2006-02-23 17:02:14 UTC
Created attachment 70010 [details]
Latset hwinfo
Comment 39 Joop Boonen 2006-02-23 22:45:08 UTC
I did a lot of tests.
1 dimm in dimm0
the other dimm in dimm0
1 dimm in dimm4
Nothing worked. Exactly the same error. This would mean that both dimms are broken?

The only thing that worked is switching ECC off all together. This isn't what i really want. 

I have a feeling ecc handling isn't implemented correcty in EDAC for the AMD762.

Tonight i will run the memory test.
Comment 40 Joop Boonen 2006-02-24 05:39:05 UTC
I did the memmory test overnight. 6 passes for 6:45 hours. No errors. 
Comment 41 Joop Boonen 2006-02-24 08:03:42 UTC
I have been looking on the internet. I found the information below.

Maybe this option would be good to investigate this problem?
echo "0" >/sys/devices/system/edac/mc/panic_on_ue

Is the following done in openSuSE: echo "1" >/sys/devices/system/edac/mc/panic_on_ue ? This might cause the kernel pannic.  Or is this 1 as soon as the edac modules are loaded? Do you know at which stage the edac modules are loaded?

DIRECTORY 'mc'

In directory 'mc' are EDAC system overall control and attribute files:


Panic on UE control file:

	'panic_on_ue'

	An uncorrectable error will cause a machine panic.  This is usually
	desirable.  It is a bad idea to continue when an uncorrectable error
	occurs - it is indeterminate what was uncorrected and the operating
	system context might be so mangled that continuing will lead to further
	corruption. If the kernel has MCE configured, then EDAC will never
	notice the UE.

	LOAD TIME: module/kernel parameter: panic_on_ue=[0|1]

	RUN TIME:  echo "1" >/sys/devices/system/edac/mc/panic_on_ue


Log UE control file:

	'log_ue'

	Generate kernel messages describing uncorrectable errors.  These errors
	are reported through the system message log system.  UE statistics
	will be accumulated even when UE logging is disabled.

	LOAD TIME: module/kernel parameter: log_ue=[0|1]

	RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ue


Log CE control file:

	'log_ce'

	Generate kernel messages describing correctable errors.  These
	errors are reported through the system message log system.
	CE statistics will be accumulated even when CE logging is disabled.

	LOAD TIME: module/kernel parameter: log_ce=[0|1]

	RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ce


Polling period control file:

	'poll_msec'

	The time period, in milliseconds, for polling for error information.
	Too small a value wastes resources.  Too large a value might delay
	necessary handling of errors and might loose valuable information for
	locating the error.  1000 milliseconds (once each second) is about
	right for most uses.

	LOAD TIME: module/kernel parameter: poll_msec=[0|1]

	RUN TIME: echo "1000" >/sys/devices/system/edac/mc/poll_msec


Module Version read-only attribute file:

	'mc_version'

	The EDAC CORE modules's version and compile date are shown here to
	indicate what EDAC is running.
Comment 42 Olaf Kirch 2006-02-24 09:10:33 UTC
I'm aware of the documentation :)

Yes, it may help to try booting the kernel with edac_mc.panic_on_ue=0

But I'm afraid this is just a workaround; something else is broken.
Anyway, please give it a try
Comment 43 Joop Boonen 2006-02-28 09:18:37 UTC
Created attachment 70592 [details]
The error i get with boot option: edac_mc.panic_on_ue=0
Comment 44 Joop Boonen 2006-02-28 15:16:51 UTC
I've send a mail to Tyan, kernel-smp and the bluesmoke mailinglist.  

I got a mail from Tyan, it's not promissing yet. I now asked to a test-tool to test the ECC functionality. Didn't get an answer yet.
Comment 45 Joop Boonen 2006-02-28 15:23:54 UTC
The error i get in text:

EDAC MC0: UE page 0x0, offset 0x0, grain 536870912, row 0, labels "": AMD762
Kernel panic - not syncing: EDAC MC0: UE page 0x0, offset 0x0, grain
536870912, row 0, labels "": AMD762

 Badness in smp_call_function at arch/i386/kernel/smp.c:595
 [<c0112968>] smp_call_function+0x52/0xc0
 [<c0120c13>] printk+0x14/0x18
 [<c01129e9>] smp_send_stop+0x13/0x1c
 [<c01202d9>] panic+0x5d/0xec
 [<f97389d6>] edac_mc_handle_ue+0x12a/0x13e [edac_mc]
 [<c011a0aa>] move_tasks+0x1d5/0x212
 [<c0294330>] _spin_unlock_irq+0x5/0xa7
 [<c0238f2c>] pci_conf1_write+0x99/0xa7
 [<c0239faa>] pci_write+0x1d/0x22
 [<f95ca2e0>] amd76x_check+0xc3/0xf7 [amd76x_edac]
 [<f97387db>] do_edac_check+0x19/0xaf [edac_mc]
 [<f9738c55>] edac_kernel_thread+0x3a/0x92 [edac_mc]
 [<f9738c1b>] edac_kernel_thread+0x0/0x92 [edac_mc]
 [<c0102005>] kernel_thread_helper+0x5/0xb

REMARK: I typed this over from the screen. So there might be typos in it.
Comment 46 Joop Boonen 2006-03-01 07:46:05 UTC
The edac development team is also looking into this. Link: http://sourceforge.net/mailarchive/forum.php?forum_id=43090 . It's not synced yet. :-(
Comment 47 Joop Boonen 2006-03-01 08:23:38 UTC
By the way the boot option: edac_mc.panic_on_ue=0
Isn't enough, that's why it still paniced. This should be done:
options edac_mc      panic_on_ue=0 in /etc/modprobe.conf.local

The bluesmoke development team tought me this.
Comment 52 Chris L Mason 2006-03-10 13:32:27 UTC
-> gregkh for monitoring
Comment 53 Greg Kroah-Hartman 2006-03-10 18:00:15 UTC
I've disabled it from the build now.

Also, based on upstream comments, this code _really_ needs work and might just
be removed from 2.6.16 entirely.