Bug 353252

Summary: Spamassassin sa-learn causing kernel errors
Product: [openSUSE] openSUSE 10.3 Reporter: David Rankin <drankinatty>
Component: KernelAssignee: Jeff Mahoney <jeffm>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P5 - None    
Version: Final   
Target Milestone: ---   
Hardware: i686   
OS: openSUSE 10.3   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: sa-learn script output showing perl errors
/var/logmessages
syslog from reboot 15/01/08 w/fetchmail disabled

Description David Rankin 2008-01-11 15:50:00 UTC
The sa-learn part of spam assassin is causing kernel errors when run under openSuSE 10.3. The error and backtrace and calltrace are shown in the error log and output shown below:

Jan 10 13:00:02 bonza /usr/sbin/cron[9911]: (david) CMD (/home/david/linux/scripts/Learn_as_spam_cron)
Jan 10 13:00:02 bonza imapd[9931]: PAM audit_log_acct_message() failed: Operation not permitted
Jan 10 13:00:06 bonza kernel: BUG: unable to handle kernel NULL pointer dereference at virtual address 00000044
Jan 10 13:00:06 bonza kernel:  printing eip:
Jan 10 13:00:06 bonza kernel: c02c5027
Jan 10 13:00:06 bonza kernel: *pde = 00000000
Jan 10 13:00:06 bonza kernel: Oops: 0002 [#2]
Jan 10 13:00:06 bonza kernel: SMP 
Jan 10 13:00:06 bonza kernel: last sysfs file: /class/input/input4/event4/dev
Jan 10 13:00:06 bonza kernel: Modules linked in: joydev st ide_disk ide_cd ide_core af_packet iptable_filter ip_tables ip6_tables x_tables apparmor loop 8139cp 8139too mii button sr_mod i2c_nforce2 shpchp nvidia_agp cdrom rtc_cmos rtc_core rtc_lib pci_hotplug agpgart i2c_core sg dm_mirror dm_log ehci_hcd sd_mod ohci_hcd usbcore dm_snapshot dm_mod edd ext3 mbcache jbd fan sata_nv pata_amd libata scsi_mod thermal processor
Jan 10 13:00:06 bonza kernel: CPU:    0
Jan 10 13:00:06 bonza kernel: EIP:    0060:[<c02c5027>]    Tainted: G      N VLI
Jan 10 13:00:06 bonza kernel: EFLAGS: 00210286   (2.6.22.13-0.3-default #1)
Jan 10 13:00:06 bonza kernel: EIP is at _spin_lock+0x1/0xf
Jan 10 13:00:06 bonza kernel: eax: 00000044   ebx: d4cdf978   ecx: d4cdf888   edx: 00000001
Jan 10 13:00:06 bonza kernel: esi: 00000000   edi: 0000004e   ebp: d4cdf930   esp: d940dd84
Jan 10 13:00:06 bonza kernel: ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Jan 10 13:00:06 bonza kernel: Process sa-learn (pid: 10213, ti=d940c000 
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
task=ce19bab0 task.ti=d940c000)
Jan 10 13:00:06 bonza kernel: Stack: c018de77 c1990b40 c1990b60 c1990b80 c1990ba0 c1990bc0 d4cdf978 d4cdf888 
Jan 10 13:00:06 bonza kernel:        00000000 0000004e 0000004e c0180a14 00000080 d4cdfa7c e23cdabc 00019ed8 
Jan 10 13:00:06 bonza kernel:        dfffeaa0 00000096 000280d2 c015add0 00004950 000024a8 00000000 00cf6c00 
Jan 10 13:00:06 bonza kernel: Call Trace:
Jan 10 13:00:06 bonza kernel:  [<c018de77>] remove_inode_buffers+0x3b/0xc8
Jan 10 13:00:06 bonza kernel:  [<c0180a14>] shrink_icache_memory+0xc5/0x1b6
Jan 10 13:00:06 bonza kernel:  [<c015add0>] shrink_slab+0xde/0x13b
Jan 10 13:00:06 bonza kernel:  [<c015b6b8>] try_to_free_pages+0x173/0x247
Jan 10 13:00:06 bonza kernel:  [<c015731a>] __alloc_pages+0x1a1/0x2d6
Jan 10 13:00:06 bonza kernel:  [<c0162361>] find_mergeable_anon_vma+0x5e/0xb2
Jan 10 13:00:06 bonza kernel:  [<c015f61a>] __handle_mm_fault+0x482/0xa83
Jan 10 13:00:06 bonza kernel:  [<c011c854>] kunmap_atomic+0x54/0x7c
Jan 10 13:00:06 bonza kernel:  [<c011c860>] kunmap_atomic+0x60/0x7c
Jan 10 13:00:06 bonza kernel:  [<c0137c62>] getnstimeofday+0x30/0xbe
Jan 10 13:00:06 bonza kernel:  [<c0161668>] vma_adjust+0x1ca/0x360
Jan 10 13:00:06 bonza kernel:  [<c02c67ca>] do_page_fault+0x26c/0x5a6
Jan 10 13:00:06 bonza kernel:  [<c02c655e>] do_page_fault+0x0/0x5a6
Jan 10 13:00:06 bonza kernel:  [<c02c5362>] error_code+0x72/0x80
Jan 10 13:00:06 bonza kernel:  =======================
Jan 10 13:00:06 bonza kernel: Code: 05 90 ff 00 30 d2 89 d0 c3 89 c2 90 81 28 00 00 00 01 0f 94 c0 84 c0 b9 01 00 00 00 75 09 90 81 02 00 00 00 01 30 c9 89 c8 c3 90 <fe> 08 79 09 f3 90 80 38 00 7e f9 eb f2 c3 90 81 28 00 00 00 01 
Jan 10 13:00:06 bonza kernel: EIP: [<c02c5027>] _spin_lock+0x1/0xf SS:ESP 0068:d940dd84

  The PAM error is not the cause of the sa-learn failure. See:

https://bugzilla.novell.com/show_bug.cgi?id=331683

  The output suggests that there is a perl error causing the mess. I will attach the file for your review.

  The installed packages are:

09:47 bonza~> uname -r
2.6.22.13-0.3-default

09:47 bonza~> rpm -qa | grep spamassassin
perl-spamassassin-3.2.3-11.1
spamassassin-3.2.3-11.1

   The script calling sa-learn is:

#!/bin/bash
/usr/bin/fetchmail -a -s -n --folder Mail/spam-learn -m '/usr/bin/sa-learn --spam'

  Please let me know what additional information I can provide and I will be happy to run any tests and provide any additional information you may require.
Comment 1 David Rankin 2008-01-11 15:51:24 UTC
Created attachment 190265 [details]
sa-learn script output showing perl errors
Comment 2 David Rankin 2008-01-15 20:13:27 UTC
Guys,

    It looks like spamassassin starts the cascading error that accumulates over time. Before 13:00:01 after the server restart and ~ 12:40, there were no PAM errors despite the large amount of mail going across the server. You can see the start of the problem in the logs the first time the spamassassin daemon (program) is called on the hour to clean and learn from spam in the Mail/spam-learn imap folder:

Jan 15 13:00:01 bonza /usr/sbin/cron[4029]: (david) CMD (/home/david/linux/scripts/Learn_as_spam_cron)
Jan 15 13:00:25 bonza imapd[4050]: PAM audit_log_acct_message() failed: Operation not permitted
Jan 15 13:01:32 bonza sshd[5039]: Accepted publickey for david from 192.168.7.101 port 41909 ssh2
Jan 15 13:01:38 bonza su: (to root) david on /dev/pts/2
Jan 15 13:05:01 bonza /usr/sbin/cron[5100]: (drr) CMD (/usr/local/bin/Learn_as_spam_cron)
Jan 15 13:05:03 bonza imapd[5102]: PAM audit_log_acct_message() failed: Operation not permitted
Jan 15 13:06:53 bonza imapd[5333]: PAM audit_log_acct_message() failed: Operation not permitted
Jan 15 13:06:55 bonza imapd[5329]: PAM audit_log_acct_message() failed: Operation not permitted
Jan 15 13:06:55 bonza imapd[5435]: PAM audit_log_acct_message() failed: Operation not permitted
Jan 15 13:06:56 bonza imapd[5440]: PAM audit_log_acct_message() failed: Operation not permitted

    I am including the syslog (/var/log/messages) for approximately the 20 hours before this problem cascaded into a kernel melt-down. Hopefully, there will be something useful in the log. The pertinent part of the log to look at is the Learn_as_spam_cron script entries that are followed by the PAM error that appear to ultimately cascade into kernel errors over the course of a day. Before this meltdown, the server had been up ~3 days.

    For the record, the permission on the script are:

root@bonza:/home/david/linux/error # l ../scripts/
total 16
drwxr-xr-x 2 david users 4096 2008-01-06 20:47 ./
drwxr-xr-x 7 david users 4096 2008-01-12 19:28 ../
-rwxr----- 1 david users   95 2008-01-06 20:47 Learn_as_spam_cron*


    I'll attach the log after this post goes through. If you want/need anything else or want me to run more tests let me know.
Comment 3 David Rankin 2008-01-15 20:20:23 UTC
Created attachment 190615 [details]
/var/logmessages

Hope this helps!
Comment 4 David Rankin 2008-01-16 04:08:46 UTC
    The process that generates the error is fetchmail, but the error is a result of authentication against the UW imap server:

/usr/bin/fetchmail -a -v -n --folder Mail/spam-learn --ssl --sslcert imap.key.pub
fetchmail: 6.3.8 querying mail.******.com (protocol IMAP) at Tue 15 Jan 2008 09:55:20 PM CST: poll started
Trying to connect to 192.168.7.15/993...connected.
fetchmail: Issuer Organization: RankinLaw
fetchmail: Issuer CommonName: mail.******.com
fetchmail: Server CommonName: mail.******.com
fetchmail: mail.rbpllc.com key fingerprint: BD:9C:2D:7A:DB:<snipped>
fetchmail: Server certificate verification error: self signed certificate
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
fetchmail: IMAP< * OK [CAPABILITY IMAP4REV1 LITERAL+ SASL-IR LOGIN-REFERRALS AUTH=PLAIN AUTH=LOGIN] bonza IMAP4rev1 2006c.374 at Tue, 15 Jan 2008 21:55:20 -0600 (CST)
fetchmail: IMAP> A0001 CAPABILITY
fetchmail: IMAP< * CAPABILITY IMAP4REV1 LITERAL+ IDLE UIDPLUS NAMESPACE MAILBOX-REFERRALS BINARY UNSELECT SCAN SORT THREAD=REFERENCES THREAD=ORDEREDSUBJECT MULTIAPPEND SASL-IR LOGIN-REFERRALS AUTH=PLAIN AUTH=LOGIN
fetchmail: IMAP< A0001 OK CAPABILITY completed
fetchmail: IMAP> A0002 LOGIN "david" *
fetchmail: IMAP< A0002 OK [CAPABILITY IMAP4REV1 LITERAL+ IDLE UIDPLUS NAMESPACE MAILBOX-REFERRALS BINARY UNSELECT SCAN SORT THREAD=REFERENCES THREAD=ORDEREDSUBJECT MULTIAPPEND] User david authenticated
fetchmail: IMAP> A0003 SELECT "Mail/spam-learn"
fetchmail: IMAP< * 0 EXISTS
fetchmail: IMAP< * 0 RECENT
fetchmail: IMAP< * OK [UIDVALIDITY 1168962957] UID validity status
fetchmail: IMAP< * OK [UIDNEXT 3238] Predicted next UID
fetchmail: IMAP< * FLAGS (Junk NonJunk \Answered \Flagged \Deleted \Draft \Seen)
fetchmail: IMAP< * OK [PERMANENTFLAGS (Junk NonJunk \* \Answered \Flagged \Deleted \Draft \Seen)] Permanent flags
fetchmail: IMAP< A0003 OK [READ-WRITE] SELECT completed
fetchmail: No mail for david at mail.*****.com (folder Mail/spam-learn)
fetchmail: IMAP> A0004 LOGOUT
fetchmail: IMAP< * BYE bonza.*****.com IMAP4rev1 server terminating connection
fetchmail: IMAP< A0004 OK LOGOUT completed
fetchmail: 6.3.8 querying mail.******.com (protocol IMAP) at Tue 15 Jan 2008 09:55:20 PM CST: poll completed
fetchmail: normal termination, status 1

    If run fetchmail with --sslcertck, authentication fails, but no PAM errors are generated:

21:41 bonza~/linux/scripts/test> /usr/bin/fetchmail -a -v -n --folder Mail/spam-learn --ssl --sslcert imap.key.pub --sslcertck
fetchmail: 6.3.8 querying mail.*****.com (protocol IMAP) at Tue 15 Jan 2008 09:41:43 PM CST: poll started
Trying to connect to 192.168.7.15/993...connected.
fetchmail: Issuer Organization: RankinLaw
fetchmail: Issuer CommonName: mail.******.com
fetchmail: Server CommonName: mail.******.com
fetchmail: mail.rbpllc.com key fingerprint: BD:9C:2D:7A:DB:19:0A:
fetchmail: Server certificate verification error: self signed certificate
12789:error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed:s3_clnt.c:894:
fetchmail: SSL connection failed.
fetchmail: socket error while fetching from david@mail.*****.com
fetchmail: 6.3.8 querying mail.*****.com (protocol IMAP) at Tue 15 Jan 2008 09:41:43 PM CST: poll completed
fetchmail: Query status=2 (SOCKET)
fetchmail: normal termination, status 2

    This tells us what process is causing the imapd PAM error, but does not answer why it causes kernel errors or a breakdown in the kernel protections. Something in the interchange is violating somebody else's space. It is repeatable 100% of the time. Just run fetchmail against a UW imap setup with a self-signed certificate. After fetchmail is run, subsequent imap mail server access by any mail client will then generate the imapd PAM when the folders are accessed.

    These may be a few more clues to help. I hope. Again, if you need more information, just ask. Thanks.
Comment 5 David Rankin 2008-01-16 16:18:53 UTC
Damn!,

    I hate being wrong. fetchmail may be "one of the causes" for the imap PAM error, but it is *not* the "sole cause" of the problem. For the previous post, I disabled all calls to fetchmail on the system and rebooted and watched the logs for a while and had no errors. I check the logs again this morning and they are flooded with imapd PAM errors. AARGH! Seriously, I've attached another log that covers the time from reboot last night to present. Hope this help!

    Again, if there is anything I can do, just ask.
Comment 6 David Rankin 2008-01-16 16:22:15 UTC
Created attachment 190730 [details]
syslog from reboot 15/01/08 w/fetchmail disabled

After identifying fetchmail as one of the causes of the imapd PAM error, fetchmail was disabled (all scripts removed that call it) and the system was rebooted.

100s of imapd PAM error still occur. This is the syslog from reboot on 15/01/08 until resent.
Comment 7 David Rankin 2008-01-16 16:36:59 UTC
Guys,

    I may have had another stroke of genius! (or confusion from an actual stroke),  but could this be related to the fact I am running a bios raid? For some strange reason, it looks like smartd is triggering the imapd PAM errors in the latest logs, could the nvidia BIOS raid be causing the PAM error as it attempts to mirror the imapd transactions regardless of what process actually created the imapd action?

    Just another SWAG (swinging wild ass guess) with a bit of logic applied or totally misapplied as the case may be. 
Comment 8 David Rankin 2008-01-26 07:55:35 UTC
Guys,

    I have captured another crash and trace after the sa-lean process was called. Hopefully this little additional information will help. Here is the information:

Jan 25 22:00:01 bonza /usr/sbin/cron[12116]: (david) CMD (/home/david/linux/scripts/Learn_as_spam_cron)
Jan 25 22:00:07 bonza syslog-ng[2830]: SIGHUP received, restarting syslog-ng
Jan 25 22:00:08 bonza syslog-ng[2830]: new configuration initialized
Jan 25 22:00:57 bonza su: (to nobody) root on none
Jan 25 22:01:08 bonza syslog-ng[2830]: last message repeated 6 times
Jan 25 22:02:00 bonza kernel: BUG: unable to handle kernel paging request at virtual address 00009100
Jan 25 22:02:00 bonza kernel:  printing eip:
Jan 25 22:02:00 bonza kernel: c017e889
Jan 25 22:02:00 bonza kernel: *pde = 00000000
Jan 25 22:02:00 bonza kernel: Oops: 0000 [#1]
Jan 25 22:02:00 bonza kernel: SMP 
Jan 25 22:02:00 bonza kernel: last sysfs file: /class/input/input4/event4/dev
Jan 25 22:02:00 bonza kernel: Modules linked in: joydev nls_utf8 cifs af_packet iptable_filter ip_tables ip6_tables x_tables apparmor loop 8139cp 8139too mii button rtc_cmos rtc_core rtc_lib nvidia_agp agpgart i2c_nforce2 i2c_core shpchp pci_hotplug sr_mod cdrom sg dm_mirror dm_log sd_mod ehci_hcd ohci_hcd usbcore dm_snapshot dm_mod edd ext3 mbcache jbd fan sata_nv pata_amd libata scsi_mod thermal processor
Jan 25 22:02:00 bonza kernel: CPU:    0
Jan 25 22:02:00 bonza kernel: EIP:    0060:[<c017e889>]    Tainted: G      N VLI
Jan 25 22:02:00 bonza kernel: EFLAGS: 00010206   (2.6.22.13-0.3-default #1)
Jan 25 22:02:00 bonza kernel: EIP is at __d_lookup+0xb9/0xd9
Jan 25 22:02:00 bonza kernel: eax: d8973cb5   ebx: d4cdfbe8   ecx: 00000011   edx: c2019180
Jan 25 22:02:00 bonza kernel: esi: c04b36bc   edi: 00009100   ebp: c04d825c   esp: c2ca3db8
Jan 25 22:02:00 bonza kernel: ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Jan 25 22:02:00 bonza kernel: Process find (pid: 12436, ti=c2ca2000 task=c4096570 task.ti=c2ca2000)
Jan 25 22:02:00 bonza kernel: Stack: c2ca3e3c 0000000a d8973cb5 c9ce3000 d8973cb5 c04b36bc c9ce3000 c2ca3f04 
Jan 25 22:02:00 bonza kernel:        c0175f80 c2ca3e48 c2ca3e3c c2ca3f04 dfff1ca0 d8973cb5 c04b36bc c9ce3000 
Jan 25 22:02:00 bonza kernel:        c2ca3f04 c0177eb5 d34020ac 0000000c c9ce300a dfff1d00 00000000 db1268f0 
Jan 25 22:02:00 bonza kernel: Call Trace:
Jan 25 22:02:00 bonza kernel:  [<c0175f80>] do_lookup+0x24/0x140
Jan 25 22:02:00 bonza kernel:  [<c0177eb5>] __link_path_walk+0x7c9/0xc26
Jan 25 22:02:00 bonza kernel:  [<c018237d>] mntput_no_expire+0x13/0x6c
Jan 25 22:02:00 bonza kernel:  [<c01783bb>] link_path_walk+0xa9/0xb3
Jan 25 22:02:00 bonza kernel:  [<c0178356>] link_path_walk+0x44/0xb3
Jan 25 22:02:00 bonza kernel:  [<c0178644>] do_path_lookup+0x168/0x1cc
Jan 25 22:02:00 bonza kernel:  [<c0177413>] getname+0x59/0xab
Jan 25 22:02:00 bonza kernel:  [<c0178e29>] __user_walk_fd+0x2f/0x40
Jan 25 22:02:00 bonza kernel:  [<c01729b1>] vfs_lstat_fd+0x16/0x3d
Jan 25 22:02:00 bonza kernel:  [<c0172a1d>] sys_lstat64+0xf/0x23
Jan 25 22:02:00 bonza kernel:  [<c0104e22>] sysenter_past_esp+0x6b/0xa9
Jan 25 22:02:00 bonza kernel:  =======================
Jan 25 22:02:00 bonza kernel: Code: 54 24 0c e8 26 08 05 00 85 c0 75 14 31 d2 f6 43 04 10 75 05 90 ff 03 89 da b0 01 86 43 08 eb 21 b0 01 86 43 08 8b 3f 85 ff 74 14 <8b> 07 0f 18 00 90 8d 5f f0 8b 44 24 08 39 43 1c 75 e8 eb 84 31 
Jan 25 22:02:00 bonza kernel: EIP: [<c017e889>] __d_lookup+0xb9/0xd9 SS:ESP 0068:c2ca3db8
Jan 25 22:02:00 bonza kernel: klogd 1.4.1, ---------- state change ---------- 
Comment 9 David Rankin 2008-01-31 14:34:41 UTC
Here are a few more log entries. Hopefully they will add another piece to the puzzle:

Jan 31 08:13:05 bonza syslog-ng[2748]: STATS: dropped 0
Jan 31 08:13:30 bonza kernel: VM: killing process sa-learn
Jan 31 08:13:30 bonza kernel: swap_free: Unused swap offset entry 00000040

This machine never swaps:

08:26 bonza~> free -tm
             total       used       free     shared    buffers     cached
Mem:          2027       1790        237          0        482        804
-/+ buffers/cache:        503       1524
Swap:         2125          0       2125
Total:        4153       1790       2362

So why did the kernel have to free an unused swap offset after killing sa-learn? Let me know what else I can send you. Thanks.
Comment 10 David Rankin 2008-02-11 17:37:33 UTC
Guys,

    This may be a pure kernel/Gigabyte issue as opposed to a sa-learn only issue. The sa-learn problem looks like it may the the result of a larger issue concerning kernel stability/instability on the gigabyte platform. I have experienced several additional meltdowns. Some the logs are blank. Some the logs contain useful information. The latest, yesterday involved the simple routine daily mandb process. The following was collected:

Feb 10 04:30:09 bonza kernel: BUG: unable to handle kernel NULL pointer dereference at virtual address 00000044
Feb 10 04:30:09 bonza kernel:  printing eip:
Feb 10 04:30:09 bonza kernel: c02c6157
Feb 10 04:30:09 bonza kernel: *pde = 00000000
Feb 10 04:30:09 bonza kernel: Oops: 0002 [#2]
Feb 10 04:30:09 bonza kernel: SMP 
Feb 10 04:30:09 bonza kernel: last sysfs file: /devices/platform/i2c-adapter/i2c-9191/9191-0290/name
Feb 10 04:30:09 bonza kernel: Modules linked in: af_packet it87 hwmon_vid hwmon i2c_isa iptable_filter ip_tables ip6_tables x_tables apparmor loop 8139cp 8139too mii button sr_mod cdrom shpchp i2c_nforce2 rtc_cmos pci_hotplug rtc_core nvidia_agp rtc_lib i2c_core agpgart sg dm_mirror dm_log sd_mod ehci_hcd ohci_hcd usbcore dm_snapshot dm_mod edd ext3 mbcache jbd fan sata_nv pata_amd libata scsi_mod thermal processor
Feb 10 04:30:09 bonza kernel: CPU:    0
Feb 10 04:30:09 bonza kernel: EIP:    0060:[<c02c6157>]    Tainted: G      N VLI
Feb 10 04:30:09 bonza kernel: EFLAGS: 00010286   (2.6.22.16-0.2-default #1)
Feb 10 04:30:09 bonza kernel: EIP is at _spin_lock+0x1/0xf
Feb 10 04:30:09 bonza kernel: eax: 00000044   ebx: d4cdf978   ecx: d4cdf888   edx: 00000001
Feb 10 04:30:09 bonza kernel: esi: 00000000   edi: 00000039   ebp: d4cdf930   esp: d56dfd84
Feb 10 04:30:09 bonza kernel: ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Feb 10 04:30:09 bonza kernel: Process mandb (pid: 3975, ti=d56de000 task=f7f79570 task.ti=d56de000)
Feb 10 04:30:09 bonza kernel: Stack: c018eee3 00000282 00000246 00000000 dfafba40 dfb77e00 d4cdf978 d4cdf888 
Feb 10 04:30:09 bonza kernel:        00000000 00000039 00000039 c0181a80 00000080 d4cdfc68 d4ccbe94 00027fd8 
Feb 10 04:30:09 bonza kernel:        dfffeaa0 0000008e 000280d2 c015bdfc 00003c90 00001e48 00000000 013fec00 
Feb 10 04:30:09 bonza kernel: Call Trace:
Feb 10 04:30:09 bonza kernel:  [<c018eee3>] remove_inode_buffers+0x3b/0xc8
Feb 10 04:30:09 bonza kernel:  [<c0181a80>] shrink_icache_memory+0xc5/0x1b6
Feb 10 04:30:09 bonza kernel:  [<c015bdfc>] shrink_slab+0xde/0x13b
Feb 10 04:30:09 bonza kernel:  [<c015c6e4>] try_to_free_pages+0x173/0x247
Feb 10 04:30:09 bonza kernel:  [<c015833e>] __alloc_pages+0x1a1/0x2d6
Feb 10 04:30:09 bonza kernel:  [<c0163392>] find_mergeable_anon_vma+0x5e/0xb2
Feb 10 04:30:09 bonza kernel:  [<c0160646>] __handle_mm_fault+0x482/0xa83
Feb 10 04:30:09 bonza kernel:  [<c0170c32>] do_sync_read+0x0/0x10a
Feb 10 04:30:09 bonza kernel:  [<c01632a8>] arch_get_unmapped_area_topdown+0x9b/0x127
Feb 10 04:30:09 bonza kernel:  [<c0162c52>] vma_merge+0x168/0x178
Feb 10 04:30:09 bonza kernel:  [<c02c78fa>] do_page_fault+0x26c/0x5a6
Feb 10 04:30:09 bonza kernel:  [<c02c768e>] do_page_fault+0x0/0x5a6
Feb 10 04:30:09 bonza kernel:  [<c02c6492>] error_code+0x72/0x80
Feb 10 04:30:09 bonza kernel:  =======================
Feb 10 04:30:09 bonza kernel: Code: 05 90 ff 00 30 d2 89 d0 c3 89 c2 90 81 28 00 00 00 01 0f 94 c0 84 c0 b9 01 00 00 00 75 09 90 81 02 00 00 00 01 30 c9 89 c8 c3 90 <fe> 08 79 09 f3 90 80 38 00 7e f9 eb f2 c3 90 81 28 00 00 00 01 
Feb 10 04:30:09 bonza kernel: EIP: [<c02c6157>] _spin_lock+0x1/0xf SS:ESP 0068:d56dfd84
Feb 10 04:30:09 bonza su: (to nobody) root on none
Feb 10 04:30:35 bonza syslog-ng[2748]: last message repeated 3 times

    Let me know what else I can provide. I am not the only user affected by instability on Gigabyte boards. Per Jessen on the openSuSE list is likewise seeing the exact same issues. I am more than happy to provide whatever you need to see if we can solve this. This has happened regardless of the kernel version installed. Currently we have: 

11:35 bonza~> uname -r
2.6.22.16-0.2-default

    Also, if you need/would like direct access to the box, contact me privately and I will be glad to arrange it.
Comment 11 David Rankin 2008-02-11 18:17:12 UTC
Here is a bit more information:

    Since this has occurred with both mandriva and opensuse, on different power supplies and with different ram and hardrives, it is pretty much limited to being a gigabyte motherboard problem. The relevant information is:

Gigabyte GA-N700SL
2 Gig of Ram
Raid1: 2 Seagate (ST3250410AS_6RY28WW1) SATA 250G drives in Raid 1 (md raid)
ATA: Maxtor_5T020H2_T2H34J2 as swap 

   This occurred without the raid as well so that isn't the culprit.
Comment 12 Jeff Mahoney 2008-02-20 20:18:04 UTC
The Oops in the initial description was caused by buffer_mapping (inode->i_data.assoc_mapping) being NULL in remove_inode_buffers(). spin_lock hits it and derefs the NULL pointer to find the lock.

The Oops in comment #3 was caused by the buffer head being NULL, and it's accessing bh->b_data (0x19 = 25). NULL in this case likely means 0x1.

I haven't decoded the others because I'm not really sure how we should proceed here. You're getting Oopses all over the kernel in *core* data paths. These paths are well tested, and more importantly used by every Linux user every day. I know you've said you've tried using different memory, etc - but I'd stil like to see the results of a memtest run.

Also, is this system overclocked?
Comment 13 David Rankin 2008-02-21 04:48:50 UTC
Jeff,

    The system is "bone stock" no overclocking whatsoever. I run all my servers very conservatively. I will install and give a full memory test just to make triple sure. Also, the system, and all of the meltdowns seem to be associated with cron running. The last two meltdowns have been the direct result of the cron.daily script being run. I have set the cron.daily run time to 12:15 (lunch) just to monitor this behavior. The result of the meltdown today is provided below. 

    Even though all of these data paths are well tested, there seems to be a problem with their operation on this Gigabyte board. I'll run memtest, but it seems unlikely that I could have gotten 3 bad sets of memory. Here is the latest explosion:

Feb 20 12:15:07 bonza syslog-ng[2858]: STATS: dropped 0
Feb 20 12:15:38 bonza kernel: BUG: unable to handle kernel paging request at virtual address 03188010
Feb 20 12:15:38 bonza kernel:  printing eip:
Feb 20 12:15:38 bonza kernel: c01890c1
Feb 20 12:15:38 bonza kernel: *pde = 00000000
Feb 20 12:15:38 bonza kernel: Oops: 0000 [#1]
Feb 20 12:15:38 bonza kernel: SMP 
Feb 20 12:15:38 bonza kernel: last sysfs file: /devices/platform/i2c-adapter/i2c-9191/9191-0290/name
Feb 20 12:15:38 bonza kernel: Modules linked in: af_packet it87 hwmon_vid hwmon i2c_isa iptable_filter ip_tables ip6_tables x_tables apparmor loop 8139cp 8139too mii button snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_timer snd ns558 parport_pc parport gameport soundcore nvidia_agp shpchp snd_page_alloc agpgart rtc_cmos pci_hotplug i2c_nforce2 rtc_core sr_mod rtc_lib cdrom i2c_core sg dm_mirror dm_log sd_mod ehci_hcd ohci_hcd usbcore dm_snapshot dm_mod edd ext3 mbcache jbd fan sata_nv pata_amd libata scsi_mod thermal processor
Feb 20 12:15:38 bonza kernel: CPU:    0
Feb 20 12:15:38 bonza kernel: EIP:    0060:[<c01890c1>]    Tainted: G      N VLI
Feb 20 12:15:38 bonza kernel: EFLAGS: 00010202   (2.6.22.17-0.1-default #1)
Feb 20 12:15:38 bonza kernel: EIP is at __writeback_single_inode+0x1af/0x31c
Feb 20 12:15:38 bonza kernel: eax: 03188000   ebx: 00000001   ecx: 00000000   edx: c01589ec
Feb 20 12:15:38 bonza kernel: esi: d4cdfe4c   edi: c21a7600   ebp: d4cdfef4   esp: dfb0def0
Feb 20 12:15:38 bonza kernel: ds: 007b   es: 007b   fs: 00d8  gs: 0000  ss: 0068
Feb 20 12:15:38 bonza kernel: Process pdflush (pid: 182, ti=dfb0c000 task=dff5aab0 task.ti=dfb0c000)
Feb 20 12:15:38 bonza kernel: Stack: 00000002 df8369a0 df8369a0 00000000 dfb0df74 00000000 00000000 00000000 
Feb 20 12:15:38 bonza kernel:        00000002 f8c7483e df8d7ec0 df8d7ec0 00000000 00000002 d4cdfe4c dfb78100 
Feb 20 12:15:38 bonza kernel:        c21a6e00 dfb0df74 c018950f c21a6e7c c21a6e84 02015354 00000000 c21a6e00 
Feb 20 12:15:38 bonza kernel: Call Trace:
Feb 20 12:15:38 bonza kernel:  [<f8c7483e>] dm_table_any_congested+0x33/0x4c [dm_mod]
Feb 20 12:15:38 bonza kernel:  [<c018950f>] sync_sb_inodes+0x16a/0x225
Feb 20 12:15:38 bonza kernel:  [<c018990c>] writeback_inodes+0x6a/0xb2
Feb 20 12:15:39 bonza kernel:  [<c0159445>] wb_kupdate+0x83/0xe5
Feb 20 12:15:39 bonza kernel:  [<c0159825>] pdflush+0x123/0x1c2
Feb 20 12:15:39 bonza kernel:  [<c01593c2>] wb_kupdate+0x0/0xe5
Feb 20 12:15:39 bonza kernel:  [<c0159702>] pdflush+0x0/0x1c2
Feb 20 12:15:39 bonza kernel:  [<c01350ee>] kthread+0x38/0x5e
Feb 20 12:15:39 bonza kernel:  [<c01350b6>] kthread+0x0/0x5e
Feb 20 12:15:39 bonza kernel:  [<c0106117>] kernel_thread_helper+0x7/0x10
Feb 20 12:15:39 bonza kernel:  =======================
Feb 20 12:15:39 bonza kernel: Code: 89 86 38 01 00 00 b0 01 86 05 84 eb 35 c0 8b 54 24 10 89 e8 e8 77 fd fc ff 80 e3 03 89 44 24 14 74 3b 8b 86 9c 00 00 00 8b 40 20 <83> 78 10 00 74 1f 89 f0 e8 ff 94 ff ff 85 c0 75 14 8b 86 9c 00 
Feb 20 12:15:39 bonza kernel: EIP: [<c01890c1>] __writeback_single_inode+0x1af/0x31c SS:ESP 0068:dfb0def0
Feb 20 12:15:39 bonza kernel: klogd 1.4.1, ---------- state change ---------- 
Feb 20 12:15:48 bonza kernel: BUG: unable to handle kernel paging request at virtual address 0318800c
Feb 20 12:15:48 bonza kernel:  printing eip:
Feb 20 12:15:48 bonza kernel: c0189754
Feb 20 12:15:48 bonza kernel: *pde = 00000000
Feb 20 12:15:48 bonza kernel: Oops: 0000 [#2]
Feb 20 12:15:48 bonza kernel: SMP 
Feb 20 12:15:48 bonza kernel: last sysfs file: /devices/platform/i2c-adapter/i2c-9191/9191-0290/name
Feb 20 12:15:48 bonza kernel: Modules linked in: af_packet it87 hwmon_vid hwmon i2c_isa iptable_filter ip_tables ip6_tables x_tables apparmor loop 8139cp 8139too mii button snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_timer snd ns558 parport_pc parport gameport soundcore nvidia_agp shpchp snd_page_alloc agpgart rtc_cmos pci_hotplug i2c_nforce2 rtc_core sr_mod rtc_lib cdrom i2c_core sg dm_mirror dm_log sd_mod ehci_hcd ohci_hcd usbcore dm_snapshot dm_mod edd ext3 mbcache jbd fan sata_nv pata_amd libata scsi_mod thermal processor
Feb 20 12:15:48 bonza kernel: CPU:    0
Feb 20 12:15:48 bonza kernel: EIP:    0060:[<c0189754>]    Tainted: G      N VLI
Feb 20 12:15:48 bonza kernel: EFLAGS: 00010202   (2.6.22.17-0.1-default #1)
Feb 20 12:15:48 bonza kernel: EIP is at __mark_inode_dirty+0x1d/0x16b
Feb 20 12:15:48 bonza kernel: eax: 03188000   ebx: 00000001   ecx: 00000000   edx: 00000001
Feb 20 12:15:48 bonza kernel: esi: d4cdfe4c   edi: 00001000   ebp: d4cdfe4c   esp: ed5bdf0c
Feb 20 12:15:48 bonza kernel: ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Feb 20 12:15:48 bonza kernel: Process mandb (pid: 17139, ti=ed5bc000 task=ef319ab0 task.ti=ed5bc000)
Feb 20 12:15:48 bonza kernel: Stack: d4cdfe4c 223215a6 47bc6e53 c21a7600 c0128f0a d4cdfe4c c21a7600 00001000 
Feb 20 12:15:48 bonza kernel:        00000000 d4cdfe4c 00001000 ed5bc000 c017394f ed5bdf44 d4d05534 dfff18a0 
Feb 20 12:15:48 bonza kernel:        00000000 bfb71228 083ab5a4 00000000 00000001 00000000 000e2ae9 00000000 
Feb 20 12:15:48 bonza kernel: Call Trace:
Feb 20 12:15:48 bonza kernel:  [<c0128f0a>] current_fs_time+0x41/0x46
Feb 20 12:15:48 bonza kernel:  [<c017394f>] sys_readlinkat+0x76/0xa4
Feb 20 12:15:48 bonza kernel:  [<c01739a4>] sys_readlink+0x27/0x2b
Feb 20 12:15:48 bonza kernel:  [<c0104e22>] sysenter_past_esp+0x6b/0xa9
Feb 20 12:15:48 bonza kernel:  =======================
Feb 20 12:15:48 bonza kernel: Code: f7 ff ff b2 01 86 15 84 eb 35 c0 5b 5e c3 55 89 c5 57 56 53 89 d3 83 ec 20 8b 80 9c 00 00 00 f6 c2 03 89 44 24 18 74 0e 8b 40 20 <8b> 50 0c 85 d2 74 04 89 e8 ff d2 f0 83 04 24 00 89 d8 23 85 38 
Feb 20 12:15:48 bonza kernel: EIP: [<c0189754>] __mark_inode_dirty+0x1d/0x16b SS:ESP 0068:ed5bdf0c
Feb 20 12:15:48 bonza su: (to nobody) root on none
Feb 20 12:16:34 bonza syslog-ng[2858]: last message repeated 6 times

(then it dies)
Comment 14 David Rankin 2008-02-28 05:39:45 UTC
Sorry, forgot the check the button to turn off the need status. Also, I'm beginning to suspect a mother board or memory problem. I'm looking into that.
Comment 15 David Rankin 2008-03-04 00:32:55 UTC
You can close this bug.

    Apparently there was a spurious memory address in my ocz platinum RAM that was the culprit. It only appeared under certain loads. I replaced the ram 4 days ago and it is stable. It is hard to believe this was the problem after memtesting the ram less that 3 months ago. This is the first ram I have had just "go bad" and this wasn't cheepo ram either.