Bug 117197 - strange segfaults in conjunction with grep and SMP kernel
Summary: strange segfaults in conjunction with grep and SMP kernel
Status: RESOLVED FIXED
Alias: None
Product: SUSE LINUX 10.0
Classification: openSUSE
Component: Kernel (show other bugs)
Version: RC 4
Hardware: x86-64 All
: P5 - None : Normal
Target Milestone: ---
Assignee: Hubert Mantel
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-09-15 11:40 UTC by Wilken Gottwalt
Modified: 2006-02-27 16:32 UTC (History)
2 users (show)

See Also:
Found By: Development
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Add boot param pdc=off to disable pdc (2.09 KB, patch)
2005-09-22 14:19 UTC, Thomas Renninger
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Wilken Gottwalt 2005-09-15 11:40:42 UTC
I getting a lot of segfaults which seem to belong to grep. They occur in random
situations.


Sep 15 12:56:31 linux kernel: grep[6262]: segfault at 0000000000000000 rip
00002aaaaaaae05c rsp 00007fffffd200a0 error 4
Sep 15 12:56:31 linux kernel: grep[6273]: segfault at 0000000000000000 rip
00002aaaaaaae05c rsp 00007fffffb92d80 error 4
Sep 15 12:56:31 linux kernel: grep[6284]: segfault at 0000000000000000 rip
00002aaaaaaae05c rsp 00007fffff922ce0 error 4
Sep 15 12:56:31 linux kernel: grep[6322]: segfault at 0000000000000000 rip
00002aaaaaaae05c rsp 00007fffffd3a4d0 error 4


checking for spec icon... zziplib-icon.png (fallback)
checking for a BSD-compatible install... .././configure: line 2257: 10020
Segmentation fault      grep dspmsg "$as_dir/$ac_prog$ac_exec_ext" >/dev/null 2>&1
/usr/bin/install -c


It allways looks like this, it happens often, if grep gets called from a script,
it's also not possible to make a testcase for it, it's totaly random.

There is a bug which seems to be similar -> 105885.
Comment 1 Wilken Gottwalt 2005-09-15 11:41:26 UTC
It does not happen, if I install a default kernel.
Comment 2 Dr. Werner Fink 2005-09-15 11:51:55 UTC
Are there _more_ segfaults in the /var/log/messages?
This could be a _hardware_ problem, to be exact a
problem of the BIOS.

Compare with bug #106147
Comment 3 Wilken Gottwalt 2005-09-15 12:03:31 UTC
Yes, there are some more, but they look all similiar. A new BIOS isn't an
option, I allready have the newest.

A SL93 runs fine, so does SLES9, too. It only happens with the smp kernel of
SL10RC4.
Comment 4 Mike Fabian 2005-09-15 14:18:58 UTC
I also suspect a hardware or BIOS problem. See also bug #105885.

Strange that it hits grep so often though ...
Comment 5 Mike Fabian 2005-09-15 14:20:35 UTC
By the way, in Wilkens /var/log/messages there were also a few
segfaults from other programs, two perl-scripts (which do *not* call
grep) and a few segfaults from gcc as well.

Comment 6 Michael Matz 2005-09-15 14:23:42 UTC
Let's also put Andi in CC.  Is this an Intel or and AMD machine?  In the past 
we had many machines with this problem and a BIOS update seemed 
to have solved it.  But now we again get reports that say they already 
have newest BIOS but still show these random segfaults, where they can't 
be reproduced by e.g. an 9.3 kernel.  IIRC Thomas Renninger even found 
out a set of four patches in the kernel which triggered them, so CCing 
him too. 
Comment 8 Thomas Renninger 2005-09-15 15:27:32 UTC
It's somewhere between 9.3 Gold and 9.3 YOU update kernel. I tried to limit the
set of patches, but did not come far -> often machine needed extra boot params
or did not boot at all. After BIOS updates helped I didn't look any further...
If this should get a bigger problem someone with more memory knowledge should
have a look at 9.3 patches, there are not much that could trigger the problem, I
think.

A BIOS update (if available) will help.

Olaf/Andi: If you think it's worth a try to test the patch from #72919 -> There
is still one machine that does not provide a new BIOS and still has the segfault
... error 4 messages. Tell me and I'll give it a try.
Comment 10 Thomas Renninger 2005-09-15 20:09:43 UTC
The patch from okir in #72919 (disabling PDC) makes the segfaults disappear.
Booting is enough to get segfaults on *stravinsky* (with pdc off, it compiled
kdebase3 and updated to RC4 without any)

We should invert the logic (enable pdc by default) as there is a BIOS update and
according to AMD (#72919 comment #1) there could be performance regressions up
to 2%. Still there are not BIOS updates for all machines (e.g. the FSC CELSIUS -
stravinsky and sf also has such a machine without BIOS update AFAIK).

-> Assigning to okir as he already made up this patch.

I tried to identify bad patches by removing them from 9.3 YOU update kernel (as
9.3 goldmaster worked)... Could it be that the patch has been reverted for some
reasons and not a bad one added?!?
I had a quick grep in the changelog, but couldn't find anything related...
Comment 11 Wilken Gottwalt 2005-09-16 05:02:26 UTC
Ah I forget, my machine is a dual Opteron 244. The Board is a kind of prototype
(RioWorks HDAMB Rev.E). And like I said, a new BIOS isn't an option, I have the
newest one from their page, but maybe someone can get a new/special one, which
isn't for the public.
Comment 13 Wilken Gottwalt 2005-09-17 08:48:10 UTC
Thomas Renniger made a kernel where PDC is off by default. This kernel works 
perfectly.
Comment 14 JP Rosevear 2005-09-22 13:59:33 UTC
Where is this kernel to try?  I've seen similar problems on my x86_64 SMP machine.
Comment 15 Thomas Renninger 2005-09-22 14:19:54 UTC
Created attachment 50658 [details]
Add boot param pdc=off to disable pdc 

Can this one be added to SL 10.0 kernel, please.
Comment 16 Thomas Renninger 2005-09-22 14:21:44 UTC
I just used Okir's patch and inverted the logic. Not even compile tested, but
should work ...
Comment 17 Pavel Janik 2005-11-09 17:02:04 UTC
OK, I have the same problem on dual Opteron machine.

grep from SL10.0 works with only one processor (second one is physically pulled out of the motherboard).

grep from SL10.0 randomly sigsegvs as stated above.

I end up with using grep from SL9.3. It works in both uni and SMP mode.

I can provide any details needed, because it is my testing machine.
Comment 18 Andreas Kleen 2005-11-10 02:42:58 UTC
All the the TLB flush filter bug. Either update your BIOS 
or wait for the next kernel update which will be disable it.
I don't know when that will be.
Comment 19 Peter Gunreben 2006-02-27 16:32:00 UTC
Unfortunately, I have the same problem with grep. My Hardware is Dual-Opteron (2GHz) on MSI K8D-Master3. The pdc=0 kernel parameter doesn't help. However, the "echo 0 > /proc/sys/kernel/randomize_va_space" helps.

As I'm not really satisfied with the required kernel reconfig, I've tried to find the root cause for this problem. So far, I noticed that the segfault doesn't occur, if I omit the -O2 in the CFLAGS of grep.spec.

In grep.spec the configure command starts with 
./configure CFLAGS="$RPM_OPT_FLAGS" ...
which expands into 
./configure CFLAGS="-O2 -g -fmessage-length=0 -D_FORTIFY_SOURCE=2" ...
If I omit the "-O2", namely
./configure CFLAGS="-g -fmessage-length=0 -D_FORTIFY_SOURCE=2" ...
the problem is gone.

Hmmm, I'm not sure whether this hint is of any help.