Bugzilla – Bug 117197
strange segfaults in conjunction with grep and SMP kernel
Last modified: 2006-02-27 16:32:00 UTC
I getting a lot of segfaults which seem to belong to grep. They occur in random situations. Sep 15 12:56:31 linux kernel: grep[6262]: segfault at 0000000000000000 rip 00002aaaaaaae05c rsp 00007fffffd200a0 error 4 Sep 15 12:56:31 linux kernel: grep[6273]: segfault at 0000000000000000 rip 00002aaaaaaae05c rsp 00007fffffb92d80 error 4 Sep 15 12:56:31 linux kernel: grep[6284]: segfault at 0000000000000000 rip 00002aaaaaaae05c rsp 00007fffff922ce0 error 4 Sep 15 12:56:31 linux kernel: grep[6322]: segfault at 0000000000000000 rip 00002aaaaaaae05c rsp 00007fffffd3a4d0 error 4 checking for spec icon... zziplib-icon.png (fallback) checking for a BSD-compatible install... .././configure: line 2257: 10020 Segmentation fault grep dspmsg "$as_dir/$ac_prog$ac_exec_ext" >/dev/null 2>&1 /usr/bin/install -c It allways looks like this, it happens often, if grep gets called from a script, it's also not possible to make a testcase for it, it's totaly random. There is a bug which seems to be similar -> 105885.
It does not happen, if I install a default kernel.
Are there _more_ segfaults in the /var/log/messages? This could be a _hardware_ problem, to be exact a problem of the BIOS. Compare with bug #106147
Yes, there are some more, but they look all similiar. A new BIOS isn't an option, I allready have the newest. A SL93 runs fine, so does SLES9, too. It only happens with the smp kernel of SL10RC4.
I also suspect a hardware or BIOS problem. See also bug #105885. Strange that it hits grep so often though ...
By the way, in Wilkens /var/log/messages there were also a few segfaults from other programs, two perl-scripts (which do *not* call grep) and a few segfaults from gcc as well.
Let's also put Andi in CC. Is this an Intel or and AMD machine? In the past we had many machines with this problem and a BIOS update seemed to have solved it. But now we again get reports that say they already have newest BIOS but still show these random segfaults, where they can't be reproduced by e.g. an 9.3 kernel. IIRC Thomas Renninger even found out a set of four patches in the kernel which triggered them, so CCing him too.
It's somewhere between 9.3 Gold and 9.3 YOU update kernel. I tried to limit the set of patches, but did not come far -> often machine needed extra boot params or did not boot at all. After BIOS updates helped I didn't look any further... If this should get a bigger problem someone with more memory knowledge should have a look at 9.3 patches, there are not much that could trigger the problem, I think. A BIOS update (if available) will help. Olaf/Andi: If you think it's worth a try to test the patch from #72919 -> There is still one machine that does not provide a new BIOS and still has the segfault ... error 4 messages. Tell me and I'll give it a try.
The patch from okir in #72919 (disabling PDC) makes the segfaults disappear. Booting is enough to get segfaults on *stravinsky* (with pdc off, it compiled kdebase3 and updated to RC4 without any) We should invert the logic (enable pdc by default) as there is a BIOS update and according to AMD (#72919 comment #1) there could be performance regressions up to 2%. Still there are not BIOS updates for all machines (e.g. the FSC CELSIUS - stravinsky and sf also has such a machine without BIOS update AFAIK). -> Assigning to okir as he already made up this patch. I tried to identify bad patches by removing them from 9.3 YOU update kernel (as 9.3 goldmaster worked)... Could it be that the patch has been reverted for some reasons and not a bad one added?!? I had a quick grep in the changelog, but couldn't find anything related...
Ah I forget, my machine is a dual Opteron 244. The Board is a kind of prototype (RioWorks HDAMB Rev.E). And like I said, a new BIOS isn't an option, I have the newest one from their page, but maybe someone can get a new/special one, which isn't for the public.
Thomas Renniger made a kernel where PDC is off by default. This kernel works perfectly.
Where is this kernel to try? I've seen similar problems on my x86_64 SMP machine.
Created attachment 50658 [details] Add boot param pdc=off to disable pdc Can this one be added to SL 10.0 kernel, please.
I just used Okir's patch and inverted the logic. Not even compile tested, but should work ...
OK, I have the same problem on dual Opteron machine. grep from SL10.0 works with only one processor (second one is physically pulled out of the motherboard). grep from SL10.0 randomly sigsegvs as stated above. I end up with using grep from SL9.3. It works in both uni and SMP mode. I can provide any details needed, because it is my testing machine.
All the the TLB flush filter bug. Either update your BIOS or wait for the next kernel update which will be disable it. I don't know when that will be.
Unfortunately, I have the same problem with grep. My Hardware is Dual-Opteron (2GHz) on MSI K8D-Master3. The pdc=0 kernel parameter doesn't help. However, the "echo 0 > /proc/sys/kernel/randomize_va_space" helps. As I'm not really satisfied with the required kernel reconfig, I've tried to find the root cause for this problem. So far, I noticed that the segfault doesn't occur, if I omit the -O2 in the CFLAGS of grep.spec. In grep.spec the configure command starts with ./configure CFLAGS="$RPM_OPT_FLAGS" ... which expands into ./configure CFLAGS="-O2 -g -fmessage-length=0 -D_FORTIFY_SOURCE=2" ... If I omit the "-O2", namely ./configure CFLAGS="-g -fmessage-length=0 -D_FORTIFY_SOURCE=2" ... the problem is gone. Hmmm, I'm not sure whether this hint is of any help.