|
Bugzilla – Full Text Bug Listing |
| Summary: | grep segfaults when locale is utf8 | ||
|---|---|---|---|
| Product: | [openSUSE] SUSE LINUX 10.0 | Reporter: | Andreas Klein <asklein> |
| Component: | Basesystem | Assignee: | Mads Martin Joergensen <mmj> |
| Status: | RESOLVED INVALID | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Normal | ||
| Priority: | P5 - None | CC: | meissner, ro, vetter, wgottwalt |
| Version: | Beta 2 | ||
| Target Milestone: | --- | ||
| Hardware: | x86-64 | ||
| OS: | All | ||
| Whiteboard: | |||
| Found By: | Other | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
Oops on smp opteron with beta1
input for grep core |
||
|
Description
Andreas Klein
2005-08-19 15:49:09 UTC
Created attachment 46711 [details]
Oops on smp opteron with beta1
You can trigger the problem by running: /usr/sbin/ghostscript-cjk-config while LANG is set to en_US.UTF-8 wpyc022:~ # echo $LANG en_US.UTF-8 If LANG is set to C, no segfaults occur. What really puzzles me is, that it works on single opterons, regardless of the LANG setting. Maybe something is subtly different in your single opteron installation. Erm. Does the oops attached above really belong to this bug report? That's an oops in lockd, which is completely unrelated to any grep segfaults. Did you mean to attach that oops to a different report? No, the Oops is not related to this problem. I thought at first it may be, but then I realized, that the Oopses has nothing to to with this. And the problem is not a kernel-issue I think, that's why I changed it from kernel to basesystem. But somehow mmj assigned it again to Hubert. And round and round we go... sending back to Mads. And how can I reproduce the problem in grep? # echo $LANG ; /usr/sbin/ghostscript-cjk-config en_US.UTF-8 # cat /proc/cpuinfo ... processor : 1 vendor_id : AuthenticAMD cpu family : 15 model : 5 model name : AMD Opteron(tm) Processor 248 ... Run it more often. The segfault is not visible in your terminal. You have to look in /var/log/messages for it. Mon Aug 29 16:18:51 CEST 2005 wpyc022:~ # /usr/sbin/ghostscript-cjk-config wpyc022:~ # /usr/sbin/ghostscript-cjk-config wpyc022:~ # /usr/sbin/ghostscript-cjk-config wpyc022:~ # /usr/sbin/ghostscript-cjk-config wpyc022:~ # /usr/sbin/ghostscript-cjk-config wpyc022:~ # /usr/sbin/ghostscript-cjk-config wpyc022:~ # /usr/sbin/ghostscript-cjk-config wpyc022:~ # date Mon Aug 29 16:19:54 CEST 2005 Aug 29 16:19:03 wpyc022 kernel: grep[18059]: segfault at 0000000000000000 rip 00002aaaaaaae05c rsp 00007fffffdb8cc0 error 4 Aug 29 16:19:12 wpyc022 kernel: grep[18228]: segfault at 0000000000000000 rip 00002aaaaaaae05c rsp 00007fffffd5fcc0 error 4 Aug 29 16:19:20 wpyc022 kernel: grep[18360]: segfault at 0000000000000000 rip 00002aaaaaaae05c rsp 00007fffffd54d00 error 4 Aug 29 16:19:20 wpyc022 kernel: grep[18363]: segfault at 0000000000000000 rip 00002aaaaaaae05c rsp 00007fffffcccf10 error 4 Aug 29 16:19:20 wpyc022 kernel: grep[18379]: segfault at 0000000000000000 rip 00002aaaaaaae05c rsp 00007fffff879df0 error 4 Aug 29 16:19:29 wpyc022 kernel: grep[18511]: segfault at 0000000000000000 rip 00002aaaaaaae05c rsp 00007fffffa13ce0 error 4 Aug 29 16:19:29 wpyc022 kernel: grep[18529]: segfault at 0000000000000000 rip 00002aaaaaaae05c rsp 00007fffff8abe10 error 4 15 times in a row, and I still get nothing. Neither in log/messages or dmesg. Mike, can you reproduce/have an idea? Hmm, mabe there must be another package present, to trigger the problem? Should I attach my package selection file? No, I cannot reproduce this either. I tried on Beta3 on a x86_64 single processor machine and on Beta3 on a x86_64 dual processor machine. Closing as WORKSFORME. Andreas, if you still think this is a bug you may reopen it but then please find out exactly which grep command causes the segfault. This should be rather easy, if you run /usr/sbin/ghostscript-cjk with verbose output: /usr/sbin/ghostscript-cjk --verbosity 1 You can see all "grep" commands executed by /usr/sbin/ghostscript-cjk in the output. Looks like this: executing: grep -q "Creator: aliascid.ps" fonts.cache-1 executing: grep -q "Creator: aliascid.ps" MOEKai-Regular Find out exactly which grep command segfaults for you. And then please attach the file which was the argument of the grep command to this bugreport. Until then it is WORKSFORME. No, I cannot reproduce this either. I tried on Beta3 on a x86_64
single processor machine and on Beta3 on a x86_64 dual processor
machine. Closing as WORKSFORME.
Andreas, if you still think this is a bug you may reopen it but then
please find out exactly which grep command causes the segfault. This
should be rather easy, if you run /usr/sbin/ghostscript-cjk with
verbose output:
/usr/sbin/ghostscript-cjk --verbosity 1
you can see all "grep" commands executed by /usr/sbin/ghostscript-cjk
in the output. Looks like this:
executing: grep -q "Creator: aliascid.ps" fonts.cache-1
executing: grep -q "Creator: aliascid.ps" MOEKai-Regular
Find out exactly which grep command segfaults for you. And then please
attach the file which was the argument of the grep command to this
bugreport.
Until then it is WORKSFORME.
Andreas> Hmm, mabe there must be another package present, to trigger the problem? Andreas> Should I attach my package selection file? I think it would help us much more if you can check exactly which grep command segfaults and attach the file which was used as an argument in that grep command. This is really a strange problem. Running: /usr/sbin/ghostscript-cjk --verbosity 1 does not lead to a segfault, either does running the grep itself. Running /usr/sbin/ghostscript-cjk without arguments still produces the segfaults with grep. This problem depends on the running kernel.
I installed the kernel-default on that machine, then I can't trigger the problem.
Summary:
The problem is independent of the LANG setting. It was just an unlucky coincidence.
SMP kernel:
grep segfaults can be produced by the following commands:
while :; do /usr/sbin/ghostscript-cjk; done
while :; do /sbin/conf.d/SuSEconfig.lyx-cjk; done
The segfault happens on average on one out of 10-20 executions.
Running the greps that were executed by hand do not segfault.
running the script with verbosity 1 does not segfault.
The segfaults happens at grep in this part. Maybe it has something to do with
the fs-cache?
rm -f wrap_chkconfig.ltx chkconfig.vars chkconfig.classes chklayouts.tex
done > chklayouts.tex
${LATEX} wrap_chkconfig.ltx 2>/dev/null | grep '^\+'
eval `cat chkconfig.vars | sed 's/-/_/g'`
test -n "${rmcopy}" && rm -f chkconfig.ltx
fi
kernel-default:
Installing the single-processor kernel on that machine "solves" the problem!
I tested several 100 executions without segfault.
smp-kernel with nosmp kernel-parameter:
kernel does not boot: hda lost interrupt; kernel panic
Does you always get the same oops in lockd or random different ones? The Oops was just once with beta1. Beta2 and Beta3 never had an Oops. The grep segfaults startet with Beta3 smp-kernel. /var/log/mcelog shows no problems. I still don't believe this is a kernel problem. Maybe the problem is just
that grep receives an incomplete multibyte sequence from the pipe, and barfs.
But let's see.
Andreas, please provide the input that's being piped into grep above.
If just run "grep < output", does it still segfault? IOW, does it depend
on the data being fed to grep, or the fact that there's two commands
connected in a pipe?
If not: run this single pipe (latex|grep) - does it segfault?
If not: can you write a small C program that does something like this:
while (read(0, &c, 1) == 1) {
write(1, &c, 1);
msleep(10);
}
and use that to pipe the latex output into grep?
Were you able to extract the latex output?
BTW you mentioned that the segfault happens here:
${LATEX} wrap_chkconfig.ltx 2>/dev/null | grep '^\+'
I didn't find this in ghostscript-cjk-config
(In reply to comment #21) > Were you able to extract the latex output? I will try it. > BTW you mentioned that the segfault happens here: > > ${LATEX} wrap_chkconfig.ltx 2>/dev/null | grep '^\+' > > I didn't find this in ghostscript-cjk-config This was from: /usr/share/lyx-cjk/configure which is called by one of the SuSEconfig scripts. I just tried beta4. The segfault still exists. It is always in another line and not always. But it only happens with the smp kernel. The segfault always happen at address 0 and the same rip. The rsp is changing. grep[11162]: segfault at 0000000000000000 rip 00002aaaaaaae05c rsp 00007fffff90db10 error 4 wpyc022:~ # /sbin/conf.d/SuSEconfig.ghostscript-cjk wpyc022:~ # /sbin/conf.d/SuSEconfig.ghostscript-cjk /sbin/conf.d/SuSEconfig.ghostscript-cjk: line 26: 20297 Segmentation fault /usr/sbin/ghostscript-cjk-config wpyc022:~ # /sbin/conf.d/SuSEconfig.ghostscript-cjk wpyc022:~ # /sbin/conf.d/SuSEconfig.ghostscript-cjk wpyc022:~ # /sbin/conf.d/SuSEconfig.ghostscript-cjk wpyc022:~ # /sbin/conf.d/SuSEconfig.ghostscript-cjk wpyc022:~ # /sbin/conf.d/SuSEconfig.ghostscript-cjk wpyc022:~ # /sbin/conf.d/SuSEconfig.ghostscript-cjk wpyc022:~ # /sbin/conf.d/SuSEconfig.ghostscript-cjk wpyc022:~ # /sbin/conf.d/SuSEconfig.ghostscript-cjk wpyc022:~ # /sbin/conf.d/SuSEconfig.ghostscript-cjk wpyc022:~ # /sbin/conf.d/SuSEconfig.ghostscript-cjk wpyc022:~ # /sbin/conf.d/SuSEconfig.ghostscript-cjk wpyc022:~ # /sbin/conf.d/SuSEconfig.ghostscript-cjk wpyc022:~ # /sbin/conf.d/SuSEconfig.ghostscript-cjk /sbin/conf.d/SuSEconfig.ghostscript-cjk: line 22: 22124 Segmentation fault /usr/sbin/acroread-cidfont-config wpyc022:~ # /sbin/conf.d/SuSEconfig.ghostscript-cjk wpyc022:~ # /sbin/conf.d/SuSEconfig.ghostscript-cjk wpyc022:~ # /sbin/conf.d/SuSEconfig.ghostscript-cjk wpyc022:~ # /sbin/conf.d/SuSEconfig.ghostscript-cjk I don't have a beta2 anymore, so I cannot validate exactly. But the RIP is definitely in a shared library. You can verify if you start "grep X" in one window, and do a "cat /proc/$PID/maps" in another. I am quite sure that this is a problem with multibyte support in the locale code. The reason no-one is able to reproduce it here may just depend on the set of packages you have installed. Lots of CJK packages maybe? Please, it's important that you provide us with the input that makes grep segfault. Just insert a "tee /tmp/somefile" into the latex|grep pipe. Olaf> I am quite sure that this is a problem with multibyte support in the Olaf> locale code. The reason no-one is able to reproduce it here may just Olaf> depend on the set of packages you have installed. I don't think so because I cannot reproduce it either and I have *all* CJK packages installed which are available on SuSE Linux and some more. And apparently Andreas cannot give us the exact input that makes grep segfault. I also asked for that input, without such input I cannot even start to try finding a bug in grep. What's the problem with providing the input? It's quite possible that if you just extract the inout, it will not crash grep if you just feed it that file. Most likely the problem also depends on the exact sequence of read()s that were doing. My guess is still that LC_COLLATE or something like this is being fed an incomplete collation sequence and walks off into the woods. If you like, you can also try prefixing that grep invocation by "strace -e read -o /tmp/grep.trace". Again, this may or may not yield useful information, as strace changes the application's timing considerably. (In reply to comment #24) > I don't have a beta2 anymore, so I cannot validate exactly. But the RIP > is definitely in a shared library. You can verify if you start "grep X" > in one window, and do a "cat /proc/$PID/maps" in another. The RIP is still the same in beta 4 00002aaaaaaae05c cat /proc/maps show ld.so for this: 2aaaaaaab000-2aaaaaac0000 r-xp 00000000 03:06 34914 /lib64/ld-2.3.5.so > I am quite sure that this is a problem with multibyte support in the > locale code. The reason no-one is able to reproduce it here may just > depend on the set of packages you have installed. Lots of CJK packages > maybe? Go to zzz_all select all packages in this list install. Thats how all machines are installed here. remove *debuginfo, beagle, kat and choose only one kernel. I can give you the softwware-selection file, or create an account for you on that machine. > Please, it's important that you provide us with the input that makes > grep segfault. Just insert a "tee /tmp/somefile" into the latex|grep > pipe. Ok, I will attach it. But if the tee is present, it does not segfault! Even adding some prints before and after that is enough to prevent the segfault. I enabled corefile writing and will attach the corefile. Created attachment 48694 [details]
input for grep
This is the onput, that is fed into grep, where the segfaults sometimes occur.
If the tee is present, grep does not segfault.
Created attachment 48695 [details]
core
Thanks for the core dump.
The faulting address (00002aaaaaaae05c) is actually in this segment:
2aaaaaac8000-2aaaaaafb000 r--p 00000000 /usr/lib/locale/en_US.utf8/LC_CTYPE
(Note that the map you gave in comment #27 doesn't contain the faulting RIP)
IOW it jumps right into the CTYPE tables. Reproduceably. I find it hard
to believe that this should be a kernel bug.
I think someone from the packagers team needs to look at the core dump.
It could be triggered somehow by the randomized va mappings. It seems to cause all kinds of user space problems. See the long story in http://bugzilla.kernel.org/show_bug.cgi?id=4851 Does it go away with echo 0 > /proc/sys/kernel/randomize_va_space ? If yes we should probably consider to disable it for the release. This makes it better.
After
echo 0 > /proc/sys/kernel/randomize_va_space
all segfaults in the cjk scripts do not occur anymore. So this prevents most of
the segfaults.
Only one segfault remains:
updmap: initial config file is `/etc/texmf/web2c/updmap.cfg'
/usr/bin/updmap: line 298: 30445 Segmentation fault grep "$pat" "$file"
>/dev/null
Sep 5 12:19:14 wpyc022 kernel: grep[30445]: segfault at 0000000000000000 rip
00002aaaaaaae05c rsp 00007fffffffdb70 error 4
This is still the same RIP as all the other reports. Also, we're not doing much in terms of randomization anyway. Mostly the stack. it would really help to capture such a crash in gdb if possible. There is a core file attached above, but I cannot find a grep with debuginfo symbols that would show a sane backtrace. andreas, can you perhaps attach the grep binary? or try with the beta4 grep binary? The core was from a beta4 grep. taskset -p 0x01 $$ seems also prevent a segfault. Another way is to produce load with cpuburn. i tried with beta3 on reger.suse.de (4 cpu machine)... but no luck so far. Our machine is a Tyan S2885 with 2x Opteron 242, 2GB RAM. Andreas, how new is your bios? If not new, then please try and reproduce with the the latest bios. With dual opterons and Tyan board we've seen weird stuff before on Rudis machine. No, it is not the newest BIOS. I had the problem, that the newest did not work reliable on machines with 6GB RAM. I use 2.02w. With that I was able to find settings for MTRT and the adjust memory switch, so that SLES8, SLES9 an SUSE 9.3 was running without problems. Wich BIOS do you recommend/use on your Tyan S2885 board? 2.02w is one of the old beta series. my machine (also K8W aka 2885) is running 2.05 nicely with 4G of memory. I also had tons of segfaults with bioses prior to 2.05 when running recent kernels. (In reply to comment #43) > 2.02w is one of the old beta series. > my machine (also K8W aka 2885) is running 2.05 nicely with > 4G of memory. I just tested 2.05 on my workstation and it works even with 6gb. Only 2.04 had the 6GB problem. > I also had tons of segfaults with bioses prior > to 2.05 when running recent kernels. I will install 2.05 this afternoon on the beta-machine to see, if my segfaults are gone, too. After the BIOS update the segfaults are gone! The BIOS seems to repair all segfaults. So since it was a local hw problem, this was invalid. |