Bug 1181479 - LLVM 11 can not be built because of a kernel or hardware problem
LLVM 11 can not be built because of a kernel or hardware problem
Status: RESOLVED INVALID
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel
Current
S/390-64 Other
: P2 - High : Normal (vote)
: ---
Assigned To: openSUSE Kernel Bugs
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2021-01-27 16:31 UTC by Sarah Julia Kriesch
Modified: 2021-05-12 12:29 UTC (History)
5 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
OBS build log with bad page state. (20.32 KB, application/x-xz; charset=binary)
2021-05-11 22:05 UTC, Aaron Puchert
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Sarah Julia Kriesch 2021-01-27 16:31:19 UTC
User-Agent:       Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0
Build Identifier: 


OBS has got a failed build of LLVM 11 for s390x with the following message: 
[41008s] No buildstatus set, either the base system is broken (kernel/initrd/udev/glibc/bash/perl)
[41008s] or the build host has a kernel or hardware problem...

Reproducible: Always

Steps to Reproduce:
see: https://build.opensuse.org/package/live_build_log/openSUSE:Factory:zSystems/llvm11/standard/s390x
Actual Results:  
faild builds of LLVM11 because of a kernel or hardware problem on the base system (the VM)

Expected Results:  
successful builds of LLVM11 in the OBS
Comment 1 Ruediger Oertel 2021-01-27 23:29:13 UTC
the last lines in that log are:
[10986s] [4035/4215] : && /usr/bin/cmake -E rm -f lib64/libclangStaticAnalyzerFrontend.a && /home/abuild/rpmbuild/BUILD/llvm-11.0.1.src/stage1/bin/llvm-ar Dqc lib64/libclangStaticAnalyzerFrontend.a  tools/clang/lib/StaticAnalyzer/Frontend/CMakeFiles/obj.clangStaticAnalyzerFrontend.dir/AnalysisConsumer.cpp.o tools/clang/lib/StaticAnalyzer/Frontend/CMakeFiles/obj.clangStaticAnalyzerFrontend.dir/AnalyzerHelpFlags.cpp.o tools/clang/lib/StaticAnalyzer/Frontend/CMakeFiles/obj.clangStaticAnalyzerFrontend.dir/CheckerRegistry.cpp.o tools/clang/lib/StaticAnalyzer/Frontend/CMakeFiles/obj.clangStaticAnalyzerFrontend.dir/CreateCheckerManager.cpp.o tools/clang/lib/StaticAnalyzer/Frontend/CMakeFiles/obj.clangStaticAnalyzerFrontend.dir/FrontendActions.cpp.o tools/clang/lib/StaticAnalyzer/Frontend/CMakeFiles/obj.clangStaticAnalyzerFrontend.dir/ModelConsumer.cpp.o tools/clang/lib/StaticAnalyzer/Frontend/CMakeFiles/obj.clangStaticAnalyzerFrontend.dir/ModelInjector.cpp.o && /home/abuild/rpmbuild/BUILD/llvm-11.0.1.src/stage1/bin/llvm-ranlib -D lib64/libclangStaticAnalyzerFrontend.a && :
[40994s] qemu-system-s390x: terminating on signal 15 from pid 125036 (<unknown process>)


Job seems to be stuck here, killed. (after 30000 seconds of inactivity)
[41007s] ### VM INTERACTION END ###
[41008s] No buildstatus set, either the base system is broken (kernel/initrd/udev/glibc/bash/perl)
[41008s] or the build host has a kernel or hardware problem...


so llvm is stuck, possibly because of memory pressure for 30000 seconds and after that timeout the process gets killed (and subsequently fails with a dubious error message as the build script can't find out what exactly happened).

this is s390zp2a with worker VMs like
OBS_INSTANCE_MEMORY=$(( 12 * 1024 ))
OBS_WORKER_ROOT_SIZE=60000
OBS_WORKER_SWAP_SIZE=2000
OBS_WORKER_JOBS=6

so a VM with 12G memory gets stuck with 6 jobs. might be a compiler bug or the processes really get large enough to stall the machine due to memory pressure.
these are already the largest worker instances we run, so you might try with
manually setting a different (lower) number of jobs in the specfile or similar.
Comment 2 Aaron Puchert 2021-02-04 00:14:36 UTC
(In reply to Sarah Julia Kriesch from comment #0)
> https://build.opensuse.org/package/live_build_log/openSUSE:Factory:zSystems/llvm11/standard/s390x

Checked the link right now, and the latest build was successful, also the history shows the previous build as the only failing build.

(In reply to Ruediger Oertel from comment #1)
> so a VM with 12G memory gets stuck with 6 jobs. might be a compiler bug or
> the processes really get large enough to stall the machine due to memory
> pressure.
> these are already the largest worker instances we run, so you might try with
> manually setting a different (lower) number of jobs in the specfile or
> similar.

Generally I would have gone with OOM, but 2G per job should be more than enough. Ninja's job pools are strange, so we could have up to 5 compile jobs plus 1 ThinLTO link job running at the same time. But even that is unlikely to consume something remotely close to 12G.

Though a ThinLTO job being involved is likely: this is pretty late in the build, and while the compiler itself is single-threaded, so whatever it does is pretty well reproducible, ThinLTO jobs are multi-threaded. So it could be a deadlock, but also a race condition if s390x has weaker memory ordering than x86. (Which it likely has, x86 is pretty constrained.)

So there could be a bug, but unless we can observe this live there is too little to go on I think.
Comment 3 Aaron Puchert 2021-04-18 21:13:17 UTC
The 3 latest builds have succeeded, and it's almost certainly not a kernel bug.

More likely it's a hang caused by a race condition, but since LLVM 12 is around the corner I'll probably not have the time to look into it. It's also not exactly a part of LLVM I'm familiar with. So I suggest to close this bug.

If anyone is curious, the bug is most likely in the gold linker plugin LLVMgold.so or the LTO library, whichever does the actual parallelization. ThreadSanitizer is sadly not supported on s390x (https://clang.llvm.org/docs/ThreadSanitizer.html), but the tool is sensitive enough to detect the race on other platforms. So this could be worth a try. However, my machine isn't big enough for that. ;)
Comment 4 Sarah Kriesch 2021-04-25 16:58:18 UTC
Thank you for watching!
We can create a new bug report if that will happen again.
Comment 5 Aaron Puchert 2021-05-11 22:00:16 UTC
By the way, I just got an actual kernel issue in devel:tools:compiler/llvm12:

[  172s] [  159.153691] User process fault: interruption code 0010 ilc:3 in cc1plus[1000000+13a6000]
[  172s] [  159.153864] Failing address: 0000000000000000 TEID: 0000000000000800
[  172s] [  159.153959] Fault in primary space mode while using user ASCE.
[  172s] [  159.154055] AS:00000000828781c7 R3:0000000082d20007 S:0000000000000020 
[...]
[  173s] c++: internal compiler error: Segmentation fault signal terminated program cc1plus
[  173s] Please submit a full bug report,
[  173s] with preprocessed source if appropriate.
[  173s] See <https://bugs.opensuse.org/> for instructions.
[...]
[  180s] [  166.845717] BUG: Bad page state in process cc1plus  pfn:165f01
[  180s] [  166.847446] BUG: Bad page state in process cc1plus  pfn:165f02
[  180s] [  166.847937] BUG: Bad page state in process cc1plus  pfn:165f03
[  180s] [  166.849085] BUG: Bad page state in process cc1plus  pfn:165f04
[  180s] [  166.849462] BUG: Bad rss-counter state mm:00000000c03d85c2 type:MM_FILEPAGES val:-256
[  180s] [  166.850834] BUG: Bad page state in process cc1plus  pfn:165f05
[... all numbers in between ...]
[  180s] [  166.898083] BUG: Bad page state in process cc1plus  pfn:165f3c

But there is little to go on. This is started on s390zp29 via

[   13s] /usr/bin/qemu-system-s390x -nodefaults -no-reboot -nographic -vga none -cpu host -enable-kvm -object rng-random,filename=/dev/random,id=rng0 -device virtio-rng-ccw,rng=rng0 -runas qemu -net none -kernel /var/cache/obs/worker/root_2/.mount/boot/kernel -initrd /var/cache/obs/worker/root_2/.mount/boot/initrd -append root=/dev/disk/by-id/virtio-0 rootfstype=ext4 rootflags=noatime ext4.allow_unsupported=1 mitigations=off panic=1 quiet no-kvmclock elevator=noop nmi_watchdog=0 rw rd.driver.pre=binfmt_misc console=hvc0 init=/.build/build -m 12288 -drive file=/var/cache/obs/worker/root_2/root,format=raw,if=none,id=disk,cache=unsafe -device virtio-blk-ccw,drive=disk,serial=0 -drive file=/var/cache/obs/worker/root_2/swap,format=raw,if=none,id=swap,cache=unsafe -device virtio-blk-ccw,drive=swap,serial=1 -device virtio-serial-ccw -device virtconsole,chardev=virtiocon0 -chardev stdio,mux=on,id=virtiocon0 -mon chardev=virtiocon0 -chardev socket,id=monitor,server,nowait,path=/var/cache/obs/worker/root_2/root.qemu/monitor -mon chardev=monitor,mode=readline -smp 6
[   20s] ### VM INTERACTION END ###
[   20s] 2nd stage started in virtual machine
[   20s] machine type: s390x
[   20s] Linux version: 5.12.0-2-default #1 SMP Thu Apr 29 12:08:56 UTC 2021 (c4830af)
Comment 6 Aaron Puchert 2021-05-11 22:05:46 UTC
Created attachment 849275 [details]
OBS build log with bad page state.

Let's just attach the actual log. (That's https://build.opensuse.org/public/build/devel:tools:compiler/openSUSE_Factory_zSystems/s390x/llvm12/_log compressed.)
Comment 7 Sarah Kriesch 2021-05-12 12:29:26 UTC
Can you create a new bug report, please?
This bug is closed. Thank you!