Bugzilla – Bug 1216638
qemu-system-s390x crashes when building python-onnx with --vm-type=qemu
Last modified: 2023-11-24 16:30:06 UTC
When using osc to build python-onnx on my x86_64 notebook for s390x with a command like this: osc build --vm-type=qemu System is openSUSE Tumbleweed qemu-s390x has version 8.1.0 building is against current tumbleweed notebook kernel is 6.6.0-rc4 from Kernel:HEAD (due to different issues with AMD graphics card) qemu crashes during the build process. Ulrich Weigand thought it was a good idea to open a bug regarding this. I'll add the stack trace here, however the core file is like 450MB, and I don't know where to place it. I'll be happy to upload it where you like: PID: 9035 (qemu-system-s39) UID: 107 (qemu) GID: 107 (qemu) Slice: user-1000.slice Owner UID: 1000 (bg) Boot ID: 7954757a21bd43c4956ba7ef9027a353 Machine ID: 6121e5cbf8c440c19e94ff66842aaae6 Hostname: maxwell.drachenhort Storage: /var/lib/systemd/coredump/core.qemu-system-s39.107.7954757a21bd43c4956ba7ef9027a353.9035.1698391907000000.zst (present) Size on Disk: 450.7M Message: Process 9035 (qemu-system-s39) of user 107 dumped core. Stack trace of thread 9052: #0 0x00007f62ad491dec __pthread_kill_implementation (libc.so.6 + 0x91dec) #1 0x00007f62ad43f0c6 raise (libc.so.6 + 0x3f0c6) #2 0x00007f62ad4268d7 abort (libc.so.6 + 0x268d7) #3 0x00007f62ada89fd5 n/a (libglib-2.0.so.0 + 0x22fd5) #4 0x00007f62adaef8da g_assertion_message_expr (libglib-2.0.so.0 + 0x888da) #5 0x0000560d36a5dcd7 cc_calc_addu (qemu-system-s390x + 0x4cbcd7) #6 0x0000560d36a66e78 cc_calc_addu (qemu-system-s390x + 0x4d4e78) #7 0x0000560d36a66ed0 calc_cc (qemu-system-s390x + 0x4d4ed0) #8 0x0000560d36a68b99 do_program_interrupt (qemu-system-s390x + 0x4d6b99) #9 0x0000560d36a68e14 do_svc_interrupt (qemu-system-s390x + 0x4d6e14) #10 0x0000560d36b46314 cpu_handle_exception (qemu-system-s390x + 0x5b4314) #11 0x0000560d36b46dbd cpu_exec_setjmp (qemu-system-s390x + 0x5b4dbd) #12 0x0000560d36b46e8d cpu_exec (qemu-system-s390x + 0x5b4e8d) #13 0x0000560d36b5f8d3 tcg_cpus_exec (qemu-system-s390x + 0x5cd8d3) #14 0x0000560d36cfe0d8 qemu_thread_start (qemu-system-s390x + 0x76c0d8) #15 0x00007f62ad48ff44 start_thread (libc.so.6 + 0x8ff44) #16 0x00007f62ad5184cc __clone3 (libc.so.6 + 0x1184cc) Stack trace of thread 9038: #0 0x00007f62ad48c4ee __futex_abstimed_wait_common (libc.so.6 + 0x8c4ee) #1 0x00007f62ad48f230 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8f230) #2 0x0000560d36d0391b qemu_cond_wait_impl (qemu-system-s390x + 0x77191b) #3 0x0000560d369c52db qemu_wait_io_event (qemu-system-s390x + 0x4332db) #4 0x0000560d36b5f861 mttcg_cpu_thread_fn (qemu-system-s390x + 0x5cd861) #5 0x0000560d36cfe0d8 qemu_thread_start (qemu-system-s390x + 0x76c0d8) #6 0x00007f62ad48ff44 start_thread (libc.so.6 + 0x8ff44) #7 0x00007f62ad5184cc __clone3 (libc.so.6 + 0x1184cc) Stack trace of thread 9041: #0 0x00007f62ad48c4ee __futex_abstimed_wait_common (libc.so.6 + 0x8c4ee) #1 0x00007f62ad48f230 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8f230) #2 0x0000560d36d0391b qemu_cond_wait_impl (qemu-system-s390x + 0x77191b) #3 0x0000560d369c52db qemu_wait_io_event (qemu-system-s390x + 0x4332db) #4 0x0000560d36b5f861 mttcg_cpu_thread_fn (qemu-system-s390x + 0x5cd861) #5 0x0000560d36cfe0d8 qemu_thread_start (qemu-system-s390x + 0x76c0d8) #6 0x00007f62ad48ff44 start_thread (libc.so.6 + 0x8ff44) #7 0x00007f62ad5184cc __clone3 (libc.so.6 + 0x1184cc) Stack trace of thread 9040: #0 0x00007f6292c3fe43 n/a (n/a + 0x0) ELF object binary architecture: AMD x86-64
Thanks for the bug report! Ilya, can you have a look?
for the time being, I made the corefile available at: https://gunreben.synology.me:8443/core-qemu-s390x/core.qemu-system-s39.107.7954757a21bd43c4956ba7ef9027a353.9035.1698391907000000.zst
(In reply to Berthold Gunreben from comment #0) > When using osc to build python-onnx on my x86_64 notebook for s390x with a > command like this: > > osc build --vm-type=qemu > > System is openSUSE Tumbleweed > qemu-s390x has version 8.1.0 > What's the exact version of the QEMU rpm (e.g., `rpm -qa|grep qemu-s390x`) ?
------- Comment From iii@de.ibm.com 2023-10-30 13:03 EDT------- A similar issue was reported upstream a while ago: https://gitlab.com/qemu-project/qemu/-/issues/1913 I plan to start looking into it in the near future.
(In reply to Dario Faggioli from comment #3) > (In reply to Berthold Gunreben from comment #0) > > When using osc to build python-onnx on my x86_64 notebook for s390x with a > > command like this: > > > > osc build --vm-type=qemu > > > > System is openSUSE Tumbleweed > > qemu-s390x has version 8.1.0 > > > What's the exact version of the QEMU rpm (e.g., `rpm -qa|grep qemu-s390x`) ? The version is: qemu-s390x-8.1.0-2.2.x86_64
------- Comment From iii@de.ibm.com 2023-10-30 19:32 EDT------- Unfortunately I could not analyze the core file, since the debuginfo for the respective versions of qemu and glib seems to be gone from the debuginfod server. However, the only way I can see this happening is mis-emulation of the LAALG instruction. Could you please try the following patch? --- a/target/s390x/tcg/translate.c +++ b/target/s390x/tcg/translate.c @@ -2681,8 +2681,7 @@ static DisasJumpType op_laa(DisasContext *s, DisasOps *o) tcg_gen_atomic_fetch_add_i64(o->in2, o->in2, o->in1, get_mem_index(s), s->insn->data | MO_ALIGN); /* However, we need to recompute the addition for setting CC. */ - tcg_gen_add_i64(o->out, o->in1, o->in2); - return DISAS_NEXT; + return op_addu64(s, o); } It helps with a synthetic testcase, which triggers this assertion.
------- Comment From iii@de.ibm.com 2023-10-30 20:56 EDT------- I just realized I posted a wrong upstream bugtracker link. Sorry about that. The correct link is: https://gitlab.com/qemu-project/qemu/-/issues/1865 ("ERROR:../target/s390x/tcg/cc_helper.c:128:cc_calc_addu: assertion failed: (carry_out <= 1)").
------- Comment From iii@de.ibm.com 2023-10-31 01:53 EDT------- I posted this fix and a second one upstream: https://lists.gnu.org/archive/html/qemu-devel/2023-10/msg10251.html
Ok, we'll include the patches in our QEMU as soon as they're accepted upstream.
(In reply to LTC BugProxy from comment #6) > ------- Comment From iii@de.ibm.com 2023-10-30 19:32 EDT------- > Unfortunately I could not analyze the core file, since the debuginfo for the > respective versions of qemu and glib seems to be gone from the debuginfod > server. However, the only way I can see this happening is mis-emulation of > the LAALG instruction. Could you please try the following patch? > > --- a/target/s390x/tcg/translate.c > +++ b/target/s390x/tcg/translate.c > @@ -2681,8 +2681,7 @@ static DisasJumpType op_laa(DisasContext *s, DisasOps > *o) > tcg_gen_atomic_fetch_add_i64(o->in2, o->in2, o->in1, get_mem_index(s), > s->insn->data | MO_ALIGN); > /* However, we need to recompute the addition for setting CC. */ > - tcg_gen_add_i64(o->out, o->in1, o->in2); > - return DISAS_NEXT; > + return op_addu64(s, o); > } > > It helps with a synthetic testcase, which triggers this assertion. I added all four patches to the current qemu build, and tried to use that. However unfortunately, the qemu process hangs during startup, and I aborted after 5 minutes. Strace of that process shows a lot of these messages: read(7, "\1\0\0\0\0\0\0\0", 512) = 8 ppoll([{fd=0, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}, {fd=24, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}, {fd=27, events=POLLIN}, {fd=28, events=POLLIN}, {fd=29, events=POLLIN}, {fd=30, events=POLLIN}, {fd=31, events=POLLIN}, {fd=32, events=POLLIN}, {fd=33, events=POLLIN}, {fd=34, events=POLLIN}, {fd=35, events=POLLIN}, {fd=36, events=POLLIN}, ...], 75, {tv_sec=0, tv_nsec=2819928}, NULL, 8) = 0 (Timeout) futex(0x560d8155d33c, FUTEX_WAKE_PRIVATE, 2147483647) = 1 futex(0x560d80378220, FUTEX_WAKE_PRIVATE, 1) = 1 ppoll([{fd=0, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}, {fd=24, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}, {fd=27, events=POLLIN}, {fd=28, events=POLLIN}, {fd=29, events=POLLIN}, {fd=30, events=POLLIN}, {fd=31, events=POLLIN}, {fd=32, events=POLLIN}, {fd=33, events=POLLIN}, {fd=34, events=POLLIN}, {fd=35, events=POLLIN}, {fd=36, events=POLLIN}, ...], 75, {tv_sec=1, tv_nsec=866405668}, NULL, 8) = 1 ([{fd=7, revents=POLLIN}], left {tv_sec=1, tv_nsec=866168154}) read(7, "\1\0\0\0\0\0\0\0", 512) = 8 ppoll([{fd=0, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}, {fd=24, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}, {fd=27, events=POLLIN}, {fd=28, events=POLLIN}, {fd=29, events=POLLIN}, {fd=30, events=POLLIN}, {fd=31, events=POLLIN}, {fd=32, events=POLLIN}, {fd=33, events=POLLIN}, {fd=34, events=POLLIN}, {fd=35, events=POLLIN}, {fd=36, events=POLLIN}, ...], 75, {tv_sec=0, tv_nsec=399209418}, NULL, 8) = 1 ([{fd=7, revents=POLLIN}], left {tv_sec=0, tv_nsec=393119149}) read(7, "\1\0\0\0\0\0\0\0", 512) = 8 ppoll([{fd=0, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}, {fd=24, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}, {fd=27, events=POLLIN}, {fd=28, events=POLLIN}, {fd=29, events=POLLIN}, {fd=30, events=POLLIN}, {fd=31, events=POLLIN}, {fd=32, events=POLLIN}, {fd=33, events=POLLIN}, {fd=34, events=POLLIN}, {fd=35, events=POLLIN}, {fd=36, events=POLLIN}, ...], 75, {tv_sec=0, tv_nsec=2921201}, NULL, 8) = 0 (Timeout)
I reproduced the issue with a new qemu from openSUSE:Factory. Corefile is now found at https://gunreben.synology.me:8443/core-qemu-s390x/core.qemu-system-s39.107.af2ba91654fa432fb1f4d0f4de830d84.17957.1698747135000000.zst
(In reply to Berthold Gunreben from comment #11) > I reproduced the issue with a new qemu from openSUSE:Factory. Corefile is > now found at > > https://gunreben.synology.me:8443/core-qemu-s390x/core.qemu-system-s39.107. > af2ba91654fa432fb1f4d0f4de830d84.17957.1698747135000000.zst Version of qemu now is qemu-s390x-8.1.2-1.1.x86_64
------- Comment From iii@de.ibm.com 2023-11-02 11:03 EDT------- I extracted the failing s390x instruction from the core dump, which turned out to be CLC. This is what one of my patches is supposed to fix. When I view the new core file in GDB (with debuginfod, which now works), I see: (gdb) disassemble /s op_clc ../target/s390x/tcg/translate.c: 2019 tcg_gen_qemu_ld_tl(cc_src, o->addr1, get_mem_index(s), mop); ../target/s390x/tcg/translate.c: 2020 tcg_gen_qemu_ld_tl(cc_dst, o->in2, get_mem_index(s), mop); which is unpatched code. Could you please double check that the changes really found their way into the binary repo? I did `zypper source-install qemu`, got: Version: 8.1.2 Release: 1.2 and I'm not seeing the patches there either; but maybe I'm confused with respect to in which repo the updated build was supposed to go.
(In reply to LTC BugProxy from comment #13) > and I'm not seeing the patches there either; but maybe I'm confused with > respect to in which repo the updated build was supposed to go. > Well, as far as I can see, there a PULL request that includes your patches, and that is from yesterday: https://lore.kernel.org/qemu-devel/20231107183228.276424-1-thuth@redhat.com/ I've also just looked at the upstream master branch and, as far as I can see, the patches are not there yet. (Or am I just not seeing them?) We tend to not pick patches from mailing lists and either wait for them to come via a Stable release or, if they're important, apply them ourselves, but not before they're committed. This is, in fact, what I meant with Comment 9... Sorry if it wasn't clear enough. So, no, the patches are not there yet. They will, after they'll have made it upstream, so I can actually cherry-pick them from the master branch, which is how out process works.
------- Comment From iii@de.ibm.com 2023-11-13 05:25 EDT------- The following commits are now in master: commit aba2ec341c6d20c8dc3e6ecf87fa7c1a71e30c1e Author: Ilya Leoshkevich <iii@linux.ibm.com> Date: Mon Nov 6 10:31:22 2023 +0100 target/s390x: Fix CLC corrupting cc_src and commit bea402482a8c94389638cbd3d7fe3963fb317f4c Author: Ilya Leoshkevich <iii@linux.ibm.com> Date: Mon Nov 6 10:31:24 2023 +0100 target/s390x: Fix LAALG not updating cc_src I believe they should be enough to resolve this problem.
(In reply to LTC BugProxy from comment #15) > I believe they should be enough to resolve this problem. > Ok, thanks. Pushed the "backports" here: https://github.com/openSUSE/qemu/commits/factory Now OBS will build our Staging package here: https://build.opensuse.org/package/show/Virtualization:Staging/qemu And, if everything is ok, I'll push it to our Devel Project and then to Factory
This is an autogenerated message for OBS integration: This bug (1216638) was mentioned in https://build.opensuse.org/request/show/1126776 Factory / qemu
I can confirm, that qemu-system-s390x does not crash during the build of python-onnx anymore. I consider this fixed.
Then it is verified as fixed. Thanks for this collaborative contribution.
Verified by Berthold