Bug 1216638 - qemu-system-s390x crashes when building python-onnx with --vm-type=qemu
Summary: qemu-system-s390x crashes when building python-onnx with --vm-type=qemu
Status: VERIFIED FIXED
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: KVM (show other bugs)
Version: Current
Hardware: S/390-64 Other
: P2 - High : Normal (vote)
Target Milestone: ---
Assignee: Dario Faggioli
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-10-27 07:56 UTC by Berthold Gunreben
Modified: 2023-11-24 16:30 UTC (History)
9 users (show)

See Also:
Found By: Community User
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Berthold Gunreben 2023-10-27 07:56:53 UTC
When using osc to build python-onnx on my x86_64 notebook for s390x with a command like this:

osc build --vm-type=qemu

System is openSUSE Tumbleweed
qemu-s390x has version 8.1.0
building is against current tumbleweed
notebook kernel is 6.6.0-rc4 from Kernel:HEAD (due to different issues with AMD graphics card)

qemu crashes during the build process. Ulrich Weigand thought it was a good idea to open a bug regarding this. I'll add the stack trace here, however the core file is like 450MB, and I don't know where to place it. I'll be happy to upload it where you like:

           PID: 9035 (qemu-system-s39)
           UID: 107 (qemu)
           GID: 107 (qemu)
         Slice: user-1000.slice
     Owner UID: 1000 (bg)
       Boot ID: 7954757a21bd43c4956ba7ef9027a353
    Machine ID: 6121e5cbf8c440c19e94ff66842aaae6
      Hostname: maxwell.drachenhort
       Storage: /var/lib/systemd/coredump/core.qemu-system-s39.107.7954757a21bd43c4956ba7ef9027a353.9035.1698391907000000.zst (present)
  Size on Disk: 450.7M
       Message: Process 9035 (qemu-system-s39) of user 107 dumped core.
                
                Stack trace of thread 9052:
                #0  0x00007f62ad491dec __pthread_kill_implementation (libc.so.6 + 0x91dec)
                #1  0x00007f62ad43f0c6 raise (libc.so.6 + 0x3f0c6)
                #2  0x00007f62ad4268d7 abort (libc.so.6 + 0x268d7)
                #3  0x00007f62ada89fd5 n/a (libglib-2.0.so.0 + 0x22fd5)
                #4  0x00007f62adaef8da g_assertion_message_expr (libglib-2.0.so.0 + 0x888da)
                #5  0x0000560d36a5dcd7 cc_calc_addu (qemu-system-s390x + 0x4cbcd7)
                #6  0x0000560d36a66e78 cc_calc_addu (qemu-system-s390x + 0x4d4e78)
                #7  0x0000560d36a66ed0 calc_cc (qemu-system-s390x + 0x4d4ed0)
                #8  0x0000560d36a68b99 do_program_interrupt (qemu-system-s390x + 0x4d6b99)
                #9  0x0000560d36a68e14 do_svc_interrupt (qemu-system-s390x + 0x4d6e14)
                #10 0x0000560d36b46314 cpu_handle_exception (qemu-system-s390x + 0x5b4314)
                #11 0x0000560d36b46dbd cpu_exec_setjmp (qemu-system-s390x + 0x5b4dbd)
                #12 0x0000560d36b46e8d cpu_exec (qemu-system-s390x + 0x5b4e8d)
                #13 0x0000560d36b5f8d3 tcg_cpus_exec (qemu-system-s390x + 0x5cd8d3)
                #14 0x0000560d36cfe0d8 qemu_thread_start (qemu-system-s390x + 0x76c0d8)
                #15 0x00007f62ad48ff44 start_thread (libc.so.6 + 0x8ff44)
                #16 0x00007f62ad5184cc __clone3 (libc.so.6 + 0x1184cc)
                
                Stack trace of thread 9038:
                #0  0x00007f62ad48c4ee __futex_abstimed_wait_common (libc.so.6 + 0x8c4ee)
                #1  0x00007f62ad48f230 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8f230)
                #2  0x0000560d36d0391b qemu_cond_wait_impl (qemu-system-s390x + 0x77191b)
                #3  0x0000560d369c52db qemu_wait_io_event (qemu-system-s390x + 0x4332db)
                #4  0x0000560d36b5f861 mttcg_cpu_thread_fn (qemu-system-s390x + 0x5cd861)
                #5  0x0000560d36cfe0d8 qemu_thread_start (qemu-system-s390x + 0x76c0d8)
                #6  0x00007f62ad48ff44 start_thread (libc.so.6 + 0x8ff44)
                #7  0x00007f62ad5184cc __clone3 (libc.so.6 + 0x1184cc)
                
                Stack trace of thread 9041:
                #0  0x00007f62ad48c4ee __futex_abstimed_wait_common (libc.so.6 + 0x8c4ee)
                #1  0x00007f62ad48f230 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8f230)
                #2  0x0000560d36d0391b qemu_cond_wait_impl (qemu-system-s390x + 0x77191b)
                #3  0x0000560d369c52db qemu_wait_io_event (qemu-system-s390x + 0x4332db)
                #4  0x0000560d36b5f861 mttcg_cpu_thread_fn (qemu-system-s390x + 0x5cd861)
                #5  0x0000560d36cfe0d8 qemu_thread_start (qemu-system-s390x + 0x76c0d8)
                #6  0x00007f62ad48ff44 start_thread (libc.so.6 + 0x8ff44)
                #7  0x00007f62ad5184cc __clone3 (libc.so.6 + 0x1184cc)
                
                Stack trace of thread 9040:
                #0  0x00007f6292c3fe43 n/a (n/a + 0x0)
                ELF object binary architecture: AMD x86-64
Comment 1 Ulrich Weigand 2023-10-27 08:48:51 UTC
Thanks for the bug report!

Ilya, can you have a look?
Comment 2 Berthold Gunreben 2023-10-27 13:29:57 UTC
for the time being, I made the corefile available at:

https://gunreben.synology.me:8443/core-qemu-s390x/core.qemu-system-s39.107.7954757a21bd43c4956ba7ef9027a353.9035.1698391907000000.zst
Comment 3 Dario Faggioli 2023-10-30 16:42:01 UTC
(In reply to Berthold Gunreben from comment #0)
> When using osc to build python-onnx on my x86_64 notebook for s390x with a
> command like this:
> 
> osc build --vm-type=qemu
> 
> System is openSUSE Tumbleweed
> qemu-s390x has version 8.1.0
>
What's the exact version of the QEMU rpm (e.g., `rpm -qa|grep qemu-s390x`) ?
Comment 4 LTC BugProxy 2023-10-30 17:10:38 UTC
------- Comment From iii@de.ibm.com 2023-10-30 13:03 EDT-------
A similar issue was reported upstream a while ago: https://gitlab.com/qemu-project/qemu/-/issues/1913

I plan to start looking into it in the near future.
Comment 5 Berthold Gunreben 2023-10-30 21:33:30 UTC
(In reply to Dario Faggioli from comment #3)
> (In reply to Berthold Gunreben from comment #0)
> > When using osc to build python-onnx on my x86_64 notebook for s390x with a
> > command like this:
> > 
> > osc build --vm-type=qemu
> > 
> > System is openSUSE Tumbleweed
> > qemu-s390x has version 8.1.0
> >
> What's the exact version of the QEMU rpm (e.g., `rpm -qa|grep qemu-s390x`) ?

The version is:
qemu-s390x-8.1.0-2.2.x86_64
Comment 6 LTC BugProxy 2023-10-30 23:40:36 UTC
------- Comment From iii@de.ibm.com 2023-10-30 19:32 EDT-------
Unfortunately I could not analyze the core file, since the debuginfo for the respective versions of qemu and glib seems to be gone from the debuginfod server. However, the only way I can see this happening is mis-emulation of the LAALG instruction. Could you please try the following patch?

--- a/target/s390x/tcg/translate.c
+++ b/target/s390x/tcg/translate.c
@@ -2681,8 +2681,7 @@ static DisasJumpType op_laa(DisasContext *s, DisasOps *o)
tcg_gen_atomic_fetch_add_i64(o->in2, o->in2, o->in1, get_mem_index(s),
s->insn->data | MO_ALIGN);
/* However, we need to recompute the addition for setting CC.  */
-    tcg_gen_add_i64(o->out, o->in1, o->in2);
-    return DISAS_NEXT;
+    return op_addu64(s, o);
}

It helps with a synthetic testcase, which triggers this assertion.
Comment 7 LTC BugProxy 2023-10-31 01:00:20 UTC
------- Comment From iii@de.ibm.com 2023-10-30 20:56 EDT-------
I just realized I posted a wrong upstream bugtracker link. Sorry about that.

The correct link is: https://gitlab.com/qemu-project/qemu/-/issues/1865 ("ERROR:../target/s390x/tcg/cc_helper.c:128:cc_calc_addu: assertion failed: (carry_out <= 1)").
Comment 8 LTC BugProxy 2023-10-31 06:00:29 UTC
------- Comment From iii@de.ibm.com 2023-10-31 01:53 EDT-------
I posted this fix and a second one upstream: https://lists.gnu.org/archive/html/qemu-devel/2023-10/msg10251.html
Comment 9 Dario Faggioli 2023-10-31 07:52:24 UTC
Ok, we'll include the patches in our QEMU as soon as they're accepted upstream.
Comment 10 Berthold Gunreben 2023-10-31 10:10:55 UTC
(In reply to LTC BugProxy from comment #6)
> ------- Comment From iii@de.ibm.com 2023-10-30 19:32 EDT-------
> Unfortunately I could not analyze the core file, since the debuginfo for the
> respective versions of qemu and glib seems to be gone from the debuginfod
> server. However, the only way I can see this happening is mis-emulation of
> the LAALG instruction. Could you please try the following patch?
> 
> --- a/target/s390x/tcg/translate.c
> +++ b/target/s390x/tcg/translate.c
> @@ -2681,8 +2681,7 @@ static DisasJumpType op_laa(DisasContext *s, DisasOps
> *o)
> tcg_gen_atomic_fetch_add_i64(o->in2, o->in2, o->in1, get_mem_index(s),
> s->insn->data | MO_ALIGN);
> /* However, we need to recompute the addition for setting CC.  */
> -    tcg_gen_add_i64(o->out, o->in1, o->in2);
> -    return DISAS_NEXT;
> +    return op_addu64(s, o);
> }
> 
> It helps with a synthetic testcase, which triggers this assertion.

I added all four patches to the current qemu build, and tried to use that. However unfortunately, the qemu process hangs during startup, and I aborted after 5 minutes. Strace of that process shows a lot of these messages:

read(7, "\1\0\0\0\0\0\0\0", 512)        = 8
ppoll([{fd=0, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}, {fd=24, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}, {fd=27, events=POLLIN}, {fd=28, events=POLLIN}, {fd=29, events=POLLIN}, {fd=30, events=POLLIN}, {fd=31, events=POLLIN}, {fd=32, events=POLLIN}, {fd=33, events=POLLIN}, {fd=34, events=POLLIN}, {fd=35, events=POLLIN}, {fd=36, events=POLLIN}, ...], 75, {tv_sec=0, tv_nsec=2819928}, NULL, 8) = 0 (Timeout)
futex(0x560d8155d33c, FUTEX_WAKE_PRIVATE, 2147483647) = 1
futex(0x560d80378220, FUTEX_WAKE_PRIVATE, 1) = 1
ppoll([{fd=0, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}, {fd=24, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}, {fd=27, events=POLLIN}, {fd=28, events=POLLIN}, {fd=29, events=POLLIN}, {fd=30, events=POLLIN}, {fd=31, events=POLLIN}, {fd=32, events=POLLIN}, {fd=33, events=POLLIN}, {fd=34, events=POLLIN}, {fd=35, events=POLLIN}, {fd=36, events=POLLIN}, ...], 75, {tv_sec=1, tv_nsec=866405668}, NULL, 8) = 1 ([{fd=7, revents=POLLIN}], left {tv_sec=1, tv_nsec=866168154})
read(7, "\1\0\0\0\0\0\0\0", 512)        = 8
ppoll([{fd=0, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}, {fd=24, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}, {fd=27, events=POLLIN}, {fd=28, events=POLLIN}, {fd=29, events=POLLIN}, {fd=30, events=POLLIN}, {fd=31, events=POLLIN}, {fd=32, events=POLLIN}, {fd=33, events=POLLIN}, {fd=34, events=POLLIN}, {fd=35, events=POLLIN}, {fd=36, events=POLLIN}, ...], 75, {tv_sec=0, tv_nsec=399209418}, NULL, 8) = 1 ([{fd=7, revents=POLLIN}], left {tv_sec=0, tv_nsec=393119149})
read(7, "\1\0\0\0\0\0\0\0", 512)        = 8
ppoll([{fd=0, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}, {fd=24, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}, {fd=27, events=POLLIN}, {fd=28, events=POLLIN}, {fd=29, events=POLLIN}, {fd=30, events=POLLIN}, {fd=31, events=POLLIN}, {fd=32, events=POLLIN}, {fd=33, events=POLLIN}, {fd=34, events=POLLIN}, {fd=35, events=POLLIN}, {fd=36, events=POLLIN}, ...], 75, {tv_sec=0, tv_nsec=2921201}, NULL, 8) = 0 (Timeout)
Comment 11 Berthold Gunreben 2023-10-31 10:19:24 UTC
I reproduced the issue with a new qemu from openSUSE:Factory. Corefile is now found at 

https://gunreben.synology.me:8443/core-qemu-s390x/core.qemu-system-s39.107.af2ba91654fa432fb1f4d0f4de830d84.17957.1698747135000000.zst
Comment 12 Berthold Gunreben 2023-10-31 10:20:19 UTC
(In reply to Berthold Gunreben from comment #11)
> I reproduced the issue with a new qemu from openSUSE:Factory. Corefile is
> now found at 
> 
> https://gunreben.synology.me:8443/core-qemu-s390x/core.qemu-system-s39.107.
> af2ba91654fa432fb1f4d0f4de830d84.17957.1698747135000000.zst

Version of qemu now is qemu-s390x-8.1.2-1.1.x86_64
Comment 13 LTC BugProxy 2023-11-02 15:12:59 UTC
------- Comment From iii@de.ibm.com 2023-11-02 11:03 EDT-------
I extracted the failing s390x instruction from the core dump, which turned out to be CLC. This is what one of my patches is supposed to fix.

When I view the new core file in GDB (with debuginfod, which now works), I see:

(gdb) disassemble /s op_clc

../target/s390x/tcg/translate.c:
2019	        tcg_gen_qemu_ld_tl(cc_src, o->addr1, get_mem_index(s), mop);

../target/s390x/tcg/translate.c:
2020	        tcg_gen_qemu_ld_tl(cc_dst, o->in2, get_mem_index(s), mop);

which is unpatched code.

Could you please double check that the changes really found their way into the binary repo? I did `zypper source-install qemu`, got:

Version:        8.1.2
Release:        1.2

and I'm not seeing the patches there either; but maybe I'm confused with respect to in which repo the updated build was supposed to go.
Comment 14 Dario Faggioli 2023-11-08 08:15:13 UTC
(In reply to LTC BugProxy from comment #13)
> and I'm not seeing the patches there either; but maybe I'm confused with
> respect to in which repo the updated build was supposed to go.
>
Well, as far as I can see, there a PULL request that includes your patches, and that is from yesterday:

https://lore.kernel.org/qemu-devel/20231107183228.276424-1-thuth@redhat.com/

I've also just looked at the upstream master branch and, as far as I can see, the patches are not there yet. (Or am I just not seeing them?)

We tend to not pick patches from mailing lists and either wait for them to come via a Stable release or, if they're important, apply them ourselves, but not before they're committed. This is, in fact, what I meant with Comment 9... Sorry if it wasn't clear enough.

So, no, the patches are not there yet. They will, after they'll have made it upstream, so I can actually cherry-pick them from the master branch, which is how out process works.
Comment 15 LTC BugProxy 2023-11-13 10:29:50 UTC
------- Comment From iii@de.ibm.com 2023-11-13 05:25 EDT-------
The following commits are now in master:

commit aba2ec341c6d20c8dc3e6ecf87fa7c1a71e30c1e
Author: Ilya Leoshkevich <iii@linux.ibm.com>
Date:   Mon Nov 6 10:31:22 2023 +0100

target/s390x: Fix CLC corrupting cc_src

and

commit bea402482a8c94389638cbd3d7fe3963fb317f4c
Author: Ilya Leoshkevich <iii@linux.ibm.com>
Date:   Mon Nov 6 10:31:24 2023 +0100

target/s390x: Fix LAALG not updating cc_src

I believe they should be enough to resolve this problem.
Comment 16 Dario Faggioli 2023-11-15 12:21:09 UTC
(In reply to LTC BugProxy from comment #15)
> I believe they should be enough to resolve this problem.
>
Ok, thanks. Pushed the "backports" here:
https://github.com/openSUSE/qemu/commits/factory

Now OBS will build our Staging package here:
https://build.opensuse.org/package/show/Virtualization:Staging/qemu

And, if everything is ok, I'll push it to our Devel Project and then to Factory
Comment 17 OBSbugzilla Bot 2023-11-16 08:35:02 UTC
This is an autogenerated message for OBS integration:
This bug (1216638) was mentioned in
https://build.opensuse.org/request/show/1126776 Factory / qemu
Comment 19 Berthold Gunreben 2023-11-23 10:04:52 UTC
I can confirm, that qemu-system-s390x does not crash during the build of python-onnx anymore. I consider this fixed.
Comment 20 Sarah Kriesch 2023-11-24 11:07:19 UTC
Then it is verified as fixed.
Thanks for this collaborative contribution.
Comment 21 Sarah Kriesch 2023-11-24 11:08:11 UTC
Verified by Berthold