Bugzilla – Bug 1215337
python-onnx does not build s390x
Last modified: 2024-07-10 12:05:10 UTC
python-onnx does not build on openSUSE:Factory:zSystems for quite some time: osc buildhistory openSUSE:Factory:zSystems python-onnx standard s390x TIME SRCMD5 VER-REL.BUILD# REV DURATION 2019-11-26 15:59:51 c7251471230f79907f69762d42f620a6 1.6.0-1.1 1 414 2019-12-24 10:29:50 c7251471230f79907f69762d42f620a6 1.6.0-1.2 1 821 2020-01-07 20:22:00 c7251471230f79907f69762d42f620a6 1.6.0-1.3 1 138 2020-01-08 17:02:51 0147179dbf202fdba63d0887ccf75400 1.6.0-2.1 2 192 2020-01-11 08:08:49 0147179dbf202fdba63d0887ccf75400 1.6.0-2.2 2 137 2020-01-21 18:59:45 0147179dbf202fdba63d0887ccf75400 1.6.0-2.3 2 135 2020-02-05 07:12:05 0147179dbf202fdba63d0887ccf75400 1.6.0-2.4 2 668 2020-02-07 20:58:34 0147179dbf202fdba63d0887ccf75400 1.6.0-2.5 2 178 2020-02-16 16:40:23 0147179dbf202fdba63d0887ccf75400 1.6.0-2.6 2 280 2020-02-18 13:51:04 0147179dbf202fdba63d0887ccf75400 1.6.0-2.7 2 227 2020-02-19 17:01:33 0147179dbf202fdba63d0887ccf75400 1.6.0-2.8 2 264 This package is needed in order to prepare a container for onnx-mlir which is the platform that also supports IBM Telum processors on Mainframe.
Typical Big Endian error: [ 4283s] if not isinstance(model, ModelProto): [ 4283s] raise ValueError(f'VersionConverter only accepts ModelProto as model, incorrect type: {type(model)}') [ 4283s] if not isinstance(target_version, int): [ 4283s] raise ValueError(f'VersionConverter only accepts int as target_version, incorrect type: {type(target_version)}') [ 4283s] model_str = model.SerializeToString() [ 4283s] > converted_model_str = C.convert_version(model_str, target_version) [ 4283s] E onnx.onnx_cpp2py_export.shape_inference.InferenceError: [ShapeInferenceError] (op_type:Upsample): [ShapeInferenceError] Inferred shape and existing shape differ in dimension 0: (0) vs (1) [ 4283s] [ 4283s] ../../../BUILDROOT/python-onnx-1.12.0-3.1.s390x/usr/lib64/python3.9/site-packages/onnx/version_converter.py:166: InferenceError
For the time being, I just disabled the tests that brake the build. Please have a look at osc rdiff openSUSE:Factory python-onnx openSUSE:Factory:zSystems I would think that IBM should be interested in debugging this themselves. One thing that surprised me is, that the build on my notebook with qemu was almost as fast as the one on real hardware in OBS. Notebook: 7817s Mainframe: 7657s I used this command to build locally: osc build --vm-type=qemu --vm-memory=12G standard s390x and of course, the notebook is rather powerful with 16 virtual cores (8 cores, SMT2, Ryzen 6900HS)
Hi Berthold, I can forward this bug report only with the build/error log. Can you attach it, please?
Created attachment 869863 [details] buildlog with failing tests
Hi Thomas, as Berthold is mentioning, python-onnx is really important for onnx-mlir and with that for using AI related software on IBM Telum processor(s). (In reply to Berthold Gunreben from comment #0) > python-onnx does not build on openSUSE:Factory:zSystems for quite some time: ... > This package is needed in order to prepare a container for onnx-mlir which > is the platform that also supports IBM Telum processors on Mainframe. Can you forward that to any Python Expert for this topic, please?
Does it still not build as I am getting now: 2019-11-26 15:59:51 c7251471230f79907f69762d42f620a6 1.6.0-1.1 1 414 2019-12-24 10:29:50 c7251471230f79907f69762d42f620a6 1.6.0-1.2 1 821 2020-01-07 20:22:00 c7251471230f79907f69762d42f620a6 1.6.0-1.3 1 138 2020-01-08 17:02:51 0147179dbf202fdba63d0887ccf75400 1.6.0-2.1 2 192 2020-01-11 08:08:49 0147179dbf202fdba63d0887ccf75400 1.6.0-2.2 2 137 2020-01-21 18:59:45 0147179dbf202fdba63d0887ccf75400 1.6.0-2.3 2 135 2020-02-05 07:12:05 0147179dbf202fdba63d0887ccf75400 1.6.0-2.4 2 668 2020-02-07 20:58:34 0147179dbf202fdba63d0887ccf75400 1.6.0-2.5 2 178 2020-02-16 16:40:23 0147179dbf202fdba63d0887ccf75400 1.6.0-2.6 2 280 2020-02-18 13:51:04 0147179dbf202fdba63d0887ccf75400 1.6.0-2.7 2 227 2020-02-19 17:01:33 0147179dbf202fdba63d0887ccf75400 1.6.0-2.8 2 264 2023-09-26 13:54:55 58e929acfff8575cc616eb942c686a64 1.12.0-2.1 2 7688 Can you please also re-check after https://build.opensuse.org/request/show/1116892 was accepted.
Berthold has disabled the tests. Therefore, the build is successful now.
Berthold, can you remove your SR again, please?
(In reply to Christian Goll from comment #6) > Does it still not build as I am getting now: > 2023-09-26 13:54:55 58e929acfff8575cc616eb942c686a64 1.12.0-2.1 2 > 7688 > > Can you please also re-check after > https://build.opensuse.org/request/show/1116892 was accepted. yes, the build worked because I disabled the failing checks. I removed the package from openSUSE:Factory:zSystems to let you see the real results, and also to pick up the new package after the SR has been accepted. I see that there has not been a test of s390x from within the SR, therefore, I have some doubts that the situation will change.
The package is on failed again.
(In reply to Sarah Kriesch from comment #10) > The package is on failed again. Right. So an update to the latest version did not resolve the problem. Anyway, this ticket is barking up the wrong tree: Instead of setting expectations for a community contributor to fix BE issues, IBM should look into this. The screening team had assigned this purely based on package maintainership, however, others on this ticket who are much closer to IBM could have fixed this.
(In reply to Egbert Eich from comment #11) > Anyway, this ticket is barking up the wrong tree: Instead of setting > expectations for a community contributor to fix BE issues, IBM should look > into this. > > The screening team had assigned this purely based on package maintainership, > however, others on this ticket who are much closer to IBM could have fixed > this. That is the reason, why I have forwarded it to IBM. I did the review of this bugreport and have identified, that it is a deeper architecture related issue. It is already mirrored.
(In reply to Sarah Kriesch from comment #12) > (In reply to Egbert Eich from comment #11) > > Anyway, this ticket is barking up the wrong tree: Instead of setting > > expectations for a community contributor to fix BE issues, IBM should look > > into this. > > > > The screening team had assigned this purely based on package maintainership, > > however, others on this ticket who are much closer to IBM could have fixed > > this. > > That is the reason, why I have forwarded it to IBM. I did the review of this > bugreport and have identified, that it is a deeper architecture related > issue. > It is already mirrored. Yeah, I've seen that. But: $ host bugzilla.linux.ibm.com Host bugzilla.linux.ibm.com not found: 3(NXDOMAIN) Next time, please reassign and add the maintainer to Cc. Thanks :)
(In reply to Egbert Eich from comment #13) > (In reply to Sarah Kriesch from comment #12) > > (In reply to Egbert Eich from comment #11) > > > Anyway, this ticket is barking up the wrong tree: Instead of setting > > > expectations for a community contributor to fix BE issues, IBM should look > > > into this. > > > > > > The screening team had assigned this purely based on package maintainership, > > > however, others on this ticket who are much closer to IBM could have fixed > > > this. > > > > That is the reason, why I have forwarded it to IBM. I did the review of this > > bugreport and have identified, that it is a deeper architecture related > > issue. > > It is already mirrored. > > Yeah, I've seen that. But: > $ host bugzilla.linux.ibm.com > Host bugzilla.linux.ibm.com not found: 3(NXDOMAIN) > > Next time, please reassign and add the maintainer to Cc. Thanks :) Hi Egbert, yes, the IBM Bugzilla is only visible in the IBM Intranet.
------- Comment From Andreas.Krebbel@de.ibm.com 2023-10-20 05:01 EDT------- I had a quick look at the upstream version. The shape inference test does not fail there anymore but there are more other failures. From a quick look it appears to be related to specific data types and saving/loading of raw tensors. Some of it is certainly endianness related and for others additional support might be required. I'll check with the onnx-mlir (DLC) folks who have contributed before or will have a look myself but it will take some time. Apart from preventing the OpenSUSE package from building successfully, are there any known problems which prevent using it for onnx-mlir purposes right now? I would rather want to focus on these cases first. Of course I agree that we should address these issues. However, as a workaround could you use the variant installable via pip to continue? This version might be more up-to-date and I think this is how most people use it anyway.
Hi Andreas, nice to meet you again on Bugzilla! :) Christian - the Package Maintainer - has already updated the package python-onnx. You can receive every build/error log based on latest updates at this link: https://build.opensuse.org/package/live_build_log/openSUSE:Factory:zSystems/python-onnx/standard/s390x @Christian: Do you know, why python-onnx exist, but not onnx-mlir? I can not find it in our repositories. (An answer is enough)
pip is resulting in the following error message on openSUSE Tumbleweed (s390x): pip install onnx Defaulting to user installation because normal site-packages is not writeable Collecting onnx Downloading onnx-1.14.1.tar.gz (11.3 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.3/11.3 MB 31.2 MB/s eta 0:00:00 Installing build dependencies ... done Getting requirements to build wheel ... error error: subprocess-exited-with-error × Getting requirements to build wheel did not run successfully. │ exit code: 1 ╰─> [19 lines of output] Schwerwiegend: Kein Git-Repository (oder irgendein Elternverzeichnis bis zum Einhängepunkt /) Stoppe bei Dateisystemgrenze (GIT_DISCOVERY_ACROSS_FILESYSTEM nicht gesetzt). Traceback (most recent call last): File "/usr/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 351, in <module> main() File "/usr/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 333, in main json_out['return_val'] = hook(**hook_input['kwargs']) File "/usr/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 118, in get_requires_for_build_wheel return hook(config_settings) File "/tmp/pip-build-env-ndn81q7z/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 355, in get_requires_for_build_wheel return self._get_build_requires(config_settings, requirements=['wheel']) File "/tmp/pip-build-env-ndn81q7z/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 325, in _get_build_requires self.run_setup() File "/tmp/pip-build-env-ndn81q7z/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 507, in run_setup super(_BuildMetaLegacyBackend, self).run_setup(setup_script=setup_script) File "/tmp/pip-build-env-ndn81q7z/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 341, in run_setup exec(code, locals()) File "<string>", line 85, in <module> AssertionError: Could not find cmake executable! [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. error: subprocess-exited-with-error × Getting requirements to build wheel did not run successfully. │ exit code: 1 ╰─> See above for output.
I have identified, that I have used pip for Python3.8 instead of python311 (latest Python version in openSUSE). We can not use pip for in openSUSE existing packages in this version.
Latest Assertion Errors: 465s] [31m[1m_ TestExternalDataToArray_0_protobuf.test_save_model_with_external_data_multiple_times _[0m [ 465s] [gw0] linux -- Python 3.10.14 /usr/bin/python3.10 [ 465s] [1m[31mtest/test_external_data.py[0m:685: in test_save_model_with_external_data_multiple_times [ 465s] np.testing.assert_allclose([90m[39;49;00m [ 465s] [1m[31m/usr/lib64/python3.10/contextlib.py[0m:79: in inner [ 465s] [94mreturn[39;49;00m func(*args, **kwds)[90m[39;49;00m [ 465s] [1m[31mE AssertionError: [0m [ 465s] [1m[31mE Not equal to tolerance rtol=1e-07, atol=0[0m [ 465s] [1m[31mE [0m [ 465s] [1m[31mE x and y nan location mismatch:[0m [ 465s] [1m[31mE x: array([[[ 6.166898e+34, 4.026997e-26, 1.645485e+22, ...,[0m [ 465s] [1m[31mE 1.920033e+25, 2.967944e-20, -4.614014e-04],[0m [ 465s] [1m[31mE [-1.356084e+15, -8.810646e-38, -2.061762e+04, ...,...[0m [ 465s] [1m[31mE y: array([[[5.322033e-01, 9.073346e-01, 5.053618e-01, ..., 6.152101e-01,[0m [ 465s] [1m[31mE 6.564350e-01, 4.549692e-01],[0m [ 465s] [1m[31mE [6.703315e-01, 4.256554e-01, 5.766872e-01, ..., 9.197403e-01,...[0m [ 465s] [31m[1m_____ TestExternalDataToArray_0_protobuf.test_to_array_with_external_data ______[0m [ 465s] [gw0] linux -- Python 3.10.14 /usr/bin/python3.10 [ 465s] [1m[31mtest/test_external_data.py[0m:664: in test_to_array_with_external_data [ 465s] np.testing.assert_allclose(loaded_large_data, [96mself[39;49;00m.large_data)[90m[39;49;00m [ 465s] [1m[31m/usr/lib64/python3.10/contextlib.py[0m:79: in inner [ 465s] [94mreturn[39;49;00m func(*args, **kwds)[90m[39;49;00m [ 465s] [1m[31mE AssertionError: [0m [ 465s] [1m[31mE Not equal to tolerance rtol=1e-07, atol=0[0m [ 465s] [1m[31mE [0m [ 465s] [1m[31mE x and y nan location mismatch:[0m [ 465s] [1m[31mE x: array([[[ 5.798805e-30, -1.165122e+02, 8.127961e+20, ...,[0m [ 465s] [1m[31mE 1.806727e-37, -8.921661e-06, 5.233179e-20],[0m [ 465s] [1m[31mE [-5.477265e+32, -4.729377e+18, -1.624872e+24, ...,...[0m [ 465s] [1m[31mE y: array([[[0.002852, 0.527004, 0.746832, ..., 0.459877, 0.340009,[0m [ 465s] [1m[31mE 0.630724],[0m [ 465s] [1m[31mE [0.135593, 0.767637, 0.135425, ..., 0.046752, 0.25354 ,...[0m [ 465s] [31m[1m_ TestExternalDataToArray_1_textproto.test_save_model_with_external_data_multiple_times _[0m [ 465s] [gw0] linux -- Python 3.10.14 /usr/bin/python3.10 [ 465s] [1m[31mtest/test_external_data.py[0m:685: in test_save_model_with_external_data_multiple_times [ 465s] np.testing.assert_allclose([90m[39;49;00m [ 465s] [1m[31m/usr/lib64/python3.10/contextlib.py[0m:79: in inner [ 465s] [94mreturn[39;49;00m func(*args, **kwds)[90m[39;49;00m [ 465s] [1m[31mE AssertionError: [0m [ 465s] [1m[31mE Not equal to tolerance rtol=1e-07, atol=0[0m [ 465s] [1m[31mE [0m [ 465s] [1m[31mE x and y nan location mismatch:[0m [ 465s] [1m[31mE x: array([[[ 1.639122e+29, 7.927940e-02, -1.220582e-34, ...,[0m [ 465s] [1m[31mE 5.746708e+14, 3.756967e+37, -3.229969e+24],[0m [ 465s] [1m[31mE [-1.614809e-31, 2.402144e-37, 8.529220e-11, ...,...[0m [ 465s] [1m[31mE y: array([[[9.063177e-01, 5.410980e-02, 1.856786e-01, ..., 3.320491e-01,[0m [ 465s] [1m[31mE 6.167372e-01, 4.964211e-01],[0m [ 465s] [1m[31mE [3.072628e-01, 9.829561e-01, 7.018124e-02, ..., 1.915139e-01,... [ 465s] [31m[1m__________________ test_make_tensor_raw[TensorProto.FLOAT16] ___________________[0m [ 465s] [gw0] linux -- Python 3.10.14 /usr/bin/python3.10 [ 465s] [1m[31mtest/helper_test.py[0m:928: in test_make_tensor_raw [ 465s] np.testing.assert_equal(np_array, numpy_helper.to_array(tensor))[90m[39;49;00m [ 465s] [1m[31m/usr/lib64/python3.10/contextlib.py[0m:79: in inner [ 465s] [94mreturn[39;49;00m func(*args, **kwds)[90m[39;49;00m [ 465s] [1m[31mE AssertionError: [0m [ 465s] [1m[31mE Arrays are not equal[0m [ 465s] [1m[31mE [0m [ 465s] [1m[31mE Mismatched elements: 6 / 6 (100%)[0m [ 465s] [1m[31mE Max absolute difference: 1.146[0m [ 465s] [1m[31mE Max relative difference: 1576.[0m [ 465s] [1m[31mE x: array([[ 1.309 , -0.0258, 1.1455],[0m [ 465s] [1m[31mE [ 0.3464, 0.774 , -0.7744]], dtype=float16)[0m [ 465s] [1m[31mE y: array([[ 1.060e+00, -3.735e-03, -1.278e-03],[0m [ 465s] [1m[31mE [-2.199e-04, 1.633e-01, 2.102e-01]], dtype=float16)[0m
Fedora has to exclude the s390x architecture because of build failures in python-onnx: https://bugzilla.redhat.com/show_bug.cgi?id=2212096 I try to exclude python310 for solving our issue.
Python312 has got following error messages: [ 4951s] FAILED test/helper_test.py::TestHelperTensorFunctions::test_make_bfloat16_tensor_raw - AssertionError: [ 4951s] FAILED test/helper_test.py::TestHelperTensorFunctions::test_make_float8e4m3fn_tensor_raw - ValueError: buffer size must be a multiple of element size [ 4951s] FAILED test/helper_test.py::TestHelperTensorFunctions::test_make_float8e4m3fnuz_tensor_raw - ValueError: buffer size must be a multiple of element size [ 4951s] FAILED test/helper_test.py::TestHelperTensorFunctions::test_make_float8e5m2_tensor_raw - ValueError: buffer size must be a multiple of element size [ 4951s] FAILED test/helper_test.py::TestHelperTensorFunctions::test_make_float8e5m2fnuz_tensor_raw - ValueError: buffer size must be a multiple of element size [ 4951s] FAILED test/helper_test.py::test_make_tensor_raw[TensorProto.FLOAT] - AssertionError: [ 4951s] FAILED test/helper_test.py::test_make_tensor_raw[TensorProto.FLOAT16] - AssertionError: [ 4951s] FAILED test/helper_test.py::test_make_tensor_raw[TensorProto.DOUBLE] - AssertionError: [ 4951s] FAILED test/helper_test.py::test_make_tensor_raw[TensorProto.COMPLEX64] - AssertionError: [ 4951s] FAILED test/helper_test.py::test_make_tensor_raw[TensorProto.COMPLEX128] - AssertionError: [ 4951s] FAILED test/helper_test.py::test_make_tensor_raw[TensorProto.UINT32] - AssertionError: [ 4952s] FAILED test/helper_test.py::test_make_tensor_raw[TensorProto.UINT64] - AssertionError: [ 4952s] FAILED test/test_external_data.py::TestExternalDataToArray_0_protobuf::test_save_model_with_external_data_multiple_times - AssertionError: [ 4952s] FAILED test/test_external_data.py::TestExternalDataToArray_0_protobuf::test_to_array_with_external_data - AssertionError: [ 4952s] FAILED test/test_external_data.py::TestExternalDataToArray_1_textproto::test_save_model_with_external_data_multiple_times - AssertionError: [ 4952s] FAILED test/test_external_data.py::TestExternalDataToArray_1_textproto::test_to_array_with_external_data - AssertionError: [ 4952s] ==== 16 failed, 4394 passed, 3020 skipped, 32 warnings in 729.60s (0:12:09) ====
@Andreas and @Marcus: IBM is supporting Python ONNX together with the IBM Deepl Learning Compiler on Linux on Z: https://ibm.github.io/ai-on-z-101/onnxdlc/ Can you please tell us which Python versions are supported? It is not buildable on openSUSE and Fedora. I am open to adopting our Python versions in our build spec for python-onnx.
------- Comment From Andreas.Krebbel@de.ibm.com 2024-06-17 10:17 EDT------- The same python versions which work on x86 should also work on s390x. I've just tried to build upstream onnx with Python 3.10.14 on Fedora 40 and that worked fine. Do you have the build logs of a failing build at hand? Btw. we continued working on the ONNX testsuite failures. With these two changes from a colleague we are down to a single failure. Second one needs to be merged though: https://github.com/onnx/onnx/commit/3d976ff359fe22c8f881a15119c3f61e497dd86a https://github.com/onnx/onnx/pull/6183
The highest available Python version in openSUSE and Fedora is 3.13. That is the build log with excluding 3.10: https://build.opensuse.org/package/live_build_log/home:AdaLovelace:branches:openSUSE:Factory:zSystems/python-onnx/openSUSE_Factory_zSystems/s390x That is the build log including Python 3.10, Python 3.11, Python 3.12 and Python 3.13: https://build.opensuse.org/package/live_build_log/openSUSE:Factory:zSystems/python-onnx/standard/s390x Our Python Onnx Maintainers are mainlining/updating continuously. We are using onnx 1.16.0 at the moment. So, do you mean I should exclude Python 3.10-3.12? What is supported on your "supported" Enterprise distributions for AI on IBM Z?
I want to test your referenced patch.
@Andreas This bug report has been forwarded to you via the IBM Bugzilla. Therefore, you can get an openSUSE VM for development (not only Fedora) by IBM. Fedora is the partner community of openSUSE, and we are supporting each other. Please fix the bug on openSUSE Tumbleweed. As soon as python-onnx is working on openSUSE, it will be enabled for builds on Fedora also again.
------- Comment From Andreas.Krebbel@de.ibm.com 2024-06-18 01:59 EDT------- (In reply to comment #25) > The highest available Python version in openSUSE and Fedora is 3.13. > > That is the build log with excluding 3.10: > https://build.opensuse.org/package/live_build_log/home:AdaLovelace:branches: > openSUSE:Factory:zSystems/python-onnx/openSUSE_Factory_zSystems/s390x > > That is the build log including Python 3.10, Python 3.11, Python 3.12 and > Python 3.13: > https://build.opensuse.org/package/live_build_log/openSUSE:Factory:zSystems/ > python-onnx/standard/s390x These logs do not indicate actual build failures to me. The builds are done at this point. In both logs I see 16 *testsuite* failures. These are known to us and we are working on it, but I do not see a reason to let the build fail because of that. Once the patches mentioned earlier have been merged we will be down to one fail in our tests. The last remaining fail seems to be triggered by a build issue of the pillow package and not onnx. We will try to address this. > So, do you mean I should exclude Python 3.10-3.12? I don't see why these testsuite failures are related to the Python version. Are you saying that it works with Python 3.9? That would surprise me. > What is supported on your "supported" Enterprise distributions for AI on IBM > Z? This depends on what you mean by supported. Actually supported are the packages used in products like the AI Toolkit or CP4D. Apart from that, we try to keep things working best can do. In principle all the Python versions which are working on other platforms should also work on IBM Z. If not, we will try have a look asap.
(In reply to LTC BugProxy from comment #27) > ------- Comment From Andreas.Krebbel@de.ibm.com 2024-06-18 01:59 EDT------- > (In reply to comment #25) > These logs do not indicate actual build failures to me. The builds are done > at this point. In both logs I see 16 *testsuite* failures. These are known > to us and we are working on it, but I do not see a reason to let the build > fail because of that. Once the patches mentioned earlier have been merged we > will be down to one fail in our tests. The last remaining fail seems to be > triggered by a build issue of the pillow package and not onnx. We will try > to address this. The OBS has got an integrated CI/CD pipeline. Therefore, if too many tests (from upstream) are failing, our builds are failing also. You can experience that also with Koji at Fedora. I want to highlight that we do not receive any z16 system (all Linux distributions). We have to trust you as a Developer that everything is working as expected. Of course, we can disable the upstream tests (as Berthold did after the creation of this bug report). Do you want to guarantee that all AI-specific features will work as expected on z16 if we can not test them based on missing hardware? openSUSE is a rolling release (an upstream Linux distribution) and we are running upstream tests for all Python packages (or most based on the feature that we can disable also). If we have to trust any tests for AI, we have to do that based on what is available. Our Python Maintainers are also on this bug report. You can develop based on openSUSE because of this IBM Bugzilla entry. Our collaboration should be improved in the future. We have trust you with your development (not only on Fedora). We are both rpm-based Linux distributions but with different package versions. Yesterday, I was shortly before writing into my Release Announcement for the IBM Community, that Linux distributions can not guarantee working z16 features for AI, Cryptography and IBM developed Confidential Computing any more based on missing z16 hardware for testing/validation and missing tests from Developer side.
------- Comment From Andreas.Krebbel@de.ibm.com 2024-06-19 05:37 EDT------- (In reply to comment #28) > (In reply to LTC BugProxy from comment #27) > > (In reply to comment #25) > > These logs do not indicate actual build failures to me. The builds are done > > at this point. In both logs I see 16 *testsuite* failures. These are known > > to us and we are working on it, but I do not see a reason to let the build > > fail because of that. Once the patches mentioned earlier have been merged we > > will be down to one fail in our tests. The last remaining fail seems to be > > triggered by a build issue of the pillow package and not onnx. We will try > > to address this. > > The OBS has got an integrated CI/CD pipeline. Therefore, if too many tests > (from upstream) are failing, our builds are failing also. You can experience > that also with Koji at Fedora. What, in general, is a good thing to get aware of problems, but also has the consequence of considering a package build to be failing because of testcase bugs. > I want to highlight that we do not receive any z16 system (all Linux > distributions). We have to trust you as a Developer that everything is > working as expected. Of course, we can disable the upstream tests (as > Berthold did after the creation of this bug report). Do you want to > guarantee that all AI-specific features will work as expected on z16 if we > can not test them based on missing hardware? This has nothing to do with z16. The onnx package does not have any support for z16 as far as I can tell. > openSUSE is a rolling release (an upstream Linux distribution) and we are > running upstream tests for all Python packages (or most based on the feature > that we can disable also). If we have to trust any tests for AI, we have to > do that based on what is available. Our Python Maintainers are also on this > bug report. Do you really see the onnx distro package being used frequently? To my understanding most people seem to use what comes from the python package index repository instead. To be honest I'm not sure how relevant this package is at all. > You can develop based on openSUSE because of this IBM Bugzilla entry. Our > collaboration should be improved in the future. We have trust you with your > development (not only on Fedora). We are both rpm-based Linux distributions > but with different package versions. > > Yesterday, I was shortly before writing into my Release Announcement for the > IBM Community, that Linux distributions can not guarantee working z16 > features for AI, Cryptography and IBM developed Confidential Computing any > more based on missing z16 hardware for testing/validation and missing tests > from Developer side. What are the z16 specific packages you ship with opensuse? I'm not aware of anything but libzdnn and this is fully supported by IBM.
(In reply to LTC BugProxy from comment #29) > ------- Comment From Andreas.Krebbel@de.ibm.com 2024-06-19 05:37 EDT------- > (In reply to comment #28) > > (In reply to LTC BugProxy from comment #27) > > > (In reply to comment #25) > > > These logs do not indicate actual build failures to me. The builds are done > > > at this point. In both logs I see 16 *testsuite* failures. These are known > > > to us and we are working on it, but I do not see a reason to let the build > > > fail because of that. Once the patches mentioned earlier have been merged we > > > will be down to one fail in our tests. The last remaining fail seems to be > > > triggered by a build issue of the pillow package and not onnx. We will try > > > to address this. > > > > The OBS has got an integrated CI/CD pipeline. Therefore, if too many tests > > (from upstream) are failing, our builds are failing also. You can experience > > that also with Koji at Fedora. > What, in general, is a good thing to get aware of problems, but also has the > consequence of considering a package build to be failing because of testcase > bugs. > We are doing that to verify that the software is working as expected on the different hardware architectures. If tests fail on s390x, something is going wrong, and the software is not called "enabled". Fedora is doing the same. We trust upstream communities and their tests, because they want to have it also running on all Linux distributions and hardware architectures. If x86 and arm are building fine, and it does not build on s390x because of multiple testcase failures, it is a case for such a bug report. > > I want to highlight that we do not receive any z16 system (all Linux > > distributions). We have to trust you as a Developer that everything is > > working as expected. Of course, we can disable the upstream tests (as > > Berthold did after the creation of this bug report). Do you want to > > guarantee that all AI-specific features will work as expected on z16 if we > > can not test them based on missing hardware? > This has nothing to do with z16. The onnx package does not have any support > for z16 as far as I can tell. > You have integrated the onnx software into the presentation about AI on z16. That had also AI integrated on the hardware level. You have to use/call it anyways also from the software level. Therefore, I thought that there is any dependency to the z16 hardware available. > > openSUSE is a rolling release (an upstream Linux distribution) and we are > > running upstream tests for all Python packages (or most based on the feature > > that we can disable also). If we have to trust any tests for AI, we have to > > do that based on what is available. Our Python Maintainers are also on this > > bug report. > Do you really see the onnx distro package being used frequently? To my > understanding most people seem to use what comes from the python package > index repository instead. To be honest I'm not sure how relevant this > package is at all. > Here you can see, that we have got more than one Maintainer: https://build.opensuse.org/projects/science:machinelearning/packages/python-onnx/files/python-onnx.changes?expand=1 We recommend our users to use our own packages, because we test that all in OBS and on top with openQA. openSUSE Tumbleweed is one of the most well tested rolling release distributions and many call it already as "stable" because of that. That would not be possible via the Python package index. That is independent. openSUSE Leap 15.6 has Python 3.11 available for development. Here, I would recommend using pip because the package is missing now. > > Yesterday, I was shortly before writing into my Release Announcement for the > > IBM Community, that Linux distributions can not guarantee working z16 > > features for AI, Cryptography and IBM developed Confidential Computing any > > more based on missing z16 hardware for testing/validation and missing tests > > from Developer side. > What are the z16 specific packages you ship with opensuse? I'm not aware of > anything but libzdnn and this is fully supported by IBM. I had an email discussion related to that 1.5 months ago. I have got the focus on the open source Python packages. SUSE people said that the feature requests have been missing to integrate the compatibility with IBM related software like libzdnn. I can do only so much as my time is providing to contribute to openSUSE zSystems. That includes also fixing dependencies of Python packages (not only). My goal is, that all IBM software can run on our Linux distribution. I have found at the IBM Z Symposium also a company supporting openSUSE at their customers. That is one of the first steps into this direction.
I didn't disable tests for supported open source projects by IBM until now to receive the required packages based on failing tests. From my point of view, IBM should provide the customer support (also for community distributions), if the software would fail afterwards. @Thomas and @Marcus: What is your point of view? Can we add a new rule for (community) support that IBM customers can also create IBM bug reports for community distributions "directly" for software where your Developers are requesting to "disable tests"?
Ok, @Andreas, we will disable the upstream tests for s390x for our builds for "your supported" onnx. But we want to keep this bugreport open, that you can develop on an "openSUSE Tumbleweed" system (not Fedora only). From time to time, I will enable the tests with a fresh onnx version in my home repository. This bugreport will be closed as soon as all is working (incl. tests).
This is an autogenerated message for OBS integration: This bug (1215337) was mentioned in https://build.opensuse.org/request/show/1183184 Factory / python-onnx
The Volkswagen CI (https://github.com/auchenberg/volkswagen) has been enabled for python-onnx (only on s390x) for openSUSE Tumbleweed.
Ulrich told us in our Linux Distributions Working Group yesterday that IBM is providing such Python software as separate IBM related Python libraries. @Andreas Where can I find the s390x version of onnx then?