Bugzilla – Bug 1206359
kernel 6.1: random build failures: "error: <random package>: install failed"
Last modified: 2023-01-23 15:50:19 UTC
kernel 6.1 is submitted to openSUSE:Factory:Staging:J. There, a lot of packages randomly fail. It happens usually at the installation phase and looks like (build rpm): > [ 12s] libdb-4_8-4.8.30-41.81 ######################################## > [ 12s] error: libdb-4_8-4.8.30-41.81.x86_64: install failed > [ 12s] perl-5.36.0-5.47 ######################################## > [ 13s] liblsan0-12.2.1+git537-2.20 ######################################## Note that the failure is intermittent. Only some package installations fail (here libdb only). Note this can happen later in the loop (samba rpm): > error: samba-libs-4.17.3+git.283.2157972742b-3.2.x86_64: install failed > ... > [ 2288s] failed to install rpms, aborting build >... > [ 2288s] [0;1;38;5;185mFailed to write wtmp record, ignoring: Read-only file system[0m > [ 2288s] [0;1;38;5;185mFailed to write utmp record: Read-only file system[0m I'm currently out of ideas. It looks like the VM image becomes RO and then RW randomly. The image is ext4. I cannot reproduce locally with --kvm-type=kvm. Any hint anyone?
And yet, it looks like host-specific. lamb and old-cirrus machines seem to be affected only. I don't know what's their configuration though. Maybe Dominique can shed some light?
(In reply to Jiri Slaby from comment #1) > And yet, it looks like host-specific. lamb and old-cirrus machines seem to > be affected only. I don't know what's their configuration though. Maybe > Dominique can shed some light? Not really - except if you can do something out of the information provided by > osc workerinfo x86_64:lamb02:1 I'm sure Adrian is better suited to answer the original question: > And yet, it looks like host-specific. lamb and old-cirrus machines seem to be > affected only. I don't know what's their configuration though. Maybe Dominique > can shed some light?
BTW, can we get kernel logs from the build in OBS if there's anything interesting?
Bad run: ``` [ 11s] D: ========== +++ libdb-4_8-4.8.30-41.81 x86_64-linux 0x0 [ 11s] D: install: libdb-4_8-4.8.30-41.81.x86_64 has 2 files [ 11s] D: Plugin: calling hook psm_pre in selinux plugin [ 11s] libdb-4_8-4.8.30-41.81 D: create 100755 1 ( 0, 0)1595712 /usr/lib64/libdb-4.8.so;63989ef2 [ 11s] D: create 100755 1 ( 0, 0)1743256 /usr/lib64/libdb_cxx-4.8.so;63989ef2 [ 11s] D: ========== Directories not explicitly included in package: [ 11s] D: 0 /usr/lib64/ [ 11s] D: ========== [ 11s] ######################################## [ 11s] D: Plugin: calling hook fsm_file_prepare in fsverity plugin [ 11s] D: fsverity no signature for: path /usr/lib64/libdb-4.8.so;63989ef2 dest /usr/lib64/libdb-4.8.so [ 11s] D: Plugin: calling hook fsm_file_prepare in ima plugin [ 11s] D: Plugin: calling hook fsm_file_prepare in selinux plugin [ 11s] D: Plugin: calling hook fsm_file_prepare in fsverity plugin [ 11s] D: fsverity no signature for: path /usr/lib64/libdb_cxx-4.8.so;63989ef2 dest /usr/lib64/libdb_cxx-4.8.so [ 11s] D: Plugin: calling hook fsm_file_prepare in ima plugin [ 11s] D: Plugin: calling hook fsm_file_prepare in selinux plugin [ 11s] D: adding "libdb-4_8" to Name index. [ 11s] D: adding 2 entries to Basenames index. [ 11s] D: adding "System/Libraries" to Group index. [ 11s] D: adding 25 entries to Requirename index. [ 11s] D: adding 5 entries to Providename index. [ 11s] D: adding 1 entries to Dirnames index. [ 11s] D: adding 1 entries to Installtid index. [ 11s] D: adding 1 entries to Sigmd5 index. [ 11s] D: adding "39322b96c5751e2783f0d4fc0307783ae06e355e" to Sha1header index. [ 11s] D: Plugin: calling hook psm_post in syslog plugin [ 11s] error: libdb-4_8-4.8.30-41.81.x86_64: install failed ``` a good package right before this one ``` [ 11s] D: ========== +++ libasan8-12.2.1+git537-2.20 x86_64-linux 0x0 [ 11s] D: install: libasan8-12.2.1+git537-2.20.x86_64 has 2 files [ 11s] D: Plugin: calling hook psm_pre in selinux plugin [ 11s] libasan8-12.2.1+git537-2.20 D: create 120777 1 ( 0, 0) 16 /usr/lib64/libasan.so.8;63989ef2 [ 11s] D: create 100755 1 ( 0, 0)1307120 /usr/lib64/libasan.so.8.0.0;63989ef2 [ 11s] D: ========== Directories not explicitly included in package: [ 11s] D: 0 /usr/lib64/ [ 11s] D: ========== [ 11s] ######################################## [ 11s] D: Plugin: calling hook fsm_file_prepare in fsverity plugin [ 11s] D: fsverity skipping non regular: path /usr/lib64/libasan.so.8;63989ef2 dest /usr/lib64/libasan.so.8 [ 11s] D: Plugin: calling hook fsm_file_prepare in ima plugin [ 11s] D: Plugin: calling hook fsm_file_prepare in selinux plugin [ 11s] D: Plugin: calling hook fsm_file_prepare in fsverity plugin [ 11s] D: fsverity no signature for: path /usr/lib64/libasan.so.8.0.0;63989ef2 dest /usr/lib64/libasan.so.8.0.0 [ 11s] D: Plugin: calling hook fsm_file_prepare in ima plugin [ 11s] D: Plugin: calling hook fsm_file_prepare in selinux plugin [ 11s] D: adding "libasan8" to Name index. [ 11s] D: adding 2 entries to Basenames index. [ 11s] D: adding "Development/Languages/C and C++" to Group index. [ 11s] D: adding 21 entries to Requirename index. [ 11s] D: adding 3 entries to Providename index. [ 11s] D: adding 1 entries to Conflictname index. [ 11s] D: adding 1 entries to Dirnames index. [ 11s] D: adding 1 entries to Installtid index. [ 11s] D: adding 1 entries to Sigmd5 index. [ 11s] D: adding "2973d96fce735d6078efd3998891e1a7d8e1f98a" to Sha1header index. [ 11s] D: %post(libasan8-12.2.1+git537-2.20.x86_64): scriptlet start [ 11s] D: %post(libasan8-12.2.1+git537-2.20.x86_64): execv(/sbin/ldconfig) pid 957 [ 11s] D: Plugin: calling hook scriptlet_fork_post in prioreset plugin [ 11s] D: Plugin: calling hook scriptlet_fork_post in selinux plugin [ 11s] D: %post(libasan8-12.2.1+git537-2.20.x86_64): waitpid(957) rc 957 status 0 [ 11s] D: Plugin: calling hook scriptlet_post in syslog plugin [ 11s] D: Plugin: calling hook psm_post in syslog plugin ```
so from the logs it looks like on the newer kernel rpm sometimes fails to launch the processes for the post scriptlets. sadly without any useful error.
(In reply to Marcus Rückert from comment #5) > so from the logs it looks like on the newer kernel rpm sometimes fails to > launch the processes for the post scriptlets. sadly without any useful error. If post hooks failed, that would dump: Plugin %s: hook psm_post failed But that didn't happen. So post hooks likely finished w/o error. It's something done after post hooks, right? Michael?
(In reply to Jiri Slaby from comment #6) > (In reply to Marcus Rückert from comment #5) > > so from the logs it looks like on the newer kernel rpm sometimes fails to > > launch the processes for the post scriptlets. sadly without any useful error. > > If post hooks failed, that would dump: > Plugin %s: hook psm_post failed > > But that didn't happen. So post hooks likely finished w/o error. It's > something done after post hooks, right? Michael? Or even before post hooks. Post hooks are run despite errors and nor affect the return status from rpmpsmRun(). The question is why rpm does not note anything about the failure in the debug log at all?
(In reply to Jan Kara from comment #3) > BTW, can we get kernel logs from the build in OBS if there's anything > interesting? Adding to this, if we run debug runs, can we increase the kernel debug level? Currently, /usr/lib/build/build-vm does: echo 4 > /proc/sysrq-trigger so "echo 8" might show us more.
we ran with 7 already and it didnt give us additional details.
(In reply to Marcus Rückert from comment #5) > so from the logs it looks like on the newer kernel rpm sometimes fails to > launch the processes for the post scriptlets. sadly without any useful error. From a different build: https://build.opensuse.org/package/live_build_log/openSUSE:Factory:Staging:J/talloc/standard/i586 [ 18s] python-rpm-macros-20221117.8687578-2.1######################################## [ 18s] error: python-rpm-macros-20221117.8687578-2.10.noarch: install failed python-rpm-macros does not even have scriptlets; so I think we're following a red herring with the scriptlets
(In reply to Marcus Rückert from comment #5) > so from the logs it looks like on the newer kernel rpm sometimes fails to > launch the processes for the post scriptlets. sadly without any useful error. Ah. So this (AFAICT) means: dbAdd() failed, because: * a lot of "adding X entries to XX index" happens inside dbAdd(). * the following runFileTriggers() and runImmedFileTriggers() would emit an error in case of error. * the following runInstScript(psm, RPMTAG_POSTIN) was not run for libdb and it should as in libasan8 above (both have "postinstall program: /sbin/ldconfig"). * then only psm_post is run regardless of previous errors. So it might be interesting to see which of these invocations in rpmdbAdd() fails (increases ret): for (int dbix = 0; dbix < db->db_ndbi; dbix++) { ... ret += idxdbPut(dbi, rpmtag, hdrNum, h);
(In reply to Jiri Slaby from comment #11) > ret += idxdbPut(dbi, rpmtag, hdrNum, h); One things where the above returns RPMRC_FAIL is when this fails: flock(pkgdb->fd, LOCK_EX); Let me check if there are any changes in file locking in between 6.0 and 6.1.
(In reply to Jiri Slaby from comment #12) > Let me check if there are any changes in file locking in between 6.0 and 6.1. And there are none. Can anyone test with this patched rpm: https://build.opensuse.org/package/show/home:jirislaby:debug61/rpm ? There is a patch with many more rpmlog(RPMLOG_DEBUG, ...) to narrow what and how actually fails.
(In reply to Jiri Slaby from comment #13) > Can anyone test with this patched rpm: > https://build.opensuse.org/package/show/home:jirislaby:debug61/rpm > ? There is a patch with many more rpmlog(RPMLOG_DEBUG, ...) to narrow what > and how actually fails. I hit it once in home:jirislaby:debug61:rebuild and it points to rpmidxPutInternal()'s: addnewkey(idxdb, key, keyl, &keyoff) And that does pwrite(), mmap() and mremap() internally. I respinned the debug patch to narrow it further down.
(In reply to Jiri Slaby from comment #14) > And that does pwrite(), mmap() and mremap() internally. And the winner is: [ 30s] error: remapslot (223): mapmem(old=0xffffffffffffffff, osz=1056768, nsz=1060864, fl=3, fd=4, off=868352) failed 12 i.e. mremap fails with ENOMEM. Vlasta, any ideas?
(In reply to Jiri Slaby from comment #15) > (In reply to Jiri Slaby from comment #14) > > And that does pwrite(), mmap() and mremap() internally. > > And the winner is: > [ 30s] error: remapslot (223): mapmem(old=0xffffffffffffffff, osz=1056768, > nsz=1060864, fl=3, fd=4, off=868352) failed 12 How exactly is that log related to mremap() syscall and its parameters? Can I assume the address and old/new sizes are 1:1 parameters? if 'fl=3' are flags of mremap() then going by tools/include/uapi/linux/mman.h it would be MREMAP_MAYMOVE | MREMAP_FIXED. With MREMAP_FIXED there should be also fifth parameter saying the new address, which I don't see here. Instead the "fd=4, off=868352" are not mremap() parameters. > i.e. mremap fails with ENOMEM. Vlasta, any ideas? At least old=0xffffffffffffffff looks quite wrong, but it should result in EINVAL rather than ENOMEM. In general 6.1 includes the maple tree, so that would be my first suspect to look, but we should make sure what the mremap() parameters exactly are and ideally get a /proc/pid/smaps snapshot upon the failure.
(In reply to Vlastimil Babka from comment #16) > (In reply to Jiri Slaby from comment #15) > > (In reply to Jiri Slaby from comment #14) > > > And that does pwrite(), mmap() and mremap() internally. > > > > And the winner is: > > [ 30s] error: remapslot (223): mapmem(old=0xffffffffffffffff, osz=1056768, > > nsz=1060864, fl=3, fd=4, off=868352) failed 12 > > How exactly is that log related to mremap() syscall and its parameters? Can > I assume the address and old/new sizes are 1:1 parameters? No, sorry for being misleading. The source is: https://github.com/rpm-software-management/rpm/blob/master/lib/backend/ndb/rpmxdb.c#L111 so it's in fact: mremap(MAP_FAILED, 1056768, 1060864, MREMAP_MAYMOVE); > if 'fl=3' are flags of mremap() then going by Those fl, fd and off are used as prot, fd and off resp. for mmap() -- i.e. not in this case. > At least old=0xffffffffffffffff looks quite wrong, but it should result in > EINVAL rather than ENOMEM. Ugh, right. So apparently some previous mmap() failed, i.e. returned -1 == MAP_FAILED. > In general 6.1 includes the maple tree, so that would be my first suspect to > look, but we should make sure what the mremap() parameters exactly are and > ideally get a /proc/pid/smaps snapshot upon the failure. Let me catch also the *previous* mmap() giving MAP_FAILED which is put to oldaddr.
Created attachment 863524 [details] smaps (In reply to Jiri Slaby from comment #17) > > In general 6.1 includes the maple tree, so that would be my first suspect to > > look, but we should make sure what the mremap() parameters exactly are and > > ideally get a /proc/pid/smaps snapshot upon the failure. > > Let me catch also the *previous* mmap() giving MAP_FAILED which is put to > oldaddr. smaps dump attached. I don't know where this "-1 as first argument" is from. slot->mapped (the old addr) is never assigned MAP_FAILED or -1. error: remapslot (232): mapped=0xffffffffffffffff shift=0 mapmem(old=0xffffffffffffffff, osz=1056768, nsz=1060864, prot=3, fd=4, off=868352) failed 12
(In reply to Jiri Slaby from comment #18) > smaps dump attached. I don't know where this "-1 as first argument" is from. > slot->mapped (the old addr) is never assigned MAP_FAILED or -1. And neither of previous mmap()s fail.
(In reply to Vlastimil Babka from comment #16) > At least old=0xffffffffffffffff looks quite wrong, but it should result in > EINVAL rather than ENOMEM. > In general 6.1 includes the maple tree, so that would be my first suspect to > look I have to go now, so for now, I came to a suspicion that this is memory corruption. That 0xffffffffffffffff must be written by something else. So could this mmap() kernel rewrite ended up in something like this in rpm: 1) slots = calloc(); (it's 16320B, so likely via sbrk(), I didn't check) 2) do a lot of mremap() and mmap(), some of them simply returns "slots" address from 1) again. 3) slot[X].mapped gets rewritten by a mistake due to 2) by a bunch of 0xff. 4) we see slot->mapped as -1 here. ?
But wouldn't that lead to EINVAL because offset_in_page(addr) is not zero? That sounds more like your debug statement does not match what happens.
(In reply to Michael Schröder from comment #21) > But wouldn't that lead to EINVAL because offset_in_page(addr) is not zero? > That sounds more like your debug statement does not match what happens. I might do some wrong mistake. Feel free to check the patch ;): https://build.opensuse.org/package/view_file/home:jirislaby:debug61/rpm/debug61.patch?expand=1
Your debug statement is wrong because mapped gets overwritten with the mapmem() result. I.e. mapped is always MAP_FAILED and not the old address.
Created attachment 863532 [details] straced-rpm.zst I was able to get rpm to fail inside strace -f. Full build log attached.
(In reply to Fabian Vogt from comment #24) > Created attachment 863532 [details] > straced-rpm.zst > > I was able to get rpm to fail inside strace -f. Full build log attached. > mmap2(NULL, 241664, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0xaa000) = 0xf767e000 > mremap(0xf767e000, 241664, 245760, MREMAP_MAYMOVE) = 0xf767e000 > mremap(0xf767e000, 245760, 249856, MREMAP_MAYMOVE) = 0xf767e000 > mremap(0xf767e000, 249856, 253952, MREMAP_MAYMOVE) = 0xf767e000 > mremap(0xf767e000, 253952, 258048, MREMAP_MAYMOVE) = 0xf767e000 > mremap(0xf767e000, 258048, 262144, MREMAP_MAYMOVE) = -1 ENOMEM (Cannot allocate memory) > mremap(0xf767e000, 258048, 262144, MREMAP_MAYMOVE) = -1 ENOMEM (Cannot allocate memory) > mremap(0xf767e000, 258048, 262144, MREMAP_MAYMOVE) = -1 ENOMEM (Cannot allocate memory) > mremap(0xf767e000, 258048, 262144, MREMAP_MAYMOVE) = -1 ENOMEM (Cannot allocate memory) > mremap(0xf767e000, 258048, 262144, MREMAP_MAYMOVE) = -1 ENOMEM (Cannot allocate memory) > mremap(0xf767e000, 258048, 262144, MREMAP_MAYMOVE) = -1 ENOMEM (Cannot allocate memory) Looks interesting.
Created attachment 863539 [details] smaps Smaps for this error (I fixed the "mapped" variable dump): error: remapslot (235): mapped=0x7f72f4e78000 shift=0 mapmem(old=0x7f72f4e78000, osz=1056768, nsz=1060864, prot=3, fd=4, off=868352) failed 12
Created attachment 863541 [details] full log incl. smaps From another run (I dumped even more), IIUC: > mapmem(130): map=0x7ff717f7f000 (old=(nil) osz=0 nsz=1056768 prot=3 fd=4 off=868352 ) off in hex=0x0d4000 nsz in hex=0x102000 So mmap mapped this part of the file (Index.db): 0x0d4000-0x1d6000 > mapmem(130): remap=0xffffffffffffffff (old=0x7ff717f7f000 osz=1056768 nsz=1060864 prot=3 fd=4 off=868352 ) Now it wants: 0x0d4000-0x1d7000 but fails: > error: remapslot (240): mapped=0x7ff717f7f000 shift=0 mapmem(old=0x7ff717f7f000, osz=1056768, nsz=1060864, prot=3, fd=4, off=868352) failed 12 The original mapping (mmap above): > 7ff717f56000-7ff718081000 rw-s 000ab000 fd:00 656385 /usr/lib/sysimage/rpm/Index.db This mapping provides: 0x0ab000-0x1d6000 BTW, this was requested earlier by another mmap: > mapmem(130): map=0x7ff717f56000 (old=(nil) osz=0 nsz=167936 prot=3 fd=4 off=700416 ) and was not released until now. The above mapping cannot be enlarged, because this vma follows: > 7ff718082000-7ff718087000 rw-s 00047000 fd:00 656385 /usr/lib/sysimage/rpm/Index.db So the question is why remap did not apply the MAYMOVE rule. Provided it's shared, it should create a new mapping (and preserve the old for the older mmap) and return it. Or not? (I don't know the rules here.)
So I reverted: commit ca3d76b0aa808a06997297d123b66d17b81e5285 Author: Jakub Matěna <matenajakub@gmail.com> Date: Fri Jun 3 16:57:19 2022 +0200 mm: add merging after mremap resize in my home:jirislaby:debug61 tesbed. And it seems to fix the problem. I submitted to Staging to verify. The problem is I don't understand why it helps as vma merge in mremap happens only in the "if (vma_expandable(vma, new_len - old_len)) {" case. So the remap we have here should not be affected. Leaving up to Vlasta, I believe he knows far better than me.
Thanks Jiri, that was very informative, will check what exactly is wrong with the commit. (In reply to Jiri Slaby from comment #27) > The above mapping cannot be enlarged, because this vma follows: > > 7ff718082000-7ff718087000 rw-s 00047000 fd:00 656385 /usr/lib/sysimage/rpm/Index.db Actually if my calculation is right, the enlarged mapping should now be adjacent to this one, so it's not yet preventing the enlarge. > So the question is why remap did not apply the MAYMOVE rule. Provided it's > shared, it should create a new mapping (and preserve the old for the older > mmap) and return it. Or not? (I don't know the rules here.) Thus the MAYMOVE shouldn't yet trigger. But with the new commit, a merge is attempted that previously wouldn't be. But the merge probably fails because the two mappings have an incompatible offset into the file. So the merge fails and it should fallback to just enlarging the mapping as before the commit. But seems that fallback doesn't happen and we return -ENOMEM instead.
Hmm so I don't see a problem in the commit. If the next adjacent vma is incompatible for merging, it should just enlarge the current one as previously. Nor does the ENOMEM reproduce with my primitive testcase, it just works as expected. Wonder if you were just lucky when testing with the revert, or something else is going on and the revert just hides some other issue.
OK got a small reproducer, and it's indeed the commit causing it. Will check tomorrow if it's easily fixable or I'll submit a revert upstream.
Created attachment 863563 [details] the fix Here's the fix, submitted upstream: https://lore.kernel.org/all/20221216163227.24648-1-vbabka@suse.cz/T/#u
pushed to master and stable for-next
Fix is in 6.2-rc1, but because stable process sucks, it's not in 6.1.2. Hopefully 6.1.3 then. But our stable branch is fixed, so closing.
This bug is back with kernel 6.1.3 in Tumbleweed: RPM installations fail randomly and strace shows mremap returning -ENOMEM.
6.1.3 had the identical fix from upstream stable that we had in our stable branch. I can't reproduce the testcases failing with 6.1.3 nor 6.1.4. So I'm afraid I'll need logs of mremap() failings and smaps to investigate, as before.
Created attachment 864106 [details] testcase By the way here's my hacked up tests from when I was investigating the bug. To run it: gcc -O2 -o test test.c dd if=/dev/random of=data bs=4k count=20 ./test It will produce some output and then then its pid and wait for enter if capturing /proc/pid/smaps is needed. Here's how it looks on unfixed kernel: test0: mremap from 7fde78865000 to 7fde78865000 test1: mremap wihout MREMAP_MAYMOVE enlarging last page to two page: Cannot allocate memory test2: mremap without MREMAP_MAYMOVE enlarging last page to two page: Cannot allocate memory test3: mremap without MREMAP_MAYMOVE to fill the gap: Cannot allocate memory test4: mremap without MREMAP_MAYMOVE enlarging last page to two page: Cannot allocate memory test5: mremap without MREMAP_MAYMOVE to fill the gap: Cannot allocate memory test6: mremap from 7fde7886c000 to 7fde78868000 test7: mremap from 7fde7886b000 to 7fde78867000 And on a fixed kernel: (test4 and test5 are supposed to fail, test6 and test7 are supposed to have different "from" and "to"). test0: mremap from 7fa0da499000 to 7fa0da499000 test1: mremap from 7fa0da49a000 to 7fa0da49a000 test2: mremap from 7fa0da4a0000 to 7fa0da4a0000 test3: mremap from 7fa0da49f000 to 7fa0da49f000 test4: mremap without MREMAP_MAYMOVE enlarging last page to two page: Cannot allocate memory test5: mremap without MREMAP_MAYMOVE to fill the gap: Cannot allocate memory test6: mremap from 7fa0da4a0000 to 7fa0da49c000 test7: mremap from 7fa0da49f000 to 7fa0da49b000
Created attachment 864137 [details] smaps from binutils failure (In reply to Vlastimil Babka from comment #36) > 6.1.3 had the identical fix from upstream stable that we had in our stable > branch. I can't reproduce the testcases failing with 6.1.3 nor 6.1.4. So I'm > afraid I'll need logs of mremap() failings and smaps to investigate, as > before. Kernel 6.1.4 arrived. Initially I thought it was fixed, but then it failed reproducibly again, by installing binutils inside a podman container. After retrying a few times, it works: (23/29) Installing: libctf0-2.39-1.3.x86_64 ................................................................................................................................................................................................[done] error: binutils-2.39-1.3.x86_64: install failed (24/29) Installing: binutils-2.39-1.3.x86_64 ..............................................................................................................................................................................................[error] Installation of binutils-2.39-1.3.x86_64 failed: Error: Subprocess failed. Error: RPM failed: Command exited with status 1. Abort, retry, ignore? [a/r/i] (a): r error: binutils-2.39-1.3.x86_64: install failed error: binutils-2.39-1.3.x86_64: erase skipped (24/29) Installing: binutils-2.39-1.3.x86_64 ..............................................................................................................................................................................................[error] Installation of binutils-2.39-1.3.x86_64 failed: Error: Subprocess failed. Error: RPM failed: Command exited with status 1. Abort, retry, ignore? [a/r/i] (a): r error: binutils-2.39-1.3.x86_64: install failed error: binutils-2.39-1.3.x86_64: erase skipped error: binutils-2.39-1.3.x86_64: erase skipped (24/29) Installing: binutils-2.39-1.3.x86_64 ..............................................................................................................................................................................................[error] Installation of binutils-2.39-1.3.x86_64 failed: Error: Subprocess failed. Error: RPM failed: Command exited with status 1. Abort, retry, ignore? [a/r/i] (a): r (24/29) Installing: binutils-2.39-1.3.x86_64 ............................................................................................................................................................................................<100%>[/]error: binutils-2.39-1.3.x86_64: install failed error: binutils-2.39-1.3.x86_64: erase skipped error: binutils-2.39-1.3.x86_64: erase skipped error: binutils-2.39-1.3.x86_64: erase skipped (24/29) Installing: binutils-2.39-1.3.x86_64 ..............................................................................................................................................................................................[error] Installation of binutils-2.39-1.3.x86_64 failed: Error: Subprocess failed. Error: RPM failed: Command exited with status 1. Abort, retry, ignore? [a/r/i] (a): r update-alternatives: using /usr/bin/ld.bfd to provide /usr/bin/ld (ld) in auto mode (24/29) Installing: binutils-2.39-1.3.x86_64 ...............................................................................................................................................................................................[done] (25/29) Installing: libxcrypt-devel-4.4.33-1.1.x86_64 ......................................................................................................................................................................................[done] For some reason it hits only binutils here and I can even reproduce it by only installing binutils: podman run --rm -it --privileged opensuse/tumbleweed zypper in -y binutils I attached gdb to break at the failing mremap and collected parameters as well as smaps: Thread 2.1 "rpm" hit Catchpoint 2 (call to syscall mremap), 0x00007effd9d554cf in mremap () from target:/lib64/libc.so.6 (gdb) p/x $rdi $21 = 0x7effd9648000 (gdb) p/x $rsi $22 = 0x70000 (gdb) p/x $rdx $23 = 0x71000 (gdb) p/x $r10 $24 = 0x1 (gdb) stepi 0x00007effd9d554cf in mremap () from target:/lib64/libc.so.6 (gdb) p/x $rax $26 = 0xfffffffffffffff4 I don't see any mapping directly following the Index.db one here, so this might be a different/new issue...
That's indeed weird, unless there are multiple threads where gdb would pause only one of them and the smaps snapshot could be inaccurate due to other thread's activity. But I guess rpm doesn't have multiple threads? Would it be possible to capture also a kernel trace of the syscall? I.e. before stepping into the syscall, doing: trace-cmd record -p function_graph -g __x64_sys_mremap -P <pid_of_the_rpm_process> afterwards ctrl+c, trace-cmd report.
Created attachment 864155 [details] traces for the binutils failure (In reply to Vlastimil Babka from comment #39) > That's indeed weird, unless there are multiple threads where gdb would pause > only one of them and the smaps snapshot could be inaccurate due to other > thread's activity. But I guess rpm doesn't have multiple threads? Not that I'm aware of. > Would it be possible to capture also a kernel trace of the syscall? I.e. > before stepping into the syscall, doing: > > trace-cmd record -p function_graph -g __x64_sys_mremap -P > <pid_of_the_rpm_process> > > afterwards ctrl+c, trace-cmd report. Done. I used the -c option and the pid of zypper. I simultaneously straced the process for mmap, mremap and munmap calls, both traces are in the attached tarball.
For completeness, some discussion on IRC: [15:57] <fvogt> vbabka: I also hit the rpm issue on my workstation now, which also meant I could test older kernels more easily. Turns out it's already broken with the first 6.1.0 in TW, i.e. the bug was never fully fixed [15:58] <vbabka> fvogt: oh ok, what filesystem/block device is /usr/lib/sysimage/rpm/Index.db on? [15:59] <fvogt> vbabka: In both cases it's in a container, using overlayfs on one system and fuse-overlayfs on another [16:00] <fvogt> The former has ext4 as backing storage, the fuse-overlayfs system has btrfs [16:07] <fvogt> I tried to reproduce it in a clean VM twice now, but it works fine... [16:32] <vbabka> yeah my guess is vma has vm_ops.close() but I don't immediately see that in overlayfs [16:36] <fvogt> vbabka: fuse has [16:37] <vbabka> yep, but you said in one case it's overlayfs without fuse? [16:37] <fvogt> I think so, let me verify [16:38] <fvogt> I'm wrong! [16:38] <fvogt> Even though it's privileged, podman still uses fuse-overlayfs [16:38] <fvogt> So your guess is probably correct [16:44] <vbabka> hmm okay! [16:45] <fvogt> vbabka: Your testcase complains heavily when running it in a fuse-overlayfs mount [16:45] <fvogt> Even more than expected even, only test6 and test7 pass, even test0 fails [16:46] <vbabka> that looks promising, let's see if there's a nonintrusive way to fix
Created attachment 864158 [details] candidate fix Hopefully this patch will fix it.
(In reply to Vlastimil Babka from comment #42) > Created attachment 864158 [details] > candidate fix > > Hopefully this patch will fix it. Tested on 6.1.4, works! Tested-by: Fabian Vogt <fvogt@suse.com> FYI, there's a typo in the mesage: "we nevert remove"
Patch on the way to mainline and meanwhile in stable branch. Thanks.