|
Bugzilla – Full Text Bug Listing |
| Summary: | LTP: dio_append.c freeze on a wait system call on build 50.1 | ||
|---|---|---|---|
| Product: | [openSUSE] PUBLIC SUSE Linux Enterprise Server 15 SP6 | Reporter: | WEI GAO <wegao> |
| Component: | Kernel | Assignee: | Kernel Bugs <kernel-bugs> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | Normal | ||
| Priority: | P1 - Urgent | CC: | andrea.cervesato, martin.doucha, petr.vorel, rtsvetkov |
| Version: | unspecified | ||
| Target Milestone: | --- | ||
| Hardware: | S/390 | ||
| OS: | Other | ||
| URL: | https://openqa.suse.de/tests/13372268/modules/ADI000/steps/7 | ||
| Whiteboard: | |||
| Found By: | openQA | Services Priority: | |
| Business Priority: | Blocker: | Yes | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: | strace log 49.1 | ||
|
Description
WEI GAO
2024-01-29 08:25:49 UTC
Created attachment 872241 [details]
strace log 49.1
This seems to be a test issue, related with a synchronization problems between parent and children. By adding a usleep(10) at the end of the reading process, test passes without problems. Probably children now are way faster to process I/O syscalls, locking parent exit and timing out the test in openQA. I will send a patch for it. Sent a patch to fix this https://patchwork.ozlabs.org/project/ltp/patch/20240129101448.14463-1-andrea.cervesato@suse.de/ (In reply to WEI GAO from comment #0) > strace show process stop on wait call forever. This is the trace of the watchdog process which forks() the actual tests process and then waits for it to exit or time out. You need to use strace -f to capture a useful trace. And for gdb: set follow-fork-mode child (https://github.com/linux-test-project/ltp?tab=readme-ov-file#debugging-with-gdb) Ok I found it: there's a problem with parent implementation. Basically the test is calling unlink() just after telling all children that they should stop. What happens is that when at least one child is pending and processing lseek(), openat() syscall keeps failing with -ENOENT because of unlink(). We should take a look at the lseek() implementation or openat() implementation closer to check if there's a bug around there, because it's strange that -ENOENT is not propagated through lseek(). The fix for the test is easy and it's something we have to add anyway. At dio_append.c:85, we need to wait for children to end just before removing the file. *run_child = 0; tst_reap_children(); <----- SAFE_UNLINK(filename); (In reply to Andrea Cervesato from comment #6) > Ok I found it: there's a problem with parent implementation. Basically the > test is calling unlink() just after telling all children that they should > stop. What happens is that when at least one child is pending and processing > lseek(), openat() syscall keeps failing with -ENOENT because of unlink(). We > should take a look at the lseek() implementation or openat() implementation > closer to check if there's a bug around there, because it's strange that > -ENOENT is not propagated through lseek(). > > The fix for the test is easy and it's something we have to add anyway. > At dio_append.c:85, we need to wait for children to end just before removing > the file. > > *run_child = 0; > > tst_reap_children(); <----- > > SAFE_UNLINK(filename); I also suggest do not use while loop for open syscall at same time then we have no risk on dead loop. @@ -102,8 +103,9 @@ static inline void io_read_eof(const char *filename, volatile int *run_child) int fd; int r; - while ((fd = open(filename, O_RDONLY, 0666)) < 0) - usleep(100); + fd = SAFE_OPEN(filename, O_RDONLY, 0666); (In reply to Andrea Cervesato from comment #6) > Ok I found it: there's a problem with parent implementation. Basically the > test is calling unlink() just after telling all children that they should > stop. What happens is that when at least one child is pending and processing > lseek(), openat() syscall keeps failing with -ENOENT because of unlink(). We > should take a look at the lseek() implementation or openat() implementation > closer to check if there's a bug around there, because it's strange that > -ENOENT is not propagated through lseek(). I've tried to reproduce the issue on SLE-15SP5 using the latest LTP git and everything passed: https://openqa.suse.de/tests/13375452#step/ADI000/8 For now, I'd assume this issue is a kernel bug. (In reply to WEI GAO from comment #7) > I also suggest do not use while loop for open syscall at same time then we > have no risk on dead loop. > @@ -102,8 +103,9 @@ static inline void io_read_eof(const char *filename, > volatile int *run_child) > int fd; > int r; > > - while ((fd = open(filename, O_RDONLY, 0666)) < 0) > - usleep(100); > + fd = SAFE_OPEN(filename, O_RDONLY, 0666); This change would break the test because the test file does not exist when io_read_eof() gets called. I guess the easiest solution would be:
diff --git a/testcases/kernel/io/ltp-aiodio/common.h b/testcases/kernel/io/ltp-aiodio/common.h
index 200bbe18e..159bb25e4 100644
--- a/testcases/kernel/io/ltp-aiodio/common.h
+++ b/testcases/kernel/io/ltp-aiodio/common.h
@@ -62,8 +62,11 @@ static inline void io_read(const char *filename, int filesize, volatile int *run
int i;
int r;
- while ((fd = open(filename, O_RDONLY, 0666)) < 0)
+ while ((fd = open(filename, O_RDONLY, 0666)) < 0) {
+ if (!*run_child)
+ return;
usleep(100);
+ }
tst_res(TINFO, "child %i reading file", getpid());
@@ -102,8 +105,11 @@ static inline void io_read_eof(const char *filename, volatile int *run_child)
int fd;
int r;
- while ((fd = open(filename, O_RDONLY, 0666)) < 0)
+ while ((fd = open(filename, O_RDONLY, 0666)) < 0) {
+ if (!*run_child)
+ return;
usleep(100);
+ }
tst_res(TINFO, "child %i reading file", getpid());
With that the children that haven't had chance to run would exit the retry loop if the parent had signaled that they should exit.
Also on my machine the test runs for 0.1s and for 16 children, that means that even if there were not any other processes running on the machine it would be understandable that one or two children wouldn't get to run until the parent is done. Quite possibly we should change the test so that the children have chance to run, maybe we should increase the number of appends by 10, since the test was written in the times where I/O was order of magnitude or even two smaller.
Fixed by https://patchwork.ozlabs.org/project/ltp/patch/20240131100018.15767-2-andrea.cervesato@suse.de/ Waiting for a new build to close ticket. The bug has been fixed. Ticket can be closed. Thanks Problem fixed. |