Bug 1217225

Summary: LTP futex_waitv01: return EINVAL instead of EFAULT (32bit compatibility layer)
Product: [openSUSE] PUBLIC SUSE Linux Enterprise Server 15 SP6 Reporter: WEI GAO <wegao>
Component: KernelAssignee: Kernel Bugs <kernel-bugs>
Status: RESOLVED WORKSFORME QA Contact:
Severity: Normal    
Priority: P2 - High CC: eugenio.paolantonio, mhocko, nik.borisov, rtsvetkov, tiwai, wegao
Version: unspecified   
Target Milestone: ---   
Hardware: x86-64   
OS: Other   
URL: https://openqa.suse.de/tests/12803495/modules/futex_waitv01/steps/7
Whiteboard:
Found By: openQA Services Priority:
Business Priority: Blocker: Yes
Marketing QA Status: --- IT Deployment: ---
Attachments: simple code to reproduce issue.

Description WEI GAO 2023-11-16 11:51:29 UTC
## Observation
openQA test in scenario sle-15-SP6-Online-x86_64-ltp_syscalls_m32@64bit fails in
[futex_waitv01](https://openqa.suse.de/tests/12803495/modules/futex_waitv01/steps/7)

futex_waitv01 case will use syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo, clockid) to verify correct error returned when input wrong parameter.

I encounter same issue on my Ubuntu 22.04 LTS/6.2.0-36-generic/ldd (Ubuntu GLIBC 2.35-0ubuntu3.4) 2.35. So i trace the code on ubuntu env i found the timeout(__kernel_timespec __user *) parameter get from syscall32 compatibility layer to __do_sys_futex_waitv is broken, so it will failed within sub function futex_init_timeout() and return EFAULT always, so  futex_waitv01/futex_wait02/futex_wait03 cases will both get same EFAULT error lead LTP case failed.

kernel code:
SYSCALL_DEFINE5(futex_waitv, struct futex_waitv __user *, waiters,
                unsigned int, nr_futexes, unsigned int, flags,
                struct __kernel_timespec __user *, timeout, clockid_t, clockid)
{
        struct hrtimer_sleeper to;
        struct futex_vector *futexv;
...

        if (timeout) {
...
                if (get_timespec64(&ts, timeout))
                        return -EFAULT;

                ret = futex_init_timeout(FUTEX_WAIT_BITSET, flag_init, &ts, &time);  <<<<< directly return -EINVAL since &ts is wrong value
...
        }
Comment 1 Takashi Iwai 2023-11-17 13:23:35 UTC
The description is a bit confusing.  Is it the case where 32bit-compat futex_waiv syscall returns -EINVAL although -EFAULT is expected?  The subject shows so.  Meanwhile, you wrote as if -EFAULT is always returned in the later description.

And, it's really difficult to get what actually openQA is testing only from the report.  Could you give a small test code snippet that shows the error, so that one can reproduce easily?
Comment 2 WEI GAO 2023-11-19 14:26:01 UTC
(In reply to Takashi Iwai from comment #1)
> The description is a bit confusing.  Is it the case where 32bit-compat
> futex_waiv syscall returns -EINVAL although -EFAULT is expected?  The
> subject shows so.  Meanwhile, you wrote as if -EFAULT is always returned in
> the later description.
Yes, the LTP case expect get error EFAULT but get error EINVAL so failed.Such as LTP log:
futex_waitv01.c:78: TFAIL: futex_waitv address is NULL expected EFAULT: EINVAL (22)  
Above error triggered by following code: 

static void test_null_address(void)
{
        struct timespec to;

        init_waitv();
        init_timeout(&to);

        waitv->uaddr = 0x00000000;  <<<<put futex_waitv syscall's first parameter's uaddr memeber to 0, and following line expect get EFAULT error.
But acutally get EINVAL, so case report a falure.
        TST_EXP_FAIL(futex_waitv(waitv, 1, 0, &to, CLOCK_MONOTONIC), EFAULT,
                     "futex_waitv address is NULL");
}


> 
> And, it's really difficult to get what actually openQA is testing only from
> the report.  Could you give a small test code snippet that shows the error,
> so that one can reproduce easily?

Sorry for the confusing on description. The openqa just run LTP case and copy LTP failed info, the error message is created by LTP. I can create a simple test code without use any LTP library later.
Comment 3 WEI GAO 2023-11-20 02:47:33 UTC
Created attachment 870847 [details]
simple code to reproduce issue.
Comment 4 Takashi Iwai 2023-11-20 13:42:40 UTC
(In reply to WEI GAO from comment #3)
> Created attachment 870847 [details]
> simple code to reproduce issue.

It seems working on x86_64 VM with the recent SLE15-SP6 kernel.

Could you retest with the latest SLE15-SP6 KOTD?
Comment 5 WEI GAO 2023-11-20 14:39:37 UTC
(In reply to Takashi Iwai from comment #4)
> (In reply to WEI GAO from comment #3)
> > Created attachment 870847 [details]
> > simple code to reproduce issue.
> 
> It seems working on x86_64 VM with the recent SLE15-SP6 kernel.
> 
> Could you retest with the latest SLE15-SP6 KOTD?

Where i can download latest build for check?
Comment 6 Takashi Iwai 2023-11-20 14:44:21 UTC
Either IBS Devel:Kernel:SLE15-SP6 or OBS Kernel:SLE15-SP6 repo
  http://download.suse.de/ibs/Devel:/Kernel:/SLE15-SP6/standard/
  http://download.opensuse.org/repositories/Kernel:/SLE15-SP6/pool/
Comment 7 WEI GAO 2023-11-21 00:58:24 UTC
(In reply to Takashi Iwai from comment #6)
> Either IBS Devel:Kernel:SLE15-SP6 or OBS Kernel:SLE15-SP6 repo
>   http://download.suse.de/ibs/Devel:/Kernel:/SLE15-SP6/standard/
>   http://download.opensuse.org/repositories/Kernel:/SLE15-SP6/pool/

Result indeed pass. So let's wait next build result of openqa.

susetest:~ # uname -r
6.4.0-150600.98.g0f0ffd2-default
susetest:~ # cat /etc/os-release
NAME="SLES"
VERSION="15-SP6"
VERSION_ID="15.6"
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP6"
ID="sles"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:15:sp6"
DOCUMENTATION_URL="https://documentation.suse.com/"

susetest:~ # gcc -m32 -g a.c
a.c:73:1: warning: return type defaults to ‘int’ [-Wimplicit-int]
 main() {
 ^~~~
susetest:~ # ./a.out
futex is 0x87f31a0
!!! ret is -1
!!! errno is 14
!!! Error message: Bad address
Case pass
Comment 8 WEI GAO 2023-11-23 05:48:09 UTC
After further invesgitate, the simple test code indeed can pass, but not means timeout parameter from user space can be handled by kernel correctly, it still can lead failure if you start check real syscall function(if you put all correct parameter and do syscall, you will found timeout value is broken)
I try to test on latest kernel and give following fix patch.
https://lkml.org/lkml/2023/11/23/13
Comment 9 Takashi Iwai 2023-11-23 07:09:58 UTC
That makes sense, and the patch looks OK for x86.

But it might be better to unify the call for both 32 and 64bit instead of duplicated code.  Also, you'd need to cover other architectures like mips, sparc, powerpc and arm64, too.

Let's see the upstream reaction.
Comment 10 WEI GAO 2023-11-24 00:03:32 UTC
(In reply to Takashi Iwai from comment #9)
> That makes sense, and the patch looks OK for x86.
> 
> But it might be better to unify the call for both 32 and 64bit instead of
> duplicated code.  Also, you'd need to cover other architectures like mips,
> sparc, powerpc and arm64, too.
> 
> Let's see the upstream reaction.

Feedback from André Almeida <andrealmeid@igalia.com>: 
From, what I recall, we don't want to add new syscalls with old_timespec32, giving that they will have a limited lifetime. Instead, userspace should be able to come up with a 64-bit timespec implementation for -m32.
Comment 11 Takashi Iwai 2023-11-24 08:25:43 UTC
(In reply to WEI GAO from comment #10)
> (In reply to Takashi Iwai from comment #9)
> > That makes sense, and the patch looks OK for x86.
> > 
> > But it might be better to unify the call for both 32 and 64bit instead of
> > duplicated code.  Also, you'd need to cover other architectures like mips,
> > sparc, powerpc and arm64, too.
> > 
> > Let's see the upstream reaction.
> 
> Feedback from André Almeida <andrealmeid@igalia.com>: 
> From, what I recall, we don't want to add new syscalls with old_timespec32,
> giving that they will have a limited lifetime. Instead, userspace should be
> able to come up with a 64-bit timespec implementation for -m32.

That's understandable, and I vaguely remember of a similar argument at the time of switching to 64bit timespec by Arnd Bergmann.

How is it glibc-side implemention?  Does it also use 32bit timespec?  If yes, a problem is worse.
Comment 12 WEI GAO 2023-11-27 04:40:40 UTC
(In reply to Takashi Iwai from comment #11)
> (In reply to WEI GAO from comment #10)
> > (In reply to Takashi Iwai from comment #9)
> > > That makes sense, and the patch looks OK for x86.
> > > 
> > > But it might be better to unify the call for both 32 and 64bit instead of
> > > duplicated code.  Also, you'd need to cover other architectures like mips,
> > > sparc, powerpc and arm64, too.
> > > 
> > > Let's see the upstream reaction.
> > 
> > Feedback from André Almeida <andrealmeid@igalia.com>: 
> > From, what I recall, we don't want to add new syscalls with old_timespec32,
> > giving that they will have a limited lifetime. Instead, userspace should be
> > able to come up with a 64-bit timespec implementation for -m32.
> 
> That's understandable, and I vaguely remember of a similar argument at the
> time of switching to 64bit timespec by Arnd Bergmann.
> 
> How is it glibc-side implemention?  Does it also use 32bit timespec?  If
> yes, a problem is worse.

I have checked the glibc latest code but do not see any implemention(*.c) on futex_waitv syscall. So normally you have to do syscall directly with __NR_futex_waitv. I do not think glibc-side can covert this struct correctly.
Comment 13 Takashi Iwai 2023-11-27 08:17:16 UTC
(In reply to WEI GAO from comment #12)
> I have checked the glibc latest code but do not see any implemention(*.c) on
> futex_waitv syscall. So normally you have to do syscall directly with
> __NR_futex_waitv. I do not think glibc-side can covert this struct correctly.

If that's true, it implies that we rather need to fix in the kernel side as you did.  The information should be shown in the discussion thread.
Comment 14 WEI GAO 2023-11-27 12:17:02 UTC
(In reply to Takashi Iwai from comment #13)
> (In reply to WEI GAO from comment #12)
> > I have checked the glibc latest code but do not see any implemention(*.c) on
> > futex_waitv syscall. So normally you have to do syscall directly with
> > __NR_futex_waitv. I do not think glibc-side can covert this struct correctly.
> 
> If that's true, it implies that we rather need to fix in the kernel side as
> you did.  The information should be shown in the discussion thread.

I reply the email thread.
Comment 15 WEI GAO 2023-12-03 23:52:37 UTC
Patch for LTP, fix in user space instead of kernel:
https://patchwork.ozlabs.org/project/ltp/patch/20231203235117.29677-1-wegao@suse.com/
Comment 16 Radoslav Tzvetkov 2023-12-12 16:48:55 UTC
So is there any development in the past week?
Comment 17 WEI GAO 2023-12-13 06:42:46 UTC
(In reply to Radoslav Tzvetkov from comment #16)
> So is there any development in the past week?
No any fix in kernel side, we just need adjust LTP test case.

Patch in LTP is under reviewing. I think we can close this issue since upstream think this is userspace issue instead of kernel. I put to close status.