Bug 1224402

Summary: /usr/src/kernel-modules/nvidia-390.157-default/nvidia/nv-procfs.o: warning: objtool: nv_register_procfs() falls through to next function nv_unregister_procfs()
Product: [openSUSE] openSUSE Distribution Reporter: Michal Suchanek <msuchanek>
Component: X11 3rd Party DriverAssignee: Stefan Dirsch <sndirsch>
Status: RESOLVED FIXED QA Contact: Stefan Dirsch <sndirsch>
Severity: Normal    
Priority: P3 - Medium CC: mbenes, mjambor, msuchanek, petr.pavlu, pmladek, sndirsch
Version: Leap 15.5   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: The nvidia module that includes the symbol
build log
source
preprocessed source
object file
nvidia package change

Description Michal Suchanek 2024-05-17 08:34:33 UTC
Is this a bug in objtool?

How would you even go about writing such function?
Comment 1 Michal Suchanek 2024-05-17 08:39:56 UTC
Created attachment 874943 [details]
The nvidia module that includes the symbol
Comment 2 Miroslav Beneš 2024-05-22 13:28:00 UTC
Strange and I would need to take a look at sources because

000000000000abb0 <nv_register_procfs>:
[...]
    adbc:       e8 00 00 00 00          call   adc1 <nv_register_procfs+0x211>
                        adbd: R_X86_64_PC32     __stack_chk_fail-0x4
    adc1:       4c 8b 05 00 00 00 00    mov    0x0(%rip),%r8        # adc8 <nv_register_procfs+0x218>
                        adc4: R_X86_64_PC32     .bss+0x1284
    adc8:       48 c7 c1 00 00 00 00    mov    $0x0,%rcx
                        adcb: R_X86_64_32S      .rodata+0x420
    adcf:       48 89 c2                mov    %rax,%rdx
    add2:       be 24 81 00 00          mov    $0x8124,%esi
    add7:       e8 00 00 00 00          call   addc <nv_register_procfs+0x22c>
                        add8: R_X86_64_PC32     proc_create_data-0x4
    addc:       0f 1f 40 00             nopl   0x0(%rax)

It really ends with a call to proc_create_data() which, obviously, is not a noreturn function. So it falls through to the next one. If it ended with the call to __stack_chk_fail() a couple of lines above, like most of the other functions, everything would be fine, because __stack_chk_fail() is marked as noreturn and objtool knows about it.

The code between the call to __stack_chk_fail() and the call to proc_create_data() fills arguments. adc1 address is a target of intra-function jump so looks like a valid code.

000000000000ade0 <nv_unregister_procfs>:
    ade0:       e8 00 00 00 00          call   ade5 <nv_unregister_procfs+0x5>
                        ade1: R_X86_64_PC32     __fentry__-0x4
    ade5:       48 8b 3d 00 00 00 00    mov    0x0(%rip),%rdi        # adec <nv_unregister_procfs+0xc>
                        ade8: R_X86_64_PC32     .bss+0x1274
    adec:       e9 00 00 00 00          jmp    adf1 <nv_unregister_procfs+0x11>
                        aded: R_X86_64_PC32     proc_remove-0x4
    adf1:       66 66 2e 0f 1f 84 00    data16 cs nopw 0x0(%rax,%rax,1)
    adf8:       00 00 00 00 
    adfc:       0f 1f 40 00             nopl   0x0(%rax)

I left it here as an example. There is jmp at the end which is correct. It does not return. Sibling jump. All good.

I am only guessing without sources though. Could you also appends logs from the compilation, please?
Comment 3 Michal Suchanek 2024-05-22 14:22:35 UTC
Created attachment 875026 [details]
build log
Comment 4 Michal Suchanek 2024-05-22 14:26:03 UTC
Created attachment 875027 [details]
source
Comment 5 Michal Suchanek 2024-05-22 14:26:33 UTC
Created attachment 875028 [details]
preprocessed source
Comment 6 Miroslav Beneš 2024-05-24 12:33:19 UTC
Thanks. So it seems that... (thanks Petr Pavlu for helping out because it is really a strange thing)

    adc1:       4c 8b 05 00 00 00 00    mov    0x0(%rip),%r8        # adc8 <nv_register_procfs+0x218>
                        adc4: R_X86_64_PC32     .bss+0x1284
    adc8:       48 c7 c1 00 00 00 00    mov    $0x0,%rcx
                        adcb: R_X86_64_32S      .rodata+0x420
    adcf:       48 89 c2                mov    %rax,%rdx
    add2:       be 24 81 00 00          mov    $0x8124,%esi
    add7:       e8 00 00 00 00          call   addc <nv_register_procfs+0x22c>
                        add8: R_X86_64_PC32     proc_create_data-0x4
    addc:       0f 1f 40 00             nopl   0x0(%rax)

is just dead code. Originally it comes from

    for (i = 0; __nv_patches[i].short_description; i++)
    {
        nv_procfs_add_text_file(proc_nvidia_patches,
            __nv_patches[i].short_description, __nv_patches[i].description);
    }

where __nv_patches is a global array initialized as

static struct {
                const char *short_description;
                const char *description;
              } __nv_patches[] = {
{ ((void *)0), ((void *)0) } };

The dead code is nv_procfs_add_text_file() inlined (it contains just the call to proc_create_data()).

The compiler knows that the loop does not exist and it is optimized out. The call site (from the "loop") looks like

    aced:       48 8b 3d 00 00 00 00    mov    0x0(%rip),%rdi        # acf4 <nv_register_procfs+0x144>
                        acf0: R_X86_64_PC32     .bss+0x127c
    acf4:       48 85 ff                test   %rdi,%rdi
    acf7:       0f 85 c4 00 00 00       jne    adc1 <nv_register_procfs+0x211>

Meaning that the jump never happens (the array contains just zeros). However it is all left there. No idea why.

So objtool basically reports a correct thing.

Martin, does it ring a bell? Could you shed some light on this?
Comment 7 Michal Suchanek 2024-05-24 13:36:32 UTC
Created attachment 875072 [details]
object file
Comment 8 Martin Jambor 2024-05-27 15:45:46 UTC
GCC 7 deduces that the loop cannot complete its first iteration
because then it would read past the __nv_patches array if it did.  So
it puts its __builtin_unreachable after the call to proc_create_data
because if it did not return, it would be still necessary to call it.

To make the compiler do what you want it do, I think you need to make
__nv_patches const.
Comment 9 Miroslav Beneš 2024-05-28 06:49:06 UTC
Michal, could you try Martin's suggestion, please?

I do not think this is something that objtool can do anything about.
Comment 10 Michal Suchanek 2024-05-28 08:45:43 UTC
Created attachment 875152 [details]
nvidia package change

Making the array const is indeed workaround for the problem.

Nonetheless, we have real code here where gcc fails to eliminate some dead branch of the code, and as a result objtool issues a useless warning.

That's problem with the tools, not the code.
Comment 11 Martin Jambor 2024-05-28 09:39:26 UTC
I will look into why the variable was not promoted to const
automagically in this case because that seems to be somewhat lame (but
maybe some real reason escaped me).  However, failing that, GCC cannot
reliably identify every piece of unreachable code.  If, for example,
the address of the non-const array escaped the compilation unit, then
GCC simply might not do anything better than what it does.

(Well, GCC 13 has a new option -funreachable-traps which puts a trap
before the "fall-through" but we have just discovered its current
documentation is not correct and since ATM I cannot really estimate
its costs, I cannot really recommend using it or even evaluating it.)
Comment 12 Miroslav Beneš 2024-05-28 12:13:56 UTC
I will not get caught in "broken code, broken tools, broken gcc" discussion but the thing is that objtool cannot know if it is a bug in the code or not. So it reports the warning and stops the analysis. Things like this are fixed in the mainline kernel and const would be appropriate here.

Having said that the support for noreturn functions in objtool is not great. There have been talks about improving it but it needs to happen in cooperation with GCC because there is currently no way to know that GCC decided to place unreachable somewhere (decided that a function is noreturn). I do not know if there is a progress on this front behind the scenes.
Comment 13 Martin Jambor 2024-05-28 13:04:03 UTC
GCC 13 (at least) does not automagically promote the array to const because of the -flive-patching=inline-clone option (which disables the pure-const pass).  It is even natural, we need to be ready for some unknown live-patch to possibly leak the address even when such a leak is not present in the code at hand.
Comment 14 Martin Jambor 2024-05-28 13:07:03 UTC
(In reply to Miroslav Beneš from comment #12)
> Having said that the support for noreturn functions in objtool is not great.
> There have been talks about improving it but it needs to happen in
> cooperation with GCC because there is currently no way to know that GCC
> decided to place unreachable somewhere (decided that a function is
> noreturn). I do not know if there is a progress on this front behind the
> scenes.

Once we start building our enterprise kernels with GCC 13+, asking our
performance team to evaluate the option -funreachable-traps might be
worthwhile.
Comment 15 Michal Suchanek 2024-05-28 14:41:14 UTC
Please consider adding the attached patch to avoid displaying this (false-positive) warning to users.
Comment 16 Stefan Dirsch 2024-05-28 16:05:43 UTC
Which warning do you mean exactly?

conftest.sh creates code snippets to figure out for specific functions, how many and which kind of parameters are used in the kernel against the driver is being built. This is needed due to all kind of backports done in downstream kernels of distributions, so checking the kernel version doesn't work.

So maybe it's just overkill to fix such a warning? I mean if the detection would be wrong because of this, probably the driver build would fail.
Comment 17 Michal Suchanek 2024-05-28 16:32:34 UTC
(In reply to Stefan Dirsch from comment #16)
> Which warning do you mean exactly?

The one shown in the bug title which is displayed every time the package is installed on the user's system and builds the kernel modules.

This part of the conftest.sh ends up in the generated code, confuses gcc which fails to eliminate some dead code which then confuses objtool.
Comment 18 Stefan Dirsch 2024-05-28 19:04:10 UTC
Ok. BTW, there is no such warning when building against Tumbleweed, just against Leap 15.5 and 15.6. Check this

https://build.suse.de/project/monitor/Proprietary:X11:Drivers

nvidia-driver-G06
nvidia-gfxG05
nvidia-gfxG04 (this bug G04 = 390.157)

All these are affected by this objtool warning on Leap. mabye one should figure out what's different in objtool for Tumbleweed?
Comment 19 Miroslav Beneš 2024-05-29 07:01:03 UTC
(In reply to Martin Jambor from comment #13)
> GCC 13 (at least) does not automagically promote the array to const because
> of the -flive-patching=inline-clone option (which disables the pure-const
> pass).  It is even natural, we need to be ready for some unknown live-patch
> to possibly leak the address even when such a leak is not present in the
> code at hand.

Bah, right. I wonder if we should just get rid of the option eventually as upstream did and somehow detect dangerous changes in our tooling. Adding Petr for his awareness. Thanks a lot for the explanation.

(In reply to Stefan Dirsch from comment #18)
> Ok. BTW, there is no such warning when building against Tumbleweed, just
> against Leap 15.5 and 15.6. Check this
> 
> https://build.suse.de/project/monitor/Proprietary:X11:Drivers
> 
> nvidia-driver-G06
> nvidia-gfxG05
> nvidia-gfxG04 (this bug G04 = 390.157)
> 
> All these are affected by this objtool warning on Leap. mabye one should
> figure out what's different in objtool for Tumbleweed?

Upstream objtool reports the same warning on the object file so there is no difference. TW does not use the GCC option mentioned above so the object code is different.
Comment 20 Stefan Dirsch 2024-05-29 09:46:00 UTC
@Miroslav So I apply this patch and we're done? Then please assign this ticket to me.

(Then I will also try to push this patch upstream)
Comment 21 Miroslav Beneš 2024-05-29 11:21:32 UTC
(In reply to Stefan Dirsch from comment #20)
> @Miroslav So I apply this patch and we're done? Then please assign this
> ticket to me.
> 
> (Then I will also try to push this patch upstream)

Yes, if you apply the patch, GCC will remove the dead code and everything will be all right from objtool perspective.
Comment 22 Stefan Dirsch 2024-05-29 12:25:46 UTC
Patch is now applied to G04, G05 and G06 packages. Closing. I'll inform nvidia by email about this patch and ticket.