|
Bugzilla – Full Text Bug Listing |
| Summary: | 4.17 regression loading custom SPI module | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE Tumbleweed | Reporter: | Andreas Färber <afaerber> |
| Component: | Kernel | Assignee: | Michal Kubeček <mkubecek> |
| Status: | RESOLVED FIXED | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Normal | ||
| Priority: | P3 - Medium | CC: | agraf, dimstar, jeffm, matz, mbrugger, mkubecek, msuchanek, ppyu, rguenther, stefan.bruens, yousaf.kaukab |
| Version: | Current | ||
| Target Milestone: | --- | ||
| Hardware: | aarch64 | ||
| OS: | openSUSE Factory | ||
| Whiteboard: | |||
| Found By: | --- | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: | tentative patch | ||
|
Description
Andreas Färber
2018-05-09 01:16:28 UTC
"21656030" (0x014a71de) doesn't look like a typical underflow, rather a random garbage, IMHO. Do we have source of the module? (In reply to Michal Kubeček from comment #1) > Do we have source of the module? Pushed to: https://github.com/afaerber/lan9252-module Derived from previously working ones: https://github.com/afaerber/netx-module https://github.com/afaerber/lora-modules DT overlay: https://github.com/afaerber/dt-overlays/blob/master/sun50i-a64-pine64%2Barpi600%2Blan9252.dts I couldn't see any obvious problem in the module source, neither have I any idea what kind of change could result in such refcounting problem. Some tips for debugging: 1. Seeing how does the refcount behave might give us some hint. Try e.g. to display the refcount repeatedly after loading the module to see if it's growing, decreasing or stays the same (in the third case, it would be interesting to see how the values differ in repeated tests). 2. There are tracepoints module:module_get and module:module_point so that you could use perf to trace refcount changes. If there is a misbalance in get and put operations, it could tell us where. I'm seeing similar symptoms on a Raspberry Pi 3 Model B with 4.17-rc4: # lsmod Module Size Used by sx1276 20480 10549406 lora_dev 16384 10510879 sx1276 lora 16384 10455742 [...] spi_bcm2835 20480 0 [...] As you can see, the numbers are differing significantly between modules where they should have the same usage count. They do not differ between lsmod calls. What does the trace output on module load look like when you enable the module_get tracepoint (/sys/kernel/debug/tracing/events/module/module_get/)? Another thought: is it possible that the config with which the module(s) were compiled differs slightly from that of the running kernel (while controlling for kernel version)? That may potentially produce a differently sized module struct whose fields could have slightly different offsets.. (In reply to Jessica Yu from comment #6) > Another thought: is it possible that the config with which the module(s) > were compiled differs slightly from that of the running kernel (while > controlling for kernel version)? That may potentially produce a differently > sized module struct whose fields could have slightly different offsets.. Looks like you are right. I checked the layout of struct module in vmlinux and compiled module (compiled on SLE15 but that shouldn't make a difference) and indeed, they differ: --- m2.vmlinux 2018-05-14 06:15:09.710000000 -0400 +++ m2.module 2018-05-14 06:13:39.130000000 -0400 @@ -168,8 +168,9 @@ /* | 4 */ unsigned int percpu_size; /* | 4 */ unsigned int num_tracepoints; /* | 8 */ struct tracepoint * const *tracepoints_ptrs; +/* | 8 */ struct jump_entry *jump_entries; +/* | 4 */ unsigned int num_jump_entries; /* | 4 */ unsigned int num_trace_bprintk_fmt; -/* XXX 4-byte hole */ /* | 8 */ const char **trace_bprintk_fmt_start; /* | 8 */ struct trace_event_call **trace_events; /* | 4 */ unsigned int num_trace_events; As a result, the module thinks module::refcnt should be at offset 808 while kernel expects it at offset 800. module's structure has module::exit at offset 800 and the numbers shown in comments 0 and 5 could pretty well be lower 32 bits of these callbacks. The two extra members in module's version indicate a HAVE_JUMP_LABEL was enabled when building the module but not kernel so the next question is where does this difference come from. HAVE_JUMP_LABEL is defined in <linux/jump_label.h> this way:
#if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
# define HAVE_JUMP_LABEL
#endif
where CONFIG_JUMP_LABEL is enabled. CC_HAVE_ASM_GOTO is detected on build
time by trying to compile this (see scripts/gcc-goto.sh):
-----------------------------------------------------------------------------
int main(void)
{
#if defined(__arm__) || defined(__aarch64__)
/*
* Not related to asm goto, but used by jump label
* and broken on some ARM GCC versions (see GCC Bug 48637).
*/
static struct { int dummy; int state; } tp;
asm (".long %c0" :: "i" (&tp.state));
#endif
entry:
asm goto ("" :::: entry);
return 0;
}
-----------------------------------------------------------------------------
For some reason this succeeds when building the module but fails when building
the kernel packages in IBS. There is one related commit between 4.16 and
4.17-rc1: e501ce957a78 ("x86: Force asm-goto") but I don't see how it could
break the detection in IBS.
One strange thing in IBS build log is that it installs gcc 7.3.1 but libgcc_s1
version 8.0.1:
> [ 35s] libgcc_s1-8.0.1+r259636-1.1 ########################################
> [ 60s] gcc7-7.3.1+r258812-3.1 ########################################
No idea if it could be related. The SLE15 system I was testing on has also gcc 7.3.1 but
slightly different:
gcc7-7.3.1+r258812-2.8.aarch64
(In reply to Michal Kubeček from comment #9) > > [ 35s] libgcc_s1-8.0.1+r259636-1.1 ######################################## > > [ 60s] gcc7-7.3.1+r258812-3.1 ######################################## > > No idea if it could be related. The SLE15 system I was testing on has also > gcc 7.3.1 but slightly different: > > gcc7-7.3.1+r258812-2.8.aarch64 Sorry for the noise, I was looking at libgcc_s1 above. So the version is the same. It's probably either the libgcc_s1 or something in the environment. (In reply to Michal Kubeček from comment #8) > [...] CC_HAVE_ASM_GOTO is detected on build > time by trying to compile this (see scripts/gcc-goto.sh): > > ----------------------------------------------------------------------------- > int main(void) > { > #if defined(__arm__) || defined(__aarch64__) > /* > * Not related to asm goto, but used by jump label > * and broken on some ARM GCC versions (see GCC Bug 48637). > */ > static struct { int dummy; int state; } tp; > asm (".long %c0" :: "i" (&tp.state)); > #endif > > entry: > asm goto ("" :::: entry); > return 0; > } > ----------------------------------------------------------------------------- > > For some reason this succeeds when building the module but fails when > building > the kernel packages in IBS. I've confirmed that commenting out the two arm lines it still misbehaves, whereas if I torpedo the test by adding a syntactically invalid "foo" it works. Maybe the compiler folks have an idea of what's going wrong here? What's this strange
static struct { int dummy; int state; } tp;
asm (".long %c0" :: "i" (&tp.state));
supposed to do besides having a random number as instruction? That is,
if you are lucky it might be a valid instruction. Maybe it shouldn't be
in the execution flow? Or the testcase is supposed to be compiled and
not run?
So - how does this testcase FAIL when building the kernel? Ah, possibly different compiler flags? Like -fPIE/fPIC vs. no such flags? Because the quoted part ends up emitting a relocation. (In reply to Richard Biener from comment #13) > What's this strange > > static struct { int dummy; int state; } tp; > asm (".long %c0" :: "i" (&tp.state)); > > supposed to do besides having a random number as instruction? That is, > if you are lucky it might be a valid instruction. Maybe it shouldn't be > in the execution flow? Or the testcase is supposed to be compiled and > not run? AFAIU it's a compile test only (needs to work for cross-compiling, too). Confirming that Tumbleweed's published aarch64 gcc7 matches the revision Michal mentioned, which is the same that built the latest Kernel:HEAD kernel-default on aarch64. Local libgcc_s1 is also at version 8, no difference to OBS/IBS. https://build.opensuse.org/package/show/Kernel:HEAD/kernel-default (Note: -rc5 will arrive with the next cron job kernel source upload and potentially lead to newer Factory package versions used.) (In reply to Richard Biener from comment #14) > So - how does this testcase FAIL when building the kernel? ICE for %c - cf. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48637 (In reply to Andreas Färber from comment #16) > (In reply to Richard Biener from comment #14) > > So - how does this testcase FAIL when building the kernel? > > ICE for %c - cf. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48637 So it ICEs for building the kernel? Well. GCC trunk has for example > ./cc1 -quiet t.c -I include -fPIE t.c: In function ‘main’: t.c:9:3: warning: asm operand 0 probably doesn’t match constraints asm (".long %c0" :: "i" (&tp.state)); ^~~ t.c:9:3: error: impossible constraint in ‘asm’ > ./cc1 -quiet t.c -I include so whether it compiles depends on the flags. (In reply to Richard Biener from comment #17) > (In reply to Andreas Färber from comment #16) > > (In reply to Richard Biener from comment #14) > > > So - how does this testcase FAIL when building the kernel? > > > > ICE for %c - cf. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48637 > > So it ICEs for building the kernel? Well. GCC trunk has for example That was a bug fixed back in 2011. It's an additional test ot make sure we do not try to use jumplabels if kernel is built with (old) compiler exhibiting it. Certainly not what is going on in IBS now. > > ./cc1 -quiet t.c -I include -fPIE > t.c: In function ‘main’: > t.c:9:3: warning: asm operand 0 probably doesn’t match constraints > asm (".long %c0" :: "i" (&tp.state)); > ^~~ > t.c:9:3: error: impossible constraint in ‘asm’ > > ./cc1 -quiet t.c -I include > > so whether it compiles depends on the flags. This is more like it. (In reply to Richard Biener from comment #17) > (In reply to Andreas Färber from comment #16) > > (In reply to Richard Biener from comment #14) > > > So - how does this testcase FAIL when building the kernel? > > > > ICE for %c - cf. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48637 > > So it ICEs for building the kernel? Oh, you mean _our_ kernel build. :) We don't see that in the build log because gcc-goto.sh redirects output to /dev/null; I hoped you had a hunch what might differ between the two environments and how to investigate or fix. The test has been in the kernel for years already: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/scripts/gcc-goto.sh?id=f3c003f72dfb2497056bcbb864885837a1968ed5 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/scripts/gcc-goto.sh?id=a9468f30b5eac6957c86aea97954553bfb7c1f18 Did we recently try building the kernel with PIE or other changing flags? (In reply to Andreas Färber from comment #19) > Did we recently try building the kernel with PIE or other changing flags? gcc-PIE-7-2.7 :( which I locally do not have installed. Relying on gcc-PIE seems rather risky to me - can we BuildIgnore that and instead make sure the flags are always the same, both in OBS and for local builds? I'm confused... Kernel:stable (4.16.8) from May 9 seems to be built with the same gcc packages as Kernel:HEAD but it has jump_entries and num_jump_entries in struct module. So there seems to be some difference between master and stable. Some problem with commit e501ce957a78, perhaps? (In reply to Andreas Färber from comment #20) > (In reply to Andreas Färber from comment #19) > > Did we recently try building the kernel with PIE or other changing flags? > > gcc-PIE-7-2.7 :( which I locally do not have installed. > Relying on gcc-PIE seems rather risky to me - can we BuildIgnore that and > instead make sure the flags are always the same, both in OBS and for local > builds? ANY build in OBS uses gcc-PIE, also if you use 'osc build' it is being installed. PIE has been in use since 2017-05-27 (on Factory) BuildIgnore: gcc-PIE is possible, but should be discussed with the sec team first (who asked for PIE to be enabled in first place) Installing gcc-PIE makes the out of tree module compatible with kernel image (no jump_entries and num_jump_entries in struct module). But it still does not explain the difference between Kernel:HEAD (master) and Kernel:stable (stable). BuildIgnore gcc-PIE when building the kernel would IMHO result in an opposite problem if an out of tree module is built on a system with gcc-PIE installed. Perhaps rather let kernel-*-devel (or kernel-devel) depend on gcc-PIE. Anyway, long term solution should be fixing the testcase so that we detect CC_HAVE_ASM_GOTO correctly, with or without gcc-PIE. But that is beyond my ARM64 assembler skills. https://kernel.opensuse.org/cgit/kernel/commit/?id=e501ce957a78 moves the gcc-goto.sh test to an earlier location - possibly the $(KBUILD_CFLAGS) passed to the script make a difference? (In reply to Andreas Färber from comment #25) > https://kernel.opensuse.org/cgit/kernel/commit/?id=e501ce957a78 moves the > gcc-goto.sh test to an earlier location - possibly the $(KBUILD_CFLAGS) > passed to the script make a difference? This looks promising. There is KBUILD_CFLAGS += $(call cc-option,-fno-PIE) KBUILD_AFLAGS += $(call cc-option,-fno-PIE) between the new and old location. Created attachment 770256 [details]
tentative patch
This shold do the trick, just move the lines disabling PIE before the asm goto
test again.
Tested with the lan9252 module: both with and without gcc-PIE package, struct
module has jump_entries and num_jump_entries members.
Going to check that it does the trick in IBS as well but unfortunately I got
some slow worker, the build (project home:mkubecek:bsc1092456) is already
running for over 3 hours and doesn't seem to be going to finish any time soon.
Once tested, I would like to submit the patch to upstream, unless someone has
a better idea. In any case, it would be nice to fix the testcase but that
would require someone else.
Good news... with kernel packages built in IBS with the patch from comment 27, struct module looks the same in vmlinux, lan9252.ko built with gcc-PIE package installed and lan9252.ko built without gcc-PIE installed. I also tried to insmod lan9252.ko, refcount is zero after loading and module can be unloaded cleanly. I'm going to submit the patch to upstream. Let's see if someone comes with a nicer solution. (In reply to Michal Kubeček from comment #28) > Good news... with kernel packages built in IBS with the patch from comment 27, > struct module looks the same For the sake of completeness: "the same" means it has both members related to jump labels (i.e. what we want). The fix (or rather workaround) was accepted to kbuild tree but it's not in v4.17-rc6 yet so I cherry picked it to our master branch. 4.17-rc6 packages building now in IBS Devel:Kernel:master already have it. OBS Kernel:HEAD was not updated to RC6 yet but once it is, it should have the patch as well. 4.17-rc7 packages available in both OBS Kernel:HEAD and IBS Devel:Kernel:master now should be fixed. Feel free to reopen if something goes wrong. |