Bugzilla – Bug 1213755
soft lockup during test fstests with ext4
Last modified: 2023-08-24 10:54:22 UTC
When run generic/101-200 in a serial in ext4 in alp build2.4. System got hung after run generic/166, which could see from snapshot is passed, but suddenly hung before test generic/167. http://10.67.133.133/tests/295#step/generate_report/323 [details for soft lockup] [ 5615.234855][ C0] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/0:204:9780] [ 5641.077264][ C1] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 5644.940911][ C1] rcu: 0-...!: (1 GPs behind) idle=9504/1/0x4000000000000000 softirq=107591/107591 fqs=4 [ 5644.941842][ C1] rcu: 1-...!: (1 GPs behind) idle=16c4/0/0x1 softirq=112547/112548 fqs=4 [ 5644.943270][ C1] rcu: (detected by 1, t=19162 jiffies, g=181153, q=129 ncpus=4) [ 5644.998338][ C0] Kernel panic - not syncing: softlockup: hung tasks [ 5644.998901][ C1] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G L 6.4.1-1-default #1 ALP-0.1 (unreleased) c69a840d71ac942b652150cd3cda4b08c7fc98a1 [ 5644.998906][ C1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014 [ 5644.998908][ C1] RIP: 0010:pv_native_safe_halt+0xf/0x20 [ 5644.998912][ C1] Code: 0b 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d d3 65 37 00 fb f4 <c3> cc cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 [ 5644.998914][ C1] RSP: 0018:ffffb131800a3ed8 EFLAGS: 00000212 [ 5644.998916][ C1] RAX: 0000000000000001 RBX: 0000000000000001 RCX: ffff9a924526b200 [ 5644.998918][ C1] RDX: 4000000000000000 RSI: ffffffff859dad71 RDI: 0000000000a916bc [ 5644.998919][ C1] RBP: ffff9a92402ecf00 R08: 0000000000a916bc R09: 00000000000001e3 [ 5644.998920][ C1] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000 [ 5644.998921][ C1] R13: 0000000000000000 R14: ffff9a92402ecf00 R15: 0000000000000000 [ 5644.998922][ C1] FS: 0000000000000000(0000) GS:ffff9a927bc80000(0000) knlGS:0000000000000000 [ 5644.998924][ C1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5644.998925][ C1] CR2: 00007fc607fa7840 CR3: 0000000107328000 CR4: 00000000000006e0 [ 5644.998931][ C1] Call Trace: [ 5644.998933][ C1] <IRQ> [ 5644.998935][ C1] ? rcu_dump_cpu_stacks+0xc4/0x100 [ 5644.998940][ C1] ? rcu_sched_clock_irq+0xfa1/0x1120 [ 5644.998943][ C1] ? timekeeping_update+0xdd/0x130 [ 5644.998946][ C1] ? timekeeping_advance+0x377/0x570 [ 5644.998948][ C1] ? scheduler_tick+0xdb/0x2b0 [ 5644.998951][ C1] ? update_process_times+0x5f/0x90 [ 5644.998954][ C1] ? tick_sched_handle+0x21/0x60 [ 5644.998957][ C1] ? tick_sched_timer+0x6f/0x90 [ 5644.998959][ C1] ? __pfx_tick_sched_timer+0x10/0x10 [ 5644.998960][ C1] ? __hrtimer_run_queues+0x112/0x2b0 [ 5644.998963][ C1] ? hrtimer_interrupt+0xf8/0x230 [ 5644.998966][ C1] ? __sysvec_apic_timer_interrupt+0x61/0x130 [ 5644.998968][ C1] ? sysvec_apic_timer_interrupt+0x6d/0x90 [ 5644.998970][ C1] </IRQ> [ 5644.998971][ C1] <TASK> [ 5644.998972][ C1] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20 [ 5644.998975][ C1] ? pv_native_safe_halt+0xf/0x20 [ 5644.998978][ C1] default_idle+0x9/0x20 [ 5644.998981][ C1] default_idle_call+0x2c/0xe0 [ 5644.998984][ C1] do_idle+0x226/0x270 [ 5645.001763][ C1] cpu_startup_entry+0x1d/0x20 [ 5645.001769][ C1] start_secondary+0x12e/0x150 [ 5645.012098][ C1] secondary_startup_64_no_verify+0x10b/0x10b [ 5645.014647][ C1] </TASK> [ 5645.061225][ C0] CPU: 0 PID: 9780 Comm: kworker/0:204 Tainted: G L 6.4.1-1-default #1 ALP-0.1 (unreleased) c69a840d71ac942b652150cd3cda4b08c7fc98a1 [ 5645.063377][ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014 [ 5645.065423][ C0] Workqueue: events netstamp_clear [ 5645.066725][ C0] Call Trace: [ 5645.072083][ C0] <IRQ> [ 5645.072945][ C0] dump_stack_lvl+0x47/0x60 [ 5645.079379][ C0] panic+0x178/0x330 [ 5645.151886][ C0] ? __pfx_nmi_raise_cpu_backtrace+0x10/0x10 [ 5645.153137][ C0] watchdog_timer_fn+0x1d2/0x210 [ 5645.155428][ C0] ? __pfx_watchdog_timer_fn+0x10/0x10 [ 5645.156591][ C0] __hrtimer_run_queues+0x112/0x2b0 [ 5645.157748][ C0] hrtimer_interrupt+0xf8/0x230 [ 5645.159110][ C0] __sysvec_apic_timer_interrupt+0x61/0x130 [ 5645.160287][ C0] sysvec_apic_timer_interrupt+0x6d/0x90 [ 5645.162132][ C0] </IRQ> [ 5645.162865][ C0] <TASK> [ 5645.163692][ C0] asm_sysvec_apic_timer_interrupt+0x1a/0x20 [ 5645.165056][ C0] RIP: 0010:smp_call_function_many_cond+0xf8/0x4a0 [ 5645.166948][ C0] Code: 7b 08 48 63 d0 e8 d8 8c 45 00 3b 05 02 a7 f2 01 73 25 48 63 d0 48 8b 33 48 03 34 d5 00 0b ac 85 8b 56 08 83 e2 01 74 0a f3 90 <8b> 4e 08 83 e1 01 75 f6 83 c0 01 eb c1 48 83 c4 40 5b 5d 41 5c 41 [ 5645.172073][ C0] RSP: 0018:ffffb13186ec7d70 EFLAGS: 00000202 [ 5645.174186][ C0] RAX: 0000000000000002 RBX: ffff9a927bc3b2c0 RCX: 0000000000000001 [ 5645.177912][ C0] RDX: 0000000000000001 RSI: ffff9a927bd40980 RDI: ffff9a9240071bd8 [ 5645.179519][ C0] RBP: 0000000000000000 R08: ffff9a92400716b8 R09: ffff9a927bc3b2c0 [ 5645.181567][ C0] R10: ffff9a92400716b8 R11: 0000000000000000 R12: ffff9a927bc3b2c0 [ 5645.183488][ C0] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001 [ 5645.185957][ C0] ? __pfx_do_sync_core+0x10/0x10 [ 5645.187399][ C0] on_each_cpu_cond_mask+0x24/0x40 [ 5645.188592][ C0] text_poke_bp_batch+0xb7/0x2f0 [ 5645.190028][ C0] ? arch_jump_label_transform_queue+0x57/0x80 [ 5645.191284][ C0] text_poke_finish+0x1f/0x30 [ 5645.192328][ C0] arch_jump_label_transform_apply+0x1a/0x30 [ 5645.193642][ C0] jump_label_update+0x9c/0x100 [ 5645.194771][ C0] static_key_enable_cpuslocked+0x63/0x90 [ 5645.195910][ C0] static_key_enable+0x1a/0x20 [ 5645.196885][ C0] process_one_work+0x20a/0x420 [ 5645.197897][ C0] worker_thread+0x4e/0x3b0 [ 5645.198839][ C0] ? __pfx_worker_thread+0x10/0x10 [ 5645.199891][ C0] kthread+0xe8/0x120 [ 5645.200752][ C0] ? __pfx_kthread+0x10/0x10 [ 5645.201706][ C0] ret_from_fork+0x2c/0x50 [ 5645.202644][ C0] </TASK> [ 5645.203573][ C0] Kernel Offset: 0x3400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 5645.205798][ C0] Rebooting in 90 seconds..
Looking at the kernel log there's nothing pointing at ext4, or to any filesystem code. There's a softlockup that seems to be happening in the self-modifying code infrastructure (jump_label), in the network core code, more specifically in the netstamp_clear workqueue. I've tried to find something similar on lkml but no luck. Wrong search keywords, I guess. Goldwyn, can you suggest someone else to bring some knowledge on these areas? By the way, I see there are a lot of fstests that aren't being executed simply because the VM doesn't include some tools (for example, the 'chacl', from the 'acl' package). Where can I report these sort of things? (Or try to help fixing them?)
Yes, I assigned it in a hurry. This does not look like a softlockup in ext4 and looks like something else is taking up processing resources. Re-assigning back to kernel-bugs.
It needs retests with the later 6.4.x kernel, in anyway...
Requesting retest of the scenario
This test generic/101-200 in ext4 tested in ALP build5.6 didn't reproduce this issue. https://openqa.suse.de/tests/11835577 Consider this a random issue when running xfstests, and not a filesystem issue. I checked all tests in build5.6 and didn't see related fails. https://openqa.suse.de/tests/overview?distri=alp&version=1.0&build=5.6&groupid=519
Closing as WORKSFORME in latest version tested (build5.6)