Bug 1213755

Summary: soft lockup during test fstests with ext4
Product: [SUSE ALP - SUSE Adaptable Linux Platform] Granite Reporter: Yong Sun <yosun>
Component: KernelAssignee: Kernel Bugs <kernel-bugs>
Status: RESOLVED WORKSFORME QA Contact:
Severity: Normal    
Priority: P2 - High CC: fcrozat, jalausuch, jcheung, lan, rgoldwyn, tiwai, yosun
Version: unspecified   
Target Milestone: ---   
Hardware: All   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Yong Sun 2023-07-28 09:31:24 UTC
When run generic/101-200 in a serial in ext4 in alp build2.4. System got hung after run generic/166, which could see from snapshot is passed, but suddenly hung before test generic/167.
http://10.67.133.133/tests/295#step/generate_report/323

[details for soft lockup]
[ 5615.234855][    C0] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/0:204:9780]
[ 5641.077264][    C1] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 5644.940911][    C1] rcu: 	0-...!: (1 GPs behind) idle=9504/1/0x4000000000000000 softirq=107591/107591 fqs=4
[ 5644.941842][    C1] rcu: 	1-...!: (1 GPs behind) idle=16c4/0/0x1 softirq=112547/112548 fqs=4
[ 5644.943270][    C1] rcu: 	(detected by 1, t=19162 jiffies, g=181153, q=129 ncpus=4)
[ 5644.998338][    C0] Kernel panic - not syncing: softlockup: hung tasks
[ 5644.998901][    C1] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G             L      6.4.1-1-default #1 ALP-0.1 (unreleased) c69a840d71ac942b652150cd3cda4b08c7fc98a1
[ 5644.998906][    C1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
[ 5644.998908][    C1] RIP: 0010:pv_native_safe_halt+0xf/0x20
[ 5644.998912][    C1] Code: 0b 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d d3 65 37 00 fb f4 <c3> cc cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90
[ 5644.998914][    C1] RSP: 0018:ffffb131800a3ed8 EFLAGS: 00000212
[ 5644.998916][    C1] RAX: 0000000000000001 RBX: 0000000000000001 RCX: ffff9a924526b200
[ 5644.998918][    C1] RDX: 4000000000000000 RSI: ffffffff859dad71 RDI: 0000000000a916bc
[ 5644.998919][    C1] RBP: ffff9a92402ecf00 R08: 0000000000a916bc R09: 00000000000001e3
[ 5644.998920][    C1] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
[ 5644.998921][    C1] R13: 0000000000000000 R14: ffff9a92402ecf00 R15: 0000000000000000
[ 5644.998922][    C1] FS:  0000000000000000(0000) GS:ffff9a927bc80000(0000) knlGS:0000000000000000
[ 5644.998924][    C1] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5644.998925][    C1] CR2: 00007fc607fa7840 CR3: 0000000107328000 CR4: 00000000000006e0
[ 5644.998931][    C1] Call Trace:
[ 5644.998933][    C1]  <IRQ>
[ 5644.998935][    C1]  ? rcu_dump_cpu_stacks+0xc4/0x100
[ 5644.998940][    C1]  ? rcu_sched_clock_irq+0xfa1/0x1120
[ 5644.998943][    C1]  ? timekeeping_update+0xdd/0x130
[ 5644.998946][    C1]  ? timekeeping_advance+0x377/0x570
[ 5644.998948][    C1]  ? scheduler_tick+0xdb/0x2b0
[ 5644.998951][    C1]  ? update_process_times+0x5f/0x90
[ 5644.998954][    C1]  ? tick_sched_handle+0x21/0x60
[ 5644.998957][    C1]  ? tick_sched_timer+0x6f/0x90
[ 5644.998959][    C1]  ? __pfx_tick_sched_timer+0x10/0x10
[ 5644.998960][    C1]  ? __hrtimer_run_queues+0x112/0x2b0
[ 5644.998963][    C1]  ? hrtimer_interrupt+0xf8/0x230
[ 5644.998966][    C1]  ? __sysvec_apic_timer_interrupt+0x61/0x130
[ 5644.998968][    C1]  ? sysvec_apic_timer_interrupt+0x6d/0x90
[ 5644.998970][    C1]  </IRQ>
[ 5644.998971][    C1]  <TASK>
[ 5644.998972][    C1]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 5644.998975][    C1]  ? pv_native_safe_halt+0xf/0x20
[ 5644.998978][    C1]  default_idle+0x9/0x20
[ 5644.998981][    C1]  default_idle_call+0x2c/0xe0
[ 5644.998984][    C1]  do_idle+0x226/0x270
[ 5645.001763][    C1]  cpu_startup_entry+0x1d/0x20
[ 5645.001769][    C1]  start_secondary+0x12e/0x150
[ 5645.012098][    C1]  secondary_startup_64_no_verify+0x10b/0x10b
[ 5645.014647][    C1]  </TASK>
[ 5645.061225][    C0] CPU: 0 PID: 9780 Comm: kworker/0:204 Tainted: G             L      6.4.1-1-default #1 ALP-0.1 (unreleased) c69a840d71ac942b652150cd3cda4b08c7fc98a1
[ 5645.063377][    C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
[ 5645.065423][    C0] Workqueue: events netstamp_clear
[ 5645.066725][    C0] Call Trace:
[ 5645.072083][    C0]  <IRQ>
[ 5645.072945][    C0]  dump_stack_lvl+0x47/0x60
[ 5645.079379][    C0]  panic+0x178/0x330
[ 5645.151886][    C0]  ? __pfx_nmi_raise_cpu_backtrace+0x10/0x10
[ 5645.153137][    C0]  watchdog_timer_fn+0x1d2/0x210
[ 5645.155428][    C0]  ? __pfx_watchdog_timer_fn+0x10/0x10
[ 5645.156591][    C0]  __hrtimer_run_queues+0x112/0x2b0
[ 5645.157748][    C0]  hrtimer_interrupt+0xf8/0x230
[ 5645.159110][    C0]  __sysvec_apic_timer_interrupt+0x61/0x130
[ 5645.160287][    C0]  sysvec_apic_timer_interrupt+0x6d/0x90
[ 5645.162132][    C0]  </IRQ>
[ 5645.162865][    C0]  <TASK>
[ 5645.163692][    C0]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 5645.165056][    C0] RIP: 0010:smp_call_function_many_cond+0xf8/0x4a0
[ 5645.166948][    C0] Code: 7b 08 48 63 d0 e8 d8 8c 45 00 3b 05 02 a7 f2 01 73 25 48 63 d0 48 8b 33 48 03 34 d5 00 0b ac 85 8b 56 08 83 e2 01 74 0a f3 90 <8b> 4e 08 83 e1 01 75 f6 83 c0 01 eb c1 48 83 c4 40 5b 5d 41 5c 41
[ 5645.172073][    C0] RSP: 0018:ffffb13186ec7d70 EFLAGS: 00000202
[ 5645.174186][    C0] RAX: 0000000000000002 RBX: ffff9a927bc3b2c0 RCX: 0000000000000001
[ 5645.177912][    C0] RDX: 0000000000000001 RSI: ffff9a927bd40980 RDI: ffff9a9240071bd8
[ 5645.179519][    C0] RBP: 0000000000000000 R08: ffff9a92400716b8 R09: ffff9a927bc3b2c0
[ 5645.181567][    C0] R10: ffff9a92400716b8 R11: 0000000000000000 R12: ffff9a927bc3b2c0
[ 5645.183488][    C0] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001
[ 5645.185957][    C0]  ? __pfx_do_sync_core+0x10/0x10
[ 5645.187399][    C0]  on_each_cpu_cond_mask+0x24/0x40
[ 5645.188592][    C0]  text_poke_bp_batch+0xb7/0x2f0
[ 5645.190028][    C0]  ? arch_jump_label_transform_queue+0x57/0x80
[ 5645.191284][    C0]  text_poke_finish+0x1f/0x30
[ 5645.192328][    C0]  arch_jump_label_transform_apply+0x1a/0x30
[ 5645.193642][    C0]  jump_label_update+0x9c/0x100
[ 5645.194771][    C0]  static_key_enable_cpuslocked+0x63/0x90
[ 5645.195910][    C0]  static_key_enable+0x1a/0x20
[ 5645.196885][    C0]  process_one_work+0x20a/0x420
[ 5645.197897][    C0]  worker_thread+0x4e/0x3b0
[ 5645.198839][    C0]  ? __pfx_worker_thread+0x10/0x10
[ 5645.199891][    C0]  kthread+0xe8/0x120
[ 5645.200752][    C0]  ? __pfx_kthread+0x10/0x10
[ 5645.201706][    C0]  ret_from_fork+0x2c/0x50
[ 5645.202644][    C0]  </TASK>
[ 5645.203573][    C0] Kernel Offset: 0x3400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 5645.205798][    C0] Rebooting in 90 seconds..
Comment 1 Luis Henriques 2023-07-28 16:15:28 UTC
Looking at the kernel log there's nothing pointing at ext4, or to any filesystem code.  There's a softlockup that seems to be happening in the self-modifying code infrastructure (jump_label), in the network core code, more specifically in the netstamp_clear workqueue.  I've tried to find something similar on lkml but no luck.  Wrong search keywords, I guess.

Goldwyn, can you suggest someone else to bring some knowledge on these areas?

By the way, I see there are a lot of fstests that aren't being executed simply because the VM doesn't include some tools (for example, the 'chacl', from the 'acl' package).  Where can I report these sort of things? (Or try to help fixing them?)
Comment 2 Goldwyn Rodrigues 2023-08-01 12:29:13 UTC
Yes, I assigned it in a hurry. This does not look like a softlockup in ext4 and looks like something else is taking up processing resources. Re-assigning back to kernel-bugs.
Comment 3 Takashi Iwai 2023-08-01 12:39:03 UTC
It needs retests with the later 6.4.x kernel, in anyway...
Comment 4 Goldwyn Rodrigues 2023-08-23 18:40:41 UTC
Requesting retest of the scenario
Comment 5 Yong Sun 2023-08-24 03:43:43 UTC
This test generic/101-200 in ext4 tested in ALP build5.6 didn't reproduce this issue. https://openqa.suse.de/tests/11835577
Consider this a random issue when running xfstests, and not a filesystem issue. I checked all tests in build5.6 and didn't see related fails. https://openqa.suse.de/tests/overview?distri=alp&version=1.0&build=5.6&groupid=519
Comment 6 Goldwyn Rodrigues 2023-08-24 10:54:22 UTC
Closing as WORKSFORME in latest version tested (build5.6)