Bug 1227345

Summary: Kernel 6.9 (MicroOS): a single cpu core is available.
Product: [openSUSE] openSUSE Tumbleweed Reporter: Maxime Thirion <maxime.thirion>
Component: KernelAssignee: openSUSE Kernel Bugs <kernel-bugs>
Status: RESOLVED INVALID QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: jslaby, maxime.thirion, tiwai
Version: Current   
Target Milestone: ---   
Hardware: x86-64   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: htop showing only one core

Description Maxime Thirion 2024-07-03 13:18:37 UTC
Created attachment 875851 [details]
htop showing only one core

Hello everyone !

Overview: 

I'm experiencing a really strange problem on two servers, in production, hosted by the Hetzner company.

Since upgrading to kernel 6.9, only one cpu core is available, giving catastrophic performance.

The strange thing is that the two machines are not identical:

The first server has an Intel(R) Xeon(R) CPU E3-1275 v5 with 2x 512Gb NVME SSD and 64Gb ram.
The second server has an Intel(R) Core(TM) i7-3770 CPU with 4x 6TB HDD and 32Gb ram.

I contacted the hosting company first, who told me they didn't know about the problem. With their suggestion, I restarted one of the machines on their rescue system, which has a 6.9.7 kernel, so it's also very recent, and the problem doesn't exist: all the cpu cores are activated.

Steps to Reproduce : 

Update to kernel 6.9

Actual Results:

Only one CPU core is available, and performance is very poor.
lscpu gives only one core per socket and only one tread per core.

kutta:~ # lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          39 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   1
  On-line CPU(s) list:    0
Vendor ID:                GenuineIntel
  BIOS Vendor ID:         Intel(R) Corporation
  Model name:             Intel(R) Xeon(R) CPU E3-1275 v5 @ 3.60GHz
    BIOS Model name:      Intel(R) Xeon(R) CPU E3-1275 v5 @ 3.60GHz To Be Filled By O.E.M. CPU @ 3.6GHz
    BIOS CPU family:      179
    CPU family:           6
    Model:                94
    Thread(s) per core:   1
    Core(s) per socket:   1
    Socket(s):            1
    Stepping:             3
    CPU(s) scaling MHz:   36%
    CPU max MHz:          4000.0000
    CPU min MHz:          800.0000
    BogoMIPS:             7202.00
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology n
                          onstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowpr
                          efetch cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves d
                          therm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi md_clear flush_l1d arch_capabilities
Virtualization features:  
  Virtualization:         VT-x
Caches (sum of all):      
  L1d:                    32 KiB (1 instance)
  L1i:                    32 KiB (1 instance)
  L2:                     256 KiB (1 instance)
  L3:                     8 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0
Vulnerabilities:          
  Gather data sampling:   Vulnerable: No microcode
  Itlb multihit:          KVM: Mitigation: VMX disabled
  L1tf:                   Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
  Mds:                    Mitigation; Clear CPU buffers; SMT disabled
  Meltdown:               Mitigation; PTI
  Mmio stale data:        Mitigation; Clear CPU buffers; SMT disabled
  Reg file data sampling: Not affected
  Retbleed:               Mitigation; IBRS
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; IBRS; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
  Srbds:                  Mitigation; Microcode
  Tsx async abort:        Mitigation; TSX disable

Expected Results:

Correct and optimal operation of the cpu, with all cores activated, as with kernels 6.8 and earlier or with 6.9 supplied by the provider's rescue system.

Additional Information: Any other useful information.

I couldn't find the system in the list but it's OpenSUSE MicroOS, so the servers have a read-only file system and they're in production (which limits my testing and reinstallation possibilities).

I did a rollback on the server with the Xeon that was still in 6.4 and the problem no longer appears.

I no longer have a snapshop with an old kernel on the i7, as I didn't see the problem immediately.
I can do some "tests" on this server if necessary.

The servers were up to date (before the Xeon rollback).
Comment 1 Takashi Iwai 2024-07-03 13:36:31 UTC
IIRC, there were relevant bugs in the early 6.9.x kernels, and it's been fixed in the recent 6.9.x release (around 6.9.4).
Comment 2 Maxime Thirion 2024-07-03 14:01:15 UTC
(In reply to Takashi Iwai from comment #1)
> IIRC, there were relevant bugs in the early 6.9.x kernels, and it's been
> fixed in the recent 6.9.x release (around 6.9.4).

If I've understood correctly, with the first buggy kernels the cores appeared, but tasks weren't always sent to the other cores?

Here, the cores disappeared from the system with kernel 6.9.
Only dmidecode shows me the 4 cores, but only one is activated.

And I have the same problem with the latest kernel available on MicroOS/Tumbleweed:

cloud:~ # uname -a
Linux cloud 6.9.7-1-default #1 SMP PREEMPT_DYNAMIC Fri Jun 28 05:50:47 UTC 2024 (a5efffa) x86_64 x86_64 x86_64 GNU/Linux

cloud:~ # dmidecode -t processor |grep Core
        Family: Core 2 Duo
        Version: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
        Core Count: 4
        Core Enabled: 1
Comment 3 Takashi Iwai 2024-07-04 12:53:44 UTC
(In reply to Maxime Thirion from comment #2)
> (In reply to Takashi Iwai from comment #1)
> > IIRC, there were relevant bugs in the early 6.9.x kernels, and it's been
> > fixed in the recent 6.9.x release (around 6.9.4).
> 
> If I've understood correctly, with the first buggy kernels the cores
> appeared, but tasks weren't always sent to the other cores?
> 
> Here, the cores disappeared from the system with kernel 6.9.
> Only dmidecode shows me the 4 cores, but only one is activated.
> 
> And I have the same problem with the latest kernel available on
> MicroOS/Tumbleweed:
> 
> cloud:~ # uname -a
> Linux cloud 6.9.7-1-default #1 SMP PREEMPT_DYNAMIC Fri Jun 28 05:50:47 UTC
> 2024 (a5efffa) x86_64 x86_64 x86_64 GNU/Linux
> 
> cloud:~ # dmidecode -t processor |grep Core
>         Family: Core 2 Duo
>         Version: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
>         Core Count: 4
>         Core Enabled: 1

Then it's a different bug from what I originally thought.
Please check with the latest 6.10-rc kernel in OBS Kernel:HEAD repo, and if the issue is reproducible, better report to the upstream.
Comment 4 Maxime Thirion 2024-07-04 22:22:36 UTC
Thank you very much for your feedback.

I tried with :

The default kernel and the vanilla kernel in 6.9.7.

I tried the vanilla kernel 6.10 rc6.

The problem is still there, only one core is available.

On the other hand, for some reason, both my servers have the "noapic" option enabled in the kernel parameters in grub, but I've never touched or added this option.

Honestly, I don't even know exactly what she does.

By removing this option, all cores are initialized and operation is normal.

So, the problem comes from the 6.9 kernel with the noapic option, and the fact that this option is "default" on my installations.

I'm guessing that Hetzner's rescue system doesn't have this option, which explains why everything works with their system.
Comment 5 Takashi Iwai 2024-07-08 11:16:18 UTC
If it happens with the latest 6.10-rc code, it should be reported to the upstream.

Just to be sure, when you boot with 6.8.x kernel, the problem doesn't appear?  You can take the old 6.8.x kernel from my OBS kernel repo, for example,
  http://download.opensuse.org/repositories/home:/tiwai:/kernel:/6.8/standard/
Comment 6 Jiri Slaby 2024-07-09 08:23:12 UTC
(In reply to Maxime Thirion from comment #4)
> for some reason, both my servers have the "noapic" option
> enabled in the kernel parameters in grub, but I've never touched or added
> this option.

That option should never be used in production.

> Honestly, I don't even know exactly what she does.

W/o APIC, you cannot use the other cores, of course.

> By removing this option, all cores are initialized and operation is normal.

Obviously ;).

> So, the problem comes from the 6.9 kernel with the noapic option, and the
> fact that this option is "default" on my installations.
> 
> I'm guessing that Hetzner's rescue system doesn't have this option, which
> explains why everything works with their system.

It remains to investigate how that option got there. So you have it in /etc/default/grub?
Comment 7 Maxime Thirion 2024-07-09 08:52:21 UTC
Before kernel 6.9, all processor cores were active despite the NOAPIC option.

On the server with the i7, updates are applied 1x a week with automatic reboot. The server has therefore been through all kernels since it was installed, and the problem only started with 6.9.

The two machines were installed in early 2022.

As I haven't touched the Xeon yet, here are a few parameters:

kutta:~ # uname -a
Linux kutta.r-virtuel.net 6.4.6-1-default #1 SMP PREEMPT_DYNAMIC Tue Jul 25 04:42:30 UTC 2023 (55520bc) x86_64 x86_64 x86_64 GNU/Linux

kutta:~ # lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  BIOS Vendor ID:        Intel(R) Corporation
  Model name:            Intel(R) Xeon(R) CPU E3-1275 v5 @ 3.60GHz
    BIOS Model name:     Intel(R) Xeon(R) CPU E3-1275 v5 @ 3.60GHz To Be Filled By O.E.M. CPU @ 3.6GHz
    BIOS CPU family:     179
    CPU family:          6
    Model:               94
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            3
    CPU(s) scaling MHz:  92%
    CPU max MHz:         4000.0000
    CPU min MHz:         800.0000
    BogoMIPS:            7202.00
[...]

kutta:~ # cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-6.4.6-1-default root=UUID=8a392962-8a90-4306-8d47-919298279afd noapic quiet security=apparmor mitigations=auto

kutta:~ # cat /etc/default/grub
# If you change this file, run 'grub2-mkconfig -o /boot/grub2/grub.cfg' afterwards to update
# /boot/grub2/grub.cfg.

# Uncomment to set your own custom distributor. If you leave it unset or empty, the default
# policy is to determine the value from /etc/os-release
GRUB_DISTRIBUTOR=
GRUB_DEFAULT=saved
GRUB_HIDDEN_TIMEOUT=0
GRUB_HIDDEN_TIMEOUT_QUIET=true
GRUB_TIMEOUT=0
GRUB_CMDLINE_LINUX_DEFAULT="noapic quiet security=apparmor mitigations=auto"
GRUB_CMDLINE_LINUX=""

# Uncomment to automatically save last booted menu entry in GRUB2 environment

# variable `saved_entry'
# GRUB_SAVEDEFAULT="true"
#Uncomment to enable BadRAM filtering, modify to suit your needs

# This works with Linux (no patch required) and with any kernel that obtains
# the memory map information from GRUB (GNU Mach, kernel of FreeBSD ...)
# GRUB_BADRAM="0x01234567,0xfefefefe,0x89abcdef,0xefefefef"
#Uncomment to disable graphical terminal (grub-pc only)

GRUB_TERMINAL="gfxterm"
# The resolution used on graphical terminal
#note that you can use only modes which your graphic card supports via VBE

# you can see them in real GRUB with the command `vbeinfo'
GRUB_GFXMODE="auto"
# Uncomment if you don't want GRUB to pass "root=UUID=xxx" parameter to Linux
# GRUB_DISABLE_LINUX_UUID=true
#Uncomment to disable generation of recovery mode menu entries

# GRUB_DISABLE_RECOVERY="true"
#Uncomment to get a beep at grub start

# GRUB_INIT_TUNE="480 440 1"
GRUB_BACKGROUND=
GRUB_THEME=/boot/grub2/themes/openSUSE/theme.txt
SUSE_BTRFS_SNAPSHOT_BOOTING="true"
GRUB_DISABLE_OS_PROBER="false"
GRUB_ENABLE_CRYPTODISK="n"
Comment 8 Maxime Thirion 2024-07-09 08:57:21 UTC
I think I've found a bug in the openSUSE installer that adds the NOAPIC option under a certain condition.

I use openSUSE a lot, as a desktop, as a server, with and without microOS variants.

I checked my other machines and the NOAPIC option is not present in the grub.

But the two servers mentioned in this message had a less "standard" installation, as Hetzner doesn't support openSUSE by default.
I therefore start from an ubuntu installation, on which I download vmlinuz.install and initrd.install and then connect to the server via ssh -X root@XXX to start the installation.

Obviously, with this type of "remote" installation, the NOAPIC option is added automatically.

Here I'm looking at a third server which has been installed in the same way, which is not yet in production, which is based on Leap Micro and the option is also present ...

kiba:~ # cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-5.14.21-150500.55.68-default root=UUID=c887e740-95e0-4968-8bb6-f5467ea422d6 rd.timeout=60 noapic swapaccount=1 quiet rd.shell=0 security=selinux selinux=1 enforcing=1 mitigations=auto

kiba:~ # cat /etc/os-release 
NAME="openSUSE Leap Micro"
VERSION="5.5"
ID="opensuse-leap-micro"
ID_LIKE="suse opensuse opensuse-leap suse-microos"
VERSION_ID="5.5"
PRETTY_NAME="openSUSE Leap Micro 5.5"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:opensuse:leap-micro:5.5"
BUG_REPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org/"
DOCUMENTATION_URL="https://en.opensuse.org/Portal:LeapMicro"
LOGO="distributor-logo-LeapMicro"
Comment 9 Jiri Slaby 2024-07-09 10:44:15 UTC
I wonder, is it inherited from the ubuntu command line?

Can you describe exact steps you do to install?
Comment 10 Maxime Thirion 2024-07-09 13:52:42 UTC
Ubuntu is just used to launch the remote installation of openSUSE.

To install, I proceed as follows:

cd /boot

wget --output-
document=vmlinuz.install
http://download.opensuse.org/tumbleweed
/repo/oss/boot/x86_64/loader/linux

wget --output-
document=initrd.install
http://download.opensuse.org/tumbleweed
/repo/oss/boot/x86_64/loader/initrd

I retrieve the disk UUID via lsblk :
lsblk -f /dev/sdb2
NAME FSTYPE
 LABEL
 UUID
MOUNTPOINT
sdb2 linux_raid_member 62-210-136-200:0
f10e1a06-0cb2-1aeb-e92f-937476d3ea65
└─md0 ext4
5b69725e-6914-463e-bf26-6b02a222f59c/boot

I modify on ubuntu the 40_custom to add a section that will allow me to start the installation :

menuentry "openSUSE Tumbleweed" {
   set root='mduuid/f10e1a060cb21aebe92f937476d3ea65'
   linux /vmlinuz.install noapic usessh=1 sshpassword="12345678" install=http://download.opensuse.org/tumbleweed/repo/oss/ hostip=XXX netmask=XXX gateway=XXX nameserver=XXX
}

Indeed, in my notes, based on the tutorial I had taken over, the noapic option is used.

I then edit the /etc/default/grub file to change the `GRUB_DEFAULT="openSUSE Tumbleweed"`.

update-grub2

Reboot

ssh -X root@IP_SERVEUR

and then run the installation (yast.sh I think).

Then the openSUSE installation takes over, deletes ubuntu, reformats the disks as I wish and the server works fine afterwards.

So the installer thinks it's doing the right thing by defaulting the option I use to start the installation.

In the end, it's really well done, the bug is me following the tutorial without cleaning up this piece of code.

However, the behavior of the NOAPIC option has changed in 6.9, and would otherwise have gone unnoticed.
Comment 11 Jiri Slaby 2024-07-10 06:44:14 UTC
(In reply to Maxime Thirion from comment #10)
> To install, I proceed as follows:

Now I see.

> Indeed, in my notes, based on the tutorial I had taken over, the noapic
> option is used.

Well...

> However, the behavior of the NOAPIC option has changed in 6.9, and would
> otherwise have gone unnoticed.

Right, this is by intention in:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7c0edad3643f4493c4dafa6f5dfcfb1a86432156

So this is a bug in notes.