Bug 1218925 - duplicate nvme-uuid and ID_WWN for 2 SSD (dual boot Windows 11 / openSUSE)
Summary: duplicate nvme-uuid and ID_WWN for 2 SSD (dual boot Windows 11 / openSUSE)
Status: IN_PROGRESS
Alias: None
Product: openSUSE Distribution
Classification: openSUSE
Component: Basesystem (show other bugs)
Version: Leap 15.5
Hardware: x86-64 openSUSE Leap 15.5
: P5 - None : Critical (vote)
Target Milestone: ---
Assignee: E-mail List
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-01-17 18:54 UTC by Otmar Mak
Modified: 2024-07-17 06:30 UTC (History)
5 users (show)

See Also:
Found By: Customer
Services Priority:
Business Priority:
Blocker: No
Marketing QA Status: ---
IT Deployment: ---
chcao: needinfo? (o_mak_qafb)


Attachments
list of UUIDs and devices (3.09 KB, text/plain)
2024-01-17 18:54 UTC, Otmar Mak
Details
nvme id-ctr (4.36 KB, text/plain)
2024-01-18 10:46 UTC, Otmar Mak
Details
nvme id-ns (1.15 KB, text/plain)
2024-01-18 10:48 UTC, Otmar Mak
Details
hwinfo kernel (4.79 KB, text/plain)
2024-01-18 10:49 UTC, Otmar Mak
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Otmar Mak 2024-01-17 18:54:02 UTC
Created attachment 871960 [details]
list of UUIDs and devices

Leap 15.5 apparently assigned identical nvme-uuids and ID_WWNs to two SSDs with same specifications but different serial numbers and different partitions.

Booting openSUSE produces warnings about 'duplicate nsid' and potential data corruption. Windows11 can no longer be booted (stop code: registry error),
but openSUSE is still able to start and mount windows partitions and read files
from the windows system. But the Bootloader can no longer update /boot/efi.

However after login plasma desktop displays the warning that the first disk (nvme0n1) likely will fail soon.  The EFI partition is on this disk (nvme0n1p1).

What's the best way to fix this critical situation ?
Comment 1 Otmar Mak 2024-01-17 19:08:45 UTC
correction: the Bootloader can no longer be updated.
Comment 2 Daniel Wagner 2024-01-18 08:48:04 UTC
The SSD firmware is reporting identical IDs (that's a bug) and the Linux
nvme subsystem was enforcing that these IDs are unique. This check
was weaken later on for consumer disks. The SP5 kernel should already 
ship the relaxed checks but apparently it doesn't work for you.
Are you on the latest SP5 kernel?

Can you post the output of 'nvme id-ctrl' for both disks and
'nvme id-ns' for the namespaces? Thanks.
Comment 3 Otmar Mak 2024-01-18 10:46:47 UTC
Created attachment 871974 [details]
nvme id-ctr
Comment 4 Otmar Mak 2024-01-18 10:48:04 UTC
Created attachment 871975 [details]
nvme id-ns
Comment 5 Otmar Mak 2024-01-18 10:49:03 UTC
Created attachment 871976 [details]
hwinfo kernel
Comment 6 Otmar Mak 2024-01-18 10:51:24 UTC
thanks for response.

additional information:

Operating System: openSUSE Leap 15.5
KDE Plasma Version: 5.27.9
KDE Frameworks Version: 5.103.0
Qt Version: 5.15.8
Kernel Version: 5.14.21-150500.55.39-default (64-bit)
Comment 7 Otmar Mak 2024-01-18 14:38:57 UTC
Priority NONE means there won't be any indications in near future on how to resolve the issue ?

Maybe only replacement of SSD as an option ?

Comments appreciated.
Comment 8 Takashi Iwai 2024-01-18 15:19:23 UTC
(In reply to Otmar Mak from comment #7)
> Priority NONE means there won't be any indications in near future on how to
> resolve the issue ?

The priority field of Bugizlla is only for developer's side, not something the reporter should change at all.
Comment 9 Otmar Mak 2024-01-18 15:27:20 UTC
Comment on priority ZERO:

Already the Leap 15.5 installer (bootable USB ISO) did not recognize
both SSDs and could not be used to do an upgrade of Leap 15.4 on 2nd SSD.

Instead, the upgrade was then done with zypper in Leap 15.4

Testing a recent FEDORA installer, it recognized both SSDs, but detected that both disks had the same UUID.

Even after replacing the corrupted SSD (which is now read-only also for 
yast2 partitioner and other disk tools) it seems risky that reinstallation of
Leap 15.5 and Windows 11 will produce the same issue again.
Comment 10 Otmar Mak 2024-01-18 15:30:25 UTC
(In reply to Takashi Iwai from comment #8)
> (In reply to Otmar Mak from comment #7)
> > Priority NONE means there won't be any indications in near future on how to
> > resolve the issue ?
> 
> The priority field of Bugizlla is only for developer's side, not something
> the reporter should change at all.

sure
Comment 11 Daniel Wagner 2024-01-18 17:40:37 UTC
> Maybe only replacement of SSD as an option ?

No need to discard the hardware, we fix this in the kernel. But till
then you only have the option to use an older kernel.
Comment 12 Otmar Mak 2024-01-18 18:19:14 UTC
(In reply to Daniel Wagner from comment #11)
> > Maybe only replacement of SSD as an option ?
> 
> No need to discard the hardware, we fix this in the kernel. But till
> then you only have the option to use an older kernel.

Thanks.
But will an older kernel both create different UUIDs for the disks AND make the obviously corrupted nvme0n1 SSD writeable again ? 

Now it is read-only for disk tools (windows and linux) also from bootable external media. This means Windows 11 is dead and cannot be repaired, and re-partitioning is not possible.
Comment 13 Daniel Wagner 2024-01-19 07:48:02 UTC
The nvme subystem doesn't create new UUIDs, it will just ignoring the missing
UUIDs. This makes the disks accessible again. Sadly, this doesn't fix magically any data corruption.
Comment 14 Daniel Wagner 2024-01-25 15:40:11 UTC
It turns out kernel 5.14.21-150500.55.39-default ships all the patches
which allow duplicate nsids for pci devices which do not expose 
set the cmic or nmic bits (these are the patches I was referring to
in comment#2). 

The id-ctrl and id-ns output show that these fields are not set. As
all these condition are meet the nsid quirks is enabled automatically.

This matches with the statement from the initial comment that the kernel
says 'duplicate nsid' and warns from 'possible data corruption' (if
you see these message the nsid quirk is activated).

That means there is no need to set the nsid quirk by hand. 

BUT as the warning says, this quirk can lead to data corruption. The
issue is that the /dev/disk/by-id  symlinks are not stable.

How do you address the disks in /etc/fstab?
Comment 15 Otmar Mak 2024-02-05 09:19:04 UTC
in /etc/fstab the partitions or subvolumes are currently identified by their UUIDs.
 
The defective SSD however has been replaced by a new SSD in the meantime,
which was formatted by  nvme format and partitioned with a linux program.

It is not known whether the /etc/fstab was using different types of addresses before the SSD replacement.
Comment 16 Daniel Wagner 2024-02-16 14:44:49 UTC
So you say, one drive broke and the kernel then corrupted the good disk?

Right, fstab will use the partition UUIDs, so as far I understand this
should be stable. I am bit at loss what happened in your case. I expect
you still see the 'dublicate id' message also with Fedora?
Comment 17 Otmar Mak 2024-02-17 09:17:26 UTC
(In reply to Daniel Wagner from comment #16)
> So you say, one drive broke and the kernel then corrupted the good disk?
> 
> Right, fstab will use the partition UUIDs, so as far I understand this
> should be stable. I am bit at loss what happened in your case. I expect
> you still see the 'dublicate id' message also with Fedora?

NO.

let's recap the history:

* the computer was delivered with 2 SSDs, Windows 10 pre-installed
  on the first SSD
* opensuse 15.4 was installed on the second SSD, with EFI partition
  used on first SSD. 
* Windows 11 was installed on first SSD, replacing Windows 10.
  OpenSUSE bootloader was used for booting into Windows or Linux.
  No problems so far.
* Upgrade to openSUSE 15.5:
  the installation iso image on USB did not recognize both SSDs
  and could not be used to upgrade opensuse.
  Instead upensuse 15.4 was then upgraded to 15.5 using zypper.

* soon after that: warnings about the first SSD being in danger of
  data corruption, as described earlier.

* some weeks later: Windows 11 could no longer be booted, and the 
  first SSD was now strictly read-only.

* experimenting with repair options:
  trying to install Fedora on first SSD failed: error message about
  duplicate UUIDs for SSDs.
  Other repair options also failed (nvme commands, gdisk, windows commands)

* first corrupted SSD was replaced by a new SSD of different brand.
  Using Fedora installer again, the new SSD was formatted by nvme format,
  UUID was assigned by gdisk. The news SSD was then partitioned with gdisk.
  Fedora was installed for test purposes, but their bootloader did not
  identify opensuse 15.5.

* Fedora was uninstalled and Windows 11 re-installed.

* opensuse 15.5. remained functional throughout the process,
  after re-installation of Windows11 the bootloader was updated
  in order to include Windows 11 in the boot menu.

Issues seem to be fixed now.
There is no longer any "duplicate UUID" warning.
The EFI partition on first SSD is currently no longer in etc/fstab.
Comment 18 Daniel Wagner 2024-02-19 06:58:14 UTC
Thanks for the recap. Indeed, it sounds more like the first disk died and
the warning wasn't actually happening. Shall we close the bug?
Comment 19 Otmar Mak 2024-02-19 08:54:28 UTC
(In reply to Daniel Wagner from comment #18)
> Thanks for the recap. Indeed, it sounds more like the first disk died and
> the warning wasn't actually happening. Shall we close the bug?

My presumption has been different:

Because of your earlier comments, I assumed that if Windows is the first operating system that is installed on an SSD, it is not assigned an UUID, which apparently created the problems.

Therefore, as described in the above history, I made sure to assign an UUID to the new SSD with a linux command first, and partition it with a linux program,
before re-installing Windows 11. 

To me it seemed that opensuse 15.5. installer had an issue, because it could not recognize both SSDs.

I leave it to you how you interpret the cause of problems.
Comment 20 Daniel Wagner 2024-02-26 09:28:04 UTC
As far I can tell, the kernel had all the needed patches to operate correctly.
Sorry, the installer is not really my turf, can't really help there.
Comment 21 Otmar Mak 2024-02-26 12:49:34 UTC
(In reply to Daniel Wagner from comment #20)
> As far I can tell, the kernel had all the needed patches to operate
> correctly.
> Sorry, the installer is not really my turf, can't really help there.

So even if the kernel had the required patches, these still may not have been safe enough (?) according to your comment #14, because they did not prevent data corruption, as was the warning during boot.

In my case maybe the EFI partition got corrupted, because this was the only partition of the first SSD (with Windows) that got mounted during boot, and was included in /etc/fstab.

As described in the history above, the issue seems to be resolved now after
replacement of the hardware and activating the new SSD with Linux first before
re-installing Windows.
Comment 22 Daniel Wagner 2024-03-21 12:51:24 UTC
(In reply to Otmar Mak from comment #21)
> So even if the kernel had the required patches, these still may not have
> been safe enough (?) according to your comment #14, because they did not
> prevent data corruption, as was the warning during boot.

Indeed, this obviously bad, but I don't think it's not a problem caused 
by the kernel.

The kernel is issuing a warning. It is just a warning and not an
error because as there a plenty of consumer devices out there
which are reporting non-unique IDs and people were really upset when
their system stopped working after the nvme subsystem enforced the
uniqueness of the IDs. 

This means the devices will always show up and user land has to
decided what to do with it. As long the partition have a proper UUIDs
all should work then, assuming /etc/fstab uses the UUIDs from 
the OS PoV. I don't really know how this works with an Windows. But
I pretty sure Fedora is also using UUIDs.

Anyway, what I try to say is, the data corruption you got, was not due the 
kernel doing funky stuff. Something went wrong in user land.

Let's try to ask someone with more know how on the boot process.
Comment 23 Otmar Mak 2024-03-23 09:19:40 UTC
> ....
> Anyway, what I try to say is, the data corruption you got, was not due the 
> kernel doing funky stuff. Something went wrong in user land.
> 
> Let's try to ask someone with more know how on the boot process.

From the user land perspective, there were no hints whatsoever what a user could do in response to the warnings in order to prevent data corruption.

One speculation I had was that maybe the mount command was no longer safe, when mounting Windows partitions r/w using /dev/nvme....

One more observation: The Windows 11 system became no longer bootable, after there had been a Windows update, and a Linux update after reboot instead of
a completion of the Windows update. Maybe that somehow broke the system.

But even then there remain open questions from the user perspective:

(1) Why did openSUSE installer for Leap 15.5. not recognize both SSDs,
    whereas Fedora installer did, although with a duplicate UUID warning ?
(2) what kind of settings or precautions should a user take, if such warnings
    occur ?

The only idea I had was to make sure that Linux is installed FIRST on both SSDs,
and only afterwards Windows, in that order. I have not had any problems since.
Comment 24 Chenzi Cao 2024-07-16 15:02:10 UTC
Hi Otmar, is the issue still reproducible please? Or do you think need to assign it to installer's maintainer to take a look at this issue please?
Comment 25 Otmar Mak 2024-07-17 06:30:30 UTC
(In reply to Chenzi Cao from comment #24)
> Hi Otmar, is the issue still reproducible please? Or do you think need to
> assign it to installer's maintainer to take a look at this issue please?

Since I have migrated to the latest OpenSUSE Leap Version, the issue is no longer reproducible on my computer. Trying to reproduce would also mean risking damage to an SSD and data loss. I have not observed any similar issue since opensuse Leap upgrade and more Microsoft Windows 11 updates.