Bugzilla – Bug 1218925
duplicate nvme-uuid and ID_WWN for 2 SSD (dual boot Windows 11 / openSUSE)
Last modified: 2024-07-17 06:30:30 UTC
Created attachment 871960 [details] list of UUIDs and devices Leap 15.5 apparently assigned identical nvme-uuids and ID_WWNs to two SSDs with same specifications but different serial numbers and different partitions. Booting openSUSE produces warnings about 'duplicate nsid' and potential data corruption. Windows11 can no longer be booted (stop code: registry error), but openSUSE is still able to start and mount windows partitions and read files from the windows system. But the Bootloader can no longer update /boot/efi. However after login plasma desktop displays the warning that the first disk (nvme0n1) likely will fail soon. The EFI partition is on this disk (nvme0n1p1). What's the best way to fix this critical situation ?
correction: the Bootloader can no longer be updated.
The SSD firmware is reporting identical IDs (that's a bug) and the Linux nvme subsystem was enforcing that these IDs are unique. This check was weaken later on for consumer disks. The SP5 kernel should already ship the relaxed checks but apparently it doesn't work for you. Are you on the latest SP5 kernel? Can you post the output of 'nvme id-ctrl' for both disks and 'nvme id-ns' for the namespaces? Thanks.
Created attachment 871974 [details] nvme id-ctr
Created attachment 871975 [details] nvme id-ns
Created attachment 871976 [details] hwinfo kernel
thanks for response. additional information: Operating System: openSUSE Leap 15.5 KDE Plasma Version: 5.27.9 KDE Frameworks Version: 5.103.0 Qt Version: 5.15.8 Kernel Version: 5.14.21-150500.55.39-default (64-bit)
Priority NONE means there won't be any indications in near future on how to resolve the issue ? Maybe only replacement of SSD as an option ? Comments appreciated.
(In reply to Otmar Mak from comment #7) > Priority NONE means there won't be any indications in near future on how to > resolve the issue ? The priority field of Bugizlla is only for developer's side, not something the reporter should change at all.
Comment on priority ZERO: Already the Leap 15.5 installer (bootable USB ISO) did not recognize both SSDs and could not be used to do an upgrade of Leap 15.4 on 2nd SSD. Instead, the upgrade was then done with zypper in Leap 15.4 Testing a recent FEDORA installer, it recognized both SSDs, but detected that both disks had the same UUID. Even after replacing the corrupted SSD (which is now read-only also for yast2 partitioner and other disk tools) it seems risky that reinstallation of Leap 15.5 and Windows 11 will produce the same issue again.
(In reply to Takashi Iwai from comment #8) > (In reply to Otmar Mak from comment #7) > > Priority NONE means there won't be any indications in near future on how to > > resolve the issue ? > > The priority field of Bugizlla is only for developer's side, not something > the reporter should change at all. sure
> Maybe only replacement of SSD as an option ? No need to discard the hardware, we fix this in the kernel. But till then you only have the option to use an older kernel.
(In reply to Daniel Wagner from comment #11) > > Maybe only replacement of SSD as an option ? > > No need to discard the hardware, we fix this in the kernel. But till > then you only have the option to use an older kernel. Thanks. But will an older kernel both create different UUIDs for the disks AND make the obviously corrupted nvme0n1 SSD writeable again ? Now it is read-only for disk tools (windows and linux) also from bootable external media. This means Windows 11 is dead and cannot be repaired, and re-partitioning is not possible.
The nvme subystem doesn't create new UUIDs, it will just ignoring the missing UUIDs. This makes the disks accessible again. Sadly, this doesn't fix magically any data corruption.
It turns out kernel 5.14.21-150500.55.39-default ships all the patches which allow duplicate nsids for pci devices which do not expose set the cmic or nmic bits (these are the patches I was referring to in comment#2). The id-ctrl and id-ns output show that these fields are not set. As all these condition are meet the nsid quirks is enabled automatically. This matches with the statement from the initial comment that the kernel says 'duplicate nsid' and warns from 'possible data corruption' (if you see these message the nsid quirk is activated). That means there is no need to set the nsid quirk by hand. BUT as the warning says, this quirk can lead to data corruption. The issue is that the /dev/disk/by-id symlinks are not stable. How do you address the disks in /etc/fstab?
in /etc/fstab the partitions or subvolumes are currently identified by their UUIDs. The defective SSD however has been replaced by a new SSD in the meantime, which was formatted by nvme format and partitioned with a linux program. It is not known whether the /etc/fstab was using different types of addresses before the SSD replacement.
So you say, one drive broke and the kernel then corrupted the good disk? Right, fstab will use the partition UUIDs, so as far I understand this should be stable. I am bit at loss what happened in your case. I expect you still see the 'dublicate id' message also with Fedora?
(In reply to Daniel Wagner from comment #16) > So you say, one drive broke and the kernel then corrupted the good disk? > > Right, fstab will use the partition UUIDs, so as far I understand this > should be stable. I am bit at loss what happened in your case. I expect > you still see the 'dublicate id' message also with Fedora? NO. let's recap the history: * the computer was delivered with 2 SSDs, Windows 10 pre-installed on the first SSD * opensuse 15.4 was installed on the second SSD, with EFI partition used on first SSD. * Windows 11 was installed on first SSD, replacing Windows 10. OpenSUSE bootloader was used for booting into Windows or Linux. No problems so far. * Upgrade to openSUSE 15.5: the installation iso image on USB did not recognize both SSDs and could not be used to upgrade opensuse. Instead upensuse 15.4 was then upgraded to 15.5 using zypper. * soon after that: warnings about the first SSD being in danger of data corruption, as described earlier. * some weeks later: Windows 11 could no longer be booted, and the first SSD was now strictly read-only. * experimenting with repair options: trying to install Fedora on first SSD failed: error message about duplicate UUIDs for SSDs. Other repair options also failed (nvme commands, gdisk, windows commands) * first corrupted SSD was replaced by a new SSD of different brand. Using Fedora installer again, the new SSD was formatted by nvme format, UUID was assigned by gdisk. The news SSD was then partitioned with gdisk. Fedora was installed for test purposes, but their bootloader did not identify opensuse 15.5. * Fedora was uninstalled and Windows 11 re-installed. * opensuse 15.5. remained functional throughout the process, after re-installation of Windows11 the bootloader was updated in order to include Windows 11 in the boot menu. Issues seem to be fixed now. There is no longer any "duplicate UUID" warning. The EFI partition on first SSD is currently no longer in etc/fstab.
Thanks for the recap. Indeed, it sounds more like the first disk died and the warning wasn't actually happening. Shall we close the bug?
(In reply to Daniel Wagner from comment #18) > Thanks for the recap. Indeed, it sounds more like the first disk died and > the warning wasn't actually happening. Shall we close the bug? My presumption has been different: Because of your earlier comments, I assumed that if Windows is the first operating system that is installed on an SSD, it is not assigned an UUID, which apparently created the problems. Therefore, as described in the above history, I made sure to assign an UUID to the new SSD with a linux command first, and partition it with a linux program, before re-installing Windows 11. To me it seemed that opensuse 15.5. installer had an issue, because it could not recognize both SSDs. I leave it to you how you interpret the cause of problems.
As far I can tell, the kernel had all the needed patches to operate correctly. Sorry, the installer is not really my turf, can't really help there.
(In reply to Daniel Wagner from comment #20) > As far I can tell, the kernel had all the needed patches to operate > correctly. > Sorry, the installer is not really my turf, can't really help there. So even if the kernel had the required patches, these still may not have been safe enough (?) according to your comment #14, because they did not prevent data corruption, as was the warning during boot. In my case maybe the EFI partition got corrupted, because this was the only partition of the first SSD (with Windows) that got mounted during boot, and was included in /etc/fstab. As described in the history above, the issue seems to be resolved now after replacement of the hardware and activating the new SSD with Linux first before re-installing Windows.
(In reply to Otmar Mak from comment #21) > So even if the kernel had the required patches, these still may not have > been safe enough (?) according to your comment #14, because they did not > prevent data corruption, as was the warning during boot. Indeed, this obviously bad, but I don't think it's not a problem caused by the kernel. The kernel is issuing a warning. It is just a warning and not an error because as there a plenty of consumer devices out there which are reporting non-unique IDs and people were really upset when their system stopped working after the nvme subsystem enforced the uniqueness of the IDs. This means the devices will always show up and user land has to decided what to do with it. As long the partition have a proper UUIDs all should work then, assuming /etc/fstab uses the UUIDs from the OS PoV. I don't really know how this works with an Windows. But I pretty sure Fedora is also using UUIDs. Anyway, what I try to say is, the data corruption you got, was not due the kernel doing funky stuff. Something went wrong in user land. Let's try to ask someone with more know how on the boot process.
> .... > Anyway, what I try to say is, the data corruption you got, was not due the > kernel doing funky stuff. Something went wrong in user land. > > Let's try to ask someone with more know how on the boot process. From the user land perspective, there were no hints whatsoever what a user could do in response to the warnings in order to prevent data corruption. One speculation I had was that maybe the mount command was no longer safe, when mounting Windows partitions r/w using /dev/nvme.... One more observation: The Windows 11 system became no longer bootable, after there had been a Windows update, and a Linux update after reboot instead of a completion of the Windows update. Maybe that somehow broke the system. But even then there remain open questions from the user perspective: (1) Why did openSUSE installer for Leap 15.5. not recognize both SSDs, whereas Fedora installer did, although with a duplicate UUID warning ? (2) what kind of settings or precautions should a user take, if such warnings occur ? The only idea I had was to make sure that Linux is installed FIRST on both SSDs, and only afterwards Windows, in that order. I have not had any problems since.
Hi Otmar, is the issue still reproducible please? Or do you think need to assign it to installer's maintainer to take a look at this issue please?
(In reply to Chenzi Cao from comment #24) > Hi Otmar, is the issue still reproducible please? Or do you think need to > assign it to installer's maintainer to take a look at this issue please? Since I have migrated to the latest OpenSUSE Leap Version, the issue is no longer reproducible on my computer. Trying to reproduce would also mean risking damage to an SSD and data loss. I have not observed any similar issue since opensuse Leap upgrade and more Microsoft Windows 11 updates.