Bug 1203566 - [Build 21.1] openQA test fails in ibft - smartctl - i /dev/sda failed for 'mandatory smart cmd failure'
[Build 21.1] openQA test fails in ibft - smartctl - i /dev/sda failed for 'ma...
Status: RESOLVED NORESPONSE
Classification: openSUSE
Product: PUBLIC SUSE Linux Enterprise Server 15 SP5
Classification: openSUSE
Component: YaST2
unspecified
x86-64 SLES 15
: P2 - High : Normal
: ---
Assigned To: E-mail List
https://openqa.suse.de/tests/9536843/...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2022-09-20 11:41 UTC by Ming Li
Modified: 2023-01-09 10:10 UTC (History)
6 users (show)

See Also:
Found By: openQA
Services Priority:
Business Priority:
Blocker: Yes
Marketing QA Status: ---
IT Deployment: ---


Attachments
ibft y2log (7.82 MB, application/x-bzip)
2022-11-08 11:04 UTC, Ming Li
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ming Li 2022-09-20 11:41:19 UTC
## Observation

openQA test in scenario sle-15-SP5-Online-x86_64-cryptlvm_iscsi@64bit fails in
[ibft](https://openqa.suse.de/tests/9536843/modules/ibft/steps/40)

## Test suite description
Conducts installation on iSCSI device relying on iBFT with encrypted LVM.


## Reproducible

Fails since (at least) Build [21.1](https://openqa.suse.de/tests/9518762)


## Expected result

Last good: [19.1](https://openqa.suse.de/tests/9418168) (or more recent)


## Further details

Always latest result in this scenario: [latest](https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Online&machine=64bit&test=cryptlvm_iscsi&version=15-SP5)
Comment 1 Stefan Weiberg 2022-09-26 15:08:49 UTC
Not sure about the root cause, but it could be related to the iscsi setup in yast2. We didn't have a kernel change so far. For now I am setting the YaST2 component.

One note on the bug report, the Observation links to a different issue than the Reproducible part.
Comment 2 Stefan Weiberg 2022-09-26 15:09:44 UTC
Could you maybe collect and attach the y2logs of that system? They are not available in openQA.
Comment 3 Stefan Hundhammer 2022-10-05 11:25:46 UTC
  smartctl -i /dev/sda

failed with:

  "A mandatory SMART command failed: exiting"

When successful, that command ("-i" for "--info") should result in
something like this:


smartctl 7.2 2021-09-14 r5237 [x86_64-linux-5.14.21-150400.24.21-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint F3
Device Model:     SAMSUNG HD103SJ
Serial Number:    S246JD2Z921835
LU WWN Device Id: 5 0024e9 0040754bf
Firmware Version: 1AJ10001
User Capacity:    1.000.204.886.016 bytes [1,00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Wed Oct  5 13:22:20 2022 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Comment 4 Stefan Hundhammer 2022-10-05 11:28:56 UTC
cat 07-committed.yml 

# 2022-09-19 05:00:13 -0400
---
- disk:
    name: "/dev/sda"
    size: 20971540 KiB (20.00 GiB)
    block_size: 0.5 KiB
    io_size: 0 B
    min_grain: 1 MiB
    align_ofs: 0 B
    partition_table: gpt
    partitions:
    - free:
        size: 1 MiB
        start: 0 B
    - partition:
        size: 8 MiB
        start: 1 MiB
        name: "/dev/sda1"
        type: primary
        id: bios_boot
    - partition:
        size: 20962307.5 KiB (19.99 GiB)
        start: 9 MiB
        name: "/dev/sda2"
        type: primary
        id: lvm
        encryption:
          type: luks
          name: "/dev/mapper/cr_scsi-1IET_00010001-part2"
          password: "***"
    - free:
        size: 16.5 KiB
        start: 20971523.5 KiB (20.00 GiB)
- disk:
    name: "/dev/vda"
    size: 20 GiB
    block_size: 0.5 KiB
    io_size: 0 B
    min_grain: 1 MiB
    align_ofs: 0 B
- lvm_vg:
    vg_name: system
    extent_size: 4 MiB
    lvm_lvs:
    - lvm_lv:
        lv_name: home
        size: 5864 MiB (5.73 GiB)
        stripes: 1
        file_system: xfs
        mount_point: "/home"
    - lvm_lv:
        lv_name: root
        size: 13400 MiB (13.09 GiB)
        stripes: 1
        file_system: btrfs
        mount_point: "/"
        btrfs:
          default_subvolume: "@"
          subvolumes:
          - subvolume:
              path: "@"
          - subvolume:
              path: "@/boot/grub2/i386-pc"
          - subvolume:
              path: "@/boot/grub2/x86_64-efi"
          - subvolume:
              path: "@/opt"
          - subvolume:
              path: "@/root"
          - subvolume:
              path: "@/srv"
          - subvolume:
              path: "@/tmp"
          - subvolume:
              path: "@/usr/local"
          - subvolume:
              path: "@/var"
              nocow: true
    - lvm_lv:
        lv_name: swap
        size: 1204 MiB (1.18 GiB)
        stripes: 1
        file_system: swap
        mount_point: swap
    lvm_pvs:
    - lvm_pv:
        blk_device: "/dev/mapper/cr_scsi-1IET_00010001-part2"
Comment 5 Stefan Hundhammer 2022-10-05 11:49:25 UTC
The iSCSI disk in question is /dev/sda.

During configuring iSCSI in both test cases, the screenshots in the failing one look exactly the same as the "last good" one.

Yet later, after installation, in the failing case the iSCSI disk does not accept SMART commands. But is that really something that can be influenced from the client side? Shouldn't that be set up on the iSCSI server, and that's it?

Are we absolutely sure that the iSCSI server did not change in the meantime? Is that a physical server, or also a virtual machine? On physical machines, I remember that SMART support needs to be enabled in the BIOS. Is it plausible that anything changed there; on the iSCSI server side?

I also briefly checked in the yast-iscsi-client code; the last pull request is from June 21th, much longer ago than the "last good" test case. And even that PR does not appear to be even remotely related.

  https://github.com/yast/yast-iscsi-client/pull/120/files

So, please check the iSCSI server side first.
Comment 6 Ming Li 2022-11-08 11:04:04 UTC
Created attachment 862728 [details]
ibft y2log
Comment 7 Ming Li 2022-11-08 11:05:59 UTC
(In reply to Stefan Weiberg from comment #2)
> Could you maybe collect and attach the y2logs of that system? They are not
> available in openQA.

There is an issue of ibft worker blocked to reproduce this bug, anyway, finally I reproduced it and got the ibft y2log.  https://openqa.nue.suse.com/tests/9898913#step/ibft/41
Comment 8 Stefan Hundhammer 2022-11-08 12:31:03 UTC
I never got an answer to my question in comment #5: Are you sure that SMART support is enabled on the server side?
Comment 9 Stefan Hundhammer 2022-11-08 12:31:20 UTC
I never got an answer to my question in comment #5: Are you sure that SMART support is enabled on the server side?
Comment 10 Stefan Hundhammer 2022-11-08 12:33:05 UTC
BTW I don't think that's something that can be extracted from an y2log on the server side; you'll have to run "smartctl" commands there.
Comment 11 Ming Li 2022-11-09 01:29:58 UTC
(In reply to Stefan Hundhammer from comment #10)
> BTW I don't think that's something that can be extracted from an y2log on
> the server side; you'll have to run "smartctl" commands there.

Hi, I think you mean run the cmd on the ibft worker directly, 
I have a passed job recently, please check it https://openqa.nue.suse.com/tests/9891296#step/ibft/40 It runs on the openqaworker6:1 which is the instance of worker qemu_x86_64_ibft, so I think the setting for ibft is correct on server side at least when the cmd run without failure.
I agree with you, there is something wrong on the worker of qemu_x86_64_ibft when failure happened. In fact, our openQA test run on the SUT which is VM based on the worker, so maybe we can check it on SUT also. So please give me some instructions to check it, thanks.
Comment 12 Ming Li 2022-11-09 02:28:24 UTC
(In reply to Stefan Hundhammer from comment #10)
> BTW I don't think that's something that can be extracted from an y2log on
> the server side; you'll have to run "smartctl" commands there.

Hi, I just checked the iscsi server, I can't access it.

# iscsiadm --mode discovery --op update --type sendtargets --portal x.x.x.x
iscsiadm: cannot make connection to x.x.x.x: No route to host
iscsiadm: cannot make connection to x.x.x.x: No route to host
iscsiadm: cannot make connection to x.x.x.x: No route to host
iscsiadm: cannot make connection to x.x.x.x: No route to host
iscsiadm: connection login retries (reopen_max) 5 exceeded
iscsiadm: Could not perform SendTargets discovery: iSCSI PDU timed out

For security reason, I haven't pasted the ip of iscsi server here, it is in the log of autoinst-log, and I can send it to you via e-mail also.
Comment 13 Stefan Hundhammer 2022-11-09 09:36:55 UTC
I cannot check that remotely from here. Somebody who has access to that server will need to log in and issue "smartctl" commands to check if the machine has SMART enabled. From our investigations in this bug so far, it looks very much like it's not.

Not every problem in the world is a YaST installer problem. We cannot do system administration for the server infrastructure in the QA labs.
Comment 14 Ming Li 2022-11-10 07:08:15 UTC
(In reply to Stefan Hundhammer from comment #13)
> I cannot check that remotely from here. Somebody who has access to that
> server will need to log in and issue "smartctl" commands to check if the
> machine has SMART enabled. From our investigations in this bug so far, it
> looks very much like it's not.
> 
> Not every problem in the world is a YaST installer problem. We cannot do
> system administration for the server infrastructure in the QA labs.

I can access the machine now, but I don't know how to setup it, it seems no such cmd of smartctl.

leli@worker2:~> smartctl
-bash: smartctl: command not found
leli@worker2:~>

I don't know who is the maintainer of the iscsi server.
Comment 15 Richard Fan 2022-11-10 08:13:27 UTC
https://openqa.nue.suse.com/tests/9898914#step/ibft/48

There is no issue with this iscsi server. [seems we have more than 1 iscsi server?]

I can give the iscsi tgt server configuration:

#tgt-admin -s
Target 1: iqn.2016-02.openqa.de:for.openqa
    System information:
        Driver: iscsi
        State: ready
    I_T nexus information:
    LUN information:
        LUN: 0
            Type: controller
            SCSI ID: IET     00010000
            SCSI SN: beaf10
            Size: 0 MB, Block size: 1
            Online: Yes
            Removable media: No
            Prevent removal: No
            Readonly: No
            SWP: No
            Thin-provisioning: No
            Backing store type: null
            Backing store path: None
            Backing store flags: 
        LUN: 1
            Type: disk
            SCSI ID: IET     00010001
            SCSI SN: beaf11
            Size: 21475 MB, Block size: 512
            Online: Yes
            Removable media: No
            Prevent removal: No
            Readonly: No
            SWP: No
            Thin-provisioning: No
            Backing store type: rdwr
            Backing store path: /opt/openqa-iscsi-disk
            Backing store flags: 
    Account information:
    ACL information:
        ALL
Comment 16 George Gkioulis 2022-11-14 09:34:33 UTC
The issue does not seem to be originating from a change in configuration of the iscsi server.

This is the original failure: https://openqa.suse.de/tests/9536843#step/ibft/34 with target: iqn.2016-02.openqa.de:for.openqa and portal: 10.160.1.93

This is a recent test that passed: https://openqa.suse.de/tests/9680734#step/ibft/34
It has the same iscsi target: iqn.2016-02.openqa.de:for.openqa and portal: 10.160.1.93 and there has been no configuration change there since the issue.

Since SMART support seems to be enabled in the run that PASSES, it could be that there is a different underlying issue.
Comment 17 Liu Shukui 2022-11-16 02:45:07 UTC
timeout in new build40.1

https://openqa.suse.de/tests/9919949#step/ibft/40
Comment 18 Stefan Hundhammer 2022-12-13 09:54:49 UTC
So, what is the status of this?

Do we know now whether or not SMART is enabled on that server?

Is the smartmontools package installed on that server? Please notice that they are in /usr/sbin which you may not have in your $PATH as a normal user.
Comment 19 Stefan Hundhammer 2022-12-13 10:06:32 UTC
Please also notice that it's not YaST that tries to use the smartctl command, it's additional tests in your test setup.

I don't know to what extent SMART works over iSCSI, and what the requirements are for it.

I don't see any option that looks even remotely related to SMART in yast-iscsi-client, and that code has not changed for a long time (see comment #5).

I don't see ANY indication that this should be a YaST bug.
Comment 20 Stefan Hundhammer 2022-12-13 10:15:58 UTC
As for SMART over iSCSI:

https://www.smartmontools.org/wiki/FAQ#SmartmontoolsforFireWireUSBandSATAdiskssystems

"SCSI commands can be conveyed by many transports: the veteran SCSI Parallel Interface (SPI), Fibre Channel (FC), Infiniband (SRP), Serial Attached SCSI (SAS), IP (iSCSI and iSER), USB (mass storage), , and IEEE 1394 (SBP) to name some."


Maybe this is helpful to debug from your test client's side what is going on:

"The '-d sat' option instructs smartctl and smartd to assume a SATL is in place and act accordingly."


It might even be useful to always use that "-d sat" option in that test when it is known that the target disk is iSCSI.
Comment 21 Stefan Hundhammer 2022-12-19 12:51:09 UTC
No feedback.

Besides, as mentioned multiple times, there is no hint that this might be a YaST bug. YaST does not change anything related to SMART. It's either the test server (which quite possibly might not have SMART support enabled in the BIOS) or the iSCSI transport layer.
Comment 22 Huajian Luo 2022-12-22 07:04:45 UTC
we still hit this bug in the Build64.1
https://openqa.suse.de/tests/10220394#step/ibft/41
Comment 23 Joaquín Rivera 2023-01-09 10:10:38 UTC
On the worker where this is running SMART support is available and enabled:

smartctl --all /dev/sda
smartctl 7.2 2021-09-14 r5237 [x86_64-linux-5.14.21-150400.24.38-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Constellation ES.3
Device Model:     ST1000NM0033-9ZM173
Serial Number:    Z1W5P5JM
LU WWN Device Id: 5 000c50 091d250d3
Firmware Version: SN06
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Jan  9 09:59:55 2023 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

In the VM is showed like this when sporadically works or fails:
https://openqa.suse.de/tests/10187388#step/ibft/42

When trying to add option `smartctl -d sat -T permissive -a /dev/sda` I run into trouble like described here
https://www.smartmontools.org/wiki/SAT-with-UAS-Linux
Googling a bit further I always hit the same thing with uas, but doing lsmod in worker and in VM I don't find that Kernel module.

To summarize,
We have an installation in a VM using iscsi disk from the worker where the VM is running: https://openqa.suse.de/tests/10226693#step/iscsi_configuration/3 (screenshot configuring iscsi in installation)
and sporadically when we get info of the disk with smartcl command in the running system produced by that installation we don't get an answer.
Definitely nothing related with YaST and our plan is to disable this check for the test, as looks like there is not guarantee that you can run that command and get and reliable answer.

Please forward this bug to Kernel or other component if you think it makes sense, for our testing scope and given the information in here https://www.smartmontools.org/wiki/FAQ#SmartmontoolsforFireWireUSBandSATAdiskssystems for us seems not worth it.

Thanks Stefan Hundhammer, for the information provided as it helps us to understand the issue and also be aware what makes sense for us to test, and sorry for the late response.