Bug 1214469 - Networking Issue After Transactional Update
Summary: Networking Issue After Transactional Update
Status: RESOLVED INVALID
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: MicroOS (show other bugs)
Version: Current
Hardware: Other Other
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: Forgotten User u0-bnvADNc
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-08-22 13:05 UTC by Samuel Conway
Modified: 2023-08-24 13:29 UTC (History)
1 user (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
pre-update (771.95 KB, image/jpeg)
2023-08-22 13:05 UTC, Samuel Conway
Details
post-update (771.45 KB, image/jpeg)
2023-08-22 13:06 UTC, Samuel Conway
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Samuel Conway 2023-08-22 13:05:49 UTC
Created attachment 868938 [details]
pre-update

Summary:
After performing a transactional update on an openSUSE Tumbleweed system, the networking bridges and virtual interfaces were not functioning. As a result, the affected device, which serves as a VM-host for an opnsense firewall, lost its network connectivity. Rolling back to an older snapshot and disabling the transactional-update.timer service was required to restore networking functionality. The issue had significant impact as all devices on the network lost internet connectivity. Manual intervention in the form of a keyboard and monitor connection was needed to perform the rollback.

Description:
On [Date], I performed a transactional update on my openSUSE Tumbleweed system. The system was being used as a VM-host for an opnsense firewall, which plays a critical role in managing network traffic for my network. The transactional update was carried out using the standard update procedure.

After the update was applied and the system rebooted, it was immediately apparent that there were issues with the networking configuration. None of the networking bridges or virtual interfaces were functioning as expected. This resulted in a complete loss of network connectivity for all the virtual machines running on the host, as well as the host itself.

Due to the severity of the issue and the impact it had on the network, I decided to manually intervene by connecting a keyboard and monitor to the affected system. After accessing the system locally, I attempted to diagnose the problem. It became evident that the issue was related to the recent transactional update, as rolling back to an older snapshot of the system resulted in the restoration of networking functionality. This process involved using the system's snapshot manager to revert to a state prior to the update.

Additionally, to prevent this issue from occurring again in the future, I disabled the transactional-update.timer service. While this action helped avoid further disruption, it's important to note that transactional updates are a critical part of openSUSE Tumbleweed's update process and should ideally work without causing networking issues.

Impact:
The impact of this issue was significant. Due to the loss of networking bridges and virtual interfaces, all the devices on the network, including the virtual machines hosted on the affected system, lost internet connectivity. This disruption lasted until I could manually intervene, rollback to an older snapshot, and disable the transactional update timer. The incident led to downtime, network interruptions, and required manual intervention, which is not ideal for a system that's intended to provide stable and uninterrupted network services.

Steps to Reproduce:

- Install openSUSE Tumbleweed on a system configured as a VM-host.
- Set up networking bridges and virtual interfaces.
- Perform a transactional update using the standard update procedure.
- Reboot the system after the update.
- Observe that networking bridges and virtual interfaces are not functioning as expected, resulting in a loss of network connectivity for all devices.

Expected Results:
After a transactional update, the system's networking configuration, including bridges and virtual interfaces, should remain intact, and the network connectivity for all devices should not be affected.

Additional Information:
- Intel Pentium N6005, 4x Intel i226-V 2.5Gb LAN 
- Output of relevant commands`ip addr show` see attached
Comment 1 Samuel Conway 2023-08-22 13:06:12 UTC
Created attachment 868939 [details]
post-update
Comment 2 Samuel Conway 2023-08-22 13:10:30 UTC
Not sure this helps, but when I entered a transactional-update shell session, i get some Operation not supported messages:
❯ transactional-update shell
Checking for newer version.
New version found - updating...
Loading repository data...
Reading installed packages...
Retrieving: transactional-update-4.3.0-1.2.x86_64 (openSUSE-Tumbleweed-Oss)                  (1/1),  71.0 KiB    
Retrieving: transactional-update-4.3.0-1.2.x86_64.rpm .....................................................[done]
(1/1) /tmp/transactional-update.sFwAhDUSa5/repo-oss/x86_64/transactional-update-4.3.0-1.2.x86_64.rpm ......[done]
Loading repository data...
Reading installed packages...
Retrieving: libtukit4-4.3.0-1.2.x86_64 (openSUSE-Tumbleweed-Oss)                             (1/2), 163.7 KiB    
Retrieving: libtukit4-4.3.0-1.2.x86_64.rpm ................................................................[done]
(1/2) /tmp/transactional-update.sFwAhDUSa5/repo-oss/x86_64/libtukit4-4.3.0-1.2.x86_64.rpm .................[done]
Retrieving: tukit-4.3.0-1.2.x86_64 (openSUSE-Tumbleweed-Oss)                                 (2/2),  67.6 KiB    
Retrieving: tukit-4.3.0-1.2.x86_64.rpm ....................................................................[done]
(2/2) /tmp/transactional-update.sFwAhDUSa5/repo-oss/x86_64/tukit-4.3.0-1.2.x86_64.rpm .....................[done]
transactional-update 4.3.0 started
Options: shell
Separate /var detected.
2023-08-22 14:56:57 tukit 4.3.0 started
2023-08-22 14:56:57 Options: -c138 open 
2023-08-22 14:57:00 Using snapshot 138 as base for new snapshot 141.
2023-08-22 14:57:00 Syncing /etc of previous snapshot 137 as base into new snapshot "/.snapshots/141/snapshot"
2023-08-22 14:57:00 SELinux is enabled.
Relabeled /var/lib/machines from system_u:object_r:unlabeled_t:s0 to system_u:object_r:systemd_machined_var_lib_t:s0
setxattr failed: /var/lib/machines: Operation not supported
ID: 141
2023-08-22 14:57:17 Transaction completed.
Opening chroot in snapshot 141, continue with 'exit'
2023-08-22 14:57:17 tukit 4.3.0 started
2023-08-22 14:57:17 Options: call 141 bash 
Relabeled /var/lib/machines from system_u:object_r:unlabeled_t:s0 to system_u:object_r:systemd_machined_var_lib_t:s0
setxattr failed: /var/lib/machines: Operation not supported
2023-08-22 14:57:19 Executing `bash`:

root in / 
❯ exit
2023-08-22 14:57:56 Application returned with exit status 0.
2023-08-22 14:57:56 Transaction completed.
2023-08-22 14:57:56 tukit 4.3.0 started
2023-08-22 14:57:56 Options: close 141 
Relabeled /var/lib/machines from system_u:object_r:unlabeled_t:s0 to system_u:object_r:systemd_machined_var_lib_t:s0
setxattr failed: /var/lib/machines: Operation not supported
2023-08-22 14:58:01 New default snapshot is #141 (/.snapshots/141/snapshot).
2023-08-22 14:58:01 Transaction completed.

Please reboot your machine to activate the changes and avoid data loss.
New default snapshot is #141 (/.snapshots/141/snapshot).
transactional-update finished
Comment 3 Thorsten Kukuk 2023-08-22 13:27:45 UTC
I doubt this has anything to do with transactional-update at all.
Here are many MicroOS production machines and none shows a similar behavior.
I'm afraid you need to go through the log files and search for the error message, why your network interfaces don't come up anymore. That's the only way to find out, which updated packages breaks your network setup.
Some SELinux failures in "transactional-update shell" are unrelated.
Comment 4 Samuel Conway 2023-08-22 13:45:39 UTC
Hi Thorston, thanks for response, I guess I panicked a bit... 

There are two failed services in current snapshot.
❯ systemctl list-units --failed
  UNIT                               LOAD   ACTIVE SUB    DESCRIPTION                       
● NetworkManager-wait-online.service loaded failed failed Network Manager Wait Online
● snapper-cleanup.service            loaded failed failed Daily Cleanup of Snapper Snapshots

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
2 loaded units listed.

I managed to start NetworkManager-wait-online.service, but snapper-cleanup.service is throwing error.

❯ systemctl status snapper-cleanup.service
× snapper-cleanup.service - Daily Cleanup of Snapper Snapshots
     Loaded: loaded (/etc/systemd/system/snapper-cleanup.service; static)
     Active: failed (Result: exit-code) since Tue 2023-08-22 14:43:32 CEST; 30min ago
   Duration: 70ms
TriggeredBy: ● snapper-cleanup.timer
       Docs: man:snapper(8)
             man:snapper-configs(5)
   Main PID: 27309 (code=exited, status=1/FAILURE)
        CPU: 10ms

Aug 22 14:43:32 srv01 systemd[1]: Started Daily Cleanup of Snapper Snapshots.
Aug 22 14:43:32 srv01 systemd-helper[27309]: running cleanup for 'root'.
Aug 22 14:43:32 srv01 systemd-helper[27309]: running number cleanup for 'root'.
Aug 22 14:43:32 srv01 systemd-helper[27309]: Deleting snapshot failed.
Aug 22 14:43:32 srv01 systemd-helper[27309]: number cleanup for 'root' failed.
Aug 22 14:43:32 srv01 systemd-helper[27309]: running timeline cleanup for 'root'.
Aug 22 14:43:32 srv01 systemd-helper[27309]: running empty-pre-post cleanup for 'root'.
Aug 22 14:43:32 srv01 systemd[1]: snapper-cleanup.service: Main process exited, code=exited, status=1/FAILURE
Aug 22 14:43:32 srv01 systemd[1]: snapper-cleanup.service: Failed with result 'exit-code'.


I suspect this is the cause, since it is not removing older snapshots causing my /root partition to only have 1Gb free space.

Currently I have 62-142 snapshots, I was able to remove 63-90 with:
snapper delete 63-90
For some reason, deleting 62 throws an error: 
Deleting snapshot failed.
Comment 5 Santiago Zarate 2023-08-22 13:55:13 UTC
(In reply to Samuel Conway from comment #0)
> Created attachment 868938 [details]
> pre-update
> 
> Summary:
> After performing a transactional update on an openSUSE Tumbleweed system,
> the networking bridges and virtual interfaces were not functioning. As a
> result, the affected device, which serves as a VM-host for an opnsense
> firewall, lost its network connectivity. Rolling back to an older snapshot
> and disabling the transactional-update.timer service was required to restore
> networking functionality. The issue had significant impact as all devices on
> the network lost internet connectivity. Manual intervention in the form of a
> keyboard and monitor connection was needed to perform the rollback.
> 
> Description:
> On [Date], I performed a transactional update on my openSUSE Tumbleweed
> system. The system was being used as a VM-host for an opnsense firewall,
> which plays a critical role in managing network traffic for my network. The
> transactional update was carried out using the standard update procedure.
> 
> After the update was applied and the system rebooted, it was immediately
> apparent that there were issues with the networking configuration. None of
> the networking bridges or virtual interfaces were functioning as expected.
> This resulted in a complete loss of network connectivity for all the virtual
> machines running on the host, as well as the host itself.
> 
> Due to the severity of the issue and the impact it had on the network, I
> decided to manually intervene by connecting a keyboard and monitor to the
> affected system. After accessing the system locally, I attempted to diagnose
> the problem. It became evident that the issue was related to the recent
> transactional update, as rolling back to an older snapshot of the system
> resulted in the restoration of networking functionality. This process
> involved using the system's snapshot manager to revert to a state prior to
> the update.
> 
> Additionally, to prevent this issue from occurring again in the future, I
> disabled the transactional-update.timer service. While this action helped
> avoid further disruption, it's important to note that transactional updates
> are a critical part of openSUSE Tumbleweed's update process and should
> ideally work without causing networking issues.
> 
> Impact:
> The impact of this issue was significant. Due to the loss of networking
> bridges and virtual interfaces, all the devices on the network, including
> the virtual machines hosted on the affected system, lost internet
> connectivity. This disruption lasted until I could manually intervene,
> rollback to an older snapshot, and disable the transactional update timer.
> The incident led to downtime, network interruptions, and required manual
> intervention, which is not ideal for a system that's intended to provide
> stable and uninterrupted network services.
> 
> Steps to Reproduce:
> 
> - Install openSUSE Tumbleweed on a system configured as a VM-host.
> - Set up networking bridges and virtual interfaces.
> - Perform a transactional update using the standard update procedure.
> - Reboot the system after the update.
> - Observe that networking bridges and virtual interfaces are not functioning
> as expected, resulting in a loss of network connectivity for all devices.
> 
> Expected Results:
> After a transactional update, the system's networking configuration,
> including bridges and virtual interfaces, should remain intact, and the
> network connectivity for all devices should not be affected.
> 
> Additional Information:
> - Intel Pentium N6005, 4x Intel i226-V 2.5Gb LAN 
> - Output of relevant commands`ip addr show` see attached

Can you describe the network configuration? I have a personal host that's also a VM host, and is also a VPN (WG) gateway and haven't had issues, despite it having two VMs having bridged networks 

flowchart TD
    A[Internet] --> system[host network adapter]
    system --> bridge{libvirtd-bridged network}
    bridge --> VM1
    bridge --> VM2
    bridge --> VM3
     
(See routed network on hetzner: https://docs.hetzner.com/robot/dedicated-server/ip/additional-ip-adresses/)
Comment 6 Samuel Conway 2023-08-22 14:44:19 UTC
(In reply to Santiago Zarate from comment #5)
> (In reply to Samuel Conway from comment #0)
> > Created attachment 868938 [details]
> > pre-update
> > 
> > Summary:
> > After performing a transactional update on an openSUSE Tumbleweed system,
> > the networking bridges and virtual interfaces were not functioning. As a
> > result, the affected device, which serves as a VM-host for an opnsense
> > firewall, lost its network connectivity. Rolling back to an older snapshot
> > and disabling the transactional-update.timer service was required to restore
> > networking functionality. The issue had significant impact as all devices on
> > the network lost internet connectivity. Manual intervention in the form of a
> > keyboard and monitor connection was needed to perform the rollback.
> > 
> > Description:
> > On [Date], I performed a transactional update on my openSUSE Tumbleweed
> > system. The system was being used as a VM-host for an opnsense firewall,
> > which plays a critical role in managing network traffic for my network. The
> > transactional update was carried out using the standard update procedure.
> > 
> > After the update was applied and the system rebooted, it was immediately
> > apparent that there were issues with the networking configuration. None of
> > the networking bridges or virtual interfaces were functioning as expected.
> > This resulted in a complete loss of network connectivity for all the virtual
> > machines running on the host, as well as the host itself.
> > 
> > Due to the severity of the issue and the impact it had on the network, I
> > decided to manually intervene by connecting a keyboard and monitor to the
> > affected system. After accessing the system locally, I attempted to diagnose
> > the problem. It became evident that the issue was related to the recent
> > transactional update, as rolling back to an older snapshot of the system
> > resulted in the restoration of networking functionality. This process
> > involved using the system's snapshot manager to revert to a state prior to
> > the update.
> > 
> > Additionally, to prevent this issue from occurring again in the future, I
> > disabled the transactional-update.timer service. While this action helped
> > avoid further disruption, it's important to note that transactional updates
> > are a critical part of openSUSE Tumbleweed's update process and should
> > ideally work without causing networking issues.
> > 
> > Impact:
> > The impact of this issue was significant. Due to the loss of networking
> > bridges and virtual interfaces, all the devices on the network, including
> > the virtual machines hosted on the affected system, lost internet
> > connectivity. This disruption lasted until I could manually intervene,
> > rollback to an older snapshot, and disable the transactional update timer.
> > The incident led to downtime, network interruptions, and required manual
> > intervention, which is not ideal for a system that's intended to provide
> > stable and uninterrupted network services.
> > 
> > Steps to Reproduce:
> > 
> > - Install openSUSE Tumbleweed on a system configured as a VM-host.
> > - Set up networking bridges and virtual interfaces.
> > - Perform a transactional update using the standard update procedure.
> > - Reboot the system after the update.
> > - Observe that networking bridges and virtual interfaces are not functioning
> > as expected, resulting in a loss of network connectivity for all devices.
> > 
> > Expected Results:
> > After a transactional update, the system's networking configuration,
> > including bridges and virtual interfaces, should remain intact, and the
> > network connectivity for all devices should not be affected.
> > 
> > Additional Information:
> > - Intel Pentium N6005, 4x Intel i226-V 2.5Gb LAN 
> > - Output of relevant commands`ip addr show` see attached
> 
> Can you describe the network configuration? I have a personal host that's
> also a VM host, and is also a VPN (WG) gateway and haven't had issues,
> despite it having two VMs having bridged networks 
> 
> flowchart TD
>     A[Internet] --> system[host network adapter]
>     system --> bridge{libvirtd-bridged network}
>     bridge --> VM1
>     bridge --> VM2
>     bridge --> VM3
>      
> (See routed network on hetzner:
> https://docs.hetzner.com/robot/dedicated-server/ip/additional-ip-adresses/)

My setup was done via Cockpit-GUI.

I guess I have the same:
     A[Internet] --> system[host network adapter]
     system --> bridge
     bridge --> VM1

All three Network interfaces are direct attachments, while one is not used.
Perhaps there are commands I can use to help describe my setup better?
Comment 7 Samuel Conway 2023-08-22 14:49:28 UTC
(In reply to Samuel Conway from comment #6)
> (In reply to Santiago Zarate from comment #5)
> > (In reply to Samuel Conway from comment #0)
> > > Created attachment 868938 [details]
> > > pre-update
> > > 
> > > Summary:
> > > After performing a transactional update on an openSUSE Tumbleweed system,
> > > the networking bridges and virtual interfaces were not functioning. As a
> > > result, the affected device, which serves as a VM-host for an opnsense
> > > firewall, lost its network connectivity. Rolling back to an older snapshot
> > > and disabling the transactional-update.timer service was required to restore
> > > networking functionality. The issue had significant impact as all devices on
> > > the network lost internet connectivity. Manual intervention in the form of a
> > > keyboard and monitor connection was needed to perform the rollback.
> > > 
> > > Description:
> > > On [Date], I performed a transactional update on my openSUSE Tumbleweed
> > > system. The system was being used as a VM-host for an opnsense firewall,
> > > which plays a critical role in managing network traffic for my network. The
> > > transactional update was carried out using the standard update procedure.
> > > 
> > > After the update was applied and the system rebooted, it was immediately
> > > apparent that there were issues with the networking configuration. None of
> > > the networking bridges or virtual interfaces were functioning as expected.
> > > This resulted in a complete loss of network connectivity for all the virtual
> > > machines running on the host, as well as the host itself.
> > > 
> > > Due to the severity of the issue and the impact it had on the network, I
> > > decided to manually intervene by connecting a keyboard and monitor to the
> > > affected system. After accessing the system locally, I attempted to diagnose
> > > the problem. It became evident that the issue was related to the recent
> > > transactional update, as rolling back to an older snapshot of the system
> > > resulted in the restoration of networking functionality. This process
> > > involved using the system's snapshot manager to revert to a state prior to
> > > the update.
> > > 
> > > Additionally, to prevent this issue from occurring again in the future, I
> > > disabled the transactional-update.timer service. While this action helped
> > > avoid further disruption, it's important to note that transactional updates
> > > are a critical part of openSUSE Tumbleweed's update process and should
> > > ideally work without causing networking issues.
> > > 
> > > Impact:
> > > The impact of this issue was significant. Due to the loss of networking
> > > bridges and virtual interfaces, all the devices on the network, including
> > > the virtual machines hosted on the affected system, lost internet
> > > connectivity. This disruption lasted until I could manually intervene,
> > > rollback to an older snapshot, and disable the transactional update timer.
> > > The incident led to downtime, network interruptions, and required manual
> > > intervention, which is not ideal for a system that's intended to provide
> > > stable and uninterrupted network services.
> > > 
> > > Steps to Reproduce:
> > > 
> > > - Install openSUSE Tumbleweed on a system configured as a VM-host.
> > > - Set up networking bridges and virtual interfaces.
> > > - Perform a transactional update using the standard update procedure.
> > > - Reboot the system after the update.
> > > - Observe that networking bridges and virtual interfaces are not functioning
> > > as expected, resulting in a loss of network connectivity for all devices.
> > > 
> > > Expected Results:
> > > After a transactional update, the system's networking configuration,
> > > including bridges and virtual interfaces, should remain intact, and the
> > > network connectivity for all devices should not be affected.
> > > 
> > > Additional Information:
> > > - Intel Pentium N6005, 4x Intel i226-V 2.5Gb LAN 
> > > - Output of relevant commands`ip addr show` see attached
> > 
> > Can you describe the network configuration? I have a personal host that's
> > also a VM host, and is also a VPN (WG) gateway and haven't had issues,
> > despite it having two VMs having bridged networks 
> > 
> > flowchart TD
> >     A[Internet] --> system[host network adapter]
> >     system --> bridge{libvirtd-bridged network}
> >     bridge --> VM1
> >     bridge --> VM2
> >     bridge --> VM3
> >      
> > (See routed network on hetzner:
> > https://docs.hetzner.com/robot/dedicated-server/ip/additional-ip-adresses/)
> 
> My setup was done via Cockpit-GUI.
> 
> I guess I have the same:
>      A[Internet] --> system[host network adapter]
>      system --> bridge
>      bridge --> VM1
> 
> All three Network interfaces are direct attachments, while one is not used.
> Perhaps there are commands I can use to help describe my setup better?

virsh domiflist fw01
 Interface   Type     Source   Model    MAC
-----------------------------------------------------------
 macvtap0    direct   enp3s0   virtio   52:54:00:31:e7:57
 macvtap1    direct   enp4s0   virtio   52:54:00:14:27:c7
 macvtap2    direct   enp5s0   virtio   52:54:00:44:83:53
Comment 8 Samuel Conway 2023-08-23 15:25:57 UTC
This seems to be a snapper issue, so we can close this report.
Comment 9 Samuel Conway 2023-08-24 13:29:50 UTC
closed